Why does my browser-converted DOCX have no tables?

PDF.js extracts text with positioning data, not table structure. Reconstructing tables requires layout heuristics that pure-JS libraries don't implement well. Use LibreOffice or the API for table-aware conversion.

Can I convert in a Web Worker?

Yes — PDF.js already runs its parser in a worker. The docx library is pure JS and works in workers. For UI-blocking conversions of big PDFs, this keeps the page responsive.

How do I handle scanned PDFs?

Detect first: if PDF.js extracts very little text from a multi-page PDF, it's image-only. Pure-JS OCR (Tesseract.js) works in the browser but is slow. The API handles OCR fallback automatically.

What about Cloudflare Workers / Vercel Edge?

PDF.js loads in Workers but is slow for large PDFs. LibreOffice doesn't run in Workers (no native binaries). The API is the right answer for edge runtimes.

What's the file size limit on the API?

Free tier: 25MB upload. Most PDFs are well under this; scanned PDFs at 300 DPI with many pages can exceed — split first or use the upload-via-URL endpoint.

Does the output preserve hyperlinks?

PDF.js extracts hyperlinks; with the docx library you can convert them to DOCX hyperlink fields. LibreOffice and the API preserve them automatically.

How to Convert PDF to DOCX in JavaScript (Browser + Node)

JavaScript PDF-to-DOCX has weaker pure-JS options than Python. The browser path (PDF.js + docx) extracts text but loses layout precision. Node + LibreOffice is much closer to fidelity. For most production use cases, the API is the simpler answer because pure-JS PDF parsing produces inconsistent output across PDF variants.

Method 1: PDF.js + docx (browser, basic)

This works for text-only PDFs. Extract text with PDF.js, build a DOCX with the docx library. Layout precision is limited.

npm install pdfjs-dist docx file-saver

import * as pdfjsLib from "pdfjs-dist/build/pdf";
import { Document, Paragraph, Packer, TextRun } from "docx";
import { saveAs } from "file-saver";

pdfjsLib.GlobalWorkerOptions.workerSrc = "/pdf.worker.min.js";

async function pdfToDocx(pdfFile) {
  const arrayBuffer = await pdfFile.arrayBuffer();
  const pdf = await pdfjsLib.getDocument({ data: arrayBuffer }).promise;

  const paragraphs = [];
  for (let i = 1; i <= pdf.numPages; i++) {
    const page = await pdf.getPage(i);
    const text = await page.getTextContent();

    let lineText = "";
    let lastY = null;
    for (const item of text.items) {
      if (lastY !== null && Math.abs(item.transform[5] - lastY) > 5) {
        // New line
        if (lineText.trim()) {
          paragraphs.push(new Paragraph({ children: [new TextRun(lineText)] }));
        }
        lineText = item.str;
      } else {
        lineText += item.str + " ";
      }
      lastY = item.transform[5];
    }
    if (lineText.trim()) {
      paragraphs.push(new Paragraph({ children: [new TextRun(lineText)] }));
    }
    paragraphs.push(new Paragraph({ children: [new TextRun("")] })); // page break gap
  }

  const doc = new Document({
    sections: [{ children: paragraphs }],
  });

  const blob = await Packer.toBlob(doc);
  saveAs(blob, "output.docx");
}

document.querySelector("input[type=file]").addEventListener("change", (e) => {
  pdfToDocx(e.target.files[0]);
});

This produces a DOCX with the text content but loses tables, images, and complex layout. For text-heavy documents (essays, reports without tables), it's serviceable. For anything with structure, use method 2 or 3.

Method 2: LibreOffice via child_process (Node)

For higher-fidelity conversion in Node, shell out to LibreOffice. Same approach as Python but from JavaScript.

apt install libreoffice --no-install-recommends

import { spawn } from "node:child_process";
import path from "node:path";

function pdfToDocx(inPath, outDir, timeoutMs = 120000) {
  return new Promise((resolve, reject) => {
    const child = spawn(
      "libreoffice",
      [
        "--headless",
        "--infilter=writer_pdf_import",
        "--convert-to", "docx",
        "--outdir", outDir,
        inPath,
      ],
      { env: { ...process.env, HOME: "/tmp" } }
    );

    const timer = setTimeout(() => {
      child.kill("SIGKILL");
      reject(new Error("libreoffice timed out"));
    }, timeoutMs);

    let stderr = "";
    child.stderr.on("data", (chunk) => (stderr += chunk));

    child.on("close", (code) => {
      clearTimeout(timer);
      if (code !== 0) {
        reject(new Error(`libreoffice exit ${code}: ${stderr}`));
        return;
      }
      const base = path.basename(inPath, path.extname(inPath));
      resolve(path.join(outDir, `${base}.docx`));
    });
  });
}

const out = await pdfToDocx("document.pdf", "./out");
console.log("wrote:", out);

Three things to know:

HOME=/tmp in containers — LibreOffice creates a profile dir on first run.
writer_pdf_import filter tells LibreOffice the input is editable PDF.
Single-threaded per host. Use a buffered queue to bound concurrency to one.

Method 3: ChangeThisFile API (with OCR fallback)

The API runs LibreOffice server-side and falls back to OCR for image-only PDFs. Free tier covers 1,000 conversions/month.

const API_KEY = "ctf_sk_your_key_here";

async function pdfToDocx(pdfBuffer, filename = "document.pdf") {
  const form = new FormData();
  form.append("file", new Blob([pdfBuffer], { type: "application/pdf" }), filename);
  form.append("source", "pdf");
  form.append("target", "docx");

  const response = await fetch("https://changethisfile.com/v1/convert", {
    method: "POST",
    headers: { Authorization: `Bearer ${API_KEY}` },
    body: form,
  });

  if (!response.ok) throw new Error(`HTTP ${response.status}: ${await response.text()}`);
  return await response.arrayBuffer();
}

// Cloudflare Worker example:
export default {
  async fetch(request) {
    const pdf = await request.arrayBuffer();
    const docx = await pdfToDocx(new Uint8Array(pdf));
    return new Response(docx, {
      headers: {
        "Content-Type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
      },
    });
  },
};

For text-layer PDFs, the API uses LibreOffice (instant). For scanned PDFs, it runs OCR first then constructs the DOCX. Set timeout to 180s+ for large or scanned documents.

When to use each

Approach	Best for	Tradeoff
PDF.js + docx (browser)	Text-only PDFs, privacy-first conversion	Loses tables, images, complex layout
LibreOffice via Node	Higher fidelity, server-side batch	1GB install, single-threaded per host
ChangeThisFile API	Mixed input including scans, no infra	Network call, file size limit (25MB free)

Production tips

Be honest about pure-JS fidelity. The browser path produces text-only DOCX. If users expect tables, images, and layout to transfer, you need server-side conversion.
For Node + LibreOffice, use a job queue. LibreOffice serializes internally — multiple concurrent processes contend for locks. A simple BullMQ queue or in-memory semaphore is enough.
Set timeout 180s+ for large PDFs. Long documents with images take 30-60s to convert; complex layouts longer.
Detect scanned PDFs early. Run pdftotext first; if output is tiny, the PDF is image-only and pure conversion will produce empty DOCX. Use OCR (the API does this automatically).
Lazy-load PDF.js. The pdfjs-dist bundle is ~1MB. For pages that only sometimes convert PDFs, dynamic import to keep initial bundle small.

For text-heavy PDFs in the browser, PDF.js + docx works. For real production use, Node + LibreOffice or the API. Free tier covers 1,000 conversions/month.

How to Convert PDF to DOCX in JavaScript

Method 1: PDF.js + docx (browser, basic)

Method 2: LibreOffice via child_process (Node)

Method 3: ChangeThisFile API (with OCR fallback)

When to use each

Production tips

Frequently Asked Questions

Ready to convert your files?