JavaScript PDF-to-DOCX has weaker pure-JS options than Python. The browser path (PDF.js + docx) extracts text but loses layout precision. Node + LibreOffice is much closer to fidelity. For most production use cases, the API is the simpler answer because pure-JS PDF parsing produces inconsistent output across PDF variants.
Method 1: PDF.js + docx (browser, basic)
This works for text-only PDFs. Extract text with PDF.js, build a DOCX with the docx library. Layout precision is limited.
npm install pdfjs-dist docx file-saver
import * as pdfjsLib from "pdfjs-dist/build/pdf";
import { Document, Paragraph, Packer, TextRun } from "docx";
import { saveAs } from "file-saver";
pdfjsLib.GlobalWorkerOptions.workerSrc = "/pdf.worker.min.js";
async function pdfToDocx(pdfFile) {
const arrayBuffer = await pdfFile.arrayBuffer();
const pdf = await pdfjsLib.getDocument({ data: arrayBuffer }).promise;
const paragraphs = [];
for (let i = 1; i <= pdf.numPages; i++) {
const page = await pdf.getPage(i);
const text = await page.getTextContent();
let lineText = "";
let lastY = null;
for (const item of text.items) {
if (lastY !== null && Math.abs(item.transform[5] - lastY) > 5) {
// New line
if (lineText.trim()) {
paragraphs.push(new Paragraph({ children: [new TextRun(lineText)] }));
}
lineText = item.str;
} else {
lineText += item.str + " ";
}
lastY = item.transform[5];
}
if (lineText.trim()) {
paragraphs.push(new Paragraph({ children: [new TextRun(lineText)] }));
}
paragraphs.push(new Paragraph({ children: [new TextRun("")] })); // page break gap
}
const doc = new Document({
sections: [{ children: paragraphs }],
});
const blob = await Packer.toBlob(doc);
saveAs(blob, "output.docx");
}
document.querySelector("input[type=file]").addEventListener("change", (e) => {
pdfToDocx(e.target.files[0]);
});
This produces a DOCX with the text content but loses tables, images, and complex layout. For text-heavy documents (essays, reports without tables), it's serviceable. For anything with structure, use method 2 or 3.
Method 2: LibreOffice via child_process (Node)
For higher-fidelity conversion in Node, shell out to LibreOffice. Same approach as Python but from JavaScript.
apt install libreoffice --no-install-recommends
import { spawn } from "node:child_process";
import path from "node:path";
function pdfToDocx(inPath, outDir, timeoutMs = 120000) {
return new Promise((resolve, reject) => {
const child = spawn(
"libreoffice",
[
"--headless",
"--infilter=writer_pdf_import",
"--convert-to", "docx",
"--outdir", outDir,
inPath,
],
{ env: { ...process.env, HOME: "/tmp" } }
);
const timer = setTimeout(() => {
child.kill("SIGKILL");
reject(new Error("libreoffice timed out"));
}, timeoutMs);
let stderr = "";
child.stderr.on("data", (chunk) => (stderr += chunk));
child.on("close", (code) => {
clearTimeout(timer);
if (code !== 0) {
reject(new Error(`libreoffice exit ${code}: ${stderr}`));
return;
}
const base = path.basename(inPath, path.extname(inPath));
resolve(path.join(outDir, `${base}.docx`));
});
});
}
const out = await pdfToDocx("document.pdf", "./out");
console.log("wrote:", out);
Three things to know:
- HOME=/tmp in containers — LibreOffice creates a profile dir on first run.
- writer_pdf_import filter tells LibreOffice the input is editable PDF.
- Single-threaded per host. Use a buffered queue to bound concurrency to one.
Method 3: ChangeThisFile API (with OCR fallback)
The API runs LibreOffice server-side and falls back to OCR for image-only PDFs. Free tier covers 1,000 conversions/month.
const API_KEY = "ctf_sk_your_key_here";
async function pdfToDocx(pdfBuffer, filename = "document.pdf") {
const form = new FormData();
form.append("file", new Blob([pdfBuffer], { type: "application/pdf" }), filename);
form.append("source", "pdf");
form.append("target", "docx");
const response = await fetch("https://changethisfile.com/v1/convert", {
method: "POST",
headers: { Authorization: `Bearer ${API_KEY}` },
body: form,
});
if (!response.ok) throw new Error(`HTTP ${response.status}: ${await response.text()}`);
return await response.arrayBuffer();
}
// Cloudflare Worker example:
export default {
async fetch(request) {
const pdf = await request.arrayBuffer();
const docx = await pdfToDocx(new Uint8Array(pdf));
return new Response(docx, {
headers: {
"Content-Type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
},
});
},
};
For text-layer PDFs, the API uses LibreOffice (instant). For scanned PDFs, it runs OCR first then constructs the DOCX. Set timeout to 180s+ for large or scanned documents.
When to use each
| Approach | Best for | Tradeoff |
|---|---|---|
| PDF.js + docx (browser) | Text-only PDFs, privacy-first conversion | Loses tables, images, complex layout |
| LibreOffice via Node | Higher fidelity, server-side batch | 1GB install, single-threaded per host |
| ChangeThisFile API | Mixed input including scans, no infra | Network call, file size limit (25MB free) |
Production tips
- Be honest about pure-JS fidelity. The browser path produces text-only DOCX. If users expect tables, images, and layout to transfer, you need server-side conversion.
- For Node + LibreOffice, use a job queue. LibreOffice serializes internally — multiple concurrent processes contend for locks. A simple BullMQ queue or in-memory semaphore is enough.
- Set timeout 180s+ for large PDFs. Long documents with images take 30-60s to convert; complex layouts longer.
- Detect scanned PDFs early. Run pdftotext first; if output is tiny, the PDF is image-only and pure conversion will produce empty DOCX. Use OCR (the API does this automatically).
- Lazy-load PDF.js. The pdfjs-dist bundle is ~1MB. For pages that only sometimes convert PDFs, dynamic import to keep initial bundle small.
For text-heavy PDFs in the browser, PDF.js + docx works. For real production use, Node + LibreOffice or the API. Free tier covers 1,000 conversions/month.