PDF-to-DOCX is one of the harder format conversions because PDF is a fixed-layout format and DOCX is reflowable. Perfect fidelity isn't achievable; the question is how close to the original layout you get. pdf2docx does well on most text PDFs; LibreOffice is more thorough on complex layouts; OCR is needed for scans.

Method 1: pdf2docx (pure Python, the canonical option)

pdf2docx is the most-used Python library for PDF-to-DOCX. Pure Python (no Office install needed), preserves text, tables, and images.

pip install pdf2docx
from pdf2docx import Converter

def pdf_to_docx(in_path: str, out_path: str, start: int = 0, end: int = None) -> None:
    cv = Converter(in_path)
    cv.convert(out_path, start=start, end=end)
    cv.close()

pdf_to_docx("document.pdf", "document.docx")
# For pages 5-10 only:
pdf_to_docx("big_report.pdf", "chapter.docx", start=4, end=10)

Three things to know:

  • Tables are detected and preserved — usually well, but complex nested tables can break.
  • Images are embedded. Charts and diagrams transfer as images, not editable shapes.
  • Multi-column layout becomes single-column DOCX. Reading order may surprise you on academic papers — expect to clean up.

For batch conversion of a directory:

from pathlib import Path

for pdf in Path('./pdfs').glob('*.pdf'):
    docx = pdf.with_suffix('.docx')
    pdf_to_docx(str(pdf), str(docx))
    print(f"converted {pdf.name}")

Method 2: LibreOffice via subprocess

LibreOffice's --convert-to docx uses the same PDF-import-then-DOCX-export pipeline as Word's File > Open. Higher fidelity than pdf2docx for complex layouts.

apt install libreoffice --no-install-recommends
# macOS: brew install --cask libreoffice
import subprocess
import os
from pathlib import Path

def pdf_to_docx(in_path: str, out_dir: str, timeout: int = 120) -> str:
    os.makedirs(out_dir, exist_ok=True)

    env = os.environ.copy()
    env["HOME"] = "/tmp"  # LibreOffice needs writable HOME

    result = subprocess.run(
        [
            "libreoffice", "--headless",
            "--infilter=writer_pdf_import",
            "--convert-to", "docx",
            "--outdir", out_dir,
            in_path,
        ],
        capture_output=True, text=True, timeout=timeout, env=env,
    )
    if result.returncode != 0:
        raise RuntimeError(f"libreoffice failed: {result.stderr}")

    base = Path(in_path).stem
    return str(Path(out_dir) / f"{base}.docx")

out = pdf_to_docx("document.pdf", "./out")
print("wrote:", out)

Three things to know:

  • HOME=/tmp in containers — LibreOffice creates a profile dir on first run.
  • writer_pdf_import filter is essential — without it, LibreOffice doesn't know how to interpret the PDF as editable input.
  • Single-threaded per host. LibreOffice serializes conversions internally — multiple processes don't parallelize.

Method 3: ChangeThisFile API (with OCR fallback)

The API runs LibreOffice server-side and falls back to OCR for image-only PDFs. Free tier covers 1,000 conversions/month.

import requests

API_KEY = "ctf_sk_your_key_here"

def pdf_to_docx(in_path: str, out_path: str) -> None:
    with open(in_path, "rb") as f:
        response = requests.post(
            "https://changethisfile.com/v1/convert",
            headers={"Authorization": f"Bearer {API_KEY}"},
            files={"file": f},
            data={"source": "pdf", "target": "docx"},
            timeout=180,
        )
    response.raise_for_status()
    with open(out_path, "wb") as out:
        out.write(response.content)

pdf_to_docx("document.pdf", "document.docx")

For text-layer PDFs, the API uses LibreOffice (instant). For scanned PDFs (no text layer), it runs OCR first then constructs the DOCX. OCR adds 5-30s depending on page count.

When to use each

ApproachBest forTradeoff
pdf2docxPure Python, no native deps, basic-to-medium PDFsMulti-column layout flattens; complex tables can break
LibreOfficeHigher fidelity on complex layouts, broad PDF support1GB install, single-threaded per host
ChangeThisFile APIMixed input including scans, no infraNetwork call, file size limit (25MB free)

Production tips

  • Detect scanned PDFs early. Run pdftotext first; if output is <200 bytes for a multi-page document, it's image-only. Either reject or fall back to OCR (the API does this automatically).
  • Pre-warm LibreOffice in containers. First-run profile creation adds ~5s. Run a throwaway conversion at container start.
  • Set timeout 180s+ for big PDFs. Long documents with many images can take 30-60s to convert; complex layouts longer.
  • Don't expect perfect fidelity. PDF-to-DOCX is approximate by nature. Set user expectations: tables and text are preserved; complex floating layouts may need cleanup.
  • Bound LibreOffice concurrency. One conversion at a time per host. Use a multiprocessing.Semaphore or a job queue to enforce.

For most text-layer PDFs, pdf2docx is the right answer — pure Python, fast, no native deps. For higher fidelity on complex docs, LibreOffice. For mixed input including scans, the API. Free tier covers 1,000 conversions/month.