Why does my converted DOCX look different from the PDF?

PDF is fixed-layout; DOCX is reflowable. Perfect fidelity isn't possible. Multi-column layouts collapse to single-column; precise positioning becomes paragraph-flow. The text and tables are preserved; visual layout is approximate.

Can I convert specific pages?

pdf2docx: pass start= and end= to convert(). LibreOffice: extract pages first with pdftk or PyPDF2, then convert. The API: pass page_range='5-10' in form data.

What about scanned PDFs?

Neither pdf2docx nor LibreOffice handles scanned (image-only) PDFs without OCR. Run ocrmypdf first to add a text layer, then convert. The API handles this automatically with built-in OCR.

Why are tables broken in the output?

PDF tables are positioned glyphs, not table structures. pdf2docx detects tables heuristically (cell boundaries by alignment), which works on most cases but breaks on complex nested tables or merged cells. LibreOffice is slightly better; manual cleanup is sometimes needed.

What's the file size limit on the API?

Free tier: 25MB upload. Most PDFs are well under this; scanned PDFs at 300 DPI for many pages can exceed 25MB — split locally first.

Does it preserve hyperlinks?

pdf2docx preserves hyperlinks as DOCX hyperlink fields. LibreOffice also preserves them. Footnotes and endnotes are usually preserved but may shift position relative to the source.

How to Convert PDF to DOCX in Python (3 Methods + API)

PDF-to-DOCX is one of the harder format conversions because PDF is a fixed-layout format and DOCX is reflowable. Perfect fidelity isn't achievable; the question is how close to the original layout you get. pdf2docx does well on most text PDFs; LibreOffice is more thorough on complex layouts; OCR is needed for scans.

Method 1: pdf2docx (pure Python, the canonical option)

pdf2docx is the most-used Python library for PDF-to-DOCX. Pure Python (no Office install needed), preserves text, tables, and images.

pip install pdf2docx

from pdf2docx import Converter

def pdf_to_docx(in_path: str, out_path: str, start: int = 0, end: int = None) -> None:
    cv = Converter(in_path)
    cv.convert(out_path, start=start, end=end)
    cv.close()

pdf_to_docx("document.pdf", "document.docx")
# For pages 5-10 only:
pdf_to_docx("big_report.pdf", "chapter.docx", start=4, end=10)

Three things to know:

Tables are detected and preserved — usually well, but complex nested tables can break.
Images are embedded. Charts and diagrams transfer as images, not editable shapes.
Multi-column layout becomes single-column DOCX. Reading order may surprise you on academic papers — expect to clean up.

For batch conversion of a directory:

from pathlib import Path

for pdf in Path('./pdfs').glob('*.pdf'):
    docx = pdf.with_suffix('.docx')
    pdf_to_docx(str(pdf), str(docx))
    print(f"converted {pdf.name}")

Method 2: LibreOffice via subprocess

LibreOffice's --convert-to docx uses the same PDF-import-then-DOCX-export pipeline as Word's File > Open. Higher fidelity than pdf2docx for complex layouts.

apt install libreoffice --no-install-recommends
# macOS: brew install --cask libreoffice

import subprocess
import os
from pathlib import Path

def pdf_to_docx(in_path: str, out_dir: str, timeout: int = 120) -> str:
    os.makedirs(out_dir, exist_ok=True)

    env = os.environ.copy()
    env["HOME"] = "/tmp"  # LibreOffice needs writable HOME

    result = subprocess.run(
        [
            "libreoffice", "--headless",
            "--infilter=writer_pdf_import",
            "--convert-to", "docx",
            "--outdir", out_dir,
            in_path,
        ],
        capture_output=True, text=True, timeout=timeout, env=env,
    )
    if result.returncode != 0:
        raise RuntimeError(f"libreoffice failed: {result.stderr}")

    base = Path(in_path).stem
    return str(Path(out_dir) / f"{base}.docx")

out = pdf_to_docx("document.pdf", "./out")
print("wrote:", out)

Three things to know:

HOME=/tmp in containers — LibreOffice creates a profile dir on first run.
writer_pdf_import filter is essential — without it, LibreOffice doesn't know how to interpret the PDF as editable input.
Single-threaded per host. LibreOffice serializes conversions internally — multiple processes don't parallelize.

Method 3: ChangeThisFile API (with OCR fallback)

The API runs LibreOffice server-side and falls back to OCR for image-only PDFs. Free tier covers 1,000 conversions/month.

import requests

API_KEY = "ctf_sk_your_key_here"

def pdf_to_docx(in_path: str, out_path: str) -> None:
    with open(in_path, "rb") as f:
        response = requests.post(
            "https://changethisfile.com/v1/convert",
            headers={"Authorization": f"Bearer {API_KEY}"},
            files={"file": f},
            data={"source": "pdf", "target": "docx"},
            timeout=180,
        )
    response.raise_for_status()
    with open(out_path, "wb") as out:
        out.write(response.content)

pdf_to_docx("document.pdf", "document.docx")

For text-layer PDFs, the API uses LibreOffice (instant). For scanned PDFs (no text layer), it runs OCR first then constructs the DOCX. OCR adds 5-30s depending on page count.

When to use each

Approach	Best for	Tradeoff
pdf2docx	Pure Python, no native deps, basic-to-medium PDFs	Multi-column layout flattens; complex tables can break
LibreOffice	Higher fidelity on complex layouts, broad PDF support	1GB install, single-threaded per host
ChangeThisFile API	Mixed input including scans, no infra	Network call, file size limit (25MB free)

Production tips

Detect scanned PDFs early. Run pdftotext first; if output is <200 bytes for a multi-page document, it's image-only. Either reject or fall back to OCR (the API does this automatically).
Pre-warm LibreOffice in containers. First-run profile creation adds ~5s. Run a throwaway conversion at container start.
Set timeout 180s+ for big PDFs. Long documents with many images can take 30-60s to convert; complex layouts longer.
Don't expect perfect fidelity. PDF-to-DOCX is approximate by nature. Set user expectations: tables and text are preserved; complex floating layouts may need cleanup.
Bound LibreOffice concurrency. One conversion at a time per host. Use a multiprocessing.Semaphore or a job queue to enforce.

For most text-layer PDFs, pdf2docx is the right answer — pure Python, fast, no native deps. For higher fidelity on complex docs, LibreOffice. For mixed input including scans, the API. Free tier covers 1,000 conversions/month.

How to Convert PDF to DOCX in Python

Method 1: pdf2docx (pure Python, the canonical option)

Method 2: LibreOffice via subprocess

Method 3: ChangeThisFile API (with OCR fallback)

When to use each

Production tips

Frequently Asked Questions

Ready to convert your files?