PDF-to-DOCX is one of the harder format conversions because PDF is a fixed-layout format and DOCX is reflowable. Perfect fidelity isn't achievable; the question is how close to the original layout you get. pdf2docx does well on most text PDFs; LibreOffice is more thorough on complex layouts; OCR is needed for scans.
Method 1: pdf2docx (pure Python, the canonical option)
pdf2docx is the most-used Python library for PDF-to-DOCX. Pure Python (no Office install needed), preserves text, tables, and images.
pip install pdf2docx
from pdf2docx import Converter
def pdf_to_docx(in_path: str, out_path: str, start: int = 0, end: int = None) -> None:
cv = Converter(in_path)
cv.convert(out_path, start=start, end=end)
cv.close()
pdf_to_docx("document.pdf", "document.docx")
# For pages 5-10 only:
pdf_to_docx("big_report.pdf", "chapter.docx", start=4, end=10)
Three things to know:
- Tables are detected and preserved — usually well, but complex nested tables can break.
- Images are embedded. Charts and diagrams transfer as images, not editable shapes.
- Multi-column layout becomes single-column DOCX. Reading order may surprise you on academic papers — expect to clean up.
For batch conversion of a directory:
from pathlib import Path
for pdf in Path('./pdfs').glob('*.pdf'):
docx = pdf.with_suffix('.docx')
pdf_to_docx(str(pdf), str(docx))
print(f"converted {pdf.name}")
Method 2: LibreOffice via subprocess
LibreOffice's --convert-to docx uses the same PDF-import-then-DOCX-export pipeline as Word's File > Open. Higher fidelity than pdf2docx for complex layouts.
apt install libreoffice --no-install-recommends
# macOS: brew install --cask libreoffice
import subprocess
import os
from pathlib import Path
def pdf_to_docx(in_path: str, out_dir: str, timeout: int = 120) -> str:
os.makedirs(out_dir, exist_ok=True)
env = os.environ.copy()
env["HOME"] = "/tmp" # LibreOffice needs writable HOME
result = subprocess.run(
[
"libreoffice", "--headless",
"--infilter=writer_pdf_import",
"--convert-to", "docx",
"--outdir", out_dir,
in_path,
],
capture_output=True, text=True, timeout=timeout, env=env,
)
if result.returncode != 0:
raise RuntimeError(f"libreoffice failed: {result.stderr}")
base = Path(in_path).stem
return str(Path(out_dir) / f"{base}.docx")
out = pdf_to_docx("document.pdf", "./out")
print("wrote:", out)
Three things to know:
- HOME=/tmp in containers — LibreOffice creates a profile dir on first run.
- writer_pdf_import filter is essential — without it, LibreOffice doesn't know how to interpret the PDF as editable input.
- Single-threaded per host. LibreOffice serializes conversions internally — multiple processes don't parallelize.
Method 3: ChangeThisFile API (with OCR fallback)
The API runs LibreOffice server-side and falls back to OCR for image-only PDFs. Free tier covers 1,000 conversions/month.
import requests
API_KEY = "ctf_sk_your_key_here"
def pdf_to_docx(in_path: str, out_path: str) -> None:
with open(in_path, "rb") as f:
response = requests.post(
"https://changethisfile.com/v1/convert",
headers={"Authorization": f"Bearer {API_KEY}"},
files={"file": f},
data={"source": "pdf", "target": "docx"},
timeout=180,
)
response.raise_for_status()
with open(out_path, "wb") as out:
out.write(response.content)
pdf_to_docx("document.pdf", "document.docx")
For text-layer PDFs, the API uses LibreOffice (instant). For scanned PDFs (no text layer), it runs OCR first then constructs the DOCX. OCR adds 5-30s depending on page count.
When to use each
| Approach | Best for | Tradeoff |
|---|---|---|
| pdf2docx | Pure Python, no native deps, basic-to-medium PDFs | Multi-column layout flattens; complex tables can break |
| LibreOffice | Higher fidelity on complex layouts, broad PDF support | 1GB install, single-threaded per host |
| ChangeThisFile API | Mixed input including scans, no infra | Network call, file size limit (25MB free) |
Production tips
- Detect scanned PDFs early. Run pdftotext first; if output is <200 bytes for a multi-page document, it's image-only. Either reject or fall back to OCR (the API does this automatically).
- Pre-warm LibreOffice in containers. First-run profile creation adds ~5s. Run a throwaway conversion at container start.
- Set timeout 180s+ for big PDFs. Long documents with many images can take 30-60s to convert; complex layouts longer.
- Don't expect perfect fidelity. PDF-to-DOCX is approximate by nature. Set user expectations: tables and text are preserved; complex floating layouts may need cleanup.
- Bound LibreOffice concurrency. One conversion at a time per host. Use a multiprocessing.Semaphore or a job queue to enforce.
For most text-layer PDFs, pdf2docx is the right answer — pure Python, fast, no native deps. For higher fidelity on complex docs, LibreOffice. For mixed input including scans, the API. Free tier covers 1,000 conversions/month.