Document

Batch Document Conversion: Processing Hundreds of Files

Published Mar 19, 2026 6 min read By ChangeThisFile Team

Quick Answer

Batch document conversion uses command-line tools — LibreOffice headless for office formats (DOC, DOCX, ODT, RTF to PDF), Pandoc for markup formats (Markdown, HTML, LaTeX), and Ghostscript for PDF operations (compression, merging, PDF/A conversion). The key challenges are error handling, font consistency, and encoding issues at scale.

You have 500 DOCX files that need to become PDFs. Or 2,000 DOC files migrating to DOCX. Or a directory of Markdown files building into a documentation site. One file at a time through a web converter would take days. You need batch processing.

Batch conversion is command-line territory. The tools are free, reliable, and scriptable: LibreOffice headless for office documents, Pandoc for markup languages, Ghostscript for PDF manipulation. The challenge isn't the conversion itself — it's handling the failures, edge cases, and consistency issues that only appear at scale.

LibreOffice Headless: The Office Format Workhorse

LibreOffice's headless mode runs without a GUI, accepting command-line arguments for document conversion. It's the most reliable free tool for converting between office formats.

Basic usage:

# Single file
libreoffice --headless --convert-to pdf document.docx

# All DOCX files in a directory
libreoffice --headless --convert-to pdf *.docx

# Specify output directory
libreoffice --headless --convert-to pdf --outdir /output/ *.docx

# Convert DOC to DOCX
libreoffice --headless --convert-to docx *.doc

# Convert to ODT
libreoffice --headless --convert-to odt *.docx

Supported conversions: DOC/DOCX/ODT/RTF to PDF, DOC to DOCX, DOCX to ODT (and reverse), spreadsheets (XLS/XLSX/ODS) to PDF/CSV, presentations (PPT/PPTX/ODP) to PDF, and most cross-format office document conversions.

LibreOffice Batch Limitations

Single instance: LibreOffice headless can only run one conversion at a time. Concurrent invocations queue or fail. For parallel processing, use separate user profiles: libreoffice --headless -env:UserInstallation=file:///tmp/lo-profile-1 --convert-to pdf file.docx. Each profile allows an independent instance.

Timeout risk: Complex documents (100+ pages, many images, embedded objects) can take minutes to convert. LibreOffice has no built-in timeout — a stuck conversion blocks the queue. Wrap calls in timeout: timeout 120 libreoffice --headless --convert-to pdf file.docx.

Recovery mode: If LibreOffice crashes (it happens with corrupt or very complex files), it may start in recovery mode on the next invocation, blocking the command line. Add --norestore to prevent this: libreoffice --headless --norestore --convert-to pdf file.docx.

Font availability: LibreOffice on Linux typically lacks Microsoft fonts (Calibri, Cambria, etc.). Install ttf-mscorefonts-installer (Debian/Ubuntu) or equivalent. Without these fonts, DOCX files using Calibri will render with Liberation Sans, changing line breaks and page breaks throughout.

Pandoc: Markup Format Batch Processing

Pandoc converts between markup formats: Markdown, HTML, LaTeX, DOCX, ODT, EPUB, reStructuredText, and dozens more. For batch processing, wrap Pandoc in a shell script.

Batch Markdown to DOCX:

for f in *.md; do
  pandoc "$f" -o "${f%.md}.docx"
done

Batch Markdown to PDF (via LaTeX):

for f in *.md; do
  pandoc "$f" -o "${f%.md}.pdf" --pdf-engine=xelatex
done

Batch HTML to Markdown:

for f in *.html; do
  pandoc "$f" -f html -t markdown -o "${f%.html}.md"
done

With bibliography:

for f in *.md; do
  pandoc "$f" --bibliography=refs.bib --csl=apa.csl -o "${f%.md}.pdf"
done

Pandoc is single-file — it processes one file per invocation. For true parallelism: find . -name '*.md' | parallel pandoc {} -o {.}.pdf (using GNU parallel).

Ghostscript: PDF Batch Operations

Ghostscript processes PDFs at the page description level — it's the lowest-level tool for PDF manipulation.

Compress PDFs (batch):

for f in *.pdf; do
  gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \
     -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH \
     -sOutputFile="compressed/${f}" "$f"
done

Compression presets: /screen (72 DPI, smallest), /ebook (150 DPI, good quality), /printer (300 DPI, print quality), /prepress (300 DPI, preserves color fidelity).

Convert to PDF/A:

gs -dPDFA=1 -dBATCH -dNOPAUSE -dNOOUTERSAVE \
   -sColorConversionStrategy=UseDeviceIndependentColor \
   -sDEVICE=pdfwrite -sOutputFile=output_pdfa.pdf \
   PDFA_def.ps input.pdf

Merge multiple PDFs:

gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite \
   -sOutputFile=merged.pdf file1.pdf file2.pdf file3.pdf

Extract pages:

gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite \
   -dFirstPage=5 -dLastPage=10 \
   -sOutputFile=pages_5-10.pdf input.pdf

Naming Conventions for Batch Output

Consistent naming prevents confusion when processing hundreds of files. Strategies:

Extension swap: Replace the source extension with the target extension. report.docx becomes report.pdf. This is the default for LibreOffice and most tools. Simple, but overwrites if a file with the target name already exists.

Suffix addition: Add a suffix before the extension. report.docx becomes report_converted.pdf. Prevents overwriting but clutters filenames.

Output directory: Write all converted files to a separate directory. Source stays untouched. This is the cleanest approach:

mkdir -p output
libreoffice --headless --convert-to pdf --outdir output/ *.docx

Preserve directory structure: For nested directories, replicate the source tree in the output:

find source/ -name '*.docx' | while read f; do
  outdir="output/$(dirname "${f#source/}")"
  mkdir -p "$outdir"
  libreoffice --headless --convert-to pdf --outdir "$outdir" "$f"
done

Error Handling at Scale

When processing 500 files, some will fail. Corrupt files, password-protected documents, unsupported features, and timeouts are all common. A batch script without error handling stops at the first failure, or worse, continues silently while producing broken output.

Log successes and failures:

for f in *.docx; do
  if timeout 120 libreoffice --headless --norestore --convert-to pdf --outdir output/ "$f" 2>>error.log; then
    echo "OK: $f" >> conversion.log
  else
    echo "FAIL: $f" >> conversion.log
  fi
done

Common failure patterns:

Password-protected files: LibreOffice can't open them without the password. They fail silently or produce empty output. Pre-scan for encrypted files and separate them.
Corrupt files: Damaged ZIP structure (DOCX) or broken binary (DOC). LibreOffice may crash or produce partial output. The timeout wrapper prevents infinite hangs.
Unsupported features: OLE objects, ActiveX controls, and some embedded content may not convert. The file converts but the unsupported content is missing.
Encoding issues: Old DOC files or RTF files may use non-UTF-8 encodings. Text may garble in conversion. Specify encoding when possible.

Post-conversion verification: After batch conversion, verify output files exist and have non-zero size. For PDF output, pdfinfo (from poppler-utils) confirms the file is valid PDF and reports page count.

When to Use a Service vs. DIY

DIY batch conversion makes sense when: you have the technical skills, the files are on your machine (no upload needed), privacy requires local processing, you need custom handling (specific output settings, naming conventions), or it's a one-time batch.

A conversion service makes sense when: you don't want to install and configure tools, the batch is small enough to upload (under 100 files), you need a specific conversion that DIY tools handle poorly (e.g., high-fidelity PDF to DOCX), or you need an API for ongoing automated conversion.

ChangeThisFile's API (/v1/convert) handles individual file conversions with authentication. For batch processing, wrap API calls in a script:

for f in *.docx; do
  curl -X POST https://changethisfile.com/v1/convert \
    -H "X-API-Key: YOUR_KEY" \
    -F "file=@$f" \
    -F "target=pdf" \
    -o "output/${f%.docx}.pdf"
done

Rate limits apply (5 requests/minute for anonymous, higher for authenticated). For large batches, local tools are faster and free.

Common Batch Conversion Pitfalls

Font substitution across files: If 500 files use Calibri but your Linux server has Liberation Sans, all 500 convert with the substitute font. This is consistent (good) but different from the original (bad). Install the required fonts before starting the batch, not after discovering the problem 200 files in.

Disk space: 500 DOCX files might be 2GB. Converting to PDF might produce another 2GB. Converting large image-heavy documents can produce PDFs larger than the source. Monitor disk space during batch jobs: df -h before and periodically during processing.

Overwriting originals: If source and output directories are the same and the tool writes output with the same name, you'll overwrite your source files. Always use a separate output directory, or verify the output extension differs from the source extension before starting.

Character encoding: Old documents (DOC, RTF, TXT from the 1990s-2000s) may use Windows-1252, ISO-8859-1, or other non-UTF-8 encodings. Conversion tools may assume UTF-8 and garble special characters (accented letters, em dashes, smart quotes). Detect encoding first: file -i document.txt shows the detected encoding. Convert encoding separately if needed: iconv -f WINDOWS-1252 -t UTF-8 input.txt > output.txt.

Invisible failures: Some conversions "succeed" but produce broken output — a PDF with blank pages, a DOCX with missing images, or an HTML file with garbled characters. Spot-check a sample of converted files. For PDF, verify page count matches. For DOCX, spot-check formatting on a few representative files.

Batch conversion is a solved problem at the tool level. LibreOffice, Pandoc, and Ghostscript handle the heavy lifting reliably. The unsolved part is the operational layer: error handling, logging, verification, and dealing with the 5% of files that don't convert cleanly. Invest time in the script (timeout wrappers, logging, output verification) rather than assuming all 500 files will convert perfectly on the first pass. They won't.

For small batches (under 50 files), a simple shell loop is sufficient. For large batches (hundreds or thousands), add parallel processing, progress tracking, and systematic error handling. The tools are the same — the difference is how much you trust your input files and how much you verify your output.

Key Takeaways

LibreOffice headless is the workhorse for batch office document conversion (DOC/DOCX/ODT/RTF to PDF and cross-format). Use --norestore and timeout wrappers for reliability.
Pandoc handles batch markup conversion (Markdown, HTML, LaTeX, DOCX). Wrap in shell loops or use GNU parallel for throughput.
Ghostscript handles PDF batch operations: compression, PDF/A conversion, merging, splitting, and page extraction.
Always use a separate output directory. Never overwrite source files during batch conversion.
Log successes and failures. Some files will fail — corrupt documents, password protection, and unsupported features are common at scale.
Install required fonts before starting batch conversion. Font substitution at scale produces consistently wrong output across all files.
Spot-check converted output. "Successful" conversion doesn't mean correct conversion — verify a representative sample.

Frequently Asked Questions

What's the fastest way to convert hundreds of DOCX files to PDF?

LibreOffice headless: libreoffice --headless --norestore --convert-to pdf --outdir output/ *.docx. For parallelism, use multiple LibreOffice instances with separate user profiles. On a modern server, expect 2-5 seconds per file. A batch of 500 files takes 15-40 minutes sequentially, faster with parallel instances.

Can I run multiple LibreOffice conversions at the same time?

Not with the default profile — LibreOffice locks the user profile directory. For parallel instances, create separate profile directories: libreoffice --headless -env:UserInstallation=file:///tmp/lo-1 --convert-to pdf file1.docx & libreoffice --headless -env:UserInstallation=file:///tmp/lo-2 --convert-to pdf file2.docx. Each profile allows an independent instance.

How do I handle files that fail to convert?

Log everything: redirect stderr to an error log, check the exit code of each conversion, and write successes and failures to a conversion log. After the batch, review the failure log. Common causes: corrupt files, password protection, unsupported features, and timeout. Fix what you can and re-run the failed subset.

Why do my converted PDFs have wrong fonts?

The conversion system doesn't have the fonts used in the source documents. On Linux, install ttf-mscorefonts-installer for Microsoft fonts (Times New Roman, Arial, Calibri). For other fonts, install the TTF/OTF files in /usr/share/fonts/ and run fc-cache. Verify fonts are available: fc-list | grep 'Calibri'.

How much disk space does batch conversion need?

Roughly 1-2x the source file size for the output. 500 DOCX files at 2GB total may produce 1.5-3GB of PDFs. Image-heavy documents can produce PDFs larger than the source. Monitor disk space during processing with df -h. Use a separate output directory so you can clean up if space runs low.

Can I convert Markdown files to PDF in bulk?

Yes. With Pandoc: for f in *.md; do pandoc "$f" -o "${f%.md}.pdf" --pdf-engine=xelatex; done. This requires a LaTeX installation (texlive). For parallel processing: find . -name '*.md' | parallel pandoc {} -o {.}.pdf. Each file takes 1-5 seconds depending on length.

How do I compress hundreds of PDFs?

Ghostscript with a batch script: for f in *.pdf; do gs -sDEVICE=pdfwrite -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile="compressed/$f" "$f"; done. The /ebook preset (150 DPI) typically reduces file size by 60-80%. Use /printer (300 DPI) if you need print quality.

What about batch conversion in Windows?

The same tools work on Windows. LibreOffice headless uses the same command syntax (install LibreOffice, add to PATH). Pandoc has a Windows installer. Ghostscript has a Windows build. Use PowerShell or batch scripts instead of bash. The main difference: font availability is typically better on Windows (Microsoft fonts are pre-installed).

Ready to convert your files?

Use ChangeThisFile to convert between 600+ formats — free, fast, and private.

Start Converting