You have 500 DOCX files that need to become PDFs. Or 2,000 DOC files migrating to DOCX. Or a directory of Markdown files building into a documentation site. One file at a time through a web converter would take days. You need batch processing.
Batch conversion is command-line territory. The tools are free, reliable, and scriptable: LibreOffice headless for office documents, Pandoc for markup languages, Ghostscript for PDF manipulation. The challenge isn't the conversion itself — it's handling the failures, edge cases, and consistency issues that only appear at scale.
LibreOffice Headless: The Office Format Workhorse
LibreOffice's headless mode runs without a GUI, accepting command-line arguments for document conversion. It's the most reliable free tool for converting between office formats.
Basic usage:
# Single file
libreoffice --headless --convert-to pdf document.docx
# All DOCX files in a directory
libreoffice --headless --convert-to pdf *.docx
# Specify output directory
libreoffice --headless --convert-to pdf --outdir /output/ *.docx
# Convert DOC to DOCX
libreoffice --headless --convert-to docx *.doc
# Convert to ODT
libreoffice --headless --convert-to odt *.docxSupported conversions: DOC/DOCX/ODT/RTF to PDF, DOC to DOCX, DOCX to ODT (and reverse), spreadsheets (XLS/XLSX/ODS) to PDF/CSV, presentations (PPT/PPTX/ODP) to PDF, and most cross-format office document conversions.
LibreOffice Batch Limitations
Single instance: LibreOffice headless can only run one conversion at a time. Concurrent invocations queue or fail. For parallel processing, use separate user profiles: libreoffice --headless -env:UserInstallation=file:///tmp/lo-profile-1 --convert-to pdf file.docx. Each profile allows an independent instance.
Timeout risk: Complex documents (100+ pages, many images, embedded objects) can take minutes to convert. LibreOffice has no built-in timeout — a stuck conversion blocks the queue. Wrap calls in timeout: timeout 120 libreoffice --headless --convert-to pdf file.docx.
Recovery mode: If LibreOffice crashes (it happens with corrupt or very complex files), it may start in recovery mode on the next invocation, blocking the command line. Add --norestore to prevent this: libreoffice --headless --norestore --convert-to pdf file.docx.
Font availability: LibreOffice on Linux typically lacks Microsoft fonts (Calibri, Cambria, etc.). Install ttf-mscorefonts-installer (Debian/Ubuntu) or equivalent. Without these fonts, DOCX files using Calibri will render with Liberation Sans, changing line breaks and page breaks throughout.
Pandoc: Markup Format Batch Processing
Pandoc converts between markup formats: Markdown, HTML, LaTeX, DOCX, ODT, EPUB, reStructuredText, and dozens more. For batch processing, wrap Pandoc in a shell script.
Batch Markdown to DOCX:
for f in *.md; do
pandoc "$f" -o "${f%.md}.docx"
doneBatch Markdown to PDF (via LaTeX):
for f in *.md; do
pandoc "$f" -o "${f%.md}.pdf" --pdf-engine=xelatex
doneBatch HTML to Markdown:
for f in *.html; do
pandoc "$f" -f html -t markdown -o "${f%.html}.md"
doneWith bibliography:
for f in *.md; do
pandoc "$f" --bibliography=refs.bib --csl=apa.csl -o "${f%.md}.pdf"
donePandoc is single-file — it processes one file per invocation. For true parallelism: find . -name '*.md' | parallel pandoc {} -o {.}.pdf (using GNU parallel).
Ghostscript: PDF Batch Operations
Ghostscript processes PDFs at the page description level — it's the lowest-level tool for PDF manipulation.
Compress PDFs (batch):
for f in *.pdf; do
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \
-dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH \
-sOutputFile="compressed/${f}" "$f"
doneCompression presets: /screen (72 DPI, smallest), /ebook (150 DPI, good quality), /printer (300 DPI, print quality), /prepress (300 DPI, preserves color fidelity).
Convert to PDF/A:
gs -dPDFA=1 -dBATCH -dNOPAUSE -dNOOUTERSAVE \
-sColorConversionStrategy=UseDeviceIndependentColor \
-sDEVICE=pdfwrite -sOutputFile=output_pdfa.pdf \
PDFA_def.ps input.pdfMerge multiple PDFs:
gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite \
-sOutputFile=merged.pdf file1.pdf file2.pdf file3.pdfExtract pages:
gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite \
-dFirstPage=5 -dLastPage=10 \
-sOutputFile=pages_5-10.pdf input.pdf
Naming Conventions for Batch Output
Consistent naming prevents confusion when processing hundreds of files. Strategies:
Extension swap: Replace the source extension with the target extension. report.docx becomes report.pdf. This is the default for LibreOffice and most tools. Simple, but overwrites if a file with the target name already exists.
Suffix addition: Add a suffix before the extension. report.docx becomes report_converted.pdf. Prevents overwriting but clutters filenames.
Output directory: Write all converted files to a separate directory. Source stays untouched. This is the cleanest approach:
mkdir -p output
libreoffice --headless --convert-to pdf --outdir output/ *.docxPreserve directory structure: For nested directories, replicate the source tree in the output:
find source/ -name '*.docx' | while read f; do
outdir="output/$(dirname "${f#source/}")"
mkdir -p "$outdir"
libreoffice --headless --convert-to pdf --outdir "$outdir" "$f"
done
Error Handling at Scale
When processing 500 files, some will fail. Corrupt files, password-protected documents, unsupported features, and timeouts are all common. A batch script without error handling stops at the first failure, or worse, continues silently while producing broken output.
Log successes and failures:
for f in *.docx; do
if timeout 120 libreoffice --headless --norestore --convert-to pdf --outdir output/ "$f" 2>>error.log; then
echo "OK: $f" >> conversion.log
else
echo "FAIL: $f" >> conversion.log
fi
doneCommon failure patterns:
- Password-protected files: LibreOffice can't open them without the password. They fail silently or produce empty output. Pre-scan for encrypted files and separate them.
- Corrupt files: Damaged ZIP structure (DOCX) or broken binary (DOC). LibreOffice may crash or produce partial output. The
timeoutwrapper prevents infinite hangs. - Unsupported features: OLE objects, ActiveX controls, and some embedded content may not convert. The file converts but the unsupported content is missing.
- Encoding issues: Old DOC files or RTF files may use non-UTF-8 encodings. Text may garble in conversion. Specify encoding when possible.
Post-conversion verification: After batch conversion, verify output files exist and have non-zero size. For PDF output, pdfinfo (from poppler-utils) confirms the file is valid PDF and reports page count.
When to Use a Service vs. DIY
DIY batch conversion makes sense when: you have the technical skills, the files are on your machine (no upload needed), privacy requires local processing, you need custom handling (specific output settings, naming conventions), or it's a one-time batch.
A conversion service makes sense when: you don't want to install and configure tools, the batch is small enough to upload (under 100 files), you need a specific conversion that DIY tools handle poorly (e.g., high-fidelity PDF to DOCX), or you need an API for ongoing automated conversion.
ChangeThisFile's API (/v1/convert) handles individual file conversions with authentication. For batch processing, wrap API calls in a script:
for f in *.docx; do
curl -X POST https://changethisfile.com/v1/convert \
-H "X-API-Key: YOUR_KEY" \
-F "file=@$f" \
-F "target=pdf" \
-o "output/${f%.docx}.pdf"
doneRate limits apply (5 requests/minute for anonymous, higher for authenticated). For large batches, local tools are faster and free.
Common Batch Conversion Pitfalls
Font substitution across files: If 500 files use Calibri but your Linux server has Liberation Sans, all 500 convert with the substitute font. This is consistent (good) but different from the original (bad). Install the required fonts before starting the batch, not after discovering the problem 200 files in.
Disk space: 500 DOCX files might be 2GB. Converting to PDF might produce another 2GB. Converting large image-heavy documents can produce PDFs larger than the source. Monitor disk space during batch jobs: df -h before and periodically during processing.
Overwriting originals: If source and output directories are the same and the tool writes output with the same name, you'll overwrite your source files. Always use a separate output directory, or verify the output extension differs from the source extension before starting.
Character encoding: Old documents (DOC, RTF, TXT from the 1990s-2000s) may use Windows-1252, ISO-8859-1, or other non-UTF-8 encodings. Conversion tools may assume UTF-8 and garble special characters (accented letters, em dashes, smart quotes). Detect encoding first: file -i document.txt shows the detected encoding. Convert encoding separately if needed: iconv -f WINDOWS-1252 -t UTF-8 input.txt > output.txt.
Invisible failures: Some conversions "succeed" but produce broken output — a PDF with blank pages, a DOCX with missing images, or an HTML file with garbled characters. Spot-check a sample of converted files. For PDF, verify page count matches. For DOCX, spot-check formatting on a few representative files.
Batch conversion is a solved problem at the tool level. LibreOffice, Pandoc, and Ghostscript handle the heavy lifting reliably. The unsolved part is the operational layer: error handling, logging, verification, and dealing with the 5% of files that don't convert cleanly. Invest time in the script (timeout wrappers, logging, output verification) rather than assuming all 500 files will convert perfectly on the first pass. They won't.
For small batches (under 50 files), a simple shell loop is sufficient. For large batches (hundreds or thousands), add parallel processing, progress tracking, and systematic error handling. The tools are the same — the difference is how much you trust your input files and how much you verify your output.