"Compress PDF" means different things depending on the file. A scan-heavy PDF shrinks when you downsample embedded images. A text-only PDF shrinks when you strip unused objects and apply flate compression. Treating both the same way is why most compression tools either do nothing or visibly degrade your document.
This guide covers three techniques ordered by aggressiveness: lossless object cleanup, image-aware compression, and the ChangeThisFile API for environments where you don't want to install Ghostscript or PyMuPDF.
TL;DR
- Text-only PDF: PyMuPDF garbage collect + deflate. File shrinks 20-40%, images untouched.
- Scan/image-heavy PDF: Ghostscript
/ebookpreset downsample to 150 DPI. Typical 60-80% size reduction, no visible loss on screen. - No local install: POST to
https://changethisfile.com/v1/convertwithtarget=pdfandcompress=true.
Why PDFs get bloated
PDFs accumulate fat in three ways:
- Embedded images at print DPI. A scanner creates 600 DPI TIFFs. Acrobat embeds them verbatim. A 20-page scan PDF can be 50MB when 5MB would look identical on screen.
- Incremental save cruft. Every time you edit and resave a PDF, Acrobat appends a new revision rather than rewriting the file. A 10-edit document can have 10x the necessary bytes.
- Embedded fonts (unsubsetted). A 5MB TrueType font embedded in full when only 30 glyphs are used.
Lossless compression attacks #2 and #3. Image resampling attacks #1. The right technique depends on what's making your file large — check with pdfinfo yourdoc.pdf.
Method 1: Lossless compression with PyMuPDF
PyMuPDF's save options perform garbage collection, font subsetting, and flate (zlib) compression without touching image quality at all.
pip install PyMuPDF
import fitz # PyMuPDF
import os
def compress_pdf_lossless(in_path: str, out_path: str) -> dict:
doc = fitz.open(in_path)
doc.save(
out_path,
garbage=4, # remove unused objects + cross-reference rebuild
deflate=True, # apply flate compression to streams
deflate_images=True, # also compress already-compressed image streams
clean=True, # remove redundant content
)
doc.close()
before = os.path.getsize(in_path)
after = os.path.getsize(out_path)
return {
"before_mb": round(before / 1e6, 2),
"after_mb": round(after / 1e6, 2),
"reduction_pct": round((1 - after / before) * 100, 1),
}
result = compress_pdf_lossless("document.pdf", "document-compressed.pdf")
print(result) # {'before_mb': 8.4, 'after_mb': 5.1, 'reduction_pct': 39.3}
Typical results on text PDFs: 20-40% smaller. On already-compressed image PDFs: 5-15%. This never degrades visual quality.
Method 2: Ghostscript image resampling (best for scans)
Ghostscript can downsample embedded images to screen DPI. The /ebook preset targets 150 DPI — indistinguishable from 300-600 DPI on any display under 200 PPI.
# Install
apt install ghostscript # Linux
brew install ghostscript # macOS
# /screen = 72 DPI (tiny, visibly degraded — avoid)
# /ebook = 150 DPI (web/email quality — recommended)
# /printer = 300 DPI (print quality — conservative)
# /prepress = 300 DPI + color profiles kept
gs -sDEVICE=pdfwrite \
-dCompatibilityLevel=1.4 \
-dPDFSETTINGS=/ebook \
-dNOPAUSE -dQUIET -dBATCH \
-sOutputFile=compressed.pdf \
input.pdf
import subprocess
import os
def ghostscript_compress(in_path: str, out_path: str, preset: str = "/ebook") -> dict:
"""preset: /screen /ebook /printer /prepress"""
result = subprocess.run([
"gs",
"-sDEVICE=pdfwrite",
"-dCompatibilityLevel=1.4",
f"-dPDFSETTINGS={preset}",
"-dNOPAUSE", "-dQUIET", "-dBATCH",
f"-sOutputFile={out_path}",
in_path,
], capture_output=True, text=True)
if result.returncode != 0:
raise RuntimeError(f"Ghostscript failed: {result.stderr}")
before = os.path.getsize(in_path)
after = os.path.getsize(out_path)
return {
"before_mb": round(before / 1e6, 2),
"after_mb": round(after / 1e6, 2),
"reduction_pct": round((1 - after / before) * 100, 1),
}
# 20-page scan PDF: 18.2MB → 3.4MB (81% reduction)
print(ghostscript_compress("scan.pdf", "scan-compressed.pdf"))
Real benchmarks (20-page document scan):
/screen: 18.2MB → 1.1MB (94% — too aggressive, visible blur)/ebook: 18.2MB → 3.4MB (81% — sweet spot)/printer: 18.2MB → 7.8MB (57% — conservative)
Method 3: ChangeThisFile API
POST the PDF with target=pdf. The API runs Ghostscript /ebook server-side — no local installs.
curl -X POST https://changethisfile.com/v1/convert \
-H "Authorization: Bearer ctf_sk_your_key_here" \
-F "file=@document.pdf" \
-F "target=pdf" \
--output compressed.pdf
import requests
API_KEY = "ctf_sk_your_key_here"
def compress_pdf(in_path: str, out_path: str) -> dict:
with open(in_path, "rb") as f:
resp = requests.post(
"https://changethisfile.com/v1/convert",
headers={"Authorization": f"Bearer {API_KEY}"},
files={"file": f},
data={"target": "pdf"},
timeout=120,
)
resp.raise_for_status()
with open(out_path, "wb") as f:
f.write(resp.content)
import os
before = os.path.getsize(in_path)
after = os.path.getsize(out_path)
return {"reduction_pct": round((1 - after/before)*100, 1)}
print(compress_pdf("big.pdf", "small.pdf"))
const fs = require('fs');
const FormData = require('form-data');
const fetch = require('node-fetch');
async function compressPdf(inPath, outPath) {
const form = new FormData();
form.append('file', fs.createReadStream(inPath));
form.append('target', 'pdf');
const res = await fetch('https://changethisfile.com/v1/convert', {
method: 'POST',
headers: { 'Authorization': 'Bearer ctf_sk_your_key_here', ...form.getHeaders() },
body: form,
});
if (!res.ok) throw new Error(await res.text());
const buf = await res.buffer();
fs.writeFileSync(outPath, buf);
}
Edge cases and gotchas
- Password-protected PDFs. Ghostscript cannot compress password-protected PDFs by default. Decrypt first with
qpdf --decrypt input.pdf decrypted.pdf. - Already-compressed images. If images are already JPEG-compressed at 150 DPI, Ghostscript
/ebookmay produce a slightly larger file (re-encoding overhead). Check input DPI withpdfimages -list input.pdf. - Vector/text PDFs don't benefit from Ghostscript image resampling. Use PyMuPDF lossless instead — Ghostscript can even inflate pure-vector PDFs by rebuilding the content stream inefficiently.
- Form fields and annotations. Ghostscript flattens interactive forms. If the PDF has fillable fields you need to preserve, use PyMuPDF lossless or the API (which preserves structure).
- PDF/A compliance. Ghostscript's
/ebookmay break PDF/A compliance markers. For archival PDFs, use/prepressor PyMuPDF.
Scaling tips for bulk compression
from pathlib import Path
import concurrent.futures
import subprocess, os
def compress_one(pdf_path: Path, out_dir: Path) -> str:
out = out_dir / pdf_path.name
subprocess.run([
"gs", "-sDEVICE=pdfwrite", "-dPDFSETTINGS=/ebook",
"-dNOPAUSE", "-dQUIET", "-dBATCH",
f"-sOutputFile={out}", str(pdf_path),
], check=True)
before = pdf_path.stat().st_size
after = out.stat().st_size
return f"{pdf_path.name}: {before//1024}KB → {after//1024}KB"
input_dir = Path("./pdfs")
out_dir = Path("./pdfs-compressed")
out_dir.mkdir(exist_ok=True)
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as pool:
futs = [pool.submit(compress_one, p, out_dir) for p in input_dir.glob("*.pdf")]
for f in concurrent.futures.as_completed(futs):
print(f.result())
Ghostscript is CPU-bound, so max_workers=os.cpu_count() fully saturates a machine. For API batching, stay under 10 concurrent requests on the free tier (1,000 req/month limit).
Match the technique to the PDF type. For scanned documents, Ghostscript /ebook is the industry standard. For text/vector PDFs, PyMuPDF's lossless pass is always safe. Get a free API key (1,000 conversions/month) if you'd rather skip the install.