PDF-to-image is one of the highest-volume use cases for conversion pipelines: document preview thumbnails, OCR preprocessing, PDF archival to image sequences. At 1,000 PDFs, the naive serial approach takes 33 minutes. At scale, the architecture changes: you need to think about multi-page handling, per-page vs per-document billing, timeout management for large PDFs, and how to distribute the work across the async jobs queue.

TL;DR — the math for 1K PDFs

Assumptions: average PDF is 5 pages, 8MB, converts in ~2s at the API. 10 concurrent workers.

Batch sizeWall time (10 workers)Plan neededCost
100 PDFs~20sFree (1K/mo)$0
1,000 PDFs~3.5 minHobby ($29/mo)$2.90
10,000 PDFs~35 minStartup ($99/mo)~$20
50,000 PDFs~3 hrScale ($499/mo)~$10

Each PDF conversion = 1 API call regardless of page count. The output for a multi-page PDF is a ZIP file containing one image per page. You are billed per conversion (per PDF), not per page.

Naive approach: why sequential fails at 1K

The sequential pattern:

import requests

for pdf_path in pdf_files:
    with open(pdf_path, 'rb') as f:
        resp = requests.post(
            'https://changethisfile.com/v1/convert',
            headers={'Authorization': f'Bearer {API_KEY}'},
            files={'file': f},
            data={'target': 'jpg'},
            timeout=120,
        )
    resp.raise_for_status()
    # save output...

At 2s per PDF (fast, single-page): 1K PDFs = 33 minutes. At 5s per PDF (realistic, multi-page): 83 minutes. And this assumes zero failures — a single hung conversion blocks the entire queue. The 120s timeout applies per request, meaning a slow 50-page PDF can stall your pipeline for 2 minutes.

For PDFs specifically, there's a second failure mode: memory. Converting a 100-page PDF produces 100 images in a single ZIP response. If you're processing 10 PDFs in parallel, you may be holding 10 large ZIPs in memory simultaneously.

Batching strategies for large PDF sets

Three strategies depending on your volume and PDF characteristics:

Strategy 1: Sync endpoint, worker pool (under 10K PDFs, under 20MB each)

Use /v1/convert with 10-15 concurrent asyncio workers. Simple, fast, works for most pipelines.

Strategy 2: Async jobs endpoint (large PDFs or high volume)

Use /v1/jobs (POST to start, GET to poll). The API processes the conversion in the background and returns a download URL. This avoids the 120s sync timeout and is the correct pattern for PDFs over 20MB or 50+ pages.

import asyncio
import httpx

async def convert_large_pdf(client, pdf_path, target='jpg'):
    # Submit job
    async with pdf_path.open('rb') as f:
        resp = await client.post(
            'https://changethisfile.com/v1/jobs',
            headers={'Authorization': f'Bearer {API_KEY}'},
            files={'file': (pdf_path.name, await f.read())},
            data={'target': target},
        )
    resp.raise_for_status()
    job_id = resp.json()['job_id']

    # Poll until done
    for _ in range(60):  # max 5 min
        await asyncio.sleep(5)
        status_resp = await client.get(
            f'https://changethisfile.com/v1/jobs/{job_id}',
            headers={'Authorization': f'Bearer {API_KEY}'},
        )
        status = status_resp.json()
        if status['state'] == 'done':
            # Download result
            dl = await client.get(status['download_url'])
            return dl.content  # ZIP bytes
        elif status['state'] == 'failed':
            raise RuntimeError(f"Job {job_id} failed: {status.get('error')}")

    raise TimeoutError(f'Job {job_id} did not complete in 5 minutes')

Strategy 3: Webhook-driven (fire and forget, largest volumes)

Submit all jobs upfront, receive results via webhook. See the webhook guide for the full pattern. Best for overnight batch jobs where you don't need to wait for results.

Full 1K PDF pipeline

#!/usr/bin/env python3
"""Convert a directory of PDFs to JPG images."""
import asyncio
import hashlib
import json
import os
import zipfile
from pathlib import Path

import httpx

API_KEY = os.environ['CTF_API_KEY']
API_URL = 'https://changethisfile.com/v1/convert'
CONCURRENCY = 10
OUTPUT_DIR = Path('output')
FAILURES_FILE = Path('failures.jsonl')


def idempotency_key(pdf_path: Path) -> str:
    stat = pdf_path.stat()
    payload = f"{pdf_path.resolve()}|jpg|{stat.st_size}|{stat.st_mtime_ns}"
    return hashlib.sha256(payload.encode()).hexdigest()[:32]


def already_converted(pdf_path: Path) -> bool:
    """Check if output already exists (any page image)."""
    out_dir = OUTPUT_DIR / pdf_path.stem
    if not out_dir.exists():
        return False
    return any(out_dir.glob('page-*.jpg'))


async def convert_pdf(
    client: httpx.AsyncClient,
    pdf_path: Path,
    sem: asyncio.Semaphore,
) -> tuple[str, bool]:
    if already_converted(pdf_path):
        return pdf_path.name, True  # skip

    async with sem:
        for attempt in range(3):
            try:
                content = pdf_path.read_bytes()
                resp = await client.post(
                    API_URL,
                    headers={
                        'Authorization': f'Bearer {API_KEY}',
                        'Idempotency-Key': idempotency_key(pdf_path),
                    },
                    content=content,
                    params={'target': 'jpg'},
                    timeout=180,
                )

                if resp.status_code == 429:
                    await asyncio.sleep(int(resp.headers.get('Retry-After', '60')))
                    continue
                resp.raise_for_status()

                # Save output (may be ZIP for multi-page)
                out_dir = OUTPUT_DIR / pdf_path.stem
                out_dir.mkdir(parents=True, exist_ok=True)

                ct = resp.headers.get('Content-Type', '')
                if 'zip' in ct:
                    zip_path = out_dir / 'pages.zip'
                    zip_path.write_bytes(resp.content)
                    with zipfile.ZipFile(zip_path) as zf:
                        zf.extractall(out_dir)
                    zip_path.unlink()
                else:
                    (out_dir / 'page-001.jpg').write_bytes(resp.content)

                return pdf_path.name, True

            except httpx.TimeoutException:
                if attempt == 2:
                    return pdf_path.name, False
                await asyncio.sleep(2 ** attempt)

    return pdf_path.name, False


async def main():
    pdf_files = sorted(Path('.').glob('**/*.pdf'))
    print(f'Found {len(pdf_files)} PDFs')
    OUTPUT_DIR.mkdir(exist_ok=True)

    sem = asyncio.Semaphore(CONCURRENCY)
    success = 0

    async with httpx.AsyncClient() as client:
        tasks = [convert_pdf(client, p, sem) for p in pdf_files]
        for i, coro in enumerate(asyncio.as_completed(tasks), 1):
            name, ok = await coro
            if ok:
                success += 1
            else:
                with FAILURES_FILE.open('a') as f:
                    json.dump({'file': name}, f)
                    f.write('\n')
            print(f'\r[{i}/{len(pdf_files)}] {success} ok', end='')

    print(f'\nDone: {success}/{len(pdf_files)} converted')


if __name__ == '__main__':
    asyncio.run(main())

Multi-page PDF output handling

When a PDF has more than one page, the API returns a Content-Type: application/zip response containing one JPG per page, named page-001.jpg, page-002.jpg, etc. Single-page PDFs return the image directly.

Always check the Content-Type before writing the output:

ct = resp.headers.get('Content-Type', '')
if 'zip' in ct:
    # Multi-page PDF: extract ZIP
    with zipfile.ZipFile(io.BytesIO(resp.content)) as zf:
        zf.extractall(out_dir)
else:
    # Single-page PDF: direct image
    (out_dir / 'page-001.jpg').write_bytes(resp.content)

For downstream OCR or image processing, you often want page images named consistently regardless of PDF page count. The page-NNN.jpg naming convention the API uses is glob-friendly: sorted(out_dir.glob('page-*.jpg')) gives you pages in order.

Cost tracking and quota monitoring

At $29/mo for 10K conversions, each PDF costs $0.0029. At $99/mo for 50K, each costs $0.00198. Track your conversion spend in the pipeline:

remaining = int(resp.headers.get('X-CTF-Remaining', '0'))
if remaining < 500:
    print(f'WARNING: only {remaining} conversions left this month')
    # Optionally: pause batch, send alert, switch to higher plan

For overnight batch jobs, estimate your conversion count before starting: len(pdf_files) on Startup gives you len/50000 * 100 percent of monthly quota. If you're using more than 80%, consider batching across two calendar months or upgrading temporarily.

PDF-to-image at scale is straightforward once you handle multi-page output (ZIP detection), large file routing (async jobs), and idempotency. The patterns here scale from 100 to 50,000 PDFs without architectural changes — just increase concurrency and upgrade your plan. Free tier covers 1K conversions for pipeline validation.