Does the ChangeThisFile API handle password-protected PDFs?

No. Password-protected PDFs return a 502 with a 'conversion_failed' error. You need to remove the password client-side before submitting. pdfplumber accepts a password parameter for owner-protected PDFs.

What's the file size limit?

Free tier: 25MB. Pro tier: 100MB. Larger files should use the async /v1/jobs endpoint with a webhook callback instead of sync /v1/convert.

How does the API handle PDFs with embedded forms?

Form fields are extracted as text in their visual position. If you need structured form data (field name to value mapping), use pdfplumber's form extraction or pypdf's get_fields() instead.

Can I get markdown instead of plain text?

Yes — set target=md instead of target=txt. The conversion preserves headings, lists, and emphasis when the PDF has structural tags.

Does the API train on my PDFs?

No. Files are held in memory and on ephemeral disk for the duration of conversion only, then deleted within 60 seconds. No model training, no analytics on file content.

What about extracting tables from PDFs?

For tables specifically, pdfplumber's extract_tables() is the strongest open-source option. The ChangeThisFile API extracts tables as text in reading order — fine for RAG, less useful if you need structured rows.

How to Convert PDF to Text in Python (3 Methods + API)

Extracting plain text from PDF files is one of the most common tasks in any document-processing pipeline — RAG indexing, search, content extraction, training data prep. Python has three serious options: pdfplumber, PyPDF2, and a hosted API. Each has tradeoffs.

The local libraries are free and run on your hardware. They are also a maintenance liability — PDF parsing breaks on edge cases (tagged PDFs, embedded forms, weird font encodings, password-protected files), and you end up writing fallback logic. The API approach trades a per-call cost for someone else's problem.

This guide shows working code for all three approaches, when each one wins, and how to handle the common error cases.

Method 1: ChangeThisFile API (one HTTP call)

If you want to skip dependency management entirely, hit the ChangeThisFile API. One POST request, plain text response. Get a free API key at changethisfile.com/api — the free tier handles 1,000 conversions/month.

import requests

API_KEY = "sk_test_your_key_here"

def pdf_to_text(pdf_path: str) -> str:
    with open(pdf_path, "rb") as f:
        response = requests.post(
            "https://changethisfile.com/v1/convert",
            headers={"Authorization": f"Bearer {API_KEY}"},
            files={"file": f},
            data={"source": "pdf", "target": "txt"},
            timeout=60,
        )
    response.raise_for_status()
    return response.content.decode("utf-8")

text = pdf_to_text("invoice.pdf")
print(text[:500])

Error handling adds maybe four lines:

try:
    text = pdf_to_text("invoice.pdf")
except requests.HTTPError as e:
    if e.response.status_code == 413:
        print("File too large for plan tier")
    elif e.response.status_code == 429:
        print(f"Rate limited, retry after {e.response.headers.get('Retry-After')}s")
    else:
        print(f"Conversion failed: {e.response.json().get('error')}")

The API uses Poppler under the hood for PDF text extraction, which is the same engine that powers pdftotext on Linux. It handles tagged PDFs, multi-column layouts, and most font encoding quirks correctly out of the box.

Method 2: pdfplumber (best local library)

For local extraction, pdfplumber is the most reliable Python library. It builds on pdfminer.six and adds a clean API for both text and table extraction.

pip install pdfplumber

import pdfplumber

def pdf_to_text(pdf_path: str) -> str:
    text_parts = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                text_parts.append(page_text)
    return "\n\n".join(text_parts)

text = pdf_to_text("invoice.pdf")
print(text[:500])

pdfplumber's main advantage is that it preserves reading order more reliably than PyPDF2 on multi-column documents. It also gives you per-character positioning if you need to do layout-aware extraction (table boundaries, column detection).

The main downside: pdfplumber is slow on large PDFs. A 500-page report can take 30+ seconds. If you need throughput, use the API or batch with multiprocessing.

Method 3: PyPDF2 (lightweight, fast)

PyPDF2 is the classic option. It is the fastest of the three for text extraction, but it has more rough edges with complex layouts.

pip install pypdf  # successor to PyPDF2

from pypdf import PdfReader

def pdf_to_text(pdf_path: str) -> str:
    reader = PdfReader(pdf_path)
    return "\n\n".join(page.extract_text() for page in reader.pages)

text = pdf_to_text("invoice.pdf")
print(text[:500])

Use pypdf when you need raw speed and the PDFs are simple (single-column, standard fonts, no embedded forms). For anything more complex, pdfplumber or the API will give cleaner output.

Handling scanned PDFs (no text layer)

None of the methods above extract text from scanned PDFs — those are images of pages with no embedded text. You need OCR.

For local OCR, combine pdf2image + Tesseract:

from pdf2image import convert_from_path
import pytesseract

def ocr_pdf(pdf_path: str) -> str:
    images = convert_from_path(pdf_path)
    return "\n\n".join(pytesseract.image_to_string(img) for img in images)

OCR is slow (multiple seconds per page) and accuracy depends on scan quality. The ChangeThisFile API does not currently OCR scanned PDFs — that's on the roadmap as a separate endpoint. For now, detect scanned PDFs by checking if pdfplumber returns empty strings for every page.

When to use each

Approach	Best for	Tradeoff
ChangeThisFile API	Production apps, RAG pipelines, varied input quality	Per-call cost, network dependency
pdfplumber	One-off extractions, table-heavy docs, custom layout logic	Slower on large files
pypdf	High-throughput simple PDFs	Worse output on complex layouts
Tesseract + pdf2image	Scanned PDFs only	Slow, accuracy depends on scan

For most production pipelines, the API wins on operational simplicity — no system dependencies, no PDF parser updates to babysit, no edge cases on user-uploaded files. For one-off scripts or when you cannot make outbound HTTP calls, pdfplumber is the strongest local option.

For one-shot scripts on your own clean PDFs, pdfplumber is hard to beat — it is reliable, free, and handles most edge cases. For anything user-facing where input quality is unpredictable, the ChangeThisFile API removes a class of problems: no installs, no parser version drift, no "why does this PDF break extraction" debugging. Get a free API key with 1,000 conversions/month to try it.

How to Convert PDF to Text in Python

Method 1: ChangeThisFile API (one HTTP call)

Method 2: pdfplumber (best local library)

Method 3: PyPDF2 (lightweight, fast)

Handling scanned PDFs (no text layer)

When to use each

Frequently Asked Questions

Ready to convert your files?