Extracting plain text from PDF files is one of the most common tasks in any document-processing pipeline — RAG indexing, search, content extraction, training data prep. Python has three serious options: pdfplumber, PyPDF2, and a hosted API. Each has tradeoffs.

The local libraries are free and run on your hardware. They are also a maintenance liability — PDF parsing breaks on edge cases (tagged PDFs, embedded forms, weird font encodings, password-protected files), and you end up writing fallback logic. The API approach trades a per-call cost for someone else's problem.

This guide shows working code for all three approaches, when each one wins, and how to handle the common error cases.

Method 1: ChangeThisFile API (one HTTP call)

If you want to skip dependency management entirely, hit the ChangeThisFile API. One POST request, plain text response. Get a free API key at changethisfile.com/api — the free tier handles 1,000 conversions/month.

import requests

API_KEY = "sk_test_your_key_here"

def pdf_to_text(pdf_path: str) -> str:
    with open(pdf_path, "rb") as f:
        response = requests.post(
            "https://changethisfile.com/v1/convert",
            headers={"Authorization": f"Bearer {API_KEY}"},
            files={"file": f},
            data={"source": "pdf", "target": "txt"},
            timeout=60,
        )
    response.raise_for_status()
    return response.content.decode("utf-8")

text = pdf_to_text("invoice.pdf")
print(text[:500])

Error handling adds maybe four lines:

try:
    text = pdf_to_text("invoice.pdf")
except requests.HTTPError as e:
    if e.response.status_code == 413:
        print("File too large for plan tier")
    elif e.response.status_code == 429:
        print(f"Rate limited, retry after {e.response.headers.get('Retry-After')}s")
    else:
        print(f"Conversion failed: {e.response.json().get('error')}")

The API uses Poppler under the hood for PDF text extraction, which is the same engine that powers pdftotext on Linux. It handles tagged PDFs, multi-column layouts, and most font encoding quirks correctly out of the box.

Method 2: pdfplumber (best local library)

For local extraction, pdfplumber is the most reliable Python library. It builds on pdfminer.six and adds a clean API for both text and table extraction.

pip install pdfplumber
import pdfplumber

def pdf_to_text(pdf_path: str) -> str:
    text_parts = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                text_parts.append(page_text)
    return "\n\n".join(text_parts)

text = pdf_to_text("invoice.pdf")
print(text[:500])

pdfplumber's main advantage is that it preserves reading order more reliably than PyPDF2 on multi-column documents. It also gives you per-character positioning if you need to do layout-aware extraction (table boundaries, column detection).

The main downside: pdfplumber is slow on large PDFs. A 500-page report can take 30+ seconds. If you need throughput, use the API or batch with multiprocessing.

Method 3: PyPDF2 (lightweight, fast)

PyPDF2 is the classic option. It is the fastest of the three for text extraction, but it has more rough edges with complex layouts.

pip install pypdf  # successor to PyPDF2
from pypdf import PdfReader

def pdf_to_text(pdf_path: str) -> str:
    reader = PdfReader(pdf_path)
    return "\n\n".join(page.extract_text() for page in reader.pages)

text = pdf_to_text("invoice.pdf")
print(text[:500])

Use pypdf when you need raw speed and the PDFs are simple (single-column, standard fonts, no embedded forms). For anything more complex, pdfplumber or the API will give cleaner output.

Handling scanned PDFs (no text layer)

None of the methods above extract text from scanned PDFs — those are images of pages with no embedded text. You need OCR.

For local OCR, combine pdf2image + Tesseract:

from pdf2image import convert_from_path
import pytesseract

def ocr_pdf(pdf_path: str) -> str:
    images = convert_from_path(pdf_path)
    return "\n\n".join(pytesseract.image_to_string(img) for img in images)

OCR is slow (multiple seconds per page) and accuracy depends on scan quality. The ChangeThisFile API does not currently OCR scanned PDFs — that's on the roadmap as a separate endpoint. For now, detect scanned PDFs by checking if pdfplumber returns empty strings for every page.

When to use each

ApproachBest forTradeoff
ChangeThisFile APIProduction apps, RAG pipelines, varied input qualityPer-call cost, network dependency
pdfplumberOne-off extractions, table-heavy docs, custom layout logicSlower on large files
pypdfHigh-throughput simple PDFsWorse output on complex layouts
Tesseract + pdf2imageScanned PDFs onlySlow, accuracy depends on scan

For most production pipelines, the API wins on operational simplicity — no system dependencies, no PDF parser updates to babysit, no edge cases on user-uploaded files. For one-off scripts or when you cannot make outbound HTTP calls, pdfplumber is the strongest local option.

For one-shot scripts on your own clean PDFs, pdfplumber is hard to beat — it is reliable, free, and handles most edge cases. For anything user-facing where input quality is unpredictable, the ChangeThisFile API removes a class of problems: no installs, no parser version drift, no "why does this PDF break extraction" debugging. Get a free API key with 1,000 conversions/month to try it.