You have a stack of scanned PDFs. Maybe they're old contracts, archived records, tax documents from a decade ago, or research papers from a library's digitization project. They look like normal PDFs — you can view them, print them, email them. But try to select text, search for a name, or copy a paragraph, and nothing happens. These PDFs are images. Each page is a photograph of a document, stored as pixels, with no text layer.
Converting these scanned documents into something useful — searchable, editable, text-extractable — requires OCR. The technology has improved enormously, but it's still not magic. OCR accuracy depends on the quality of the source image, and the quality of most scans ranges from "good enough" to "barely readable." This guide covers the OCR pipeline, the factors that affect accuracy, and the practical workflows for single files and large batches.
Scanned PDF vs. Text PDF: How to Tell
The quickest test: open the PDF and try to select text. If you can click and drag to highlight individual words, it's a text-based PDF (or a scanned PDF that's already been OCR'd). If the cursor shows a crosshair or nothing highlights, it's a pure image scan.
For a definitive check: use pdffonts (from poppler-utils) on the command line. A text-based PDF lists fonts. A scanned PDF shows no fonts or only a font used for the invisible OCR layer. pdfimages -list will show one large image per page in a scanned PDF, while a text-based PDF has images only where actual images were placed in the document.
Hybrid PDFs exist: they contain scanned page images with an invisible OCR text layer behind them. This is the output of "searchable PDF" creation. The visible layer is the original scan; the text layer enables search and selection. These are the best of both worlds but are only as accurate as the OCR that created them.
OCR Technology: Tesseract, Adobe, ABBYY
Tesseract 5 (open source, maintained by Google) is the most widely used OCR engine. Version 5 uses LSTM neural networks and supports 100+ languages. For clean, printed English text at 300 DPI, Tesseract achieves 95-99% character accuracy. For degraded scans, complex layouts, or non-Latin scripts, accuracy drops. Tesseract is free, runs locally, and integrates with most document processing pipelines.
Adobe Acrobat's OCR (commercial) integrates directly into the PDF workflow. It handles layout analysis (detecting columns, tables, images) better than Tesseract out of the box and produces searchable PDFs with the text layer aligned precisely to the scan. Quality is consistently good across varied document types.
ABBYY FineReader (commercial) is the gold standard for OCR accuracy, particularly for complex documents: multi-language text, degraded historical documents, tables, and forms. ABBYY's layout analysis is the most sophisticated, correctly identifying and preserving document structure across columns, sidebars, and mixed content.
Cloud OCR services: Google Cloud Vision, AWS Textract, and Azure AI Document Intelligence offer OCR as APIs. They're convenient for batch processing and handle varied document types well. The tradeoff: your documents are uploaded to cloud servers, which may be unacceptable for sensitive content. Google Cloud Vision's accuracy rivals ABBYY for most document types.
Accuracy Comparison
On clean, 300 DPI, single-column English documents with standard fonts:
- Tesseract 5: 97-99% character accuracy
- Adobe Acrobat: 98-99% character accuracy
- ABBYY FineReader: 99%+ character accuracy
- Google Cloud Vision: 98-99% character accuracy
On degraded, low-resolution, multi-column, or non-English documents, accuracy drops 5-20 percentage points across all engines. ABBYY and Google Cloud Vision maintain the smallest accuracy drop for complex layouts. Tesseract's accuracy drops more sharply with poor image quality.
Character accuracy sounds abstract. In practice, 97% accuracy on a 500-word document means roughly 75 characters wrong — about 15 words. For searchability, this is fine (most words are still findable). For editing or data extraction, 15 wrong words per page requires manual proofreading.
Factors That Affect OCR Accuracy
Resolution (DPI): 300 DPI is the minimum for reliable OCR. 600 DPI is better for small text and complex documents. Below 200 DPI, accuracy drops sharply. If you're scanning documents specifically for OCR, scan at 300-600 DPI. If you're working with existing scans, check the resolution with pdfimages -list or identify -verbose (ImageMagick).
Contrast and clarity: OCR works by distinguishing dark characters from a light background. Faded text, yellowed paper, bleed-through from the other side of the page, and uneven lighting all reduce accuracy. Pre-processing (binarization, deskewing, noise removal) can improve results significantly.
Font type: Printed text in standard fonts (serif and sans-serif, 10-14pt) OCRs well. Decorative fonts, handwriting, and very small text (below 8pt) are problematic. Monospaced fonts (like typewriter output) OCR well because character spacing is consistent.
Language: English, German, French, Spanish, and other Latin-script languages with large training datasets achieve the highest accuracy. CJK (Chinese, Japanese, Korean) languages are well-supported by modern engines. Arabic and Hebrew (right-to-left, connected scripts) are harder. Languages with limited training data have lower accuracy.
Layout complexity: Single-column text with no tables, images, or sidebars is the easiest for OCR. Multi-column layouts require the engine to determine reading order. Tables require cell boundary detection. Headers, footers, page numbers, and watermarks can be incorrectly included in the text stream.
Document age and condition: Historical documents (pre-1950) with inconsistent printing, aging paper, foxing (brown spots), and varied typefaces are the hardest category. Specialized historical OCR models exist (Transkribus for handwritten historical documents) but accuracy is still lower than for modern printed text.
Creating Searchable PDFs
A searchable PDF (also called a "sandwich PDF") layers invisible OCR text behind the visible scan image. The page looks like the original scan but supports text selection, search, and copy. This is the standard output format for document digitization projects.
Using ocrmypdf (recommended):
ocrmypdf input.pdf output.pdf --deskew --cleanocrmypdf is a Python tool that wraps Tesseract. It handles deskewing (straightening rotated pages), cleaning (removing noise), and text layer positioning automatically. Options: --language eng+fra for multi-language, --rotate-pages to auto-detect and fix page orientation, --skip-text to skip pages that already have text, --output-type pdfa to produce PDF/A output.
Using Adobe Acrobat: Edit > Preferences > Document > Enable "Make searchable (Run OCR)" when scanning, or Tools > Enhance Scans > Recognize Text for existing scans. Acrobat produces well-positioned text layers with good accuracy.
Using Tesseract directly:
tesseract page.tiff output pdfTesseract outputs a searchable PDF from a single image. For multi-page documents, process each page and combine with pdftk or use ocrmypdf which handles multi-page documents natively.
Image-to-Text Pipeline
Sometimes you don't need a searchable PDF — you need the text itself, extracted for data entry, indexing, or format conversion. The pipeline:
- Pre-process the image: Convert to grayscale, increase contrast, deskew, remove borders and noise. ImageMagick:
convert scan.tiff -colorspace gray -threshold 50% -deskew 40% cleaned.tiff - Run OCR:
tesseract cleaned.tiff output -l engproducesoutput.txt. For structured output with position data:tesseract cleaned.tiff output hocr(HTML-based) ortesseract cleaned.tiff output tsv(tab-separated with coordinates). - Post-process the text: Fix common OCR errors ("rn" misread as "m," "1" misread as "l," missing spaces, broken lines). Spell-checking can catch many errors automatically. For structured data (invoices, forms), regular expressions extract specific fields.
For converting extracted text to other formats: TXT to DOCX, TXT to PDF, or TXT to HTML. The text won't have the original formatting — it's raw content that you reformat in the target format.
Handwriting Recognition: Current Limitations
Handwriting OCR (also called HTR — Handwritten Text Recognition) is a fundamentally harder problem than printed text OCR. Individual handwriting varies enormously in letter formation, spacing, size, and consistency. No OCR engine achieves printed-text accuracy on handwriting.
Current capabilities:
- Neat block printing: 80-90% character accuracy with modern engines (Google Cloud Vision, Azure AI). Usable with proofreading.
- Cursive handwriting: 50-80% accuracy depending on neatness. Often unusable without significant manual correction.
- Historical handwriting: Specialized models (Transkribus) can be trained on specific handwriting styles. Accuracy varies wildly from 60% to 95% depending on training data and script consistency.
- Forms with handwritten entries: Cloud services (AWS Textract) can identify form fields and extract handwritten values within them. Accuracy is good for numbers and short text, poor for long handwritten notes.
The practical recommendation: if you have handwritten documents, set realistic expectations. OCR will give you a starting point, not a finished transcription. Budget manual correction time. For large collections of similar handwriting (one person's notes, a specific era's records), training a custom HTR model dramatically improves accuracy.
Batch OCR Processing
For processing hundreds or thousands of scanned documents:
ocrmypdf batch:
find /path/to/scans -name '*.pdf' -exec ocrmypdf {} {}.ocr.pdf --skip-text \;The --skip-text flag skips files that already have a text layer, avoiding redundant processing. Add --jobs 4 to process multiple files in parallel (adjust based on CPU cores).
Performance expectations: A single page at 300 DPI processes in 3-10 seconds with Tesseract on a modern CPU. A 100-page document takes 5-15 minutes. A batch of 1,000 documents (average 10 pages each) takes 8-24 hours on a single machine. Cloud OCR services can process faster through parallelization.
Common batch processing pitfalls:
- Some PDFs are already text-based — OCR adds a redundant (possibly incorrect) text layer. Use
--skip-textto detect and skip these. - Rotated pages confuse OCR. Use
--rotate-pagesto auto-detect and correct orientation before processing. - Multi-language documents need the correct language flag. Processing French text with an English-only model produces garbage. Use
--language fraor--language eng+frafor mixed documents. - Very large files (100+ page PDFs at 600 DPI) can exhaust RAM. Process in chunks or increase available memory.
OCR has come a long way from the unreliable, expensive technology of the 2000s. Modern engines handle clean printed text at near-human accuracy, and the tools (ocrmypdf, Tesseract, cloud APIs) are free or affordable. The technology's limits are in the source material: degraded scans, handwriting, complex layouts, and non-standard fonts still challenge every engine.
For most practical purposes, scanning at 300+ DPI and running ocrmypdf produces searchable PDFs that are good enough for retrieval and basic text extraction. For editing, plan on manual cleanup after OCR. For data entry from forms, cloud services like AWS Textract are more cost-effective than general OCR + manual extraction. Match the tool to the task and the quality of your source material.