Document

Converting Scanned Documents: OCR and Format Options

Published Mar 19, 2026 8 min read By ChangeThisFile Team

Quick Answer

A scanned document is an image of text, not actual text. Converting it to an editable or searchable format requires OCR (Optical Character Recognition) to recognize characters in the image. OCR accuracy depends on scan quality (300 DPI minimum), image contrast, language, and the OCR engine used. Tesseract 5 handles clean printed text well; complex layouts and handwriting still challenge all OCR engines.

You have a stack of scanned PDFs. Maybe they're old contracts, archived records, tax documents from a decade ago, or research papers from a library's digitization project. They look like normal PDFs — you can view them, print them, email them. But try to select text, search for a name, or copy a paragraph, and nothing happens. These PDFs are images. Each page is a photograph of a document, stored as pixels, with no text layer.

Converting these scanned documents into something useful — searchable, editable, text-extractable — requires OCR. The technology has improved enormously, but it's still not magic. OCR accuracy depends on the quality of the source image, and the quality of most scans ranges from "good enough" to "barely readable." This guide covers the OCR pipeline, the factors that affect accuracy, and the practical workflows for single files and large batches.

Scanned PDF vs. Text PDF: How to Tell

The quickest test: open the PDF and try to select text. If you can click and drag to highlight individual words, it's a text-based PDF (or a scanned PDF that's already been OCR'd). If the cursor shows a crosshair or nothing highlights, it's a pure image scan.

For a definitive check: use pdffonts (from poppler-utils) on the command line. A text-based PDF lists fonts. A scanned PDF shows no fonts or only a font used for the invisible OCR layer. pdfimages -list will show one large image per page in a scanned PDF, while a text-based PDF has images only where actual images were placed in the document.

Hybrid PDFs exist: they contain scanned page images with an invisible OCR text layer behind them. This is the output of "searchable PDF" creation. The visible layer is the original scan; the text layer enables search and selection. These are the best of both worlds but are only as accurate as the OCR that created them.

OCR Technology: Tesseract, Adobe, ABBYY

Tesseract 5 (open source, maintained by Google) is the most widely used OCR engine. Version 5 uses LSTM neural networks and supports 100+ languages. For clean, printed English text at 300 DPI, Tesseract achieves 95-99% character accuracy. For degraded scans, complex layouts, or non-Latin scripts, accuracy drops. Tesseract is free, runs locally, and integrates with most document processing pipelines.

Adobe Acrobat's OCR (commercial) integrates directly into the PDF workflow. It handles layout analysis (detecting columns, tables, images) better than Tesseract out of the box and produces searchable PDFs with the text layer aligned precisely to the scan. Quality is consistently good across varied document types.

ABBYY FineReader (commercial) is the gold standard for OCR accuracy, particularly for complex documents: multi-language text, degraded historical documents, tables, and forms. ABBYY's layout analysis is the most sophisticated, correctly identifying and preserving document structure across columns, sidebars, and mixed content.

Cloud OCR services: Google Cloud Vision, AWS Textract, and Azure AI Document Intelligence offer OCR as APIs. They're convenient for batch processing and handle varied document types well. The tradeoff: your documents are uploaded to cloud servers, which may be unacceptable for sensitive content. Google Cloud Vision's accuracy rivals ABBYY for most document types.

Accuracy Comparison

On clean, 300 DPI, single-column English documents with standard fonts:

Tesseract 5: 97-99% character accuracy
Adobe Acrobat: 98-99% character accuracy
ABBYY FineReader: 99%+ character accuracy
Google Cloud Vision: 98-99% character accuracy

On degraded, low-resolution, multi-column, or non-English documents, accuracy drops 5-20 percentage points across all engines. ABBYY and Google Cloud Vision maintain the smallest accuracy drop for complex layouts. Tesseract's accuracy drops more sharply with poor image quality.

Character accuracy sounds abstract. In practice, 97% accuracy on a 500-word document means roughly 75 characters wrong — about 15 words. For searchability, this is fine (most words are still findable). For editing or data extraction, 15 wrong words per page requires manual proofreading.

Factors That Affect OCR Accuracy

Resolution (DPI): 300 DPI is the minimum for reliable OCR. 600 DPI is better for small text and complex documents. Below 200 DPI, accuracy drops sharply. If you're scanning documents specifically for OCR, scan at 300-600 DPI. If you're working with existing scans, check the resolution with pdfimages -list or identify -verbose (ImageMagick).

Contrast and clarity: OCR works by distinguishing dark characters from a light background. Faded text, yellowed paper, bleed-through from the other side of the page, and uneven lighting all reduce accuracy. Pre-processing (binarization, deskewing, noise removal) can improve results significantly.

Font type: Printed text in standard fonts (serif and sans-serif, 10-14pt) OCRs well. Decorative fonts, handwriting, and very small text (below 8pt) are problematic. Monospaced fonts (like typewriter output) OCR well because character spacing is consistent.

Language: English, German, French, Spanish, and other Latin-script languages with large training datasets achieve the highest accuracy. CJK (Chinese, Japanese, Korean) languages are well-supported by modern engines. Arabic and Hebrew (right-to-left, connected scripts) are harder. Languages with limited training data have lower accuracy.

Layout complexity: Single-column text with no tables, images, or sidebars is the easiest for OCR. Multi-column layouts require the engine to determine reading order. Tables require cell boundary detection. Headers, footers, page numbers, and watermarks can be incorrectly included in the text stream.

Document age and condition: Historical documents (pre-1950) with inconsistent printing, aging paper, foxing (brown spots), and varied typefaces are the hardest category. Specialized historical OCR models exist (Transkribus for handwritten historical documents) but accuracy is still lower than for modern printed text.

Creating Searchable PDFs

A searchable PDF (also called a "sandwich PDF") layers invisible OCR text behind the visible scan image. The page looks like the original scan but supports text selection, search, and copy. This is the standard output format for document digitization projects.

Using ocrmypdf (recommended):

ocrmypdf input.pdf output.pdf --deskew --clean

ocrmypdf is a Python tool that wraps Tesseract. It handles deskewing (straightening rotated pages), cleaning (removing noise), and text layer positioning automatically. Options: --language eng+fra for multi-language, --rotate-pages to auto-detect and fix page orientation, --skip-text to skip pages that already have text, --output-type pdfa to produce PDF/A output.

Using Adobe Acrobat: Edit > Preferences > Document > Enable "Make searchable (Run OCR)" when scanning, or Tools > Enhance Scans > Recognize Text for existing scans. Acrobat produces well-positioned text layers with good accuracy.

Using Tesseract directly:

tesseract page.tiff output pdf

Tesseract outputs a searchable PDF from a single image. For multi-page documents, process each page and combine with pdftk or use ocrmypdf which handles multi-page documents natively.

Image-to-Text Pipeline

Sometimes you don't need a searchable PDF — you need the text itself, extracted for data entry, indexing, or format conversion. The pipeline:

Pre-process the image: Convert to grayscale, increase contrast, deskew, remove borders and noise. ImageMagick: convert scan.tiff -colorspace gray -threshold 50% -deskew 40% cleaned.tiff
Run OCR: tesseract cleaned.tiff output -l eng produces output.txt. For structured output with position data: tesseract cleaned.tiff output hocr (HTML-based) or tesseract cleaned.tiff output tsv (tab-separated with coordinates).
Post-process the text: Fix common OCR errors ("rn" misread as "m," "1" misread as "l," missing spaces, broken lines). Spell-checking can catch many errors automatically. For structured data (invoices, forms), regular expressions extract specific fields.

For converting extracted text to other formats: TXT to DOCX, TXT to PDF, or TXT to HTML. The text won't have the original formatting — it's raw content that you reformat in the target format.

Handwriting Recognition: Current Limitations

Handwriting OCR (also called HTR — Handwritten Text Recognition) is a fundamentally harder problem than printed text OCR. Individual handwriting varies enormously in letter formation, spacing, size, and consistency. No OCR engine achieves printed-text accuracy on handwriting.

Current capabilities:

Neat block printing: 80-90% character accuracy with modern engines (Google Cloud Vision, Azure AI). Usable with proofreading.
Cursive handwriting: 50-80% accuracy depending on neatness. Often unusable without significant manual correction.
Historical handwriting: Specialized models (Transkribus) can be trained on specific handwriting styles. Accuracy varies wildly from 60% to 95% depending on training data and script consistency.
Forms with handwritten entries: Cloud services (AWS Textract) can identify form fields and extract handwritten values within them. Accuracy is good for numbers and short text, poor for long handwritten notes.

The practical recommendation: if you have handwritten documents, set realistic expectations. OCR will give you a starting point, not a finished transcription. Budget manual correction time. For large collections of similar handwriting (one person's notes, a specific era's records), training a custom HTR model dramatically improves accuracy.

Batch OCR Processing

For processing hundreds or thousands of scanned documents:

ocrmypdf batch:

find /path/to/scans -name '*.pdf' -exec ocrmypdf {} {}.ocr.pdf --skip-text \;

The --skip-text flag skips files that already have a text layer, avoiding redundant processing. Add --jobs 4 to process multiple files in parallel (adjust based on CPU cores).

Performance expectations: A single page at 300 DPI processes in 3-10 seconds with Tesseract on a modern CPU. A 100-page document takes 5-15 minutes. A batch of 1,000 documents (average 10 pages each) takes 8-24 hours on a single machine. Cloud OCR services can process faster through parallelization.

Common batch processing pitfalls:

Some PDFs are already text-based — OCR adds a redundant (possibly incorrect) text layer. Use --skip-text to detect and skip these.
Rotated pages confuse OCR. Use --rotate-pages to auto-detect and correct orientation before processing.
Multi-language documents need the correct language flag. Processing French text with an English-only model produces garbage. Use --language fra or --language eng+fra for mixed documents.
Very large files (100+ page PDFs at 600 DPI) can exhaust RAM. Process in chunks or increase available memory.

OCR has come a long way from the unreliable, expensive technology of the 2000s. Modern engines handle clean printed text at near-human accuracy, and the tools (ocrmypdf, Tesseract, cloud APIs) are free or affordable. The technology's limits are in the source material: degraded scans, handwriting, complex layouts, and non-standard fonts still challenge every engine.

For most practical purposes, scanning at 300+ DPI and running ocrmypdf produces searchable PDFs that are good enough for retrieval and basic text extraction. For editing, plan on manual cleanup after OCR. For data entry from forms, cloud services like AWS Textract are more cost-effective than general OCR + manual extraction. Match the tool to the task and the quality of your source material.

Key Takeaways

Scanned PDFs are images of text, not actual text. OCR is required to make them searchable, selectable, or convertible to editable formats.
300 DPI is the minimum resolution for reliable OCR. 600 DPI is better for small text and complex documents.
Tesseract 5 achieves 97-99% accuracy on clean printed English text. Complex layouts, handwriting, and degraded scans reduce accuracy significantly.
ocrmypdf is the best open-source tool for creating searchable PDFs — it handles deskewing, cleaning, and text layer positioning automatically.
Handwriting recognition is still unreliable. Neat block printing: 80-90% accuracy. Cursive: 50-80%. Budget manual correction time.
For batch processing, use ocrmypdf with --skip-text (to avoid re-processing) and --jobs (for parallelization). Expect 3-10 seconds per page.
Pre-processing (deskew, contrast enhancement, noise removal) significantly improves OCR accuracy on degraded scans.

Frequently Asked Questions

How do I make a scanned PDF searchable?

Run OCR on it to create a text layer behind the scan images. The easiest tool is ocrmypdf: ocrmypdf input.pdf output.pdf. This processes every page, recognizes text, and creates a searchable PDF where you can select, search, and copy text. Adobe Acrobat also has built-in OCR under Tools > Enhance Scans > Recognize Text.

What resolution should I scan at for OCR?

300 DPI minimum. This provides enough detail for OCR engines to reliably distinguish characters. 600 DPI improves accuracy for small text (below 10pt), fine print, and documents with complex formatting. Scanning above 600 DPI rarely improves OCR accuracy and significantly increases file size and processing time.

Can OCR read handwriting?

With limited accuracy. Neat block printing achieves 80-90% character accuracy with modern engines. Cursive handwriting is 50-80% accurate at best. Historical handwriting requires specialized models trained on specific scripts. For any handwritten document, expect to spend significant time on manual correction after OCR.

Is Tesseract free?

Yes. Tesseract is open-source software maintained by Google. It's free to download, use, and integrate into your own applications. It runs locally on your machine — no cloud account or API fees. Install it on Linux (apt install tesseract-ocr), macOS (brew install tesseract), or Windows (installer from GitHub).

Why does my OCR output have so many errors?

Common causes: low scan resolution (below 200 DPI), poor image contrast (faded text, yellowed paper), skewed pages, wrong language model, or complex layout. Fix: scan at 300+ DPI, increase contrast, deskew pages before OCR, specify the correct language, and pre-process images to remove noise. ocrmypdf's --deskew and --clean flags help automatically.

How long does OCR take?

Tesseract processes one page at 300 DPI in 3-10 seconds on a modern CPU. A 50-page document takes 2-8 minutes. A 1,000-page batch takes 1-3 hours. Cloud services process faster through parallelization. ABBYY FineReader is roughly 2x faster than Tesseract for the same accuracy. Processing time scales linearly with page count and DPI.

Can I convert a scanned PDF to Word?

Yes, in two steps: first run OCR to create a text layer, then convert the OCR'd PDF to DOCX. The quality depends on OCR accuracy and the PDF-to-DOCX conversion quality. For simple, clean scans, the result is usable. For complex layouts, expect to do significant manual formatting after conversion. Adobe Acrobat can do both steps, or use ocrmypdf + ChangeThisFile's PDF-to-DOCX converter.

Should I use cloud OCR or local OCR?

Use local OCR (Tesseract/ocrmypdf) for sensitive documents, batch processing on your own hardware, and when you need predictable costs. Use cloud OCR (Google Cloud Vision, AWS Textract) for complex documents (forms, mixed layouts), when accuracy is paramount, for languages where Tesseract's models are weak, and when you can accept the privacy implications of uploading documents.

Ready to convert your files?

Use ChangeThisFile to convert between 600+ formats — free, fast, and private.

Start Converting