PDF is the most misunderstood file format on the internet. People treat it like a document format — something akin to DOCX or ODT — but it's actually a page description language. Think of it less as a Word document and more as a set of instructions for drawing text and graphics onto a fixed-size canvas.

This distinction matters every time you try to edit a PDF, convert one to Word, or extract data from one. The results range from "surprisingly good" to "complete disaster" depending on how the PDF was created and what's inside it. This guide explains what's actually happening inside a PDF file and how to get the best results when you need to convert one.

If you've ever wondered why your carefully formatted PDF turned into a mess of random text boxes in Word, you're about to find out exactly why.

What PDF Actually Is: A PostScript Descendant

Adobe created PDF in 1993 as a way to represent documents independent of the application, hardware, and operating system used to create them. It descends from PostScript, the page description language that laser printers use internally. When you "print to PDF," you're literally running the same rendering pipeline a printer uses — but saving the output to a file instead of putting ink on paper.

A PDF file is a collection of objects organized into a specific structure: a header declaring the PDF version, a body containing page objects with their content streams, a cross-reference table for fast random access to any object, and a trailer pointing to the root of the document. Each page is an independent unit with its own content stream — a sequence of drawing operators that place text, lines, images, and shapes at exact coordinates.

Here's the critical insight: when a PDF says "draw the word 'Revenue' at position (72, 680) in 14pt Helvetica Bold," that's all it's saying. It does not say "this is a heading." It does not say "this belongs to a table." It does not say "this paragraph continues on the next page." The visual result looks like a heading in a table, but the PDF has no idea what it is semantically.

Why "Convert PDF to Word" Often Looks Wrong

When you convert a PDF to DOCX, the conversion tool has to reverse-engineer document structure from visual positioning. It looks at text at position (72, 680) in 14pt bold and guesses that it's a heading. It sees a grid of text at regular intervals and guesses that it's a table. It notices a large gap between text blocks and guesses that it's a paragraph break.

These guesses fail in predictable ways:

  • Tables become scattered text boxes. PDF tables are just text drawn at grid positions. If column spacing is irregular or cells span multiple rows, the converter can't reliably reconstruct the table structure.
  • Multi-column layouts merge or split incorrectly. A two-column academic paper might come through as one long column, or a single paragraph might split into two columns mid-sentence.
  • Headers and footers repeat as body text. The PDF has no concept of "header" or "footer" — it's just text at the top or bottom of each page. Converters often dump it inline with the body content.
  • Fonts change or substitute. If the PDF references a font that isn't embedded and the converter doesn't have it, text reflows with a different font, changing spacing and line breaks throughout.

Tagged vs. Untagged PDFs: The Quality Divider

Tagged PDFs include a hidden semantic layer that labels content: this is a heading, this is a paragraph, this is a table cell, this is an image with alt text. Think of it as an accessibility tree embedded in the PDF file. Screen readers use these tags to navigate the document, and conversion tools use them to reconstruct structure accurately.

The difference in conversion quality is dramatic. A tagged PDF from a modern Word export converts back to DOCX almost perfectly — headings, tables, lists, and formatting all survive. An untagged PDF from a design tool like InDesign or a scan-to-PDF tool converts to Word as a collection of positioned text boxes with approximate formatting.

How to check: open the PDF in Adobe Acrobat and go to File > Properties > Description. Look for "Tagged PDF: Yes" or "Tagged PDF: No." In most PDF readers, you can also check the accessibility panel. If you're batch-processing PDFs and need good conversion results, tagged PDFs are the ones worth converting — untagged ones will likely need manual cleanup.

PDF/A for Archiving: What It Is and Why It Exists

PDF/A (ISO 19005) is a restricted subset of PDF designed for long-term archival. It requires all fonts to be embedded, prohibits encryption, forbids external dependencies (no linked images or JavaScript), and mandates either device-independent color spaces or ICC profiles. The idea: a PDF/A file must be self-contained enough to render identically 50 years from now without any external resources.

Governments and regulated industries require PDF/A for official records. The U.S. federal courts mandate PDF/A for electronic case filing. The EU's eIDAS regulation references it for legally binding documents. If you're submitting documents to any government body, there's a good chance PDF/A is specified somewhere in the requirements.

There are multiple conformance levels: PDF/A-1a (the strictest, requiring tagged structure and Unicode mapping), PDF/A-1b (the minimum, requiring embedded fonts and device-independent color but not tags), and newer versions (PDF/A-2, PDF/A-3) that support JPEG2000 compression, transparency, and embedded files. Most practical workflows target PDF/A-1b — it satisfies archival requirements without demanding fully tagged structure.

PDF Forms: AcroForms vs. XFA

PDFs support two completely different form technologies, and they're not compatible with each other. AcroForms are the older, simpler system: form fields are annotations layered on top of the PDF page content, with each field having a name, type (text, checkbox, dropdown), and value. Every PDF reader supports AcroForms. They survive most conversions intact.

XFA (XML Forms Architecture) is Adobe's newer, more powerful form system. XFA forms use XML to define dynamic layouts that can reflow, calculate values, and validate input. The problem: only Adobe Acrobat and Adobe Reader support XFA. Chrome's PDF viewer, Firefox's pdf.js, macOS Preview, and virtually every third-party PDF tool ignore XFA forms entirely, showing either a blank form or a "please open in Adobe" message.

When converting a PDF with forms to another format, AcroForm data usually survives as text content. XFA data is frequently lost entirely because the conversion tool can't parse it. If you need form data from an XFA PDF, your best bet is filling it in Adobe Acrobat and "flattening" it (printing to PDF), which burns the form values into the page content as static text.

Text-Based vs. Scanned PDFs

A text-based PDF contains actual text objects — characters with positions, fonts, and encoding. You can select text, search it, and copy it. A scanned PDF is just a series of images, one per page. The visual result might look identical, but internally they're completely different.

You can't convert a scanned PDF to an editable document without OCR (Optical Character Recognition). The conversion tool needs to "read" the images and recognize characters, just like a human would. OCR quality depends on scan resolution (300 DPI minimum for reliable results), document clarity, language complexity, and the OCR engine used. Modern OCR engines like Tesseract 5 handle clean English documents well, but handwriting, low-resolution scans, and complex layouts still produce unreliable results.

Some PDFs are hybrid: they contain scanned images with an invisible OCR text layer behind them. These were run through OCR at scan time, so the text is searchable and selectable, but it's only as accurate as the OCR was when the scan was created. When converting these to Word, you'll get the OCR text layer — errors and all.

PDF Security: Passwords, Permissions, and Encryption

PDFs support two types of password protection. The user password (also called the "open password") prevents opening the file at all. Without it, the PDF is an encrypted blob. The owner password (also called the "permissions password") controls what you can do with an open PDF: printing, editing, copying text, filling forms.

The owner password is a polite request, not a technical barrier. It relies on the PDF reader honoring the permission flags. Most commercial tools respect them, but open-source tools and command-line utilities like qpdf can strip owner passwords trivially because the document content isn't actually encrypted — only the permission flags are protected.

The user password, on the other hand, is real encryption. PDF supports RC4 (40-bit and 128-bit) and AES (128-bit and 256-bit) encryption. AES-256 (used in PDF 2.0 and enabled by the PDF encryption revision 6 scheme) is genuinely secure. RC4-40, found in older PDFs, can be brute-forced in seconds. When converting an encrypted PDF, you'll need the password first — there's no way around AES-256 encryption without it.

Converting FROM PDF: What Works and What Doesn't

Converting from PDF is the hard direction. You're trying to reconstruct structure from a format that deliberately threw structure away.

What works well:

  • PDF to DOCX with a tagged, text-based PDF from a modern word processor. Expect 90%+ fidelity.
  • PDF to PNG or PDF to JPG — rasterizing a PDF to an image is always reliable because you're just taking a screenshot of the rendered page.
  • PDF text extraction from clean, text-based PDFs with standard encodings. Copy-paste or programmatic extraction with tools like pdftotext works well.

What produces mixed results:

  • PDF to DOCX with untagged PDFs — tables, columns, and formatting will be approximate at best.
  • PDF to HTML — usually produces absolute-positioned elements, not a flowing HTML document. Useful for visual reproduction, not for responsive web content.

What usually fails:

  • Scanned PDF to any editable format without OCR. You'll get either nothing or gibberish.
  • Complex PDF forms to DOCX — form fields, especially XFA forms, are almost always lost.
  • PDFs with mathematical equations, musical notation, or specialized content — these use custom fonts and positioning that no converter handles correctly.

Converting TO PDF: The Easy Direction

Converting to PDF is almost always reliable because PDF is a print format — anything you can print, you can make into a PDF. The source application renders the document through its normal layout engine and outputs PDF drawing commands instead of sending them to a printer.

DOCX to PDF works excellently via LibreOffice or Microsoft Word, preserving formatting, images, tables, headers, and footers with high fidelity. PowerPoint to PDF works similarly well. Spreadsheets to PDF work but may need print area configuration for large sheets. HTML to PDF is reliable via headless Chrome/Chromium rendering. Images to PDF are perfect — the image is simply embedded as-is.

The main gotcha when converting to PDF is font embedding. If the source document uses a font that the conversion tool doesn't have, it substitutes a similar font, changing text spacing and potentially breaking layout. LibreOffice on Linux, for example, doesn't have Calibri (the default Microsoft Office font), so DOCX files using Calibri may render slightly differently. Installing the Microsoft core fonts package (ttf-mscorefonts-installer on Debian/Ubuntu) fixes most of these issues.

Why PDFs Can Be Huge (and How to Shrink Them)

A simple one-page PDF with text can be 20KB. A brochure with photos can be 200MB. The culprits are almost always embedded fonts and high-resolution images.

Font embedding is a size multiplier. A single font file (e.g., one weight of a typeface) is typically 50–200KB. A PDF using four weights of two typefaces might embed 800KB+ of font data. PDF/A requires full font embedding, so archival PDFs are always larger than their non-archival equivalents. Font subsetting — embedding only the glyphs actually used in the document — can reduce this dramatically, sometimes by 90%.

Images are the other major contributor. A 3000×2000 pixel photo at full quality is about 3–5MB in JPEG. If the PDF was created by dragging a high-res photo into a Word document that displays it at 3 inches wide, the full-resolution image is still embedded even though it's displayed small. Re-saving the PDF with image downsampling (typically to 150–300 DPI for print, 72–150 DPI for screen) can shrink files by 80%+.

Other size inflators: embedded video or 3D content (yes, PDF supports these), duplicate objects from repeated copy-paste operations, and incremental saves that append changes without rewriting the file. Tools like ghostscript and qpdf --linearize can optimize PDFs by removing duplicates, recompressing images, and eliminating unused objects.

Specific Gotchas When Working with PDFs

Fonts not embedding: If you create a PDF on a system with a specific font and open it on a system without that font, the PDF viewer substitutes a different font. This changes text spacing, line breaks, and sometimes overall layout. Always embed fonts when creating PDFs for distribution. In Word, this is under Options > Save > "Embed fonts in the file."

Images getting downsampled: Some PDF creation tools automatically reduce image resolution to save file size. If you're creating a PDF for print, verify that images are at least 300 DPI at their print size. Tools like pdfimages -list can show you the actual resolution of every image in a PDF.

Hyperlinks lost in conversion: PDF hyperlinks are annotations, not part of the text content. Converting PDF to plain text, images, or some document formats drops all hyperlinks. DOCX conversion usually preserves them, but only if the conversion tool specifically handles link annotations.

Table structure destroyed: This is the single most common complaint about PDF conversion. PDF tables are visual constructs — text drawn at grid positions with lines drawn between them. Conversion tools must infer cell boundaries from coordinates, and they regularly get it wrong with merged cells, nested tables, or tables that span pages.

Right-to-left and bidirectional text: PDFs store text in visual order (left-to-right as drawn on screen), but Arabic and Hebrew text needs logical order (right-to-left) for proper editing. Converting a PDF with mixed LTR/RTL text to Word often produces text in the wrong reading order.

PDF's reputation as a "difficult" format is deserved, but it's not arbitrary. The format was designed to be a universal, device-independent print representation — and it does that job better than anything else. The pain comes when you try to use it as something it wasn't designed for: an editable, structured document format.

The key to working with PDFs effectively is understanding which direction you're converting. Going to PDF is reliable. Going from PDF depends almost entirely on whether the PDF has tagged structure and actual text content (not scanned images). When you have a tagged, text-based PDF from a modern application, conversion results are genuinely good. When you have an untagged scan from 2003, manage your expectations accordingly.