Every document format represents a different tradeoff between editability, visual fidelity, compatibility, and simplicity. DOCX gives you rich editing. PDF locks down visual appearance. HTML flows to any screen size. Markdown strips away everything but structure. Understanding what each format stores — and what it discards — is the difference between a clean conversion and a mangled mess.
This guide covers the major document formats, how they relate to each other, and what to expect when converting between them. The short version: conversion quality is asymmetric. Some directions are nearly perfect; others are fundamentally lossy. Knowing which is which saves hours of cleanup.
DOCX: The Universal Document Format
DOCX is the default output of Microsoft Word and the lingua franca of document exchange. Internally, it's a ZIP archive containing XML files: document.xml holds the body content, styles.xml defines formatting styles, word/media/ contains embedded images, and [Content_Types].xml maps everything together. You can literally rename a .docx file to .zip, extract it, and read the XML with a text editor.
DOCX supports everything you'd expect from a modern word processor: paragraph and character styles, headers and footers with page numbers, tracked changes and comments, embedded images and charts, tables with merged cells, footnotes and endnotes, bookmarks and cross-references, and VBA macros. This feature richness is both its strength and the reason conversions from DOCX to simpler formats lose information.
The Open XML format (ISO/IEC 29500) that DOCX uses is well-documented and widely supported. LibreOffice, Google Docs, Apple Pages, and dozens of programmatic libraries (python-docx, docx4j, OpenXML SDK) can read and write DOCX files. Compatibility isn't perfect — advanced Word features like SmartArt, ActiveX controls, and complex table styles may render differently — but for typical documents, interoperability is solid.
DOC vs. DOCX: Why DOCX Is Strictly Better
The legacy DOC format (used by Word 97–2003) is a binary format based on Microsoft's OLE compound document structure. It's the same container format used by old XLS and PPT files. Binary formats are hard to parse, hard to inspect, and hard for non-Microsoft tools to support correctly. LibreOffice handles most DOC files well, but complex formatting, embedded objects, and VBA macros are more likely to break than in DOCX.
DOCX replaced DOC in 2007 and is better in every measurable way: smaller files (XML compresses well in ZIP), open standard (ISO-standardized), better programmatic access, and broader tool support. If you receive a DOC file, the first thing to do is convert it to DOCX. This is a one-way upgrade — you gain compatibility without losing content. The only reason to keep DOC files is compatibility with Word 2003 or earlier, which hasn't been supported by Microsoft since 2014.
ODT: The Open Alternative
ODT (OpenDocument Text) is the default format for LibreOffice Writer and the document format mandated by many European governments. Like DOCX, it's a ZIP archive of XML files. Like DOCX, it supports styles, images, tables, headers/footers, and tracked changes. The two formats share the same paradigm — they just use different XML schemas.
Converting between DOCX and ODT (convert here) is generally good, with minor style differences. Font substitution is the most common issue: a DOCX file using Calibri will render differently in LibreOffice if Calibri isn't installed. Table borders, paragraph spacing, and numbered list styles sometimes shift slightly. For text-heavy documents without complex formatting, the conversion is virtually transparent.
ODT's advantage is that it's a fully open standard (OASIS/ISO) with no patent encumbrances. Its disadvantage is that Microsoft Word's ODT support is mediocre — Word can open ODT files but occasionally mangles styles and formatting. If you're exchanging documents with Word users, DOCX remains the pragmatic choice.
RTF and Plain Text: Simplicity Has Its Uses
RTF (Rich Text Format) was Microsoft's attempt at a cross-platform document format in the late 1980s. It's a text-based format (you can open an RTF file in a text editor and see markup) that supports basic formatting: bold, italic, fonts, colors, tables, and embedded images. It doesn't support tracked changes, styles (in the DOCX sense), macros, or advanced layout features.
RTF's main value today is safety. Unlike DOCX, RTF files can't contain macros or active content, making them a safer choice for email attachments in security-conscious environments. Some organizations still require RTF for document submissions specifically because it can't execute code. Converting RTF to DOCX is reliable — everything RTF supports is a subset of what DOCX supports.
Plain text (TXT) stores characters and line breaks. Nothing else. No fonts, no bold, no images, no tables. Converting anything to TXT is destructive by design — you're deliberately stripping formatting to get raw content. But that's sometimes exactly what you want: extracting text for search indexing, feeding content into a script, or removing formatting before pasting into another application. Converting TXT to PDF wraps the text in a basic PDF with a monospaced font — functional, not pretty.
HTML: The Underappreciated Document Format
HTML is usually thought of as a web technology, but it's an excellent document format. It supports headings, paragraphs, lists, tables, images, links, bold, italic, and custom styling via CSS. It reflows to any screen size. Every device on earth can render it. And unlike DOCX, you can open it in any text editor and read the source.
Converting DOCX to HTML works well for text content: headings map to <h1>–<h6>, paragraphs to <p>, lists to <ul>/<ol>, and tables to <table>. What gets lost: page-specific layout (headers, footers, page breaks), precise font sizing and spacing, and Word-specific features like tracked changes and comments. The result is clean, semantic HTML — exactly what you'd want for publishing content on a website or blog.
Going the other direction — HTML to DOCX — is less common but useful for turning web content into printable documents. HTML's flexible layout translates reasonably to Word, though CSS styles don't all have DOCX equivalents (flexbox, grid, and media queries have no meaning in a Word document).
Markdown: The Developer's Choice
Markdown is a lightweight markup language that uses plain text conventions to indicate formatting: # Heading, **bold**, - list item, [link](url). It was designed to be readable as plain text even without rendering. GitHub, Reddit, Discord, Stack Overflow, and most developer documentation use Markdown.
Markdown's limitation is that it only supports basic formatting. No font choices, no colors (in standard Markdown), no complex tables, no page layout. This is by design — Markdown forces you to focus on content structure, not visual presentation.
The conversion chain works well: HTML to Markdown strips HTML down to its structural essence. Markdown to HTML produces clean, semantic HTML. From there, the HTML can become PDF or DOCX. The tool Pandoc has made Markdown a universal interchange hub — you can go from Markdown to DOCX, PDF (via LaTeX), HTML, EPUB, RST, and dozens of other formats.
Conversion Quality Hierarchy: What to Expect
Not all conversions are equal. Here's a realistic quality ranking from best to worst:
Nearly perfect:
- DOCX ↔ ODT — Same paradigm, different XML schemas. Minor style differences, especially around fonts and list numbering.
- DOC → DOCX (convert here) — Upgrade from binary to XML. Occasional loss of legacy features (old-style text boxes, certain WordArt).
Good:
- DOCX → PDF (convert here) — Prints the document as-is. Layout is preserved if the converter has the right fonts. This is the most reliable conversion in the document world.
- DOCX → HTML (convert here) — Content and structure survive; page layout, headers/footers, and precise spacing are dropped.
Lossy:
- PDF → DOCX (convert here) — Reconstructs document structure from visual positions. Quality ranges from "good enough" (tagged PDFs from modern Word) to "unusable" (scanned documents, complex layouts).
- HTML → Markdown (convert here) — Complex HTML tables, images with captions, and styled content lose their visual formatting.
Destructive (by design):
- Anything → TXT — All formatting is stripped. Only the raw text content survives. This is sometimes exactly what you want.
Common Conversion Workflows
"I received a DOC file but I use Google Docs." Convert DOC to DOCX first, then upload to Google Docs. Google Docs can import DOC files directly, but the conversion is more reliable through DOCX as an intermediate because Google's DOC parser is less mature than its DOCX parser.
"I need to submit a PDF." Convert DOCX to PDF. This is the most common document conversion in existence and it works well. Make sure fonts are embedded (especially if using non-standard fonts) and check that the PDF looks right before submitting.
"I need to edit a PDF." Convert PDF to DOCX, make your edits in Word or LibreOffice, then convert back to PDF. Expect some formatting drift — the round trip through PDF-to-DOCX-to-PDF won't reproduce the original PDF pixel-for-pixel. For minor text changes, PDF editing tools (Adobe Acrobat, PDF-XChange) are less disruptive.
"I want to publish a document on my blog." Convert DOCX to HTML or write in Markdown and convert to HTML. HTML is the native language of the web — starting from HTML avoids the CSS and layout issues that come from embedded PDF viewers or Word plugins.
"I need to extract just the text from a document." Convert to TXT. This works from any format and gives you clean text with no formatting artifacts. Useful for text analysis, search indexing, or pasting into another application without carrying over formatting.
What Gets Lost in Document Conversion
Every document conversion involves tradeoffs. Here's what commonly doesn't survive:
- Macros and active content. VBA macros in DOCX don't carry over to any other format. Period. If the document relies on macros for functionality, conversion breaks it.
- Tracked changes and comments. Only DOCX and ODT support tracked changes natively. Converting to PDF, HTML, or any other format either flattens the changes (accepts all) or discards them. If you need to preserve revision history, stay in DOCX.
- Embedded fonts. DOCX files can embed fonts for portable rendering. PDF also embeds fonts. HTML does not (it references web fonts or system fonts). ODT can embed fonts but most tools don't by default. When converting between formats, font availability on the rendering system determines whether text appears correctly.
- Complex tables. Merged cells, nested tables, and tables with precise column widths are the most fragile element in any conversion. They break going DOCX → HTML (HTML tables work differently than Word tables), going PDF → DOCX (the converter must detect table structure from coordinates), and even going DOCX → ODT (subtle differences in table models).
- Headers, footers, and page breaks. These are page-layout concepts. They survive DOCX → PDF (which is page-based) but are lost in DOCX → HTML (which is flow-based) and DOCX → Markdown (which has no page concept at all).
- Precise spacing and positioning. Exact paragraph spacing, character kerning, line spacing, and margin sizes are preserved in DOCX → PDF but lost or approximated in conversions to HTML, Markdown, and RTF.
LibreOffice as a Conversion Engine
LibreOffice in headless mode (libreoffice --headless --convert-to) is the workhorse behind most server-side document conversions, including the ones on ChangeThisFile. It can convert between DOCX, DOC, ODT, RTF, PDF, HTML, and TXT without a graphical interface.
What it does well: DOCX/DOC to PDF conversion is reliable for typical business documents. Text, images, basic tables, and standard formatting come through cleanly. It handles batch conversion well and is free to deploy on any server.
Where it struggles: Complex Word-specific features like SmartArt, advanced table styles, text effects, and certain embedded object types don't render identically. Font substitution is the biggest practical issue — if the source document uses fonts not installed on the server, LibreOffice substitutes its defaults, changing text flow and page breaks. The fix is installing the expected fonts on the server.
LibreOffice also has a single-instance limitation: it can only convert one document at a time in headless mode. Concurrent requests queue up behind each other. For high-throughput conversion, you either run multiple LibreOffice instances on different ports or use a conversion queue.
The golden rule of document conversion: always work from the most editable source. If you have the original DOCX, convert from that — not from a PDF someone emailed you. Every step away from the original editable format loses information that can't be recovered.
When conversion is unavoidable, match the conversion to the use case. Need visual fidelity? Go to PDF. Need web publishing? Go to HTML. Need raw text for processing? Go to TXT. Need to hand off to a developer? Go to Markdown. And when the conversion produces imperfect results, understand that it's not a bug in the tool — it's a fundamental consequence of moving data between formats that represent different things.