Document

How to Convert Documents Between Any Format

Published Mar 19, 2026 7 min read By ChangeThisFile Team

Quick Answer

Document conversion quality depends on the direction and the formats involved. DOCX-to-PDF is nearly lossless. PDF-to-DOCX is lossy because PDF stores visual positions, not document structure. DOCX and ODT convert between each other well. Converting anything to TXT strips all formatting. For best results, always convert from the original editable source format, not from a PDF.

Every document format represents a different tradeoff between editability, visual fidelity, compatibility, and simplicity. DOCX gives you rich editing. PDF locks down visual appearance. HTML flows to any screen size. Markdown strips away everything but structure. Understanding what each format stores — and what it discards — is the difference between a clean conversion and a mangled mess.

This guide covers the major document formats, how they relate to each other, and what to expect when converting between them. The short version: conversion quality is asymmetric. Some directions are nearly perfect; others are fundamentally lossy. Knowing which is which saves hours of cleanup.

DOCX: The Universal Document Format

DOCX is the default output of Microsoft Word and the lingua franca of document exchange. Internally, it's a ZIP archive containing XML files: document.xml holds the body content, styles.xml defines formatting styles, word/media/ contains embedded images, and [Content_Types].xml maps everything together. You can literally rename a .docx file to .zip, extract it, and read the XML with a text editor.

DOCX supports everything you'd expect from a modern word processor: paragraph and character styles, headers and footers with page numbers, tracked changes and comments, embedded images and charts, tables with merged cells, footnotes and endnotes, bookmarks and cross-references, and VBA macros. This feature richness is both its strength and the reason conversions from DOCX to simpler formats lose information.

The Open XML format (ISO/IEC 29500) that DOCX uses is well-documented and widely supported. LibreOffice, Google Docs, Apple Pages, and dozens of programmatic libraries (python-docx, docx4j, OpenXML SDK) can read and write DOCX files. Compatibility isn't perfect — advanced Word features like SmartArt, ActiveX controls, and complex table styles may render differently — but for typical documents, interoperability is solid.

DOC vs. DOCX: Why DOCX Is Strictly Better

The legacy DOC format (used by Word 97–2003) is a binary format based on Microsoft's OLE compound document structure. It's the same container format used by old XLS and PPT files. Binary formats are hard to parse, hard to inspect, and hard for non-Microsoft tools to support correctly. LibreOffice handles most DOC files well, but complex formatting, embedded objects, and VBA macros are more likely to break than in DOCX.

DOCX replaced DOC in 2007 and is better in every measurable way: smaller files (XML compresses well in ZIP), open standard (ISO-standardized), better programmatic access, and broader tool support. If you receive a DOC file, the first thing to do is convert it to DOCX. This is a one-way upgrade — you gain compatibility without losing content. The only reason to keep DOC files is compatibility with Word 2003 or earlier, which hasn't been supported by Microsoft since 2014.

ODT: The Open Alternative

ODT (OpenDocument Text) is the default format for LibreOffice Writer and the document format mandated by many European governments. Like DOCX, it's a ZIP archive of XML files. Like DOCX, it supports styles, images, tables, headers/footers, and tracked changes. The two formats share the same paradigm — they just use different XML schemas.

Converting between DOCX and ODT (convert here) is generally good, with minor style differences. Font substitution is the most common issue: a DOCX file using Calibri will render differently in LibreOffice if Calibri isn't installed. Table borders, paragraph spacing, and numbered list styles sometimes shift slightly. For text-heavy documents without complex formatting, the conversion is virtually transparent.

ODT's advantage is that it's a fully open standard (OASIS/ISO) with no patent encumbrances. Its disadvantage is that Microsoft Word's ODT support is mediocre — Word can open ODT files but occasionally mangles styles and formatting. If you're exchanging documents with Word users, DOCX remains the pragmatic choice.

RTF and Plain Text: Simplicity Has Its Uses

RTF (Rich Text Format) was Microsoft's attempt at a cross-platform document format in the late 1980s. It's a text-based format (you can open an RTF file in a text editor and see markup) that supports basic formatting: bold, italic, fonts, colors, tables, and embedded images. It doesn't support tracked changes, styles (in the DOCX sense), macros, or advanced layout features.

RTF's main value today is safety. Unlike DOCX, RTF files can't contain macros or active content, making them a safer choice for email attachments in security-conscious environments. Some organizations still require RTF for document submissions specifically because it can't execute code. Converting RTF to DOCX is reliable — everything RTF supports is a subset of what DOCX supports.

Plain text (TXT) stores characters and line breaks. Nothing else. No fonts, no bold, no images, no tables. Converting anything to TXT is destructive by design — you're deliberately stripping formatting to get raw content. But that's sometimes exactly what you want: extracting text for search indexing, feeding content into a script, or removing formatting before pasting into another application. Converting TXT to PDF wraps the text in a basic PDF with a monospaced font — functional, not pretty.

HTML: The Underappreciated Document Format

HTML is usually thought of as a web technology, but it's an excellent document format. It supports headings, paragraphs, lists, tables, images, links, bold, italic, and custom styling via CSS. It reflows to any screen size. Every device on earth can render it. And unlike DOCX, you can open it in any text editor and read the source.

Converting DOCX to HTML works well for text content: headings map to <h1>–<h6>, paragraphs to <p>, lists to <ul>/<ol>, and tables to <table>. What gets lost: page-specific layout (headers, footers, page breaks), precise font sizing and spacing, and Word-specific features like tracked changes and comments. The result is clean, semantic HTML — exactly what you'd want for publishing content on a website or blog.

Going the other direction — HTML to DOCX — is less common but useful for turning web content into printable documents. HTML's flexible layout translates reasonably to Word, though CSS styles don't all have DOCX equivalents (flexbox, grid, and media queries have no meaning in a Word document).

Markdown: The Developer's Choice

Markdown is a lightweight markup language that uses plain text conventions to indicate formatting: # Heading, **bold**, - list item, [link](url). It was designed to be readable as plain text even without rendering. GitHub, Reddit, Discord, Stack Overflow, and most developer documentation use Markdown.

Markdown's limitation is that it only supports basic formatting. No font choices, no colors (in standard Markdown), no complex tables, no page layout. This is by design — Markdown forces you to focus on content structure, not visual presentation.

The conversion chain works well: HTML to Markdown strips HTML down to its structural essence. Markdown to HTML produces clean, semantic HTML. From there, the HTML can become PDF or DOCX. The tool Pandoc has made Markdown a universal interchange hub — you can go from Markdown to DOCX, PDF (via LaTeX), HTML, EPUB, RST, and dozens of other formats.

Conversion Quality Hierarchy: What to Expect

Not all conversions are equal. Here's a realistic quality ranking from best to worst:

Nearly perfect:

DOCX ↔ ODT — Same paradigm, different XML schemas. Minor style differences, especially around fonts and list numbering.
DOC → DOCX (convert here) — Upgrade from binary to XML. Occasional loss of legacy features (old-style text boxes, certain WordArt).

Good:

DOCX → PDF (convert here) — Prints the document as-is. Layout is preserved if the converter has the right fonts. This is the most reliable conversion in the document world.
DOCX → HTML (convert here) — Content and structure survive; page layout, headers/footers, and precise spacing are dropped.

Lossy:

PDF → DOCX (convert here) — Reconstructs document structure from visual positions. Quality ranges from "good enough" (tagged PDFs from modern Word) to "unusable" (scanned documents, complex layouts).
HTML → Markdown (convert here) — Complex HTML tables, images with captions, and styled content lose their visual formatting.

Destructive (by design):

Anything → TXT — All formatting is stripped. Only the raw text content survives. This is sometimes exactly what you want.

Common Conversion Workflows

"I received a DOC file but I use Google Docs." Convert DOC to DOCX first, then upload to Google Docs. Google Docs can import DOC files directly, but the conversion is more reliable through DOCX as an intermediate because Google's DOC parser is less mature than its DOCX parser.

"I need to submit a PDF." Convert DOCX to PDF. This is the most common document conversion in existence and it works well. Make sure fonts are embedded (especially if using non-standard fonts) and check that the PDF looks right before submitting.

"I need to edit a PDF." Convert PDF to DOCX, make your edits in Word or LibreOffice, then convert back to PDF. Expect some formatting drift — the round trip through PDF-to-DOCX-to-PDF won't reproduce the original PDF pixel-for-pixel. For minor text changes, PDF editing tools (Adobe Acrobat, PDF-XChange) are less disruptive.

"I want to publish a document on my blog." Convert DOCX to HTML or write in Markdown and convert to HTML. HTML is the native language of the web — starting from HTML avoids the CSS and layout issues that come from embedded PDF viewers or Word plugins.

"I need to extract just the text from a document." Convert to TXT. This works from any format and gives you clean text with no formatting artifacts. Useful for text analysis, search indexing, or pasting into another application without carrying over formatting.

What Gets Lost in Document Conversion

Every document conversion involves tradeoffs. Here's what commonly doesn't survive:

Macros and active content. VBA macros in DOCX don't carry over to any other format. Period. If the document relies on macros for functionality, conversion breaks it.
Tracked changes and comments. Only DOCX and ODT support tracked changes natively. Converting to PDF, HTML, or any other format either flattens the changes (accepts all) or discards them. If you need to preserve revision history, stay in DOCX.
Embedded fonts. DOCX files can embed fonts for portable rendering. PDF also embeds fonts. HTML does not (it references web fonts or system fonts). ODT can embed fonts but most tools don't by default. When converting between formats, font availability on the rendering system determines whether text appears correctly.
Complex tables. Merged cells, nested tables, and tables with precise column widths are the most fragile element in any conversion. They break going DOCX → HTML (HTML tables work differently than Word tables), going PDF → DOCX (the converter must detect table structure from coordinates), and even going DOCX → ODT (subtle differences in table models).
Headers, footers, and page breaks. These are page-layout concepts. They survive DOCX → PDF (which is page-based) but are lost in DOCX → HTML (which is flow-based) and DOCX → Markdown (which has no page concept at all).
Precise spacing and positioning. Exact paragraph spacing, character kerning, line spacing, and margin sizes are preserved in DOCX → PDF but lost or approximated in conversions to HTML, Markdown, and RTF.

LibreOffice as a Conversion Engine

LibreOffice in headless mode (libreoffice --headless --convert-to) is the workhorse behind most server-side document conversions, including the ones on ChangeThisFile. It can convert between DOCX, DOC, ODT, RTF, PDF, HTML, and TXT without a graphical interface.

What it does well: DOCX/DOC to PDF conversion is reliable for typical business documents. Text, images, basic tables, and standard formatting come through cleanly. It handles batch conversion well and is free to deploy on any server.

Where it struggles: Complex Word-specific features like SmartArt, advanced table styles, text effects, and certain embedded object types don't render identically. Font substitution is the biggest practical issue — if the source document uses fonts not installed on the server, LibreOffice substitutes its defaults, changing text flow and page breaks. The fix is installing the expected fonts on the server.

LibreOffice also has a single-instance limitation: it can only convert one document at a time in headless mode. Concurrent requests queue up behind each other. For high-throughput conversion, you either run multiple LibreOffice instances on different ports or use a conversion queue.

The golden rule of document conversion: always work from the most editable source. If you have the original DOCX, convert from that — not from a PDF someone emailed you. Every step away from the original editable format loses information that can't be recovered.

When conversion is unavoidable, match the conversion to the use case. Need visual fidelity? Go to PDF. Need web publishing? Go to HTML. Need raw text for processing? Go to TXT. Need to hand off to a developer? Go to Markdown. And when the conversion produces imperfect results, understand that it's not a bug in the tool — it's a fundamental consequence of moving data between formats that represent different things.

Key Takeaways

DOCX is a ZIP archive of XML files and the most widely supported editable document format. When in doubt, use DOCX.
DOC is the legacy binary format from Word 97–2003. Convert DOC to DOCX immediately — it's a one-way upgrade with no downside.
DOCX-to-PDF is the most reliable document conversion. PDF-to-DOCX is the least reliable because it must reconstruct structure from visual positions.
HTML is an underappreciated document format: it supports rich content, reflows to any screen, and every device can render it.
Markdown is ideal for content that will be published to the web or converted to multiple output formats. It forces clean structure.
Macros, tracked changes, and complex tables are the first casualties of document conversion. Keep the original editable file.
Always convert from the most editable source format available, not from a PDF or image.

Frequently Asked Questions

What's the best way to convert a PDF back to an editable document?

Convert to DOCX using a tool that supports tagged PDF extraction. The results will be best if the PDF was created from a modern word processor (Word 2010+) and has tagged structure. For scanned PDFs, you'll need OCR first. In all cases, expect to do some manual cleanup — PDF-to-DOCX is inherently imperfect because PDF doesn't store document structure.

Why does my DOCX look different in LibreOffice vs. Word?

Font substitution is the primary cause. If LibreOffice doesn't have a font the document uses (commonly Calibri, Cambria, or other Microsoft fonts), it substitutes a similar font with different metrics, changing text flow and page breaks. Install the Microsoft core fonts on your system to fix most discrepancies.

Should I use DOCX or PDF for sharing documents?

Use PDF if the recipient only needs to read the document and you want it to look identical everywhere. Use DOCX if the recipient needs to edit the document. For formal submissions (legal, government, academic), PDF is almost always required. For collaborative work, DOCX (or Google Docs) is more practical.

Can I convert a Word document to HTML for my website?

Yes, and it works well for content. DOCX-to-HTML preserves headings, paragraphs, lists, tables, and images. But the output HTML may include inline styles and Word-specific markup that needs cleanup. For best results, convert to HTML, then clean up the output with an HTML formatter or paste into your CMS's content editor.

What's the difference between ODT and DOCX?

Both are ZIP archives of XML files with similar capabilities. DOCX is Microsoft's Open XML standard; ODT is the OASIS OpenDocument standard. They're interchangeable for most documents. DOCX has better compatibility with Microsoft Word; ODT has better compatibility with LibreOffice and is mandated by some European governments. For maximum compatibility, DOCX is the safer choice.

Why do my tables break when converting documents?

Tables are the most fragile element in document conversion because different formats implement tables differently. DOCX uses Word's table model with merged cells and precise column widths. HTML uses a flow-based table model. PDF stores tables as positioned text. Each conversion requires translating between these different table concepts, and merged cells, nested tables, and multi-page tables are where translations fail.

Is RTF still useful?

RTF is useful in two scenarios: security-sensitive environments where macro-free documents are required, and as a universal interchange format for basic formatted text. RTF can't contain macros, scripts, or active content, making it inherently safer than DOCX for untrusted documents. For everything else, DOCX is more capable and better supported.

How do I convert multiple documents at once?

For bulk conversion, LibreOffice's command line is the standard tool: 'libreoffice --headless --convert-to pdf *.docx' converts all DOCX files in a directory to PDF. For web-based conversion, ChangeThisFile processes files individually but you can queue multiple uploads. For programmatic workflows, use libraries like python-docx, Pandoc, or the LibreOffice API.

Ready to convert your files?

Use ChangeThisFile to convert between 600+ formats — free, fast, and private.

Start Converting