Rename any .docx file to .zip and extract it. You'll find a folder structure with XML files, a media directory for embedded images, and relationship files that tie everything together. This is not a hack or a trick — it's how the format actually works. DOCX is a ZIP container, and every word processor that reads it starts by unzipping it.

Microsoft introduced DOCX in 2007 with Office 12, replacing the binary DOC format that had been the default since Word 97. The move was partly strategic (responding to pressure from the OpenDocument format) and partly practical. Binary DOC files were opaque blobs that only Microsoft's code could reliably parse. DOCX files are structured, standards-based, and recoverable even when corrupted — because you can often fix a broken XML file by hand.

This guide covers what's actually inside a DOCX file, how styles and formatting work at the XML level, what happens with macros and compatibility mode, and where things break during conversion. Whether you're converting documents between formats, troubleshooting formatting issues, or just curious about the most widely used document format on earth, this is the technical foundation you need.

Inside a DOCX File: The ZIP Structure

A DOCX file is an Open Packaging Conventions (OPC) container — a ZIP archive with a specific directory layout. When you unzip one, you'll find:

  • [Content_Types].xml — declares the MIME types for every part in the package
  • _rels/.rels — the root relationships file, pointing to the main document part
  • word/document.xml — the actual document content (paragraphs, runs, text)
  • word/styles.xml — style definitions (Normal, Heading 1, etc.)
  • word/numbering.xml — list definitions (bullets, numbered lists)
  • word/settings.xml — document settings (page size, margins, compatibility flags)
  • word/fontTable.xml — fonts used in the document
  • word/media/ — embedded images and other media files
  • word/_rels/document.xml.rels — relationships for the document part (images, hyperlinks, headers)

Each XML file uses the WordprocessingML namespace (http://schemas.openxmlformats.org/wordprocessingml/2006/main, typically prefixed w:). The document content in document.xml is a sequence of <w:p> (paragraph) elements, each containing <w:r> (run) elements with text and formatting. A "run" is a contiguous span of text with the same formatting — if you bold one word in a sentence, that sentence becomes three runs: normal text, bold text, normal text.

The OOXML Standard (ECMA-376 / ISO 29500)

DOCX implements Office Open XML (OOXML), standardized as ECMA-376 in 2006 and later as ISO/IEC 29500. The standardization was controversial — critics argued the spec was too large (6,000+ pages), too Microsoft-specific, and rubber-stamped through ISO's fast-track process. The debate produced two conformance classes: Strict (the ISO-clean version) and Transitional (which permits legacy Microsoft features). In practice, virtually every DOCX file in the wild uses Transitional conformance, because that's what Microsoft Office produces by default.

The practical impact: if a tool claims "OOXML support," ask whether it supports Transitional or Strict. LibreOffice handles both reasonably well. Google Docs supports a subset of Transitional. Smaller tools often implement just enough of the spec to handle simple documents and fail on complex ones.

How Styles and Formatting Actually Work

DOCX formatting operates on a cascade, similar to CSS but with different precedence rules. At the base is the document default (defined in styles.xml), followed by the style definition, then table style properties, then numbering properties, and finally direct formatting ("run properties" applied inline). Direct formatting always wins, which is why selecting all text and changing the font sometimes doesn't fix everything — there may be direct formatting buried in individual runs.

Styles in styles.xml have a type (paragraph, character, table, numbering), a unique ID, a display name, and optionally a parent style they inherit from. The "Normal" style is the root of most paragraph style chains. "Heading 1" typically inherits from "Normal" with overrides for font size, spacing, and weight. If you change Normal's font, all headings that inherit from it change too — unless they have direct overrides.

This inheritance model is why documents sometimes behave unexpectedly after conversion. A DOCX file might define Heading 1 as "inherit from Normal, override font size to 16pt." If Normal uses Calibri but the conversion tool substitutes Arial, the heading also changes because it inherited the font from Normal. The heading's definition didn't mention a font at all — it relied on inheritance.

Themes and Font Resolution

Modern DOCX files use theme fonts rather than explicit font names. In document.xml, you'll see <w:rFonts w:asciiTheme="minorHAnsi"/> instead of <w:rFonts w:ascii="Calibri"/>. The actual font name is resolved at render time by looking up the theme definition in word/theme/theme1.xml. The default Office theme maps "minorHAnsi" to Calibri and "majorHAnsi" to Calibri Light.

This indirection causes problems during conversion. If the conversion tool doesn't resolve theme fonts before processing, text inherits a fallback font instead of the intended one. LibreOffice handles theme resolution well as of version 7.x. Older tools and lightweight parsers often miss it, producing documents where every font is Times New Roman because the theme lookup failed.

Track Changes and Comments at the XML Level

Track changes in DOCX work by wrapping modified content in revision markup elements. An insertion is wrapped in <w:ins>, a deletion in <w:del>, and a format change in <w:rPrChange>. Each revision element carries an author, date, and unique revision ID. The original and modified content both exist in the XML simultaneously — accepting or rejecting changes removes one version and unwraps the other.

This has practical implications for conversion and privacy. When you convert a DOCX to PDF with track changes active, the conversion tool must decide whether to show markup or accept all changes first. LibreOffice accepts all changes by default before rendering to PDF. Some tools render the markup visually, producing a PDF with red strikethroughs and blue insertions — not what you wanted if you're submitting a final document.

Comments use a similar mechanism: a <w:commentRangeStart> and <w:commentRangeEnd> pair in the document text, with the comment content stored in word/comments.xml. When you convert to formats that don't support comments (PDF, HTML, TXT), the comments are silently dropped. Always check for and resolve comments before converting if they contain important information.

Macros: DOCX vs. DOCM

DOCX files cannot contain macros. This was a deliberate security decision by Microsoft. The .docx extension guarantees no executable code. If a document contains VBA macros, it must be saved as .docm (macro-enabled document). Under the hood, the difference is a single file: word/vbaProject.bin, which contains the compiled VBA code in the old binary format (yes, even in an otherwise XML-based container).

This matters for conversion because macros don't survive format changes. Converting DOCM to PDF, ODT, or any non-Microsoft format drops the macro code entirely. The macros aren't converted or translated — they're binary Microsoft VBA that only Microsoft Office can execute. If a document relies on macros for functionality (auto-calculated fields, custom formatting), the converted output will be missing those computed values.

One security note: if you receive a .docx file that asks you to "enable macros," the file has been renamed from .docm to .docx to bypass security warnings. Modern versions of Office will still block the macros because they detect the VBA binary inside the ZIP regardless of the file extension.

Compatibility Mode: When Word Holds Itself Back

Open a DOCX file created in Word 2007 with Word 2024, and you'll see "[Compatibility Mode]" in the title bar. This means Word is deliberately disabling features introduced after the document's original version. The specific restrictions are controlled by the <w:compat> element in settings.xml, which lists compatibility flags like <w:compatSetting w:val="15" w:name="compatibilityMode"/> (the value corresponds to the Office version: 12=2007, 14=2010, 15=2013+).

Compatibility mode affects layout in subtle ways. Newer versions of Word refined text measurement, paragraph spacing, and table cell sizing. A document in compatibility mode renders using the old measurements, so line breaks and page breaks appear in the same places as the original version. Converting to a non-Microsoft format ignores compatibility mode entirely — LibreOffice and Google Docs use their own text measurement, which is why a document that looks perfect in Word might have slightly different line breaks in LibreOffice.

To exit compatibility mode: File > Info > Convert. This updates the compatibility version and may reflow the document. It's a one-way operation — you can't go back to the old compatibility level without manually editing settings.xml.

Font Embedding in DOCX

DOCX supports embedding fonts directly in the file, stored as .odttf (obfuscated TrueType font) files in the word/fonts/ directory. The obfuscation is trivial — it XORs the first 32 bytes with a GUID — and exists to satisfy font licensing requirements rather than provide real protection. Not all fonts permit embedding; the font's OS/2 table contains an fsType field that declares embedding permissions.

Embedding is disabled by default in Word. Enable it via File > Options > Save > "Embed fonts in the file." You can choose to embed only characters used in the document (subsetting), which significantly reduces file size. A full Calibri font file is about 800KB; a subset containing only the 150 characters in a typical business letter might be 50KB.

Font embedding matters for conversion fidelity. When you convert DOCX to PDF on a system that doesn't have the document's fonts installed, the conversion tool uses embedded fonts if present, or substitutes if not. LibreOffice on Linux typically lacks Calibri, Cambria, and other Microsoft-proprietary fonts. If the DOCX embeds them, conversion produces pixel-perfect results. If not, you get Liberator or Liberation Sans substitutions with slightly different metrics.

DOCX Compared to Other Document Formats

DOCX vs. DOC: DOC is a binary format from Word 97 — a proprietary blob that only Microsoft fully documented (and only in 2008, under EU pressure). DOCX is smaller (ZIP compression, typically 50-75% smaller), inspectable (XML text), repairable, and standards-based. There is no technical reason to use DOC for new documents. The only valid reason is compatibility with truly ancient systems that can't open DOCX — and even Office 2003 can open DOCX with the compatibility pack.

DOCX vs. ODT: ODT (OpenDocument Text) is the ISO/IEC 26300 standard, used by LibreOffice. Structurally similar to DOCX — both are ZIP containers with XML content. ODT uses different XML namespaces and a different style model. Simple documents convert between DOCX and ODT with 95%+ fidelity. Complex documents with track changes, equations, SmartArt, or advanced table features lose formatting in the conversion. Choose DOCX for Microsoft Office environments, ODT for vendor independence.

DOCX vs. PDF: DOCX is an editing format; PDF is a distribution format. DOCX to PDF conversion is reliable and near-lossless. PDF to DOCX is lossy because PDF doesn't store document structure. Use DOCX for documents you'll edit; use PDF for documents you'll share as-is.

DOCX vs. RTF: RTF supports basic formatting (bold, italic, fonts, tables) but not advanced features (track changes, themes, embedded objects, SmartArt). RTF is universal — every text editor can open it. DOCX is feature-rich but requires a word processor. For maximum compatibility with minimal formatting needs, convert DOCX to RTF.

Converting DOCX: What Survives and What Doesn't

DOCX is the most convertible document format because it carries semantic structure that conversion tools can interpret. But not everything survives every conversion.

To PDF (/docx-to-pdf): Near-perfect. Text, formatting, images, tables, headers/footers, page numbers all survive. Main risk: font substitution on systems lacking the original fonts. Resolution: embed fonts before converting, or install Microsoft core fonts on the conversion system.

To ODT (/docx-to-odt): Good for simple documents. Paragraph styles, basic formatting, images, and tables convert well. Complex features that break: equations (OMML vs. MathML), SmartArt, advanced numbering, conditional text. Track changes convert with varying accuracy.

To HTML (/docx-to-html): Structure converts well (headings, lists, bold/italic, links). Visual layout does not (precise spacing, page margins, columns, headers/footers are discarded or approximated). Embedded images are typically base64-encoded inline or extracted to a media directory.

To Markdown (/docx-to-md): Headings, bold, italic, links, and images survive. Tables convert to Markdown table syntax. Everything else is lost: page layout, fonts, colors, precise spacing. Markdown is intentionally minimal.

To plain text (/docx-to-txt): Only text content survives. All formatting, images, tables (as structure), and metadata are discarded. Useful for text extraction, indexing, and accessibility fallbacks.

Troubleshooting Common DOCX Problems

"The file is corrupted and cannot be opened": DOCX is a ZIP file. If the ZIP structure is damaged, Word can't open it. Try: rename to .zip, extract, check for malformed XML in word/document.xml. Common causes: incomplete download (file was truncated), email attachment encoding error, or disk corruption. If the XML is intact but a media file is corrupted, you can delete the broken media reference from document.xml.rels and repackage the ZIP.

Formatting looks different on another computer: Almost always a font issue. The document uses a font that's installed on the creator's machine but not on the reader's. Word substitutes a metrically similar font, but line breaks and spacing shift. Fix: embed fonts in the document before sharing, or use standard fonts (Arial, Times New Roman) that are installed everywhere.

File is unexpectedly large: Check word/media/ for oversized images. A common pattern: user pastes a 20MP photo, Word embeds the full-resolution image. The document displays it at 3 inches wide but stores the entire 15MB image. Compress images in Word (Format > Compress Pictures) or extract, resize, and re-embed manually.

Track changes won't fully accept: Sometimes track changes involve structural revisions (section breaks, table modifications) that Word's accept/reject mechanism handles poorly. Open the document.xml in a text editor, search for <w:ins> and <w:del> elements, and manually clean up orphaned revision markup. This is last-resort surgery, but it works when Word's UI gives up.

DOCX is the dominant document format for good reasons: it's a well-structured container that carries both content and presentation, it's backed by a published standard (however imperfect), and it's universally supported. Understanding that it's a ZIP of XML files transforms it from a black box into something inspectable, debuggable, and repairable.

For conversion workflows, DOCX is the best starting point. It carries enough semantic information (headings, lists, tables, styles) for conversion tools to produce reasonable output in almost any target format. Converting to DOCX from less structured formats (like PDF) is where quality drops — not because of any DOCX limitation, but because the source format didn't preserve the structure DOCX expects.