Every XLSX file you've ever opened is a ZIP archive. Rename any .xlsx to .zip, extract it, and you'll find a folder of XML files: one per worksheet, one for shared strings, one for styles, one for the workbook structure. This isn't trivia. It's the key to understanding why XLSX files behave the way they do during conversion, why they sometimes balloon to enormous sizes, and why certain features vanish when you export to CSV or PDF.
Microsoft introduced XLSX in Office 2007 as part of the Office Open XML (OOXML) standard, replacing the proprietary binary XLS format that had been the default since 1997. The format was designed for interoperability, transparency, and extensibility. Two decades later, XLSX is the de facto standard for spreadsheet interchange: Google Sheets exports it, LibreOffice reads and writes it, every data tool on earth can parse it.
But "standard" doesn't mean "simple." The ECMA-376 specification that defines OOXML is over 6,000 pages long. This guide focuses on what actually matters: the architecture inside the ZIP, the parts that affect your data, and the parts that break during conversion.
Inside the ZIP: What's Actually in an XLSX File
Unzip any XLSX file and you'll find a directory structure like this:
my_spreadsheet.xlsx/
├── [Content_Types].xml
├── _rels/
│ └── .rels
├── docProps/
│ ├── app.xml (application metadata)
│ └── core.xml (author, created date, modified date)
└── xl/
├── workbook.xml (sheet names, defined names, calc settings)
├── sharedStrings.xml (all unique text strings)
├── styles.xml (number formats, fonts, fills, borders)
├── theme/
│ └── theme1.xml (color palette, font scheme)
├── worksheets/
│ ├── sheet1.xml (cell data for Sheet1)
│ └── sheet2.xml (cell data for Sheet2)
├── charts/ (if any charts exist)
├── drawings/ (images, shapes)
├── tables/ (structured table definitions)
└── pivotTables/ (pivot table cache + definitions)Each XML file has a specific role. The workbook.xml is the table of contents. The worksheets/sheet1.xml contains the actual cell data. The styles.xml defines every visual format. The sharedStrings.xml deduplicates text. Relationships between these files are defined in .rels files using OPC (Open Packaging Convention) relationships.
The Shared Strings Table: Why It Exists and Why It Matters
If you have a spreadsheet with 100,000 rows and column B contains the country name "United States" in 40,000 of them, XLSX doesn't store "United States" 40,000 times. Instead, sharedStrings.xml stores each unique string once and assigns it an index. The worksheet XML references strings by index: <c t="s"><v>7</v></c> means "the string at index 7 in the shared strings table."
This drastically reduces file size for data with repeating text values. A 50MB naive XML representation might compress to 2MB thanks to shared strings plus ZIP compression. But it also means parsing an XLSX file requires loading the shared strings table first, then resolving every string reference. Libraries like SheetJS, openpyxl, and Apache POI all handle this transparently, but it's the reason XLSX parsing uses more memory than you'd expect from the file size.
Cell Types and Storage
Each cell in a worksheet XML has a type attribute (t) that determines how its value (v) is interpreted:
t="s"— String (index into shared strings table)t="n"or no type attribute — Number (stored as a decimal)t="b"— Boolean (0 or 1)t="e"— Error (#REF!, #VALUE!, #N/A, etc.)t="str"— Inline string (not in shared strings table)t="d"— Date (ISO 8601, rarely used)
Here's what trips people up: dates are almost always stored as numbers. The date January 15, 2026 is stored as the serial number 46038 (days since January 1, 1900). The cell's style tells Excel to display it as a date. When you convert XLSX to CSV, the conversion library must read the style information to decide whether to output 46038 or 2026-01-15. If the library ignores styles, your dates become mysterious five-digit numbers.
There's also the infamous 1900 date bug: Excel treats 1900 as a leap year (it wasn't), so date serial numbers are off by one day for dates before March 1, 1900. This bug is preserved in the XLSX spec for backward compatibility with Lotus 1-2-3.
Styles, Conditional Formatting, and Visual Features
The styles.xml file contains every number format, font, fill color, border style, and alignment used in the workbook. Cells reference styles by index. A typical workbook might have 50-200 style definitions, even if it looks simple — Excel generates styles liberally.
Conditional formatting rules are stored per-worksheet in the sheet XML, not in the styles file. Each rule specifies a range, a condition (greater than, between, contains text, etc.), and a differential format (what to change when the condition is true). Complex conditional formatting with custom formulas is fully preserved in XLSX but is one of the first things to break in cross-platform conversion. LibreOffice handles most conditions, Google Sheets handles fewer, and XLSX to PDF conversion preserves the visual result but not the dynamic behavior.
Charts, Images, and Drawings
Charts in XLSX are stored as separate XML files in the xl/charts/ directory. Each chart XML defines the chart type, data ranges, axis settings, colors, and labels. Charts are complex objects: a single chart can reference multiple worksheets, use dynamic named ranges, and contain custom formatting on individual data points.
When converting XLSX, charts are the first casualty. XLSX to CSV drops them entirely (CSV has no concept of charts). XLSX to PDF renders them as static images. XLSX to ODS preserves most chart types but may alter appearance. XLSX to HTML typically renders charts as images or drops them.
Images and shapes are stored in the xl/drawings/ directory as DrawingML XML. Embedded images are in xl/media/. These are the primary reason XLSX files get large — a few high-resolution images can inflate a file from kilobytes to megabytes.
Formulas and Calculation Chains
Formulas in XLSX are stored as strings in the cell's <f> element: <f>SUM(A1:A100)</f>. The cell also stores the last calculated value in the <v> element, so applications can display results without recalculating. This cached value is what you get when you convert XLSX to CSV — the formulas are gone, replaced by their last computed results.
The calcChain.xml file defines the order in which cells should be recalculated. This matters for performance: a workbook with 100,000 formulas needs to recalculate in dependency order (calculate A1 before B1 if B1 references A1). The calculation chain is regenerated when you open the workbook, so it's safe to delete it — Excel will rebuild it. Some XLSX optimization tools strip the calculation chain to reduce file size.
Array Formulas and Dynamic Arrays
Legacy array formulas (entered with Ctrl+Shift+Enter, shown with curly braces) are stored with a t="array" attribute and a ref attribute defining the output range. Modern dynamic array formulas (SORT, FILTER, UNIQUE, SEQUENCE — introduced in Excel 365) use a different mechanism: the formula is in one cell, and a <xm:f> element with the cm (connection metadata) attribute links the spill range.
Dynamic arrays are the biggest interoperability gap in modern spreadsheets. LibreOffice doesn't support SORT, FILTER, or UNIQUE as of version 24.8. Google Sheets has its own implementations that behave differently. When you convert XLSX to ODS, dynamic array formulas become static values or errors. This is a one-way door: converting back won't restore the formulas.
Macros and the XLSM/XLTM Split
XLSX files cannot contain macros. This is a deliberate security decision. When Microsoft introduced the Open XML formats, they split macro-enabled workbooks into a separate extension: .xlsm. The internal structure is identical to XLSX, but XLSM files include a xl/vbaProject.bin file containing compiled VBA (Visual Basic for Applications) code.
This split exists because macros are a security risk. VBA has full access to the Windows API, file system, and network. Macro viruses in Excel files were a major attack vector in the late 1990s and early 2000s. By making macros impossible in .xlsx, Microsoft ensured that most spreadsheets exchanged via email are safe to open. If someone sends you a .xlsm file, you know it contains code and can decide whether to trust it.
Other macro-adjacent files: .xltx (template, no macros), .xltm (template with macros), .xlam (add-in with macros), and .xlsb (binary workbook, which is XLSX's binary cousin — faster to load for very large files but not human-readable).
Pivot Tables and Data Connections
Pivot tables in XLSX are stored in two parts: the pivot cache (xl/pivotCache/) and the pivot table definition (xl/pivotTables/). The cache stores a snapshot of the source data at the time of the last refresh. The definition stores the layout: which fields are in rows, columns, values, and filters, plus aggregation functions and formatting.
This dual storage means the pivot table's visual output is independent of the source data. You can delete the source worksheet and the pivot table still displays its cached data. But the pivot can't be refreshed without the source. When converting XLSX to other formats, the pivot table's displayed values survive (they're just cells with values), but the pivot functionality doesn't. Converting to CSV flattens the pivot to its current state. Converting to ODS partially preserves the pivot structure, but refreshing may not work correctly.
The 1,048,576 Row Limit and Other Boundaries
XLSX supports 1,048,576 rows (220) and 16,384 columns (214, columns A through XFD) per worksheet. These limits are hardcoded in the specification and haven't changed since 2007.
In practice, you'll hit memory limits before row limits. A worksheet with 1 million rows and 20 columns of data might be 200MB as an XLSX file but require 2-4GB of RAM to open in Excel. Google Sheets has a separate limit of 10 million cells per workbook (not per sheet), so a single sheet can't exceed about 10 million cells total.
Other limits worth knowing: 255 characters in a sheet name, 32,767 characters per cell, 64,000 conditional formatting rules per sheet, 4,096 custom number formats, and 1,024 page breaks per sheet. The 32,767 character cell limit occasionally bites people who store long text strings or serialized data in cells.
For datasets that exceed these limits, you need to move beyond spreadsheet formats entirely. Convert CSV to Parquet for columnar storage that handles billions of rows, or CSV to Arrow for fast in-memory analytics.
Why XLSX Files Get Huge (and How to Fix It)
XLSX files bloat for specific, identifiable reasons:
- Embedded images. A single high-resolution photo can be 5-10MB. XLSX stores images uncompressed inside the ZIP. Ten product photos = 50MB+ workbook. Fix: resize images before inserting, or link to external images instead of embedding.
- Unused styles. Copy-pasting from multiple sources accumulates unique style definitions. A workbook with 10,000 unique cell formats creates a massive
styles.xml. Fix: "Clear All Formatting" on ranges that don't need specific formatting. - Pivot cache bloat. Each pivot table stores a complete copy of its source data in the cache. Three pivot tables from the same 100,000-row source = three copies of that data. Fix: share a single pivot cache across multiple pivot tables.
- Volatile functions. Formulas using NOW(), TODAY(), INDIRECT(), or OFFSET() force full recalculation on every change, and the calculation chain stores dependency information for every affected cell. Thousands of INDIRECT() formulas create enormous
calcChain.xmlfiles. - Phantom data. Deleting content from cells doesn't always remove their style definitions. A sheet where you once had data in row 1,000,000 still has those rows "used" in the XML, even if they appear empty. Fix: select the empty rows, right-click, Delete (not Clear Contents).
A quick diagnostic: rename the file to .zip, extract it, and check the sizes of individual XML files. If sharedStrings.xml is 50MB, you have massive text data. If xl/media/ is huge, it's images. If a sheet XML is large, check for phantom rows or excessive formulas.
Converting XLSX: What Survives and What Doesn't
Understanding the internal structure tells you exactly what survives each conversion:
| Target Format | Preserves | Loses |
|---|---|---|
| CSV | Cell values (one sheet only) | Formulas, styles, charts, images, multiple sheets, data types |
| JSON | Cell values, data types, multiple sheets | Formulas, styles, charts, images |
| TSV | Cell values (one sheet only) | Same as CSV |
| Visual appearance (static snapshot) | Formulas, interactivity, editability | |
| ODS | Data, most formulas, basic formatting | Complex conditional formatting, dynamic arrays, some chart types |
| HTML | Table structure, basic formatting | Formulas, charts (or rasterized), interactivity |
| XLS (legacy) | Data, most formulas, styles | Features beyond XLS limits (>65K rows, >256 cols) |
XLSX is a remarkably well-designed format once you understand its internals. The ZIP-of-XML architecture makes it inspectable, debuggable, and programmatically manipulable in ways that the old binary XLS format never was. You can write a Python script to modify cell values directly in the XML without any spreadsheet library.
The practical takeaway: when converting XLSX to simpler formats, you're always discarding layers. CSV strips everything except values. JSON strips formatting but keeps structure. PDF preserves appearance but freezes the content. Knowing what each layer contains — shared strings, styles, formulas, charts, pivot caches — lets you predict exactly what will survive the conversion and what won't.