General

XLSX: Inside Excel's Modern Spreadsheet Format

Published Mar 19, 2026 8 min read By ChangeThisFile Team

Quick Answer

XLSX is a ZIP archive containing XML files that define worksheets, styles, shared strings, and metadata. It replaced the binary XLS format in Office 2007 with the Open XML standard, supporting up to 1,048,576 rows and 16,384 columns per sheet. Understanding its internals explains why files bloat, why some features survive format conversion, and why .xlsm exists separately for macros.

Every XLSX file you've ever opened is a ZIP archive. Rename any .xlsx to .zip, extract it, and you'll find a folder of XML files: one per worksheet, one for shared strings, one for styles, one for the workbook structure. This isn't trivia. It's the key to understanding why XLSX files behave the way they do during conversion, why they sometimes balloon to enormous sizes, and why certain features vanish when you export to CSV or PDF.

Microsoft introduced XLSX in Office 2007 as part of the Office Open XML (OOXML) standard, replacing the proprietary binary XLS format that had been the default since 1997. The format was designed for interoperability, transparency, and extensibility. Two decades later, XLSX is the de facto standard for spreadsheet interchange: Google Sheets exports it, LibreOffice reads and writes it, every data tool on earth can parse it.

But "standard" doesn't mean "simple." The ECMA-376 specification that defines OOXML is over 6,000 pages long. This guide focuses on what actually matters: the architecture inside the ZIP, the parts that affect your data, and the parts that break during conversion.

Inside the ZIP: What's Actually in an XLSX File

Unzip any XLSX file and you'll find a directory structure like this:

my_spreadsheet.xlsx/
├── [Content_Types].xml
├── _rels/
│   └── .rels
├── docProps/
│   ├── app.xml          (application metadata)
│   └── core.xml         (author, created date, modified date)
└── xl/
    ├── workbook.xml     (sheet names, defined names, calc settings)
    ├── sharedStrings.xml (all unique text strings)
    ├── styles.xml       (number formats, fonts, fills, borders)
    ├── theme/
    │   └── theme1.xml   (color palette, font scheme)
    ├── worksheets/
    │   ├── sheet1.xml   (cell data for Sheet1)
    │   └── sheet2.xml   (cell data for Sheet2)
    ├── charts/          (if any charts exist)
    ├── drawings/        (images, shapes)
    ├── tables/          (structured table definitions)
    └── pivotTables/     (pivot table cache + definitions)

Each XML file has a specific role. The workbook.xml is the table of contents. The worksheets/sheet1.xml contains the actual cell data. The styles.xml defines every visual format. The sharedStrings.xml deduplicates text. Relationships between these files are defined in .rels files using OPC (Open Packaging Convention) relationships.

The Shared Strings Table: Why It Exists and Why It Matters

If you have a spreadsheet with 100,000 rows and column B contains the country name "United States" in 40,000 of them, XLSX doesn't store "United States" 40,000 times. Instead, sharedStrings.xml stores each unique string once and assigns it an index. The worksheet XML references strings by index: <c t="s"><v>7</v></c> means "the string at index 7 in the shared strings table."

This drastically reduces file size for data with repeating text values. A 50MB naive XML representation might compress to 2MB thanks to shared strings plus ZIP compression. But it also means parsing an XLSX file requires loading the shared strings table first, then resolving every string reference. Libraries like SheetJS, openpyxl, and Apache POI all handle this transparently, but it's the reason XLSX parsing uses more memory than you'd expect from the file size.

Cell Types and Storage

Each cell in a worksheet XML has a type attribute (t) that determines how its value (v) is interpreted:

t="s" — String (index into shared strings table)
t="n" or no type attribute — Number (stored as a decimal)
t="b" — Boolean (0 or 1)
t="e" — Error (#REF!, #VALUE!, #N/A, etc.)
t="str" — Inline string (not in shared strings table)
t="d" — Date (ISO 8601, rarely used)

Here's what trips people up: dates are almost always stored as numbers. The date January 15, 2026 is stored as the serial number 46038 (days since January 1, 1900). The cell's style tells Excel to display it as a date. When you convert XLSX to CSV, the conversion library must read the style information to decide whether to output 46038 or 2026-01-15. If the library ignores styles, your dates become mysterious five-digit numbers.

There's also the infamous 1900 date bug: Excel treats 1900 as a leap year (it wasn't), so date serial numbers are off by one day for dates before March 1, 1900. This bug is preserved in the XLSX spec for backward compatibility with Lotus 1-2-3.

Styles, Conditional Formatting, and Visual Features

The styles.xml file contains every number format, font, fill color, border style, and alignment used in the workbook. Cells reference styles by index. A typical workbook might have 50-200 style definitions, even if it looks simple — Excel generates styles liberally.

Conditional formatting rules are stored per-worksheet in the sheet XML, not in the styles file. Each rule specifies a range, a condition (greater than, between, contains text, etc.), and a differential format (what to change when the condition is true). Complex conditional formatting with custom formulas is fully preserved in XLSX but is one of the first things to break in cross-platform conversion. LibreOffice handles most conditions, Google Sheets handles fewer, and XLSX to PDF conversion preserves the visual result but not the dynamic behavior.

Charts, Images, and Drawings

Charts in XLSX are stored as separate XML files in the xl/charts/ directory. Each chart XML defines the chart type, data ranges, axis settings, colors, and labels. Charts are complex objects: a single chart can reference multiple worksheets, use dynamic named ranges, and contain custom formatting on individual data points.

When converting XLSX, charts are the first casualty. XLSX to CSV drops them entirely (CSV has no concept of charts). XLSX to PDF renders them as static images. XLSX to ODS preserves most chart types but may alter appearance. XLSX to HTML typically renders charts as images or drops them.

Images and shapes are stored in the xl/drawings/ directory as DrawingML XML. Embedded images are in xl/media/. These are the primary reason XLSX files get large — a few high-resolution images can inflate a file from kilobytes to megabytes.

Formulas and Calculation Chains

Formulas in XLSX are stored as strings in the cell's <f> element: <f>SUM(A1:A100)</f>. The cell also stores the last calculated value in the <v> element, so applications can display results without recalculating. This cached value is what you get when you convert XLSX to CSV — the formulas are gone, replaced by their last computed results.

The calcChain.xml file defines the order in which cells should be recalculated. This matters for performance: a workbook with 100,000 formulas needs to recalculate in dependency order (calculate A1 before B1 if B1 references A1). The calculation chain is regenerated when you open the workbook, so it's safe to delete it — Excel will rebuild it. Some XLSX optimization tools strip the calculation chain to reduce file size.

Array Formulas and Dynamic Arrays

Legacy array formulas (entered with Ctrl+Shift+Enter, shown with curly braces) are stored with a t="array" attribute and a ref attribute defining the output range. Modern dynamic array formulas (SORT, FILTER, UNIQUE, SEQUENCE — introduced in Excel 365) use a different mechanism: the formula is in one cell, and a <xm:f> element with the cm (connection metadata) attribute links the spill range.

Dynamic arrays are the biggest interoperability gap in modern spreadsheets. LibreOffice doesn't support SORT, FILTER, or UNIQUE as of version 24.8. Google Sheets has its own implementations that behave differently. When you convert XLSX to ODS, dynamic array formulas become static values or errors. This is a one-way door: converting back won't restore the formulas.

Macros and the XLSM/XLTM Split

XLSX files cannot contain macros. This is a deliberate security decision. When Microsoft introduced the Open XML formats, they split macro-enabled workbooks into a separate extension: .xlsm. The internal structure is identical to XLSX, but XLSM files include a xl/vbaProject.bin file containing compiled VBA (Visual Basic for Applications) code.

This split exists because macros are a security risk. VBA has full access to the Windows API, file system, and network. Macro viruses in Excel files were a major attack vector in the late 1990s and early 2000s. By making macros impossible in .xlsx, Microsoft ensured that most spreadsheets exchanged via email are safe to open. If someone sends you a .xlsm file, you know it contains code and can decide whether to trust it.

Other macro-adjacent files: .xltx (template, no macros), .xltm (template with macros), .xlam (add-in with macros), and .xlsb (binary workbook, which is XLSX's binary cousin — faster to load for very large files but not human-readable).

Pivot Tables and Data Connections

Pivot tables in XLSX are stored in two parts: the pivot cache (xl/pivotCache/) and the pivot table definition (xl/pivotTables/). The cache stores a snapshot of the source data at the time of the last refresh. The definition stores the layout: which fields are in rows, columns, values, and filters, plus aggregation functions and formatting.

This dual storage means the pivot table's visual output is independent of the source data. You can delete the source worksheet and the pivot table still displays its cached data. But the pivot can't be refreshed without the source. When converting XLSX to other formats, the pivot table's displayed values survive (they're just cells with values), but the pivot functionality doesn't. Converting to CSV flattens the pivot to its current state. Converting to ODS partially preserves the pivot structure, but refreshing may not work correctly.

The 1,048,576 Row Limit and Other Boundaries

XLSX supports 1,048,576 rows (2²⁰) and 16,384 columns (2¹⁴, columns A through XFD) per worksheet. These limits are hardcoded in the specification and haven't changed since 2007.

In practice, you'll hit memory limits before row limits. A worksheet with 1 million rows and 20 columns of data might be 200MB as an XLSX file but require 2-4GB of RAM to open in Excel. Google Sheets has a separate limit of 10 million cells per workbook (not per sheet), so a single sheet can't exceed about 10 million cells total.

Other limits worth knowing: 255 characters in a sheet name, 32,767 characters per cell, 64,000 conditional formatting rules per sheet, 4,096 custom number formats, and 1,024 page breaks per sheet. The 32,767 character cell limit occasionally bites people who store long text strings or serialized data in cells.

For datasets that exceed these limits, you need to move beyond spreadsheet formats entirely. Convert CSV to Parquet for columnar storage that handles billions of rows, or CSV to Arrow for fast in-memory analytics.

Why XLSX Files Get Huge (and How to Fix It)

XLSX files bloat for specific, identifiable reasons:

Embedded images. A single high-resolution photo can be 5-10MB. XLSX stores images uncompressed inside the ZIP. Ten product photos = 50MB+ workbook. Fix: resize images before inserting, or link to external images instead of embedding.
Unused styles. Copy-pasting from multiple sources accumulates unique style definitions. A workbook with 10,000 unique cell formats creates a massive styles.xml. Fix: "Clear All Formatting" on ranges that don't need specific formatting.
Pivot cache bloat. Each pivot table stores a complete copy of its source data in the cache. Three pivot tables from the same 100,000-row source = three copies of that data. Fix: share a single pivot cache across multiple pivot tables.
Volatile functions. Formulas using NOW(), TODAY(), INDIRECT(), or OFFSET() force full recalculation on every change, and the calculation chain stores dependency information for every affected cell. Thousands of INDIRECT() formulas create enormous calcChain.xml files.
Phantom data. Deleting content from cells doesn't always remove their style definitions. A sheet where you once had data in row 1,000,000 still has those rows "used" in the XML, even if they appear empty. Fix: select the empty rows, right-click, Delete (not Clear Contents).

A quick diagnostic: rename the file to .zip, extract it, and check the sizes of individual XML files. If sharedStrings.xml is 50MB, you have massive text data. If xl/media/ is huge, it's images. If a sheet XML is large, check for phantom rows or excessive formulas.

Converting XLSX: What Survives and What Doesn't

Understanding the internal structure tells you exactly what survives each conversion:

Target Format	Preserves	Loses
CSV	Cell values (one sheet only)	Formulas, styles, charts, images, multiple sheets, data types
JSON	Cell values, data types, multiple sheets	Formulas, styles, charts, images
TSV	Cell values (one sheet only)	Same as CSV
PDF	Visual appearance (static snapshot)	Formulas, interactivity, editability
ODS	Data, most formulas, basic formatting	Complex conditional formatting, dynamic arrays, some chart types
HTML	Table structure, basic formatting	Formulas, charts (or rasterized), interactivity
XLS (legacy)	Data, most formulas, styles	Features beyond XLS limits (>65K rows, >256 cols)

XLSX is a remarkably well-designed format once you understand its internals. The ZIP-of-XML architecture makes it inspectable, debuggable, and programmatically manipulable in ways that the old binary XLS format never was. You can write a Python script to modify cell values directly in the XML without any spreadsheet library.

The practical takeaway: when converting XLSX to simpler formats, you're always discarding layers. CSV strips everything except values. JSON strips formatting but keeps structure. PDF preserves appearance but freezes the content. Knowing what each layer contains — shared strings, styles, formulas, charts, pivot caches — lets you predict exactly what will survive the conversion and what won't.

Key Takeaways

XLSX is a ZIP of XML files. Rename to .zip and explore it to diagnose bloat, missing data, or conversion issues.
Dates are stored as serial numbers (days since 1900), not date strings. The cell's style determines display format. Conversion tools that ignore styles output raw numbers instead of dates.
The shared strings table deduplicates text values, dramatically reducing file size for repetitive data but increasing memory usage during parsing.
XLSX cannot contain macros — that's what .xlsm is for. This split is a security feature, not a limitation.
Dynamic array formulas (SORT, FILTER, UNIQUE) don't convert to ODS or Google Sheets. They're the biggest interoperability gap in modern spreadsheets.
File bloat comes from embedded images, unused styles, pivot cache duplication, and phantom rows. Check individual XML file sizes inside the ZIP to diagnose.

Frequently Asked Questions

What's the difference between XLSX and XLSM?

XLSX and XLSM are structurally identical — both are ZIP archives of XML files following the Open XML specification. The only difference is that XLSM includes a vbaProject.bin file containing VBA macro code. XLSX files cannot contain macros by design. This separation exists for security: you can trust that a .xlsx file won't execute code when opened. If you save a macro-enabled workbook as .xlsx, Excel strips the VBA code with a warning.

Why does my XLSX file show dates as numbers after conversion?

XLSX stores dates as serial numbers (e.g., 46038 for January 15, 2026) with a style that tells Excel to display them as dates. When converting to CSV or JSON, the conversion tool must read the style information to know that 46038 should be formatted as a date. If the tool only reads cell values without checking styles, you get raw serial numbers. Most modern conversion libraries (SheetJS, openpyxl, Apache POI) handle this correctly, but simpler parsers may not.

How big can an XLSX file get before Excel struggles?

Excel technically supports 1,048,576 rows and 16,384 columns per sheet, but practical limits are lower. A workbook with 500,000+ rows of data with formulas typically requires 2-4GB of RAM and becomes sluggish. Files over 100MB take noticeably long to open. The 32-bit version of Excel has a 2GB memory limit. For datasets approaching these limits, consider converting to a more efficient format like Parquet or loading directly into a database.

Can I edit XLSX files programmatically without Excel?

Yes. Since XLSX is a ZIP of XML files, you can manipulate it with any ZIP and XML library. In Python, openpyxl is the standard library for reading and writing XLSX. In JavaScript, SheetJS (xlsx) is the most popular. In Java, Apache POI handles XLSX natively. You can even unzip the file, modify the XML directly with a text editor, re-zip it, and rename to .xlsx — though libraries are safer because they handle all the relationships and cross-references.

Why is my XLSX file so much larger than the same data in CSV?

XLSX stores significantly more information than CSV: styles for every cell, the shared strings table, workbook metadata, theme definitions, and relationship files. Even with ZIP compression, the XML overhead makes XLSX files 2-5x larger than equivalent CSV for pure data. However, for data with many repeating text values, the shared strings table plus ZIP compression can make XLSX smaller than CSV. The crossover point is typically around 100,000+ rows with highly repetitive text columns.

What happens to formulas when I convert XLSX to CSV?

Formulas are replaced by their last calculated values. A cell containing =SUM(A1:A100) that displays 5,280 becomes the static value 5280 in the CSV. The formula itself is permanently lost. If you need to preserve formulas, convert to ODS instead of CSV — ODS supports most Excel formulas. Always save a backup of the original XLSX before converting to a format that strips formulas.

Is XLSX an open standard or proprietary?

XLSX is an open standard. It's defined by ECMA-376 (Office Open XML) and also approved as ISO/IEC 29500. The specification is publicly available and free to implement. However, Microsoft's implementation includes extensions and behaviors not fully covered by the standard, and some features (like threaded comments and dynamic arrays) are Excel-specific additions that other implementations may not support. In practice, XLSX is 'open but Microsoft-led.'

How do I convert XLSX with multiple sheets to separate CSV files?

Most conversion tools process one sheet at a time. When you <a href='/xlsx-to-csv'>convert XLSX to CSV on ChangeThisFile</a>, the converter handles the active sheet. For programmatic batch conversion, use a library like SheetJS or openpyxl to iterate over each sheet and export individually. In Python: loop through wb.sheetnames, access each sheet, and write each to a separate CSV file with the sheet name as the filename.

Ready to convert your files?

Use ChangeThisFile to convert between 600+ formats — free, fast, and private.

Start Converting