Data

CSV: Simple, Broken, and Everywhere

Published Mar 19, 2026 9 min read By ChangeThisFile Team

Quick Answer

CSV (Comma-Separated Values) is the simplest and most widely supported tabular data format, but it has no real standard, no type system, and rampant encoding and delimiter inconsistencies. RFC 4180 is the closest thing to a specification, but most real-world CSV files violate it.

CSV is the cockroach of data formats. It predates the internet, has no real specification, breaks in hundreds of subtle ways, and is absolutely impossible to kill. Every spreadsheet, every database, every data tool, every programming language supports CSV. It is the lowest common denominator of data exchange — and that's both its greatest strength and its most frustrating property.

The format looks trivially simple: one record per line, fields separated by commas. But behind that simplicity lurks a minefield of encoding issues, delimiter ambiguity, type coercion disasters, and incompatible implementations. This guide covers the real-world behavior of CSV files — not the idealized version, but the one that eats your ZIP codes and reformats your dates.

If you've ever opened a CSV and found garbled characters, missing leading zeros, or dates that make no sense, this guide explains why and how to fix it.

RFC 4180: The Standard That Wasn't

RFC 4180, published in 2005 by Yakov Shafranovich, is titled "Common Format and MIME Type for Comma-Separated Values (CSV) Files." It's an informational RFC, not a standard — meaning it describes existing practice rather than mandating behavior. By the time it was written, CSV had been in use for over 30 years with no specification at all.

RFC 4180 defines these rules:

Each record is on a separate line, terminated by CRLF (\r\n).
The last record may or may not have a trailing CRLF.
An optional header line may appear as the first line with the same format as normal records.
Fields are separated by commas. Spaces are part of the field (not trimmed).
Fields may be enclosed in double quotes. Fields containing commas, double quotes, or line breaks must be quoted.
A double quote inside a quoted field is escaped by doubling it: "".

These rules sound clear. In practice, almost every CSV file you encounter violates at least one of them. Most files use LF (\n) line endings, not CRLF. Many use semicolons or tabs instead of commas. Some quote all fields; others quote none. The RFC acknowledges this reality but can't fix it.

The Delimiter Problem

The "C" in CSV stands for "comma," but commas are only the most common delimiter, not the only one. The delimiter problem stems from a fundamental conflict: the character used to separate fields must never appear unescaped inside a field value.

Delimiter	Used By	Problem
Comma (`,`)	US, UK, most of Asia	Commas appear in text fields, addresses, numbers with thousands separators
Semicolon (`;`)	Germany, France, Italy, Brazil, most of Europe	Necessary because these locales use comma as decimal separator (`3,14`)
Tab (`\t`)	TSV files, scientific data, database exports	Invisible character, hard to distinguish from spaces visually
Pipe (`\|`)	Legacy mainframe systems, some US government data	Uncommon but unambiguous. Rarely appears in field values.

European CSV files are the most common surprise. If you're in Germany, Excel saves CSV with semicolons because the locale uses commas for decimals. A file exported from German Excel and opened in American Excel will mash all columns into one because American Excel expects commas. The fix: convert to TSV, which uses tabs and avoids the locale problem entirely.

Delimiter auto-detection is imperfect. Python's csv.Sniffer class examines the first few lines and guesses. PapaParse (JavaScript) also auto-detects. Both can be fooled by files with quoted fields containing the wrong delimiter character.

Quoting Rules and Edge Cases

RFC 4180's quoting rules are straightforward in theory: if a field contains a comma, a double quote, or a newline, enclose it in double quotes. Escape internal double quotes by doubling them. In practice:

name,description,price
Widget,"A small, useful device",9.99
"Gadget ""Pro""","Multiline\ndescription here",19.99

Common quoting violations in the wild:

No quoting at all. Some producers never quote fields, even when they contain commas. The field Smith, John without quotes breaks every parser.
Single quotes. Some tools use single quotes instead of double quotes. This is not RFC-compliant and most strict parsers reject it.
Backslash escaping. Some producers use \" instead of "" to escape quotes. MySQL's SELECT INTO OUTFILE does this by default.
Quote all fields. Some producers quote every field, including numbers and booleans. This is technically valid but can confuse parsers that try to detect types.

Newlines inside quoted fields are the most dangerous edge case. A field value that spans multiple lines is valid CSV, but many naive parsers (and many command-line tools like wc -l) treat every line break as a record boundary. If your CSV has multiline fields, test your parser explicitly.

Encoding: UTF-8, BOM, and the Excel Problem

CSV files have no standard way to declare their character encoding. Unlike XML (which has an encoding declaration in the prolog) or JSON (which mandates UTF-8), a CSV file is just bytes. The consumer must guess the encoding — and often guesses wrong.

Common Encodings in the Wild

UTF-8 is now the default for most modern tools, but legacy CSV files are everywhere in Windows-1252 (Western European Windows), Latin-1/ISO 8859-1, Shift-JIS (Japanese), or GB2312 (Chinese). Opening a Latin-1 file as UTF-8 produces garbled characters (mojibake): café becomes cafÃ©.

The UTF-8 BOM (Byte Order Mark) is three bytes (EF BB BF) at the start of a file that signal UTF-8 encoding. Excel on Windows requires a BOM to correctly interpret UTF-8 CSV files. Without it, Excel assumes the system's locale encoding (usually Windows-1252 in Western countries), mangling any non-ASCII characters. But many Unix tools treat the BOM as data, producing a phantom invisible character in the first field name.

The practical advice: if your CSV will be opened in Excel, add a UTF-8 BOM. If it will be consumed by code, omit the BOM. If it needs to work for both, add the BOM and make sure your parser skips it. In Python: open('file.csv', 'w', encoding='utf-8-sig') writes with BOM; encoding='utf-8-sig' on read strips it automatically.

Type Destruction: Leading Zeros, Dates, and Long Numbers

CSV has no type system. Every value is a string. This is fine in theory — the consumer decides how to interpret each field. In practice, spreadsheet software aggressively auto-detects types and silently destroys data.

Original Value	What Excel Does	Why
`02134`	`2134`	Detected as number, leading zero stripped
`1234567890123456`	`1.23457E+15`	Exceeds 15-digit precision limit, switched to scientific notation
`3-15`	`15-Mar`	Interpreted as date
`1E3`	`1000`	Interpreted as scientific notation
`TRUE`	`TRUE` (boolean)	Interpreted as boolean, changes cell format
`000-123-4567`	Varies	May be interpreted as date, formula, or number depending on locale

This destroys ZIP codes, phone numbers, part numbers, product SKUs, barcodes, social security numbers, and any identifier with leading zeros or long digit strings. The damage is often silent — you don't notice until downstream processing fails.

Prevention strategies: Import via Excel's Data > From Text/CSV wizard and set columns to "Text" type. Prefix values with a single quote in the CSV ('02134) — Excel treats the quote as a format signal, not data. Or better: use a format with types. Converting CSV to JSON preserves string values as strings, and converting CSV to XLSX lets you control column types explicitly.

Newlines in Fields: The Silent Parser Killer

A field value can legitimately contain a newline if the field is quoted:

id,comment
1,"This comment
spans two lines"
2,"Single line comment"

This is valid per RFC 4180. The problem: tools that process CSV line-by-line (shell scripts using while read line, awk, wc -l, and many ETL tools) don't handle this. They see four lines instead of two records plus a header. Database LOAD DATA commands often fail on multiline fields too.

If your CSV has multiline fields and you need to process it with line-oriented tools, either: (1) replace newlines inside fields with a placeholder before processing and restore them after, (2) convert to a format that doesn't have this ambiguity (JSON or JSONL), or (3) use a proper CSV parser that handles quoted multiline fields (Python's csv module, PapaParse, etc.).

Excel's CSV vs. RFC 4180

Microsoft Excel is the world's most popular CSV tool, and it deviates from RFC 4180 in several ways:

Delimiter follows locale. In countries using comma as decimal separator, Excel uses semicolons. The file is still saved with a .csv extension.
Encoding defaults to system locale, not UTF-8. Japanese Excel defaults to Shift-JIS. German Excel defaults to Windows-1252.
Date format follows locale. Dates are re-formatted on display according to the system's short date format, making the same CSV show different dates on different machines.
Number precision is 15 digits. Longer numbers are silently truncated. A 16-digit credit card number loses its last digit.
"Save as CSV" drops everything except the active sheet, and strips all formatting, formulas, charts, and multi-sheet structure. It also doesn't warn about data loss when a cell contains a formula that evaluates to a value.

For programmatic CSV work, avoid Excel entirely. Use Python's csv module, PapaParse in JavaScript, or import into Google Sheets (which handles UTF-8 better than Excel). If you must use Excel, import via the Data tab rather than double-clicking the file.

Best Practices for Producing CSV

If you're generating CSV files for others to consume:

Use UTF-8 encoding with BOM if any consumer might be Excel. Without BOM for code-only consumption.
Quote all fields or at least all string fields. The tiny size overhead prevents delimiter-in-value bugs.
Use ISO 8601 dates (2026-03-19). Never use locale-dependent formats like 3/19/26.
Include a header row. Name columns descriptively. Don't use spaces or special characters in headers.
Use CRLF line endings for maximum compatibility, even on Unix systems.
Escape properly: double quotes inside quoted fields become "".
Document the encoding and delimiter in a README or metadata file. Don't assume the consumer will guess correctly.

CSV persists because nothing else is as universally readable. Every programming language, spreadsheet, database, and data tool can ingest CSV. That universality comes at the cost of ambiguity — there's no standard encoding, no standard delimiter, no type system, and no way to represent nested data. CSV is the format you use when you need the widest possible compatibility and your data is flat.

The pragmatic approach: use CSV for import/export and data exchange, but don't store working data in CSV. Convert to JSON when you need types and structure. Convert to XLSX when you need typed columns and multiple sheets. Convert to Parquet when you need columnar storage for analytics. CSV is a transport format, not a storage format.

Key Takeaways

CSV has no real standard. RFC 4180 is informational and most CSV files don't fully comply with it.
The delimiter is locale-dependent: commas in the US/UK, semicolons in much of Europe. TSV avoids this problem entirely.
Every value in CSV is a string. Excel silently converts leading zeros, long numbers, and date-like strings, destroying data.
UTF-8 with BOM is required for Excel to read non-ASCII characters correctly. Without BOM, Excel assumes the system locale encoding.
Newlines inside quoted fields are valid CSV but break most line-oriented tools and shell scripts.
When CSV's limitations bite, convert to JSON (for structure), XLSX (for typed columns), or TSV (for delimiter safety).
Always quote string fields, use ISO 8601 dates, and document the encoding when producing CSV.

Frequently Asked Questions

Why does Excel remove the leading zeros from my CSV data?

Excel auto-detects cell types when opening a CSV. Values that look like numbers are converted to numbers, which strips leading zeros. The value '02134' becomes 2134. To prevent this, import via Data > From Text/CSV and set the column type to 'Text', or prefix values with a single quote in the CSV. Alternatively, convert to XLSX format where you control column types explicitly.

What's the difference between CSV and TSV?

CSV uses commas as delimiters; TSV uses tabs. TSV is less ambiguous because tab characters rarely appear in data values, eliminating the quoting issues that plague CSV. TSV also avoids the European locale problem where commas are decimal separators. If you're exchanging data between systems and can choose the format, TSV is often the safer choice. You can convert between them easily.

How do I open a UTF-8 CSV in Excel without garbled characters?

Either add a UTF-8 BOM (byte order mark) to the CSV file before opening, or import via Excel's Data tab > From Text/CSV, which lets you specify the encoding. On Mac, Excel handles UTF-8 better than Windows. In Python, write the file with encoding='utf-8-sig' to include the BOM automatically.

Should I use comma, semicolon, or tab as my CSV delimiter?

If your audience is global or mixed, use tabs (TSV). Tabs almost never appear in data values, avoiding the quoting problem entirely. If you must use CSV, use commas and follow RFC 4180 quoting rules. Only use semicolons if your audience is exclusively European Excel users.

Can CSV represent nested data?

No. CSV is strictly two-dimensional: rows and columns. If your data has hierarchy (like a customer with multiple orders), you must either flatten it (one row per order, repeating customer fields), use multiple CSV files with foreign keys, or switch to a format that supports nesting like JSON or XML. Flattening loses the explicit parent-child relationship.

Why does my CSV show wrong dates on a different computer?

CSV has no date type — dates are stored as strings. But Excel interprets date-like strings according to the system's locale settings. '01/02/2026' is January 2 in the US (MM/DD/YYYY) but February 1 in Europe (DD/MM/YYYY). Always use ISO 8601 format (2026-03-19) to avoid this ambiguity. ISO 8601 is interpreted consistently across all locales.

How do I handle commas inside field values in CSV?

Enclose the field in double quotes: "Smith, John" is a single field. If the field also contains double quotes, escape them by doubling: "He said ""hello""". Most CSV libraries handle this automatically when writing. When reading, make sure your parser supports RFC 4180 quoting — Python's csv module and PapaParse both do.

What's the maximum file size for CSV?

CSV itself has no size limit. The constraint is the tool reading it. Excel has a hard limit of 1,048,576 rows and 16,384 columns. Google Sheets limits files to 10 million cells. For programmatic processing, CSV files of 10GB+ are routine — Python and most languages can stream CSV line by line with minimal memory. For truly large datasets, consider converting to Parquet or JSONL for better performance.

Ready to convert your files?

Use ChangeThisFile to convert between 600+ formats — free, fast, and private.

Start Converting