CSV is the cockroach of data formats. It predates the internet, has no real specification, breaks in hundreds of subtle ways, and is absolutely impossible to kill. Every spreadsheet, every database, every data tool, every programming language supports CSV. It is the lowest common denominator of data exchange — and that's both its greatest strength and its most frustrating property.
The format looks trivially simple: one record per line, fields separated by commas. But behind that simplicity lurks a minefield of encoding issues, delimiter ambiguity, type coercion disasters, and incompatible implementations. This guide covers the real-world behavior of CSV files — not the idealized version, but the one that eats your ZIP codes and reformats your dates.
If you've ever opened a CSV and found garbled characters, missing leading zeros, or dates that make no sense, this guide explains why and how to fix it.
RFC 4180: The Standard That Wasn't
RFC 4180, published in 2005 by Yakov Shafranovich, is titled "Common Format and MIME Type for Comma-Separated Values (CSV) Files." It's an informational RFC, not a standard — meaning it describes existing practice rather than mandating behavior. By the time it was written, CSV had been in use for over 30 years with no specification at all.
RFC 4180 defines these rules:
- Each record is on a separate line, terminated by CRLF (
\r\n). - The last record may or may not have a trailing CRLF.
- An optional header line may appear as the first line with the same format as normal records.
- Fields are separated by commas. Spaces are part of the field (not trimmed).
- Fields may be enclosed in double quotes. Fields containing commas, double quotes, or line breaks must be quoted.
- A double quote inside a quoted field is escaped by doubling it:
"".
These rules sound clear. In practice, almost every CSV file you encounter violates at least one of them. Most files use LF (\n) line endings, not CRLF. Many use semicolons or tabs instead of commas. Some quote all fields; others quote none. The RFC acknowledges this reality but can't fix it.
The Delimiter Problem
The "C" in CSV stands for "comma," but commas are only the most common delimiter, not the only one. The delimiter problem stems from a fundamental conflict: the character used to separate fields must never appear unescaped inside a field value.
| Delimiter | Used By | Problem |
|---|---|---|
Comma (,) | US, UK, most of Asia | Commas appear in text fields, addresses, numbers with thousands separators |
Semicolon (;) | Germany, France, Italy, Brazil, most of Europe | Necessary because these locales use comma as decimal separator (3,14) |
Tab (\t) | TSV files, scientific data, database exports | Invisible character, hard to distinguish from spaces visually |
Pipe (|) | Legacy mainframe systems, some US government data | Uncommon but unambiguous. Rarely appears in field values. |
European CSV files are the most common surprise. If you're in Germany, Excel saves CSV with semicolons because the locale uses commas for decimals. A file exported from German Excel and opened in American Excel will mash all columns into one because American Excel expects commas. The fix: convert to TSV, which uses tabs and avoids the locale problem entirely.
Delimiter auto-detection is imperfect. Python's csv.Sniffer class examines the first few lines and guesses. PapaParse (JavaScript) also auto-detects. Both can be fooled by files with quoted fields containing the wrong delimiter character.
Quoting Rules and Edge Cases
RFC 4180's quoting rules are straightforward in theory: if a field contains a comma, a double quote, or a newline, enclose it in double quotes. Escape internal double quotes by doubling them. In practice:
name,description,price
Widget,"A small, useful device",9.99
"Gadget ""Pro""","Multiline\ndescription here",19.99Common quoting violations in the wild:
- No quoting at all. Some producers never quote fields, even when they contain commas. The field
Smith, Johnwithout quotes breaks every parser. - Single quotes. Some tools use single quotes instead of double quotes. This is not RFC-compliant and most strict parsers reject it.
- Backslash escaping. Some producers use
\"instead of""to escape quotes. MySQL'sSELECT INTO OUTFILEdoes this by default. - Quote all fields. Some producers quote every field, including numbers and booleans. This is technically valid but can confuse parsers that try to detect types.
Newlines inside quoted fields are the most dangerous edge case. A field value that spans multiple lines is valid CSV, but many naive parsers (and many command-line tools like wc -l) treat every line break as a record boundary. If your CSV has multiline fields, test your parser explicitly.
Encoding: UTF-8, BOM, and the Excel Problem
CSV files have no standard way to declare their character encoding. Unlike XML (which has an encoding declaration in the prolog) or JSON (which mandates UTF-8), a CSV file is just bytes. The consumer must guess the encoding — and often guesses wrong.
Common Encodings in the Wild
UTF-8 is now the default for most modern tools, but legacy CSV files are everywhere in Windows-1252 (Western European Windows), Latin-1/ISO 8859-1, Shift-JIS (Japanese), or GB2312 (Chinese). Opening a Latin-1 file as UTF-8 produces garbled characters (mojibake): café becomes café.
The UTF-8 BOM (Byte Order Mark) is three bytes (EF BB BF) at the start of a file that signal UTF-8 encoding. Excel on Windows requires a BOM to correctly interpret UTF-8 CSV files. Without it, Excel assumes the system's locale encoding (usually Windows-1252 in Western countries), mangling any non-ASCII characters. But many Unix tools treat the BOM as data, producing a phantom invisible character in the first field name.
The practical advice: if your CSV will be opened in Excel, add a UTF-8 BOM. If it will be consumed by code, omit the BOM. If it needs to work for both, add the BOM and make sure your parser skips it. In Python: open('file.csv', 'w', encoding='utf-8-sig') writes with BOM; encoding='utf-8-sig' on read strips it automatically.
Type Destruction: Leading Zeros, Dates, and Long Numbers
CSV has no type system. Every value is a string. This is fine in theory — the consumer decides how to interpret each field. In practice, spreadsheet software aggressively auto-detects types and silently destroys data.
| Original Value | What Excel Does | Why |
|---|---|---|
02134 | 2134 | Detected as number, leading zero stripped |
1234567890123456 | 1.23457E+15 | Exceeds 15-digit precision limit, switched to scientific notation |
3-15 | 15-Mar | Interpreted as date |
1E3 | 1000 | Interpreted as scientific notation |
TRUE | TRUE (boolean) | Interpreted as boolean, changes cell format |
000-123-4567 | Varies | May be interpreted as date, formula, or number depending on locale |
This destroys ZIP codes, phone numbers, part numbers, product SKUs, barcodes, social security numbers, and any identifier with leading zeros or long digit strings. The damage is often silent — you don't notice until downstream processing fails.
Prevention strategies: Import via Excel's Data > From Text/CSV wizard and set columns to "Text" type. Prefix values with a single quote in the CSV ('02134) — Excel treats the quote as a format signal, not data. Or better: use a format with types. Converting CSV to JSON preserves string values as strings, and converting CSV to XLSX lets you control column types explicitly.
Newlines in Fields: The Silent Parser Killer
A field value can legitimately contain a newline if the field is quoted:
id,comment
1,"This comment
spans two lines"
2,"Single line comment"This is valid per RFC 4180. The problem: tools that process CSV line-by-line (shell scripts using while read line, awk, wc -l, and many ETL tools) don't handle this. They see four lines instead of two records plus a header. Database LOAD DATA commands often fail on multiline fields too.
If your CSV has multiline fields and you need to process it with line-oriented tools, either: (1) replace newlines inside fields with a placeholder before processing and restore them after, (2) convert to a format that doesn't have this ambiguity (JSON or JSONL), or (3) use a proper CSV parser that handles quoted multiline fields (Python's csv module, PapaParse, etc.).
Excel's CSV vs. RFC 4180
Microsoft Excel is the world's most popular CSV tool, and it deviates from RFC 4180 in several ways:
- Delimiter follows locale. In countries using comma as decimal separator, Excel uses semicolons. The file is still saved with a
.csvextension. - Encoding defaults to system locale, not UTF-8. Japanese Excel defaults to Shift-JIS. German Excel defaults to Windows-1252.
- Date format follows locale. Dates are re-formatted on display according to the system's short date format, making the same CSV show different dates on different machines.
- Number precision is 15 digits. Longer numbers are silently truncated. A 16-digit credit card number loses its last digit.
- "Save as CSV" drops everything except the active sheet, and strips all formatting, formulas, charts, and multi-sheet structure. It also doesn't warn about data loss when a cell contains a formula that evaluates to a value.
For programmatic CSV work, avoid Excel entirely. Use Python's csv module, PapaParse in JavaScript, or import into Google Sheets (which handles UTF-8 better than Excel). If you must use Excel, import via the Data tab rather than double-clicking the file.
Best Practices for Producing CSV
If you're generating CSV files for others to consume:
- Use UTF-8 encoding with BOM if any consumer might be Excel. Without BOM for code-only consumption.
- Quote all fields or at least all string fields. The tiny size overhead prevents delimiter-in-value bugs.
- Use ISO 8601 dates (
2026-03-19). Never use locale-dependent formats like3/19/26. - Include a header row. Name columns descriptively. Don't use spaces or special characters in headers.
- Use CRLF line endings for maximum compatibility, even on Unix systems.
- Escape properly: double quotes inside quoted fields become
"". - Document the encoding and delimiter in a README or metadata file. Don't assume the consumer will guess correctly.
CSV persists because nothing else is as universally readable. Every programming language, spreadsheet, database, and data tool can ingest CSV. That universality comes at the cost of ambiguity — there's no standard encoding, no standard delimiter, no type system, and no way to represent nested data. CSV is the format you use when you need the widest possible compatibility and your data is flat.
The pragmatic approach: use CSV for import/export and data exchange, but don't store working data in CSV. Convert to JSON when you need types and structure. Convert to XLSX when you need typed columns and multiple sheets. Convert to Parquet when you need columnar storage for analytics. CSV is a transport format, not a storage format.