Data

CSV Encoding and Delimiters: Why Your Data Looks Wrong

Published Mar 19, 2026 8 min read By ChangeThisFile Team

Quick Answer

CSV encoding problems stem from the format having no standard way to declare its character encoding or delimiter. UTF-8 without BOM causes garbled text in Excel. European CSVs use semicolons because commas are decimal separators. The fixes: UTF-8 with BOM for Excel, explicit delimiter specification, and TSV as a locale-independent alternative.

You download a CSV file. You open it. The columns are smashed into one, or the accented characters look like gibberish, or the numbers have commas where decimals should be. This happens because CSV files carry no metadata about their encoding or delimiter — the consuming application must guess, and it frequently guesses wrong.

These aren't exotic edge cases. They happen every time a German colleague sends a CSV to an American one, every time a database exports UTF-8 and Excel expects Windows-1252, and every time a Python script writes a file that a Java application reads. CSV's simplicity means it has no mechanism to prevent these problems.

This guide explains the specific encoding and delimiter issues you'll encounter, why they happen, and how to fix each one.

Character Encoding: Why Bytes Aren't Characters

A CSV file is bytes on disk. Characters like a, é, ü, ¥, and emojis are stored as byte sequences according to a character encoding. Different encodings map different byte sequences to the same character — and the same byte sequence to different characters.

The byte 0xE9 is:

Latin-1 / ISO 8859-1: é (e with acute accent)
Windows-1252: é (same character, different standard)
UTF-8: Invalid (in UTF-8, é is the two-byte sequence 0xC3 0xA9)

When a program opens a Latin-1 file assuming UTF-8, it encounters invalid byte sequences and either throws an error, replaces characters with ï¿½, or produces the garbled text known as mojibake. The reverse — opening a UTF-8 file as Latin-1 — doesn't produce errors but shows wrong characters: café becomes cafÃ©.

Common Encodings in CSV Files

Encoding	Byte Range	Common Source	Notes
UTF-8	1-4 bytes per char	Modern tools, web, Linux, macOS	The global standard. Backward-compatible with ASCII.
Windows-1252	1 byte per char	Excel on Windows (Western Europe/Americas)	Superset of Latin-1 with extra characters (curly quotes, em dash).
Latin-1 (ISO 8859-1)	1 byte per char	Older European systems, some databases	Almost identical to Windows-1252. Missing curly quotes and euro sign.
Shift-JIS	1-2 bytes per char	Japanese Windows systems	Contains ASCII bytes in multi-byte sequences, which can break naive parsers.
GB2312 / GBK	1-2 bytes per char	Chinese Windows systems	Same multi-byte issues as Shift-JIS.
UTF-16	2-4 bytes per char	Some Windows exports, SQL Server bulk export	Has BOM, little-endian or big-endian variants. Not line-oriented — breaks `head`, `tail`, `grep`.

In 2026, UTF-8 is the correct default. But if you're processing CSV files from legacy systems, corporate databases, or region-specific software, you'll encounter all of these.

The BOM (Byte Order Mark): Love It or Hate It

The UTF-8 BOM is three bytes (EF BB BF) at the start of a file. It signals "this file is UTF-8." Whether to include it is one of the most contentious issues in CSV handling.

Scenario	BOM?	Why
CSV for Excel (Windows)	Yes	Without BOM, Excel assumes Windows-1252 and mangles accented characters.
CSV for Excel (macOS)	Either	Excel for Mac handles UTF-8 slightly better, but BOM doesn't hurt.
CSV for Google Sheets	Either	Google Sheets auto-detects UTF-8 with or without BOM.
CSV for Python/code	No	Some parsers treat BOM as data, adding invisible `\ufeff` to the first field name.
CSV for Linux tools	No	`head`, `sort`, `cut`, `awk` don't expect BOM and may produce wrong output.

The safe approach when you don't know the consumer: add the BOM. Python's csv module and most modern parsers skip the BOM when present. The risk of Excel mangling your data (without BOM) is higher than the risk of a parser choking on BOM bytes (rare in modern tools).

Handling BOM in Python

Python's utf-8-sig encoding handles BOM automatically:

# Writing CSV with BOM (for Excel)
with open('output.csv', 'w', encoding='utf-8-sig', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['Name', 'City'])
    writer.writerow(['José', 'Montréal'])

# Reading CSV that might have BOM
with open('input.csv', 'r', encoding='utf-8-sig') as f:
    reader = csv.reader(f)
    # BOM is stripped automatically

utf-8-sig writes BOM on output and strips it on input. Use it as your default encoding for CSV files and you'll avoid 90% of encoding issues.

Delimiter Detection: Why Columns Merge or Split

The "C" in CSV means comma, but commas are just one of many delimiters in the wild:

Delimiter	Where	Why
Comma (`,`)	US, UK, most of Asia, Australia	The "default" delimiter in English-speaking countries.
Semicolon (`;`)	Germany, France, Italy, Spain, Portugal, Brazil, and most of continental Europe	These countries use comma as the decimal separator (`3,14` not `3.14`). Using comma as both decimal and field separator creates ambiguity, so the field separator changes to semicolon.
Tab (`\t`)	TSV files, database exports, scientific data	Tab characters rarely appear in data values. No locale dependency.
Pipe (`\|`)	US government data, mainframe exports, some HL7 messages	Extremely rare in data values. Unambiguous.

The European Locale Problem

When Excel on a German Windows system saves a CSV file, it uses semicolons because the Windows regional settings define comma as the decimal separator. The file has a .csv extension even though it's technically semicolon-separated. When an American colleague opens this file in Excel, Excel sees commas (in the decimal numbers) and semicolons, and tries to use commas as delimiters — producing mangled data.

This is the single most common CSV interoperability problem in international organizations. Solutions:

TSV: Convert to TSV. Tab delimiters work identically across all locales.
Explicit import: Use Excel's Data > From Text/CSV import wizard, which lets you specify the delimiter.
Standard decimal: Agree on a standard: always use dot as decimal separator in CSV, regardless of locale. This requires the CSV producer to override locale defaults.

How Delimiter Auto-Detection Works (and Fails)

Most CSV parsers auto-detect the delimiter by analyzing the first few lines. Common approaches:

Frequency analysis: Count occurrences of candidate delimiters (,, ;, \t, |) in each line. The character with the most consistent count across lines wins.
Python's csv.Sniffer: Examines a sample of the file and returns a Dialect object with the detected delimiter, quoting style, and line terminator. Works well on clean files, fails on files with irregular quoting or mixed delimiters.
PapaParse: Auto-detects delimiter by testing candidates and scoring based on column count consistency. Generally reliable.

Auto-detection fails when:

The first few lines aren't representative (e.g., a header with no commas followed by data rows with commas in text fields)
Multiple delimiters appear with similar frequency
The file has only one column (no delimiter at all)
Quoted fields contain the delimiter character

When reliability matters, always specify the delimiter explicitly rather than relying on auto-detection.

Excel Import Wizard: The Correct Way to Open CSV

Double-clicking a CSV file tells Excel to guess the encoding and delimiter. The import wizard gives you control:

Open Excel (don't open the CSV file directly).
Go to Data > From Text/CSV (Windows) or File > Import (macOS).
Select the CSV file.
In the preview dialog:
- Set File Origin to UTF-8 (or the correct encoding).
- Set Delimiter to Comma, Semicolon, Tab, or Other.
- Preview the column split to verify it looks correct.
Click Transform Data to set column types (Text for ZIP codes, IDs, phone numbers).
Click Load.

This process takes 30 seconds and prevents the encoding/delimiter/type-coercion problems that plague double-click opens. For CSV files from unknown sources, always use the import wizard.

Python csv Module Gotchas

Python's csv module is reliable but has specific behaviors that cause issues:

Default encoding is system-dependent. On Windows, open('file.csv') uses the system's default encoding (often Windows-1252). Always specify encoding explicitly: open('file.csv', encoding='utf-8-sig').
Newline handling. On Python 3, always open CSV files with newline='' to prevent the csv module from mangling line endings: open('file.csv', 'w', newline='', encoding='utf-8-sig'). Without newline='', Windows systems may produce files with double line breaks.
Sniffer is limited. csv.Sniffer().sniff(sample) examines only the sample you pass. Pass at least 10-20 lines for reliable detection. It can't detect encoding — only delimiter and quoting style.
Large files and memory. csv.reader is already streaming (reads line by line), so memory is rarely an issue. But csv.DictReader creates a dictionary per row, which adds overhead for millions of rows.

# The safe Python CSV pattern
import csv

# Reading (handles BOM, explicit encoding)
with open('input.csv', 'r', encoding='utf-8-sig', newline='') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row)

# Writing (BOM for Excel, explicit newline handling)
with open('output.csv', 'w', encoding='utf-8-sig', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=['name', 'city', 'country'])
    writer.writeheader()
    writer.writerow({'name': 'José', 'city': 'Montréal', 'country': 'CA'})

TSV: The Locale-Independent Escape Hatch

Tab-Separated Values (TSV) solves the delimiter problem by using a character that almost never appears in data values and has no locale dependency. Tabs are tabs everywhere — no country uses tab as a decimal separator.

TSV advantages over CSV:

No locale ambiguity (tabs work the same in every country)
Tab characters in data values are extremely rare
No quoting needed in most cases (reducing parser complexity)
Directly pasteable into spreadsheets (Excel accepts tab-pasted data natively)

TSV disadvantage: tabs are invisible, making TSV harder to inspect in a text editor (commas and semicolons are visible characters).

For data exchange between international teams, converting CSV to TSV eliminates the entire delimiter class of problems. If the data will be consumed by code rather than visually inspected, the invisible-tab issue is irrelevant.

Diagnosing CSV Encoding Problems

When a CSV file looks wrong, diagnose the issue systematically:

Symptom	Cause	Fix
All columns in one cell	Wrong delimiter assumed	Re-import with correct delimiter (Data > From Text/CSV)
`Ã©` instead of `é`	UTF-8 file opened as Latin-1/Windows-1252	Re-open specifying UTF-8 encoding
`?` or `ï¿½` characters	Latin-1 file opened as UTF-8, or encoding mismatch	Try Windows-1252 or Latin-1 encoding
First column name has invisible char	UTF-8 BOM treated as data	Open with `utf-8-sig` encoding or strip first 3 bytes
Asian characters garbled	Shift-JIS or GBK file opened as UTF-8	Detect encoding with `chardet` library, re-open correctly
Extra blank lines between rows	CRLF line endings doubled	Open with `newline=''` parameter in Python
Numbers show as dates	Excel auto-type-detection	Import via wizard, set column type to Text

For programmatic detection, Python's chardet library can guess the encoding of a file by analyzing byte patterns. It's not perfect but handles common cases well:

import chardet

with open('mystery.csv', 'rb') as f:
    result = chardet.detect(f.read(10000))
print(result)  # {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

CSV encoding and delimiter problems are not bugs — they're the predictable consequence of a format with no metadata. CSV files are raw bytes with no declaration of what those bytes mean or how they're structured. Every solution (BOM for encoding, explicit delimiter specification, TSV for locale independence) works around this fundamental limitation.

The pragmatic approach: standardize on UTF-8 with BOM for files that might touch Excel, use TSV for international exchange, and always specify encoding and delimiter explicitly when processing CSV programmatically. If your data has types, nesting, or encoding requirements that CSV can't handle cleanly, convert to JSON and leave CSV's limitations behind.

Key Takeaways

CSV files have no standard way to declare encoding or delimiter. The consumer must guess, and guesses are frequently wrong.
UTF-8 with BOM is the safest choice for CSV files that might be opened in Excel. Use encoding='utf-8-sig' in Python.
European CSVs use semicolons because commas are decimal separators. TSV avoids this locale-dependent problem entirely.
Always open CSV files through Excel's Data > From Text/CSV import wizard, never by double-clicking.
Python csv module: always specify encoding explicitly and use newline='' to prevent line ending issues.
When exchanging CSVs internationally, convert to TSV to eliminate delimiter ambiguity.
Use chardet (Python) or file command (Linux) to diagnose unknown file encodings.

Frequently Asked Questions

Why does my CSV look garbled in Excel but fine in a text editor?

The text editor is likely detecting the encoding correctly (or showing raw bytes), while Excel is guessing wrong. Excel on Windows defaults to the system's locale encoding (e.g., Windows-1252) unless it finds a BOM indicating UTF-8. Add a UTF-8 BOM to the file, or use Excel's Data > From Text/CSV import wizard to specify the encoding manually.

How do I detect what encoding a CSV file uses?

On Linux/macOS, run 'file -bi yourfile.csv' for a guess. For programmatic detection, use Python's chardet library: import chardet; chardet.detect(open('file.csv', 'rb').read()). Neither method is 100% reliable — they analyze byte patterns and return a confidence score. The best approach is to ask the data provider what encoding they used.

Why does my European colleague's CSV use semicolons instead of commas?

In most of continental Europe (Germany, France, Italy, Spain, etc.), the comma is the decimal separator (3,14 instead of 3.14). Since commas already mean 'decimal point' in these locales, CSV files use semicolons as the field delimiter to avoid ambiguity. Excel follows the Windows regional settings to decide which delimiter to use when saving CSV.

Should I always include a BOM in my CSV files?

Include a BOM if your CSV might be opened in Excel or another spreadsheet application. Omit the BOM if your CSV is consumed only by code and the consuming code doesn't handle BOM (rare in modern tools, but possible with shell scripts or legacy parsers). When in doubt, include it — Python's utf-8-sig and most modern parsers handle BOM gracefully.

What's the best way to share CSV data internationally?

Convert to TSV (tab-separated values). Tabs have no locale dependency — they work identically regardless of whether the sender is in Germany, Japan, or the US. Alternatively, agree on explicit standards: UTF-8 with BOM, comma delimiter regardless of locale, dot as decimal separator. Document these standards alongside the data.

Can I convert between CSV encodings without breaking data?

Yes, as long as you read with the correct source encoding and write with the target encoding. In Python: read with the source encoding, write with the target encoding. Characters that exist in the source but not the target encoding will be lost or replaced with '?'. Converting from Windows-1252 to UTF-8 is lossless (UTF-8 has all Windows-1252 characters). Converting from UTF-8 to Latin-1 may lose characters outside the Latin-1 range.

Why does Python's csv module mangle my line endings?

Python 3's universal newline mode (the default) translates all line endings to '\n' on read, and then the csv module's own newline handling adds another layer of translation. This can produce double line breaks on Windows. The fix is simple: always open CSV files with newline='' to disable Python's universal newline translation and let the csv module handle line endings directly.

Is there a CSV format that includes encoding information?

No standard CSV format includes encoding metadata. Some workarounds: the UTF-8 BOM signals encoding implicitly. RFC 7111 (URI Fragment Identifiers for text/csv) can specify encoding in the URI but this is rarely used. The real solution is to switch to a format with encoding support: JSON mandates UTF-8, XML declares encoding in its prolog, and Parquet uses UTF-8 exclusively.

Ready to convert your files?

Use ChangeThisFile to convert between 600+ formats — free, fast, and private.

Start Converting