You download a CSV file. You open it. The columns are smashed into one, or the accented characters look like gibberish, or the numbers have commas where decimals should be. This happens because CSV files carry no metadata about their encoding or delimiter — the consuming application must guess, and it frequently guesses wrong.

These aren't exotic edge cases. They happen every time a German colleague sends a CSV to an American one, every time a database exports UTF-8 and Excel expects Windows-1252, and every time a Python script writes a file that a Java application reads. CSV's simplicity means it has no mechanism to prevent these problems.

This guide explains the specific encoding and delimiter issues you'll encounter, why they happen, and how to fix each one.

Character Encoding: Why Bytes Aren't Characters

A CSV file is bytes on disk. Characters like a, é, ü, ¥, and emojis are stored as byte sequences according to a character encoding. Different encodings map different byte sequences to the same character — and the same byte sequence to different characters.

The byte 0xE9 is:

  • Latin-1 / ISO 8859-1: é (e with acute accent)
  • Windows-1252: é (same character, different standard)
  • UTF-8: Invalid (in UTF-8, é is the two-byte sequence 0xC3 0xA9)

When a program opens a Latin-1 file assuming UTF-8, it encounters invalid byte sequences and either throws an error, replaces characters with �, or produces the garbled text known as mojibake. The reverse — opening a UTF-8 file as Latin-1 — doesn't produce errors but shows wrong characters: café becomes café.

Common Encodings in CSV Files

EncodingByte RangeCommon SourceNotes
UTF-81-4 bytes per charModern tools, web, Linux, macOSThe global standard. Backward-compatible with ASCII.
Windows-12521 byte per charExcel on Windows (Western Europe/Americas)Superset of Latin-1 with extra characters (curly quotes, em dash).
Latin-1 (ISO 8859-1)1 byte per charOlder European systems, some databasesAlmost identical to Windows-1252. Missing curly quotes and euro sign.
Shift-JIS1-2 bytes per charJapanese Windows systemsContains ASCII bytes in multi-byte sequences, which can break naive parsers.
GB2312 / GBK1-2 bytes per charChinese Windows systemsSame multi-byte issues as Shift-JIS.
UTF-162-4 bytes per charSome Windows exports, SQL Server bulk exportHas BOM, little-endian or big-endian variants. Not line-oriented — breaks head, tail, grep.

In 2026, UTF-8 is the correct default. But if you're processing CSV files from legacy systems, corporate databases, or region-specific software, you'll encounter all of these.

The BOM (Byte Order Mark): Love It or Hate It

The UTF-8 BOM is three bytes (EF BB BF) at the start of a file. It signals "this file is UTF-8." Whether to include it is one of the most contentious issues in CSV handling.

ScenarioBOM?Why
CSV for Excel (Windows)YesWithout BOM, Excel assumes Windows-1252 and mangles accented characters.
CSV for Excel (macOS)EitherExcel for Mac handles UTF-8 slightly better, but BOM doesn't hurt.
CSV for Google SheetsEitherGoogle Sheets auto-detects UTF-8 with or without BOM.
CSV for Python/codeNoSome parsers treat BOM as data, adding invisible \ufeff to the first field name.
CSV for Linux toolsNohead, sort, cut, awk don't expect BOM and may produce wrong output.

The safe approach when you don't know the consumer: add the BOM. Python's csv module and most modern parsers skip the BOM when present. The risk of Excel mangling your data (without BOM) is higher than the risk of a parser choking on BOM bytes (rare in modern tools).

Handling BOM in Python

Python's utf-8-sig encoding handles BOM automatically:

# Writing CSV with BOM (for Excel)
with open('output.csv', 'w', encoding='utf-8-sig', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['Name', 'City'])
    writer.writerow(['José', 'Montréal'])

# Reading CSV that might have BOM
with open('input.csv', 'r', encoding='utf-8-sig') as f:
    reader = csv.reader(f)
    # BOM is stripped automatically

utf-8-sig writes BOM on output and strips it on input. Use it as your default encoding for CSV files and you'll avoid 90% of encoding issues.

Delimiter Detection: Why Columns Merge or Split

The "C" in CSV means comma, but commas are just one of many delimiters in the wild:

DelimiterWhereWhy
Comma (,)US, UK, most of Asia, AustraliaThe "default" delimiter in English-speaking countries.
Semicolon (;)Germany, France, Italy, Spain, Portugal, Brazil, and most of continental EuropeThese countries use comma as the decimal separator (3,14 not 3.14). Using comma as both decimal and field separator creates ambiguity, so the field separator changes to semicolon.
Tab (\t)TSV files, database exports, scientific dataTab characters rarely appear in data values. No locale dependency.
Pipe (|)US government data, mainframe exports, some HL7 messagesExtremely rare in data values. Unambiguous.

The European Locale Problem

When Excel on a German Windows system saves a CSV file, it uses semicolons because the Windows regional settings define comma as the decimal separator. The file has a .csv extension even though it's technically semicolon-separated. When an American colleague opens this file in Excel, Excel sees commas (in the decimal numbers) and semicolons, and tries to use commas as delimiters — producing mangled data.

This is the single most common CSV interoperability problem in international organizations. Solutions:

  • TSV: Convert to TSV. Tab delimiters work identically across all locales.
  • Explicit import: Use Excel's Data > From Text/CSV import wizard, which lets you specify the delimiter.
  • Standard decimal: Agree on a standard: always use dot as decimal separator in CSV, regardless of locale. This requires the CSV producer to override locale defaults.

How Delimiter Auto-Detection Works (and Fails)

Most CSV parsers auto-detect the delimiter by analyzing the first few lines. Common approaches:

  • Frequency analysis: Count occurrences of candidate delimiters (,, ;, \t, |) in each line. The character with the most consistent count across lines wins.
  • Python's csv.Sniffer: Examines a sample of the file and returns a Dialect object with the detected delimiter, quoting style, and line terminator. Works well on clean files, fails on files with irregular quoting or mixed delimiters.
  • PapaParse: Auto-detects delimiter by testing candidates and scoring based on column count consistency. Generally reliable.

Auto-detection fails when:

  • The first few lines aren't representative (e.g., a header with no commas followed by data rows with commas in text fields)
  • Multiple delimiters appear with similar frequency
  • The file has only one column (no delimiter at all)
  • Quoted fields contain the delimiter character

When reliability matters, always specify the delimiter explicitly rather than relying on auto-detection.

Excel Import Wizard: The Correct Way to Open CSV

Double-clicking a CSV file tells Excel to guess the encoding and delimiter. The import wizard gives you control:

  1. Open Excel (don't open the CSV file directly).
  2. Go to Data > From Text/CSV (Windows) or File > Import (macOS).
  3. Select the CSV file.
  4. In the preview dialog:
    • Set File Origin to UTF-8 (or the correct encoding).
    • Set Delimiter to Comma, Semicolon, Tab, or Other.
    • Preview the column split to verify it looks correct.
  5. Click Transform Data to set column types (Text for ZIP codes, IDs, phone numbers).
  6. Click Load.

This process takes 30 seconds and prevents the encoding/delimiter/type-coercion problems that plague double-click opens. For CSV files from unknown sources, always use the import wizard.

Python csv Module Gotchas

Python's csv module is reliable but has specific behaviors that cause issues:

  • Default encoding is system-dependent. On Windows, open('file.csv') uses the system's default encoding (often Windows-1252). Always specify encoding explicitly: open('file.csv', encoding='utf-8-sig').
  • Newline handling. On Python 3, always open CSV files with newline='' to prevent the csv module from mangling line endings: open('file.csv', 'w', newline='', encoding='utf-8-sig'). Without newline='', Windows systems may produce files with double line breaks.
  • Sniffer is limited. csv.Sniffer().sniff(sample) examines only the sample you pass. Pass at least 10-20 lines for reliable detection. It can't detect encoding — only delimiter and quoting style.
  • Large files and memory. csv.reader is already streaming (reads line by line), so memory is rarely an issue. But csv.DictReader creates a dictionary per row, which adds overhead for millions of rows.
# The safe Python CSV pattern
import csv

# Reading (handles BOM, explicit encoding)
with open('input.csv', 'r', encoding='utf-8-sig', newline='') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row)

# Writing (BOM for Excel, explicit newline handling)
with open('output.csv', 'w', encoding='utf-8-sig', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=['name', 'city', 'country'])
    writer.writeheader()
    writer.writerow({'name': 'José', 'city': 'Montréal', 'country': 'CA'})

TSV: The Locale-Independent Escape Hatch

Tab-Separated Values (TSV) solves the delimiter problem by using a character that almost never appears in data values and has no locale dependency. Tabs are tabs everywhere — no country uses tab as a decimal separator.

TSV advantages over CSV:

  • No locale ambiguity (tabs work the same in every country)
  • Tab characters in data values are extremely rare
  • No quoting needed in most cases (reducing parser complexity)
  • Directly pasteable into spreadsheets (Excel accepts tab-pasted data natively)

TSV disadvantage: tabs are invisible, making TSV harder to inspect in a text editor (commas and semicolons are visible characters).

For data exchange between international teams, converting CSV to TSV eliminates the entire delimiter class of problems. If the data will be consumed by code rather than visually inspected, the invisible-tab issue is irrelevant.

Diagnosing CSV Encoding Problems

When a CSV file looks wrong, diagnose the issue systematically:

SymptomCauseFix
All columns in one cellWrong delimiter assumedRe-import with correct delimiter (Data > From Text/CSV)
é instead of éUTF-8 file opened as Latin-1/Windows-1252Re-open specifying UTF-8 encoding
? or � charactersLatin-1 file opened as UTF-8, or encoding mismatchTry Windows-1252 or Latin-1 encoding
First column name has invisible charUTF-8 BOM treated as dataOpen with utf-8-sig encoding or strip first 3 bytes
Asian characters garbledShift-JIS or GBK file opened as UTF-8Detect encoding with chardet library, re-open correctly
Extra blank lines between rowsCRLF line endings doubledOpen with newline='' parameter in Python
Numbers show as datesExcel auto-type-detectionImport via wizard, set column type to Text

For programmatic detection, Python's chardet library can guess the encoding of a file by analyzing byte patterns. It's not perfect but handles common cases well:

import chardet

with open('mystery.csv', 'rb') as f:
    result = chardet.detect(f.read(10000))
print(result)  # {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

CSV encoding and delimiter problems are not bugs — they're the predictable consequence of a format with no metadata. CSV files are raw bytes with no declaration of what those bytes mean or how they're structured. Every solution (BOM for encoding, explicit delimiter specification, TSV for locale independence) works around this fundamental limitation.

The pragmatic approach: standardize on UTF-8 with BOM for files that might touch Excel, use TSV for international exchange, and always specify encoding and delimiter explicitly when processing CSV programmatically. If your data has types, nesting, or encoding requirements that CSV can't handle cleanly, convert to JSON and leave CSV's limitations behind.