Data migration between formats is one of the most common file operations and one of the least understood. "Convert CSV to JSON" sounds simple, but the details determine whether you get clean data or silent corruption: Does the CSV have a header row? Are dates ISO 8601 or locale-specific? Do numbers have leading zeros that must be preserved? Is the CSV semicolon-delimited because it was exported from European Excel?
This guide covers the most common format migrations, what happens to your data in each conversion, where data is lost or transformed, and how to validate the result. The focus is practical: real conversions between CSV, JSON, XML, YAML, and TOML, with specific attention to the failure modes that cause production bugs.
The Data Loss Matrix
Every format has features that other formats lack. Converting from a richer format to a simpler one discards information. Converting from a simpler format to a richer one adds nothing but gains the target format's capabilities.
| From \ To | CSV | JSON | XML | YAML | TOML |
|---|---|---|---|---|---|
| CSV | - | Lossless (gains types/structure) | Lossless (gains markup) | Lossless (gains types) | Lossless (gains types, but flat only) |
| JSON | Loses nesting, types | - | Loses JSON simplicity, gains attributes | Lossless | Lossless (if shallow) |
| XML | Loses everything except text | Loses attributes, namespaces, PI, comments | - | Loses attributes, namespaces | Loses attributes, namespaces, deep nesting |
| YAML | Loses nesting, types | Lossless (1.2) | Lossless (gains verbosity) | - | Lossless (if shallow, no null) |
| TOML | Loses nesting, types | Dates become strings | Lossless (gains verbosity) | Lossless | - |
The key takeaway: conversions up in expressiveness (CSV to JSON, JSON to XML) are generally lossless. Conversions down (XML to JSON, JSON to CSV) lose features. Lateral conversions (JSON to YAML, TOML to YAML) are usually lossless.
CSV to JSON: Gaining Structure
This is the most common data migration and usually the smoothest. Each CSV row becomes a JSON object, with column headers as keys.
// Input CSV
name,age,city
Alice,30,New York
Bob,25,London
// Output JSON
[
{"name": "Alice", "age": "30", "city": "New York"},
{"name": "Bob", "age": "25", "city": "London"}
]Issues to watch:
- Type inference. Should
30become the number30or the string"30"? Most converters default to strings (safe) or attempt type detection (risky). Ifageis sometimes"N/A", type detection breaks. ChangeThisFile's CSV to JSON converter preserves values as strings by default. - Header row. The converter must know whether the first row is headers. Without headers, you get an array of arrays instead of an array of objects.
- Empty values. Is an empty cell
null, an empty string"", or should the key be omitted? Different converters make different choices. - Encoding. If the CSV is Latin-1 or Windows-1252 and the JSON is UTF-8 (mandatory for JSON), non-ASCII characters must be transcoded correctly.
For large CSV files (100MB+), convert to JSONL instead of JSON — one object per line, streamable, constant memory.
JSON to CSV: Flattening Hierarchy
This is the conversion most likely to lose data, because CSV can't represent JSON's nesting.
// Input JSON
{
"name": "Alice",
"address": {
"street": "123 Main St",
"city": "New York"
},
"tags": ["premium", "verified"]
}
// Output CSV (flattened)
name,address.street,address.city,tags
Alice,123 Main St,New York,"premium,verified"Flattening strategies:
- Dot notation: Nested objects become dotted keys:
address.street,address.city. Works well for uniform nesting but creates long column names. - Array serialization: Arrays are joined with commas or serialized as JSON strings within CSV cells. Both approaches are lossy — the consumer must know the convention to reconstruct the array.
- Row multiplication: Each array element gets its own row, with non-array fields repeated. This preserves data but explodes file size and makes post-processing complex.
Before converting JSON to CSV, check: is the JSON flat (no nesting)? If yes, the conversion is straightforward. If it has nesting, decide on a flattening strategy and accept the information loss. For spreadsheet consumption, JSON to XLSX is often better — XLSX supports multiple sheets (one per nested entity) and typed columns.
XML to JSON: Losing the Attribute/Element Distinction
XML to JSON conversion is deceptively complex because XML has features with no JSON equivalent:
<!-- Input XML -->
<product id="P001" status="active">
<name>Widget</name>
<price currency="USD">29.99</price>
<tags>
<tag>popular</tag>
<tag>sale</tag>
</tags>
</product>Different converters produce different JSON from this XML:
// Convention 1: @ prefix for attributes
{"product": {"@id": "P001", "@status": "active", "name": "Widget", "price": {"@currency": "USD", "#text": "29.99"}, "tags": {"tag": ["popular", "sale"]}}}
// Convention 2: Flatten attributes into properties
{"product": {"id": "P001", "status": "active", "name": "Widget", "price": "29.99", "price_currency": "USD", "tags": ["popular", "sale"]}}
// Convention 3: Parker convention (drop attributes)
{"product": {"name": "Widget", "price": "29.99", "tags": {"tag": ["popular", "sale"]}}}There is no standard for XML-to-JSON conversion. Every library makes different choices. When using ChangeThisFile's XML to JSON converter, verify the output matches your expected structure.
Lost in all conventions: namespaces, processing instructions, CDATA markers, comments, and mixed content. If you need to round-trip (XML to JSON to XML), the output XML will differ from the input.
YAML to JSON: Usually Clean
YAML 1.2 is a JSON superset, so YAML to JSON conversion is straightforward. YAML's extra features (comments, anchors, multiline strings) are resolved during parsing — the resulting data structure maps directly to JSON.
What's lost:
- Comments are stripped (JSON has no comments)
- Anchors and aliases are resolved to duplicated data
- Merge keys (
<<) are expanded - Multiline strings become single-line strings with embedded
\n
Type coercion warning: if YAML 1.1 parsed NO as boolean false, the JSON output will contain false — the type coercion happens during YAML parsing, before JSON serialization. The JSON is "correct" for what the YAML parser produced, but may not be what you intended.
The reverse — JSON to YAML — is lossless. JSON's entire type system maps to YAML. The YAML output can then be enhanced with comments and multiline strings.
JSON to YAML / YAML to TOML: Config Migration
Configuration format migration is common when switching ecosystems or adopting a new standard.
JSON to YAML: Lossless. Every JSON value has a YAML equivalent. After converting, add comments to document your config values — this is the primary reason to migrate.
YAML to TOML: Usually lossless for application configs (1-3 nesting levels). May lose data for deep nesting (TOML becomes verbose past 3 levels), heterogeneous arrays (TOML requires same-type arrays), null values (TOML has no null), and anchors/aliases (TOML has no reference system).
JSON to TOML: Works well for flat-to-moderately-nested JSON. Deep nesting becomes verbose. Null values must be handled (omit the key or use a sentinel value). Dates in ISO 8601 string format can be converted to TOML's native date type.
Steps for any config migration:
- Convert the file format with a converter.
- Validate the output parses correctly in the target format.
- Re-add comments (comments are lost in all format conversions).
- Test with your application to catch any type interpretation differences.
- Keep the old config file as a reference during the transition period.
Encoding Traps During Migration
Character encoding issues are the most common source of silent data corruption during format migration:
| Migration | Encoding Risk | Prevention |
|---|---|---|
| CSV to JSON | CSV encoding unknown (might be Latin-1). JSON requires UTF-8. | Detect CSV encoding first (chardet in Python). Transcode to UTF-8 before converting. |
| XML to JSON | XML declares encoding in prolog. JSON is always UTF-8. | Read XML with declared encoding. Write JSON as UTF-8. |
| JSON to CSV | JSON is UTF-8. CSV has no encoding declaration. | Write CSV as UTF-8 with BOM for Excel. Document the encoding. |
| Any to YAML | YAML is UTF-8 by default but allows other encodings. | Always use UTF-8 for YAML. Other encodings are rare and poorly supported. |
| Any to TOML | TOML mandates UTF-8. | Transcode source to UTF-8 before conversion. |
The safe default: convert source data to UTF-8 before any format conversion. All modern formats (JSON, YAML, TOML) are UTF-8. CSV is the only common format where encoding is ambiguous.
Type Coercion During Migration
Type interpretation changes when data moves between formats:
| Data | In CSV | In JSON | In YAML | In TOML |
|---|---|---|---|---|
42 | String "42" | Number 42 | Number 42 | Number 42 |
3.10 | String "3.10" | Number 3.1 | Number 3.1 | Float 3.1 |
true | String "true" | Boolean true | Boolean true | Boolean true |
NO | String "NO" | String "NO" | Boolean false (1.1) / String "NO" (1.2) | String "NO" |
2026-03-19 | String "2026-03-19" | String "2026-03-19" | Date (if parsed) | Date 2026-03-19 |
null | String "null" or empty | Null null | Null null | Not representable |
02134 | String "02134" | String "02134" or number 2134 | Number 2134 (octal in 1.1) | String "02134" |
The most dangerous migrations for type coercion: CSV to YAML (everything was a string, now some strings become booleans or numbers), and JSON to YAML (numbers and booleans are preserved, but adding new YAML values manually may trigger coercion). Always validate output data against expected types after conversion.
Validation After Conversion
Every format migration should include a validation step. Don't trust that the conversion is correct — verify it.
- Record count. The output should have the same number of records as the input. Count rows in CSV, array length in JSON, record count in JSONL.
- Field count. Each output record should have the same number of fields as the corresponding input record (accounting for flattening or nesting changes).
- Type spot-checks. Verify that leading zeros are preserved, dates aren't reformatted, booleans haven't been coerced, and numbers haven't changed precision.
- Encoding spot-checks. Find records with non-ASCII characters (accents, CJK, emoji) and verify they survived the conversion.
- Round-trip test. Convert A to B, then B back to A. Compare with the original. Any differences highlight information loss.
- Edge case checks. Empty fields, null values, extremely long strings, numbers with many decimal places, Unicode surrogate pairs.
# Quick validation in Python
import json, csv
with open('data.csv') as f:
csv_count = sum(1 for _ in csv.reader(f)) - 1 # Minus header
with open('data.json') as f:
json_count = len(json.load(f))
assert csv_count == json_count, f"Record count mismatch: CSV={csv_count}, JSON={json_count}"
Format migration is not just a conversion — it's a translation between data models. CSV's flat strings, JSON's typed trees, XML's attributed markup, and YAML's typed indentation each represent data differently. Understanding these differences is the difference between a clean migration and weeks of debugging silent data corruption.
The practical workflow: know what you're losing before converting (check the data loss matrix), handle encoding explicitly (UTF-8 everywhere), validate the output (record counts, type checks, encoding checks), and keep the original data until you've verified the migration is complete. Format conversion tools handle the mechanics; understanding the data model differences is your job.