Data

Data Migration Between Formats: A Practical Guide

Q: Which format conversion is truly lossless?

CSV to JSON (gains structure, no loss). JSON to YAML (YAML is a superset). TOML to YAML (YAML can represent everything TOML can). TOML to JSON (dates become strings but data is preserved). Any conversion from a simpler format to a richer one is lossless. The reverse direction is always potentially lossy.

Q: How do I convert a large CSV file to JSON without running out of memory?

Convert to JSONL (JSON Lines) instead of a single JSON array. JSONL writes one JSON object per line, so the converter only needs to hold one row in memory at a time. For a 5GB CSV, the JSONL output streams line by line with minimal memory. Most data tools (BigQuery, Elasticsearch, pandas) accept JSONL natively.

Q: What happens to XML attributes when converting to JSON?

There's no standard. Different converters use different conventions: @ prefix (@id, @status), underscore prefix (_id), special _attributes object, or simply dropping attributes. Always check the converter's documentation and verify the output. If attributes are important, choose a converter that preserves them and document the convention for downstream consumers.

Q: Can I round-trip data through a format conversion without loss?

Only for certain pairs. JSON to YAML to JSON is lossless (assuming YAML 1.2 and careful quoting). CSV to JSON to CSV is usually lossless for flat data. XML to JSON to XML is NOT lossless — you'll lose attributes, namespaces, processing instructions, and the original element/attribute structure. Always test round-trip conversion with your specific data before relying on it.

Q: How do I handle null values when converting to TOML?

TOML has no null type. Options: omit the key entirely (the absence of a key is the closest TOML equivalent of null), use an empty string as a sentinel value, or use a special marker string like 'N/A'. The choice depends on how your application handles the config — if it has default values for missing keys, omitting is cleanest.

Q: Why do my leading zeros disappear after conversion?

Leading zeros are preserved in CSV (everything is a string) but lost in JSON, YAML, and TOML if the value is parsed as a number. '02134' becomes integer 2134. Prevention: quote the values as strings in the target format. In JSON: "zip": "02134". In YAML: zip: '02134'. In TOML: zip = "02134". When converting CSV, disable type inference or force string type for affected columns.

Q: What's the safest way to migrate a production API from XML to JSON?

Gradual migration with content negotiation. Keep the XML endpoint running. Add a JSON endpoint that serves the same data. Use the Accept header to let clients choose. Monitor both endpoints. Deprecate the XML endpoint once all clients have migrated. This approach avoids breaking existing consumers while providing a migration path.

Q: How do I preserve comments when converting between config formats?

You can't — automatically. Comments are stripped during parsing in all formats. The converter sees data, not comments. After converting, manually re-add comments in the target format. Some developers write a migration script that reads the source file as text (not parsed), extracts comments by line, converts the data, and re-inserts comments at corresponding positions. This is fragile but better than nothing.

Published Mar 19, 2026 8 min read By ChangeThisFile Team

Quick Answer

Every format conversion involves potential data loss, type changes, or structural transformation. CSV to JSON gains structure (lossless). JSON to CSV loses nesting (lossy). XML to JSON loses attributes and namespaces. YAML to JSON is lossless. Understanding what's preserved and what's lost is the key to safe data migration.

Data migration between formats is one of the most common file operations and one of the least understood. "Convert CSV to JSON" sounds simple, but the details determine whether you get clean data or silent corruption: Does the CSV have a header row? Are dates ISO 8601 or locale-specific? Do numbers have leading zeros that must be preserved? Is the CSV semicolon-delimited because it was exported from European Excel?

This guide covers the most common format migrations, what happens to your data in each conversion, where data is lost or transformed, and how to validate the result. The focus is practical: real conversions between CSV, JSON, XML, YAML, and TOML, with specific attention to the failure modes that cause production bugs.

The Data Loss Matrix

Every format has features that other formats lack. Converting from a richer format to a simpler one discards information. Converting from a simpler format to a richer one adds nothing but gains the target format's capabilities.

From \ To	CSV	JSON	XML	YAML	TOML
CSV	-	Lossless (gains types/structure)	Lossless (gains markup)	Lossless (gains types)	Lossless (gains types, but flat only)
JSON	Loses nesting, types	-	Loses JSON simplicity, gains attributes	Lossless	Lossless (if shallow)
XML	Loses everything except text	Loses attributes, namespaces, PI, comments	-	Loses attributes, namespaces	Loses attributes, namespaces, deep nesting
YAML	Loses nesting, types	Lossless (1.2)	Lossless (gains verbosity)	-	Lossless (if shallow, no null)
TOML	Loses nesting, types	Dates become strings	Lossless (gains verbosity)	Lossless	-

The key takeaway: conversions up in expressiveness (CSV to JSON, JSON to XML) are generally lossless. Conversions down (XML to JSON, JSON to CSV) lose features. Lateral conversions (JSON to YAML, TOML to YAML) are usually lossless.

CSV to JSON: Gaining Structure

This is the most common data migration and usually the smoothest. Each CSV row becomes a JSON object, with column headers as keys.

// Input CSV
name,age,city
Alice,30,New York
Bob,25,London

// Output JSON
[
  {"name": "Alice", "age": "30", "city": "New York"},
  {"name": "Bob", "age": "25", "city": "London"}
]

Issues to watch:

Type inference. Should 30 become the number 30 or the string "30"? Most converters default to strings (safe) or attempt type detection (risky). If age is sometimes "N/A", type detection breaks. ChangeThisFile's CSV to JSON converter preserves values as strings by default.
Header row. The converter must know whether the first row is headers. Without headers, you get an array of arrays instead of an array of objects.
Empty values. Is an empty cell null, an empty string "", or should the key be omitted? Different converters make different choices.
Encoding. If the CSV is Latin-1 or Windows-1252 and the JSON is UTF-8 (mandatory for JSON), non-ASCII characters must be transcoded correctly.

For large CSV files (100MB+), convert to JSONL instead of JSON — one object per line, streamable, constant memory.

JSON to CSV: Flattening Hierarchy

This is the conversion most likely to lose data, because CSV can't represent JSON's nesting.

// Input JSON
{
  "name": "Alice",
  "address": {
    "street": "123 Main St",
    "city": "New York"
  },
  "tags": ["premium", "verified"]
}

// Output CSV (flattened)
name,address.street,address.city,tags
Alice,123 Main St,New York,"premium,verified"

Flattening strategies:

Dot notation: Nested objects become dotted keys: address.street, address.city. Works well for uniform nesting but creates long column names.
Array serialization: Arrays are joined with commas or serialized as JSON strings within CSV cells. Both approaches are lossy — the consumer must know the convention to reconstruct the array.
Row multiplication: Each array element gets its own row, with non-array fields repeated. This preserves data but explodes file size and makes post-processing complex.

Before converting JSON to CSV, check: is the JSON flat (no nesting)? If yes, the conversion is straightforward. If it has nesting, decide on a flattening strategy and accept the information loss. For spreadsheet consumption, JSON to XLSX is often better — XLSX supports multiple sheets (one per nested entity) and typed columns.

XML to JSON: Losing the Attribute/Element Distinction

XML to JSON conversion is deceptively complex because XML has features with no JSON equivalent:

<!-- Input XML -->
<product id="P001" status="active">
  <name>Widget</name>
  <price currency="USD">29.99</price>
  <tags>
    <tag>popular</tag>
    <tag>sale</tag>
  </tags>
</product>

Different converters produce different JSON from this XML:

// Convention 1: @ prefix for attributes
{"product": {"@id": "P001", "@status": "active", "name": "Widget", "price": {"@currency": "USD", "#text": "29.99"}, "tags": {"tag": ["popular", "sale"]}}}

// Convention 2: Flatten attributes into properties
{"product": {"id": "P001", "status": "active", "name": "Widget", "price": "29.99", "price_currency": "USD", "tags": ["popular", "sale"]}}

// Convention 3: Parker convention (drop attributes)
{"product": {"name": "Widget", "price": "29.99", "tags": {"tag": ["popular", "sale"]}}}

There is no standard for XML-to-JSON conversion. Every library makes different choices. When using ChangeThisFile's XML to JSON converter, verify the output matches your expected structure.

Lost in all conventions: namespaces, processing instructions, CDATA markers, comments, and mixed content. If you need to round-trip (XML to JSON to XML), the output XML will differ from the input.

YAML to JSON: Usually Clean

YAML 1.2 is a JSON superset, so YAML to JSON conversion is straightforward. YAML's extra features (comments, anchors, multiline strings) are resolved during parsing — the resulting data structure maps directly to JSON.

What's lost:

Comments are stripped (JSON has no comments)
Anchors and aliases are resolved to duplicated data
Merge keys (<<) are expanded
Multiline strings become single-line strings with embedded \n

Type coercion warning: if YAML 1.1 parsed NO as boolean false, the JSON output will contain false — the type coercion happens during YAML parsing, before JSON serialization. The JSON is "correct" for what the YAML parser produced, but may not be what you intended.

The reverse — JSON to YAML — is lossless. JSON's entire type system maps to YAML. The YAML output can then be enhanced with comments and multiline strings.

JSON to YAML / YAML to TOML: Config Migration

Configuration format migration is common when switching ecosystems or adopting a new standard.

JSON to YAML: Lossless. Every JSON value has a YAML equivalent. After converting, add comments to document your config values — this is the primary reason to migrate.

YAML to TOML: Usually lossless for application configs (1-3 nesting levels). May lose data for deep nesting (TOML becomes verbose past 3 levels), heterogeneous arrays (TOML requires same-type arrays), null values (TOML has no null), and anchors/aliases (TOML has no reference system).

JSON to TOML: Works well for flat-to-moderately-nested JSON. Deep nesting becomes verbose. Null values must be handled (omit the key or use a sentinel value). Dates in ISO 8601 string format can be converted to TOML's native date type.

Steps for any config migration:

Convert the file format with a converter.
Validate the output parses correctly in the target format.
Re-add comments (comments are lost in all format conversions).
Test with your application to catch any type interpretation differences.
Keep the old config file as a reference during the transition period.

Encoding Traps During Migration

Character encoding issues are the most common source of silent data corruption during format migration:

Migration	Encoding Risk	Prevention
CSV to JSON	CSV encoding unknown (might be Latin-1). JSON requires UTF-8.	Detect CSV encoding first (`chardet` in Python). Transcode to UTF-8 before converting.
XML to JSON	XML declares encoding in prolog. JSON is always UTF-8.	Read XML with declared encoding. Write JSON as UTF-8.
JSON to CSV	JSON is UTF-8. CSV has no encoding declaration.	Write CSV as UTF-8 with BOM for Excel. Document the encoding.
Any to YAML	YAML is UTF-8 by default but allows other encodings.	Always use UTF-8 for YAML. Other encodings are rare and poorly supported.
Any to TOML	TOML mandates UTF-8.	Transcode source to UTF-8 before conversion.

The safe default: convert source data to UTF-8 before any format conversion. All modern formats (JSON, YAML, TOML) are UTF-8. CSV is the only common format where encoding is ambiguous.

Type Coercion During Migration

Type interpretation changes when data moves between formats:

Data	In CSV	In JSON	In YAML	In TOML
`42`	String "42"	Number 42	Number 42	Number 42
`3.10`	String "3.10"	Number 3.1	Number 3.1	Float 3.1
`true`	String "true"	Boolean true	Boolean true	Boolean true
`NO`	String "NO"	String "NO"	Boolean false (1.1) / String "NO" (1.2)	String "NO"
`2026-03-19`	String "2026-03-19"	String "2026-03-19"	Date (if parsed)	Date 2026-03-19
`null`	String "null" or empty	Null null	Null null	Not representable
`02134`	String "02134"	String "02134" or number 2134	Number 2134 (octal in 1.1)	String "02134"

The most dangerous migrations for type coercion: CSV to YAML (everything was a string, now some strings become booleans or numbers), and JSON to YAML (numbers and booleans are preserved, but adding new YAML values manually may trigger coercion). Always validate output data against expected types after conversion.

Validation After Conversion

Every format migration should include a validation step. Don't trust that the conversion is correct — verify it.

Record count. The output should have the same number of records as the input. Count rows in CSV, array length in JSON, record count in JSONL.
Field count. Each output record should have the same number of fields as the corresponding input record (accounting for flattening or nesting changes).
Type spot-checks. Verify that leading zeros are preserved, dates aren't reformatted, booleans haven't been coerced, and numbers haven't changed precision.
Encoding spot-checks. Find records with non-ASCII characters (accents, CJK, emoji) and verify they survived the conversion.
Round-trip test. Convert A to B, then B back to A. Compare with the original. Any differences highlight information loss.
Edge case checks. Empty fields, null values, extremely long strings, numbers with many decimal places, Unicode surrogate pairs.

# Quick validation in Python
import json, csv

with open('data.csv') as f:
    csv_count = sum(1 for _ in csv.reader(f)) - 1  # Minus header

with open('data.json') as f:
    json_count = len(json.load(f))

assert csv_count == json_count, f"Record count mismatch: CSV={csv_count}, JSON={json_count}"

Format migration is not just a conversion — it's a translation between data models. CSV's flat strings, JSON's typed trees, XML's attributed markup, and YAML's typed indentation each represent data differently. Understanding these differences is the difference between a clean migration and weeks of debugging silent data corruption.

The practical workflow: know what you're losing before converting (check the data loss matrix), handle encoding explicitly (UTF-8 everywhere), validate the output (record counts, type checks, encoding checks), and keep the original data until you've verified the migration is complete. Format conversion tools handle the mechanics; understanding the data model differences is your job.

Key Takeaways

Conversions up in expressiveness (CSV to JSON, JSON to XML) are generally lossless. Conversions down (XML to JSON, JSON to CSV) lose features.
CSV to JSON is usually clean but watch for type inference, header detection, and encoding issues.
JSON to CSV loses nesting. Decide on a flattening strategy (dot notation, array serialization, row multiplication) before converting.
XML to JSON has no standard convention. Different libraries produce different JSON from the same XML. Verify the output.
YAML to JSON is straightforward but loses comments, anchors, and merge keys. YAML type coercion happens before the conversion.
Character encoding: convert everything to UTF-8 before format migration. CSV is the only format where encoding is ambiguous.
Always validate after conversion: record count, field count, type spot-checks, and encoding spot-checks.

Frequently Asked Questions

Which format conversion is truly lossless?

CSV to JSON (gains structure, no loss). JSON to YAML (YAML is a superset). TOML to YAML (YAML can represent everything TOML can). TOML to JSON (dates become strings but data is preserved). Any conversion from a simpler format to a richer one is lossless. The reverse direction is always potentially lossy.

How do I convert a large CSV file to JSON without running out of memory?

Convert to JSONL (JSON Lines) instead of a single JSON array. JSONL writes one JSON object per line, so the converter only needs to hold one row in memory at a time. For a 5GB CSV, the JSONL output streams line by line with minimal memory. Most data tools (BigQuery, Elasticsearch, pandas) accept JSONL natively.

What happens to XML attributes when converting to JSON?

There's no standard. Different converters use different conventions: @ prefix (@id, @status), underscore prefix (_id), special _attributes object, or simply dropping attributes. Always check the converter's documentation and verify the output. If attributes are important, choose a converter that preserves them and document the convention for downstream consumers.

Can I round-trip data through a format conversion without loss?

Only for certain pairs. JSON to YAML to JSON is lossless (assuming YAML 1.2 and careful quoting). CSV to JSON to CSV is usually lossless for flat data. XML to JSON to XML is NOT lossless — you'll lose attributes, namespaces, processing instructions, and the original element/attribute structure. Always test round-trip conversion with your specific data before relying on it.

How do I handle null values when converting to TOML?

TOML has no null type. Options: omit the key entirely (the absence of a key is the closest TOML equivalent of null), use an empty string as a sentinel value, or use a special marker string like 'N/A'. The choice depends on how your application handles the config — if it has default values for missing keys, omitting is cleanest.

Why do my leading zeros disappear after conversion?

Leading zeros are preserved in CSV (everything is a string) but lost in JSON, YAML, and TOML if the value is parsed as a number. '02134' becomes integer 2134. Prevention: quote the values as strings in the target format. In JSON: "zip": "02134". In YAML: zip: '02134'. In TOML: zip = "02134". When converting CSV, disable type inference or force string type for affected columns.

What's the safest way to migrate a production API from XML to JSON?

Gradual migration with content negotiation. Keep the XML endpoint running. Add a JSON endpoint that serves the same data. Use the Accept header to let clients choose. Monitor both endpoints. Deprecate the XML endpoint once all clients have migrated. This approach avoids breaking existing consumers while providing a migration path.

How do I preserve comments when converting between config formats?

You can't — automatically. Comments are stripped during parsing in all formats. The converter sees data, not comments. After converting, manually re-add comments in the target format. Some developers write a migration script that reads the source file as text (not parsed), extracts comments by line, converts the data, and re-inserts comments at corresponding positions. This is fragile but better than nothing.

Ready to convert your files?

Use ChangeThisFile to convert between 600+ formats — free, fast, and private.

Start Converting