There's a moment in every data project where CSV stops working. Maybe it's the 2GB file that takes 4 minutes to load and crashes your script. Maybe it's the Excel workbook that truncates at row 1,048,576 and silently drops the rest. Maybe it's the third time this week you've had to manually specify column types because CSV doesn't store them.
CSV and Excel are spreadsheet-era formats designed for data that fits on a screen. Modern datasets — web analytics, IoT sensors, financial transactions, genomics — routinely reach millions or billions of rows. At this scale, the overhead of text-based formats becomes a real cost: storage, memory, parsing time, and lost type information.
This guide covers the six formats that solve these problems, each optimized for a different access pattern. The goal isn't to pick one format for everything — it's to understand which format matches your workload.
CSV at Scale: Where It Breaks
Before discussing alternatives, here's exactly why CSV fails at scale:
- No schema. Every consumer must independently infer column types. A column of ZIP codes (02134, 10001) has no way to declare itself as string. Every reader that auto-detects types will strip the leading zero.
- Text encoding overhead. The number
3.14159265358979is 16 bytes in CSV but 8 bytes as a float64 binary. Dates, booleans, and integers all have similar overhead: 2-4x larger as text than binary. - Memory amplification. Loading a 1GB CSV into pandas uses 5-10GB of RAM. Loading a 1GB Parquet file uses 1-2GB. The difference is Python object overhead for strings and the lack of columnar compression.
- No column pruning. Reading 3 columns from a 50-column CSV requires scanning every byte of every row. Columnar formats read only the columns you need.
- No predicate pushdown. Filtering rows (WHERE revenue > 1000) in CSV requires reading every row, parsing it, and then checking the condition. Parquet stores min/max statistics per column chunk, so entire chunks can be skipped without reading.
- No compression. CSV is plain text. You can gzip it, but then you lose random access and streaming ability. Columnar formats compress internally while maintaining queryability.
CSV is still the right choice for interchange (maximum compatibility), small datasets (under 100MB), and human inspection. For everything else, read on.
Parquet: The Default for Analytics
Apache Parquet is a columnar storage format designed for analytical workloads. It stores data by column instead of by row, which means reading a single column from a 100-column dataset only touches ~1% of the file. Parquet files are typically 3-10x smaller than equivalent CSV and 10-30x faster to read for analytical queries.
How it works: Parquet divides data into row groups (default ~128MB each). Within each row group, data is stored column by column. Each column chunk is independently compressed (Snappy or Zstd) and carries metadata: data type, min/max values, null count, and distinct count. This metadata enables predicate pushdown — a query filter can skip entire row groups without decompressing them.
Why Parquet Dominates Data Lakes
Compression. Columnar storage compresses dramatically because values in the same column have similar patterns. A column of country names with 50 unique values across 10 million rows compresses to near-zero via dictionary encoding. Typical Parquet compression ratios vs. CSV: 5-10x for string-heavy data, 3-5x for numeric data.
Schema. Every Parquet file carries its full schema: column names, data types (int32, int64, float, double, string, binary, timestamp, decimal), and nullability. No type inference needed. No leading-zero-stripped ZIP codes. No dates-as-serial-numbers.
Partitioning. Parquet files can be partitioned by column values (e.g., data/year=2026/month=03/part-001.parquet). Queries that filter on partition columns skip entire files. A query for March 2026 data in a 5-year dataset reads ~0.5% of the total data.
Ecosystem. Parquet is read natively by pandas, polars, DuckDB, Spark, Presto/Trino, Athena, BigQuery, Snowflake, Databricks, R (arrow package), and Julia. It's the de facto standard for data lakes on AWS S3, GCP Cloud Storage, and Azure Blob Storage.
Parquet's Trade-offs
Not human-readable. You can't open a Parquet file in a text editor. For inspection, you need parquet-tools, DuckDB (SELECT * FROM 'file.parquet' LIMIT 10), or pandas (pd.read_parquet('file.parquet')).
Write-once. Parquet files are immutable. You can't update a single row. You write the entire file (or partition) and replace it. This is fine for analytics (read-heavy) but wrong for transactional data (write-heavy).
Row-level access is slow. If you need row 7,531,422 specifically, Parquet must scan row groups until it finds it. Databases with indexes are better for point lookups. Parquet is optimized for scanning, not seeking.
To get data into Parquet from common formats: convert CSV to Parquet for tabular data, or use DuckDB to query CSV directly and write Parquet: COPY (SELECT * FROM 'data.csv') TO 'data.parquet'.
Arrow / Feather: Fastest In-Memory Interchange
Apache Arrow is an in-memory columnar format. Feather (Feather V2) is Arrow's file format — a direct serialization of Arrow's in-memory layout to disk. The key feature: zero-copy reads. When you load a Feather file, the data goes into memory in exactly the format that analytics libraries use internally. There's no deserialization step.
Read speed comparison (1GB of data):
| Format | Read Time | Memory Usage |
|---|---|---|
| CSV | ~30 seconds | 5-10 GB |
| Parquet (Snappy) | ~2 seconds | 1-2 GB |
| Feather (uncompressed) | ~0.3 seconds | ~1 GB |
| Feather (LZ4) | ~0.5 seconds | ~1 GB |
Feather's speed advantage over Parquet comes from its simpler structure: no row groups, no page headers, no complex encoding schemes. It's Arrow's memory layout written directly to disk. The trade-off: Feather files are larger than Parquet (less sophisticated compression) and don't support partitioning.
Best use case: Intermediate files in a pipeline. Read CSV once, convert to Arrow, and all subsequent reads are near-instantaneous. Passing data between Python and R processes. Caching expensive transformations.
Not ideal for: Long-term storage (Parquet compresses better), sharing with non-technical users (no universal viewer), or data lakes (no partitioning).
SQLite: A Database in a Single File
SQLite is a relational database engine in a single file. It's not a data format in the traditional sense — it's a fully ACID-compliant database with SQL support, indexes, transactions, and constraints. But it solves the large-dataset problem differently from columnar formats: instead of optimizing scans, it optimizes queries.
What SQLite does that file formats can't:
- Indexes. Create an index on a column and point lookups go from O(n) to O(log n). Finding one row in 100 million takes microseconds, not minutes.
- Joins. Relating data across multiple tables is native SQL. In CSV, you'd need to load both files into pandas and merge.
- Updates. Modify individual rows without rewriting the entire file. Insert new data incrementally. Delete rows. All impossible with CSV or Parquet.
- Constraints. NOT NULL, UNIQUE, FOREIGN KEY, CHECK — enforce data quality at the storage level.
Size: SQLite files are typically 1.5-3x larger than equivalent Parquet (no columnar compression) but smaller than CSV. A 1GB CSV becomes ~800MB in SQLite with appropriate types.
Performance: For analytical scans (SUM of all rows), SQLite is slower than Parquet and DuckDB. For indexed lookups and joins, SQLite is dramatically faster. For mixed workloads (some scans, some lookups), SQLite is often the best single-file option.
To get data into SQLite: SQLite to CSV and the reverse are supported. Programmatically, Python's built-in sqlite3 module reads CSV and writes to SQLite in a few lines.
HDF5: Hierarchical Data for Science
HDF5 (Hierarchical Data Format version 5) is the standard for scientific and engineering data. It stores datasets in a filesystem-like hierarchy: groups (directories) containing datasets (arrays) with attributes (metadata). A single HDF5 file can contain hundreds of datasets of different shapes and types.
What makes HDF5 unique:
- Multidimensional data. HDF5 natively stores N-dimensional arrays. A 3D temperature grid (1000 x 1000 x 100 time steps) is a single dataset with efficient slicing on any axis.
- Chunked I/O. Datasets are stored in chunks, and you can read a slice (e.g., one time step of the temperature grid) without loading the full dataset.
- Compression per-dataset. Different datasets in the same file can use different compression algorithms (gzip, LZF, szip) at different levels.
- Self-describing. Attributes on datasets store units, coordinate systems, experiment parameters — metadata travels with the data.
HDF5 in Python: The h5py library provides NumPy-like access: f['temperatures'][0:10, :, :] reads the first 10 rows of a 3D dataset. pandas has pd.read_hdf() for tabular data stored in HDF5 via PyTables.
When to use HDF5: Multidimensional scientific data (climate models, medical imaging, particle physics), large collections of related arrays, and any dataset where the hierarchy of groups and attributes is valuable. For flat tabular data, Parquet is simpler and better integrated with analytics tools.
JSON Lines: Streaming JSON for Large Data
JSONL (JSON Lines, also called NDJSON — Newline Delimited JSON) puts one complete JSON object per line. It solves JSON's fatal flaw for large data: standard JSON requires parsing the entire document before you can access any element. A 10GB JSON array must be fully loaded and parsed. A 10GB JSONL file can be processed one line at a time.
{"id":1,"name":"Alice","score":95}
{"id":2,"name":"Bob","score":87}
{"id":3,"name":"Charlie","score":92}Why JSONL matters for large data:
- Streaming. Process line by line with constant memory.
for line in open('big.jsonl'): record = json.loads(line) - Appendable. Add new records by appending lines. No need to read-parse-modify-write the entire file.
- Schema-flexible. Different records can have different fields. Row 1 can have 5 fields, row 1000 can have 8. This is impossible in CSV (fixed columns) and Parquet (fixed schema).
- Rich types per record. Each JSON object preserves nested structures, arrays, and type information that CSV loses.
When JSONL is the right choice: Log files, event streams, semi-structured data where records have varying schemas, data exchange between microservices, and any dataset that needs to be appended to incrementally. For structured analytics on fixed-schema data, Parquet is better. For human editing, standard JSON is better. JSONL is the sweet spot for machine-generated, machine-consumed, variable-schema data at scale.
Convert between these formats as needed: CSV to JSONL for adding structure, JSONL to CSV for flattening to tabular form, CSV to NDJSON for streaming pipelines.
Excel's Ceiling: Why 1M Rows Isn't Enough
Excel's 1,048,576-row limit is a hard wall, not a soft guideline. But the practical limits are much lower:
- Performance degrades from ~100K rows. Formulas that reference large ranges become noticeably slow. Scrolling lags. Saving takes seconds.
- Memory spikes at ~500K rows. A 500K-row workbook with 20 columns and some formulas uses 2-4GB of RAM. The 32-bit version of Excel caps at 2GB.
- Charts on large datasets freeze the UI. A scatter plot with 100K points can take 30+ seconds to render and makes the workbook sluggish.
- Pivot tables on large sources are unstable. Refreshing a pivot table backed by 1M rows can take minutes and occasionally crashes Excel.
If your workflow hits these limits, you have three options:
- Stay in Excel with Power Query/Power Pivot. These tools use a columnar in-memory engine (VertiPaq) that handles millions of rows efficiently. But the learning curve is steep and the data doesn't live in regular cells.
- Move to a database. Import your data into PostgreSQL, SQLite, or DuckDB. Query with SQL. Export subsets to Excel for presentation.
- Move to Parquet + Python/R. Convert your CSV to Parquet, analyze in pandas/polars/DuckDB, and export summaries to XLSX for sharing.
Decision Matrix: Choosing by Use Case
| Use Case | Recommended Format | Why |
|---|---|---|
| Data lake / warehouse | Parquet | Columnar compression, partitioning, universal analytics tool support |
| Sharing with non-technical users | XLSX or CSV | Universal readability, Excel compatible |
| Pipeline intermediate files | Feather/Arrow | Fastest read/write, zero-copy in pandas/polars |
| Application database (single file) | SQLite | ACID transactions, indexes, SQL queries, updates |
| Scientific / multidimensional data | HDF5 | N-dimensional arrays, chunked I/O, metadata attributes |
| Event streams / logs | JSONL | Appendable, streaming, flexible schema |
| One-time data exchange | CSV | Maximum compatibility, human-readable |
| Archival (decades) | CSV + Parquet | CSV for guaranteed readability, Parquet for efficient access |
The common pattern for large data projects: receive data as CSV, convert to Parquet for analysis, analyze in DuckDB or pandas, export summaries to XLSX or CSV for stakeholders. Raw CSV is archived alongside Parquet for long-term accessibility.
The format landscape for large datasets is mature and well-differentiated. CSV is for interchange and small data. Parquet is for analytics. Arrow is for speed. SQLite is for queries. HDF5 is for science. JSONL is for streams. Choosing wrong means paying a continuous tax in storage, speed, or correctness.
The most common mistake is sticking with CSV out of familiarity. A 5-minute conversion to Parquet saves hours of waiting for reads, gigabytes of memory, and eliminates entire classes of type-inference bugs. The second most common mistake is reaching for a heavyweight database when SQLite or Parquet would suffice. A single Parquet file on S3 queried by DuckDB or Athena replaces many use cases that once required a running database server.