General

File Formats for Large Datasets: Beyond CSV and Excel

Published Mar 19, 2026 9 min read By ChangeThisFile Team

Quick Answer

CSV and Excel hit hard limits at scale: CSV has no types and uses 5-10x its file size in memory; Excel caps at 1,048,576 rows. For large datasets, Parquet (columnar, compressed, schema-aware) is the default choice for analytics. Arrow/Feather is fastest for in-memory interchange. SQLite provides queryable single-file databases. HDF5 handles multidimensional scientific data. JSONL enables streaming JSON processing. The right format depends on your access pattern: full scans, column selection, random access, or streaming.

There's a moment in every data project where CSV stops working. Maybe it's the 2GB file that takes 4 minutes to load and crashes your script. Maybe it's the Excel workbook that truncates at row 1,048,576 and silently drops the rest. Maybe it's the third time this week you've had to manually specify column types because CSV doesn't store them.

CSV and Excel are spreadsheet-era formats designed for data that fits on a screen. Modern datasets — web analytics, IoT sensors, financial transactions, genomics — routinely reach millions or billions of rows. At this scale, the overhead of text-based formats becomes a real cost: storage, memory, parsing time, and lost type information.

This guide covers the six formats that solve these problems, each optimized for a different access pattern. The goal isn't to pick one format for everything — it's to understand which format matches your workload.

CSV at Scale: Where It Breaks

Before discussing alternatives, here's exactly why CSV fails at scale:

No schema. Every consumer must independently infer column types. A column of ZIP codes (02134, 10001) has no way to declare itself as string. Every reader that auto-detects types will strip the leading zero.
Text encoding overhead. The number 3.14159265358979 is 16 bytes in CSV but 8 bytes as a float64 binary. Dates, booleans, and integers all have similar overhead: 2-4x larger as text than binary.
Memory amplification. Loading a 1GB CSV into pandas uses 5-10GB of RAM. Loading a 1GB Parquet file uses 1-2GB. The difference is Python object overhead for strings and the lack of columnar compression.
No column pruning. Reading 3 columns from a 50-column CSV requires scanning every byte of every row. Columnar formats read only the columns you need.
No predicate pushdown. Filtering rows (WHERE revenue > 1000) in CSV requires reading every row, parsing it, and then checking the condition. Parquet stores min/max statistics per column chunk, so entire chunks can be skipped without reading.
No compression. CSV is plain text. You can gzip it, but then you lose random access and streaming ability. Columnar formats compress internally while maintaining queryability.

CSV is still the right choice for interchange (maximum compatibility), small datasets (under 100MB), and human inspection. For everything else, read on.

Parquet: The Default for Analytics

Apache Parquet is a columnar storage format designed for analytical workloads. It stores data by column instead of by row, which means reading a single column from a 100-column dataset only touches ~1% of the file. Parquet files are typically 3-10x smaller than equivalent CSV and 10-30x faster to read for analytical queries.

How it works: Parquet divides data into row groups (default ~128MB each). Within each row group, data is stored column by column. Each column chunk is independently compressed (Snappy or Zstd) and carries metadata: data type, min/max values, null count, and distinct count. This metadata enables predicate pushdown — a query filter can skip entire row groups without decompressing them.

Why Parquet Dominates Data Lakes

Compression. Columnar storage compresses dramatically because values in the same column have similar patterns. A column of country names with 50 unique values across 10 million rows compresses to near-zero via dictionary encoding. Typical Parquet compression ratios vs. CSV: 5-10x for string-heavy data, 3-5x for numeric data.

Schema. Every Parquet file carries its full schema: column names, data types (int32, int64, float, double, string, binary, timestamp, decimal), and nullability. No type inference needed. No leading-zero-stripped ZIP codes. No dates-as-serial-numbers.

Partitioning. Parquet files can be partitioned by column values (e.g., data/year=2026/month=03/part-001.parquet). Queries that filter on partition columns skip entire files. A query for March 2026 data in a 5-year dataset reads ~0.5% of the total data.

Ecosystem. Parquet is read natively by pandas, polars, DuckDB, Spark, Presto/Trino, Athena, BigQuery, Snowflake, Databricks, R (arrow package), and Julia. It's the de facto standard for data lakes on AWS S3, GCP Cloud Storage, and Azure Blob Storage.

Parquet's Trade-offs

Not human-readable. You can't open a Parquet file in a text editor. For inspection, you need parquet-tools, DuckDB (SELECT * FROM 'file.parquet' LIMIT 10), or pandas (pd.read_parquet('file.parquet')).

Write-once. Parquet files are immutable. You can't update a single row. You write the entire file (or partition) and replace it. This is fine for analytics (read-heavy) but wrong for transactional data (write-heavy).

Row-level access is slow. If you need row 7,531,422 specifically, Parquet must scan row groups until it finds it. Databases with indexes are better for point lookups. Parquet is optimized for scanning, not seeking.

To get data into Parquet from common formats: convert CSV to Parquet for tabular data, or use DuckDB to query CSV directly and write Parquet: COPY (SELECT * FROM 'data.csv') TO 'data.parquet'.

Arrow / Feather: Fastest In-Memory Interchange

Apache Arrow is an in-memory columnar format. Feather (Feather V2) is Arrow's file format — a direct serialization of Arrow's in-memory layout to disk. The key feature: zero-copy reads. When you load a Feather file, the data goes into memory in exactly the format that analytics libraries use internally. There's no deserialization step.

Read speed comparison (1GB of data):

Format	Read Time	Memory Usage
CSV	~30 seconds	5-10 GB
Parquet (Snappy)	~2 seconds	1-2 GB
Feather (uncompressed)	~0.3 seconds	~1 GB
Feather (LZ4)	~0.5 seconds	~1 GB

Feather's speed advantage over Parquet comes from its simpler structure: no row groups, no page headers, no complex encoding schemes. It's Arrow's memory layout written directly to disk. The trade-off: Feather files are larger than Parquet (less sophisticated compression) and don't support partitioning.

Best use case: Intermediate files in a pipeline. Read CSV once, convert to Arrow, and all subsequent reads are near-instantaneous. Passing data between Python and R processes. Caching expensive transformations.

Not ideal for: Long-term storage (Parquet compresses better), sharing with non-technical users (no universal viewer), or data lakes (no partitioning).

SQLite: A Database in a Single File

SQLite is a relational database engine in a single file. It's not a data format in the traditional sense — it's a fully ACID-compliant database with SQL support, indexes, transactions, and constraints. But it solves the large-dataset problem differently from columnar formats: instead of optimizing scans, it optimizes queries.

What SQLite does that file formats can't:

Indexes. Create an index on a column and point lookups go from O(n) to O(log n). Finding one row in 100 million takes microseconds, not minutes.
Joins. Relating data across multiple tables is native SQL. In CSV, you'd need to load both files into pandas and merge.
Updates. Modify individual rows without rewriting the entire file. Insert new data incrementally. Delete rows. All impossible with CSV or Parquet.
Constraints. NOT NULL, UNIQUE, FOREIGN KEY, CHECK — enforce data quality at the storage level.

Size: SQLite files are typically 1.5-3x larger than equivalent Parquet (no columnar compression) but smaller than CSV. A 1GB CSV becomes ~800MB in SQLite with appropriate types.

Performance: For analytical scans (SUM of all rows), SQLite is slower than Parquet and DuckDB. For indexed lookups and joins, SQLite is dramatically faster. For mixed workloads (some scans, some lookups), SQLite is often the best single-file option.

To get data into SQLite: SQLite to CSV and the reverse are supported. Programmatically, Python's built-in sqlite3 module reads CSV and writes to SQLite in a few lines.

HDF5: Hierarchical Data for Science

HDF5 (Hierarchical Data Format version 5) is the standard for scientific and engineering data. It stores datasets in a filesystem-like hierarchy: groups (directories) containing datasets (arrays) with attributes (metadata). A single HDF5 file can contain hundreds of datasets of different shapes and types.

What makes HDF5 unique:

Multidimensional data. HDF5 natively stores N-dimensional arrays. A 3D temperature grid (1000 x 1000 x 100 time steps) is a single dataset with efficient slicing on any axis.
Chunked I/O. Datasets are stored in chunks, and you can read a slice (e.g., one time step of the temperature grid) without loading the full dataset.
Compression per-dataset. Different datasets in the same file can use different compression algorithms (gzip, LZF, szip) at different levels.
Self-describing. Attributes on datasets store units, coordinate systems, experiment parameters — metadata travels with the data.

HDF5 in Python: The h5py library provides NumPy-like access: f['temperatures'][0:10, :, :] reads the first 10 rows of a 3D dataset. pandas has pd.read_hdf() for tabular data stored in HDF5 via PyTables.

When to use HDF5: Multidimensional scientific data (climate models, medical imaging, particle physics), large collections of related arrays, and any dataset where the hierarchy of groups and attributes is valuable. For flat tabular data, Parquet is simpler and better integrated with analytics tools.

JSON Lines: Streaming JSON for Large Data

JSONL (JSON Lines, also called NDJSON — Newline Delimited JSON) puts one complete JSON object per line. It solves JSON's fatal flaw for large data: standard JSON requires parsing the entire document before you can access any element. A 10GB JSON array must be fully loaded and parsed. A 10GB JSONL file can be processed one line at a time.

{"id":1,"name":"Alice","score":95}
{"id":2,"name":"Bob","score":87}
{"id":3,"name":"Charlie","score":92}

Why JSONL matters for large data:

Streaming. Process line by line with constant memory. for line in open('big.jsonl'): record = json.loads(line)
Appendable. Add new records by appending lines. No need to read-parse-modify-write the entire file.
Schema-flexible. Different records can have different fields. Row 1 can have 5 fields, row 1000 can have 8. This is impossible in CSV (fixed columns) and Parquet (fixed schema).
Rich types per record. Each JSON object preserves nested structures, arrays, and type information that CSV loses.

When JSONL is the right choice: Log files, event streams, semi-structured data where records have varying schemas, data exchange between microservices, and any dataset that needs to be appended to incrementally. For structured analytics on fixed-schema data, Parquet is better. For human editing, standard JSON is better. JSONL is the sweet spot for machine-generated, machine-consumed, variable-schema data at scale.

Convert between these formats as needed: CSV to JSONL for adding structure, JSONL to CSV for flattening to tabular form, CSV to NDJSON for streaming pipelines.

Excel's Ceiling: Why 1M Rows Isn't Enough

Excel's 1,048,576-row limit is a hard wall, not a soft guideline. But the practical limits are much lower:

Performance degrades from ~100K rows. Formulas that reference large ranges become noticeably slow. Scrolling lags. Saving takes seconds.
Memory spikes at ~500K rows. A 500K-row workbook with 20 columns and some formulas uses 2-4GB of RAM. The 32-bit version of Excel caps at 2GB.
Charts on large datasets freeze the UI. A scatter plot with 100K points can take 30+ seconds to render and makes the workbook sluggish.
Pivot tables on large sources are unstable. Refreshing a pivot table backed by 1M rows can take minutes and occasionally crashes Excel.

If your workflow hits these limits, you have three options:

Stay in Excel with Power Query/Power Pivot. These tools use a columnar in-memory engine (VertiPaq) that handles millions of rows efficiently. But the learning curve is steep and the data doesn't live in regular cells.
Move to a database. Import your data into PostgreSQL, SQLite, or DuckDB. Query with SQL. Export subsets to Excel for presentation.
Move to Parquet + Python/R. Convert your CSV to Parquet, analyze in pandas/polars/DuckDB, and export summaries to XLSX for sharing.

Decision Matrix: Choosing by Use Case

Use Case	Recommended Format	Why
Data lake / warehouse	Parquet	Columnar compression, partitioning, universal analytics tool support
Sharing with non-technical users	XLSX or CSV	Universal readability, Excel compatible
Pipeline intermediate files	Feather/Arrow	Fastest read/write, zero-copy in pandas/polars
Application database (single file)	SQLite	ACID transactions, indexes, SQL queries, updates
Scientific / multidimensional data	HDF5	N-dimensional arrays, chunked I/O, metadata attributes
Event streams / logs	JSONL	Appendable, streaming, flexible schema
One-time data exchange	CSV	Maximum compatibility, human-readable
Archival (decades)	CSV + Parquet	CSV for guaranteed readability, Parquet for efficient access

The common pattern for large data projects: receive data as CSV, convert to Parquet for analysis, analyze in DuckDB or pandas, export summaries to XLSX or CSV for stakeholders. Raw CSV is archived alongside Parquet for long-term accessibility.

The format landscape for large datasets is mature and well-differentiated. CSV is for interchange and small data. Parquet is for analytics. Arrow is for speed. SQLite is for queries. HDF5 is for science. JSONL is for streams. Choosing wrong means paying a continuous tax in storage, speed, or correctness.

The most common mistake is sticking with CSV out of familiarity. A 5-minute conversion to Parquet saves hours of waiting for reads, gigabytes of memory, and eliminates entire classes of type-inference bugs. The second most common mistake is reaching for a heavyweight database when SQLite or Parquet would suffice. A single Parquet file on S3 queried by DuckDB or Athena replaces many use cases that once required a running database server.

Key Takeaways

CSV is a transport format, not a storage format. For datasets over 100MB that you'll read more than once, convert to Parquet or Feather.
Parquet is the default choice for analytical data: 3-10x smaller than CSV, 10-30x faster to read, with full type information and columnar access.
Feather/Arrow is the fastest file format for read speed (zero-copy deserialization) but doesn't compress as well as Parquet. Use it for pipeline intermediates.
SQLite is the right choice when you need indexed lookups, joins, updates, or constraints — capabilities that flat file formats can't provide.
HDF5 is specifically designed for multidimensional scientific data. Don't use it for flat tabular data (Parquet is simpler); do use it for 3D+ arrays.
JSONL handles semi-structured, schema-flexible data that CSV can't represent. It's streamable and appendable, unlike standard JSON.
Excel's practical limit is ~100K-500K rows, not the nominal 1M row cap. Beyond that, you need a format designed for scale.

Frequently Asked Questions

How much smaller is Parquet compared to CSV?

Typically 3-10x smaller, depending on data content. String-heavy data with many repeated values compresses best (8-10x reduction via dictionary encoding). Numeric data compresses less dramatically (3-5x). A 1GB CSV file with mixed data usually becomes 150-300MB in Parquet with Snappy compression, or 100-200MB with Zstd compression.

Can I query Parquet files without loading them into memory?

Yes. DuckDB can query Parquet files directly from disk or S3: <code>SELECT * FROM 'data.parquet' WHERE revenue > 1000</code>. It uses Parquet's metadata to skip irrelevant row groups and only reads the columns referenced in the query. AWS Athena, Google BigQuery, and Trino/Presto all query Parquet files on cloud storage natively. This is the foundation of the modern data lake architecture.

When should I use SQLite instead of Parquet?

Use SQLite when you need to update, insert, or delete individual rows; when you need indexed point lookups (find one record by ID); when you need joins across multiple tables; or when you need ACID transactions (guaranteed consistency). Use Parquet when you need to scan large volumes of data for analytics (aggregations, group-by, filtering) and the data doesn't change frequently.

Is HDF5 still relevant or has Parquet replaced it?

HDF5 is still essential for multidimensional scientific data. Parquet is strictly 2D (rows and columns). HDF5 handles 3D+ arrays natively: a climate model with dimensions (time, latitude, longitude, altitude) is natural in HDF5 but requires flattening in Parquet. For flat tabular data, Parquet has largely replaced HDF5 in the data engineering ecosystem. For scientific computing, HDF5 (and its successor Zarr for cloud-native workloads) remains standard.

What's the difference between JSONL and NDJSON?

Nothing functional. JSONL (JSON Lines) and NDJSON (Newline Delimited JSON) are the same format: one complete JSON object per line, separated by newline characters. The names come from different communities (JSONL is more common in data engineering, NDJSON in web development), but the format is identical. Both use the convention of one JSON value per line with no wrapping array.

How do I convert a 10GB CSV to Parquet?

DuckDB is the simplest tool: <code>duckdb -c "COPY (SELECT * FROM 'big.csv') TO 'big.parquet'"</code>. It handles files larger than RAM automatically. In Python, use pyarrow: <code>csv.read_csv('big.csv').write_parquet('big.parquet')</code> for files that fit in memory, or process in chunks with pandas and append to Parquet. On ChangeThisFile, you can <a href='/csv-to-parquet'>convert CSV to Parquet</a> for smaller files.

Can Excel open Parquet files?

Not natively. Excel has no built-in Parquet reader. You can use Power Query (Data > Get Data > From File) with a third-party Parquet connector, or convert Parquet to CSV/XLSX first. DuckDB can export Parquet to CSV in one command: <code>COPY (SELECT * FROM 'data.parquet') TO 'data.csv'</code>. For sharing with Excel users, <a href='/parquet-to-csv'>convert Parquet to CSV</a> or export a subset to XLSX.

What's the best format for archiving data long-term?

Store both CSV and Parquet. CSV guarantees readability for decades — any future system will be able to read comma-separated text. Parquet preserves types and compresses efficiently, but requires a Parquet reader (which will almost certainly exist but isn't as guaranteed as plain text). For critical data, add a schema file (JSON or SQL CREATE TABLE statement) alongside the CSV to document column types and formats.

Ready to convert your files?

Use ChangeThisFile to convert between 600+ formats — free, fast, and private.

Start Converting