Every time two systems exchange data, they must agree on a serialization format. For most applications, JSON is the default — it's human-readable, universally supported, and fast enough. But "fast enough" has limits. When you're processing millions of API calls per second, transmitting data over cellular networks, or synchronizing game state 60 times per second, JSON's verbosity and parse overhead become measurable costs.
Binary serialization formats exist to eliminate those costs. Protocol Buffers (Google), Thrift (Facebook/Apache), Avro (Apache/Hadoop), MessagePack, and CBOR each take a different approach to making data smaller, faster to parse, and more rigidly typed. The tradeoff: binary data is not human-readable, requires schemas or format knowledge to decode, and adds tooling complexity.
This guide compares text and binary formats across the dimensions that matter: size, speed, schema evolution, tooling, and practical applicability. The goal is to help you decide when the complexity of a binary format is worth the performance gain.
Text vs Binary: The Fundamental Tradeoff
| Property | Text Formats (JSON, XML, YAML) | Binary Formats (Protobuf, MsgPack, Avro) |
|---|---|---|
| Human readable | Yes | No (requires decoder) |
| Debuggable | Open in any editor, pipe to jq | Need format-specific tools to inspect |
| Size | Large (verbose keys, delimiters, quoting) | Small (30-70% of JSON equivalent) |
| Parse speed | Moderate (string parsing, type detection) | Fast (direct memory mapping in some cases) |
| Schema required | No (self-describing) | Varies: some require schema (Protobuf, Avro), some don't (MsgPack, CBOR) |
| Language support | Universal | Wide but not universal (Protobuf has best coverage) |
| Version control | Clean diffs | Binary diffs (meaningless in git diff) |
The key insight: most applications don't need binary formats. JSON over gzip is typically sufficient for web APIs, mobile apps, and microservice communication. Binary formats become necessary when you're optimizing for latency at the microsecond level, bandwidth at the kilobyte level, or throughput at millions of messages per second.
JSON: The Baseline
JSON is the default serialization format for the web. A sample record:
{"id": 12345, "name": "Alice Johnson", "email": "alice@example.com", "active": true, "score": 98.5, "tags": ["premium", "verified"]}This record is 124 bytes as JSON. The key names (id, name, email, active, score, tags) are repeated in every record, quotes and colons add overhead, and numbers are stored as text (the 5-digit number 12345 takes 5 bytes instead of 2-4 in binary).
JSON's strengths for serialization: universal parser support, self-describing (no schema needed to decode), human-debuggable, and clean version control diffs. Its weaknesses: verbose, no binary data support (must Base64-encode), no schema enforcement, and no integer/float distinction.
Protocol Buffers (Protobuf): Google's Binary Format
Protocol Buffers, developed by Google, are the most widely used binary serialization format. Protobuf uses a schema (defined in .proto files) to generate code for serializing and deserializing data in your target language.
// user.proto
syntax = "proto3";
message User {
int32 id = 1;
string name = 2;
string email = 3;
bool active = 4;
float score = 5;
repeated string tags = 6;
}The same record that was 124 bytes in JSON is approximately 45-55 bytes in Protobuf — about 60% smaller. Field names are replaced by numeric tags (1, 2, 3...), numbers use variable-length encoding (small numbers take fewer bytes), and there's no quoting, delimiter, or structural overhead.
Protobuf Strengths
- Size: 30-40% of JSON for typical data. Numeric-heavy data sees even better compression.
- Speed: 5-10x faster to parse than JSON. Protobuf's wire format maps almost directly to memory layout.
- Schema evolution: Fields can be added, deprecated (reserved), and renamed without breaking existing consumers — as long as field numbers don't change. This is critical for APIs that evolve over years.
- Code generation:
protocgenerates serialization/deserialization code for C++, Java, Python, Go, Rust, C#, and more. No manual parsing code needed. - gRPC integration: Protobuf is the native format for gRPC, Google's high-performance RPC framework used by most major tech companies.
Protobuf Weaknesses
- Not human-readable. Binary data requires
protoc --decodeor format-specific tools to inspect. Debugging network issues is harder than with JSON. - Schema required. You can't decode Protobuf without the
.protoschema file (you can see raw field numbers and types, but not field names). - No self-description. A Protobuf message doesn't contain its schema. Both producer and consumer must have the same
.protofile. - Build step required.
protocmust run before compilation to generate code from.protofiles. This adds build complexity. - Not good for ad-hoc data. Every data structure needs a
.protodefinition. For quick data exchange between scripts or exploratory data analysis, JSON is far more practical.
MessagePack: Binary JSON
MessagePack is a binary format that mirrors JSON's data model exactly: objects, arrays, strings, numbers, booleans, and null. It requires no schema — you can convert JSON to MessagePack and back losslessly. Think of it as "JSON, but in binary."
Our sample record in MessagePack: approximately 85-90 bytes (vs 124 in JSON). Smaller than JSON but larger than Protobuf because MessagePack includes field names in the encoding (like JSON does), just in a more compact binary representation.
| Feature | JSON | MessagePack |
|---|---|---|
| Data model | Object, array, string, number, boolean, null | Same (plus binary type) |
| Schema required | No | No |
| Size | 100% (baseline) | ~65-75% of JSON |
| Parse speed | 1x | 2-5x faster |
| Binary data | Base64 string | Native binary type |
| Human readable | Yes | No |
MessagePack's sweet spot: you want smaller/faster than JSON, you don't want to manage schemas, and you're willing to trade readability for performance. Common uses include Redis serialization, real-time data feeds, and game networking. Converting JSON to MessagePack is lossless and adds native binary data support.
CBOR: The IETF Standard
CBOR (Concise Binary Object Representation, RFC 8949) is an IETF-standardized binary format based on JSON's data model. It's similar to MessagePack but with standardized extensions: tags for dates, URIs, regex, big numbers, and other types that JSON lacks. CBOR is the serialization format for COSE (CBOR Object Signing and Encryption) and is used in WebAuthn, FIDO2, and IoT protocols (CoAP).
CBOR vs MessagePack: CBOR is an IETF standard (RFC), MessagePack is a community specification. CBOR has extensible type tags (can represent dates, URIs, etc.), MessagePack has better library coverage. For new projects, CBOR's standardization and extensibility make it the better choice. For projects already using MessagePack, there's no compelling reason to switch.
Apache Avro: Schema Evolution for Data Pipelines
Avro is a binary format designed for Hadoop and data pipeline use cases. Its defining feature: the schema is stored with the data. An Avro file starts with the schema definition (in JSON), followed by the binary data. Any consumer can decode the data without a separate schema file.
Avro schemas are JSON documents:
{
"type": "record",
"name": "User",
"fields": [
{"name": "id", "type": "int"},
{"name": "name", "type": "string"},
{"name": "email", "type": ["null", "string"], "default": null}
]
}Avro's schema evolution is particularly robust. A consumer can read data written with an older or newer schema version — the Avro library automatically maps fields, applies defaults for new fields, and ignores removed fields. This makes Avro ideal for data lakes and event sourcing where producers and consumers evolve independently.
Avro is the standard format in the Kafka ecosystem (Schema Registry), Apache Spark, and Hadoop. It's less common in web APIs (Protobuf/gRPC dominates there).
Size and Speed Comparison
Approximate benchmarks for our sample user record (124 bytes in JSON):
| Format | Size (bytes) | % of JSON | Relative Parse Speed | Schema Required |
|---|---|---|---|---|
| JSON | 124 | 100% | 1x | No |
| JSON (gzip) | ~110 | ~89% | 1x + decompress | No |
| XML | ~250 | ~200% | 0.3-0.5x | No |
| YAML | ~110 | ~89% | 0.1-0.2x | No |
| MessagePack | ~88 | ~71% | 2-5x | No |
| CBOR | ~90 | ~73% | 2-4x | No |
| Protobuf | ~50 | ~40% | 5-10x | Yes |
| Avro | ~35* | ~28%* | 3-7x | Yes (embedded) |
| FlatBuffers | ~80 | ~65% | 20-50x (zero-copy) | Yes |
* Avro's per-record size is very small because field names and types are in the schema header, not repeated per record. The overhead is amortized across all records — Avro is most efficient for large datasets.
Important context: for a single API response of 1-10KB, the size and speed differences are imperceptible. These benchmarks matter when you're processing millions of records, transmitting over constrained networks (IoT, mobile in developing regions), or operating at the latency boundaries of high-frequency trading or real-time gaming.
Schema Evolution: Why It Matters
Real systems evolve. Fields are added, renamed, deprecated, and retyped. How a serialization format handles these changes determines how much pain version upgrades cause.
| Format | Add Field | Remove Field | Rename Field | Change Type |
|---|---|---|---|---|
| JSON | Just add it | Just remove it | Add new + keep old | Consumers must handle both |
| Protobuf | Add with new field number | Reserve the number, stop using | Change name, keep number | Limited (compatible changes only) |
| Avro | Add with default value | Stop using (reader ignores) | Use aliases | Limited (promotions only) |
| MessagePack | Same as JSON | Same as JSON | Same as JSON | Same as JSON |
Protobuf and Avro have the best schema evolution stories because they decouple field identity from field names. In Protobuf, field 3 is always field 3 regardless of name. In Avro, the schema registry tracks compatible versions. JSON's schema evolution is "anything goes" — which is flexible but means every consumer must defensively handle unexpected input.
When to Use Binary Formats
Switch from JSON to a binary format when:
- High throughput: Processing >100K messages/second where parse overhead is measurable.
- Bandwidth constraints: IoT devices on cellular/LPWAN, mobile apps in bandwidth-limited regions, or cloud data transfer costs that scale with bytes transferred.
- Real-time requirements: Game state synchronization (60+ updates/second), financial trading, live video metadata.
- Large-scale data storage: Data lakes, event sourcing, analytics pipelines — Avro/Parquet save significant storage costs at petabyte scale.
- Microservice APIs: gRPC + Protobuf is the standard for inter-service communication in large-scale systems (Google, Netflix, Lyft, Square).
Keep JSON when:
- Your API is consumed by web browsers (JSON.parse is universally available)
- Humans need to read and debug the data
- The data volume is modest (<10K requests/minute)
- You don't want to manage schemas and code generation
- Interoperability with diverse systems matters more than performance
Converting Between Text and Binary Formats
Format conversion between text and binary follows a general principle: text-to-binary is usually lossless, binary-to-text depends on type compatibility.
| Conversion | Lossless? | Notes |
|---|---|---|
| JSON to MessagePack | Yes | MessagePack has the same data model as JSON, plus binary type. |
| MessagePack to JSON | Almost | MessagePack's binary type must be Base64-encoded in JSON. |
| JSON to Protobuf | Requires schema | Must define .proto first. JSON values mapped to Protobuf types. |
| Protobuf to JSON | Yes | Protobuf has official JSON mapping specification. |
| JSON to Avro | Requires schema | Must define Avro schema. Types must be compatible. |
The serialization format decision is usually simple: use JSON unless you have a measured reason not to. JSON's universal support, human readability, and debugging ease outweigh the size and speed advantages of binary formats for the vast majority of applications.
When you do need better performance: MessagePack is the lowest-friction upgrade (same data model, no schema). Protocol Buffers are the industry standard for high-performance APIs (schema-driven, gRPC integration). Avro is the standard for data pipelines (embedded schema, excellent evolution). Pick based on your specific constraint — bandwidth, latency, storage, or schema management — not on theoretical benchmarks.