Raw, uncompressed 1080p video at 30fps generates about 186MB per second. That's 11GB per minute, or 670GB per hour. Without compression, a single Blu-ray movie would need 3-4 terabytes of storage. Video codecs solve this by reducing that data by 100-1000x while preserving enough visual quality that your eyes can't tell the difference.
But codecs aren't magic boxes. They make specific engineering tradeoffs: newer codecs compress harder but need more CPU time to encode. Different profiles enable different features at different complexity levels. Hardware encoders sacrifice compression efficiency for real-time speed. Understanding these tradeoffs is what lets you choose the right codec and settings for your specific use case, rather than guessing.
This guide explains how video codecs actually work, from the fundamental compression techniques to the practical encoding settings you'll use when converting files.
Codec vs Container: The First Distinction
A codec (coder-decoder) is an algorithm that compresses video frames into a fraction of their original size (encoding) and reconstructs them for playback (decoding). H.264, H.265, VP9, and AV1 are codecs.
A container (or wrapper) is a file format that packages one or more compressed streams (video, audio, subtitles) into a single file with synchronization and metadata. MP4, MKV, and WebM are containers.
The same codec can live in different containers. H.264 video can be in an MP4 file, an MKV file, or a MOV file — the compressed video data is identical in each case. The container only determines how the streams are organized on disk, what metadata is stored, and what additional tracks (subtitles, chapters) are supported.
This is why MKV-to-MP4 conversion is often instant and lossless: if both containers support the same codecs, you're just moving data from one wrapper to another (remuxing) without touching the compressed content.
How Video Compression Works
Every video codec exploits two types of redundancy to shrink file size:
Spatial Compression (Within a Frame)
A single video frame has enormous redundancy. A blue sky region is mostly the same color. A wooden desk has repeating grain patterns. Instead of storing every pixel individually, the codec divides the frame into blocks and predicts each block's content from its neighbors.
The process for a typical codec:
- Block partitioning: Divide the frame into blocks. H.264 uses 16x16 macroblocks (optionally split to 4x4). H.265 uses Coding Tree Units (CTUs) up to 64x64, recursively split. AV1 uses superblocks up to 128x128.
- Prediction: For each block, predict its content from adjacent decoded blocks (intra prediction). The codec tries multiple prediction modes (horizontal, vertical, diagonal, DC/flat) and picks the one with the smallest residual (difference between prediction and actual).
- Transform: Apply a DCT (Discrete Cosine Transform) or similar transform to the residual, converting spatial data to frequency data. Low-frequency components (gradual changes) get large coefficients; high-frequency components (sharp edges, fine detail) get small ones.
- Quantization: This is the lossy step. Divide the transform coefficients by a quantization parameter (QP). Small coefficients round to zero and are discarded. Higher QP = more zeros = smaller file = more quality loss.
- Entropy coding: Compress the quantized coefficients using arithmetic coding (CABAC in H.264/H.265, ANS in AV1). This is lossless — it just represents the data more efficiently.
Temporal Compression (Between Frames)
Consecutive video frames are usually very similar — most of the scene doesn't change between frames. Temporal compression exploits this by storing only the differences between frames.
The codec performs motion estimation: for each block in the current frame, it searches nearby positions in previously decoded reference frames for the best match. If a person walks to the right, the codec stores "copy the block from 10 pixels to the left in the previous frame" (a motion vector) rather than the actual pixels. Only the residual (what the motion vector didn't predict) gets encoded.
Block sizes for motion estimation have grown with each codec generation. H.264 searches with blocks down to 4x4. H.265 uses flexible block sizes up to 64x64. AV1 supports 128x128 and adds warped motion prediction (accounting for rotation and zoom, not just translation). Larger and more flexible block sizes = better predictions = smaller residuals = smaller files. But also = more searching = slower encoding.
Frame Types: I, P, and B
Not all frames are compressed equally. Codecs use three frame types with different compression strategies:
I-frames (Intra-coded): Compressed using only spatial techniques — no reference to other frames. An I-frame is essentially a standalone compressed image (like a JPEG). It's the largest frame type but the only one that can be decoded independently. Every video starts with an I-frame, and they appear periodically as sync points for seeking.
P-frames (Predictive): Compressed using motion estimation from previous frames (forward prediction only). P-frames are significantly smaller than I-frames because they only store what changed. A P-frame references one or more previous I or P frames.
B-frames (Bidirectional): Compressed using motion estimation from both previous and future frames. B-frames achieve the best compression because they have more reference options. The tradeoff: B-frames must be stored out of display order (the encoder needs to encode the future reference frame before the B-frame), which adds complexity and latency.
Typical compression ratios by frame type (H.264, 1080p):
- I-frame: 50-200KB per frame
- P-frame: 10-50KB per frame
- B-frame: 5-25KB per frame
GOP (Group of Pictures) is the sequence of frames between I-frames. A typical GOP is 30-250 frames. Shorter GOPs = more I-frames = larger file = better seeking accuracy. Longer GOPs = fewer I-frames = smaller file = coarser seeking. The default in most encoders is GOP = 250 frames (about 8 seconds at 30fps).
Codec Generations: The Evolution
| Codec | Standard | Year | Key Innovation | Efficiency vs Previous |
|---|---|---|---|---|
| MPEG-1 | ISO 11172 | 1993 | First practical video codec (VCD quality) | Baseline |
| MPEG-2 | ISO 13818 | 1995 | Interlaced video, DVD/broadcast quality | ~30% better than MPEG-1 |
| MPEG-4 ASP | ISO 14496-2 | 2001 | Object-based coding, quarter-pixel motion | ~30% better than MPEG-2 |
| H.264 (AVC) | ITU-T / ISO 14496-10 | 2003 | CABAC, flexible block sizes (4x4-16x16), multi-reference prediction | ~50% better than MPEG-2 |
| H.265 (HEVC) | ITU-T / ISO 23008-2 | 2013 | CTU up to 64x64, 35 angular prediction modes, SAO filter | ~50% better than H.264 |
| VP9 | Google/WebM | 2013 | Superblocks up to 64x64, 10 intra modes, adaptive reference frames | ~45% better than H.264 |
| AV1 | AOM | 2018 | Superblocks up to 128x128, 56+ intra modes, film grain synthesis, warped motion | ~65% better than H.264 |
Each generation roughly doubles compression efficiency over the one before it (or halves the required bitrate at the same quality). The cost is always encoding complexity: H.264 encoding is roughly 5x faster than H.265, which is roughly 5x faster than AV1 (software encoders).
Hardware vs Software Encoding
Software encoders (x264, x265, libaom, SVT-AV1, libvpx) run on CPU. They have access to the full codec feature set and can perform exhaustive optimization. Software encoding is slow but produces the best quality per bit.
Hardware encoders (NVIDIA NVENC, Intel Quick Sync Video, AMD VCE/AMF, Apple VideoToolbox) use dedicated silicon on GPUs and SoCs. They're dramatically faster (real-time or faster) but use simplified algorithms that produce files 20-40% larger at the same quality.
| Encoder | Type | Speed (1080p30) | Quality-per-bit |
|---|---|---|---|
| x264 (slow preset) | Software | ~40 fps | Reference quality |
| NVENC H.264 | Hardware | ~300 fps | ~75-85% of x264 |
| x265 (medium preset) | Software | ~15 fps | Reference quality |
| NVENC HEVC | Hardware | ~200 fps | ~75-85% of x265 |
| SVT-AV1 (preset 6) | Software | ~25 fps | ~90% of libaom |
| NVENC AV1 (RTX 40+) | Hardware | ~120 fps | ~80-85% of SVT-AV1 |
When to use hardware encoding: Live streaming, real-time recording (OBS, screen capture), quick exports where time matters more than file size.
When to use software encoding: Final delivery, archival, web hosting (encode once, serve many times), any situation where smaller files save bandwidth or storage cost over time.
Profiles and Levels
Codecs define profiles (which compression features are enabled) and levels (maximum resolution, bitrate, and decode complexity). This lets a single codec standard span everything from low-power IoT cameras to 8K cinema.
H.264 profiles:
- Baseline: No B-frames, no CABAC (uses CAVLC instead). For low-latency video conferencing and mobile devices. About 10-15% less efficient than Main.
- Main: Adds B-frames and CABAC. Good balance of efficiency and decode complexity.
- High: Adds 8x8 transforms, custom quantization matrices, monochrome support. Standard for Blu-ray and most modern encoding. About 10% more efficient than Main.
- High 10: Adds 10-bit color depth. Required for HDR content.
H.264 levels: Level 3.0 = max 720p 30fps. Level 4.0 = max 1080p 30fps. Level 4.1 = max 1080p 60fps (common target). Level 5.1 = max 4K 30fps. Level 5.2 = max 4K 60fps.
For most conversions: H.264 High Profile, Level 4.1 covers 1080p at any frame rate and is universally supported. For 4K, use Level 5.1 or 5.2. Specify these in FFmpeg with -profile:v high -level 4.1.
Encoding Modes: CRF, CBR, VBR, and CQ
How the encoder allocates bits across frames dramatically affects both quality and file size.
CRF (Constant Rate Factor): The encoder targets a constant perceptual quality. Easy frames get fewer bits; complex frames get more. File size varies depending on content complexity. This is the recommended mode for offline encoding. CRF 23 is the x264 default. Lower = better quality = larger files.
CBR (Constant Bitrate): Every second of video gets the same number of bits regardless of complexity. Simple scenes waste bits; complex scenes don't get enough. Used for streaming where the delivery channel has a fixed bandwidth (e.g., satellite broadcast). Avoid for file-based encoding.
VBR (Variable Bitrate): The encoder varies the bitrate within a specified range (min/max) while targeting an average. More intelligent bit allocation than CBR, but the target average bitrate means the encoder must predict complexity across the entire video. Two-pass VBR encoding improves this by analyzing the video in the first pass and allocating bits in the second.
CQ (Constant Quality): Similar to CRF but specific to hardware encoders. NVENC and QSV use CQ mode to approximate CRF behavior. The quality scale differs from CRF — NVENC CQ 19 roughly equals x264 CRF 19, but the exact mapping varies by content.
Recommendation: Use CRF for all offline encoding. CRF 18 for visually lossless, CRF 23 for good quality, CRF 28 for acceptable quality at smaller size. Only use CBR/VBR when a target file size or bitrate is required.
Video codecs are fundamentally about trading computation for compression. Newer codecs search harder for redundancy, try more prediction modes, and use more sophisticated transforms — all of which takes more CPU time but produces smaller files at the same quality.
For practical conversion tasks, the essential knowledge is: CRF mode for quality control, the right profile/level for your target devices, and knowing when a conversion is a remux (fast, lossless) versus a transcode (slow, potential quality change). The codec choice itself — H.264, H.265, or AV1 — depends on your audience's devices, your encoding time budget, and whether file size savings justify the computational cost.