How Compression Algorithms Work: From Huffman to LZMA

Published Mar 19, 2026 8 min read By ChangeThisFile Team

Quick Answer

Lossless compression works by finding and eliminating redundancy in data. Dictionary-based methods (LZ77, LZ78, LZW) replace repeated byte sequences with short references. Entropy coders (Huffman, arithmetic, ANS) assign shorter codes to more frequent symbols. Modern algorithms like DEFLATE, LZMA, Brotli, and Zstandard combine these techniques for maximum compression.

Every compressed file on your computer exploits the same fundamental insight: data contains patterns, and patterns can be represented more efficiently. A file that says "the cat sat on the mat" has obvious redundancy — "the" appears twice, "at" appears three times. Compression algorithms find these patterns systematically and replace them with shorter representations.

This guide covers the actual algorithms behind the file formats. When you ZIP a file (Deflate), compress a tarball (GZIP/XZ/ZSTD), or create a 7z archive (LZMA2), these are the algorithms doing the work. Understanding them explains why some formats compress better than others, why compression speed varies so dramatically, and why some data barely compresses at all.

Everything here is lossless — the decompressed output is bit-for-bit identical to the original input. Lossy compression (JPEG, MP3, H.264) uses entirely different techniques that discard data the human eye or ear won't miss. We're talking about data compression, not perceptual compression.

The Fundamental Principle: Redundancy Exploitation

All lossless compression reduces to one operation: finding redundancy and encoding it more efficiently. Redundancy comes in two forms:

Repetition (sequential redundancy): The same byte sequences appear multiple times. "function" appears 500 times in a JavaScript file. Each occurrence after the first can be replaced with a short reference: "copy from offset X, length Y."
Distribution bias (statistical redundancy): Some bytes appear far more often than others. In English text, 'e' appears 12.7% of the time, 'z' appears 0.07%. Assigning short codes to frequent bytes and long codes to rare bytes reduces total size.

Dictionary-based algorithms attack the first type. Entropy coders attack the second. The best modern algorithms (Deflate, LZMA, Zstandard) attack both simultaneously, which is why they compress better than either approach alone.

Data that has no redundancy — truly random bytes — cannot be compressed. This is provable: no algorithm can compress all possible inputs. Files that are already compressed (JPG, MP3, PNG, ZIP) contain minimal residual redundancy, which is why compressing them again barely reduces size.

LZ77 and LZ78: Dictionary-Based Compression

Abraham Lempel and Jacob Ziv published two foundational algorithms in 1977 and 1978 that spawned entire families of compressors.

LZ77: Sliding Window Compression

LZ77 uses a "sliding window" of recently-processed data (typically 32KB to 64MB). As it reads input, it searches the window for the longest match — a sequence of bytes it has already seen. When it finds a match, it emits a back-reference: (distance, length) meaning "go back distance bytes and copy length bytes." When there's no match, it emits the literal byte.

Example: "ABCABCABC" — after seeing the first "ABC", the second "ABC" is encoded as (3, 3) — go back 3 bytes, copy 3. The third "ABC" is (3, 3) or (6, 3). The longer the match, the better the compression.

The window size determines how far back the algorithm can look for matches. GZIP's Deflate uses a 32KB window. LZMA uses up to 1.5GB. This difference is the primary reason LZMA compresses better — it can find duplicate patterns separated by millions of bytes, not just 32,000.

LZ77 variants power: GZIP, ZIP (Deflate), LZMA (7z, XZ), Zstandard, Brotli, LZ4, Snappy. Almost every modern compressor is an LZ77 derivative.

LZ78 and LZW: Explicit Dictionary

LZ78 builds an explicit dictionary of seen phrases rather than searching a sliding window. Each new phrase is the longest existing dictionary entry plus one new byte. LZW (Lempel-Ziv-Welch, 1984) is the most famous LZ78 variant — it was used in GIF images, Unix compress, and early modems (V.42bis).

LZW's historical significance is enormous (it's why GIF exists), but it's been surpassed by LZ77-based algorithms for general compression. The Unisys patent on LZW (expired 2003-2004) drove the creation of GZIP and PNG as patent-free alternatives, which turned out to compress better anyway.

Today, LZ78/LZW is mainly encountered in legacy GIF files and old Unix .Z compressed files. No modern archive format uses it.

Entropy Coding: Huffman, Arithmetic, and ANS

Entropy coding assigns shorter binary codes to more frequent symbols and longer codes to rarer symbols. The theoretical minimum size is given by Shannon's entropy: H = -sum(p * log2(p)) bits per symbol. A perfect entropy coder achieves this minimum.

Huffman Coding (1952)

David Huffman's algorithm builds a binary tree from symbol frequencies. Frequent symbols get short codes (near the root), rare symbols get long codes (deep in the tree). The code is prefix-free — no code is a prefix of another — so the decoder can unambiguously parse the bit stream.

Example: if 'e' has frequency 30%, it might get a 2-bit code (10). If 'q' has frequency 0.1%, it gets a 12-bit code (110110110100). The average bits-per-symbol is lower than a fixed 8-bits-per-byte encoding.

Huffman coding is used in: Deflate (GZIP, ZIP, PNG), JPEG (for the final entropy coding stage), and many legacy formats. It's simple, fast, and gets within 1 bit per symbol of the theoretical optimum. Its weakness: it assigns whole-number bit lengths, so it can't represent symbols that should get, say, 2.3 bits.

Arithmetic Coding and Range Coding

Arithmetic coding encodes the entire input as a single fraction between 0 and 1, achieving compression within 1 bit of the theoretical entropy limit (closer than Huffman for skewed distributions). Instead of assigning a fixed code per symbol, it progressively narrows a numeric range based on each symbol's probability.

Range coding is a simplified variant that uses integers instead of fractions, avoiding the floating-point precision issues. LZMA uses range coding for its entropy coding stage, which is one reason it compresses better than Deflate (which uses Huffman).

Arithmetic coding's advantage over Huffman is most significant when symbol probabilities are very uneven. If one symbol has probability 95%, Huffman assigns it 1 bit (the minimum). Arithmetic coding effectively assigns it 0.074 bits, much closer to the theoretical -log2(0.95) = 0.074 bits.

ANS: Asymmetric Numeral Systems (2009)

ANS, developed by Jarek Duda in 2009, achieves arithmetic-coding-level compression ratios with Huffman-coding-level speed. It encodes by mapping symbols to integers using a state machine, which is simple enough to run at near-memory-bandwidth speed on modern CPUs.

ANS powers Zstandard (Facebook), FSE (Finite State Entropy, used within Zstandard), and Apple's LZFSE. Its speed advantage over arithmetic coding is why Zstandard decompresses so much faster than LZMA — same compression ratio, but the entropy coding stage runs 3-5x faster.

ANS is arguably the most important compression algorithm development since LZMA. It broke the assumption that better compression requires slower decoding.

DEFLATE: LZ77 + Huffman (1993)

Deflate, designed by Phil Katz for PKZIP, is the most widely deployed compression algorithm in computing. It uses LZ77 with a 32KB sliding window, followed by Huffman coding of the LZ77 output (literal bytes and back-references).

The algorithm: (1) Scan input with a 32KB window, emitting literal bytes and (distance, length) back-references. (2) Build Huffman trees for the literals/lengths and for the distances. (3) Encode the LZ77 output using these Huffman codes. (4) Write the Huffman tree definitions and encoded data as a "block."

Deflate is used in GZIP, ZIP, PNG, HTTP content-encoding, PDF, and zlib. Its 32KB window size is the main limitation — it can't find duplicate patterns separated by more than 32KB, which limits compression on files with long-range redundancy. Zopfli (Google) is an ultra-slow Deflate compressor that exhaustively searches for the optimal encoding within Deflate's constraints, producing ~5% smaller output at 100x the CPU cost.

LZMA: Large Dictionary + Range Coding

LZMA (Igor Pavlov, 1998) achieves dramatically better compression than Deflate through three improvements:

Much larger dictionary: Up to 1.5GB vs Deflate's 32KB. Finding patterns across a wider window captures redundancy that Deflate misses entirely.
Range coding instead of Huffman: Gets closer to Shannon entropy, especially for highly biased probability distributions.
Context modeling: LZMA uses a Markov chain model that considers recent symbols when predicting the next byte's probability distribution. If the last two bytes were "th", the probability of 'e' is very high. This context awareness improves the entropy coder's efficiency.

The combined effect is substantial: LZMA typically compresses 20-30% smaller than Deflate on text and 10-15% smaller on binary data. The cost is speed — LZMA compression is 5-10x slower due to the larger dictionary search and more complex modeling.

LZMA2 (used in 7z and XZ) adds multi-threading (parallel block compression) and better incompressible data handling. The core algorithm is the same.

Burrows-Wheeler Transform (BWT)

The Burrows-Wheeler Transform (1994) is a conceptually beautiful algorithm used by BZIP2. It doesn't compress directly — it rearranges the input to make it easier for a subsequent compressor.

BWT works by creating all rotations of the input string, sorting them lexicographically, and taking the last column. This last column has a remarkable property: characters that appear in similar contexts cluster together. All 'e's that follow 'th' end up adjacent, all spaces after periods end up together, and so on.

The clustered output is then compressed with Move-to-Front transform (which exploits the local clustering) and Huffman coding. The result compresses 10-15% better than Deflate.

BWT's weakness: the sorting step requires O(n) memory and O(n log n) time, where n is the block size. BZIP2's default 900KB block size keeps memory reasonable but limits the context window. LZMA achieves better compression with a simpler algorithm by using a larger dictionary. BWT is elegant but has been outpaced by dictionary-based methods.

Brotli: Google's Web-Optimized Algorithm

Brotli (2015, Google) was designed specifically for web content — HTML, CSS, JavaScript, and WOFF2 font compression. It uses LZ77 with a large window (up to 16MB), Huffman and ANS entropy coding, and a built-in static dictionary of 13,504 common words and phrases from a corpus of English text and HTML/CSS/JS tokens.

The static dictionary is Brotli's secret weapon for web content. When compressing HTML, common strings like "

", "class=", "function", and "https://" are already in the dictionary and compress to just a dictionary reference. For small files (under 10KB), the static dictionary gives Brotli a 15-25% compression advantage over GZIP.

Brotli is supported by all modern browsers as Content-Encoding: br. It's the default compression for WOFF2 font files. For general-purpose file compression, Brotli doesn't offer advantages over Zstandard, but for web delivery, the static dictionary makes it uniquely effective.

Zstandard: ANS + Finite State Entropy + Large Dictionaries

Zstandard (Yann Collet, Facebook, 2016) combines LZ77 with Finite State Entropy (FSE, an ANS variant) for entropy coding. FSE provides arithmetic-coding compression ratios at Huffman-coding speeds because ANS encoding and decoding is a simple table lookup — no division or floating-point arithmetic.

The practical result: Zstandard decompresses at 700-900 MB/s on modern hardware, 3-5x faster than LZMA and 2x faster than Deflate, while achieving compression ratios between the two. At high compression levels (19-22), Zstandard approaches LZMA ratios. At low levels (1-3), it matches GZIP ratios at 3-5x the speed.

Zstandard also introduced practical dictionary compression — training a reusable dictionary on sample data that both compressor and decompressor share. This dramatically improves compression of small similar files (JSON records, log entries, network packets) where traditional algorithms can't build enough context from a single small input.

Zstandard's architecture (fast LZ77 matcher + fast FSE entropy coder) represents the current state of the art for general-purpose compression. It hasn't been significantly beaten since its release in 2016, though incremental improvements continue.

Why Some Files Don't Compress

Already-compressed data has very little residual redundancy. A JPEG file has already been through DCT, quantization, and Huffman coding. An MP3 has been through psychoacoustic modeling and Huffman coding. A PNG has been through filtering and Deflate compression. Compressing these again — putting them in a ZIP — finds almost nothing to exploit.

Concrete numbers: 1GB of JPEG images in a ZIP: ~995MB (0.5% reduction). 1GB of MP3 files in a 7z: ~992MB. 1GB of random bytes (from /dev/urandom): will actually get slightly larger because the compression headers add overhead with nothing to compress.

Encrypted data is also incompressible — encryption is designed to make output indistinguishable from random noise. Always compress before encrypting, not after. If you encrypt first, compression achieves nothing.

This is why archiving media files (photos, videos, music) is about bundling, not compression. Use ZIP or uncompressed TAR (tar cf, no -z) to bundle media — you'll save significant compression time while losing almost nothing in file size.

Every compression algorithm in this guide is a different combination of the same two building blocks: find repeated patterns (LZ77/LZ78/BWT), then encode the result with optimal bit lengths (Huffman/arithmetic/ANS). The differences in ratio and speed come from how aggressively each algorithm searches for patterns (dictionary size, context modeling) and how efficiently it encodes them (Huffman vs range coding vs ANS).

Understanding these fundamentals explains practical format choices: converting ZIP to 7Z works because LZMA's larger dictionary finds more patterns. Converting TAR.BZ2 to TAR.XZ works because LZMA outcompresses BWT. The files inside are identical — only the pattern-finding strategy changes.

Key Takeaways

All lossless compression exploits redundancy: repeated byte sequences (dictionary methods) and biased frequency distributions (entropy coding)
LZ77's sliding window size is the single biggest determinant of compression ratio: 32KB (Deflate) vs 1.5GB (LZMA) explains why 7z beats ZIP
Huffman coding assigns whole-bit codes; arithmetic/range coding and ANS approach the theoretical Shannon entropy limit
ANS (2009) broke the speed/ratio tradeoff — it achieves arithmetic-coding ratios at Huffman-coding speed, powering Zstandard
Deflate (LZ77 + Huffman) powers GZIP, ZIP, and PNG — it's the most widely deployed compression algorithm in computing
LZMA (LZ77 + range coding + context modeling) achieves 20-30% better compression than Deflate at 5-10x slower speed
Already-compressed files (JPG, MP3, PNG, ZIP) and encrypted data have no redundancy and cannot be meaningfully compressed further

Frequently Asked Questions

Why can't you compress an already-compressed file?

Compression removes redundancy. Once redundancy is removed, the remaining data looks like near-random noise to any compression algorithm. A ZIP file, JPEG image, or MP3 audio has already had its redundancy squeezed out by the first compressor. Attempting to compress it again finds almost nothing to exploit, resulting in the same size (or slightly larger due to compression headers). Always compress before encrypting, because encryption also eliminates apparent redundancy.

What's the difference between LZ77 and Huffman coding?

They attack different types of redundancy. LZ77 finds repeated byte sequences (like the word 'function' appearing 500 times) and replaces duplicates with short back-references. Huffman coding assigns shorter bit patterns to more frequent symbols (the letter 'e' gets a 3-bit code, 'q' gets 12 bits). Modern algorithms use both: LZ77 finds repeated patterns first, then Huffman (or ANS) encodes the result optimally.

Why does LZMA compress better than Deflate?

Three reasons: (1) Much larger dictionary — LZMA uses up to 1.5GB vs Deflate's 32KB, finding duplicate patterns separated by millions of bytes. (2) Range coding instead of Huffman coding, getting closer to the theoretical entropy limit. (3) Context modeling — LZMA predicts the next byte based on recent context, improving the entropy coder's efficiency. Combined, these produce 20-30% smaller output at the cost of 5-10x slower compression.

What is the Burrows-Wheeler Transform?

BWT rearranges input data to cluster characters that appear in similar contexts. It creates all rotations of the input, sorts them, and takes the last column. In this column, all 'e's that follow 'th' end up adjacent, all spaces after periods cluster together, and so on. This clustered output compresses much better with subsequent encoding. BZIP2 uses BWT. It's elegant but outpaced by LZMA-based methods that achieve better compression with simpler algorithms.

What makes Zstandard so fast at decompression?

Zstandard uses ANS (Asymmetric Numeral Systems) for entropy coding, which decodes via simple table lookups — no division, no floating-point, no branching. LZMA uses range coding, which requires division per symbol. Huffman decoding requires tree traversal. ANS achieves the same compression ratio as arithmetic/range coding but decodes at near-memory-bandwidth speed. This is why Zstandard decompresses at 700-900 MB/s while LZMA manages 200-300 MB/s.

What is entropy in compression?

Shannon entropy is the theoretical minimum number of bits needed per symbol, based on the probability distribution. If a source emits 'A' 50% of the time and 'B' 50%, entropy is 1 bit per symbol (you need exactly 1 bit to distinguish them). If it emits 'A' 99% and 'B' 1%, entropy is 0.081 bits per symbol — you can encode long runs of A's very efficiently. A perfect compressor achieves this minimum. Real compressors get close but not perfect.

Is lossless compression always possible?

No. For any compression algorithm, there exist inputs that cannot be compressed — the output is the same size or larger than the input. This is provable by counting: there are more possible inputs of size N than outputs of size N-1, so some inputs must map to outputs of equal or greater size. In practice, 'normal' data (text, code, structured files) compresses well because it has redundancy. Random data, encrypted data, and already-compressed data do not.

How does dictionary pre-training work in Zstandard?

You provide Zstandard with sample files (e.g., 1,000 JSON API responses). It analyzes common patterns and builds a dictionary — typically 32-112KB — that captures shared structure. During compression, the dictionary acts as a pre-loaded sliding window, so even a 1KB input can reference patterns from the dictionary. Both compressor and decompressor need the same dictionary. This is transformative for small files: individual JSON records compress 85-90% with a dictionary vs 60% without.

Ready to convert your files?

Use ChangeThisFile to convert between 600+ formats — free, fast, and private.

Start Converting