Every compressed file on your computer exploits the same fundamental insight: data contains patterns, and patterns can be represented more efficiently. A file that says "the cat sat on the mat" has obvious redundancy — "the" appears twice, "at" appears three times. Compression algorithms find these patterns systematically and replace them with shorter representations.

This guide covers the actual algorithms behind the file formats. When you ZIP a file (Deflate), compress a tarball (GZIP/XZ/ZSTD), or create a 7z archive (LZMA2), these are the algorithms doing the work. Understanding them explains why some formats compress better than others, why compression speed varies so dramatically, and why some data barely compresses at all.

Everything here is lossless — the decompressed output is bit-for-bit identical to the original input. Lossy compression (JPEG, MP3, H.264) uses entirely different techniques that discard data the human eye or ear won't miss. We're talking about data compression, not perceptual compression.

The Fundamental Principle: Redundancy Exploitation

All lossless compression reduces to one operation: finding redundancy and encoding it more efficiently. Redundancy comes in two forms:

  • Repetition (sequential redundancy): The same byte sequences appear multiple times. "function" appears 500 times in a JavaScript file. Each occurrence after the first can be replaced with a short reference: "copy from offset X, length Y."
  • Distribution bias (statistical redundancy): Some bytes appear far more often than others. In English text, 'e' appears 12.7% of the time, 'z' appears 0.07%. Assigning short codes to frequent bytes and long codes to rare bytes reduces total size.

Dictionary-based algorithms attack the first type. Entropy coders attack the second. The best modern algorithms (Deflate, LZMA, Zstandard) attack both simultaneously, which is why they compress better than either approach alone.

Data that has no redundancy — truly random bytes — cannot be compressed. This is provable: no algorithm can compress all possible inputs. Files that are already compressed (JPG, MP3, PNG, ZIP) contain minimal residual redundancy, which is why compressing them again barely reduces size.

LZ77 and LZ78: Dictionary-Based Compression

Abraham Lempel and Jacob Ziv published two foundational algorithms in 1977 and 1978 that spawned entire families of compressors.

LZ77: Sliding Window Compression

LZ77 uses a "sliding window" of recently-processed data (typically 32KB to 64MB). As it reads input, it searches the window for the longest match — a sequence of bytes it has already seen. When it finds a match, it emits a back-reference: (distance, length) meaning "go back distance bytes and copy length bytes." When there's no match, it emits the literal byte.

Example: "ABCABCABC" — after seeing the first "ABC", the second "ABC" is encoded as (3, 3) — go back 3 bytes, copy 3. The third "ABC" is (3, 3) or (6, 3). The longer the match, the better the compression.

The window size determines how far back the algorithm can look for matches. GZIP's Deflate uses a 32KB window. LZMA uses up to 1.5GB. This difference is the primary reason LZMA compresses better — it can find duplicate patterns separated by millions of bytes, not just 32,000.

LZ77 variants power: GZIP, ZIP (Deflate), LZMA (7z, XZ), Zstandard, Brotli, LZ4, Snappy. Almost every modern compressor is an LZ77 derivative.

LZ78 and LZW: Explicit Dictionary

LZ78 builds an explicit dictionary of seen phrases rather than searching a sliding window. Each new phrase is the longest existing dictionary entry plus one new byte. LZW (Lempel-Ziv-Welch, 1984) is the most famous LZ78 variant — it was used in GIF images, Unix compress, and early modems (V.42bis).

LZW's historical significance is enormous (it's why GIF exists), but it's been surpassed by LZ77-based algorithms for general compression. The Unisys patent on LZW (expired 2003-2004) drove the creation of GZIP and PNG as patent-free alternatives, which turned out to compress better anyway.

Today, LZ78/LZW is mainly encountered in legacy GIF files and old Unix .Z compressed files. No modern archive format uses it.

Entropy Coding: Huffman, Arithmetic, and ANS

Entropy coding assigns shorter binary codes to more frequent symbols and longer codes to rarer symbols. The theoretical minimum size is given by Shannon's entropy: H = -sum(p * log2(p)) bits per symbol. A perfect entropy coder achieves this minimum.

Huffman Coding (1952)

David Huffman's algorithm builds a binary tree from symbol frequencies. Frequent symbols get short codes (near the root), rare symbols get long codes (deep in the tree). The code is prefix-free — no code is a prefix of another — so the decoder can unambiguously parse the bit stream.

Example: if 'e' has frequency 30%, it might get a 2-bit code (10). If 'q' has frequency 0.1%, it gets a 12-bit code (110110110100). The average bits-per-symbol is lower than a fixed 8-bits-per-byte encoding.

Huffman coding is used in: Deflate (GZIP, ZIP, PNG), JPEG (for the final entropy coding stage), and many legacy formats. It's simple, fast, and gets within 1 bit per symbol of the theoretical optimum. Its weakness: it assigns whole-number bit lengths, so it can't represent symbols that should get, say, 2.3 bits.

Arithmetic Coding and Range Coding

Arithmetic coding encodes the entire input as a single fraction between 0 and 1, achieving compression within 1 bit of the theoretical entropy limit (closer than Huffman for skewed distributions). Instead of assigning a fixed code per symbol, it progressively narrows a numeric range based on each symbol's probability.

Range coding is a simplified variant that uses integers instead of fractions, avoiding the floating-point precision issues. LZMA uses range coding for its entropy coding stage, which is one reason it compresses better than Deflate (which uses Huffman).

Arithmetic coding's advantage over Huffman is most significant when symbol probabilities are very uneven. If one symbol has probability 95%, Huffman assigns it 1 bit (the minimum). Arithmetic coding effectively assigns it 0.074 bits, much closer to the theoretical -log2(0.95) = 0.074 bits.

ANS: Asymmetric Numeral Systems (2009)

ANS, developed by Jarek Duda in 2009, achieves arithmetic-coding-level compression ratios with Huffman-coding-level speed. It encodes by mapping symbols to integers using a state machine, which is simple enough to run at near-memory-bandwidth speed on modern CPUs.

ANS powers Zstandard (Facebook), FSE (Finite State Entropy, used within Zstandard), and Apple's LZFSE. Its speed advantage over arithmetic coding is why Zstandard decompresses so much faster than LZMA — same compression ratio, but the entropy coding stage runs 3-5x faster.

ANS is arguably the most important compression algorithm development since LZMA. It broke the assumption that better compression requires slower decoding.

DEFLATE: LZ77 + Huffman (1993)

Deflate, designed by Phil Katz for PKZIP, is the most widely deployed compression algorithm in computing. It uses LZ77 with a 32KB sliding window, followed by Huffman coding of the LZ77 output (literal bytes and back-references).

The algorithm: (1) Scan input with a 32KB window, emitting literal bytes and (distance, length) back-references. (2) Build Huffman trees for the literals/lengths and for the distances. (3) Encode the LZ77 output using these Huffman codes. (4) Write the Huffman tree definitions and encoded data as a "block."

Deflate is used in GZIP, ZIP, PNG, HTTP content-encoding, PDF, and zlib. Its 32KB window size is the main limitation — it can't find duplicate patterns separated by more than 32KB, which limits compression on files with long-range redundancy. Zopfli (Google) is an ultra-slow Deflate compressor that exhaustively searches for the optimal encoding within Deflate's constraints, producing ~5% smaller output at 100x the CPU cost.

LZMA: Large Dictionary + Range Coding

LZMA (Igor Pavlov, 1998) achieves dramatically better compression than Deflate through three improvements:

  1. Much larger dictionary: Up to 1.5GB vs Deflate's 32KB. Finding patterns across a wider window captures redundancy that Deflate misses entirely.
  2. Range coding instead of Huffman: Gets closer to Shannon entropy, especially for highly biased probability distributions.
  3. Context modeling: LZMA uses a Markov chain model that considers recent symbols when predicting the next byte's probability distribution. If the last two bytes were "th", the probability of 'e' is very high. This context awareness improves the entropy coder's efficiency.

The combined effect is substantial: LZMA typically compresses 20-30% smaller than Deflate on text and 10-15% smaller on binary data. The cost is speed — LZMA compression is 5-10x slower due to the larger dictionary search and more complex modeling.

LZMA2 (used in 7z and XZ) adds multi-threading (parallel block compression) and better incompressible data handling. The core algorithm is the same.

Burrows-Wheeler Transform (BWT)

The Burrows-Wheeler Transform (1994) is a conceptually beautiful algorithm used by BZIP2. It doesn't compress directly — it rearranges the input to make it easier for a subsequent compressor.

BWT works by creating all rotations of the input string, sorting them lexicographically, and taking the last column. This last column has a remarkable property: characters that appear in similar contexts cluster together. All 'e's that follow 'th' end up adjacent, all spaces after periods end up together, and so on.

The clustered output is then compressed with Move-to-Front transform (which exploits the local clustering) and Huffman coding. The result compresses 10-15% better than Deflate.

BWT's weakness: the sorting step requires O(n) memory and O(n log n) time, where n is the block size. BZIP2's default 900KB block size keeps memory reasonable but limits the context window. LZMA achieves better compression with a simpler algorithm by using a larger dictionary. BWT is elegant but has been outpaced by dictionary-based methods.

Brotli: Google's Web-Optimized Algorithm

Brotli (2015, Google) was designed specifically for web content — HTML, CSS, JavaScript, and WOFF2 font compression. It uses LZ77 with a large window (up to 16MB), Huffman and ANS entropy coding, and a built-in static dictionary of 13,504 common words and phrases from a corpus of English text and HTML/CSS/JS tokens.

The static dictionary is Brotli's secret weapon for web content. When compressing HTML, common strings like "