If you write software, you interact with archive formats constantly — even when you don't realize it. Every npm install downloads and extracts tarballs. Every Docker image is built from TAR layers. Every pip install unpacks a ZIP file. Every Java application is a ZIP file renamed to .jar. Every GitHub release auto-generates .tar.gz and .zip downloads.

Most of the time, these archives are invisible — tools handle them automatically. But when things break — corrupted packages, slow CI builds, bloated Docker images, failed deployments — understanding the underlying archive format lets you diagnose and fix problems that would otherwise be mysterious.

This guide maps every major developer touchpoint where archive formats appear, from source code distribution to deployment artifacts.

Source Tarballs: The Linux Tradition

The convention of distributing source code as .tar.gz (or .tar.xz) tarballs dates to the earliest days of open-source software. The Linux kernel, GNU tools, and virtually every C/C++ library released before GitHub followed this pattern: download the tarball, extract, ./configure && make && make install.

Why TAR and not ZIP? Because source code distribution on Unix requires preserving file permissions. Build scripts (configure, autogen.sh) must be executable. Symlinks between library versions must survive extraction. TAR preserves these; ZIP doesn't reliably.

The Linux kernel itself distributes as .tar.xz (since version 3.x; previously .tar.bz2, and before that .tar.gz). XZ compression makes the kernel source tarball ~135MB instead of ~220MB (tar.gz) or ~1.4GB (uncompressed) — a significant difference when mirrored across hundreds of distribution sites.

In 2026, many projects have moved to "just clone the git repo" for development. But tarballs remain essential for: offline installation, reproducible builds (a tarball is a snapshot at a specific version), air-gapped environments, and package build systems that start from a source tarball.

GitHub Release Archives

Every GitHub release automatically generates two source archives: Source code (tar.gz) and Source code (zip). These are created by GitHub from the repository state at the tagged commit. They're not identical to a git archive output — they exclude the .git directory but include all tracked files.

Key details developers should know:

  • These are auto-generated on each request (not pre-built). The content is always consistent with the tagged commit, but the archive file itself isn't cached permanently.
  • Submodules are NOT included. If your project uses git submodules, the auto-generated archives won't contain submodule contents. This is a common complaint. Workaround: build the archive yourself with submodules included and attach it as a release asset.
  • Release assets (manually attached files) are separate from auto-generated archives. You can attach pre-built binaries, custom tarballs, or any file up to 2GB per asset.
  • URL format: github.com/{owner}/{repo}/archive/refs/tags/{tag}.tar.gz for tarballs, .zip for ZIP. These URLs are stable and suitable for build scripts.

For projects using ChangeThisFile: if a user downloads a .tar.gz release but needs .zip (or vice versa), they can convert TAR.GZ to ZIP or ZIP to TAR.GZ here.

Docker Image Layers: TAR All the Way Down

A Docker image is a stack of filesystem layers, each stored as a TAR archive. When you docker build, each Dockerfile instruction (RUN, COPY, ADD) creates a new layer — a TAR file containing the files added or modified by that instruction.

  • Layer format: Each layer is a tar.gz (GZIP-compressed TAR). Docker 23+ supports Zstandard-compressed layers for faster push/pull.
  • Image manifest: The image manifest (JSON) lists layers by their content hash (SHA-256 of the compressed tar). This enables content-addressable storage and layer deduplication.
  • Layer reuse: When you push an image, Docker only uploads layers that the registry doesn't already have. When you pull, it only downloads new layers. This is why sharing a base image (like python:3.12) between multiple images is efficient — the base layers are stored once.

Optimization insight: Every unnecessary file in a Dockerfile instruction creates permanent bloat in a layer. RUN apt-get install -y gcc && rm -rf /var/lib/apt/lists/* in a single RUN instruction means the apt cache exists in the layer but is deleted in the same layer. If the rm is in a separate RUN instruction, the cache exists in one layer and is "whited out" in the next — but both layers are stored, and the total image size includes both.

This is why multi-stage Docker builds are effective: the final image only contains layers from the last stage, not the intermediate build layers with compilers, dev dependencies, and build artifacts.

npm Packages: Tarballs in Disguise

Every npm package is a gzipped tarball (.tgz). When you run npm install lodash, npm downloads lodash-4.17.21.tgz from the registry, extracts it to node_modules/lodash/, and discards the tarball (or caches it in ~/.npm/_cacache/).

You can inspect any npm package as a tarball:

# Download without installing
npm pack lodash
# Creates lodash-4.17.21.tgz

# View contents
tar -tzf lodash-4.17.21.tgz

# Extract to see what you're installing
tar -xzf lodash-4.17.21.tgz
# Creates package/ directory with all package files

The tarball contains a package/ directory with the published files. npm publish creates this tarball from your project, honoring .npmignore (or files in package.json) to exclude test files, source maps, and development artifacts.

Common issue: Accidentally publishing large files (node_modules, test fixtures, videos) inflates download sizes. Use npm pack --dry-run before publishing to see exactly what will be included in the tarball.

Java JARs and WARs: ZIP Files in Disguise

A Java JAR (Java ARchive) is a ZIP file with a different extension. A WAR (Web Application Archive) is also a ZIP file. An EAR (Enterprise Archive) is also a ZIP file. Java's entire packaging ecosystem is built on ZIP.

# Prove it:
file application.jar
# application.jar: Java archive data (JAR)

unzip -l application.jar
# Lists all classes, resources, META-INF/MANIFEST.MF

# Inspect without Java tools
7z l application.jar

# Extract
unzip application.jar -d extracted/

The JAR format extends ZIP with the META-INF/MANIFEST.MF file, which specifies the main class, classpath, and JAR metadata. Spring Boot "fat JARs" contain nested JARs (JAR-within-JAR), which are ZIP files containing ZIP files — a matryoshka doll of archives.

This ZIP foundation means standard archive tools can inspect Java packages. When debugging "class not found" errors, examining the JAR with unzip -l is often faster than navigating IDE project structures. Convert JAR/ZIP to 7Z if you need to compress a JAR more aggressively for transfer (rename .jar to .zip first, or just use 7z directly on the .jar).

Python Wheels and sdist: ZIP and TAR

Python has two package distribution formats:

  • Wheel (.whl): A ZIP file containing pre-built Python code and metadata. The standard binary distribution format since PEP 427 (2012). Wheels install fast because there's no build step — just extract and copy. The filename encodes platform compatibility: numpy-1.26.0-cp312-cp312-manylinux_2_17_x86_64.whl.
  • Source distribution (.tar.gz): A gzipped tarball containing the source code and a build script (setup.py or pyproject.toml). The recipient's machine must compile the package during installation. Slower than wheels, but works on any platform.
# Inspect a wheel
unzip -l numpy-1.26.0-cp312-cp312-manylinux_2_17_x86_64.whl

# It's a ZIP file
file numpy-1.26.0-cp312-cp312-manylinux_2_17_x86_64.whl
# ZIP archive data, at least v2.0 to extract

pip install prefers wheels (fast, pre-built) and falls back to sdist (requires compilation) when a wheel isn't available for the target platform. If you see pip compiling packages during install, it's because no pre-built wheel exists for your Python version + OS + architecture combination.

System Packages: .deb and .rpm

Linux system packages use archive formats internally:

  • .deb (Debian/Ubuntu): An ar archive containing two tarballs: control.tar.gz (package metadata, install scripts) and data.tar.xz (actual files to install). You can extract a .deb without dpkg: ar x package.deb && tar -xJf data.tar.xz.
  • .rpm (Red Hat/Fedora/SUSE): A custom binary format containing a cpio archive (similar to TAR but older) with the payload. Extract with: rpm2cpio package.rpm | cpio -idm. Modern RPMs compress the cpio payload with XZ or Zstandard.

Understanding the internal structure helps when debugging package issues. Can't install a .deb? Extract it manually and inspect the control scripts. Need a single file from an RPM? Extract with rpm2cpio instead of installing the whole package.

Arch Linux uses .tar.zst for packages — a plain Zstandard-compressed tarball with no additional wrapper. This simplicity makes Arch packages the easiest to inspect: tar -tf package.pkg.tar.zst lists everything; tar -xf extracts everything.

CI/CD Artifact Caching and Archive Performance

CI/CD systems (GitHub Actions, GitLab CI, CircleCI) cache build artifacts between runs to speed up builds. These caches are archive files — typically TAR with GZIP or Zstandard compression.

GitHub Actions uses @actions/cache, which creates tarballs of cached directories (node_modules, .gradle, pip cache). Cache restore extracts the tarball. The compression algorithm affects both cache size (storage costs) and restore speed (build time).

Optimization opportunities:

  • Use Zstandard where supported. GitHub Actions supports Zstandard compression for caches (set enableCrossOsArchive: true or use newer action versions). ZSTD cache restore is 2-3x faster than GZIP for large node_modules directories.
  • Cache selectively. Caching all of node_modules/ (potentially hundreds of MB) takes time to archive and restore. Caching just the pnpm store or npm cache directory is often faster because pnpm's store is already deduplicated.
  • Artifact size matters. CI artifacts (test reports, coverage files, built binaries) stored between pipeline stages are archives. Compressing build outputs with 7z instead of ZIP can halve artifact storage for large binary artifacts.

Build Artifacts: Choosing the Right Format

When your build pipeline produces distributable artifacts, the format choice affects downstream consumers:

Artifact TypeRecommended FormatWhy
Source release.tar.gz or .tar.xzPreserves permissions, Unix convention
Windows binary.zipWindows users can open natively
macOS binary.dmg or .zipDMG is macOS convention; ZIP is universal
Linux binary.tar.gzPreserves executable permissions
Cross-platform binary.zip + .tar.gzProvide both; let users choose
Internal transfer.tar.zst or .7zBest compression/speed for known environments

Many projects provide both .zip and .tar.gz downloads for each release — this is the right approach. ZIP for Windows users, TAR.GZ for Unix users. GitHub generates both automatically for source code; for binary releases, build both and attach them as release assets.

Archive formats are the invisible infrastructure of software distribution. Every package manager, container runtime, and CI/CD system uses them internally. When things work, you never think about them. When things break — corrupted packages, slow builds, bloated images — knowing the underlying format turns mystery errors into diagnosable problems.

For converting between formats in your development workflow: TAR.GZ to ZIP for Windows distribution, ZIP to TAR.GZ for Unix-friendly packaging, 7Z to ZIP for universally openable releases.