Archive Formats for Developers: Source Code, Artifacts, and Deployment

Published Mar 19, 2026 7 min read By ChangeThisFile Team

Quick Answer

Archive formats are embedded throughout the software development lifecycle. Source code distributes as .tar.gz/.tar.xz. Docker image layers are TAR files. npm packages are tarballs. Java JARs and Python wheels are ZIP files. Debian packages are ar archives containing tarballs. Understanding these formats helps debug build failures and optimize CI/CD pipelines.

If you write software, you interact with archive formats constantly — even when you don't realize it. Every npm install downloads and extracts tarballs. Every Docker image is built from TAR layers. Every pip install unpacks a ZIP file. Every Java application is a ZIP file renamed to .jar. Every GitHub release auto-generates .tar.gz and .zip downloads.

Most of the time, these archives are invisible — tools handle them automatically. But when things break — corrupted packages, slow CI builds, bloated Docker images, failed deployments — understanding the underlying archive format lets you diagnose and fix problems that would otherwise be mysterious.

This guide maps every major developer touchpoint where archive formats appear, from source code distribution to deployment artifacts.

Source Tarballs: The Linux Tradition

The convention of distributing source code as .tar.gz (or .tar.xz) tarballs dates to the earliest days of open-source software. The Linux kernel, GNU tools, and virtually every C/C++ library released before GitHub followed this pattern: download the tarball, extract, ./configure && make && make install.

Why TAR and not ZIP? Because source code distribution on Unix requires preserving file permissions. Build scripts (configure, autogen.sh) must be executable. Symlinks between library versions must survive extraction. TAR preserves these; ZIP doesn't reliably.

The Linux kernel itself distributes as .tar.xz (since version 3.x; previously .tar.bz2, and before that .tar.gz). XZ compression makes the kernel source tarball ~135MB instead of ~220MB (tar.gz) or ~1.4GB (uncompressed) — a significant difference when mirrored across hundreds of distribution sites.

In 2026, many projects have moved to "just clone the git repo" for development. But tarballs remain essential for: offline installation, reproducible builds (a tarball is a snapshot at a specific version), air-gapped environments, and package build systems that start from a source tarball.

GitHub Release Archives

Every GitHub release automatically generates two source archives: Source code (tar.gz) and Source code (zip). These are created by GitHub from the repository state at the tagged commit. They're not identical to a git archive output — they exclude the .git directory but include all tracked files.

Key details developers should know:

These are auto-generated on each request (not pre-built). The content is always consistent with the tagged commit, but the archive file itself isn't cached permanently.
Submodules are NOT included. If your project uses git submodules, the auto-generated archives won't contain submodule contents. This is a common complaint. Workaround: build the archive yourself with submodules included and attach it as a release asset.
Release assets (manually attached files) are separate from auto-generated archives. You can attach pre-built binaries, custom tarballs, or any file up to 2GB per asset.
URL format: github.com/{owner}/{repo}/archive/refs/tags/{tag}.tar.gz for tarballs, .zip for ZIP. These URLs are stable and suitable for build scripts.

For projects using ChangeThisFile: if a user downloads a .tar.gz release but needs .zip (or vice versa), they can convert TAR.GZ to ZIP or ZIP to TAR.GZ here.

Docker Image Layers: TAR All the Way Down

A Docker image is a stack of filesystem layers, each stored as a TAR archive. When you docker build, each Dockerfile instruction (RUN, COPY, ADD) creates a new layer — a TAR file containing the files added or modified by that instruction.

Layer format: Each layer is a tar.gz (GZIP-compressed TAR). Docker 23+ supports Zstandard-compressed layers for faster push/pull.
Image manifest: The image manifest (JSON) lists layers by their content hash (SHA-256 of the compressed tar). This enables content-addressable storage and layer deduplication.
Layer reuse: When you push an image, Docker only uploads layers that the registry doesn't already have. When you pull, it only downloads new layers. This is why sharing a base image (like python:3.12) between multiple images is efficient — the base layers are stored once.

Optimization insight: Every unnecessary file in a Dockerfile instruction creates permanent bloat in a layer. RUN apt-get install -y gcc && rm -rf /var/lib/apt/lists/* in a single RUN instruction means the apt cache exists in the layer but is deleted in the same layer. If the rm is in a separate RUN instruction, the cache exists in one layer and is "whited out" in the next — but both layers are stored, and the total image size includes both.

This is why multi-stage Docker builds are effective: the final image only contains layers from the last stage, not the intermediate build layers with compilers, dev dependencies, and build artifacts.

npm Packages: Tarballs in Disguise

Every npm package is a gzipped tarball (.tgz). When you run npm install lodash, npm downloads lodash-4.17.21.tgz from the registry, extracts it to node_modules/lodash/, and discards the tarball (or caches it in ~/.npm/_cacache/).

You can inspect any npm package as a tarball:

# Download without installing
npm pack lodash
# Creates lodash-4.17.21.tgz

# View contents
tar -tzf lodash-4.17.21.tgz

# Extract to see what you're installing
tar -xzf lodash-4.17.21.tgz
# Creates package/ directory with all package files

The tarball contains a package/ directory with the published files. npm publish creates this tarball from your project, honoring .npmignore (or files in package.json) to exclude test files, source maps, and development artifacts.

Common issue: Accidentally publishing large files (node_modules, test fixtures, videos) inflates download sizes. Use npm pack --dry-run before publishing to see exactly what will be included in the tarball.

Java JARs and WARs: ZIP Files in Disguise

A Java JAR (Java ARchive) is a ZIP file with a different extension. A WAR (Web Application Archive) is also a ZIP file. An EAR (Enterprise Archive) is also a ZIP file. Java's entire packaging ecosystem is built on ZIP.

# Prove it:
file application.jar
# application.jar: Java archive data (JAR)

unzip -l application.jar
# Lists all classes, resources, META-INF/MANIFEST.MF

# Inspect without Java tools
7z l application.jar

# Extract
unzip application.jar -d extracted/

The JAR format extends ZIP with the META-INF/MANIFEST.MF file, which specifies the main class, classpath, and JAR metadata. Spring Boot "fat JARs" contain nested JARs (JAR-within-JAR), which are ZIP files containing ZIP files — a matryoshka doll of archives.

This ZIP foundation means standard archive tools can inspect Java packages. When debugging "class not found" errors, examining the JAR with unzip -l is often faster than navigating IDE project structures. Convert JAR/ZIP to 7Z if you need to compress a JAR more aggressively for transfer (rename .jar to .zip first, or just use 7z directly on the .jar).

Python Wheels and sdist: ZIP and TAR

Python has two package distribution formats:

Wheel (.whl): A ZIP file containing pre-built Python code and metadata. The standard binary distribution format since PEP 427 (2012). Wheels install fast because there's no build step — just extract and copy. The filename encodes platform compatibility: numpy-1.26.0-cp312-cp312-manylinux_2_17_x86_64.whl.
Source distribution (.tar.gz): A gzipped tarball containing the source code and a build script (setup.py or pyproject.toml). The recipient's machine must compile the package during installation. Slower than wheels, but works on any platform.

# Inspect a wheel
unzip -l numpy-1.26.0-cp312-cp312-manylinux_2_17_x86_64.whl

# It's a ZIP file
file numpy-1.26.0-cp312-cp312-manylinux_2_17_x86_64.whl
# ZIP archive data, at least v2.0 to extract

pip install prefers wheels (fast, pre-built) and falls back to sdist (requires compilation) when a wheel isn't available for the target platform. If you see pip compiling packages during install, it's because no pre-built wheel exists for your Python version + OS + architecture combination.

System Packages: .deb and .rpm

Linux system packages use archive formats internally:

.deb (Debian/Ubuntu): An ar archive containing two tarballs: control.tar.gz (package metadata, install scripts) and data.tar.xz (actual files to install). You can extract a .deb without dpkg: ar x package.deb && tar -xJf data.tar.xz.
.rpm (Red Hat/Fedora/SUSE): A custom binary format containing a cpio archive (similar to TAR but older) with the payload. Extract with: rpm2cpio package.rpm | cpio -idm. Modern RPMs compress the cpio payload with XZ or Zstandard.

Understanding the internal structure helps when debugging package issues. Can't install a .deb? Extract it manually and inspect the control scripts. Need a single file from an RPM? Extract with rpm2cpio instead of installing the whole package.

Arch Linux uses .tar.zst for packages — a plain Zstandard-compressed tarball with no additional wrapper. This simplicity makes Arch packages the easiest to inspect: tar -tf package.pkg.tar.zst lists everything; tar -xf extracts everything.

CI/CD Artifact Caching and Archive Performance

CI/CD systems (GitHub Actions, GitLab CI, CircleCI) cache build artifacts between runs to speed up builds. These caches are archive files — typically TAR with GZIP or Zstandard compression.

GitHub Actions uses @actions/cache, which creates tarballs of cached directories (node_modules, .gradle, pip cache). Cache restore extracts the tarball. The compression algorithm affects both cache size (storage costs) and restore speed (build time).

Optimization opportunities:

Use Zstandard where supported. GitHub Actions supports Zstandard compression for caches (set enableCrossOsArchive: true or use newer action versions). ZSTD cache restore is 2-3x faster than GZIP for large node_modules directories.
Cache selectively. Caching all of node_modules/ (potentially hundreds of MB) takes time to archive and restore. Caching just the pnpm store or npm cache directory is often faster because pnpm's store is already deduplicated.
Artifact size matters. CI artifacts (test reports, coverage files, built binaries) stored between pipeline stages are archives. Compressing build outputs with 7z instead of ZIP can halve artifact storage for large binary artifacts.

Build Artifacts: Choosing the Right Format

When your build pipeline produces distributable artifacts, the format choice affects downstream consumers:

Artifact Type	Recommended Format	Why
Source release	.tar.gz or .tar.xz	Preserves permissions, Unix convention
Windows binary	.zip	Windows users can open natively
macOS binary	.dmg or .zip	DMG is macOS convention; ZIP is universal
Linux binary	.tar.gz	Preserves executable permissions
Cross-platform binary	.zip + .tar.gz	Provide both; let users choose
Internal transfer	.tar.zst or .7z	Best compression/speed for known environments

Many projects provide both .zip and .tar.gz downloads for each release — this is the right approach. ZIP for Windows users, TAR.GZ for Unix users. GitHub generates both automatically for source code; for binary releases, build both and attach them as release assets.

Archive formats are the invisible infrastructure of software distribution. Every package manager, container runtime, and CI/CD system uses them internally. When things work, you never think about them. When things break — corrupted packages, slow builds, bloated images — knowing the underlying format turns mystery errors into diagnosable problems.

For converting between formats in your development workflow: TAR.GZ to ZIP for Windows distribution, ZIP to TAR.GZ for Unix-friendly packaging, 7Z to ZIP for universally openable releases.

Key Takeaways

Docker image layers are TAR files — understanding this explains layer caching, image size optimization, and multi-stage builds
npm packages are .tgz tarballs; Java JARs are ZIP files; Python wheels are ZIP files — archive knowledge helps debug package issues
.deb packages are ar archives containing tarballs; .rpm packages contain cpio archives — both extractable without package managers
GitHub releases auto-generate .tar.gz and .zip but exclude git submodules — attach custom archives for complete distributions
CI/CD cache performance depends on archive compression: ZSTD restores 2-3x faster than GZIP for large caches
Provide both .zip (Windows) and .tar.gz (Unix) for release artifacts — each preserves what the target platform needs

Frequently Asked Questions

Why are npm packages distributed as tarballs?

npm chose tarballs (.tgz) because they preserve Unix file permissions (executable scripts must remain executable after npm install), they're streamable (can be extracted without downloading the entire file first), and they're simple to create and verify. The registry stores each package version as a content-addressable tarball with a SHA-512 integrity hash for verification.

Can I open a .jar file with a ZIP tool?

Yes. JAR files are ZIP files with a different extension. Any ZIP-compatible tool (7-Zip, WinRAR, unzip, Archive Utility) can open, inspect, and extract JAR files. The only special content is META-INF/MANIFEST.MF (metadata) and the compiled .class files. This works for WAR and EAR files too — they're all ZIP-based.

How are Docker image layers stored?

Each Docker image layer is a gzipped TAR archive (.tar.gz) containing the filesystem changes for that layer. Layers are content-addressable — identified by the SHA-256 hash of their compressed content. When you push an image, only new layers are uploaded. When you pull, only missing layers are downloaded. Docker 23+ supports Zstandard-compressed layers for faster operations.

Why do GitHub release source archives exclude submodules?

GitHub generates release archives using git-archive, which only includes tracked files in the main repository — not submodule contents. This is a known limitation. Workaround: build a custom archive that includes submodules (git submodule update --init --recursive, then tar/zip the result) and attach it as a release asset alongside the auto-generated archives.

What format should I use for CI/CD artifact caching?

Zstandard where supported — it decompresses 2-3x faster than GZIP, which directly reduces cache restore time and therefore build time. GitHub Actions supports ZSTD caches. For artifact storage (passing files between pipeline stages), ZSTD or GZIP tarballs are standard. If your CI system uses a generic upload/download mechanism, compress with zstd -T0 for multi-threaded speed.

How do I extract a .deb package without dpkg?

A .deb is an ar archive: ar x package.deb produces control.tar.gz and data.tar.xz (or data.tar.gz). Extract the data archive for the actual files: tar -xJf data.tar.xz. Extract the control archive for package metadata and install scripts: tar -xzf control.tar.gz. This works on any system with ar and tar — no Debian tools needed.

Why do some Python packages require compilation during pip install?

pip prefers pre-built wheels (.whl, which are ZIP files) over source distributions (.tar.gz). If no wheel exists for your Python version + OS + architecture combination, pip falls back to the source distribution — downloading the .tar.gz, extracting it, and running the build script. Packages with C extensions (numpy, scipy, cryptography) need a compiler for this. Install pre-built wheels by ensuring your Python version and platform match available wheels, or use conda which provides pre-built binaries for most scientific packages.

Should I distribute release binaries as ZIP or TAR.GZ?

Both. Provide .zip for Windows users (universally openable, no extra tools) and .tar.gz for Linux/macOS users (preserves executable permissions and symlinks). Many projects offer both — the Go compiler, Terraform, Rust tools, and most CLI utilities distribute as .zip (Windows) + .tar.gz (Linux/macOS). If you must pick one: .tar.gz for CLI tools (preserves permissions), .zip for GUI applications (cross-platform convenience).

Ready to convert your files?

Use ChangeThisFile to convert between 600+ formats — free, fast, and private.

Start Converting