Audio

Best Audio Formats for Voice Recording and Dictation

Published Mar 19, 2026 6 min read By ChangeThisFile Team

Quick Answer

Voice has a narrower frequency range than music (primarily 300 Hz to 3.4 kHz for intelligibility, up to 8 kHz for natural sound), so it compresses extremely well. For maximum quality, use WAV at 48 kHz / 16-bit. For maximum efficiency, Opus at 32-64 kbps is clear and natural. Phone voice memos typically record in M4A (AAC) at 128 kbps — more than adequate for speech.

Voice recording has different requirements than music recording. Human speech occupies a much narrower frequency range, has less dynamic range, and is mono by nature (one mouth, one microphone). These properties mean speech compresses remarkably well — a 64 kbps Opus recording of speech is essentially indistinguishable from the uncompressed original.

Whether you're recording lectures, dictation, interviews, voice memos, audiobook narration, or content for transcription, the format choice depends on one question: do you need maximum quality (for editing and publishing) or maximum efficiency (for storage and transmission)?

Why Voice Is Different from Music

Voice signals have specific properties that affect format choice:

Frequency range: Telephone-quality speech is 300-3,400 Hz. Wideband speech (natural sounding) extends to ~8 kHz. Full-bandwidth (capturing every nuance including breath sounds and room tone) goes to ~16 kHz. Compare to music, which uses the full 20-20,000 Hz range.
Dynamic range: Normal speech has about 30-40 dB of dynamic range (whisper to shout). Music might have 60-70 dB. This means 16-bit recording (96 dB range) is more than sufficient for voice — 24-bit's extra headroom provides less benefit than it does for music.
Mono signal: One speaker = one channel. Stereo recording of a single voice doubles file size with no benefit. Always record voice as mono unless you're specifically capturing room ambience with multiple microphones.
Predictable patterns: Speech has repeating phonetic patterns that compression algorithms exploit. Vowels are periodic waveforms; consonants are brief noise bursts. This predictability means speech compresses better than music at any given bitrate.

Format Recommendations by Use Case

Use Case	Format	Settings	Size/Min	Quality
Professional narration / audiobook	WAV	48 kHz, 16-bit, mono	5.5 MB	Perfect (uncompressed)
Podcast recording (raw)	WAV	48 kHz, 24-bit, mono	8.3 MB	Perfect (with headroom)
Lecture / meeting recording	Opus or AAC	64 kbps, mono	0.5 MB	Excellent
Voice memo / quick note	Opus or M4A	32-48 kbps, mono	0.2-0.4 MB	Very good
Dictation for transcription	WAV or Opus 64 kbps	Mono, 16 kHz+ sample rate	0.5-5.5 MB	Varies
VoIP / phone call recording	Opus	24-48 kbps, mono	0.2-0.4 MB	Good to very good

The range is dramatic: uncompressed voice at 5.5 MB/min versus compressed at 0.2 MB/min — a 27x difference. For a 60-minute recording, that's 330 MB vs 12 MB. The compressed version is perfectly clear for speech.

Phone Voice Recorder Formats

Smartphone voice recording apps typically output in these formats:

iPhone Voice Memos: M4A (AAC) at ~128 kbps by default. Compressed mode available at lower bitrates. Lossless mode records at higher quality (Apple Lossless). Files saved as .m4a.
Android recorders: Varies by manufacturer and app. Common outputs: M4A (AAC), OGG (Vorbis), AMR (legacy narrowband), and WAV. Samsung Voice Recorder defaults to M4A. Google Recorder exports as M4A.
Third-party apps: Many recording apps (Otter.ai, Rev, RecUp) record in M4A or WAV depending on quality settings.

For phone recordings intended for transcription or further editing, choose the highest quality setting available — WAV if the app offers it, otherwise M4A at the highest bitrate. You can always compress later; you can't uncompress.

To extract audio from phone recordings for editing on desktop: convert M4A to WAV for DAW import, or convert M4A to MP3 for universal sharing.

Audio Requirements for Transcription Services

Automated transcription services (Whisper, Deepgram, Google Speech-to-Text, Amazon Transcribe, AssemblyAI) have specific format preferences:

OpenAI Whisper: Accepts most formats via FFmpeg. Internally resamples to 16 kHz mono. Any format works, but WAV/FLAC at 16+ kHz is optimal to avoid unnecessary transcoding.
Google Speech-to-Text: Prefers FLAC or WAV, mono, 16 kHz sample rate for speech recognition models. Accepts other formats but recommends these for best accuracy.
Amazon Transcribe: Accepts WAV, MP3, MP4, FLAC, OGG, AMR, WebM. Sample rates from 8 kHz to 48 kHz. Mono recommended.
Rev / human transcription: Accepts virtually any format. WAV or MP3 at 128+ kbps recommended for human transcribers who need to hear nuance.

General rule for transcription: mono, 16 kHz minimum sample rate, clear audio. The quality of the audio matters far more than the format — a clean 64 kbps Opus recording in a quiet room transcribes better than a 320 kbps MP3 recorded in a noisy car.

Noise, Compression Artifacts, and Speech Clarity

Low-bitrate compression affects voice differently than music. At 64 kbps MP3, music has noticeable artifacts (washy cymbals, narrow stereo). At 64 kbps MP3, speech sounds slightly muffled — like a phone call on a mediocre connection. Perfectly intelligible, but not broadcast quality.

Opus handles this much better. At 64 kbps Opus, speech is nearly transparent — clear, natural, with preserved consonant detail (the "s" and "t" sounds that MP3 struggles with at low bitrates). At 32 kbps Opus, speech is still very good — comparable to a high-quality phone call.

Background noise interacts with compression: encoders can't distinguish noise from signal, so they spend bits encoding background hum, air conditioning, traffic, etc. A clean recording at 32 kbps sounds better than a noisy recording at 128 kbps. If you have control over your recording environment, reducing noise at the source is worth more than doubling your bitrate.

Why Opus Is the Best Choice for Voice

Opus was co-developed by Xiph.Org (audio codecs) and Skype (voice communication). Its SILK mode is specifically optimized for speech — encoding the spectral envelope of the human voice efficiently. At low bitrates where other codecs struggle, Opus excels because it can switch between speech-optimized (SILK) and general audio (CELT) modes seamlessly.

Practical comparison for voice at common bitrates:

Bitrate	MP3	AAC	Opus
24 kbps	Unintelligible	Barely intelligible	Clear (phone quality)
32 kbps	Poor	Marginal	Good (natural)
48 kbps	Marginal	Acceptable	Very good
64 kbps	Acceptable	Good	Excellent (near-transparent)

For voice memos, dictation, and any application where file size matters: Opus at 32-64 kbps mono. A 60-minute recording at 48 kbps Opus: 21 MB. The same recording at 128 kbps MP3: 57 MB. The Opus version sounds better at less than half the size.

Converting Voice Recordings

Common voice recording conversions: phone recording to desktop-editable format, preparing for transcription, sharing voice notes.

Voice is the easiest audio content to handle — it compresses better, requires less bandwidth, and is more forgiving of format choices than music. For professional narration, record WAV and edit losslessly. For everything else, Opus at 32-64 kbps gives you broadcast-quality voice in tiny files. The most important factor isn't the format — it's the recording environment. A clean recording in a quiet room at 32 kbps Opus beats a noisy recording at 320 kbps MP3 every time.

Convert M4A to WAV, WAV to MP3, or Opus to MP3 — free at ChangeThisFile.

Key Takeaways

Voice compresses extremely well — speech at 64 kbps Opus is nearly indistinguishable from uncompressed.
For quality recording: WAV, 48 kHz, 16-bit (or 24-bit for extra headroom), mono.
For efficient recording/storage: Opus 32-64 kbps mono. A 60-minute recording is 14-29 MB.
Phone voice memos (M4A/AAC at 128 kbps) are more than adequate for speech.
For transcription services: mono audio, 16 kHz minimum sample rate, clean recording matters more than format.
Always record mono for single-voice content — stereo doubles file size with zero benefit.

Frequently Asked Questions

What format should I use for voice memos?

Your phone's default (usually M4A/AAC at 128 kbps) is fine for casual voice memos. If you want smaller files with equal quality, use an app that supports Opus at 48-64 kbps. For voice memos you plan to edit later, use the highest quality setting available — WAV if the app offers it.

Is 128 kbps MP3 good enough for voice?

Yes. Speech occupies a narrow frequency range (primarily 300 Hz to 8 kHz) with limited dynamic range. At 128 kbps, MP3 captures this range without audible artifacts. You could go lower — 96 kbps MP3 is acceptable for speech, and 64 kbps Opus sounds better than 128 kbps MP3 for voice. 128 kbps MP3 is a safe, universally compatible choice.

What audio format do transcription services prefer?

Most accept any common format. For best results, provide mono audio at 16 kHz or higher sample rate. WAV and FLAC are preferred by services like Google Speech-to-Text. OpenAI Whisper accepts anything FFmpeg can decode. The quality of the recording (clear speech, low noise) matters far more than the format choice.

Should I record voice in mono or stereo?

Mono, unless you're deliberately capturing stereo room ambience with two microphones. A single speaker recorded through a single microphone produces a mono signal. Recording or exporting this as stereo creates two identical channels, doubling file size with no quality benefit. Always use mono for single-voice content.

Why does my phone record in M4A instead of MP3?

AAC (the codec inside M4A) is more efficient than MP3 — it sounds better at the same bitrate, or sounds the same at a lower bitrate. Apple chose AAC as the default for iOS because it delivers good quality in small files. Android manufacturers have similarly moved to AAC or OGG. MP3 is the older format with wider compatibility, but for phone recordings, M4A/AAC is the better technical choice.

How much storage does voice recording use?

Uncompressed WAV at 48 kHz/16-bit mono: 5.5 MB per minute (330 MB per hour). M4A at 128 kbps: 0.9 MB per minute (57 MB per hour). Opus at 48 kbps: 0.35 MB per minute (21 MB per hour). The range is 15:1 between uncompressed and efficient compression, with the compressed versions being perfectly adequate for speech.

Can I improve a low-quality voice recording by converting to a higher bitrate?

No. Converting a 32 kbps recording to 320 kbps doesn't restore data that was discarded during the original encoding. The result is a larger file with the same audio quality. To get better quality, you need to re-record from the source. Post-processing (noise reduction, EQ) can improve clarity, but format conversion cannot.

What's the best format for recording Zoom or Teams calls?

Both Zoom and Teams use Opus for the live call, but their recording features output differently. Zoom records to M4A (AAC) by default, with a WAV option in settings. Teams records to MP4 (video) or M4A (audio-only). For the best quality recording of a call, use a local recording (WAV) rather than the cloud recording, which is re-encoded and often at a lower bitrate.

Ready to convert your files?

Use ChangeThisFile to convert between 600+ formats — free, fast, and private.

Start Converting