Voice recording has different requirements than music recording. Human speech occupies a much narrower frequency range, has less dynamic range, and is mono by nature (one mouth, one microphone). These properties mean speech compresses remarkably well — a 64 kbps Opus recording of speech is essentially indistinguishable from the uncompressed original.

Whether you're recording lectures, dictation, interviews, voice memos, audiobook narration, or content for transcription, the format choice depends on one question: do you need maximum quality (for editing and publishing) or maximum efficiency (for storage and transmission)?

Why Voice Is Different from Music

Voice signals have specific properties that affect format choice:

  • Frequency range: Telephone-quality speech is 300-3,400 Hz. Wideband speech (natural sounding) extends to ~8 kHz. Full-bandwidth (capturing every nuance including breath sounds and room tone) goes to ~16 kHz. Compare to music, which uses the full 20-20,000 Hz range.
  • Dynamic range: Normal speech has about 30-40 dB of dynamic range (whisper to shout). Music might have 60-70 dB. This means 16-bit recording (96 dB range) is more than sufficient for voice — 24-bit's extra headroom provides less benefit than it does for music.
  • Mono signal: One speaker = one channel. Stereo recording of a single voice doubles file size with no benefit. Always record voice as mono unless you're specifically capturing room ambience with multiple microphones.
  • Predictable patterns: Speech has repeating phonetic patterns that compression algorithms exploit. Vowels are periodic waveforms; consonants are brief noise bursts. This predictability means speech compresses better than music at any given bitrate.

Format Recommendations by Use Case

Use CaseFormatSettingsSize/MinQuality
Professional narration / audiobookWAV48 kHz, 16-bit, mono5.5 MBPerfect (uncompressed)
Podcast recording (raw)WAV48 kHz, 24-bit, mono8.3 MBPerfect (with headroom)
Lecture / meeting recordingOpus or AAC64 kbps, mono0.5 MBExcellent
Voice memo / quick noteOpus or M4A32-48 kbps, mono0.2-0.4 MBVery good
Dictation for transcriptionWAV or Opus 64 kbpsMono, 16 kHz+ sample rate0.5-5.5 MBVaries
VoIP / phone call recordingOpus24-48 kbps, mono0.2-0.4 MBGood to very good

The range is dramatic: uncompressed voice at 5.5 MB/min versus compressed at 0.2 MB/min — a 27x difference. For a 60-minute recording, that's 330 MB vs 12 MB. The compressed version is perfectly clear for speech.

Phone Voice Recorder Formats

Smartphone voice recording apps typically output in these formats:

  • iPhone Voice Memos: M4A (AAC) at ~128 kbps by default. Compressed mode available at lower bitrates. Lossless mode records at higher quality (Apple Lossless). Files saved as .m4a.
  • Android recorders: Varies by manufacturer and app. Common outputs: M4A (AAC), OGG (Vorbis), AMR (legacy narrowband), and WAV. Samsung Voice Recorder defaults to M4A. Google Recorder exports as M4A.
  • Third-party apps: Many recording apps (Otter.ai, Rev, RecUp) record in M4A or WAV depending on quality settings.

For phone recordings intended for transcription or further editing, choose the highest quality setting available — WAV if the app offers it, otherwise M4A at the highest bitrate. You can always compress later; you can't uncompress.

To extract audio from phone recordings for editing on desktop: convert M4A to WAV for DAW import, or convert M4A to MP3 for universal sharing.

Audio Requirements for Transcription Services

Automated transcription services (Whisper, Deepgram, Google Speech-to-Text, Amazon Transcribe, AssemblyAI) have specific format preferences:

  • OpenAI Whisper: Accepts most formats via FFmpeg. Internally resamples to 16 kHz mono. Any format works, but WAV/FLAC at 16+ kHz is optimal to avoid unnecessary transcoding.
  • Google Speech-to-Text: Prefers FLAC or WAV, mono, 16 kHz sample rate for speech recognition models. Accepts other formats but recommends these for best accuracy.
  • Amazon Transcribe: Accepts WAV, MP3, MP4, FLAC, OGG, AMR, WebM. Sample rates from 8 kHz to 48 kHz. Mono recommended.
  • Rev / human transcription: Accepts virtually any format. WAV or MP3 at 128+ kbps recommended for human transcribers who need to hear nuance.

General rule for transcription: mono, 16 kHz minimum sample rate, clear audio. The quality of the audio matters far more than the format — a clean 64 kbps Opus recording in a quiet room transcribes better than a 320 kbps MP3 recorded in a noisy car.

Noise, Compression Artifacts, and Speech Clarity

Low-bitrate compression affects voice differently than music. At 64 kbps MP3, music has noticeable artifacts (washy cymbals, narrow stereo). At 64 kbps MP3, speech sounds slightly muffled — like a phone call on a mediocre connection. Perfectly intelligible, but not broadcast quality.

Opus handles this much better. At 64 kbps Opus, speech is nearly transparent — clear, natural, with preserved consonant detail (the "s" and "t" sounds that MP3 struggles with at low bitrates). At 32 kbps Opus, speech is still very good — comparable to a high-quality phone call.

Background noise interacts with compression: encoders can't distinguish noise from signal, so they spend bits encoding background hum, air conditioning, traffic, etc. A clean recording at 32 kbps sounds better than a noisy recording at 128 kbps. If you have control over your recording environment, reducing noise at the source is worth more than doubling your bitrate.

Why Opus Is the Best Choice for Voice

Opus was co-developed by Xiph.Org (audio codecs) and Skype (voice communication). Its SILK mode is specifically optimized for speech — encoding the spectral envelope of the human voice efficiently. At low bitrates where other codecs struggle, Opus excels because it can switch between speech-optimized (SILK) and general audio (CELT) modes seamlessly.

Practical comparison for voice at common bitrates:

BitrateMP3AACOpus
24 kbpsUnintelligibleBarely intelligibleClear (phone quality)
32 kbpsPoorMarginalGood (natural)
48 kbpsMarginalAcceptableVery good
64 kbpsAcceptableGoodExcellent (near-transparent)

For voice memos, dictation, and any application where file size matters: Opus at 32-64 kbps mono. A 60-minute recording at 48 kbps Opus: 21 MB. The same recording at 128 kbps MP3: 57 MB. The Opus version sounds better at less than half the size.

Converting Voice Recordings

Common voice recording conversions: phone recording to desktop-editable format, preparing for transcription, sharing voice notes.

Key conversions: M4A to WAV (import phone recordings to DAW) | M4A to MP3 (share phone recordings) | WAV to MP3 (compress for sharing) | OGG to WAV (Android recordings to DAW) | Opus to MP3 (compatibility) | MP4 to MP3 (extract audio from video interviews) | WAV to Opus (efficient archival)

Voice is the easiest audio content to handle — it compresses better, requires less bandwidth, and is more forgiving of format choices than music. For professional narration, record WAV and edit losslessly. For everything else, Opus at 32-64 kbps gives you broadcast-quality voice in tiny files. The most important factor isn't the format — it's the recording environment. A clean recording in a quiet room at 32 kbps Opus beats a noisy recording at 320 kbps MP3 every time.

Convert M4A to WAV, WAV to MP3, or Opus to MP3 — free at ChangeThisFile.