Motivation
Voice models are increasingly deployed in call centers, clinical intake, legal discovery, and enterprise meeting tools. These contexts share properties that public benchmarks rarely capture: noisy environments, accented and non-native speakers, dense domain jargon, and the need to classify intent and extract entities rather than just emit a transcript. Standard speech evaluations do not tell us whether a model can do the downstream work that a deployment actually pays for.
VoicePro-Bench is structured around five independent axes — transcription, SLURP intent and entity extraction, MELD emotion, AMI multi-turn reasoning, and MUSAN/WHAM noise robustness — with three model classes (dedicated ASR, audio-native MLLM, and a Claude text-reasoner control on the reference transcript) scored under matched conditions. The text-reasoner control is the lever that lets us isolate audio understanding from reasoning capacity: when the same model class answers the same questions from raw audio and from the reference transcript, the difference is attributable to audio, not to prompting.
Main result
On SLURP intent the best text-reasoner control (Claude Opus 4.5, F1 0.748) outperforms the best audio-native MLLM (Gemini 2.5 Pro, F1 0.479) by 27 pp. On AMI reasoning Claude Haiku 4.5 (text, F1 0.636) leads the best audio-native MLLM by 12.4 pp token-F1. Same questions, same answers — audio understanding is the dominant source of error, not reasoning capacity.
The benchmark
The transcription axis uses 760 curated audio samples drawn from FLEURS English (en-US), VoxPopuli, and six accented language variants (Hindi, Spanish, German, French, Mandarin, Japanese), with a 200-sample balanced subset for the reported model evaluation. The other four axes draw on task-specific corpora: 200 SLURP samples for intent and entity extraction, 200 MELD samples for 7-class emotion, 248 AMI-grounded questions for multi-turn reasoning, and a 200-sample MUSAN/WHAM noise-cliff at four SNR levels (20, 10, 5, 0 dB). We evaluate twelve frontier systems organized into three classes:
- 01Dedicated ASR (audio in, transcript out). Whisper v3, Deepgram Nova-3, Deepgram Nova-2, AssemblyAI Universal-2, and ElevenLabs Scribe.
- 02Audio-native multimodal LLMs (audio in, task-conditioned output). GPT-4o Audio, GPT-4o-mini Audio, Gemini 2.5 Pro, and Gemini 2.5 Flash.
- 03Text reasoners over reference transcripts (text in, text out). Claude Opus 4.5, Sonnet 4.5, and Haiku 4.5, used as an upper-bound control that isolates reasoning quality from audio understanding.
Primary metrics are WER and CER on transcription (95% bootstrap confidence intervals, 10,000 resamples), weighted F1 on intent and emotion, set-F1 on entities, token-level F1 on AMI reasoning, and WER-under-noise deltas on the MUSAN/WHAM cliff. On SLURP, dedicated ASR providers (which cannot emit intent labels directly) run an ASR→Claude-Haiku cascade so the comparison is against what a provider-grade deployment would actually produce.
Headline results
Table 1 reports WER and CER for all twelve models on the 200-sample transcription subset, post the 2026-04-22 rate-limit backfill that recovered the audio-native MLLMs to full coverage. Within the audio-facing tiers, ElevenLabs Scribe leads on WER (0.408) and is matched by Gemini 2.5 Pro (0.411) at overlapping CIs — a point-estimate tie at the top. AssemblyAI Universal-2 leads on CER (0.125). The dedicated-ASR cluster spans only 0.024 WER from best to worst. Claude text reasoners on the reference transcript define a ceiling (WER 0.385–0.395) attributable to formatting, punctuation, and casing normalization — not audio. ElevenLabs Scribe and Gemini 2.5 Pro both land within 0.03 WER of that text-only ceiling.
| Model | Class | WER | CER |
|---|---|---|---|
| Whisper v3 | Dedicated ASR | 0.432 | 0.137 |
| Deepgram Nova-3 | Dedicated ASR | 0.429 | 0.144 |
| Deepgram Nova-2 | Dedicated ASR | 0.432 | 0.144 |
| AssemblyAI Univ.-2 | Dedicated ASR | 0.427 | 0.125 |
| ElevenLabs Scribe | Dedicated ASR | 0.408 | 0.132 |
| GPT-4o Audio | Audio MLLM | 0.612 | 0.308 |
| GPT-4o-mini Audio | Audio MLLM | 0.611 | 0.334 |
| Gemini 2.5 Pro | Audio MLLM | 0.411 | 0.132 |
| Gemini 2.5 Flash | Audio MLLM | 0.429 | 0.143 |
| Claude Opus 4.5 | Text control | 0.391 | 0.116 |
| Claude Sonnet 4.5 | Text control | 0.395 | 0.128 |
| Claude Haiku 4.5 | Text control | 0.385 | 0.106 |
Figure 1
WER
0.0 to 1.0
Scribe and Gemini 2.5 Pro tie at the top
After the 2026-04-22 rate-limit backfill brought the audio-native MLLMs to full 200-sample coverage, Gemini 2.5 Pro (WER 0.411) essentially matches ElevenLabs Scribe (0.408) at the top of the transcription axis: the 95% bootstrap CIs are 0.366–0.458 and 0.365–0.454 respectively, a statistical tie. Gemini 2.5 Flash (0.429) sits mid-pack inside the dedicated-ASR cluster. GPT-4o Audio (0.612) and GPT-4o-mini Audio (0.611) remain dramatically worse. The simpler earlier story — that specialized ASR subsumes audio-native MLLMs on transcription — does not hold at full coverage on our 200-sample subset; one frontier MLLM has reached parity, while the rest of the audio-native tier still lags.
Within the dedicated-ASR cluster, WER spans only 0.024 from best (ElevenLabs 0.408) to worst (Whisper v3 / Deepgram Nova-2 at 0.432). At this scale, the choice between top-five production ASR providers should be made on cost, latency, language coverage, and reliability rather than headline WER alone. AssemblyAI Universal-2 has the lowest character error rate (0.125), suggesting it preserves character-level fidelity even when word-level errors are marginally higher.
On transcription the best frontier multimodal model has drawn level with the best dedicated ASR. On intent and reasoning, the text-reasoner control on the reference transcript still wins by double digits — audio understanding, not reasoning, remains the dominant source of error on today's downstream voice tasks.
Refusals, timeouts, and noise cliffs
Three deployment-visible failure modes show up in the axes beyond transcription. First, GPT-4o Audio refuses 20% of valid SLURP transcription requests (40 of 200), with benign user utterances like "play radio station 101.9" treated as actions the model cannot perform; GPT-4o-mini Audio refuses 26.5%. This is a prompt-following failure in audio mode, not an audio-understanding failure, and it would not appear on a leaderboard that filters non-responses.
Second, at full 200-sample coverage GPT-4o Audio's transcription WER is 0.612 with a 95% bootstrap CI of 0.464–0.795 — CI width 0.33, roughly 4× Gemini 2.5 Pro's 0.09. This is no longer a sample-size artifact; it reflects a genuinely bimodal distribution with a heavy CJK tail alongside a sizeable fraction of clean transcriptions. Third, under additive MUSAN/WHAM noise, Gemini 2.5 Flash posts a 20-to-0 dB WER jump of +0.073 — roughly 1.5× the steepest dedicated-ASR cliff (Deepgram Nova-2, +0.049) — with the degradation coming from faithful-transcription quality loss rather than refusal. Over 95% of Gemini 2.5 Flash's 0-dB outputs are still transcription attempts, so the cliff is a pure quality effect.
A separate integration cliff: before we passed explicit BCP-47 language codes to Deepgram Nova-3, its default multilingual auto-detect covered only a small set of Western and Indic languages and silently returned empty strings for Mandarin and Japanese. Passing an explicit code (zh, ja) brought Nova-3 back into line with the other dedicated ASR providers. This kind of integration detail is not surfaced in the default SDK examples and likely explains a non-trivial fraction of multilingual quality complaints from Deepgram customers.
Discussion
VoicePro-Bench's main result is that audio understanding, not reasoning, accounts for the downstream-task gap on today's frontier models. Claude Opus 4.5 as a text-reasoner control reaches SLURP intent F1 0.748 on the reference transcript; the best audio-native MLLM reaches 0.479. Claude Haiku 4.5 on AMI meeting transcripts reaches token-F1 0.636; GPT-4o Audio and Gemini 2.5 Pro on raw audio reach 0.509–0.512. Same questions, same answers — when the audio is replaced by the ground-truth transcript the gap closes.
On transcription the dedicated-ASR cluster is tight (0.024 WER spread) and now joined at the top by Gemini 2.5 Pro (0.411, statistical tie with ElevenLabs Scribe), so provider-selection decisions should turn on cost, latency, language coverage, and reliability rather than headline WER. On downstream tasks, the ASR→Claude cascade outperforms every audio-native MLLM we evaluated on SLURP intent and entity F1. Emotion (MELD 7-class) is not yet production-usable on any audio-native model: the best score is Gemini 2.5 Pro at F1 0.443, roughly 2–3× the random baseline but well below any usable routing signal.
We release the benchmark and evaluation harness so that other labs can reproduce these numbers. Per-axis sample sizes are informative for aggregate rankings but too small for per-accent or per-scenario claims; we do not make those claims in the paper body.