Why equity, not just accuracy
Public speech benchmarks typically report a single summary word error rate aggregated across the languages a model happens to support. This obscures a critical property: the worst-served language is the one that matters most for the last billion users. A model that averages a 3% word error rate but returns HTTP 400 on Javanese, Yoruba, or Amharic is not a multilingual model in any practical sense.
GlobalVoice-Bench reframes voice AI evaluation around the language equity gap: the performance differential between high- and low-resource languages under identical evaluation protocols. The benchmark reports four axes — per-tier transcription, code-switching, accent-sensitivity variance, and culturally-grounded transcription — so that a provider's behaviour can be evaluated on the dimensions that actually matter for multilingual deployment, not a single tier-averaged number.
Main result
Three of five dedicated ASR providers (Deepgram Nova-3, Deepgram Nova-2, AssemblyAI Universal-2) do not produce usable transcripts on our low-resource tier: Deepgram returns HTTP 400 on Swahili, Amharic, Hausa, Yoruba, Igbo, and Javanese; AssemblyAI Universal-2 returns WER above 0.85. Only ElevenLabs Scribe (WER 0.409) and Gemini 2.5 Pro (0.413) remain usable, statistically tied at the top.
The benchmark
GlobalVoice-Bench samples 800 FLEURS utterances (40 per language) across 20 languages organized into three resource tiers — seven high-resource (English, Mandarin, Spanish, French, German, Russian, Japanese), seven mid-resource (Hindi, Arabic, Portuguese, Turkish, Korean, Vietnamese, Polish), and six low-resource (Swahili, Amharic, Hausa, Yoruba, Igbo, Javanese). A 200-sample balanced subset drives the reported per-tier results. We evaluate 12 frontier systems organized into three classes:
- 01Dedicated ASR (audio in, transcript out). Whisper v3, Deepgram Nova-3, Deepgram Nova-2, AssemblyAI Universal-2, and ElevenLabs Scribe.
- 02Audio-native multimodal LLMs. GPT-4o Audio, GPT-4o-mini Audio, Gemini 2.5 Pro, and Gemini 2.5 Flash.
- 03Text reasoners on reference transcripts (upper-bound control). Claude Opus 4.5, Sonnet 4.5, and Haiku 4.5.
Evaluation is performed along four axes: per-tier transcription (WER, CER on CJK), Mandarin–English code-switching from ASCEND (150 samples, 356 annotated switch-point boundaries), accent-sensitivity variance from Common Voice 17 (745 samples across seven languages and 22 accent cells), and an 800-sample culturally-grounded transcription axis (40 per language × 20 languages). All scores carry 95% bootstrap confidence intervals.
The equity gap
Figure
High resource
Mid resource
Low resource
Whisper v3
gap 2.5x
Deepgram Nova-3
unsup.
Deepgram Nova-2
unsup.
AssemblyAI Univ.-2
gap 4.3x
ElevenLabs Scribe
gap 1.8x
GPT-4o Audio
gap 3.8x
GPT-4o-mini Audio
gap 2.8x
Gemini 2.5 Pro
gap 1.6x
Gemini 2.5 Flash
gap 2.1x
| Model | High | Mid | Low |
|---|---|---|---|
| Whisper v3 | 0.214 | 0.232 | 0.529 |
| Deepgram Nova-3 | 0.210 | 0.228 | unsup. |
| Deepgram Nova-2 | 0.213 | 0.246 | unsup. |
| AssemblyAI Univ.-2 | 0.201 | 0.217 | 0.856 |
| ElevenLabs Scribe | 0.230 | 0.217 | 0.409 |
| GPT-4o Audio | 0.742 | 0.456 | 2.829 |
| GPT-4o-mini Audio | 0.838 | 0.574 | 2.377 |
| Gemini 2.5 Pro | 0.257 | 0.198 | 0.413 |
| Gemini 2.5 Flash | 0.239 | 0.208 | 0.510 |
The tier gap is large and monotonic: even within the audio-native MLLMs that do support all tiers, Gemini 2.5 Pro shows roughly a 1.6× WER increase from high to low resource. For dedicated ASR, the four production providers cluster within 0.013 WER of each other on high-resource (0.201–0.214), so the choice between them is essentially a wash at the top of the hierarchy. Differences become decision-relevant only as we descend tiers.
Code-switching and silent drops
GPT-4o Audio averages WER above 1.0 on low-resource (2.829) — the model is producing more text than the reference contains, dominated by hallucination and repetition. Its mini variant is similarly broken. Gemini 2.5 Pro and Flash, in contrast, track dedicated ASR closely on high- and mid-resource. On low-resource, ElevenLabs Scribe (0.409) and Gemini 2.5 Pro (0.413) are statistically tied at the top. The audio-native MLLM category is not monolithic: provider-level differences inside the category are larger than the gap between the category and dedicated ASR.
The ASCEND Mandarin–English code-switching axis (150 samples, 356 annotated switch points) reveals a silent-drop failure mode that tier-transcription leaderboards hide. Deepgram Nova-2 returns an empty transcript on 49.3% of samples and Gemini 2.5 Flash on 28.7% — the HTTP request succeeds, the body is empty. On the non-empty windows, Gemini 2.5 Flash leads boundary WER at 0.781 and Gemini 2.5 Pro at 0.821, with ElevenLabs Scribe the best dedicated ASR at 0.932. No audio model scores switch-point language-ID accuracy above 0.343; even the best audio model mis-attributes the language on more than two-thirds of switch points. Code-switched speech is a research ceiling today, not a vendor-selection problem.
| Model | Refuse % | Boundary WER ↓ | LID acc ↑ |
|---|---|---|---|
| ElevenLabs Scribe | 0.0 | 0.932 | 0.343 |
| Whisper v3 | 0.7 | 0.948 | 0.164 |
| AssemblyAI Univ.-2 | 2.0 | 0.989 | 0.145 |
| Deepgram Nova-3 | 6.7 | 2.065 | 0.000 |
| Deepgram Nova-2 | 49.3 | 1.501 | 0.000 |
| GPT-4o Audio | 4.7 | 1.064 | 0.135 |
| GPT-4o-mini Audio | 0.0 | 1.649 | 0.067 |
| Gemini 2.5 Pro | 2.7 | 0.821 | 0.206 |
| Gemini 2.5 Flash | 28.7 | 0.781 | 0.288 |
Accent variance and cultural-QA coverage
The accent-sensitivity axis (Common Voice 17, 745 samples across seven languages and 22 accent cells) measures the across-accent WER standard deviation per model — a small value means WER is roughly constant across regional dialects of the same language, a large value means WER is heavily driven by which accent a sample happens to carry. Gemini 2.5 Flash is the most accent-robust audio-native model we evaluated (mean across-accent std 0.039). Within-language dispersion is dominated by Arabic (MSA vs Egyptian std 0.09–0.16 for most providers) and Italian (Northern/Southern/Central, 0.04–0.07); Spanish accent cells are essentially interchangeable for every top provider (std ≤ 0.032). The practical implication: "handles accent X for language Y" does not predict accent robustness in language Z.
Figure
The cultural-QA axis expands from the original comprehension pilot into an 800-sample per-language transcription evaluation on culturally-grounded recordings (40 per language × 20 languages). Coverage is the discriminating signal: ElevenLabs Scribe is the only provider with 100% effective non-empty hypothesis coverage across all 20 languages; AssemblyAI Universal-2 reaches 90%, Whisper large-v3 75%, Deepgram Nova-3 70%, Deepgram Nova-2 65% (and Nova-2 specifically returns HTTP 400 on Arabic — a provider-API gap, not a benchmark artifact). Post day-3 backfill, the audio-native MLLMs cluster near the top: GPT-4o-mini Audio 100%, GPT-4o Audio 99.9%, Gemini 2.5 Flash 99.6%, and Gemini 2.5 Pro 95.0% with its 40 residual failures all on Yoruba. GPT-4o Audio's per-tier WER on the cultural-QA audio is 0.672 / 0.913 / 2.369 (high/mid/low) — paraphrasing and hallucination, not missing.
Transcription is the floor, coverage is the ceiling. Models today are close to the floor in high-resource settings, blocked by HTTP 400 errors at the floor in low-resource settings, and inconsistently covered everywhere in between.
Implications
The language equity gap is a training-data and language-support story. The models we evaluated are not architecturally incapable of serving low-resource languages — ElevenLabs Scribe and Gemini 2.5 Pro demonstrate that broad coverage is feasible. The absence of reliable coverage from Deepgram and AssemblyAI on the low-resource tier reflects commercial neglect, not technical impossibility. We argue the per-(provider, language) coverage matrix belongs in every public speech-model card.
Refusal rate, not boundary WER, is the dominant failure mode for two production-grade audio systems on code-switched Mandarin–English: Deepgram Nova-2 refuses 49.3% of samples, Gemini 2.5 Flash refuses 28.7%. Any multilingual deployment pipeline must treat refusal rate as a first-class metric alongside WER — a silent empty transcript passed downstream is worse than a noisy transcript flagged for review.