Why Indic, why now
Voice AI for Indian languages is simultaneously one of the largest commercial opportunities in contemporary speech technology — 22 constitutionally scheduled languages, roughly 1.4 billion speakers — and one of the least benchmarked. Frontier multimodal models from OpenAI and Google, alongside specialized speech providers Deepgram, AssemblyAI, and ElevenLabs, are deployed globally and claim multilingual coverage. Indic-native specialists like Sarvam Saaras claim to beat the frontier on Indian languages. These claims have been compared against each other only in vendor blog posts, never in an independent reproducible benchmark. We are not aware of a prior independent head-to-head evaluation on the same data with the same normalization.
Three observations motivate the benchmark. First, WER alone is insufficient: Whisper's default BasicTextNormalizer strips the matras and viramas that carry phonetic content in Brahmi-family scripts, so we use the IndicNLP library as the reference normalizer. Second, WER cannot distinguish a high-quality hypothesis in the wrong script from a low-quality hypothesis in the right script — our Script Fidelity Rate metric makes the distinction explicit. Third, frontier-vs-specialist comparisons currently exist only as vendor self-reports, so we run all seven models on the same 160-sample stratified subset with the same scorer.
Main result
ElevenLabs Scribe v2 leads at aggregate WER 0.277 over 10 Indic languages. Sarvam Saaras v3 (0.308) and Deepgram Nova-3 (0.350) are statistically tied (paired bootstrap p = 0.38). AssemblyAI Universal-3 Pro silently script-collapses five of the Indic languages it accepts; self-hosted Whisper large-v3 Latin-transliterates Odia on 83% and Malayalam on 75%. Only ElevenLabs Scribe v2, Sarvam Saaras v3, and Whisper cover all 10 target languages — but Whisper's coverage is undermined by the transliteration mode.
The benchmark
BharatVoice-Bench evaluates seven systems on a 160-sample stratified subset drawn from an 11,487-sample curated corpus of FLEURS, AI4Bharat IndicVoices, AI4Bharat Svarah, and HiACC. Every (model, language) cell carries 95% bootstrap confidence intervals over 10,000 resamples. References are normalized via the IndicNLP library — preserving matras and viramas that Whisper's default BasicTextNormalizer strips — before WER and CER are computed. The benchmark has three independent axes:
- 01Transcription fidelity. WER and CER per (model, language) with bootstrap CIs; pairwise significance tests on aggregate WER.
- 02Script Fidelity Rate (SFR). Fraction of output characters in the expected script for the target language. Cells below 0.5 indicate script collapse: the model is producing output dominantly in the wrong script. Invisible to WER.
- 03Code-switching. CMI-bucketed WER on Hindi-English, switch-point F1 on language-boundary prediction, and an LLM-as-judge Entity Preservation score (Claude Opus 4.6) that catches semantic drift WER misses.
We also expose a fourth implicit axis: API coverage. Several frontier providers silently return HTTP 400 errors for specific (provider, language) pairs that no model card documents. We treat the coverage matrix as a first-class result.
The leaderboard
Table 1 reports overall WER and the composite score. ElevenLabs Scribe v2 leads at WER 0.277 (95% CI 0.244–0.311), achieving the lowest WER on 7 of 10 Indic languages. Sarvam Saaras v3 (0.308) and Deepgram Nova-3 Multilingual (0.350) form a statistical cluster — paired bootstrap p = 0.38 between the two, so we cannot reliably rank them on 160 samples. The two OpenAI transcribe variants are similarly indistinguishable from each other (p = 0.30). AssemblyAI Universal-3 Pro posts an aggregate WER of 0.843, but that number is dominated by the script collapse mode described in the next section. Whisper large-v3 posts aggregate WER 4.32 because its Latin-transliteration hypotheses disagree with every token of the Brahmi-script reference; with WER inputs clamped to [0, 1] inside the composite (necessary because WER is unbounded above when insertions exceed reference length), Whisper's composite is 0.000 and it sits at the bottom of the leaderboard.
| Model | Composite | WER ↓ | SFR ↑ | WER_CS ↓ |
|---|---|---|---|---|
| ElevenLabs Scribe v2 | 0.472 | 0.277 | 0.964 | 0.323 |
| Deepgram Nova-3 Multi. | 0.420 | 0.350 | 0.957 | 0.325 |
| Sarvam Saaras v3 | 0.383 | 0.308 | 0.996 | 0.444 |
| GPT-4o-mini Transcribe | 0.302 | 0.419 | 0.988 | 0.474 |
| GPT-4o Transcribe | 0.268 | 0.408 | 0.998 | 0.547 |
| AssemblyAI Univ.-3 Pro | 0.021 | 0.843 | 0.414 | 0.683 |
| Whisper large-v3 | 0.000 | 4.320 | 0.378 | 3.521 |
Figure 1
WER
0.0 to 1.0
The script collapse problem
Script Fidelity Rate measures the fraction of output characters in the expected script for the target language. A high WER paired with a high SFR means the model is trying — making word-level errors in the right alphabet. A low SFR means something more pathological: the model is producing output in the wrong script entirely. WER cannot distinguish these, but customers care about the difference: a romanized Latin transcript of Malayalam audio is unusable for downstream Indic NLP regardless of what its WER number says.
Two distinct script-collapse modes show up in our slate. AssemblyAI Universal-3 Pro silently falls back to its Universal-2 predecessor for Bengali, Gujarati, Malayalam, Punjabi, and Telugu, returning romanized Latin output. SFR on those five cells is 0.0 — every sample is in the wrong script — and WER exceeds 1.0 on the Brahmi-script references. Whisper large-v3 exhibits a different collapse mode: it transliterates the audio into the Latin script itself. Its Latin-transliteration rate is 83% on Odia, 75% on Malayalam, 50% on Telugu, and 33–38% across Hindi, Marathi, and Punjabi. A Whisper hypothesis that correctly captures the phonetics in Latin letters still scores SFR = 0 against a Brahmi-script reference, and WER explodes because every token is counted as a substitution. This is the mechanical cause of Whisper's 4.32 aggregate WER — and the single strongest illustration in the benchmark that SFR and WER must be read together.
| Model | Ben | Guj | Hin | Kan | Mal | Mar | Ori | Pan | Tam | Tel |
|---|---|---|---|---|---|---|---|---|---|---|
| Sarvam Saaras v3 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.99 | 0.99 | 1.00 | 1.00 |
| ElevenLabs Scribe v2 | 0.98 | 1.00 | 0.88 | 0.99 | 1.00 | 1.00 | 0.97 | 0.90 | 0.93 | 0.99 |
| GPT-4o Transcribe | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | — | — | 1.00 | 1.00 |
| GPT-4o-mini Transcribe | 0.99 | 0.99 | 0.93 | 1.00 | 1.00 | 1.00 | — | — | 1.00 | 1.00 |
| Deepgram Nova-3 Multi. | 0.94 | 0.99 | 0.82 | 1.00 | — | 1.00 | — | — | 0.95 | 1.00 |
| AssemblyAI Univ.-3 Pro | 0.00 | 0.00 | 0.81 | 0.91 | 0.00 | 1.00 | — | 0.00 | 1.00 | 0.00 |
| Whisper large-v3 | 0.44 | 0.66 | 0.44 | 0.75 | 0.00 | 0.33 | 0.00 | 0.48 | 0.45 | 0.23 |
WER on a script-collapsed transcript is meaningless: every token is an error even if the phonetic content is preserved. SFR turns an invisible deployment-killing failure into a visible coverage finding.
Coverage is deployment-critical
Across the seven systems we evaluated, only ElevenLabs Scribe v2, Sarvam Saaras v3, and self-hosted Whisper large-v3 successfully transcribed all 10 Indic languages in our test set — and Whisper's coverage is severely undermined by the Latin-transliteration mode described above. Odia is rejected outright by OpenAI Transcribe, GPT-4o-mini Transcribe, AssemblyAI Universal-3 Pro, and Deepgram Nova-3 Multilingual — the API returns HTTP 400 with no transcript. Deepgram additionally rejects Malayalam and Punjabi. None of these silent-rejection patterns is surfaced in vendor documentation. For a customer building an Odia-language voice product, discovering at integration time that four of seven providers will simply refuse the audio is a high-stakes, deployment-defining surprise.
Coverage gaps disproportionately affect speakers of languages with less commercial buying power. Odia (~38 million native speakers), Punjabi (~125 million), and Malayalam (~35 million) are not small communities. ElevenLabs Scribe v2 and Sarvam Saaras v3 demonstrate that broad Indic coverage with high SFR is feasible for motivated providers; the absence of coverage from the others reflects commercial neglect, not technical impossibility. We argue this coverage matrix belongs in every public speech-model card.
Code-switching: where the gap is largest
Hindi–English is the only code-switch pair for which we obtained sufficient samples across low, mid, and high CMI buckets. Even so, the per-bucket pattern is informative. ElevenLabs Scribe v2 and Deepgram Nova-3 remain relatively flat across CMI intensity (WER ≈ 0.18–0.30), indicating that their advertised code-switch handling holds up empirically on Hindi–English. GPT-4o-mini Transcribe shows a sharp jump from 0.19 (low CMI) to 0.59 (mid CMI) — a 3× degradation as English-token density increases inside a predominantly Hindi utterance. Sarvam Saaras v3, despite leading on monolingual Indic, climbs from 0.29 (low) to 0.64 (high) as code-mixing intensifies. Whisper large-v3's WER_CS of 3.52 reflects its Latin-transliteration mode on the Indic-script portions of the Hinglish reference.
| Model | ben-eng/low | hin-eng/low | hin-eng/mid | hin-eng/high | tam-eng/low |
|---|---|---|---|---|---|
| Deepgram Nova-3 Multi. | 0.42 | 0.27 | 0.18 | 0.20 | 0.56 |
| ElevenLabs Scribe v2 | 0.33 | 0.28 | 0.30 | 0.22 | 0.49 |
| Sarvam Saaras v3 | 0.26 | 0.29 | 0.50 | 0.64 | 0.53 |
| GPT-4o-mini Transcribe | 0.34 | 0.19 | 0.59 | 0.66 | 0.59 |
| GPT-4o Transcribe | 0.55 | 0.43 | 0.45 | 0.49 | 0.81 |
| AssemblyAI Univ.-3 Pro | 1.03 | 0.45 | 0.53 | 0.66 | 0.75 |
| Whisper large-v3 | 4.79 | 1.38 | 4.06 | 3.13 | 4.24 |
Switch-point F1 (token-level language-boundary prediction) and Entity Preservation (LLM-as-judge fraction of named entities preserved through the transcript) reveal additional structure. ElevenLabs Scribe v2 leads switch-point F1 at 0.42, with Deepgram Nova-3 close behind at 0.40. GPT-4o-mini Transcribe and Sarvam Saaras v3 both score 0.0 on switch-point F1 — they do not preserve mixed-language token boundaries at all. Sarvam Saaras v3 also scores 0.0 on Entity Preservation, almost certainly because the Indic-only model transliterates English entities into Indic script rather than preserving them in Latin. Useful for monolingual Indic deployments; problematic for any pipeline that downstreams the entities to text-only systems.
The harder finding is what is missing. Despite mining 9,000 IndicVoices conversational utterances with aggressive transliterated-English detection, we found only a few hundred Tamil–English and Bengali–English code-switch samples — almost all in the low-CMI bucket. Real-world Tanglish and Madras Bashai routinely sustain mid- and high-CMI mixing; the gap is in corpus curation, not in speaker behaviour. We report this as a corpus-level scarcity that propagates into every model's code-switch coverage on Dravidian–English pairs, not a benchmark-design choice.
Discussion
BharatVoice-Bench's main result on transcription is that the strongest Indic frontier system is a dedicated ASR — ElevenLabs Scribe v2 — and the strongest Indic-specialist (Sarvam Saaras v3) is statistically tied with Deepgram Nova-3 Multilingual on aggregate WER (p = 0.38). OpenAI's audio-native transcribe variants trail dedicated ASR by 0.10–0.14 WER on Indic. The framing that frontier multimodal models subsume specialized ASR does not hold on Indian languages.
The methodological contribution we care most about is Script Fidelity Rate. WER alone is not enough to measure Indic transcription quality: it conflates word-level errors with silent script collapse, and a romanized Latin transcript of Bengali audio is unusable downstream regardless of the WER number it reports. We release the benchmark, the per-(model, language) coverage matrix, the IndicNLP normalization, and the LLM-judge prompts so that other labs can reproduce and extend these numbers.