Paper № 05Multilingual12 min read

BharatVoice-Bench

Independent benchmarking of frontier voice models on Indian languages

Published

April 16, 2026

Authors

Datoric Research

Dataset

160-sample stratified subset (11,487 curated). 7 frontier and Indic-specialist systems. 10 Indic languages plus Indian-accented English, with Hindi–English code-switching as a separate CMI-bucketed axis. Three axes: transcription fidelity (WER + CER with IndicNLP normalization), Script Fidelity Rate, and code-switching (CMI-bucketed WER, switch-point F1, LLM-as-judge Entity Preservation). Bootstrap 95% CI.

Download PDF View PDF

Abstract

India's 22 constitutionally-scheduled languages and 450+ living languages represent the largest concentration of linguistic diversity on the planet, yet every public benchmark of frontier voice AI on Indic speech is either vendor-authored or restricted to a single capability axis. BharatVoice-Bench is an independent, multi-axis evaluation of seven frontier and Indic-specialist voice models — ElevenLabs Scribe v2, Deepgram Nova-3 Multilingual, Sarvam Saaras v3, GPT-4o Transcribe, GPT-4o-mini Transcribe, AssemblyAI Universal-3 Pro, and self-hosted Whisper large-v3 — on a 160-sample stratified subset drawn from an 11,487-sample curated corpus. It spans 10 Indic languages plus Indian-accented English. The benchmark reports three axes: transcription fidelity (IndicNLP-normalized WER and CER), Script Fidelity Rate (SFR) to catch silent script collapse, and CMI-bucketed code-switching WER with switch-point F1 and an LLM-as-judge Entity Preservation score. Four findings stand out. API coverage gaps are substantial — Odia is unsupported by OpenAI, Deepgram, and AssemblyAI. SFR exposes a failure mode invisible to WER alone: Whisper large-v3 Latin-transliterates Odia on 83%, Malayalam on 75%, and Telugu on 50%, driving its 0.378 aggregate SFR; AssemblyAI Universal-3 Pro script-collapses five Indic languages to romanized Latin output. Public Dravidian–English code-switch data is essentially absent in mid/high-CMI. And the strongest Indic specialist (Sarvam Saaras v3) is statistically tied with Deepgram Nova-3 Multilingual on aggregate WER (paired bootstrap p = 0.38), while OpenAI's audio-native transcribe variants trail dedicated ASR by 0.10–0.14 WER — the framing that frontier multimodal models subsume specialized ASR does not hold on Indian languages.

Headline

SFR

Script Fidelity Rate catches silent script-collapse — Whisper Latin-transliterates 83% of Odia audio, invisible to WER alone

Why Indic, why now

Voice AI for Indian languages is simultaneously one of the largest commercial opportunities in contemporary speech technology — 22 constitutionally scheduled languages, roughly 1.4 billion speakers — and one of the least benchmarked. Frontier multimodal models from OpenAI and Google, alongside specialized speech providers Deepgram, AssemblyAI, and ElevenLabs, are deployed globally and claim multilingual coverage. Indic-native specialists like Sarvam Saaras claim to beat the frontier on Indian languages. These claims have been compared against each other only in vendor blog posts, never in an independent reproducible benchmark. We are not aware of a prior independent head-to-head evaluation on the same data with the same normalization.

Three observations motivate the benchmark. First, WER alone is insufficient: Whisper's default BasicTextNormalizer strips the matras and viramas that carry phonetic content in Brahmi-family scripts, so we use the IndicNLP library as the reference normalizer. Second, WER cannot distinguish a high-quality hypothesis in the wrong script from a low-quality hypothesis in the right script — our Script Fidelity Rate metric makes the distinction explicit. Third, frontier-vs-specialist comparisons currently exist only as vendor self-reports, so we run all seven models on the same 160-sample stratified subset with the same scorer.

Main result

ElevenLabs Scribe v2 leads at aggregate WER 0.277 over 10 Indic languages. Sarvam Saaras v3 (0.308) and Deepgram Nova-3 (0.350) are statistically tied (paired bootstrap p = 0.38). AssemblyAI Universal-3 Pro silently script-collapses five of the Indic languages it accepts; self-hosted Whisper large-v3 Latin-transliterates Odia on 83% and Malayalam on 75%. Only ElevenLabs Scribe v2, Sarvam Saaras v3, and Whisper cover all 10 target languages — but Whisper's coverage is undermined by the transliteration mode.

The benchmark

BharatVoice-Bench evaluates seven systems on a 160-sample stratified subset drawn from an 11,487-sample curated corpus of FLEURS, AI4Bharat IndicVoices, AI4Bharat Svarah, and HiACC. Every (model, language) cell carries 95% bootstrap confidence intervals over 10,000 resamples. References are normalized via the IndicNLP library — preserving matras and viramas that Whisper's default BasicTextNormalizer strips — before WER and CER are computed. The benchmark has three independent axes:

01Transcription fidelity. WER and CER per (model, language) with bootstrap CIs; pairwise significance tests on aggregate WER.
02Script Fidelity Rate (SFR). Fraction of output characters in the expected script for the target language. Cells below 0.5 indicate script collapse: the model is producing output dominantly in the wrong script. Invisible to WER.
03Code-switching. CMI-bucketed WER on Hindi-English, switch-point F1 on language-boundary prediction, and an LLM-as-judge Entity Preservation score (Claude Opus 4.6) that catches semantic drift WER misses.

We also expose a fourth implicit axis: API coverage. Several frontier providers silently return HTTP 400 errors for specific (provider, language) pairs that no model card documents. We treat the coverage matrix as a first-class result.

The leaderboard

Table 1 reports overall WER and the composite score. ElevenLabs Scribe v2 leads at WER 0.277 (95% CI 0.244–0.311), achieving the lowest WER on 7 of 10 Indic languages. Sarvam Saaras v3 (0.308) and Deepgram Nova-3 Multilingual (0.350) form a statistical cluster — paired bootstrap p = 0.38 between the two, so we cannot reliably rank them on 160 samples. The two OpenAI transcribe variants are similarly indistinguishable from each other (p = 0.30). AssemblyAI Universal-3 Pro posts an aggregate WER of 0.843, but that number is dominated by the script collapse mode described in the next section. Whisper large-v3 posts aggregate WER 4.32 because its Latin-transliteration hypotheses disagree with every token of the Brahmi-script reference; with WER inputs clamped to [0, 1] inside the composite (necessary because WER is unbounded above when insertions exceed reference length), Whisper's composite is 0.000 and it sits at the bottom of the leaderboard.

Model	Composite	WER ↓	SFR ↑	WER_CS ↓
ElevenLabs Scribe v2	0.472	0.277	0.964	0.323
Deepgram Nova-3 Multi.	0.420	0.350	0.957	0.325
Sarvam Saaras v3	0.383	0.308	0.996	0.444
GPT-4o-mini Transcribe	0.302	0.419	0.988	0.474
GPT-4o Transcribe	0.268	0.408	0.998	0.547
AssemblyAI Univ.-3 Pro	0.021	0.843	0.414	0.683
Whisper large-v3	0.000	4.320	0.378	3.521

Table 1. BharatVoice-Bench composite leaderboard. Composite = (1 − min(WER, 1)) · SFR · (1 − min(WER_CS, 1)), higher is better. WER inputs are clamped to [0, 1] inside the composite because WER is unbounded above (insertions can exceed reference token count); without clamping, two negative (1 − WER) factors would sign-flip into a positive product. Models with catastrophic transcription failure (WER > 1) therefore receive a 0 composite, as Whisper does here. Per-axis WER columns retain the raw, unclamped values.

Figure 1

WER

0.0 to 1.0

ElevenLabs Scribe v2

0.28

Sarvam Saaras v3

0.31

Deepgram Nova-3

0.35

GPT-4o Transcribe

0.41

GPT-4o-mini Transcribe

0.42

AssemblyAI Univ.-3 Pro

0.84

Whisper large-v3

4.32

Figure 1. Aggregate WER (lower is better) across the seven evaluated systems. ElevenLabs Scribe v2 leads; Sarvam Saaras v3 and Deepgram Nova-3 are statistically tied behind it; AssemblyAI Universal-3 Pro is pushed out by script collapse. Whisper large-v3's WER > 4 reflects Latin-transliteration against Brahmi-script references.

The script collapse problem

Script Fidelity Rate measures the fraction of output characters in the expected script for the target language. A high WER paired with a high SFR means the model is trying — making word-level errors in the right alphabet. A low SFR means something more pathological: the model is producing output in the wrong script entirely. WER cannot distinguish these, but customers care about the difference: a romanized Latin transcript of Malayalam audio is unusable for downstream Indic NLP regardless of what its WER number says.

Two distinct script-collapse modes show up in our slate. AssemblyAI Universal-3 Pro silently falls back to its Universal-2 predecessor for Bengali, Gujarati, Malayalam, Punjabi, and Telugu, returning romanized Latin output. SFR on those five cells is 0.0 — every sample is in the wrong script — and WER exceeds 1.0 on the Brahmi-script references. Whisper large-v3 exhibits a different collapse mode: it transliterates the audio into the Latin script itself. Its Latin-transliteration rate is 83% on Odia, 75% on Malayalam, 50% on Telugu, and 33–38% across Hindi, Marathi, and Punjabi. A Whisper hypothesis that correctly captures the phonetics in Latin letters still scores SFR = 0 against a Brahmi-script reference, and WER explodes because every token is counted as a substitution. This is the mechanical cause of Whisper's 4.32 aggregate WER — and the single strongest illustration in the benchmark that SFR and WER must be read together.

Model	Ben	Guj	Hin	Kan	Mal	Mar	Ori	Pan	Tam	Tel
Sarvam Saaras v3	0.99	1.00	1.00	1.00	1.00	1.00	0.99	0.99	1.00	1.00
ElevenLabs Scribe v2	0.98	1.00	0.88	0.99	1.00	1.00	0.97	0.90	0.93	0.99
GPT-4o Transcribe	0.99	1.00	1.00	1.00	1.00	1.00	—	—	1.00	1.00
GPT-4o-mini Transcribe	0.99	0.99	0.93	1.00	1.00	1.00	—	—	1.00	1.00
Deepgram Nova-3 Multi.	0.94	0.99	0.82	1.00	—	1.00	—	—	0.95	1.00
AssemblyAI Univ.-3 Pro	0.00	0.00	0.81	0.91	0.00	1.00	—	0.00	1.00	0.00
Whisper large-v3	0.44	0.66	0.44	0.75	0.00	0.33	0.00	0.48	0.45	0.23

Table 2. Script Fidelity Rate (SFR) per (model, language). Cells below 0.5 indicate script collapse — the dominant output script is not the expected target script. A dash indicates the model returned HTTP 400 or similar refusal. AssemblyAI's collapse is wrong-Brahmi + Latin fallback; Whisper's is consistent Latin transliteration.

WER on a script-collapsed transcript is meaningless: every token is an error even if the phonetic content is preserved. SFR turns an invisible deployment-killing failure into a visible coverage finding.

Coverage is deployment-critical

Across the seven systems we evaluated, only ElevenLabs Scribe v2, Sarvam Saaras v3, and self-hosted Whisper large-v3 successfully transcribed all 10 Indic languages in our test set — and Whisper's coverage is severely undermined by the Latin-transliteration mode described above. Odia is rejected outright by OpenAI Transcribe, GPT-4o-mini Transcribe, AssemblyAI Universal-3 Pro, and Deepgram Nova-3 Multilingual — the API returns HTTP 400 with no transcript. Deepgram additionally rejects Malayalam and Punjabi. None of these silent-rejection patterns is surfaced in vendor documentation. For a customer building an Odia-language voice product, discovering at integration time that four of seven providers will simply refuse the audio is a high-stakes, deployment-defining surprise.

Coverage gaps disproportionately affect speakers of languages with less commercial buying power. Odia (~38 million native speakers), Punjabi (~125 million), and Malayalam (~35 million) are not small communities. ElevenLabs Scribe v2 and Sarvam Saaras v3 demonstrate that broad Indic coverage with high SFR is feasible for motivated providers; the absence of coverage from the others reflects commercial neglect, not technical impossibility. We argue this coverage matrix belongs in every public speech-model card.

Code-switching: where the gap is largest

Hindi–English is the only code-switch pair for which we obtained sufficient samples across low, mid, and high CMI buckets. Even so, the per-bucket pattern is informative. ElevenLabs Scribe v2 and Deepgram Nova-3 remain relatively flat across CMI intensity (WER ≈ 0.18–0.30), indicating that their advertised code-switch handling holds up empirically on Hindi–English. GPT-4o-mini Transcribe shows a sharp jump from 0.19 (low CMI) to 0.59 (mid CMI) — a 3× degradation as English-token density increases inside a predominantly Hindi utterance. Sarvam Saaras v3, despite leading on monolingual Indic, climbs from 0.29 (low) to 0.64 (high) as code-mixing intensifies. Whisper large-v3's WER_CS of 3.52 reflects its Latin-transliteration mode on the Indic-script portions of the Hinglish reference.

Model	ben-eng/low	hin-eng/low	hin-eng/mid	hin-eng/high	tam-eng/low
Deepgram Nova-3 Multi.	0.42	0.27	0.18	0.20	0.56
ElevenLabs Scribe v2	0.33	0.28	0.30	0.22	0.49
Sarvam Saaras v3	0.26	0.29	0.50	0.64	0.53
GPT-4o-mini Transcribe	0.34	0.19	0.59	0.66	0.59
GPT-4o Transcribe	0.55	0.43	0.45	0.49	0.81
AssemblyAI Univ.-3 Pro	1.03	0.45	0.53	0.66	0.75
Whisper large-v3	4.79	1.38	4.06	3.13	4.24

Table 3. Code-switching WER by (pair, CMI bucket). Empty pools indicate public CS corpora do not cover mid/high intensity for that pair. The GPT-4o-mini cliff between low and mid CMI on Hindi–English is the most informative single pattern here.

Switch-point F1 (token-level language-boundary prediction) and Entity Preservation (LLM-as-judge fraction of named entities preserved through the transcript) reveal additional structure. ElevenLabs Scribe v2 leads switch-point F1 at 0.42, with Deepgram Nova-3 close behind at 0.40. GPT-4o-mini Transcribe and Sarvam Saaras v3 both score 0.0 on switch-point F1 — they do not preserve mixed-language token boundaries at all. Sarvam Saaras v3 also scores 0.0 on Entity Preservation, almost certainly because the Indic-only model transliterates English entities into Indic script rather than preserving them in Latin. Useful for monolingual Indic deployments; problematic for any pipeline that downstreams the entities to text-only systems.

The harder finding is what is missing. Despite mining 9,000 IndicVoices conversational utterances with aggressive transliterated-English detection, we found only a few hundred Tamil–English and Bengali–English code-switch samples — almost all in the low-CMI bucket. Real-world Tanglish and Madras Bashai routinely sustain mid- and high-CMI mixing; the gap is in corpus curation, not in speaker behaviour. We report this as a corpus-level scarcity that propagates into every model's code-switch coverage on Dravidian–English pairs, not a benchmark-design choice.

Discussion

BharatVoice-Bench's main result on transcription is that the strongest Indic frontier system is a dedicated ASR — ElevenLabs Scribe v2 — and the strongest Indic-specialist (Sarvam Saaras v3) is statistically tied with Deepgram Nova-3 Multilingual on aggregate WER (p = 0.38). OpenAI's audio-native transcribe variants trail dedicated ASR by 0.10–0.14 WER on Indic. The framing that frontier multimodal models subsume specialized ASR does not hold on Indian languages.

The methodological contribution we care most about is Script Fidelity Rate. WER alone is not enough to measure Indic transcription quality: it conflates word-level errors with silent script collapse, and a romanized Latin transcript of Bengali audio is unusable downstream regardless of the WER number it reports. We release the benchmark, the per-(model, language) coverage matrix, the IndicNLP normalization, and the LLM-judge prompts so that other labs can reproduce and extend these numbers.

Cite this work

@article{datoric05,
  title={BharatVoice-Bench: Independent benchmarking of frontier voice models on Indian languages},
  author={Datoric Research},
  year={2026},
  journal={Datoric Research Notes},
  url={https://datoric.com/research/bharatvoice-bench}
}

Data sources

→FLEURS (10 Indic configs)
→AI4Bharat IndicVoices
→AI4Bharat Svarah
→HiACC

№ 04Safety

VideoTruth-Bench

Cross-modal consistency verification across six contradiction levels

Read paper

№ 03Video

VidWork-Bench

A five-axis benchmark for procedural video understanding

Read paper