Why verification, not just description
Most video benchmarks measure whether a model can describe what is in a clip. VideoTruth-Bench measures something different: whether a model will tell you when a caption contradicts what the video shows, and whether the framing of the caption changes the model's answer. This is the capability that matters for fact-checking video claims, moderating video content, reviewing security footage, and verifying insurance claims. A model that generates fluent captions can still fail silently when asked whether a supplied description is actually correct.
We evaluate six frontier multimodal models with eight real VATEX video frames passed to every API call, under three prompt framings (direct, indirect, adversarial) over a graded six-level contradiction taxonomy, plus temporal-ordering and hallucination-probe measurements. Our core contribution is a matched-item measurement of how much framing moves detection — the sycophancy gap — across the frontier. Prior work has measured sycophancy in text-only LLMs; we are not aware of a matched-item measurement for video-language models with the video actually attached.
Main result
At full video-frame coverage the top of the contradiction-detection leaderboard is a three-way tie: Claude Haiku 4.5 (0.870), Gemini 2.5 Flash (0.869), and GPT-4o (0.864) cluster within 1 pp from three different vendors. L5 causal ranges 0.58–0.96 across models and L6 omission ranges 0.50–0.74 — no model exceeds 0.74 on L6. Sycophancy is universal: every model drops 24–60 pp under an "already-verified" preamble. Hallucination refusal splits cleanly: Claude family and GPT-4o at ≥88%, both Gemini models below 70%. No single model wins all four axes.
The benchmark
VideoTruth-Bench evaluates 100 VATEX videos with 566 model×task evaluations per model. Contradictions are generated using the Claude API at six levels of increasing subtlety (L1 entity swap, L2 temporal reorder, L3 quantitative, L4 attributive, L5 causal, L6 omission). For each clip we pair the caption with three prompt framings — direct, indirect ("describe first, then compare"), and adversarial ("this caption has been verified as accurate") — holding content constant and varying only the wrapper. Per-model per-level contradiction sample size is 22–30; sycophancy analysis is at n=30 per framing per model where coverage permits. GPT-4o-mini was initially part of the slate but was rate-limited during the parallel run and yielded too few per-cell samples for meaningful inference. We evaluate six frontier multimodal models across four axes:
- 01Contradiction detection by subtlety level. A six-level taxonomy from L1 entity swap through L2 temporal, L3 quantitative, L4 attributive, L5 causal, and L6 omission.
- 02Temporal ordering. Binary before/after and ordered-sequence accuracy on genuine events in the video.
- 03Hallucination resistance. Fraction of probes about absent events that are correctly refused.
- 04Sycophancy. Contradiction-detection drop between direct and adversarial framings of the same content.
Three-way tie at the top, L6 universal ceiling
Three models from three different vendors cluster within 1 pp of each other at the top of overall contradiction-detection accuracy — Claude Haiku 4.5 (0.870), Gemini 2.5 Flash (0.869), GPT-4o (0.864) — with bootstrap 95% CIs that overlap heavily. The middle tier (Sonnet 0.796, Opus 0.759) is reliably below the top three but well above Gemini 2.5 Pro (0.652). Intra-vendor, smaller models beat larger: Haiku > Sonnet > Opus (0.870 / 0.796 / 0.759) and Gemini 2.5 Flash > 2.5 Pro (0.869 / 0.652). We hypothesize the larger models are tuned toward more qualified answers that don't crisp into yes/no verdicts on the contradiction prompt.
| Model | L1 Entity | L2 Temp. | L3 Quant. | L4 Attr. | L5 Causal | L6 Omiss. | Overall | n |
|---|---|---|---|---|---|---|---|---|
| Claude Haiku 4.5 | 0.889 | 0.900 | 1.000 | 0.760 | 0.962 | 0.733 | 0.870 | 162 |
| Gemini 2.5 Flash | 1.000 | 0.909 | 0.812 | 1.000 | 0.952 | 0.593 | 0.869 | 130 |
| GPT-4o | 1.000 | 0.833 | 0.875 | 0.920 | 0.846 | 0.733 | 0.864 | 162 |
| Claude Sonnet 4.5 | 0.889 | 0.900 | 0.750 | 0.760 | 0.962 | 0.533 | 0.796 | 162 |
| Claude Opus 4.5 | 0.926 | 0.833 | 0.750 | 0.800 | 0.769 | 0.500 | 0.759 | 162 |
| Gemini 2.5 Pro | 0.731 | 0.733 | 0.625 | 0.760 | 0.577 | 0.500 | 0.652 | 161 |
L6 omission — "does this summary leave out a key event?" — is where the frontier tier collapses. No model exceeds 0.74; two models (Claude Opus 4.5, Gemini 2.5 Pro) sit at chance. L5 causal is narrower — the top five models score 0.77–0.96 — but no model is perfect on either. Benchmarks that stop short of L6 will conclude the frontier tier is uniformly reliable at verification, when in fact models are at their ceiling on obvious mismatches and at chance on noticing what's missing. Rankings also differ by level: for L1–L3 entity/quantitative fact-checks Gemini 2.5 Flash dominates; for L5 causal Haiku and Sonnet tie at 0.962; for L6 omission Haiku and GPT-4o tie at 0.733. A production stack may want different models for different verification subtasks rather than a single composite winner.
No model is sycophancy-immune
The sycophancy gap measures how much a model's contradiction detection degrades when the same contradictory caption is framed as coming from an authoritative source rather than as a neutral statement. A zero gap means the model is unaffected by framing. A large gap means the model is effectively being steered by prompt phrasing rather than by what is actually in the video.
Figure
Direct framing
Adversarial framing
Gemini 2.5 Flash
gap 23.6pp
Claude Haiku 4.5
gap 26.6pp
Gemini 2.5 Pro
gap 26.6pp
GPT-4o
gap 33.3pp
Claude Sonnet 4.5
gap 40.0pp
Claude Opus 4.5
gap 60.0pp
| Model | Direct | Indirect | Adversarial | Gap (pp) |
|---|---|---|---|---|
| Gemini 2.5 Flash | 0.967 | 0.967 | 0.731 | 23.6 |
| Claude Haiku 4.5 | 0.933 | 0.933 | 0.667 | 26.7 |
| Gemini 2.5 Pro | 0.933 | 1.000 | 0.667 | 26.7 |
| GPT-4o | 1.000 | 0.967 | 0.667 | 33.3 |
| Claude Sonnet 4.5 | 0.833 | 1.000 | 0.433 | 40.0 |
| Claude Opus 4.5 | 0.900 | 0.900 | 0.300 | 60.0 |
| Macro-avg | 0.928 | 0.961 | 0.578 | 35.7 |
Every model in our slate loses between 23.6 and 60.0 percentage points of contradiction-detection accuracy under the "already-verified" preamble — a 2.5× spread across the six full-coverage models, with no model immune. Gemini 2.5 Flash is the most robust at 23.6 pp, but "most robust" here still means losing a quarter of its detections under adversarial framing. Claude Opus 4.5 is the most susceptible, dropping from 0.900 direct to 0.300 adversarial. The within-Claude pattern is striking: the two more-cautious Claude models (Sonnet, Opus) are more easily steered than Haiku — we interpret this as the larger models treating an "already-verified" preamble as a strong prior that overrides direct visual evidence. Notably, Claude Haiku 4.5 is the strongest model on direct contradiction detection but only midpack on sycophancy robustness (26.7 pp); detection accuracy does not transfer into framing robustness.
Indirect framing ("describe first, then compare") is weakly dominant over direct on this slate: it matches direct for most models and strictly improves Claude Sonnet 4.5 (+16.7 pp) and Gemini 2.5 Pro (+6.7 pp), and is never materially worse than direct. Indirect does not close the adversarial gap — under "already-verified" framing, every model still drops substantially — but it is a safe default for direct evaluation settings. Defenders should assume adversarial framing will land, monitor for the preamble patterns that trigger it, and pair video verification with independent retrieval-augmented checks.
Model selection alone cannot provide sycophancy robustness on today's frontier tier: every model in our evaluation loses at least 24 pp of contradiction-detection accuracy under authoritative framing, and the strongest contradiction-detector (Claude Haiku 4.5) is only midpack on sycophancy robustness.
Hallucination resistance splits by vendor
Hallucination refusal is the cleanest cross-vendor effect in our data — and it is the inverse of the contradiction-detection ranking. Four of six models refuse probes about non-existent video content at ≥88% (Claude Sonnet 0.95, GPT-4o 0.95, Claude Haiku 0.93, Claude Opus 0.88). Both Gemini models are outliers: Flash refuses only 0.66 and Pro only 0.59, fabricating descriptions of non-existent events, objects, and details in 34–41% of probes. The within-Anthropic variance is small (≤7 pp); the cross-vendor gap is the dominant signal. Gemini 2.5 Flash is top-tier on contradiction detection and bottom-tier on hallucination refusal — a production stack that needs both cannot pick a single model from one vendor.
| Model | Temporal (B/A) | n_t | Halluc. Refusal | n_h |
|---|---|---|---|---|
| Claude Haiku 4.5 | 0.667 | 60 | 0.930 | 100 |
| Claude Opus 4.5 | 0.617 | 60 | 0.880 | 100 |
| Claude Sonnet 4.5 | 0.567 | 60 | 0.950 | 100 |
| Gemini 2.5 Flash | 0.564 | 55 | 0.663 | 95 |
| Gemini 2.5 Pro | 0.577 | 52 | 0.589 | 95 |
| GPT-4o | 0.583 | 60 | 0.950 | 100 |
Temporal binary ordering is at or just above chance. The highest before/after accuracy is 66.7% (Claude Haiku 4.5); the rest cluster 56–62% against a 50% chance baseline. Open-ended temporal questions ("what happened immediately after X?" and "what happened between X and Y?") score zero for every model under our exact-substring scorer, but the raw responses are plausible paraphrases that don't satisfy strict matching. We report only the binary sub-score here. Temporal ordering remains hard even for models that excel at contradiction detection on the same videos.
A cross-model breakdown of the hallucination rate by trigger category shows existence probes ("did the speaker hold up a specific object at time X?" when that object never appears) elicit the highest average hallucination at 33.3% (CI very wide due to small per-model n), followed by specific-detail at 24.6%. Counting and temporal-reference probes are safest at 17.2% and 16.3%; spatial-reference safest of all at 10.0%. A concrete deployment recommendation: when the decisive probe can be phrased as a count, phrase it as a count.
Safety implications
No single model wins all four axes, and the within-vendor hierarchy inverts the usual expectation that bigger beats smaller. Claude Haiku 4.5 ties for top on contradictions and L6 omission, is mid-pack on sycophancy (26.7 pp gap), and is the best on temporal binary ordering; it is outperformed by its own family members on hallucination refusal. Gemini 2.5 Flash is top-tier on contradictions and the most sycophancy-robust model, but bottom-tier on hallucination refusal alongside Gemini 2.5 Pro. GPT-4o is near-perfect on L1–L4 and strong on L6 but middle-of-the-pack on sycophancy. A production verification stack may want different models for different subtasks rather than a single composite winner.
In adversarial settings — misinformation, fraud — bad actors naturally frame false claims with authority. Every frontier model in our slate loses at least 24 percentage points of contradiction-detection accuracy under an "already-verified" preamble, with Claude Opus 4.5 losing 60 pp. Model selection alone cannot provide sycophancy robustness on this tier. Defenders should assume adversarial framing will land, monitor for the preamble patterns that trigger it, and pair video verification with independent retrieval-augmented checks. Sample sizes are adequate to catch the large effects we report but too small for close pair rankings. Human inter-annotator agreement has not been performed. The caption-only ablation we release is a methodology control for benchmark designers: any video-language benchmark whose input pipeline can silently fail will produce model rankings that look plausible but measure the wrong thing — hard-failing on missing video input and publishing a caption-only baseline together rule that class of artifact out.