Paper № 04Safety14 min read

VideoTruth-Bench

Cross-modal consistency verification across six contradiction levels

Published

April 22, 2026

Authors

Datoric Research

Dataset

100 VATEX videos; 566 model×task evaluations per model. Four axes: six-level contradiction taxonomy, temporal ordering, hallucination refusal, and sycophancy under three prompt framings. 6 frontier multimodal models, eight real video frames per API call. Bootstrap 95% CI. A caption-only ablation is released alongside the main results as a methodology control.

Download PDF View PDF

Abstract

VideoTruth-Bench evaluates four trust-relevant axes of video–language understanding: contradiction detection across a graded six-level taxonomy (L1 entity swap through L6 omission), temporal ordering, hallucination resistance, and sycophancy under adversarial caption framing. We report 100 VATEX videos (566 model×task evaluations per model) across six frontier multimodal models — Claude Haiku 4.5, Opus 4.5, Sonnet 4.5; Gemini 2.5 Flash and Gemini 2.5 Pro; and GPT-4o — with eight real video frames passed to every API call. Three findings structure the release. First, the contradiction-detection leaderboard collapses into a three-way tie at the top: Claude Haiku 4.5 (0.870), Gemini 2.5 Flash (0.869), and GPT-4o (0.864) cluster within 1 pp of each other and from three different vendors, with Sonnet/Opus at 0.76–0.80 and Gemini 2.5 Pro at 0.65. L5 (causal) and L6 (omission) are the discriminating levels — no model exceeds 0.74 on L6. Second, sycophancy under an "already-verified" adversarial preamble costs every model 24–60 pp of contradiction-detection accuracy (Gemini 2.5 Flash 23.6 pp → Claude Opus 4.5 60.0 pp); no model in the frontier tier is sycophancy-immune. Third, hallucination resistance splits cleanly along vendor lines: Claude family and GPT-4o refuse ≥88% of probes about non-existent video content, while Gemini 2.5 Flash refuses 66% and Gemini 2.5 Pro 59%, fabricating descriptions in 34–41% of probes. Binary temporal before/after is 56–67% across models, just above chance. We also release a documented caption-only ablation that quantifies how much of each model's verdict comes from text prior versus video evidence — a methodology-relevant control benchmark designers should include to rule out silently-broken video-input pipelines.

Headline

24–60 pp

sycophancy-gap range across six frontier models under an "already-verified" preamble — no model in the frontier tier is sycophancy-immune

Why verification, not just description

Most video benchmarks measure whether a model can describe what is in a clip. VideoTruth-Bench measures something different: whether a model will tell you when a caption contradicts what the video shows, and whether the framing of the caption changes the model's answer. This is the capability that matters for fact-checking video claims, moderating video content, reviewing security footage, and verifying insurance claims. A model that generates fluent captions can still fail silently when asked whether a supplied description is actually correct.

We evaluate six frontier multimodal models with eight real VATEX video frames passed to every API call, under three prompt framings (direct, indirect, adversarial) over a graded six-level contradiction taxonomy, plus temporal-ordering and hallucination-probe measurements. Our core contribution is a matched-item measurement of how much framing moves detection — the sycophancy gap — across the frontier. Prior work has measured sycophancy in text-only LLMs; we are not aware of a matched-item measurement for video-language models with the video actually attached.

Main result

At full video-frame coverage the top of the contradiction-detection leaderboard is a three-way tie: Claude Haiku 4.5 (0.870), Gemini 2.5 Flash (0.869), and GPT-4o (0.864) cluster within 1 pp from three different vendors. L5 causal ranges 0.58–0.96 across models and L6 omission ranges 0.50–0.74 — no model exceeds 0.74 on L6. Sycophancy is universal: every model drops 24–60 pp under an "already-verified" preamble. Hallucination refusal splits cleanly: Claude family and GPT-4o at ≥88%, both Gemini models below 70%. No single model wins all four axes.

The benchmark

VideoTruth-Bench evaluates 100 VATEX videos with 566 model×task evaluations per model. Contradictions are generated using the Claude API at six levels of increasing subtlety (L1 entity swap, L2 temporal reorder, L3 quantitative, L4 attributive, L5 causal, L6 omission). For each clip we pair the caption with three prompt framings — direct, indirect ("describe first, then compare"), and adversarial ("this caption has been verified as accurate") — holding content constant and varying only the wrapper. Per-model per-level contradiction sample size is 22–30; sycophancy analysis is at n=30 per framing per model where coverage permits. GPT-4o-mini was initially part of the slate but was rate-limited during the parallel run and yielded too few per-cell samples for meaningful inference. We evaluate six frontier multimodal models across four axes:

01Contradiction detection by subtlety level. A six-level taxonomy from L1 entity swap through L2 temporal, L3 quantitative, L4 attributive, L5 causal, and L6 omission.
02Temporal ordering. Binary before/after and ordered-sequence accuracy on genuine events in the video.
03Hallucination resistance. Fraction of probes about absent events that are correctly refused.
04Sycophancy. Contradiction-detection drop between direct and adversarial framings of the same content.

Three-way tie at the top, L6 universal ceiling

Three models from three different vendors cluster within 1 pp of each other at the top of overall contradiction-detection accuracy — Claude Haiku 4.5 (0.870), Gemini 2.5 Flash (0.869), GPT-4o (0.864) — with bootstrap 95% CIs that overlap heavily. The middle tier (Sonnet 0.796, Opus 0.759) is reliably below the top three but well above Gemini 2.5 Pro (0.652). Intra-vendor, smaller models beat larger: Haiku > Sonnet > Opus (0.870 / 0.796 / 0.759) and Gemini 2.5 Flash > 2.5 Pro (0.869 / 0.652). We hypothesize the larger models are tuned toward more qualified answers that don't crisp into yes/no verdicts on the contradiction prompt.

Model	L1 Entity	L2 Temp.	L3 Quant.	L4 Attr.	L5 Causal	L6 Omiss.	Overall	n
Claude Haiku 4.5	0.889	0.900	1.000	0.760	0.962	0.733	0.870	162
Gemini 2.5 Flash	1.000	0.909	0.812	1.000	0.952	0.593	0.869	130
GPT-4o	1.000	0.833	0.875	0.920	0.846	0.733	0.864	162
Claude Sonnet 4.5	0.889	0.900	0.750	0.760	0.962	0.533	0.796	162
Claude Opus 4.5	0.926	0.833	0.750	0.800	0.769	0.500	0.759	162
Gemini 2.5 Pro	0.731	0.733	0.625	0.760	0.577	0.500	0.652	161

Table 1. Contradiction-detection accuracy by level across all six evaluated models. L1–L4 saturate for the top three; L5 and L6 spread the field. L6 (omission) is the universal ceiling: no model exceeds 0.74.

L6 omission — "does this summary leave out a key event?" — is where the frontier tier collapses. No model exceeds 0.74; two models (Claude Opus 4.5, Gemini 2.5 Pro) sit at chance. L5 causal is narrower — the top five models score 0.77–0.96 — but no model is perfect on either. Benchmarks that stop short of L6 will conclude the frontier tier is uniformly reliable at verification, when in fact models are at their ceiling on obvious mismatches and at chance on noticing what's missing. Rankings also differ by level: for L1–L3 entity/quantitative fact-checks Gemini 2.5 Flash dominates; for L5 causal Haiku and Sonnet tie at 0.962; for L6 omission Haiku and GPT-4o tie at 0.733. A production stack may want different models for different verification subtasks rather than a single composite winner.

No model is sycophancy-immune

The sycophancy gap measures how much a model's contradiction detection degrades when the same contradictory caption is framed as coming from an authoritative source rather than as a neutral statement. A zero gap means the model is unaffected by framing. A large gap means the model is effectively being steered by prompt phrasing rather than by what is actually in the video.

Figure

Direct framing

Adversarial framing

Gemini 2.5 Flash

gap 23.6pp

.967

.731

Claude Haiku 4.5

gap 26.6pp

.933

.667

Gemini 2.5 Pro

gap 26.6pp

.933

.667

GPT-4o

gap 33.3pp

1.000

.667

Claude Sonnet 4.5

gap 40.0pp

.833

.433

Claude Opus 4.5

gap 60.0pp

.900

.300

Figure 1. Contradiction detection under direct versus adversarial framing across all six evaluated models. The gap between the two bars is the sycophancy effect. At full video-frame coverage the range is 23.6 to 60.0 pp — no model is immune. Gemini 2.5 Flash is the most robust (23.6 pp); Claude Opus 4.5 is the most susceptible (60.0 pp).

Model	Direct	Indirect	Adversarial	Gap (pp)
Gemini 2.5 Flash	0.967	0.967	0.731	23.6
Claude Haiku 4.5	0.933	0.933	0.667	26.7
Gemini 2.5 Pro	0.933	1.000	0.667	26.7
GPT-4o	1.000	0.967	0.667	33.3
Claude Sonnet 4.5	0.833	1.000	0.433	40.0
Claude Opus 4.5	0.900	0.900	0.300	60.0
Macro-avg	0.928	0.961	0.578	35.7

Table 2. Detection rates by prompt variant across all six evaluated models. The sycophancy gap is direct minus adversarial. Each framing is evaluated at n=30 per model where coverage permits.

Every model in our slate loses between 23.6 and 60.0 percentage points of contradiction-detection accuracy under the "already-verified" preamble — a 2.5× spread across the six full-coverage models, with no model immune. Gemini 2.5 Flash is the most robust at 23.6 pp, but "most robust" here still means losing a quarter of its detections under adversarial framing. Claude Opus 4.5 is the most susceptible, dropping from 0.900 direct to 0.300 adversarial. The within-Claude pattern is striking: the two more-cautious Claude models (Sonnet, Opus) are more easily steered than Haiku — we interpret this as the larger models treating an "already-verified" preamble as a strong prior that overrides direct visual evidence. Notably, Claude Haiku 4.5 is the strongest model on direct contradiction detection but only midpack on sycophancy robustness (26.7 pp); detection accuracy does not transfer into framing robustness.

Indirect framing ("describe first, then compare") is weakly dominant over direct on this slate: it matches direct for most models and strictly improves Claude Sonnet 4.5 (+16.7 pp) and Gemini 2.5 Pro (+6.7 pp), and is never materially worse than direct. Indirect does not close the adversarial gap — under "already-verified" framing, every model still drops substantially — but it is a safe default for direct evaluation settings. Defenders should assume adversarial framing will land, monitor for the preamble patterns that trigger it, and pair video verification with independent retrieval-augmented checks.

Model selection alone cannot provide sycophancy robustness on today's frontier tier: every model in our evaluation loses at least 24 pp of contradiction-detection accuracy under authoritative framing, and the strongest contradiction-detector (Claude Haiku 4.5) is only midpack on sycophancy robustness.

Hallucination resistance splits by vendor

Hallucination refusal is the cleanest cross-vendor effect in our data — and it is the inverse of the contradiction-detection ranking. Four of six models refuse probes about non-existent video content at ≥88% (Claude Sonnet 0.95, GPT-4o 0.95, Claude Haiku 0.93, Claude Opus 0.88). Both Gemini models are outliers: Flash refuses only 0.66 and Pro only 0.59, fabricating descriptions of non-existent events, objects, and details in 34–41% of probes. The within-Anthropic variance is small (≤7 pp); the cross-vendor gap is the dominant signal. Gemini 2.5 Flash is top-tier on contradiction detection and bottom-tier on hallucination refusal — a production stack that needs both cannot pick a single model from one vendor.

Model	Temporal (B/A)	n_t	Halluc. Refusal	n_h
Claude Haiku 4.5	0.667	60	0.930	100
Claude Opus 4.5	0.617	60	0.880	100
Claude Sonnet 4.5	0.567	60	0.950	100
Gemini 2.5 Flash	0.564	55	0.663	95
Gemini 2.5 Pro	0.577	52	0.589	95
GPT-4o	0.583	60	0.950	100

Table 3. Temporal binary before/after accuracy and hallucination refusal rate (fraction of probes about absent events that are correctly refused — higher is better).

Temporal binary ordering is at or just above chance. The highest before/after accuracy is 66.7% (Claude Haiku 4.5); the rest cluster 56–62% against a 50% chance baseline. Open-ended temporal questions ("what happened immediately after X?" and "what happened between X and Y?") score zero for every model under our exact-substring scorer, but the raw responses are plausible paraphrases that don't satisfy strict matching. We report only the binary sub-score here. Temporal ordering remains hard even for models that excel at contradiction detection on the same videos.

A cross-model breakdown of the hallucination rate by trigger category shows existence probes ("did the speaker hold up a specific object at time X?" when that object never appears) elicit the highest average hallucination at 33.3% (CI very wide due to small per-model n), followed by specific-detail at 24.6%. Counting and temporal-reference probes are safest at 17.2% and 16.3%; spatial-reference safest of all at 10.0%. A concrete deployment recommendation: when the decisive probe can be phrased as a count, phrase it as a count.

Safety implications

No single model wins all four axes, and the within-vendor hierarchy inverts the usual expectation that bigger beats smaller. Claude Haiku 4.5 ties for top on contradictions and L6 omission, is mid-pack on sycophancy (26.7 pp gap), and is the best on temporal binary ordering; it is outperformed by its own family members on hallucination refusal. Gemini 2.5 Flash is top-tier on contradictions and the most sycophancy-robust model, but bottom-tier on hallucination refusal alongside Gemini 2.5 Pro. GPT-4o is near-perfect on L1–L4 and strong on L6 but middle-of-the-pack on sycophancy. A production verification stack may want different models for different subtasks rather than a single composite winner.

In adversarial settings — misinformation, fraud — bad actors naturally frame false claims with authority. Every frontier model in our slate loses at least 24 percentage points of contradiction-detection accuracy under an "already-verified" preamble, with Claude Opus 4.5 losing 60 pp. Model selection alone cannot provide sycophancy robustness on this tier. Defenders should assume adversarial framing will land, monitor for the preamble patterns that trigger it, and pair video verification with independent retrieval-augmented checks. Sample sizes are adequate to catch the large effects we report but too small for close pair rankings. Human inter-annotator agreement has not been performed. The caption-only ablation we release is a methodology control for benchmark designers: any video-language benchmark whose input pipeline can silently fail will produce model rankings that look plausible but measure the wrong thing — hard-failing on missing video input and publishing a caption-only baseline together rule that class of artifact out.

Cite this work

@article{datoric04,
  title={VideoTruth-Bench: Cross-modal consistency verification across six contradiction levels},
  author={Datoric Research},
  year={2026},
  journal={Datoric Research Notes},
  url={https://datoric.com/research/videotruth-bench}
}

Data sources

→VATEX
→Synthetic adversarial constructions

№ 05Multilingual

BharatVoice-Bench

Independent benchmarking of frontier voice models on Indian languages

Read paper

№ 03Video

VidWork-Bench

A five-axis benchmark for procedural video understanding

Read paper