Why workflows matter
The promising use cases for video AI are professional: manufacturing QC, medical training review, equipment repair guidance, first-aid and safety monitoring. These tasks share a structural property that is missing from public video benchmarks. They are procedural. They require the model to understand not just what objects are present, but what order steps occur in, which steps depend on which, and whether the sequence was executed correctly.
VidWork-Bench is a five-axis framework for measuring procedural video understanding: step recognition, temporal ordering, causal reasoning, cross-modal grounding, and adversarial error detection. We instantiate it on 171 curated clips across three domains (cooking, repair/manufacturing, first-aid/safety) and four duration buckets (30s, 60s, 180s, 300s), generating 2,092 QA items after three-model difficulty filtering. The adversarial error-detection axis has no direct analogue in prior procedural video work.
Main result
A paired 1-frame vs 8-frame ablation on the two axes most plausibly multi-frame-dependent — temporal ordering and causal reasoning — finds no statistically significant benefit from multi-frame context for Claude Sonnet 4.5 or GPT-4o. On GPT-4o causal reasoning the 1-frame condition is significantly better (Δ = +0.014, 95% CI [+0.002, +0.026]). Any benchmark claiming to measure "temporal reasoning" by serving more frames needs to verify the extra frames are actually being used.
The benchmark
VidWork-Bench draws 171 clips from YouCook2 (cooking), COIN (repair/manufacturing), and curated instructional content (first-aid/safety), binned into four duration buckets (72 × 30s, 42 × 60s, 40 × 180s, 17 × 300s). Each clip contains at least three identifiable steps. The 2,092 QA items are distributed across five axes: step recognition (n = 423), temporal ordering (n = 90), causal reasoning (n = 225), cross-modal grounding (n = 264), and adversarial error detection (n = 1,090). Items are filtered against a three-model baseline consensus so that questions all three baselines answered correctly are dropped as insufficiently discriminating. One cell (repair/manufacturing × 300s) has zero clips and is reported as unpopulated. Three-annotator human IAA has not been performed.
- 01Step recognition. Fuzzy-match F1 against the specific annotated steps (not generic procedural templates).
- 02Temporal ordering. Accuracy on pairwise before/after questions drawn from step-boundary annotations.
- 03Causal reasoning. Accuracy on "why did the person do X?" and "what would happen if they skipped Y?" questions.
- 04Cross-modal grounding. Accuracy on questions whose answers appear in the video but NOT in the ASR transcript, forcing the model to attend to frames.
- 05Adversarial error detection. Detection rate on descriptions modified along one of eight error types (step swap, step omission, step modification, action modification, tool substitution, quantity error, causal reversal, insufficient input).
All scores are reported with 95% bootstrap confidence intervals (10,000 resamples; 1,000 for the paired single-frame ablation). We evaluate six frontier vision–language models — GPT-4o, GPT-4o-mini, Gemini 2.5 Flash, Claude Haiku 4.5, Sonnet 4.5, and Opus 4.5. Gemini 2.5 Pro is not evaluated; it returned 503 at a higher rate than Flash during our evaluation window. Gemini 2.5 Flash itself suffered a 30% 503 rate and is reported only on step recognition (the axis that completed before retries were exhausted). Claude Opus 4.5 runs at 4 frames per sample due to compute constraints; the other models run at 8 frames.
Multi-frame context doesn't measurably help
We re-ran the temporal-ordering and causal-reasoning axes at max_frames = 1 for the two models most relevant to the question — Claude Sonnet 4.5 (the leaderboard winner) and GPT-4o (the strongest non-Claude model with complete coverage). Every 1-frame response is paired with the corresponding 8-frame response on the identical (clip, question) pair, enabling a paired bootstrap on the score difference.
| Model | Axis | n | 1f | 8f | Δ (1f − 8f) | Sig. |
|---|---|---|---|---|---|---|
| Sonnet 4.5 | Temporal | 90 | 0.365 | 0.387 | −0.022 | ns |
| Sonnet 4.5 | Causal | 225 | 0.491 | 0.503 | −0.012 | ns |
| GPT-4o | Temporal | 90 | 0.203 | 0.218 | −0.014 | ns |
| GPT-4o | Causal | 225 | 0.397 | 0.383 | +0.014 | p < 0.05 |
For three of the four (model, axis) cells, the 8-frame condition is numerically slightly better but the 95% paired CI includes zero. For the fourth — GPT-4o on causal reasoning — the 1-frame condition is significantly better. We read this cautiously. Three candidate explanations are non-exclusive: the ASR transcript is included in both conditions and may already carry the temporal signal; our items may not be subtle enough to separate 1-frame from 8-frame evidence at current difficulty; and extra frames may occasionally introduce distractor content that hurts rather than helps. The key empirical takeaway is that we cannot reject the null of 1-frame = 8-frame on this pool, and any future claim that procedural reasoning requires multi-frame context must include an ablation like this one.
Any benchmark that claims to evaluate temporal reasoning by serving more frames should verify that the extra frames are actually being used. Our negative result suggests a substantial fraction of procedural items can be answered from transcript plus one representative frame.
Leaderboard and axis breakdown
Table 2 reports the full per-axis leaderboard. Claude Sonnet 4.5 wins the composite (unweighted mean across the five axes) at 0.446, narrowly ahead of Claude Haiku 4.5 (0.422) and Opus 4.5 (0.420). The Claude-family cluster sits 8–13 points above the GPT-4o family (mini 0.335, 4o 0.313). The Sonnet–Haiku gap (0.024) is within combined bootstrap uncertainty on several axes and we do not over-claim a tight ranking between them; what is robust is the Claude-family vs GPT-4o-family separation, and that separation is almost entirely driven by the error-detection axis. GPT-4o-mini outscores full GPT-4o on every axis reported — we do not have a clean explanation and flag it as worth a separate study.
| Model | Step (F1) | Temporal | Causal | X-Modal | Error Det. | Composite |
|---|---|---|---|---|---|---|
| Claude Sonnet 4.5 | 0.048 | 0.387 | 0.503 | 0.343 | 0.947 | 0.446 |
| Claude Haiku 4.5 | 0.036 | 0.308 | 0.487 | 0.342 | 0.936 | 0.422 |
| Claude Opus 4.5 | 0.035 | 0.335 | 0.457 | 0.343 | 0.933 | 0.420 |
| GPT-4o-mini | 0.094 | 0.238 | 0.426 | 0.270 | 0.647 | 0.335 |
| GPT-4o | 0.104 | 0.218 | 0.383 | 0.223 | 0.639 | 0.313 |
| Gemini 2.5 Flash | 0.056 | — | — | — | — | — |
Step recognition is uniformly hard: no model exceeds 11% F1. The scorer requires fuzzy overlap against the specific annotated steps rather than accepting generic procedural templates ("gather supplies, perform procedure, verify result"), and the dominant failure mode across all models is template substitution — producing a plausible cooking/repair/first-aid boilerplate rather than the procedure-specific steps. Models know what procedures look like in the abstract but do not recover the procedure-specific step signal from our clips.
Error detection separates Claude from GPT-4o
The adversarial error-detection axis is the decisive axis in this benchmark. Table 3 reports per-error-type detection rates across the five models with full coverage. Across the six non-degenerate error types with meaningful sample size, Claude models detect adversarial errors at 85–97%; GPT-4o models detect at 38–76%. The largest gaps are on step omission (Haiku 96.6% vs GPT-4o 59.2%) and tool substitution (Opus 96.8% vs GPT-4o 53.2%). The gap on quantity error (91.7% vs 37.5–45.8%) is notable because quantity errors are arguably the subtlest category — they require verifying a specific numerical claim rather than catching a qualitative mismatch.
| Error type | n | GPT-4o | GPT-4o-mini | Haiku 4.5 | Sonnet 4.5 | Opus 4.5 |
|---|---|---|---|---|---|---|
| step_swap | 428 | 0.729 | 0.755 | 0.956 | 0.967 | 0.967 |
| step_omission | 206 | 0.592 | 0.738 | 0.966 | 0.947 | 0.850 |
| action_modification | 181 | 0.597 | 0.481 | 0.895 | 0.934 | 0.945 |
| tool_substitution | 124 | 0.532 | 0.524 | 0.911 | 0.952 | 0.968 |
| causal_reversal | 123 | 0.602 | 0.537 | 0.911 | 0.911 | 0.911 |
| quantity_error | 24 | 0.458 | 0.375 | 0.917 | 0.875 | 0.917 |
| Weighted mean | 1,090 | 0.639 | 0.647 | 0.936 | 0.947 | 0.933 |
Two readings compete. The Claude-family gap could reflect genuinely more careful procedural scrutiny, or it could reflect a higher base rate of "flagging errors when prompted" — which would inflate detection rates at the cost of false positives on matched-correct descriptions. Disambiguating requires a symmetric correct-description counter-set where the described procedure is actually correct and the model must not flag. We do not have that counter-set in this release; constructing it is the highest-priority follow-up, and we treat the current number as a detection-plus-propensity composite.
Scope we don't claim
VidWork-Bench has four scope limits we document explicitly. Gemini 2.5 Pro returned persistent 503 errors during the evaluation window and is not evaluated. Gemini 2.5 Flash suffered a 30% 503 rate and is reported only on step recognition; its other four axes are blank rather than partial. Claude Opus 4.5 was run at 4 frames per sample instead of 8 due to compute constraints. The repair/manufacturing × 300s cell in the duration grid is empty — long enough repair clips with dense step annotations were not available during curation, so the cell is reported as unpopulated.
Two caveats affect how the composite ranking should be read. The causal-reasoning and temporal-ordering items are Claude-authored, which likely confers a small within-family advantage on those axes. The adversarial error-detection axis, which contributes most of the Claude-vs-GPT composite gap, is rule-based (perturbed from a correct description along one of eight categories) and is not subject to Claude-authoring bias. Human inter-annotator agreement has not been performed; the reference answers and the key-term-overlap scorer are the only labels. We release the full evaluation corpus, per-cell aggregates, and the single-frame ablation pool so that other labs can reproduce and extend these numbers.