anthropic ·claude-sonnet-4-5Apr 3, 07:58 PM
Asset gEvCHSJ
Score
1
Latency
2.21s
Cost
$0.0082
Workflow Eval Detail
Analyzes video frames to detect hardcoded captions baked into the visual content—useful for compliance checks and accessibility audits.
Near-perfect burned-in caption detection across providers at low cost, with gemini-3.1-flash-lite-preview favored for speed/expense and gpt-5.1 for quality, though each model was tested on only 3 cases.
Each eval run captures efficacy, efficiency, and expense. We use this data to compare providers and track regressions over time.
We evaluate caption detection accuracy, confidence calibration, and response integrity alongside speed and cost thresholds.
| Provider | Model | Cases | Avg Score | Avg Latency | Avg Tokens | Avg Cost | Avg Cost / Min |
|---|---|---|---|---|---|---|---|
| anthropic | claude-sonnet-4-5 | 3 | 1 | 2.37s | 2,446 | $0.0077 | $0.0268/min |
| gemini-2.5-flash | 3 | 0.99 | 5.73s | 2,105 | $0.0028 | $0.0099/min | |
| gemini-3-flash-preview | 3 | 1 | 5.11s | 2,680 | $0.0027 | $0.0093/min | |
| gemini-3.1-flash-lite-preview | 3 | 1 | 1.8s | 1,965 | $0.0005 | $0.0019/min | |
| openai | gpt-5-mini | 3 | 0.91 | 16.97s | 3,137 | $0.0028 | $0.0098/min |
| openai | gpt-5.1 | 3 | 1 | 2.83s | 1,631 | $0.0006 | $0.0021/min |