anthropic ·claude-sonnet-4-5May 18, 06:01 PM
Asset 88Lb01q
Score
0.9
Latency
7.07s
Cost
$0.0151
Workflow Eval Detail
Answers natural-language questions about a video by retrieving relevant context and answering with a concise response.
On this small 7-case suite, all providers are accurate, but Google flash/flash-lite models provide the best overall tradeoff of quality, latency, and cost for Ask Questions.
Each eval run captures efficacy, efficiency, and expense. We use this data to compare providers and track regressions over time.
We score answer accuracy and response integrity while tracking latency, token usage, and cost.
| Provider | Model | Cases | Avg Score | Avg Latency | Avg Tokens | Avg Cost | Avg Cost / Min |
|---|---|---|---|---|---|---|---|
| anthropic | claude-sonnet-4-5 | 1 | 0.9 | 7.07s | 4,451 | $0.0151 | $0.0263/min |
| gemini-2.5-flash | 1 | 1 | 3.73s | 2,881 | $0.0017 | $0.0029/min | |
| gemini-3-flash-preview | 1 | 0.96 | 4.61s | 3,796 | $0.0022 | $0.0039/min | |
| gemini-3.1-flash-lite | 1 | 0.98 | 2.58s | 3,480 | $0.0011 | $0.0018/min | |
| gemini-3.1-flash-lite-preview | 1 | 0.97 | 2.67s | 3,489 | $0.0011 | $0.0019/min | |
| openai | gpt-5-mini | 1 | 0.85 | 13.4s | 4,750 | $0.0023 | $0.004/min |
| openai | gpt-5.1 | 1 | 1 | 3.48s | 3,013 | $0.0018 | $0.0032/min |