anthropic ·claude-sonnet-4-5Apr 3, 07:58 PM
Asset 88Lb01q
Score
0.98
Latency
6.21s
Cost
$0.012
Workflow Eval Detail
Answers natural-language questions about a video by retrieving relevant context and answering with a concise response.
Near-perfect answer quality across providers at very low cost, with Google gemini-3.1-flash-lite-preview best for speed/cost and OpenAI gpt-5.1 best for quality, but findings are based on only 6 cases.
Each eval run captures efficacy, efficiency, and expense. We use this data to compare providers and track regressions over time.
We score answer accuracy and response integrity while tracking latency, token usage, and cost.
| Provider | Model | Cases | Avg Score | Avg Latency | Avg Tokens | Avg Cost | Avg Cost / Min |
|---|---|---|---|---|---|---|---|
| anthropic | claude-sonnet-4-5 | 1 | 0.98 | 6.21s | 3,247 | $0.012 | $0.0209/min |
| gemini-2.5-flash | 1 | 1 | 3.84s | 1,685 | $0.0013 | $0.0022/min | |
| gemini-3-flash-preview | 1 | 1 | 5.6s | 2,746 | $0.002 | $0.0036/min | |
| gemini-3.1-flash-lite-preview | 1 | 1 | 3.4s | 2,322 | $0.0008 | $0.0014/min | |
| openai | gpt-5-mini | 1 | 0.91 | 11.77s | 3,517 | $0.0019 | $0.0034/min |
| openai | gpt-5.1 | 1 | 1 | 4.33s | 1,918 | $0.0039 | $0.0067/min |