Workflow Eval Detail

Ask Questions

Answers natural-language questions about a video by retrieving relevant context and answering with a concise response.

Latest Runcompleted
muxinc/ai
mainb7cce22·@mux/ai v0.13.1
Cases
6
Avg Score
0.98
Avg Latency
5.86s
Avg Cost
$0.0036
Avg Cost / Min
$0.0064/min
Avg Tokens
2,573
TL;DR

Near-perfect answer quality across providers at very low cost, with Google gemini-3.1-flash-lite-preview best for speed/cost and OpenAI gpt-5.1 best for quality, but findings are based on only 6 cases.

Best Quality
openai
gpt-5.1
Fastest
google
gemini-3.1-flash-lite-preview
Most Economical
google
gemini-3.1-flash-lite-preview

What we measure

Each eval run captures efficacy, efficiency, and expense. We use this data to compare providers and track regressions over time.

Efficacy
Quality + correctness
Efficiency
Latency + token usage
Expense
Cost per request

Workflow snapshot

Suite statussuccess
Suite average score0.98
Suite duration35.16s
Last suite runApr 3, 07:57 PM

Evaluation criteria

From eval tests

We score answer accuracy and response integrity while tracking latency, token usage, and cost.

Question set3 checks
Question
Has on-screen text?
Answer: YesConf0.94
Question
Is the speaker visible?
Answer: NoConf0.88
Question
Scene indoors?
Answer: YesConf0.91
Accuracy
Match expected yes/no outputs.
Format
Required fields + response shape.
Integrity
Confidence and reasoning present.
Throughput
Latency and cost stay within targets.
Response validation
Efficacy checks
  • Answers match expected yes/no outputs.
  • Response includes required fields and answer structure.
  • Confidence is 0-1 and reasoning strings are non-empty.
  • Response preserves asset ID and storyboard URL.
Efficiency targets
  • Latency: scores are normalized between 0 and 1. Under 8s earns 1.0; past 12s trends toward 0.
  • Token usage: scores are normalized between 0 and 1. Under 2,900 tokens earns 1.0; higher usage reduces the score.
  • Usage data must include total tokens for cost analysis.
Expense guardrails
  • Estimated cost under $0.012 per request for full score.

Provider breakdown

Run b7cce22
Efficacy scoreHigher is better
LatencyLower is better
Token UsageLower is better
CostLower is better
ProviderModelCasesAvg ScoreAvg LatencyAvg TokensAvg CostAvg Cost / Min
anthropicclaude-sonnet-4-510.986.21s3,247$0.012$0.0209/min
googlegemini-2.5-flash113.84s1,685$0.0013$0.0022/min
googlegemini-3-flash-preview115.6s2,746$0.002$0.0036/min
googlegemini-3.1-flash-lite-preview113.4s2,322$0.0008$0.0014/min
openaigpt-5-mini10.9111.77s3,517$0.0019$0.0034/min
openaigpt-5.1114.33s1,918$0.0039$0.0067/min

Recent cases

Latest 6
anthropic ·claude-sonnet-4-5Apr 3, 07:58 PM
Asset 88Lb01q
Score
0.98
Latency
6.21s
Cost
$0.012
google ·gemini-2.5-flashApr 3, 07:58 PM
Asset 88Lb01q
Score
1
Latency
3.84s
Cost
$0.0013
google ·gemini-3-flash-previewApr 3, 07:58 PM
Asset 88Lb01q
Score
1
Latency
5.6s
Cost
$0.002
google ·gemini-3.1-flash-lite-previewApr 3, 07:58 PM
Asset 88Lb01q
Score
1
Latency
3.4s
Cost
$0.0008
openai ·gpt-5-miniApr 3, 07:58 PM
Asset 88Lb01q
Score
0.91
Latency
11.77s
Cost
$0.0019
openai ·gpt-5.1Apr 3, 07:58 PM
Asset 88Lb01q
Score
1
Latency
4.33s
Cost
$0.0039