Workflow Eval Detail

Ask Questions

Answers natural-language questions about a video by retrieving relevant context and answering with a concise response.

Latest Runcompleted
muxinc/ai
maind5b5d84·@mux/ai v0.7.4
Cases
5
Avg Score
0.98
Avg Latency
7.12s
Avg Cost
$0.0039
Avg Cost / Min
$0.0067/min
Avg Tokens
2,332
TL;DR

All providers perform strongly on this workflow with very high accuracy and low cost, and OpenAI gpt-5.1 is the recommended default for quality and latency, pending confirmation on a larger sample.

Best Quality
openai
gpt-5.1
Fastest
openai
gpt-5.1
Most Economical
openai
gpt-5-mini

What we measure

Each eval run captures efficacy, efficiency, and expense. We use this data to compare providers and track regressions over time.

Efficacy
Quality + correctness
Efficiency
Latency + token usage
Expense
Cost per request

Workflow snapshot

Suite statussuccess
Suite average score0.98
Suite duration35.58s
Last suite runFeb 18, 09:11 PM

Evaluation criteria

From eval tests

We score answer accuracy and response integrity while tracking latency, token usage, and cost.

Question set3 checks
Question
Has on-screen text?
Answer: YesConf0.94
Question
Is the speaker visible?
Answer: NoConf0.88
Question
Scene indoors?
Answer: YesConf0.91
Accuracy
Match expected yes/no outputs.
Format
Required fields + response shape.
Integrity
Confidence and reasoning present.
Throughput
Latency and cost stay within targets.
Response validation
Efficacy checks
  • Answers match expected yes/no outputs.
  • Response includes required fields and answer structure.
  • Confidence is 0-1 and reasoning strings are non-empty.
  • Response preserves asset ID and storyboard URL.
Efficiency targets
  • Latency: scores are normalized between 0 and 1. Under 8s earns 1.0; past 12s trends toward 0.
  • Token usage: scores are normalized between 0 and 1. Under 2,900 tokens earns 1.0; higher usage reduces the score.
  • Usage data must include total tokens for cost analysis.
Expense guardrails
  • Estimated cost under $0.012 per request for full score.

Provider breakdown

Run d5b5d84
Efficacy scoreHigher is better
LatencyLower is better
Token UsageLower is better
CostLower is better
ProviderModelCasesAvg ScoreAvg LatencyAvg TokensAvg CostAvg Cost / Min
anthropicclaude-sonnet-4-5115.99s2,856$0.0109$0.0189/min
googlegemini-2.5-flash115.28s1,504$0.0015$0.0027/min
googlegemini-3-flash-preview117.56s2,531$0.0022$0.0039/min
openaigpt-5-mini10.9213.2s3,216$0.0015$0.0027/min
openaigpt-5.1113.55s1,555$0.0031$0.0055/min

Recent cases

Latest 6
anthropic ·claude-sonnet-4-5Feb 18, 09:12 PM
Asset 88Lb01q
Score
1
Latency
5.99s
Cost
$0.0109
google ·gemini-2.5-flashFeb 18, 09:12 PM
Asset 88Lb01q
Score
1
Latency
5.28s
Cost
$0.0015
google ·gemini-3-flash-previewFeb 18, 09:12 PM
Asset 88Lb01q
Score
1
Latency
7.56s
Cost
$0.0022
openai ·gpt-5-miniFeb 18, 09:12 PM
Asset 88Lb01q
Score
0.92
Latency
13.2s
Cost
$0.0015
openai ·gpt-5.1Feb 18, 09:12 PM
Asset 88Lb01q
Score
1
Latency
3.55s
Cost
$0.0031