Workflow Eval Detail

Ask Questions

Answers natural-language questions about a video by retrieving relevant context and answering with a concise response.

Latest Runcompleted
muxinc/ai
mainc15880c·@mux/ai v0.22.0
Cases
7
Avg Score
0.95
Avg Latency
5.36s
Avg Cost
$0.0036
Avg Cost / Min
$0.0063/min
Avg Tokens
3,694
TL;DR

On this small 7-case suite, all providers are accurate, but Google flash/flash-lite models provide the best overall tradeoff of quality, latency, and cost for Ask Questions.

Best Quality
google
gemini-2.5-flash
Fastest
google
gemini-3.1-flash-lite
Most Economical
google
gemini-3.1-flash-lite

What we measure

Each eval run captures efficacy, efficiency, and expense. We use this data to compare providers and track regressions over time.

Efficacy
Quality + correctness
Efficiency
Latency + token usage
Expense
Cost per request

Workflow snapshot

Suite statussuccess
Suite average score0.95
Suite duration37.54s
Last suite runMay 18, 05:59 PM

Evaluation criteria

From eval tests

We score answer accuracy and response integrity while tracking latency, token usage, and cost.

Question set3 checks
Question
Has on-screen text?
Answer: YesConf0.94
Question
Is the speaker visible?
Answer: NoConf0.88
Question
Scene indoors?
Answer: YesConf0.91
Accuracy
Match expected yes/no outputs.
Format
Required fields + response shape.
Integrity
Confidence and reasoning present.
Throughput
Latency and cost stay within targets.
Response validation
Efficacy checks
  • Answers match expected yes/no outputs.
  • Response includes required fields and answer structure.
  • Confidence is 0-1 and reasoning strings are non-empty.
  • Response preserves asset ID and storyboard URL.
Efficiency targets
  • Latency: scores are normalized between 0 and 1. Under 8s earns 1.0; past 12s trends toward 0.
  • Token usage: scores are normalized between 0 and 1. Under 2,900 tokens earns 1.0; higher usage reduces the score.
  • Usage data must include total tokens for cost analysis.
Expense guardrails
  • Estimated cost under $0.012 per request for full score.

Provider breakdown

Run c15880c
Efficacy scoreHigher is better
LatencyLower is better
Token UsageLower is better
CostLower is better
ProviderModelCasesAvg ScoreAvg LatencyAvg TokensAvg CostAvg Cost / Min
anthropicclaude-sonnet-4-510.97.07s4,451$0.0151$0.0263/min
googlegemini-2.5-flash113.73s2,881$0.0017$0.0029/min
googlegemini-3-flash-preview10.964.61s3,796$0.0022$0.0039/min
googlegemini-3.1-flash-lite10.982.58s3,480$0.0011$0.0018/min
googlegemini-3.1-flash-lite-preview10.972.67s3,489$0.0011$0.0019/min
openaigpt-5-mini10.8513.4s4,750$0.0023$0.004/min
openaigpt-5.1113.48s3,013$0.0018$0.0032/min

Recent cases

Latest 6
anthropic ·claude-sonnet-4-5May 18, 06:01 PM
Asset 88Lb01q
Score
0.9
Latency
7.07s
Cost
$0.0151
google ·gemini-2.5-flashMay 18, 06:01 PM
Asset 88Lb01q
Score
1
Latency
3.73s
Cost
$0.0017
google ·gemini-3-flash-previewMay 18, 06:01 PM
Asset 88Lb01q
Score
0.96
Latency
4.61s
Cost
$0.0022
google ·gemini-3.1-flash-liteMay 18, 06:01 PM
Asset 88Lb01q
Score
0.98
Latency
2.58s
Cost
$0.0011
google ·gemini-3.1-flash-lite-previewMay 18, 06:01 PM
Asset 88Lb01q
Score
0.97
Latency
2.67s
Cost
$0.0011
openai ·gpt-5-miniMay 18, 06:01 PM
Asset 88Lb01q
Score
0.85
Latency
13.4s
Cost
$0.0023