Workflow Eval Detail

Ask Questions

Answers natural-language questions about a video by retrieving relevant context and answering with a concise response.

Latest Runcompleted

muxinc/ai

main·d5b5d84·@mux/ai v0.7.4

Cases

Avg Score

0.98

Avg Latency

7.12s

Avg Cost

$0.0039

Avg Cost / Min

$0.0067/min

Avg Tokens

2,332

TL;DR

All providers perform strongly on this workflow with very high accuracy and low cost, and OpenAI gpt-5.1 is the recommended default for quality and latency, pending confirmation on a larger sample.

Best Quality

openai

gpt-5.1

Fastest

openai

gpt-5.1

Most Economical

openai

gpt-5-mini

What we measure

Each eval run captures efficacy, efficiency, and expense. We use this data to compare providers and track regressions over time.

Efficacy

Quality + correctness

Efficiency

Latency + token usage

Expense

Cost per request

Workflow snapshot

Suite statussuccess

Suite average score0.98

Suite duration35.58s

Last suite runFeb 18, 09:11 PM

Evaluation criteria

From eval tests

We score answer accuracy and response integrity while tracking latency, token usage, and cost.

Question set3 checks

Question

Has on-screen text?

Answer: YesConf0.94

Question

Is the speaker visible?

Answer: NoConf0.88

Question

Scene indoors?

Answer: YesConf0.91

Accuracy

Match expected yes/no outputs.

Format

Required fields + response shape.

Integrity

Confidence and reasoning present.

Throughput

Latency and cost stay within targets.

Response validation

Efficacy checks

Answers match expected yes/no outputs.
Response includes required fields and answer structure.
Confidence is 0-1 and reasoning strings are non-empty.
Response preserves asset ID and storyboard URL.

Efficiency targets

Latency: scores are normalized between 0 and 1. Under 8s earns 1.0; past 12s trends toward 0.
Token usage: scores are normalized between 0 and 1. Under 2,900 tokens earns 1.0; higher usage reduces the score.
Usage data must include total tokens for cost analysis.

Expense guardrails

Estimated cost under $0.012 per request for full score.

Provider breakdown

Run d5b5d84

Efficacy scoreHigher is better

LatencyLower is better

Token UsageLower is better

CostLower is better

Provider	Model	Cases	Avg Score	Avg Latency	Avg Tokens	Avg Cost	Avg Cost / Min
anthropic	claude-sonnet-4-5	1	1	5.99s	2,856	$0.0109	$0.0189/min
google	gemini-2.5-flash	1	1	5.28s	1,504	$0.0015	$0.0027/min
google	gemini-3-flash-preview	1	1	7.56s	2,531	$0.0022	$0.0039/min
openai	gpt-5-mini	1	0.92	13.2s	3,216	$0.0015	$0.0027/min
openai	gpt-5.1	1	1	3.55s	1,555	$0.0031	$0.0055/min

Recent cases

Latest 6

anthropic ·claude-sonnet-4-5Feb 18, 09:12 PM

Asset 88Lb01q

Score

Latency

5.99s

Cost

$0.0109

google ·gemini-2.5-flashFeb 18, 09:12 PM

Asset 88Lb01q

Score

Latency

5.28s

Cost

$0.0015

google ·gemini-3-flash-previewFeb 18, 09:12 PM

Asset 88Lb01q

Score

Latency

7.56s

Cost

$0.0022

openai ·gpt-5-miniFeb 18, 09:12 PM

Asset 88Lb01q

Score

0.92

Latency

13.2s

Cost

$0.0015

openai ·gpt-5.1Feb 18, 09:12 PM

Asset 88Lb01q

Score

Latency

3.55s

Cost

$0.0031