Workflow Eval Detail

Summarization

Generates concise summaries and smart tags from your content—perfect for search, discovery, and quick recaps.

Latest Runcompleted
muxinc/ai
mainc15880c·@mux/ai v0.22.0
Cases
27
Avg Score
0.97
Avg Latency
6.89s
Avg Cost
$0.0042
Avg Cost / Min
$0.0079/min
Avg Tokens
3,495
TL;DR

High-quality, low-cost summarization across providers with Anthropic best on quality and Google best on speed/cost, but conclusions are tentative given only 7 cases.

Best Quality
anthropic
claude-sonnet-4-5
Fastest
google
gemini-3.1-flash-lite-preview
Most Economical
google
gemini-3.1-flash-lite

What we measure

Each eval run captures efficacy, efficiency, and expense. We use this data to compare providers and track regressions over time.

Efficacy
Quality + correctness
Efficiency
Latency + token usage
Expense
Cost per request

Workflow snapshot

Suite statussuccess
Suite average score0.95
Suite duration55.56s
Last suite runMay 18, 05:58 PM

Evaluation criteria

From eval tests

We score summary quality, tag relevance, and semantic similarity while tracking latency, token usage, and cost.

...and so when we look at the numbers...
...the growth trajectory has been...
...our teams have worked incredibly hard...
...which brings me to the next point...
Analyzing
AI Summary8:00 duration
6 tags extracted
Complete
Semantic Extraction
Efficacy checks
  • Title is non-empty, <=100 chars, and avoids filler starters.
  • Description is non-empty, <=1000 chars, and avoids meta phrases.
  • Tags are non-empty strings, unique, and <=10 items.
  • Title, description, and tags are semantically similar to references.
  • Response includes asset ID and HTTPS storyboard URL.
Efficiency targets
  • Latency: scores are normalized between 0 and 1. Under 8s earns 1.0; past 20s trends toward 0.
  • Token usage: scores are normalized between 0 and 1. Under 4,000 tokens earns 1.0; higher usage reduces the score.
  • Usage data must include input and output tokens > 0.
Expense guardrails
  • Estimated cost under $0.015 per request for full score.

Provider breakdown

Run c15880c
Efficacy scoreHigher is better
LatencyLower is better
Token UsageLower is better
CostLower is better
ProviderModelCasesAvg ScoreAvg LatencyAvg TokensAvg CostAvg Cost / Min
anthropicclaude-sonnet-4-540.996.06s3,838$0.0132$0.0229/min
googlegemini-2.5-flash40.997.41s3,041$0.0029$0.0064/min
googlegemini-3-flash-preview40.977.82s4,339$0.005$0.011/min
googlegemini-3.1-flash-lite40.992.37s2,986$0.0009$0.0016/min
googlegemini-3.1-flash-lite-preview40.992.69s2,985$0.0009$0.0016/min
openaigpt-5-mini40.9216.48s4,522$0.0024$0.0047/min
openaigpt-5.130.994.91s2,509$0.0036$0.0072/min

Recent cases

Latest 6
anthropic ·claude-sonnet-4-5May 18, 06:01 PM
Asset 88Lb01q
Score
1
Latency
6.17s
Cost
$0.0132
anthropic ·claude-sonnet-4-5May 18, 06:01 PM
Asset 88Lb01q
Score
1
Latency
5.02s
Cost
$0.0129
anthropic ·claude-sonnet-4-5May 18, 06:01 PM
Asset 88Lb01q
Score
1
Latency
5.94s
Cost
$0.0133
google ·gemini-2.5-flashMay 18, 06:01 PM
Asset 88Lb01q
Score
0.95
Latency
7.82s
Cost
$0.0037
google ·gemini-2.5-flashMay 18, 06:01 PM
Asset 88Lb01q
Score
1
Latency
6.81s
Cost
$0.0025
anthropic ·claude-sonnet-4-5May 18, 06:01 PM
Asset 88Lb01q
Score
0.98
Latency
7.12s
Cost
$0.0131