Workflow Eval Detail

Summarization

Generates concise summaries and smart tags from your content—perfect for search, discovery, and quick recaps.

Latest Runcompleted

muxinc/ai

main·d5b5d84·@mux/ai v0.7.4

Cases

Avg Score

0.94

Avg Latency

8.81s

Avg Cost

$0.004

Avg Cost / Min

$0.007/min

Avg Tokens

2,705

TL;DR

Anthropic leads on summarization quality, OpenAI gpt-5.1 on latency, and OpenAI gpt-5-mini on cost, but the findings are based on only 5 runs and should be treated as directional.

Best Quality

anthropic

claude-sonnet-4-5

Fastest

openai

gpt-5.1

Most Economical

openai

gpt-5-mini

What we measure

Each eval run captures efficacy, efficiency, and expense. We use this data to compare providers and track regressions over time.

Efficacy

Quality + correctness

Efficiency

Latency + token usage

Expense

Cost per request

Workflow snapshot

Suite statussuccess

Suite average score0.94

Suite duration45.56s

Last suite runFeb 18, 09:10 PM

Evaluation criteria

From eval tests

We score summary quality, tag relevance, and semantic similarity while tracking latency, token usage, and cost.

...and so when we look at the numbers...

...the growth trajectory has been...

...our teams have worked incredibly hard...

...which brings me to the next point...

Analyzing

→

↓

AI Summary8:00 duration

6 tags extracted

●Complete

Semantic Extraction

Efficacy checks

Title is non-empty, <=100 chars, and avoids filler starters.
Description is non-empty, <=1000 chars, and avoids meta phrases.
Tags are non-empty strings, unique, and <=10 items.
Title, description, and tags are semantically similar to references.
Response includes asset ID and HTTPS storyboard URL.

Efficiency targets

Latency: scores are normalized between 0 and 1. Under 8s earns 1.0; past 20s trends toward 0.
Token usage: scores are normalized between 0 and 1. Under 4,000 tokens earns 1.0; higher usage reduces the score.
Usage data must include input and output tokens > 0.

Expense guardrails

Estimated cost under $0.015 per request for full score.

Provider breakdown

Run d5b5d84

Efficacy scoreHigher is better

LatencyLower is better

Token UsageLower is better

CostLower is better

Provider	Model	Cases	Avg Score	Avg Latency	Avg Tokens	Avg Cost	Avg Cost / Min
anthropic	claude-sonnet-4-5	1	0.97	6.71s	3,019	$0.0109	$0.019/min
google	gemini-2.5-flash	1	0.96	8.31s	2,408	$0.0032	$0.0056/min
google	gemini-3-flash-preview	1	0.95	7.34s	2,793	$0.0023	$0.004/min
openai	gpt-5-mini	1	0.89	16.58s	3,478	$0.0016	$0.0029/min
openai	gpt-5.1	1	0.95	5.09s	1,828	$0.0021	$0.0036/min

Recent cases

Latest 6

anthropic ·claude-sonnet-4-5Feb 18, 09:12 PM

Asset 88Lb01q

Score

0.97

Latency

6.71s

Cost

$0.0109

google ·gemini-2.5-flashFeb 18, 09:12 PM

Asset 88Lb01q

Score

0.96

Latency

8.31s

Cost

$0.0032

google ·gemini-3-flash-previewFeb 18, 09:12 PM

Asset 88Lb01q

Score

0.95

Latency

7.34s

Cost

$0.0023

openai ·gpt-5-miniFeb 18, 09:12 PM

Asset 88Lb01q

Score

0.89

Latency

16.58s

Cost

$0.0016

openai ·gpt-5.1Feb 18, 09:12 PM

Asset 88Lb01q

Score

0.95

Latency

5.09s

Cost

$0.0021