About These Evals

How our team evaluates the Mux AI SDK in production-like conditions.

We treat evals as part of shipping. Each new workflow in @mux/ai launches with coverage that measures quality, speed, and cost, so teams can trust the defaults we recommend.

The 3 Es Framework

A calm, measurable lens for every eval.

3Es

Efficacy

Does it work correctly?

Efficiency

How fast and scalable is it?

Expense

What does it cost?

Efficacy

Does it work correctly?

Accuracy and output quality on real inputs
Schema-compliant, reliable formatting
Guardrails for common failure modes
Side-by-side provider comparison

Efficiency

How fast and scalable is it?

Token consumption and efficiency budgets
Wall clock latency at the workflow level
Performance against clear thresholds

Expense

What does it cost?

Estimated USD per request
Cost comparison across providers
Opportunities for prompt optimization

How we apply the framework

Start with objective signals

We always measure efficiency and expense first, even while efficacy scoring is still evolving.

Build efficacy over time

Ground truth, thresholds, and test sets improve with usage. We iterate until the scores match real quality.

Foundational model coverage

For core models from OpenAI, Anthropic, and Google, we target all three Es from day one.

Eval structure

Our evals pair real inputs with expected outputs, then score the workflow against the 3 Es with consistent, repeatable metrics.

1Test data with inputs and ground truth expectations.
2Task function that runs the workflow and reports traces.
3Scorers for efficacy, efficiency, and expense.

evalite("Workflow Name", {
  data: [{ input, expected }],
  task: async (input) => { /* run workflow */ },
  scorers: [
    { name: "accuracy", scorer: ({ output, expected }) => 0.9 },
  ],
});

Reporting pipeline

Automated reporting

We run evals on pull requests, then re-run on merge to main. The main-branch run is what we publish: JSON results are ingested, insights extracted, and acceptance criteria applied to recommend the best models.

Published insights

Those recommendations and metrics are published to our database and surface directly in this dashboard for each workflow.

Transparent by design

Sharing these evals publicly is our way of earning trust. You can see how we validate quality, speed, and cost so the workflows we ship deliver real impact inside production video pipelines.

Efficacy scorers

Detection or classification accuracy
Confidence calibration
Response integrity
Semantic similarity
No filler phrases

Efficiency scorers

Latency performance
Token efficiency

Expense scorers

Usage data present
Cost within budget

Cross-provider testing

We run every eval across providers to compare quality, latency, and token usage side by side.

const providers = ["openai", "anthropic", "google"];
const data = providers.flatMap(provider =>
  testAssets.map(asset => ({ input: { assetId: asset.id, provider } })),
);

This helps reduce bias in our default recommendations by grounding decisions in comparable data across providers. It keeps our suggested workflows focused on measurable impact rather than preference.

Model pricing

We estimate costs using default model pricing and verify these numbers regularly against provider docs.

Provider	Model	Input per 1M	Output per 1M
OpenAI	gpt-5.1	$1.25	$10.00
Anthropic	claude-sonnet-4-5	$3.00	$15.00
Google	gemini-3-flash-preview	$0.50	$3.00

Resources

@mux/ai OpenAI Pricing Anthropic Pricing Google AI Pricing