About These Evals

How our team evaluates the Mux AI SDK in production-like conditions.

We treat evals as part of shipping. Each new workflow in @mux/ai launches with coverage that measures quality, speed, and cost, so teams can trust the defaults we recommend.

The 3 Es Framework

A calm, measurable lens for every eval.

3Es
01
Efficacy
Does it work correctly?
02
Efficiency
How fast and scalable is it?
03
Expense
What does it cost?

Efficacy

Does it work correctly?

  • Accuracy and output quality on real inputs
  • Schema-compliant, reliable formatting
  • Guardrails for common failure modes
  • Side-by-side provider comparison

Efficiency

How fast and scalable is it?

  • Token consumption and efficiency budgets
  • Wall clock latency at the workflow level
  • Performance against clear thresholds

Expense

What does it cost?

  • Estimated USD per request
  • Cost comparison across providers
  • Opportunities for prompt optimization

How we apply the framework

Start with objective signals

We always measure efficiency and expense first, even while efficacy scoring is still evolving.

Build efficacy over time

Ground truth, thresholds, and test sets improve with usage. We iterate until the scores match real quality.

Foundational model coverage

For core models from OpenAI, Anthropic, and Google, we target all three Es from day one.

Eval structure

Our evals pair real inputs with expected outputs, then score the workflow against the 3 Es with consistent, repeatable metrics.

  1. 1Test data with inputs and ground truth expectations.
  2. 2Task function that runs the workflow and reports traces.
  3. 3Scorers for efficacy, efficiency, and expense.
evalite("Workflow Name", {
  data: [{ input, expected }],
  task: async (input) => { /* run workflow */ },
  scorers: [
    { name: "accuracy", scorer: ({ output, expected }) => 0.9 },
  ],
});

Reporting pipeline

Automated reporting

We run evals on pull requests, then re-run on merge to main. The main-branch run is what we publish: JSON results are ingested, insights extracted, and acceptance criteria applied to recommend the best models.

Published insights

Those recommendations and metrics are published to our database and surface directly in this dashboard for each workflow.

Transparent by design

Sharing these evals publicly is our way of earning trust. You can see how we validate quality, speed, and cost so the workflows we ship deliver real impact inside production video pipelines.

Efficacy scorers

  • Detection or classification accuracy
  • Confidence calibration
  • Response integrity
  • Semantic similarity
  • No filler phrases

Efficiency scorers

  • Latency performance
  • Token efficiency

Expense scorers

  • Usage data present
  • Cost within budget

Cross-provider testing

We run every eval across providers to compare quality, latency, and token usage side by side.

const providers = ["openai", "anthropic", "google"];
const data = providers.flatMap(provider =>
  testAssets.map(asset => ({ input: { assetId: asset.id, provider } })),
);

This helps reduce bias in our default recommendations by grounding decisions in comparable data across providers. It keeps our suggested workflows focused on measurable impact rather than preference.

Model pricing

We estimate costs using default model pricing and verify these numbers regularly against provider docs.

ProviderModelInput per 1MOutput per 1M
OpenAIgpt-5.1$1.25$10.00
Anthropicclaude-sonnet-4-5$3.00$15.00
Googlegemini-3-flash-preview$0.50$3.00

Resources