Efficacy
Does it work correctly?
- Accuracy and output quality on real inputs
- Schema-compliant, reliable formatting
- Guardrails for common failure modes
- Side-by-side provider comparison
About These Evals
We treat evals as part of shipping. Each new workflow in @mux/ai launches with coverage that measures quality, speed, and cost, so teams can trust the defaults we recommend.
Does it work correctly?
How fast and scalable is it?
What does it cost?
We always measure efficiency and expense first, even while efficacy scoring is still evolving.
Ground truth, thresholds, and test sets improve with usage. We iterate until the scores match real quality.
For core models from OpenAI, Anthropic, and Google, we target all three Es from day one.
Our evals pair real inputs with expected outputs, then score the workflow against the 3 Es with consistent, repeatable metrics.
evalite("Workflow Name", {
data: [{ input, expected }],
task: async (input) => { /* run workflow */ },
scorers: [
{ name: "accuracy", scorer: ({ output, expected }) => 0.9 },
],
});We run evals on pull requests, then re-run on merge to main. The main-branch run is what we publish: JSON results are ingested, insights extracted, and acceptance criteria applied to recommend the best models.
Those recommendations and metrics are published to our database and surface directly in this dashboard for each workflow.
Sharing these evals publicly is our way of earning trust. You can see how we validate quality, speed, and cost so the workflows we ship deliver real impact inside production video pipelines.
We run every eval across providers to compare quality, latency, and token usage side by side.
const providers = ["openai", "anthropic", "google"];
const data = providers.flatMap(provider =>
testAssets.map(asset => ({ input: { assetId: asset.id, provider } })),
);This helps reduce bias in our default recommendations by grounding decisions in comparable data across providers. It keeps our suggested workflows focused on measurable impact rather than preference.
We estimate costs using default model pricing and verify these numbers regularly against provider docs.
| Provider | Model | Input per 1M | Output per 1M |
|---|---|---|---|
| OpenAI | gpt-5.1 | $1.25 | $10.00 |
| Anthropic | claude-sonnet-4-5 | $3.00 | $15.00 |
| gemini-3-flash-preview | $0.50 | $3.00 |