Evaluating LLMs for production: what benchmarks don't tell you

Model selection decisions in production AI teams are often made the wrong way. A team reads benchmark results, runs a few manual tests, and picks the model that performed best in the demo. Six months later, the model that won the evaluation is causing more production incidents than expected, or it is significantly more expensive at production volume than the estimates suggested, or it handles a class of edge cases poorly that was not represented in the evaluation.

The problem is not that benchmarks are wrong. They measure what they measure accurately. The problem is that benchmark conditions rarely match production conditions, and the gap between the two is where most model selection errors live.

What benchmarks measure and what they miss

Public benchmarks measure capability on carefully curated datasets under controlled conditions. They tell you whether a model can, in principle, perform a task type. They do not tell you how the model performs on your specific data distribution, in your specific prompt structure, against your specific quality criteria.

The mismatch shows up in three ways.

Distribution mismatch. Benchmark datasets are constructed to be representative of a general distribution. Your production data is specific to your domain, your users, and your use case. A model that leads on general reasoning benchmarks may underperform on the structured extraction task your application depends on, because your inputs have idiosyncrasies that the benchmark dataset did not include.

Prompt structure sensitivity. Most benchmarks use a specific prompt format designed by the benchmark authors. Production systems use prompt structures designed by your team for your context. Models vary in how sensitive they are to prompt structure, and a model’s benchmark rank can shift significantly when you switch from the benchmark’s prompt format to your own.

Latency and cost at scale. Benchmarks report accuracy metrics. They rarely report the latency distribution at production call volume, the behavior when the API is under load, or the cost per successful output when factoring in retries and error handling. A model that scores higher on quality may produce worse outcomes at scale if its p95 latency exceeds your response time budget or its error rate at peak load is higher than a competitor’s.

The evaluation structure that predicts production performance

A production-relevant evaluation has four components that benchmark comparisons typically omit.

A representative dataset from your actual domain. This is the most important component and the hardest to build. It requires taking real or realistic examples from your production distribution, covering the range of inputs the model will encounter including the difficult ones. The evaluation dataset should include easy cases (to confirm baseline performance), hard cases (to differentiate models at the capability frontier), and adversarial cases (inputs designed to surface failure modes). If you are early in development and do not have production data, the process of constructing realistic test cases forces clarity about what the system is actually supposed to handle.

Explicit quality criteria with measurable definitions. “Good output” is not a quality criterion. “The extracted entities are present in the source text, the extraction covers all entities mentioned, and the output is valid JSON matching the defined schema” is a quality criterion. The difference is that the second version can be evaluated automatically or by a human reviewer without ambiguity. Before running a model evaluation, write down what correct output looks like in specific, testable terms.

Multi-dimensional scoring. Most production AI tasks have multiple quality dimensions that trade off against each other. A summarization task might have dimensions of factual accuracy, completeness, conciseness, and format compliance. A model that scores highest on completeness might score lower on conciseness. Which trade-off is right for your use case is a product decision, not a model selection decision, but it has to be made explicitly before the evaluation or the scoring will be ambiguous. Build a scoring rubric with weights for each dimension that reflect your actual priorities.

Cost-adjusted performance. The relevant metric is not accuracy; it is accuracy per dollar at your target volume. Model A might be 5% more accurate than Model B but 3x more expensive. Whether that trade-off is worth it depends on what 5% accuracy improvement is worth in your use case. Building cost into the evaluation score from the start avoids the situation where the team selects a model on quality metrics and then discovers the cost at scale is prohibitive.

Running the evaluation

The mechanics of a model evaluation are straightforward once the dataset and criteria are defined. Run each candidate model against the full dataset. Score each output against your criteria. Aggregate across dimensions with your defined weights.

Two practices that improve the reliability of the results:

Score outputs blind to the model that produced them. When human reviewers know which model produced an output, their scores are biased. If you are using human review for any part of the evaluation, strip the model identifier before presenting outputs to reviewers. This applies even to self-review: it is difficult to evaluate your own outputs without anchoring to your prior expectations about model quality.

Use the same prompt structure you will use in production. Do not use the benchmark prompt format or a simplified test prompt. Use the actual prompt template your application will use, with realistic context. Model performance can vary significantly across prompt formats, and an evaluation that uses a different prompt than production is measuring a different system.

The regression test infrastructure

The model evaluation for initial selection is one-time work. The regression test infrastructure that runs continuously in production is the ongoing investment that protects your application from silent quality degradation.

A regression test suite is a subset of your evaluation dataset (the cases most sensitive to quality changes) run automatically whenever something changes: a model version update, a prompt change, a change to retrieval or context construction. The suite does not need to be comprehensive. It needs to be fast (under five minutes) and to cover the cases most likely to surface regressions.

The regression suite answers the question “did this change make things worse?” It does not answer the question “is the system performing well overall?” Answering that question requires periodic full evaluations against the complete dataset, ideally with a sample of recent production inputs added as the test suite evolves.

The teams that catch model-related regressions quickly have built this infrastructure before they needed it. The teams that discover regressions in user feedback have not.

When to switch models

The decision to switch models mid-production is more disruptive than most teams expect. Prompt engineering that was tuned for one model’s behavior does not always transfer. Evaluation thresholds calibrated for one model’s output distribution may not apply to a different model’s outputs. The downstream code that processes the model’s responses may have implicit assumptions about response format that break under a new model.

A structured model switch process includes: running the full evaluation suite against the new model before changing anything in production, identifying the cases where the new model underperforms relative to the current model, updating the prompt to address the most significant underperformance, re-running the evaluation, and only then planning a phased rollout.

Phased rollout means routing a small percentage of production traffic to the new model, monitoring the quality metrics you established in the evaluation, and comparing them to the metrics from the current model before increasing the percentage. This is the same pattern as a feature flag rollout for traditional software, applied to model selection.

Teams that switch models by changing the API call and watching for user complaints are skipping the steps that make the switch predictable. The switch may go fine. When it does not, the diagnosis starts from zero.

The honest cost of getting this right

Building a proper model evaluation infrastructure takes longer than most teams budget for it. A representative dataset, explicit quality criteria, multi-dimensional scoring rubrics, automated evaluation pipelines, and regression test suites collectively represent weeks of engineering work before the first model comparison runs.

The alternative is running a few manual tests, picking the model that seemed best, and discovering its production failure modes through user feedback over the following months. Both approaches take time. The first approach takes time upfront, produces predictable outcomes, and generates infrastructure that pays dividends on every future model change. The second takes time distributed across incidents, investigations, and reactive fixes, and generates no reusable infrastructure.

Most teams that have been through a poorly-managed model selection learn to do it differently the next time. Building the infrastructure before you need it is the lesson the experience teaches; it is just cheaper to learn it that way.

Evaluating LLMs for production: what benchmarks don't tell you

What benchmarks measure and what they miss

The evaluation structure that predicts production performance

Running the evaluation

The regression test infrastructure

When to switch models

The honest cost of getting this right

More from Zylver

What your board needs to know about AI

How AI is changing customer service

How to scale AI adoption from one team to the whole organization