Skip to main content
Back to blog
6 min read

Why AI teams need a culture of evaluation

Evaluation infrastructure is a tooling problem. Evaluation culture is an organizational problem. Teams that build the tooling without changing how they make decisions discover that the tooling goes unused. The harder work is building the norms.

By Ramiro Enriquez

The case for AI evaluation infrastructure is not difficult to make. Teams that measure quality systematically catch problems before they reach users. They improve faster because they have feedback on whether changes help or hurt. They deploy with more confidence because they have evidence, not just intuition. The argument is clear and most teams who hear it agree with it.

And yet, in practice, evaluation infrastructure is consistently underbuilt, underused, and deprioritized. Teams build a test suite at launch, run it a few times, and then let it drift out of date as the system evolves. They build monitoring dashboards that no one reviews regularly. They write evaluation code that is not maintained and breaks without anyone noticing until months later.

The gap between knowing evaluation matters and actually doing it consistently is not a tooling gap. It is a cultural gap. The tooling is necessary but not sufficient. What makes evaluation stick is organizational norms that treat measurement as a first-class activity rather than overhead.

What evaluation culture looks like

Evaluation culture is the set of shared norms and practices that make systematic quality measurement the default rather than the exception. It shows up in specific observable behaviors.

Teams with evaluation culture define success criteria before they build. They answer “how will we know if this is working?” before they answer “how will we build this?” The evaluation criteria are written down, specific, and agreed to by the people who will make deployment decisions. This takes time and is often uncomfortable because it forces honest conversation about what the feature is actually supposed to do.

Teams with evaluation culture run evaluations before and after changes. A model update, a prompt change, or a retrieval configuration change triggers an evaluation run before deployment. The team looks at the numbers, not just the vibes. This sounds like obvious engineering practice but it is not the norm in AI development, where changes are often made based on intuitive assessment of a few examples.

Teams with evaluation culture review evaluation results regularly, not just at launch. The monthly review of production quality metrics is as normal as the monthly review of business metrics. When quality trends in the wrong direction, it generates the same response as a business metric that trends in the wrong direction: investigation, hypothesis formation, and action.

Teams with evaluation culture treat evaluation gaps as technical debt. When a new feature or input category is not covered by evaluation, that gap is tracked and scheduled for resolution, not accepted as a permanent state.

Why evaluation culture is hard to build

Most of the cultural obstacles to evaluation are not technical. They are organizational dynamics that consistently de-prioritize measurement in favor of building.

The speed trade-off is real. Writing evaluation criteria before building does slow down the start of development. Running evals before deploying a change adds time to each deploy cycle. Reviewing quality metrics adds another meeting to the calendar. Each of these is a small cost in the short term that produces significant value in the medium term, but in a team under pressure to ship, short-term costs dominate long-term benefits.

Evaluation results are uncomfortable. When you run systematic evaluation, you find things that are not working. Features that were described as working well turn out to have significant failure rates on specific input categories. Model updates that were described as improvements turn out to have regressions in some dimensions. This is exactly what evaluation is supposed to surface, but it is organizational discomfort to discover that a system you have been treating as working well has measurable quality problems.

Teams that have not built the norm of finding and discussing quality problems tend to avoid creating the conditions where they would discover those problems. Evaluation culture requires a prior cultural commitment to honest assessment of quality, which requires psychological safety around imperfect systems.

Success is defined by shipping, not by quality. In most engineering organizations, the celebration happens at launch. The feature is shipped, the milestone is hit, the team moves on. There is no comparable organizational event for “the feature continued to perform well for six months in production.” Evaluation work that prevents degradation is invisible success. Shipping new features is visible success. This incentive structure consistently de-prioritizes the ongoing work of maintaining quality.

Building the norms

Evaluation culture is built through specific practices that, repeated over time, become norms. The practices are not complex. They are primarily about sequencing and habits.

Make success criteria part of the spec. Before a feature enters development, the spec should include a section on evaluation: what does good output look like for this feature, what are the failure modes, and how will quality be measured at launch and on an ongoing basis. This is the most important norm to build because it affects every subsequent decision. Teams that define success criteria upfront make better technical choices during development because they know what they are optimizing for.

Build evaluation alongside the feature, not after. The evaluation infrastructure for a feature (test data, evaluation code, quality metrics) should be built while the feature is being built, not as a separate follow-up project after launch. Treating evaluation as a post-launch activity means it is always deprioritized by the next feature, and it means the team does not have the feedback loop during development that would catch design problems before they are baked into a launched system.

Review quality data in existing team rituals. Adding evaluation review to existing meetings is more sustainable than creating new evaluation meetings. Adding a five-minute quality review to the weekly team sync is easier to maintain than a dedicated monthly evaluation review. The goal is to make quality data part of the team’s normal information diet, not a special event.

Celebrate quality findings, not just quality scores. When evaluation reveals a problem, the team that discovered it should be recognized for finding the problem, not blamed for having it. The incentive structure should make finding quality problems a positive event. Teams that have learned to celebrate the discovery of quality issues before they reach users have solved the hardest cultural problem in building evaluation culture.

The leadership role

Evaluation culture cannot be built bottom-up alone. Engineering teams that want to do evaluation well need leadership that creates the organizational conditions for it.

This means accepting the speed trade-off. Leaders who consistently pressure teams to skip evaluation steps in the name of shipping faster will not have teams that maintain evaluation culture. The implicit message that shipping is what matters and quality measurement is overhead is impossible to counteract at the team level.

It means being willing to delay deployment when evaluation reveals problems. A team that runs evaluation conscientiously and discovers a quality issue, only to have the deployment proceed anyway because of schedule pressure, will not run evaluation conscientiously next time. The evaluation has to matter.

And it means asking about evaluation in reviews of AI work. When leaders consistently ask “how are we measuring quality?” and “what do the evaluation results show?” those questions signal that evaluation is a real expectation, not a box-checking exercise. Questions shape behavior. Leaders who ask about evaluation will get teams that do evaluation.

The long-term payoff

Teams with evaluation culture compound. Each project builds evaluation infrastructure that informs the next. Each quality finding improves the team’s understanding of the system. Each evaluation review builds the team’s ability to distinguish real problems from noise. Over time, the team develops an institutional knowledge of their AI systems that is not possible without systematic measurement.

Teams without evaluation culture also compound, in the opposite direction. Each unexamined quality degradation reduces user trust a little more. Each deployment made without evaluation increases the chance of a visible failure that damages trust significantly. Each quarter without quality review increases the gap between what the team thinks the system is doing and what it is actually doing.

The investment in evaluation culture is front-loaded: it is hardest to build in the early months when there is no established norm and significant cost to building one. The payoff is back-loaded: it becomes most valuable as the team’s AI portfolio grows and the cost of not measuring quality compounds. Organizations that build the culture early are in a significantly stronger position than those that try to build it after quality problems have already accumulated.

Zylver ships AI products: Forge, Signal, Agents, Flows, and Meter. View all products.

Get insights like this delivered monthly.

No spam. Unsubscribe anytime.