Skip to main content
Back to blog
7 min read

What good AI observability looks like

Traditional observability tells you if your system is up and how fast it is. AI systems need a second layer: is the output quality good, is it degrading, and why? The teams shipping reliable AI have built this layer. Most have not.

By Ramiro Enriquez

Traditional software observability answers a defined set of questions: is the service up, is it responding within acceptable latency, are error rates within bounds, what is the resource utilization? These questions are necessary and the tooling for answering them is mature. Metrics, logs, distributed traces: a team that has instrumented these well can diagnose most operational problems.

AI systems need this infrastructure too. They also need something different. The operational questions are necessary but not sufficient because the most important failure mode in AI systems is not operational failure. It is quality failure: the system is up, responding quickly, and returning 200s, while generating outputs that are wrong, misleading, or harmful. Standard observability infrastructure cannot detect this. The service looks healthy while the product is failing.

Good AI observability answers both sets of questions. The teams that have built this have a significant operational advantage over the teams that have not: they catch quality problems before users report them, they detect degradation as it happens rather than after it has accumulated, and they have the data to make principled decisions about model updates and configuration changes.

The questions AI observability needs to answer

The operational layer answers: is the system functioning? The quality layer answers four different questions.

Is the output quality good? This means measuring the AI’s outputs against whatever quality criteria matter for the use case: accuracy, relevance, completeness, tone, adherence to constraints. The challenge is that quality is often not a single number. It is a distribution across the range of inputs the system encounters. A system that produces high-quality outputs on 80% of inputs and low-quality outputs on 20% needs different attention than one that produces medium-quality outputs across all inputs.

Is quality changing over time? Models can drift. System prompts interact with model updates in unexpected ways. The distribution of inputs changes as the product evolves and the user base grows. Quality that was adequate at launch may not be adequate six months later. Detecting drift requires tracking quality over time, not just at a point in time.

What is the cost of current quality levels? AI inference has a direct cost per call. That cost needs to be understood relative to the quality it produces. A configuration that produces high-quality outputs at ten times the cost of one that produces adequate-quality outputs may or may not be justified. The observability layer should make this comparison visible.

What are the failure modes? When the AI produces bad output, why? Is it a specific input category that consistently fails? A length or format constraint that the model does not handle well? A topic area where the model’s training is poor? Understanding the structure of failures is the input to improving the system. Without structured failure data, improvements are guesses.

What to instrument

Building AI observability requires instrumenting things that traditional monitoring does not capture.

Input and output sampling. A sample of real inputs and their corresponding outputs, stored with enough context to evaluate quality later, is the foundation of AI observability. The sample size and strategy matter: you need enough examples to detect statistical changes in quality, but not so many that storage and review become prohibitive. A common approach is systematic sampling (every Nth request) combined with targeted capture of flagged or unusual cases.

Quality signals. The specific quality signals to capture depend on the use case, but there are common categories. Explicit user feedback (thumbs up/down, accepted/rejected suggestions) is direct but sparse. Implicit behavioral signals (did the user edit the output before using it, did they request a different output, did they abandon the workflow after the AI response) are more abundant and often more honest. Automated evaluation against predefined criteria is more scalable but requires investing in the evaluation criteria and the evaluation infrastructure.

Latency by request characteristics. AI inference latency is not constant. It varies by input length, output length, model configuration, and load on the inference provider. Aggregate latency metrics hide the distribution. Instrumenting latency by relevant request characteristics reveals which request types are slow and whether slowness is getting worse.

Cost per request and cost by user segment. API-based AI inference is billed per token. Cost varies by input length, output length, and model tier. Understanding cost distribution tells you where your inference budget is going, which users or use cases are disproportionately expensive, and what the cost implications of configuration changes would be.

Error categorization. Not all errors are the same. An infrastructure error (the AI provider is down) requires different response than a quality error (the AI produced an off-topic response) or a constraint violation (the AI response contained something it was instructed not to). Categorizing errors before logging them makes it possible to alert on meaningful conditions rather than raw error counts.

The drift detection problem

One of the most important and underappreciated functions of AI observability is detecting quality drift before it becomes user-visible.

Quality drift can happen from several sources. The AI model provider updates the underlying model, changing behavior in ways that affect quality for specific input types. The distribution of user inputs shifts as the product grows, exposing the AI to input categories it was not optimized for. A configuration change (a prompt modification, a context window adjustment) changes behavior in ways that were not fully evaluated before deployment.

Detecting drift requires a baseline and a comparison. The baseline is the quality distribution observed during a reference period when the system was known to be working well. The comparison is the current quality distribution. Alerting on statistically significant divergence from the baseline catches drift early.

This is harder to build than traditional drift detection because quality is not a single metric. A system that tracks one quality dimension may drift on another without detection. Good drift detection covers multiple quality dimensions and is sensitive to changes in the distribution, not just the mean.

The two layers in practice

The operational layer and the quality layer have different owners, different tooling, and different cadences.

The operational layer is owned by whoever runs the infrastructure. It lives in whatever monitoring system the organization uses for its other services. Alerts fire on anomalies. The on-call rotation responds. The tooling is standard.

The quality layer is owned by the product and engineering teams who built the AI feature. It lives in a combination of a data store (for sampled inputs/outputs), an evaluation pipeline (for computing quality metrics), and a dashboard (for visualizing quality trends). Alerts fire on quality degradation. The team that built the feature responds. The tooling is often custom.

The teams that have built good AI observability tend to build the quality layer as a product requirement, not as an operational afterthought. The evaluation infrastructure is built alongside the AI feature, not after it ships. The quality dashboard is reviewed on a regular cadence, not only when something goes wrong.

What to build before launch versus what can wait

There is a practical sequencing question for teams that have not yet built AI observability: what is necessary before launch versus what can be added after?

Before launch, you need enough quality visibility to know whether the system is working at all. This means at minimum: input/output sampling sufficient to evaluate quality on a reasonable sample, at least one automated quality metric that correlates with the quality dimension you care most about, and cost monitoring so you know what you are spending.

After launch, you can add: richer evaluation across more quality dimensions, behavioral signal collection from users, drift detection against a baseline, detailed failure categorization, and cost analysis by user segment and request type. These are important but they require real production data to calibrate, which means they are most useful after you have launched and accumulated that data.

The mistake to avoid is launching with no quality monitoring and planning to add it later. Teams that launch without quality monitoring tend not to add it later, because by the time they would add it, they have user-visible problems and are triaging rather than building infrastructure. The time to build quality monitoring is before you need it urgently.

The questions to ask about your current AI observability

A practical way to assess whether your AI observability is adequate is to ask specific questions and see if you can answer them from your current instrumentation.

If quality degraded by 10% over the last two weeks, would you know? If you cannot answer yes to this, you do not have drift detection.

What percentage of AI outputs do users modify before using them? If you cannot answer this, you do not have implicit quality signal collection.

What are the five input categories where the AI performs worst? If you cannot answer this, you do not have structured failure analysis.

What did your AI inference cost last month, broken down by product feature? If you cannot answer this, you do not have cost attribution.

The answers to these questions are what good AI observability makes possible. They are also the inputs to the decisions that determine whether your AI system gets better over time or stays at its launch quality while the world around it changes.

Zylver ships AI products: Forge, Signal, Agents, Flows, and Meter. View all products.

Get insights like this delivered monthly.

No spam. Unsubscribe anytime.