What to instrument when your AI degrades in production

A team we know shipped a customer-facing knowledge assistant in January. By mid-March, support was handling a steady trickle of tickets that all looked the same: the assistant kept recommending a product that had been discontinued six weeks earlier. Nothing in the dashboards lit up. Latency was normal. Cost was flat. The model returned 200 OK on every call.

The retrieval index had stopped refreshing. The nightly job had been failing silently for 43 days, and the only signal that something was wrong was a slow climb in escalation rate that nobody had wired to an alert.

This is the default failure mode for production AI. It does not crash. It does not throw. It produces confident, plausible, wrong answers, and it does so consistently enough to feel like the system is working until somebody downstream notices that it is not.

Traditional observability misses this entirely. Infrastructure dashboards measure the request path. AI fails at the output layer, not the request layer. If you only watch HTTP status, latency, and uptime, you will keep finding out about quality regressions from your users. Months after they started.

Six failure modes that never trip an infra alert

Before discussing what to instrument, it is worth naming what you are instrumenting against. The failure modes that degrade production AI are not the ones traditional monitoring is built for.

Model drift. Provider models change. The endpoint name stays the same, the contract stays the same, but the behavior shifts after an update. The model you tuned your prompts against in February is not the model your prompts run against in May.

Prompt drift. Upstream schemas change. A retrieval-augmented pipeline that injects a JSON blob from a partner API will behave differently after that schema adds two fields. Your prompt was never edited. Its context was.

Tool and function-calling failures. Agentic systems accumulate silent breakage as tool signatures evolve. A model that cannot find the right tool often does not return an error. It improvises. The improvisation looks like an answer.

Retrieval staleness. In RAG systems, the index is a snapshot. As the source content updates, the retrieval layer answers questions with stale context. The model is correct given what it was shown. What it was shown is wrong.

Silent model downgrades. Cost-optimization logic that falls back to a cheaper model on cache miss or rate-limit backoff quietly degrades quality. The fallback fires. The quality gate does not.

Cost-quality regressions from cache invalidation. When a semantic cache breaks (TTL expiry, schema change, key rotation), every cached call reverts to full inference. Costs spike. Quality sometimes drops, because the cache may have been holding outputs from a better prompt version.

The pattern across all six: the failure is silent, gradual, and invisible to anything that watches the request path. You need a different category of signal entirely. Four of them, actually. (Our earlier piece on the observability gap frames the problem at a higher level. This post is the implementation guide.)

The four-layer instrumentation stack

The right mental model is a stack. Infrastructure observability sits below it. Four AI-specific layers sit above.

Token and cost. What every call costs and which model produced it.
Quality. Whether the outputs match what “correct” looks like.
Behavior. How the system operates: tool calls, retries, latency, fallbacks.
Outcome. What users actually do with the result.

Each layer answers a different question. None of them substitute for the others. A team that monitors only Layer 1 will catch cost spikes and miss every quality regression. A team that runs eval harnesses but never wires Layer 4 will pass internal tests while users churn.

The stack is the unit of completeness, not any single layer.

The four layers compound. Skipping one leaves a category of failure invisible.

Layer 1: Token and cost

This is the entry point. It is the easiest to implement, has the fastest payoff, and gives you the trace IDs that make the other three layers possible.

Every AI request should emit a structured event. Required fields: input and output token counts, model identifier, computed cost, cache hit/miss status, prompt version, operation type, tenant or user ID. Add a trace ID that follows the request through every downstream step.

What to track: cost per operation by feature, model used per request, cache hit rate trend, input token distribution over time.

What to alert on: cost per operation exceeding 2x the rolling seven-day average. Cache hit rate dropping more than 20 percentage points in 24 hours. Any model identifier appearing in production that is not in your approved routing configuration.

What this layer misses: it tells you the bill changed. It tells you nothing about whether the answers changed. Cost and quality are largely independent dimensions, and treating one as a proxy for the other is the most common instrumentation mistake we see.

Layer 2: Quality

This is the hardest layer to instrument and the one with the highest payoff. It requires defining what “correct” looks like before you can measure deviation from it. Most teams skip this for that reason. That is the wrong call.

The practical starting point is a golden set: 50 to 200 representative inputs with known-good outputs, stored in version control. Run your live pipeline against this set on a schedule (daily) and per deploy. Track three scores per output. Groundedness: does it stay anchored to the retrieved context? Faithfulness: is it consistent with the source material? Format adherence: does it match the schema the downstream consumer expects?

Here is what this catches that nothing else does. A provider quietly updates a model endpoint. The new version produces JSON with a slightly different field order and occasionally omits a nullable field that the old version always included. A downstream parser swallows the resulting null-reference exceptions in a broad error handler. The system degrades gracefully. Users see subtly worse outputs. The only growing signal is edit rate on co-edited content, which nobody is tracking.

This is exactly the class of failure scheduled golden-set replay is built for. A model.version field per request would have pinpointed the upgrade boundary on day one.

For teams without a formal eval harness yet, a usable proxy is output schema pass rate: the percentage of responses that successfully parse against the expected schema. It is blunt, but it catches a large class of regressions for free. Build it the same week you ship Layer 1.

Layer 3: Behavior

This layer measures how the system operates, not what it costs or produces.

Signals to track: tool-call success rate broken out by tool name. Retry count per request. Refusal rate. Latency at p50, p95, and p99. Fallback event counts: how often does the system route to a secondary model or default response?

Latency distribution matters because AI systems have heavy tails. A p50 of 800ms can coexist with a p99 of 12 seconds. The p99 is what your most complex requests experience. It is also, often, what your most valuable users experience, because the most complex requests are the ones that need to work the most.

A real story from this layer. An agentic workflow at one team called a downstream tool with a strict enum on a “priority” field. A developer added a new valid enum value on the downstream side without updating the tool definition passed to the model. The model, unable to call the tool successfully, started routing through a different freeform-string tool that accepted anything. That tool had no schema validation. Bad data flowed downstream for three days before someone noticed. Tool-call success rate broken out by tool name would have flagged it on day one: one tool’s success rate dropped to zero while another’s volume spiked. Aggregate “tool-call success rate” across all tools looked normal because the sum was unchanged.

The lesson generalizes.

Aggregate metrics hide compositional failures.

Always break behavior signals out by the smallest meaningful unit. Per tool. Per prompt version. Per operation type. The rolled-up total smooths the signal you most need to see.

Layer 4: Outcome

This is the ground truth layer. It captures what users actually do with the AI’s outputs, which is the only reliable measure of whether the system is useful.

Outcome signals depend on the surface. For chat surfaces: thumbs ratings, copy-out rate, escalation to a human. For co-editing or generation: edit distance between AI output and final user-submitted version, accept rate. For agentic workflows: task completion rate, abandonment mid-flow, time to user-visible success.

These signals are noisier than the other three layers. They require more volume to be actionable, and they take longer to converge. They are also the only signals that close the loop between what you instrumented and what your users experienced.

In the stale-RAG-index story from the opening, the only rising signal for 43 days was escalation rate to human support. Layer 1 saw nothing. Layer 2 was not running. Layer 3 was clean. Layer 4 was telling the truth, and nobody was listening.

Wire Layer 4 signals back to the trace IDs from Layer 1. That is what lets you answer the question that matters most: which operation types, which models, which prompt versions, which tenants correlate with the worst user outcomes? Without trace correlation, outcome data is a thermometer. With it, it is a diagnostic.

Anti-patterns to retire today

A short list of habits that look like instrumentation but are not.

Monitoring only latency. A model can be fast and wrong simultaneously. If your AI dashboard has one panel and it is latency, you are watching the wrong thing.
Logging prompts but not outputs. Prompt logging is useful for debugging. Output logging is what tells you whether the system did its job. Teams with the former and not the latter have a record of what they asked and no record of what they got back.
Treating “the model is fine” as the null hypothesis. Provider models change. The default assumption should be that something has drifted; you just have not measured it yet.
Alerting on cost without quality gates. A cost spike is a signal to investigate. A flat cost is not a signal that quality is stable. Cost and quality move independently.
Deferring eval infrastructure. “Later” usually becomes “never.” A 50-example golden set takes an afternoon. Not having one means every model update is a blind deployment.

Where to start tomorrow

If you are starting from zero, this is the smallest viable progression. It is achievable in two weeks with one engineer.

Day one. Emit a structured JSON event for every LLM call. Required fields: input/output token counts, model ID, computed cost, cache status, prompt version, trace ID, tenant ID. Pipe to a queryable store. This unlocks every later step.
Day three. Write 50 golden-set examples for your highest-volume operation. Inputs and known-good outputs. Store in version control alongside the code that calls the model.
Day five. Schedule a daily run of the golden set against production routing. Track output schema pass rate as the baseline metric. Add groundedness and faithfulness scores as you have capacity.
Day eight. Break Layer 3 metrics out by tool and by operation type. Tool-call success rate per tool. Retry count per operation. Fallback event count by destination model.
Day twelve. Wire one outcome signal back to trace IDs. Pick the one your product surface makes cheapest: thumbs ratings, accept rate, escalation rate. Build the join.
Day fourteen. Set one alert per layer. Cost-per-operation regression. Golden-set score regression. Tool-call success-rate regression. Outcome-signal regression. Four alerts. Cover the stack.

This is what Zylver Signal does as a productized answer for teams that want the stack as a service rather than a project. The principles in this post are the same either way.

The takeaway

Production AI fails at the output layer. Instrument the output layer, or wait for users to do it for you.

The AI observability gap: what you cannot see is costing you. The prerequisite read. Frames the four dimensions of AI observability before this post drills into the stack of telemetry that makes them measurable.
Why your AI gets more expensive over time (and how to reverse it). What to do with Layer 1 data once you have it. Distillation, model routing, and prompt compression for the cost dimension specifically.

What to instrument when your AI degrades in production

Six failure modes that never trip an infra alert

The four-layer instrumentation stack

Layer 1: Token and cost

Layer 2: Quality

Layer 3: Behavior

Layer 4: Outcome

Anti-patterns to retire today

Where to start tomorrow

The takeaway

More from Zylver

Reading an LLM bill: line items that actually matter

Multi-tenant AI: what you can't fake when you have 50 customers

Financial services AI: four constraints that reshape the architecture

Six failure modes that never trip an infra alert

The four-layer instrumentation stack

Layer 1: Token and cost

Layer 2: Quality

Layer 3: Behavior

Layer 4: Outcome

Anti-patterns to retire today

Where to start tomorrow

The takeaway

Related reading

More from Zylver

Reading an LLM bill: line items that actually matter

Multi-tenant AI: what you can't fake when you have 50 customers

Financial services AI: four constraints that reshape the architecture