Skip to main content
Back to blog
6 min read

Why AI systems drift without contracts

AI systems degrade silently over time. Not because the model changes, but because the assumptions baked into the system (about inputs, outputs, and behavior) are never made explicit enough to enforce.

By Ramiro Enriquez

Drift is what happens to AI systems that were not designed to resist it. The model does not change. The infrastructure does not fail. The system just quietly becomes less reliable, less accurate, or less aligned with what it was supposed to do. By the time it surfaces in user feedback or business metrics, the gap between the intended behavior and the current behavior has been compounding for months.

The root cause is almost always the same: the system’s assumptions were implicit. The team knew what the system was supposed to do, but that knowledge lived in heads and documentation rather than in enforceable artifacts. When something changed (input distribution, model version, upstream data schema, product requirements), nothing caught the mismatch.

Contracts are what catch mismatches before they become drift.

What “contract” means in an AI system

In traditional software, a contract is a formal interface definition: the types go in, the types come out, and the runtime enforces the boundary. Violating a contract produces an immediate, loud error. You cannot accidentally pass a string where an integer is expected and have the system quietly continue.

In AI systems, contracts are harder to enforce and more important to have. The model accepts natural language. It returns natural language or loosely-typed structured output. The surface area for silent violation is enormous.

A contract in an AI system is any mechanism that makes an assumption explicit and detectable. This includes:

  • Input schemas: defining the expected format, range, and distribution of inputs, not just the type
  • Output schemas: defining what valid output looks like structurally, and what values are acceptable
  • Behavioral assertions: defining what the system should do in specific cases, expressed as test cases with expected outputs
  • Dependency pins: recording exactly which model version, prompt version, and retrieval configuration produced the validated behavior

None of these is novel. Software systems have had all of them for decades. The difference is that in AI systems, teams routinely skip them because the model “understands” the intent. The model does understand the intent. It is less reliable at maintaining it across months of production variation than a type system or a test suite.

The four failure modes of undocumented assumptions

Drift accumulates through a small number of recurring patterns. Most AI systems that degrade in production do so through one or more of these.

Input distribution shift. The system was designed and tested on a specific distribution of inputs. Over time, the actual inputs diverge. Users find edge cases the prompt was not designed for. Integrations send data in formats that differ from the training examples. A new customer segment uses the product differently. The model handles the new inputs, but handles them worse. Nothing breaks. The quality just erodes.

A contract on input distribution does not prevent the shift. It detects it. When the schema of incoming data drifts from the defined range, something alerts. Without the contract, the shift is invisible until someone notices the outputs are off.

Model version change. A model provider updates the underlying model. The API returns the same tokens, but the behavior changes. Some capabilities improve. Some degrade. Some edge cases that the old model handled consistently are now inconsistent. The system continues to return 200s. Nothing in the monitoring infrastructure changes. The drift is invisible.

Behavioral contracts (test cases with expected outputs run against a pinned model version) catch this immediately. When the provider updates the model, the tests fail. The team knows before users know.

Prompt drift. Someone edits the prompt to fix a specific case. The fix works for the reported issue. It degrades behavior on cases that were not reported because they were working. The previous state of the prompt was never formally captured as a tested artifact, so there is no baseline to compare against. The edit ships. The regression is undiscovered.

Prompt versioning with behavioral regression tests catches this. Every prompt change runs the test suite. Regressions on previously-passing cases block the change until the conflict is resolved intentionally.

Schema drift in dependencies. The AI system depends on upstream data: a database query, an API response, a document retrieval result. The schema of that data changes incrementally. A field is renamed. A value range expands. A nested object is flattened. The system keeps working because the model is flexible enough to handle the variation. But the prompt was written assuming the old schema, so the model’s interpretation is now partially wrong. The outputs are subtly degraded in ways that do not produce errors.

Input schema contracts with validation at the boundary catch this. When the upstream data violates the expected schema, the validation fails loudly instead of producing silently degraded output.

The compounding problem

Each of these failure modes alone produces marginal degradation. Combined, they compound.

Consider a system that has input distribution shift, an undocumented prompt edit, and a model version change, all within six months. Each change individually might reduce output quality by a few percent. Combined, the system is operating far from the validated baseline, and no one knows because each change was below the visibility threshold of whatever monitoring exists.

The compounding problem is why AI system drift is harder to recover from than most teams expect. When you finally identify that the system has degraded, you cannot easily attribute the degradation. Was it the prompt change from three months ago? The model version that updated last month? The upstream schema change? Without contracts that create snapshots of expected behavior at each point, the investigation is archaeological.

Recovery requires either reverting everything (losing whatever improvements came with the changes) or running a controlled series of ablations to isolate the cause. Both are expensive. Both are avoidable with contracts.

What a contract layer looks like in practice

The implementation does not need to be complex. The value is in having something, not in having something sophisticated.

At minimum, a contract layer for an AI system includes:

A prompt registry with version history and associated test suites. Every prompt has an identifier. Every change is a new version. Every version has tests that define expected behavior on representative inputs. Changes to prompts run the test suite; regressions require explicit acknowledgment before deploying.

Input validation at the boundary with schema definitions for every input type the system handles. Validation should fail closed: invalid input does not reach the model; it fails with a structured error that gets logged and alerted. The schema definition is also documentation: it makes the assumptions explicit.

Output validation at the boundary with structural assertions on the model’s response. If the model is supposed to return a JSON object with specific fields, validate that it does. If it is supposed to return a summary within a certain length range, assert the range. Structural validation does not catch semantic degradation, but it catches the cases where the model’s output format has changed in ways the downstream code cannot handle.

A versioned baseline that captures the model, prompt version, and input/output examples that produced validated behavior. When any of these change, the baseline is updated, the tests are re-run, and the delta is explicit. The baseline is the artifact that makes drift detectable.

The organizational dimension

Contracts are as much an organizational pattern as a technical one. The technical implementation is straightforward. The harder problem is that teams building AI systems often do not treat behavioral guarantees as first-class engineering artifacts.

In traditional software development, interfaces are defined before implementation. In AI development, the common pattern is: implement the feature, iterate until it seems to work, ship it, and document the intended behavior afterward (if at all). This sequence produces systems where the assumptions are permanently implicit.

The fix is treating the contract definition as part of the feature specification. Before building the AI feature, define: what inputs does it accept, what outputs does it produce, and what are the behavioral requirements that must hold for specific cases? These definitions become the test suite that validates the implementation and the baseline that detects future drift.

Teams that do this describe it as slowing down initial development by a small amount. They also describe far fewer incidents from silent degradation, faster investigation when something does go wrong, and substantially more confidence when making changes. The overhead is front-loaded; the benefit compounds indefinitely.

The alternative is building systems that work until they do not, with no reliable mechanism for knowing when the transition happened.

Zylver ships AI products: Forge, Signal, Agents, Flows, and Meter. View all products.

Get insights like this delivered monthly.

No spam. Unsubscribe anytime.