Skip to main content
Back to blog
7 min read

How to manage AI model upgrades without breaking production

Model providers update their underlying models regularly, sometimes without announcement and without changing the API version. The same endpoint that returned reliable outputs last month may behave differently today. Managing this risk requires different practices than managing software library upgrades.

By Ramiro Enriquez

Software library upgrades are predictable: the version number changes, a changelog documents what changed, and teams can choose when to upgrade. AI model upgrades are different. Model providers update their underlying models regularly, sometimes without announcement, sometimes without changing the API version string, and often without comprehensive documentation of what specifically changed in model behavior. The same API endpoint that returned reliable outputs last month may behave differently today.

This is not a theoretical concern. Production AI systems across industries have experienced silent quality degradations, unexpected behavior changes on edge cases, and latency shifts following unannounced model updates. The teams that handled these well had practices in place before the upgrade happened. The teams that did not found out through user complaints.

Why model upgrades are different from software upgrades

When a software library upgrades from version 3.1 to 3.2, the behavior change is bounded by the library’s interface contract. If the function signature is the same, the outputs for the same inputs should be the same or explicitly documented as changed. Version control means you can pin to 3.1 and choose when to move.

AI model upgrades do not have this property. The model is a statistical function with no formal contract on specific outputs. A new version of a model might produce better outputs on average according to the provider’s benchmarks while producing worse outputs for your specific use case, your specific prompts, and your specific edge cases. The provider is not wrong to ship it; it performs better on the metrics they track. But your production system may experience it as a regression.

Several categories of change occur during model upgrades:

Behavioral drift on edge cases. The model handles common inputs similarly before and after the upgrade but diverges on unusual inputs. These are the hardest to catch because your evaluation suite may not cover the long tail of inputs your production system encounters.

Format changes. The model’s output format shifts slightly: different punctuation conventions, different capitalization, different list formatting. If downstream systems parse model outputs with brittle string parsing, these changes break things that were working.

Latency and cost shifts. Model upgrades sometimes change the token efficiency of outputs and the inference speed. A system designed around a specific latency budget may exceed it after an upgrade, not because of a bug, but because the model now generates longer or more detailed responses.

Safety and refusal changes. Provider safety tuning evolves with model versions. A prompt that produced usable outputs before may now be refused or returned with heavy caveats. This is particularly common in models used in creative or customer service contexts.

Build the evaluation harness before you need it

The most important practice for managing model upgrades is building an evaluation harness before any upgrade happens. A harness is only useful if it exists when you need it; building it after a regression is discovered means investigating a problem without the tool designed to investigate problems.

A minimal evaluation harness for a production AI feature includes:

A golden dataset. A set of representative inputs from production, covering common cases and known edge cases, with expected outputs recorded. The expected outputs do not need to be a single correct answer; they can be behavioral specifications (should produce a response under 200 words, should mention the customer’s name, should not include pricing information) or human-reviewed examples of acceptable outputs.

A comparison mechanism. Something that runs the current model and the candidate model against the golden dataset and surfaces the differences. This does not need to be sophisticated; a diff of outputs with human review is a valid starting point. More mature versions include automated scoring on dimensions that matter for your use case.

Regression thresholds. Explicit criteria for what constitutes a regression. This forces you to define what you care about before seeing results, which reduces the temptation to rationalize acceptable regressions as improvements because the new model otherwise seems better.

Running the harness on a regular cadence against the production model version, even without an active upgrade, creates a baseline and catches silent drift when providers update models without announcement.

Shadow mode testing for model changes

Before routing any production traffic to a new model version, run it in shadow mode: every production request goes to both the current model and the candidate, the current model’s response is served to the user, and the candidate’s response is logged for comparison.

Shadow mode lets you observe how the candidate model behaves on actual production inputs without affecting users. The comparison between current and candidate responses reveals whether the upgrade would cause regressions at production scale, not just on your golden dataset.

The practical implementation is simpler than it sounds. Most AI providers support simultaneous API calls; shadow mode is a wrapper that sends each request to both endpoints, returns the primary response, and logs the secondary for async analysis. The logging infrastructure is more work than the parallel call; you need something that stores both responses, links them to the same input, and makes comparison easy.

Shadow mode has a cost: you are running twice the AI calls for the duration of the test. For high-volume systems, this may be significant. Time-sampling production traffic rather than shadowing every request is a reasonable tradeoff for high-volume use cases.

Phased rollout for model changes

If shadow mode testing shows acceptable results, move to a phased rollout rather than switching all traffic at once. Route a small percentage of production traffic to the new model version, monitor quality and error metrics, and increase the percentage over time.

The phased rollout serves a different purpose than shadow mode. Shadow mode reveals what the new model does differently on the same inputs. A phased rollout reveals whether real users, interacting with a real system, experience the change as a regression. Users sometimes accept or reject outputs for reasons that are not obvious from static comparison; actual user behavior in production reveals things that shadow mode cannot.

Practical phased rollout requires a few things: a way to route specific requests to specific model versions, per-version observability so you can compare metrics between the current and candidate, and a fast path to shift traffic back to the current version if something goes wrong.

What rollback actually means for AI

Rollback for software deployments is usually straightforward: deploy the previous version of the application. Rollback for AI model changes is more complicated because the change may not be fully under your control.

If you have pinned to a specific model version and the provider still serves it, rolling back is as simple as changing the version parameter. Many providers do not guarantee long-term availability of specific model versions, which limits this option.

When a specific version is unavailable, the practical rollback options are:

Prompt engineering workarounds. Modify prompts to elicit the behavior the old model version produced automatically. This works when the change is behavioral; it does not work when the change is in capability.

Output post-processing. Add post-processing to enforce format constraints that the new model no longer produces consistently. This works for format regressions; it does not work for quality regressions.

Cached response serving. For high-volume, low-personalization features, fall back to cached responses from before the upgrade while the regression is diagnosed. This is a short-term measure, not a solution.

Feature degradation. Temporarily reduce the feature to a simpler, non-AI version while the regression is resolved. This is the nuclear option and the most reliable one if the regression is severe.

Pinning versus tracking

Model providers offer different interfaces for version management. Some allow explicit pinning to a named snapshot of a model (often identified by a date-stamped version like gpt-4-0125-preview). Others provide only a rolling “latest” endpoint that always serves the current model.

Where pinning is available, pin your production deployments. Accept the overhead of evaluating and migrating to new versions on your schedule rather than the provider’s. The cost is a regular migration process; the benefit is that unexpected behavior changes in production are your choice, not the provider’s.

Where pinning is not available or the provider discontinues pinned versions, the evaluation harness and shadow mode process become even more important. You cannot control when the model changes, so you need fast detection of changes and a fast response path.

The teams that manage AI model upgrades well have usually invested in these practices not because they predicted a specific failure, but because they recognized that AI model behavior is a production dependency they do not fully control. The practices exist to transfer some of that control back to the engineering team.

Zylver ships AI products: Forge, Signal, Agents, Flows, and Meter. View all products.

Get insights like this delivered monthly.

No spam. Unsubscribe anytime.