How to use feature flags with AI systems
Feature flags are a standard tool for gradual software rollouts, but AI systems introduce dimensions that standard feature flag patterns do not handle well. Prompts, models, and inference configurations need their own flagging approaches.
By Ramiro Enriquez
Feature flags let you change the behavior of a system for a subset of users or traffic without deploying new code. In standard software, this means showing a new UI to 5% of users, or enabling a new API endpoint for internal accounts before rolling it out broadly. The pattern is well understood: a flag is a boolean or enumerated value, a flag evaluation service resolves the value per request based on targeting rules, and the code branches based on the result.
AI systems introduce dimensions that this pattern does not handle cleanly. A model is not a boolean. A prompt is not a feature. The quality characteristics you want to measure during rollout (does the AI response meet user needs?) are harder to quantify than the quality characteristics of regular software features (did the page load? did the API return a 200?). The rollback unit is not a code deploy; it may be a model version, a prompt version, a system configuration, or all three simultaneously.
Building effective feature flag infrastructure for AI systems requires extending the standard pattern to handle these dimensions.
What you are actually flagging
In a standard application, feature flags control code paths. In an AI system, there are four distinct things that might change independently, each of which may need its own flag:
The model itself. When you move from one model version to another (a major version upgrade, a different provider, a fine-tuned variant), you want to gradually migrate traffic to verify quality before full rollout. Model flags are often the most consequential because they affect every request that touches that model.
The prompt or system instructions. Prompts are the most frequently changed component of an AI system. A prompt change can meaningfully affect output quality, tone, format, and safety characteristics. A flag that controls which prompt version is served for a given request lets you test prompt changes against a subset of traffic before rolling them out.
Inference configuration. Temperature, max tokens, response format, and tool configurations affect output behavior without changing the model or prompt. These parameters are often tuned in production based on observed behavior, and tuning one at a time for a subset of traffic is safer than changing configuration globally.
Post-processing and response handling. How AI output is filtered, formatted, and validated before being returned to the user can be flagged independently of the AI call itself. A new output validation rule or response transformation can be rolled out progressively.
These four dimensions interact: a prompt change may interact differently with a new model than with the old one. Flag them as independently as possible, but be aware that combinations may need to be tested as units when the interaction is significant.
Targeting and traffic allocation
Standard feature flag targeting rules (user ID, account type, region, random percentage) apply to AI flags but need some additions.
Request characteristics matter for AI flags in ways they often do not for regular feature flags. A prompt that works well for simple queries may degrade on complex ones. A model that performs well on short-context requests may struggle with long-context ones. Consider adding request characteristics to your targeting rules: estimated token count, topic category, user history with the feature, or any signal that correlates with request complexity or type.
Traffic allocation for AI flags should generally be more conservative than for regular feature flags, especially at the model level. The difference in output quality between two model versions can be significant, and quality regressions affect user trust in ways that are harder to recover from than, say, a UI regression. Rolling out model changes to 1-2% of traffic initially, and expanding only after quality metrics are confirmed, is more conservative than most teams default to.
Holdback populations are more important for AI than for regular software. A holdback is a fixed percentage of traffic that always receives the control version, even after a flag is fully rolled out. This lets you detect degradations in the new version relative to a baseline without relying purely on time-series analysis. Quality regressions in AI systems can be subtle and slow-moving; a holdback makes them detectable.
Evaluation during rollout
The critical question during an AI feature flag rollout is: is the new version better, worse, or the same? This is harder to answer for AI than for regular features because the quality signal is often qualitative.
Define your evaluation criteria before starting the rollout. What does “better” mean for this specific use case? For a summarization system, it might be summary length, factual accuracy, or user engagement with the summary. For a code assistant, it might be acceptance rate on suggestions, or downstream error rates in code that used AI suggestions. For a customer service bot, it might be resolution rate or escalation rate. If you cannot define the metric before rollout, you will not be able to evaluate the rollout reliably.
Pair automatic metrics with periodic human evaluation. Automatic metrics (length, sentiment, engagement signals) catch obvious regressions quickly but miss subtler quality changes. Periodic human evaluation of a sample of responses from both the control and treatment variants provides the signal that automatic metrics miss. The sample does not need to be large; a structured review of 20-30 responses from each variant per week is enough to catch meaningful quality differences.
Track evaluation results at the flag level, not just in aggregate. If a flag is running on 10% of traffic and quality metrics look acceptable in aggregate, check whether quality differs by user segment or request type. A prompt change that improves average quality but degrades quality for a specific user segment may not be visible in aggregate metrics.
Rollback
Rollback for AI system changes is faster and cleaner than rollback for schema migrations or data transformations, because you are usually changing configuration rather than data. Flipping a flag back to the control version reverts the behavior immediately, without requiring a code deploy.
The complication is that AI output may have already been stored, sent, or acted upon by the time you detect a quality regression. If your system stores AI-generated content (summaries, recommendations, drafts), a rollback stops generating bad output but does not fix existing bad output. Consider whether stored AI output needs to be flagged as generated by a specific model or prompt version so it can be identified and reprocessed if a regression is detected.
Build flag state into your incident response runbook for AI systems. When a quality regression is detected, the first intervention is usually to revert the relevant flag to the control version, not to roll back a code deploy. Teams that do not have this workflow documented often spend the first hour of an incident looking for a code change to revert when the actual change was a configuration flag.
Infrastructure considerations
Flag evaluation should be fast and local. AI requests are often already latency-sensitive; adding a remote flag evaluation call in the hot path adds latency and introduces a dependency that can fail. Cache flag state locally (per process or per container) with a short TTL, and use a polling model to refresh the cache rather than evaluating flags remotely per request.
Store flag state alongside AI request logs. When you review a request to understand why a particular response was generated, you need to know which model, prompt version, and configuration were in effect at the time. Log the resolved flag state for each request as part of your AI observability infrastructure.
Version your prompts and configurations in a way that flag states can reference. A flag that points to “prompt version 14” is more reproducible than a flag that points to a mutable value. If you change the prompt that a flag references without incrementing the version, you lose the ability to compare what was actually served during the rollout.
Zylver ships AI products: Forge, Signal, Agents, Flows, and Meter. View all products.
More from Zylver
What your board needs to know about AI
Boards are being asked to provide oversight on AI at a moment when most board members lack the background to evaluate what they are hearing. The gap between what boards need to know and what they typically get in management presentations is real and consequential.
How AI is changing customer service
Customer service is one of the business functions most visibly transformed by AI. The changes are happening faster than most organizations planned for, and the outcomes depend heavily on implementation decisions that are easy to get wrong.
How to scale AI adoption from one team to the whole organization
Getting AI to work in one team is a different challenge from scaling it across an organization. What worked for the first team often fails when applied elsewhere, and the failure mode is usually invisible until the expansion is already stalled.
Get insights like this delivered monthly.
No spam. Unsubscribe anytime.