The AI infrastructure decisions that age poorly

Infrastructure decisions made early in a project tend to persist longer than they should. The codebase grows around them. Changing them later requires touching more code and carries more risk than changing them would have at the start. The cost of a poor infrastructure decision compounds over time.

AI systems have a specific set of infrastructure decisions that age particularly badly. They are not obvious mistakes at the time they are made. They seem like reasonable choices given the information available early in development. But they carry assumptions about how AI systems work that do not hold as the system matures, and unwinding them later is expensive.

Understanding which decisions these are before making them is more valuable than understanding them afterward.

Treating the model as a black box with no observability layer

The most common poor infrastructure decision is building an AI integration with no internal observability: no sampling of inputs and outputs, no quality tracking, no cost attribution per request type.

This seems reasonable at first. The model API is simple: send a request, get a response. The response either satisfies the use case or it does not. Observability feels like an operational concern that can be added later.

The problem is that AI systems degrade in ways that are invisible without observability. Model providers update underlying models in ways that change behavior. The distribution of user inputs shifts as the product grows. A system prompt that worked well for one set of inputs starts producing worse results for a different set. Without observability, you find out about these problems through user complaints rather than through monitoring.

By the time observability feels urgent, you are adding it to a system that was not designed for it. Retrofitting sampling and quality tracking into an existing AI integration is significantly more work than building it in from the start. And the data you need to establish a quality baseline does not exist, because you were not collecting it.

The decision that ages well: instrument inputs, outputs, latency, and cost from day one. Even a simple log of sampled requests with their outputs gives you the baseline you will need later.

Hardcoding the model

Early AI features often target a specific model by name in the code. The model is chosen for a reason: it produces good results for the use case, the API is familiar, the pricing fits the budget. The model name goes directly into the configuration or the code.

This creates brittleness that compounds over time. Model providers deprecate models. Better models become available. Cost or latency considerations change which model tier is appropriate. Multi-model strategies emerge where different request types route to different models.

When the model is hardcoded, any of these changes requires finding every place the model is referenced and updating them, with the associated risk of inconsistency and the maintenance cost of doing it repeatedly.

The decision that ages well: abstract the model behind a configuration layer from the beginning. The model name is a configuration value, not a code value. Changing models means changing configuration, not changing code. This also makes experimentation easier: you can run different models in different environments or for different user segments without code changes.

Prompt strings in application code

Related to model hardcoding is prompt hardcoding: system prompts and user prompt templates embedded directly in application code as string literals.

Prompts need to change more often than most other application configuration. They are tuned iteratively. They need to be updated when model behavior changes. Different prompts may need to be tested against each other to determine which produces better outputs. Treating prompts as application code creates friction for all of these activities.

When prompts are in code, changing them requires a code deployment. Testing a prompt change means writing and deploying code. Iterating on prompts during model evaluation requires touching the codebase. The operational overhead pushes teams toward making fewer prompt changes than the system would benefit from.

The decision that ages well: treat prompts as configuration or content, not code. Store them in a location that can be updated without a code deployment. This might be a database, a content management system, a configuration file outside the main application, or a dedicated prompt management layer. The specific mechanism matters less than the separation: prompts should be changeable independently of the code that uses them.

No request-level cost tracking

AI inference has a direct cost per request. That cost varies significantly by input length, output length, and model tier. The cost of a given request type can differ by an order of magnitude depending on these factors.

Teams that do not track cost at the request level often discover cost problems through their invoice rather than through their monitoring. A particular feature, user behavior, or edge case generates unexpectedly expensive requests. By the time the invoice arrives, the cost has already been incurred and finding the source requires reconstruction rather than observation.

Request-level cost tracking is also the data that drives optimization decisions. Which features are driving most of the inference cost? Which users or use cases are disproportionately expensive? What would the cost impact of switching to a smaller model for a specific request type be? Without cost data at the request level, these questions can only be answered with estimates, not measurements.

The decision that ages well: track token counts and estimated cost per request from the beginning. Attribute cost to the feature, request type, or user segment that incurred it. This data is available in every major model provider’s API response; the work is in collecting it systematically and making it queryable.

Synchronous inference everywhere

The default pattern for AI integration is synchronous: the user takes an action, the application calls the AI, the AI responds, the result is returned to the user. This pattern is appropriate for some use cases and inappropriate for others.

For interactive use cases where the user is waiting for the AI response, synchronous inference is correct. For background processing, document analysis, batch operations, or any case where the user does not need the AI result immediately, synchronous inference imposes unnecessary constraints: it ties up application resources during inference, it couples the user experience to inference latency, and it makes the system brittle to inference provider slowdowns.

Teams that build everything synchronously because it is simpler initially find themselves making significant architectural changes when they need to support batch processing, long-running tasks, or resilient background operations. The changes require introducing queuing infrastructure, async processing patterns, and new user-facing states that were not in the original design.

The decision that ages well: identify which AI calls require synchronous responses and which do not, and build the async infrastructure early even if it is not immediately needed. A simple job queue for AI tasks that do not need immediate responses is much easier to introduce before the system is built around synchronous inference than after.

No fallback behavior for inference failures

AI inference providers have availability and latency characteristics that differ from internal services. Inference can be slow or unavailable due to provider outages, rate limits, or high demand. Teams that do not plan for inference failure end up with systems that fail in user-visible ways when the inference provider has problems.

This often shows up as unhandled exceptions that result in error pages, timeouts that degrade the user experience, or cascading failures where inference slowness causes downstream services to back up. The failure modes are not graceful because no one designed for them.

The decision that ages well: define fallback behavior for every AI call before shipping it. What happens if the AI call fails? What happens if it is slow? For some calls, the fallback is a graceful degradation: the feature is unavailable but the rest of the application works. For others, it is a cached result, a simpler non-AI alternative, or a queue for later processing. The specific fallback depends on the use case; having no fallback is the decision that ages badly.

What these decisions have in common

The infrastructure decisions that age poorly share a structural characteristic: they are easy to skip at the start and expensive to add later because the rest of the system has been built around their absence.

Observability requires that the code emit data; retrofitting that into existing code is tedious. Model abstraction requires that callers go through an interface; changing direct calls to use an interface touches many files. Prompt externalization requires that the deployment process include a prompt management step; adding that to an existing deployment pipeline has organizational overhead. Request cost tracking requires a data model for cost records; adding that later means backfilling historical data or accepting a gap.

None of these decisions are technically difficult. They are easy to implement at the start and progressively harder to implement as the system grows. The teams that make them early spend a small amount of time on infrastructure that pays dividends for the life of the system. The teams that skip them spend much more time later, under more pressure, fixing problems that were predictable from the beginning.

The list of decisions that age poorly in AI systems is not long. The ones here are the ones that appear most consistently and cost the most to fix later. Building in these directions from the start is among the highest-leverage investments available early in an AI project.

The AI infrastructure decisions that age poorly

Treating the model as a black box with no observability layer

Hardcoding the model

Prompt strings in application code

No request-level cost tracking

Synchronous inference everywhere

No fallback behavior for inference failures

What these decisions have in common

More from Zylver

What your board needs to know about AI

How AI is changing customer service

How to scale AI adoption from one team to the whole organization