How to handle rate limits in production AI systems

Rate limits are the first infrastructure constraint most AI applications encounter in production. In development, you send a few requests per minute and everything works. In production, you discover that your usage patterns hit provider limits in ways that are hard to predict, and that the behavior when you hit those limits is worse than you designed for.

Understanding rate limits well enough to build systems that handle them gracefully requires understanding what is actually being limited and why, then building the right response to each constraint type.

What gets rate-limited

AI API providers apply rate limits along multiple dimensions simultaneously. Requests per minute (RPM) and tokens per minute (TPM) are the most common, but providers also apply daily caps, per-model caps, and organization-level limits that aggregate across all your API keys.

TPM limits are often the binding constraint in production, not RPM, because AI responses are variable in size. A system that stays well within its request limit can still hit token limits if a batch of requests generates unusually long responses. Systems that only monitor request rates and ignore token consumption often hit unexpected limit errors when response sizes increase.

Some providers also apply separate limits for input tokens and output tokens. A system that sends large context windows (high input token consumption) with short responses has a different constraint profile than one that sends concise prompts but requests long completions. Understanding which limit is actually binding in your workload matters for designing the right mitigation.

Tier-based limits add another layer: providers increase limits as your account demonstrates consistent usage and payment history. A limit that constrains you today may not constrain you next month if your usage grows. Design your rate limit handling to adapt as your tier improves rather than hardcoding assumptions about specific numbers.

The baseline: retry with exponential backoff

Every AI API client should implement retry with exponential backoff as a baseline. When a request returns a rate limit error (typically HTTP 429), the correct response is to wait and retry rather than immediately failing.

Exponential backoff means doubling the wait time with each retry: wait one second, then two, then four, then eight. Add jitter (a random offset to the wait time) to prevent synchronized retries when multiple clients hit the same limit simultaneously. Without jitter, all clients that hit a limit at the same time will retry at the same time, producing a thundering herd that hits the limit again immediately.

Most AI provider SDKs implement basic retry logic. Check whether your SDK’s default retry behavior matches your production requirements before building custom retry logic. SDK defaults are usually conservative; production systems often need tuned retry counts and backoff parameters.

Set a maximum retry count and a maximum total wait time. A request that retries for five minutes before failing is worse than one that fails fast and lets the caller handle the failure. The right values depend on your application’s latency requirements.

Request queuing

For workloads where you control the request rate, queuing is more effective than reactive retry. Rather than sending requests as fast as your application generates them and hitting limits, a queue throttles outgoing requests to stay within your rate limit budget.

A token bucket or sliding window counter tracks your recent consumption and holds new requests when consumption approaches the limit. This prevents you from ever generating a rate limit error rather than handling one after the fact.

Queuing works well for batch processing, background jobs, and any workload where requests do not need to complete within a specific latency window. It is less appropriate for synchronous user-facing requests where adding queue wait time to response latency is noticeable.

For user-facing workloads, a hybrid approach works: requests that arrive below the rate limit proceed immediately; requests that arrive when you are near the limit either queue with a short timeout or fail fast with a specific error that the client can handle gracefully (showing a “try again in a moment” message, for example).

Monitoring rate limit proximity

Rate limit errors are lagging indicators. By the time you see a 429, you have already failed a user request. A better approach is to monitor how close you are to your limits before errors appear.

Many providers include rate limit headers in API responses: the current limit, remaining quota, and reset time. Parsing and recording these headers gives you leading indicators of impending constraint. An alert when you are consuming 80% of your TPM budget gives you time to respond before you start seeing errors.

Track rate limit proximity separately from error rates. A system with zero rate limit errors but consistently running at 95% of its limit is one spike away from errors. A system that has rate limit errors but they are rare and self-resolving via retry may be functioning acceptably. Both patterns need different responses.

Multi-provider and multi-model fallback

If your application has latency requirements that make queuing and retry impractical, maintaining the ability to route traffic to a secondary provider or model gives you capacity headroom.

This is not the same as a quality fallback (routing to a less capable model when the primary is unavailable). A rate limit fallback routes to a provider that has available capacity rather than a degraded capability. If you have rate limit budgets with two providers, you can distribute load across them to stay within each provider’s limits while serving more total traffic.

The practical cost is maintaining integrations with multiple providers and ensuring that prompt compatibility, response format, and quality are acceptable from each. For applications where the model is central to quality (creative generation, complex reasoning), multi-provider rate limit fallback is harder to implement without quality degradation. For applications where the model is more interchangeable (classification, extraction, summarization), it is more tractable.

Prioritizing requests under constraint

When you cannot serve all incoming requests within your rate limit budget, you need a policy for which requests to serve and which to delay or drop.

The simplest policy is first-in-first-out within the queue. This is fair but ignores business priority. A batch job that was queued before a real-time user request should not block the user request.

A more useful policy assigns priority levels to request types and processes high-priority requests first. Real-time user-facing requests get high priority; background jobs and batch processing get low priority. Under normal load, all requests are served. Under rate limit pressure, background jobs queue longer while user requests are served immediately.

This requires tagging requests with priority when they enter the queue and processing the queue in priority order rather than arrival order. The implementation complexity is modest; the operational benefit when you hit rate limits under load is significant.

Testing rate limit behavior

Rate limit handling is code that only runs under specific conditions, which means it often goes untested until production. Build explicit test coverage for your rate limit code paths.

Inject 429 responses in your test suite to verify that retry logic, queue behavior, and fallback routing work correctly. Verify that retry with backoff terminates correctly (does not retry indefinitely). Verify that queue overflow (more requests than the queue can hold) fails gracefully rather than silently. Verify that rate limit monitoring correctly reads the response headers your provider actually sends.

Production rate limit events also benefit from logging that captures the specific limit hit, the request that triggered it, and the resolution (retry succeeded, request failed, routed to fallback). This logging makes rate limit incidents diagnosable after the fact and provides the data needed to tune your limits and queuing parameters over time.

How to handle rate limits in production AI systems

What gets rate-limited

The baseline: retry with exponential backoff

Request queuing

Monitoring rate limit proximity

Multi-provider and multi-model fallback

Prioritizing requests under constraint

Testing rate limit behavior

More from Zylver

What your board needs to know about AI

How AI is changing customer service

How to scale AI adoption from one team to the whole organization