Provider Limits Are an Architecture Problem

Air traffic control, not retry loops — what we learned running hundreds of agents against dynamic provider quotas.

Oct 06, 2025

Archive note: I originally published this on the Vertesia blog on October 6, 2025. I’m importing it here because it captures an earlier step in the thinking that led to my current work on production AI systems.

Running LLM workloads at production scale exposes a specific kind of pain: you don’t actually know your provider capacity until you hit it.

A customer needed to migrate millions of assets, with ten-plus inference calls per item. Every agent was technically well-behaved. The system still thrashed.

That isn’t unusual. In practice, usable throughput can vary with factors like regional demand, model availability, service quotas, provider-side scheduling, and overall system load.

So the throughput you had at 2 PM may be gone at 3 PM, and the usable number is not always visible in advance. You cannot plan for a capacity limit you cannot see.

Air traffic control, not retry loops

The pattern we ended up with looks more like air traffic control than ordinary retry logic. Instead of letting every agent independently slam into provider limits, the system coordinates expensive model calls as a shared resource.

Every agent runs as a durable, long-running workflow. Expensive model calls go through a central rate limiter that issues clearance.

Traditional approach:

agent makes an LLM call
hits a rate limit
retries with exponential backoff
wastes compute or crashes

What we do instead:

agent requests a ticket from the rate limiter
if capacity exists, the call proceeds
if not, the entire workflow suspends
when capacity opens up, the workflow wakes and continues

The goal is to avoid wasting active compute while preserving enough state to resume safely. Workflow suspension isn’t busy-waiting. The workflow literally pauses, freeing resources, and resumes with context intact regardless of how long it slept.

Learning to surf an unknown wave

The rate limiter has to adapt to capacity that is constantly shifting. The available capacity can vary dramatically. The system needs to discover usable throughput over time and adapt without creating a retry storm.

Dynamic capacity discovery

The adaptive loop:

probes for more capacity during smooth operation
backs off when it hits limits
remembers successful capacity levels for fast recovery
isolates failures to prevent cascades

Circuit breaker with fast resume

When consecutive failures signal a genuine capacity constraint, the circuit breaker opens to prevent a meltdown. Unlike a traditional circuit breaker that resets to zero, this one remembers the last healthy capacity level and resumes below it when conditions improve — so recovery happens in seconds rather than minutes.

Per-model, per-environment intelligence

Each model-environment combination maintains its own capacity model:

GPT-4 capacity is tracked independently from Claude throughput
production learns separately from staging
different regions adapt to their own constraints

What actually changed in production

In one customer migration, this pattern changed the operating behavior of the system. Instead of hundreds of agents independently retrying, blocking, or failing, runs could pause when capacity disappeared and resume when throughput returned.

The exact numbers depend on workload shape, region, model, and provider conditions, so I wouldn’t generalize the result into a benchmark. The important lesson was architectural: adaptive flow control beat blind retry logic.

Why this matters

Dynamic quota systems fundamentally changed infrastructure planning. Previous assumptions don’t apply:

you cannot plan capacity based on fixed limits
simple rate limiters are insufficient
yesterday’s throughput won’t necessarily hold today

This architecture treats capacity as something to discover rather than configure, so agents self-regulate against resources they cannot see.

What that gives you

The practical effect is calmer operation: fewer retry storms, better use of whatever capacity is available, less manual retuning, and long-running jobs that can keep moving without pretending provider capacity is stable.

The takeaway

The lesson is simple: provider capacity is not a local error condition. At scale, it becomes a shared scheduling problem.

If every agent handles that problem alone, the system thrashes. If the platform coordinates capacity centrally, agents can pause, resume, and keep making progress without pretending the world is stable.

This is part of an archive of earlier pieces I wrote on LLMs and agent systems. The newer architectural series picks up where this one leaves off.

Eric Barroca

Discussion about this post

Ready for more?