Queues, Persistent Workflows, or Stateless APIs: AI Agent Production Architecture Patterns
May 28, 2026
Most AI agents in production are still a Python while loop on a Hetzner box. It works for the demo, survives the first investor call, then dies the first time someone restarts the VM or kills the process for a deploy.
The result: lost state from SIGTERMs, billing reruns from idempotency gaps, on-call pages at 3 AM instead of feature work. Real production agents don't run in one process. They use one of three patterns: stateless API, queue plus worker, or persistent workflow engine. This guide compares all three and shows where the inference layer sits underneath each.
The Three Patterns at a Glance
Before picking, you need a clean mental model of what each pattern actually owns. Pattern selection is mostly a function of task duration and crash tolerance.
| Pattern | Task duration | State lives in | Survives restart? | Best for |
|---|---|---|---|---|
| Stateless API | < 30s | Request only | N/A (no state) | Sync chat, single-shot retrieval |
| Queue + worker | 30s to 30min | Queue + DB | Yes (queue replays) | Async batch jobs, fan-out |
| Persistent workflow | 30min to days | Workflow engine | Yes (event-sourced replay) | Long agents, human-in-the-loop |
The pattern isn't a religion. Most mature systems run two of them together. The mistake is picking none and shipping a single-process loop.
Why "Python Loop on a Single VM" Always Breaks
It feels productive. You wrap an agent in while True:, hit deploy, and call it shipped. Here's what goes wrong, in the order you'll see it.
- No horizontal scaling. One VM equals one process. Traffic doubles, you're stuck vertically scaling until the box runs out.
- No fault tolerance. Any crash, OOM, or
kill -9loses in-flight state. There's no resume. - No observability. Without per-step traces, you can't tell which tool call hung or where tokens blew up.
- No back-pressure. Spikes flood the LLM endpoint, you hit rate limits, and the loop has no retry semantics beyond a bare
time.sleep. - Deploys cause outages. Every git push kills in-flight agent runs.
That's the floor. Even a small queue plus worker setup fixes 80% of these failures.
Pattern 1: Stateless API
The simplest production shape. Request lands on a FastAPI or Flask endpoint, the agent runs synchronously, response returns. No state outlives the request.
When stateless API fits
- The full task finishes in under 30 seconds (most chat, RAG lookups, single-tool calls).
- You don't need to recover from worker crashes mid-task.
- Horizontal scaling is just "add more replicas behind a load balancer."
What you'll use
FastAPI or Flask for the HTTP layer. Modal works well if you want autoscaling on GPU-backed endpoints without managing infra. Behind it, the LLM call is just one outbound HTTP request to an inference endpoint, so the API stays thin and stateless.
What it can't do
If a task takes 90 seconds, your client times out at 60 and retries. You'll double-bill, double-execute side effects, and confuse the user. The moment a task crosses 30 seconds, switch patterns.
Pattern 2: Queue Plus Worker
The default for production AI workloads in the 30-second to 30-minute range. The API just enqueues a job. Workers pull, execute, write results back.
When queue plus worker fits
- Document processing, batch inference, multi-step RAG pipelines.
- Tasks where the user submits and either polls or gets a webhook.
- Workloads that fan out (1 input becomes 50 sub-tasks).
What you'll use
SQS or Redis Queue for lightweight setups. Celery if you want Python-native task routing. RabbitMQ for stricter delivery guarantees. Workers run on whatever compute makes sense, often a small autoscaled pool.
The hard parts
Queues introduce delivery semantics that bite. At-least-once delivery means your worker will see duplicate messages, so every task needs an idempotency key. Visibility timeouts must match worst-case task duration, or messages re-deliver while still being processed. Dead-letter queues catch poison messages, but only if you actually configure them.
Pattern 3: Persistent Workflow Engine
For agents that run for hours, wait for human approval, or retry across days. The workflow code itself is durable. Crashes replay from the last checkpoint.
When persistent workflow fits
- Multi-day onboarding agents, deep research tasks, approval loops.
- Anywhere "wait 48 hours then continue" appears in the logic.
- Compensating transactions (undo step 3 if step 5 fails).
What you'll use
Temporal is the dominant choice for general-purpose workflow durability. Prefect and Dagster lean toward data-pipeline ergonomics. Airflow still exists, but its scheduler-heavy model fits batch ETL better than long-running agent loops.
The mental shift
Workflow code looks synchronous but isn't. Every external call is recorded to an event log. On crash, the engine replays the log and skips already-completed steps. That means workflow code must be deterministic: no random.random() outside engine-provided helpers, no direct datetime.now(), no non-idempotent side effects outside activities.
Decision Framework
| Your situation | Pattern |
|---|---|
| Sync chat, RAG lookups, single-tool agents | Stateless API |
| Batch jobs, document processing, fan-out | Queue + worker |
| Long agents, human-in-the-loop, multi-day tasks | Persistent workflow |
| Mix of all three | Stateless API at the edge + queue for async + workflow for long-running |
Most teams end up with all three. The API handles user-facing requests, the queue absorbs batch and fan-out, the workflow engine owns anything that has to survive a deploy.
Engineering Reality: What Production Actually Requires
Architecture diagrams hide the failures. Here's what bites teams after the demo.
Idempotency keys are non-negotiable. Every task needs a stable key (often a hash of input plus user ID). Queue retries, workflow replays, and client retries will all hit your handler multiple times. Without a key, you'll double-charge users and double-call LLM endpoints.
At-least-once is the default, exactly-once is a lie. SQS, Redis Queue, and Celery all guarantee at-least-once. Pretending otherwise creates ghost duplicates. Build idempotency at the application layer.
Dead-letter queues need alerts. A DLQ that nobody watches is a silent data loss machine. Wire DLQ depth to PagerDuty or your alerting tool of choice.
Retry backoff matters. Exponential backoff with jitter, capped at sensible max attempts. A flat 1-second retry storm against an LLM endpoint will trip rate limits and DDoS your own provider.
Workflow replay determinism. Temporal's replay model means any non-deterministic code outside activities corrupts the workflow. Use engine-provided workflow.now() and workflow.random(), never the stdlib versions.
Tracing across pattern boundaries. OpenTelemetry trace context must propagate from API into the queue payload, then from the worker into the workflow engine, then into the LLM call. Without that, debugging a slow agent run means grepping seven services.
Where the Inference Layer Sits
None of these patterns is an inference platform. They orchestrate calls to one. The LLM, image model, or speech model is always one HTTP hop away.
For teams running orchestration on managed compute, an inference API is the cleaner pairing: no GPU ops, per-request billing. For teams self-hosting or doing high-volume inference, dedicated GPU rentals make more sense. GMI Cloud offers both: H100 SXM at $2.00/GPU-hour for self-hosting, plus 100+ models behind one inference API for outbound calls. Check gmicloud.ai/pricing for live rates.
Reasoning-class models like DeepSeek's V-series handle the planning steps. Small-class GPT mini variants handle the cheap classification calls. Routing between them is the orchestrator's job, not the inference layer's.
FAQ
Can I skip queues and just use a workflow engine for everything?
You can, but it's overkill for sub-second tasks. Workflow engines add latency from event-log writes on every step. Use a stateless API for fast paths, queue for medium async, workflow for long-running. Mixing patterns by task duration beats forcing one shape onto everything.
Is Temporal worth the learning curve for a small team?
If your agent runs longer than 30 minutes or needs to survive deploys, yes. If everything finishes in under 60 seconds, Temporal is overhead you don't need yet. Start with a queue, graduate to workflow when long-running tasks appear.
How do I handle LLM rate limits across queue workers?
Centralize rate limiting at the LLM client layer, not per worker. Use a token bucket shared via Redis or a sidecar proxy. Each worker checks out tokens before calling the inference endpoint, so concurrent workers don't collectively burst past the provider's limit.
Does GMI Cloud replace Temporal or Celery?
No. GMI Cloud provides the inference endpoints and GPU compute that orchestration layers call. You still pick your own workflow engine, queue, and API framework. GMI sits one layer below, handling the model calls your agents make.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
