Queues, Persistent Workflows, or Stateless APIs: AI Agent Production Architecture Patterns

Q: Can I skip queues and just use a workflow engine for everything?

You can, but it's overkill for sub-second tasks. Workflow engines add latency from event-log writes on every step. Use a stateless API for fast paths, queue for medium async, workflow for long-running. Mixing patterns by task duration beats forcing one shape onto everything.

Q: Is Temporal worth the learning curve for a small team?

If your agent runs longer than 30 minutes or needs to survive deploys, yes. If everything finishes in under 60 seconds, Temporal is overhead you don't need yet. Start with a queue, graduate to workflow when long-running tasks appear.

Q: How do I handle LLM rate limits across queue workers?

Centralize rate limiting at the LLM client layer, not per worker. Use a token bucket shared via Redis or a sidecar proxy. Each worker checks out tokens before calling the inference endpoint, so concurrent workers don't collectively burst past the provider's limit.

Q: Does GMI Cloud replace Temporal or Celery?

No. GMI Cloud provides the inference endpoints and GPU compute that orchestration layers call. You still pick your own workflow engine, queue, and API framework. GMI sits one layer below, handling the model calls your agents make.

May 28, 2026

Most AI agents in production are still a Python while loop on a Hetzner box. It works for the demo, survives the first investor call, then dies the first time someone restarts the VM or kills the process for a deploy.

The result: lost state from SIGTERMs, billing reruns from idempotency gaps, on-call pages at 3 AM instead of feature work. Real production agents don't run in one process. They use one of three patterns: stateless API, queue plus worker, or persistent workflow engine. This guide compares all three and shows where the inference layer sits underneath each.

The Three Patterns at a Glance

Before picking, you need a clean mental model of what each pattern actually owns. Pattern selection is mostly a function of task duration and crash tolerance.

Pattern	Task duration	State lives in	Survives restart?	Best for
Stateless API	< 30s	Request only	N/A (no state)	Sync chat, single-shot retrieval
Queue + worker	30s to 30min	Queue + DB	Yes (queue replays)	Async batch jobs, fan-out
Persistent workflow	30min to days	Workflow engine	Yes (event-sourced replay)	Long agents, human-in-the-loop

The pattern isn't a religion. Most mature systems run two of them together. The mistake is picking none and shipping a single-process loop.

Why "Python Loop on a Single VM" Always Breaks

It feels productive. You wrap an agent in while True:, hit deploy, and call it shipped. Here's what goes wrong, in the order you'll see it.

No horizontal scaling. One VM equals one process. Traffic doubles, you're stuck vertically scaling until the box runs out.
No fault tolerance. Any crash, OOM, or kill -9 loses in-flight state. There's no resume.
No observability. Without per-step traces, you can't tell which tool call hung or where tokens blew up.
No back-pressure. Spikes flood the LLM endpoint, you hit rate limits, and the loop has no retry semantics beyond a bare time.sleep.
Deploys cause outages. Every git push kills in-flight agent runs.

That's the floor. Even a small queue plus worker setup fixes 80% of these failures.

Pattern 1: Stateless API

The simplest production shape. Request lands on a FastAPI or Flask endpoint, the agent runs synchronously, response returns. No state outlives the request.

When stateless API fits

The full task finishes in under 30 seconds (most chat, RAG lookups, single-tool calls).
You don't need to recover from worker crashes mid-task.
Horizontal scaling is just "add more replicas behind a load balancer."

What you'll use

FastAPI or Flask for the HTTP layer. Modal works well if you want autoscaling on GPU-backed endpoints without managing infra. Behind it, the LLM call is just one outbound HTTP request to an inference endpoint, so the API stays thin and stateless.

What it can't do

If a task takes 90 seconds, your client times out at 60 and retries. You'll double-bill, double-execute side effects, and confuse the user. The moment a task crosses 30 seconds, switch patterns.

Pattern 2: Queue Plus Worker

The default for production AI workloads in the 30-second to 30-minute range. The API just enqueues a job. Workers pull, execute, write results back.

When queue plus worker fits

Document processing, batch inference, multi-step RAG pipelines.
Tasks where the user submits and either polls or gets a webhook.
Workloads that fan out (1 input becomes 50 sub-tasks).

What you'll use

SQS or Redis Queue for lightweight setups. Celery if you want Python-native task routing. RabbitMQ for stricter delivery guarantees. Workers run on whatever compute makes sense, often a small autoscaled pool.

The hard parts

Queues introduce delivery semantics that bite. At-least-once delivery means your worker will see duplicate messages, so every task needs an idempotency key. Visibility timeouts must match worst-case task duration, or messages re-deliver while still being processed. Dead-letter queues catch poison messages, but only if you actually configure them.

Pattern 3: Persistent Workflow Engine

For agents that run for hours, wait for human approval, or retry across days. The workflow code itself is durable. Crashes replay from the last checkpoint.

When persistent workflow fits

Multi-day onboarding agents, deep research tasks, approval loops.
Anywhere "wait 48 hours then continue" appears in the logic.
Compensating transactions (undo step 3 if step 5 fails).

What you'll use

Temporal is the dominant choice for general-purpose workflow durability. Prefect and Dagster lean toward data-pipeline ergonomics. Airflow still exists, but its scheduler-heavy model fits batch ETL better than long-running agent loops.

The mental shift

Workflow code looks synchronous but isn't. Every external call is recorded to an event log. On crash, the engine replays the log and skips already-completed steps. That means workflow code must be deterministic: no random.random() outside engine-provided helpers, no direct datetime.now(), no non-idempotent side effects outside activities.

Decision Framework

Your situation	Pattern
Sync chat, RAG lookups, single-tool agents	Stateless API
Batch jobs, document processing, fan-out	Queue + worker
Long agents, human-in-the-loop, multi-day tasks	Persistent workflow
Mix of all three	Stateless API at the edge + queue for async + workflow for long-running

Most teams end up with all three. The API handles user-facing requests, the queue absorbs batch and fan-out, the workflow engine owns anything that has to survive a deploy.

Engineering Reality: What Production Actually Requires

Architecture diagrams hide the failures. Here's what bites teams after the demo.

Idempotency keys are non-negotiable. Every task needs a stable key (often a hash of input plus user ID). Queue retries, workflow replays, and client retries will all hit your handler multiple times. Without a key, you'll double-charge users and double-call LLM endpoints.

At-least-once is the default, exactly-once is a lie. SQS, Redis Queue, and Celery all guarantee at-least-once. Pretending otherwise creates ghost duplicates. Build idempotency at the application layer.

Dead-letter queues need alerts. A DLQ that nobody watches is a silent data loss machine. Wire DLQ depth to PagerDuty or your alerting tool of choice.

Retry backoff matters. Exponential backoff with jitter, capped at sensible max attempts. A flat 1-second retry storm against an LLM endpoint will trip rate limits and DDoS your own provider.

Workflow replay determinism. Temporal's replay model means any non-deterministic code outside activities corrupts the workflow. Use engine-provided workflow.now() and workflow.random(), never the stdlib versions.

Tracing across pattern boundaries. OpenTelemetry trace context must propagate from API into the queue payload, then from the worker into the workflow engine, then into the LLM call. Without that, debugging a slow agent run means grepping seven services.

Where the Inference Layer Sits

None of these patterns is an inference platform. They orchestrate calls to one. The LLM, image model, or speech model is always one HTTP hop away.

For teams running orchestration on managed compute, an inference API is the cleaner pairing: no GPU ops, per-request billing. For teams self-hosting or doing high-volume inference, dedicated GPU rentals make more sense. GMI Cloud offers both: H100 SXM at $2.00/GPU-hour for self-hosting, plus 100+ models behind one inference API for outbound calls. Check gmicloud.ai/pricing for live rates.

Reasoning-class models like DeepSeek's V-series handle the planning steps. Small-class GPT mini variants handle the cheap classification calls. Routing between them is the orchestrator's job, not the inference layer's.

FAQ

Can I skip queues and just use a workflow engine for everything?

You can, but it's overkill for sub-second tasks. Workflow engines add latency from event-log writes on every step. Use a stateless API for fast paths, queue for medium async, workflow for long-running. Mixing patterns by task duration beats forcing one shape onto everything.

Is Temporal worth the learning curve for a small team?

If your agent runs longer than 30 minutes or needs to survive deploys, yes. If everything finishes in under 60 seconds, Temporal is overhead you don't need yet. Start with a queue, graduate to workflow when long-running tasks appear.

How do I handle LLM rate limits across queue workers?

Centralize rate limiting at the LLM client layer, not per worker. Use a token bucket shared via Redis or a sidecar proxy. Each worker checks out tokens before calling the inference endpoint, so concurrent workers don't collectively burst past the provider's limit.

Does GMI Cloud replace Temporal or Celery?

No. GMI Cloud provides the inference endpoints and GPU compute that orchestration layers call. You still pick your own workflow engine, queue, and API framework. GMI sits one layer below, handling the model calls your agents make.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started