other

How to Separate LLM Inference from Agent Orchestration: A Production Architecture Guide

May 28, 2026

Most agent codebases couple the model call directly into the workflow step, then need a full rewrite when the model changes. That coupling looks harmless on day one. It rots fast. By month six you're patching retry logic inside business code, eating bills that double from cold starts you can't isolate, and shipping behind every model swap.

The split is the fix: treat inference as one layer, orchestration as another, and let each scale on its own clock. This guide covers why the two layers have different scaling profiles, how to draw the boundary in real code, which platforms own which side, and how GMI Cloud fits on the inference half of that line.

Why Inference and Orchestration Don't Belong in the Same Box

The two layers solve different problems. Inference is stateless math on a GPU, billed per token or per GPU-hour. Orchestration is stateful coordination of steps, billed per workflow-hour or self-hosted. Wiring them together saves a week. It costs a quarter.

Different scaling profiles

Inference scales with concurrent requests. A spike from 50 to 500 RPS pushes you to add GPU replicas or burst to a managed API. Orchestration scales with concurrent workflows in flight, which can sit idle for minutes between steps. One layer is CPU-light and GPU-bound. The other is the opposite.

Different failure semantics

When a GPU node OOMs mid-decode, you want the orchestrator to catch a timeout, retry on a fallback model, and keep workflow state intact. When a workflow step crashes, you want durable replay, not a re-roll of the LLM call. Bundling them means one failure mode pollutes both recovery paths.

Different pricing models

Per-token inference billing rewards tight prompts and small models. Per-workflow-hour orchestration billing rewards short steps and clean state. Putting them in the same process hides which one is bleeding money. You'll spend a sprint just adding spans before you can tell.

The Reference Architecture in One Diagram (Described)

Draw it left to right. Client request hits the orchestrator. The orchestrator owns workflow state, retries, fallbacks, and step ordering. Each step that needs a model emits a call to the inference layer through a thin client. The inference layer answers and returns. Tool calls fan out from the orchestrator, not from inside the model client.

Client → Orchestrator (state, retry, fallback) → Inference Layer (model API or self-hosted GPU) ↓ External tools / DBs / vector store

The contract between layers is narrow on purpose. The orchestrator sends a prompt plus model hint. The inference layer returns tokens plus a request ID. Nothing else crosses the line. When you swap a frontier-class model for a smaller routing model, the orchestrator code doesn't change.

Picking the Two Halves: Layer-by-Layer

You pick each side independently. That's the whole point of separation.

Orchestration layer choices

Option Best for Trade-off
Temporal Durable workflows, strong replay semantics Heavier ops footprint
LangGraph Graph-shaped agent flows, Python-native Younger, fewer prod guarantees
Queue + worker (Celery / RQ + Redis) Simple fan-out, existing Python stack You'll build retry yourself
Custom Python state machine Tight control, small surface Long-tail bugs are yours

There's no winner here. If you already run Temporal for non-AI workflows, extend it. If you're prototyping, LangGraph ships faster. If you're a data team with Celery, don't introduce a new system.

Inference layer choices

Option Best for Trade-off
Managed inference API Burst traffic, no GPU ops Per-token pricing at scale
Self-hosted on H100/H200 SXM Steady high RPS, custom models You own vLLM tuning, autoscaling
Hybrid (managed for spikes, self-hosted baseline) Predictable load with bursts Two surfaces to monitor

For most teams shipping in 2026, the inference layer is where GMI Cloud (gmicloud.ai) sits naturally. You either call its Inference Engine for 100+ pre-deployed models, or rent H100 SXM at $2.00/hr and H200 SXM at $2.60/hr for self-hosted serving. Check gmicloud.ai/pricing for current rates. Orchestrator choice stays yours.

GPU Selection for the Self-Hosted Path

If you self-host the inference layer, the GPU choice depends on model size and decode pattern. H100 and H200 anchor the production set.

GPU VRAM Read Speed Best For GMI Cloud Price
H100 SXM 80 GB HBM3 3.35 TB/s 7B-70B FP8, latency-sensitive ~$2.00/hr
H200 SXM 141 GB HBM3e 4.8 TB/s 70B+ long context, decode-bound ~$2.60/hr
A100 80GB 80 GB HBM2e 2.0 TB/s 7B-34B, existing fleet Contact
L4 24 GB GDDR6 300 GB/s 7B INT8/INT4 sidecars Contact

Per NVIDIA's H200 Product Brief (2024), H200 delivers up to 1.9x inference speedup on Llama 2 70B versus H100 (TensorRT-LLM, FP8, batch 64, 128/2048 tokens). NVLink 4.0 runs 900 GB/s bidirectional aggregate per GPU on HGX/DGX platforms. For KV-cache budgeting: KV per request ≈ 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_element, FP16 default.

Consumer cards like the 4090 belong in dev only. NVIDIA's GeForce EULA carries data center use restrictions (see nvidia.com/en-us/drivers/geforce-license), and that's a compliance risk you don't want in prod.

Engineering Reality: What Breaks After the Demo

Architectural diagrams are clean. Production is not. Here's what actually trips teams once the separation goes live.

Retry policy lives in the orchestrator, not the platform. A managed inference API will retry transient socket errors at the platform edge, but timeouts, 429s, and bad JSON are yours. Put the retry decorator on the orchestrator's inference step. Use exponential backoff with jitter, cap at 3 attempts, and surface the request ID on every retry log.

Observability needs spans across the boundary. Wrap each inference call in an OpenTelemetry span on the orchestrator side, then attach the inference platform's x-request-id as a span attribute. Without that link, you can't correlate a slow workflow with a slow decode.

Fallback and circuit breakers belong on the orchestrator. Wire a circuit breaker (resilience4j, pybreaker) on the inference client. When the primary model errors past threshold, route to a smaller fallback. Don't retry the same broken model six times.

Multi-model routing lives at the workflow step. Send routing decisions and short-form summaries to a GPT mini-class model. Send hard reasoning to Claude Opus class or DeepSeek V-series. Cheap first, expensive on escalation. Evaluator-in-the-loop (Ragas, LangSmith evals) catches regressions before they reach users.

Model Tier Recommendations for the Inference Layer

Match the model class to the workflow step, not the other way around.

Step type Model class Why
Routing, classification, short summaries GPT mini-class Cheap, fast, predictable JSON
Tool-arg generation, planning Frontier-class reasoning models Reliable structured output
Hard reasoning, multi-step plans Claude Opus-class Deeper chain quality
Bulk reasoning at low cost DeepSeek V-series Cost-efficient reasoning

You can run all four through one inference API gateway. GMI Cloud's Inference Engine page (snapshot 2026-03-03) lists 100+ pre-deployed models behind a single API, which means your orchestrator only learns one client SDK. Swap the model name string, redeploy the orchestrator step. No GPU re-provisioning.

When the Hybrid Pays Off

A baseline of self-hosted H100 for predictable load plus a managed API for traffic spikes is the most common production shape we see. The orchestrator routes by load: under threshold, hit the self-hosted endpoint; over threshold, burst to the managed API. Two cost curves, one workflow contract. Bottom line: you don't lock in either side.

FAQ

Where does state live in the separated architecture?

State lives in the orchestrator's durable store: Temporal's history, LangGraph's checkpointer, or your own database. The inference layer is stateless by design. Each call carries the prompt the orchestrator assembled, and the platform returns tokens plus a request ID. If you stuff state into the inference client, you've leaked orchestration across the boundary.

Can GMI Cloud act as the orchestrator?

No, and that's intentional. GMI Cloud serves the inference layer through its Inference Engine API and H100/H200 GPU instances. For workflow orchestration you still pick Temporal, LangGraph, a queue-and-worker stack, or custom Python. The honest framing: GMI handles inference; you handle orchestration.

How do I avoid vendor lock-in at the inference layer?

Wrap the inference call in a thin internal client interface that takes prompt and model hint, returns tokens and request ID. Implement that interface twice: once for your primary managed API, once for a self-hosted vLLM endpoint or a second managed API. Swap implementations with a config flag, not a code change.

What's the smallest viable separation for a 3-person team?

Celery plus Redis on the orchestrator side, a managed inference API on the inference side. You'll have durable workflows, retry, fallback, and a wide model catalog without standing up a GPU cluster. When load justifies it, you can move steady traffic to self-hosted H100 at $2.00/GPU-hour and keep the same orchestrator.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started