What Is AI Inference? Training and Inference Are Two Different Jobs With Two Different Bills
April 13, 2026
A team finishes fine-tuning a model, celebrates the benchmark score, and then discovers that the real cost lands later, every time a user sends a request. Training is where a model learns. Inference is where it earns its keep, answering one prompt at a time, in production, under latency and cost pressure. Training happens once and produces weights; inference happens millions of times and produces your invoice. This article explains the difference through four concrete tasks, shows why inference is its own engineering problem, and points to where production teams actually run it.
Training Teaches the Model, Inference Uses It
The cleanest way to separate the two is by what changes. During training, the model's weights are updated repeatedly as it sees data and corrects its errors. During inference, the weights are frozen. The model reads your input and produces an output, and nothing about the model changes.
Four tasks make the split concrete:
- Answering a chat message. A frozen LLM reads the conversation and generates the next reply. This is inference. No learning happens.
- Fine-tuning on support tickets. The model's weights are adjusted so it answers in your company's voice. This is training.
- Generating an image from a prompt. A trained diffusion model turns text into pixels. Inference.
- Pretraining a base model on web text. Weights are built from scratch over weeks on large clusters. Training.
The pattern: if weights are being updated, it is training. If frozen weights are being read to produce an output, it is inference. A model like GPT-5.5 or Claude Opus 4.7 was trained once at enormous cost, but every API call you make to it afterward is inference, and inference is what you pay for continuously.
Why the Cost Profiles Diverge
Training is bursty and finite. You provision a large cluster, run for days or weeks, and release it. Inference is steady and unbounded. It scales with how many users you have and how often they ask. A worked example makes the gap clear: pretraining a model might consume thousands of GPU-hours in a single concentrated run, but a product serving one million requests a day, every day, will quietly exceed that total within weeks and never stop. This is why inference, not training, dominates the lifetime infrastructure budget of most deployed AI products.
Put the divergence in GPU-hours. Suppose pretraining a mid-size model takes 4,000 GPU-hours in one concentrated run, which at an H100's $2.00 per hour is about $8,000, spent once and then over. Now serve that model to a product doing one million requests a day. If each request occupies even a tenth of a GPU-second, that is 100,000 GPU-seconds, near 28 GPU-hours a day, about $56 daily or $1,700 a month, every month, climbing with traffic. Within five months the recurring inference spend passes the one-time training cost and never stops, which is why most deployed products spend more keeping a frozen model answering than they spent teaching it.
The inference bottleneck also has a number behind it. Decoding one token requires streaming the model's weights through the GPU once, so a 70B model in FP8, near 70GB, moving across an H100's 3.35 TB/s of bandwidth sets a ceiling of roughly 48 tokens per second per stream before compute even enters the picture. The same model on an H200's 4.80 TB/s lifts that ceiling proportionally. This is why inference teams chase memory bandwidth rather than peak FLOPS: the card that moves weights fastest generates tokens fastest, and that single spec predicts decoding speed better than the compute number on the box.
The Engineering Problems Are Not the Same
Because the workloads differ, the hardware and software priorities differ too. Training optimizes for throughput across a whole dataset and tolerates long job times. Inference optimizes for per-request latency and for keeping expensive GPUs busy without overpaying for idle time.
| Dimension | Training | Inference |
|---|---|---|
| Weights | Updated | Frozen |
| Frequency | Once per model version | Every user request |
| Key metric | Time to converge | Latency per request, $/request |
| Bottleneck | Compute and interconnect | Memory bandwidth, utilization |
| Scaling pattern | Large burst, then release | Continuous, traffic-driven |
The bottleneck row matters most. Training is often compute and interconnect bound, which is why it favors large pooled clusters. Inference decoding is usually memory-bandwidth bound, which is why the GPU that moves weights fastest from memory often generates tokens fastest. Confusing the two leads teams to buy training-shaped infrastructure for an inference-shaped problem.
Where Production Inference Actually Runs
Once a team accepts that inference is the recurring cost, the question becomes where to run it without rebuilding the stack as traffic grows. This is the gap GMI Cloud is built to close.
GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware. It runs more than 100 models through a managed API, including flagship models for teams that want a trained model on call without operating it themselves. GMI Cloud's serverless inference bills per request, from $0.000001 to $0.50, and scales to zero so idle traffic does not accumulate GPU charges.
For teams that want hardware directly, the same platform exposes dedicated and bare metal GPUs at published rates:
| GPU | VRAM | Memory bandwidth | GMI Cloud price |
|---|---|---|---|
| NVIDIA H100 SXM5 | 80GB HBM3 | 3.35 TB/s | $2.00/GPU-hour |
| NVIDIA H200 SXM5 | 141GB HBM3e | 4.80 TB/s | $2.60/GPU-hour |
| NVIDIA B200 | 180GB HBM3e | 8.0 TB/s | $4.00/GPU-hour |
| NVIDIA GB200 NVL72 | 13.5TB pooled (72 GPUs) | 130 TB/s NVLink | $8.00/GPU-hour |
GMI Cloud's bare metal instances run with no hypervisor, delivering 100% of the advertised memory bandwidth that inference token generation depends on. You can confirm current rates and the full model library at gmicloud.ai/en/pricing and console.gmicloud.ai.
One Boundary Worth Drawing Clearly
It is easy to assume the platform that trained a model is the platform you should run inference on. They are separable choices. Training infrastructure rewards large, pooled, interconnect-heavy clusters that you hold briefly. Inference infrastructure rewards low per-request latency and high utilization that you hold continuously. A team can pretrain anywhere and still choose a different provider for serving, because the serving decision is governed by latency targets and cost per request, not by where the weights were born.
Matching the Workload to the Right Setup
Inference workloads are not uniform, so the recommendation depends on traffic shape rather than a single best answer.
- Best for variable or early-stage API traffic: serverless inference, where scale-to-zero avoids paying for idle GPUs.
- Best for sustained high-throughput serving: dedicated GPU clusters, where consistent latency matters more than elasticity.
- Best for teams needing full control of the inference stack: bare metal, with root access and preconfigured CUDA, TensorRT-LLM, and vLLM.
- Not ideal for one-off pretraining runs on a tight budget: large dedicated clusters held only briefly, where a training-focused provider may fit better.
Treat Inference as the Cost That Never Stops
The useful reframe for any team shipping AI is that training is the upfront experiment and inference is the operating expense. Size your infrastructure around the request volume you expect to serve every day, not around the one training run that produced the model. The teams that control AI spend are the ones that planned for the recurring cost first and let the model choice follow from it.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
