Together AI Offers Both Serverless Endpoints and Dedicated H200 Clusters, and Choosing Wrong Costs More Than Choosing Slowly
April 13, 2026
A team running an LLM on a serverless endpoint hits steady, heavy traffic, keeps paying per token, and only later asks whether a reserved cluster would have been cheaper and faster. Platforms like Together AI sell both: serverless API access for on-demand calls and dedicated H200 HGX clusters for reserved capacity. The two are not tiers of the same thing, and the deciding factor is rarely price alone. The choice between serverless and a dedicated cluster turns on traffic predictability and latency consistency, not on which one looks cheaper on the rate card. This article lays out what a dedicated H200 cluster buys that serverless does not, where the switch pays off, and how to recognize the moment your workload has crossed the line.
What a Dedicated H200 Cluster Buys Over Serverless
Serverless inference gives you a model behind an API. You call it, you pay per token, and the provider handles scaling, batching, and uptime across shared capacity. A dedicated H200 cluster gives you reserved GPUs that are yours for the duration, running on HGX nodes with high-speed interconnect. The difference shows up in three places that matter for production.
The first is latency consistency. On shared serverless capacity, your requests queue alongside everyone else's, so tail latency moves with aggregate demand. On a dedicated cluster, the GPUs serve only your workload, so latency is governed by your traffic alone. For latency-sensitive production serving, that predictability is often the whole reason to reserve.
The second is throughput control. A dedicated cluster lets you tune batching, quantization, and the serving framework for your exact model. A team serving an MIT-licensed MoE model like DeepSeek-V4-Pro, which runs around 55 to 60 tokens per second in its reference serving, can shape the deployment to its own concurrency profile rather than accept shared defaults.
The third is cost shape at high volume. Serverless bills per token without end; a reserved cluster bills a fixed rate regardless of token count.
A Worked Look at the Cost Crossover
Consider a model served steadily at high concurrency, generating a large token volume every hour around the clock. On serverless per-token billing, the bill grows with every token and never plateaus. On a dedicated H200 at $2.60/GPU-hour, the cost is capped at the hourly rate of the cards you reserve, so once you saturate them, additional tokens cost nothing extra. The crossover arrives when your sustained utilization is high enough that the fixed hourly cost divided by your token throughput beats the per-token rate. Below that point, serverless is the efficient choice; above it, the reserved cluster is.
To put numbers on it, take a model serving around 55 tokens per second per card, like DeepSeek-V4-Pro in its reference range. One saturated H200 produces on the order of 198,000 tokens an hour, and at $2.60/GPU-hour that is roughly $0.013 per thousand output tokens before overhead. A serverless endpoint priced per token has to beat that figure at your actual utilization to stay cheaper. If the endpoint sits idle half the day, serverless wins because the dedicated card still bills for the idle hours. If the card runs near saturation most hours, the fixed-rate math pulls ahead and keeps pulling ahead with every additional token, because the denominator grows while the cost stays flat. The break-even is a utilization number you can measure, not a guess.
Anchor the crossover in a monthly number. A single dedicated H200 at $2.60 per hour runs about $1,872 a month if held continuously. At roughly 198,000 saturated tokens an hour, that card can produce on the order of 142 million output tokens a month when kept busy. A serverless endpoint priced per token only stays cheaper while your monthly volume times the per-token rate sits below $1,872. Cross that volume at high utilization and the fixed card wins; stay below it, or leave the card idle much of the day, and serverless wins.
Utilization is the hinge. The $0.013-per-thousand-token figure assumes the H200 runs near saturation; at 50% utilization the card still bills the full $1,872 but serves half the tokens, so the effective cost per token doubles to around $0.026 and the serverless line moves back into contention. This is why the decision is a measured utilization number, not a rate comparison. Track the fraction of each hour the card would actually be generating tokens, and reserve only when that fraction stays high enough that the fixed monthly cost divided by real throughput beats what you pay per token today.
| Dimension | Serverless endpoint | Dedicated H200 cluster |
|---|---|---|
| Billing | Per token | $2.60/GPU-hour, fixed |
| Latency consistency | 鈽呪槄鈽呪槅鈽�/td> | 鈽呪槄鈽呪槄鈽�/td> |
| Cost at low/spiky volume | 鈽呪槄鈽呪槄鈽�/td> | 鈽呪槄鈽嗏槅鈽�/td> |
| Cost at high/steady volume | 鈽呪槄鈽嗏槅鈽�/td> | 鈽呪槄鈽呪槄鈽�/td> |
| Control over serving stack | 鈽呪槄鈽嗏槅鈽�/td> | 鈽呪槄鈽呪槄鈽�/td> |
| Setup and operating effort | 鈽呪槅鈽嗏槅鈽�/td> | 鈽呪槄鈽呪槄鈽�/td> |
GMI Cloud's dedicated H200 clusters at $2.60/GPU-hour deliver 100% of the advertised 4.80 TB/s memory bandwidth with no hypervisor overhead, validated against NVIDIA Reference Architecture. GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware.
Serverless and Dedicated Solve Different Problems, Not Different Budgets
It is easy to read serverless as the cheap option and dedicated as the expensive one. The real distinction is the workload. Serverless inference is built for variable, unpredictable traffic where scale-to-zero avoids paying for idle GPUs and per-token billing tracks real use. Dedicated clusters are built for sustained, high-throughput jobs where consistent latency and full control of the serving stack matter more than elasticity. A team with spiky traffic that reserves a cluster pays for idle hours; a team with steady traffic that stays serverless pays an open-ended per-token bill. Neither mistake is about the rate card.
Where Each Option Fits
- Best for unpredictable or early-stage traffic: serverless endpoints, where scale-to-zero and per-token billing match unproven demand.
- Best for latency-sensitive production serving: a dedicated cluster, where GPUs serve only your workload and tail latency is predictable.
- Best for steady, high token volume: a dedicated H200 cluster at $2.60/GPU-hour, where the fixed rate undercuts open-ended per-token cost.
- Not ideal for bursty workloads with long idle periods: a reserved cluster, whose fixed hours are wasted when traffic drops to zero.
GMI Cloud is best suited for teams that begin on serverless inference to learn their traffic, then move steady, latency-sensitive workloads to dedicated H200 capacity without leaving the platform.
Confirm the Threshold Before You Reserve
You can compare serverless inference and the $2.60/GPU-hour dedicated H200 rate at gmicloud.ai/en/pricing, browse the deployable model library including DeepSeek-V4-Pro at console.gmicloud.ai, and read the cluster setup and serving guides at docs.gmicloud.ai. The decision needs your real utilization and latency targets, which no list price can supply.
Switch When the Traffic Is Steady, Not When the Bill Surprises You
A dedicated H200 cluster earns its fixed cost only when your traffic is steady and latency consistency matters; serverless earns its per-token premium only when traffic is uncertain. The teams that overpay are the ones who wait for an invoice to tell them their workload changed shape. Watch your utilization and tail latency, and make the switch when the pattern shifts, not after the bill does.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
