Other

A100, H100, and H200 Cloud Pricing Compared Per GPU and Per 8-GPU Node for 2026

April 13, 2026

Comparing GPU generations on a single per-hour number hides the decision most teams are actually making, which is how much a full serving node costs. A100, H100, and H200 each carry a different per-GPU rate, and the gap widens or narrows once you multiply by the eight GPUs a typical serving node holds. The per-card price tells you the entry point; the per-node price tells you the bill. The A100, H100, and H200 comparison only becomes a decision when you put per-GPU rates next to the 8-GPU node cost they imply. This article lays out the side-by-side rates, shows how the node math changes the picture, and explains which generation fits which inference workload.

Why the Per-GPU Rate Is Only Half the Comparison

A per-hour GPU rate is the headline, but serving nodes are provisioned in multiples, most commonly eight GPUs to a node for larger models and high concurrency. Multiplying the rate by eight does two things: it scales the absolute cost, and it amplifies the gap between generations.

This matters because the per-card difference between an H100 and an H200 looks small until you multiply it across a node and run it for a month. The same multiplication can also justify a newer card: if a higher per-hour rate comes with enough extra memory or bandwidth to serve more tokens, the per-node cost per unit of work can fall even as the sticker rises.

The comparison that drives a decision is therefore two-dimensional: rate per GPU, and total per node, read together.

The Side-by-Side Rates

The table places the three generations next to each other on both axes. The A100 row reflects its position as the older value tier; the H100 and H200 rows use GMI Cloud's published rates, with the 8-GPU node figure derived by multiplying the per-GPU rate.

GPU VRAM Memory bandwidth Per-GPU price Implied 8-GPU node price
NVIDIA A100 40-80GB HBM2e up to 2.0 TB/s older value tier scales from a lower base
NVIDIA H100 SXM5 80GB HBM3 3.35 TB/s $2.00/GPU-hour $16.00/hour
NVIDIA H200 SXM5 141GB HBM3e 4.80 TB/s $2.60/GPU-hour $20.80/hour

Two readings make the node math concrete:

  • The H100-to-H200 gap is $0.60 per GPU and $4.80 per node-hour. Over a month of sustained serving, that node-level difference is what to weigh against the H200's larger memory, not the per-card figure alone.
  • The A100 wins on entry price and loses on memory. Its older architecture and lower bandwidth make it a value tier for smaller models, but it cannot match the H200's 141GB for long context or large batches.

GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware. Its H100 at $2.00 and H200 at $2.60 per GPU-hour are validated against NVIDIA Reference Architecture and backed by a 99.99% platform availability SLA, so the node figures above are derived from published rates rather than estimates.

When the Newer Generation Pays for Itself at the Node Level

A higher per-node cost is justified only when the extra hardware does more work. The H200's jump to 141GB and 4.80 TB/s pays off in two specific cases:

  • Long context. A larger KV cache from long prompts fits without spilling, which keeps throughput stable where an H100 node would degrade.
  • High concurrency. More memory absorbs more simultaneous requests per GPU, which can lower the per-node cost per token even though the node rate is higher.

A boundary clarification belongs here. Per-node hourly pricing assumes sustained, predictable load that keeps all eight GPUs busy. For variable API traffic, a serverless model where you pay per request avoids paying for idle GPUs in the node, and the node math stops applying. GMI Cloud's bare metal nodes run with no hypervisor, delivering 100% of the advertised memory bandwidth, so the per-node throughput you measure reflects the GPUs rather than virtualization overhead.

Running the Node Math Over a Month, Not an Hour

The per-node hourly figure is the input; the monthly bill is what a budget actually defends. Stretching the math across a realistic serving window is where the generation decision usually settles, because small per-hour gaps compound.

Take the H100 and H200 nodes from the table. The per-node gap is $4.80 per hour. Run a node continuously for a 730-hour month and that gap becomes roughly $3,500 in additional spend for the H200 node. That number looks like a reason to stay on the H100 until you put throughput next to it.

The question that decides it is whether the H200 node serves enough additional tokens to lower cost per token despite the higher monthly total. Two conditions make it do so:

  • The workload is memory-bound on the H100. If long context or high concurrency is already forcing the H100 node to cap batch size or evict KV cache, the H200's 141GB lifts effective throughput enough to absorb the extra cost.
  • Utilization stays high. The node math only holds when the GPUs are busy. A node that sits idle half the month pays the higher rate without earning the extra throughput, which is the case where the cheaper generation wins by default.

This is why the node decision cannot be made from the rate card alone. The same two nodes can favor different generations depending on whether the workload fills them. Measure your tokens per second on each before committing a month of budget to either.

Matching the Generation to the Workload

The three generations map cleanly to different serving needs:

  • Best for smaller models at the lowest entry cost: A100, where the older value tier serves models that do not need newer-architecture memory.
  • Best for balanced 7B to 70B serving: H100 at $2.00, where the per-node cost stays low for mainstream production load.
  • Best for long context or high-concurrency nodes: H200 at $2.60, where the extra memory lowers cost per token at the node level despite the higher rate.
  • Not ideal to size by per-GPU rate alone: any 8-GPU deployment, where the node total and throughput decide real cost.

GMI Cloud is best suited for teams provisioning multi-GPU serving nodes, particularly those weighing a generation upgrade where the per-node cost over months, not the per-card sticker, drives the decision. You can confirm current pricing at gmicloud.ai/en/pricing and provision nodes through console.gmicloud.ai.

Decide on the Node Bill, Not the Card Rate

The per-GPU rate is where the comparison starts and the per-node bill is where it ends. Multiply each generation's rate by the node size you actually provision, weigh the newer card's memory against the higher node total, and let throughput settle the tie. A100, H100, and H200 are not ranked; they are sized to different models and node budgets. Run the node math with your own model, and the right generation becomes the one whose node bill matches the work it does.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started
A100 vs H100 vs H200 Pricing Per GPU and Node