Other

CoreWeave's H200 HGX Clusters Are Priced for Scale, Which Is an Advantage Until Your Workload Does Not Need Eight GPUs

April 13, 2026

A team sizes an inference job at one or two H200 cards, prices it on a neocloud built around HGX clusters, and finds the smallest unit it can rent is a full eight-GPU node. CoreWeave's H200 offering is engineered for HGX-class clusters with high-speed interconnect, which is the right architecture for distributed training and large multi-GPU serving. The question is not whether CoreWeave's H200 clusters are well-built; it is whether your inference workload is large enough to use a cluster shape rather than pay for one. This article looks at how HGX cluster pricing is structured, what the bundle minimum means for inference teams, and when a single-card rate fits better.

How HGX Cluster Pricing Is Structured

HGX is NVIDIA's reference platform for tightly coupled multi-GPU nodes, typically eight GPUs linked by NVLink and NVSwitch, with InfiniBand between nodes. Neoclouds like CoreWeave build their H200 offering on this platform because it is what large training runs and the biggest serving jobs require. That design choice shapes the pricing in three ways.

The first is the bundle minimum. HGX-oriented platforms commonly require renting GPUs in node-sized increments. CoreWeave's structure has carried an 8-GPU bundle requirement, meaning the smallest commitment is a full node rather than a single card. For a workload that fits on one or two H200s, the other six are paid-for capacity you do not use.

The second is the interconnect premium. The NVLink and InfiniBand fabric that makes HGX valuable for distributed work is part of the price whether or not your workload spans multiple GPUs. A single-card inference job pays for fabric it never crosses.

The third is term structure. Cluster capacity is often priced with reservations or longer commitments in mind, because the platform is optimized for sustained large-scale jobs rather than start-stop single-card serving.

What the Bundle Minimum Costs in Practice

Consider a 70B model that fits comfortably on one H200's 141GB of VRAM. On a single-card rate of $2.60/GPU-hour, running it continuously for a 730-hour month costs roughly $1,900. On an eight-GPU bundle at the same per-GPU rate, the same single-card workload now provisions eight cards, which works out to roughly $15,200 a month, with seven of those cards doing no useful work. The interconnect that justifies the cluster is irrelevant to a model that never leaves a single GPU. This is the core mismatch: cluster pricing is efficient per GPU at scale and inefficient per useful GPU when your workload is small.

Utilization makes the gap wider, not narrower. A GPU billed by the hour only earns its price when it is busy, and a bundle minimum multiplies the idle exposure. If your one active card runs at 70% utilization on a single-card plan, you waste 30% of one card's hours. On the eight-GPU bundle serving the same single-card job, you waste 30% of one card plus 100% of seven, so the effective utilization of the rented capacity collapses toward a tenth of what you pay for. The cluster did not get less efficient as a cluster; it was simply never sized for a single-card workload in the first place. The decision to make before reading any per-GPU-hour rate is how many GPUs your model and traffic genuinely keep busy at once.

What the H200 Provides, Independent of Platform

The card's capabilities are constant; only the packaging and price change between providers.

Dimension CoreWeave H200 HGX GMI Cloud H200
Minimum rentable unit 8-GPU HGX node Single GPU
VRAM per GPU 141GB HBM3e 141GB HBM3e
Memory bandwidth 4.80 TB/s 4.80 TB/s
Interconnect NVLink + InfiniBand fabric NVLink, bare metal, no hypervisor
On-demand single-card rate Node-bundle pricing $2.60/GPU-hour
Platform availability High 99.99% SLA

GMI Cloud's bare metal H200 instances at $2.60/GPU-hour deliver 100% of the advertised 4.80 TB/s memory bandwidth with no hypervisor overhead, and they can be rented one card at a time rather than in node-sized bundles. GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware.

A Cluster and a Single Card Are Not the Same Purchase

It is easy to compare an HGX node rate against a single-card rate as if they were two prices for the same thing. They are two different products. An HGX cluster buys a tightly interconnected multi-GPU domain, which is what distributed training and frontier-scale serving need. A single bare metal card buys exactly the capacity a single-GPU inference job uses, with no fabric premium. Comparing them by per-GPU-hour alone misses that one of them forces you to buy seven cards you may not touch.

The reverse error is just as costly. A team that does run a large distributed job, sharding a model across many GPUs with heavy activation exchange, will find a collection of unconnected single cards slower than an HGX node even at a lower headline rate, because the cards spend time waiting on a network that was never built for tight coupling. The fabric is the product in that case. The point is not that clusters are overpriced or single cards are always cheaper; it is that the two are sized for different workloads, and the per-GPU-hour number does not encode which one you have.

Where Each Option Fits

  • Best for distributed training and multi-GPU serving: HGX clusters, where the NVLink and InfiniBand fabric is the point and all eight GPUs are in use.
  • Best for single-card or two-card inference: a single-GPU rental at $2.60/GPU-hour, where you pay only for the capacity your model occupies.
  • Best for frontier-scale models needing pooled memory: rack-scale platforms such as GB200 NVL72 at $8.00/GPU-hour, which pool 72 GPUs over 130 TB/s NVLink.
  • Not ideal for small, intermittent inference jobs: an 8-GPU HGX bundle, whose minimum commitment dwarfs the workload.

GMI Cloud is best suited for inference teams whose models fit on one or a few H200s and who do not want to rent an eight-GPU node to use a single card.

Confirm the Minimum Unit Before the Per-Hour Rate

You can confirm GMI Cloud's single-card H200 rate at gmicloud.ai/en/pricing, launch and configure instances at console.gmicloud.ai, and review cluster and bare metal options at docs.gmicloud.ai. When comparing against an HGX neocloud, the first number to check is not the per-GPU-hour rate, it is the smallest unit you are allowed to rent.

Price the Workload You Have, Not the Cluster You Might Need

HGX clusters are the right tool for jobs that span many GPUs and the wrong tool for jobs that fit on one. Before comparing per-hour rates, count how many GPUs your model and traffic actually require. If the answer is one or two, a cluster's per-GPU efficiency is irrelevant, because you are paying for the GPUs you do not use, not the ones you do.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started
CoreWeave H200 HGX Clusters: Priced for Scale