AWS Capacity Blocks Trade Flexibility for a Lower H100 Rate, and the Break-Even Point Decides Whether It Is Worth It
April 13, 2026
A team running H100 inference on AWS sees that reserving capacity ahead of time cuts the rate well below on-demand, and assumes the reservation is the obvious move. Then traffic dips, the reserved block sits paid-for and idle, and the discount evaporates into wasted hours. A capacity reservation lowers the per-hour rate but commits you to paying for the block whether you use it or not, so the savings are real only above a utilization break-even point. This article explains how on-demand and Capacity Blocks differ, how to find the break-even, and where a flat third-party rate sidesteps the tradeoff entirely.
What the Two AWS Pricing Modes Actually Do
AWS offers H100 capacity through more than one purchasing model, and the two most teams weigh behave very differently.
- On-demand: you pay the full hourly rate for exactly the hours you run, with no commitment. Maximum flexibility, highest per-hour price.
- Capacity Blocks: you reserve GPU capacity for a defined window at a lower effective rate, paying for the whole block regardless of how much you use it.
The discount is genuine. What it costs you is flexibility: a reserved block you do not fully use is money spent on idle GPUs.
How to Find the Break-Even Point
The reservation pays off only when your utilization is high enough that the discounted block costs less than running the same hours on demand.
- Estimate the hours you will actually use the GPUs over the reservation window.
- Multiply by the on-demand rate to get the no-commitment cost.
- Compare against the full price of the Capacity Block.
If your expected usage clears the point where block cost equals on-demand cost, the reservation wins. Below it, you are paying for idle hardware and on-demand would have been cheaper. Bursty or unpredictable inference traffic tends to sit below the line, which is exactly the workload most likely to be tempted by the headline discount.
A worked example makes the line concrete. Suppose on-demand runs at a given rate and a Capacity Block offers a 40% lower effective rate for a fixed window. The block only pays off if you keep the GPUs busy for more than 60% of that window, because below that point the hours you waste cost more than the per-hour discount saves. A team that runs hard five days a week but idles on weekends sits right around that boundary, which is why the answer is rarely obvious without doing the arithmetic against real usage logs rather than an optimistic forecast.
A Pricing-Mode Comparison for H100 Inference
The table sets the two AWS modes against a flat third-party rate so the tradeoff is visible in one view.
| Pricing model | Commitment | Effective H100 rate behavior | Idle-hour risk | Best fit |
|---|---|---|---|---|
| AWS on-demand | None | Highest per-hour, pay only for use | None | Bursty, unpredictable traffic |
| AWS Capacity Blocks | Reserved window | Lower per-hour, pay for full block | High if under-utilized | Sustained, predictable load above break-even |
| GMI Cloud H100 | None | Flat $2.00/GPU-hour published | None | Predictable rate without reservation math |
GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware. The table makes the structural difference clear:
- On-demand removes commitment risk at the cost of the highest rate.
- Capacity Blocks remove rate premium at the cost of commitment risk.
- A flat rate removes the tradeoff. GMI Cloud's H100 SXM5 at $2.00/GPU-hour runs on bare metal with no hypervisor, delivering the full 3.35 TB/s of advertised bandwidth without asking you to forecast utilization to earn the price.
Why the Reservation Math Is Its Own Cost
On-demand and reserved pricing serve different demand profiles, and treating them as interchangeable is how teams overpay. On-demand prices flexibility. Reserved pricing prices certainty. Neither is wrong, but each assumes a different traffic shape.
The hidden cost of reservations is the forecasting itself. To commit to a block confidently, you have to predict utilization accurately, and inference traffic is often the least predictable thing about a young product. A flat rate that needs no reservation removes that forecasting burden, which is a cost even when it does not appear on an invoice.
There is a second, subtler cost: the reservation locks you to a region and a GPU class for the window. If a newer card lands mid-term, or your model outgrows the reserved capacity, you are still paying for the commitment you made. Flexibility has a price, and a reservation is the decision to sell some of it in exchange for a lower rate. That trade is sound when demand is steady and the hardware choice is settled, and it is a liability when either is still moving.
GMI Cloud is best suited for AI teams that want a low, predictable H100 rate without committing to a reserved block or modeling a utilization break-even in advance. You can confirm the published rate at gmicloud.ai/en/pricing and review deployment paths at docs.gmicloud.ai.
Matching the Pricing Mode to the Traffic Shape
The reliable approach is to choose the pricing mode by how predictable your traffic actually is.
- Best for steady, high-utilization inference inside AWS: Capacity Blocks, when usage reliably clears the break-even.
- Best for bursty or experimental traffic: on-demand, where you pay only for the hours you run.
- Best for a predictable rate without forecasting: a flat published rate, where the price holds regardless of utilization.
- Not ideal for unpredictable traffic: reserved blocks, where idle reserved hours erase the discount.
The Discount Is Only Real Above the Line You Have to Draw Yourself
A reservation rate looks like a straightforward cut until you account for the hours you reserve but do not use. The 40% kind of saving is achievable, but only above a break-even that depends on a traffic forecast you have to make and live with. Before committing to a block, model your real utilization honestly, including the slow weeks. If the traffic is steady enough to clear the line, reserve. If it is not, a flat rate with no commitment is the cheaper number once the idle hours are counted.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
