The Sticker Price on an AWS p5.48xlarge Is the Part of the H100 Bill That Is Easiest to Predict
April 13, 2026
A team prices an 8-GPU H100 instance on AWS, divides by eight, and writes down a per-GPU number that looks competitive. Then the monthly invoice arrives with line items the instance page never emphasized: EBS storage, data transfer, and the elastic network fabric the instance depends on. On a large general-purpose cloud, the GPU hourly rate is rarely the whole H100 cost, because storage and networking are billed separately and scale with how you actually run the job. This article breaks down what a p5.48xlarge really costs to operate, why the附加 charges accumulate, and how a flat per-GPU rate changes the math.
What the p5.48xlarge Actually Bundles
The p5.48xlarge is AWS's 8-way H100 instance. Its on-demand rate covers the eight GPUs and the host, but a production inference deployment touches several other billed services.
- Compute: eight NVIDIA H100 GPUs plus host CPU and memory, billed per instance-hour.
- Storage: EBS volumes for model weights and working data, billed per GB-month plus provisioned IOPS or throughput.
- Data transfer: egress and cross-AZ traffic, billed per GB.
- Networking: the high-bandwidth fabric that keeps multi-GPU jobs fed, with its own configuration and cost implications.
Each service is reasonable on its own. The surprise is that the per-GPU figure people quote from the instance page is only the first of these.
It is worth being concrete about scale. An 8-way H100 instance running a production endpoint might keep tens of terabytes of model weights and cached data on EBS, provision high IOPS to load them without stalling the GPUs, and push large inference payloads back to clients across availability zones. None of that appears in the per-instance-hour rate, yet all of it lands on the same invoice. A team that benchmarks only the GPU rate and signs off on the budget is reading one line of a bill that has four or five lines that move.
Why the Add-On Charges Accumulate
The instance rate is fixed per hour. The附加 charges are not, and that is where budgets drift.
Storage cost scales with how many model checkpoints and datasets you keep resident, and with the IOPS you provision to load them quickly. A team serving several large models keeps far more on disk than the headline implies.
Data transfer cost scales with traffic. Inference endpoints that return large payloads, or that pull data across availability zones, accrue egress that the GPU rate never reflects.
Networking cost scales with topology. Multi-instance serving leans on the elastic fabric, and the configuration that delivers full bandwidth carries its own charges.
The result is that two teams running the same instance can see different effective per-GPU costs, depending entirely on the services around the GPU.
This is the part that makes cross-provider comparison hard. A flat per-GPU rate is a single number you can multiply by hours and GPUs. A bundled rate plus metered extras is a function of your storage footprint, your traffic pattern, and your network topology, none of which you know precisely until the workload has been running for a month. The honest way to compare is to model a full month of realistic usage on the bundled provider, add the metered services, and only then divide by GPU-hours to get a number that lines up with a flat rate.
A Per-GPU Cost Comparison for H100 Inference
The table sets the AWS bundle against a flat per-GPU rate. The AWS per-GPU figure is the instance rate divided by eight before storage, transfer, and networking; the GMI Cloud figure is the published all-in GPU-hour rate.
| Provider | Offering | Per-GPU rate | What the rate includes | Separate billed extras |
|---|---|---|---|---|
| AWS | p5.48xlarge (8x H100) | ~$12.29/hr per GPU on-demand | GPU + host | EBS storage, data egress, cross-AZ, network fabric |
| GMI Cloud | NVIDIA H100 SXM5 | $2.00/GPU-hour | GPU on bare metal, full bandwidth | Transparent published model and GPU rates |
GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware. The comparison is not that AWS is wrong to itemize. Large clouds price granularly because they serve every workload. The point is that a flat rate makes the H100 line predictable.
A few readings worth stating plainly:
- The AWS per-GPU number is a floor, not a total. Storage and transfer sit on top and scale with usage.
- A flat rate collapses the variables. GMI Cloud's H100 SXM5 at $2.00/GPU-hour runs on bare metal with no hypervisor, delivering 100% of the advertised 3.35 TB/s memory bandwidth that inference throughput depends on.
- Predictability is itself a cost saving when finance has to forecast a quarter rather than reconcile an invoice.
When the Large-Cloud Bundle Is Still the Right Fit
On-demand instances and flat-rate dedicated GPUs serve different needs, and the distinction is worth keeping clear. A large cloud bundles GPU, storage, and networking because many workloads need all three deeply integrated with the rest of that cloud's services. A dedicated GPU provider strips the rate down to the compute and publishes it flat.
That difference matters most when an inference workload already lives inside an existing cloud footprint. If your data, identity, and pipelines are deep in one provider, the integrated bundle can be worth its premium. If the workload is primarily GPU inference, the附加 charges are paying for integration you may not use.
GMI Cloud is best suited for AI teams whose primary cost is GPU inference and who want a flat, forecastable per-GPU rate rather than a bundle of separately metered cloud services. You can confirm the current rate at gmicloud.ai/en/pricing and review deployment options at docs.gmicloud.ai.
Matching the Pricing Model to the Workload
The reliable approach is to match the pricing model to where your cost actually concentrates.
- Best for workloads deep inside an existing cloud: the integrated p5 bundle, where co-located storage and services justify the extras.
- Best for GPU-dominant inference: a flat per-GPU rate, where the bill tracks the GPUs and little else.
- Not ideal for tight forecasting: usage-metered storage and egress, when finance needs a fixed monthly number.
- Not ideal for bandwidth-sensitive serving on shared tiers: virtualized instances that lose a slice of advertised bandwidth to overhead.
Read the Whole Invoice Before You Trust the Hourly Rate
The hourly GPU rate is the easiest number to compare and the least complete. On a large cloud, the H100 cost is the instance rate plus the storage you keep, the data you move, and the network you provision. Before you commit to a per-GPU figure, model a full month of real traffic against it. The provider that wins on the instance page is not always the one that wins on the invoice, and the only way to know is to add up everything the GPU rate leaves out.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
