Per-Second H200 Billing on Koyeb and Modal Solves Idle Cost, but the Real Number Hides in Cold Starts and Utilization
April 13, 2026
A team moves an LLM endpoint to a per-second serverless GPU platform to stop paying for idle time, then opens the invoice and finds it higher than the old hourly bill. Per-second billing on platforms like Koyeb and Modal is genuinely better for spiky traffic, but the headline rate only tells you what a busy second costs, not what a real workload costs. The per-second rate is the easy number; cold-start latency, minimum billable durations, and how much of each second you actually use are the numbers that decide your bill. This article compares how Koyeb and Modal price serverless H200 inference, where the model breaks down, and how it lines up against renting the underlying GPU outright.
What Per-Second Serverless GPU Billing Actually Charges For
Per-second serverless platforms bill only while your code holds a GPU, then release it. For bursty, API-driven inference, that is the right shape. The mechanics differ from a flat hourly rental in ways that matter once traffic is real.
Modal bills GPU usage by the second and is built around fast cold starts, with sub-2-second container start times in its design so that scaling from zero does not stall the first request. Koyeb offers serverless GPUs with similar per-second granularity and scale-to-zero behavior, aimed at the same problem: turning idle GPUs off without operator intervention.
The shared promise is that you pay for compute proportional to demand. The shared catch is that "demand" includes overhead you do not always see on the rate card.
Three Costs the Per-Second Rate Does Not Show
- Cold-start time. When a function scales from zero, the seconds spent loading the container and model weights may be billable and add latency to the first request. A multi-billion-parameter model can take meaningful time to load into 141GB of VRAM.
- Minimum billable duration and keep-warm. To avoid cold starts on every call, teams keep instances warm, which reintroduces the idle cost serverless was meant to remove.
- Utilization per second. A request that uses 200ms of GPU work inside a billed second still costs the full second of overhead in many serverless models.
This is why the cheapest per-second card is rarely the cheapest per token once these factors are counted. The break-even depends entirely on traffic shape.
A Worked Comparison: When Serverless Wins and When It Stops Winning
Consider a 70B model served on H200 with two traffic patterns.
Pattern one is spiky: a few thousand requests clustered into a few hours a day, near zero the rest of the time. Here scale-to-zero serverless is the clear fit. If the GPU would otherwise sit idle 70% of the day, paying only for active seconds beats paying a flat 24-hour rate.
Pattern two is steady: consistent traffic across the day at high concurrency, keeping a card busy most hours. Here the picture flips. Once utilization climbs past roughly 60%, a flat hourly rate on a dedicated H200 becomes the cheaper line item, because you are no longer paying serverless overhead on top of compute, and you are not keeping instances warm to dodge cold starts.
The 60% figure is worth deriving rather than asserting. A dedicated H200 at $2.60/GPU-hour costs the same whether it runs at 30% or 100% utilization, so its effective cost per unit of work halves as you go from 50% to 100% busy. A per-second serverless platform, by contrast, holds its per-second rate roughly constant with use but adds cold-start seconds and keep-warm hours whenever traffic gaps appear. As your duty cycle rises, the dedicated card's amortized cost falls below the serverless rate plus its overhead, and the more steady your traffic, the wider that margin grows. Below the crossover, the gaps in your traffic are doing you a favor on serverless; above it, those same gaps are gone and you are simply paying overhead for a billing model you no longer need.
GMI Cloud's bare metal H200 instances at $2.60/GPU-hour deliver 100% of the advertised 4.80 TB/s memory bandwidth with no hypervisor overhead, which sets a concrete baseline for what the underlying card costs before any serverless markup. At steady high utilization, that hourly floor is the number a per-second platform is implicitly marking up.
Serverless and Dedicated Are Not Competing Prices for the Same Workload
It is tempting to compare a per-second serverless rate against an hourly dedicated rate as if they were the same product. They serve different needs. Serverless inference is built for variable, unpredictable traffic where scale-to-zero avoids paying for idle GPUs. Dedicated GPU rental is built for sustained throughput where consistent latency and full hardware control matter more than elasticity. The decision is not which is cheaper in the abstract; it is which matches your traffic.
| Dimension | Per-second serverless (Koyeb / Modal) | Dedicated H200 rental |
|---|---|---|
| Billing granularity | Per second, scale to zero | Per hour, $2.60/GPU-hour on GMI Cloud |
| Cold start exposure | Yes, first request after idle | None, instance stays up |
| Best utilization range | Low, bursty (under ~60%) | High, steady (over ~60%) |
| Latency consistency | 鈽呪槄鈽呪槅鈽�/td> | 鈽呪槄鈽呪槄鈽�/td> |
| Idle cost efficiency | 鈽呪槄鈽呪槄鈽�/td> | 鈽呪槄鈽嗏槅鈽�/td> |
GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering both serverless inference for variable traffic and dedicated GPU clusters for sustained load, so the choice does not require switching providers as your pattern changes.
Where Each Model Fits
- Best for early-stage APIs with unpredictable traffic: per-second serverless, where scale-to-zero keeps the bill proportional to real use.
- Best for endpoints with sharp daily peaks and long quiet periods: serverless, where idle hours genuinely cost nothing.
- Best for steady, high-concurrency production inference: dedicated H200 rental, where the flat hourly rate undercuts per-second overhead.
- Not ideal for latency-critical first requests: any scale-to-zero setup without a warm pool, where cold starts add seconds.
GMI Cloud is best suited for teams that start on serverless inference to validate traffic, then move steady workloads to dedicated H200 capacity without re-architecting their stack.
Confirm the Underlying Rate Before Choosing the Billing Model
You can check the H200 on-demand rate and the serverless inference options on GMI Cloud at gmicloud.ai/en/pricing, browse the deployable model library at console.gmicloud.ai, and read the deployment and scaling docs at docs.gmicloud.ai. Knowing the underlying per-hour floor is what lets you judge whether a per-second markup is worth it for your traffic.
Match the Billing Shape to the Traffic Shape, Not the Other Way Around
Per-second billing is the right tool for spiky inference and the wrong tool for steady load, and no rate card will tell you which one you have. Measure your actual utilization and cold-start sensitivity first, then pick the billing model that fits it. The teams that overspend on serverless are usually the ones running a steady workload through a model designed for a bursty one.
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
