other

DeepInfra vs Baseten vs Modal vs GMI Cloud: LLM Inference Provider Pricing Comparison

May 28, 2026

Most LLM inference pricing comparisons lead with hourly GPU rates or per-token figures. Those numbers are real, but they are not the bill. A platform billing per-replica-hour charges you whether or not any requests are running. A platform with per-second active compute pricing looks cheaper per hour until the first cold start adds 30 seconds to a user-facing request. A per-token platform looks expensive at small volumes and looks very different at ten million tokens per month.The gap between a platform's headline rate and its actual monthly invoice depends on the billing model, the cold start behavior, and the concurrency architecture, and all three differ substantially between Baseten, Modal, DeepInfra, and MaaS platforms like GMI Cloud.This piece breaks each one down.

Three Billing Models, Three Different Total Costs

Understanding the billing model before comparing prices prevents the most common mistake in provider selection:

  • Per-replica-hour (Baseten): Every running model replica incurs continuous cost regardless of request volume. A two-replica setup for redundancy costs twice the single-replica rate even during idle periods. Total cost scales with the number of replicas, not with request volume.
  • Per-second active compute (Modal): Billing runs only when the function is executing. Idle time costs nothing. The tradeoff is cold start latency when the container has not been called recently.
  • Per-token (DeepInfra, MaaS): Cost scales directly with output volume. No idle cost, no cold start, no replica management. The tradeoff is that only pre-deployed models are accessible. Custom model deployment is not available on pure per-token platforms.

Each model optimizes for a different assumption about the workload. Understanding which assumption matches the actual traffic pattern determines which model produces the lower invoice.

Platform-by-Platform Pricing and Hidden Costs

Baseten

Baseten's headline differentiator is Truss, its open-source inference framework that gives teams fine-grained control over batching, hardware configuration, and custom model packaging. The platform raised a $300M round from CapitalG and NVIDIA in January 2026, which reflects its position as the enterprise-grade custom model deployment option.

The billing model is per-replica-hour. An H100 instance on Baseten runs approximately $6.50 per GPU-hour at list rates, among the highest in the category. That rate runs continuously for every replica that is deployed, regardless of request volume.

The hidden cost surface is significant for teams that underestimate it.Two replicas for reliability: double the cost. A minimum warm replica to avoid cold starts: continuous billing even during low traffic. The Truss deployment abstraction creates switching costs when migrating to another platform because models are packaged against Baseten's specific framework. For teams that need full deployment control, HIPAA/SOC 2 compliance, and enterprise support with dedicated engineers, these costs are justified. For teams that need to run standard models at volume, they are not.

For high-concurrency scenarios, Baseten's dedicated infrastructure eliminates cold starts entirely. The cost of zero cold starts is paying the replica-hour rate continuously.

Modal

Modal's billing model is per-second of actual GPU execution. There is no charge during idle periods. An H100 on Modal runs approximately $3.95 per GPU-hour when active, which is cheaper than Baseten's per-hour rate on active compute but requires accounting for cold start behavior.

Cold start times on Modal: CPU containers start in under a second. Typical GPU model containers start in "a few seconds" per Modal's documentation. For large LLM workloads with 70B+ parameter models, cold starts extend to tens of seconds as weights load from the cache layer. For a user-facing inference endpoint, a 30-second cold start on the first request after a quiet period is not acceptable without mitigation.

Modal provides keep-warm configuration to hold containers ready without generating full per-second billing. This reduces cold start frequency but adds idle compute cost. The platform requires teams to configure their own inference stack (typically vLLM or a custom runtime). Baseten abstracts this; Modal does not. That difference represents real engineering time per deployment.

For bursty workloads with unpredictable traffic patterns, Modal's per-second billing eliminates idle cost at the expense of cold start risk. For sustained high-concurrency inference, the effective hourly rate approaches and can exceed dedicated instance providers without the management overhead advantage.

DeepInfra

DeepInfra's model is per-token, OpenAI-compatible API, and covers 50+ open-source models without any infrastructure management. Pricing starts at $0.06 per million tokens for small models and reaches $0.55 per million input tokens for larger models like DeepSeek-R1. On independent benchmarks, DeepInfra consistently ranks among the cheapest providers for open-source frontier models.

There are no cold starts on DeepInfra's hosted endpoints because the models are pre-deployed by the platform. There is no replica management. There is no minimum commitment.

The limitation is the model catalog.DeepInfra does not support custom model deployment. If a workload requires a fine-tuned model, a model not in DeepInfra's catalog, or any model with custom preprocessing, DeepInfra is not the platform. For workloads running standard open-source models at high volume, it is among the most cost-efficient options available.

Four Models Available on Per-Token MaaS Pricing Through GMI Cloud

GMI Cloud's MaaS layer applies the per-token billing model to a broader model catalog that includes both open-source and proprietary models from OpenAI, Google, and DeepSeek. The four models below illustrate the price range:

Model Input price per 1M tokens Output price per 1M tokens Best for
Gemini 3.1 Flash-Lite $0.10 $0.40 High-volume, long-context, budget workloads
GPT-5.4-nano $0.20 $1.25 Reasoning-sensitive tasks, coding subagents
GPT-5.4-mini $0.40 $2.50 Mid-tier quality, balanced cost and capability
DeepSeek-V4-Pro $1.39 Varies Near-frontier quality at sub-frontier price

No cold starts, no replica charges, no idle cost. The invoice scales with tokens consumed, not with time elapsed or containers running.

The practical advantage over Baseten and Modal is operational simplicity at scale.A team running GPT-5.4-nano for classification plus DeepSeek-V4-Pro for complex reasoning on the same platform pays for tokens consumed by each model without managing separate deployment stacks, replica configurations, or keep-warm policies. Billing visibility is per-request, not per-hour-of-infrastructure.

GMI Cloud's serverless inference layer handles request routing and scaling automatically. Dedicated endpoint configurations are available for teams that need predictable latency guarantees above what the serverless tier provides. Model documentation and console access are atconsole.gmicloud.aianddocs.gmicloud.ai.

Matching the Billing Model to the Workload

Three workload types map clearly to billing model:

  • Custom model deployment with fine-tuned weights: Baseten or Modal. Neither per-token platform supports arbitrary model deployment. The cost difference between them depends on traffic pattern: sustained high concurrency favors Baseten's dedicated infrastructure; bursty low-average traffic favors Modal's per-second billing.
  • High-concurrency production serving of standard models: Per-token MaaS or DeepInfra. No cold starts, no replica management, predictable cost per request. The model catalog constraint is the real decision variable: proprietary models (GPT, Gemini, DeepSeek API) require MaaS; open-source-only teams can use DeepInfra at lower per-token rates.
  • Bursty experimental workloads with unpredictable request patterns: Modal for custom models, per-token MaaS for standard models. Both avoid idle cost during quiet periods, which matters when traffic is irregular.

The Invoice Is the Real Price, Not the Rate Card

A $6.50/hr GPU on Baseten for a two-replica deployment of a seldom-called model costs $9,360 per month before a single request arrives. A $3.95/hr Modal container with 90% idle time costs $284 per month in compute plus cold starts on 10% of requests. A $0.20/M token per-token service for 50 million tokens costs $10 per month with no baseline.

These are the same category of service. The billing models produce very different invoices at the same workload volume. Running the math against the actual traffic pattern before selecting a provider is the step that most initial evaluations skip.

Before committing to any of these platforms, verify current pricing at their official documentation: pricing changes frequently in this category, and the numbers that matter are the ones on the current rate card applied to your actual monthly token or GPU-hour volume.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started