Cheapest Cloud Options for Running Mistral Large Inference at Scale

June 04, 2026

Mistral Large inference has economics that behave differently from most other frontier-class models. The official API at $2.00 per million input tokens and $6.00 per million output tokens via la Plateforme is EU-hosted, GDPR-compliant, and comes with a batch API discount that drops output pricing to $3.00 per million for non-realtime workloads. Self-hosting Mistral Large 2's open weights on bare metal H100 or H200 infrastructure beats that rate only at high concurrency with continuous batching. Choosing the cheapest option correctly requires knowing which scenario applies to your workload.

The official Mistral API batch tier at $3.00/M output tokens is the cheapest managed inference option for async workloads with 24-hour latency tolerance. No self-hosting or GPU management required, and EU data residency is included.
Self-hosting Mistral Large 2 on GMI Cloud H200 at $2.60/hr with continuous batching at batch size 32 or above delivers approximately $3.00 to $3.50 per million output tokens at production utilization, competitive with the official batch API and cheaper than the standard real-time tier.
Mistral Large 2 open weights under the Mistral Research License support self-hosting with restrictions on commercial use at scale. Teams that need full commercial deployment without restrictions should verify license terms before building a self-hosted production stack.
The H200's 141 GB VRAM fits Mistral Large 2 at FP8 in a single GPU, eliminating tensor parallelism overhead. The H100's 80 GB requires INT4 quantization to fit the 123B parameter model single-GPU, or a two-GPU FP8 configuration.
EU data residency is the deciding constraint for many enterprise workloads. La Plateforme runs entirely within EU data centers and is the only option in this comparison that provides native GDPR compliance without requiring self-hosted infrastructure in the EU.
Throughput determines self-hosting economics more than hourly rate. An H200 running Mistral Large 2 at FP8 with continuous batching achieves 120 to 150 tokens per second per request at low concurrency, but 400 or more tokens per second aggregate throughput at batch size 32. The difference between those two numbers is the difference between self-hosting being more expensive and less expensive than the official API.

‍

Understanding the Mistral Large Family

Mistral Large 2 (released July 2024, model ID mistral-large-2407) is the open-weight flagship: 123 billion dense parameters, 128K context window, Grouped Query Attention with 48 attention heads and 8 KV heads, available under the Mistral Research License. It supports over 80 programming languages and was designed specifically for single-node inference, which is why the 123B parameter count is not accidental.

The current Mistral Large in the API (large-latest) reflects subsequent updates. Mistral Large 3 is the commercially deployed successor with improved reasoning benchmarks. Pricing through la Plateforme is consistent across versions at $2.00 per million input tokens and $6.00 per million output tokens.

For cost purposes, the relevant variants are:

Mistral Large via la Plateforme (API): $2.00 input / $6.00 output per million tokens, standard real-time. $1.00 input / $3.00 output via the batch API (50 percent discount, 24-hour turnaround). EU-hosted, GDPR-compliant, SOC 2 Type II.

Mistral Large 2 self-hosted: Open weights under the Mistral Research License, downloadable from Hugging Face. GPU compute cost only. Commercial use restrictions apply at very large scale; verify license terms for your deployment.

Third-party hosted inference: Amazon Bedrock, Azure AI Foundry, and specialized providers host Mistral Large under standard API models with enterprise SLAs and their respective ecosystem integrations.

‍

Hardware Requirements: What Mistral Large 2 Actually Needs

Mistral Large 2 is a 123B dense transformer. Unlike MoE models, all 123 billion parameters contribute to every forward pass. This makes VRAM planning straightforward but demanding.

FP8 on a single H200 is the recommended production configuration. The H200's 141 GB provides the full 123 GB weight footprint at FP8 plus approximately 18 GB of headroom for KV cache and activations. At 128K context, the KV cache can grow large, but most production workloads operate at far shorter actual context lengths where the headroom is sufficient.

INT4 on a single H100 is the budget alternative. A single H100 80GB has 18 GB of headroom after the 62 GB INT4 weight footprint for KV cache and batching. INT4 quantization introduces quality tradeoffs on precision-sensitive tasks, typically 3 to 8 percent degradation on mathematical reasoning and 1 to 3 percent on coding benchmarks. For conversational and summarization workloads, the degradation is often imperceptible.

FP8 on 2x H100 (two H100 80GB GPUs, 160 GB combined) fits the FP8 weight matrix with 37 GB remaining for KV cache. This configuration uses tensor parallelism, which adds communication overhead between the two GPUs on each forward pass. A single H200 at FP8 eliminates that overhead and costs $2.60/hr versus $4.00/hr for two H100s, making the H200 the more cost-efficient option for this model specifically.

‍

The Three Cost Approaches: Which Is Cheapest for Your Workload

The cheapest option for Mistral Large inference depends entirely on monthly token volume, latency requirements, and whether EU data residency is a requirement. Three distinct approaches cover the full range.

Approach 1: Official Mistral API (la Plateforme)

The official Mistral API is the cheapest managed option for most teams below 200 million output tokens per month.

Standard real-time pricing: $2.00 input / $6.00 output per million tokens. Batch API: $1.00 input / $3.00 output per million tokens with 24-hour turnaround. For non-realtime workloads (overnight data processing, content generation pipelines, batch summarization, evaluation runs), the batch API at $3.00/M output is the lowest-cost option available for Mistral Large without managing GPU infrastructure.

La Plateforme runs entirely within EU data centers. This is the only option on this list that provides GDPR compliance without requiring self-hosted infrastructure. For European enterprises or any team processing personal data of EU residents, la Plateforme resolves the data residency question that all US-hosted providers must address through additional legal mechanisms.

Rate limits on the standard tier cap at 1 million tokens per minute. Enterprise contracts remove this ceiling and provide volume discounts of 10 to 30 percent for teams spending $100,000 or more per month.

Best for: Teams below 200 million output tokens per month, EU data residency requirements, async batch workloads, teams that cannot staff GPU infrastructure management.

Approach 2: Self-Hosted on GMI Cloud Dedicated GPU Infrastructure

Self-hosting Mistral Large 2 on GMI Cloud's H100 or H200 hardware breaks even with the official API at approximately 200 to 400 million output tokens per month, depending on the hardware configuration and achieved utilization.

Single H100 80GB at INT4 on GMI Cloud ($2.00/hr):

At 90 tokens per second with single-request serving, an H100 generates roughly 163 million output tokens per month at 70 percent utilization. Monthly cost at $2.00/hr: $1,460. Effective cost per million output tokens: $8.96. That is above the standard API rate at most volume levels, which means single-request serving on H100 INT4 is not economically competitive with the direct API.

The math changes with continuous batching. At batch size 32 with vLLM or SGLang, aggregate throughput on a single H100 serving Mistral Large 2 INT4 reaches 400 to 600 tokens per second. At 500 tok/s and 70 percent utilization: 907 million output tokens per month at $1,460. Effective cost per million: $1.61. That is below the batch API rate of $3.00/M and well below the standard API rate of $6.00/M.

Single H200 141GB at FP8 on GMI Cloud ($2.60/hr):

FP8 preserves model quality at a monthly cost of $1,898. At batch size 32, aggregate throughput on H200 FP8 reaches 600 to 800 tokens per second. At 700 tok/s and 70 percent utilization: 1.27 billion output tokens per month. Effective cost per million: $1.49. The quality advantage of FP8 over INT4 with lower effective cost per token at this batch size makes H200 FP8 the recommended production configuration for teams that justify self-hosting.

The critical caveat: these economics only hold at sustained high concurrency. If your workload averages 20 concurrent requests per day rather than 200, batch size 32 is theoretical. Actual throughput reverts toward the single-request rate and the cost economics revert to the unfavorable comparison.

Two H100s at FP8 as an alternative: Two H100 SXM GPUs provide 160 GB combined for Mistral Large 2 FP8 plus KV cache headroom. At $4.00 to $4.20/hr combined, the monthly cost is $2,920 to $3,066. At double the throughput of a single H100, the economics roughly match the H200 single-GPU configuration. The H200 is generally preferred because it eliminates tensor parallelism communication overhead and the operational complexity of managing a two-GPU configuration.

Approach 3: Third-Party Hosted APIs (Amazon Bedrock, Azure AI Foundry, Together AI)

Amazon Bedrock and Azure AI Foundry provide Mistral Large inference within AWS and Azure ecosystems respectively. For teams whose applications are already deeply integrated with these clouds, the operational simplicity of keeping Mistral Large inference in the same ecosystem can justify the premium over direct la Plateforme pricing.

Amazon Bedrock typically charges $2.00 to $3.00 per million input tokens and $6.00 per million output tokens for Mistral Large, with serverless billing and no minimum commitments. For AWS-native applications, this eliminates cross-cloud networking costs and simplifies IAM-based access control.

Azure AI Foundry pricing is comparable, with the addition of Azure's compliance certifications and managed private endpoints for enterprise security requirements.

Together AI includes Mistral Large variants in its 200-plus model catalog, and at standard per-token rates that are competitive with direct API pricing. For teams already using Together AI for other model families, keeping Mistral Large on the same platform simplifies billing and API management.

‍

The Batch API Advantage for Cost-Sensitive Workloads

The Mistral batch API at 50 percent discount is underused. At $1.00/M input and $3.00/M output, it is the cheapest non-self-hosted option for Mistral Large and competitive with self-hosted GPU infrastructure at all but the highest concurrency levels.

The constraint is latency: batch API requests are queued with up to 24-hour turnaround. For workloads where this is acceptable, the batch API eliminates the infrastructure management burden of self-hosting while delivering costs that match dedicated GPU economics at moderate scale.

Use cases where the batch API is the clear winner:

Data processing pipelines: Content classification, document summarization, metadata extraction, sentiment analysis at scale. These jobs run overnight and deliver results the next morning regardless of whether they use real-time or batch inference.

Training data generation: Generating synthetic instruction-response pairs, creating evaluation benchmarks, or processing large document corpora for fine-tuning datasets. None of these require real-time responses.

Evaluation and testing: Running benchmark evaluations, automated quality checks, or regression testing against a new model version. The 24-hour turnaround is irrelevant when you are testing infrastructure rather than serving users.

Content generation queues: Blog drafts, product description generation, or other structured content workflows where the output feeds into an editorial queue rather than a live user request.

‍

Cost Comparison: The Full Picture

Key clarification on self-hosted costs: The $1.49 to $1.61 figures for self-hosted GMI Cloud infrastructure assume sustained batch size 32 at 70 percent GPU utilization. At lower utilization (10 to 20 percent, which is common during early production), effective costs rise to $6.00 to $12.00 per million tokens. Self-hosting only wins at scale.

‍

When to Choose Each Option

Use the official Mistral batch API if: Your workload tolerates 24-hour latency, you need EU data residency without managing infrastructure, or your monthly volume is below 300 million output tokens where the self-hosting economics do not justify the operational overhead.

Use GMI Cloud H200 FP8 self-hosted if: Monthly output token volume exceeds 400 million, average concurrency is high enough to sustain batch size 16 or above, you need full control over the serving stack, or fine-tuning access on open weights is required.

Use GMI Cloud H100 INT4 self-hosted if: Budget is the primary constraint, INT4 quality is acceptable for your use case, and single-GPU simplicity matters more than FP8 precision.

Use Amazon Bedrock or Azure if: Your application is deeply integrated with AWS or Azure infrastructure and the ecosystem simplicity justifies the pricing premium.

Use Together AI if: You are managing multiple models under a single API budget and want Mistral Large alongside other model families without operating separate provider relationships.

Serving Framework for Mistral Large 2 Self-Hosted

vLLM is the default serving framework for Mistral Large 2 self-hosted. It supports the model natively with continuous batching, FP8 and INT4 quantization via bitsandbytes, and an OpenAI-compatible API endpoint that makes it a drop-in replacement for the official la Plateforme API with a single base URL change.

Key configuration for single H200 FP8:

‍

Setting --gpu-memory-utilization 0.92 rather than the default 0.90 recovers approximately 2.8 GB of additional memory for KV cache, increasing maximum batch size at the cost of slightly higher OOM risk. Adjust based on your workload's context length distribution.

SGLang delivers 29 percent higher throughput than vLLM on comparable hardware for batch-heavy workloads through RadixAttention KV cache prefix reuse. For workloads with shared system prompts across many requests (RAG with a common context, agent systems with a shared instruction set), SGLang is the better choice. For workloads with highly varied prompts and no shared prefix, vLLM and SGLang perform comparably.

GMI Cloud H200 nodes ship pre-configured with both vLLM and SGLang on CUDA 12.x, eliminating the environment setup phase. Root access and custom software stacks are supported for teams that need framework customization beyond standard configuration.

‍

Conclusion

Mistral Large is unusual among frontier-class models because its official API pricing, EU data residency, and batch API discount together make it genuinely competitive with self-hosted infrastructure at moderate volume levels. The cheapest option is not automatically self-hosting.

For async workloads below 300 million output tokens per month, the official Mistral batch API at $3.00/M output is hard to beat on a total cost basis: zero infrastructure management, native EU GDPR compliance, and pricing that matches or beats self-hosted GPU at average utilization.

For real-time workloads at higher volume, GMI Cloud's H200 FP8 single-GPU deployment at $2.60/hr becomes the lower-cost option above approximately 400 million output tokens per month with sustained concurrency. The same OpenAI-compatible API structure used for la Plateforme works unchanged when pointing at a self-hosted vLLM endpoint on GMI Cloud, so the migration requires a base URL change rather than an application rewrite.

‍

FAQs

What are the VRAM requirements for self-hosting Mistral Large 2? Mistral Large 2 is a 123B dense parameter model. At FP8 precision, weights require approximately 123 GB of VRAM, fitting within a single H200 141GB with roughly 18 GB of headroom for KV cache and activations. At INT4 quantization, the weight footprint drops to approximately 62 GB, fitting comfortably within a single H100 80GB with around 18 GB remaining for KV cache. At FP16, the full 246 GB weight footprint requires a minimum of four H100 80GB GPUs or two H200s. The H200 FP8 single-GPU configuration is the recommended production setup: it eliminates tensor parallelism communication overhead, preserves FP8 model quality, and reduces effective cost per token relative to two-GPU FP8 configurations.

When does self-hosting Mistral Large 2 on GMI Cloud become cheaper than the official API? The crossover depends on GPU utilization and batch size. At sustained batch size 32 and 70 percent GPU utilization on a single H200 ($2.60/hr), effective cost per million output tokens is approximately $1.49, well below the official API standard rate of $6.00/M and below the batch API rate of $3.00/M. The self-hosting economics only hold at high and consistent concurrency. At 10 to 20 percent GPU utilization, which is typical during early production phases, effective self-hosted cost rises to $6.00 to $12.00 per million output tokens, making the official API or batch API the cheaper option. The practical crossover threshold where self-hosting consistently wins sits at approximately 400 million output tokens per month with a concurrency profile that sustains batch size 16 or above.

What is the Mistral batch API and when should I use it? The Mistral batch API provides 50 percent discounts on standard la Plateforme pricing, reducing Mistral Large output from $6.00 to $3.00 per million tokens, with requests queued for up to 24-hour turnaround. It is the cheapest managed inference option for Mistral Large and the right choice for any workload that does not require real-time responses: overnight data processing, document summarization, training data generation, evaluation runs, batch classification, and content generation queues that feed into editorial workflows rather than live user requests. For teams where the majority of Mistral Large usage is async, the batch API eliminates the need for self-hosted GPU infrastructure at moderate token volumes while providing EU GDPR compliance through la Plateforme's EU-hosted infrastructure.

Does self-hosting Mistral Large 2 provide EU GDPR compliance? Not automatically. Mistral Large 2 open weights can be self-hosted anywhere, including on EU-based infrastructure, but the compliance posture depends entirely on the hosting provider's jurisdiction and data handling practices. Self-hosting on GMI Cloud's US infrastructure means data is processed under US jurisdiction, which requires additional legal mechanisms (Standard Contractual Clauses, adequacy decision) to comply with GDPR for EU personal data. For teams requiring EU GDPR compliance with Mistral Large without self-hosted EU infrastructure, la Plateforme is the clearest path: it runs entirely within EU data centers and is the only option discussed here that provides native GDPR compliance without additional legal framework requirements.

How does Mistral Large 2 compare to similarly-priced models for self-hosting economics? Mistral Large 2 at 123B parameters is significantly heavier than Llama 3.3 70B or Qwen3-32B, which means lower tokens per second at equivalent hardware. An H100 serving Llama 3.3 70B FP8 achieves 2,000 to 3,000 tokens per second with continuous batching, while Mistral Large 2 INT4 on the same hardware achieves 400 to 600 tokens per second at batch size 32. The lower throughput means Mistral Large 2 requires more GPU-hours per billion tokens, which makes the self-hosting economics less favorable relative to the API pricing compared to smaller models. Teams whose primary requirement is instruction-following quality per compute dollar often find that Llama 3.3 70B or Qwen3-32B on GMI Cloud H100 at $2.00/hr delivers comparable quality to Mistral Large at 3 to 5 times the throughput on equivalent hardware.

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

FAQ

Mistral Large 2 is a 123B dense parameter model. At FP8 precision, weights require approximately 123 GB of VRAM, fitting within a single H200 141GB with roughly 18 GB of headroom for KV cache and activations. At INT4 quantization, the weight footprint drops to approximately 62 GB, fitting comfortably within a single H100 80GB with around 18 GB remaining for KV cache. At FP16, the full 246 GB weight footprint requires a minimum of four H100 80GB GPUs or two H200s. The H200 FP8 single-GPU configuration is the recommended production setup: it eliminates tensor parallelism communication overhead, preserves FP8 model quality, and reduces effective cost per token relative to two-GPU FP8 configurations.

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started