Best GPU Cloud for Running Qwen3 Inference at Scale

Q: How much VRAM does Qwen3-235B actually require for production inference?

Qwen3-235B activates only 22 billion parameters per token, but the full 235 billion parameter weight matrix must be loaded into GPU memory before inference can begin. At FP8 precision, this requires approximately 235 GB of VRAM. The recommended production configuration is 8x H200 SXM with 1,128 GB total VRAM, which provides weight storage plus substantial KV cache headroom for concurrent requests and the 262,144 token context window. Planning hardware around the 22B active parameter figure is the most common and costly mistake when deploying this model.

Q: What is the difference between Qwen3 thinking mode and non-thinking mode for infrastructure sizing?

In thinking mode, Qwen3 generates internal reasoning tokens before producing its final response. These tokens count toward total token consumption, KV cache usage, throughput requirements, and billing on per-token APIs. A response that takes 500 output tokens in non-thinking mode can generate 2,000 to 5,000 total tokens in thinking mode on complex reasoning tasks. Infrastructure should therefore be sized for the mode used in production, not the mode used during lightweight development testing.

Q: Why is NVLink interconnect especially important for Qwen3-235B compared to dense models?

Qwen3-235B uses a Mixture-of-Experts architecture that routes each token to 8 of 128 experts during each forward pass. In multi-GPU deployment, those experts are distributed across GPUs, creating all-to-all communication across the interconnect. NVLink handles this traffic far more efficiently than PCIe at production batch sizes. For Qwen3-235B, interconnect bandwidth can be more important than raw compute because expert routing becomes a primary throughput constraint.

Q: How does Qwen3-32B compare to Qwen3-235B for production cost-efficiency?

Qwen3-32B is the more practical and cost-efficient option for most production workloads because it fits on a single H100 at FP8 or a single H200 at FP16. It avoids the multi-GPU MoE routing complexity of Qwen3-235B and can deliver strong throughput with standard inference optimizations. Qwen3-235B provides higher capability on complex reasoning tasks, but it requires an 8x H200 cluster for recommended production deployment. For most teams, Qwen3-32B is the starting point, while Qwen3-235B is reserved for workloads that require frontier-class reasoning quality.

May 29, 2026

Qwen3 is one of the most capable open-weight model families available in 2026, and the hardware decisions behind deploying it at production scale are meaningfully different from other model families. The flagship Qwen3-235B-A22B is a Mixture-of-Experts model that activates only 22 billion parameters per token but requires the full 235 billion parameter weight matrix in VRAM at all times. That distinction drives every infrastructure decision downstream.

The "22B active" figure in Qwen3-235B is not the VRAM requirement. The full 235 billion parameter weights must reside in GPU memory during inference. At FP8, that is approximately 235 GB. An 8x H200 cluster (1,128 GB total) is the production standard for serving Qwen3-235B at useful throughput.
Qwen3-32B is the most practical single-GPU deployment. At FP8, weights consume roughly 32 GB on a single H100 80GB with 34 to 35 GB of KV cache headroom remaining. On an H200, FP16 weights fit at 64 GB with substantial headroom for batching and long context.
GMI Cloud offers Qwen3-32B FP8 inference at $0.10 per million input tokens and $0.60 per million output tokens through its serverless Inference Engine, with dedicated H100 and H200 clusters for teams that need infrastructure control at larger scale.
The MoE architecture creates unique serving dynamics. MoE routing selects 8 experts per token from 128 total experts in Qwen3-235B. Expert routing and communication across GPUs is the primary throughput constraint in multi-GPU deployment, making interconnect bandwidth more important than raw compute for this model.
Thinking mode versus non-thinking mode doubles output token volume. Qwen3's chain-of-thought reasoning generates internal thinking tokens that count toward billing and throughput. At scale, a workload that averages 500 tokens per response in non-thinking mode can average 2,000 to 5,000 tokens in thinking mode. Sizing infrastructure around the correct mode for each use case is critical.
The cost crossover from per-token API to dedicated GPU infrastructure for Qwen3-32B sits at approximately 80 to 100 million output tokens per month, similar to other 70B-class models, because the per-active-parameter inference cost aligns with dense 22B model economics despite the 235B total parameter size.

‍

The Qwen3 Family: Which Model for Which Workload

Qwen3 is not a single model. It is a family spanning dense models from 0.6B to 32B and two flagship MoE architectures (30B-A3B and 235B-A22B), plus subsequent Qwen3.5 and Qwen3.6 releases that extend the lineup further. The right model depends on your hardware budget, context length requirements, and whether you need the reasoning-mode capability.

Dense models (0.6B to 32B): The Qwen3-32B is the largest and most capable dense model in the family. It delivers performance competitive with Qwen2.5-72B on most coding and instruction-following benchmarks while fitting on a single H100 80GB or H200. For production teams that want a single-GPU deployment with strong capability and no MoE complexity, Qwen3-32B is the practical default. Smaller variants (8B, 14B) fit on lower-cost hardware and are suitable for latency-sensitive applications where response time matters more than frontier-class performance.

MoE models (30B-A3B and 235B-A22B): Qwen3-30B-A3B activates 3 billion parameters per token from a 30B total parameter pool. At Q4 quantization it requires approximately 17 GB of VRAM, fitting on a single RTX 4090 or any H100/H200 with substantial headroom. This is the best cost-efficiency option for teams that need MoE speed benefits without the multi-GPU infrastructure requirement of the 235B flagship.

Qwen3-235B-A22B is the frontier-class option, matching GPT-4o class benchmarks on reasoning, coding, and multilingual tasks. It requires multi-GPU infrastructure at any useful production throughput, but its MoE architecture means token generation speed benefits from the relatively low active parameter count once the full weight matrix is resident in GPU memory.

Thinking mode versus non-thinking mode: Qwen3 models support both a chain-of-thought reasoning mode (thinking mode) and a direct response mode (non-thinking mode). In thinking mode, the model generates internal reasoning tokens before producing output, which significantly increases total token count per response. For tasks that benefit from step-by-step reasoning (mathematics, code generation, complex analysis), thinking mode improves quality at the cost of higher latency and more tokens billed. For straightforward conversational and retrieval tasks, non-thinking mode produces faster, cheaper responses with competitive quality.

This toggle has direct infrastructure implications. Teams sizing GPU capacity or per-token budgets based on non-thinking mode output lengths and then enabling thinking mode in production discover the cost and latency can be three to ten times higher than anticipated.

‍

Hardware Requirements by Model Variant

Model	Precision	VRAM Required	Minimum Configuration
Qwen3-8B	FP16	~16 GB	1x H100 80GB (comfortable)
Qwen3-14B	FP16	~28 GB	1x H100 80GB (with headroom)
Qwen3-32B	FP8	~32 GB	1x H100 80GB (34-35 GB KV headroom)
Qwen3-32B	FP16	~64 GB	1x H200 141GB (recommended)
Qwen3-30B-A3B (MoE)	Q4	~17 GB	1x H100 80GB (substantial headroom)
Qwen3-235B-A22B (MoE)	FP8	~235 GB	4x H100 80GB minimum, 8x H200 recommended
Qwen3-235B-A22B (MoE)	INT4	~132 GB	2x H100 80GB (with quality tradeoff)

Critical note on Qwen3-235B planning: The "22B active parameters" figure refers only to how many parameters are computed per forward pass. The full 235B weight matrix must be loaded into VRAM before any inference can begin. Planning infrastructure based on the 22B active figure is one of the most common and costly mistakes when deploying this model.

The MoE Serving Constraint: Why Interconnect Matters More Than Compute

For dense models like Llama 3.3 70B or Qwen3-32B, GPU-to-GPU communication during tensor parallelism is the secondary concern after memory bandwidth. For Qwen3-235B, the MoE expert routing mechanism changes this balance.

Qwen3-235B has 128 experts. During each forward pass, 8 are selected per token. In a multi-GPU deployment, experts are distributed across GPUs, and routing sends each token's computation to the GPUs holding the selected experts. This communication pattern creates all-to-all traffic across the interconnect that scales with batch size and expert distribution.

NVLink, which delivers 900 GB/s bidirectional bandwidth in 8-GPU H100 and H200 SXM configurations, handles this traffic efficiently. PCIe Gen5 at 128 GB/s creates a communication bottleneck that limits effective throughput on MoE models at production batch sizes. For Qwen3-235B specifically, the difference between NVLink-connected SXM GPUs and PCIe-connected instances is more impactful than for dense models of similar total parameter count.

GMI Cloud's H200 SXM clusters run on 8-GPU nodes with NVLink 4.0 (900 GB/s bidirectional per GPU) and 3.2 Tbps InfiniBand for inter-node communication. This is the interconnect configuration required for Qwen3-235B to reach production throughput targets without expert routing becoming the bottleneck.

Serving Framework Recommendations for Qwen3

vLLM is the default starting point. It supports all Qwen3 text models natively from version 0.8.4 onward, with thinking mode support from version 0.9.0. PagedAttention and continuous batching deliver strong throughput on Qwen3-32B on a single H100. For Qwen3-235B on multi-GPU configurations, vLLM's tensor parallelism handles expert distribution across GPUs.

SGLang provides a 29 percent throughput advantage over vLLM generally through RadixAttention KV cache reuse. For Qwen3 workloads with shared system prompts (RAG applications, agents with common context), SGLang's prefix caching removes the KV recomputation cost on the shared prefix for every request in the batch. At high concurrency this compounds into meaningful throughput gains. SGLang also handles the thinking mode token generation more efficiently through its structured output optimizations.

TensorRT-LLM delivers 15 to 30 percent higher peak throughput over vLLM on NVIDIA hardware after a compilation step. For fixed production deployments where model weights and configuration are stable, TensorRT-LLM is the highest-throughput option on GMI Cloud's H100 and H200 nodes. The compilation requirement makes it less practical for teams still iterating on model versions.

Thinking mode configuration in vLLM: The enable_thinking flag requires careful management at the serving layer. For applications that mix thinking and non-thinking requests, routing to separate endpoints with appropriate configurations avoids unexpected token explosions on non-thinking workloads. The context window of 262,144 tokens in Qwen3-235B is relevant here: thinking mode reasoning chains can consume tens of thousands of tokens on complex tasks, requiring adequate KV cache allocation.

Platform Comparison for Qwen3 Inference

GMI Cloud

GMI Cloud provides both managed serverless inference and dedicated GPU infrastructure for Qwen3 workloads across the full model family.

The Inference Engine offers Qwen3-32B FP8 at $0.10 per million input tokens and $0.60 per million output tokens with no GPU provisioning required. Automatic request batching, latency-aware scheduling, and scaling to zero between requests handle traffic variability without idle GPU cost. For teams starting with Qwen3 inference, this is the lowest-friction entry point before traffic justifies dedicated infrastructure.

For Qwen3-235B at production scale, dedicated H200 SXM clusters with NVLink interconnects provide the hardware configuration the model requires. H200 GPUs at $2.60/hr on bare metal with RDMA-ready networking, pre-installed vLLM, TensorRT-LLM, and Triton Inference Server eliminate the environment setup phase. The 8x H200 configuration recommended for Qwen3-235B FP8 is available on-demand with per-minute billing.

GMI Cloud's multi-region footprint across the US, Taiwan, Singapore, Thailand, Malaysia, and Japan also directly addresses the APAC data residency requirements that matter for organizations deploying Qwen3 in Asian markets. Qwen3's multilingual training, which covers 119 languages, makes it the natural choice for APAC enterprise workloads.

Together AI

Together AI hosts Qwen3-235B-A22B on managed infrastructure with an OpenAI-compatible API. For teams that need the flagship model without building multi-GPU infrastructure, Together AI's managed endpoint provides immediate access. The catalog also includes Qwen3-32B, Qwen3-72B (Qwen3.5 family), and other Qwen variants with consistent API access.

LoRA fine-tuning on Qwen3 variants is supported, which is relevant for teams building domain-specific Qwen3 applications. No other managed provider on this list offers fine-tuning as a managed service on Qwen3 class models.

Cerebras

Cerebras offers Qwen3-32B and Qwen3-235B on its free tier with 30 requests per minute and 1 million tokens per day at no cost, no credit card required. Its wafer-scale silicon delivers approximately 3,000 tokens per second throughput on these models, significantly faster than GPU-based inference at low concurrency. For agentic workflows where many sequential Qwen3 calls are chained together, Cerebras' speed reduces total wall-clock time meaningfully.

The model catalog is narrower than Together AI or GMI Cloud, and dedicated capacity with throughput guarantees requires moving off the free tier. For prototyping Qwen3 applications and benchmarking response quality before committing to paid infrastructure, Cerebras is a strong starting point.

Fireworks AI

Fireworks AI hosts Qwen3 models with FireAttention optimization delivering low latency on H100 hardware. SOC 2 Type II and HIPAA certification makes it the right choice for regulated industry workloads running Qwen3. For healthcare, finance, and government applications where compliance certification matters alongside Qwen3's multilingual and reasoning capabilities, Fireworks provides the necessary certifications that generic GPU rental cannot.

DeepInfra

DeepInfra hosts Qwen3-235B at $0.15 per million input tokens and $0.60 per million output tokens under the Apache 2.0 license. For teams specifically seeking the lowest per-token rate on the flagship model from a managed provider, DeepInfra is competitive. The platform offers straightforward API access with less ecosystem depth than Together AI but also less overhead.

Cost at Scale: When to Move From Per-Token to Dedicated Infrastructure

For Qwen3-32B specifically, a single H100 running the model at FP8 with continuous batching at batch size 32 delivers roughly 2,000 to 3,000 tokens per second. At $2.00/hr, the effective cost per million output tokens sits at approximately $0.19 to $0.28 at full utilization, well below GMI Cloud's per-token serverless rate of $0.60/M output and significantly below most third-party managed API rates.

For Qwen3-235B, the math is different because the hardware requirement is an 8x H200 cluster at approximately $20.80/hr combined. At useful production throughput on this cluster with MoE routing, effective cost per million output tokens can reach $0.08 to $0.15 at high utilization, competitive with the per-token rates charged by managed providers for this model.

Qwen3 for APAC Workloads: The Multilingual Advantage

Qwen3 was trained on data covering 119 languages, with particular depth in Chinese, Japanese, Korean, and Southeast Asian languages. This makes it the natural default for APAC enterprise workloads where multilingual quality matters: customer service systems serving mixed-language markets, document processing across Asian languages, and regional applications where leading English-first models underperform.

For Japanese enterprise workloads, the combination of Qwen3's multilingual training and GMI Cloud's Japan-based infrastructure (including the Kagoshima AI Factory in development) addresses both model capability and data residency requirements in a single stack. Japan's APPI requirements and Qwen3's language coverage align naturally with GMI Cloud's APAC-focused infrastructure footprint.

Conclusion

Qwen3 is the strongest open-weight model family for production deployment in 2026 when multilingual quality, reasoning capability, and infrastructure flexibility are the requirements. The deployment decision hinges on which variant fits your hardware budget and throughput requirements.

For most teams, Qwen3-32B on a single H100 or H200 covers the capability requirement with the simplest infrastructure. GMI Cloud's Inference Engine at $0.10 per million input tokens and $0.60 per million output tokens provides the most accessible entry point. As volume grows, dedicated H100 and H200 clusters at GMI Cloud carry the same workload at lower effective cost per token with full serving stack control.

For teams that need Qwen3-235B at production scale, the 8x H200 NVLink cluster is the hardware standard, and GMI Cloud's multi-region APAC infrastructure makes it the natural deployment platform for the markets where Qwen3's multilingual strength matters most.

‍

FAQs

How much VRAM does Qwen3-235B actually require for production inference? Despite activating only 22 billion parameters per token, Qwen3-235B requires the full 235 billion parameter weight matrix to be loaded into GPU memory before inference can begin. At FP8 precision, this requires approximately 235 GB of VRAM. The minimum practical production configuration is 4x H100 80GB (320 GB total) at INT4 quantization, which introduces measurable quality degradation on reasoning tasks. The recommended production configuration is 8x H200 SXM (1,128 GB total) at FP8, which provides weight storage plus substantial KV cache headroom for concurrent requests and the 262,144 token context window. Planning hardware based on the 22B active parameter figure is the most common and costly mistake when deploying this model.

What is the difference between Qwen3 thinking mode and non-thinking mode for infrastructure sizing? In thinking mode, Qwen3 generates an internal chain-of-thought reasoning trace before producing its final response. These internal tokens count toward total token consumption, VRAM KV cache usage, and billing on per-token APIs. A response that takes 500 output tokens in non-thinking mode can generate 2,000 to 5,000 total tokens in thinking mode on complex reasoning tasks. For infrastructure sizing, this means throughput drops proportionally when thinking mode is enabled, and KV cache allocation must account for the longer effective sequence length. GPU capacity and per-token budgets should be sized for the mode used in production, not the mode used in development testing.

Why is NVLink interconnect especially important for Qwen3-235B compared to dense models? Qwen3-235B's MoE architecture routes each token to 8 of 128 available experts during each forward pass. In a multi-GPU deployment, these experts are distributed across GPUs, requiring all-to-all communication across the interconnect for every token generated. This communication pattern scales with batch size and creates more inter-GPU traffic than standard tensor parallelism in dense models. NVLink at 900 GB/s bidirectional bandwidth handles this routing traffic efficiently. PCIe Gen5 at 128 GB/s becomes a throughput bottleneck for Qwen3-235B at production batch sizes in a way it does not for dense models like Llama 3.3 70B. SXM form factor with NVLink is the recommended configuration for Qwen3-235B in production.

How does Qwen3-32B compare to Qwen3-235B for production cost-efficiency? Qwen3-32B is a dense 32B parameter model. It fits on a single H100 80GB at FP8 or a single H200 at FP16, runs with standard tensor parallelism, and benefits from all standard inference optimizations. At $2.00/hr for a single H100, it delivers roughly 2,000 to 3,000 tokens per second at batch size 32, yielding an effective cost of $0.19 to $0.28 per million output tokens at full utilization. Qwen3-235B requires an 8x H200 cluster at roughly $20.80/hr total, and its MoE architecture adds routing overhead. The capability gap between the two is real on complex reasoning tasks, but Qwen3-32B covers the majority of production use cases at 80 to 90 percent lower hardware cost. For most teams, Qwen3-32B on GMI Cloud's managed Inference Engine at $0.60 per million output tokens is the starting point, with Qwen3-235B reserved for workloads where frontier-class reasoning quality is the requirement.

Which GPU cloud providers offer Qwen3 inference today, and how do their approaches differ? GMI Cloud offers Qwen3-32B FP8 through its serverless Inference Engine at $0.10 per million input tokens and $0.60 per million output tokens, plus dedicated H100 and H200 clusters for self-managed deployments of any Qwen3 variant. Together AI hosts multiple Qwen3 variants with LoRA fine-tuning support, making it the choice for teams iterating on custom model behavior. Cerebras provides Qwen3-32B and Qwen3-235B on its free tier at up to 3,000 tokens per second throughput, useful for prototyping. Fireworks AI offers Qwen3 inference with SOC 2 Type II and HIPAA certification for regulated workloads. DeepInfra hosts Qwen3-235B at $0.15 per million input tokens, among the lowest rates for managed flagship inference. The key differentiator for teams building toward scale is whether the provider supports dedicated GPU infrastructure as an upgrade path from per-token pricing, which GMI Cloud provides on a single platform.

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started