Where to Deploy DeepSeek-V3 for Production AI Inference
DeepSeek-V3 is a 671 billion parameter Mixture-of-Experts model that activates only 37 billion parameters per token. It competes with GPT-4o on benchmarks at a fraction of the cost, carries an MIT license, and runs on open infrastructure. But deploying it for production inference requires solving three real problems: the hardware demands are serious, the official API carries data sovereignty risks that block many enterprise workloads, and the serving stack matters more for MoE models than most teams expect.
- DeepSeek-V3 requires a minimum of 8x H200 GPUs at FP8 to meet production throughput requirements. At INT4 quantization, the weight footprint drops to 350 to 400GB but adds quality degradation that is measurable on reasoning-heavy tasks.
- The official DeepSeek API costs $0.27/M input and $1.10/M output tokens, making it among the cheapest GPT-4-class options available. The constraint is data: all traffic routes through servers subject to Chinese data laws, which is a hard blocker for HIPAA, GDPR, and most enterprise data governance frameworks.
- GMI Cloud runs DeepSeek-V3 on US-hosted H100 and H200 infrastructure with an OpenAI-compatible API, no data residency risk, and serverless inference that scales to zero between requests. The free endpoint lets you benchmark production performance before committing to any spend.
- SGLang is the recommended inference engine for DeepSeek-V3, delivering 3.1x faster inference than vLLM through MLA-optimized kernels and Multi-Token Prediction support. On H200, SGLang with MTP enabled boosts throughput 1.2 to 2.1 times over the baseline.
- Per-token API pricing becomes expensive above 500 million tokens per month. Below that threshold, a managed US-hosted endpoint is cheaper and simpler than self-hosting. Above it, dedicated GPU infrastructure on a purpose-built platform is the better option.
- The self-hosting decision is not just about cost. Data sovereignty, rate limit predictability, fine-tuning access, and control over the serving stack each push teams toward self-hosted infrastructure even at lower token volumes.
Why DeepSeek-V3 Specifically
DeepSeek-V3 is the most cost-efficient frontier-class model available with open weights in 2026. Its MoE architecture delivers performance comparable to GPT-4o class models on coding, reasoning, and multilingual tasks while activating only 37 billion parameters per token. The active parameter count is what determines inference speed: at comparable hardware, DeepSeek-V3 inference runs significantly faster than a dense 70 billion parameter model despite carrying 671 billion total parameters.
The MIT license means commercial deployment is unrestricted. The open weights mean teams can self-host without per-token licensing fees. The 131K context window covers most production RAG and document analysis use cases.
The model already runs in production at scale across many US and European teams. The deployment decision is not whether to use DeepSeek-V3, but where to run it, with what inference stack, and under what data governance framework.
Hardware Requirements for Production Inference
DeepSeek-V3 is a data-center-class model that cannot run at useful throughput on consumer or workstation hardware. Understanding the minimum and recommended configurations prevents expensive mistakes at deployment time.
Full model at FP8 precision: Weight storage requires approximately 700 GB of VRAM. At FP8, an 8x H200 configuration (141GB per GPU, 1,128GB total) is the minimum practical setup for production throughput. An 8x H100 configuration (80GB per GPU, 640GB total) falls short of the weight footprint at FP8, requiring either INT4 quantization or a 16-GPU H100 configuration.
INT4 quantization: Reduces weight memory to approximately 350 to 400 GB, fitting within an 8x H100 80GB cluster (640GB total). INT4 introduces measurable quality degradation on reasoning-heavy tasks, typically 5 to 15 percent on benchmark evaluations. For general conversational and coding use cases, this degradation is often acceptable.
Interconnect requirements: DeepSeek-V3 uses tensor parallelism across GPUs during inference. NVLink bandwidth between GPUs is critical for communication-heavy MoE routing. Multi-GPU configurations with H200 SXM (NVLink 5.0, 900 GB/s bidirectional) outperform configurations using PCIe interconnects for this model specifically.
Hardware Configuration Comparison
| Configuration | VRAM | Fits V3 FP8 | Practical Throughput |
|---|---|---|---|
| 8x H200 SXM | 1,128 GB | Yes | High, recommended for production |
| 16x H100 SXM | 1,280 GB | Yes | High, higher node cost |
| 8x H100 SXM | 640 GB | No (FP8), Yes (INT4) | Moderate with quality tradeoff |
| 4x H200 SXM | 564 GB | No (FP8), Yes (INT4) | Low concurrency only |
For production inference serving concurrent users, 8x H200 SXM is the standard configuration. GMI Cloud's dedicated H200 clusters provide this configuration with RDMA-ready networking and the SGLang inference stack recommended by the DeepSeek team.
The Serving Stack: SGLang vs vLLM
The inference engine choice matters more for DeepSeek-V3 than for most models. The MoE architecture and Multi-head Latent Attention design benefit from specific optimizations that are not uniformly available across inference frameworks.
SGLang is the inference engine officially recommended by the DeepSeek team, and the performance gap over vLLM is substantial for this specific model. SGLang achieves 3.1x faster inference than vLLM on DeepSeek-V3, driven by MLA-optimized kernels including FlashAttention3, FlashInfer, FlashMLA, and CutlassMLA. With Multi-Token Prediction enabled via EAGLE speculative decoding, SGLang delivers 1.8x decode speedup at batch size 1 and 1.5x at batch size 32 on H200 hardware. For multi-turn conversational workloads, SGLang's RadixAttention KV cache reuse provides an additional 30 percent throughput advantage through prefix sharing across requests.
vLLM is the more broadly supported option, with coverage for more hardware configurations (including TPUs and Trainium) and a larger contributor ecosystem. For DeepSeek-V3 specifically, vLLM delivers lower throughput than SGLang on equivalent hardware. The choice between them for DeepSeek is usually straightforward: if you are deploying on NVIDIA H100 or H200 hardware and DeepSeek is your primary model, SGLang is the better serving engine.
TensorRT-LLM achieves 15 to 30 percent higher peak throughput than vLLM after a 10 to 30 minute compilation step. NVIDIA-only and suited for production configurations where the model will not change frequently. Less flexible than SGLang for multi-model or rapidly iterating deployments.
Deployment Options: A Practical Comparison
Option 1: Official DeepSeek API
The official API costs $0.27 per million input tokens and $1.10 per million output tokens for DeepSeek-V3. On cache hits, input drops to $0.028 per million, making it essentially free for repeated prompt prefixes.
These are among the lowest per-token rates for a frontier-class model. The operational simplicity is real: no infrastructure to manage, no serving stack to configure, no GPU procurement required.
The production constraint is data sovereignty. All traffic sent to the official DeepSeek API routes through servers subject to Chinese data law, including mandatory government access provisions under the Cybersecurity Law and Data Security Law of the People's Republic of China. For workloads involving personal data under GDPR, healthcare data under HIPAA, financial data under SOX, or any government and defense workloads, the official API is not a viable production option.
Additionally, the API has rate limits during peak demand that dedicated infrastructure does not. Teams with high-throughput production requirements can hit these limits before they become billable constraints.
Best for: Low-volume workloads where data governance is not a constraint, rapid prototyping, and cost benchmarking before committing to infrastructure.
Option 2: GMI Cloud Managed Inference
GMI Cloud runs DeepSeek-V3 on US-hosted H100 and H200 hardware through its Inference Engine, with an OpenAI-compatible API that requires no code changes from teams already integrated with the official DeepSeek API or OpenAI.
The key differences from the official API: data stays within US infrastructure under GMI Cloud's data governance, rate limits are not a constraint for high-throughput workloads, and the serverless model scales to zero between requests so idle periods cost nothing. Free inference on the DeepSeek R1 Distill Llama 70B and Llama 3.3 70B Instruct Turbo endpoints lets teams benchmark the infrastructure before committing.
For teams that need dedicated capacity with predictable latency guarantees, GMI Cloud's dedicated H200 clusters provide bare metal access to 8x H200 SXM configurations with RDMA networking and the SGLang serving stack. This is the configuration used for production DeepSeek-V3 workloads that require sustained high throughput.
The per-token cost on GMI Cloud's managed endpoint is higher than the official DeepSeek API, which is the standard tradeoff for US-hosted infrastructure on a model controlled by a Chinese lab. For teams where data sovereignty matters, the premium is not optional.
Deploy DeepSeek-V3 on GMI Cloud
Option 3: Third-Party Per-Token APIs (Together AI, Fireworks AI, OpenRouter)
Together AI hosts DeepSeek-V3 at approximately $1.25 per million tokens (combined input and output pricing). Fireworks AI offers comparable rates with FireAttention optimization delivering up to four times lower latency than standard vLLM. Both run on US infrastructure with standard SLA commitments, SOC 2 certification, and OpenAI-compatible APIs.
These providers are the fastest path to US-hosted DeepSeek-V3 inference for teams that want managed API simplicity without the official API's data residency issue. The per-token premium over the official DeepSeek API (roughly 2 to 5x at standard rates) is the cost of US-hosted infrastructure and operational overhead.
Both also offer dedicated GPU endpoints for teams that need consistent throughput. Together AI's dedicated H100 endpoints for DeepSeek start at $3.99/hr with reserved pricing from $2.25/hr.
Option 4: Self-Hosted on Dedicated GPU Infrastructure
Self-hosting DeepSeek-V3 makes economic sense above approximately 500 million tokens per month, where the fixed cost of GPU time undercuts per-token rates. The break-even depends on utilization: an 8x H200 cluster running at 70 percent utilization and generating approximately 100 tokens per second per GPU produces roughly 17 billion tokens per month. At $1.10/M output token pricing, that volume generates $18,700/month in avoided API cost versus a cluster cost of approximately $13,000/month on dedicated H200 infrastructure.
Beyond the economics, self-hosting gives teams complete control over the serving stack, context window configuration, fine-tuning pipeline, and data governance. There is no external API call, no rate limit, and no dependency on a third-party provider's uptime for production workloads.
The operational cost is real. Running SGLang on a multi-GPU H200 cluster requires ML engineering time for deployment, monitoring, incident response, and model updates. Industry estimates put ongoing ops at 20 to 30 percent of a senior ML engineer's time for a production serving stack. At standard ML engineer compensation, that is $3,000 to $6,000 per month in staffing cost before hardware.
GMI Cloud's dedicated GPU clusters provide an intermediate option: bare metal H200 access with RDMA networking, root access, and custom software stacks, without the need to manage physical hardware. Teams deploy their own SGLang or vLLM configuration and maintain the serving stack themselves, but GMI Cloud handles the infrastructure layer. This is the right configuration for teams that need self-hosting control without datacenter operations overhead.
The Data Sovereignty Decision
For many enterprise teams, the deployment decision starts and ends with data governance. The question is not which option is cheapest but which options are actually available given their compliance requirements.
US government and defense workloads: The official DeepSeek API is not viable. Self-hosting on US-based infrastructure with FedRAMP-aligned providers is the only option.
Healthcare workloads under HIPAA: Data routed through the official DeepSeek API cannot be HIPAA-compliant. US-hosted providers with BAA agreements (Fireworks AI is HIPAA-certified, GMI Cloud can discuss compliance requirements) are required.
European workloads under GDPR: The official DeepSeek API routes data to China, which is not an adequate jurisdiction under GDPR without specific safeguards. EU-hosted providers (Nebius, Scaleway) or US providers with adequate transfer mechanisms are required. Self-hosting on EU-based GPU infrastructure is the cleanest solution.
General enterprise workloads with proprietary data: The open-source community consensus is that hosting DeepSeek's open weights on US or EU infrastructure, rather than using the official API, eliminates the data sovereignty concern. "It would be unlikely that the US would take any action on using the open-source R1 or V3 models as long as they were hosted on US-based servers," notes one analysis of the regulatory landscape. GMI Cloud's US-hosted inference infrastructure addresses this directly.
Choosing Your Deployment Path
Under 100 million tokens per month, no data governance constraints: The official DeepSeek API at $0.27/M input is the cheapest and simplest option. No infrastructure to manage.
Under 100 million tokens per month, data governance constraints apply: GMI Cloud's managed inference or Together AI / Fireworks AI for US-hosted per-token access. The cost premium over the official API is the price of compliance.
100 to 500 million tokens per month: Run the TCO calculation with your specific token ratio. At this range, managed GPU endpoints from GMI Cloud ($2.00 to $2.40/hr H100/H200 with serverless scaling) often beat third-party per-token APIs, and the predictability of GPU-hour pricing simplifies budgeting.
Above 500 million tokens per month: Dedicated H200 infrastructure with self-managed SGLang serving is the most cost-efficient option. GMI Cloud's dedicated clusters provide the hardware layer; the serving stack is yours to configure and control.
Any volume with fine-tuning requirements: Self-hosting is required. No managed inference provider supports fine-tuned DeepSeek-V3 weights at production scale with full control over the adapter lifecycle.
Conclusion
DeepSeek-V3 is one of the most capable open-weight models available in 2026, and the deployment decision is real infrastructure work rather than a simple API key swap. The hardware demands require multi-GPU H200 configurations for full FP8 precision, SGLang outperforms vLLM by 3x on this model specifically, and the official API's data sovereignty constraints rule it out for most enterprise workloads.
For teams that need US-hosted production inference today, GMI Cloud's managed DeepSeek endpoints and dedicated H200 clusters provide the right combination: production-grade infrastructure, US data residency, OpenAI-compatible APIs, and a serverless default that eliminates idle compute cost. The free model endpoints let you benchmark production performance before any billing begins.
FAQs
What hardware does DeepSeek-V3 require for production inference? DeepSeek-V3 has 671 billion total parameters and requires approximately 700 GB of VRAM at FP8 precision for weight storage alone. The minimum practical production configuration is 8x H200 SXM GPUs (1,128 GB combined VRAM), which provides headroom for KV cache and activations alongside the model weights. An 8x H100 SXM configuration (640 GB total) can run the model at INT4 quantization with a 5 to 15 percent quality tradeoff on reasoning tasks. For production workloads requiring FP8 quality on H100 hardware, a 16-GPU configuration is needed. NVLink interconnects are important for this model's MoE routing communication patterns: SXM form factor with NVLink is strongly preferred over PCIe for multi-GPU DeepSeek-V3 deployments.
Why should I host DeepSeek-V3 on US infrastructure instead of using the official API? The official DeepSeek API routes traffic through servers subject to Chinese data law, which mandates government access to data upon request. For workloads involving personal data (GDPR), healthcare data (HIPAA), financial data (SOX), or any government and defense applications, this is a hard compliance blocker. DeepSeek-V3's open weights under MIT license mean teams can host the model on US or EU infrastructure and preserve the cost and capability advantages of the model without the data residency risk. US-hosted options include GMI Cloud's managed inference, Together AI, and Fireworks AI for per-token access, or self-hosted SGLang on dedicated GPU clusters for high-volume production workloads.
Which inference engine should I use for DeepSeek-V3 in production? SGLang is the inference engine recommended by the DeepSeek team and delivers 3.1x faster throughput than vLLM on DeepSeek-V3 specifically. The performance gap comes from MLA-optimized kernels (FlashMLA, FlashInfer, CutlassMLA) tuned for DeepSeek's Multi-head Latent Attention architecture, plus Multi-Token Prediction support via EAGLE speculative decoding that delivers a 1.8x decode speedup at batch size 1. SGLang's RadixAttention also provides 30 percent throughput improvement for multi-turn workloads by reusing KV cache across requests with shared prefixes. vLLM remains a viable option for multi-hardware environments or when model flexibility matters more than peak throughput on DeepSeek specifically.
At what token volume does self-hosting DeepSeek-V3 become cheaper than per-token APIs? The crossover point is approximately 500 million output tokens per month when comparing self-hosted GPU infrastructure against third-party managed APIs. Against the official DeepSeek API at $1.10/M output tokens, the crossover comes later because the official rate is lower. Against US-hosted providers at $1.00 to $1.50/M output tokens, a dedicated 8x H200 cluster at roughly $13,000/month running at 70 percent utilization breaks even at around 500 million output tokens per month. This calculation assumes you can sustain that utilization level; underutilized dedicated clusters increase the effective per-token cost significantly. For teams below this threshold, GMI Cloud's serverless inference with automatic scaling to zero is more cost-efficient than dedicated hardware because it charges only for active compute time.
How does GMI Cloud's DeepSeek-V3 deployment differ from using the official API or Together AI? GMI Cloud runs DeepSeek-V3 on US-hosted H100 and H200 infrastructure, which addresses the data sovereignty concern that blocks the official API for most enterprise workloads. Unlike Together AI and Fireworks AI, which offer per-token pricing on shared managed infrastructure, GMI Cloud also provides dedicated H200 bare metal clusters for teams that need full serving stack control, custom SGLang configurations, and predictable throughput guarantees. The serverless inference layer scales to zero between requests, eliminating idle cost for variable-traffic workloads. The free model endpoints on GMI Cloud run on the same H100 and H200 production infrastructure, providing an accurate performance benchmark before any billing commitment begins.
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

.png)