How fast does auto-scaling respond to traffic spikes?

Kubernetes HPA typically scales within 30-60 seconds of sustained high utilization. For applications that need faster response, keeping 2 warm replicas and scaling up to 8 provides a safety margin.

What's the best way to test a new model without breaking production?

Blue-green deployment routing 10% traffic to the new version for 24-48 hours is the industry standard. Monitor error rates and latency closely; if all metrics stay green, shift more traffic gradually.

How do I know if my KV-cache is sized correctly?

Watch GPU memory usage under load. If it regularly exceeds 80%, reduce max_seq_len or increase replicas. If it consistently stays below 60%, you might safely increase context length.

Should I always use the lowest GPU price?

Not always. An H100 at $2.10/hour running at 80% utilization is more cost-effective than an H200 at $2.50/hour running at 50% utilization. Right-sizing to your model and traffic matters more than hourly rate.

10 Production Pitfalls That Kill AI Inference Deployments

April 30, 2026

Most inference deployments that fail in production passed every test in staging. The difference isn't code quality; it's that staging doesn't replicate the scale, concurrency, and edge cases that production exposes. Failures come from infrastructure constraints, performance assumptions that break under load, and operational blind spots that only emerge over days and weeks of real traffic.

Choosing the right infrastructure helps, but doesn't replace good engineering practices. Platforms like GMI Cloud reduce some of these risks through pre-configured runtimes (TensorRT-LLM, vLLM, Triton) and on-demand GPU access. But even on well-configured infrastructure, teams that skip these ten lessons tend to redeploy, debug, and scale the hard way.

This article covers: infrastructure pitfalls that cause downtime, performance pitfalls that degrade experience, operational pitfalls that hide cost and quality issues, a strategic pitfall that costs millions later, a pre-deployment checklist, and how to choose infrastructure that prevents failure.

Infrastructure Pitfall 1: No Auto-Scaling

A single GPU instance runs fine at 40% utilization. Then traffic spikes at 3 AM, and response times jump from 200 ms to 8 seconds because the queue backs up. By the time a human realizes the problem and scales manually, customers have already seen degradation for an hour.

A common approach is Kubernetes Horizontal Pod Autoscaler (HPA) targeting 70% GPU utilization, with a minimum of 1 replica and a maximum of 8. This scales inference deployments up within 30 seconds of sustained overload and scales down during quiet hours to save cost. Setting the scale-down grace period to 5 minutes prevents thrashing when traffic fluctuates.

Infrastructure Pitfall 2: Cold Start Lag in Serverless

Serverless inference seems attractive: pay per request, no idle GPU costs. But when a function wakes after 10 minutes dormant, the first request hits a 15-second cold start. In production, that tail latency violates SLAs and frustrates users.

One option is keeping a minimum of 1 warm replica by periodically triggering a dummy request, then allowing auto-scale to 20 replicas on load. Adding a warm-up script that loads the model into VRAM before accepting traffic bridges the gap between cold function and production-ready instance.

Infrastructure Pitfall 3: Single-Region Deployment

When a single region experiences a data center outage, inference goes dark. It sounds unlikely until it happens on a Monday morning and customers lose access for 4 hours. Most teams find that dual-region deployment with DNS failover (Route53 or Cloudflare) is the typical architecture.

Route 53 health checks every 30 seconds, so failover completes within 60 seconds. The cost of maintaining warm standby in a second region is roughly 50% of primary production cost, a small premium for 99.99% uptime.

Performance Pitfall 4: No KV-Cache Optimization

Inference engines cache the key-value tensors of previous tokens to avoid recomputing attention. With long contexts (4K to 8K tokens) and many concurrent users, KV-cache consumes most of VRAM. If no limits are set, a single request with a 16K context and 128 concurrent users fills the GPU, and new requests get stuck in queue.

A common fix is setting max_seq_len to 4K and configuring a VRAM usage alert at the 85% threshold. This prevents the GPU from becoming a bottleneck and gives ops teams time to scale before saturation. Using the formula KV per request ≈ 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_element helps teams right-size context length for their model and hardware.

Performance Pitfall 5: Wrong Precision for the Workload

Running FP16 (half-precision float) feels safer than FP8 because it loses less precision. In production, FP16 throughput is half of FP8 on modern GPUs like Nvidia's H100, which has specialized FP8 tensor units. One option is testing FP8 quantization; for most LLMs, accuracy loss is under 1%, and throughput doubles.

FP8 doesn't work for all models, especially those with dynamic activations. The safest approach is measuring accuracy loss on a representative validation set, then A/B testing FP8 in production with a canary deployment to 5% of traffic before full rollout.

Performance Pitfall 6: No Continuous Batching

Without continuous batching, a GPU waits for all requests in a batch to finish before accepting new ones. If one request takes 8 seconds and others finish in 2 seconds, those fast requests idle until the slow one completes, wasting GPU capacity. Most teams find that continuous batching (token-level scheduling) improves throughput 2-4x.

vLLM enables continuous batching by default. TensorRT-LLM requires enable_chunked_context=True and a compatible batching scheduler. Setting batch size to 32 and max_batch_delay to 500 ms balances throughput and latency.

Operational Pitfall 7: No Request-Level Logging

A deployment is slow, but ops doesn't know which part of the pipeline is the bottleneck. Was it model inference, preprocessing, or network I/O? Without request-level logs, debugging becomes guesswork. Most teams that scale find logging each request ID with prompt_tokens, completion_tokens, latency_ms, and status_code to be essential.

Structured logging to a time-series database like Prometheus or Datadog allows grouping by model, region, and user. This enables tracking p99 latency trends and correlating slowness with other signals like GPU utilization.

Operational Pitfall 8: No Model Versioning Strategy

A new model version is deployed to production, and inference breaks. The old version is lost, or rolling back requires manual redeployment that takes 20 minutes. Most teams find that version-numbered model files (llama-70b-v3.1.safetensors, llama-70b-v3.2.safetensors) plus blue-green deployment prevents chaos.

A blue-green pattern routes 10% of traffic to the new version for 24 hours while monitoring for errors. If error rates stay low, traffic is gradually shifted to 50%, then 100% over the next day. If errors spike, 100% of traffic reverts to green (old version) instantly.

Operational Pitfall 9: No Cost Alerting

Inference costs grow silently. A bug floods the queue with requests, or a customer's integration calls the API 100x per second instead of 10x. By the time the bill arrives, the overcharge is $50k. One option is configuring daily budget alerts at >120% of the rolling 7-day average.

Grafana dashboards showing real-time $/request and monthly burn rate make cost visible to the whole team. Slack notifications alert ops within 10 minutes of anomalous spend, enabling quick diagnosis.

Strategic Pitfall 10: Vendor Lock-In Without an Exit Plan

Choosing an inference provider feels permanent once models and integrations are built. If the provider raises prices 40% or shuts down, you're stuck replatforming under pressure. Most teams that avoid this pitfall write an exit plan document before signing a contract.

The plan includes API compatibility testing (how hard is it to swap SDKs?), model export testing (can models be downloaded in standard formats?), egress cost estimation (does the provider charge for data transfer?), and a backup provider shortlist. Quarterly reviews ensure the plan stays current.

Pre-Deployment Checklist

Pitfall	Prevention	Verification
No auto-scaling	HPA target 70% GPU util, min 1, max 8	Trigger load test; verify scale within 60s
Cold start	Min replica 1 + warm-up script	Measure first request latency, expect <2s
Single region	Dual region + DNS failover	Simulate region outage; verify failover <60s
KV-cache bloat	max_seq_len 4K, VRAM alert 85%	Load concurrent requests, monitor VRAM usage
Wrong precision	A/B test FP8 on 5% canary	Measure throughput and accuracy on test set
No continuous batching	Enable in vLLM or TensorRT-LLM	Compare batch and non-batch throughput
No request logging	Log request_id, tokens, latency, status	Query logs for p99 latency by model
No versioning	Version files + blue-green deploy	Deploy new version to 10% traffic, revert on error
No cost alerting	Daily budget alerts at 120% baseline	Trigger spike in requests, verify Slack alert fires
Vendor lock-in	Write exit plan before contract	Test model export and API swap time

GMI Cloud Infrastructure: Built for Production Reliability

GMI Cloud is worth considering when evaluating infrastructure that addresses several of these pitfalls. Listed pricing at the time of writing is H100 SXM at ~$2.10/GPU-hour and H200 SXM at ~$2.50/GPU-hour. Nodes provide 8 GPUs with NVLink 4.0 (900 GB/s per-GPU bidirectional aggregate on HGX/DGX platforms) and 3.2 Tbps InfiniBand. Pre-configured runtimes include TensorRT-LLM, vLLM, and Triton with NCCL.

Teams should verify auto-scaling behavior, monitoring capabilities, and FP8 support against their specific model and traffic patterns during a trial. For teams that prefer managed inference without GPU management, GMI's Inference Engine offers 100+ pre-deployed models with per-request pricing (check gmicloud.ai/pricing for current availability and rates).

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started