How Can I Run Inference Efficiently on AI Models?
March 10, 2026
GMI Cloud Blog | AI Infrastructure Guide | gmicloud.ai
Running AI inference efficiently comes down to three levers: choosing the right model, choosing the right precision, and choosing the right infrastructure. Most teams overspend on inference not because they lack compute, but because they haven't optimized across all three dimensions systematically.
This guide provides a framework for inference optimization, from model selection to serving configuration.
Platforms like GMI Cloud offer optimized inference infrastructure with 100+ API-callable models, but the principles here apply regardless of provider.
We focus on NVIDIA data center GPUs; AMD MI300X, Google TPUs, and AWS Trainium are outside scope.
Let's work through each lever, starting with the one that has the biggest impact.
Lever 1: Model Selection
Choosing the right model is the single largest efficiency decision you'll make. A model that's oversized for your task wastes VRAM, bandwidth, and money on every request. A model that's undersized produces poor results and generates rework.
The key is matching model capability to task requirements, not defaulting to the biggest or cheapest option available.
Image and Video Tasks
For image generation, seedream-5.0-lite ($0.035/request) delivers strong text-to-image and image-to-image quality at an efficient cost-per-request. For image editing workflows, reve-edit-fast-20251030 ($0.007/request) provides fast turnaround with solid output quality.
For video generation, the quality-cost spectrum is wide. pixverse-v5.6-t2v ($0.03/request) handles text-to-video efficiently. Kling-Image2Video-V1.6-Pro ($0.098/request) delivers higher fidelity for production pipelines.
For research requiring maximum quality, Sora-2-Pro ($0.50/request) or Veo3 ($0.40/request) are the top-tier options.
Audio and TTS
minimax-tts-speech-2.6-turbo ($0.06/request) is a reliable mid-range TTS model. elevenlabs-tts-v3 ($0.10/request) delivers broadcast-quality output for production deployments. For voice cloning, minimax-audio-voice-clone-speech-2.6-hd ($0.10/request) handles speaker replication from samples.
For prototyping and A/B testing across multiple TTS models, inworld-tts-1.5-mini ($0.005/request) provides a cost-efficient baseline to benchmark against higher-end options.
Image Research and Exploration
For research workflows that require high-fidelity image editing, bria-fibo-edit ($0.04/request) and seedream-4-0-250828 ($0.05/request) provide the precision that serious work demands.
The bria-fibo series (relight, restyle, restore at $0.000001/request) serves as a low-cost entry point for baseline experiments and pipeline testing.
Model Selection Table
Text-to-image
- Efficient Pick: seedream-5.0-lite
- Price: $0.035/req
- High-Fidelity Pick: gemini-3-pro-image-preview
- Price: $0.134/req
Image editing
- Efficient Pick: reve-edit-fast-20251030
- Price: $0.007/req
- High-Fidelity Pick: bria-fibo-edit
- Price: $0.04/req
Text-to-video
- Efficient Pick: pixverse-v5.6-t2v
- Price: $0.03/req
- High-Fidelity Pick: Sora-2-Pro
- Price: $0.50/req
Image-to-video
- Efficient Pick: Kling-Image2Video-V1.6-Pro
- Price: $0.098/req
- High-Fidelity Pick: Kling-Image2Video-V2-Master
- Price: $0.28/req
TTS
- Efficient Pick: minimax-tts-speech-2.6-turbo
- Price: $0.06/req
- High-Fidelity Pick: elevenlabs-tts-v3
- Price: $0.10/req
Voice cloning
- Efficient Pick: minimax-audio-voice-clone-speech-2.6-turbo
- Price: $0.06/req
- High-Fidelity Pick: minimax-audio-voice-clone-speech-2.6-hd
- Price: $0.10/req
Video (research)
- Efficient Pick: Veo3-Fast
- Price: $0.15/req
- High-Fidelity Pick: Veo3
- Price: $0.40/req
Once you've picked the right model, the next lever is how you represent its parameters.
Lever 2: Precision Optimization
Precision determines how many bytes each model parameter uses in memory. Lower precision means less VRAM, faster reads, and higher throughput, but potentially lower output quality. Choosing the right precision for your workload is the second-highest-impact optimization.
The Precision Ladder
FP32 (32-bit float): Full precision. Used in training for gradient accuracy. Rarely used for inference because it doubles memory cost with negligible quality benefit over FP16.
FP16/BF16 (16-bit): The default for most inference workloads. 2 bytes per parameter, so a 70B model needs ~140 GB. Safe choice when quality is the top priority.
FP8 (8-bit float): Halves memory vs. FP16. A 70B model drops to ~70 GB, fitting on a single H100. Requires Hopper-generation GPUs (H100, H200). Quality loss is minimal for most tasks.
INT8 (8-bit integer): Similar memory savings to FP8. Broader hardware support (works on A100). Slightly different quantization tradeoffs depending on the model architecture.
INT4 (4-bit): Aggressive quantization. A 70B model drops to ~35 GB. Quality degradation becomes noticeable on complex tasks. Best for latency-critical deployments where some quality loss is acceptable.
Decision Rules
Use FP8 as your default starting point on H100/H200 hardware. It halves VRAM vs. FP16 with minimal quality impact. Drop to INT4 only if latency is critical and you've validated quality is acceptable. Stay at FP16 when output quality is non-negotiable (research publications, production content).
With model and precision locked in, the third lever is how you serve requests.
Lever 3: Serving Optimization
Even with the right model at the right precision, poor serving configuration leaves performance on the table. This lever covers how requests are scheduled, batched, and processed.
Continuous Batching
Static batching waits for a full batch before processing, which adds latency. Continuous batching (supported by vLLM and TensorRT-LLM) inserts new requests into the batch as slots open up. This keeps GPU utilization high and reduces time-to-first-token.
KV-Cache Management
For LLM inference, the key-value cache stores attention states for each token in the sequence. KV-cache memory grows with sequence length and concurrency. The formula: KV per request ≈ 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_element.
For Llama 2 70B at FP16 with 4K context, that's ~0.4 GB per concurrent request. At 100 concurrent users, that's 40 GB of VRAM just for KV-cache, on top of the model weights. Managing this efficiently (FP8 KV-cache, paged attention) is critical for high-concurrency deployments.
Speculative Decoding
Standard autoregressive LLM inference generates one token per forward pass. Speculative decoding uses a smaller "draft" model to predict multiple tokens, then verifies them with the main model in a single pass. This can deliver 1.5-2x throughput improvements without quality loss.
Serving Framework Selection
TensorRT-LLM excels at maximum throughput with NVIDIA-specific optimizations. vLLM provides flexibility with PagedAttention for efficient memory management. Both support continuous batching and FP8. Choose TensorRT-LLM for production throughput; vLLM for rapid iteration and broader model support.
These three levers work together. Here's how to apply them to your specific situation.
Applying the Framework by Role
For AI Researchers
Your priority is output quality, not cost minimization. Use high-fidelity models (Kling-Image2Video-V2-Master, Sora-2-Pro, Veo3) at FP16 precision. Run the bria-fibo series for baseline experiments at minimal cost, then switch to premium models for final results.
Dedicated GPU instances give you control over precision and batching configuration.
For Enterprise Project Leads
Your priority is cost-per-request at acceptable quality. Start with the "efficient pick" column from the model table. Use FP8 precision to halve VRAM costs. Enable continuous batching and monitor GPU utilization. Target 70%+ utilization; below that, you're overpaying for idle capacity.
For Technical Teams Running Multi-Model Pipelines
Your priority is iteration speed and A/B testing. Use API-based inference to test multiple models without provisioning separate GPU instances for each. Compare output quality and latency across the efficient and high-fidelity picks, then lock in your production choice and optimize precision and batching around it.
Infrastructure: The Performance Ceiling
The three levers above are software optimizations. Hardware sets the ceiling. Here's how the current NVIDIA lineup compares for inference workloads.
VRAM
- H100 SXM: 80 GB HBM3
- H200 SXM: 141 GB HBM3e
- A100 80GB: 80 GB HBM2e
- L4: 24 GB GDDR6
Bandwidth
- H100 SXM: 3.35 TB/s
- H200 SXM: 4.8 TB/s
- A100 80GB: 2.0 TB/s
- L4: 300 GB/s
FP8
- H100 SXM: 1,979 TFLOPS
- H200 SXM: 1,979 TFLOPS
- A100 80GB: N/A
- L4: 242 TOPS
NVLink
- H100 SXM: 900 GB/s*
- H200 SXM: 900 GB/s*
- A100 80GB: 600 GB/s
- L4: None (PCIe)
MIG
- H100 SXM: Up to 7 instances
- H200 SXM: Up to 7 instances
- A100 80GB: Up to 7 instances
- L4: No
NVLink 4.0: 900 GB/s bidirectional aggregate per GPU on HGX/DGX platforms. Sources: NVIDIA H100 Datasheet (2023), H200 Product Brief (2024), A100 Datasheet, L4 Datasheet.
Per NVIDIA's H200 Product Brief (2024), the H200 delivers up to 1.9x inference speedup on Llama 2 70B vs. H100 (TensorRT-LLM, FP8, batch 64, 128/2048 tokens). The 76% VRAM increase means you can run larger models or higher concurrency without multi-GPU overhead.
Getting Started
Two paths depending on your stage. If you're benchmarking models or building a proof of concept, start with API-based inference to test multiple models without infrastructure overhead.
If you're optimizing an existing production pipeline, provision dedicated GPU instances and apply all three levers: right model, FP8 precision, continuous batching with TensorRT-LLM or vLLM.
Cloud inference platforms like GMI Cloud support both paths, with infrastructure optimized for performance and cost-efficiency.
Explore the model library or provision GPU instances depending on your needs.
Start where you are, apply the framework, and measure the results.
FAQ
What's the single highest-impact optimization for inference cost?
Model selection. Switching from an oversized model to one that matches your task requirements can reduce per-request costs by 5-10x while maintaining output quality.
When should I use FP8 vs. FP16?
Use FP8 as the default on H100/H200 hardware. It halves VRAM and improves throughput with minimal quality loss. Stay at FP16 only when output quality is non-negotiable, such as research publications or premium content generation.
How do I know if my GPU utilization is good enough?
Target 70%+ utilization. Below that, you're paying for idle capacity. Enable continuous batching and monitor request queuing. If requests queue frequently, you need more capacity or better batching.
Should I use TensorRT-LLM or vLLM?
TensorRT-LLM for maximum throughput in production with NVIDIA GPUs. vLLM for rapid prototyping, broader model support, and efficient memory management via PagedAttention. Both support FP8 and continuous batching.
Tab 10
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
