What Factors Influence Inference Speed in Machine Learning Models?
March 10, 2026
GMI Cloud Blog | AI Infrastructure Guide | gmicloud.ai
Inference speed in machine learning models is determined by the interaction of three categories of factors: hardware (GPU memory, bandwidth, compute), model characteristics (size, architecture, precision), and serving configuration (batching, framework, optimization techniques).
Most speed bottlenecks trace back to a mismatch between these factors, not a single root cause.
This guide breaks down each category so you can diagnose where your inference pipeline is slow and what to do about it.
Optimized infrastructure like GMI Cloud addresses these factors through high-performance GPUs, an optimized inference engine, and a model library with 100+ API-callable options.
We focus on NVIDIA data center GPUs; AMD MI300X, Google TPUs, and AWS Trainium are outside scope.
Let's start with the factor that sets the ceiling: hardware.
Factor 1: Hardware
Hardware determines the upper bound of inference speed. Three GPU specs matter most.
Memory Capacity (VRAM)
VRAM determines how large a model you can load. A 70B parameter model at FP8 needs ~70 GB. If the model doesn't fit, you either quantize further, split across GPUs (adding communication overhead), or downgrade to a smaller model. Insufficient VRAM is the most common reason teams hit speed walls.
Memory Bandwidth
Bandwidth determines how fast the GPU reads parameters during each forward pass. For LLM inference, this is usually the primary bottleneck. The H200's 4.8 TB/s vs. H100's 3.35 TB/s directly translates to ~40% faster token generation on bandwidth-bound workloads.
Compute (FLOPS)
FLOPS matter most for compute-bound tasks like diffusion model inference, where each denoising step involves heavy matrix math. For LLM inference, FLOPS are rarely the bottleneck because the model spends most time reading parameters, not computing.
GPU Comparison for Inference
VRAM
- H100 SXM: 80 GB HBM3
- H200 SXM: 141 GB HBM3e
- A100 80GB: 80 GB HBM2e
- L4: 24 GB GDDR6
Bandwidth
- H100 SXM: 3.35 TB/s
- H200 SXM: 4.8 TB/s
- A100 80GB: 2.0 TB/s
- L4: 300 GB/s
FP8
- H100 SXM: 1,979 TFLOPS
- H200 SXM: 1,979 TFLOPS
- A100 80GB: N/A
- L4: 242 TOPS
NVLink
- H100 SXM: 900 GB/s*
- H200 SXM: 900 GB/s*
- A100 80GB: 600 GB/s
- L4: None (PCIe)
NVLink 4.0: 900 GB/s bidirectional aggregate per GPU on HGX/DGX platforms. Sources: NVIDIA H100 Datasheet (2023), H200 Product Brief (2024), A100 Datasheet, L4 Datasheet.
Per NVIDIA's H200 Product Brief (2024), the H200 delivers up to 1.9x inference speedup on Llama 2 70B vs. H100 (TensorRT-LLM, FP8, batch 64, 128/2048 tokens). NVLink matters when splitting models across GPUs: faster links mean less communication overhead.
Hardware sets the ceiling. But the model you run determines how much of that ceiling you use.
Factor 2: Model Characteristics
Two models on the same GPU can differ in inference speed by 10x, depending on three properties.
Parameter Count
More parameters means more data to read per forward pass. A 70B model reads ~70 GB per token at FP8; a 7B model reads ~7 GB. On the same GPU, the smaller model generates tokens roughly 10x faster. Choosing the smallest model that meets your quality bar is the single largest speed lever.
Architecture Type
Different architectures have fundamentally different inference profiles. LLMs generate tokens autoregressively (one forward pass per token), making them bandwidth-bound. Diffusion models run 20-50 denoising passes per image, making them compute-bound.
This means the same GPU can be a bottleneck for one model type and overkill for another. Diagnosing speed issues requires knowing which bottleneck your model hits.
Precision
Precision controls bytes per parameter. FP16 uses 2 bytes; FP8 uses 1; INT4 uses 0.5. Lower precision means less VRAM and faster reads, but potentially lower quality.
FP8 on H100/H200 is the current sweet spot: halves memory vs. FP16 with minimal quality loss. INT4 is more aggressive and requires validation. FP16 is the safe choice when quality is non-negotiable.
With hardware and model locked in, the third factor is how you configure serving.
Factor 3: Serving Configuration
Even with optimal hardware and model selection, poor serving configuration leaves speed on the table.
Batching Strategy
Static batching waits for a full batch before processing, adding latency. Continuous batching (vLLM, TensorRT-LLM) inserts new requests as slots open, keeping GPU utilization high. Switching from static to continuous batching typically improves throughput 2-4x.
KV-Cache Management
For LLM inference, the KV-cache stores attention states per token. It grows with sequence length and concurrency: KV per request ≈ 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_element.
For Llama 2 70B at FP16 with 4K context, that's ~0.4 GB per concurrent request. At 100 users, KV-cache alone consumes 40 GB. FP8 KV-cache and paged attention (PagedAttention in vLLM) can halve this.
Speculative Decoding
Standard autoregressive inference generates one token per forward pass. Speculative decoding uses a smaller draft model to predict multiple tokens, then verifies in a single pass. This delivers 1.5-2x throughput improvement without quality loss.
Framework Selection
TensorRT-LLM provides maximum throughput with NVIDIA-specific optimizations. vLLM provides flexibility with PagedAttention for memory efficiency. Both support FP8 and continuous batching. Choose TensorRT-LLM for production throughput; vLLM for rapid iteration.
These three factors interact. Here's how to apply this to real tasks.
Applying the Diagnosis
When inference is slow, the fix depends on which factor is the bottleneck.
Image generation slow? Likely compute-bound. For efficient text-to-image, seedream-5.0-lite ($0.035/request) is optimized for speed-quality balance. For higher fidelity, seedream-4-0-250828 ($0.05/request) provides more capability.
LLM responses slow? Likely bandwidth-bound. Upgrade to higher bandwidth (H200 > H100 > A100), reduce precision to FP8, enable continuous batching. Check KV-cache usage if concurrency is high.
Video generation slow? Inherently heavy. Kling-Image2Video-V1.6-Pro ($0.098/request) delivers high fidelity on optimized infrastructure. pixverse-v5.6-t2v ($0.03/request) trades some quality for speed.
TTS latency too high? minimax-tts-speech-2.6-turbo ($0.06/request) is optimized for low-latency. elevenlabs-tts-v3 ($0.10/request) provides broadcast quality with competitive speed.
Model Picks by Role
R&D Engineer
- Priority: Max fidelity video
- Model: Kling-Image2Video-V2-Master
- Price: $0.28/req
- Why This One: Top-tier research output
R&D Engineer
- Priority: Image research
- Model: bria-fibo-edit
- Price: $0.04/req
- Why This One: High-fidelity editing
Algorithm Optimizer
- Priority: Speed benchmarking
- Model: pixverse-v5.6-t2v
- Price: $0.03/req
- Why This One: Fast, efficient video
Algorithm Optimizer
- Priority: TTS optimization
- Model: minimax-tts-speech-2.6-turbo
- Price: $0.06/req
- Why This One: Low-latency delivery
Deployment Engineer
- Priority: Production image
- Model: seedream-5.0-lite
- Price: $0.035/req
- Why This One: Quality + speed balance
Deployment Engineer
- Priority: Production TTS
- Model: elevenlabs-tts-v3
- Price: $0.10/req
- Why This One: Broadcast-quality output
Grad Researcher
- Priority: Baseline experiments
- Model: bria-fibo-image-blend
- Price: $0.000001/req
- Why This One: Low-cost exploration
Grad Researcher
- Priority: Video (top-tier)
- Model: Sora-2-Pro
- Price: $0.50/req
- Why This One: Publication-grade fidelity
Diagnose First, Then Optimize
The mistake most teams make is optimizing the wrong factor. Before tuning batching or switching GPUs, identify which category is your actual bottleneck.
If your model doesn't fit in VRAM, no serving optimization will help. If GPU utilization is below 50%, batching is the issue, not hardware. If your model is oversized for the task, a smaller model will outperform any infrastructure upgrade.
Cloud platforms like GMI Cloud let you test quickly.
Try models from the model library via API to isolate whether the bottleneck is model choice or infrastructure.
When you need dedicated hardware, GPU instances give you full control over precision, batching, and framework.
Diagnose first, then optimize.
FAQ
What's the single biggest factor affecting inference speed?
Model size relative to VRAM. If the model barely fits, everything slows down. Choosing the smallest model that meets your quality bar is the highest-impact decision.
When should I upgrade hardware vs. optimize software?
If GPU utilization is 70%+ and you're still slow, hardware is the bottleneck. If utilization is low, fix batching, precision, and framework first. Software optimization is cheaper and faster.
How much does FP8 actually speed up inference?
On H100/H200, FP8 roughly halves VRAM vs. FP16 and increases throughput proportionally on bandwidth-bound workloads. Quality impact is minimal for most tasks. Always validate on your specific use case.
How do I tell if I'm bandwidth-bound or compute-bound?
Monitor GPU compute utilization vs. memory bandwidth during inference. If compute is at 30% but bandwidth at 90%, you're bandwidth-bound (typical for LLMs). If compute is 80%+ with moderate bandwidth, you're compute-bound (typical for diffusion models).
Tab 12
Colin Mo
Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
