Does TensorRT-LLM work with models other than NVIDIA's?

Yes, TensorRT-LLM supports any model weights that can be loaded as SafeTensors or PyTorch checkpoints. The compilation step works with Llama, Mistral, Qwen, and many others. The main constraint is that the model architecture must have a TensorRT-LLM plugin implementation. Common architectures (transformers with attention and MLP layers) are well-supported. Newer or highly custom architectures may require a custom plugin, adding engineering work.

Can I switch from vLLM to TensorRT-LLM mid-deployment?

Absolutely. Because both expose compatible inference APIs, switching is typically a matter of updating a few configuration parameters and restarting the inference server. The tricky part is the compilation step: you'll need to compile your models ahead of time and have the compiled artifacts ready. Many teams compile models during off-peak hours, then flip a configuration switch to start serving from the new backend.

What's the memory overhead of running Triton as a router?

Triton itself uses 500 MB to 2 GB of GPU memory depending on configuration, plus memory for the actual model weights and vLLM/TensorRT-LLM runtime overhead. On a 80GB H100, this is negligible. On smaller GPUs like an L4 with 24GB, you'll want to account for this more carefully, though routing overhead is still under 10% of total memory in most cases.

Should I compile all my models to TensorRT-LLM or keep some in vLLM?

A common heuristic is to compile models that account for the top 70 to 80% of your inference traffic. Models below that threshold often benefit from vLLM's flexibility. The ROI on compilation diminishes for low-traffic models. If you're serving 100 requests a day for a specialized model, the cost of TensorRT compilation and maintenance might exceed the savings.

vLLM vs TensorRT-LLM vs Triton: The Runtime Decision That Shapes Your Inference Cost

April 30, 2026

The runtime engine you choose for LLM inference can change your cost per token by 2 to 3 times on the same GPU. Yet most platform evaluations focus on GPU specs and model availability while treating the software layer as an afterthought. Providers like GMI Cloud pre-configure all three major runtimes (vLLM, TensorRT-LLM, Triton) on every instance, but many teams still need to understand the tradeoffs to get the most out of them. This article covers: what each runtime does, throughput and latency comparisons, deployment complexity, the decision matrix that guides runtime selection, and how to stack them for maximum flexibility.

Understanding the Three Runtimes

vLLM is an open-source inference engine built around the PagedAttention algorithm, which reduces memory fragmentation and enables higher batch sizes without GPU out-of-memory errors. TensorRT-LLM is NVIDIA's compiler-optimized runtime that transforms models into lower-level kernel implementations, trading compilation time for significantly faster inference. Triton is an orchestration layer that routes requests across multiple model backends and instances, providing a service interface rather than a runtime itself.

The confusion usually arises because Triton doesn't execute models directly. Instead, it wraps TensorRT-LLM, vLLM, or other backends and exposes them through a unified inference server API. Thinking of Triton as a runtime is like thinking of a load balancer as a database. It's the opposite of what it does. Understanding this distinction clarifies when to use each tool and how to combine them effectively.

Throughput & Latency Comparison

vLLM and TensorRT-LLM deliver different performance profiles on the same hardware. The typical pattern is that TensorRT-LLM achieves 20 to 40% higher throughput after compilation, while vLLM reaches production faster with minimal setup overhead.

Consider Llama 2 70B in FP8 precision on a single H100 SXM with 80GB HBM3. vLLM with continuous batching typically delivers 85 to 95 tokens per second at batch=8 and p95 latency around 450 milliseconds. TensorRT-LLM, after compiling the Llama 2 weights to TensorRT format, reaches 110 to 130 tokens per second at equivalent batch size with p95 latency closer to 350 milliseconds. The 30% throughput gain compounds across a month of inference workload.

A typical vLLM launch looks like this pattern: specifying the model, quantization format, and tensor parallelism if needed. TensorRT-LLM requires building the engine first: converting model weights to FP8, compiling them with TensorRT, then serving the compiled artifact. The TensorRT compilation step typically takes 15 to 30 minutes per model-GPU combination, but happens once. vLLM skips this step entirely, reducing time-to-production significantly.

The latency story becomes more nuanced at high batch sizes. vLLM excels when requests arrive in steady streams at moderate batch sizes. TensorRT-LLM dominates when you're trying to maximize throughput on a fixed number of GPUs or when serving tail latency (p99) is critical. A hybrid approach uses vLLM for fast iteration and testing, then switches to TensorRT-LLM once the model reaches production.

Deployment Complexity

vLLM deployment is straightforward: install the Python package, launch a single command, and the server listens on port 8000 within minutes. Most teams accomplish initial deployment in under 30 minutes, including model download time.

TensorRT-LLM requires more steps. First, download or convert model weights to the format TensorRT expects. Second, run the compilation step, which invokes NVIDIA's compiler and generates an optimized engine file (typically 20 to 80 GB for large models). Third, configure TensorRT-LLM to serve the compiled engine. The entire workflow takes 45 to 90 minutes depending on model size and GPU availability.

Triton adds a configuration layer: defining model repositories with model.pbtxt files, specifying input/output shapes, and declaring which backend (vLLM, TensorRT-LLM, or custom) handles each model. This is straightforward for teams experienced with containerized services but adds friction for first-time users. One option is to start with vLLM, then introduce Triton only when serving multiple models or complex routing logic.

Decision Matrix: When to Use Each

Scenario	Recommended Runtime	Why
Prototyping, fast iteration	vLLM	Minutes to production, model updates are straightforward
Single model, production, max throughput	TensorRT-LLM	30% throughput gain justifies compilation overhead
Multiple models, dynamic routing	Triton	Unified API and request routing eliminate custom code
Bursty traffic, unpredictable batch sizes	vLLM	PagedAttention handles variable batching without waste
Cost-sensitive, consistent load	TensorRT-LLM	Lower per-token cost offsets engineering time
Hybrid (fast iteration + prod performance)	vLLM for dev, TensorRT-LLM for prod	Deploy models to production after vLLM validation

Most teams find that the decision hinges on two factors: how many models you're serving and how often you update them. If you're running a single model that changes quarterly, TensorRT-LLM's upfront investment pays for itself quickly. If you're running ten models that get refreshed monthly, vLLM's low operational friction becomes the priority.

Stacking Runtimes for Maximum Flexibility

The most powerful approach combines all three. Develop and test in vLLM, compile high-traffic models to TensorRT-LLM, and orchestrate both backends through Triton. This gives you fast iteration on new models and optimized performance on proven ones.

The architecture looks like this: a Triton inference server exposes a unified /v1/chat/completions endpoint. Behind it, TensorRT-LLM handles your flagship Llama 3 model that serves 80% of requests. vLLM handles newer or less-trafficked models still in testing phases. A router in Triton directs requests based on model ID. When a new model graduates from testing to production, you compile it to TensorRT-LLM and migrate traffic gradually.

This hybrid setup lets your team optimize where it matters most: the 80% of traffic hitting proven models. Newer models get vLLM's agility without the latency penalty for users. Many teams find that this combination reduces per-token cost by 25 to 35% compared to vLLM alone while retaining the deployment flexibility of vLLM for experimentation.

GMI Cloud ships TensorRT-LLM, vLLM, and Triton pre-installed on GPU instances with CUDA 12.x, NCCL, and cuDNN. This reduces the setup time for runtime stacking, though teams should verify that pre-compiled versions match their model requirements. The Inference Engine also offers 100+ pre-deployed models with per-request pricing, which provides an alternative path for teams that prefer managed inference over self-hosted runtime configuration. Check gmicloud.ai/pricing for current rates.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started