Other

Speculative Decoding Cuts LLM Latency by Letting a Small Model Guess and a Large Model Check

April 13, 2026

Generating text one token at a time is the slow part of LLM inference, and for years it looked unavoidable: each token depends on the one before it, so the big model runs once per token. Speculative decoding breaks that assumption without changing the output. A small draft model proposes several tokens at once, and the large target model verifies them in a single pass, accepting the ones it would have produced anyway. The large model still decides every token, so quality is unchanged, but it spends far fewer forward passes doing it. This article explains the mechanism, the math behind the speedup, and where it pays off in production.

The Bottleneck Speculative Decoding Attacks

Standard autoregressive decoding is sequential. To generate token N, the model needs token N minus 1, so the large model runs a full forward pass for every single token. Because decode is memory-bandwidth bound, each pass streams the entire weight set through the chip, and the GPU's compute sits underused. The inefficiency is structural: you are paying for compute capacity you cannot use because the work arrives one token at a time.

Speculative decoding exploits that idle compute. Verifying several proposed tokens in one forward pass costs almost the same as generating one token, because the bottleneck was moving weights, not the arithmetic. If most guesses are right, you get several tokens for the price of one pass.

How the Draft-and-Verify Loop Works

The method pairs two models that share a vocabulary.

  • The draft model is small and fast. It cheaply generates a short run of candidate tokens, say four or five ahead, guessing what the large model would say.
  • The target model is the large, high-quality model you actually want to serve. It takes the draft's candidates and verifies them in a single parallel forward pass.
  • Acceptance keeps every leading token that matches what the target model would have generated, and rejects from the first mismatch onward. The target model then produces the correcting token, and the loop repeats.

The output is provably identical to what the target model would have produced alone. Speculative decoding is a latency optimization, not a quality tradeoff, which is what separates it from simply using a smaller model.

The Speedup Math, Worked Out

The gain depends on the acceptance rate, how often the draft's guesses survive verification. Suppose the draft proposes four tokens per round and the target accepts three on average. Then roughly four tokens emerge per target forward pass instead of one, approaching a 3x reduction in target passes, minus the small cost of running the draft. The closer the draft model's distribution is to the target's, the higher the acceptance rate and the larger the speedup. A draft that guesses poorly wastes its proposals and can even slow things down, which is why model pairing matters.

A fuller worked example shows where the speedup leaks. Say the target model alone generates at 40 tokens per second, meaning a 25 millisecond forward pass per token. With a draft that proposes four tokens and an acceptance rate of 75%, each target pass yields three accepted tokens plus one correction, so four useful tokens per 25 millisecond verification, lifting effective throughput toward 120 tokens per second. But the draft itself takes time to produce those four candidates. If the draft adds 8 milliseconds per round, the real cost per round is 33 milliseconds for four tokens, or about 120 tokens per second still, a clear win. Drop the acceptance rate to 40% and you accept fewer than two tokens per round while still paying the draft overhead, and the advantage shrinks fast. The lesson is that acceptance rate and draft speed both have to clear a bar, and only measurement on real traffic confirms they do.

Pairing Models for a Good Acceptance Rate

The practical engineering problem is choosing a draft model that is cheap to run yet predicts the target well. A common pairing uses a small model from the same family or a distilled version as the draft and the flagship as the target.

Role Property that matters Example fit Effect on speedup
Draft model Low latency, cheap per token GPT-5.4-nano at $0.20/M input More guesses per second
Target model High quality, shared vocabulary DeepSeek-V4-Pro Higher acceptance, identical output
Mismatch risk Draft too different from target Unrelated tokenizer Low acceptance, little gain

The acceptance-rate column is the one to test empirically, because it depends on how similar the two models' distributions are on your traffic.

Where to Run a Speculative Decoding Setup

Running two models in a tight verify loop needs both on fast, low-overhead hardware, because the draft latency and the target bandwidth both feed the final speed. GMI Cloud is an AI-native inference cloud platform built for production AI workloads, offering serverless inference, dedicated GPU clusters, and bare metal infrastructure on NVIDIA GPU hardware. Its model library hosts both ends of a pairing, including GPT-5.4-nano at $0.20/M input as a cheap draft candidate and DeepSeek-V4-Pro as a high-quality target.

GMI Cloud's bare metal instances ship with TensorRT-LLM and vLLM preconfigured, the inference engines that implement speculative decoding, so teams do not build the verify loop from scratch. GMI Cloud's bare metal GPUs run with no hypervisor, delivering 100% of the advertised memory bandwidth that the target model's verification pass depends on. You can review the model library at console.gmicloud.ai and integration guidance at docs.gmicloud.ai.

One Distinction Worth Drawing

Speculative decoding is not the same as using a smaller model to save money. A smaller model alone changes the output and lowers quality. Speculative decoding uses the small model only to propose, while the large model verifies and owns every accepted token, so the result matches the large model exactly. The first trades quality for cost; the second trades a little extra compute on the draft for lower latency at unchanged quality. Confusing them leads teams to ship a cheaper model when they meant to ship a faster pipeline.

When Speculative Decoding Pays Off

The technique helps in some settings more than others.

  • Best for latency-sensitive serving of large models: where cutting target passes directly lowers per-token latency.
  • Best for predictable, structured text: where draft guesses are easy and acceptance rates run high.
  • Not ideal for very small target models: where a single forward pass is already cheap and the draft overhead is not worth it.
  • Not ideal for poorly matched model pairs: where low acceptance erases the speedup.

Test the Acceptance Rate Before You Bank the Speedup

The headline speedups assume the draft model guesses well on your actual traffic, and that number varies by workload. Before committing, measure the acceptance rate on representative prompts, confirm the combined draft-plus-verify latency beats the target alone, and only then size your infrastructure around the gain. Speculative decoding is real and lossless, but its payoff is earned at the pairing level, not assumed from the paper.

Colin Mo

Build AI Without Limits

GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.

Get Started