We replaced traditional DNS with a programmable, API-driven routing layer tuned for real-time AI inference, cutting costs by 40% while achieving 98.5% optimal first-attempt routing and 2-second failover under live multimodal traffic.
Aliquet morbi justo auctor cursus auctor aliquam. Neque elit blandit et quis tortor vel ut lectus morbi. Amet mus nunc rhoncus sit sagittis pellentesque eleifend lobortis commodo vestibulum hendrerit proin varius lorem ultrices quam velit sed consequat duis. Lectus condimentum maecenas adipiscing massa neque erat porttitor in adipiscing aliquam auctor aliquam eu phasellus egestas lectus hendrerit sit malesuada tincidunt quisque volutpat aliquet vitae lorem odio feugiat lectus sem purus.
Viverra mi ut nulla eu mattis in purus. Habitant donec mauris id consectetur. Tempus consequat ornare dui tortor feugiat cursus. Pellentesque massa molestie phasellus enim lobortis pellentesque sit ullamcorper purus. Elementum ante nunc quam pulvinar. Volutpat nibh dolor amet vitae feugiat varius augue justo elit. Vitae amet curabitur in sagittis arcu montes tortor. In enim pulvinar pharetra sagittis fermentum. Ultricies non eu faucibus praesent tristique dolor tellus bibendum. Cursus bibendum nunc enim.
Mattis quisque amet pharetra nisl congue nulla orci. Nibh commodo maecenas adipiscing adipiscing. Blandit ut odio urna arcu quam eleifend donec neque. Augue nisl arcu malesuada interdum risus lectus sed. Pulvinar aliquam morbi arcu commodo. Accumsan elementum elit vitae pellentesque sit. Nibh elementum morbi feugiat amet aliquet. Ultrices duis lobortis mauris nibh pellentesque mattis est maecenas. Tellus pellentesque vivamus massa purus arcu sagittis. Viverra consectetur praesent luctus faucibus phasellus integer fermentum mattis donec.
Commodo velit viverra neque aliquet tincidunt feugiat. Amet proin cras pharetra mauris leo. In vitae mattis sit fermentum. Maecenas nullam egestas lorem tincidunt eleifend est felis tincidunt. Etiam dictum consectetur blandit tortor vitae. Eget integer tortor in mattis velit ante purus ante.
“Lacus donec arcu amet diam vestibulum nunc nulla malesuada velit curabitur mauris tempus nunc curabitur dignig pharetra metus consequat.”
Commodo velit viverra neque aliquet tincidunt feugiat. Amet proin cras pharetra mauris leo. In vitae mattis sit fermentum. Maecenas nullam egestas lorem tincidunt eleifend est felis tincidunt. Etiam dictum consectetur blandit tortor vitae. Eget integer tortor in mattis velit ante purus ante.
When you're running large-scale inference workloads, infrastructure needs shift constantly and with drastic proportions. Standard DNS (Domain Name System) and basic load-balancing strategies work well for websites, but they crack under the demands of real-time AI systems. Large Language Models (LLMs), image/video generation pipelines, and agentic runtimes don’t just want an IP address. They want the right backend, fast, with minimal latency, and high availability across volatile compute resources to serve the end-user — because the result was expected 50 ms ago right after they clicked “Generate.”
Normally, this is where backend engineers reach for a load balancer and call it a day. Unfortunately, the costs would look ridiculous at scale and fortunately, we’re scaling (if the Powers-That-Be are reading this please scale the engineering team too).
That’s why we built a custom DNS optimization backend: a routing system designed for high-throughput, low-latency inference that scales across distributed GPUs and responds in real-time to backend health and load signals.
This system achieved <2s failover time, and significantly improved inference throughput by dynamically routing requests based on real-time health, load, and utilization signals.
Oh, and the 40% cost reduction in our Cloudflare bill (for inference routing) is nice too.
Traditional DNS approaches — round-robin, geo-DNS, or even basic weighted load-balancing — fall short for our specific purpose of serving inference API at low latency. Here's why:
When it breaks, users time out, token streaming stalls, and costs balloon while they double our on-call hours. Worse, inference traffic can swing so fast that even good routes become bad ones mid-session.
Just our publicly available metrics on OpenRouter showcase triple the daily traffic in the last 2 months, and a load balancer would have been both more expensive and unwieldy compared to what we cooked up. This surge in token volume — spread across diverse models and regions — means our routing system has to make millions of split-second decisions daily to keep latency stable and costs predictable.
Note the sudden jumps in July. These aren’t planned load tests but real customer demand spikes that our routing system absorbed without error rate increases that would cause the PM’s heart rate to follow the same graph.
Routing shouldn’t live in a config file, as this makes it static. Wait, why is static bad?
Instead, we exposed our routing logic through an API so routing becomes an active lever instead of a static choice. We use this API to:
This setup gives us operational velocity. Instead of redeploying infra, we tune the system live. Kind of like adjusting dials on a running engine.
Every decision is observable, revertible, and audit-ready. Eventually, this becomes the backbone of our inference control plane: a programmable surface to route, schedule, and shape AI inferencing traffic at scale.
TL;DR:
We built a system using Cloudflare Workers and KV storage, with routing weights assigned per endpoint. This lets us evaluate and adapt routing in real-time based on backend-specific telemetry.
KEY WINS (what the bean-counters cared about):
Instead of simple round-robin, we dynamically assign and update weights based on real-time signals and current data. The goal is a smarter form of geo+weighted routing with inference-specific logic layered on top.
During a multi-modal model rollout, traffic spiked and slammed one of our GPU clusters.
The system kicked into action:
Failover time? Under 2 seconds.
It wasn’t fully invisible. Some users felt the hiccup (but no one complained!). More importantly, the session stayed alive, jobs completed, and we tuned the reactivity loop to be harder better faster stronger. (Shoutout to my DaftPunk fans.)
Routing logic is only as good as the signals feeding it. Sometimes, the only relevant data is current data.
We built our observability stack to ensure every decision is grounded in reality instead of assumptions.
This tight feedback loop between telemetry and routing allows us to make micro-adjustments before they become macro-failures. This is essential for a system that lives or dies by user experience.
This is also essential for the engineering team to get sleep.
Metrics are only valuable if they hold up under stressful scrutiny. We’re glad to state: Ours did! (The bean-counters were happy.)
These numbers come from production, under real customer load, across multiple inference types and regions. We’re happy to announce that our current system is predictably stable with the capability to recover during unexpected spikes while continuously optimizing itself.*
*(Statement based on known and expected traffic with above average growth. It is by no means a declaration of victory against the possible traffic expected in the near future as we add new models to the inference engine. Please unchain engineers from their desks and take several names off the on-call list, starting with —)
We’ve proven that adaptive routing works. Now we’re evolving it into a more intelligent, model-aware system; one that thinks ahead and responds even harder better faster stronger.
Ultimately, we're building toward a unified control plane: one where inference traffic is not just routed, but orchestrated with precision. The kind that would make an autocrat’s military procession blush.
Final note: if you're deploying LLMs or agent loops at scale, you can’t be stuck with the mindset of managing infrastructure. You’re engineering orchestration. And your routing layer is part of your product experience, even if end users might not be aware of what that is.
Because all they see is “Thinking…”
#BuildAIWithoutLimits
Take our Inference Engine for a spin, maybe you'll break our Cloudflare Workers!