llm-d: The Intelligent Inference Scheduler That Fixes What More GPUs Can’t

The GPU Utilization Problem in LLM Inference

If you’re running vLLM at scale, you’ve probably hit this wall: you add more GPUs, throughput goes up, but your P95 latency stays stubbornly high. Your slowest 5% of users still wait 10x longer than your median. More hardware doesn’t fix it.

The problem isn’t capacity. It’s routing.

When you deploy multiple vLLM replicas behind a standard load balancer, three critical bottlenecks conspire to create unpredictable, high tail latency:

1. Prefill queuing. Large prompts block smaller requests from starting. A user with a 50-token prompt who submits just after a 10,000-token request can experience 3x higher time-to-first-token (TTFT) simply due to head-of-line blocking.

2. KV cache misses. With round-robin routing, requests sharing the same system prompt land on different vLLM instances. Each one recomputes the same 6,000-token prefix from scratch—wasting the entire GPU prefill cycle even though the data was already computed elsewhere.

3. Unpredictable queue depth. Random routing ignores how busy each instance actually is. Your P50 might be 200ms while P95 reaches 2,000ms—a 10x gap caused purely by routing blindness.

Adding a 5th, 10th, or 50th GPU replica doesn’t touch any of these. Capacity goes up, but your unluckiest users still get cache misses, still hit overloaded pods, and still wait.

Introducing llm-d: Cache-Aware Routing for LLM Inference

llm-d is a Kubernetes-native inference scheduling framework that sits above vLLM and makes one deceptively simple change: instead of routing requests randomly, it routes them to the pod that already has the right data in its KV cache.

The results are dramatic—57x faster time-to-first-token, 2x throughput, and near-elimination of tail latency on identical hardware.

This collaborative open-source project by IBM, Google, Red Hat, Alibaba Cloud, DaoCloud, and the broader AI infrastructure community provides four well-lit paths for production LLM serving:

  • Intelligent inference scheduling—prefix cache-aware routing across vLLM replicas
  • Prefill/decode disaggregation—separate pods for compute-bound prefill and memory-bound decode phases
  • Wide expert parallelism—distributing large MoE models across multiple nodes
  • Tiered prefix cache—extending KV cache beyond GPU vRAM to CPU memory, SSD, and shared filesystems

How Cache-Aware Routing Works

At the heart of llm-d is the Endpoint Picker (EPP)—a sidecar component that intercepts every inference request via Envoy’s external processing callback and makes intelligent pod selection decisions before forwarding.

The EPP runs a 4-step cycle for every request:

Step 1—Discover. Enumerate all pods in the InferencePool. Collect queue depth, loaded models, KV cache contents, and real-time metrics via Prometheus and KV-Events.

Step 2—Filter. Exclude pods that are overloaded, have insufficient memory, or cannot serve the requested model variant using filters like decode-filter, prefill-filter, by-label, and by-label-selector.

Step 3—Score. Run pluggable scorers in parallel: session affinity score + prefix cache hit score + load score. The final score is a weighted combination.

Step 4—Select. The max-score-picker selects the highest-scoring pod, with built-in tie-breaking and fallback logic ensuring a pod is always selected.

The Scoring Pipeline Explained

The default well-lit path uses a weighted scoring formula:

final_score = (prefix_score × 3) + (kv_utilization_score × 2) + (queue_score × 2)

The prefix-cache-scorer (weight 3) queries the KV-cache index to find what percentage of the incoming prompt’s prefix is already cached on each pod. A pod with 90% of the prompt cached scores 2.7—this is the dominant signal by design.

The kv-cache-utilization-scorer (weight 2) reads each pod’s GPU vRAM utilization. A pod near full capacity scores lower, preventing selection of a pod that would need to evict existing blocks.

The queue-scorer (weight 2) counts pending requests per pod, preventing hotspots that pure cache affinity would create.

A pod with a perfect cache hit but full queue can still lose to a pod with partial cache hit and empty queue—which is exactly the right behavior.

Performance Comparisons: The Numbers Tell the Story

Here’s what changes when you replace round-robin with cache-aware routing on the same hardware:

  • Cache Hit Rate: 20-30% (naive) → 60-80% (llm-d)
  • P95/P99 TTFT Improvement: 40-60% reduction
  • Time-to-First-Token: Up to 57x faster
  • Throughput: 2x improvement
  • Multi-turn Conversations: Reuse cached context instead of recomputing every turn

In validated benchmarks on 8 vLLM pods (16 H100 GPUs total) using a realistic B2B workload simulating 150 enterprise customers with 6,000-token contexts, cache-aware routing demonstrated why it’s the single highest-impact optimization for production LLM inference.

When to Use llm-d vs. Naive Scaling

Use naive vLLM scaling when:

  • You’re in early development or testing phases
  • Your workload has minimal shared context between requests
  • You have simple, stateless inference requirements
  • You’re not experiencing tail latency issues yet

Use llm-d when:

  • You have enterprise workloads with shared system prompts
  • Multi-turn conversations are common in your application
  • P95/P99 latency matters as much as P50
  • You’re scaling beyond 2-3 vLLM replicas
  • Cache hit rates below 50% are costing you GPU hours
  • You need to maximize inference density on existing hardware

The key insight: adding more GPUs with naive scaling increases capacity but does NOT reduce tail latency. Cache misses still occur, causing high latency for unlucky users. llm-d’s cache-aware routing directly attacks the tail latency problem by eliminating redundant computation through intelligent cache reuse.

Implications for AI Infrastructure Teams

For platform engineers and DevOps teams managing LLM infrastructure, llm-d represents a paradigm shift in how we think about inference scaling.

Hardware Efficiency: Before adding more GPUs, optimize what you have. Cache-aware routing can deliver order-of-magnitude improvements without capex. In an era of constrained GPU supply, this isn’t just optimization—it’s strategy.

Operational Simplicity: llm-d sits above your existing model server, not inside it. Deployment is non-invasive, using standard Kubernetes Gateway API and Envoy. You don’t need to modify vLLM or retrain models.

Composability: Start with prefix-aware routing, then progressively layer on prefill/decode disaggregation, tiered KV caching, and SLO-aware autoscaling as your workload demands. Each well-lit path is independently validated.

Multi-Accelerator Support: With v0.6.0, llm-d now supports AMD ROCm, Intel Gaudi/HPU, Intel XPU, and CPU on AMX—making it viable for heterogeneous infrastructure environments.

Future-Proofing: The pluggable architecture means new scheduling strategies can be added without disrupting existing deployments. The project is actively developed by major cloud and enterprise infrastructure players.

Getting Started with llm-d

llm-d v0.6.0, released April 2026, brings several important updates including expanded plugin ecosystem, new accelerator support, and graduated Workload Variant Autoscaler.

The deployment model is straightforward: install the llm-d inference scheduler as a Kubernetes Gateway, configure your InferencePool with vLLM backends, and let the EPP handle intelligent routing. Each well-lit path ships with Helmfile guides, reference configurations, and validated benchmarks.

Documentation and guides are available at llm-d.ai, with the open-source repository at github.com/llm-d/llm-d.

Conclusion

llm-d addresses a fundamental blind spot in how most teams scale LLM inference today: they add hardware without fixing the routing that causes tail latency in the first place. The shift from cache-blind load balancing to cache-aware scheduling is not incremental—it’s a structural change that unlocks dramatic improvement on the same hardware.

If you’re operating vLLM at any meaningful scale, cache-aware routing should be the first optimization you evaluate—before adding hardware, before tuning model parameters, before exploring quantization. The returns are immediate, and the deployment is non-invasive.

The message is clear: when more GPUs can’t deliver the latency you need, smarter routing can.


Source: llm-d: The Inference Scheduler That Fixes What More GPUs Can’t by Yakov Beder – https://medium.com/@yakovbeder/llm-d-the-inference-scheduler-that-fixes-what-more-gpus-cant-03644ac55504