GPU – it-stud.io

For the last few years, Kubernetes has been the default control plane for almost everything except one increasingly important workload: large language model inference. Teams that adopted Kubernetes for microservices, batch jobs, and stateful systems often ran their GenAI inference on bespoke setups, vendor platforms, or hand-tuned vLLM deployments glued together with custom scripts. That gap is now closing.

At KubeCon EU 2026, the CNCF accepted llm-d into its Sandbox. Backed by Red Hat, Google, IBM, CoreWeave, and NVIDIA, llm-d is a purpose-built, Kubernetes-native framework for distributed LLM inference. It is not another model server competing with vLLM. It is the orchestration and serving layer that turns a fleet of accelerators into a coherent, scalable inference platform on Kubernetes.

This matters because the economics and operational reality of LLM inference are fundamentally different from traditional web services. A single large model can exceed the memory of one GPU. Request patterns are bursty. The compute profile of generating the first token is completely different from generating the rest. Treating inference like a stateless HTTP service leaves enormous performance and cost on the table. llm-d is the CNCF community’s answer to that problem.

The Inference Gap: Why Kubernetes Needed a Native Answer

According to the CNCF’s 2026 survey, around 66% of organizations now run GenAI inference on Kubernetes. That adoption happened faster than the tooling matured. Most teams stitched together their own stack: a model server such as vLLM, a custom Deployment or StatefulSet, an Ingress or Gateway in front, a homegrown autoscaler keyed off queue depth, and a lot of YAML.

This works until it doesn’t. The problems show up at scale:

Model size vs. accelerator memory: Large models must be sharded across multiple GPUs or nodes, which standard Deployments do not coordinate well.
Prefill vs. decode imbalance: The compute-heavy prefill phase and the memory-bandwidth-bound decode phase compete for the same resources when colocated.
Scheduling fragility: Multi-GPU inference pods need all their resources at once. Partial scheduling wastes expensive accelerators.
Routing blindness: Standard HTTP load balancing does not understand KV-cache locality, model affinity, or queue depth.
Cost opacity: Without per-accelerator metrics, FinOps for inference is guesswork.

llm-d addresses these as first-class concerns rather than afterthoughts. It aligns with the emerging Kubernetes AI Requirements (KARs) and positions itself as the reference inference stack for the cloud-native ecosystem.

Prefill/Decode Disaggregation: The Core Idea

The single most important architectural concept in llm-d is the disaggregation of the prefill and decode phases of inference.

When an LLM processes a request, it first reads the entire prompt and builds a key-value (KV) cache. This is the prefill phase, and it is compute-bound: it benefits from raw GPU throughput. Then the model generates tokens one at a time, each step depending on the KV cache. This is the decode phase, and it is largely memory-bandwidth-bound and latency-sensitive.

When both phases run on the same GPU, they interfere. A long prompt being prefilled can stall token generation for other requests, hurting time-to-first-token and inter-token latency simultaneously. Disaggregation separates these phases onto different pools of accelerators that can be scaled and tuned independently.

Phase	Characteristic	Bottleneck	Scaling Strategy
Prefill	Processes the full prompt, builds KV cache	Compute (FLOPs)	Scale for throughput on high-FLOP accelerators
Decode	Generates tokens iteratively	Memory bandwidth, latency	Scale for concurrency and low latency

By splitting these across nodes, llm-d lets platform teams right-size each pool. You can throw high-throughput accelerators at prefill and optimize a separate pool for low-latency decode, transferring the KV cache between them. The result is better GPU utilization and more predictable latency under load.

How llm-d Fits the Kubernetes Stack

llm-d is not a monolith. It is designed to compose with the newest Kubernetes primitives for AI workloads. This is what makes it cloud-native rather than just another inference server wrapped in a container.

Dynamic Resource Allocation (DRA)

Kubernetes v1.36 matured Dynamic Resource Allocation, which replaces the aging device plugin framework for GPUs and accelerators. NVIDIA and Google have contributed CNCF-donated DRA drivers that act as the accelerator abstraction layer. llm-d uses DRA to request GPUs declaratively, with structured parameters the scheduler and autoscaler can actually understand. This means topology-aware allocation, partitionable devices, and cleaner multi-accelerator scheduling.

Gang Scheduling

Distributed inference needs all its pods at once. A model sharded across four GPUs is useless with three. llm-d relies on gang scheduling, available in the Kubernetes v1.36 workload-aware scheduling features, to ensure that a distributed inference deployment either gets all its resources or waits, rather than partially allocating and stranding expensive accelerators.

Kueue for Job Queuing

For multi-tenant inference pools and batch-style inference, llm-d integrates with Kueue. This brings quota management, fair sharing, and queuing across teams. Platform teams can define ClusterQueues that cap GPU budgets per team while keeping the pool efficiently shared.

Gateway API Inference Extension

Above llm-d sits the Gateway API Inference Extension (GIE), which provides intelligent routing for LLM traffic. Unlike standard HTTP routing, GIE understands inference-specific signals: model affinity, KV-cache locality, queue depth, and load. It routes requests to the right pool and the right replica, which is essential once prefill and decode are disaggregated.

llm-d vs. vLLM, KServe, and Ray Serve

A common point of confusion is where llm-d sits relative to existing tools. It does not replace all of them; it orchestrates and complements them.

Tool	Primary Role	When to Use
vLLM	High-performance model server / inference engine	As the underlying engine, often used by llm-d itself
KServe	General model serving framework on Kubernetes	Mixed model types, classic ML plus LLMs, standardized CRDs
Ray Serve	Python-native distributed serving	Teams already invested in the Ray ecosystem
llm-d	Kubernetes-native distributed LLM inference orchestration	Large-scale LLM inference with prefill/decode disaggregation and multi-node scaling

The practical mental model: vLLM is the engine, llm-d is the distributed serving and orchestration layer that runs engines like vLLM across a fleet, and the Gateway API Inference Extension is the smart front door. KServe and Ray Serve remain valid choices, especially for mixed workloads, but llm-d is purpose-built for the specific challenges of large-scale LLM inference on Kubernetes.

Observability and Cost: Making Inference Accountable

One of the most underrated aspects of running inference at scale is knowing what it costs and how it performs per accelerator. llm-d treats observability as a requirement, aligning with AI Conformance expectations.

Key signals platform teams should capture:

Time-to-first-token (TTFT): Dominated by prefill and queueing; the primary latency SLO for interactive use.
Inter-token latency: Reflects decode performance and concurrency pressure.
Per-accelerator utilization: GPU compute and memory usage, exported via Prometheus.
Queue depth and batch size: Drive autoscaling decisions far better than CPU metrics.
Throughput (tokens/sec): The real unit of inference work.

With OpenTelemetry and Prometheus wired in, these metrics feed both autoscaling and FinOps. The cost question llm-d helps answer is concrete: is it cheaper to run distributed inference across several smaller accelerators with disaggregation, or to consolidate on fewer large GPU nodes? The answer depends on model size, request mix, and latency targets, but for the first time the data to decide is available natively.

Security and Multi-Tenancy

Inference pools are expensive shared resources, which makes isolation important. llm-d supports multi-tenant inference pools with workload isolation and RBAC. Combined with Kueue quotas and namespace-scoped policies, platform teams can let multiple AI teams share a GPU fleet without one team starving another or accessing another team’s models. This is the same multi-tenancy discipline platform teams already apply to compute, extended to accelerators.

A Production Readiness Checklist

llm-d is in the CNCF Sandbox, which means it is early. Sandbox status signals direction and community backing, not production maturity. Platform teams evaluating it should be deliberate.

Validate your Kubernetes version. You need v1.36-era features: mature DRA, gang scheduling, and workload-aware scheduling. Confirm your managed provider (EKS/GKE/AKS) exposes the required feature gates and DRA drivers.
Confirm accelerator drivers. Ensure the NVIDIA or Google DRA drivers are installed and supported on your node pools.
Start with a single model and pool. Prove the basic serving path before introducing prefill/decode disaggregation across nodes.
Wire observability first. Export TTFT, inter-token latency, throughput, and per-GPU metrics to Prometheus before scaling up.
Introduce the Gateway API Inference Extension early. Smart routing is what makes disaggregation pay off.
Layer in Kueue for multi-tenancy. Define GPU quotas per team before opening the pool to multiple consumers.
Run a cost comparison. Benchmark distributed inference against your current setup with real traffic, not synthetic load.
Plan for sandbox churn. APIs may change. Pin versions, track releases, and avoid hard-coupling your platform to unstable interfaces.

Why This Is a Platform Engineering Story

It is tempting to file llm-d under „AI infrastructure“ and leave it to ML teams. That would be a mistake. The whole point of llm-d is to make distributed LLM inference a self-service IDP primitive. Instead of every AI team building bespoke serving stacks, the platform team offers inference as a paved road: declare a model, a pool, and an SLO, and the platform handles GPU allocation, scheduling, routing, scaling, and observability.

This is the same shift platform engineering brought to application deployment, now applied to inference. The golden path for an AI team becomes a governed, observable, cost-aware inference service rather than a pile of custom YAML and tribal knowledge.

The Bottom Line

llm-d represents the cloud-native ecosystem catching up to where AI workloads actually are. By disaggregating prefill and decode, integrating with DRA, gang scheduling, Kueue, and the Gateway API Inference Extension, and treating observability and multi-tenancy as requirements, it offers a coherent answer to the inference gap on Kubernetes.

It is early, and Sandbox status means platform teams should pilot rather than bet the farm. But the backing from Red Hat, Google, IBM, CoreWeave, and NVIDIA, plus alignment with the Kubernetes AI Requirements, makes llm-d the most credible candidate to become the standard distributed inference layer for Kubernetes. For platform teams whose organizations are already running GenAI on Kubernetes, now is the time to start evaluating it, building the observability foundation, and planning the golden path for inference-as-a-service.

Schlagwort: GPU

llm-d: Kubernetes-Native Distributed LLM Inference and CNCF’s Answer to the AI Inference Gap