Dynamic Resource Allocation Goes GA: How to Run AI Workloads on Kubernetes the Right Way

March 18, 2026•Stackxx•AI, Kubernetes

AI workloads have strained Kubernetes since the beginning. The original Device Plugin API only understood GPU counts—not topology, not sharing, not high-speed interconnects. Kubernetes 1.34 changes everything with Dynamic Resource Allocation (DRA) reaching GA.

The Problem with Counting GPUs

The traditional device plugin model breaks down when:

Training jobs need specific GPU partitions (MIG, time-slicing)
Multiple pods must share a single physical GPU
Distributed training requires high-speed interconnects (InfiniBand, NVLink)
Inference needs right-sizing, not whole GPUs

DRA solves this by introducing structured device information through ResourceSlices and ResourceClaims that describe workload requirements.

Goal

Configure a Kubernetes 1.34+ cluster to use Dynamic Resource Allocation for GPU workloads, enabling proper sharing, topology awareness, and gang scheduling.

Prerequisites

Kubernetes 1.34+ (DRA is GA)
Container runtime with CDI support (containerd 2.0+)
GPU drivers and device plugin with DRA support
DRA-enabled device driver (NVIDIA, AMD, or custom)

Steps

1. Enable DRA feature gates

# For kube-apiserver, kube-scheduler, kubelet
--feature-gates=DynamicResourceAllocation=true

2. Install DRA driver (NVIDIA example)

helm install nvidia-dra   nvidia/k8s-dra-driver   --namespace kube-system

3. Define a ResourceClass

apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClass
metadata:
  name: gpu.nvidia.com
structuredParameters:
  generatedFrom:
    nodeName: 
    driverName: gpu.nvidia.com

4. Create a ResourceClaim for your workload

apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaim
metadata:
  name: training-gpus
spec:
  resourceClassName: gpu.nvidia.com
  parameters:
    apiVersion: gpu.resource.nvidia.com/v1alpha1
    kind: GpuClaimParameters
    count: 4
    sharing: {}
    selector: {}
  constraints:
    - nodeSelector:
        node.kubernetes.io/instance-type: p4d.24xlarge

5. Reference the claim in your Pod

apiVersion: v1
kind: Pod
metadata:
  name: training-job
spec:
  resourceClaims:
    - name: gpus
      resourceClaimName: training-gpus
  containers:
    - name: trainer
      image: pytorch/pytorch:latest
      resources:
        claims:
          - name: gpus

Common Pitfalls

Issue	Symptom	Fix
Pod stuck “Pending”	ResourceSlices not populated	Verify DRA driver is running and registered
Training slowdown	Pods on different network spines	Use topology-aware scheduling with node labels
GPU OOM	Over-scheduling shared GPUs	Set proper memory limits in ResourceClaim

Advanced: Gang Scheduling with KAI Scheduler

For distributed training, gang scheduling ensures all pods start together. The CNCF Sandbox KAI Scheduler provides DRA-aware gang scheduling with hierarchical queues and topology-aware placement.

apiVersion: scheduling.x-k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: distributed-training
spec:
  minMember: 8
  scheduleTimeoutSeconds: 300

Verify

# Check ResourceSlices
kubectl get resourceslices -n default

# View ResourceClaims
kubectl get resourceclaims

# Verify GPU allocation
kubectl exec training-job -- nvidia-smi

# Check scheduler logs for DRA decisions
kubectl logs -n kube-system kube-scheduler- | grep DRA

Sources

Next signal