AI workloads have strained Kubernetes since the beginning. The original Device Plugin API only understood GPU counts—not topology, not sharing, not high-speed interconnects. Kubernetes 1.34 changes everything with Dynamic Resource Allocation (DRA) reaching GA.
The Problem with Counting GPUs
The traditional device plugin model breaks down when:
- Training jobs need specific GPU partitions (MIG, time-slicing)
- Multiple pods must share a single physical GPU
- Distributed training requires high-speed interconnects (InfiniBand, NVLink)
- Inference needs right-sizing, not whole GPUs
DRA solves this by introducing structured device information through ResourceSlices and ResourceClaims that describe workload requirements.
Goal
Configure a Kubernetes 1.34+ cluster to use Dynamic Resource Allocation for GPU workloads, enabling proper sharing, topology awareness, and gang scheduling.
Prerequisites
- Kubernetes 1.34+ (DRA is GA)
- Container runtime with CDI support (containerd 2.0+)
- GPU drivers and device plugin with DRA support
- DRA-enabled device driver (NVIDIA, AMD, or custom)
Steps
1. Enable DRA feature gates
# For kube-apiserver, kube-scheduler, kubelet
--feature-gates=DynamicResourceAllocation=true
2. Install DRA driver (NVIDIA example)
helm install nvidia-dra nvidia/k8s-dra-driver --namespace kube-system
3. Define a ResourceClass
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClass
metadata:
name: gpu.nvidia.com
structuredParameters:
generatedFrom:
nodeName:
driverName: gpu.nvidia.com
4. Create a ResourceClaim for your workload
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaim
metadata:
name: training-gpus
spec:
resourceClassName: gpu.nvidia.com
parameters:
apiVersion: gpu.resource.nvidia.com/v1alpha1
kind: GpuClaimParameters
count: 4
sharing: {}
selector: {}
constraints:
- nodeSelector:
node.kubernetes.io/instance-type: p4d.24xlarge
5. Reference the claim in your Pod
apiVersion: v1
kind: Pod
metadata:
name: training-job
spec:
resourceClaims:
- name: gpus
resourceClaimName: training-gpus
containers:
- name: trainer
image: pytorch/pytorch:latest
resources:
claims:
- name: gpus
Common Pitfalls
| Issue | Symptom | Fix |
|---|---|---|
| Pod stuck “Pending” | ResourceSlices not populated | Verify DRA driver is running and registered |
| Training slowdown | Pods on different network spines | Use topology-aware scheduling with node labels |
| GPU OOM | Over-scheduling shared GPUs | Set proper memory limits in ResourceClaim |
Advanced: Gang Scheduling with KAI Scheduler
For distributed training, gang scheduling ensures all pods start together. The CNCF Sandbox KAI Scheduler provides DRA-aware gang scheduling with hierarchical queues and topology-aware placement.
apiVersion: scheduling.x-k8s.io/v1alpha1
kind: PodGroup
metadata:
name: distributed-training
spec:
minMember: 8
scheduleTimeoutSeconds: 300
Verify
# Check ResourceSlices
kubectl get resourceslices -n default
# View ResourceClaims
kubectl get resourceclaims
# Verify GPU allocation
kubectl exec training-job -- nvidia-smi
# Check scheduler logs for DRA decisions
kubectl logs -n kube-system kube-scheduler- | grep DRA

Leave a Reply