“Kubernetes is the operating system for AI” can sound like marketing—until you look at what platform teams are optimizing for today. The hard problems aren’t just “how do I run a model?” but rather: how do I place tightly-coupled jobs efficiently, how do I tune resource requests without restarts, how do I avoid config drift at scale, and how do I keep clusters predictable when workloads have very different cost and latency profiles.
In that context, the CNCF community’s framing of Kubernetes v1.35 as an AI-infrastructure signal is worth taking seriously—not because Kubernetes is suddenly an ML framework, but because the control-plane and CLI surface are evolving toward the pain points that show up first in AI-heavy fleets.
1) Scheduling is shifting from “pods” to “workloads”
For years, Kubernetes scheduling has been excellent for independent pods and service-style deployments, while distributed training and pipeline stages often required custom controllers, ad-hoc admission logic, or external schedulers to avoid “partial placement” (some pods land, the rest wait) and wasted capacity.
Kubernetes v1.35 introduces a workload API and an early implementation of workload-aware scheduling, including gang-scheduling semantics for “all-or-nothing” placement across a group of pods (alpha, per upstream notes). Even if you don’t flip the feature gate immediately, the direction is clear: the platform is acknowledging that some jobs should be reasoned about as a unit.
Operator takeaway: treat this as a design shift. If you run any of the following, it’s time to track the KEPs and roadmap:
- distributed training (multi-node, multi-GPU) where “half a job” burns budget without progress
- tightly-coupled data processing stages that need co-scheduling to meet SLOs
- inference fleets that mix bursty batch and steady low-latency endpoints
2) In-place pod resource resize reduces restart churn
One of the most practical “AI cluster” improvements is also one of the least flashy: in-place pod resource resize is Stable in v1.35. For platform teams, this is about reducing operational churn (restarts, rolling disruptions, cold caches) when you’re tuning CPU/memory limits in response to real workload behavior.
In AI contexts, the biggest wins show up in two places:
- Inference tuning loops: when you need to dial memory/CPU to hit latency targets without rolling a whole deployment.
- Long-running batch: when you learn mid-run that a job is under-provisioned and restarting would be costly.
Operator takeaway: the feature doesn’t replace good capacity planning, but it changes the failure modes. If you support self-service ML teams, you can build “safe resize” workflows into your internal platform (with guardrails) rather than forcing redeploys.
3) The “config last mile” gets safer: KYAML default output
Large organizations don’t fail on the first manifest; they fail on the thousandth. YAML is flexible, but that flexibility produces surprises across tooling, formatting, and review. Kubernetes v1.35’s kubectl defaulting to KYAML—a stricter subset—signals a push toward safer, more consistent manifest generation and review.
Why this matters for platform engineering:
- Golden paths rely on predictable output. If your scaffolding tools and CI pipelines consume kubectl output, strictness reduces edge-case diffs.
- Policy-as-code becomes easier to enforce when the configuration surface is less ambiguous.
Operator takeaway: if you have internal templates, generators, or “kubectl output → PR” automation, validate KYAML behavior in a staging pipeline. The KEP includes toggles (for controlled rollout) and the docs describe the subset.
4) Kubernetes for AI isn’t just features—it’s the platform model
The strongest argument for Kubernetes as an AI workload platform is that it is already the place where teams converge for scheduling, policy, and multi-tenancy. AI adds pressure (cost spikes, security boundaries, and heavy device scheduling), but it also increases the value of shared operational standards.
Practically, that means platform teams should treat “AI readiness” as a product roadmap:
- Define workload classes (training, batch, inference) and map them to scheduling and quota models.
- Standardize deployment patterns (e.g., inference as a Deployment/Service plus autoscaling; training as Jobs with clear queueing semantics).
- Build governance into the platform (cost controls, approvals, audit trails), not into tribal knowledge.
Practical next steps for the next 30 days
- Read the v1.35 notes with an “AI ops” lens: identify which improvements reduce churn or waste for your heaviest workloads.
- Pilot workload-aware scheduling in a non-prod cluster for one distributed job class.
- Adopt in-place resize where it meaningfully reduces restarts, but gate it behind policy and SRE-owned runbooks.
- Test KYAML in CI and scaffolding workflows, especially where automated formatting affects reviews.

Leave a Reply