Most Kubernetes upgrade discussions focus on the control plane and API deprecations. But in real clusters, the most painful failures tend to come from the parts that look “boring” until they aren’t: node images, kernel flags, CNI datapaths…and the container runtime.
Kubernetes v1.35 makes that runtime dependency explicit: the SIG Node community has flagged v1.35 as the last Kubernetes release to support the containerd v1.x series. That doesn’t mean your cluster explodes on day one. It does mean that if you keep treating containerd as “just whatever came with the OS,” the next major Kubernetes upgrade can turn into a forced, time-boxed runtime migration.
This post lays out a practical playbook: what to inventory, how to stage the runtime switch, how to reduce node churn risk, and how to validate the move with metrics and workload-level checks. It’s written for platform teams that run mixed node pools, managed Kubernetes, or opinionated distros (K3s, RKE2, etc.)—because that’s where the edge cases hide.
Why the containerd 2.0 transition matters
From the cluster’s perspective, the container runtime is “just” the implementation behind the CRI (Container Runtime Interface). From the node’s perspective, it’s the critical service that translates kubelet requests into running pods, images, snapshots, cgroups, and logs. A runtime mismatch is rarely a clean failure; it’s usually a slow-burn reliability problem: image pull weirdness, sandbox creation latency, sporadic pod start failures, or resource accounting surprises.
Containerd 2.0 is a major release line. Major lines typically mean:
- Configuration defaults may change (or options may be renamed/retired).
- Plugins and integrations may lag (snapshotters, registries, security tooling).
- Operational runbooks need refresh: log locations, metrics names, and troubleshooting commands.
If you wait until the Kubernetes upgrade itself forces your hand, you end up combining two high-risk changes: Kubernetes API/behavior changes and runtime behavior changes. The goal is to de-risk by separating them: migrate the runtime first while Kubernetes stays constant, then upgrade Kubernetes later.
Step 0: build a runtime inventory (it’s never just “containerd everywhere”)
Before you plan a migration, answer these questions with data:
- Which node pools use containerd? (You may still have legacy nodes on CRI-O in some environments, or a vendor-specific runtime bundle.)
- What containerd major version is actually running? OS packages, AMIs, and “golden images” drift over time.
- What kubelet version(s) are paired with which runtime versions? Your fleet might not be homogeneous.
- Which clusters are managed? In some managed services, you don’t directly control containerd, but you do control node image selection and upgrade cadence.
Make this inventory easy to keep current. A common pattern is a daemonset that emits node runtime metadata as labels/metrics, so you can see runtime versions per pool in a dashboard instead of a spreadsheet.
Step 1: decide your migration strategy (in-place vs. new node image)
In most orgs, the lowest-risk path is immutable nodes: build or adopt a new node image that ships containerd 2.0+, then rotate nodes via your autoscaler or node pool upgrade mechanism. This gives you rollback: if something goes sideways, you can halt the rollout or revert the node image version.
In-place upgrades of containerd on nodes can work, but they are harder to roll back cleanly and tend to create “snowflake” state over time (especially if some nodes miss a package update or config merge). If you must do in-place upgrades, be strict about:
- Version pinning
- Config management (templated, validated, reviewed)
- Automated conformance checks after the upgrade
Step 2: isolate risk with canary pools and workload selection
Runtime changes surface first in the “unsexy” workloads: CI runners, log shippers, service meshes, security agents, and anything that uses unusual capabilities or mounts. Pick a canary node pool and explicitly schedule a representative set of workloads there.
Good canary coverage includes:
- Image-heavy deployments (large multi-arch images, private registries, frequent pull/push behavior)
- High-churn jobs (CronJobs, build pods, ephemeral test pods)
- Network-sensitive components (sidecars, eBPF-based CNI, Gateway API controllers)
- Storage (CSI drivers, snapshots, and workloads that stress overlayfs)
- Policy/security (PSA/OPA, runtime security agents, seccomp profiles)
If you don’t have the ability to “move workloads around” easily, start with the simplest: pick one service that is important but not existential, create a deployment override to prefer the canary pool, and observe it for 24–72 hours.
Step 3: define success metrics that catch runtime regression early
You’re looking for small increases that predict bigger incidents later. Track:
- Pod startup latency (time from scheduled to ready)
- Image pull latency and error rate
- Sandbox creation failures (kubelet events)
- Node-level CPU and IO overhead (runtime and snapshotter behavior can shift)
- OOM and eviction patterns (cgroup accounting changes can reveal hidden pressure)
Kubernetes also exposes signals meant for this transition. In the v1.35 release messaging, SIG Node calls out a kubelet metric intended to help identify nodes that are at risk as support boundaries change. Even if you don’t rely on one specific metric name, the broader point stands: treat runtime version drift as something you can measure, not something you “discover” at upgrade time.
Step 4: update your runbooks (your on-call future will thank you)
Runtime migrations fail in the gap between “it should work” and “what do we do at 3am.” Before you roll beyond canaries, update runbooks:
- Where to find containerd logs across distros
- How to validate CRI health (kubelet → runtime connectivity)
- How to interpret common errors (image unpack, snapshotter issues, registry auth)
- Rollback criteria (what triggers a stop/rollback vs. a fix-forward)
Also decide in advance whether you will allow mixed runtime majors in a single cluster long-term. Often the answer is “temporarily yes” (during rollout), “permanently no” (after the migration). Write that down. Enforce it with policy once the migration is complete.
Step 5: coordinate with your Kubernetes upgrade roadmap
Once the runtime migration is stable, you’ve earned a lot of flexibility: the next Kubernetes upgrade becomes “just Kubernetes,” rather than “Kubernetes plus the thing that starts every pod.”
If you’re on a downstream distro like K3s, pay attention to how they align their release streams with upstream. Distro releases often bundle runtime choices, and a seemingly simple minor upgrade can pull in a containerd major behind the scenes. Use that to your advantage: pick the distro release that gives you containerd 2.0 on your schedule, not the night before your Kubernetes version crosses the support boundary.
Common pitfalls (and how to avoid them)
- Assuming “managed Kubernetes handles it.” Even when the provider manages the runtime, you’re still responsible for choosing node images, scheduling capacity, and controlling rollout risk.
- Not testing private registry paths. Runtime changes can surface auth/cert edge cases. Test the exact registries you use in production.
- Forgetting about GPU nodes. GPU stacks are integration-heavy; validate driver containers, runtime classes, and device plugins early.
- Ignoring the boring workloads. Daemonsets and agents are the first to break—and the hardest to debug if they stop shipping logs.
Bottom line
Kubernetes v1.35’s containerd guidance is a gift: it turns an eventual surprise into a visible deadline. Treat it like you would a certificate rotation or a base image refresh. Move the runtime with canaries, measurable success criteria, and an immutable rollout plan. Then upgrade Kubernetes with a clear head.

Leave a Reply