Kubernetes v1.36 Roundup: In-Place Restarts, SIG Storage Milestones, and Runtime Security Patches

Kubernetes v1.36, codenamed Haru, shipped on April 22, 2026, and the ecosystem around it has been busy. From in-place Pod restarts graduating to beta, to SIG Storage delivering long-awaited data-protection primitives, to containerd patching a batch of security vulnerabilities, the past month has given operators plenty to unpack. This article rounds up the most significant developments across the Kubernetes landscape and what they mean for the teams running production clusters.

In-Place Pod Restarts Graduate to Beta

One of the headline features in Kubernetes v1.36 is the promotion of RestartAllContainersOnContainerExits to beta, enabled by default. Developed under KEP-5532 in SIG Node, this capability allows a container’s exit behavior to trigger a fast, in-place restart of the entire Pod while keeping the sandbox intact.

Historically, recovering from a software crash in a multi-container Pod meant deleting and recreating the entire Pod object. That approach created control-plane churn, forced IP reassignments, unbound GPUs and TPUs, and risked losing node-locality for warm caches. For large batch or AI/ML workloads, where thousands of Pods might fail simultaneously, the thundering-herd effect on the scheduler could delay recovery by minutes.

The new RestartAllContainers action changes the equation. When triggered, the Kubelet halts all containers, re-runs init containers (including sidecars) in order, and restarts the workload containers, all while preserving:

  • The same Pod IP and network namespace
  • Bound accelerators such as GPUs and TPUs
  • Mounted volumes, including emptyDir and PVCs

This was a key enabler for JobSet, which reportedly reduced recovery time from minutes to seconds by adopting in-place restarts. Kubernetes v1.35 also introduced the AllContainersRestarting Pod condition to help SREs and autoscalers distinguish restarts from true failures.

Operators adopting this feature should note three caveats: init containers must be idempotent, graceful termination (preStop hooks) is not supported during in-place restarts, and external CD and observability tools may need updates to handle re-running init containers without flagging them as new deployments.

SIG Storage: VolumeGroupSnapshot GA, CBT Beta, and COSI Advances

The Kubernetes SIG Storage spotlight published in mid-June offers a clear view of where persistent storage is heading. Two features in particular stand out for production workloads.

VolumeGroupSnapshot Reaches GA

VolumeGroupSnapshot, which moved to General Availability in Kubernetes v1.36, enables a crash-consistent, point-in-time snapshot of multiple PersistentVolumes simultaneously. For applications like databases that span multiple volumes, this ensures all volumes in the group are captured atomically at the exact same moment. The feature has been in development for multiple releases and is now ready for broad adoption.

CSI Changed Block Tracking Enters Beta

CSI Changed Block Tracking (CBT) also graduated to beta in v1.36. CBT allows storage systems to report only the blocks that have changed since the last snapshot, dramatically reducing the amount of data that needs to be transferred during incremental backups. For teams running large stateful workloads, this translates directly into shorter backup windows and lower egress costs.

COSI and VolumeAttributesClass

The Container Object Storage Interface (COSI) is transitioning to v1alpha2, with plans for beta promotion in a future release. COSI aims to standardize object-storage bucket provisioning in Kubernetes, much as CSI did for block and file storage. Given the exabyte-scale datasets now common in AI/ML pipelines, object storage is becoming a first-class concern for cluster operators.

Another win for stateful workloads is the GA graduation of VolumeAttributesClass in v1.34, which allows users to dynamically tune storage properties such as IOPS or throughput through the Kubernetes API, without recreating volumes or taking downtime.

containerd 2.3.2 Patches Five CVEs

The container runtime layer also saw important activity. containerd 2.3.2 shipped on June 18 as a patch release addressing five security vulnerabilities: CVE-2026-50195, CVE-2026-53488, CVE-2026-53492, CVE-2026-53489, and CVE-2026-47262. The release also updates the bundled runc binary to v1.4.3 and bumps Go to 1.26.4.

Beyond security, the release fixes a data race when reading shim logs on Windows and resolves container startup failures caused by concurrent task RPC timeouts during slow container creation. For image distribution, the resolver now retries on transient network errors. Operators running containerd 2.3.x should plan to upgrade promptly.

Helm v3.21.2 Aligns with Kubernetes v1.36

Helm v3.21.2 arrived on June 20 as a patch release primarily bumping Kubernetes client libraries (client-go and friends) to match the v1.36 release. The Helm maintainers noted that v3.22.0 will be the final feature release in the Helm 3 series, making this a good time for chart authors to verify compatibility with the latest Kubernetes APIs.

The Bigger Picture: Telemetry, AI, and Platform Engineering

Two broader trends are worth watching alongside the release notes.

First, the CNCF’s recent Telemetry That Matters panel at Observability Summit North America highlighted a growing concern: cloud-native platforms are drowning in telemetry data. Industry experience suggests roughly 50% of collected metrics are never queried. The push toward “green observability” treats telemetry reduction as both a cost-optimization and a sustainability goal. For Kubernetes operators, this means being more intentional about what signals are collected from clusters and how they flow through OpenTelemetry pipelines.

Second, Kubernetes is increasingly positioned as the “operating system for AI.” SIG Storage’s roadmap explicitly calls out AI/ML as a primary driver for future work, including data-aware scheduling, high-performance parallel file systems, and NVMe-over-Fabrics technologies managed natively via Kubernetes. The in-place restart feature, meanwhile, directly addresses the reliability needs of distributed training jobs that cannot afford to lose accelerator bindings during recovery.

What to Watch Next

Looking ahead, several items are on the near-term radar:

  • Volume Health, currently in development, will offer persistent visibility into the operational status of volumes, moving beyond the current non-persistent event-based reporting.
  • Mutable PV Affinity, introduced as alpha in v1.35, is seeking community feedback for use cases such as migrating volumes between zonal and regional storage.
  • Node Lifecycle Controller, which graduated to beta in v1.36, streamlines node heartbeat handling and the node lease API, improving reliability at scale.
  • RKE2 has switched its default ingress controller from Ingress NGINX (which reached end-of-life in March 2026) to Traefik starting with v1.36, a change that will affect new cluster deployments.

Sources