Kubernetes has always had “ways” to restart things, but most of them are really replacements: delete a Pod and let a controller recreate it; bump a Deployment annotation to trigger a rollout; drain a node; or in the worst cases, cordon/kill a workload to force a clean slate. Kubernetes 1.35 adds a new tool to that kit: Restart All Containers (alpha), a feature that makes a full, in-place restart of a Pod a first-class capability.
This sounds small, but it’s one of those “sharp edges become primitives” moments. When the platform gives you a supported, API-driven restart mechanism, you can build safer automation around it—especially in environments with GPU nodes, stateful sidecars, and increasingly AI-shaped workloads that have complicated warm-up and caching behavior.
What “in-place restart” actually means
In most clusters today, “restart” usually means “replace,” and that replacement has consequences:
- Identity changes: new Pod UID, new IP, new endpoint readiness transitions, and sometimes new scheduling placement.
- Storage semantics: volumes might remount; local ephemeral state is lost; initialization might re-run.
- Disruption budget accounting: a rollout behaves differently than a deliberate restart event.
An in-place restart aims to refresh the containers within the existing Pod object rather than forcing the Pod to be deleted and recreated. Conceptually: keep the Pod where it is, but restart containers so you can reset state, apply configuration that requires process restart, or recover from certain classes of “stuck” behavior—without triggering a reschedule.
That “without reschedule” piece matters most on nodes that are hard to come by or slow to warm up: GPU nodes, bare-metal nodes with attached accelerators, nodes with large local caches, or nodes pinned to specific network topologies.
Where this helps in real operations
1) Draining the “soft failure” swamp
Every SRE team has a catalog of soft failures: gRPC channels that wedge, sidecars that leak memory, log shippers that stop tailing after rotation, agents that fail to reload updated credentials, and so on. The standard response is “restart the Pod,” which often becomes “rollout restart the Deployment” or “delete a Pod.”
That approach can be too blunt when a workload is stable but a single instance is degraded. A targeted, in-place restart can be the smallest viable hammer: reset the process and restore health while minimizing cluster churn.
2) Faster recovery on scarce scheduling domains
Rescheduling can be expensive. Even if the scheduler finds a node quickly, your workload may have to rehydrate caches, pull large images, or re-run init containers. For GPU workloads, rescheduling can also trigger longer queues and unfairness: a “simple restart” becomes “wait for another GPU.”
In-place restart is attractive because it’s local: if the node is healthy and the Pod placement is good, you don’t want the restart mechanism to add scheduling variance.
3) Cleaner automation hooks
Platform teams commonly build “restart bots” that watch for certain metrics or alerts and then trigger a restart. Today those bots often do one of two things:
- patch an annotation on a controller to force a rollout (affects all replicas), or
- delete a specific Pod (which may reschedule and can interact poorly with PDBs).
A supported restart primitive gives a better contract for automation: it’s explicit, scoped, and easier to reason about in change reviews.
Why AI workloads make this more interesting (and more dangerous)
AI-shaped Kubernetes workloads aren’t just “one container.” They’re often a bundle:
- a model server (GPU-bound),
- a routing/proxy sidecar,
- an agent runtime or tool server,
- telemetry/exporters, and
- sometimes a local cache or vector store process.
These systems have stateful behavior above the filesystem: warmed KV caches, compiled kernels, loaded model weights in GPU memory, and connection pools to upstream services. A reschedule wipes all of that, and can cause cold-start storms. But a restart also wipes in-memory state—so the goal isn’t “never restart,” it’s “restart with intention.”
In-place restarts can help teams separate concerns:
- Reset the process without moving the workload.
- Control blast radius by restarting one replica instead of the whole fleet.
- Coordinate restarts with traffic shifting, circuit breakers, and queues.
They can also amplify risk if misused. A naive automation could turn intermittent latency into a restart loop that never allows a model server to warm up. The more expensive the warm-up, the more careful you need to be with restart policies and alert thresholds.
How to roll it out safely
Because Restart All Containers is alpha, the operational stance should be: treat it as an experiment, not as a default behavior.
Step 1: Start with a narrow allowlist
Pick one or two workloads where the current restart mechanism causes pain—often GPU-bound inference deployments or stateful-ish services with fragile clients. Avoid your critical path at first.
Step 2: Wrap it in “traffic-aware” runbooks
Write a runbook that says: before you restart a replica, confirm that (a) there is spare capacity, (b) the load balancer health checks are sensible, and (c) you can drain connections gracefully. Then automate those checks.
Step 3: Observe the restart outcome, not just the restart event
Track post-restart warm-up time, readiness delay, error rate spikes, and tail latency. If your workload takes 90 seconds to recover, your alerting should reflect that reality.
Step 4: Prefer “restart one” patterns
Even if the primitive is “restart all containers,” your operational pattern should be “restart one replica at a time,” with a cool-down and an explicit success condition.
What to watch next
Over time, a restart primitive can become the foundation for better tooling: controllers that manage coordinated restarts, policies that constrain automation, or runtime integrations that can restart only the relevant containers without turning every incident into a full rollout.
The bigger theme is Kubernetes continuing to evolve from a “declarative desired state engine” into a platform with richer operational verbs—verbs you can standardize, audit, and reason about. For platform teams dealing with expensive nodes and complex AI workloads, that’s not just convenience; it’s cost control.

Leave a Reply