ARC + Karpenter: A Practical Pattern for Zonal-Shift Resiliency in EKS

Kubernetes gives platform teams a deep toolbox for resiliency: topology spread constraints, pod disruption budgets, readiness and liveness probes, and autoscaling primitives that help capacity follow demand. But there’s one failure mode that keeps showing up in real-world incident reviews: zonal gray failures — partial outages where an Availability Zone isn’t completely down, but it’s degraded enough to cause cascading timeouts, throttling, or error spikes.

In those moments, the fastest way to restore customer-facing SLOs often isn’t a perfect root-cause analysis. It’s decisively shifting traffic and compute away from the affected zone. Amazon Application Recovery Controller (ARC) exists for that purpose, and AWS recently published a concrete pattern for integrating ARC’s zonal shift events with Karpenter, addressing a gap that many EKS teams have likely felt: ARC can cordon nodes and pull endpoints, but Karpenter may still try to provision into the degraded zone unless the autoscaler is explicitly informed.

This post breaks down what AWS built, why it’s important, and how to adopt the same ideas even if you’re not on EKS — because the underlying lesson is universal: your traffic shifting system and your provisioning system must share the same understanding of fault domains.

Why zonal shift is different from “normal” Kubernetes healing

Most Kubernetes resiliency mechanisms assume an individual pod or node is unhealthy and should be replaced. Zonal incidents are trickier:

  • Pods can appear healthy but still serve errors due to network jitter, storage latency, or impaired control plane dependencies in that zone.
  • Autoscalers may make things worse by adding more capacity in the degraded zone because it still satisfies constraints and has perceived headroom.
  • Human troubleshooting is slow when the failure is intermittent, multi-service, or caused by shared infrastructure beyond the cluster.

ARC’s core value is speed: detect zonal impairment and shift traffic away quickly. The missing link is preventing fresh capacity from being created in the same zone during the incident window.

The integration gap: ARC knows the zone is bad; Karpenter doesn’t

Karpenter is widely used because it provisions nodes dynamically based on pending pod requirements, rather than relying on pre-baked node groups. But that flexibility has a side effect: if a NodePool spans multiple zones, Karpenter will continue to consider all eligible zones unless something changes the scheduling/provisioning constraints.

AWS’s solution is a controller that listens for zonal shift or autoshift events, identifies the impaired zone, and then mutates Karpenter NodePools so the bad zone is excluded until the shift ends.

How the AWS controller pattern works (conceptually)

The architecture AWS describes can be summarized in three parts:

  1. Signal: ARC emits events when an autoshift or manual zonal shift happens.
  2. Transport: Events flow through EventBridge into an SQS queue (decoupling delivery from controller uptime).
  3. Actuation: A Kubernetes controller reads events and edits NodePool zone requirements to remove the impaired zone, restoring them when the shift ends.

Operationally, the controller behaves like a “fault domain policy enforcer.” When a zone is marked unhealthy, it rewrites the truth Karpenter uses to make scaling decisions.

Why annotations matter

One subtle but important part of AWS’s approach is that the controller stores the removed zone in an annotation (for example, an “away-zones” list). That preserves intent so the controller can revert to the original state when the incident ends, rather than requiring a human to remember what changed.

Design principles platform teams should steal

Even if you never run this exact code, the pattern offers several principles worth standardizing:

1) Treat “avoid this zone” as a first-class control input

Many orgs express zone awareness only as an output constraint (“spread replicas across zones”). During an incident, you need a fast mechanism that flips constraints into an explicit avoid list. This is a different class of control plane input than ordinary scheduling preferences.

2) Drive both scheduling and provisioning from the same state

It’s not enough to cordon nodes if new nodes can still appear in the same zone. Likewise, it’s not enough to stop provisioning if traffic still routes there. Resiliency requires a consistent model across:

  • Ingress / load balancing
  • Endpoint registration
  • Node provisioning
  • Scheduler placement

3) Build reversibility into incident automation

Most “incident scripts” fail because they’re one-way. Reversibility means: every automated mutation should carry enough context to reverse itself safely. Annotations, structured status fields, or a dedicated CRD all work.

4) Make it multi-AZ resilient itself

AWS notes running multiple replicas of the controller in different zones. That’s table stakes: the thing that reacts to a zonal failure cannot be single-zoned.

Adoption checklist: what to validate before you copy the pattern

If you’re considering implementing the same integration, validate these items up front:

  • NodePool modeling: Do your NodePools explicitly constrain zones today? If not, your controller needs discovery logic (like examining subnets) to determine which zones are “in play.”
  • Workload constraints: Some workloads may be pinned to zones for data locality. Decide whether zonal shift should override those constraints or instead trigger a different remediation path.
  • Capacity math: When you remove a zone, you’re reducing total available capacity. Confirm HPA/VPA behavior, and confirm you have headroom in remaining zones.
  • Failure semantics: Clarify whether a zone is “away” due to ARC autoshift, a manual operator decision, or an internal SLO-based detector. Your platform should treat these differently.
  • Rollbacks: Define what “shift completed” means and how you reintroduce zones gradually (all at once vs. staged).

What this means for the broader Kubernetes ecosystem

The big takeaway is that the industry is converging on a pattern where clusters need an incident-time policy plane: a way to temporarily change placement and provisioning rules based on real-time infrastructure health signals.

Today, most of that logic lives in provider tools (ARC) plus autoscalers (Karpenter). Over time, expect to see more cross-cutting integrations where infrastructure health signals feed directly into declarative cluster policy, possibly via standard interfaces or dedicated “fault domain” APIs.

If you run multi-zone Kubernetes in production, you should have a documented answer to this question: when an AZ is degraded, how do we prevent the cluster from scaling into it? AWS has now provided a concrete reference implementation for EKS teams — and a useful blueprint for everyone else.

Sources

Leave a Reply

Your email address will not be published. Required fields are marked *