Platform engineering at ‘too many clusters’: how Fastly built safer rollouts on top of Argo CD

February 13, 2026•Stackxx•DevOps

GitOps promised a simple contract: describe desired state in Git, let a controller reconcile reality. For a single cluster (or a handful), it’s a superpower. But as more organizations run dozens or hundreds of clusters, GitOps can start to feel like it’s missing an orchestration layer: you can declare the change, but you can’t easily manage the rollout of that change across an entire fleet with confidence.

Fastly recently published a candid write-up about operating a large Kubernetes fleet and the tooling they ended up building around Argo CD. It’s a useful case study because it highlights a pattern that’s becoming common in platform engineering: GitOps is the baseline, and “fleet rollout safety” becomes the differentiator.

The fleet problem GitOps doesn’t solve by itself

Argo CD (and similar tools) excel at reconciling one cluster to one desired state. At fleet scale, platform teams run into recurring questions:

Ordering: which clusters should upgrade first, and why?
Blast radius: how do you guarantee a change can’t hit every region at once?
Validation: what automated checks must pass before the next wave?
Pausing: how do you stop the rollout when you see a problem?
Visibility: can you answer “what’s running where?” in seconds?

None of these are “anti-GitOps.” They’re the natural next layer once GitOps becomes the default.

What Fastly says they were missing

In their post, Fastly describes how GitOps kept them consistent early, but at scale it exposed gaps — specifically an orchestration layer for multi-cluster rollouts and automated validation to reduce manual checks. That’s a familiar story: the reconcilers are solid, but the “release management” experience is still DIY.

Pattern: add a rollout controller above GitOps

The key idea is to treat GitOps reconciliations as the execution engine, and add a higher-level system that decides when and where to apply changes. There are different ways to do this (homegrown tooling, CRDs, pipelines), but the requirements are consistent:

Define waves (canary → small region → broader region → global)
Gate wave progression on health signals
Enforce invariants (never upgrade two critical regions simultaneously)
Surface state as a single “fleet release” view

Think of it as progressive delivery for infrastructure, not just applications.

Stealable idea #1: make rollouts a first-class object

When rollouts live in wiki pages and runbooks, they fail under pressure. A strong platform approach is to define a “fleet rollout” object (or at least a pipeline artifact) that has:

Target set (which clusters)
Wave definitions
Validation checks
Pause/abort semantics
Audit trail

That object becomes the unit of coordination between platform, SRE, and service owners.

Stealable idea #2: validate automatically, but choose signals carefully

Automated validation is easy to say and hard to do. The trick is to pick signals that are:

Fast (you want signal within minutes, not hours)
Hard to game (not “the deploy succeeded”)
Service-aware (error rate, latency, saturation)
Environment-aware (some regions behave differently)

For cluster upgrades, also consider control-plane signals: API server latency, etcd health, node readiness churn, and CNI stability.

Stealable idea #3: enforce blast-radius boundaries with policy, not etiquette

At fleet scale, “don’t deploy everywhere at once” can’t be a cultural guideline — it has to be policy. Good rollout systems encode constraints like:

Max N clusters concurrently in an upgrade wave
Max 1 cluster per region at a time
Require canary success before broader rollout

This is the same shift you saw in CI/CD: from “be careful” to “guardrails.”

Stealable idea #4: separate ‘desired state’ from ‘release intent’

GitOps repositories describe a desired state. Release intent describes how to transition the fleet to that state. Keep them separate so you can:

Promote the same commit through environments
Control rollout pacing independently
Reuse validation across releases

This makes rollouts less chaotic and improves auditability.

Where Argo CD fits (and where it doesn’t)

Argo CD remains the reconciliation engine in this pattern. It’s good at ensuring each cluster converges to the desired manifests. The orchestration layer decides when to let Argo synchronize a given cluster (or which clusters are allowed to sync).

That distinction is powerful because it means you don’t need to replace Argo CD; you enhance it.

Practical implementation options

Organizations implement “fleet rollout” in several ways:

Pipeline-driven: CI triggers Argo syncs for waves, with checks between waves.
Controller-driven: a custom controller manages per-cluster Application objects and sync windows.
Git-branch promotion: promote config via branches/tags and let Argo’s App-of-Apps pattern handle the rest, with sync gating.

Your best choice depends on how much control you need, and how much custom software you’re willing to own.

What this means for platform engineering in 2026

Platform teams are increasingly judged on release safety and velocity. As fleets grow, the question becomes: can you ship infrastructure changes like a product team ships software? GitOps gives you reproducibility, but fleet rollouts require orchestration and validation.

Fastly’s write-up is a reminder that “we built a thin orchestration layer on top of GitOps” is not an outlier story anymore — it’s becoming table stakes for serious multi-cluster operations.

Sources

Fastly: Reliable deployments for a large Kubernetes fleet

Next signal