Beyond All-or-Nothing: The Case for Gradual Rollouts
You’ve adopted GitOps. Your infrastructure is declarative, version-controlled, and automatically reconciled. But when it comes to deploying application changes, are you still flipping a switch and hoping for the best?
Progressive delivery bridges this gap. Instead of instant cutover, traffic shifts gradually — 5% → 25% → 100% — with automated checks at every step. If metrics degrade, instant rollback. If health checks pass, automatic promotion. The result: safer deployments without sacrificing velocity.
The Progressive Delivery Stack
At its core, progressive delivery combines three capabilities:
- Traffic Shifting — Gradually move users from old to new version
- Automated Analysis — Continuously evaluate SLOs and business metrics
- Automatic Promotion/Rollback — Decisions based on data, not gut feeling
The two leading implementations in the Kubernetes ecosystem are Argo Rollouts and Flagger. Both integrate with existing GitOps workflows but approach progressive delivery differently.
Argo Rollouts: Native Kubernetes Experience
Argo Rollouts extends the Deployment concept with custom resources. You get canaries, blue-green deployments, and experiments using familiar Kubernetes primitives.
Architecture Overview
┌─────────────────────────────────────────┐
│ Argo Rollouts Controller │
│ (manages Rollout CRD, traffic shaping) │
├─────────────────────────────────────────┤
│ Service Mesh │
│ (Istio, Linkerd, NGINX, ALB, SMI) │
├─────────────────────────────────────────┤
│ Prometheus/OTel │
│ (metric queries for analysis) │
└─────────────────────────────────────────┘
Example: Canary Deployment
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payment-service
spec:
replicas: 10
strategy:
canary:
canaryService: payment-service-canary
stableService: payment-service-stable
trafficRouting:
istio:
virtualService:
name: payment-service-vs
routes:
- primary
steps:
- setWeight: 5
- pause: {duration: 10m}
- setWeight: 20
- pause: {duration: 10m}
- analysis:
templates:
- templateName: success-rate
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 100
- analysis:
templates:
- templateName: success-rate
- templateName: latency
Analysis Template
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
metrics:
- name: success-rate
interval: 5m
count: 3
successCondition: result[0] >= 0.95
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{service="payment-service",status=~"2.."}[5m]))
/
sum(rate(http_requests_total{service="payment-service"}[5m]))
Flagger: GitOps-Native Approach
Flagger takes a different approach. Instead of replacing Deployments, it works alongside them — creating canary resources and managing traffic splitting externally.
Architecture Overview
┌─────────────────────────────────────────┐
│ Flagger │
│ (watches Deployments, manages canary) │
├─────────────────────────────────────────┤
│ Service Mesh / Ingress │
│ (Istio, Linkerd, NGINX, Gloo, Contour)│
├─────────────────────────────────────────┤
│ Prometheus/CloudWatch │
│ (metrics for canary checks) │
└─────────────────────────────────────────┘
Example: Automated Canary
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: payment-service
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: payment-service
service:
port: 8080
analysis:
interval: 30s
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 1m
webhooks:
- name: load-test
url: http://flagger-loadtester.test/
timeout: 5s
metadata:
cmd: "hey -z 1m -q 10 -c 2 http://payment-service-canary/"
Argo Rollouts vs Flagger: Quick Comparison
| Aspect | Argo Rollouts | Flagger |
|---|---|---|
| Deployment Model | Replaces Deployment with Rollout CRD | Watches existing Deployments |
| GitOps Integration | Argo CD native (same project) | Works with any GitOps tool |
| Traffic Control | Multiple meshes + ALB/NLB | Multiple meshes + ingress controllers |
| Experimentation | Built-in A/B/n testing | A/B testing via webhooks |
| Analysis | AnalysisTemplate/AnalysisRun CRDs | Inline metric thresholds |
| Rollback | Automatic on failed analysis | Automatic on threshold breach |
Metric-Driven Promotion
The magic happens when deployment decisions are based on actual system behavior, not time-based guesses.
Key Metrics to Watch
- Golden Signals: Latency, traffic, errors, saturation
- Business Metrics: Conversion rates, checkout completion
- Infrastructure Metrics: CPU, memory, disk I/O
Prometheus Integration Example
# Argo Rollouts: P99 latency check
- name: p99-latency
interval: 5m
successCondition: result[0] <= 200
provider:
prometheus:
address: http://prometheus.monitoring
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# Flagger: Error rate check
metrics:
- name: request-success-rate
thresholdRange:
min: 99.0
interval: 1m
Adoption Path: From GitOps to Progressive Delivery
For teams already running Argo CD or Flux, the transition is gradual:
Phase 1: Observability Foundation
- Ensure metrics are flowing (Prometheus/Grafana operational)
- Define SLOs and error budgets
- Set up alerting on key services
Phase 2: First Canary
- Pick a non-critical service with good metrics coverage
- Install Argo Rollouts or Flagger controller
- Convert Deployment to Rollout/Canary (small team impact)
Phase 3: Expand Coverage
- Roll out to more services
- Refine analysis templates based on learnings
- Add automated load testing in canary phase
Phase 4: Advanced Patterns
- A/B/n testing for feature validation
- Multi-region progressive rollouts
- Chaos engineering integration
Integration with Argo CD
Argo Rollouts shines here because it's part of the same ecosystem:
# Application manifest with Rollout
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: payment-service
namespace: argocd
spec:
project: production
source:
repoURL: https://github.com/org/gitops-repo
targetRevision: HEAD
path: apps/payment-service
destination:
server: https://kubernetes.default.svc
namespace: payments
syncPolicy:
automated:
prune: true
selfHeal: true
The Rollout resource is just another Kubernetes object — Argo CD manages it like any Deployment.
Common Pitfalls and How to Avoid Them
Insufficient Metrics Coverage
Problem: Canary proceeds based on partial data.
Solution: Require minimum metric samples before promotion decision.
Overly Aggressive Traffic Shifts
Problem: 50% traffic jump exposes too many users to issues.
Solution: Use smaller steps (5% → 10% → 25% → 50% → 100%).
Ignoring Cold Start Effects
Problem: New pods show artificially high latency initially.
Solution: Add warmup period or exclude initial metrics from analysis.
When to Choose Which
Choose Argo Rollouts if:
- You're already using Argo CD
- You want tight integration with your GitOps workflow
- You need sophisticated experimentation (A/B/n testing)
Choose Flagger if:
- You use Flux or another GitOps tool
- You prefer keeping native Deployments
- You want simpler, less invasive setup
Conclusion
Progressive delivery isn't just a safety net — it's a competitive advantage. Teams that deploy confidently multiple times per day recover faster from incidents, validate features with real traffic, and reduce the blast radius of bad changes.
The tooling is mature, the patterns are proven, and the integration with existing GitOps workflows is seamless. Whether you choose Argo Rollouts or Flagger, the important step is starting: pick a service, set up your first canary, and let data drive your deployment decisions.
GitOps gave us declarative infrastructure. Progressive delivery gives us declarative confidence in our deployments.
