How Do You Implement a Canary Deployment in Kubernetes?
A canary deployment gradually shifts a small percentage of traffic to a new version while the majority continues hitting the stable version. If metrics look good, traffic is increased until the canary becomes the new production release.
Detailed Answer
A canary deployment releases a new version of your application to a small subset of users before rolling it out to the entire fleet. The name comes from the "canary in a coal mine" -- if the canary (small release) is healthy, the full rollout proceeds. If not, you pull back before users are affected.
Native Kubernetes Canary (Basic Approach)
The simplest canary in Kubernetes uses two Deployments behind a single Service. Traffic is split based on the ratio of Pod replicas.
Stable Deployment (90% of traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app-stable
spec:
replicas: 9
selector:
matchLabels:
app: web-app
track: stable
template:
metadata:
labels:
app: web-app
track: stable
spec:
containers:
- name: web-app
image: web-app:1.0
ports:
- containerPort: 8080
Canary Deployment (10% of traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app-canary
spec:
replicas: 1
selector:
matchLabels:
app: web-app
track: canary
template:
metadata:
labels:
app: web-app
track: canary
spec:
containers:
- name: web-app
image: web-app:2.0
ports:
- containerPort: 8080
Shared Service
apiVersion: v1
kind: Service
metadata:
name: web-app
spec:
selector:
app: web-app # Matches BOTH stable and canary Pods
ports:
- port: 80
targetPort: 8080
The Service selects all Pods with app: web-app, regardless of the track label. With 9 stable Pods and 1 canary Pod, roughly 10% of requests hit the canary.
Limitations of the Native Approach
- Granularity is limited to replica count. Getting 1% traffic requires 99 stable Pods and 1 canary Pod.
- No sticky sessions. A single user may alternate between versions.
- No automated analysis. You must manually monitor and decide to promote or rollback.
- No header-based routing. You cannot route specific users to the canary.
Service Mesh Canary (Precise Control)
For production-grade canary deployments, use Istio's traffic splitting:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: web-app
spec:
hosts:
- web-app
http:
- route:
- destination:
host: web-app
subset: stable
weight: 95
- destination:
host: web-app
subset: canary
weight: 5
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: web-app
spec:
host: web-app
subsets:
- name: stable
labels:
track: stable
- name: canary
labels:
track: canary
This sends exactly 5% of traffic to the canary, regardless of the number of replicas.
Automated Canary with Argo Rollouts
Argo Rollouts provides a Canary strategy with built-in traffic management and analysis:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: web-app
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 5
- pause: { duration: 5m }
- setWeight: 20
- pause: { duration: 5m }
- setWeight: 50
- pause: { duration: 5m }
- setWeight: 100
canaryService: web-app-canary
stableService: web-app-stable
analysis:
templates:
- templateName: success-rate
startingStep: 1
args:
- name: service-name
value: web-app-canary
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: web-app
image: web-app:2.0
ports:
- containerPort: 8080
This gradually increases traffic: 5% for 5 minutes, then 20%, then 50%, then full rollout. An AnalysisTemplate runs in parallel, querying Prometheus for error rates:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 60s
successCondition: result[0] >= 0.99
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{service="{{args.service-name}}",status=~"2.."}[5m]))
/
sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))
If the success rate drops below 99%, the rollout is automatically aborted and traffic shifts back to the stable version.
Canary Deployment Workflow
- Deploy canary with minimal traffic (1-5%).
- Monitor metrics -- error rate, latency, resource usage.
- Gradually increase traffic at each step if metrics are healthy.
- Promote the canary to become the new stable version.
- Or rollback immediately if any metric breaches its threshold.
When to Use Canary vs. Other Strategies
| Strategy | Best For | |---|---| | Rolling update | Standard releases with low risk | | Blue-green | Instant cutover with full pre-validation | | Canary | High-risk changes that need gradual validation with real traffic | | A/B testing | Feature experimentation targeting specific user segments |
Summary
Canary deployments let you test new versions with a small fraction of production traffic before committing to a full rollout. The native Kubernetes approach uses replica ratios, which is simple but imprecise. For production-grade canary releases, tools like Istio, Argo Rollouts, or Flagger provide percentage-based traffic splitting and automated metric analysis. The canary pattern is essential for high-availability systems where a bad release could affect millions of users.
Why Interviewers Ask This
Canary deployments are a critical strategy for reducing risk in production releases. Interviewers want to see that you can implement progressive delivery and understand the tooling involved.
Common Follow-Up Questions
Key Takeaways
- Canary deployments minimize blast radius by testing with a fraction of production traffic.
- Native Kubernetes offers a basic canary via replica ratios, but precise control requires a service mesh.
- Automated canary analysis compares metrics between the canary and baseline to decide promotion or rollback.