How Do You Implement a Canary Deployment in Kubernetes?

advanced|deploymentsdevopssreCKACKAD
TL;DR

A canary deployment gradually shifts a small percentage of traffic to a new version while the majority continues hitting the stable version. If metrics look good, traffic is increased until the canary becomes the new production release.

Detailed Answer

A canary deployment releases a new version of your application to a small subset of users before rolling it out to the entire fleet. The name comes from the "canary in a coal mine" -- if the canary (small release) is healthy, the full rollout proceeds. If not, you pull back before users are affected.

Native Kubernetes Canary (Basic Approach)

The simplest canary in Kubernetes uses two Deployments behind a single Service. Traffic is split based on the ratio of Pod replicas.

Stable Deployment (90% of traffic)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app-stable
spec:
  replicas: 9
  selector:
    matchLabels:
      app: web-app
      track: stable
  template:
    metadata:
      labels:
        app: web-app
        track: stable
    spec:
      containers:
        - name: web-app
          image: web-app:1.0
          ports:
            - containerPort: 8080

Canary Deployment (10% of traffic)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app-canary
spec:
  replicas: 1
  selector:
    matchLabels:
      app: web-app
      track: canary
  template:
    metadata:
      labels:
        app: web-app
        track: canary
    spec:
      containers:
        - name: web-app
          image: web-app:2.0
          ports:
            - containerPort: 8080

Shared Service

apiVersion: v1
kind: Service
metadata:
  name: web-app
spec:
  selector:
    app: web-app      # Matches BOTH stable and canary Pods
  ports:
    - port: 80
      targetPort: 8080

The Service selects all Pods with app: web-app, regardless of the track label. With 9 stable Pods and 1 canary Pod, roughly 10% of requests hit the canary.

Limitations of the Native Approach

  • Granularity is limited to replica count. Getting 1% traffic requires 99 stable Pods and 1 canary Pod.
  • No sticky sessions. A single user may alternate between versions.
  • No automated analysis. You must manually monitor and decide to promote or rollback.
  • No header-based routing. You cannot route specific users to the canary.

Service Mesh Canary (Precise Control)

For production-grade canary deployments, use Istio's traffic splitting:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: web-app
spec:
  hosts:
    - web-app
  http:
    - route:
        - destination:
            host: web-app
            subset: stable
          weight: 95
        - destination:
            host: web-app
            subset: canary
          weight: 5
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: web-app
spec:
  host: web-app
  subsets:
    - name: stable
      labels:
        track: stable
    - name: canary
      labels:
        track: canary

This sends exactly 5% of traffic to the canary, regardless of the number of replicas.

Automated Canary with Argo Rollouts

Argo Rollouts provides a Canary strategy with built-in traffic management and analysis:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: web-app
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: { duration: 5m }
        - setWeight: 20
        - pause: { duration: 5m }
        - setWeight: 50
        - pause: { duration: 5m }
        - setWeight: 100
      canaryService: web-app-canary
      stableService: web-app-stable
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 1
        args:
          - name: service-name
            value: web-app-canary
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
        - name: web-app
          image: web-app:2.0
          ports:
            - containerPort: 8080

This gradually increases traffic: 5% for 5 minutes, then 20%, then 50%, then full rollout. An AnalysisTemplate runs in parallel, querying Prometheus for error rates:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 60s
      successCondition: result[0] >= 0.99
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}",status=~"2.."}[5m]))
            /
            sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))

If the success rate drops below 99%, the rollout is automatically aborted and traffic shifts back to the stable version.

Canary Deployment Workflow

  1. Deploy canary with minimal traffic (1-5%).
  2. Monitor metrics -- error rate, latency, resource usage.
  3. Gradually increase traffic at each step if metrics are healthy.
  4. Promote the canary to become the new stable version.
  5. Or rollback immediately if any metric breaches its threshold.

When to Use Canary vs. Other Strategies

| Strategy | Best For | |---|---| | Rolling update | Standard releases with low risk | | Blue-green | Instant cutover with full pre-validation | | Canary | High-risk changes that need gradual validation with real traffic | | A/B testing | Feature experimentation targeting specific user segments |

Summary

Canary deployments let you test new versions with a small fraction of production traffic before committing to a full rollout. The native Kubernetes approach uses replica ratios, which is simple but imprecise. For production-grade canary releases, tools like Istio, Argo Rollouts, or Flagger provide percentage-based traffic splitting and automated metric analysis. The canary pattern is essential for high-availability systems where a bad release could affect millions of users.

Why Interviewers Ask This

Canary deployments are a critical strategy for reducing risk in production releases. Interviewers want to see that you can implement progressive delivery and understand the tooling involved.

Common Follow-Up Questions

How do you control the exact traffic percentage in a canary deployment?
Native Kubernetes cannot do percentage-based traffic splitting. You need a service mesh (Istio, Linkerd) or a progressive delivery tool like Argo Rollouts or Flagger.
What metrics should you monitor during a canary rollout?
Error rate, latency (p50, p95, p99), CPU/memory usage, and business metrics like conversion rate or transaction success rate.
What is the difference between canary and A/B testing?
Canary routes a percentage of random traffic to the new version. A/B testing routes specific user segments based on criteria like headers, cookies, or geography.

Key Takeaways

  • Canary deployments minimize blast radius by testing with a fraction of production traffic.
  • Native Kubernetes offers a basic canary via replica ratios, but precise control requires a service mesh.
  • Automated canary analysis compares metrics between the canary and baseline to decide promotion or rollback.

Related Questions