How do you set up a highly available Kubernetes cluster?

advanced|architecturedevopssrecloud architectCKA
TL;DR

A highly available Kubernetes cluster requires multiple control plane nodes (minimum 3) with a load balancer in front of the API servers, an etcd cluster with odd-numbered members for quorum, leader election for the scheduler and controller manager, and worker nodes spread across failure domains.

Detailed Answer

A production Kubernetes cluster must tolerate component failures without downtime. High availability (HA) is achieved by running redundant instances of every control plane component and distributing them across failure domains (availability zones, racks, or data centers).

HA Architecture Overview

                    Load Balancer (L4/TCP)
                    |        |        |
            +-------+  +-------+  +-------+
            | CP-1  |  | CP-2  |  | CP-3  |
            | api   |  | api   |  | api   |
            | sched |  | sched |  | sched |
            | cm    |  | cm    |  | cm    |
            | etcd  |  | etcd  |  | etcd  |
            +-------+  +-------+  +-------+
              AZ-1       AZ-2       AZ-3
                    |        |        |
         +------+------+------+------+------+
         | W-1  | W-2  | W-3  | W-4  | W-5  |
         +------+------+------+------+------+

Component-Level HA

kube-apiserver -- The API server is stateless. Multiple instances run simultaneously behind a load balancer. All instances are active and serve requests in parallel. A layer-4 (TCP) load balancer distributes traffic across all healthy API server endpoints.

# kubeadm HA setup with a load balancer endpoint
kubeadm init \
  --control-plane-endpoint "api.k8s.example.com:6443" \
  --upload-certs

# Join additional control plane nodes
kubeadm join api.k8s.example.com:6443 \
  --token <token> \
  --discovery-token-ca-cert-hash sha256:<hash> \
  --control-plane \
  --certificate-key <key>

etcd -- Requires an odd number of members (3 or 5) for quorum. Uses Raft consensus where a majority must agree on writes. In a 3-member cluster, 1 failure is tolerated. In a 5-member cluster, 2 failures are tolerated.

kube-scheduler and kube-controller-manager -- Use leader election to ensure only one active instance at a time. The others remain on standby.

# Verify leader election leases
kubectl get leases -n kube-system

# Example output:
# NAME                      HOLDER                              AGE
# kube-controller-manager   cp-1_abc123-def456                  5d
# kube-scheduler            cp-2_ghi789-jkl012                  5d

Stacked vs. External etcd

Stacked etcd topology -- etcd runs on the same nodes as other control plane components. This is simpler to set up and requires fewer machines, but a node failure loses both a control plane member and an etcd member simultaneously.

External etcd topology -- etcd runs on dedicated nodes separate from the Kubernetes control plane. This provides better fault isolation and allows independent scaling, but requires more infrastructure.

# kubeadm config for external etcd
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
etcd:
  external:
    endpoints:
    - https://etcd-1.example.com:2379
    - https://etcd-2.example.com:2379
    - https://etcd-3.example.com:2379
    caFile: /etc/kubernetes/pki/etcd/ca.crt
    certFile: /etc/kubernetes/pki/apiserver-etcd-client.crt
    keyFile: /etc/kubernetes/pki/apiserver-etcd-client.key

Load Balancer Configuration

The load balancer must perform TCP (layer 4) load balancing to the API servers. Common choices include HAProxy, nginx, and cloud provider load balancers:

# Example HAProxy configuration for API server HA
frontend k8s-api
    bind *:6443
    mode tcp
    default_backend k8s-api-backend

backend k8s-api-backend
    mode tcp
    balance roundrobin
    option tcp-check
    server cp-1 10.0.1.10:6443 check fall 3 rise 2
    server cp-2 10.0.2.10:6443 check fall 3 rise 2
    server cp-3 10.0.3.10:6443 check fall 3 rise 2

Worker Node HA

Worker nodes should be spread across failure domains using topology spread constraints:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 6
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: web
      containers:
      - name: web
        image: nginx:1.27
        resources:
          requests:
            cpu: "250m"
            memory: "128Mi"

Pod Disruption Budgets

PDBs protect applications during voluntary disruptions (node drains, upgrades):

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-pdb
spec:
  minAvailable: "50%"
  selector:
    matchLabels:
      app: web

Validating HA Setup

# Check all control plane components are running
kubectl get pods -n kube-system -l tier=control-plane -o wide

# Verify etcd cluster health
ETCDCTL_API=3 etcdctl endpoint health \
  --endpoints=https://10.0.1.10:2379,https://10.0.2.10:2379,https://10.0.3.10:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify leader election is working
kubectl get lease -n kube-system kube-controller-manager -o jsonpath='{.spec.holderIdentity}'
kubectl get lease -n kube-system kube-scheduler -o jsonpath='{.spec.holderIdentity}'

# Check node distribution across zones
kubectl get nodes -L topology.kubernetes.io/zone

# Simulate a control plane failure and verify cluster continues operating
# (do this in a test environment!)

Managed Kubernetes HA

Cloud providers simplify HA significantly. With EKS, GKE, or AKS, the control plane is fully managed and distributed across availability zones automatically. Your responsibility is ensuring worker nodes are spread across zones using node groups or node pools configured for multiple AZs.

Why Interviewers Ask This

HA architecture questions reveal whether a candidate can design production-grade clusters. Interviewers evaluate understanding of failure modes, quorum requirements, load balancing strategies, and the trade-offs between stacked and external etcd topologies.

Common Follow-Up Questions

What is the difference between stacked and external etcd topologies?
Stacked etcd runs on the same nodes as the control plane components, reducing infrastructure but coupling failures. External etcd runs on dedicated nodes, providing better isolation and allowing independent scaling of etcd and control plane.
How does leader election work for the scheduler and controller manager?
Only one instance is active at a time. They compete for a Lease object in kube-system. The holder renews it periodically. If the lease expires, another instance takes over within the configured lease duration.
What failure scenarios should you test?
Single control plane node failure, etcd member failure, network partition between AZs, load balancer failover, and simultaneous worker node failures. Chaos engineering tools like Litmus or chaos-mesh help automate this.

Key Takeaways

  • HA requires redundancy at every layer: API server, etcd, scheduler, controller manager, and worker nodes
  • etcd quorum (majority of members) is the critical factor; losing quorum means losing the ability to write cluster state
  • A load balancer in front of API server instances is essential for transparent failover