What is etcd and what role does it play in Kubernetes?

beginner|architecturedevopssrecloud architectCKA
TL;DR

etcd is a distributed, strongly consistent key-value store that serves as the backing store for all Kubernetes cluster data. Every object, configuration, and piece of state in the cluster is persisted in etcd, making it the single source of truth.

Detailed Answer

etcd is an open-source, distributed key-value store developed by CoreOS (now part of Red Hat). In Kubernetes, it serves as the persistent storage backend for all cluster data. When you create a Deployment, a Service, a ConfigMap, or any other Kubernetes object, it is serialized and stored in etcd. When you query the API server, it reads from etcd (or its watch cache) to return the current state.

How Kubernetes Uses etcd

The kube-apiserver is the only Kubernetes component that communicates directly with etcd. All other components (scheduler, controller manager, kubelet) interact with cluster state exclusively through the API server. This design provides a single point of access control and ensures consistent serialization of data.

Data in etcd is organized under a key prefix, typically /registry/. For example:

  • /registry/pods/default/my-pod -- A pod named "my-pod" in the default namespace
  • /registry/deployments/production/web-app -- A Deployment in the production namespace
  • /registry/services/kube-system/kube-dns -- The kube-dns Service

Raft Consensus

etcd uses the Raft consensus algorithm to replicate data across all members of the cluster. Raft ensures that as long as a majority (quorum) of members are available, the cluster can accept writes. For a 3-member etcd cluster, it can tolerate 1 failure. For 5 members, it can tolerate 2 failures.

| Cluster Size | Quorum | Failure Tolerance | |-------------|--------|-------------------| | 1 | 1 | 0 | | 3 | 2 | 1 | | 5 | 3 | 2 | | 7 | 4 | 3 |

Backup and Restore

Backing up etcd is the single most important disaster recovery procedure in Kubernetes:

# Create a snapshot backup
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify the snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db --write-table

# Restore from a snapshot (stop the API server first)
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
  --data-dir=/var/lib/etcd-restored \
  --initial-cluster=controlplane=https://10.0.0.10:2380 \
  --initial-advertise-peer-urls=https://10.0.0.10:2380 \
  --name=controlplane

Monitoring etcd Health

# Check etcd cluster health
ETCDCTL_API=3 etcdctl endpoint health \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# List etcd members
ETCDCTL_API=3 etcdctl member list \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  --write-table

Inspecting Data Stored in etcd

While you should not directly modify data in etcd, inspecting it can be useful for debugging:

# List all keys under /registry
ETCDCTL_API=3 etcdctl get /registry --prefix --keys-only | head -20

# Read a specific key (output is protobuf-encoded)
ETCDCTL_API=3 etcdctl get /registry/pods/default/my-pod

Performance Tuning

etcd performance directly impacts the responsiveness of the entire cluster. Key considerations:

  • Storage: Use fast SSDs with low-latency I/O. etcd uses a write-ahead log (WAL) and performs periodic fsync operations. If wal_fsync_duration_seconds consistently exceeds 10ms, disk I/O is a bottleneck.
  • Network: etcd members communicate over gRPC. Network latency between members should be under 10ms for reliable operation.
  • Compaction: etcd stores all revisions of every key. Over time, this grows. Kubernetes configures automatic compaction, but you should verify it is working by monitoring etcd_db_total_size_in_bytes.
  • Defragmentation: After compaction frees space logically, defragmentation reclaims it on disk. Schedule periodic defragmentation during maintenance windows.

etcd in the Static Pod Manifest

On kubeadm clusters, etcd runs as a static pod:

# /etc/kubernetes/manifests/etcd.yaml (excerpt)
spec:
  containers:
  - command:
    - etcd
    - --data-dir=/var/lib/etcd
    - --listen-client-urls=https://127.0.0.1:2379,https://10.0.0.10:2379
    - --advertise-client-urls=https://10.0.0.10:2379
    - --listen-peer-urls=https://10.0.0.10:2380
    - --initial-advertise-peer-urls=https://10.0.0.10:2380
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    volumeMounts:
    - mountPath: /var/lib/etcd
      name: etcd-data

The /var/lib/etcd directory holds the actual database and WAL files. Losing this directory without a backup means losing all cluster state.

Why Interviewers Ask This

Interviewers ask about etcd to assess whether a candidate understands where cluster state lives and the implications for backup, disaster recovery, and high availability. Misunderstanding etcd often leads to data loss scenarios in production.

Common Follow-Up Questions

How do you back up and restore etcd?
Use etcdctl snapshot save to create a backup and etcdctl snapshot restore to recover. Regular automated backups are essential for disaster recovery.
What happens if etcd loses quorum?
The cluster becomes read-only from etcd's perspective. No new writes are accepted, meaning no new pods can be scheduled or state changes persisted, though existing workloads continue running.
Should etcd run on SSDs or HDDs?
SSDs are strongly recommended. etcd is sensitive to disk I/O latency, and slow disks cause leader election timeouts and cluster instability.

Key Takeaways

  • etcd is the only stateful component in the control plane and requires careful operational attention
  • It uses the Raft consensus algorithm requiring a majority of members to be available for writes
  • Regular backups of etcd are non-negotiable in production environments