How does etcd ensure data consistency in a Kubernetes cluster?
etcd uses the Raft consensus algorithm to ensure strong consistency across all cluster members. Every write must be agreed upon by a majority (quorum) of members before it is committed. This guarantees that all reads return the most recent committed write, preventing split-brain scenarios and data divergence.
Detailed Answer
etcd provides strong consistency (linearizability) for all operations, meaning every read reflects the most recent committed write across the entire cluster. This is achieved through the Raft consensus algorithm, a protocol designed for managing a replicated log across distributed systems.
Raft Consensus Fundamentals
Raft organizes cluster members into three roles:
- Leader -- Handles all client writes and replicates them to followers. Only one leader exists at any time.
- Follower -- Receives log entries from the leader and acknowledges them. Responds to reads in serializable mode.
- Candidate -- A follower that has initiated a leader election after not hearing from the leader.
Write Path
When the Kubernetes API server writes to etcd, the following sequence occurs:
1. Client sends write request to etcd (any member)
2. If the member is not the leader, it forwards the request to the leader
3. Leader appends the entry to its local write-ahead log (WAL)
4. Leader replicates the log entry to all followers in parallel
5. Each follower appends to its own WAL and acknowledges
6. Once a majority (quorum) acknowledges, the leader commits the entry
7. Leader applies the entry to its state machine (the key-value store)
8. Leader responds to the client with success
9. Followers learn about the commit and apply it to their state machines
For a 3-member cluster, the leader needs 1 additional acknowledgment (2 out of 3 = quorum). For a 5-member cluster, it needs 2 additional (3 out of 5).
Leader Election
If the leader fails or becomes network-partitioned, Raft triggers an automatic leader election:
1. Followers have an election timeout (randomized, 150-300ms default)
2. A follower that times out without hearing from the leader becomes a Candidate
3. The Candidate increments its term number and votes for itself
4. It requests votes from all other members
5. Members grant their vote if:
- They haven't voted in this term yet
- The Candidate's log is at least as up-to-date as their own
6. If the Candidate receives a majority of votes, it becomes the new Leader
7. The new Leader begins sending heartbeats to prevent further elections
# Check the current leader
ETCDCTL_API=3 etcdctl endpoint status \
--endpoints=https://10.0.1.10:2379,https://10.0.2.10:2379,https://10.0.3.10:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--write-table
# Output shows which member is leader:
# ENDPOINT ID VERSION DB SIZE IS LEADER RAFT TERM RAFT INDEX
# 10.0.1.10:2379 abc123 3.5.12 20 MB true 5 150234
# 10.0.2.10:2379 def456 3.5.12 20 MB false 5 150234
# 10.0.3.10:2379 ghi789 3.5.12 20 MB false 5 150234
Read Consistency Levels
etcd supports two read consistency levels:
Linearizable reads (default) -- The read request goes to the leader, which confirms it is still the leader with a quorum check before responding. This guarantees the returned data reflects the most recent committed write. The Kubernetes API server uses linearizable reads by default.
Serializable reads -- Any member can serve the read from its local state without contacting the leader. This is faster but may return slightly stale data if the member has not yet applied the latest committed entries.
# Linearizable read (default)
ETCDCTL_API=3 etcdctl get /registry/pods/default/my-pod \
--consistency=l
# Serializable read (faster but potentially stale)
ETCDCTL_API=3 etcdctl get /registry/pods/default/my-pod \
--consistency=s
Revisions and MVCC
etcd implements Multi-Version Concurrency Control (MVCC). Every modification increments a global revision number. This allows:
- Watch from a specific revision -- Kubernetes controllers watch from a known revision, so they never miss events
- Historical queries -- Read the value of a key at a previous revision
- Transactions -- Atomic compare-and-swap operations using revision-based preconditions
# Get the current revision
ETCDCTL_API=3 etcdctl get /registry/pods/default/my-pod -w json | python3 -c "
import json,sys; d=json.load(sys.stdin); print('Revision:', d['header']['revision'])"
# Watch from a specific revision
ETCDCTL_API=3 etcdctl watch /registry/pods --prefix --rev=150000
Performance and Operational Impact
The consistency model directly affects performance and operations:
Disk I/O -- Every committed write requires an fsync to the WAL. Slow disks cause Raft heartbeat timeouts, triggering unnecessary leader elections. The etcd_disk_wal_fsync_duration_seconds metric should stay below 10ms.
Network latency -- Raft replication requires round-trips between the leader and followers. High latency between etcd members increases write latency and can cause election instability. Keep etcd members in the same region.
Cluster sizing -- Larger clusters (5 or 7 members) tolerate more failures but increase write latency since more members must acknowledge each write. Most production Kubernetes clusters use 3 or 5 etcd members.
# Monitor critical etcd metrics
# WAL fsync latency (should be < 10ms)
curl -s http://localhost:2381/metrics | grep etcd_disk_wal_fsync_duration
# Raft proposal failures (should be 0 in steady state)
curl -s http://localhost:2381/metrics | grep etcd_server_proposals_failed_total
# Leader changes (frequent changes indicate instability)
curl -s http://localhost:2381/metrics | grep etcd_server_leader_changes_seen_total
# Database size
curl -s http://localhost:2381/metrics | grep etcd_mvcc_db_total_size_in_bytes
Compaction and Defragmentation
Because of MVCC, etcd retains all revisions of every key. Without compaction, the database grows indefinitely:
# Compact old revisions (Kubernetes does this automatically)
ETCDCTL_API=3 etcdctl compact 150000
# Defragment to reclaim disk space after compaction
ETCDCTL_API=3 etcdctl defrag --endpoints=https://10.0.1.10:2379
Why Interviewers Ask This
This question evaluates deep understanding of distributed systems concepts underlying Kubernetes. Interviewers want to know if a candidate can reason about consistency guarantees, failure scenarios, and the operational implications of etcd's consensus model on cluster reliability.
Common Follow-Up Questions
Key Takeaways
- Raft requires a majority quorum for every write, providing strong consistency guarantees
- Leader election happens automatically when the current leader fails, with brief write unavailability
- Understanding etcd's consistency model is essential for sizing clusters and planning failure tolerance