How does etcd ensure data consistency in a Kubernetes cluster?

advanced|architecturedevopssrecloud architectCKA

TL;DR

etcd uses the Raft consensus algorithm to ensure strong consistency across all cluster members. Every write must be agreed upon by a majority (quorum) of members before it is committed. This guarantees that all reads return the most recent committed write, preventing split-brain scenarios and data divergence.

Detailed Answer

etcd provides strong consistency (linearizability) for all operations, meaning every read reflects the most recent committed write across the entire cluster. This is achieved through the Raft consensus algorithm, a protocol designed for managing a replicated log across distributed systems.

Raft Consensus Fundamentals

Raft organizes cluster members into three roles:

Leader -- Handles all client writes and replicates them to followers. Only one leader exists at any time.
Follower -- Receives log entries from the leader and acknowledges them. Responds to reads in serializable mode.
Candidate -- A follower that has initiated a leader election after not hearing from the leader.

Write Path

When the Kubernetes API server writes to etcd, the following sequence occurs:

1. Client sends write request to etcd (any member)
2. If the member is not the leader, it forwards the request to the leader
3. Leader appends the entry to its local write-ahead log (WAL)
4. Leader replicates the log entry to all followers in parallel
5. Each follower appends to its own WAL and acknowledges
6. Once a majority (quorum) acknowledges, the leader commits the entry
7. Leader applies the entry to its state machine (the key-value store)
8. Leader responds to the client with success
9. Followers learn about the commit and apply it to their state machines

For a 3-member cluster, the leader needs 1 additional acknowledgment (2 out of 3 = quorum). For a 5-member cluster, it needs 2 additional (3 out of 5).

Leader Election

If the leader fails or becomes network-partitioned, Raft triggers an automatic leader election:

1. Followers have an election timeout (randomized, 150-300ms default)
2. A follower that times out without hearing from the leader becomes a Candidate
3. The Candidate increments its term number and votes for itself
4. It requests votes from all other members
5. Members grant their vote if:
   - They haven't voted in this term yet
   - The Candidate's log is at least as up-to-date as their own
6. If the Candidate receives a majority of votes, it becomes the new Leader
7. The new Leader begins sending heartbeats to prevent further elections

# Check the current leader
ETCDCTL_API=3 etcdctl endpoint status \
  --endpoints=https://10.0.1.10:2379,https://10.0.2.10:2379,https://10.0.3.10:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  --write-table

# Output shows which member is leader:
# ENDPOINT          ID         VERSION  DB SIZE  IS LEADER  RAFT TERM  RAFT INDEX
# 10.0.1.10:2379    abc123     3.5.12   20 MB    true       5          150234
# 10.0.2.10:2379    def456     3.5.12   20 MB    false      5          150234
# 10.0.3.10:2379    ghi789     3.5.12   20 MB    false      5          150234

Read Consistency Levels

etcd supports two read consistency levels:

Linearizable reads (default) -- The read request goes to the leader, which confirms it is still the leader with a quorum check before responding. This guarantees the returned data reflects the most recent committed write. The Kubernetes API server uses linearizable reads by default.

Serializable reads -- Any member can serve the read from its local state without contacting the leader. This is faster but may return slightly stale data if the member has not yet applied the latest committed entries.

# Linearizable read (default)
ETCDCTL_API=3 etcdctl get /registry/pods/default/my-pod \
  --consistency=l

# Serializable read (faster but potentially stale)
ETCDCTL_API=3 etcdctl get /registry/pods/default/my-pod \
  --consistency=s

Revisions and MVCC

etcd implements Multi-Version Concurrency Control (MVCC). Every modification increments a global revision number. This allows:

Watch from a specific revision -- Kubernetes controllers watch from a known revision, so they never miss events
Historical queries -- Read the value of a key at a previous revision
Transactions -- Atomic compare-and-swap operations using revision-based preconditions

# Get the current revision
ETCDCTL_API=3 etcdctl get /registry/pods/default/my-pod -w json | python3 -c "
import json,sys; d=json.load(sys.stdin); print('Revision:', d['header']['revision'])"

# Watch from a specific revision
ETCDCTL_API=3 etcdctl watch /registry/pods --prefix --rev=150000

Performance and Operational Impact

The consistency model directly affects performance and operations:

Disk I/O -- Every committed write requires an fsync to the WAL. Slow disks cause Raft heartbeat timeouts, triggering unnecessary leader elections. The etcd_disk_wal_fsync_duration_seconds metric should stay below 10ms.

Network latency -- Raft replication requires round-trips between the leader and followers. High latency between etcd members increases write latency and can cause election instability. Keep etcd members in the same region.

Cluster sizing -- Larger clusters (5 or 7 members) tolerate more failures but increase write latency since more members must acknowledge each write. Most production Kubernetes clusters use 3 or 5 etcd members.

# Monitor critical etcd metrics
# WAL fsync latency (should be < 10ms)
curl -s http://localhost:2381/metrics | grep etcd_disk_wal_fsync_duration

# Raft proposal failures (should be 0 in steady state)
curl -s http://localhost:2381/metrics | grep etcd_server_proposals_failed_total

# Leader changes (frequent changes indicate instability)
curl -s http://localhost:2381/metrics | grep etcd_server_leader_changes_seen_total

# Database size
curl -s http://localhost:2381/metrics | grep etcd_mvcc_db_total_size_in_bytes

Compaction and Defragmentation

Because of MVCC, etcd retains all revisions of every key. Without compaction, the database grows indefinitely:

# Compact old revisions (Kubernetes does this automatically)
ETCDCTL_API=3 etcdctl compact 150000

# Defragment to reclaim disk space after compaction
ETCDCTL_API=3 etcdctl defrag --endpoints=https://10.0.1.10:2379

Why Interviewers Ask This

This question evaluates deep understanding of distributed systems concepts underlying Kubernetes. Interviewers want to know if a candidate can reason about consistency guarantees, failure scenarios, and the operational implications of etcd's consensus model on cluster reliability.

Common Follow-Up Questions

What happens during an etcd leader election?

When the current leader fails or becomes unreachable, followers time out waiting for heartbeats and start an election. A candidate requests votes from other members. The first to receive a majority becomes the new leader. During election, writes are blocked but existing data remains consistent.

What is the difference between linearizable and serializable reads in etcd?

Linearizable reads (default) go through the leader and reflect the most recent committed write. Serializable reads can be served by any member and may return slightly stale data, but are faster and reduce leader load.

How does etcd handle network partitions?

The partition with the majority of members retains quorum and continues operating. The minority partition cannot commit writes. When the partition heals, the minority members catch up through the leader's log replication.

Key Takeaways

Raft requires a majority quorum for every write, providing strong consistency guarantees
Leader election happens automatically when the current leader fails, with brief write unavailability
Understanding etcd's consistency model is essential for sizing clusters and planning failure tolerance

Back to Architecture Interview Questions