Kubernetes Architecture Explained — A Deep Dive

The Big Picture

A Kubernetes cluster is split into two layers: the control plane, which makes decisions about the cluster, and the worker nodes, which run your actual workloads. Every interaction — from running kubectl apply to a pod being scheduled — flows through a well-defined chain of components.

Here's the high-level layout:

┌─────────────────────────────────────────────────────────┐
│                     CONTROL PLANE                        │
│                                                          │
│  ┌──────────────┐  ┌────────────────────┐  ┌──────────┐ │
│  │ kube-apiserver│  │kube-controller-mgr │  │kube-sched│ │
│  │              │  │                    │  │          │ │
│  │  (REST API)  │  │ (reconciliation    │  │ (pod     │ │
│  │              │  │  loops)            │  │  placement│ │
│  └──────┬───────┘  └────────┬───────────┘  └────┬─────┘ │
│         │                   │                    │       │
│         ▼                   │                    │       │
│  ┌──────────────┐           │                    │       │
│  │    etcd      │◄──────────┘────────────────────┘       │
│  │ (cluster     │                                        │
│  │  state store)│                                        │
│  └──────────────┘                                        │
└─────────────────────────────────────────────────────────┘
         │
         │ Watch/API calls over TLS
         │
┌────────▼──────────────────────────────────────────────────┐
│                      WORKER NODE                           │
│                                                            │
│  ┌──────────┐    ┌────────────┐    ┌───────────────────┐  │
│  │  kubelet │    │ kube-proxy │    │ container runtime │  │
│  │          │    │            │    │ (containerd/CRI-O)│  │
│  │ (pod     │    │ (service   │    │                   │  │
│  │  manager)│    │  routing)  │    │                   │  │
│  └──────────┘    └────────────┘    └───────────────────┘  │
│                                                            │
│  ┌────────┐  ┌────────┐  ┌────────┐                      │
│  │ Pod A  │  │ Pod B  │  │ Pod C  │                      │
│  └────────┘  └────────┘  └────────┘                      │
└────────────────────────────────────────────────────────────┘

Every arrow in this diagram is an API call over TLS. No component talks directly to etcd except the API server. This is a deliberate design choice that we'll explore throughout this guide.

Control Plane Components

kube-apiserver: The Front Door

The API server is the only component that directly reads from and writes to etcd. Every other component — the scheduler, controller manager, kubelet, even kubectl — communicates through it.

What it actually does:

  1. Authentication: Validates who you are (certificates, tokens, OIDC).
  2. Authorization: Checks if you're allowed to do what you're requesting (RBAC, ABAC, webhook).
  3. Admission control: Mutating and validating webhooks modify or reject requests before they're persisted.
  4. Validation: Ensures the object schema is correct.
  5. Persistence: Writes the object to etcd.
  6. Notification: Informs watchers that a resource has changed.

You can interact with the API server directly to understand what it does:

# See all API resources the server exposes
kubectl api-resources

# Make a raw API call
kubectl get --raw /api/v1/namespaces/default/pods

# Check which API versions are available
kubectl api-versions

# Inspect the full spec of any resource
kubectl explain pod.spec --recursive

The API server is stateless — you can run multiple instances behind a load balancer for high availability. All state lives in etcd.

Key interview insight: When someone says "Kubernetes is declarative," what they mean mechanically is that you POST a desired state to the API server, and controllers (which watch the API server) continuously reconcile actual state toward that desired state.

etcd: The Source of Truth

etcd is a distributed key-value store that holds the entire cluster state. Every object you create — pods, services, secrets, config maps — is stored here as a serialized protobuf.

What's stored in etcd:

/registry/pods/default/my-pod
/registry/deployments/default/my-deployment
/registry/services/specs/default/my-service
/registry/secrets/default/my-secret
/registry/events/default/my-pod.17a3b2c1

Critical properties of etcd in a Kubernetes context:

  • Consistency: etcd uses the Raft consensus algorithm. A write is committed only when a majority of etcd nodes acknowledge it.
  • Watch mechanism: Components can watch for changes to specific keys or prefixes. This is how the scheduler knows when a new pod needs to be placed — it watches for pods with no spec.nodeName.
  • Compaction and defragmentation: etcd keeps a history of revisions. Without regular compaction, it grows unbounded. Kubernetes handles this automatically, but you need to understand it for disaster recovery.

Backing up etcd is the single most important backup operation in a Kubernetes cluster:

# Snapshot etcd
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify the snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db --write-table

What happens if etcd goes down? The API server cannot read or write state. Existing pods continue running (kubelet manages them locally), but no new scheduling, scaling, or deployments can occur. The cluster is effectively frozen.

kube-scheduler: Placing Pods on Nodes

When you create a pod (directly or through a Deployment), it initially has no spec.nodeName. The scheduler's job is to find the best node for it.

The scheduling process has two phases:

Phase 1 — Filtering: Eliminate nodes that can't run the pod.

  • Does the node have enough CPU and memory to satisfy the pod's requests?
  • Does the node match the pod's nodeSelector or nodeAffinity rules?
  • Does the node have taints that the pod doesn't tolerate?
  • Does the pod request a specific port that's already in use on the node?

Phase 2 — Scoring: Rank the remaining nodes to pick the best one.

  • LeastRequestedPriority: Prefer nodes with more available resources (spreads load).
  • BalancedResourceAllocation: Prefer nodes where CPU and memory usage ratios are similar.
  • NodeAffinityPriority: Prefer nodes matching the pod's preferred (not required) affinities.
  • PodTopologySpread: Honor topology spread constraints to distribute pods across failure domains.

The scheduler then writes the chosen node name into the pod's spec.nodeName via the API server. The kubelet on that node picks it up from there.

You can see the scheduler's decision-making:

# Check why a pod is Pending
kubectl describe pod my-pod
# Look under Events for messages like:
#   0/3 nodes are available: 1 Insufficient cpu, 2 node(s) had taint
#   {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.

# Check node resources
kubectl describe node worker-1
# Look at "Allocated resources" to see how much is committed vs available

Taints and tolerations are a common interview topic:

# Taint a node — no pods will be scheduled unless they tolerate it
kubectl taint nodes worker-1 gpu=true:NoSchedule

# A pod that tolerates this taint
# spec:
#   tolerations:
#   - key: "gpu"
#     operator: "Equal"
#     value: "true"
#     effect: "NoSchedule"

kube-controller-manager: The Reconciliation Engine

The controller manager runs dozens of control loops, each responsible for one type of resource. The pattern is always the same:

  1. Watch the API server for changes to a specific resource type
  2. Compare desired state (what the user specified) with actual state (what's running)
  3. Take action to reconcile the difference

Key controllers and what they do:

Deployment controller: Watches Deployments. When you update a Deployment's pod template, it creates a new ReplicaSet and scales it up while scaling the old one down.

ReplicaSet controller: Watches ReplicaSets. Ensures the correct number of pods exist. If a pod dies, it creates a new one.

Node controller: Monitors node health. If a node stops reporting heartbeats, the controller marks it as NotReady and eventually evicts its pods (after the pod-eviction-timeout).

Endpoint controller: Watches Services and Pods. Maintains the Endpoints object that maps a Service to the set of pod IPs that match its selector.

Job controller: Watches Jobs. Creates pods to execute the job and tracks their completion.

ServiceAccount controller: Creates the default ServiceAccount in new namespaces.

You can see controllers in action:

# Delete a pod managed by a ReplicaSet
kubectl delete pod my-deployment-abc123-xyz

# Watch the ReplicaSet controller immediately create a replacement
kubectl get pods -w

# Check the ReplicaSet's events
kubectl describe rs my-deployment-abc123
# Events:
#   Created pod: my-deployment-abc123-new

How controller leader election works: In an HA setup with multiple control plane nodes, only one instance of the controller manager is active at a time. The others are on standby. They use a Lease object in Kubernetes to elect a leader. If the leader fails, another instance acquires the lease within seconds.

cloud-controller-manager: The Cloud Bridge

If you're running on AWS, GCP, or Azure, the cloud controller manager integrates Kubernetes with cloud APIs:

  • Node controller: Detects when a cloud VM is deleted and removes the corresponding Node object.
  • Route controller: Configures cloud network routes so pods on different nodes can communicate.
  • Service controller: Creates cloud load balancers when you create a Service of type LoadBalancer.

This component doesn't exist in bare-metal or local (kind/minikube) clusters.

Node Components

kubelet: The Node Agent

The kubelet is the most critical component on each worker node. It has one job: ensure that the containers described in a pod spec are running and healthy.

The kubelet's workflow:

  1. Watch the API server for pods assigned to its node (pods where spec.nodeName matches).
  2. Pull the container image via the container runtime (if not already cached).
  3. Create and start containers using the Container Runtime Interface (CRI).
  4. Run probes — liveness, readiness, and startup probes on the configured schedule.
  5. Report status back to the API server: pod phase, container states, resource usage.

The kubelet also handles:

  • Static pods: Pods defined as YAML files in a directory on the node (typically /etc/kubernetes/manifests/). The control plane components themselves — API server, etcd, scheduler, controller manager — are often run as static pods managed by the kubelet on control plane nodes.
  • Volume mounting: Attaches persistent volumes, projected volumes (ConfigMaps, Secrets, downward API), and ephemeral volumes to pods.
  • Container lifecycle hooks: Executes postStart and preStop hooks.
  • Eviction: When node resources are critically low (disk, memory, PIDs), the kubelet evicts pods based on their QoS class — BestEffort first, then Burstable, then Guaranteed.
# Check kubelet status on a node
systemctl status kubelet

# View kubelet logs for debugging
journalctl -u kubelet -f

# The kubelet exposes metrics and a read-only API
curl http://localhost:10255/pods   # read-only port (if enabled)

kube-proxy: Service Networking

kube-proxy runs on every node and implements the networking rules that make Services work. When you create a Service, kube-proxy ensures that traffic to the Service's ClusterIP gets forwarded to one of the backing pods.

kube-proxy operates in one of three modes:

iptables mode (default on most clusters):

  • Creates iptables rules for each Service and Endpoint
  • Traffic matching a Service IP is DNAT'd to a random backend pod
  • Statistically random load balancing
  • Rules scale linearly with the number of services/endpoints

IPVS mode (better for large clusters):

  • Uses Linux IPVS (IP Virtual Server) for load balancing
  • Supports multiple balancing algorithms: round-robin, least connections, shortest expected delay
  • Scales better than iptables because IPVS uses hash tables instead of chains

nftables mode (newer alternative):

  • Uses nftables instead of iptables
  • Better performance characteristics than iptables
  • Still relatively new in Kubernetes

You can check which mode your cluster uses:

# Check kube-proxy mode
kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode

# See the iptables rules kube-proxy creates (on a node)
iptables -t nat -L KUBE-SERVICES -n

Container Runtime: Where Containers Actually Run

Kubernetes doesn't run containers directly. It delegates to a container runtime via the Container Runtime Interface (CRI).

Common runtimes:

  • containerd: The most widely used runtime. Docker's core container execution engine, extracted as a standalone daemon. Used by most managed Kubernetes services (EKS, GKE, AKS).
  • CRI-O: Built specifically for Kubernetes. Lighter weight than containerd, supports only what CRI requires.

The container runtime handles:

  • Pulling images from registries
  • Creating and managing container processes
  • Setting up namespaces and cgroups for isolation
  • Managing container storage (overlay filesystems)
# On a node running containerd, list containers directly
crictl ps

# Pull an image via the runtime
crictl pull nginx:latest

# Inspect a container
crictl inspect <container-id>

How Components Interact: The Request Flow

Let's trace what happens when you run kubectl apply -f deployment.yaml for a new Deployment with 3 replicas.

   kubectl                API Server              etcd
     │                        │                     │
     │── POST Deployment ────►│                     │
     │                        │── validate ──┐      │
     │                        │              │      │
     │                        │◄─ admission ─┘      │
     │                        │── store ───────────►│
     │◄── 201 Created ───────│                     │
     │                        │                     │
     │    Controller Manager  │                     │
     │                        │                     │
     │         ┌──────────────│◄── watch event ────│
     │         │              │                     │
     │         │  Deployment  │                     │
     │         │  controller  │                     │
     │         │  creates     │── store RS ────────►│
     │         │  ReplicaSet  │                     │
     │         │              │                     │
     │         │  ReplicaSet  │                     │
     │         │  controller  │                     │
     │         │  creates 3   │── store Pods ──────►│
     │         │  Pods        │                     │
     │         └──────────────│                     │
     │                        │                     │
     │    Scheduler           │                     │
     │         ┌──────────────│◄── watch: unbound  │
     │         │              │    pods             │
     │         │  Picks nodes │                     │
     │         │  for each    │── update pod ──────►│
     │         │  pod         │   .spec.nodeName    │
     │         └──────────────│                     │
     │                        │                     │
     │    kubelet (per node)  │                     │
     │         ┌──────────────│◄── watch: pods on  │
     │         │              │    my node          │
     │         │  Pulls image │                     │
     │         │  Starts      │                     │
     │         │  containers  │                     │
     │         │  via CRI     │── update pod ──────►│
     │         │              │   status: Running   │
     │         └──────────────│                     │

Step by step:

  1. kubectl sends a POST request to the API server with the Deployment object.
  2. API server authenticates the request, authorizes it via RBAC, runs admission webhooks, validates the schema, and stores the Deployment in etcd.
  3. Deployment controller (in the controller manager) notices a new Deployment via its watch. It creates a ReplicaSet matching the Deployment's pod template.
  4. ReplicaSet controller notices the new ReplicaSet needs 3 pods but 0 exist. It creates 3 Pod objects (with no spec.nodeName).
  5. Scheduler notices 3 unscheduled pods. For each, it runs filtering and scoring, picks a node, and patches the pod with spec.nodeName.
  6. kubelet on each chosen node notices a new pod assigned to it. It pulls the image, creates the containers via the container runtime, starts probes, and reports the pod status back.
  7. Endpoint controller notices 3 new Running pods matching the labels of any existing Service. It updates the Endpoints object.

The entire process — from kubectl apply to running containers — typically takes seconds, but each step is asynchronous. No single component orchestrates the whole flow. Each one watches for its trigger condition and acts independently. This decoupled design is why Kubernetes is resilient: if the scheduler goes down for a minute, pods already running are unaffected, and unscheduled pods will simply queue until the scheduler recovers.

A Deeper Look at API Server Request Processing

Every request to the API server passes through a well-defined pipeline:

Request ──► Authentication ──► Authorization ──► Admission (Mutating)
                                                       │
                                                       ▼
Response ◄── Storage (etcd) ◄──── Validation ◄── Admission (Validating)

Authentication supports multiple methods simultaneously:

  • Client certificates (used by kubelet, controller manager, scheduler)
  • Bearer tokens (ServiceAccount tokens, OIDC tokens)
  • Authentication proxy (for integration with external identity providers)

Authorization is pluggable. Most clusters use RBAC:

# Check if you can perform an action
kubectl auth can-i create deployments --namespace production

# Check what a service account can do
kubectl auth can-i --list --as=system:serviceaccount:default:my-sa

Admission controllers are powerful and underappreciated. They can modify (mutate) or reject (validate) requests. Examples:

  • LimitRanger: Applies default resource requests/limits if not specified
  • NamespaceLifecycle: Prevents creating resources in terminating namespaces
  • PodSecurity: Enforces Pod Security Standards
  • Custom webhooks: Your own admission logic (e.g., require all images from an approved registry)

etcd Internals for Kubernetes Operators

Understanding etcd at a deeper level matters for cluster operations:

Raft consensus: etcd requires a quorum (majority) of nodes to commit writes. For a 3-node etcd cluster, 2 must agree. For 5 nodes, 3 must agree. This is why etcd clusters should always have an odd number of members — 4 nodes provides no more fault tolerance than 3 (both tolerate 1 failure).

Performance considerations: etcd is sensitive to disk latency. SSDs are strongly recommended. The heartbeat interval (default 100ms) and election timeout (default 1000ms) should be tuned based on network latency. Slow etcd is the #1 cause of control plane instability.

Size limits: A single etcd value is capped at 1.5MB by default. The total database size is capped at 8GB by default (configurable). This is why you shouldn't store large objects in ConfigMaps or Secrets.

# Check etcd cluster health
ETCDCTL_API=3 etcdctl endpoint health \
  --endpoints=https://10.0.0.1:2379,https://10.0.0.2:2379,https://10.0.0.3:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Check database size
ETCDCTL_API=3 etcdctl endpoint status --write-table \
  --endpoints=https://10.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

High Availability Architecture

In production, the control plane should be highly available:

                    ┌──────────────┐
                    │ Load Balancer│
                    └──────┬───────┘
           ┌───────────────┼───────────────┐
           ▼               ▼               ▼
  ┌────────────────┐ ┌────────────────┐ ┌────────────────┐
  │ Control Plane 1│ │ Control Plane 2│ │ Control Plane 3│
  │                │ │                │ │                │
  │ API Server     │ │ API Server     │ │ API Server     │
  │ Controller Mgr │ │ Controller Mgr │ │ Controller Mgr │
  │ Scheduler      │ │ Scheduler      │ │ Scheduler      │
  │ etcd           │ │ etcd           │ │ etcd           │
  └────────────────┘ └────────────────┘ └────────────────┘
  • API server: All instances are active. A load balancer distributes requests across them. Since they're stateless, this just works.
  • Controller manager and scheduler: Only one instance is active (the leader). Others are on standby. Leader election uses Lease objects with configurable timeouts.
  • etcd: All members participate in Raft consensus. Write availability requires a quorum.

The managed Kubernetes services (EKS, GKE, AKS) handle all of this for you. If you run self-managed clusters, you need to set this up, monitor it, and practice disaster recovery.

What To Take Away

The architecture of Kubernetes is built around a few key principles:

  1. Declarative state: You tell the system what you want, not what to do. Controllers continuously reconcile toward your desired state.
  2. API server as the single gateway: Every interaction flows through the API server. This centralizes authentication, authorization, admission, and audit logging.
  3. Watch-based coordination: Components don't poll on a timer. They watch the API server for changes and react. This is efficient and fast.
  4. Decoupled components: Each component has a narrow responsibility. The scheduler doesn't know about container runtimes. The kubelet doesn't know about ReplicaSets. This modularity makes the system resilient and extensible.
  5. Level-triggered, not edge-triggered: Controllers don't react to events per se — they react to state differences. If a controller misses an event, it will still reconcile correctly on its next sync, because it compares desired state with actual state.

When you're in an interview and asked about Kubernetes architecture, lead with these principles. They show that you understand not just what the components are, but why they're designed the way they are.

Related Topics