How Does Prometheus Monitor Kubernetes?

intermediate|monitoringdevopssreCKA
TL;DR

Prometheus monitors Kubernetes by scraping metrics endpoints from Pods, nodes, and cluster components. It uses Kubernetes service discovery to automatically find targets. The kube-prometheus-stack (Prometheus Operator) is the standard deployment method, providing pre-built dashboards and alerting rules.

Detailed Answer

How Prometheus Works

Prometheus is a pull-based monitoring system. It periodically scrapes HTTP endpoints (typically /metrics) on targets, parses the exposed metrics, stores them in a time-series database, and evaluates alerting rules.

The four main metric types:

  • Counter: Monotonically increasing value (e.g., total HTTP requests)
  • Gauge: Value that can go up or down (e.g., current memory usage)
  • Histogram: Distribution of values in buckets (e.g., request latency)
  • Summary: Similar to histogram but calculates quantiles client-side

Deploying Prometheus with kube-prometheus-stack

The recommended way to deploy Prometheus on Kubernetes is through the kube-prometheus-stack Helm chart, which includes Prometheus, Grafana, Alertmanager, and node-exporter:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
  --set grafana.adminPassword=securePassword

This deploys:

  • Prometheus - Metrics collection and storage
  • Alertmanager - Alert routing and notification
  • Grafana - Dashboards and visualization
  • node-exporter - Host-level metrics (CPU, memory, disk)
  • kube-state-metrics - Kubernetes object state metrics
# Verify the deployment
kubectl get pods -n monitoring
kubectl get svc -n monitoring

Kubernetes Metrics Sources

| Source | Metrics | Endpoint | |---|---|---| | kube-apiserver | API request latency, counts | /metrics on :6443 | | kubelet | Container CPU, memory, network | /metrics on :10250 | | cAdvisor (in kubelet) | Container resource usage | /metrics/cadvisor | | kube-state-metrics | Object state (Pod phase, replicas) | /metrics on :8080 | | node-exporter | Node CPU, memory, disk, network | /metrics on :9100 | | CoreDNS | DNS query latency, cache stats | /metrics on :9153 | | etcd | Cluster health, disk IO | /metrics on :2379 |

ServiceMonitor CRD

The Prometheus Operator uses ServiceMonitor CRDs to define scrape targets declaratively:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app-metrics
  namespace: monitoring
  labels:
    release: monitoring  # Must match Prometheus Operator's serviceMonitorSelector
spec:
  namespaceSelector:
    matchNames:
      - production
  selector:
    matchLabels:
      app: my-app
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

This tells Prometheus to scrape every Service labeled app: my-app in the production namespace on the port named metrics.

Application Instrumentation

Expose custom metrics from your application:

# Application Deployment with metrics port
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: app
          image: my-app:v2
          ports:
            - name: http
              containerPort: 8080
            - name: metrics
              containerPort: 9090
---
apiVersion: v1
kind: Service
metadata:
  name: my-app
  namespace: production
  labels:
    app: my-app
spec:
  selector:
    app: my-app
  ports:
    - name: http
      port: 80
      targetPort: 8080
    - name: metrics
      port: 9090
      targetPort: 9090

Key PromQL Queries for Kubernetes

# CPU usage per Pod
rate(container_cpu_usage_seconds_total{namespace="production"}[5m])

# Memory usage per Pod
container_memory_working_set_bytes{namespace="production"}

# Pod restart count
kube_pod_container_status_restarts_total{namespace="production"}

# Pod not ready for more than 5 minutes
kube_pod_status_ready{condition="false"} == 1
  and on(pod) (time() - kube_pod_created > 300)

# Node CPU utilization percentage
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Persistent Volume usage
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100

# API server request rate
rate(apiserver_request_total[5m])

# API server error rate
rate(apiserver_request_total{code=~"5.."}[5m]) / rate(apiserver_request_total[5m]) * 100

Alerting Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kubernetes-alerts
  namespace: monitoring
  labels:
    release: monitoring
spec:
  groups:
    - name: kubernetes-pod-alerts
      rules:
        - alert: PodCrashLooping
          expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
            description: "Pod has restarted {{ $value }} times in the last 15 minutes."

        - alert: PodNotReady
          expr: kube_pod_status_phase{phase=~"Pending|Unknown"} > 0
          for: 15m
          labels:
            severity: critical
          annotations:
            summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been not ready for 15m"

        - alert: HighMemoryUsage
          expr: |
            container_memory_working_set_bytes{container!=""}
            / on(namespace, pod, container) kube_pod_container_resource_limits{resource="memory"}
            > 0.9
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Container {{ $labels.container }} memory usage above 90%"

Accessing Dashboards

# Port-forward to Grafana
kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80

# Port-forward to Prometheus UI
kubectl port-forward -n monitoring svc/monitoring-kube-prometheus-prometheus 9090:9090

# Port-forward to Alertmanager
kubectl port-forward -n monitoring svc/monitoring-kube-prometheus-alertmanager 9093:9093

Production Considerations

For production clusters, ensure Prometheus has sufficient storage and retention configured. Use remote write (Thanos, Cortex, or Mimir) for long-term storage and multi-cluster aggregation. Set resource requests and limits on Prometheus Pods to prevent OOM kills. Use recording rules to pre-compute expensive queries that power dashboards.

Why Interviewers Ask This

Interviewers ask this because monitoring is fundamental to operating Kubernetes in production, and Prometheus is the de facto standard for Kubernetes observability.

Common Follow-Up Questions

How does Prometheus discover scrape targets in Kubernetes?
Through kubernetes_sd_config, which queries the Kubernetes API for Pods, Services, Endpoints, and Nodes with specific annotations or labels.
What is a ServiceMonitor?
A CRD from the Prometheus Operator that declaratively defines scrape targets. Prometheus automatically configures itself based on ServiceMonitor objects.
How do you expose custom metrics from your application?
Instrument your application with a Prometheus client library and expose a /metrics endpoint. Then create a ServiceMonitor to scrape it.

Key Takeaways

  • Prometheus uses a pull-based model, scraping /metrics endpoints on targets
  • Kubernetes service discovery automates target configuration
  • kube-prometheus-stack provides a production-ready monitoring setup