How Do You Set Up Grafana Dashboards for Kubernetes?

intermediate|monitoringdevopssreplatform engineerCKA
TL;DR

Grafana dashboards for Kubernetes visualize metrics from Prometheus, providing real-time visibility into cluster health, node resources, Pod performance, and application behavior. You can use pre-built community dashboards or create custom ones using PromQL queries.

Detailed Answer

Grafana is the standard visualization layer for Kubernetes monitoring. Combined with Prometheus, it provides real-time dashboards for cluster operators and development teams.

Installation

# Install kube-prometheus-stack (Prometheus + Grafana + dashboards)
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set grafana.adminPassword=admin

This installs Prometheus, Grafana, Alertmanager, and a comprehensive set of pre-configured dashboards.

Essential Kubernetes Dashboards

Cluster Overview

Key panels for a cluster-level dashboard:

# Total cluster CPU usage percentage
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m]))
/
sum(machine_cpu_cores) * 100

# Total cluster memory usage percentage
sum(container_memory_working_set_bytes{container!=""})
/
sum(machine_memory_bytes) * 100

# Number of running Pods
count(kube_pod_status_phase{phase="Running"})

# Pods in error state
count(kube_pod_status_phase{phase=~"Failed|Unknown"})

# Node count
count(kube_node_info)

Node Resource Dashboard

# CPU usage per node
100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage per node
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100

# Disk usage per node
(node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_avail_bytes{mountpoint="/"})
/ node_filesystem_size_bytes{mountpoint="/"} * 100

# Network throughput per node
rate(node_network_receive_bytes_total{device!~"lo|veth.*"}[5m])
rate(node_network_transmit_bytes_total{device!~"lo|veth.*"}[5m])

Pod Performance Dashboard

# CPU usage vs requests per Pod
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)
/
sum(kube_pod_container_resource_requests{resource="cpu", namespace="production"}) by (pod)

# Memory usage vs limits per Pod
sum(container_memory_working_set_bytes{namespace="production"}) by (pod)
/
sum(kube_pod_container_resource_limits{resource="memory", namespace="production"}) by (pod)

# Pod restart count
sum(kube_pod_container_status_restarts_total{namespace="production"}) by (pod)

# OOMKilled containers
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}

The Four Golden Signals Dashboard

Google's SRE book defines four golden signals. Here is how to dashboard them:

# 1. Latency (request duration)
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{namespace="production"}[5m]))
  by (le, service))

# 2. Traffic (requests per second)
sum(rate(http_requests_total{namespace="production"}[5m])) by (service)

# 3. Errors (error rate)
sum(rate(http_requests_total{namespace="production", status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total{namespace="production"}[5m])) by (service)

# 4. Saturation (resource pressure)
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)
/
sum(kube_pod_container_resource_limits{resource="cpu", namespace="production"}) by (pod)

Popular Community Dashboards

Import these by ID from grafana.com/dashboards:

| Dashboard ID | Name | Purpose | |-------------|------|---------| | 315 | Kubernetes Cluster Overview | Cluster health at a glance | | 1860 | Node Exporter Full | Detailed node metrics | | 13770 | Kubernetes Pods | Per-Pod resource usage | | 6417 | Kubernetes Cluster (Prometheus) | Comprehensive cluster view | | 11074 | Node Exporter for Prometheus | Node-level dashboards | | 14981 | CoreDNS | DNS performance | | 15757 | Kubernetes / Views / Namespaces | Namespace-level overview |

# Import via Grafana UI:
# Dashboards → Import → Enter ID → Select Prometheus data source → Import

Dashboards as Code

Provision dashboards automatically using ConfigMaps and the Grafana sidecar:

apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"       # Sidecar watches for this label
data:
  my-dashboard.json: |
    {
      "annotations": { "list": [] },
      "title": "My Service Dashboard",
      "panels": [
        {
          "title": "Request Rate",
          "type": "timeseries",
          "datasource": "Prometheus",
          "targets": [
            {
              "expr": "sum(rate(http_requests_total{service=\"my-service\"}[5m]))",
              "legendFormat": "RPS"
            }
          ],
          "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }
        },
        {
          "title": "Error Rate",
          "type": "stat",
          "datasource": "Prometheus",
          "targets": [
            {
              "expr": "sum(rate(http_requests_total{service=\"my-service\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"my-service\"}[5m])) * 100"
            }
          ],
          "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }
        }
      ],
      "schemaVersion": 39,
      "version": 1
    }

Grafana with Helm Values

# kube-prometheus-stack values
grafana:
  adminPassword: "secure-password"
  sidecar:
    dashboards:
      enabled: true
      label: grafana_dashboard
      searchNamespace: ALL
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
        - name: default
          orgId: 1
          folder: Kubernetes
          type: file
          options:
            path: /var/lib/grafana/dashboards/default
  dashboards:
    default:
      cluster-overview:
        gnetId: 315
        revision: 3
        datasource: Prometheus
      node-exporter:
        gnetId: 1860
        revision: 33
        datasource: Prometheus

Dashboard Design Best Practices

  1. Use template variables — add namespace, Pod, and node dropdowns for filtering
  2. Set meaningful thresholds — color panels red/yellow/green based on SLO targets
  3. Include context — link to logs and traces from dashboard panels
  4. Organize by audience — separate dashboards for cluster ops, namespace owners, and developers
  5. Keep panels focused — each panel answers one question
  6. Use appropriate panel types — time series for trends, stats for current values, tables for lists
  7. Document dashboards — add text panels explaining what each panel shows and when to be concerned

Useful Variables for Templates

# Namespace variable
label_values(kube_pod_info, namespace)

# Pod variable (filtered by namespace)
label_values(kube_pod_info{namespace="$namespace"}, pod)

# Node variable
label_values(kube_node_info, node)

Alerting from Grafana

Grafana can also send alerts based on dashboard panels:

# Alert rule
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
    /
    sum(rate(http_requests_total[5m])) by (service) > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Error rate above 5% for {{ $labels.service }}"

However, for production alerting, it is generally recommended to use Prometheus Alertmanager rather than Grafana alerts, as Alertmanager provides better deduplication, grouping, and routing.

Why Interviewers Ask This

Effective monitoring dashboards are critical for operating production clusters. This question tests your ability to build actionable observability that enables quick incident response.

Common Follow-Up Questions

What are the most important Kubernetes metrics to dashboard?
Node CPU/memory utilization, Pod restart counts, container resource usage vs. limits, API server latency, etcd health, and application-specific metrics like request rate and error rate.
How do you provision dashboards as code?
Use Grafana's provisioning system with ConfigMaps or the Grafana Operator. Store dashboard JSON in Git and deploy via Helm or Kustomize.
What is the difference between Grafana dashboards and Grafana alerts?
Dashboards provide visual exploration. Alerts trigger notifications based on query thresholds. Both use PromQL but serve different purposes — dashboards for investigation, alerts for detection.

Key Takeaways

  • Start with community dashboards (IDs 315, 1860, 13770) and customize them for your needs.
  • Provision dashboards as code using ConfigMaps and the Grafana sidecar for GitOps workflows.
  • Focus dashboards on the four golden signals: latency, traffic, errors, and saturation.

Related Questions

You Might Also Like