How Do You Set Up Grafana Dashboards for Kubernetes?

Q: How Do You Set Up Grafana Dashboards for Kubernetes?

Grafana dashboards for Kubernetes visualize metrics from Prometheus, providing real-time visibility into cluster health, node resources, Pod performance, and application behavior. You can use pre-built community dashboards or create custom ones using PromQL queries.

Detailed Answer

Grafana is the standard visualization layer for Kubernetes monitoring. Combined with Prometheus, it provides real-time dashboards for cluster operators and development teams.

Installation

# Install kube-prometheus-stack (Prometheus + Grafana + dashboards)
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set grafana.adminPassword=admin

This installs Prometheus, Grafana, Alertmanager, and a comprehensive set of pre-configured dashboards.

Essential Kubernetes Dashboards

Cluster Overview

Key panels for a cluster-level dashboard:

# Total cluster CPU usage percentage
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m]))
/
sum(machine_cpu_cores) * 100

# Total cluster memory usage percentage
sum(container_memory_working_set_bytes{container!=""})
/
sum(machine_memory_bytes) * 100

# Number of running Pods
count(kube_pod_status_phase{phase="Running"})

# Pods in error state
count(kube_pod_status_phase{phase=~"Failed|Unknown"})

# Node count
count(kube_node_info)

Node Resource Dashboard

# CPU usage per node
100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage per node
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100

# Disk usage per node
(node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_avail_bytes{mountpoint="/"})
/ node_filesystem_size_bytes{mountpoint="/"} * 100

# Network throughput per node
rate(node_network_receive_bytes_total{device!~"lo|veth.*"}[5m])
rate(node_network_transmit_bytes_total{device!~"lo|veth.*"}[5m])

Pod Performance Dashboard

# CPU usage vs requests per Pod
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)
/
sum(kube_pod_container_resource_requests{resource="cpu", namespace="production"}) by (pod)

# Memory usage vs limits per Pod
sum(container_memory_working_set_bytes{namespace="production"}) by (pod)
/
sum(kube_pod_container_resource_limits{resource="memory", namespace="production"}) by (pod)

# Pod restart count
sum(kube_pod_container_status_restarts_total{namespace="production"}) by (pod)

# OOMKilled containers
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}

The Four Golden Signals Dashboard

Google's SRE book defines four golden signals. Here is how to dashboard them:

# 1. Latency (request duration)
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{namespace="production"}[5m]))
  by (le, service))

# 2. Traffic (requests per second)
sum(rate(http_requests_total{namespace="production"}[5m])) by (service)

# 3. Errors (error rate)
sum(rate(http_requests_total{namespace="production", status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total{namespace="production"}[5m])) by (service)

# 4. Saturation (resource pressure)
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)
/
sum(kube_pod_container_resource_limits{resource="cpu", namespace="production"}) by (pod)

Popular Community Dashboards

Import these by ID from grafana.com/dashboards:

| Dashboard ID | Name | Purpose | |-------------|------|---------| | 315 | Kubernetes Cluster Overview | Cluster health at a glance | | 1860 | Node Exporter Full | Detailed node metrics | | 13770 | Kubernetes Pods | Per-Pod resource usage | | 6417 | Kubernetes Cluster (Prometheus) | Comprehensive cluster view | | 11074 | Node Exporter for Prometheus | Node-level dashboards | | 14981 | CoreDNS | DNS performance | | 15757 | Kubernetes / Views / Namespaces | Namespace-level overview |

# Import via Grafana UI:
# Dashboards → Import → Enter ID → Select Prometheus data source → Import

Dashboards as Code

Provision dashboards automatically using ConfigMaps and the Grafana sidecar:

apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"       # Sidecar watches for this label
data:
  my-dashboard.json: |
    {
      "annotations": { "list": [] },
      "title": "My Service Dashboard",
      "panels": [
        {
          "title": "Request Rate",
          "type": "timeseries",
          "datasource": "Prometheus",
          "targets": [
            {
              "expr": "sum(rate(http_requests_total{service=\"my-service\"}[5m]))",
              "legendFormat": "RPS"
            }
          ],
          "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }
        },
        {
          "title": "Error Rate",
          "type": "stat",
          "datasource": "Prometheus",
          "targets": [
            {
              "expr": "sum(rate(http_requests_total{service=\"my-service\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"my-service\"}[5m])) * 100"
            }
          ],
          "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }
        }
      ],
      "schemaVersion": 39,
      "version": 1
    }

Grafana with Helm Values

# kube-prometheus-stack values
grafana:
  adminPassword: "secure-password"
  sidecar:
    dashboards:
      enabled: true
      label: grafana_dashboard
      searchNamespace: ALL
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
        - name: default
          orgId: 1
          folder: Kubernetes
          type: file
          options:
            path: /var/lib/grafana/dashboards/default
  dashboards:
    default:
      cluster-overview:
        gnetId: 315
        revision: 3
        datasource: Prometheus
      node-exporter:
        gnetId: 1860
        revision: 33
        datasource: Prometheus

Dashboard Design Best Practices

Use template variables — add namespace, Pod, and node dropdowns for filtering
Set meaningful thresholds — color panels red/yellow/green based on SLO targets
Include context — link to logs and traces from dashboard panels
Organize by audience — separate dashboards for cluster ops, namespace owners, and developers
Keep panels focused — each panel answers one question
Use appropriate panel types — time series for trends, stats for current values, tables for lists
Document dashboards — add text panels explaining what each panel shows and when to be concerned

Useful Variables for Templates

# Namespace variable
label_values(kube_pod_info, namespace)

# Pod variable (filtered by namespace)
label_values(kube_pod_info{namespace="$namespace"}, pod)

# Node variable
label_values(kube_node_info, node)

Alerting from Grafana

Grafana can also send alerts based on dashboard panels:

# Alert rule
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
    /
    sum(rate(http_requests_total[5m])) by (service) > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Error rate above 5% for {{ $labels.service }}"

However, for production alerting, it is generally recommended to use Prometheus Alertmanager rather than Grafana alerts, as Alertmanager provides better deduplication, grouping, and routing.