How Do You Set Up Grafana Dashboards for Kubernetes?
Grafana dashboards for Kubernetes visualize metrics from Prometheus, providing real-time visibility into cluster health, node resources, Pod performance, and application behavior. You can use pre-built community dashboards or create custom ones using PromQL queries.
Detailed Answer
Grafana is the standard visualization layer for Kubernetes monitoring. Combined with Prometheus, it provides real-time dashboards for cluster operators and development teams.
Installation
# Install kube-prometheus-stack (Prometheus + Grafana + dashboards)
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set grafana.adminPassword=admin
This installs Prometheus, Grafana, Alertmanager, and a comprehensive set of pre-configured dashboards.
Essential Kubernetes Dashboards
Cluster Overview
Key panels for a cluster-level dashboard:
# Total cluster CPU usage percentage
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m]))
/
sum(machine_cpu_cores) * 100
# Total cluster memory usage percentage
sum(container_memory_working_set_bytes{container!=""})
/
sum(machine_memory_bytes) * 100
# Number of running Pods
count(kube_pod_status_phase{phase="Running"})
# Pods in error state
count(kube_pod_status_phase{phase=~"Failed|Unknown"})
# Node count
count(kube_node_info)
Node Resource Dashboard
# CPU usage per node
100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage per node
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100
# Disk usage per node
(node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_avail_bytes{mountpoint="/"})
/ node_filesystem_size_bytes{mountpoint="/"} * 100
# Network throughput per node
rate(node_network_receive_bytes_total{device!~"lo|veth.*"}[5m])
rate(node_network_transmit_bytes_total{device!~"lo|veth.*"}[5m])
Pod Performance Dashboard
# CPU usage vs requests per Pod
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)
/
sum(kube_pod_container_resource_requests{resource="cpu", namespace="production"}) by (pod)
# Memory usage vs limits per Pod
sum(container_memory_working_set_bytes{namespace="production"}) by (pod)
/
sum(kube_pod_container_resource_limits{resource="memory", namespace="production"}) by (pod)
# Pod restart count
sum(kube_pod_container_status_restarts_total{namespace="production"}) by (pod)
# OOMKilled containers
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}
The Four Golden Signals Dashboard
Google's SRE book defines four golden signals. Here is how to dashboard them:
# 1. Latency (request duration)
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{namespace="production"}[5m]))
by (le, service))
# 2. Traffic (requests per second)
sum(rate(http_requests_total{namespace="production"}[5m])) by (service)
# 3. Errors (error rate)
sum(rate(http_requests_total{namespace="production", status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total{namespace="production"}[5m])) by (service)
# 4. Saturation (resource pressure)
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)
/
sum(kube_pod_container_resource_limits{resource="cpu", namespace="production"}) by (pod)
Popular Community Dashboards
Import these by ID from grafana.com/dashboards:
| Dashboard ID | Name | Purpose | |-------------|------|---------| | 315 | Kubernetes Cluster Overview | Cluster health at a glance | | 1860 | Node Exporter Full | Detailed node metrics | | 13770 | Kubernetes Pods | Per-Pod resource usage | | 6417 | Kubernetes Cluster (Prometheus) | Comprehensive cluster view | | 11074 | Node Exporter for Prometheus | Node-level dashboards | | 14981 | CoreDNS | DNS performance | | 15757 | Kubernetes / Views / Namespaces | Namespace-level overview |
# Import via Grafana UI:
# Dashboards → Import → Enter ID → Select Prometheus data source → Import
Dashboards as Code
Provision dashboards automatically using ConfigMaps and the Grafana sidecar:
apiVersion: v1
kind: ConfigMap
metadata:
name: custom-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1" # Sidecar watches for this label
data:
my-dashboard.json: |
{
"annotations": { "list": [] },
"title": "My Service Dashboard",
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(rate(http_requests_total{service=\"my-service\"}[5m]))",
"legendFormat": "RPS"
}
],
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }
},
{
"title": "Error Rate",
"type": "stat",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(rate(http_requests_total{service=\"my-service\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"my-service\"}[5m])) * 100"
}
],
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }
}
],
"schemaVersion": 39,
"version": 1
}
Grafana with Helm Values
# kube-prometheus-stack values
grafana:
adminPassword: "secure-password"
sidecar:
dashboards:
enabled: true
label: grafana_dashboard
searchNamespace: ALL
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: default
orgId: 1
folder: Kubernetes
type: file
options:
path: /var/lib/grafana/dashboards/default
dashboards:
default:
cluster-overview:
gnetId: 315
revision: 3
datasource: Prometheus
node-exporter:
gnetId: 1860
revision: 33
datasource: Prometheus
Dashboard Design Best Practices
- Use template variables — add namespace, Pod, and node dropdowns for filtering
- Set meaningful thresholds — color panels red/yellow/green based on SLO targets
- Include context — link to logs and traces from dashboard panels
- Organize by audience — separate dashboards for cluster ops, namespace owners, and developers
- Keep panels focused — each panel answers one question
- Use appropriate panel types — time series for trends, stats for current values, tables for lists
- Document dashboards — add text panels explaining what each panel shows and when to be concerned
Useful Variables for Templates
# Namespace variable
label_values(kube_pod_info, namespace)
# Pod variable (filtered by namespace)
label_values(kube_pod_info{namespace="$namespace"}, pod)
# Node variable
label_values(kube_node_info, node)
Alerting from Grafana
Grafana can also send alerts based on dashboard panels:
# Alert rule
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 5% for {{ $labels.service }}"
However, for production alerting, it is generally recommended to use Prometheus Alertmanager rather than Grafana alerts, as Alertmanager provides better deduplication, grouping, and routing.
Why Interviewers Ask This
Effective monitoring dashboards are critical for operating production clusters. This question tests your ability to build actionable observability that enables quick incident response.
Common Follow-Up Questions
Key Takeaways
- Start with community dashboards (IDs 315, 1860, 13770) and customize them for your needs.
- Provision dashboards as code using ConfigMaps and the Grafana sidecar for GitOps workflows.
- Focus dashboards on the four golden signals: latency, traffic, errors, and saturation.