Build production-ready Grafana dashboards with dynamic variables, custom panels, and sophisticated alert rules. Integrate Prometheus metrics for comprehensive monitoring with multi-condition alerting and notification channels.
Prerequisites
- Grafana 10+ installed and running
- Prometheus server with metrics collection
- Administrative access to configure Grafana
- Basic understanding of PromQL queries
What this solves
Advanced Grafana dashboards transform raw Prometheus metrics into actionable insights through dynamic variables, custom visualizations, and intelligent alerting. This tutorial covers building sophisticated monitoring solutions that scale across multiple environments and services.
Step-by-step configuration
Verify Prometheus and Grafana installation
Ensure both services are running and accessible before proceeding with advanced configuration.
systemctl status prometheus
systemctl status grafana-server
curl -s http://localhost:9090/api/v1/label/__name__/values | jq '.data[] | select(test("up|node_"))' | head -5
Configure Prometheus data source with advanced settings
Set up the Prometheus data source with query timeout and caching optimizations for better dashboard performance.
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://localhost:9090
isDefault: true
editable: true
jsonData:
httpMethod: POST
queryTimeout: 60s
timeInterval: 15s
customQueryParameters: 'max_source_resolution=5m&partial_response=true'
secureJsonData: {}
Create dashboard variables for dynamic filtering
Configure template variables that allow users to filter dashboards by instance, job, or custom labels dynamically.
{
"templating": {
"list": [
{
"name": "instance",
"type": "query",
"query": "label_values(up, instance)",
"refresh": 1,
"includeAll": true,
"allValue": ".*",
"multi": true,
"options": [],
"current": {},
"hide": 0,
"sort": 1
},
{
"name": "job",
"type": "query",
"query": "label_values(up, job)",
"refresh": 1,
"includeAll": true,
"allValue": ".*",
"multi": true,
"regex": "/^(?!prometheus).*$/",
"sort": 1
},
{
"name": "interval",
"type": "interval",
"query": "1m,5m,15m,30m,1h,6h,12h",
"current": {
"text": "5m",
"value": "5m"
}
}
]
}
}
Build advanced system overview dashboard
Create a comprehensive dashboard with custom panels for CPU, memory, disk, and network metrics using advanced PromQL queries.
{
"targets": [
{
"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\",instance=~\"$instance\",job=~\"$job\"}[5m])) * 100)",
"legendFormat": "{{instance}} - Current",
"refId": "A"
},
{
"expr": "predict_linear(node_cpu_seconds_total{mode!=\"idle\",instance=~\"$instance\",job=~\"$job\"}[1h], 3600)",
"legendFormat": "{{instance}} - Predicted +1h",
"refId": "B"
}
],
"yAxes": [
{
"min": 0,
"max": 100,
"unit": "percent"
}
],
"alert": {
"conditions": [
{
"query": {
"queryType": "A",
"refId": "A"
},
"reducer": {
"type": "last",
"params": []
},
"evaluator": {
"params": [85],
"type": "gt"
}
}
],
"executionErrorState": "alerting",
"for": "5m",
"frequency": "10s",
"handler": 1,
"name": "High CPU Usage",
"noDataState": "no_data"
}
}
Configure memory usage panel with thresholds
Create a memory usage visualization with dynamic thresholds and trend analysis.
{
"title": "Memory Usage",
"type": "stat",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$instance\",job=~\"$job\"} / node_memory_MemTotal_bytes{instance=~\"$instance\",job=~\"$job\"})) * 100",
"legendFormat": "{{instance}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"thresholds": {
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 70
},
{
"color": "red",
"value": 85
}
]
},
"unit": "percent",
"min": 0,
"max": 100
}
},
"options": {
"reduceOptions": {
"values": false,
"calcs": ["lastNotNull"],
"fields": ""
},
"orientation": "auto",
"textMode": "auto",
"colorMode": "background"
}
}
Create disk I/O heatmap visualization
Build a heatmap showing disk I/O patterns across time and instances for performance analysis.
{
"title": "Disk I/O Operations Heatmap",
"type": "heatmap",
"targets": [
{
"expr": "sum by (instance) (irate(node_disk_io_time_seconds_total{instance=~\"$instance\",job=~\"$job\"}[5m]))",
"format": "time_series",
"refId": "A"
}
],
"heatmap": {
"xAxis": {
"show": true
},
"yAxis": {
"show": true,
"logBase": 1,
"min": "0",
"max": "1"
},
"yBucketBound": "auto",
"xBucketSize": null,
"yBucketSize": null
},
"color": {
"mode": "spectrum",
"colorScheme": "interpolateSpectral",
"exponent": 0.5,
"fill": "dark-orange"
},
"legend": {
"show": false
}
}
Set up network traffic monitoring table
Create a table visualization showing detailed network statistics with sorting and filtering capabilities.
{
"title": "Network Interface Statistics",
"type": "table",
"targets": [
{
"expr": "irate(node_network_receive_bytes_total{instance=~\"$instance\",job=~\"$job\",device!~\"lo|veth.|docker.|virbr.|br-.\"}[5m]) * 8",
"format": "table",
"legendFormat": "",
"refId": "A"
},
{
"expr": "irate(node_network_transmit_bytes_total{instance=~\"$instance\",job=~\"$job\",device!~\"lo|veth.|docker.|virbr.|br-.\"}[5m]) * 8",
"format": "table",
"refId": "B"
}
],
"transformations": [
{
"id": "merge",
"options": {}
},
{
"id": "organize",
"options": {
"excludeByName": {
"Time": true,
"__name__": true,
"job": true
},
"indexByName": {
"instance": 0,
"device": 1,
"Value #A": 2,
"Value #B": 3
},
"renameByName": {
"Value #A": "RX (bps)",
"Value #B": "TX (bps)",
"device": "Interface",
"instance": "Instance"
}
}
}
],
"fieldConfig": {
"defaults": {
"custom": {
"displayMode": "auto",
"filterable": true
},
"unit": "bps"
},
"overrides": [
{
"matcher": {
"id": "byName",
"options": "Instance"
},
"properties": [
{
"id": "unit",
"value": "string"
}
]
}
]
}
}
Configure advanced alert rules with multiple conditions
Set up sophisticated alerting rules that combine multiple metrics and conditions for accurate incident detection.
apiVersion: 1
groups:
- name: system_alerts
folder: System Monitoring
interval: 1m
rules:
- uid: high_cpu_memory_combo
title: High CPU and Memory Usage Combined
condition: C
data:
- refId: A
queryType: ''
relativeTimeRange:
from: 300
to: 0
model:
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
refId: A
- refId: B
queryType: ''
relativeTimeRange:
from: 300
to: 0
model:
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
refId: B
- refId: C
queryType: ''
relativeTimeRange:
from: 0
to: 0
model:
conditions:
- evaluator:
params:
- 80
- 0
type: gt
operator:
type: and
query:
params:
- A
reducer:
params: []
type: last
type: query
- evaluator:
params:
- 85
- 0
type: gt
operator:
type: and
query:
params:
- B
reducer:
params: []
type: last
type: query
refId: C
noDataState: NoData
execErrState: Alerting
for: 5m
annotations:
description: "Instance {{ $labels.instance }} has high CPU ({{ $values.A.Value | humanizePercentage }}) AND high memory usage ({{ $values.B.Value | humanizePercentage }})"
runbook_url: "https://runbooks.example.com/high-resource-usage"
summary: "Critical resource usage on {{ $labels.instance }}"
labels:
severity: critical
team: infrastructure
Configure notification channels
Set up multiple notification channels including Slack, email, and webhook integrations with proper routing rules.
apiVersion: 1
notifiers:
- name: critical-slack
type: slack
uid: critical_slack_channel
settings:
url: https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK
channel: "#alerts-critical"
username: grafana
title: "Critical Alert - {{ .GroupLabels.alertname }}"
text: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Severity: {{ .Labels.severity }}
Instance: {{ .Labels.instance }}
Runbook: {{ .Annotations.runbook_url }}
{{ end }}
iconEmoji: ":exclamation:"
- name: warning-email
type: email
uid: warning_email_list
settings:
addresses: "devops@example.com;sre@example.com"
subject: "[Grafana] Warning Alert - {{ .GroupLabels.alertname }}"
body: |
Grafana Alert Notification
{{ range .Alerts }}
{{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Severity: {{ .Labels.severity }}
Instance: {{ .Labels.instance }}
Started: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ if .Annotations.runbook_url }}
{{ end }}
{{ end }}
- name: webhook-integration
type: webhook
uid: external_webhook
settings:
url: https://api.example.com/webhooks/grafana-alerts
httpMethod: POST
username: grafana
password: webhook_secret_password
title: "Grafana Alert"
body: |
{
"alertname": "{{ .GroupLabels.alertname }}",
"status": "{{ .Status }}",
"alerts": [
{{ range .Alerts }}
{
"summary": "{{ .Annotations.summary }}",
"description": "{{ .Annotations.description }}",
"severity": "{{ .Labels.severity }}",
"instance": "{{ .Labels.instance }}",
"starts_at": "{{ .StartsAt }}",
"ends_at": "{{ .EndsAt }}"
}{{ if not (eq . (index $.Alerts (sub (len $.Alerts) 1))) }},{{ end }}
{{ end }}
]
}
Set up notification policies with label-based routing
Configure intelligent alert routing based on severity levels and team ownership using label matchers.
apiVersion: 1
policies:
- orgId: 1
receiver: default-receiver
group_by:
- alertname
- instance
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
routes:
- receiver: critical-notifications
group_wait: 5s
group_interval: 5s
repeat_interval: 1h
matchers:
- severity = critical
routes:
- receiver: infrastructure-critical
matchers:
- team = infrastructure
continue: true
- receiver: database-critical
matchers:
- team = database
continue: true
- receiver: warning-notifications
group_wait: 30s
group_interval: 30s
repeat_interval: 6h
matchers:
- severity = warning
- receiver: info-notifications
group_wait: 5m
group_interval: 5m
repeat_interval: 24h
matchers:
- severity = info
contactPoints:
- orgId: 1
name: critical-notifications
receivers:
- uid: critical_slack_channel
type: slack
- uid: critical_pagerduty
type: pagerduty
settings:
integrationKey: YOUR_PAGERDUTY_INTEGRATION_KEY
severity: critical
component: grafana
group: infrastructure
- orgId: 1
name: warning-notifications
receivers:
- uid: warning_email_list
type: email
- uid: warning_slack_general
type: slack
settings:
url: https://hooks.slack.com/services/YOUR/WARNING/WEBHOOK
channel: "#alerts-general"
- orgId: 1
name: default-receiver
receivers:
- uid: default_email
type: email
settings:
addresses: "admin@example.com"
Create service-level dashboard with SLI/SLO tracking
Build a comprehensive service dashboard that tracks service level indicators and objectives with burn rate alerts.
{
"title": "Service Level Objective Tracking",
"type": "stat",
"targets": [
{
"expr": "(
sum(rate(http_requests_total{job=~\"$job\",code!~\"5..\"}[1h])) /
sum(rate(http_requests_total{job=~\"$job\"}[1h]))
) * 100",
"legendFormat": "Success Rate (1h)",
"refId": "A"
},
{
"expr": "(
sum(rate(http_requests_total{job=~\"$job\",code!~\"5..\"}[24h])) /
sum(rate(http_requests_total{job=~\"$job\"}[24h]))
) * 100",
"legendFormat": "Success Rate (24h)",
"refId": "B"
},
{
"expr": "99.5 - (
sum(rate(http_requests_total{job=~\"$job\",code!~\"5..\"}[30d])) /
sum(rate(http_requests_total{job=~\"$job\"}[30d]))
) * 100",
"legendFormat": "Error Budget Consumption (30d)",
"refId": "C"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"thresholds": {
"steps": [
{
"color": "red",
"value": null
},
{
"color": "yellow",
"value": 99
},
{
"color": "green",
"value": 99.5
}
]
},
"unit": "percent",
"decimals": 2
},
"overrides": [
{
"matcher": {
"id": "byName",
"options": "Error Budget Consumption (30d)"
},
"properties": [
{
"id": "thresholds",
"value": {
"steps": [
{
"color": "green",
"value": null
},
{
"color": "yellow",
"value": 0.25
},
{
"color": "red",
"value": 0.4
}
]
}
}
]
}
]
},
"options": {
"reduceOptions": {
"values": false,
"calcs": ["lastNotNull"]
},
"orientation": "horizontal",
"textMode": "value_and_name",
"colorMode": "background"
}
}
Configure burn rate alerting for SLO monitoring
Set up multi-window burn rate alerts that detect when your error budget is being consumed too quickly.
{
"uid": "slo_burn_rate_alert",
"title": "SLO Burn Rate - Fast Burn",
"condition": "C",
"data": [
{
"refId": "A",
"queryType": "",
"relativeTimeRange": {
"from": 3600,
"to": 0
},
"model": {
"expr": "(1 - (sum(rate(http_requests_total{code!~\"5..\"}[1h])) / sum(rate(http_requests_total[1h])))) > (14.4 * (1 - 0.995))",
"refId": "A"
}
},
{
"refId": "B",
"queryType": "",
"relativeTimeRange": {
"from": 300,
"to": 0
},
"model": {
"expr": "(1 - (sum(rate(http_requests_total{code!~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])))) > (14.4 * (1 - 0.995))",
"refId": "B"
}
},
{
"refId": "C",
"queryType": "",
"relativeTimeRange": {
"from": 0,
"to": 0
},
"model": {
"conditions": [
{
"evaluator": {
"params": [1],
"type": "gt"
},
"operator": {
"type": "and"
},
"query": {
"params": ["A"]
},
"reducer": {
"params": [],
"type": "last"
},
"type": "query"
},
{
"evaluator": {
"params": [1],
"type": "gt"
},
"operator": {
"type": "and"
},
"query": {
"params": ["B"]
},
"reducer": {
"params": [],
"type": "last"
},
"type": "query"
}
],
"refId": "C"
}
}
],
"noDataState": "NoData",
"execErrState": "Alerting",
"for": "2m",
"annotations": {
"description": "Service is burning through error budget at 14.4x the acceptable rate. Immediate action required.",
"runbook_url": "https://runbooks.example.com/slo-burn-rate",
"summary": "Fast SLO burn rate detected"
},
"labels": {
"severity": "critical",
"team": "sre",
"type": "slo"
}
}
Enable and restart Grafana service
Apply all configuration changes by restarting Grafana and verify the service starts correctly.
sudo systemctl restart grafana-server
sudo systemctl status grafana-server
sudo journalctl -u grafana-server -f --lines=20
Verify your setup
Test your advanced Grafana configuration to ensure all components are working correctly.
# Check dashboard variables are loading
curl -s -H "Authorization: Bearer YOUR_API_KEY" \
"http://localhost:3000/api/dashboards/uid/YOUR_DASHBOARD_UID" | \
jq '.dashboard.templating.list[] | {name: .name, type: .type}'
Verify alert rules are active
curl -s -H "Authorization: Bearer YOUR_API_KEY" \
"http://localhost:3000/api/ruler/grafana/api/v1/rules" | \
jq '.[] | keys'
Test notification channels
curl -X POST -H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"name":"test-notification"}' \
"http://localhost:3000/api/alert-notifications/test"
Check Prometheus connectivity
curl -s "http://localhost:3000/api/datasources/proxy/1/api/v1/query?query=up" | \
jq '.data.result[] | {instance: .metric.instance, value: .value[1]}'
Advanced dashboard techniques
Create annotation queries for event correlation
Add deployment and incident annotations to correlate system behavior with external events.
{
"annotations": {
"list": [
{
"name": "Deployments",
"datasource": "Prometheus",
"enable": true,
"expr": "increase(deployment_timestamp[5m])",
"iconColor": "blue",
"titleFormat": "Deployment: {{service}}",
"textFormat": "Version {{version}} deployed by {{user}}",
"tags": ["deployment"],
"type": "tags"
},
{
"name": "Incidents",
"datasource": "Prometheus",
"enable": true,
"expr": "incident_start_timestamp > 0",
"iconColor": "red",
"titleFormat": "Incident: {{incident_id}}",
"textFormat": "{{description}} - Severity: {{severity}}",
"tags": ["incident"],
"type": "tags"
}
]
}
}
Configure custom value mappings and transformations
Transform raw metric values into meaningful business indicators using Grafana's transformation engine.
{
"transformations": [
{
"id": "calculateField",
"options": {
"mode": "reduceRow",
"reduce": {
"reducer": "sum"
},
"replaceFields": false,
"alias": "Total Requests"
}
},
{
"id": "calculateField",
"options": {
"mode": "binary",
"binary": {
"left": "Success Rate",
"operator": "*",
"reducer": "sum",
"right": "Total Requests"
},
"replaceFields": false,
"alias": "Successful Requests"
}
}
],
"fieldConfig": {
"defaults": {
"mappings": [
{
"options": {
"from": 0,
"to": 50,
"result": {
"text": "Low Traffic",
"color": "blue"
}
},
"type": "range"
},
{
"options": {
"from": 50,
"to": 200,
"result": {
"text": "Normal Traffic",
"color": "green"
}
},
"type": "range"
},
{
"options": {
"from": 200,
"to": null,
"result": {
"text": "High Traffic",
"color": "orange"
}
},
"type": "range"
}
]
}
}
}
Common issues
| Symptom | Cause | Fix |
|---|---|---|
| Variables not loading values | Incorrect Prometheus query or missing data | curl "http://localhost:9090/api/v1/label/__name__/values" to verify metrics exist |
| Alert rules not triggering | Query returns no data or condition logic error | Test queries in Prometheus UI first, check sudo journalctl -u grafana-server | grep -i alert |
| Notification channels failing | Invalid webhook URL or authentication | Test channels manually: curl -X POST "webhook_url" -d "test payload" |
| Dashboard loading slowly | Inefficient PromQL queries or large time ranges | Add query timeout limits, use recording rules for complex calculations |
| Permission denied errors | Grafana service account lacks file access | sudo chown -R grafana:grafana /etc/grafana/provisioning/ |
Next steps
- Set up MySQL backup monitoring with Prometheus alerts and Grafana dashboards
- Set up Alertmanager with email and Slack notifications for monitoring alerts
- Configure Prometheus long-term storage with Thanos for unlimited data retention
- Implement Grafana SSO authentication with OAuth providers