Configure advanced Grafana dashboards with Prometheus

Build production-ready Grafana dashboards with dynamic variables, custom panels, and sophisticated alert rules. Integrate Prometheus metrics for comprehensive monitoring with multi-condition alerting and notification channels.

Prerequisites

Grafana 10+ installed and running
Prometheus server with metrics collection
Administrative access to configure Grafana
Basic understanding of PromQL queries

What this solves

Advanced Grafana dashboards transform raw Prometheus metrics into actionable insights through dynamic variables, custom visualizations, and intelligent alerting. This tutorial covers building sophisticated monitoring solutions that scale across multiple environments and services.

Step-by-step configuration

Verify Prometheus and Grafana installation

Ensure both services are running and accessible before proceeding with advanced configuration.

systemctl status prometheus
systemctl status grafana-server
curl -s http://localhost:9090/api/v1/label/__name__/values | jq '.data[] | select(test("up|node_"))' | head -5

Configure Prometheus data source with advanced settings

Set up the Prometheus data source with query timeout and caching optimizations for better dashboard performance.

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://localhost:9090
    isDefault: true
    editable: true
    jsonData:
      httpMethod: POST
      queryTimeout: 60s
      timeInterval: 15s
      customQueryParameters: 'max_source_resolution=5m&partial_response=true'
    secureJsonData: {}

Create dashboard variables for dynamic filtering

Configure template variables that allow users to filter dashboards by instance, job, or custom labels dynamically.

{
  "templating": {
    "list": [
      {
        "name": "instance",
        "type": "query",
        "query": "label_values(up, instance)",
        "refresh": 1,
        "includeAll": true,
        "allValue": ".*",
        "multi": true,
        "options": [],
        "current": {},
        "hide": 0,
        "sort": 1
      },
      {
        "name": "job",
        "type": "query", 
        "query": "label_values(up, job)",
        "refresh": 1,
        "includeAll": true,
        "allValue": ".*",
        "multi": true,
        "regex": "/^(?!prometheus).*$/",
        "sort": 1
      },
      {
        "name": "interval",
        "type": "interval",
        "query": "1m,5m,15m,30m,1h,6h,12h",
        "current": {
          "text": "5m",
          "value": "5m"
        }
      }
    ]
  }
}

Build advanced system overview dashboard

Create a comprehensive dashboard with custom panels for CPU, memory, disk, and network metrics using advanced PromQL queries.

{
  "targets": [
    {
      "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\",instance=~\"$instance\",job=~\"$job\"}[5m])) * 100)",
      "legendFormat": "{{instance}} - Current",
      "refId": "A"
    },
    {
      "expr": "predict_linear(node_cpu_seconds_total{mode!=\"idle\",instance=~\"$instance\",job=~\"$job\"}[1h], 3600)",
      "legendFormat": "{{instance}} - Predicted +1h",
      "refId": "B"
    }
  ],
  "yAxes": [
    {
      "min": 0,
      "max": 100,
      "unit": "percent"
    }
  ],
  "alert": {
    "conditions": [
      {
        "query": {
          "queryType": "A",
          "refId": "A"
        },
        "reducer": {
          "type": "last",
          "params": []
        },
        "evaluator": {
          "params": [85],
          "type": "gt"
        }
      }
    ],
    "executionErrorState": "alerting",
    "for": "5m",
    "frequency": "10s",
    "handler": 1,
    "name": "High CPU Usage",
    "noDataState": "no_data"
  }
}

Configure memory usage panel with thresholds

Create a memory usage visualization with dynamic thresholds and trend analysis.

{
  "title": "Memory Usage",
  "type": "stat",
  "targets": [
    {
      "expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$instance\",job=~\"$job\"} / node_memory_MemTotal_bytes{instance=~\"$instance\",job=~\"$job\"})) * 100",
      "legendFormat": "{{instance}}",
      "refId": "A"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "color": {
        "mode": "thresholds"
      },
      "thresholds": {
        "steps": [
          {
            "color": "green",
            "value": null
          },
          {
            "color": "yellow",
            "value": 70
          },
          {
            "color": "red",
            "value": 85
          }
        ]
      },
      "unit": "percent",
      "min": 0,
      "max": 100
    }
  },
  "options": {
    "reduceOptions": {
      "values": false,
      "calcs": ["lastNotNull"],
      "fields": ""
    },
    "orientation": "auto",
    "textMode": "auto",
    "colorMode": "background"
  }
}

Create disk I/O heatmap visualization

Build a heatmap showing disk I/O patterns across time and instances for performance analysis.

{
  "title": "Disk I/O Operations Heatmap",
  "type": "heatmap",
  "targets": [
    {
      "expr": "sum by (instance) (irate(node_disk_io_time_seconds_total{instance=~\"$instance\",job=~\"$job\"}[5m]))",
      "format": "time_series",
      "refId": "A"
    }
  ],
  "heatmap": {
    "xAxis": {
      "show": true
    },
    "yAxis": {
      "show": true,
      "logBase": 1,
      "min": "0",
      "max": "1"
    },
    "yBucketBound": "auto",
    "xBucketSize": null,
    "yBucketSize": null
  },
  "color": {
    "mode": "spectrum",
    "colorScheme": "interpolateSpectral",
    "exponent": 0.5,
    "fill": "dark-orange"
  },
  "legend": {
    "show": false
  }
}

Set up network traffic monitoring table

Create a table visualization showing detailed network statistics with sorting and filtering capabilities.

{
  "title": "Network Interface Statistics",
  "type": "table",
  "targets": [
    {
      "expr": "irate(node_network_receive_bytes_total{instance=~\"$instance\",job=~\"$job\",device!~\"lo|veth.|docker.|virbr.|br-.\"}[5m]) * 8",
      "format": "table",
      "legendFormat": "",
      "refId": "A"
    },
    {
      "expr": "irate(node_network_transmit_bytes_total{instance=~\"$instance\",job=~\"$job\",device!~\"lo|veth.|docker.|virbr.|br-.\"}[5m]) * 8",
      "format": "table",
      "refId": "B"
    }
  ],
  "transformations": [
    {
      "id": "merge",
      "options": {}
    },
    {
      "id": "organize",
      "options": {
        "excludeByName": {
          "Time": true,
          "__name__": true,
          "job": true
        },
        "indexByName": {
          "instance": 0,
          "device": 1,
          "Value #A": 2,
          "Value #B": 3
        },
        "renameByName": {
          "Value #A": "RX (bps)",
          "Value #B": "TX (bps)",
          "device": "Interface",
          "instance": "Instance"
        }
      }
    }
  ],
  "fieldConfig": {
    "defaults": {
      "custom": {
        "displayMode": "auto",
        "filterable": true
      },
      "unit": "bps"
    },
    "overrides": [
      {
        "matcher": {
          "id": "byName",
          "options": "Instance"
        },
        "properties": [
          {
            "id": "unit",
            "value": "string"
          }
        ]
      }
    ]
  }
}

Configure advanced alert rules with multiple conditions

Set up sophisticated alerting rules that combine multiple metrics and conditions for accurate incident detection.

apiVersion: 1
groups:
  - name: system_alerts
    folder: System Monitoring
    interval: 1m
    rules:
      - uid: high_cpu_memory_combo
        title: High CPU and Memory Usage Combined
        condition: C
        data:
          - refId: A
            queryType: ''
            relativeTimeRange:
              from: 300
              to: 0
            model:
              expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
              refId: A
          - refId: B
            queryType: ''
            relativeTimeRange:
              from: 300
              to: 0
            model:
              expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
              refId: B
          - refId: C
            queryType: ''
            relativeTimeRange:
              from: 0
              to: 0
            model:
              conditions:
                - evaluator:
                    params:
                      - 80
                      - 0
                    type: gt
                  operator:
                    type: and
                  query:
                    params:
                      - A
                  reducer:
                    params: []
                    type: last
                  type: query
                - evaluator:
                    params:
                      - 85
                      - 0
                    type: gt
                  operator:
                    type: and
                  query:
                    params:
                      - B
                  reducer:
                    params: []
                    type: last
                  type: query
              refId: C
        noDataState: NoData
        execErrState: Alerting
        for: 5m
        annotations:
          description: "Instance {{ $labels.instance }} has high CPU ({{ $values.A.Value | humanizePercentage }}) AND high memory usage ({{ $values.B.Value | humanizePercentage }})"
          runbook_url: "https://runbooks.example.com/high-resource-usage"
          summary: "Critical resource usage on {{ $labels.instance }}"
        labels:
          severity: critical
          team: infrastructure

Configure notification channels

Set up multiple notification channels including Slack, email, and webhook integrations with proper routing rules.

apiVersion: 1
notifiers:
  - name: critical-slack
    type: slack
    uid: critical_slack_channel
    settings:
      url: https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK
      channel: "#alerts-critical"
      username: grafana
      title: "Critical Alert - {{ .GroupLabels.alertname }}"
      text: |
        {{ range .Alerts }}
        Alert: {{ .Annotations.summary }}
        Description: {{ .Annotations.description }}
        Severity: {{ .Labels.severity }}
        Instance: {{ .Labels.instance }}
        Runbook: {{ .Annotations.runbook_url }}
        {{ end }}
      iconEmoji: ":exclamation:"

  - name: warning-email
    type: email
    uid: warning_email_list
    settings:
      addresses: "devops@example.com;sre@example.com"
      subject: "[Grafana] Warning Alert - {{ .GroupLabels.alertname }}"
      body: |
        Grafana Alert Notification
        {{ range .Alerts }}
        {{ .Annotations.summary }}
        Description: {{ .Annotations.description }}
        Severity: {{ .Labels.severity }}
        Instance: {{ .Labels.instance }}
        Started: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
        {{ if .Annotations.runbook_url }}
        View Runbook
        {{ end }}
        {{ end }}

  - name: webhook-integration
    type: webhook
    uid: external_webhook
    settings:
      url: https://api.example.com/webhooks/grafana-alerts
      httpMethod: POST
      username: grafana
      password: webhook_secret_password
      title: "Grafana Alert"
      body: |
        {
          "alertname": "{{ .GroupLabels.alertname }}",
          "status": "{{ .Status }}",
          "alerts": [
            {{ range .Alerts }}
            {
              "summary": "{{ .Annotations.summary }}",
              "description": "{{ .Annotations.description }}",
              "severity": "{{ .Labels.severity }}",
              "instance": "{{ .Labels.instance }}",
              "starts_at": "{{ .StartsAt }}",
              "ends_at": "{{ .EndsAt }}"
            }{{ if not (eq . (index $.Alerts (sub (len $.Alerts) 1))) }},{{ end }}
            {{ end }}
          ]
        }

Set up notification policies with label-based routing

Configure intelligent alert routing based on severity levels and team ownership using label matchers.

apiVersion: 1
policies:
  - orgId: 1
    receiver: default-receiver
    group_by:
      - alertname
      - instance
    group_wait: 10s
    group_interval: 10s
    repeat_interval: 12h
    routes:
      - receiver: critical-notifications
        group_wait: 5s
        group_interval: 5s
        repeat_interval: 1h
        matchers:
          - severity = critical
        routes:
          - receiver: infrastructure-critical
            matchers:
              - team = infrastructure
            continue: true
          - receiver: database-critical
            matchers:
              - team = database
            continue: true
      
      - receiver: warning-notifications
        group_wait: 30s
        group_interval: 30s
        repeat_interval: 6h
        matchers:
          - severity = warning
        
      - receiver: info-notifications
        group_wait: 5m
        group_interval: 5m
        repeat_interval: 24h
        matchers:
          - severity = info

contactPoints:
  - orgId: 1
    name: critical-notifications
    receivers:
      - uid: critical_slack_channel
        type: slack
      - uid: critical_pagerduty
        type: pagerduty
        settings:
          integrationKey: YOUR_PAGERDUTY_INTEGRATION_KEY
          severity: critical
          component: grafana
          group: infrastructure
  
  - orgId: 1
    name: warning-notifications
    receivers:
      - uid: warning_email_list
        type: email
      - uid: warning_slack_general
        type: slack
        settings:
          url: https://hooks.slack.com/services/YOUR/WARNING/WEBHOOK
          channel: "#alerts-general"
  
  - orgId: 1
    name: default-receiver
    receivers:
      - uid: default_email
        type: email
        settings:
          addresses: "admin@example.com"

Create service-level dashboard with SLI/SLO tracking

Build a comprehensive service dashboard that tracks service level indicators and objectives with burn rate alerts.

{
  "title": "Service Level Objective Tracking",
  "type": "stat",
  "targets": [
    {
      "expr": "(
        sum(rate(http_requests_total{job=~\"$job\",code!~\"5..\"}[1h])) /
        sum(rate(http_requests_total{job=~\"$job\"}[1h]))
      ) * 100",
      "legendFormat": "Success Rate (1h)",
      "refId": "A"
    },
    {
      "expr": "(
        sum(rate(http_requests_total{job=~\"$job\",code!~\"5..\"}[24h])) /
        sum(rate(http_requests_total{job=~\"$job\"}[24h]))
      ) * 100",
      "legendFormat": "Success Rate (24h)",
      "refId": "B"
    },
    {
      "expr": "99.5 - (
        sum(rate(http_requests_total{job=~\"$job\",code!~\"5..\"}[30d])) /
        sum(rate(http_requests_total{job=~\"$job\"}[30d]))
      ) * 100",
      "legendFormat": "Error Budget Consumption (30d)",
      "refId": "C"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "color": {
        "mode": "thresholds"
      },
      "thresholds": {
        "steps": [
          {
            "color": "red",
            "value": null
          },
          {
            "color": "yellow",
            "value": 99
          },
          {
            "color": "green",
            "value": 99.5
          }
        ]
      },
      "unit": "percent",
      "decimals": 2
    },
    "overrides": [
      {
        "matcher": {
          "id": "byName",
          "options": "Error Budget Consumption (30d)"
        },
        "properties": [
          {
            "id": "thresholds",
            "value": {
              "steps": [
                {
                  "color": "green",
                  "value": null
                },
                {
                  "color": "yellow",
                  "value": 0.25
                },
                {
                  "color": "red",
                  "value": 0.4
                }
              ]
            }
          }
        ]
      }
    ]
  },
  "options": {
    "reduceOptions": {
      "values": false,
      "calcs": ["lastNotNull"]
    },
    "orientation": "horizontal",
    "textMode": "value_and_name",
    "colorMode": "background"
  }
}

Configure burn rate alerting for SLO monitoring

Set up multi-window burn rate alerts that detect when your error budget is being consumed too quickly.

{
  "uid": "slo_burn_rate_alert",
  "title": "SLO Burn Rate - Fast Burn",
  "condition": "C",
  "data": [
    {
      "refId": "A",
      "queryType": "",
      "relativeTimeRange": {
        "from": 3600,
        "to": 0
      },
      "model": {
        "expr": "(1 - (sum(rate(http_requests_total{code!~\"5..\"}[1h])) / sum(rate(http_requests_total[1h])))) > (14.4 * (1 - 0.995))",
        "refId": "A"
      }
    },
    {
      "refId": "B",
      "queryType": "",
      "relativeTimeRange": {
        "from": 300,
        "to": 0
      },
      "model": {
        "expr": "(1 - (sum(rate(http_requests_total{code!~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])))) > (14.4 * (1 - 0.995))",
        "refId": "B"
      }
    },
    {
      "refId": "C",
      "queryType": "",
      "relativeTimeRange": {
        "from": 0,
        "to": 0
      },
      "model": {
        "conditions": [
          {
            "evaluator": {
              "params": [1],
              "type": "gt"
            },
            "operator": {
              "type": "and"
            },
            "query": {
              "params": ["A"]
            },
            "reducer": {
              "params": [],
              "type": "last"
            },
            "type": "query"
          },
          {
            "evaluator": {
              "params": [1],
              "type": "gt"
            },
            "operator": {
              "type": "and"
            },
            "query": {
              "params": ["B"]
            },
            "reducer": {
              "params": [],
              "type": "last"
            },
            "type": "query"
          }
        ],
        "refId": "C"
      }
    }
  ],
  "noDataState": "NoData",
  "execErrState": "Alerting",
  "for": "2m",
  "annotations": {
    "description": "Service is burning through error budget at 14.4x the acceptable rate. Immediate action required.",
    "runbook_url": "https://runbooks.example.com/slo-burn-rate",
    "summary": "Fast SLO burn rate detected"
  },
  "labels": {
    "severity": "critical",
    "team": "sre",
    "type": "slo"
  }
}

Enable and restart Grafana service

Apply all configuration changes by restarting Grafana and verify the service starts correctly.

sudo systemctl restart grafana-server
sudo systemctl status grafana-server
sudo journalctl -u grafana-server -f --lines=20

Verify your setup

Test your advanced Grafana configuration to ensure all components are working correctly.

# Check dashboard variables are loading
curl -s -H "Authorization: Bearer YOUR_API_KEY" \
  "http://localhost:3000/api/dashboards/uid/YOUR_DASHBOARD_UID" | \
  jq '.dashboard.templating.list[] | {name: .name, type: .type}'

Verify alert rules are active
curl -s -H "Authorization: Bearer YOUR_API_KEY" \
  "http://localhost:3000/api/ruler/grafana/api/v1/rules" | \
  jq '.[] | keys'

Test notification channels
curl -X POST -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"name":"test-notification"}' \
  "http://localhost:3000/api/alert-notifications/test"

Check Prometheus connectivity
curl -s "http://localhost:3000/api/datasources/proxy/1/api/v1/query?query=up" | \
  jq '.data.result[] | {instance: .metric.instance, value: .value[1]}'

Advanced dashboard techniques

Create annotation queries for event correlation

Add deployment and incident annotations to correlate system behavior with external events.

{
  "annotations": {
    "list": [
      {
        "name": "Deployments",
        "datasource": "Prometheus",
        "enable": true,
        "expr": "increase(deployment_timestamp[5m])",
        "iconColor": "blue",
        "titleFormat": "Deployment: {{service}}",
        "textFormat": "Version {{version}} deployed by {{user}}",
        "tags": ["deployment"],
        "type": "tags"
      },
      {
        "name": "Incidents",
        "datasource": "Prometheus", 
        "enable": true,
        "expr": "incident_start_timestamp > 0",
        "iconColor": "red",
        "titleFormat": "Incident: {{incident_id}}",
        "textFormat": "{{description}} - Severity: {{severity}}",
        "tags": ["incident"],
        "type": "tags"
      }
    ]
  }
}

Configure custom value mappings and transformations

Transform raw metric values into meaningful business indicators using Grafana's transformation engine.

{
  "transformations": [
    {
      "id": "calculateField",
      "options": {
        "mode": "reduceRow",
        "reduce": {
          "reducer": "sum"
        },
        "replaceFields": false,
        "alias": "Total Requests"
      }
    },
    {
      "id": "calculateField",
      "options": {
        "mode": "binary",
        "binary": {
          "left": "Success Rate",
          "operator": "*",
          "reducer": "sum",
          "right": "Total Requests"
        },
        "replaceFields": false,
        "alias": "Successful Requests"
      }
    }
  ],
  "fieldConfig": {
    "defaults": {
      "mappings": [
        {
          "options": {
            "from": 0,
            "to": 50,
            "result": {
              "text": "Low Traffic",
              "color": "blue"
            }
          },
          "type": "range"
        },
        {
          "options": {
            "from": 50,
            "to": 200,
            "result": {
              "text": "Normal Traffic", 
              "color": "green"
            }
          },
          "type": "range"
        },
        {
          "options": {
            "from": 200,
            "to": null,
            "result": {
              "text": "High Traffic",
              "color": "orange"
            }
          },
          "type": "range"
        }
      ]
    }
  }
}

Common issues

Symptom	Cause	Fix
Variables not loading values	Incorrect Prometheus query or missing data	`curl "http://localhost:9090/api/v1/label/__name__/values"` to verify metrics exist
Alert rules not triggering	Query returns no data or condition logic error	Test queries in Prometheus UI first, check `sudo journalctl -u grafana-server \| grep -i alert`
Notification channels failing	Invalid webhook URL or authentication	Test channels manually: `curl -X POST "webhook_url" -d "test payload"`
Dashboard loading slowly	Inefficient PromQL queries or large time ranges	Add query timeout limits, use recording rules for complex calculations
Permission denied errors	Grafana service account lacks file access	`sudo chown -R grafana:grafana /etc/grafana/provisioning/`

Next steps

Running this in production?

Want this handled for you? Running this at scale adds a second layer of work: capacity planning, failover drills, cost control, and on-call. See how we run infrastructure like this for European teams.

#grafana #prometheus #monitoring #alerting #dashboards

Configure advanced Grafana dashboards and alerting with Prometheus integration

Prerequisites

What this solves

Step-by-step configuration

Verify Prometheus and Grafana installation

Configure Prometheus data source with advanced settings

Create dashboard variables for dynamic filtering

Build advanced system overview dashboard

Configure memory usage panel with thresholds

Create disk I/O heatmap visualization

Set up network traffic monitoring table

Configure advanced alert rules with multiple conditions

Configure notification channels

Grafana Alert Notification

{{ .Annotations.summary }}

Set up notification policies with label-based routing

Create service-level dashboard with SLI/SLO tracking

Configure burn rate alerting for SLO monitoring

Enable and restart Grafana service

Verify your setup

Verify alert rules are active

Test notification channels

Check Prometheus connectivity

Advanced dashboard techniques

Create annotation queries for event correlation

Configure custom value mappings and transformations

Common issues

Next steps

Running this in production?

Related tutorials

Configure Consul Connect service mesh monitoring with distributed tracing

Configure OpenTelemetry custom metrics for application monitoring with Prometheus and Grafana

Configure Jaeger with Elasticsearch backend security and encryption

Don't want to manage this yourself?