Configure advanced Grafana dashboards and alerting with Prometheus integration

Advanced 45 min May 21, 2026 120 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Build production-ready Grafana dashboards with dynamic variables, custom panels, and sophisticated alert rules. Integrate Prometheus metrics for comprehensive monitoring with multi-condition alerting and notification channels.

Prerequisites

  • Grafana 10+ installed and running
  • Prometheus server with metrics collection
  • Administrative access to configure Grafana
  • Basic understanding of PromQL queries

What this solves

Advanced Grafana dashboards transform raw Prometheus metrics into actionable insights through dynamic variables, custom visualizations, and intelligent alerting. This tutorial covers building sophisticated monitoring solutions that scale across multiple environments and services.

Step-by-step configuration

Verify Prometheus and Grafana installation

Ensure both services are running and accessible before proceeding with advanced configuration.

systemctl status prometheus
systemctl status grafana-server
curl -s http://localhost:9090/api/v1/label/__name__/values | jq '.data[] | select(test("up|node_"))' | head -5

Configure Prometheus data source with advanced settings

Set up the Prometheus data source with query timeout and caching optimizations for better dashboard performance.

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://localhost:9090
    isDefault: true
    editable: true
    jsonData:
      httpMethod: POST
      queryTimeout: 60s
      timeInterval: 15s
      customQueryParameters: 'max_source_resolution=5m&partial_response=true'
    secureJsonData: {}

Create dashboard variables for dynamic filtering

Configure template variables that allow users to filter dashboards by instance, job, or custom labels dynamically.

{
  "templating": {
    "list": [
      {
        "name": "instance",
        "type": "query",
        "query": "label_values(up, instance)",
        "refresh": 1,
        "includeAll": true,
        "allValue": ".*",
        "multi": true,
        "options": [],
        "current": {},
        "hide": 0,
        "sort": 1
      },
      {
        "name": "job",
        "type": "query", 
        "query": "label_values(up, job)",
        "refresh": 1,
        "includeAll": true,
        "allValue": ".*",
        "multi": true,
        "regex": "/^(?!prometheus).*$/",
        "sort": 1
      },
      {
        "name": "interval",
        "type": "interval",
        "query": "1m,5m,15m,30m,1h,6h,12h",
        "current": {
          "text": "5m",
          "value": "5m"
        }
      }
    ]
  }
}

Build advanced system overview dashboard

Create a comprehensive dashboard with custom panels for CPU, memory, disk, and network metrics using advanced PromQL queries.

{
  "targets": [
    {
      "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\",instance=~\"$instance\",job=~\"$job\"}[5m])) * 100)",
      "legendFormat": "{{instance}} - Current",
      "refId": "A"
    },
    {
      "expr": "predict_linear(node_cpu_seconds_total{mode!=\"idle\",instance=~\"$instance\",job=~\"$job\"}[1h], 3600)",
      "legendFormat": "{{instance}} - Predicted +1h",
      "refId": "B"
    }
  ],
  "yAxes": [
    {
      "min": 0,
      "max": 100,
      "unit": "percent"
    }
  ],
  "alert": {
    "conditions": [
      {
        "query": {
          "queryType": "A",
          "refId": "A"
        },
        "reducer": {
          "type": "last",
          "params": []
        },
        "evaluator": {
          "params": [85],
          "type": "gt"
        }
      }
    ],
    "executionErrorState": "alerting",
    "for": "5m",
    "frequency": "10s",
    "handler": 1,
    "name": "High CPU Usage",
    "noDataState": "no_data"
  }
}

Configure memory usage panel with thresholds

Create a memory usage visualization with dynamic thresholds and trend analysis.

{
  "title": "Memory Usage",
  "type": "stat",
  "targets": [
    {
      "expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$instance\",job=~\"$job\"} / node_memory_MemTotal_bytes{instance=~\"$instance\",job=~\"$job\"})) * 100",
      "legendFormat": "{{instance}}",
      "refId": "A"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "color": {
        "mode": "thresholds"
      },
      "thresholds": {
        "steps": [
          {
            "color": "green",
            "value": null
          },
          {
            "color": "yellow",
            "value": 70
          },
          {
            "color": "red",
            "value": 85
          }
        ]
      },
      "unit": "percent",
      "min": 0,
      "max": 100
    }
  },
  "options": {
    "reduceOptions": {
      "values": false,
      "calcs": ["lastNotNull"],
      "fields": ""
    },
    "orientation": "auto",
    "textMode": "auto",
    "colorMode": "background"
  }
}

Create disk I/O heatmap visualization

Build a heatmap showing disk I/O patterns across time and instances for performance analysis.

{
  "title": "Disk I/O Operations Heatmap",
  "type": "heatmap",
  "targets": [
    {
      "expr": "sum by (instance) (irate(node_disk_io_time_seconds_total{instance=~\"$instance\",job=~\"$job\"}[5m]))",
      "format": "time_series",
      "refId": "A"
    }
  ],
  "heatmap": {
    "xAxis": {
      "show": true
    },
    "yAxis": {
      "show": true,
      "logBase": 1,
      "min": "0",
      "max": "1"
    },
    "yBucketBound": "auto",
    "xBucketSize": null,
    "yBucketSize": null
  },
  "color": {
    "mode": "spectrum",
    "colorScheme": "interpolateSpectral",
    "exponent": 0.5,
    "fill": "dark-orange"
  },
  "legend": {
    "show": false
  }
}

Set up network traffic monitoring table

Create a table visualization showing detailed network statistics with sorting and filtering capabilities.

{
  "title": "Network Interface Statistics",
  "type": "table",
  "targets": [
    {
      "expr": "irate(node_network_receive_bytes_total{instance=~\"$instance\",job=~\"$job\",device!~\"lo|veth.|docker.|virbr.|br-.\"}[5m]) * 8",
      "format": "table",
      "legendFormat": "",
      "refId": "A"
    },
    {
      "expr": "irate(node_network_transmit_bytes_total{instance=~\"$instance\",job=~\"$job\",device!~\"lo|veth.|docker.|virbr.|br-.\"}[5m]) * 8",
      "format": "table",
      "refId": "B"
    }
  ],
  "transformations": [
    {
      "id": "merge",
      "options": {}
    },
    {
      "id": "organize",
      "options": {
        "excludeByName": {
          "Time": true,
          "__name__": true,
          "job": true
        },
        "indexByName": {
          "instance": 0,
          "device": 1,
          "Value #A": 2,
          "Value #B": 3
        },
        "renameByName": {
          "Value #A": "RX (bps)",
          "Value #B": "TX (bps)",
          "device": "Interface",
          "instance": "Instance"
        }
      }
    }
  ],
  "fieldConfig": {
    "defaults": {
      "custom": {
        "displayMode": "auto",
        "filterable": true
      },
      "unit": "bps"
    },
    "overrides": [
      {
        "matcher": {
          "id": "byName",
          "options": "Instance"
        },
        "properties": [
          {
            "id": "unit",
            "value": "string"
          }
        ]
      }
    ]
  }
}

Configure advanced alert rules with multiple conditions

Set up sophisticated alerting rules that combine multiple metrics and conditions for accurate incident detection.

apiVersion: 1
groups:
  - name: system_alerts
    folder: System Monitoring
    interval: 1m
    rules:
      - uid: high_cpu_memory_combo
        title: High CPU and Memory Usage Combined
        condition: C
        data:
          - refId: A
            queryType: ''
            relativeTimeRange:
              from: 300
              to: 0
            model:
              expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
              refId: A
          - refId: B
            queryType: ''
            relativeTimeRange:
              from: 300
              to: 0
            model:
              expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
              refId: B
          - refId: C
            queryType: ''
            relativeTimeRange:
              from: 0
              to: 0
            model:
              conditions:
                - evaluator:
                    params:
                      - 80
                      - 0
                    type: gt
                  operator:
                    type: and
                  query:
                    params:
                      - A
                  reducer:
                    params: []
                    type: last
                  type: query
                - evaluator:
                    params:
                      - 85
                      - 0
                    type: gt
                  operator:
                    type: and
                  query:
                    params:
                      - B
                  reducer:
                    params: []
                    type: last
                  type: query
              refId: C
        noDataState: NoData
        execErrState: Alerting
        for: 5m
        annotations:
          description: "Instance {{ $labels.instance }} has high CPU ({{ $values.A.Value | humanizePercentage }}) AND high memory usage ({{ $values.B.Value | humanizePercentage }})"
          runbook_url: "https://runbooks.example.com/high-resource-usage"
          summary: "Critical resource usage on {{ $labels.instance }}"
        labels:
          severity: critical
          team: infrastructure

Configure notification channels

Set up multiple notification channels including Slack, email, and webhook integrations with proper routing rules.

apiVersion: 1
notifiers:
  - name: critical-slack
    type: slack
    uid: critical_slack_channel
    settings:
      url: https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK
      channel: "#alerts-critical"
      username: grafana
      title: "Critical Alert - {{ .GroupLabels.alertname }}"
      text: |
        {{ range .Alerts }}
        Alert: {{ .Annotations.summary }}
        Description: {{ .Annotations.description }}
        Severity: {{ .Labels.severity }}
        Instance: {{ .Labels.instance }}
        Runbook: {{ .Annotations.runbook_url }}
        {{ end }}
      iconEmoji: ":exclamation:"

  - name: warning-email
    type: email
    uid: warning_email_list
    settings:
      addresses: "devops@example.com;sre@example.com"
      subject: "[Grafana] Warning Alert - {{ .GroupLabels.alertname }}"
      body: |
        

Grafana Alert Notification

{{ range .Alerts }}

{{ .Annotations.summary }}

Description: {{ .Annotations.description }}

Severity: {{ .Labels.severity }}

Instance: {{ .Labels.instance }}

Started: {{ .StartsAt.Format "2006-01-02 15:04:05" }}

{{ if .Annotations.runbook_url }}

View Runbook

{{ end }} {{ end }} - name: webhook-integration type: webhook uid: external_webhook settings: url: https://api.example.com/webhooks/grafana-alerts httpMethod: POST username: grafana password: webhook_secret_password title: "Grafana Alert" body: | { "alertname": "{{ .GroupLabels.alertname }}", "status": "{{ .Status }}", "alerts": [ {{ range .Alerts }} { "summary": "{{ .Annotations.summary }}", "description": "{{ .Annotations.description }}", "severity": "{{ .Labels.severity }}", "instance": "{{ .Labels.instance }}", "starts_at": "{{ .StartsAt }}", "ends_at": "{{ .EndsAt }}" }{{ if not (eq . (index $.Alerts (sub (len $.Alerts) 1))) }},{{ end }} {{ end }} ] }

Set up notification policies with label-based routing

Configure intelligent alert routing based on severity levels and team ownership using label matchers.

apiVersion: 1
policies:
  - orgId: 1
    receiver: default-receiver
    group_by:
      - alertname
      - instance
    group_wait: 10s
    group_interval: 10s
    repeat_interval: 12h
    routes:
      - receiver: critical-notifications
        group_wait: 5s
        group_interval: 5s
        repeat_interval: 1h
        matchers:
          - severity = critical
        routes:
          - receiver: infrastructure-critical
            matchers:
              - team = infrastructure
            continue: true
          - receiver: database-critical
            matchers:
              - team = database
            continue: true
      
      - receiver: warning-notifications
        group_wait: 30s
        group_interval: 30s
        repeat_interval: 6h
        matchers:
          - severity = warning
        
      - receiver: info-notifications
        group_wait: 5m
        group_interval: 5m
        repeat_interval: 24h
        matchers:
          - severity = info

contactPoints:
  - orgId: 1
    name: critical-notifications
    receivers:
      - uid: critical_slack_channel
        type: slack
      - uid: critical_pagerduty
        type: pagerduty
        settings:
          integrationKey: YOUR_PAGERDUTY_INTEGRATION_KEY
          severity: critical
          component: grafana
          group: infrastructure
  
  - orgId: 1
    name: warning-notifications
    receivers:
      - uid: warning_email_list
        type: email
      - uid: warning_slack_general
        type: slack
        settings:
          url: https://hooks.slack.com/services/YOUR/WARNING/WEBHOOK
          channel: "#alerts-general"
  
  - orgId: 1
    name: default-receiver
    receivers:
      - uid: default_email
        type: email
        settings:
          addresses: "admin@example.com"

Create service-level dashboard with SLI/SLO tracking

Build a comprehensive service dashboard that tracks service level indicators and objectives with burn rate alerts.

{
  "title": "Service Level Objective Tracking",
  "type": "stat",
  "targets": [
    {
      "expr": "(
        sum(rate(http_requests_total{job=~\"$job\",code!~\"5..\"}[1h])) /
        sum(rate(http_requests_total{job=~\"$job\"}[1h]))
      ) * 100",
      "legendFormat": "Success Rate (1h)",
      "refId": "A"
    },
    {
      "expr": "(
        sum(rate(http_requests_total{job=~\"$job\",code!~\"5..\"}[24h])) /
        sum(rate(http_requests_total{job=~\"$job\"}[24h]))
      ) * 100",
      "legendFormat": "Success Rate (24h)",
      "refId": "B"
    },
    {
      "expr": "99.5 - (
        sum(rate(http_requests_total{job=~\"$job\",code!~\"5..\"}[30d])) /
        sum(rate(http_requests_total{job=~\"$job\"}[30d]))
      ) * 100",
      "legendFormat": "Error Budget Consumption (30d)",
      "refId": "C"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "color": {
        "mode": "thresholds"
      },
      "thresholds": {
        "steps": [
          {
            "color": "red",
            "value": null
          },
          {
            "color": "yellow",
            "value": 99
          },
          {
            "color": "green",
            "value": 99.5
          }
        ]
      },
      "unit": "percent",
      "decimals": 2
    },
    "overrides": [
      {
        "matcher": {
          "id": "byName",
          "options": "Error Budget Consumption (30d)"
        },
        "properties": [
          {
            "id": "thresholds",
            "value": {
              "steps": [
                {
                  "color": "green",
                  "value": null
                },
                {
                  "color": "yellow",
                  "value": 0.25
                },
                {
                  "color": "red",
                  "value": 0.4
                }
              ]
            }
          }
        ]
      }
    ]
  },
  "options": {
    "reduceOptions": {
      "values": false,
      "calcs": ["lastNotNull"]
    },
    "orientation": "horizontal",
    "textMode": "value_and_name",
    "colorMode": "background"
  }
}

Configure burn rate alerting for SLO monitoring

Set up multi-window burn rate alerts that detect when your error budget is being consumed too quickly.

{
  "uid": "slo_burn_rate_alert",
  "title": "SLO Burn Rate - Fast Burn",
  "condition": "C",
  "data": [
    {
      "refId": "A",
      "queryType": "",
      "relativeTimeRange": {
        "from": 3600,
        "to": 0
      },
      "model": {
        "expr": "(1 - (sum(rate(http_requests_total{code!~\"5..\"}[1h])) / sum(rate(http_requests_total[1h])))) > (14.4 * (1 - 0.995))",
        "refId": "A"
      }
    },
    {
      "refId": "B",
      "queryType": "",
      "relativeTimeRange": {
        "from": 300,
        "to": 0
      },
      "model": {
        "expr": "(1 - (sum(rate(http_requests_total{code!~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])))) > (14.4 * (1 - 0.995))",
        "refId": "B"
      }
    },
    {
      "refId": "C",
      "queryType": "",
      "relativeTimeRange": {
        "from": 0,
        "to": 0
      },
      "model": {
        "conditions": [
          {
            "evaluator": {
              "params": [1],
              "type": "gt"
            },
            "operator": {
              "type": "and"
            },
            "query": {
              "params": ["A"]
            },
            "reducer": {
              "params": [],
              "type": "last"
            },
            "type": "query"
          },
          {
            "evaluator": {
              "params": [1],
              "type": "gt"
            },
            "operator": {
              "type": "and"
            },
            "query": {
              "params": ["B"]
            },
            "reducer": {
              "params": [],
              "type": "last"
            },
            "type": "query"
          }
        ],
        "refId": "C"
      }
    }
  ],
  "noDataState": "NoData",
  "execErrState": "Alerting",
  "for": "2m",
  "annotations": {
    "description": "Service is burning through error budget at 14.4x the acceptable rate. Immediate action required.",
    "runbook_url": "https://runbooks.example.com/slo-burn-rate",
    "summary": "Fast SLO burn rate detected"
  },
  "labels": {
    "severity": "critical",
    "team": "sre",
    "type": "slo"
  }
}

Enable and restart Grafana service

Apply all configuration changes by restarting Grafana and verify the service starts correctly.

sudo systemctl restart grafana-server
sudo systemctl status grafana-server
sudo journalctl -u grafana-server -f --lines=20

Verify your setup

Test your advanced Grafana configuration to ensure all components are working correctly.

# Check dashboard variables are loading
curl -s -H "Authorization: Bearer YOUR_API_KEY" \
  "http://localhost:3000/api/dashboards/uid/YOUR_DASHBOARD_UID" | \
  jq '.dashboard.templating.list[] | {name: .name, type: .type}'

Verify alert rules are active

curl -s -H "Authorization: Bearer YOUR_API_KEY" \ "http://localhost:3000/api/ruler/grafana/api/v1/rules" | \ jq '.[] | keys'

Test notification channels

curl -X POST -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{"name":"test-notification"}' \ "http://localhost:3000/api/alert-notifications/test"

Check Prometheus connectivity

curl -s "http://localhost:3000/api/datasources/proxy/1/api/v1/query?query=up" | \ jq '.data.result[] | {instance: .metric.instance, value: .value[1]}'

Advanced dashboard techniques

Create annotation queries for event correlation

Add deployment and incident annotations to correlate system behavior with external events.

{
  "annotations": {
    "list": [
      {
        "name": "Deployments",
        "datasource": "Prometheus",
        "enable": true,
        "expr": "increase(deployment_timestamp[5m])",
        "iconColor": "blue",
        "titleFormat": "Deployment: {{service}}",
        "textFormat": "Version {{version}} deployed by {{user}}",
        "tags": ["deployment"],
        "type": "tags"
      },
      {
        "name": "Incidents",
        "datasource": "Prometheus", 
        "enable": true,
        "expr": "incident_start_timestamp > 0",
        "iconColor": "red",
        "titleFormat": "Incident: {{incident_id}}",
        "textFormat": "{{description}} - Severity: {{severity}}",
        "tags": ["incident"],
        "type": "tags"
      }
    ]
  }
}

Configure custom value mappings and transformations

Transform raw metric values into meaningful business indicators using Grafana's transformation engine.

{
  "transformations": [
    {
      "id": "calculateField",
      "options": {
        "mode": "reduceRow",
        "reduce": {
          "reducer": "sum"
        },
        "replaceFields": false,
        "alias": "Total Requests"
      }
    },
    {
      "id": "calculateField",
      "options": {
        "mode": "binary",
        "binary": {
          "left": "Success Rate",
          "operator": "*",
          "reducer": "sum",
          "right": "Total Requests"
        },
        "replaceFields": false,
        "alias": "Successful Requests"
      }
    }
  ],
  "fieldConfig": {
    "defaults": {
      "mappings": [
        {
          "options": {
            "from": 0,
            "to": 50,
            "result": {
              "text": "Low Traffic",
              "color": "blue"
            }
          },
          "type": "range"
        },
        {
          "options": {
            "from": 50,
            "to": 200,
            "result": {
              "text": "Normal Traffic", 
              "color": "green"
            }
          },
          "type": "range"
        },
        {
          "options": {
            "from": 200,
            "to": null,
            "result": {
              "text": "High Traffic",
              "color": "orange"
            }
          },
          "type": "range"
        }
      ]
    }
  }
}

Common issues

SymptomCauseFix
Variables not loading values Incorrect Prometheus query or missing data curl "http://localhost:9090/api/v1/label/__name__/values" to verify metrics exist
Alert rules not triggering Query returns no data or condition logic error Test queries in Prometheus UI first, check sudo journalctl -u grafana-server | grep -i alert
Notification channels failing Invalid webhook URL or authentication Test channels manually: curl -X POST "webhook_url" -d "test payload"
Dashboard loading slowly Inefficient PromQL queries or large time ranges Add query timeout limits, use recording rules for complex calculations
Permission denied errors Grafana service account lacks file access sudo chown -R grafana:grafana /etc/grafana/provisioning/

Next steps

Running this in production?

Want this handled for you? Running this at scale adds a second layer of work: capacity planning, failover drills, cost control, and on-call. See how we run infrastructure like this for European teams.

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.