Grafana Advanced Alerting Setup Guide

Set up comprehensive Grafana alerting with webhook endpoints, Slack and Teams notifications, and advanced alert conditions. Configure data sources, create alert rules, and implement custom notification channels for production monitoring.

Prerequisites

Grafana 9.0 or higher
Prometheus data source
SMTP server access for email notifications
Webhook endpoints for external integrations
Administrative access to configure notification channels

What this solves

Grafana's unified alerting system lets you create sophisticated alert rules that can notify multiple channels when your infrastructure needs attention. This tutorial shows you how to set up webhook endpoints for external integrations, configure notification channels for Slack and Microsoft Teams, and create advanced alert conditions with custom templating for production environments.

Step-by-step configuration

Update system packages

Start by updating your package manager to ensure you have the latest security patches.

sudo apt update && sudo apt upgrade -y

sudo dnf update -y

Install Grafana if not already present

Install Grafana from the official repository if you don't have it running yet.

wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install -y grafana

sudo dnf install -y https://dl.grafana.com/oss/release/grafana-10.2.3-1.x86_64.rpm

Configure Grafana for unified alerting

Enable unified alerting in the main configuration file and set retention policies.

[unified_alerting]
enabled = true

[alerting]
enabled = false

Alert rule evaluation interval
[unified_alerting.evaluation]
max_attempts = 3
min_interval = 10s

Data retention for alert instances
[unified_alerting.state_history]
enabled = true
max_age = 168h

Screenshot settings for alert notifications
[unified_alerting.screenshots]
capture = true
max_concurrent_screenshots = 5

Start and enable Grafana

Start the Grafana service and enable it to run on boot.

sudo systemctl start grafana-server
sudo systemctl enable grafana-server
sudo systemctl status grafana-server

Configure Prometheus data source

Add a Prometheus data source through the Grafana web interface at http://your-server:3000. Navigate to Configuration > Data Sources > Add data source > Prometheus.

Name: Prometheus
URL: http://localhost:9090
Access: Server (default)
Scrape interval: 15s
Query timeout: 60s
HTTP Method: POST

Click "Save & Test" to verify the connection works.

Create contact points for notifications

Navigate to Alerting > Contact points > Add contact point. Create separate contact points for each notification method.

Slack Contact Point:

Name: slack-alerts
Integration: Slack
Webhook URL: https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK
Channel: #alerts
Username: Grafana
Title: {{ template "slack.default.title" . }}
Text: {{ template "slack.default.text" . }}

Microsoft Teams Contact Point:

Name: teams-alerts
Integration: Microsoft Teams
Webhook URL: https://your-tenant.webhook.office.com/webhookb2/YOUR-WEBHOOK-URL
Title: {{ template "teams.default.title" . }}
Summary: {{ template "teams.default.summary" . }}

Email Contact Point:

Name: email-alerts
Integration: Email
Addresses: ops-team@example.com, alerts@example.com
Subject: [GRAFANA] {{ .GroupLabels.alertname }}
Message: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}

Configure SMTP for email notifications

Update Grafana's SMTP settings to enable email notifications.

[smtp]
enabled = true
host = smtp.gmail.com:587
user = your-email@gmail.com
password = your-app-password
cert_file =
key_file =
skip_verify = false
from_address = grafana@example.com
from_name = Grafana Alerts
ehlo_identity = example.com
startTLS_policy = MandatoryStartTLS

Restart Grafana after updating SMTP settings:

sudo systemctl restart grafana-server

Create custom webhook endpoint

Set up a webhook contact point for external integrations like PagerDuty or custom applications.

Name: custom-webhook
Integration: Webhook
URL: https://api.example.com/v1/alerts
HTTP Method: POST
Authorization Header: Bearer YOUR-API-TOKEN
Content-Type: application/json
Body:
{
  "alert_name": "{{ .GroupLabels.alertname }}",
  "status": "{{ .Status }}",
  "severity": "{{ .GroupLabels.severity }}",
  "summary": "{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}",
  "timestamp": "{{ .Alerts.Firing | len }}",
  "firing_alerts": {{ .Alerts.Firing | len }},
  "resolved_alerts": {{ .Alerts.Resolved | len }}
}

Create notification policies

Navigate to Alerting > Notification policies. Configure routing based on alert labels and severity.

# Root policy (catches all alerts)
Default contact point: email-alerts
Group by: alertname, instance
Group wait: 10s
Group interval: 5m
Repeat interval: 12h

High severity alerts
Matchers:
severity = criticalContact point: slack-alerts, teams-alerts, custom-webhook
Group wait: 0s
Repeat interval: 5m
Override grouping: true

Warning alerts
Matchers:
severity = warningContact point: slack-alerts
Group interval: 10m
Repeat interval: 4h

Create advanced alert rules

Navigate to Alerting > Alert rules > New rule. Create comprehensive alert rules with multiple conditions.

CPU Usage Alert Rule:

Rule name: High CPU Usage
Folder: Infrastructure
Group: System Metrics

Query A - Current CPU usage
Query: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Alias: cpu_usage

Query B - CPU usage trend (15min average)
Query: avg_over_time((100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))[15m:])
Alias: cpu_trend

Condition
Expression: $cpu_usage > 80 AND $cpu_trend > 70
Evaluation: Last value, IS ABOVE, 0

Evaluation behavior
Evaluate every: 1m
For: 5m

Labels
severity: warning
team: infrastructure
service: system

Annotations
summary: High CPU usage detected on {{ $labels.instance }}
description: CPU usage is {{ $values.cpu_usage | humanize }}% (15min avg: {{ $values.cpu_trend | humanize }}%)

Create memory usage alert with templating

Create a more complex alert rule for memory usage with custom templating.

Rule name: High Memory Usage
Folder: Infrastructure
Group: System Metrics

Query A - Available memory percentage
Query: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
Alias: memory_available

Query B - Memory usage percentage
Query: 100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)
Alias: memory_usage

Condition
Expression: $memory_available < 15
Evaluation: Last value, IS BELOW, 15

Evaluation behavior
Evaluate every: 30s
For: 2m

Labels
severity: critical
team: infrastructure
service: system
runbook_url: https://wiki.example.com/runbooks/memory-alerts

Annotations with advanced templating
summary: Memory usage critical on {{ $labels.instance }}
description: |
  Memory usage is {{ $values.memory_usage | printf "%.1f" }}% ({{ printf "%.1f" (100 - $values.memory_available) }}% used)
  Available memory: {{ $values.memory_available | printf "%.1f" }}%
  Instance: {{ $labels.instance }}
  Job: {{ $labels.job }}
  
  Runbook: {{ $labels.runbook_url }}

Create application-specific alert rules

Set up alerts for application metrics like HTTP response times and error rates.

Rule name: High HTTP Error Rate
Folder: Applications
Group: Web Services

Query A - Error rate calculation
Query: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100
Alias: error_rate

Query B - Request volume
Query: rate(http_requests_total[5m])
Alias: request_rate

Condition
Expression: $error_rate > 5 AND $request_rate > 1
Evaluation: Last value, IS ABOVE, 0

Evaluation behavior
Evaluate every: 30s
For: 3m

Labels
severity: critical
team: backend
service: {{ $labels.service }}
environment: {{ $labels.environment }}

Annotations
summary: High error rate for {{ $labels.service }}
description: |
  Error rate: {{ $values.error_rate | printf "%.2f" }}%
  Request rate: {{ $values.request_rate | printf "%.1f" }} req/s
  Service: {{ $labels.service }}
  Environment: {{ $labels.environment }}
  Instance: {{ $labels.instance }}

Configure alert rule groups and folders

Organize alerts into logical groups for better management. Navigate to Alerting > Alert rules and create folders.

Folders:
├── Infrastructure
│   ├── System Metrics (CPU, Memory, Disk)
│   ├── Network (Connectivity, Bandwidth)
│   └── Database (MySQL, PostgreSQL, Redis)
├── Applications
│   ├── Web Services (HTTP errors, latency)
│   ├── Background Jobs (Queue length, failures)
│   └── API Endpoints (Rate limits, timeouts)
└── Business Metrics
    ├── User Activity (Logins, signups)
    ├── Revenue (Transactions, conversions)
    └── Performance (Page load, API response)

Create silences and maintenance windows

Set up alert silences for planned maintenance. Navigate to Alerting > Silences > New silence.

Matchers:
service = "web-frontend"
environment = "production"
Start: 2024-01-15 02:00 UTC
End: 2024-01-15 04:00 UTC
Timezone: UTC

Created by: ops-team
Comment: Scheduled maintenance - database migration

Or create regex-based silences
Matchers:
alertname =~ "High.*Usage"
instance =~ "web-[0-9]+\.example\.com"
Duration: 2h
Comment: Load testing in progress

Configure alert templates

Create custom notification templates for better alert messages. Navigate to Alerting > Contact points > Message templates.

Name: detailed-alert-template
Content:
{{ define "alert.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }} x{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.alertname }}
{{ end }}

{{ define "alert.summary" }}
{{ if gt (len .Alerts.Firing) 0 }}
Firing Alerts:
{{ range .Alerts.Firing }}
• {{ .Annotations.summary }}
  Labels: {{ range .Labels.SortedPairs }}{{ .Name }}={{ .Value }} {{ end }}
  Started: {{ .StartsAt.Format "2006-01-02 15:04:05 UTC" }}
{{ end }}
{{ end }}

{{ if gt (len .Alerts.Resolved) 0 }}
Resolved Alerts:
{{ range .Alerts.Resolved }}
• {{ .Annotations.summary }}
  Resolved: {{ .EndsAt.Format "2006-01-02 15:04:05 UTC" }}
{{ end }}
{{ end }}

Dashboard: {{ .ExternalURL }}
{{ end }}

Configure webhook security

Secure webhook endpoints

Add authentication and validation to your webhook endpoints to prevent unauthorized access.

# In webhook contact point configuration
HTTP Headers:
X-Grafana-Source: alerting
User-Agent: Grafana/10.2.3
Authorization: Bearer YOUR-SECRET-TOKEN

Custom headers for verification
X-Signature: {{ .ExternalURL | sha256 }}
X-Timestamp: {{ now.Unix }}

Validate webhook payload

Create a simple webhook receiver to test your configuration.

#!/usr/bin/env python3
import json
import hashlib
import hmac
from http.server import HTTPServer, BaseHTTPRequestHandler

SECRET_TOKEN = "your-secret-token"

class WebhookHandler(BaseHTTPRequestHandler):
    def do_POST(self):
        content_length = int(self.headers['Content-Length'])
        post_data = self.rfile.read(content_length)
        
        # Verify authorization
        auth_header = self.headers.get('Authorization', '')
        if not auth_header.startswith('Bearer '):
            self.send_response(401)
            self.end_headers()
            return
            
        token = auth_header[7:]  # Remove 'Bearer '
        if token != SECRET_TOKEN:
            self.send_response(401)
            self.end_headers()
            return
        
        try:
            alert_data = json.loads(post_data.decode('utf-8'))
            print(f"Received alert: {alert_data['alert_name']}")
            print(f"Status: {alert_data['status']}")
            print(f"Summary: {alert_data['summary']}")
            
            self.send_response(200)
            self.send_header('Content-type', 'application/json')
            self.end_headers()
            self.wfile.write(json.dumps({"status": "received"}).encode())
            
        except json.JSONDecodeError:
            self.send_response(400)
            self.end_headers()

if __name__ == '__main__':
    server = HTTPServer(('localhost', 8080), WebhookHandler)
    print("Webhook test server running on http://localhost:8080")
    server.serve_forever()

Run the test webhook server:

chmod +x /tmp/webhook-test.py
python3 /tmp/webhook-test.py

Advanced alerting features

Create multi-condition alerts

Build complex alerts that require multiple conditions to be met simultaneously.

Rule name: Service Degradation
Folder: Applications
Group: Service Health

Query A - High response time
Query: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) * 1000
Alias: response_time_p95

Query B - Error rate
Query: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100
Alias: error_rate

Query C - Request volume
Query: rate(http_requests_total[5m])
Alias: request_volume

Complex condition with AND/OR logic
Expression: ($response_time_p95 > 500 AND $request_volume > 10) OR ($error_rate > 2 AND $request_volume > 5)
Evaluation: Last value, IS ABOVE, 0

Evaluate every: 30s
For: 5m

Configure alert dependencies

Set up alert routing based on dependencies to reduce noise during outages.

# Parent policy - Database connectivity
Matchers:
alertname = "Database Connection Failed"
service = "postgresql"Contact point: critical-alerts
Group wait: 0s
Repeat interval: 5m

Child policy - Application errors (only if DB is healthy)
Matchers:
alertname = "High Application Error Rate"
NOT database_status = "down"Contact point: app-team-alerts
Group interval: 10m
Repeat interval: 30m

Silence application alerts when database is down
Matchers:
alertname = "High Application Error Rate"
database_status = "down"
Contact point: null

Create alert annotations with links

Add useful links and context to alert notifications for faster troubleshooting.

# In alert rule annotations
summary: High CPU usage on {{ $labels.instance }}
description: |
  CPU usage: {{ $values.cpu_usage | printf "%.1f" }}%
  Instance: {{ $labels.instance }}
  Environment: {{ $labels.environment }}
  
  🔍 Troubleshooting Links:
  • System Dashboard
  • CPU Details
  • Server Logs,refreshInterval:(pause:!t,value:0),time:(from:now-1h,to:now))&_a=(columns:!(_source),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logs-*',key:host,negate:!f,params:(query:'{{ $labels.instance }}'),type:phrase),query:(match_phrase:(host:'{{ $labels.instance }}')))))
  • Runbook

runbook_url: https://wiki.example.com/runbooks/high-cpu-usage
dashboard_url: http://grafana.example.com/d/system/system-overview?var-instance={{ $labels.instance }}

Verify your setup

Test your alerting configuration to ensure everything works correctly.

# Check Grafana status
sudo systemctl status grafana-server

Test contact points from Grafana UI
Navigate to Alerting > Contact points > Test

View alert rule evaluation
curl -H "Authorization: Bearer YOUR-API-KEY" \
  "http://localhost:3000/api/v1/eval" \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "queries": [{
      "refId": "A",
      "queryType": "",
      "model": {
        "expr": "up{job=\"prometheus\"}",
        "intervalMs": 1000,
        "maxDataPoints": 43200
      }
    }]
  }'

Check notification policies
curl -H "Authorization: Bearer YOUR-API-KEY" \
  "http://localhost:3000/api/v1/provisioning/policies"

View active alerts
curl -H "Authorization: Bearer YOUR-API-KEY" \
  "http://localhost:3000/api/v1/alerts"

Note: Replace YOUR-API-KEY with a service account token created in Grafana under Administration > Service accounts.

Common issues

Symptom	Cause	Fix
Alerts not firing	Incorrect query or evaluation settings	Check query syntax in Explore tab, verify evaluation interval
Notifications not sent	Contact point configuration error	Test contact point, check SMTP/webhook settings
Too many alert notifications	Incorrect grouping or repeat interval	Adjust notification policy grouping and intervals
Webhook returns 401/403	Authentication headers missing or incorrect	Verify Authorization header and webhook endpoint security
Email notifications fail	SMTP configuration incorrect	Test SMTP settings, check firewall rules for port 587
Alert templates not rendering	Template syntax errors or missing variables	Validate template syntax, test with sample alert data
Silences not working	Label matchers don't match alert labels	Check exact label names and values in alert rules

Next steps

Configure Grafana dashboards for TimescaleDB analytics to visualize time-series data alongside your alerts
Set up Prometheus and Grafana monitoring stack for a complete observability solution
Configure Zabbix custom alerting with webhooks for comparison with other monitoring tools
Implement Grafana SLA reporting with Prometheus for business-level monitoring
Configure Grafana alert escalation with PagerDuty for enterprise incident management

Running this in production?

Want comprehensive alerting handled for you? Setting this up once is straightforward. Keeping it tuned, managing alert fatigue, and ensuring 24/7 reliability across environments is the harder part. See how we run infrastructure like this for European SaaS and e-commerce teams.

#grafana #alerting #webhooks #prometheus #monitoring #notifications #slack #teams

Implement Grafana advanced alerting with webhooks and notification channels

Prerequisites

What this solves

Step-by-step configuration

Update system packages

Install Grafana if not already present

Configure Grafana for unified alerting

Alert rule evaluation interval

Data retention for alert instances

Screenshot settings for alert notifications

Start and enable Grafana

Configure Prometheus data source

Create contact points for notifications

Configure SMTP for email notifications

Create custom webhook endpoint

Create notification policies

High severity alerts

Warning alerts

Create advanced alert rules

Query A - Current CPU usage

Query B - CPU usage trend (15min average)

Condition

Evaluation behavior

Labels

Annotations

Create memory usage alert with templating

Query A - Available memory percentage

Query B - Memory usage percentage

Condition

Evaluation behavior

Labels

Annotations with advanced templating

Create application-specific alert rules

Query A - Error rate calculation

Query B - Request volume

Condition

Evaluation behavior

Labels

Annotations

Configure alert rule groups and folders

Create silences and maintenance windows

Or create regex-based silences

Configure alert templates

Configure webhook security

Secure webhook endpoints

Custom headers for verification

Validate webhook payload

Advanced alerting features

Create multi-condition alerts

Query A - High response time

Query B - Error rate

Query C - Request volume

Complex condition with AND/OR logic

Configure alert dependencies

Child policy - Application errors (only if DB is healthy)

Silence application alerts when database is down

Create alert annotations with links

Verify your setup

Test contact points from Grafana UI

Navigate to Alerting > Contact points > Test

View alert rule evaluation

Check notification policies

View active alerts

Common issues

Next steps

Running this in production?

Related tutorials

Configure Consul Connect service mesh monitoring with distributed tracing

Configure OpenTelemetry custom metrics for application monitoring with Prometheus and Grafana

Configure Jaeger with Elasticsearch backend security and encryption

Don't want to manage this yourself?