Set up comprehensive Grafana alerting with webhook endpoints, Slack and Teams notifications, and advanced alert conditions. Configure data sources, create alert rules, and implement custom notification channels for production monitoring.
Prerequisites
- Grafana 9.0 or higher
- Prometheus data source
- SMTP server access for email notifications
- Webhook endpoints for external integrations
- Administrative access to configure notification channels
What this solves
Grafana's unified alerting system lets you create sophisticated alert rules that can notify multiple channels when your infrastructure needs attention. This tutorial shows you how to set up webhook endpoints for external integrations, configure notification channels for Slack and Microsoft Teams, and create advanced alert conditions with custom templating for production environments.
Step-by-step configuration
Update system packages
Start by updating your package manager to ensure you have the latest security patches.
sudo apt update && sudo apt upgrade -y
Install Grafana if not already present
Install Grafana from the official repository if you don't have it running yet.
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install -y grafana
Configure Grafana for unified alerting
Enable unified alerting in the main configuration file and set retention policies.
[unified_alerting]
enabled = true
[alerting]
enabled = false
Alert rule evaluation interval
[unified_alerting.evaluation]
max_attempts = 3
min_interval = 10s
Data retention for alert instances
[unified_alerting.state_history]
enabled = true
max_age = 168h
Screenshot settings for alert notifications
[unified_alerting.screenshots]
capture = true
max_concurrent_screenshots = 5
Start and enable Grafana
Start the Grafana service and enable it to run on boot.
sudo systemctl start grafana-server
sudo systemctl enable grafana-server
sudo systemctl status grafana-server
Configure Prometheus data source
Add a Prometheus data source through the Grafana web interface at http://your-server:3000. Navigate to Configuration > Data Sources > Add data source > Prometheus.
Name: Prometheus
URL: http://localhost:9090
Access: Server (default)
Scrape interval: 15s
Query timeout: 60s
HTTP Method: POST
Click "Save & Test" to verify the connection works.
Create contact points for notifications
Navigate to Alerting > Contact points > Add contact point. Create separate contact points for each notification method.
Slack Contact Point:
Name: slack-alerts
Integration: Slack
Webhook URL: https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK
Channel: #alerts
Username: Grafana
Title: {{ template "slack.default.title" . }}
Text: {{ template "slack.default.text" . }}
Microsoft Teams Contact Point:
Name: teams-alerts
Integration: Microsoft Teams
Webhook URL: https://your-tenant.webhook.office.com/webhookb2/YOUR-WEBHOOK-URL
Title: {{ template "teams.default.title" . }}
Summary: {{ template "teams.default.summary" . }}
Email Contact Point:
Name: email-alerts
Integration: Email
Addresses: ops-team@example.com, alerts@example.com
Subject: [GRAFANA] {{ .GroupLabels.alertname }}
Message: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}
Configure SMTP for email notifications
Update Grafana's SMTP settings to enable email notifications.
[smtp]
enabled = true
host = smtp.gmail.com:587
user = your-email@gmail.com
password = your-app-password
cert_file =
key_file =
skip_verify = false
from_address = grafana@example.com
from_name = Grafana Alerts
ehlo_identity = example.com
startTLS_policy = MandatoryStartTLS
Restart Grafana after updating SMTP settings:
sudo systemctl restart grafana-server
Create custom webhook endpoint
Set up a webhook contact point for external integrations like PagerDuty or custom applications.
Name: custom-webhook
Integration: Webhook
URL: https://api.example.com/v1/alerts
HTTP Method: POST
Authorization Header: Bearer YOUR-API-TOKEN
Content-Type: application/json
Body:
{
"alert_name": "{{ .GroupLabels.alertname }}",
"status": "{{ .Status }}",
"severity": "{{ .GroupLabels.severity }}",
"summary": "{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}",
"timestamp": "{{ .Alerts.Firing | len }}",
"firing_alerts": {{ .Alerts.Firing | len }},
"resolved_alerts": {{ .Alerts.Resolved | len }}
}
Create notification policies
Navigate to Alerting > Notification policies. Configure routing based on alert labels and severity.
# Root policy (catches all alerts)
Default contact point: email-alerts
Group by: alertname, instance
Group wait: 10s
Group interval: 5m
Repeat interval: 12h
High severity alerts
Matchers:
- severity = critical
Contact point: slack-alerts, teams-alerts, custom-webhook
Group wait: 0s
Repeat interval: 5m
Override grouping: true
Warning alerts
Matchers:
- severity = warning
Contact point: slack-alerts
Group interval: 10m
Repeat interval: 4h
Create advanced alert rules
Navigate to Alerting > Alert rules > New rule. Create comprehensive alert rules with multiple conditions.
CPU Usage Alert Rule:
Rule name: High CPU Usage
Folder: Infrastructure
Group: System Metrics
Query A - Current CPU usage
Query: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Alias: cpu_usage
Query B - CPU usage trend (15min average)
Query: avg_over_time((100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))[15m:])
Alias: cpu_trend
Condition
Expression: $cpu_usage > 80 AND $cpu_trend > 70
Evaluation: Last value, IS ABOVE, 0
Evaluation behavior
Evaluate every: 1m
For: 5m
Labels
severity: warning
team: infrastructure
service: system
Annotations
summary: High CPU usage detected on {{ $labels.instance }}
description: CPU usage is {{ $values.cpu_usage | humanize }}% (15min avg: {{ $values.cpu_trend | humanize }}%)
Create memory usage alert with templating
Create a more complex alert rule for memory usage with custom templating.
Rule name: High Memory Usage
Folder: Infrastructure
Group: System Metrics
Query A - Available memory percentage
Query: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
Alias: memory_available
Query B - Memory usage percentage
Query: 100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)
Alias: memory_usage
Condition
Expression: $memory_available < 15
Evaluation: Last value, IS BELOW, 15
Evaluation behavior
Evaluate every: 30s
For: 2m
Labels
severity: critical
team: infrastructure
service: system
runbook_url: https://wiki.example.com/runbooks/memory-alerts
Annotations with advanced templating
summary: Memory usage critical on {{ $labels.instance }}
description: |
Memory usage is {{ $values.memory_usage | printf "%.1f" }}% ({{ printf "%.1f" (100 - $values.memory_available) }}% used)
Available memory: {{ $values.memory_available | printf "%.1f" }}%
Instance: {{ $labels.instance }}
Job: {{ $labels.job }}
Runbook: {{ $labels.runbook_url }}
Create application-specific alert rules
Set up alerts for application metrics like HTTP response times and error rates.
Rule name: High HTTP Error Rate
Folder: Applications
Group: Web Services
Query A - Error rate calculation
Query: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100
Alias: error_rate
Query B - Request volume
Query: rate(http_requests_total[5m])
Alias: request_rate
Condition
Expression: $error_rate > 5 AND $request_rate > 1
Evaluation: Last value, IS ABOVE, 0
Evaluation behavior
Evaluate every: 30s
For: 3m
Labels
severity: critical
team: backend
service: {{ $labels.service }}
environment: {{ $labels.environment }}
Annotations
summary: High error rate for {{ $labels.service }}
description: |
Error rate: {{ $values.error_rate | printf "%.2f" }}%
Request rate: {{ $values.request_rate | printf "%.1f" }} req/s
Service: {{ $labels.service }}
Environment: {{ $labels.environment }}
Instance: {{ $labels.instance }}
Configure alert rule groups and folders
Organize alerts into logical groups for better management. Navigate to Alerting > Alert rules and create folders.
Folders:
├── Infrastructure
│ ├── System Metrics (CPU, Memory, Disk)
│ ├── Network (Connectivity, Bandwidth)
│ └── Database (MySQL, PostgreSQL, Redis)
├── Applications
│ ├── Web Services (HTTP errors, latency)
│ ├── Background Jobs (Queue length, failures)
│ └── API Endpoints (Rate limits, timeouts)
└── Business Metrics
├── User Activity (Logins, signups)
├── Revenue (Transactions, conversions)
└── Performance (Page load, API response)
Create silences and maintenance windows
Set up alert silences for planned maintenance. Navigate to Alerting > Silences > New silence.
Matchers:
- service = "web-frontend"
- environment = "production"
Start: 2024-01-15 02:00 UTC
End: 2024-01-15 04:00 UTC
Timezone: UTC
Created by: ops-team
Comment: Scheduled maintenance - database migration
Or create regex-based silences
Matchers:
- alertname =~ "High.*Usage"
- instance =~ "web-[0-9]+\.example\.com"
Duration: 2h
Comment: Load testing in progress
Configure alert templates
Create custom notification templates for better alert messages. Navigate to Alerting > Contact points > Message templates.
Name: detailed-alert-template
Content:
{{ define "alert.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }} x{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.alertname }}
{{ end }}
{{ define "alert.summary" }}
{{ if gt (len .Alerts.Firing) 0 }}
Firing Alerts:
{{ range .Alerts.Firing }}
• {{ .Annotations.summary }}
Labels: {{ range .Labels.SortedPairs }}{{ .Name }}={{ .Value }} {{ end }}
Started: {{ .StartsAt.Format "2006-01-02 15:04:05 UTC" }}
{{ end }}
{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}
Resolved Alerts:
{{ range .Alerts.Resolved }}
• {{ .Annotations.summary }}
Resolved: {{ .EndsAt.Format "2006-01-02 15:04:05 UTC" }}
{{ end }}
{{ end }}
Dashboard: {{ .ExternalURL }}
{{ end }}
Configure webhook security
Secure webhook endpoints
Add authentication and validation to your webhook endpoints to prevent unauthorized access.
# In webhook contact point configuration
HTTP Headers:
X-Grafana-Source: alerting
User-Agent: Grafana/10.2.3
Authorization: Bearer YOUR-SECRET-TOKEN
Custom headers for verification
X-Signature: {{ .ExternalURL | sha256 }}
X-Timestamp: {{ now.Unix }}
Validate webhook payload
Create a simple webhook receiver to test your configuration.
#!/usr/bin/env python3
import json
import hashlib
import hmac
from http.server import HTTPServer, BaseHTTPRequestHandler
SECRET_TOKEN = "your-secret-token"
class WebhookHandler(BaseHTTPRequestHandler):
def do_POST(self):
content_length = int(self.headers['Content-Length'])
post_data = self.rfile.read(content_length)
# Verify authorization
auth_header = self.headers.get('Authorization', '')
if not auth_header.startswith('Bearer '):
self.send_response(401)
self.end_headers()
return
token = auth_header[7:] # Remove 'Bearer '
if token != SECRET_TOKEN:
self.send_response(401)
self.end_headers()
return
try:
alert_data = json.loads(post_data.decode('utf-8'))
print(f"Received alert: {alert_data['alert_name']}")
print(f"Status: {alert_data['status']}")
print(f"Summary: {alert_data['summary']}")
self.send_response(200)
self.send_header('Content-type', 'application/json')
self.end_headers()
self.wfile.write(json.dumps({"status": "received"}).encode())
except json.JSONDecodeError:
self.send_response(400)
self.end_headers()
if __name__ == '__main__':
server = HTTPServer(('localhost', 8080), WebhookHandler)
print("Webhook test server running on http://localhost:8080")
server.serve_forever()
Run the test webhook server:
chmod +x /tmp/webhook-test.py
python3 /tmp/webhook-test.py
Advanced alerting features
Create multi-condition alerts
Build complex alerts that require multiple conditions to be met simultaneously.
Rule name: Service Degradation
Folder: Applications
Group: Service Health
Query A - High response time
Query: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) * 1000
Alias: response_time_p95
Query B - Error rate
Query: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100
Alias: error_rate
Query C - Request volume
Query: rate(http_requests_total[5m])
Alias: request_volume
Complex condition with AND/OR logic
Expression: ($response_time_p95 > 500 AND $request_volume > 10) OR ($error_rate > 2 AND $request_volume > 5)
Evaluation: Last value, IS ABOVE, 0
Evaluate every: 30s
For: 5m
Configure alert dependencies
Set up alert routing based on dependencies to reduce noise during outages.
# Parent policy - Database connectivity
Matchers:
- alertname = "Database Connection Failed"
- service = "postgresql"
Contact point: critical-alerts
Group wait: 0s
Repeat interval: 5m
Child policy - Application errors (only if DB is healthy)
Matchers:
- alertname = "High Application Error Rate"
- NOT database_status = "down"
Contact point: app-team-alerts
Group interval: 10m
Repeat interval: 30m
Silence application alerts when database is down
Matchers:
- alertname = "High Application Error Rate"
- database_status = "down"
Contact point: null
Create alert annotations with links
Add useful links and context to alert notifications for faster troubleshooting.
# In alert rule annotations
summary: High CPU usage on {{ $labels.instance }}
description: |
CPU usage: {{ $values.cpu_usage | printf "%.1f" }}%
Instance: {{ $labels.instance }}
Environment: {{ $labels.environment }}
🔍 Troubleshooting Links:
• System Dashboard
• CPU Details
• Server Logs,refreshInterval:(pause:!t,value:0),time:(from:now-1h,to:now))&_a=(columns:!(_source),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logs-*',key:host,negate:!f,params:(query:'{{ $labels.instance }}'),type:phrase),query:(match_phrase:(host:'{{ $labels.instance }}')))))
• Runbook
runbook_url: https://wiki.example.com/runbooks/high-cpu-usage
dashboard_url: http://grafana.example.com/d/system/system-overview?var-instance={{ $labels.instance }}
Verify your setup
Test your alerting configuration to ensure everything works correctly.
# Check Grafana status
sudo systemctl status grafana-server
Test contact points from Grafana UI
Navigate to Alerting > Contact points > Test
View alert rule evaluation
curl -H "Authorization: Bearer YOUR-API-KEY" \
"http://localhost:3000/api/v1/eval" \
-X POST \
-H "Content-Type: application/json" \
-d '{
"queries": [{
"refId": "A",
"queryType": "",
"model": {
"expr": "up{job=\"prometheus\"}",
"intervalMs": 1000,
"maxDataPoints": 43200
}
}]
}'
Check notification policies
curl -H "Authorization: Bearer YOUR-API-KEY" \
"http://localhost:3000/api/v1/provisioning/policies"
View active alerts
curl -H "Authorization: Bearer YOUR-API-KEY" \
"http://localhost:3000/api/v1/alerts"
Common issues
| Symptom | Cause | Fix |
|---|---|---|
| Alerts not firing | Incorrect query or evaluation settings | Check query syntax in Explore tab, verify evaluation interval |
| Notifications not sent | Contact point configuration error | Test contact point, check SMTP/webhook settings |
| Too many alert notifications | Incorrect grouping or repeat interval | Adjust notification policy grouping and intervals |
| Webhook returns 401/403 | Authentication headers missing or incorrect | Verify Authorization header and webhook endpoint security |
| Email notifications fail | SMTP configuration incorrect | Test SMTP settings, check firewall rules for port 587 |
| Alert templates not rendering | Template syntax errors or missing variables | Validate template syntax, test with sample alert data |
| Silences not working | Label matchers don't match alert labels | Check exact label names and values in alert rules |
Next steps
- Configure Grafana dashboards for TimescaleDB analytics to visualize time-series data alongside your alerts
- Set up Prometheus and Grafana monitoring stack for a complete observability solution
- Configure Zabbix custom alerting with webhooks for comparison with other monitoring tools
- Implement Grafana SLA reporting with Prometheus for business-level monitoring
- Configure Grafana alert escalation with PagerDuty for enterprise incident management