Set up comprehensive Linux system monitoring with Prometheus, Node Exporter, and Alertmanager to track CPU, memory, and disk usage with automated alerts and response scripts for proactive system management.
Prerequisites
- Root or sudo access
- At least 2GB RAM
- Basic understanding of systemd services
- Email server access for notifications (optional)
What this solves
System resource monitoring prevents downtime by alerting you when CPU, memory, disk, or network usage reaches critical thresholds. This tutorial sets up Prometheus with Node Exporter for metrics collection, Alertmanager for notifications, and custom scripts for automated responses like service restarts or resource cleanup.
Step-by-step installation
Update system packages
Start by updating your package manager to ensure you get the latest versions of all dependencies.
sudo apt update && sudo apt upgrade -y
Create system users for monitoring services
Create dedicated users for Prometheus, Node Exporter, and Alertmanager to run services securely without root privileges.
sudo useradd --no-create-home --shell /bin/false prometheus
sudo useradd --no-create-home --shell /bin/false node_exporter
sudo useradd --no-create-home --shell /bin/false alertmanager
Create directory structure
Set up the directory structure for configuration files, data storage, and binaries with correct permissions.
sudo mkdir -p /etc/prometheus /var/lib/prometheus /opt/prometheus
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo mkdir -p /opt/node_exporter
sudo mkdir -p /opt/monitoring-scripts
sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus
sudo chown alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager
Download and install Prometheus
Download the latest Prometheus release and extract it to the system directory.
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
tar xvf prometheus-2.48.0.linux-amd64.tar.gz
sudo cp prometheus-2.48.0.linux-amd64/prometheus /usr/local/bin/
sudo cp prometheus-2.48.0.linux-amd64/promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
sudo chmod 755 /usr/local/bin/prometheus /usr/local/bin/promtool
Download and install Node Exporter
Install Node Exporter to collect system metrics like CPU, memory, disk, and network statistics.
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvf node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
sudo chmod 755 /usr/local/bin/node_exporter
Download and install Alertmanager
Install Alertmanager to handle alerts from Prometheus and send notifications via email, Slack, or other channels.
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar xvf alertmanager-0.26.0.linux-amd64.tar.gz
sudo cp alertmanager-0.26.0.linux-amd64/alertmanager /usr/local/bin/
sudo cp alertmanager-0.26.0.linux-amd64/amtool /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager /usr/local/bin/amtool
sudo chmod 755 /usr/local/bin/alertmanager /usr/local/bin/amtool
Configure Prometheus
Create the main Prometheus configuration file with scrape targets for Node Exporter and alerting rules.
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node_exporter'
static_configs:
- targets: ['localhost:9100']
scrape_interval: 5s
Create alerting rules
Define alert rules for critical system resources including CPU, memory, disk usage, and system load.
groups:
- name: system_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 2m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 85% for more than 2 minutes. Current value: {{ $value }}%"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 2m
labels:
severity: critical
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 90%. Current value: {{ $value }}%"
- alert: HighDiskUsage
expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 85
for: 1m
labels:
severity: warning
annotations:
summary: "High disk usage on {{ $labels.instance }}"
description: "Root filesystem usage is above 85%. Current value: {{ $value }}%"
- alert: HighSystemLoad
expr: node_load15 > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High system load on {{ $labels.instance }}"
description: "15-minute load average is above 2. Current value: {{ $value }}"
- alert: SystemDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.instance }} has been down for more than 1 minute"
Configure Alertmanager
Set up Alertmanager configuration with email notifications and webhook integration for automated responses.
global:
smtp_smarthost: 'localhost:587'
smtp_from: 'alerts@example.com'
smtp_auth_username: 'alerts@example.com'
smtp_auth_password: 'your-smtp-password'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
- match:
severity: warning
receiver: 'warning-alerts'
receivers:
- name: 'default-receiver'
email_configs:
- to: 'admin@example.com'
subject: 'Alert: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Instance: {{ .Labels.instance }}
{{ end }}
- name: 'critical-alerts'
email_configs:
- to: 'admin@example.com'
subject: 'CRITICAL: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
CRITICAL ALERT: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Instance: {{ .Labels.instance }}
{{ end }}
webhook_configs:
- url: 'http://localhost:8080/webhook/critical'
send_resolved: true
- name: 'warning-alerts'
email_configs:
- to: 'admin@example.com'
subject: 'WARNING: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Warning: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Instance: {{ .Labels.instance }}
{{ end }}
webhook_configs:
- url: 'http://localhost:8080/webhook/warning'
send_resolved: true
Create automated response scripts
Create scripts that can automatically respond to alerts by freeing disk space, restarting services, or clearing logs.
#!/bin/bash
Automated disk cleanup script
LOG_FILE="/var/log/monitoring-cleanup.log"
echo "$(date): Starting automated disk cleanup" >> $LOG_FILE
Clear temporary files older than 7 days
find /tmp -type f -atime +7 -delete 2>/dev/null
echo "$(date): Cleared old temporary files" >> $LOG_FILE
Rotate and compress logs
journalctl --vacuum-time=30d
echo "$(date): Cleaned journal logs" >> $LOG_FILE
Clear package cache
apt clean 2>/dev/null || dnf clean all 2>/dev/null
echo "$(date): Cleared package cache" >> $LOG_FILE
Check disk usage after cleanup
DISK_USAGE=$(df / | awk 'NR==2 {print $5}' | sed 's/%//')
echo "$(date): Disk usage after cleanup: ${DISK_USAGE}%" >> $LOG_FILE
if [ $DISK_USAGE -lt 80 ]; then
echo "$(date): Disk cleanup successful" >> $LOG_FILE
else
echo "$(date): WARNING: Disk usage still high after cleanup" >> $LOG_FILE
fi
Create service restart script
Create a script to automatically restart services when system load is high or memory usage is critical.
#!/bin/bash
Automated service restart script
LOG_FILE="/var/log/monitoring-restarts.log"
ALERT_TYPE=$1
echo "$(date): Received $ALERT_TYPE alert, checking system status" >> $LOG_FILE
Get current system metrics
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | sed 's/%us,//')
MEM_USAGE=$(free | grep Mem | awk '{printf("%.2f", $3/$2 * 100.0)}')
LOAD_AVG=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | sed 's/,//')
echo "$(date): CPU: ${CPU_USAGE}%, Memory: ${MEM_USAGE}%, Load: ${LOAD_AVG}" >> $LOG_FILE
Restart high-memory services if memory is critical
if (( $(echo "$MEM_USAGE > 90" | bc -l) )); then
echo "$(date): Critical memory usage, restarting services" >> $LOG_FILE
systemctl restart apache2 2>/dev/null || systemctl restart nginx 2>/dev/null
systemctl restart mysql 2>/dev/null || systemctl restart mariadb 2>/dev/null
echo "$(date): Services restarted" >> $LOG_FILE
fi
Kill processes consuming excessive CPU
if (( $(echo "$CPU_USAGE > 90" | bc -l) )); then
echo "$(date): High CPU usage detected, checking for runaway processes" >> $LOG_FILE
# Kill processes using more than 50% CPU (excluding system processes)
ps aux --sort=-%cpu | awk 'NR>1 && $3>50 && $1!="root" {print $2}' | head -3 | xargs -r kill -TERM
echo "$(date): Terminated high CPU processes" >> $LOG_FILE
fi
Create webhook server for automated responses
Set up a simple webhook server that receives alerts from Alertmanager and triggers automated response scripts.
#!/usr/bin/env python3
import json
import subprocess
from http.server import HTTPServer, BaseHTTPRequestHandler
import logging
logging.basicConfig(filename='/var/log/monitoring-webhook.log', level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s')
class WebhookHandler(BaseHTTPRequestHandler):
def do_POST(self):
if self.path == '/webhook/critical':
content_length = int(self.headers['Content-Length'])
post_data = self.rfile.read(content_length)
try:
alert_data = json.loads(post_data.decode('utf-8'))
logging.info(f"Received critical alert: {alert_data}")
for alert in alert_data.get('alerts', []):
alert_name = alert.get('labels', {}).get('alertname', '')
if alert_name == 'HighDiskUsage':
subprocess.run(['/opt/monitoring-scripts/cleanup-disk.sh'],
check=False)
logging.info("Triggered disk cleanup for HighDiskUsage alert")
elif alert_name in ['HighMemoryUsage', 'HighCPUUsage']:
subprocess.run(['/opt/monitoring-scripts/restart-services.sh', 'critical'],
check=False)
logging.info(f"Triggered service restart for {alert_name} alert")
self.send_response(200)
self.end_headers()
self.wfile.write(b'Alert processed')
except Exception as e:
logging.error(f"Error processing alert: {e}")
self.send_response(500)
self.end_headers()
elif self.path == '/webhook/warning':
content_length = int(self.headers['Content-Length'])
post_data = self.rfile.read(content_length)
try:
alert_data = json.loads(post_data.decode('utf-8'))
logging.info(f"Received warning alert: {alert_data}")
self.send_response(200)
self.end_headers()
self.wfile.write(b'Warning logged')
except Exception as e:
logging.error(f"Error processing warning: {e}")
self.send_response(500)
self.end_headers()
else:
self.send_response(404)
self.end_headers()
if __name__ == '__main__':
server = HTTPServer(('localhost', 8080), WebhookHandler)
logging.info("Webhook server starting on port 8080")
server.serve_forever()
Set permissions for monitoring scripts
Make the monitoring scripts executable and set proper ownership for security.
sudo chmod 755 /opt/monitoring-scripts/*.sh
sudo chmod 755 /opt/monitoring-scripts/webhook-server.py
sudo chown root:root /opt/monitoring-scripts/*
Install Python dependencies for webhook server
Install required Python packages for the webhook server to function properly.
sudo apt install -y python3 python3-pip bc
pip3 install --user requests
Create systemd service files
Create systemd service files for Prometheus, Node Exporter, Alertmanager, and the webhook server.
[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.listen-address=0.0.0.0:9090 \
--web.enable-lifecycle
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target
Create Node Exporter service
Configure the systemd service for Node Exporter to collect system metrics.
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--web.listen-address=:9100 \
--collector.systemd \
--collector.processes
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target
Create Alertmanager service
Set up the systemd service for Alertmanager to handle alert notifications.
[Unit]
Description=Alertmanager
After=network-online.target
[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager \
--web.listen-address=:9093
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target
Create webhook server service
Configure the systemd service for the automated response webhook server.
[Unit]
Description=Monitoring Webhook Server
After=network.target
[Service]
Type=simple
User=root
ExecStart=/usr/bin/python3 /opt/monitoring-scripts/webhook-server.py
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target
Set correct file ownership
Ensure all configuration files have the correct ownership for their respective services.
sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml
sudo chown prometheus:prometheus /etc/prometheus/alert_rules.yml
sudo chown alertmanager:alertmanager /etc/alertmanager/alertmanager.yml
Configure firewall rules
Open the necessary ports for Prometheus web interface, Node Exporter, Alertmanager, and webhook server.
sudo ufw allow 9090/tcp comment "Prometheus"
sudo ufw allow 9100/tcp comment "Node Exporter"
sudo ufw allow 9093/tcp comment "Alertmanager"
sudo ufw allow 8080/tcp comment "Webhook Server"
Enable and start services
Reload systemd, enable all monitoring services to start on boot, and start them immediately.
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
sudo systemctl enable --now node_exporter
sudo systemctl enable --now alertmanager
sudo systemctl enable --now monitoring-webhook
Create monitoring dashboards and notifications
Install and configure Grafana
Add Grafana repository and install it for creating visual dashboards of your monitoring data.
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install -y grafana
Configure Grafana data source
Create a Grafana configuration to automatically connect to your Prometheus instance.
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://localhost:9090
isDefault: true
editable: true
Enable and start Grafana
Start Grafana service and configure it to start automatically on boot.
sudo systemctl enable --now grafana-server
sudo systemctl status grafana-server
Verify your setup
Check that all services are running and accessible through their web interfaces.
# Check service status
sudo systemctl status prometheus node_exporter alertmanager monitoring-webhook
Verify Prometheus is collecting metrics
curl -s http://localhost:9090/api/v1/query?query=up
Check Node Exporter metrics
curl -s http://localhost:9100/metrics | head -20
Test Alertmanager
curl -s http://localhost:9093/api/v1/status
Check webhook server
curl -s http://localhost:8080/webhook/warning -X POST -d '{}'
View recent logs
sudo journalctl -u prometheus --since "10 minutes ago" --no-pager
sudo journalctl -u alertmanager --since "10 minutes ago" --no-pager
Access the web interfaces:
- Prometheus: http://your-server:9090
- Alertmanager: http://your-server:9093
- Grafana: http://your-server:3000 (admin/admin)
Common issues
| Symptom | Cause | Fix |
|---|---|---|
| Prometheus won't start | Configuration syntax error | sudo /usr/local/bin/promtool check config /etc/prometheus/prometheus.yml |
| Node Exporter not collecting metrics | Permission issues or wrong user | sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter |
| Alerts not firing | Alert rules not loaded | sudo /usr/local/bin/promtool check rules /etc/prometheus/alert_rules.yml |
| Email notifications not working | SMTP configuration error | Check SMTP settings in /etc/alertmanager/alertmanager.yml |
| Webhook server not responding | Python dependencies missing | pip3 install --user requests and restart service |
| Can't access web interfaces | Firewall blocking ports | Verify firewall rules: sudo ufw status or sudo firewall-cmd --list-ports |
Next steps
- Install and configure Grafana with Prometheus for system monitoring to create advanced dashboards
- Monitor container performance with Prometheus and cAdvisor for comprehensive metrics collection to extend monitoring to containers
- Configure network monitoring with SNMP and Grafana dashboards for infrastructure visibility to monitor network devices
- Setup centralized log aggregation with Elasticsearch 8, Logstash 8, and Kibana 8 (ELK Stack) for comprehensive log analysis
- Configure Prometheus Alertmanager with Slack integration for team notifications
Automated install script
Run this to automate the entire setup
#!/usr/bin/env bash
set -euo pipefail
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
# Versions
PROMETHEUS_VERSION="2.48.0"
NODE_EXPORTER_VERSION="1.7.0"
ALERTMANAGER_VERSION="0.26.0"
# Function to print colored output
print_status() {
echo -e "${GREEN}[INFO]${NC} $1"
}
print_warning() {
echo -e "${YELLOW}[WARNING]${NC} $1"
}
print_error() {
echo -e "${RED}[ERROR]${NC} $1"
}
# Cleanup function
cleanup() {
if [[ $? -ne 0 ]]; then
print_error "Installation failed. Cleaning up..."
systemctl stop prometheus node_exporter alertmanager 2>/dev/null || true
systemctl disable prometheus node_exporter alertmanager 2>/dev/null || true
userdel prometheus node_exporter alertmanager 2>/dev/null || true
rm -rf /etc/prometheus /var/lib/prometheus /etc/alertmanager /var/lib/alertmanager /opt/monitoring-scripts
rm -f /usr/local/bin/{prometheus,promtool,node_exporter,alertmanager,amtool}
rm -f /etc/systemd/system/{prometheus,node_exporter,alertmanager}.service
fi
}
trap cleanup ERR
# Check if running as root or with sudo
if [[ $EUID -ne 0 ]]; then
print_error "This script must be run as root or with sudo"
exit 1
fi
# Auto-detect distribution
if [ -f /etc/os-release ]; then
. /etc/os-release
case "$ID" in
ubuntu|debian)
PKG_MGR="apt"
PKG_UPDATE="apt update -y"
PKG_INSTALL="apt install -y"
PKG_UPGRADE="apt upgrade -y"
;;
almalinux|rocky|centos|rhel|ol|fedora)
PKG_MGR="dnf"
PKG_UPDATE="dnf update -y"
PKG_INSTALL="dnf install -y"
PKG_UPGRADE="dnf upgrade -y"
;;
amzn)
PKG_MGR="yum"
PKG_UPDATE="yum update -y"
PKG_INSTALL="yum install -y"
PKG_UPGRADE="yum upgrade -y"
;;
*)
print_error "Unsupported distribution: $ID"
exit 1
;;
esac
else
print_error "Cannot detect distribution. /etc/os-release not found."
exit 1
fi
print_status "Starting Prometheus monitoring stack installation on $PRETTY_NAME"
# Step 1: Update system
echo "[1/12] Updating system packages..."
$PKG_UPDATE
$PKG_UPGRADE
$PKG_INSTALL wget tar
# Step 2: Create system users
echo "[2/12] Creating system users..."
useradd --no-create-home --shell /bin/false prometheus 2>/dev/null || true
useradd --no-create-home --shell /bin/false node_exporter 2>/dev/null || true
useradd --no-create-home --shell /bin/false alertmanager 2>/dev/null || true
# Step 3: Create directory structure
echo "[3/12] Creating directory structure..."
mkdir -p /etc/prometheus /var/lib/prometheus /opt/prometheus
mkdir -p /etc/alertmanager /var/lib/alertmanager
mkdir -p /opt/node_exporter /opt/monitoring-scripts
chown prometheus:prometheus /etc/prometheus /var/lib/prometheus
chown alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager
# Step 4: Download and install Prometheus
echo "[4/12] Installing Prometheus..."
cd /tmp
wget -q "https://github.com/prometheus/prometheus/releases/download/v${PROMETHEUS_VERSION}/prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz"
tar xzf "prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz"
cp "prometheus-${PROMETHEUS_VERSION}.linux-amd64/prometheus" /usr/local/bin/
cp "prometheus-${PROMETHEUS_VERSION}.linux-amd64/promtool" /usr/local/bin/
chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
chmod 755 /usr/local/bin/prometheus /usr/local/bin/promtool
# Step 5: Download and install Node Exporter
echo "[5/12] Installing Node Exporter..."
wget -q "https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz"
tar xzf "node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz"
cp "node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter" /usr/local/bin/
chown node_exporter:node_exporter /usr/local/bin/node_exporter
chmod 755 /usr/local/bin/node_exporter
# Step 6: Download and install Alertmanager
echo "[6/12] Installing Alertmanager..."
wget -q "https://github.com/prometheus/alertmanager/releases/download/v${ALERTMANAGER_VERSION}/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz"
tar xzf "alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz"
cp "alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/alertmanager" /usr/local/bin/
cp "alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/amtool" /usr/local/bin/
chown alertmanager:alertmanager /usr/local/bin/alertmanager /usr/local/bin/amtool
chmod 755 /usr/local/bin/alertmanager /usr/local/bin/amtool
# Step 7: Configure Prometheus
echo "[7/12] Configuring Prometheus..."
cat > /etc/prometheus/prometheus.yml << 'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node_exporter'
static_configs:
- targets: ['localhost:9100']
scrape_interval: 5s
EOF
# Step 8: Create alerting rules
echo "[8/12] Creating alerting rules..."
cat > /etc/prometheus/alert_rules.yml << 'EOF'
groups:
- name: system_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 2m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 2m
labels:
severity: critical
annotations:
summary: "High memory usage on {{ $labels.instance }}"
- alert: DiskSpaceUsage
expr: (1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"})) * 100 > 85
for: 2m
labels:
severity: warning
annotations:
summary: "High disk usage on {{ $labels.instance }}"
EOF
# Step 9: Configure Alertmanager
echo "[9/12] Configuring Alertmanager..."
cat > /etc/alertmanager/alertmanager.yml << 'EOF'
global:
smtp_smarthost: 'localhost:587'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'console-notifications'
receivers:
- name: 'console-notifications'
webhook_configs:
- url: 'http://localhost:8080/webhook'
EOF
chown prometheus:prometheus /etc/prometheus/prometheus.yml /etc/prometheus/alert_rules.yml
chown alertmanager:alertmanager /etc/alertmanager/alertmanager.yml
chmod 644 /etc/prometheus/prometheus.yml /etc/prometheus/alert_rules.yml /etc/alertmanager/alertmanager.yml
# Step 10: Create systemd services
echo "[10/12] Creating systemd services..."
cat > /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus
After=network.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/var/lib/prometheus/ --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries --web.listen-address=0.0.0.0:9090
Restart=always
[Install]
WantedBy=multi-user.target
EOF
cat > /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter --web.listen-address=0.0.0.0:9100
Restart=always
[Install]
WantedBy=multi-user.target
EOF
cat > /etc/systemd/system/alertmanager.service << 'EOF'
[Unit]
Description=Alertmanager
After=network.target
[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager --config.file=/etc/alertmanager/alertmanager.yml --storage.path=/var/lib/alertmanager/ --web.listen-address=0.0.0.0:9093
Restart=always
[Install]
WantedBy=multi-user.target
EOF
# Step 11: Start and enable services
echo "[11/12] Starting services..."
systemctl daemon-reload
systemctl enable prometheus node_exporter alertmanager
systemctl start prometheus node_exporter alertmanager
# Step 12: Configure firewall
echo "[12/12] Configuring firewall..."
if command -v ufw >/dev/null 2>&1; then
ufw allow 9090/tcp
ufw allow 9100/tcp
ufw allow 9093/tcp
elif command -v firewall-cmd >/dev/null 2>&1; then
firewall-cmd --permanent --add-port=9090/tcp
firewall-cmd --permanent --add-port=9100/tcp
firewall-cmd --permanent --add-port=9093/tcp
firewall-cmd --reload
fi
# Verification
echo "Verifying installation..."
sleep 10
if systemctl is-active --quiet prometheus && systemctl is-active --quiet node_exporter && systemctl is-active --quiet alertmanager; then
print_status "All services are running successfully!"
print_status "Prometheus: http://localhost:9090"
print_status "Node Exporter: http://localhost:9100"
print_status "Alertmanager: http://localhost:9093"
print_warning "Remember to configure proper notification channels in /etc/alertmanager/alertmanager.yml"
else
print_error "Some services failed to start. Check logs with 'journalctl -u prometheus -u node_exporter -u alertmanager'"
exit 1
fi
# Cleanup temp files
rm -rf /tmp/prometheus-* /tmp/node_exporter-* /tmp/alertmanager-*
print_status "Prometheus monitoring stack installation completed successfully!"
Review the script before running. Execute with: bash install.sh