Monitor Linux system resources with performance alerts and automated responses

Intermediate 45 min Apr 05, 2026 75 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Set up comprehensive Linux system monitoring with Prometheus, Node Exporter, and Alertmanager to track CPU, memory, and disk usage with automated alerts and response scripts for proactive system management.

Prerequisites

  • Root or sudo access
  • At least 2GB RAM
  • Basic understanding of systemd services
  • Email server access for notifications (optional)

What this solves

System resource monitoring prevents downtime by alerting you when CPU, memory, disk, or network usage reaches critical thresholds. This tutorial sets up Prometheus with Node Exporter for metrics collection, Alertmanager for notifications, and custom scripts for automated responses like service restarts or resource cleanup.

Step-by-step installation

Update system packages

Start by updating your package manager to ensure you get the latest versions of all dependencies.

sudo apt update && sudo apt upgrade -y
sudo dnf update -y

Create system users for monitoring services

Create dedicated users for Prometheus, Node Exporter, and Alertmanager to run services securely without root privileges.

sudo useradd --no-create-home --shell /bin/false prometheus
sudo useradd --no-create-home --shell /bin/false node_exporter
sudo useradd --no-create-home --shell /bin/false alertmanager

Create directory structure

Set up the directory structure for configuration files, data storage, and binaries with correct permissions.

sudo mkdir -p /etc/prometheus /var/lib/prometheus /opt/prometheus
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo mkdir -p /opt/node_exporter
sudo mkdir -p /opt/monitoring-scripts
sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus
sudo chown alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager

Download and install Prometheus

Download the latest Prometheus release and extract it to the system directory.

cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
tar xvf prometheus-2.48.0.linux-amd64.tar.gz
sudo cp prometheus-2.48.0.linux-amd64/prometheus /usr/local/bin/
sudo cp prometheus-2.48.0.linux-amd64/promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
sudo chmod 755 /usr/local/bin/prometheus /usr/local/bin/promtool

Download and install Node Exporter

Install Node Exporter to collect system metrics like CPU, memory, disk, and network statistics.

cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvf node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
sudo chmod 755 /usr/local/bin/node_exporter

Download and install Alertmanager

Install Alertmanager to handle alerts from Prometheus and send notifications via email, Slack, or other channels.

cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar xvf alertmanager-0.26.0.linux-amd64.tar.gz
sudo cp alertmanager-0.26.0.linux-amd64/alertmanager /usr/local/bin/
sudo cp alertmanager-0.26.0.linux-amd64/amtool /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager /usr/local/bin/amtool
sudo chmod 755 /usr/local/bin/alertmanager /usr/local/bin/amtool

Configure Prometheus

Create the main Prometheus configuration file with scrape targets for Node Exporter and alerting rules.

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']
    scrape_interval: 5s

Create alerting rules

Define alert rules for critical system resources including CPU, memory, disk usage, and system load.

groups:
  • name: system_alerts
rules: - alert: HighCPUUsage expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85 for: 2m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is above 85% for more than 2 minutes. Current value: {{ $value }}%" - alert: HighMemoryUsage expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90 for: 2m labels: severity: critical annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is above 90%. Current value: {{ $value }}%" - alert: HighDiskUsage expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 85 for: 1m labels: severity: warning annotations: summary: "High disk usage on {{ $labels.instance }}" description: "Root filesystem usage is above 85%. Current value: {{ $value }}%" - alert: HighSystemLoad expr: node_load15 > 2 for: 5m labels: severity: warning annotations: summary: "High system load on {{ $labels.instance }}" description: "15-minute load average is above 2. Current value: {{ $value }}" - alert: SystemDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} is down" description: "{{ $labels.instance }} has been down for more than 1 minute"

Configure Alertmanager

Set up Alertmanager configuration with email notifications and webhook integration for automated responses.

global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'your-smtp-password'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default-receiver'
  routes:
  - match:
      severity: critical
    receiver: 'critical-alerts'
  - match:
      severity: warning
    receiver: 'warning-alerts'

receivers:
  • name: 'default-receiver'
email_configs: - to: 'admin@example.com' subject: 'Alert: {{ .GroupLabels.alertname }}' body: | {{ range .Alerts }} Alert: {{ .Annotations.summary }} Description: {{ .Annotations.description }} Instance: {{ .Labels.instance }} {{ end }}
  • name: 'critical-alerts'
email_configs: - to: 'admin@example.com' subject: 'CRITICAL: {{ .GroupLabels.alertname }}' body: | {{ range .Alerts }} CRITICAL ALERT: {{ .Annotations.summary }} Description: {{ .Annotations.description }} Instance: {{ .Labels.instance }} {{ end }} webhook_configs: - url: 'http://localhost:8080/webhook/critical' send_resolved: true
  • name: 'warning-alerts'
email_configs: - to: 'admin@example.com' subject: 'WARNING: {{ .GroupLabels.alertname }}' body: | {{ range .Alerts }} Warning: {{ .Annotations.summary }} Description: {{ .Annotations.description }} Instance: {{ .Labels.instance }} {{ end }} webhook_configs: - url: 'http://localhost:8080/webhook/warning' send_resolved: true

Create automated response scripts

Create scripts that can automatically respond to alerts by freeing disk space, restarting services, or clearing logs.

#!/bin/bash

Automated disk cleanup script

LOG_FILE="/var/log/monitoring-cleanup.log" echo "$(date): Starting automated disk cleanup" >> $LOG_FILE

Clear temporary files older than 7 days

find /tmp -type f -atime +7 -delete 2>/dev/null echo "$(date): Cleared old temporary files" >> $LOG_FILE

Rotate and compress logs

journalctl --vacuum-time=30d echo "$(date): Cleaned journal logs" >> $LOG_FILE

Clear package cache

apt clean 2>/dev/null || dnf clean all 2>/dev/null echo "$(date): Cleared package cache" >> $LOG_FILE

Check disk usage after cleanup

DISK_USAGE=$(df / | awk 'NR==2 {print $5}' | sed 's/%//') echo "$(date): Disk usage after cleanup: ${DISK_USAGE}%" >> $LOG_FILE if [ $DISK_USAGE -lt 80 ]; then echo "$(date): Disk cleanup successful" >> $LOG_FILE else echo "$(date): WARNING: Disk usage still high after cleanup" >> $LOG_FILE fi

Create service restart script

Create a script to automatically restart services when system load is high or memory usage is critical.

#!/bin/bash

Automated service restart script

LOG_FILE="/var/log/monitoring-restarts.log" ALERT_TYPE=$1 echo "$(date): Received $ALERT_TYPE alert, checking system status" >> $LOG_FILE

Get current system metrics

CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | sed 's/%us,//') MEM_USAGE=$(free | grep Mem | awk '{printf("%.2f", $3/$2 * 100.0)}') LOAD_AVG=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | sed 's/,//') echo "$(date): CPU: ${CPU_USAGE}%, Memory: ${MEM_USAGE}%, Load: ${LOAD_AVG}" >> $LOG_FILE

Restart high-memory services if memory is critical

if (( $(echo "$MEM_USAGE > 90" | bc -l) )); then echo "$(date): Critical memory usage, restarting services" >> $LOG_FILE systemctl restart apache2 2>/dev/null || systemctl restart nginx 2>/dev/null systemctl restart mysql 2>/dev/null || systemctl restart mariadb 2>/dev/null echo "$(date): Services restarted" >> $LOG_FILE fi

Kill processes consuming excessive CPU

if (( $(echo "$CPU_USAGE > 90" | bc -l) )); then echo "$(date): High CPU usage detected, checking for runaway processes" >> $LOG_FILE # Kill processes using more than 50% CPU (excluding system processes) ps aux --sort=-%cpu | awk 'NR>1 && $3>50 && $1!="root" {print $2}' | head -3 | xargs -r kill -TERM echo "$(date): Terminated high CPU processes" >> $LOG_FILE fi

Create webhook server for automated responses

Set up a simple webhook server that receives alerts from Alertmanager and triggers automated response scripts.

#!/usr/bin/env python3
import json
import subprocess
from http.server import HTTPServer, BaseHTTPRequestHandler
import logging

logging.basicConfig(filename='/var/log/monitoring-webhook.log', level=logging.INFO,
                   format='%(asctime)s - %(levelname)s - %(message)s')

class WebhookHandler(BaseHTTPRequestHandler):
    def do_POST(self):
        if self.path == '/webhook/critical':
            content_length = int(self.headers['Content-Length'])
            post_data = self.rfile.read(content_length)
            
            try:
                alert_data = json.loads(post_data.decode('utf-8'))
                logging.info(f"Received critical alert: {alert_data}")
                
                for alert in alert_data.get('alerts', []):
                    alert_name = alert.get('labels', {}).get('alertname', '')
                    
                    if alert_name == 'HighDiskUsage':
                        subprocess.run(['/opt/monitoring-scripts/cleanup-disk.sh'], 
                                     check=False)
                        logging.info("Triggered disk cleanup for HighDiskUsage alert")
                    
                    elif alert_name in ['HighMemoryUsage', 'HighCPUUsage']:
                        subprocess.run(['/opt/monitoring-scripts/restart-services.sh', 'critical'], 
                                     check=False)
                        logging.info(f"Triggered service restart for {alert_name} alert")
                
                self.send_response(200)
                self.end_headers()
                self.wfile.write(b'Alert processed')
                
            except Exception as e:
                logging.error(f"Error processing alert: {e}")
                self.send_response(500)
                self.end_headers()
        
        elif self.path == '/webhook/warning':
            content_length = int(self.headers['Content-Length'])
            post_data = self.rfile.read(content_length)
            
            try:
                alert_data = json.loads(post_data.decode('utf-8'))
                logging.info(f"Received warning alert: {alert_data}")
                
                self.send_response(200)
                self.end_headers()
                self.wfile.write(b'Warning logged')
                
            except Exception as e:
                logging.error(f"Error processing warning: {e}")
                self.send_response(500)
                self.end_headers()
        
        else:
            self.send_response(404)
            self.end_headers()

if __name__ == '__main__':
    server = HTTPServer(('localhost', 8080), WebhookHandler)
    logging.info("Webhook server starting on port 8080")
    server.serve_forever()

Set permissions for monitoring scripts

Make the monitoring scripts executable and set proper ownership for security.

sudo chmod 755 /opt/monitoring-scripts/*.sh
sudo chmod 755 /opt/monitoring-scripts/webhook-server.py
sudo chown root:root /opt/monitoring-scripts/*
Never use chmod 777. It gives every user on the system full access to your files. Scripts only need execute permissions for the owner and group.

Install Python dependencies for webhook server

Install required Python packages for the webhook server to function properly.

sudo apt install -y python3 python3-pip bc
pip3 install --user requests
sudo dnf install -y python3 python3-pip bc
pip3 install --user requests

Create systemd service files

Create systemd service files for Prometheus, Node Exporter, Alertmanager, and the webhook server.

[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries \
  --web.listen-address=0.0.0.0:9090 \
  --web.enable-lifecycle
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

Create Node Exporter service

Configure the systemd service for Node Exporter to collect system metrics.

[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --web.listen-address=:9100 \
  --collector.systemd \
  --collector.processes
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

Create Alertmanager service

Set up the systemd service for Alertmanager to handle alert notifications.

[Unit]
Description=Alertmanager
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=:9093
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

Create webhook server service

Configure the systemd service for the automated response webhook server.

[Unit]
Description=Monitoring Webhook Server
After=network.target

[Service]
Type=simple
User=root
ExecStart=/usr/bin/python3 /opt/monitoring-scripts/webhook-server.py
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

Set correct file ownership

Ensure all configuration files have the correct ownership for their respective services.

sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml
sudo chown prometheus:prometheus /etc/prometheus/alert_rules.yml
sudo chown alertmanager:alertmanager /etc/alertmanager/alertmanager.yml

Configure firewall rules

Open the necessary ports for Prometheus web interface, Node Exporter, Alertmanager, and webhook server.

sudo ufw allow 9090/tcp comment "Prometheus"
sudo ufw allow 9100/tcp comment "Node Exporter"
sudo ufw allow 9093/tcp comment "Alertmanager"
sudo ufw allow 8080/tcp comment "Webhook Server"
sudo firewall-cmd --permanent --add-port=9090/tcp --add-port=9100/tcp --add-port=9093/tcp --add-port=8080/tcp
sudo firewall-cmd --reload

Enable and start services

Reload systemd, enable all monitoring services to start on boot, and start them immediately.

sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
sudo systemctl enable --now node_exporter
sudo systemctl enable --now alertmanager
sudo systemctl enable --now monitoring-webhook

Create monitoring dashboards and notifications

Install and configure Grafana

Add Grafana repository and install it for creating visual dashboards of your monitoring data.

wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install -y grafana
sudo dnf install -y https://dl.grafana.com/oss/release/grafana-10.2.2-1.x86_64.rpm

Configure Grafana data source

Create a Grafana configuration to automatically connect to your Prometheus instance.

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://localhost:9090
    isDefault: true
    editable: true

Enable and start Grafana

Start Grafana service and configure it to start automatically on boot.

sudo systemctl enable --now grafana-server
sudo systemctl status grafana-server

Verify your setup

Check that all services are running and accessible through their web interfaces.

# Check service status
sudo systemctl status prometheus node_exporter alertmanager monitoring-webhook

Verify Prometheus is collecting metrics

curl -s http://localhost:9090/api/v1/query?query=up

Check Node Exporter metrics

curl -s http://localhost:9100/metrics | head -20

Test Alertmanager

curl -s http://localhost:9093/api/v1/status

Check webhook server

curl -s http://localhost:8080/webhook/warning -X POST -d '{}'

View recent logs

sudo journalctl -u prometheus --since "10 minutes ago" --no-pager sudo journalctl -u alertmanager --since "10 minutes ago" --no-pager

Access the web interfaces:

  • Prometheus: http://your-server:9090
  • Alertmanager: http://your-server:9093
  • Grafana: http://your-server:3000 (admin/admin)

Common issues

Symptom Cause Fix
Prometheus won't start Configuration syntax error sudo /usr/local/bin/promtool check config /etc/prometheus/prometheus.yml
Node Exporter not collecting metrics Permission issues or wrong user sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
Alerts not firing Alert rules not loaded sudo /usr/local/bin/promtool check rules /etc/prometheus/alert_rules.yml
Email notifications not working SMTP configuration error Check SMTP settings in /etc/alertmanager/alertmanager.yml
Webhook server not responding Python dependencies missing pip3 install --user requests and restart service
Can't access web interfaces Firewall blocking ports Verify firewall rules: sudo ufw status or sudo firewall-cmd --list-ports

Next steps

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.