Kubernetes Monitoring with Prometheus Operator Setup

Deploy a production-grade monitoring stack with Prometheus Operator, configure ServiceMonitor resources for automatic scraping, and create custom alerting rules with Grafana dashboards for comprehensive Kubernetes cluster observability.

Prerequisites

Kubernetes cluster with admin access
kubectl configured
50GB+ available storage
Basic understanding of Kubernetes resources

What this solves

Prometheus Operator simplifies monitoring deployment in Kubernetes by using custom resources to manage Prometheus instances, alerting rules, and service discovery. This approach provides declarative configuration, automatic reloading, and seamless integration with Kubernetes RBAC and networking.

Step-by-step installation

Install Helm package manager

Helm is required to install the Prometheus Operator stack. Install it on your system if not already available.

curl https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg > /dev/null
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt update
sudo apt install -y helm

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

Add the Prometheus community Helm repository

Add the official repository that contains the kube-prometheus-stack chart with all required components.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Create monitoring namespace

Create a dedicated namespace for the monitoring stack to isolate resources and apply specific policies.

kubectl create namespace monitoring

Create custom values configuration

Configure the Prometheus Operator with persistent storage, resource limits, and custom scraping intervals for production use.

prometheus:
  prometheusSpec:
    retention: 30d
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi
    resources:
      requests:
        cpu: 500m
        memory: 2Gi
      limits:
        cpu: 2000m
        memory: 8Gi
    scrapeInterval: 30s
    evaluationInterval: 30s

grafana:
  persistence:
    enabled: true
    size: 10Gi
    storageClassName: fast-ssd
  adminPassword: "SecureAdminPassword123!"
  resources:
    requests:
      cpu: 100m
      memory: 256Mi
    limits:
      cpu: 500m
      memory: 1Gi

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi
    resources:
      requests:
        cpu: 100m
        memory: 256Mi
      limits:
        cpu: 200m
        memory: 512Mi

Install Prometheus Operator with Helm

Deploy the complete monitoring stack including Prometheus, Grafana, Alertmanager, and various exporters for comprehensive cluster monitoring.

helm install prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values prometheus-values.yaml \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.ruleSelectorNilUsesHelmValues=false

Wait for deployment completion

Monitor the deployment progress and ensure all pods are running before proceeding with configuration.

kubectl get pods -n monitoring -w

Note: The deployment may take 3-5 minutes. Wait until all pods show Running status before continuing.

Configure ServiceMonitor resources

Create application ServiceMonitor

ServiceMonitor resources tell Prometheus which services to scrape for metrics. This example monitors a custom web application.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: webapp-metrics
  namespace: monitoring
  labels:
    app: webapp
spec:
  selector:
    matchLabels:
      app: webapp
  endpoints:
  - port: metrics
    path: /metrics
    interval: 30s
    scrapeTimeout: 10s
  namespaceSelector:
    matchNames:
    - default
    - production

kubectl apply -f webapp-servicemonitor.yaml

Create database ServiceMonitor

Monitor PostgreSQL or MySQL databases using dedicated exporters that expose database-specific metrics.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: postgres-exporter
  namespace: monitoring
  labels:
    app: postgres-exporter
spec:
  selector:
    matchLabels:
      app: postgres-exporter
  endpoints:
  - port: http-metrics
    path: /metrics
    interval: 60s
    scrapeTimeout: 30s
    relabelings:
    - sourceLabels: [__meta_kubernetes_pod_name]
      targetLabel: instance
    - sourceLabels: [__meta_kubernetes_namespace]
      targetLabel: kubernetes_namespace

kubectl apply -f database-servicemonitor.yaml

Configure ingress ServiceMonitor

Monitor NGINX ingress controller metrics to track request rates, response times, and error rates across all services.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nginx-ingress
  namespace: monitoring
  labels:
    app: nginx-ingress
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: ingress-nginx
  endpoints:
  - port: prometheus
    path: /metrics
    interval: 30s
  namespaceSelector:
    matchNames:
    - ingress-nginx

kubectl apply -f ingress-servicemonitor.yaml

Set up custom metrics and alerting rules

Create application-specific PrometheusRule

PrometheusRule resources define alerting rules that trigger based on metric thresholds and conditions. This example monitors application performance.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: webapp-alerts
  namespace: monitoring
  labels:
    app: webapp
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
  - name: webapp.rules
    interval: 30s
    rules:
    - alert: WebAppHighResponseTime
      expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{app="webapp"}[5m])) > 0.5
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: "Web application response time is high"
        description: "95th percentile response time is {{ $value }}s for {{ $labels.instance }}"
    
    - alert: WebAppHighErrorRate
      expr: rate(http_requests_total{app="webapp",status=~"5.."}[5m]) / rate(http_requests_total{app="webapp"}[5m]) > 0.1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High error rate detected"
        description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.instance }}"
    
    - alert: WebAppPodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total{container="webapp"}[15m]) > 0
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: "Pod is crash looping"
        description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is restarting frequently"

kubectl apply -f webapp-alerts.yaml

Configure infrastructure alerting rules

Monitor cluster-wide metrics including node resources, storage utilization, and system components health.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: infrastructure-alerts
  namespace: monitoring
  labels:
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
  - name: infrastructure.rules
    interval: 60s
    rules:
    - alert: NodeHighCPUUsage
      expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Node CPU usage is high"
        description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"
    
    - alert: NodeHighMemoryUsage
      expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Node memory usage is high"
        description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"
    
    - alert: PersistentVolumeUsageHigh
      expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100 > 85
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Persistent volume usage is high"
        description: "Volume {{ $labels.persistentvolumeclaim }} usage is {{ $value }}%"

kubectl apply -f infrastructure-alerts.yaml

Configure Alertmanager routing

Set up alert routing and notification channels to ensure critical alerts reach the right teams through appropriate channels.

apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-prometheus-stack-kube-prom-alertmanager
  namespace: monitoring
type: Opaque
stringData:
  alertmanager.yml: |
    global:
      smtp_smarthost: 'smtp.example.com:587'
      smtp_from: 'alerts@example.com'
    
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 12h
      receiver: 'web.hook'
      routes:
      - match:
          severity: critical
        receiver: 'critical-alerts'
      - match:
          severity: warning
        receiver: 'warning-alerts'
    
    receivers:
    - name: 'web.hook'
      webhook_configs:
      - url: 'http://example.com/webhook'
    
    - name: 'critical-alerts'
      email_configs:
      - to: 'oncall@example.com'
        subject: 'CRITICAL: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}
      slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#alerts'
        title: 'Critical Alert'
    
    - name: 'warning-alerts'
      email_configs:
      - to: 'team@example.com'
        subject: 'WARNING: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

kubectl apply -f alertmanager-config.yaml

Deploy Grafana dashboards for cluster monitoring

Access Grafana interface

Create a port-forward to access Grafana and configure dashboards for cluster monitoring and application metrics visualization.

kubectl port-forward -n monitoring svc/prometheus-stack-grafana 3000:80

Note: Access Grafana at http://localhost:3000 with username 'admin' and the password you set in the values file.

Create custom application dashboard

Import or create custom dashboards that visualize your application metrics, request rates, and performance indicators.

{
  "dashboard": {
    "id": null,
    "title": "Web Application Metrics",
    "tags": ["webapp", "monitoring"],
    "timezone": "browser",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total{app=\"webapp\"}[5m])",
            "legendFormat": "{{ instance }} - {{ method }}"
          }
        ],
        "yAxes": [
          {
            "label": "Requests/sec"
          }
        ]
      },
      {
        "title": "Response Time (95th percentile)",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{app=\"webapp\"}[5m]))",
            "legendFormat": "{{ instance }}"
          }
        ],
        "yAxes": [
          {
            "label": "Seconds"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "singlestat",
        "targets": [
          {
            "expr": "rate(http_requests_total{app=\"webapp\",status=~\"5..\"}[5m]) / rate(http_requests_total{app=\"webapp\"}[5m]) * 100"
          }
        ],
        "format": "percent"
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "30s"
  }
}

Configure dashboard provisioning

Set up automatic dashboard provisioning using ConfigMaps to deploy dashboards consistently across environments.

apiVersion: v1
kind: ConfigMap
metadata:
  name: webapp-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  webapp-dashboard.json: |
    {
      "dashboard": {
        "id": null,
        "title": "Web Application Dashboard",
        "panels": [
          {
            "title": "Pod CPU Usage",
            "type": "graph",
            "targets": [
              {
                "expr": "rate(container_cpu_usage_seconds_total{pod=~\"webapp-.\"}[5m])  100",
                "legendFormat": "{{ pod }}"
              }
            ]
          },
          {
            "title": "Pod Memory Usage",
            "type": "graph", 
            "targets": [
              {
                "expr": "container_memory_working_set_bytes{pod=~\"webapp-.*\"} / 1024 / 1024",
                "legendFormat": "{{ pod }}"
              }
            ]
          }
        ]
      }
    }

kubectl apply -f dashboard-configmap.yaml

Verify your setup

Confirm that all monitoring components are operational and collecting metrics from your cluster.

# Check Prometheus Operator pods
kubectl get pods -n monitoring

Verify ServiceMonitor discovery
kubectl get servicemonitors -n monitoring

Check PrometheusRule status
kubectl get prometheusrules -n monitoring

Access Prometheus UI
kubectl port-forward -n monitoring svc/prometheus-stack-kube-prom-prometheus 9090:9090

Access Alertmanager UI
kubectl port-forward -n monitoring svc/prometheus-stack-kube-prom-alertmanager 9093:9093

Test metrics endpoint
curl http://localhost:9090/api/v1/targets

Common issues

Symptom	Cause	Fix
ServiceMonitor not discovered	Label selectors don't match	Check `kubectl get servicemonitors -o yaml` and verify selector labels
Metrics not scraped	Service endpoint not accessible	Verify service exists: `kubectl get svc -l app=your-app`
Prometheus rules not loading	Syntax errors in PrometheusRule	Use `promtool check rules your-rules.yaml` to validate
Grafana dashboards empty	Data source not configured	Check Prometheus data source URL in Grafana settings
Persistent volumes failing	StorageClass not available	Create StorageClass or use `default`: `kubectl get storageclass`
Alertmanager not receiving alerts	Alert routing configuration	Check alertmanager config: `kubectl logs -n monitoring alertmanager-*`

Next steps

Running this in production?

Want this handled for you? Setting up monitoring once is straightforward. Keeping it tuned, managing storage growth, handling alert fatigue and maintaining dashboards across environments is the harder part. See how we run infrastructure like this for European teams who need 24/7 observability without the operational overhead.

Automated install script

Run this to automate the entire setup

install.sh

#!/usr/bin/env bash
set -euo pipefail

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'

# Default values
NAMESPACE="${1:-monitoring}"
RETENTION="${2:-30d}"
STORAGE_SIZE="${3:-50Gi}"
GRAFANA_PASSWORD="${4:-SecureAdminPassword123!}"

usage() {
    echo "Usage: $0 [namespace] [retention] [storage_size] [grafana_password]"
    echo "  namespace: Kubernetes namespace (default: monitoring)"
    echo "  retention: Prometheus retention period (default: 30d)"
    echo "  storage_size: Prometheus storage size (default: 50Gi)"
    echo "  grafana_password: Grafana admin password (default: SecureAdminPassword123!)"
    exit 1
}

log_info() {
    echo -e "${GREEN}[INFO]${NC} $1"
}

log_warn() {
    echo -e "${YELLOW}[WARN]${NC} $1"
}

log_error() {
    echo -e "${RED}[ERROR]${NC} $1"
}

cleanup() {
    log_error "Installation failed! Cleaning up..."
    helm uninstall prometheus-stack --namespace "$NAMESPACE" 2>/dev/null || true
    kubectl delete namespace "$NAMESPACE" 2>/dev/null || true
    exit 1
}

trap cleanup ERR

# Detect distribution
if [ -f /etc/os-release ]; then
    . /etc/os-release
    case "$ID" in
        ubuntu|debian) 
            PKG_MGR="apt"
            PKG_INSTALL="apt install -y"
            PKG_UPDATE="apt update"
            ;;
        almalinux|rocky|centos|rhel|ol|fedora) 
            PKG_MGR="dnf"
            PKG_INSTALL="dnf install -y"
            PKG_UPDATE="dnf makecache"
            ;;
        amzn) 
            PKG_MGR="yum"
            PKG_INSTALL="yum install -y"
            PKG_UPDATE="yum makecache"
            ;;
        *) 
            log_error "Unsupported distribution: $ID"
            exit 1
            ;;
    esac
else
    log_error "Cannot detect OS distribution"
    exit 1
fi

# Check prerequisites
echo "[1/8] Checking prerequisites..."
if [ "$EUID" -eq 0 ]; then
    log_warn "Running as root. Consider using a non-root user with sudo access."
fi

if ! command -v kubectl &> /dev/null; then
    log_error "kubectl is required but not installed"
    exit 1
fi

if ! kubectl cluster-info &> /dev/null; then
    log_error "Cannot connect to Kubernetes cluster"
    exit 1
fi

# Install Helm
echo "[2/8] Installing Helm package manager..."
if ! command -v helm &> /dev/null; then
    case "$PKG_MGR" in
        apt)
            curl -fsSL https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg > /dev/null
            echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
            sudo $PKG_UPDATE
            sudo $PKG_INSTALL helm
            ;;
        dnf|yum)
            curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
            ;;
    esac
    log_info "Helm installed successfully"
else
    log_info "Helm already installed"
fi

# Add Helm repository
echo "[3/8] Adding Prometheus community Helm repository..."
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
log_info "Helm repository added and updated"

# Create namespace
echo "[4/8] Creating monitoring namespace..."
if ! kubectl get namespace "$NAMESPACE" &> /dev/null; then
    kubectl create namespace "$NAMESPACE"
    log_info "Namespace $NAMESPACE created"
else
    log_info "Namespace $NAMESPACE already exists"
fi

# Create values file
echo "[5/8] Creating custom values configuration..."
cat > /tmp/prometheus-values.yaml << EOF
prometheus:
  prometheusSpec:
    retention: $RETENTION
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: $STORAGE_SIZE
    resources:
      requests:
        cpu: 500m
        memory: 2Gi
      limits:
        cpu: 2000m
        memory: 8Gi
    scrapeInterval: 30s
    evaluationInterval: 30s

grafana:
  persistence:
    enabled: true
    size: 10Gi
  adminPassword: "$GRAFANA_PASSWORD"
  resources:
    requests:
      cpu: 100m
      memory: 256Mi
    limits:
      cpu: 500m
      memory: 1Gi

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi
    resources:
      requests:
        cpu: 100m
        memory: 256Mi
      limits:
        cpu: 200m
        memory: 512Mi
EOF
chmod 644 /tmp/prometheus-values.yaml
log_info "Values configuration created"

# Install Prometheus Operator
echo "[6/8] Installing Prometheus Operator with Helm..."
helm install prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace "$NAMESPACE" \
  --values /tmp/prometheus-values.yaml \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.ruleSelectorNilUsesHelmValues=false \
  --wait --timeout=10m

log_info "Prometheus stack installed successfully"

# Wait for deployment
echo "[7/8] Waiting for deployment completion..."
kubectl wait --for=condition=ready pod -l "app.kubernetes.io/name=prometheus" -n "$NAMESPACE" --timeout=300s
kubectl wait --for=condition=ready pod -l "app.kubernetes.io/name=grafana" -n "$NAMESPACE" --timeout=300s
kubectl wait --for=condition=ready pod -l "app.kubernetes.io/name=alertmanager" -n "$NAMESPACE" --timeout=300s
log_info "All pods are running"

# Create sample ServiceMonitor
echo "[8/8] Creating sample ServiceMonitor..."
cat > /tmp/webapp-servicemonitor.yaml << EOF
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: webapp-metrics
  namespace: $NAMESPACE
  labels:
    app: webapp
spec:
  selector:
    matchLabels:
      app: webapp
  endpoints:
  - port: metrics
    path: /metrics
    interval: 30s
    scrapeTimeout: 10s
  namespaceSelector:
    matchNames:
    - default
    - production
EOF
kubectl apply -f /tmp/webapp-servicemonitor.yaml
chmod 644 /tmp/webapp-servicemonitor.yaml
log_info "Sample ServiceMonitor created"

# Verification
echo ""
log_info "=== Installation Complete ==="
echo "Namespace: $NAMESPACE"
echo "Retention: $RETENTION"
echo "Storage Size: $STORAGE_SIZE"
echo ""
echo "Access Grafana:"
echo "kubectl port-forward -n $NAMESPACE svc/prometheus-stack-grafana 3000:80"
echo "Then visit: http://localhost:3000"
echo "Username: admin"
echo "Password: $GRAFANA_PASSWORD"
echo ""
echo "Access Prometheus:"
echo "kubectl port-forward -n $NAMESPACE svc/prometheus-stack-kube-prom-prometheus 9090:9090"
echo "Then visit: http://localhost:9090"
echo ""
echo "Running pods:"
kubectl get pods -n "$NAMESPACE"

# Cleanup temp files
rm -f /tmp/prometheus-values.yaml /tmp/webapp-servicemonitor.yaml

log_info "Prometheus Operator monitoring stack is ready!"

Review the script before running. Execute with: bash install.sh

#prometheus #kubernetes #monitoring #grafana #alerting

Set up Kubernetes monitoring with Prometheus Operator and custom metrics