Deploy a production-grade monitoring stack with Prometheus Operator, configure ServiceMonitor resources for automatic scraping, and create custom alerting rules with Grafana dashboards for comprehensive Kubernetes cluster observability.
Prerequisites
- Kubernetes cluster with admin access
- kubectl configured
- 50GB+ available storage
- Basic understanding of Kubernetes resources
What this solves
Prometheus Operator simplifies monitoring deployment in Kubernetes by using custom resources to manage Prometheus instances, alerting rules, and service discovery. This approach provides declarative configuration, automatic reloading, and seamless integration with Kubernetes RBAC and networking.
Step-by-step installation
Install Helm package manager
Helm is required to install the Prometheus Operator stack. Install it on your system if not already available.
curl https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg > /dev/null
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt update
sudo apt install -y helm
Add the Prometheus community Helm repository
Add the official repository that contains the kube-prometheus-stack chart with all required components.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
Create monitoring namespace
Create a dedicated namespace for the monitoring stack to isolate resources and apply specific policies.
kubectl create namespace monitoring
Create custom values configuration
Configure the Prometheus Operator with persistent storage, resource limits, and custom scraping intervals for production use.
prometheus:
prometheusSpec:
retention: 30d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
cpu: 2000m
memory: 8Gi
scrapeInterval: 30s
evaluationInterval: 30s
grafana:
persistence:
enabled: true
size: 10Gi
storageClassName: fast-ssd
adminPassword: "SecureAdminPassword123!"
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 1Gi
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 200m
memory: 512Mi
Install Prometheus Operator with Helm
Deploy the complete monitoring stack including Prometheus, Grafana, Alertmanager, and various exporters for comprehensive cluster monitoring.
helm install prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values prometheus-values.yaml \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.ruleSelectorNilUsesHelmValues=false
Wait for deployment completion
Monitor the deployment progress and ensure all pods are running before proceeding with configuration.
kubectl get pods -n monitoring -w
Configure ServiceMonitor resources
Create application ServiceMonitor
ServiceMonitor resources tell Prometheus which services to scrape for metrics. This example monitors a custom web application.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: webapp-metrics
namespace: monitoring
labels:
app: webapp
spec:
selector:
matchLabels:
app: webapp
endpoints:
- port: metrics
path: /metrics
interval: 30s
scrapeTimeout: 10s
namespaceSelector:
matchNames:
- default
- production
kubectl apply -f webapp-servicemonitor.yaml
Create database ServiceMonitor
Monitor PostgreSQL or MySQL databases using dedicated exporters that expose database-specific metrics.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: postgres-exporter
namespace: monitoring
labels:
app: postgres-exporter
spec:
selector:
matchLabels:
app: postgres-exporter
endpoints:
- port: http-metrics
path: /metrics
interval: 60s
scrapeTimeout: 30s
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: instance
- sourceLabels: [__meta_kubernetes_namespace]
targetLabel: kubernetes_namespace
kubectl apply -f database-servicemonitor.yaml
Configure ingress ServiceMonitor
Monitor NGINX ingress controller metrics to track request rates, response times, and error rates across all services.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: nginx-ingress
namespace: monitoring
labels:
app: nginx-ingress
spec:
selector:
matchLabels:
app.kubernetes.io/name: ingress-nginx
endpoints:
- port: prometheus
path: /metrics
interval: 30s
namespaceSelector:
matchNames:
- ingress-nginx
kubectl apply -f ingress-servicemonitor.yaml
Set up custom metrics and alerting rules
Create application-specific PrometheusRule
PrometheusRule resources define alerting rules that trigger based on metric thresholds and conditions. This example monitors application performance.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: webapp-alerts
namespace: monitoring
labels:
app: webapp
prometheus: kube-prometheus
role: alert-rules
spec:
groups:
- name: webapp.rules
interval: 30s
rules:
- alert: WebAppHighResponseTime
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{app="webapp"}[5m])) > 0.5
for: 2m
labels:
severity: warning
annotations:
summary: "Web application response time is high"
description: "95th percentile response time is {{ $value }}s for {{ $labels.instance }}"
- alert: WebAppHighErrorRate
expr: rate(http_requests_total{app="webapp",status=~"5.."}[5m]) / rate(http_requests_total{app="webapp"}[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.instance }}"
- alert: WebAppPodCrashLooping
expr: rate(kube_pod_container_status_restarts_total{container="webapp"}[15m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Pod is crash looping"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is restarting frequently"
kubectl apply -f webapp-alerts.yaml
Configure infrastructure alerting rules
Monitor cluster-wide metrics including node resources, storage utilization, and system components health.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: infrastructure-alerts
namespace: monitoring
labels:
prometheus: kube-prometheus
role: alert-rules
spec:
groups:
- name: infrastructure.rules
interval: 60s
rules:
- alert: NodeHighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "Node CPU usage is high"
description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"
- alert: NodeHighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "Node memory usage is high"
description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"
- alert: PersistentVolumeUsageHigh
expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "Persistent volume usage is high"
description: "Volume {{ $labels.persistentvolumeclaim }} usage is {{ $value }}%"
kubectl apply -f infrastructure-alerts.yaml
Configure Alertmanager routing
Set up alert routing and notification channels to ensure critical alerts reach the right teams through appropriate channels.
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-prometheus-stack-kube-prom-alertmanager
namespace: monitoring
type: Opaque
stringData:
alertmanager.yml: |
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alerts@example.com'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'web.hook'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
- match:
severity: warning
receiver: 'warning-alerts'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://example.com/webhook'
- name: 'critical-alerts'
email_configs:
- to: 'oncall@example.com'
subject: 'CRITICAL: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#alerts'
title: 'Critical Alert'
- name: 'warning-alerts'
email_configs:
- to: 'team@example.com'
subject: 'WARNING: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
kubectl apply -f alertmanager-config.yaml
Deploy Grafana dashboards for cluster monitoring
Access Grafana interface
Create a port-forward to access Grafana and configure dashboards for cluster monitoring and application metrics visualization.
kubectl port-forward -n monitoring svc/prometheus-stack-grafana 3000:80
Create custom application dashboard
Import or create custom dashboards that visualize your application metrics, request rates, and performance indicators.
{
"dashboard": {
"id": null,
"title": "Web Application Metrics",
"tags": ["webapp", "monitoring"],
"timezone": "browser",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total{app=\"webapp\"}[5m])",
"legendFormat": "{{ instance }} - {{ method }}"
}
],
"yAxes": [
{
"label": "Requests/sec"
}
]
},
{
"title": "Response Time (95th percentile)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{app=\"webapp\"}[5m]))",
"legendFormat": "{{ instance }}"
}
],
"yAxes": [
{
"label": "Seconds"
}
]
},
{
"title": "Error Rate",
"type": "singlestat",
"targets": [
{
"expr": "rate(http_requests_total{app=\"webapp\",status=~\"5..\"}[5m]) / rate(http_requests_total{app=\"webapp\"}[5m]) * 100"
}
],
"format": "percent"
}
],
"time": {
"from": "now-1h",
"to": "now"
},
"refresh": "30s"
}
}
Configure dashboard provisioning
Set up automatic dashboard provisioning using ConfigMaps to deploy dashboards consistently across environments.
apiVersion: v1
kind: ConfigMap
metadata:
name: webapp-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
webapp-dashboard.json: |
{
"dashboard": {
"id": null,
"title": "Web Application Dashboard",
"panels": [
{
"title": "Pod CPU Usage",
"type": "graph",
"targets": [
{
"expr": "rate(container_cpu_usage_seconds_total{pod=~\"webapp-.\"}[5m]) 100",
"legendFormat": "{{ pod }}"
}
]
},
{
"title": "Pod Memory Usage",
"type": "graph",
"targets": [
{
"expr": "container_memory_working_set_bytes{pod=~\"webapp-.*\"} / 1024 / 1024",
"legendFormat": "{{ pod }}"
}
]
}
]
}
}
kubectl apply -f dashboard-configmap.yaml
Verify your setup
Confirm that all monitoring components are operational and collecting metrics from your cluster.
# Check Prometheus Operator pods
kubectl get pods -n monitoring
Verify ServiceMonitor discovery
kubectl get servicemonitors -n monitoring
Check PrometheusRule status
kubectl get prometheusrules -n monitoring
Access Prometheus UI
kubectl port-forward -n monitoring svc/prometheus-stack-kube-prom-prometheus 9090:9090
Access Alertmanager UI
kubectl port-forward -n monitoring svc/prometheus-stack-kube-prom-alertmanager 9093:9093
Test metrics endpoint
curl http://localhost:9090/api/v1/targets
Common issues
| Symptom | Cause | Fix |
|---|---|---|
| ServiceMonitor not discovered | Label selectors don't match | Check kubectl get servicemonitors -o yaml and verify selector labels |
| Metrics not scraped | Service endpoint not accessible | Verify service exists: kubectl get svc -l app=your-app |
| Prometheus rules not loading | Syntax errors in PrometheusRule | Use promtool check rules your-rules.yaml to validate |
| Grafana dashboards empty | Data source not configured | Check Prometheus data source URL in Grafana settings |
| Persistent volumes failing | StorageClass not available | Create StorageClass or use default: kubectl get storageclass |
| Alertmanager not receiving alerts | Alert routing configuration | Check alertmanager config: kubectl logs -n monitoring alertmanager-* |
Next steps
- Configure advanced Grafana dashboards and alerting with Prometheus integration for custom visualization
- Set up Alertmanager with email and Slack notifications for monitoring alerts
- Implement Istio observability with Jaeger tracing and Kiali dashboard for Kubernetes service mesh
- Configure Kubernetes network policies for enhanced cluster security
- Setup Kubernetes Ingress NGINX with cert-manager for automated SSL certificates
Running this in production?
Automated install script
Run this to automate the entire setup
#!/usr/bin/env bash
set -euo pipefail
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
# Default values
NAMESPACE="${1:-monitoring}"
RETENTION="${2:-30d}"
STORAGE_SIZE="${3:-50Gi}"
GRAFANA_PASSWORD="${4:-SecureAdminPassword123!}"
usage() {
echo "Usage: $0 [namespace] [retention] [storage_size] [grafana_password]"
echo " namespace: Kubernetes namespace (default: monitoring)"
echo " retention: Prometheus retention period (default: 30d)"
echo " storage_size: Prometheus storage size (default: 50Gi)"
echo " grafana_password: Grafana admin password (default: SecureAdminPassword123!)"
exit 1
}
log_info() {
echo -e "${GREEN}[INFO]${NC} $1"
}
log_warn() {
echo -e "${YELLOW}[WARN]${NC} $1"
}
log_error() {
echo -e "${RED}[ERROR]${NC} $1"
}
cleanup() {
log_error "Installation failed! Cleaning up..."
helm uninstall prometheus-stack --namespace "$NAMESPACE" 2>/dev/null || true
kubectl delete namespace "$NAMESPACE" 2>/dev/null || true
exit 1
}
trap cleanup ERR
# Detect distribution
if [ -f /etc/os-release ]; then
. /etc/os-release
case "$ID" in
ubuntu|debian)
PKG_MGR="apt"
PKG_INSTALL="apt install -y"
PKG_UPDATE="apt update"
;;
almalinux|rocky|centos|rhel|ol|fedora)
PKG_MGR="dnf"
PKG_INSTALL="dnf install -y"
PKG_UPDATE="dnf makecache"
;;
amzn)
PKG_MGR="yum"
PKG_INSTALL="yum install -y"
PKG_UPDATE="yum makecache"
;;
*)
log_error "Unsupported distribution: $ID"
exit 1
;;
esac
else
log_error "Cannot detect OS distribution"
exit 1
fi
# Check prerequisites
echo "[1/8] Checking prerequisites..."
if [ "$EUID" -eq 0 ]; then
log_warn "Running as root. Consider using a non-root user with sudo access."
fi
if ! command -v kubectl &> /dev/null; then
log_error "kubectl is required but not installed"
exit 1
fi
if ! kubectl cluster-info &> /dev/null; then
log_error "Cannot connect to Kubernetes cluster"
exit 1
fi
# Install Helm
echo "[2/8] Installing Helm package manager..."
if ! command -v helm &> /dev/null; then
case "$PKG_MGR" in
apt)
curl -fsSL https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg > /dev/null
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo $PKG_UPDATE
sudo $PKG_INSTALL helm
;;
dnf|yum)
curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
;;
esac
log_info "Helm installed successfully"
else
log_info "Helm already installed"
fi
# Add Helm repository
echo "[3/8] Adding Prometheus community Helm repository..."
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
log_info "Helm repository added and updated"
# Create namespace
echo "[4/8] Creating monitoring namespace..."
if ! kubectl get namespace "$NAMESPACE" &> /dev/null; then
kubectl create namespace "$NAMESPACE"
log_info "Namespace $NAMESPACE created"
else
log_info "Namespace $NAMESPACE already exists"
fi
# Create values file
echo "[5/8] Creating custom values configuration..."
cat > /tmp/prometheus-values.yaml << EOF
prometheus:
prometheusSpec:
retention: $RETENTION
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: $STORAGE_SIZE
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
cpu: 2000m
memory: 8Gi
scrapeInterval: 30s
evaluationInterval: 30s
grafana:
persistence:
enabled: true
size: 10Gi
adminPassword: "$GRAFANA_PASSWORD"
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 1Gi
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 200m
memory: 512Mi
EOF
chmod 644 /tmp/prometheus-values.yaml
log_info "Values configuration created"
# Install Prometheus Operator
echo "[6/8] Installing Prometheus Operator with Helm..."
helm install prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace "$NAMESPACE" \
--values /tmp/prometheus-values.yaml \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.ruleSelectorNilUsesHelmValues=false \
--wait --timeout=10m
log_info "Prometheus stack installed successfully"
# Wait for deployment
echo "[7/8] Waiting for deployment completion..."
kubectl wait --for=condition=ready pod -l "app.kubernetes.io/name=prometheus" -n "$NAMESPACE" --timeout=300s
kubectl wait --for=condition=ready pod -l "app.kubernetes.io/name=grafana" -n "$NAMESPACE" --timeout=300s
kubectl wait --for=condition=ready pod -l "app.kubernetes.io/name=alertmanager" -n "$NAMESPACE" --timeout=300s
log_info "All pods are running"
# Create sample ServiceMonitor
echo "[8/8] Creating sample ServiceMonitor..."
cat > /tmp/webapp-servicemonitor.yaml << EOF
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: webapp-metrics
namespace: $NAMESPACE
labels:
app: webapp
spec:
selector:
matchLabels:
app: webapp
endpoints:
- port: metrics
path: /metrics
interval: 30s
scrapeTimeout: 10s
namespaceSelector:
matchNames:
- default
- production
EOF
kubectl apply -f /tmp/webapp-servicemonitor.yaml
chmod 644 /tmp/webapp-servicemonitor.yaml
log_info "Sample ServiceMonitor created"
# Verification
echo ""
log_info "=== Installation Complete ==="
echo "Namespace: $NAMESPACE"
echo "Retention: $RETENTION"
echo "Storage Size: $STORAGE_SIZE"
echo ""
echo "Access Grafana:"
echo "kubectl port-forward -n $NAMESPACE svc/prometheus-stack-grafana 3000:80"
echo "Then visit: http://localhost:3000"
echo "Username: admin"
echo "Password: $GRAFANA_PASSWORD"
echo ""
echo "Access Prometheus:"
echo "kubectl port-forward -n $NAMESPACE svc/prometheus-stack-kube-prom-prometheus 9090:9090"
echo "Then visit: http://localhost:9090"
echo ""
echo "Running pods:"
kubectl get pods -n "$NAMESPACE"
# Cleanup temp files
rm -f /tmp/prometheus-values.yaml /tmp/webapp-servicemonitor.yaml
log_info "Prometheus Operator monitoring stack is ready!"
Review the script before running. Execute with: bash install.sh