Set up Kubernetes monitoring with Prometheus Operator and custom metrics

Intermediate 45 min May 21, 2026 29 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Deploy a production-grade monitoring stack with Prometheus Operator, configure ServiceMonitor resources for automatic scraping, and create custom alerting rules with Grafana dashboards for comprehensive Kubernetes cluster observability.

Prerequisites

  • Kubernetes cluster with admin access
  • kubectl configured
  • 50GB+ available storage
  • Basic understanding of Kubernetes resources

What this solves

Prometheus Operator simplifies monitoring deployment in Kubernetes by using custom resources to manage Prometheus instances, alerting rules, and service discovery. This approach provides declarative configuration, automatic reloading, and seamless integration with Kubernetes RBAC and networking.

Step-by-step installation

Install Helm package manager

Helm is required to install the Prometheus Operator stack. Install it on your system if not already available.

curl https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg > /dev/null
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt update
sudo apt install -y helm
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

Add the Prometheus community Helm repository

Add the official repository that contains the kube-prometheus-stack chart with all required components.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Create monitoring namespace

Create a dedicated namespace for the monitoring stack to isolate resources and apply specific policies.

kubectl create namespace monitoring

Create custom values configuration

Configure the Prometheus Operator with persistent storage, resource limits, and custom scraping intervals for production use.

prometheus:
  prometheusSpec:
    retention: 30d
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi
    resources:
      requests:
        cpu: 500m
        memory: 2Gi
      limits:
        cpu: 2000m
        memory: 8Gi
    scrapeInterval: 30s
    evaluationInterval: 30s

grafana:
  persistence:
    enabled: true
    size: 10Gi
    storageClassName: fast-ssd
  adminPassword: "SecureAdminPassword123!"
  resources:
    requests:
      cpu: 100m
      memory: 256Mi
    limits:
      cpu: 500m
      memory: 1Gi

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi
    resources:
      requests:
        cpu: 100m
        memory: 256Mi
      limits:
        cpu: 200m
        memory: 512Mi

Install Prometheus Operator with Helm

Deploy the complete monitoring stack including Prometheus, Grafana, Alertmanager, and various exporters for comprehensive cluster monitoring.

helm install prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values prometheus-values.yaml \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.ruleSelectorNilUsesHelmValues=false

Wait for deployment completion

Monitor the deployment progress and ensure all pods are running before proceeding with configuration.

kubectl get pods -n monitoring -w
Note: The deployment may take 3-5 minutes. Wait until all pods show Running status before continuing.

Configure ServiceMonitor resources

Create application ServiceMonitor

ServiceMonitor resources tell Prometheus which services to scrape for metrics. This example monitors a custom web application.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: webapp-metrics
  namespace: monitoring
  labels:
    app: webapp
spec:
  selector:
    matchLabels:
      app: webapp
  endpoints:
  - port: metrics
    path: /metrics
    interval: 30s
    scrapeTimeout: 10s
  namespaceSelector:
    matchNames:
    - default
    - production
kubectl apply -f webapp-servicemonitor.yaml

Create database ServiceMonitor

Monitor PostgreSQL or MySQL databases using dedicated exporters that expose database-specific metrics.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: postgres-exporter
  namespace: monitoring
  labels:
    app: postgres-exporter
spec:
  selector:
    matchLabels:
      app: postgres-exporter
  endpoints:
  - port: http-metrics
    path: /metrics
    interval: 60s
    scrapeTimeout: 30s
    relabelings:
    - sourceLabels: [__meta_kubernetes_pod_name]
      targetLabel: instance
    - sourceLabels: [__meta_kubernetes_namespace]
      targetLabel: kubernetes_namespace
kubectl apply -f database-servicemonitor.yaml

Configure ingress ServiceMonitor

Monitor NGINX ingress controller metrics to track request rates, response times, and error rates across all services.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nginx-ingress
  namespace: monitoring
  labels:
    app: nginx-ingress
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: ingress-nginx
  endpoints:
  - port: prometheus
    path: /metrics
    interval: 30s
  namespaceSelector:
    matchNames:
    - ingress-nginx
kubectl apply -f ingress-servicemonitor.yaml

Set up custom metrics and alerting rules

Create application-specific PrometheusRule

PrometheusRule resources define alerting rules that trigger based on metric thresholds and conditions. This example monitors application performance.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: webapp-alerts
  namespace: monitoring
  labels:
    app: webapp
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
  - name: webapp.rules
    interval: 30s
    rules:
    - alert: WebAppHighResponseTime
      expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{app="webapp"}[5m])) > 0.5
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: "Web application response time is high"
        description: "95th percentile response time is {{ $value }}s for {{ $labels.instance }}"
    
    - alert: WebAppHighErrorRate
      expr: rate(http_requests_total{app="webapp",status=~"5.."}[5m]) / rate(http_requests_total{app="webapp"}[5m]) > 0.1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High error rate detected"
        description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.instance }}"
    
    - alert: WebAppPodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total{container="webapp"}[15m]) > 0
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: "Pod is crash looping"
        description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is restarting frequently"
kubectl apply -f webapp-alerts.yaml

Configure infrastructure alerting rules

Monitor cluster-wide metrics including node resources, storage utilization, and system components health.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: infrastructure-alerts
  namespace: monitoring
  labels:
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
  - name: infrastructure.rules
    interval: 60s
    rules:
    - alert: NodeHighCPUUsage
      expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Node CPU usage is high"
        description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"
    
    - alert: NodeHighMemoryUsage
      expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Node memory usage is high"
        description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"
    
    - alert: PersistentVolumeUsageHigh
      expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100 > 85
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Persistent volume usage is high"
        description: "Volume {{ $labels.persistentvolumeclaim }} usage is {{ $value }}%"
kubectl apply -f infrastructure-alerts.yaml

Configure Alertmanager routing

Set up alert routing and notification channels to ensure critical alerts reach the right teams through appropriate channels.

apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-prometheus-stack-kube-prom-alertmanager
  namespace: monitoring
type: Opaque
stringData:
  alertmanager.yml: |
    global:
      smtp_smarthost: 'smtp.example.com:587'
      smtp_from: 'alerts@example.com'
    
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 12h
      receiver: 'web.hook'
      routes:
      - match:
          severity: critical
        receiver: 'critical-alerts'
      - match:
          severity: warning
        receiver: 'warning-alerts'
    
    receivers:
    - name: 'web.hook'
      webhook_configs:
      - url: 'http://example.com/webhook'
    
    - name: 'critical-alerts'
      email_configs:
      - to: 'oncall@example.com'
        subject: 'CRITICAL: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}
      slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#alerts'
        title: 'Critical Alert'
    
    - name: 'warning-alerts'
      email_configs:
      - to: 'team@example.com'
        subject: 'WARNING: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
kubectl apply -f alertmanager-config.yaml

Deploy Grafana dashboards for cluster monitoring

Access Grafana interface

Create a port-forward to access Grafana and configure dashboards for cluster monitoring and application metrics visualization.

kubectl port-forward -n monitoring svc/prometheus-stack-grafana 3000:80
Note: Access Grafana at http://localhost:3000 with username 'admin' and the password you set in the values file.

Create custom application dashboard

Import or create custom dashboards that visualize your application metrics, request rates, and performance indicators.

{
  "dashboard": {
    "id": null,
    "title": "Web Application Metrics",
    "tags": ["webapp", "monitoring"],
    "timezone": "browser",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total{app=\"webapp\"}[5m])",
            "legendFormat": "{{ instance }} - {{ method }}"
          }
        ],
        "yAxes": [
          {
            "label": "Requests/sec"
          }
        ]
      },
      {
        "title": "Response Time (95th percentile)",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{app=\"webapp\"}[5m]))",
            "legendFormat": "{{ instance }}"
          }
        ],
        "yAxes": [
          {
            "label": "Seconds"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "singlestat",
        "targets": [
          {
            "expr": "rate(http_requests_total{app=\"webapp\",status=~\"5..\"}[5m]) / rate(http_requests_total{app=\"webapp\"}[5m]) * 100"
          }
        ],
        "format": "percent"
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "30s"
  }
}

Configure dashboard provisioning

Set up automatic dashboard provisioning using ConfigMaps to deploy dashboards consistently across environments.

apiVersion: v1
kind: ConfigMap
metadata:
  name: webapp-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  webapp-dashboard.json: |
    {
      "dashboard": {
        "id": null,
        "title": "Web Application Dashboard",
        "panels": [
          {
            "title": "Pod CPU Usage",
            "type": "graph",
            "targets": [
              {
                "expr": "rate(container_cpu_usage_seconds_total{pod=~\"webapp-.\"}[5m])  100",
                "legendFormat": "{{ pod }}"
              }
            ]
          },
          {
            "title": "Pod Memory Usage",
            "type": "graph", 
            "targets": [
              {
                "expr": "container_memory_working_set_bytes{pod=~\"webapp-.*\"} / 1024 / 1024",
                "legendFormat": "{{ pod }}"
              }
            ]
          }
        ]
      }
    }
kubectl apply -f dashboard-configmap.yaml

Verify your setup

Confirm that all monitoring components are operational and collecting metrics from your cluster.

# Check Prometheus Operator pods
kubectl get pods -n monitoring

Verify ServiceMonitor discovery

kubectl get servicemonitors -n monitoring

Check PrometheusRule status

kubectl get prometheusrules -n monitoring

Access Prometheus UI

kubectl port-forward -n monitoring svc/prometheus-stack-kube-prom-prometheus 9090:9090

Access Alertmanager UI

kubectl port-forward -n monitoring svc/prometheus-stack-kube-prom-alertmanager 9093:9093

Test metrics endpoint

curl http://localhost:9090/api/v1/targets

Common issues

SymptomCauseFix
ServiceMonitor not discoveredLabel selectors don't matchCheck kubectl get servicemonitors -o yaml and verify selector labels
Metrics not scrapedService endpoint not accessibleVerify service exists: kubectl get svc -l app=your-app
Prometheus rules not loadingSyntax errors in PrometheusRuleUse promtool check rules your-rules.yaml to validate
Grafana dashboards emptyData source not configuredCheck Prometheus data source URL in Grafana settings
Persistent volumes failingStorageClass not availableCreate StorageClass or use default: kubectl get storageclass
Alertmanager not receiving alertsAlert routing configurationCheck alertmanager config: kubectl logs -n monitoring alertmanager-*

Next steps

Running this in production?

Want this handled for you? Setting up monitoring once is straightforward. Keeping it tuned, managing storage growth, handling alert fatigue and maintaining dashboards across environments is the harder part. See how we run infrastructure like this for European teams who need 24/7 observability without the operational overhead.

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.