Configure Spark Kubernetes Operator with MinIO for cloud-native analytics

Advanced 45 min Apr 23, 2026
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Deploy Apache Spark on Kubernetes with the Spark Operator and MinIO object storage for scalable big data processing. Configure RBAC, SSL certificates, and persistent storage for production-ready analytics workloads.

Prerequisites

  • Kubernetes cluster with at least 8GB RAM and 4 CPU cores
  • kubectl configured with admin access
  • Helm 3.x installed
  • At least 100GB available storage for MinIO

What this solves

The Spark Kubernetes Operator automates the deployment and management of Apache Spark applications on Kubernetes clusters, while MinIO provides S3-compatible object storage for your data lake. This combination creates a cloud-native analytics platform that scales automatically and handles large datasets efficiently. You'll use this when you need to run distributed data processing workloads with automatic resource management and fault tolerance.

Step-by-step installation

Install Kubernetes cluster requirements

Start with a working Kubernetes cluster and install Helm for managing chart deployments.

sudo apt update && sudo apt upgrade -y
sudo apt install -y curl wget gpg
sudo dnf update -y
sudo dnf install -y curl wget gpg

Install kubectl and Helm

Install the Kubernetes command-line tool and Helm package manager for deploying the operator and MinIO.

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo chmod +x kubectl
sudo mv kubectl /usr/local/bin/

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

kubectl version --client
helm version

Create dedicated namespaces

Create separate namespaces for the Spark operator and MinIO to organize resources and apply security policies.

kubectl create namespace spark-operator
kubectl create namespace minio
kubectl create namespace spark-jobs

Install MinIO with SSL certificates

Deploy MinIO object storage with TLS encryption and persistent volume claims for data durability.

helm repo add minio https://charts.min.io/
helm repo update

Generate SSL certificates for MinIO

Create self-signed certificates for MinIO API and console access in development environments.

openssl req -new -newkey rsa:4096 -days 365 -nodes -x509 \
  -subj "/C=US/ST=CA/L=San Francisco/O=Example/CN=minio.example.com" \
  -keyout minio.key -out minio.crt

kubectl create secret tls minio-tls --key minio.key --cert minio.crt -n minio
Note: For production environments, use certificates from a trusted CA or Let's Encrypt with cert-manager.

Create MinIO configuration values

Configure MinIO with persistent storage, resource limits, and SSL termination for production workloads.

auth:
  rootUser: admin
  rootPassword: SecureMinIOPassword123!

mode: standalone

persistence:
  enabled: true
  storageClass: ""
  size: 100Gi

resources:
  requests:
    memory: "512Mi"
    cpu: "250m"
  limits:
    memory: "2Gi"
    cpu: "1000m"

service:
  type: ClusterIP
  port: 9000
  
consoleService:
  type: ClusterIP
  port: 9001

tls:
  enabled: true
  existingSecret: "minio-tls"

ingress:
  enabled: true
  annotations:
    kubernetes.io/ingress.class: "nginx"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
  hosts:
    - host: minio.example.com
      paths:
        - path: /
          pathType: Prefix

Deploy MinIO with Helm

Install MinIO using the configuration values and verify the deployment status.

helm install minio minio/minio -n minio -f minio-values.yaml

kubectl get pods -n minio
kubectl get svc -n minio

Create MinIO buckets for Spark

Set up storage buckets for Spark application jars, data, and output using the MinIO client.

# Port forward to access MinIO API
kubectl port-forward svc/minio 9000:9000 -n minio &

Install MinIO client

wget https://dl.min.io/client/mc/release/linux-amd64/mc chmod +x mc sudo mv mc /usr/local/bin/

Configure MinIO alias

mc alias set minio https://localhost:9000 admin SecureMinIOPassword123! --insecure

Create buckets

mc mb minio/spark-jars --insecure mc mb minio/spark-data --insecure mc mb minio/spark-output --insecure mc mb minio/spark-logs --insecure

Install Spark Kubernetes Operator

Deploy the Spark operator using Helm with RBAC configuration and webhook settings.

helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
helm repo update

Create Spark operator configuration

Configure the operator with proper RBAC permissions and resource management settings.

image:
  repository: gcr.io/spark-operator/spark-operator
  tag: v1beta2-1.3.8-3.1.1

sparkJobNamespace: spark-jobs

controllerThreads: 10
resyncInterval: 30

webhook:
  enable: true
  port: 8080

resources:
  limits:
    cpu: 1000m
    memory: 1Gi
  requests:
    cpu: 100m
    memory: 300Mi

rbac:
  create: true
  createClusterRole: true

serviceAccounts:
  spark:
    create: true
    name: spark
  sparkoperator:
    create: true
    name: spark-operator

leaderElection:
  lockName: "spark-operator-lock"
  lockNamespace: "spark-operator"

Deploy Spark operator

Install the Spark operator and verify it's running correctly with webhook admission control.

helm install spark-operator spark-operator/spark-operator \
  --namespace spark-operator \
  --values spark-operator-values.yaml

kubectl get pods -n spark-operator
kubectl get mutatingwebhookconfigurations

Create RBAC for Spark applications

Set up service accounts and role bindings for Spark driver and executor pods to access Kubernetes APIs.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: spark-role
rules:
  • apiGroups: [""]
resources: ["pods"] verbs: ["*"]
  • apiGroups: [""]
resources: ["services"] verbs: ["*"]
  • apiGroups: [""]
resources: ["configmaps"] verbs: ["*"]
  • apiGroups: [""]
resources: ["persistentvolumeclaims"] verbs: ["*"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: spark-role-binding subjects:
  • kind: ServiceAccount
name: spark namespace: spark-jobs roleRef: kind: ClusterRole name: spark-role apiGroup: rbac.authorization.k8s.io --- apiVersion: v1 kind: ServiceAccount metadata: name: spark namespace: spark-jobs
kubectl apply -f spark-rbac.yaml

Create MinIO access secrets

Store MinIO credentials as Kubernetes secrets for Spark applications to access object storage.

kubectl create secret generic minio-secret \
  --from-literal=accesskey=admin \
  --from-literal=secretkey=SecureMinIOPassword123! \
  -n spark-jobs

Create Spark application with MinIO integration

Deploy a sample Spark job that reads from and writes to MinIO storage using S3A filesystem.

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-minio-example
  namespace: spark-jobs
spec:
  type: Scala
  mode: cluster
  image: "apache/spark:3.4.0"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.4.0.jar"
  sparkVersion: "3.4.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.4.0
    serviceAccount: spark
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
    env:
      - name: AWS_ACCESS_KEY_ID
        valueFrom:
          secretKeyRef:
            name: minio-secret
            key: accesskey
      - name: AWS_SECRET_ACCESS_KEY
        valueFrom:
          secretKeyRef:
            name: minio-secret
            key: secretkey
  executor:
    cores: 1
    instances: 2
    memory: "512m"
    labels:
      version: 3.4.0
    env:
      - name: AWS_ACCESS_KEY_ID
        valueFrom:
          secretKeyRef:
            name: minio-secret
            key: accesskey
      - name: AWS_SECRET_ACCESS_KEY
        valueFrom:
          secretKeyRef:
            name: minio-secret
            key: secretkey
  sparkConf:
    "spark.kubernetes.container.image.pullPolicy": "Always"
    "spark.hadoop.fs.s3a.endpoint": "http://minio.minio.svc.cluster.local:9000"
    "spark.hadoop.fs.s3a.access.key": "admin"
    "spark.hadoop.fs.s3a.secret.key": "SecureMinIOPassword123!"
    "spark.hadoop.fs.s3a.path.style.access": "true"
    "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
    "spark.hadoop.fs.s3a.connection.ssl.enabled": "false"
    "spark.sql.adaptive.enabled": "true"
    "spark.sql.adaptive.coalescePartitions.enabled": "true"
    "spark.eventLog.enabled": "true"
    "spark.eventLog.dir": "s3a://spark-logs/"
    "spark.history.fs.logDirectory": "s3a://spark-logs/"

Deploy and monitor Spark application

Submit the Spark job and monitor its execution through Kubernetes and Spark UI.

kubectl apply -f spark-minio-job.yaml

Monitor the application

kubectl get sparkapplications -n spark-jobs kubectl describe sparkapplication spark-minio-example -n spark-jobs kubectl get pods -n spark-jobs

Configure Spark History Server

Deploy Spark History Server to view completed application logs and metrics stored in MinIO.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: spark-history-server
  namespace: spark-jobs
spec:
  replicas: 1
  selector:
    matchLabels:
      app: spark-history-server
  template:
    metadata:
      labels:
        app: spark-history-server
    spec:
      containers:
      - name: spark-history-server
        image: apache/spark:3.4.0
        command: ["/opt/spark/bin/spark-class"]
        args: ["org.apache.spark.deploy.history.HistoryServer"]
        ports:
        - containerPort: 18080
        env:
        - name: SPARK_HISTORY_OPTS
          value: "-Dspark.history.fs.logDirectory=s3a://spark-logs/ -Dspark.hadoop.fs.s3a.endpoint=http://minio.minio.svc.cluster.local:9000 -Dspark.hadoop.fs.s3a.access.key=admin -Dspark.hadoop.fs.s3a.secret.key=SecureMinIOPassword123! -Dspark.hadoop.fs.s3a.path.style.access=true -Dspark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem -Dspark.hadoop.fs.s3a.connection.ssl.enabled=false"
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"
---
apiVersion: v1
kind: Service
metadata:
  name: spark-history-server
  namespace: spark-jobs
spec:
  selector:
    app: spark-history-server
  ports:
  - port: 18080
    targetPort: 18080
  type: ClusterIP
kubectl apply -f spark-history-server.yaml

Verify your setup

Check that all components are running and can communicate with each other.

# Verify MinIO is running
kubectl get pods -n minio
kubectl logs deployment/minio -n minio

Verify Spark operator is ready

kubectl get pods -n spark-operator kubectl logs deployment/spark-operator -n spark-operator

Check Spark application status

kubectl get sparkapplications -n spark-jobs kubectl describe sparkapplication spark-minio-example -n spark-jobs

Access MinIO console (port forward)

kubectl port-forward svc/minio-console 9001:9001 -n minio

Visit http://localhost:9001 (admin/SecureMinIOPassword123!)

Access Spark History Server

kubectl port-forward svc/spark-history-server 18080:18080 -n spark-jobs

Visit http://localhost:18080

Test MinIO connectivity from Spark

mc ls minio/spark-logs --insecure

Configure production optimizations

Enable resource quotas and limits

Set up resource quotas to prevent Spark jobs from consuming all cluster resources.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: spark-jobs-quota
  namespace: spark-jobs
spec:
  hard:
    requests.cpu: "20"
    requests.memory: "40Gi"
    limits.cpu: "40"
    limits.memory: "80Gi"
    persistentvolumeclaims: "10"
---
apiVersion: v1
kind: LimitRange
metadata:
  name: spark-jobs-limits
  namespace: spark-jobs
spec:
  limits:
  - default:
      cpu: "2"
      memory: "4Gi"
    defaultRequest:
      cpu: "100m"
      memory: "256Mi"
    type: Container
kubectl apply -f spark-resource-quota.yaml

Configure network policies

Implement network security policies to control traffic between namespaces and external access.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: spark-jobs-netpol
  namespace: spark-jobs
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: spark-operator
  - from:
    - podSelector: {}
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: minio
    ports:
    - protocol: TCP
      port: 9000
  - to:
    - podSelector: {}
  - to: []
    ports:
    - protocol: TCP
      port: 53
    - protocol: UDP
      port: 53
kubectl label namespace minio name=minio
kubectl label namespace spark-operator name=spark-operator
kubectl apply -f network-policies.yaml

Common issues

SymptomCauseFix
Spark operator pods failingInsufficient RBAC permissionskubectl get clusterrolebinding spark-operator and verify permissions
MinIO connection refusedService discovery misconfigurationUse full service DNS: minio.minio.svc.cluster.local:9000
Spark apps stuck in pendingResource limits exceededkubectl describe pod and check resource quotas
S3A authentication failuresIncorrect MinIO credentialsVerify secret values: kubectl get secret minio-secret -o yaml
Webhook admission failuresCertificate issueskubectl get mutatingwebhookconfigurations and check TLS
History server can't read logsMinIO bucket permissionsCheck bucket policy: mc policy list minio/spark-logs

Next steps

Running this in production?

Want this handled for you? Running this at scale adds a second layer of work: capacity planning, failover drills, cost control, and on-call. Our managed platform covers monitoring, backups and 24/7 response by default.

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.