Spark Kubernetes Operator with MinIO Setup

Deploy Apache Spark on Kubernetes with the Spark Operator and MinIO object storage for scalable big data processing. Configure RBAC, SSL certificates, and persistent storage for production-ready analytics workloads.

Prerequisites

Kubernetes cluster with at least 8GB RAM and 4 CPU cores
kubectl configured with admin access
Helm 3.x installed
At least 100GB available storage for MinIO

What this solves

The Spark Kubernetes Operator automates the deployment and management of Apache Spark applications on Kubernetes clusters, while MinIO provides S3-compatible object storage for your data lake. This combination creates a cloud-native analytics platform that scales automatically and handles large datasets efficiently. You'll use this when you need to run distributed data processing workloads with automatic resource management and fault tolerance.

Step-by-step installation

Install Kubernetes cluster requirements

Start with a working Kubernetes cluster and install Helm for managing chart deployments.

sudo apt update && sudo apt upgrade -y
sudo apt install -y curl wget gpg

sudo dnf update -y
sudo dnf install -y curl wget gpg

Install kubectl and Helm

Install the Kubernetes command-line tool and Helm package manager for deploying the operator and MinIO.

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo chmod +x kubectl
sudo mv kubectl /usr/local/bin/

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

kubectl version --client
helm version

Create dedicated namespaces

Create separate namespaces for the Spark operator and MinIO to organize resources and apply security policies.

kubectl create namespace spark-operator
kubectl create namespace minio
kubectl create namespace spark-jobs

Install MinIO with SSL certificates

Deploy MinIO object storage with TLS encryption and persistent volume claims for data durability.

helm repo add minio https://charts.min.io/
helm repo update

Generate SSL certificates for MinIO

Create self-signed certificates for MinIO API and console access in development environments.

openssl req -new -newkey rsa:4096 -days 365 -nodes -x509 \
  -subj "/C=US/ST=CA/L=San Francisco/O=Example/CN=minio.example.com" \
  -keyout minio.key -out minio.crt

kubectl create secret tls minio-tls --key minio.key --cert minio.crt -n minio

Note: For production environments, use certificates from a trusted CA or Let's Encrypt with cert-manager.

Create MinIO configuration values

Configure MinIO with persistent storage, resource limits, and SSL termination for production workloads.

auth:
  rootUser: admin
  rootPassword: SecureMinIOPassword123!

mode: standalone

persistence:
  enabled: true
  storageClass: ""
  size: 100Gi

resources:
  requests:
    memory: "512Mi"
    cpu: "250m"
  limits:
    memory: "2Gi"
    cpu: "1000m"

service:
  type: ClusterIP
  port: 9000
  
consoleService:
  type: ClusterIP
  port: 9001

tls:
  enabled: true
  existingSecret: "minio-tls"

ingress:
  enabled: true
  annotations:
    kubernetes.io/ingress.class: "nginx"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
  hosts:
    - host: minio.example.com
      paths:
        - path: /
          pathType: Prefix

Deploy MinIO with Helm

Install MinIO using the configuration values and verify the deployment status.

helm install minio minio/minio -n minio -f minio-values.yaml

kubectl get pods -n minio
kubectl get svc -n minio

Create MinIO buckets for Spark

Set up storage buckets for Spark application jars, data, and output using the MinIO client.

# Port forward to access MinIO API
kubectl port-forward svc/minio 9000:9000 -n minio &

Install MinIO client
wget https://dl.min.io/client/mc/release/linux-amd64/mc
chmod +x mc
sudo mv mc /usr/local/bin/

Configure MinIO alias
mc alias set minio https://localhost:9000 admin SecureMinIOPassword123! --insecure

Create buckets
mc mb minio/spark-jars --insecure
mc mb minio/spark-data --insecure
mc mb minio/spark-output --insecure
mc mb minio/spark-logs --insecure

Install Spark Kubernetes Operator

Deploy the Spark operator using Helm with RBAC configuration and webhook settings.

helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
helm repo update

Create Spark operator configuration

Configure the operator with proper RBAC permissions and resource management settings.

image:
  repository: gcr.io/spark-operator/spark-operator
  tag: v1beta2-1.3.8-3.1.1

sparkJobNamespace: spark-jobs

controllerThreads: 10
resyncInterval: 30

webhook:
  enable: true
  port: 8080

resources:
  limits:
    cpu: 1000m
    memory: 1Gi
  requests:
    cpu: 100m
    memory: 300Mi

rbac:
  create: true
  createClusterRole: true

serviceAccounts:
  spark:
    create: true
    name: spark
  sparkoperator:
    create: true
    name: spark-operator

leaderElection:
  lockName: "spark-operator-lock"
  lockNamespace: "spark-operator"

Deploy Spark operator

Install the Spark operator and verify it's running correctly with webhook admission control.

helm install spark-operator spark-operator/spark-operator \
  --namespace spark-operator \
  --values spark-operator-values.yaml

kubectl get pods -n spark-operator
kubectl get mutatingwebhookconfigurations

Create RBAC for Spark applications

Set up service accounts and role bindings for Spark driver and executor pods to access Kubernetes APIs.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: spark-role
rules:
apiGroups: [""]  resources: ["pods"]
  verbs: ["*"]
apiGroups: [""]  resources: ["services"]
  verbs: ["*"]
apiGroups: [""]  resources: ["configmaps"]
  verbs: ["*"]
apiGroups: [""]  resources: ["persistentvolumeclaims"]
  verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: spark-role-binding
subjects:
kind: ServiceAccount  name: spark
  namespace: spark-jobs
roleRef:
  kind: ClusterRole
  name: spark-role
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark
  namespace: spark-jobs

kubectl apply -f spark-rbac.yaml

Create MinIO access secrets

Store MinIO credentials as Kubernetes secrets for Spark applications to access object storage.

kubectl create secret generic minio-secret \
  --from-literal=accesskey=admin \
  --from-literal=secretkey=SecureMinIOPassword123! \
  -n spark-jobs

Create Spark application with MinIO integration

Deploy a sample Spark job that reads from and writes to MinIO storage using S3A filesystem.

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-minio-example
  namespace: spark-jobs
spec:
  type: Scala
  mode: cluster
  image: "apache/spark:3.4.0"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.4.0.jar"
  sparkVersion: "3.4.0"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.4.0
    serviceAccount: spark
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
    env:
      - name: AWS_ACCESS_KEY_ID
        valueFrom:
          secretKeyRef:
            name: minio-secret
            key: accesskey
      - name: AWS_SECRET_ACCESS_KEY
        valueFrom:
          secretKeyRef:
            name: minio-secret
            key: secretkey
  executor:
    cores: 1
    instances: 2
    memory: "512m"
    labels:
      version: 3.4.0
    env:
      - name: AWS_ACCESS_KEY_ID
        valueFrom:
          secretKeyRef:
            name: minio-secret
            key: accesskey
      - name: AWS_SECRET_ACCESS_KEY
        valueFrom:
          secretKeyRef:
            name: minio-secret
            key: secretkey
  sparkConf:
    "spark.kubernetes.container.image.pullPolicy": "Always"
    "spark.hadoop.fs.s3a.endpoint": "http://minio.minio.svc.cluster.local:9000"
    "spark.hadoop.fs.s3a.access.key": "admin"
    "spark.hadoop.fs.s3a.secret.key": "SecureMinIOPassword123!"
    "spark.hadoop.fs.s3a.path.style.access": "true"
    "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
    "spark.hadoop.fs.s3a.connection.ssl.enabled": "false"
    "spark.sql.adaptive.enabled": "true"
    "spark.sql.adaptive.coalescePartitions.enabled": "true"
    "spark.eventLog.enabled": "true"
    "spark.eventLog.dir": "s3a://spark-logs/"
    "spark.history.fs.logDirectory": "s3a://spark-logs/"

Deploy and monitor Spark application

Submit the Spark job and monitor its execution through Kubernetes and Spark UI.

kubectl apply -f spark-minio-job.yaml

Monitor the application
kubectl get sparkapplications -n spark-jobs
kubectl describe sparkapplication spark-minio-example -n spark-jobs
kubectl get pods -n spark-jobs

Configure Spark History Server

Deploy Spark History Server to view completed application logs and metrics stored in MinIO.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: spark-history-server
  namespace: spark-jobs
spec:
  replicas: 1
  selector:
    matchLabels:
      app: spark-history-server
  template:
    metadata:
      labels:
        app: spark-history-server
    spec:
      containers:
      - name: spark-history-server
        image: apache/spark:3.4.0
        command: ["/opt/spark/bin/spark-class"]
        args: ["org.apache.spark.deploy.history.HistoryServer"]
        ports:
        - containerPort: 18080
        env:
        - name: SPARK_HISTORY_OPTS
          value: "-Dspark.history.fs.logDirectory=s3a://spark-logs/ -Dspark.hadoop.fs.s3a.endpoint=http://minio.minio.svc.cluster.local:9000 -Dspark.hadoop.fs.s3a.access.key=admin -Dspark.hadoop.fs.s3a.secret.key=SecureMinIOPassword123! -Dspark.hadoop.fs.s3a.path.style.access=true -Dspark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem -Dspark.hadoop.fs.s3a.connection.ssl.enabled=false"
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"
---
apiVersion: v1
kind: Service
metadata:
  name: spark-history-server
  namespace: spark-jobs
spec:
  selector:
    app: spark-history-server
  ports:
  - port: 18080
    targetPort: 18080
  type: ClusterIP

kubectl apply -f spark-history-server.yaml

Verify your setup

Check that all components are running and can communicate with each other.

# Verify MinIO is running
kubectl get pods -n minio
kubectl logs deployment/minio -n minio

Verify Spark operator is ready
kubectl get pods -n spark-operator
kubectl logs deployment/spark-operator -n spark-operator

Check Spark application status
kubectl get sparkapplications -n spark-jobs
kubectl describe sparkapplication spark-minio-example -n spark-jobs

Access MinIO console (port forward)
kubectl port-forward svc/minio-console 9001:9001 -n minio
Visit http://localhost:9001 (admin/SecureMinIOPassword123!)

Access Spark History Server
kubectl port-forward svc/spark-history-server 18080:18080 -n spark-jobs
Visit http://localhost:18080

Test MinIO connectivity from Spark
mc ls minio/spark-logs --insecure

Configure production optimizations

Enable resource quotas and limits

Set up resource quotas to prevent Spark jobs from consuming all cluster resources.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: spark-jobs-quota
  namespace: spark-jobs
spec:
  hard:
    requests.cpu: "20"
    requests.memory: "40Gi"
    limits.cpu: "40"
    limits.memory: "80Gi"
    persistentvolumeclaims: "10"
---
apiVersion: v1
kind: LimitRange
metadata:
  name: spark-jobs-limits
  namespace: spark-jobs
spec:
  limits:
  - default:
      cpu: "2"
      memory: "4Gi"
    defaultRequest:
      cpu: "100m"
      memory: "256Mi"
    type: Container

kubectl apply -f spark-resource-quota.yaml

Configure network policies

Implement network security policies to control traffic between namespaces and external access.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: spark-jobs-netpol
  namespace: spark-jobs
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: spark-operator
  - from:
    - podSelector: {}
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: minio
    ports:
    - protocol: TCP
      port: 9000
  - to:
    - podSelector: {}
  - to: []
    ports:
    - protocol: TCP
      port: 53
    - protocol: UDP
      port: 53

kubectl label namespace minio name=minio
kubectl label namespace spark-operator name=spark-operator
kubectl apply -f network-policies.yaml

Common issues

Symptom	Cause	Fix
Spark operator pods failing	Insufficient RBAC permissions	`kubectl get clusterrolebinding spark-operator` and verify permissions
MinIO connection refused	Service discovery misconfiguration	Use full service DNS: `minio.minio.svc.cluster.local:9000`
Spark apps stuck in pending	Resource limits exceeded	`kubectl describe pod` and check resource quotas
S3A authentication failures	Incorrect MinIO credentials	Verify secret values: `kubectl get secret minio-secret -o yaml`
Webhook admission failures	Certificate issues	`kubectl get mutatingwebhookconfigurations` and check TLS
History server can't read logs	MinIO bucket permissions	Check bucket policy: `mc policy list minio/spark-logs`

Next steps

Running this in production?

Want this handled for you? Running this at scale adds a second layer of work: capacity planning, failover drills, cost control, and on-call. Our managed platform covers monitoring, backups and 24/7 response by default.

Automated install script

Run this to automate the entire setup

install.sh

#!/usr/bin/env bash
set -euo pipefail

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color

# Default values
MINIO_DOMAIN="${1:-minio.local}"
MINIO_PASSWORD="${2:-SecureMinIOPassword123!}"
CLEANUP_ON_EXIT=false

# Usage function
usage() {
    echo "Usage: $0 [minio_domain] [minio_password]"
    echo "  minio_domain: Domain for MinIO ingress (default: minio.local)"
    echo "  minio_password: MinIO admin password (default: SecureMinIOPassword123!)"
    echo "Example: $0 minio.example.com MySecurePassword123!"
    exit 1
}

# Validate arguments
if [[ $# -gt 2 ]]; then
    usage
fi

# Check if running as root or with sudo
if [[ $EUID -eq 0 ]]; then
    SUDO=""
elif command -v sudo >/dev/null 2>&1; then
    SUDO="sudo"
else
    echo -e "${RED}Error: This script requires root privileges or sudo${NC}"
    exit 1
fi

# Detect distribution
if [ -f /etc/os-release ]; then
    . /etc/os-release
    case "$ID" in
        ubuntu|debian)
            PKG_MGR="apt"
            PKG_UPDATE="apt update && apt upgrade -y"
            PKG_INSTALL="apt install -y"
            ;;
        almalinux|rocky|centos|rhel|ol|fedora)
            PKG_MGR="dnf"
            PKG_UPDATE="dnf update -y"
            PKG_INSTALL="dnf install -y"
            ;;
        amzn)
            PKG_MGR="yum"
            PKG_UPDATE="yum update -y"
            PKG_INSTALL="yum install -y"
            ;;
        *)
            echo -e "${RED}Unsupported distribution: $ID${NC}"
            exit 1
            ;;
    esac
else
    echo -e "${RED}Cannot detect distribution${NC}"
    exit 1
fi

# Cleanup function
cleanup() {
    if [[ "$CLEANUP_ON_EXIT" == "true" ]]; then
        echo -e "${YELLOW}Cleaning up due to error...${NC}"
        kubectl delete namespace spark-operator --ignore-not-found=true 2>/dev/null || true
        kubectl delete namespace minio --ignore-not-found=true 2>/dev/null || true
        kubectl delete namespace spark-jobs --ignore-not-found=true 2>/dev/null || true
        rm -f kubectl helm mc minio.key minio.crt minio-values.yaml spark-operator-values.yaml
    fi
}

trap cleanup ERR

echo -e "${BLUE}Spark Kubernetes Operator with MinIO Installation${NC}"
echo "=================================================="

# Check prerequisites
echo -e "${BLUE}[1/10] Checking prerequisites...${NC}"
if ! command -v docker >/dev/null 2>&1 && ! command -v podman >/dev/null 2>&1; then
    echo -e "${RED}Error: Docker or Podman is required${NC}"
    exit 1
fi

# Update system packages
echo -e "${BLUE}[2/10] Updating system packages...${NC}"
$SUDO $PKG_UPDATE
$SUDO $PKG_INSTALL curl wget gpg openssl

CLEANUP_ON_EXIT=true

# Install kubectl
echo -e "${BLUE}[3/10] Installing kubectl...${NC}"
if ! command -v kubectl >/dev/null 2>&1; then
    KUBECTL_VERSION=$(curl -L -s https://dl.k8s.io/release/stable.txt)
    curl -LO "https://dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl"
    chmod 755 kubectl
    $SUDO mv kubectl /usr/local/bin/
fi

# Install Helm
echo -e "${BLUE}[4/10] Installing Helm...${NC}"
if ! command -v helm >/dev/null 2>&1; then
    curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
fi

# Verify Kubernetes connection
echo -e "${BLUE}[5/10] Verifying Kubernetes connection...${NC}"
if ! kubectl cluster-info >/dev/null 2>&1; then
    echo -e "${RED}Error: Cannot connect to Kubernetes cluster${NC}"
    exit 1
fi

# Create namespaces
echo -e "${BLUE}[6/10] Creating namespaces...${NC}"
kubectl create namespace spark-operator --dry-run=client -o yaml | kubectl apply -f -
kubectl create namespace minio --dry-run=client -o yaml | kubectl apply -f -
kubectl create namespace spark-jobs --dry-run=client -o yaml | kubectl apply -f -

# Generate SSL certificates for MinIO
echo -e "${BLUE}[7/10] Generating SSL certificates...${NC}"
openssl req -new -newkey rsa:4096 -days 365 -nodes -x509 \
  -subj "/C=US/ST=CA/L=San Francisco/O=Example/CN=${MINIO_DOMAIN}" \
  -keyout minio.key -out minio.crt
chmod 600 minio.key
chmod 644 minio.crt
kubectl create secret tls minio-tls --key minio.key --cert minio.crt -n minio --dry-run=client -o yaml | kubectl apply -f -

# Create MinIO values file
echo -e "${BLUE}[8/10] Creating MinIO configuration...${NC}"
cat > minio-values.yaml << EOF
auth:
  rootUser: admin
  rootPassword: ${MINIO_PASSWORD}

mode: standalone

persistence:
  enabled: true
  size: 100Gi

resources:
  requests:
    memory: "512Mi"
    cpu: "250m"
  limits:
    memory: "2Gi"
    cpu: "1000m"

service:
  type: ClusterIP
  port: 9000

consoleService:
  type: ClusterIP
  port: 9001

tls:
  enabled: true
  existingSecret: "minio-tls"
EOF
chmod 644 minio-values.yaml

# Install MinIO
echo -e "${BLUE}[9/10] Installing MinIO...${NC}"
helm repo add minio https://charts.min.io/
helm repo update
helm upgrade --install minio minio/minio -n minio -f minio-values.yaml --wait

# Install Spark Operator
echo -e "${BLUE}[10/10] Installing Spark Operator...${NC}"
cat > spark-operator-values.yaml << EOF
image:
  repository: gcr.io/spark-operator/spark-operator
  tag: v1beta2-1.3.8-3.1.1

serviceAccounts:
  spark:
    create: true
    name: spark
  sparkoperator:
    create: true
    name: spark-operator

rbac:
  create: true

webhook:
  enable: true
  port: 8080

metrics:
  enable: true
  port: 10254

controllerThreads: 10
resyncInterval: 30
EOF
chmod 644 spark-operator-values.yaml

helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
helm repo update
helm upgrade --install spark-operator spark-operator/spark-operator \
  -n spark-operator -f spark-operator-values.yaml --wait

# Install MinIO client and create buckets
echo -e "${BLUE}Setting up MinIO buckets...${NC}"
if ! command -v mc >/dev/null 2>&1; then
    wget https://dl.min.io/client/mc/release/linux-amd64/mc
    chmod 755 mc
    $SUDO mv mc /usr/local/bin/
fi

# Port forward and create buckets
kubectl port-forward svc/minio 9000:9000 -n minio >/dev/null 2>&1 &
PF_PID=$!
sleep 5

mc alias set local https://localhost:9000 admin "${MINIO_PASSWORD}" --insecure
mc mb local/spark-jars --ignore-existing --insecure
mc mb local/spark-data --ignore-existing --insecure
mc mb local/spark-output --ignore-existing --insecure
mc mb local/spark-logs --ignore-existing --insecure

kill $PF_PID 2>/dev/null || true

# Verification
echo -e "${BLUE}Verifying installation...${NC}"
kubectl get pods -n minio
kubectl get pods -n spark-operator
kubectl get svc -n minio

CLEANUP_ON_EXIT=false

echo -e "${GREEN}Installation completed successfully!${NC}"
echo -e "${GREEN}MinIO Domain: ${MINIO_DOMAIN}${NC}"
echo -e "${GREEN}MinIO Password: ${MINIO_PASSWORD}${NC}"
echo -e "${YELLOW}To access MinIO console: kubectl port-forward svc/minio-console 9001:9001 -n minio${NC}"

# Clean up temporary files
rm -f minio.key minio.crt minio-values.yaml spark-operator-values.yaml

Review the script before running. Execute with: bash install.sh

#spark #kubernetes #minio #analytics #operator

Configure Spark Kubernetes Operator with MinIO for cloud-native analytics

Prerequisites

What this solves

Step-by-step installation

Install Kubernetes cluster requirements

Install kubectl and Helm

Create dedicated namespaces

Install MinIO with SSL certificates

Generate SSL certificates for MinIO

Create MinIO configuration values

Deploy MinIO with Helm

Create MinIO buckets for Spark

Install MinIO client

Configure MinIO alias

Create buckets

Install Spark Kubernetes Operator

Create Spark operator configuration

Deploy Spark operator

Create RBAC for Spark applications

Create MinIO access secrets

Create Spark application with MinIO integration

Deploy and monitor Spark application

Monitor the application

Configure Spark History Server

Verify your setup

Verify Spark operator is ready

Check Spark application status

Access MinIO console (port forward)

Visit http://localhost:9001 (admin/SecureMinIOPassword123!)

Access Spark History Server

Visit http://localhost:18080

Test MinIO connectivity from Spark

Configure production optimizations

Enable resource quotas and limits

Configure network policies

Common issues

Next steps

Running this in production?

Related tutorials

Configure Kubernetes RBAC with service accounts and cluster roles for secure access control

Implement Deno microservices architecture with service discovery and load balancing

Implement Kubernetes RBAC with service accounts and role-based access control

Don't want to manage this yourself?