Deploy Apache Spark on Kubernetes with the Spark Operator and MinIO object storage for scalable big data processing. Configure RBAC, SSL certificates, and persistent storage for production-ready analytics workloads.
Prerequisites
- Kubernetes cluster with at least 8GB RAM and 4 CPU cores
- kubectl configured with admin access
- Helm 3.x installed
- At least 100GB available storage for MinIO
What this solves
The Spark Kubernetes Operator automates the deployment and management of Apache Spark applications on Kubernetes clusters, while MinIO provides S3-compatible object storage for your data lake. This combination creates a cloud-native analytics platform that scales automatically and handles large datasets efficiently. You'll use this when you need to run distributed data processing workloads with automatic resource management and fault tolerance.
Step-by-step installation
Install Kubernetes cluster requirements
Start with a working Kubernetes cluster and install Helm for managing chart deployments.
sudo apt update && sudo apt upgrade -y
sudo apt install -y curl wget gpg
Install kubectl and Helm
Install the Kubernetes command-line tool and Helm package manager for deploying the operator and MinIO.
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo chmod +x kubectl
sudo mv kubectl /usr/local/bin/
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
kubectl version --client
helm version
Create dedicated namespaces
Create separate namespaces for the Spark operator and MinIO to organize resources and apply security policies.
kubectl create namespace spark-operator
kubectl create namespace minio
kubectl create namespace spark-jobs
Install MinIO with SSL certificates
Deploy MinIO object storage with TLS encryption and persistent volume claims for data durability.
helm repo add minio https://charts.min.io/
helm repo update
Generate SSL certificates for MinIO
Create self-signed certificates for MinIO API and console access in development environments.
openssl req -new -newkey rsa:4096 -days 365 -nodes -x509 \
-subj "/C=US/ST=CA/L=San Francisco/O=Example/CN=minio.example.com" \
-keyout minio.key -out minio.crt
kubectl create secret tls minio-tls --key minio.key --cert minio.crt -n minio
Create MinIO configuration values
Configure MinIO with persistent storage, resource limits, and SSL termination for production workloads.
auth:
rootUser: admin
rootPassword: SecureMinIOPassword123!
mode: standalone
persistence:
enabled: true
storageClass: ""
size: 100Gi
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "1000m"
service:
type: ClusterIP
port: 9000
consoleService:
type: ClusterIP
port: 9001
tls:
enabled: true
existingSecret: "minio-tls"
ingress:
enabled: true
annotations:
kubernetes.io/ingress.class: "nginx"
cert-manager.io/cluster-issuer: "letsencrypt-prod"
hosts:
- host: minio.example.com
paths:
- path: /
pathType: Prefix
Deploy MinIO with Helm
Install MinIO using the configuration values and verify the deployment status.
helm install minio minio/minio -n minio -f minio-values.yaml
kubectl get pods -n minio
kubectl get svc -n minio
Create MinIO buckets for Spark
Set up storage buckets for Spark application jars, data, and output using the MinIO client.
# Port forward to access MinIO API
kubectl port-forward svc/minio 9000:9000 -n minio &
Install MinIO client
wget https://dl.min.io/client/mc/release/linux-amd64/mc
chmod +x mc
sudo mv mc /usr/local/bin/
Configure MinIO alias
mc alias set minio https://localhost:9000 admin SecureMinIOPassword123! --insecure
Create buckets
mc mb minio/spark-jars --insecure
mc mb minio/spark-data --insecure
mc mb minio/spark-output --insecure
mc mb minio/spark-logs --insecure
Install Spark Kubernetes Operator
Deploy the Spark operator using Helm with RBAC configuration and webhook settings.
helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
helm repo update
Create Spark operator configuration
Configure the operator with proper RBAC permissions and resource management settings.
image:
repository: gcr.io/spark-operator/spark-operator
tag: v1beta2-1.3.8-3.1.1
sparkJobNamespace: spark-jobs
controllerThreads: 10
resyncInterval: 30
webhook:
enable: true
port: 8080
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 100m
memory: 300Mi
rbac:
create: true
createClusterRole: true
serviceAccounts:
spark:
create: true
name: spark
sparkoperator:
create: true
name: spark-operator
leaderElection:
lockName: "spark-operator-lock"
lockNamespace: "spark-operator"
Deploy Spark operator
Install the Spark operator and verify it's running correctly with webhook admission control.
helm install spark-operator spark-operator/spark-operator \
--namespace spark-operator \
--values spark-operator-values.yaml
kubectl get pods -n spark-operator
kubectl get mutatingwebhookconfigurations
Create RBAC for Spark applications
Set up service accounts and role bindings for Spark driver and executor pods to access Kubernetes APIs.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: spark-role
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["*"]
- apiGroups: [""]
resources: ["services"]
verbs: ["*"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["*"]
- apiGroups: [""]
resources: ["persistentvolumeclaims"]
verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: spark-role-binding
subjects:
- kind: ServiceAccount
name: spark
namespace: spark-jobs
roleRef:
kind: ClusterRole
name: spark-role
apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: spark
namespace: spark-jobs
kubectl apply -f spark-rbac.yaml
Create MinIO access secrets
Store MinIO credentials as Kubernetes secrets for Spark applications to access object storage.
kubectl create secret generic minio-secret \
--from-literal=accesskey=admin \
--from-literal=secretkey=SecureMinIOPassword123! \
-n spark-jobs
Create Spark application with MinIO integration
Deploy a sample Spark job that reads from and writes to MinIO storage using S3A filesystem.
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-minio-example
namespace: spark-jobs
spec:
type: Scala
mode: cluster
image: "apache/spark:3.4.0"
imagePullPolicy: Always
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.4.0.jar"
sparkVersion: "3.4.0"
restartPolicy:
type: Never
volumes:
- name: "test-volume"
hostPath:
path: "/tmp"
type: Directory
driver:
cores: 1
coreLimit: "1200m"
memory: "512m"
labels:
version: 3.4.0
serviceAccount: spark
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: minio-secret
key: accesskey
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: minio-secret
key: secretkey
executor:
cores: 1
instances: 2
memory: "512m"
labels:
version: 3.4.0
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: minio-secret
key: accesskey
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: minio-secret
key: secretkey
sparkConf:
"spark.kubernetes.container.image.pullPolicy": "Always"
"spark.hadoop.fs.s3a.endpoint": "http://minio.minio.svc.cluster.local:9000"
"spark.hadoop.fs.s3a.access.key": "admin"
"spark.hadoop.fs.s3a.secret.key": "SecureMinIOPassword123!"
"spark.hadoop.fs.s3a.path.style.access": "true"
"spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
"spark.hadoop.fs.s3a.connection.ssl.enabled": "false"
"spark.sql.adaptive.enabled": "true"
"spark.sql.adaptive.coalescePartitions.enabled": "true"
"spark.eventLog.enabled": "true"
"spark.eventLog.dir": "s3a://spark-logs/"
"spark.history.fs.logDirectory": "s3a://spark-logs/"
Deploy and monitor Spark application
Submit the Spark job and monitor its execution through Kubernetes and Spark UI.
kubectl apply -f spark-minio-job.yaml
Monitor the application
kubectl get sparkapplications -n spark-jobs
kubectl describe sparkapplication spark-minio-example -n spark-jobs
kubectl get pods -n spark-jobs
Configure Spark History Server
Deploy Spark History Server to view completed application logs and metrics stored in MinIO.
apiVersion: apps/v1
kind: Deployment
metadata:
name: spark-history-server
namespace: spark-jobs
spec:
replicas: 1
selector:
matchLabels:
app: spark-history-server
template:
metadata:
labels:
app: spark-history-server
spec:
containers:
- name: spark-history-server
image: apache/spark:3.4.0
command: ["/opt/spark/bin/spark-class"]
args: ["org.apache.spark.deploy.history.HistoryServer"]
ports:
- containerPort: 18080
env:
- name: SPARK_HISTORY_OPTS
value: "-Dspark.history.fs.logDirectory=s3a://spark-logs/ -Dspark.hadoop.fs.s3a.endpoint=http://minio.minio.svc.cluster.local:9000 -Dspark.hadoop.fs.s3a.access.key=admin -Dspark.hadoop.fs.s3a.secret.key=SecureMinIOPassword123! -Dspark.hadoop.fs.s3a.path.style.access=true -Dspark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem -Dspark.hadoop.fs.s3a.connection.ssl.enabled=false"
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
---
apiVersion: v1
kind: Service
metadata:
name: spark-history-server
namespace: spark-jobs
spec:
selector:
app: spark-history-server
ports:
- port: 18080
targetPort: 18080
type: ClusterIP
kubectl apply -f spark-history-server.yaml
Verify your setup
Check that all components are running and can communicate with each other.
# Verify MinIO is running
kubectl get pods -n minio
kubectl logs deployment/minio -n minio
Verify Spark operator is ready
kubectl get pods -n spark-operator
kubectl logs deployment/spark-operator -n spark-operator
Check Spark application status
kubectl get sparkapplications -n spark-jobs
kubectl describe sparkapplication spark-minio-example -n spark-jobs
Access MinIO console (port forward)
kubectl port-forward svc/minio-console 9001:9001 -n minio
Visit http://localhost:9001 (admin/SecureMinIOPassword123!)
Access Spark History Server
kubectl port-forward svc/spark-history-server 18080:18080 -n spark-jobs
Visit http://localhost:18080
Test MinIO connectivity from Spark
mc ls minio/spark-logs --insecure
Configure production optimizations
Enable resource quotas and limits
Set up resource quotas to prevent Spark jobs from consuming all cluster resources.
apiVersion: v1
kind: ResourceQuota
metadata:
name: spark-jobs-quota
namespace: spark-jobs
spec:
hard:
requests.cpu: "20"
requests.memory: "40Gi"
limits.cpu: "40"
limits.memory: "80Gi"
persistentvolumeclaims: "10"
---
apiVersion: v1
kind: LimitRange
metadata:
name: spark-jobs-limits
namespace: spark-jobs
spec:
limits:
- default:
cpu: "2"
memory: "4Gi"
defaultRequest:
cpu: "100m"
memory: "256Mi"
type: Container
kubectl apply -f spark-resource-quota.yaml
Configure network policies
Implement network security policies to control traffic between namespaces and external access.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: spark-jobs-netpol
namespace: spark-jobs
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: spark-operator
- from:
- podSelector: {}
egress:
- to:
- namespaceSelector:
matchLabels:
name: minio
ports:
- protocol: TCP
port: 9000
- to:
- podSelector: {}
- to: []
ports:
- protocol: TCP
port: 53
- protocol: UDP
port: 53
kubectl label namespace minio name=minio
kubectl label namespace spark-operator name=spark-operator
kubectl apply -f network-policies.yaml
Common issues
| Symptom | Cause | Fix |
|---|---|---|
| Spark operator pods failing | Insufficient RBAC permissions | kubectl get clusterrolebinding spark-operator and verify permissions |
| MinIO connection refused | Service discovery misconfiguration | Use full service DNS: minio.minio.svc.cluster.local:9000 |
| Spark apps stuck in pending | Resource limits exceeded | kubectl describe pod and check resource quotas |
| S3A authentication failures | Incorrect MinIO credentials | Verify secret values: kubectl get secret minio-secret -o yaml |
| Webhook admission failures | Certificate issues | kubectl get mutatingwebhookconfigurations and check TLS |
| History server can't read logs | MinIO bucket permissions | Check bucket policy: mc policy list minio/spark-logs |
Next steps
- Monitor Kubernetes cluster with Prometheus Operator for comprehensive observability
- Implement Kubernetes network policies with Calico CNI and OPA Gatekeeper for security enforcement
- Configure Spark Delta Lake integration with MinIO for ACID transactions
- Set up Spark Streaming with Kafka integration for real-time analytics
- Configure Spark Kubernetes autoscaling with KEDA for dynamic resource management
Running this in production?
Automated install script
Run this to automate the entire setup
#!/usr/bin/env bash
set -euo pipefail
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Default values
MINIO_DOMAIN="${1:-minio.local}"
MINIO_PASSWORD="${2:-SecureMinIOPassword123!}"
CLEANUP_ON_EXIT=false
# Usage function
usage() {
echo "Usage: $0 [minio_domain] [minio_password]"
echo " minio_domain: Domain for MinIO ingress (default: minio.local)"
echo " minio_password: MinIO admin password (default: SecureMinIOPassword123!)"
echo "Example: $0 minio.example.com MySecurePassword123!"
exit 1
}
# Validate arguments
if [[ $# -gt 2 ]]; then
usage
fi
# Check if running as root or with sudo
if [[ $EUID -eq 0 ]]; then
SUDO=""
elif command -v sudo >/dev/null 2>&1; then
SUDO="sudo"
else
echo -e "${RED}Error: This script requires root privileges or sudo${NC}"
exit 1
fi
# Detect distribution
if [ -f /etc/os-release ]; then
. /etc/os-release
case "$ID" in
ubuntu|debian)
PKG_MGR="apt"
PKG_UPDATE="apt update && apt upgrade -y"
PKG_INSTALL="apt install -y"
;;
almalinux|rocky|centos|rhel|ol|fedora)
PKG_MGR="dnf"
PKG_UPDATE="dnf update -y"
PKG_INSTALL="dnf install -y"
;;
amzn)
PKG_MGR="yum"
PKG_UPDATE="yum update -y"
PKG_INSTALL="yum install -y"
;;
*)
echo -e "${RED}Unsupported distribution: $ID${NC}"
exit 1
;;
esac
else
echo -e "${RED}Cannot detect distribution${NC}"
exit 1
fi
# Cleanup function
cleanup() {
if [[ "$CLEANUP_ON_EXIT" == "true" ]]; then
echo -e "${YELLOW}Cleaning up due to error...${NC}"
kubectl delete namespace spark-operator --ignore-not-found=true 2>/dev/null || true
kubectl delete namespace minio --ignore-not-found=true 2>/dev/null || true
kubectl delete namespace spark-jobs --ignore-not-found=true 2>/dev/null || true
rm -f kubectl helm mc minio.key minio.crt minio-values.yaml spark-operator-values.yaml
fi
}
trap cleanup ERR
echo -e "${BLUE}Spark Kubernetes Operator with MinIO Installation${NC}"
echo "=================================================="
# Check prerequisites
echo -e "${BLUE}[1/10] Checking prerequisites...${NC}"
if ! command -v docker >/dev/null 2>&1 && ! command -v podman >/dev/null 2>&1; then
echo -e "${RED}Error: Docker or Podman is required${NC}"
exit 1
fi
# Update system packages
echo -e "${BLUE}[2/10] Updating system packages...${NC}"
$SUDO $PKG_UPDATE
$SUDO $PKG_INSTALL curl wget gpg openssl
CLEANUP_ON_EXIT=true
# Install kubectl
echo -e "${BLUE}[3/10] Installing kubectl...${NC}"
if ! command -v kubectl >/dev/null 2>&1; then
KUBECTL_VERSION=$(curl -L -s https://dl.k8s.io/release/stable.txt)
curl -LO "https://dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl"
chmod 755 kubectl
$SUDO mv kubectl /usr/local/bin/
fi
# Install Helm
echo -e "${BLUE}[4/10] Installing Helm...${NC}"
if ! command -v helm >/dev/null 2>&1; then
curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
fi
# Verify Kubernetes connection
echo -e "${BLUE}[5/10] Verifying Kubernetes connection...${NC}"
if ! kubectl cluster-info >/dev/null 2>&1; then
echo -e "${RED}Error: Cannot connect to Kubernetes cluster${NC}"
exit 1
fi
# Create namespaces
echo -e "${BLUE}[6/10] Creating namespaces...${NC}"
kubectl create namespace spark-operator --dry-run=client -o yaml | kubectl apply -f -
kubectl create namespace minio --dry-run=client -o yaml | kubectl apply -f -
kubectl create namespace spark-jobs --dry-run=client -o yaml | kubectl apply -f -
# Generate SSL certificates for MinIO
echo -e "${BLUE}[7/10] Generating SSL certificates...${NC}"
openssl req -new -newkey rsa:4096 -days 365 -nodes -x509 \
-subj "/C=US/ST=CA/L=San Francisco/O=Example/CN=${MINIO_DOMAIN}" \
-keyout minio.key -out minio.crt
chmod 600 minio.key
chmod 644 minio.crt
kubectl create secret tls minio-tls --key minio.key --cert minio.crt -n minio --dry-run=client -o yaml | kubectl apply -f -
# Create MinIO values file
echo -e "${BLUE}[8/10] Creating MinIO configuration...${NC}"
cat > minio-values.yaml << EOF
auth:
rootUser: admin
rootPassword: ${MINIO_PASSWORD}
mode: standalone
persistence:
enabled: true
size: 100Gi
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "1000m"
service:
type: ClusterIP
port: 9000
consoleService:
type: ClusterIP
port: 9001
tls:
enabled: true
existingSecret: "minio-tls"
EOF
chmod 644 minio-values.yaml
# Install MinIO
echo -e "${BLUE}[9/10] Installing MinIO...${NC}"
helm repo add minio https://charts.min.io/
helm repo update
helm upgrade --install minio minio/minio -n minio -f minio-values.yaml --wait
# Install Spark Operator
echo -e "${BLUE}[10/10] Installing Spark Operator...${NC}"
cat > spark-operator-values.yaml << EOF
image:
repository: gcr.io/spark-operator/spark-operator
tag: v1beta2-1.3.8-3.1.1
serviceAccounts:
spark:
create: true
name: spark
sparkoperator:
create: true
name: spark-operator
rbac:
create: true
webhook:
enable: true
port: 8080
metrics:
enable: true
port: 10254
controllerThreads: 10
resyncInterval: 30
EOF
chmod 644 spark-operator-values.yaml
helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
helm repo update
helm upgrade --install spark-operator spark-operator/spark-operator \
-n spark-operator -f spark-operator-values.yaml --wait
# Install MinIO client and create buckets
echo -e "${BLUE}Setting up MinIO buckets...${NC}"
if ! command -v mc >/dev/null 2>&1; then
wget https://dl.min.io/client/mc/release/linux-amd64/mc
chmod 755 mc
$SUDO mv mc /usr/local/bin/
fi
# Port forward and create buckets
kubectl port-forward svc/minio 9000:9000 -n minio >/dev/null 2>&1 &
PF_PID=$!
sleep 5
mc alias set local https://localhost:9000 admin "${MINIO_PASSWORD}" --insecure
mc mb local/spark-jars --ignore-existing --insecure
mc mb local/spark-data --ignore-existing --insecure
mc mb local/spark-output --ignore-existing --insecure
mc mb local/spark-logs --ignore-existing --insecure
kill $PF_PID 2>/dev/null || true
# Verification
echo -e "${BLUE}Verifying installation...${NC}"
kubectl get pods -n minio
kubectl get pods -n spark-operator
kubectl get svc -n minio
CLEANUP_ON_EXIT=false
echo -e "${GREEN}Installation completed successfully!${NC}"
echo -e "${GREEN}MinIO Domain: ${MINIO_DOMAIN}${NC}"
echo -e "${GREEN}MinIO Password: ${MINIO_PASSWORD}${NC}"
echo -e "${YELLOW}To access MinIO console: kubectl port-forward svc/minio-console 9001:9001 -n minio${NC}"
# Clean up temporary files
rm -f minio.key minio.crt minio-values.yaml spark-operator-values.yaml
Review the script before running. Execute with: bash install.sh