Set up a production-grade Apache Spark 3.5 cluster with YARN resource management and HDFS distributed storage for scalable big data processing. This tutorial covers multi-node Hadoop cluster configuration, YARN integration, and monitoring setup.
Prerequisites
- 4GB RAM minimum
- Multiple servers for cluster setup
- Java 11 compatibility
- Network connectivity between cluster nodes
What this solves
Apache Spark with YARN and HDFS creates a powerful distributed computing platform for big data analytics, machine learning, and stream processing. YARN manages cluster resources efficiently while HDFS provides reliable distributed storage with automatic replication and fault tolerance.
Step-by-step installation
Create dedicated hadoop user
Create a system user for running Hadoop and Spark services with proper permissions.
sudo useradd -m -s /bin/bash hadoop
sudo passwd hadoop
sudo usermod -aG sudo hadoop
Install Java 11
Hadoop and Spark require Java 11 for optimal performance and compatibility.
sudo apt update
sudo apt install -y openjdk-11-jdk
Download and install Hadoop 3.4
Download Hadoop binaries and extract them to the standard installation directory.
cd /tmp
wget https://archive.apache.org/dist/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz
sudo tar -xzf hadoop-3.4.0.tar.gz -C /opt/
sudo mv /opt/hadoop-3.4.0 /opt/hadoop
sudo chown -R hadoop:hadoop /opt/hadoop
Download and install Apache Spark 3.5
Download Spark pre-built for Hadoop 3.3+ and configure it for YARN integration.
cd /tmp
wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
sudo tar -xzf spark-3.5.0-bin-hadoop3.tgz -C /opt/
sudo mv /opt/spark-3.5.0-bin-hadoop3 /opt/spark
sudo chown -R hadoop:hadoop /opt/spark
Configure environment variables
Set up Java and Hadoop environment variables for all users and services.
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
HADOOP_HOME=/opt/hadoop
HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
SPARK_HOME=/opt/spark
YARN_CONF_DIR=/opt/hadoop/etc/hadoop
PATH=/opt/hadoop/bin:/opt/hadoop/sbin:/opt/spark/bin:/opt/spark/sbin:$PATH
Configure Hadoop environment
Set Java home in Hadoop's environment configuration file.
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export HADOOP_LOG_DIR=/opt/hadoop/logs
export YARN_LOG_DIR=/opt/hadoop/logs
Configure core Hadoop settings
Configure the core Hadoop settings including filesystem URI and temporary directories.
fs.defaultFS
hdfs://namenode:9000
hadoop.tmp.dir
/opt/hadoop/data/tmp
hadoop.proxyuser.hadoop.hosts
*
hadoop.proxyuser.hadoop.groups
*
Configure HDFS settings
Configure HDFS namenode and datanode directories with appropriate replication factor.
dfs.replication
3
dfs.namenode.name.dir
/opt/hadoop/data/namenode
dfs.datanode.data.dir
/opt/hadoop/data/datanode
dfs.namenode.http-address
0.0.0.0:9870
dfs.secondary.http.address
0.0.0.0:9868
Configure YARN resource manager
Configure YARN for resource management and job scheduling across the cluster.
yarn.nodemanager.aux-services
mapreduce_shuffle,spark_shuffle
yarn.nodemanager.aux-services.spark_shuffle.class
org.apache.spark.network.yarn.YarnShuffleService
yarn.resourcemanager.hostname
resourcemanager
yarn.resourcemanager.address
resourcemanager:8032
yarn.resourcemanager.webapp.address
0.0.0.0:8088
yarn.nodemanager.resource.memory-mb
4096
yarn.nodemanager.resource.cpu-vcores
4
yarn.scheduler.maximum-allocation-mb
4096
yarn.scheduler.minimum-allocation-mb
512
Configure MapReduce settings
Configure MapReduce to run on YARN framework for distributed processing.
mapreduce.framework.name
yarn
mapreduce.application.classpath
$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/
yarn.app.mapreduce.am.env
HADOOP_MAPRED_HOME=/opt/hadoop
mapreduce.map.env
HADOOP_MAPRED_HOME=/opt/hadoop
mapreduce.reduce.env
HADOOP_MAPRED_HOME=/opt/hadoop
Configure Spark for YARN integration
Configure Spark to use YARN as the cluster manager and set up shuffle service.
spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs://namenode:9000/spark-logs
spark.history.fs.logDirectory hdfs://namenode:9000/spark-logs
spark.yarn.historyServer.address resourcemanager:18080
spark.shuffle.service.enabled true
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.maxExecutors 10
spark.sql.adaptive.enabled true
spark.sql.adaptive.coalescePartitions.enabled true
Configure Spark environment
Set up Spark environment variables and memory settings.
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/opt/hadoop
export YARN_CONF_DIR=/opt/hadoop/etc/hadoop
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export SPARK_DIST_CLASSPATH=$(/opt/hadoop/bin/hadoop classpath)
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080"
Create required directories
Create necessary directories for Hadoop data storage and logs with proper permissions.
sudo mkdir -p /opt/hadoop/data/namenode
sudo mkdir -p /opt/hadoop/data/datanode
sudo mkdir -p /opt/hadoop/data/tmp
sudo mkdir -p /opt/hadoop/logs
sudo chown -R hadoop:hadoop /opt/hadoop/data
sudo chown -R hadoop:hadoop /opt/hadoop/logs
sudo chmod 755 /opt/hadoop/data/namenode
sudo chmod 755 /opt/hadoop/data/datanode
sudo chmod 1777 /opt/hadoop/data/tmp
Configure cluster nodes
Define worker nodes and configure hostname resolution for the cluster.
datanode1
datanode2
datanode3
Copy Spark shuffle JAR to Hadoop
Copy the Spark YARN shuffle service JAR to enable dynamic allocation.
sudo cp /opt/spark/yarn/spark-*-yarn-shuffle.jar /opt/hadoop/share/hadoop/yarn/lib/
sudo chown hadoop:hadoop /opt/hadoop/share/hadoop/yarn/lib/spark-*-yarn-shuffle.jar
Initialize HDFS namenode
Format the namenode to initialize the HDFS filesystem. Only run this once.
sudo -u hadoop /opt/hadoop/bin/hdfs namenode -format -y
Create systemd service files
Create systemd services for Hadoop components to enable automatic startup.
[Unit]
Description=Hadoop NameNode
After=network.target
[Service]
Type=forking
User=hadoop
Group=hadoop
Environment=JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Environment=HADOOP_HOME=/opt/hadoop
Environment=HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
ExecStart=/opt/hadoop/bin/hdfs --daemon start namenode
ExecStop=/opt/hadoop/bin/hdfs --daemon stop namenode
Restart=on-failure
[Install]
WantedBy=multi-user.target
Create YARN ResourceManager service
Configure systemd service for YARN ResourceManager component.
[Unit]
Description=YARN ResourceManager
After=network.target
[Service]
Type=forking
User=hadoop
Group=hadoop
Environment=JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Environment=HADOOP_HOME=/opt/hadoop
Environment=YARN_CONF_DIR=/opt/hadoop/etc/hadoop
ExecStart=/opt/hadoop/bin/yarn --daemon start resourcemanager
ExecStop=/opt/hadoop/bin/yarn --daemon stop resourcemanager
Restart=on-failure
[Install]
WantedBy=multi-user.target
Create Spark History Server service
Configure systemd service for Spark History Server to track application metrics.
[Unit]
Description=Spark History Server
After=network.target hadoop-namenode.service
Requires=hadoop-namenode.service
[Service]
Type=forking
User=hadoop
Group=hadoop
Environment=JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Environment=SPARK_HOME=/opt/spark
Environment=HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
ExecStartPre=/opt/hadoop/bin/hdfs dfs -mkdir -p /spark-logs
ExecStart=/opt/spark/sbin/start-history-server.sh
ExecStop=/opt/spark/sbin/stop-history-server.sh
Restart=on-failure
[Install]
WantedBy=multi-user.target
Enable and start services
Enable all Hadoop and Spark services to start automatically on boot.
sudo systemctl daemon-reload
sudo systemctl enable hadoop-namenode yarn-resourcemanager spark-history-server
sudo systemctl start hadoop-namenode
sudo systemctl start yarn-resourcemanager
sudo systemctl start spark-history-server
Configure firewall rules
Open necessary ports for Hadoop and Spark web interfaces and inter-node communication.
sudo ufw allow 9870/tcp comment 'HDFS NameNode Web UI'
sudo ufw allow 8088/tcp comment 'YARN ResourceManager Web UI'
sudo ufw allow 18080/tcp comment 'Spark History Server'
sudo ufw allow 4040/tcp comment 'Spark Application UI'
sudo ufw allow 9000/tcp comment 'HDFS NameNode IPC'
Verify your setup
Check that all services are running and accessible through their web interfaces.
sudo systemctl status hadoop-namenode yarn-resourcemanager spark-history-server
/opt/hadoop/bin/hdfs dfsadmin -report
/opt/spark/bin/spark-submit --version
curl -s http://localhost:9870/jmx?qry=Hadoop:service=NameNode,name=NameNodeInfo | grep -o '"State":"[^"]*'
Test Spark integration with YARN by running a simple job:
sudo -u hadoop /opt/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client /opt/spark/examples/jars/spark-examples_2.12-3.5.0.jar 10
Access the web interfaces to monitor your cluster:
- HDFS NameNode: http://your-server:9870
- YARN ResourceManager: http://your-server:8088
- Spark History Server: http://your-server:18080
Common issues
| Symptom | Cause | Fix |
|---|---|---|
| NameNode fails to start | Incorrect permissions on data directory | sudo chown -R hadoop:hadoop /opt/hadoop/data && sudo chmod 755 /opt/hadoop/data/namenode |
| Spark jobs fail with ClassNotFoundException | Missing Spark shuffle JAR in Hadoop | sudo cp /opt/spark/yarn/spark-*-yarn-shuffle.jar /opt/hadoop/share/hadoop/yarn/lib/ |
| YARN containers fail to start | Insufficient memory allocation | Increase yarn.nodemanager.resource.memory-mb in yarn-site.xml |
| Connection refused on port 9000 | NameNode not running or firewall blocking | sudo systemctl start hadoop-namenode && sudo ufw allow 9000/tcp |
| History server shows no applications | Event log directory not created | sudo -u hadoop hdfs dfs -mkdir -p /spark-logs |
Next steps
- Set up Spark 3.5 Delta Lake with MinIO for ACID transactions
- Configure MinIO with Apache Spark 3.5 for big data analytics
- Set up Prometheus and Grafana monitoring for your Spark cluster
- Implement Spark streaming with Kafka for real-time data processing
- Configure Spark security with Kerberos and SSL encryption
Automated install script
Run this to automate the entire setup
#!/usr/bin/env bash
set -euo pipefail
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
# Default values
CLUSTER_NAME="${1:-hadoop-cluster}"
NAMENODE_HOST="${2:-$(hostname -f)}"
usage() {
echo "Usage: $0 [cluster_name] [namenode_hostname]"
echo " cluster_name: Name for the Hadoop cluster (default: hadoop-cluster)"
echo " namenode_hostname: FQDN of namenode (default: current hostname)"
exit 1
}
log() {
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')] $1${NC}"
}
warn() {
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING: $1${NC}"
}
error() {
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR: $1${NC}"
exit 1
}
cleanup() {
warn "Installation failed. Cleaning up..."
systemctl stop hadoop-namenode hadoop-datanode yarn-resourcemanager yarn-nodemanager 2>/dev/null || true
userdel -r hadoop 2>/dev/null || true
rm -rf /opt/hadoop /opt/spark 2>/dev/null || true
}
trap cleanup ERR
# Check prerequisites
if [[ $EUID -ne 0 ]]; then
error "This script must be run as root"
fi
if [[ "$#" -gt 2 ]]; then
usage
fi
# Detect distribution
if [ -f /etc/os-release ]; then
. /etc/os-release
case "$ID" in
ubuntu|debian)
PKG_MGR="apt"
PKG_INSTALL="apt install -y"
JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"
;;
almalinux|rocky|centos|rhel|ol|fedora)
PKG_MGR="dnf"
PKG_INSTALL="dnf install -y"
JAVA_HOME="/usr/lib/jvm/java-11-openjdk"
;;
amzn)
PKG_MGR="yum"
PKG_INSTALL="yum install -y"
JAVA_HOME="/usr/lib/jvm/java-11-openjdk"
;;
*)
error "Unsupported distribution: $ID"
;;
esac
else
error "Cannot detect distribution"
fi
log "[1/12] Updating system packages..."
if [[ "$PKG_MGR" == "apt" ]]; then
apt update
else
$PKG_INSTALL epel-release 2>/dev/null || true
fi
log "[2/12] Creating hadoop user..."
if ! id hadoop &>/dev/null; then
useradd -m -s /bin/bash hadoop
usermod -aG wheel hadoop 2>/dev/null || usermod -aG sudo hadoop
fi
log "[3/12] Installing Java 11..."
if [[ "$PKG_MGR" == "apt" ]]; then
$PKG_INSTALL openjdk-11-jdk wget curl
else
$PKG_INSTALL java-11-openjdk-devel wget curl
fi
log "[4/12] Downloading and installing Hadoop 3.4.0..."
cd /tmp
if [[ ! -f hadoop-3.4.0.tar.gz ]]; then
wget -q https://archive.apache.org/dist/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz
fi
tar -xzf hadoop-3.4.0.tar.gz -C /opt/
mv /opt/hadoop-3.4.0 /opt/hadoop
chown -R hadoop:hadoop /opt/hadoop
log "[5/12] Downloading and installing Apache Spark 3.5.0..."
if [[ ! -f spark-3.5.0-bin-hadoop3.tgz ]]; then
wget -q https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
fi
tar -xzf spark-3.5.0-bin-hadoop3.tgz -C /opt/
mv /opt/spark-3.5.0-bin-hadoop3 /opt/spark
chown -R hadoop:hadoop /opt/spark
log "[6/12] Configuring environment variables..."
cat > /etc/environment << EOF
JAVA_HOME=$JAVA_HOME
HADOOP_HOME=/opt/hadoop
HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
SPARK_HOME=/opt/spark
YARN_CONF_DIR=/opt/hadoop/etc/hadoop
PATH="/opt/hadoop/bin:/opt/hadoop/sbin:/opt/spark/bin:/opt/spark/sbin:\$PATH"
EOF
cat > /opt/hadoop/etc/hadoop/hadoop-env.sh << EOF
export JAVA_HOME=$JAVA_HOME
export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export HADOOP_LOG_DIR=/opt/hadoop/logs
export YARN_LOG_DIR=/opt/hadoop/logs
EOF
log "[7/12] Creating Hadoop data directories..."
mkdir -p /opt/hadoop/data/{tmp,namenode,datanode} /opt/hadoop/logs
chown -R hadoop:hadoop /opt/hadoop/data /opt/hadoop/logs
chmod 755 /opt/hadoop/data /opt/hadoop/logs
chmod 750 /opt/hadoop/data/{namenode,datanode}
log "[8/12] Configuring core Hadoop settings..."
cat > /opt/hadoop/etc/hadoop/core-site.xml << EOF
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://$NAMENODE_HOST:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop/data/tmp</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
</property>
</configuration>
EOF
log "[9/12] Configuring HDFS settings..."
cat > /opt/hadoop/etc/hadoop/hdfs-site.xml << EOF
<?xml version="1.0"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hadoop/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/hadoop/data/datanode</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>0.0.0.0:9870</value>
</property>
</configuration>
EOF
log "[10/12] Configuring YARN settings..."
cat > /opt/hadoop/etc/hadoop/yarn-site.xml << EOF
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle,spark_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>$NAMENODE_HOST</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>0.0.0.0:8088</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>2</value>
</property>
</configuration>
EOF
cat > /opt/hadoop/etc/hadoop/mapred-site.xml << EOF
<?xml version="1.0"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
EOF
log "[11/12] Setting up Spark configuration..."
cp /opt/spark/conf/spark-defaults.conf.template /opt/spark/conf/spark-defaults.conf
cat >> /opt/spark/conf/spark-defaults.conf << EOF
spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs://$NAMENODE_HOST:9000/spark-logs
spark.yarn.jars hdfs://$NAMENODE_HOST:9000/spark-jars/*.jar
EOF
chown -R hadoop:hadoop /opt/hadoop/etc/hadoop /opt/spark/conf
log "[12/12] Initializing HDFS and starting services..."
sudo -u hadoop bash << EOF
source /etc/environment
export JAVA_HOME=$JAVA_HOME
cd /opt/hadoop
./bin/hdfs namenode -format -force
./sbin/start-dfs.sh
sleep 10
./bin/hdfs dfs -mkdir -p /spark-logs /spark-jars /user/hadoop
./bin/hdfs dfs -put /opt/spark/jars/* /spark-jars/
./sbin/start-yarn.sh
EOF
# Configure firewall
if command -v ufw &>/dev/null; then
ufw allow 9000,9870,8088/tcp
elif command -v firewall-cmd &>/dev/null; then
firewall-cmd --permanent --add-port=9000/tcp --add-port=9870/tcp --add-port=8088/tcp
firewall-cmd --reload
fi
log "Verifying installation..."
sleep 5
if sudo -u hadoop /opt/hadoop/bin/hdfs dfsadmin -report | grep -q "Live datanodes"; then
log "✓ HDFS is running successfully"
else
warn "HDFS verification failed"
fi
if curl -s http://localhost:8088 | grep -q "ResourceManager"; then
log "✓ YARN ResourceManager is running"
else
warn "YARN verification failed"
fi
log "Apache Spark 3.5 cluster installation completed!"
log "HDFS Web UI: http://$NAMENODE_HOST:9870"
log "YARN Web UI: http://$NAMENODE_HOST:8088"
log "Switch to hadoop user: sudo -u hadoop -i"
Review the script before running. Execute with: bash install.sh