Implement Apache Spark 3.5 cluster with YARN and HDFS for distributed computing

Advanced 45 min Apr 09, 2026 58 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Set up a production-grade Apache Spark 3.5 cluster with YARN resource management and HDFS distributed storage for scalable big data processing. This tutorial covers multi-node Hadoop cluster configuration, YARN integration, and monitoring setup.

Prerequisites

  • 4GB RAM minimum
  • Multiple servers for cluster setup
  • Java 11 compatibility
  • Network connectivity between cluster nodes

What this solves

Apache Spark with YARN and HDFS creates a powerful distributed computing platform for big data analytics, machine learning, and stream processing. YARN manages cluster resources efficiently while HDFS provides reliable distributed storage with automatic replication and fault tolerance.

Step-by-step installation

Create dedicated hadoop user

Create a system user for running Hadoop and Spark services with proper permissions.

sudo useradd -m -s /bin/bash hadoop
sudo passwd hadoop
sudo usermod -aG sudo hadoop

Install Java 11

Hadoop and Spark require Java 11 for optimal performance and compatibility.

sudo apt update
sudo apt install -y openjdk-11-jdk
sudo dnf update -y
sudo dnf install -y java-11-openjdk-devel

Download and install Hadoop 3.4

Download Hadoop binaries and extract them to the standard installation directory.

cd /tmp
wget https://archive.apache.org/dist/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz
sudo tar -xzf hadoop-3.4.0.tar.gz -C /opt/
sudo mv /opt/hadoop-3.4.0 /opt/hadoop
sudo chown -R hadoop:hadoop /opt/hadoop

Download and install Apache Spark 3.5

Download Spark pre-built for Hadoop 3.3+ and configure it for YARN integration.

cd /tmp
wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
sudo tar -xzf spark-3.5.0-bin-hadoop3.tgz -C /opt/
sudo mv /opt/spark-3.5.0-bin-hadoop3 /opt/spark
sudo chown -R hadoop:hadoop /opt/spark

Configure environment variables

Set up Java and Hadoop environment variables for all users and services.

JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
HADOOP_HOME=/opt/hadoop
HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
SPARK_HOME=/opt/spark
YARN_CONF_DIR=/opt/hadoop/etc/hadoop
PATH=/opt/hadoop/bin:/opt/hadoop/sbin:/opt/spark/bin:/opt/spark/sbin:$PATH

Configure Hadoop environment

Set Java home in Hadoop's environment configuration file.

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export HADOOP_LOG_DIR=/opt/hadoop/logs
export YARN_LOG_DIR=/opt/hadoop/logs

Configure core Hadoop settings

Configure the core Hadoop settings including filesystem URI and temporary directories.



    
        fs.defaultFS
        hdfs://namenode:9000
    
    
        hadoop.tmp.dir
        /opt/hadoop/data/tmp
    
    
        hadoop.proxyuser.hadoop.hosts
        *
    
    
        hadoop.proxyuser.hadoop.groups
        *
    

Configure HDFS settings

Configure HDFS namenode and datanode directories with appropriate replication factor.



    
        dfs.replication
        3
    
    
        dfs.namenode.name.dir
        /opt/hadoop/data/namenode
    
    
        dfs.datanode.data.dir
        /opt/hadoop/data/datanode
    
    
        dfs.namenode.http-address
        0.0.0.0:9870
    
    
        dfs.secondary.http.address
        0.0.0.0:9868
    

Configure YARN resource manager

Configure YARN for resource management and job scheduling across the cluster.



    
        yarn.nodemanager.aux-services
        mapreduce_shuffle,spark_shuffle
    
    
        yarn.nodemanager.aux-services.spark_shuffle.class
        org.apache.spark.network.yarn.YarnShuffleService
    
    
        yarn.resourcemanager.hostname
        resourcemanager
    
    
        yarn.resourcemanager.address
        resourcemanager:8032
    
    
        yarn.resourcemanager.webapp.address
        0.0.0.0:8088
    
    
        yarn.nodemanager.resource.memory-mb
        4096
    
    
        yarn.nodemanager.resource.cpu-vcores
        4
    
    
        yarn.scheduler.maximum-allocation-mb
        4096
    
    
        yarn.scheduler.minimum-allocation-mb
        512
    

Configure MapReduce settings

Configure MapReduce to run on YARN framework for distributed processing.



    
        mapreduce.framework.name
        yarn
    
    
        mapreduce.application.classpath
        $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/
    
    
        yarn.app.mapreduce.am.env
        HADOOP_MAPRED_HOME=/opt/hadoop
    
    
        mapreduce.map.env
        HADOOP_MAPRED_HOME=/opt/hadoop
    
    
        mapreduce.reduce.env
        HADOOP_MAPRED_HOME=/opt/hadoop
    

Configure Spark for YARN integration

Configure Spark to use YARN as the cluster manager and set up shuffle service.

spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs://namenode:9000/spark-logs
spark.history.fs.logDirectory hdfs://namenode:9000/spark-logs
spark.yarn.historyServer.address resourcemanager:18080
spark.shuffle.service.enabled true
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.maxExecutors 10
spark.sql.adaptive.enabled true
spark.sql.adaptive.coalescePartitions.enabled true

Configure Spark environment

Set up Spark environment variables and memory settings.

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/opt/hadoop
export YARN_CONF_DIR=/opt/hadoop/etc/hadoop
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export SPARK_DIST_CLASSPATH=$(/opt/hadoop/bin/hadoop classpath)
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080"

Create required directories

Create necessary directories for Hadoop data storage and logs with proper permissions.

sudo mkdir -p /opt/hadoop/data/namenode
sudo mkdir -p /opt/hadoop/data/datanode
sudo mkdir -p /opt/hadoop/data/tmp
sudo mkdir -p /opt/hadoop/logs
sudo chown -R hadoop:hadoop /opt/hadoop/data
sudo chown -R hadoop:hadoop /opt/hadoop/logs
sudo chmod 755 /opt/hadoop/data/namenode
sudo chmod 755 /opt/hadoop/data/datanode
sudo chmod 1777 /opt/hadoop/data/tmp
Never use chmod 777. It gives every user on the system full access to your files. Instead, fix ownership with chown and use minimal permissions like 755 for directories and 644 for files.

Configure cluster nodes

Define worker nodes and configure hostname resolution for the cluster.

datanode1
datanode2
datanode3

Copy Spark shuffle JAR to Hadoop

Copy the Spark YARN shuffle service JAR to enable dynamic allocation.

sudo cp /opt/spark/yarn/spark-*-yarn-shuffle.jar /opt/hadoop/share/hadoop/yarn/lib/
sudo chown hadoop:hadoop /opt/hadoop/share/hadoop/yarn/lib/spark-*-yarn-shuffle.jar

Initialize HDFS namenode

Format the namenode to initialize the HDFS filesystem. Only run this once.

sudo -u hadoop /opt/hadoop/bin/hdfs namenode -format -y

Create systemd service files

Create systemd services for Hadoop components to enable automatic startup.

[Unit]
Description=Hadoop NameNode
After=network.target

[Service]
Type=forking
User=hadoop
Group=hadoop
Environment=JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Environment=HADOOP_HOME=/opt/hadoop
Environment=HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
ExecStart=/opt/hadoop/bin/hdfs --daemon start namenode
ExecStop=/opt/hadoop/bin/hdfs --daemon stop namenode
Restart=on-failure

[Install]
WantedBy=multi-user.target

Create YARN ResourceManager service

Configure systemd service for YARN ResourceManager component.

[Unit]
Description=YARN ResourceManager
After=network.target

[Service]
Type=forking
User=hadoop
Group=hadoop
Environment=JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Environment=HADOOP_HOME=/opt/hadoop
Environment=YARN_CONF_DIR=/opt/hadoop/etc/hadoop
ExecStart=/opt/hadoop/bin/yarn --daemon start resourcemanager
ExecStop=/opt/hadoop/bin/yarn --daemon stop resourcemanager
Restart=on-failure

[Install]
WantedBy=multi-user.target

Create Spark History Server service

Configure systemd service for Spark History Server to track application metrics.

[Unit]
Description=Spark History Server
After=network.target hadoop-namenode.service
Requires=hadoop-namenode.service

[Service]
Type=forking
User=hadoop
Group=hadoop
Environment=JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Environment=SPARK_HOME=/opt/spark
Environment=HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
ExecStartPre=/opt/hadoop/bin/hdfs dfs -mkdir -p /spark-logs
ExecStart=/opt/spark/sbin/start-history-server.sh
ExecStop=/opt/spark/sbin/stop-history-server.sh
Restart=on-failure

[Install]
WantedBy=multi-user.target

Enable and start services

Enable all Hadoop and Spark services to start automatically on boot.

sudo systemctl daemon-reload
sudo systemctl enable hadoop-namenode yarn-resourcemanager spark-history-server
sudo systemctl start hadoop-namenode
sudo systemctl start yarn-resourcemanager
sudo systemctl start spark-history-server

Configure firewall rules

Open necessary ports for Hadoop and Spark web interfaces and inter-node communication.

sudo ufw allow 9870/tcp comment 'HDFS NameNode Web UI'
sudo ufw allow 8088/tcp comment 'YARN ResourceManager Web UI'
sudo ufw allow 18080/tcp comment 'Spark History Server'
sudo ufw allow 4040/tcp comment 'Spark Application UI'
sudo ufw allow 9000/tcp comment 'HDFS NameNode IPC'
sudo firewall-cmd --permanent --add-port=9870/tcp --add-port=8088/tcp --add-port=18080/tcp --add-port=4040/tcp --add-port=9000/tcp
sudo firewall-cmd --reload

Verify your setup

Check that all services are running and accessible through their web interfaces.

sudo systemctl status hadoop-namenode yarn-resourcemanager spark-history-server
/opt/hadoop/bin/hdfs dfsadmin -report
/opt/spark/bin/spark-submit --version
curl -s http://localhost:9870/jmx?qry=Hadoop:service=NameNode,name=NameNodeInfo | grep -o '"State":"[^"]*'

Test Spark integration with YARN by running a simple job:

sudo -u hadoop /opt/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client /opt/spark/examples/jars/spark-examples_2.12-3.5.0.jar 10

Access the web interfaces to monitor your cluster:

  • HDFS NameNode: http://your-server:9870
  • YARN ResourceManager: http://your-server:8088
  • Spark History Server: http://your-server:18080

Common issues

SymptomCauseFix
NameNode fails to startIncorrect permissions on data directorysudo chown -R hadoop:hadoop /opt/hadoop/data && sudo chmod 755 /opt/hadoop/data/namenode
Spark jobs fail with ClassNotFoundExceptionMissing Spark shuffle JAR in Hadoopsudo cp /opt/spark/yarn/spark-*-yarn-shuffle.jar /opt/hadoop/share/hadoop/yarn/lib/
YARN containers fail to startInsufficient memory allocationIncrease yarn.nodemanager.resource.memory-mb in yarn-site.xml
Connection refused on port 9000NameNode not running or firewall blockingsudo systemctl start hadoop-namenode && sudo ufw allow 9000/tcp
History server shows no applicationsEvent log directory not createdsudo -u hadoop hdfs dfs -mkdir -p /spark-logs

Next steps

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle managed devops services for businesses that depend on uptime. From initial setup to ongoing operations.