Setting Up Apache Spark Cluster

4 min readSep 29, 2020

Step by Step setting up Apache Spark cluster based on Hadoop

This is a step by step notes to setting up Apache Spark cluster for team projects, help received from Mr. Steven Chiu, Instructor of Institute for Information Industry, Taiwan.

Spark is probably one of the most popular Big data processing platforms that being used nowadays. In order to manage vast amount of data that we collect for our team projects, we decided to utilize Hadoop and Spark platform for furthur data processing.

This note demonstrates setting up Spark cluster on Linux ubuntu system.

First of all, we need to have all needed installation packages prepared for working environment. In our class, for example, we have prepared jdk(1.8.0), Scala(2.12.11), Anaconda(2020.02-Linux-X86_64) for python 3.7, hadoop(2.10.0) and Spark(2.4.5), stored in ~/Downloads directory for demonstration purpose.

Before we start, make sure we have set up static ip address for each machine, and specify them with distinct hostname. Changing hostname by editting /etc/hostname file.

Then on all machine, make sure their /etc/hosts files content are all the same.

First, start with installing java SDK.

In terminal bash command, steps are as follow:

unzip the installation package

move it to ~ directory for demonstration purpose

Second, install Scala SDK

in terminal, goes as follow:

unzip the installation package

move it to ~ directory for demonstration purpose

Third, install Anaconda3 for python 3.7.

steps are as follow:

change mode

execute it, and agree to the license and go with default setting for demonstration purpose

Forth, we will install hadoop.

steps are as follow:

unzip the package

move it to ~ directory for demonstration purpose, then go to /hadoop-2.10.0/etc/hadoop/ directory

editting core-site.xml

go to the buttom and insert the text above, make sure to change your master hostname, then save it.

editting hdfs-site.xml

go to the buttom and insert the text above, to specify your namenode and datanode directory route. Later on when you create namenode directory for master and datanode for slaves, make sure the directory routes matches what you type in here.

editting slaves file

Type in whatever hostname you have for your workers machine, in this example, I only set up one worker

On master machine, create “namenode” directory, “datanode” for worker machine

Fifth, set up Spark.

steps are as follow:

unzip installation package

move it to ~/ directory

go to /conf/ and make a copy of spark-env.sh template

editting spark-env.sh

make a copy of slaves template

editting slaves

at the buttom, specify all your worker hostname.

Finally, editting .bashrc file for setting environmental parameter.

go to home directory ~, and edit the .bashrc file

At the buttom, specify all environmental parameters as text above

Up till now, we only set up the necessary conf on master machine, you’ll need to repeat all above steps for all your worker machines and UI machine(if you have one).

After finished setting up your other machines, go back to the master machine, and getting ready to start up the hadoop and Spark.