Setting Up Apache Spark Cluster

Step by Step setting up Apache Spark cluster based on Hadoop

This is a step by step notes to setting up Apache Spark cluster for team projects, help received from Mr. Steven Chiu, Instructor of Institute for Information Industry, Taiwan.

Spark is probably one of the most popular Big data processing platforms that being used nowadays. In order to manage vast amount of data that we collect for our team projects, we decided to utilize Hadoop and Spark platform for furthur data processing.

This note demonstrates setting up Spark cluster on Linux ubuntu system.

First of all, we need to have all needed installation packages prepared for working environment. In our class, for example, we have prepared jdk(1.8.0), Scala(2.12.11), Anaconda(2020.02-Linux-X86_64) for python 3.7, hadoop(2.10.0) and Spark(2.4.5), stored in ~/Downloads directory for demonstration purpose.

Before we start, make sure we have set up static ip address for each machine, and specify them with distinct hostname. Changing hostname by editting /etc/hostname file.

Then on all machine, make sure their /etc/hosts files content are all the same.

First, start with installing java SDK.

In terminal bash command, steps are as follow:

unzip the installation package
move it to ~ directory for demonstration purpose

Second, install Scala SDK

in terminal, goes as follow:

unzip the installation package
move it to ~ directory for demonstration purpose

Third, install Anaconda3 for python 3.7.

steps are as follow:

change mode
execute it, and agree to the license and go with default setting for demonstration purpose

Forth, we will install hadoop.

steps are as follow:

unzip the package
move it to ~ directory for demonstration purpose, then go to /hadoop-2.10.0/etc/hadoop/ directory
editting core-site.xml
go to the buttom and insert the text above, make sure to change your master hostname, then save it.
editting hdfs-site.xml
go to the buttom and insert the text above, to specify your namenode and datanode directory route. Later on when you create namenode directory for master and datanode for slaves, make sure the directory routes matches what you type in here.
editting slaves file
Type in whatever hostname you have for your workers machine, in this example, I only set up one worker
On master machine, create “namenode” directory, “datanode” for worker machine

Fifth, set up Spark.

steps are as follow:

unzip installation package
move it to ~/ directory
go to /conf/ and make a copy of spark-env.sh template
editting spark-env.sh
At the buttom, insert text above.
make a copy of slaves template
editting slaves
at the buttom, specify all your worker hostname.

Finally, editting .bashrc file for setting environmental parameter.

go to home directory ~, and edit the .bashrc file
At the buttom, specify all environmental parameters as text above

Up till now, we only set up the necessary conf on master machine, you’ll need to repeat all above steps for all your worker machines and UI machine(if you have one).

After finished setting up your other machines, go back to the master machine, and getting ready to start up the hadoop and Spark.

Starting Up Hadoop

format namenode directory
go to sbin directory
start up the hadoop cluster
open web browser, in the url, type in “your_master_hostname:50070”, then check the hadoop overview page and see if there is live node, then your hadoop cluster is good to go.

Starting up Spark

go to sbin directory in spark directory
execute ./start-all.sh
then open web browser, in the url, type in “your_master_hostname:8080”, if you see alive workers, then you are all set!!

This post is just an example of setting up one master and one worker, if there is any error needed to be corrected, please let me know.

A Life-long enthusiast of Data Engineering and Data Analysis.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store