Setting Up Apache Spark Cluster

Step by Step setting up Apache Spark cluster based on Hadoop

Image for post
Image for post

This is a step by step notes to setting up Apache Spark cluster for team projects, help received from Mr. Steven Chiu, Instructor of Institute for Information Industry, Taiwan.

Spark is probably one of the most popular Big data processing platforms that being used nowadays. In order to manage vast amount of data that we collect for our team projects, we decided to utilize Hadoop and Spark platform for furthur data processing.

This note demonstrates setting up Spark cluster on Linux ubuntu system.

First of all, we need to have all needed installation packages prepared for working environment. In our class, for example, we have prepared jdk(1.8.0), Scala(2.12.11), Anaconda(2020.02-Linux-X86_64) for python 3.7, hadoop(2.10.0) and Spark(2.4.5), stored in ~/Downloads directory for demonstration purpose.

Before we start, make sure we have set up static ip address for each machine, and specify them with distinct hostname. Changing hostname by editting /etc/hostname file.

Then on all machine, make sure their /etc/hosts files content are all the same.

Image for post
Image for post

First, start with installing java SDK.

In terminal bash command, steps are as follow:

Image for post
Image for post
unzip the installation package
Image for post
Image for post
move it to ~ directory for demonstration purpose

Second, install Scala SDK

in terminal, goes as follow:

Image for post
Image for post
unzip the installation package
Image for post
Image for post
move it to ~ directory for demonstration purpose

Third, install Anaconda3 for python 3.7.

steps are as follow:

Image for post
Image for post
change mode
Image for post
Image for post
execute it, and agree to the license and go with default setting for demonstration purpose

Forth, we will install hadoop.

steps are as follow:

Image for post
Image for post
unzip the package
Image for post
Image for post
move it to ~ directory for demonstration purpose, then go to /hadoop-2.10.0/etc/hadoop/ directory
Image for post
Image for post
editting core-site.xml
Image for post
Image for post
go to the buttom and insert the text above, make sure to change your master hostname, then save it.
Image for post
Image for post
editting hdfs-site.xml
Image for post
Image for post
go to the buttom and insert the text above, to specify your namenode and datanode directory route. Later on when you create namenode directory for master and datanode for slaves, make sure the directory routes matches what you type in here.
Image for post
Image for post
editting slaves file
Image for post
Image for post
Type in whatever hostname you have for your workers machine, in this example, I only set up one worker
Image for post
Image for post
On master machine, create “namenode” directory, “datanode” for worker machine

Fifth, set up Spark.

steps are as follow:

Image for post
Image for post
unzip installation package
Image for post
Image for post
move it to ~/ directory
Image for post
Image for post
go to /conf/ and make a copy of spark-env.sh template
Image for post
Image for post
editting spark-env.sh
Image for post
Image for post
At the buttom, insert text above.
Image for post
Image for post
make a copy of slaves template
Image for post
Image for post
editting slaves
Image for post
Image for post
at the buttom, specify all your worker hostname.

Finally, editting .bashrc file for setting environmental parameter.

Image for post
Image for post
go to home directory ~, and edit the .bashrc file
Image for post
Image for post
At the buttom, specify all environmental parameters as text above

Up till now, we only set up the necessary conf on master machine, you’ll need to repeat all above steps for all your worker machines and UI machine(if you have one).

After finished setting up your other machines, go back to the master machine, and getting ready to start up the hadoop and Spark.

Starting Up Hadoop

Image for post
Image for post
format namenode directory
Image for post
Image for post
go to sbin directory
Image for post
Image for post
start up the hadoop cluster
Image for post
Image for post
open web browser, in the url, type in “your_master_hostname:50070”, then check the hadoop overview page and see if there is live node, then your hadoop cluster is good to go.

Starting up Spark

Image for post
Image for post
go to sbin directory in spark directory
Image for post
Image for post
execute ./start-all.sh
Image for post
Image for post
then open web browser, in the url, type in “your_master_hostname:8080”, if you see alive workers, then you are all set!!

This post is just an example of setting up one master and one worker, if there is any error needed to be corrected, please let me know.

Written by

A Life-long enthusiast of Data Engineering and Data Analysis.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store