Setting Up Apache Spark Cluster
Step by Step setting up Apache Spark cluster based on Hadoop
This is a step by step notes to setting up Apache Spark cluster for team projects, help received from Mr. Steven Chiu, Instructor of Institute for Information Industry, Taiwan.
Spark is probably one of the most popular Big data processing platforms that being used nowadays. In order to manage vast amount of data that we collect for our team projects, we decided to utilize Hadoop and Spark platform for furthur data processing.
This note demonstrates setting up Spark cluster on Linux ubuntu system.
First of all, we need to have all needed installation packages prepared for working environment. In our class, for example, we have prepared jdk(1.8.0), Scala(2.12.11), Anaconda(2020.02-Linux-X86_64) for python 3.7, hadoop(2.10.0) and Spark(2.4.5), stored in ~/Downloads directory for demonstration purpose.
Before we start, make sure we have set up static ip address for each machine, and specify them with distinct hostname. Changing hostname by editting /etc/hostname file.
Then on all machine, make sure their /etc/hosts files content are all the same.
First, start with installing java SDK.
In terminal bash command, steps are as follow:
Second, install Scala SDK
in terminal, goes as follow:
Third, install Anaconda3 for python 3.7.
steps are as follow:
Forth, we will install hadoop.
steps are as follow:
Fifth, set up Spark.
steps are as follow:
Finally, editting .bashrc file for setting environmental parameter.
Up till now, we only set up the necessary conf on master machine, you’ll need to repeat all above steps for all your worker machines and UI machine(if you have one).
After finished setting up your other machines, go back to the master machine, and getting ready to start up the hadoop and Spark.
Starting Up Hadoop
Starting up Spark
This post is just an example of setting up one master and one worker, if there is any error needed to be corrected, please let me know.