LabData

A Data Science/Big Data Laboratory — part 2 of 4: Hadoop 3.2.1 and Spark 3.0.0 over Ubuntu 20.04 in a 3-node cluster

Assembling a Data Science/Big Data Laboratory in a Raspberry Pi 4 or VMs cluster with Hadoop, Spark, Hive, Kafka, Zookeeper and PostgreSQL

Pier Taranti
Towards Data Science
8 min readJun 14, 2020

--

This text can be used to support the installation in any Ubuntu 20.04 server clusters, and this is the beauty of well-designed layered software. Furthermore, if you have more nodes, you can distribute the software as you like. The text assumes you know Linux command line, including ssh, vim, and nano.

I do not recommend starting with less than three Raspberries since you need to set the communication, and both Zookeeper and Kafka requires an odd number of nodes. If you are trying a single node this guide may be used. Still, the performance is likely to be disappointing in a Raspberry — for single node I suggest a virtual machine with a reasonable amount of RAM and processor.

Due to size, I had to divide the tutorial into four parts

All configuration files are available at [1]:

Disclaimer: This text is offered to everyone for free to use at your own risk. I took care in citing all my sources, but if you feel that something is missed, please send me a note. Since different software versions may behaviour in a distinct way due to their dependencies, I suggest using the same versions I used in your first try.

3. Installing Hadoop and Spark

The Hadoop and Spark installation considered the instructions from [3, 4] together with other sources.

I have used the updated versions from the Apache website:

  • hadoop-3.2.1.tar.gz
  • spark-2.4.5-bin-hadoop2.7.tgz

3.1 Setting your environment

First: download, and extract the files to /opt. Give access to pi user.

sudo tar -xvf hadoop-3.2.1.tar.gz -C /opt/
sudo tar -xvf spark-2.4.5-bin-hadoop2.7.tgz -C /opt/
cd /opt/pi@pi1:/opt$ sudo mv hadoop-3.2.1 hadoop
pi@pi1:/opt$ sudo mv spark-2.4.5-bin-hadoop2.7 spark
pi@pi1:/opt$ sudo chown -R pi:pi /opt/spark
pi@pi1:/opt$ sudo chown -R pi:pi /opt/hadoop

add to /home/pi/.bashrc:

After editing:

source /home/pi/.bashrc

3.2 Configuring the Hadoop and Spark as a single node

Now you need to configure the Hadoop and Spark

To be clear — we first configure it as a single node and then modify for a cluster. My repository in GitHub contains only the final cluster config files.

3.2.1 Hadoop

Go to folder

/opt/hadoop/etc/hadoop

I had a lot of trouble at this point: I accidentally inserted one blanc line on top of the file header. This blanc line caused parse errors and Hadoop kept failing until I realised the issue.

Edit the file

/opt/hadoop/etc/hadoop/hadoop-env.sh,

adding the following line at the end:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-arm64

Edit configuration in

/opt/hadoop/etc/hadoop/core-site.xml

Edit configuration in

/opt/hadoop/etc/hadoop/hdfs-site.xml

Now prepare the data area:

$ sudo mkdir -p /opt/hadoop_tmp/hdfs/datanode
$ sudo mkdir -p /opt/hadoop_tmp/hdfs/namenode
sudo chown -R pi:pi /opt/hadoop_tmp

Edit configuration in

/opt/hadoop/etc/hadoop/mapred-site.xml

Edit configuration in

/opt/hadoop/etc/hadoop/yarn-site.xml

Prepare the data space:

$ hdfs namenode -format -force$ start-dfs.sh
$ start-yarn.sh
$ hadoop fs -mkdir /tmp$ hadoop fs -ls /
Found 1 items
drwzr-xr-x - pi supergroup 0 2019-04-09 16:51 /tmp

Use jps to check if all services are on (numbers change..) :

$ jps
2736 NameNode
2850 DataNode
3430 NodeManager
3318 ResourceManager
3020 SecondaryNameNode

You need these five services on!

3.2.2 Testing

For testing the single node, I refer to the tutorial [2]:

Execute the following commands:

pi@pi1:/opt$ hadoop fs -put $SPARK_HOME/README.md /
2020-06-24 19:16:02,822 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2020-06-24 19:16:06,389 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
pi@pi1:/opt$ spark-shell
2020-06-24 19:16:23,814 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://pi1:4040
Spark context available as 'sc' (master = local[*], app id = local-1593026210941).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.0.0
/_/
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_252)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val textFile = sc.textFile("hdfs://pi1:9000/README.md")
textFile: org.apache.spark.rdd.RDD[String] = hdfs://pi1:9000/README.md MapPartitionsRDD[1] at textFile at <console>:24
scala> textFile.first()
res0: String = # Apache Spark
scala>

At this point, I got stuck, with a repeating message similar to:

INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED)

I followed the suggestion from [4] and other sources and changed the following file /opt/hadoop/etc/hadoop/capacity-scheduler.xml.

Parameter yarn.scheduler.capacity.maximum-am-resource-percent should be set if you are running a cluster on a single machine where you have got less resource. This setting indicates the fraction of the resources that are made available to be allocated to application masters, increasing the number of possible concurrent applications. Note that this is dependent on your resources. It worked in my Pi 4 4GB ram.

Edit the file, adding the property:

/opt/hadoop/etc/hadoop/capacity-scheduler.xml

Note — the tutorials usually provide commands to suppress warnings. I do prefer to see these warnings when experimenting. If you would like to remove it refers to the first tutorial.

3.3 Hadoop in a cluster with Yarn

Now you should have one fully operational installation in a single node. It is time to Hadoop goes to a cluster!

I executed the tutorials but faced a number of problems. This is expected — different environment, software versions.

After some experimenting, I succeed to have a stable environment. The next steps to configure Hadoop to operate with Yarn in cluster are a mix from both [2, 4].

Note — all nodes have the same configuration (p2, p3, … -> workers), except the node pi1 (pi1 -> master), because of the Spark. Again, my GitHub repository has the configuration available. I have made available the configuration for all nodes.

Creating the folders for all nodes:

$ clustercmd-sudo mkdir -p /opt/hadoop_tmp/hdfs
$ clustercmd-sudo chown –R pi:pi /opt/hadoop_tmp
$ clustercmd-sudo mkdir -p /opt/hadoop
$ clustercmd-sudo chown -R pi:pi /opt/Hadoop

The next will remove all data from Hadoop. Do your backup first if there is something important.

$ clustercmd rm –rf /opt/hadoop_tmp/hdfs/datanode/*
$ clustercmd rm –rf /opt/hadoop_tmp/hdfs/namenode/*

Note that Spark will exist only in the master.

Copy Hadoop :

From pi1:

pi@pi1:~$  rsync -vaz  /opt/hadoop   pi2:/opt/ hadoop   
pi@pi1:~$ rsync -vaz /opt/hadoop pi3:/opt/ hadoop
pi@pi1:~$ rsync -vaz /opt/hadoop pi4:/opt/ hadoop

Do it for all your nodes.

I prefer doing one by one and confirming no abnormal behaviour.

Now, the following files need to be edited, changing the configuration:

/opt/hadoop/etc/hadoop/core-site.xml

/opt/hadoop/etc/hadoop/hdfs-site.xml

Note — The property dfs.replication, indicates how many times data is replicated in the cluster. You can set to have all the data duplicated on the two or more nodes. Don’t enter a value higher than the actual number of worker nodes. I used 1 because one of my noted was using a 16GB micro SD. Some of my parts delayed in the post due to the COVID-19 outbreak. If you misconfigure it your spark applications will get stuck in “accepted” status, due to the lack of resources.

Note — The last property, dfs.permissions.enabled, was set to false to disable permission checking. I use spark from a machine external from the cluster, and this facilitates my access. Obviously I advice do not use this setting in a production environment. I also disabled the safe mode. To do this, after finishing the installation run:

 hdfs dfsadmin -safemode leave

how many times data is replicated in the cluster. You can set to have all the data duplicated on the two or more nodes. Don’t enter a value higher than the actual number of worker nodes. I used 1 because one of my noted was using a 16GB micro SD. Some of my parts delayed in the post due to the COVID-19 outbreak. If you misconfigure it your spark applications will get stuck in “accepted” status, due to the lack of resources.

/opt/hadoop/etc/hadoop/mapred-site.xml

/opt/hadoop/etc/hadoop/yarn-site.xml

create two files:

/opt/hadoop/etc/hadoop/master

/opt/hadoop/etc/hadoop/workers

After updating the configuration files at all nodes, it is necessary to format the data space and starting the cluster (you can start from any node):

$ hdfs namenode -format -force$ start-dfs.sh
$ start-yarn.sh

3.4 Configuring SPARK

Basically, you need to create/edit the following configuration file:

/opt/spark/conf/spark-defaults.conf

These values can be adjusted to your hardware — but they will work with Raspberry Pi 4 4GB.

Set environment variables at:

/opt/spark/conf/spark-env.sh

Install the following packages in all nodes in order to allow the nodes to process jobs prepared in python/pyspark:

sudo apt intall python3 python-is-python3

3.5 Testing the Cluster

Reboot all nodes, and restart services:

$ start-dfs.sh
$ start-yarn.sh

You can send one application example to test the spark:

$ spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi /opt/spark/examples/jars/spark-examples_2.12-3.0.0.jar

At the end of the processing, you should receive an approximated calculation for PI value:

Pi is roughly 3.140555702778514

(this PI calculation needs improvement!!!!)

3.6 Web App for Hadoop and Yarn

3.6.1 Hadoop webUi

http://pi1:9870/

Initially, I was not able to handle(upload/delete) files online. A workaround is available in:

This workaround is implemented adding the following properties to Hadoop core-site.xml:

Yarn WebUi

http://pi1:8088/

NEXT

[1] P. G. Taranti. https://github.com/ptaranti/RaspberryPiCluster

[2] A. W. Watson. Building a Raspberry Pi Hadoop / Spark Cluster (2019)

[3] W. H. Liang. Build Raspberry Pi Hadoop/Spark Cluster from scratch (2019)

[4] F. Houbart. How to Install and Set Up a 3-Node Hadoop Cluster (2019)

--

--

Data scientist, PhD, experienced team leader and co-worker; result-oriented professional — always learning.