LABDATA
A Data Science/Big Data Laboratory — part 4 of 4: Kafka and Zookeeper over Ubuntu in a 3-node cluster
Assembling a Data Science/Big Data Laboratory in a Raspberry Pi 4 or VMs cluster with Hadoop, Spark, Hive, Kafka, Zookeeper and PostgreSQL
This text can be used to support the installation in any Ubuntu 20.04 server clusters, and this is the beauty of well-designed layered software. Furthermore, if you have more nodes, you can distribute the software as you like. The text assumes you know Linux command line, including ssh, vim, and nano.
I do not recommend starting with less than three Raspberries since you need to set the communication, and both Zookeeper and Kafka requires an odd number of nodes. If you are trying a single node this guide may be used. Still, the performance is likely to be disappointing in a Raspberry — for single node I suggest a virtual machine with a reasonable amount of RAM and processor.
Due to size, I had to divide the tutorial into four parts :
- Part 1: Introduction, Operational System and Networking
- Part 2: Hadoop and Spark
- Part 3: PostgreSQL and Hive
- Part 4: Kafka, Zookeeper and Conclusion
All configuration files are available at [1]:
Disclaimer: This text is offered to everyone for free to use at your own risk. I took care in citing all my sources, but if you feel that something is missed, please send me a note. Since different software versions may behaviour in a distinct way due to their dependencies, I suggest using the same versions I used in your first try.
6. Kafka
Kafka (https://kafka.apache.org/) is a robust message broker widely used to instantiate pipelines. Its retention feature makes it possible to handle a surge of information or the need to put consumers offline for maintenance.
Furthermore, as almost all big data solution, Kafka escalates quickly from a single node to full clusters with replication.
The primary literature for learning Kafka is the book “Kafka: The Definitive Guide” [2]. The e-book is freely available at
Many thanks to Confluent!
Kafka can easily handle from gigabytes to even petabyte a day. This is far from my lab-cluster capacity. However, I decided to install Kafka initially as a single node and after distributed it to allow playing with data pipelines, such as collecting real-time information from Tweeter.
6.1 Zookeeper
The first step is to install the zookeeper server since Kafka depends on it for distributing metadata. I installed the most recent stable version available on the following sites:
https://zookeeper.apache.org/releases.html
https://downloads.apache.org/zookeeper/zookeeper-3.6.1/apache-zookeeper-3.6.1.tar.gz
pi@pi3:~/tmp$ wget https://downloads.apache.org/zookeeper/zookeeper-3.6.1/apache-zookeeper-3.6.1-bin.tar.gz
pi@pi3:~/tmp$ tar -xzvf apache-zookeeper-3.6.1-bin.tar.gz
pi@pi3:~/tmp$ sudo mv apache-zookeeper-3.6.1-bin /opt/zookeeper
pi@pi3:~/tmp$ cd /opt/
pi@pi3:/opt$ ls
hadoop hadoop_tmp hive zookeeper
pi@pi3:/opt$ sudo chown -R pi:pi zookeeper
[sudo] password for pi:
pi@pi3:/opt$
pi@pi3: /opt$ sudo mkdir /opt/zookeeper_data
pi@pi3: /opt$ sudo chown -R pi:pi zookeeper_data
Create the file:
/opt/zookeeper/conf/zoo.conf
Now you can start zookeeper in a single node:
pi@pi3:/opt$ /opt/zookeeper/bin/zkServer.sh startZooKeeper JMX enabled by default
Using config: /opt/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
Checking service:
pi@pi3:/opt$ sudo netstat -plnt | grep 2181
tcp6 0 0 :::2181 :::* LISTEN 2511/java
Now we have zookeeper running locally on pi3. Next, we will install Kafka.
6.2 KAFKA
Kafka installation as a single note is not so troublesome. I followed the book instructions but changing the installation folder to /opt and my user is pi.
I downloaded the most recent stable version, which was
kafka_2.13–2.5.0.tgz:
As usual, I saved it in /home/pi/tmp
The following commands are to extract the files, send it to /opt and adjust folders and access rights:
pi@pi3:~/tmp$ tar -xzvf kafka_2.13-2.5.0.tgz
pi@pi3:~/tmp$ sudo mv kafka_2.13-2.5.0 /opt/kafka
pi@pi3:~/tmp$ cd /opt/
pi@pi3:/opt$ ls
hadoop hadoop_tmp hive kafka zookeeper zookeeper_data
pi@pi3:/opt$ sudo chown -R pi:pi kafkapi@pi3:/opt$ sudo mkdir /opt/kafka-data
pi@pi3:/opt$ sudo chown -R pi:pi /opt/kafka_data
Edit file
/opt/kafka/config/server.properties,
changing the following parameter:
log.dirs=/opt/kafka_data
Starting Kafka:
pi@pi3:/opt$ /opt/kafka/bin/kafka-server-start.sh -daemon /opt/kafka/config/server.properties
Note: Kafka and zookeeper require one open terminal if you do not indicate another way. When starting these services. Use one terminal/remote session for each service initially. After the tutorial will show how to do it in a transparent way.
Checking port 9092:
pi@pi3:/opt$ sudo netstat -plnt | grep 9092
tcp6 0 0 :::9092 :::* LISTEN 3014/java
The following commands will allow you to ensure Kafka is running properly:
first start zookeeper:
pi@pi3:~$ /opt/zookeeper/bin/zkServer.sh start
ZooKeeper JMX enabled by default
Using config: /opt/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
Now you can start Kafka and create a Kafka topic with producer and consumer and test it:
pi@pi3:~$ /opt/kafka/bin/kafka-server-start.sh -daemon /opt/kafka/config/server.propertiespi@pi3:~$ /opt/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
Created topic test.pi@pi3:~$ /opt/kafka/bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic test
Topic: test PartitionCount: 1 ReplicationFactor: 1 Configs:
Topic: test Partition: 0 Leader: 0 Replicas: 0 Isr: 0pi@pi3:~$ /opt/kafka/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
>message test 1
>message test 2pi@pi3:~$ /opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
message test 1
message test 2
^C Processed a total of 2 messages
6.3 Changing Zookeeper and Kafka to a cluster
Attention: both Kafka and Zookeeper suggest you have an odd number of nodes. I am using 3 them it is ok.
pi1
pi@pi1:~$ sudo mkdir /opt/zookeeper_data
pi@pi1:~$ sudo mkdir /opt/zookeeper
pi@pi1:~$ sudo mkdir /opt/kafka
pi@pi1:~$ sudo mkdir /opt/kafka_datapi@pi1:~$ sudo chown -R pi:pi /opt/zookeeper_data
pi@pi1:~$ sudo chown -R pi:pi /opt/zookeeper
pi@pi1:~$ sudo chown -R pi:pi /opt/kafka
pi@pi1:~$ sudo chown -R pi:pi /opt/kafka_data
pi2
pi@pi2:~$ sudo mkdir /opt/zookeeper_data
pi@pi2:~$ sudo mkdir /opt/zookeeper
pi@pi2:~$ sudo mkdir /opt/kafka
pi@pi2:~$ sudo mkdir /opt/kafka_datapi@pi2:~$ sudo chown -R pi:pi /opt/zookeeper_data
pi@pi2:~$ sudo chown -R pi:pi /opt/zookeeper
pi@pi2:~$ sudo chown -R pi:pi /opt/kafka
pi@pi2:~$ sudo chown -R pi:pi /opt/kafka_data
pi3
pi@pi3:/opt$ rsync -vaz /opt/zookeeper/ pi2:/opt/zookeeper/
pi@pi3:/opt$ rsync -vaz /opt/kafka/ pi2:/opt/kafka/
pi@pi3:/opt$ rsync -vaz /opt/zookeeper/ pi1:/opt/zookeeper/
pi@pi3:/opt$ rsync -vaz /opt/kafka/ pi1:/opt/kafka/
Edit removing the previous comments (pi1, pi2, pi3)
create files:
The file should have only the id of the zookeeper node (see GitHub)
pi1 ->1,
pi2 ->2,
pi3 ->3,
For Kafka, we need to edit (in all nodes):
/opt/kafka/config/server.properties
Changing the parameters:
broker.id=1 # 2, 3 acoording to the node
(1 for pi1, 2 for pi2, and 3 for pi3
and:
zookeeper.connect= pi1:2181, pi2:2181, pi3:2181
Now the Kafka is zookeeper will run as a cluster. You need to start it in all nodes.
6.3.1 Monitoring your Kafka cluster
There are some tools for monitoring Kafka clusters. I think it is nice to have a look and feel of a full environment.
The text [3]:
Provide an overview of some available tools. Kafka itself does not come with such a tool.
O opted for installing the Kafka tool. It is free for personal use, and this is my case. Kafka tool can be installed in windows, mac and Linux. I opted for installing it in my windows notebook, to reduce the workload in my raspberries.
Below I present my actual interface with the distributed nodes:
Finally, the following post provides extra information on Kafka tool [4]:
7. Starting the Cluster
I coded one script for each node, to start all services — because I sometimes forget to start specific services. You will find the scripts in the home folder for pi user as cluster-start.sh. The script uses ssh connections to start services in the other nodes, and I copied it to all my nodes.
You can check the cluster status using the web UI for Hadoop, Yarn, Hive and the Kafka tool.
My final architecture is as presented in the following diagram:
And the services are distributed according to the table below:
I also use an Ubuntu virtual machine over VirtualBox on my notebook, with RStudio, R, Anaconda and TensorFlow.
Conclusion
I congratulate you by concluding all the installation. I know the feeling of relief and joy! :D
Now you have a full environment to experiment and train!
I would like to thank all authors from the sources I cited.
I authorize anyone to copy parts of this tutorial pending the origin is cited. If a substantial part is necessary, please only link it.
Any comments and questions are most welcomed!
[1] P. G. Taranti. https://github.com/ptaranti/RaspberryPiCluster
[2] N. Narkhede et al. Kafka: The Definitive Guide: Real-Time Data and Stream Processing at Scale. O’Reilly Media (2017)
[3] G. Myrianthous. Overview of UI monitoring tools for Apache Kafka clusters (2019)
[4] O. Grygorian. GUI for Apache Kafka (2019)