LABDATA

A Data Science/Big Data Laboratory — part 4 of 4: Kafka and Zookeeper over Ubuntu in a 3-node cluster

Assembling a Data Science/Big Data Laboratory in a Raspberry Pi 4 or VMs cluster with Hadoop, Spark, Hive, Kafka, Zookeeper and PostgreSQL

Pier Taranti
Towards Data Science
7 min readJun 15, 2020

--

This text can be used to support the installation in any Ubuntu 20.04 server clusters, and this is the beauty of well-designed layered software. Furthermore, if you have more nodes, you can distribute the software as you like. The text assumes you know Linux command line, including ssh, vim, and nano.

I do not recommend starting with less than three Raspberries since you need to set the communication, and both Zookeeper and Kafka requires an odd number of nodes. If you are trying a single node this guide may be used. Still, the performance is likely to be disappointing in a Raspberry — for single node I suggest a virtual machine with a reasonable amount of RAM and processor.

Due to size, I had to divide the tutorial into four parts :

All configuration files are available at [1]:

Disclaimer: This text is offered to everyone for free to use at your own risk. I took care in citing all my sources, but if you feel that something is missed, please send me a note. Since different software versions may behaviour in a distinct way due to their dependencies, I suggest using the same versions I used in your first try.

6. Kafka

Kafka (https://kafka.apache.org/) is a robust message broker widely used to instantiate pipelines. Its retention feature makes it possible to handle a surge of information or the need to put consumers offline for maintenance.

Furthermore, as almost all big data solution, Kafka escalates quickly from a single node to full clusters with replication.

The primary literature for learning Kafka is the book “Kafka: The Definitive Guide” [2]. The e-book is freely available at

Many thanks to Confluent!

Kafka can easily handle from gigabytes to even petabyte a day. This is far from my lab-cluster capacity. However, I decided to install Kafka initially as a single node and after distributed it to allow playing with data pipelines, such as collecting real-time information from Tweeter.

6.1 Zookeeper

The first step is to install the zookeeper server since Kafka depends on it for distributing metadata. I installed the most recent stable version available on the following sites:

https://zookeeper.apache.org/releases.html

https://downloads.apache.org/zookeeper/zookeeper-3.6.1/apache-zookeeper-3.6.1.tar.gz

pi@pi3:~/tmp$ wget https://downloads.apache.org/zookeeper/zookeeper-3.6.1/apache-zookeeper-3.6.1-bin.tar.gz
pi@pi3:~/tmp$ tar -xzvf apache-zookeeper-3.6.1-bin.tar.gz
pi@pi3:~/tmp$ sudo mv apache-zookeeper-3.6.1-bin /opt/zookeeper
pi@pi3:~/tmp$ cd /opt/
pi@pi3:/opt$ ls
hadoop hadoop_tmp hive zookeeper
pi@pi3:/opt$ sudo chown -R pi:pi zookeeper
[sudo] password for pi:
pi@pi3:/opt$
pi@pi3: /opt$ sudo mkdir /opt/zookeeper_data
pi@pi3: /opt$ sudo chown -R pi:pi zookeeper_data

Create the file:

/opt/zookeeper/conf/zoo.conf

Now you can start zookeeper in a single node:

pi@pi3:/opt$ /opt/zookeeper/bin/zkServer.sh startZooKeeper JMX enabled by default
Using config: /opt/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED

Checking service:

pi@pi3:/opt$ sudo netstat -plnt | grep 2181
tcp6 0 0 :::2181 :::* LISTEN 2511/java

Now we have zookeeper running locally on pi3. Next, we will install Kafka.

6.2 KAFKA

Kafka installation as a single note is not so troublesome. I followed the book instructions but changing the installation folder to /opt and my user is pi.

I downloaded the most recent stable version, which was

kafka_2.13–2.5.0.tgz:

As usual, I saved it in /home/pi/tmp

The following commands are to extract the files, send it to /opt and adjust folders and access rights:

pi@pi3:~/tmp$ tar -xzvf kafka_2.13-2.5.0.tgz
pi@pi3:~/tmp$ sudo mv kafka_2.13-2.5.0 /opt/kafka
pi@pi3:~/tmp$ cd /opt/
pi@pi3:/opt$ ls
hadoop hadoop_tmp hive kafka zookeeper zookeeper_data
pi@pi3:/opt$ sudo chown -R pi:pi kafka
pi@pi3:/opt$ sudo mkdir /opt/kafka-data
pi@pi3:/opt$ sudo chown -R pi:pi /opt/kafka_data

Edit file

/opt/kafka/config/server.properties,

changing the following parameter:

log.dirs=/opt/kafka_data

Starting Kafka:

pi@pi3:/opt$ /opt/kafka/bin/kafka-server-start.sh -daemon /opt/kafka/config/server.properties

Note: Kafka and zookeeper require one open terminal if you do not indicate another way. When starting these services. Use one terminal/remote session for each service initially. After the tutorial will show how to do it in a transparent way.

Checking port 9092:

pi@pi3:/opt$ sudo netstat -plnt | grep 9092
tcp6 0 0 :::9092 :::* LISTEN 3014/java

The following commands will allow you to ensure Kafka is running properly:

first start zookeeper:

pi@pi3:~$ /opt/zookeeper/bin/zkServer.sh start
ZooKeeper JMX enabled by default
Using config: /opt/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED

Now you can start Kafka and create a Kafka topic with producer and consumer and test it:

pi@pi3:~$ /opt/kafka/bin/kafka-server-start.sh -daemon /opt/kafka/config/server.propertiespi@pi3:~$ /opt/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
Created topic test.
pi@pi3:~$ /opt/kafka/bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic test
Topic: test PartitionCount: 1 ReplicationFactor: 1 Configs:
Topic: test Partition: 0 Leader: 0 Replicas: 0 Isr: 0
pi@pi3:~$ /opt/kafka/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
>message test 1
>message test 2
pi@pi3:~$ /opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
message test 1
message test 2
^C Processed a total of 2 messages

6.3 Changing Zookeeper and Kafka to a cluster

Attention: both Kafka and Zookeeper suggest you have an odd number of nodes. I am using 3 them it is ok.

pi1

pi@pi1:~$  sudo mkdir /opt/zookeeper_data
pi@pi1:~$ sudo mkdir /opt/zookeeper
pi@pi1:~$ sudo mkdir /opt/kafka
pi@pi1:~$ sudo mkdir /opt/kafka_data
pi@pi1:~$ sudo chown -R pi:pi /opt/zookeeper_data
pi@pi1:~$ sudo chown -R pi:pi /opt/zookeeper
pi@pi1:~$ sudo chown -R pi:pi /opt/kafka
pi@pi1:~$ sudo chown -R pi:pi /opt/kafka_data

pi2

pi@pi2:~$  sudo mkdir /opt/zookeeper_data
pi@pi2:~$ sudo mkdir /opt/zookeeper
pi@pi2:~$ sudo mkdir /opt/kafka
pi@pi2:~$ sudo mkdir /opt/kafka_data
pi@pi2:~$ sudo chown -R pi:pi /opt/zookeeper_data
pi@pi2:~$ sudo chown -R pi:pi /opt/zookeeper
pi@pi2:~$ sudo chown -R pi:pi /opt/kafka
pi@pi2:~$ sudo chown -R pi:pi /opt/kafka_data

pi3

pi@pi3:/opt$ rsync -vaz  /opt/zookeeper/   pi2:/opt/zookeeper/
pi@pi3:/opt$ rsync -vaz /opt/kafka/ pi2:/opt/kafka/
pi@pi3:/opt$ rsync -vaz /opt/zookeeper/ pi1:/opt/zookeeper/
pi@pi3:/opt$ rsync -vaz /opt/kafka/ pi1:/opt/kafka/

Edit removing the previous comments (pi1, pi2, pi3)

/opt/zookeeper/conf/zoo.conf

create files:

/opt/zookeeper_data/myid

The file should have only the id of the zookeeper node (see GitHub)

pi1 ->1,

pi2 ->2,

pi3 ->3,

For Kafka, we need to edit (in all nodes):

/opt/kafka/config/server.properties

Changing the parameters:

broker.id=1   # 2, 3 acoording to the node

(1 for pi1, 2 for pi2, and 3 for pi3

and:

zookeeper.connect= pi1:2181, pi2:2181, pi3:2181

Now the Kafka is zookeeper will run as a cluster. You need to start it in all nodes.

6.3.1 Monitoring your Kafka cluster

There are some tools for monitoring Kafka clusters. I think it is nice to have a look and feel of a full environment.

The text [3]:

Provide an overview of some available tools. Kafka itself does not come with such a tool.

O opted for installing the Kafka tool. It is free for personal use, and this is my case. Kafka tool can be installed in windows, mac and Linux. I opted for installing it in my windows notebook, to reduce the workload in my raspberries.

Below I present my actual interface with the distributed nodes:

Finally, the following post provides extra information on Kafka tool [4]:

7. Starting the Cluster

I coded one script for each node, to start all services — because I sometimes forget to start specific services. You will find the scripts in the home folder for pi user as cluster-start.sh. The script uses ssh connections to start services in the other nodes, and I copied it to all my nodes.

/home/pi/cluster-start.sh

You can check the cluster status using the web UI for Hadoop, Yarn, Hive and the Kafka tool.

My final architecture is as presented in the following diagram:

And the services are distributed according to the table below:

I also use an Ubuntu virtual machine over VirtualBox on my notebook, with RStudio, R, Anaconda and TensorFlow.

Conclusion

I congratulate you by concluding all the installation. I know the feeling of relief and joy! :D

Now you have a full environment to experiment and train!

I would like to thank all authors from the sources I cited.

I authorize anyone to copy parts of this tutorial pending the origin is cited. If a substantial part is necessary, please only link it.

Any comments and questions are most welcomed!

[1] P. G. Taranti. https://github.com/ptaranti/RaspberryPiCluster

[2] N. Narkhede et al. Kafka: The Definitive Guide: Real-Time Data and Stream Processing at Scale. O’Reilly Media (2017)

[3] G. Myrianthous. Overview of UI monitoring tools for Apache Kafka clusters (2019)

[4] O. Grygorian. GUI for Apache Kafka (2019)

--

--

Data scientist, PhD, experienced team leader and co-worker; result-oriented professional — always learning.