A Data Science/Big Data Laboratory — part 4 of 4: Kafka and Zookeeper over Ubuntu in a 3-node cluster

Assembling a Data Science/Big Data Laboratory in a Raspberry Pi 4 or VMs cluster with Hadoop, Spark, Hive, Kafka, Zookeeper and PostgreSQL

This text can be used to support the installation in any Ubuntu 20.04 server clusters, and this is the beauty of well-designed layered software. Furthermore, if you have more nodes, you can distribute the software as you like. The text assumes you know Linux command line, including ssh, vim, and nano.

I do not recommend starting with less than three Raspberries since you need to set the communication, and both Zookeeper and Kafka requires an odd number of nodes. If you are trying a single node this guide may be used. Still, the performance is likely to be disappointing in a Raspberry — for single node I suggest a virtual machine with a reasonable amount of RAM and processor.

Due to size, I had to divide the tutorial into four parts :

All configuration files are available at [1]:

Disclaimer: This text is offered to everyone for free to use at your own risk. I took care in citing all my sources, but if you feel that something is missed, please send me a note. Since different software versions may behaviour in a distinct way due to their dependencies, I suggest using the same versions I used in your first try.

6. Kafka

Kafka ( is a robust message broker widely used to instantiate pipelines. Its retention feature makes it possible to handle a surge of information or the need to put consumers offline for maintenance.

Furthermore, as almost all big data solution, Kafka escalates quickly from a single node to full clusters with replication.

The primary literature for learning Kafka is the book “Kafka: The Definitive Guide” [2]. The e-book is freely available at

Kafka can easily handle from gigabytes to even petabyte a day. This is far from my lab-cluster capacity. However, I decided to install Kafka initially as a single node and after distributed it to allow playing with data pipelines, such as collecting real-time information from Tweeter.

6.1 Zookeeper

The first step is to install the zookeeper server since Kafka depends on it for distributing metadata. I installed the most recent stable version available on the following sites:

pi@pi3:~/tmp$ wget
pi@pi3:~/tmp$ tar -xzvf apache-zookeeper-3.6.1-bin.tar.gz
pi@pi3:~/tmp$ sudo mv apache-zookeeper-3.6.1-bin /opt/zookeeper
pi@pi3:~/tmp$ cd /opt/
pi@pi3:/opt$ ls
hadoop hadoop_tmp hive zookeeper
pi@pi3:/opt$ sudo chown -R pi:pi zookeeper
[sudo] password for pi:
pi@pi3: /opt$ sudo mkdir /opt/zookeeper_data
pi@pi3: /opt$ sudo chown -R pi:pi zookeeper_data

Create the file:


Now you can start zookeeper in a single node:

pi@pi3:/opt$ /opt/zookeeper/bin/ startZooKeeper JMX enabled by default
Using config: /opt/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED

Checking service:

pi@pi3:/opt$ sudo netstat -plnt | grep 2181
tcp6 0 0 :::2181 :::* LISTEN 2511/java

Now we have zookeeper running locally on pi3. Next, we will install Kafka.


Kafka installation as a single note is not so troublesome. I followed the book instructions but changing the installation folder to /opt and my user is pi.

I downloaded the most recent stable version, which was


As usual, I saved it in /home/pi/tmp

The following commands are to extract the files, send it to /opt and adjust folders and access rights:

pi@pi3:~/tmp$ tar -xzvf kafka_2.13-2.5.0.tgz
pi@pi3:~/tmp$ sudo mv kafka_2.13-2.5.0 /opt/kafka
pi@pi3:~/tmp$ cd /opt/
pi@pi3:/opt$ ls
hadoop hadoop_tmp hive kafka zookeeper zookeeper_data
pi@pi3:/opt$ sudo chown -R pi:pi kafka
pi@pi3:/opt$ sudo mkdir /opt/kafka-data
pi@pi3:/opt$ sudo chown -R pi:pi /opt/kafka_data

Edit file


changing the following parameter:


Starting Kafka:

pi@pi3:/opt$ /opt/kafka/bin/ -daemon /opt/kafka/config/

Note: Kafka and zookeeper require one open terminal if you do not indicate another way. When starting these services. Use one terminal/remote session for each service initially. After the tutorial will show how to do it in a transparent way.

Checking port 9092:

pi@pi3:/opt$ sudo netstat -plnt | grep 9092
tcp6 0 0 :::9092 :::* LISTEN 3014/java

The following commands will allow you to ensure Kafka is running properly:

first start zookeeper:

pi@pi3:~$ /opt/zookeeper/bin/ start
ZooKeeper JMX enabled by default
Using config: /opt/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED

Now you can start Kafka and create a Kafka topic with producer and consumer and test it:

pi@pi3:~$ /opt/kafka/bin/ -daemon /opt/kafka/config/server.propertiespi@pi3:~$ /opt/kafka/bin/ --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
Created topic test.
pi@pi3:~$ /opt/kafka/bin/ --zookeeper localhost:2181 --describe --topic test
Topic: test PartitionCount: 1 ReplicationFactor: 1 Configs:
Topic: test Partition: 0 Leader: 0 Replicas: 0 Isr: 0
pi@pi3:~$ /opt/kafka/bin/ --broker-list localhost:9092 --topic test
>message test 1
>message test 2
pi@pi3:~$ /opt/kafka/bin/ --bootstrap-server localhost:9092 --topic test --from-beginning
message test 1
message test 2
^C Processed a total of 2 messages

6.3 Changing Zookeeper and Kafka to a cluster

Attention: both Kafka and Zookeeper suggest you have an odd number of nodes. I am using 3 them it is ok.


pi@pi1:~$  sudo mkdir /opt/zookeeper_data
pi@pi1:~$ sudo mkdir /opt/zookeeper
pi@pi1:~$ sudo mkdir /opt/kafka
pi@pi1:~$ sudo mkdir /opt/kafka_data
pi@pi1:~$ sudo chown -R pi:pi /opt/zookeeper_data
pi@pi1:~$ sudo chown -R pi:pi /opt/zookeeper
pi@pi1:~$ sudo chown -R pi:pi /opt/kafka
pi@pi1:~$ sudo chown -R pi:pi /opt/kafka_data


pi@pi2:~$  sudo mkdir /opt/zookeeper_data
pi@pi2:~$ sudo mkdir /opt/zookeeper
pi@pi2:~$ sudo mkdir /opt/kafka
pi@pi2:~$ sudo mkdir /opt/kafka_data
pi@pi2:~$ sudo chown -R pi:pi /opt/zookeeper_data
pi@pi2:~$ sudo chown -R pi:pi /opt/zookeeper
pi@pi2:~$ sudo chown -R pi:pi /opt/kafka
pi@pi2:~$ sudo chown -R pi:pi /opt/kafka_data


pi@pi3:/opt$ rsync -vaz  /opt/zookeeper/   pi2:/opt/zookeeper/
pi@pi3:/opt$ rsync -vaz /opt/kafka/ pi2:/opt/kafka/
pi@pi3:/opt$ rsync -vaz /opt/zookeeper/ pi1:/opt/zookeeper/
pi@pi3:/opt$ rsync -vaz /opt/kafka/ pi1:/opt/kafka/

Edit removing the previous comments (pi1, pi2, pi3)


create files:


The file should have only the id of the zookeeper node (see GitHub)

pi1 ->1,

pi2 ->2,

pi3 ->3,

For Kafka, we need to edit (in all nodes):


Changing the parameters:   # 2, 3 acoording to the node

(1 for pi1, 2 for pi2, and 3 for pi3


zookeeper.connect= pi1:2181, pi2:2181, pi3:2181

Now the Kafka is zookeeper will run as a cluster. You need to start it in all nodes.

6.3.1 Monitoring your Kafka cluster

There are some tools for monitoring Kafka clusters. I think it is nice to have a look and feel of a full environment.

The text [3]:

Provide an overview of some available tools. Kafka itself does not come with such a tool.

O opted for installing the Kafka tool. It is free for personal use, and this is my case. Kafka tool can be installed in windows, mac and Linux. I opted for installing it in my windows notebook, to reduce the workload in my raspberries.

Below I present my actual interface with the distributed nodes:

Finally, the following post provides extra information on Kafka tool [4]:

7. Starting the Cluster

I coded one script for each node, to start all services — because I sometimes forget to start specific services. You will find the scripts in the home folder for pi user as The script uses ssh connections to start services in the other nodes, and I copied it to all my nodes.


You can check the cluster status using the web UI for Hadoop, Yarn, Hive and the Kafka tool.

My final architecture is as presented in the following diagram:

And the services are distributed according to the table below:

I also use an Ubuntu virtual machine over VirtualBox on my notebook, with RStudio, R, Anaconda and TensorFlow.


