LABDATA

A Data Science/Big Data Laboratory — part 4 of 4: Kafka and Zookeeper over Ubuntu in a 3-node cluster

Assembling a Data Science/Big Data Laboratory in a Raspberry Pi 4 or VMs cluster with Hadoop, Spark, Hive, Kafka, Zookeeper and PostgreSQL

Published in

Towards Data Science

7 min readJun 15, 2020

This text can be used to support the installation in any Ubuntu 20.04 server clusters, and this is the beauty of well-designed layered software. Furthermore, if you have more nodes, you can distribute the software as you like. The text assumes you know Linux command line, including ssh, vim, and nano.

I do not recommend starting with less than three Raspberries since you need to set the communication, and both Zookeeper and Kafka requires an odd number of nodes. If you are trying a single node this guide may be used. Still, the performance is likely to be disappointing in a Raspberry — for single node I suggest a virtual machine with a reasonable amount of RAM and processor.

Due to size, I had to divide the tutorial into four parts :

Part 1: Introduction, Operational System and Networking
Part 2: Hadoop and Spark
Part 3: PostgreSQL and Hive
Part 4: Kafka, Zookeeper and Conclusion

All configuration files are available at [1]:

ptaranti/RaspberryPiCluster

Config files for a Hadoop + Spark + Hive + Kafka + Postgresql raspberry cluster (ubuntu 20.04) The files in this…

github.com

Disclaimer: This text is offered to everyone for free to use at your own risk. I took care in citing all my sources, but if you feel that something is missed, please send me a note. Since different software versions may behaviour in a distinct way due to their dependencies, I suggest using the same versions I used in your first try.

6. Kafka

Kafka (https://kafka.apache.org/) is a robust message broker widely used to instantiate pipelines. Its retention feature makes it possible to handle a surge of information or the need to put consumers offline for maintenance.

Furthermore, as almost all big data solution, Kafka escalates quickly from a single node to full clusters with replication.

The primary literature for learning Kafka is the book “Kafka: The Definitive Guide” [2]. The e-book is freely available at

Apache Kafka: The Definitive Guide | Confluent

What is Kafka, and how does it work? In this comprehensive e-book, you’ll get a full introduction to Apache Kafka ®, the…

www.confluent.io

Many thanks to Confluent!

Kafka can easily handle from gigabytes to even petabyte a day. This is far from my lab-cluster capacity. However, I decided to install Kafka initially as a single node and after distributed it to allow playing with data pipelines, such as collecting real-time information from Tweeter.

6.1 Zookeeper

The first step is to install the zookeeper server since Kafka depends on it for distributing metadata. I installed the most recent stable version available on the following sites:

https://zookeeper.apache.org/releases.html

https://downloads.apache.org/zookeeper/zookeeper-3.6.1/apache-zookeeper-3.6.1.tar.gz

pi@pi3:~/tmp$ wget https://downloads.apache.org/zookeeper/zookeeper-3.6.1/apache-zookeeper-3.6.1-bin.tar.gz
pi@pi3:~/tmp$ tar -xzvf apache-zookeeper-3.6.1-bin.tar.gz
pi@pi3:~/tmp$ sudo mv apache-zookeeper-3.6.1-bin /opt/zookeeper
pi@pi3:~/tmp$ cd /opt/
pi@pi3:/opt$ ls
hadoop  hadoop_tmp  hive  zookeeper
pi@pi3:/opt$ sudo chown -R pi:pi zookeeper
[sudo] password for pi:
pi@pi3:/opt$
pi@pi3: /opt$ sudo mkdir /opt/zookeeper_data
pi@pi3: /opt$ sudo chown -R pi:pi zookeeper_data

Create the file:

/opt/zookeeper/conf/zoo.conf

Now you can start zookeeper in a single node:

pi@pi3:/opt$ /opt/zookeeper/bin/zkServer.sh startZooKeeper JMX enabled by default
Using config: /opt/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED

Checking service:

pi@pi3:/opt$ sudo netstat -plnt | grep 2181
tcp6  0  0 :::2181  :::*  LISTEN  2511/java

Now we have zookeeper running locally on pi3. Next, we will install Kafka.

6.2 KAFKA

Kafka installation as a single note is not so troublesome. I followed the book instructions but changing the installation folder to /opt and my user is pi.

I downloaded the most recent stable version, which was

kafka_2.13–2.5.0.tgz:

Apache Download Mirrors

Home page of The Apache Software Foundation

www.apache.org

As usual, I saved it in /home/pi/tmp

The following commands are to extract the files, send it to /opt and adjust folders and access rights:

pi@pi3:~/tmp$ tar -xzvf kafka_2.13-2.5.0.tgz
pi@pi3:~/tmp$ sudo mv kafka_2.13-2.5.0 /opt/kafka
pi@pi3:~/tmp$ cd /opt/
pi@pi3:/opt$ ls
hadoop  hadoop_tmp  hive  kafka  zookeeper  zookeeper_data
pi@pi3:/opt$ sudo chown -R pi:pi kafkapi@pi3:/opt$ sudo mkdir /opt/kafka-data
pi@pi3:/opt$ sudo chown -R pi:pi  /opt/kafka_data

Edit file

/opt/kafka/config/server.properties,

changing the following parameter:

log.dirs=/opt/kafka_data

Starting Kafka:

pi@pi3:/opt$ /opt/kafka/bin/kafka-server-start.sh -daemon /opt/kafka/config/server.properties

Note: Kafka and zookeeper require one open terminal if you do not indicate another way. When starting these services. Use one terminal/remote session for each service initially. After the tutorial will show how to do it in a transparent way.

Checking port 9092:

pi@pi3:/opt$ sudo netstat -plnt | grep 9092
tcp6  0  0 :::9092  :::*   LISTEN  3014/java

The following commands will allow you to ensure Kafka is running properly:

first start zookeeper:

pi@pi3:~$ /opt/zookeeper/bin/zkServer.sh start
ZooKeeper JMX enabled by default
Using config: /opt/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED

Now you can start Kafka and create a Kafka topic with producer and consumer and test it:

pi@pi3:~$ /opt/kafka/bin/kafka-server-start.sh -daemon /opt/kafka/config/server.propertiespi@pi3:~$ /opt/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
Created topic test.pi@pi3:~$ /opt/kafka/bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic test
Topic: test     PartitionCount: 1       ReplicationFactor: 1    Configs:
        Topic: test     Partition: 0    Leader: 0       Replicas: 0     Isr: 0pi@pi3:~$ /opt/kafka/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
>message test 1
>message test 2pi@pi3:~$ /opt/kafka/bin/kafka-console-consumer.sh  --bootstrap-server localhost:9092 --topic test --from-beginning
message test 1
message test 2
^C Processed a total of 2 messages

6.3 Changing Zookeeper and Kafka to a cluster

Attention: both Kafka and Zookeeper suggest you have an odd number of nodes. I am using 3 them it is ok.

pi1

pi@pi1:~$  sudo mkdir /opt/zookeeper_data
pi@pi1:~$ sudo mkdir /opt/zookeeper
pi@pi1:~$ sudo mkdir /opt/kafka
pi@pi1:~$ sudo mkdir /opt/kafka_datapi@pi1:~$ sudo chown -R pi:pi /opt/zookeeper_data
pi@pi1:~$ sudo chown -R pi:pi /opt/zookeeper
pi@pi1:~$ sudo chown -R pi:pi /opt/kafka
pi@pi1:~$ sudo chown -R pi:pi /opt/kafka_data

pi2

pi@pi2:~$  sudo mkdir /opt/zookeeper_data
pi@pi2:~$ sudo mkdir /opt/zookeeper
pi@pi2:~$ sudo mkdir /opt/kafka
pi@pi2:~$ sudo mkdir /opt/kafka_datapi@pi2:~$ sudo chown -R pi:pi /opt/zookeeper_data
pi@pi2:~$ sudo chown -R pi:pi /opt/zookeeper
pi@pi2:~$ sudo chown -R pi:pi /opt/kafka
pi@pi2:~$ sudo chown -R pi:pi /opt/kafka_data

pi3

pi@pi3:/opt$ rsync -vaz  /opt/zookeeper/   pi2:/opt/zookeeper/
pi@pi3:/opt$ rsync -vaz  /opt/kafka/   pi2:/opt/kafka/
pi@pi3:/opt$ rsync -vaz  /opt/zookeeper/   pi1:/opt/zookeeper/
pi@pi3:/opt$ rsync -vaz  /opt/kafka/   pi1:/opt/kafka/

Edit removing the previous comments (pi1, pi2, pi3)

/opt/zookeeper/conf/zoo.conf

create files:

/opt/zookeeper_data/myid

The file should have only the id of the zookeeper node (see GitHub)

pi1 ->1,

pi2 ->2,

pi3 ->3,

For Kafka, we need to edit (in all nodes):

/opt/kafka/config/server.properties

Changing the parameters:

broker.id=1   # 2, 3 acoording to the node

(1 for pi1, 2 for pi2, and 3 for pi3

and:

zookeeper.connect= pi1:2181, pi2:2181, pi3:2181

Now the Kafka is zookeeper will run as a cluster. You need to start it in all nodes.

6.3.1 Monitoring your Kafka cluster

There are some tools for monitoring Kafka clusters. I think it is nice to have a look and feel of a full environment.

The text [3]:

Overview of UI monitoring tools for Apache Kafka clusters

What are the best monitoring tools for Apache Kafka?

medium.com

Provide an overview of some available tools. Kafka itself does not come with such a tool.

O opted for installing the Kafka tool. It is free for personal use, and this is my case. Kafka tool can be installed in windows, mac and Linux. I opted for installing it in my windows notebook, to reduce the workload in my raspberries.

Kafka Tool

Kafka Tool is a GUI application for managing and using Apache Kafka ® clusters. It provides an intuitive UI that allows…

www.kafkatool.com

Below I present my actual interface with the distributed nodes:

Finally, the following post provides extra information on Kafka tool [4]:

GUI for Apache Kafka

Overview

medium.com

7. Starting the Cluster

I coded one script for each node, to start all services — because I sometimes forget to start specific services. You will find the scripts in the home folder for pi user as cluster-start.sh. The script uses ssh connections to start services in the other nodes, and I copied it to all my nodes.

/home/pi/cluster-start.sh

You can check the cluster status using the web UI for Hadoop, Yarn, Hive and the Kafka tool.

My final architecture is as presented in the following diagram:

And the services are distributed according to the table below:

I also use an Ubuntu virtual machine over VirtualBox on my notebook, with RStudio, R, Anaconda and TensorFlow.

Conclusion

I congratulate you by concluding all the installation. I know the feeling of relief and joy! :D

Now you have a full environment to experiment and train!

I would like to thank all authors from the sources I cited.

I authorize anyone to copy parts of this tutorial pending the origin is cited. If a substantial part is necessary, please only link it.

Any comments and questions are most welcomed!

[1] P. G. Taranti. https://github.com/ptaranti/RaspberryPiCluster

[2] N. Narkhede et al. Kafka: The Definitive Guide: Real-Time Data and Stream Processing at Scale. O’Reilly Media (2017)

[3] G. Myrianthous. Overview of UI monitoring tools for Apache Kafka clusters (2019)

[4] O. Grygorian. GUI for Apache Kafka (2019)