Connecting the Dots (Python, Spark, and Kafka)

Python, Spark, and Kafka are vital frameworks in data scientists’ day to day activities. It is essential to enable them to integrate these frameworks.

Kiruparan Balachandran
Towards Data Science

--

Photo By César Gaviria from Pexels

Introduction

Frequently, Data scientists prefer to use Python (in some cases, R) to develop machine learning models. Here, they have a valid justification since data-driven solutions arrive with many experiments. Numerous interactions with the language we use to develop the models are required to perform experiments, and the libraries and platforms available in python to develop machine-learning models are tremendous. This is a valid argument; however, we confront issues when these models are applied to production.

We still have the Python micro-service library such as Flask to deploy machine-learning models and publish it as API. Nevertheless, the question is, ‘can this cater for real-time analytics where you need to process millions of events in a millisecond of time?’ The answer is ‘no.’ This situation is my motivation to write this article.

To overcome all the above problems, I have identified a set of dots that could be appropriately connected. In this article, I attempt to connect these dots, which are Python, Apache Spark, and Apache Kafka.

The article is structured in the following order;

  • Discuss the steps to perform to setup Apache Spark in a Linux environment.
  • Starting Kafka (for more details, please refer to this article).
  • Creating a PySpark app for consume and process the events and write back to Kafka.
  • Steps to produce and consume events using Kafka-Python.

Installing Spark

The latest version of Apache Spark is available at http://spark.apache.org/downloads.html

Spark-2.3.2 was the latest version by the time I wrote this article.

Step 1: Download spark-2.3.2 to the local machine using the following command

wget http://www-us.apache.org/dist/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz

Step 2: Unpack.

tar -xvf spark-2.1.1-bin-hadoop2.7.tgz

Step 3: Create soft links (optional).

This step is optional, but preferred; it facilitates upgrading spark versions in the future.

ln -s /home/xxx/spark-2.3.2-bin-hadoop2.7/ /home/xxx/spark

Step 4: Add SPARK_HOME entry to bashrc

#set spark related environment varibales
SPARK_HOME="/home/xxx/spark"
export PATH=$SPARK_HOME/bin:$PATH
export PATH=$SPARK_HOME/sbin:$PATH

Step 5: Verify the installation

pyspark

The following output would be visible on the console if everything were accurate:

Step 6: Start the master in this machine

start-master.sh

Spark Master Web GUI (the flowing screen) is accessible from the following URL: http://abc.def.com:8080/

Spark Master Web GUI — Image by Author

Step 7: Starting Worker

start-slave.sh spark://abc.def.ghi.jkl:7077

If everything were accurate, the entry for Workers would appear on the same screen.

Spark Master Web GUI with workers — Image by Author

Starting Kafka

Here Kafka is a streaming platform that helps to produce and consume the events to the spark platform.

Please refer to the article on Kafka I have already written for more detailed instructions.

Step 1: Go to the Kafka root folder

cd /home/xxx/IQ_STREAM_PROCESSOR/kafka_2.12-2.0.0/

Step 2: Start Kafka Zookeeper

bin/zookeeper-server-start.sh config/zookeeper.properties

Step 3: Start Kafka Brokers

bin/kafka-server-start.sh config/server.properties

Step 4: Create two Kafka Topics (input_event and output_event)

Event Processing on Apache Spark (PySpark)

Setup Spark

Step 1

First setup python packages in each node of the cluster and specify the path to each worker node. Installation of Anaconda is preferred here, which contains a majority of the necessary python packages.

Add the below entry in spark-env.sh to specify the path to each worker node.

export PYSPARK_PYTHON='/home/xxx/anaconda3/bin/python'

Step 2

Installation of other python dependencies used in this spark app is required. For example, we use Kafka-python to write the processed event back to Kafka.

This is the process to install Kafka python:

In a console, go to anaconda bin directory

cd /home/xxx/anaconda3/bin/

Execute the following command

pip install kafka-python

Step 3

Download Spark Streaming’s Kafka libraries from the following URL: https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-8-assembly[Later, this is required to submit the spark jobs].

Now we have arranged the whole atmosphere necessary to run the Spark application.

Create and Submit the park Application

Creating SparkContext

Spark Context is the entry point to access spark functionalities and provides connection to a Spark cluster. To create SparkContext, first, we should create SparkConf that contains parameters required to pass to SparkContext. The below code snipe shows how to create SparkContext.

Here, only master URL and application name were arranged, but not limited to this. SparkConf allows you to control more parameters. For example, you can specify the number of cores to use for the driver process, the amount of memory to use per executor process, etc.

Creating [StreamingContext + input stream for Kafka Brokers]

Streaming Context is the entry point to access spark streaming functionalities. The key functionality of the streaming context is to create Discretized Stream from different streaming sources. The following code snip shows creating a StreamingContext.

#batch duration, here i process for each second
ssc = StreamingContext(sc,1)

Next, we create the input stream for pulls message from the Kafka Brokers. Following parameters creating the input stream should be specified:

  • Host name and the port of Zookeeper that connects from this stream.
  • Group id of this consumer.
  • “Per-topic number of Kafka partitions to consume”: To specify the number of partitions this stream reads parallel.

The following code snipe expresses how to create an input stream for Kafka Brokers.

Process events and write back to Kafka

After creating the stream for Kafka Brokers, we pull each event from the stream and process the events. Here I demonstrate a typical example (word count) referred in most spark tutorials, with minor alterations, to keep the key value throughout the processing period and write back to Kafka.

The following code snip describes receiving the inbound stream and creating another stream with the processed events:

Now, all that remain is to write back to Kafka. We get the processed stream and write back to the external system by applying the output operation to stream (here we use foreachRDD). This pushes the data in each RDD to an external system (in our use case, to Kafka). The following code snipe explains how to the data in each RDD write back to Kafka:

Launch spark application

Script spark-submit is applied to launch spark application. Following parameters should be specified during launch application:

  • master: URL to connect the master; in our example, it is spark://abc.def.ghi.jkl:7077
  • deploy-mode: option to deploy driver (either at the worker node or locally as an external client)
  • jars: recall our discussion about Spark Streaming’s Kafka libraries; here we need to submit that jar to provide Kafka dependencies.

Finally, we must submit the PySpark script we wrote in this section, i.e., spark_processor.py

After launching all commands, our spark application will be as follows:

Provided all be fine, the following output will appear in the console:

Now we have the necessary setup and is the time for testing. Here, I used Kafka-python to create the events and consume the events already discussed in one of my previous articles.

Here is the code snipe for your reference:

Code for producing the events

Code for consuming the events

If everything is accurate, the process events consume and display in the console as follows:

Final Thoughts

The key takeaways from this article are,
1) Python, Spark, and Kafka are important frameworks in a data scientist’s daily activities.
2) This article helps data scientists to perform their experiments in Python while deploying the final model in a scalable production environment.

Thank you for reading this. Hope that you guys will also manage to connect these dots!

--

--

A Data enthusiast on extracting insights from business data sets, machine learning, and building and deploying large-scale machine learning models.