Abstract
Nowadays, traffic loads on applications can increase or decrease drastically and in short time periods, meaning that applications need to easily scale-out and scale-down. To do so, they need to be supported by an infrastructure that can accommodate resources extension or decrease resources, but more importantly, codes should be written to support scalability.
In our Big Data era, applications collect and process huge amounts of data. To fulfill their purpose, these applications need to process the data in a reasonable timeframe and without consuming too many resources.
When developing a service that processes a large amount of data, developers ought to be resource-aware. The manipulation of large data sets in libraries such as pandas and NumPy often duplicates data in memory, ultimately causing unnecessary and rapid resource limitations. Take the case of matching a table of students with a list of exams. Each student takes all the tests (and all tests are taken by all the students). If we wanted to express this relationship in a table, it would require a table of (# of students) x (# of tests) rows. For large datasets, this could far exceed memory limitations.
More to the point, a lack of resource-awareness could create some of the many problems listed below, for example:
- Holding the main thread busy – Intensive CPU workload could lead to other requests being stuck.
- OOM – processing a huge amount of data in memory could exceed the amount of memory in the physical server.
- CPU loads – performing intensive computations can lead to our application latency, which influences our user’s experience.
Besides, simply extending the resources (scaling-up) or deploying more instances of an application (scaling-out) won’t help.****Scaling-up has a limitation of the server CPU and Memory resources, which makes it limited for a BigData environment. Also, Scaling-out isn’t possible when the application holds all the data and tries to process it at once.
So, how can we use scale-out in a data-driven application?
The Dispatcher Pattern
There are many ways to help our code support scaling, One of which is using a pattern named Dispatcher.
The dispatcher pattern decouples the execution of tasks from the one who commanded the task. Comparable to master-slave architecture, a dispatcher who sends commands to the slaves who performs them.
The dispatcher does more than just sending commands to its workers. It applies some computation logic before sending the command to its workers. An example of the logic is dividing the data into categories, for instance by day. While working with that methodology, we can scale-out our application on-demand.
The benefits of using the dispatcher pattern:
- Efficient resource utilization – While working with the dispatcher pattern, one can decide when to scale an application based on the system loads and save costs by reducing the consumption of needless resources.
- Separation of the business logic and data preprocessing – using this pattern, one can separate the concerns. Each component has a better-defined purpose; the dispatcher asks the slave to execute atomic operations, and the slave executes a simpler operation.
- Easy for debugging – while working with this method, one can detect bugs on a smaller and atomic part the of data (which one batch was faulty).
- Improve processing time – the more workers deployed, the faster the data processing.

There are many ways to implement the dispatcher pattern in your application, Here we’re going to implement it in a Microservices environment using Apache Kafka(Event Streaming Platform).
Apache Kafka supports PubSub architecture, which means that there is a publisher that publishes messages to a queue(called a Topic) and there are consumers, who consume messages from that topic. (Apache Kafka has much more capabilities but this is out of scope for this article).
We will use Apache Kafka to deliver messages between the Dispatcher and its workers and show how the support for scale comes to be.
Example for the use of the Dispatcher Pattern
In this example, we’re going to perform manipulation on the COVID19 dataset, The dataset is fake and generated using the faker library. The dataset contains a timeframe of the positive and negative cases of COVID19 around the word and has the following attributes: patient name, city, country, birth_date, and the date and result of the COVID19 test.
The idea is to calculate the countries positive rate for COVID19 cross-time the dataset timeframe.
Link for the dataset: dataset.
Dataset exploration:

According to the picture below, the dataset contains 2M records, Each record presents a patient with its COVID19 test result. The dataset time frame is between 01/03/2020— 01/01/2021.
The Dispatcher
The dispatcher’s task is to split the data into an atomic permutation and dispatch the data to its workers to manipulate it. The keys: "country" and "date_hour" present the atomic permutation we want to calculate the positive rate for. As we can see, here the dispatcher task is small and the dispatcher just has to group the data into the country and date_hour key, which isn’t an intensive operation, The dispatcher task doesn’t always have to be a heavy one.

Once the dispatcher task is implemented, we will dispatch all the jobs to the workers using the Kafka Broker, there are few libraries to work with Kafka, I prefer to use Kafka-confluent, which is faster according to the attached benchmark (http://activisiongamescience.github.io/2016/06/15/Kafka-Client-Benchmarking/) and also has good documentation.
The Worker’s
After dispatching, the Kafka consumers (our workers) will consume msg by msg and start working on the jobs.
The worker’s task is to take each job at a time, which each job containing the records of a specific date and country, and calculate the covid19 positive rate for it. It’s important to understand that the worker could run more than 1 job (passing 1 state), but can run each job at a time.

We can see that the worker’s tasks are small and give us the ability to diagnose bugs easily. One more advantage is that if the data processing is unexpectedly turned off, we don’t have to start the processing from the beginning.
After all, workers finished their jobs, we can contact all of the results and plot our insights.

In the plot below, we can see the COVID19 positive rate cross dates in all our dataset countries.
Let’s summarize this demo with a bar plot of the comparison between run this task using a dispatcher pattern and run it without a dispatcher pattern.

Summary
This article comes to emphasizes the importance of the code to support Scaling when you are developing your next data processing job. Do not put attention that our code supports scaling or not know which amount of data we’re trying to processing could lead to multiple issues. In this article, we saw one way to write code which supports scaling and saw practical example how to implement it using Apache Kafka.
Hope you find it useful (: