Introduction
Data engineers usually have to pull data from several data sources, clean them and aggregate them. And in many cases, these processes need to be applied over large volumes of data.
In today’s article we will explore one of the most fundamental concepts in Computing and Data Engineering in particular, called Parallel Programming that enables modern applications to process enormous amounts of data in relatively small time frames.
Additionally, we will discuss about the benefits of parallel programming in general, as well as its drawbacks. Finally, we will explore a few packages and frameworks that can take advantage of modern multi-core systems and clusters of computers in order to distribute and parallelize workloads.
What is Parallel Programming
Parallel Computing sits at the heart of pretty much ever modern data processing tool. Such frameworks take advantage of the processing power and memory that modern machines offer in a way that a main task can be split into subtasks and these subtasks can be executed in parallel on several computers.

Note that parallel computing can even take advantage of multiple cores in just a single machine.
It is also important not to confuse concurrency and parallelism. These are two different concepts, so if you want to get a better idea about their main differences you can read one of my most recent articles discussing the difference between multi-threading and multi-processing.
Benefits of Parallel Programming
The first benefit of parallel programming is related to multi-core systems and its ability to take advantage of all of them while executing tasks so that these can be completed in much less time.
In the big data era, datasets can grow in enormous sizes and therefore it may be impossible to load them into a single machine. In parallel computing, such datasets can take advantage of multiple computer machines in order to load them in a distributed fashion, by partitioning them.
Parallel Slowdown
Apart from the obvious benefits, the concept of parallel computing comes with a few drawbacks. The distribution of tasks across a cluster of computers comes with an overhead which is related to how the nodes communicate with each other.
So there’s a chance that the distribution of fairly simple tasks won’t speed up the execution and in fact, it may instead trigger a parallel slowdown which is actually going to negatively affect the execution time.
In other words, for small and easy tasks it will be more efficient (and perhaps easier) to execute them on a single machine rather than distributing them across a node of clusters.
Another factor we need to consider when parallelizing tasks is whether they can actually be distributed across a cluster of nodes.
Parallel Programming in Python
The multiprocessing
package comes with the Python standard library and offers an intuitive API that lets users spawn multiple processes for both local and remote concurrency, that can effectively side-step the Global Interpreter Lock. You can read more about what the Python GIL is and what it does in my article shared below.
More specifically, the [Pool](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool)
object can be used to parallelize the execution of a provided function across multiple input values, in a way that the input data is distributed across multiple processes which may be spawned on different CPU cores even of a single machine.
As an example, let’s assume that we have a pandas DataFrame containing personal information about Students, including their age and the class year.
import pandas as pd
df = pd.DataFrame(
[
(1, 2021, 15),
(2, 2020, 14),
(3, 2021, 17),
(4, 2019, 13),
(5, 2019, 14),
(6, 2020, 15),
(7, 2020, 14),
(8, 2021, 14),
(9, 2021, 13),
(10, 2020, 14),
(11, 2019, 12),
(12, 2018, 10),
(13, 2019, 15),
(14, 2019, 16),
],
columns=['student_id', 'class_year', 'age']
)
print(df)
student_id class_year age
0 1 2021 15
1 2 2020 14
2 3 2021 17
3 4 2019 13
4 5 2019 14
5 6 2020 15
6 7 2020 14
7 8 2021 14
8 9 2021 13
9 10 2020 14
10 11 2019 12
11 12 2018 10
12 13 2019 15
13 14 2019 16
Now let’s say we want to compute the mean age of students, per class year. In the example below, we take advantage of the multiprocessing
library in order to parallelize this computation across 4 different cores on our local machine.
from multiprocessing import Pool
def compute_mean_age(groupby_year):
year, group = groupby_year
return pd.DataFrame({'age': group['age'].mean()}, index=[year])
with Pool(4) as p:
mean_age = p.map(compute_mean_age, df.groupby('class_year'))
df_mean_age_per_year = pd.concat(mean_age)
And the result would be:
print(df_mean_age_per_year)
age
2018 10.00
2019 14.00
2020 14.25
2021 14.75
Note that in this example, we’ve used a very small dataset that would otherwise cause the parallel slowdown effect that we discussed earlier. The approach above would only be beneficial if our dataset was extremely large. For small datasets, the following would do the trick:
>>> df.groupby('class_year')['age'].mean()
class_year
2018 10.00
2019 14.00
2020 14.25
2021 14.75
Name: age, dtype: float64
Modern Parallel Computation Frameworks
Numerous technologies have emerged over time, that are capable of distributing workloads and execute multiple tasks in parallel by taking advantage multi-core systems and clusters of computers.
One such tool is Dask that offers functionality for advanced parallelism that enables performance at scale for other tools such as numpy
, pandas
and scikit-learn
.
Instead of the "low level" multiprocessing
package, we could instead use Dask in order to compute the mean age per class year in the previous example. Using Dask, we could achieve the same result with much cleaner and more intuitive code that is shared below.
import dask.dataframe as dd
dd_students = dd.from_pandas(df, npartitions=4)
df_mean_age_per_year =
dd_sudents.groupby('class_year').age.mean().compute()
If you have done some research around Big Data the chances are you’ve came across with Hadoop that is maintained by Apache Software Foundation. Hadoop is a collection of open-source projects including MapReduce and Hadoop Distributed File System (HDFS). In a nutshell, MapReduce is one of the first distributed programming paradigms that is used to access big data stored on HDFS. Another popular tool of the Hadoop ecosystem is Hive, which is a layer running on top of Hadoop that can access data using Hive SQL, the structured query language. For instance, you can use Hive to write a query that will in turn be translated into a job that can be executed in a distributed fashion on a cluster of computers.
Finally, another popular tool in the Data Engineering stack is Apache Spark – the engine that is used to execute big data and machine learning at scale. The framework supports multiple languages including Scala and Python and can be used to perform SQL analytics, Data Science and Machine Learning at scale, as well as Batch and Streaming data processing.
You can find more tools related to Big Data Engineering in one of my recent posts here on Medium, called Tools for Data Engineers where I go into more detail on the purpose of the tools I mentioned today, including Apache Airflow and Apache Kafka.
Final Thoughts
In today’s article we discussed about one of the most important pillars in Data Engineering, namely Parallel Computing. We discussed about what purpose it serves and what benefits it brings especially in the big data era. Additionally, we explored how to perform Parallel Computing with Python built-in package multiprocessing
over a pandas DataFrame as well as we to achieve the equivalent behaviour using dask
.
Finally, we discussed about the most commonly used Parallel Computing frameworks that every Data Engineer should be at least aware of in order to be able to design and implement efficient and scalable data workflows across the organisation. Such frameworks include the Hadoop Ecosystem (that contains numerous tools including MapReduce, HDFS and Hive) and Apache Spark.
Become a member and read every story on Medium. Your membership fee directly supports me and other writers you read. You’ll also get full access to every story on Medium.
Related articles you may also like