The world’s leading publication for data science, AI, and ML professionals.

What is Parallel Computing?

Understanding the importance of parallel computing in the context of Data Engineering

Photo by Martin Sanchez on Unsplash
Photo by Martin Sanchez on Unsplash

Introduction

Data engineers usually have to pull data from several data sources, clean them and aggregate them. And in many cases, these processes need to be applied over large volumes of data.

In today’s article we will explore one of the most fundamental concepts in Computing and Data Engineering in particular, called Parallel Programming that enables modern applications to process enormous amounts of data in relatively small time frames.

Additionally, we will discuss about the benefits of parallel programming in general, as well as its drawbacks. Finally, we will explore a few packages and frameworks that can take advantage of modern multi-core systems and clusters of computers in order to distribute and parallelize workloads.


What is Parallel Programming

Parallel Computing sits at the heart of pretty much ever modern data processing tool. Such frameworks take advantage of the processing power and memory that modern machines offer in a way that a main task can be split into subtasks and these subtasks can be executed in parallel on several computers.

Multiple tasks (or subtasks) being executed in parallel - Source: Author
Multiple tasks (or subtasks) being executed in parallel – Source: Author

Note that parallel computing can even take advantage of multiple cores in just a single machine.

It is also important not to confuse concurrency and parallelism. These are two different concepts, so if you want to get a better idea about their main differences you can read one of my most recent articles discussing the difference between multi-threading and multi-processing.

Multi-threading and Multi-processing in Python


Benefits of Parallel Programming

The first benefit of parallel programming is related to multi-core systems and its ability to take advantage of all of them while executing tasks so that these can be completed in much less time.

In the big data era, datasets can grow in enormous sizes and therefore it may be impossible to load them into a single machine. In parallel computing, such datasets can take advantage of multiple computer machines in order to load them in a distributed fashion, by partitioning them.


Parallel Slowdown

Apart from the obvious benefits, the concept of parallel computing comes with a few drawbacks. The distribution of tasks across a cluster of computers comes with an overhead which is related to how the nodes communicate with each other.

So there’s a chance that the distribution of fairly simple tasks won’t speed up the execution and in fact, it may instead trigger a parallel slowdown which is actually going to negatively affect the execution time.

In other words, for small and easy tasks it will be more efficient (and perhaps easier) to execute them on a single machine rather than distributing them across a node of clusters.

Another factor we need to consider when parallelizing tasks is whether they can actually be distributed across a cluster of nodes.


Parallel Programming in Python

The multiprocessing package comes with the Python standard library and offers an intuitive API that lets users spawn multiple processes for both local and remote concurrency, that can effectively side-step the Global Interpreter Lock. You can read more about what the Python GIL is and what it does in my article shared below.

What is the Python Global Interpreter Lock (GIL)?

More specifically, the [Pool](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool) object can be used to parallelize the execution of a provided function across multiple input values, in a way that the input data is distributed across multiple processes which may be spawned on different CPU cores even of a single machine.

As an example, let’s assume that we have a pandas DataFrame containing personal information about Students, including their age and the class year.

import pandas as pd
df = pd.DataFrame(
    [
        (1, 2021, 15),
        (2, 2020, 14),
        (3, 2021, 17),
        (4, 2019, 13),
        (5, 2019, 14),
        (6, 2020, 15),
        (7, 2020, 14),
        (8, 2021, 14),
        (9, 2021, 13),
        (10, 2020, 14),
        (11, 2019, 12),
        (12, 2018, 10),
        (13, 2019, 15),
        (14, 2019, 16),
    ],
    columns=['student_id', 'class_year', 'age']
)
print(df)
    student_id  class_year  age
0            1        2021   15
1            2        2020   14
2            3        2021   17
3            4        2019   13
4            5        2019   14
5            6        2020   15
6            7        2020   14
7            8        2021   14
8            9        2021   13
9           10        2020   14
10          11        2019   12
11          12        2018   10
12          13        2019   15
13          14        2019   16

Now let’s say we want to compute the mean age of students, per class year. In the example below, we take advantage of the multiprocessing library in order to parallelize this computation across 4 different cores on our local machine.

from multiprocessing import Pool
def compute_mean_age(groupby_year):
    year, group = groupby_year
    return pd.DataFrame({'age': group['age'].mean()}, index=[year])
with Pool(4) as p:
    mean_age = p.map(compute_mean_age, df.groupby('class_year'))
    df_mean_age_per_year = pd.concat(mean_age)

And the result would be:

print(df_mean_age_per_year) 
      age
2018  10.00
2019  14.00
2020  14.25
2021  14.75

Note that in this example, we’ve used a very small dataset that would otherwise cause the parallel slowdown effect that we discussed earlier. The approach above would only be beneficial if our dataset was extremely large. For small datasets, the following would do the trick:

>>> df.groupby('class_year')['age'].mean()
class_year
2018        10.00
2019        14.00
2020        14.25
2021        14.75
Name: age, dtype: float64

Modern Parallel Computation Frameworks

Numerous technologies have emerged over time, that are capable of distributing workloads and execute multiple tasks in parallel by taking advantage multi-core systems and clusters of computers.

One such tool is Dask that offers functionality for advanced parallelism that enables performance at scale for other tools such as numpy, pandas and scikit-learn.

Instead of the "low level" multiprocessing package, we could instead use Dask in order to compute the mean age per class year in the previous example. Using Dask, we could achieve the same result with much cleaner and more intuitive code that is shared below.

import dask.dataframe as dd 
dd_students = dd.from_pandas(df, npartitions=4)
df_mean_age_per_year =     
    dd_sudents.groupby('class_year').age.mean().compute()

If you have done some research around Big Data the chances are you’ve came across with Hadoop that is maintained by Apache Software Foundation. Hadoop is a collection of open-source projects including MapReduce and Hadoop Distributed File System (HDFS). In a nutshell, MapReduce is one of the first distributed programming paradigms that is used to access big data stored on HDFS. Another popular tool of the Hadoop ecosystem is Hive, which is a layer running on top of Hadoop that can access data using Hive SQL, the structured query language. For instance, you can use Hive to write a query that will in turn be translated into a job that can be executed in a distributed fashion on a cluster of computers.

Finally, another popular tool in the Data Engineering stack is Apache Spark – the engine that is used to execute big data and machine learning at scale. The framework supports multiple languages including Scala and Python and can be used to perform SQL analytics, Data Science and Machine Learning at scale, as well as Batch and Streaming data processing.

You can find more tools related to Big Data Engineering in one of my recent posts here on Medium, called Tools for Data Engineers where I go into more detail on the purpose of the tools I mentioned today, including Apache Airflow and Apache Kafka.

Tools For Data Engineers


Final Thoughts

In today’s article we discussed about one of the most important pillars in Data Engineering, namely Parallel Computing. We discussed about what purpose it serves and what benefits it brings especially in the big data era. Additionally, we explored how to perform Parallel Computing with Python built-in package multiprocessing over a pandas DataFrame as well as we to achieve the equivalent behaviour using dask.

Finally, we discussed about the most commonly used Parallel Computing frameworks that every Data Engineer should be at least aware of in order to be able to design and implement efficient and scalable data workflows across the organisation. Such frameworks include the Hadoop Ecosystem (that contains numerous tools including MapReduce, HDFS and Hive) and Apache Spark.


Become a member and read every story on Medium. Your membership fee directly supports me and other writers you read. You’ll also get full access to every story on Medium.

Join Medium with my referral link – Giorgos Myrianthous


Related articles you may also like

requirements.txt vs setup.py in Python


Fact vs Dimension Tables


15 Kafka CLI Commands For Everyday Programming


Related Articles