Making Sense of Big Data

How to process big data with examples: MapReduce

A simple explanation of how to run parallel workloads to process big data

Ebru Cucen
Towards Data Science
5 min readJan 29, 2021

--

I decided to write a series on distributed computing, a historical review on where we succeeded in handling big data properly, and how it evolved. The challenge with processing big files was that our computation running on distributed machines had to finish within a reasonable time, and this brought the questions on how we can do embarrassingly parallel computations, how to distribute the data, and how we handle the failures. And in 2004, Google shared the MapReduce framework, after 4 years (or 5) of running an average of one hundred thousand MapReduce jobs, processing about twenty petabytes of data per day.

What they have achieved is to decouple the process with Map tasks where they can process the input in parallel, and Reduce tasks where they can do aggregation on the sorted and grouped data with a master, which tracks the workers' health, and completion of the tasks and re-execution of the tasks on failed machines allows fault tolerance.

With a simple pseudo-code, the map takes a key/value pair of inputs and computes another key/value pair independent of the original input. And reduce takes the key, with all values split out of the map function, sorted/grouped ready to run aggregation functionality on it. The output is generally one output value.

map (k1,v1) → list(k2,v2)

reduce (k2,list(v2)) → list(v2)

initial execution

Let’s consider a simple use case `word count` where we try to find out the number of occurrences for each word. In this scenario, the user program splits the input file into M pairs. If the input is a corpus, each split would be a document, and if the input is one document, each split would be a line, etc…And also it spins up multiple workers, one of them being the master node.

master picks map/reduce tasks

The master node is responsible for assigning workers to map and reduce tasks and checking the health check (via pinging). If any worker does not respond within the time frame, master marks them as failed, and reschedules the tasks in progress on these workers, and also all map tasks as the output data on the machine are lost.

split input file processed by map workers

Workers assigned for map function reads the split pair and saves the intermediary key/value pair to its disk (first to buffer and a partitioning function write to disk, but we skip not to get distracted). It was groundbreaking at the time because map workers doing this in embarrassingly parallel resulted in high performance on large clusters of commodity PCs.

MapReduce provides locality optimisation, as it uses GFS, and master asks GFS for locations of the input file blocks and schedules map tasks on the same machine or the same rack, which allows thousands of machines read input at local disk speed and does not have switch rack and be limited by the read rate

The catch is we are responsible for writing the map and reduce functions, so let’s look at what we would have done if we were writing in Python. The original article/implementation is on C++, I will use Python to keep it simple. There are many ways of implementing and I shared in the references part if you want to explore more:

Now, all the intermediary key/value pairs are saved, a partitioning function puts them onto the disk, and part of reduce tasks the words are sorted and grouped. Reduce function will take the unique key and the list of grouped values to compute the sum of the counts by appending each output for the global output. If we were asked to find the maximum temperature of given a list of cities, we would do a max aggregation on the Reduce function.

If the reduce worker fails, the master does not ask for re-execution as the output is globally shared.

And our complete function (with the help of mrjob module) would look like as below. Mrjob module can run almost any on-prem and on cloud Hadoop clusters. (PS: If you prefer to use go, gossamr repo can be useful)

If we check our complete example, we get the number of the occurrences for each word, and it is saved on a globally accessible location. Google’s paper is based on the 2000 worker machines, 5000 partitions, and 200,000 splits, where they chose each split about 16MB to 64MB.

aggregated reduce worker results in values for unique key

The input file or key/value pairs are different from the intermediary as well as the output key/value pairs, and we won’t be able to predict when handling for petabytes of data.

Next Steps

It was the first abstraction to enable execution of parallel tasks with taking advantage of a local file system. Next, inspired by the MapReduce programming model, Hadoop in 2006 was launched by Yahoo, and cloud providers did not fail to promote the Hadoop-on-cloud tools, namely Google implemented as Dataproc, AWS as EMR (Elastic Map Reduce), and Azure as HDInsights.

Call to Arms:

So, let’s reflect bak. Question is for this simple problem, how would you improve the solution Googlers found? How would you handle the files spilt at the end of map phase if you wanted to handle repeated executions? How would you make the software developer’s life easier to write map/reduce functions in a more abstracted way?

Until the next one, take care!

References:

Micheal Noll’s blogpost

--

--