The world’s leading publication for data science, AI, and ML professionals.

Scalable Machine Learning with Spark

Distributed Algorithms, Map-Reduce Paradigm, Scalable ML using Spark MLlib on Standalone, AWS EMR Cluster with Docker & Nvidia RAPIDS.

Since the early 2000s, the amount of data collected has increased enormously due to the advent of internet giants such as Google, Netflix, Youtube, Amazon, Facebook, etc. Near to 2010, another "data wave" had come about when mobile phones became hugely popular. In 2020s, we anticipate another exponential rise in data when IoT devices become all-pervasive. Given this backdrop, building scalable systems becomes a sine qua non for machine learning solutions.

Machine Learning in Spark: Zero to Hero Edition
Machine Learning in Spark: Zero to Hero Edition

Any solution majorly depends on these 2 types of tasks:

a) Compute-heavy: Prior to 2000s, parallel processing boxes known as ‘Supercomputers’ were popular for compute-heavy tasks. Pre-2005, parallel processing libraries like MPI and PVM were popular for compute heavy tasks, based on which TensorFlow was designed later.

b) Data-heavy: Relational algebra based databases were designed in 1970s, when hard disk storage was very expensive. Hence, the design was aimed to reduce data redundancy, by dividing larger tables into smaller tables, and link them using relationships (Normalization).

Thus, traditional databases such as mySQL, PostgreSQL, Oracle etc. were not designed to scale, especially in the data-explosion context mentioned above. Consequently, NoSQL databases were designed to cater to different situations:

  1. MongoDB: To store text documents
  2. Redis, Memcache: distributed hash table for quick key-value lookup
  3. Elastic Search: to search through text documents
  4. HBase and Cassandra: columnar stores
  5. Neo4j and Grakn: Graph Databases.

However, Machine Learning & Deep Learning solutions on large datasets are both compute heavy and data heavy, at the same time. Hence, in order to make scalable AI/ML solutions, it is necessary the solution caters to both.

Fig 1. Author's Parallel Implementation of Photon Mapping using MPI
Fig 1. Author’s Parallel Implementation of Photon Mapping using MPI

In 2004, Jeff Dean et al. published the seminal MapReduce paper to handle data heavy tasks [2]. In 2006, Hadoop implemented MapReduce and designed a distributed file system called HDFS, wherein a single big file is split and stored in the disks of multiple computers. The idea was to split huge databases across hard-disks in multiple motherboards, each with individual CPU, RAM, hard disk etc, interconnected by a fast LAN network.

However, Hadoop stores all the intermediate data to disk, as it was designed in 2000s, when hard disk prices plummeted, while RAM prices remained high. In 2010s, when RAM prices came down, Spark was born with a big design change to store all intermediate data to RAM, instead of disk.

Spark was good for both,

i) Data-heavy tasks: as it was using HDFS &

ii) Compute-heavy tasks: as it uses RAM instead of disk, to store intermediate outputs. Eg: Iterative solutions

As Spark could utilize RAM, it became an efficient solution for iterative tasks in Machine Learning like Stochastic Gradient Descent (SGD). So is the reason, Spark MLlib became so popular for Machine Learning, in contrast to Hadoop’s Mahout.

Furthermore, to do Distributed Deep-Learning with TF you can use,

  1. Multiple GPUs on the same box (or)
  2. Multiple GPUs on different boxes (GPU Cluster)

While today’s supercomputers use GPU Cluster for compute intensive tasks, you can install Spark in such a cluster to make it suitable for tasks such as distributed deep-learning, which are both compute and data intensive.

Introduction to Hadoop & Spark

Majorly, there are 2 components in Hadoop,

  1. Hadoop Distributed File System (HDFS): a fault-tolerant distributed file system, used by Hadoop and Spark both. HDFS enables splitting a big file into ‘n’ chunks & keep in ‘n’ nodes. When the file is accessed, then different chunks of data have to be accessed, across the nodes via LAN.
  2. Map-Reduce: Given a task across huge amount of data, distributed across numerous nodes, a lot of data transfer has to happen and processing needs to be distributed. Let’s look into this in detail.

Map-Reduce Paradigm

Consider the task to find word frequency in a large distributed file of 900 GB. HDFS will enable splitting the big file into 3 chunks P1, P2, P3 of 300 GB each and keep one each in 3 nodes.

Any Hadoop code would have 3 stages:

  1. Map: Mapper function will pass through the data, stored in the disk of each node, and increment the word count in the output dictionary. It will get executed independently on each distributed box.
Fig 2. Word Count Map-Reduce workflow (Image by Author)
Fig 2. Word Count Map-Reduce workflow (Image by Author)

2. Shuffle: Hadoop automatically moves the data across the LAN network, so that the same keys are grouped together in one box.

3. Reduce: A function which will consume the dictionary and add up the values with same keys (to compute the total count).

To implement a function in Hadoop, you just need to write the Map & Reduce function. Please note, there is disk I/O between each Map-Reduce operation in Hadoop. However, almost all ML algorithms work iteratively. Each iteration step in SGD [Equation below] corresponds to a Map-Reduce operation. After each iteration step, intermediate weights will be written to disk, taking up 90% of the total time to converge.

Equation: Weight Update Formula in ML & DL Iteration
Equation: Weight Update Formula in ML & DL Iteration

As a solution, Spark was born ** in 2013 that replaced disk I/O operations to in-memory operations. With the help of Mesos – a distributed system kernel —** Spark caches the intermediate data set after each iteration. Since output of each iteration is stored in RDD, only 1 disk read and write operation is required to complete all iterations of SGD.

Spark is built on Resilient Distributed Dataset (RDD), a fault tolerant immutable collection of distributed datasets stored in main memory. On top of RDD, DataFrame API is designed to abstract away its complexity and ease doing Machine Learning on Spark.

RDDs support two types of operations:

  1. Transformations: to create a new data set from an existing one ✓ Map: pass each data set element through a functionReduceByKey: values for each key are aggregated using a functionFilter: selects only those elements on which function returns true.

  2. Actions: return a value after running computation on data set ✓ Reduce: aggregates all elements of the RDD using some functionCollect: return all the elements of the output data set ✓ SaveAsTextFile: write elements of the data set as a text file.

All transformations in Spark are lazy, i.e. they are computed only when an ‘action’ requires a result. In the code below, lineLengths is not immediately computed, due to laziness. Only when ‘reduce’ is run, Spark breaks the computation into tasks to run on separate machines, to compute total length.

lines = sc.textFile("data.txt")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)

A simple data transformation example to count the occurrence of keys, stored in a distributed RDD of 3 partitions is as below:

Logistic Regression as Map Reduce

The most expensive operation in SGD iteration is the gradient operation across all data points [Eqn. above]. If the data set is huge, say ‘n’ billion data points, then we can distribute gradient computation across ‘k’ different boxes.

  • Map Stage: Each box would compute gradient of n/k billion points
  • Reduce Stage: Partial sums in each box are summed up using same key
  • Gradient of Loss over all points = ∑ Partial Sums
  • Thus, easily compute w_new & store in memory of each node

This is how you distribute any optimization based ML algorithm. However, see the Hadoop vs Spark performance, for a distributed LR implementation.

Fig 3. Running Time Comparison: Hadoop vs Spark [3]
Fig 3. Running Time Comparison: Hadoop vs Spark [3]

As Spark RDDs allow performing several map operations in memory, there is no need to write interim data sets to a disk, thus being 100x faster. Note, the time taken for first iteration is almost the same, as both Hadoop and Spark have to read from disk. But in subsequent iterations, Spark’s memory-read takes only 6 secs vs 127 secs for Hadoop’s disk-read.

Besides, a ML Scientist don’t need to code Map and Reduce functions. Most ML algorithms are contained in Spark MLlib and all data preprocessing is done using Spark SQL.

Spark Installation and Setup

You can setup Spark in either,

  1. Your local box or boxes (OR)
  2. Managed Cluster using AWS EMR or Azure Databricks

Below we will see both ways. First, we will run trivially parallelizable tasks on your personal box, after doing Spark local system setup. Then we will take a more complex ML project and run in Spark Docker, AWS EMR & Spark Rapids.

Spark: Local System Setup

  1. docker pull jupyter/pyspark-notebook
  2. docker run -it -p 8888:8888 jupyter/pyspark-notebook
  3. Either click the link with Auth-Token or go-to http://localhost:8888/ and copy paste the token
  4. Now you can execute Spark code in Jupyter or Terminal. To execute in docker, just run spark-submit pi-code.py

Task 1: Estimate the value of Pi (π)

Take a unit circle and consider a square circumscribing the circle.

  • Area of unit square = 1
  • Since its a unit circle, the area of the circle = π
  • The area of quarter arc = π/4
  • Thus, π = 4 * area of quarter arc
Fig 4. Area of Circle = # of Red Points/ Total Points. (Image by Author)
Fig 4. Area of Circle = # of Red Points/ Total Points. (Image by Author)

The area of quarter arc can be computed using,

  1. Numerical Methods: using integration
  2. Monte Carlo Approach: to find answers using random sampling

In Monte Carlo Approach,

  • Take uniform distribution of (x, y) points from 0 to 1 (i.e. inside square)
  • Area of quarter region = % of points within the circle, i.e. 𝑥²+𝑦² < 1 Eg: out of 1000 random points, if ‘k’ points are within the circle, then area of shaded region = k/1000

These operations are trivially parallelizable as there is no dependency across nodes in order to check whether a point falls within the circle. Below pyspark code, once run on Spark local setup, will output value nearer to π=3.14 as we increase number of random points (NUM_SAMPLES)

  • The random function will generate a number between 0 to 1.
  • The ‘inside’ function runs a million times and returns ‘True’ only when the random point is within the circle.
  • sc.parallelize() will create an RDD broken up into k=10 instances.
  • Filter will apply the passed function.

Task 2: Find Word Count

To find word frequency in a large distributed file, just replace the local file path in below code to HDFS file path.

The map function will create a list of lists & flatMap merges the list into one.

Task 3: Data Preprocessing

Most data preprocessing can be done with DataFrame API using Spark SQL. Spark SQL query executed on Spark Data Frame will be converted into Map and Reduce ops before execution in Spark.

Intro to Spark MLLib & ML Pipeline

Spark MLLib is a sklearn-inspired library which contains distributed implementation of popular ML algorithms. The main difference with sklearn is the usage of sc.parallelize() function to split data across multiple boxes.

All the steps required to convert the raw data on disk to a final model is known as ML pipeline. Pipeline() contains the input and output stages in sequence. For instance, TokenizerCount VectorizerLogistic Regression pipeline sequence can be coded as,

Thus, the training data is fed into tokenizer first, then to CV and then to LR. To test the model, call the model.transform(). You can also do Distributed Hyper Parameter Tuning, i.e. to run the same architecture on multiple boxes with different hyper-parameters. However, to distributively store and train one big model in the VRAM of GPUs in different boxes is slightly intricate.

ML Algorithms in Spark: Custom Implementation

An algorithm can be made parallel only when it can be divided into independent sub-tasks. To explicate, Bitonic sorting can be made parallel as the sequence of operations are data-independant, while merge sort is not. Similarly, some ML algorithms are trivially parallelizable while others are not.

a) Trivially Parallel: Take KNN (k-nearest neighbours) for example.

  • Split the data set D into ‘n’ boxes. Eg: 40K points into 4 boxes
  • Find top ‘k’ nearest points from each 10K points in each box
  • Transfer all 4k points to 1 box and find top ‘k’ nearest points.

b) Non-Trivially Parallel: Take GBDT ** for instance. Each decision tree in GBDT is built based on the residues of the previous decision trees. Hence, to train GBDT is inherently a sequential operation, not parallel**.

However, we can parallelize the building of each decision tree, as the dataset used for left and right sub-tree are independent. Thus, xgboost parallelize at tree level, i.e. left & right sub-trees are trained on 2 nodes independently.

Time Series Prediction using Random Forest

Let’s solve an ML problem in Standalone, Spark Local & Cluster mode.

Problem Statement: The daily temperate, wind, rainfall and humidity of a location is noted from 1990~2020s. Given these features, build a time series model to predict the humidity in Y2021. To verify the model, use 2020Q4 humidity values to compare, using a metric.

The complete source code of the below experiments can be found here.

Fig 5. Input Dataset Features
Fig 5. Input Dataset Features
Fig 6. Time series nature of the humidity values is clearly visible
Fig 6. Time series nature of the humidity values is clearly visible

Firstly, transform data to derive new features useful to predict humidity.

Fig 7. New Features: Day, Week & Scale = Temp*Rainfall
Fig 7. New Features: Day, Week & Scale = Temp*Rainfall

A. Standalone Implementation

Now we can train sklearn’s Random Forest Regressor with above features.

The predicted and actual humidity values are too near (visualization below)

Fig 8. 2020Q4 Humidity: Blue-Red Line as Actual and Predicted Humidity (Standalone)
Fig 8. 2020Q4 Humidity: Blue-Red Line as Actual and Predicted Humidity (Standalone)

B. Spark Local Implementation

First, you need to do Spark local system setup steps mentioned above. Then, you can use PySpark’s RandomForestRegressor to do the same above.

To feed in features to machine learning models in Spark MLlib, you need to merge multiple columns into a vector column, using the VectorAssembler module in Spark ML library.

Fig 9. Features are combined using VectorAssembler
Fig 9. Features are combined using VectorAssembler

Then, you can run the train and test module similar to standalone solution.

Fig 10. Predicted Regression values (Humidity)
Fig 10. Predicted Regression values (Humidity)
Fig 11. 2020Q4 Humidity: Actual and Predicted (Spark Local)
Fig 11. 2020Q4 Humidity: Actual and Predicted (Spark Local)

Distributed Random Forest Regressor implementation in PySpark also seem to follow the trend, though the mean squared error value is a bit higher. Even so, the running time of spark local implementation is found 3x times faster, on a quad-core system with 64G RAM.

Fig 12. Standalone: 2.14 secs. Spark Local: 0.71 secs for Random Forest Regression training
Fig 12. Standalone: 2.14 secs. Spark Local: 0.71 secs for Random Forest Regression training

C. Spark Cluster: AWS Elastic Map Reduce + Docker

To get double benefits of compute and data scale, the above solution needs to be deployed across multiple boxes. However, it is time consuming to setup a cluster with Spark, using your local machines.

Instead, you can use Amazon EMR to create a cluster with workloads running on Amazon EC2 instances. Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Spark.

You can use Docker containers to handle library dependencies, starting from Amazon EMR 6.0.0, instead of installing it on each EC2 cluster instance. But you need to configure the Docker registry and define additional parameters during "Cluster Creation" in AWS EMR.

  1. Create a Docker Image (or modify a docker image)

✓ Create a directory and a requirements.txt file. The requirements.txt file should contain all the dependencies required by your Spark application.

mkdir pyspark
vi pyspark/requirements.txt
vi pyspark/Dockerfile
Sample requirements.txt:

python-dateutil==2.8.1
scikit-image==0.18.1
statsmodels==0.12.2
scikit-learn==0.23.2

✓ Create a Dockerfile inside the directory with following contents. A specific numpy version is installed to confirm docker execution from EMR Notebook.

FROM amazoncorretto:8
RUN yum -y update
RUN yum -y install yum-utils
RUN yum -y groupinstall development
RUN yum list python3*
RUN yum -y install python3 python3-dev python3-pip python3-virtualenv python-dev
RUN python -V
RUN python3 -V
ENV PYSPARK_DRIVER_PYTHON python3
ENV PYSPARK_PYTHON python3
RUN pip3 install --upgrade pip
RUN pip3 install 'numpy==1.17.5'
RUN python3 -c "import numpy as np"
WORKDIR /app
COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt

✓ Build the docker image using the command below.

2. Create an ECR repository. Tag and upload the locally built image to ECR.

sudo docker build -t local/pyspark-example pyspark/
aws ecr create-repository --repository-name emr-docker-examples
sudo docker tag local/pyspark-example 123456789123.dkr.ecr.us-east-1.amazonaws.com/emr-docker-examples:pyspark-example
sudo aws ecr get-login --region us-east-1 --no-include-email | sudo docker login -u AWS -p <password> https://123456789123.dkr.ecr.us-east-1.amazonaws.com
sudo docker push 123456789123.dkr.ecr.us-east-1.amazonaws.com/emr-docker-examples:pyspark-example

You can also upload to Docker Hub and give ‘docker.io/account/docker-name:tag’ instead of AWS ECR Image URI above.

3. Create Cluster with Spark in EMR

✓ Open the Amazon EMR console (Ref here)

✓ Instead of ‘Quick Options’, click ‘Go to Advanced Options’ to enable,

  • Jupyter Enterprise Gateway: a web server that helps launch kernels on behalf of remote notebooks.
  • JupyterHub: to host multiple instances of Jupyter notebook server.
  • Apache Livy: a service that enables interaction with Spark cluster over a REST interface.

✓ Select the number of nodes of each node type, as per required parallelism

  • Master node: manages the cluster by coordinating the distribution of data and tasks among other nodes
  • Core Node: run tasks and store data in HDFS (at least one for multi-node)
  • Task Node: only runs tasks & does not store data in HDFS (optional)

Thus, core nodes add both data and compute parallelism while task node adds only compute parallelism.

Fig 13. How to create Spark Cluster using AWS EMR
Fig 13. How to create Spark Cluster using AWS EMR

4. Enter below configuration under ‘Software Settings’

To avoid userid error, set "livy.impersonation" in the below JSON

[
  {
    "Classification": "container-executor",
    "Configurations": [
        {
            "Classification": "docker",
            "Properties": {
                "docker.trusted.registries": "local,centos,123456789123.dkr.ecr.us-east-1.amazonaws.com",
                "docker.privileged-containers.registries": "local,centos,123456789123.dkr.ecr.us-east-1.amazonaws.com"
            }
        },
    ]
  },
  {
      "Classification":"livy-conf",
      "Properties":{
         "livy.impersonation.enabled": "true",
         "livy.spark.master":"yarn",
         "livy.spark.deploy-mode":"cluster",
         "livy.server.session.timeout":"16h"
      }
   },
   {
    "Classification": "core-site",
    "Properties": {
      "hadoop.proxyuser.livy.groups": "*",
      "hadoop.proxyuser.livy.hosts": "*"
    }
   },
   {
      "Classification":"hive-site",
      "Properties":{
         "hive.execution.mode":"container"
      }
   },
   {
      "Classification":"spark-defaults",
      "Properties":{
         "spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE":"docker",
         "spark.yarn.am.waitTime":"300s",
         "spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE":"docker",
         "spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG":"hdfs:///user/hadoop/config.json",
         "spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE":"123456789123.dkr.ecr.us-east-1.amazonaws.com/scalableml:s3spark",
"spark.executor.instances":"2",
         "spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG":"hdfs:///user/hadoop/config.json",
         "spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE":"123456789123.dkr.ecr.us-east-1.amazonaws.com/scalableml:s3spark"
      }
   }
]

5. Enable ECR access from YARN

To enable YARN to access images from ECR, we set container environment variable YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG.

However, we need to generate config.json and put in HDFS so that it can be used by jobs running on the cluster. To do this, login to one of the core nodes and execute below commands.

ssh -i permission.pem [email protected]
aws ecr get-login-password --region us-east-1 | sudo docker login --username AWS --password-stdin 123456789123.dkr.ecr.us-east-1.amazonaws.com
mkdir -p ~/.docker
sudo cp /root/.docker/config.json ~/.docker/config.json
sudo chmod 644 ~/.docker/config.json
hadoop fs -put ~/.docker/config.json /user/hadoop/
Fig 14. To give ECR access to YARN
Fig 14. To give ECR access to YARN

6. EMR Notebook and Configuration

You can create jupyter notebooks and attach to Amazon EMR clusters running Hadoop, Spark, and Livy. EMR Notebooks are saved in AWS S3 independently of clusters.

  • Click ‘Notebooks’ on Amazon EMR console & ‘Create Notebook’
  • Select the cluster to run the notebook on
  • In Jupyter notebook, give below config as the first cell.
%%configure -f
{"conf": 
 { 
     "spark.pyspark.virtualenv.enabled": "false",
     "spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE":"docker",
     "spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG":"hdfs:///user/hadoop/config.json",
     "spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE":"123456789123.dkr.ecr.us-east-1.amazonaws.com/scalableml:s3spark",
     "spark.executor.instances":"2",
     "spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG":"hdfs:///user/hadoop/config.json",
     "spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE":"123456789123.dkr.ecr.us-east-1.amazonaws.com/scalableml:s3spark"
 }
}
  • Thus, you can give Notebook-scoped docker in EMR Notebooks to resolve dependencies. Now you can execute PySpark code in subsequent cells.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("docker-numpy").getOrCreate()
sc = spark.sparkContext

import numpy as np
print(np.__version__)

Ideally, you should see numpy version as 1.17.5, if the above code is running inside the built docker. If not, you need to find the S3 cluster logging path.

  • You can use the above Spark standalone code, except that the input data should be read from S3, as shown below, and convert to RDD.

    As visualized below, the time taken by the RandomForest training workload steadily decreases as the number of nodes in the EMR cluster increase. The base overhead of inter-node communication will remain, even with small datasets. Hence, the graph will fall steeper when the training dataset gets bigger.

Fig 15. EMR Cluster Performance for various cluster sizes
Fig 15. EMR Cluster Performance for various cluster sizes

Alternatively, you can execute the PySpark code in the cluster by using spark-submit command, after connecting to master node via SSH.

  • Set _DOCKER_IMAGE_NAME & DOCKER_CLIENTCONFIG env variable
  • Execute .py file with spark-submit & deploy-mode as ‘cluster’
  • You can read the input and write the output from S3 as below.

    However, it is more convenient to submit a Spark job via Notebook running on Spark Cluster, along with docker to resolve dependancies in all cluster nodes.

D. GPU Cluster: AWS EMR + Spark RAPIDs

You can use ‘Nvidia Spark-RAPIDS plugin for Spark’ to accelerate ML pipelines using GPUs attached to EC2 nodes. The core and task instance groups must use EC2 GPU instance types while master node can be non-ARM non-GPU.

Spark RAPID would speed up data processing and model training without any code change. To create the GPU-accelerated cluster, you can follow the steps in Section C, but with these changes:

  1. Use the JSON given at Step 5 here, instead of config in Section C
  2. Save the bootstrap script given here in S3 as my-bootstap-action.sh
  3. Give the S3 file path from Step 2 as bootstrap script (to use YARN on GPU)
  4. SSH to master node and install dependencies. To execute the time series project, execute the command below.
pip3 install sklearn statsmodels seaborn matplotlib pandas boto3

No code change is required. Below is the timing comparison of Random Forest Regression training on Spark-RAPIDS Cluster of 1x m5.2xlarge master node and 2x g4dn.4xlarge core nodes, with other execution modes.

Fig 16. Timing Comparison: Standalone vs Spark Local vs RAPIDS
Fig 16. Timing Comparison: Standalone vs Spark Local vs RAPIDS

However, the speed gain is not much in the above case, as the data set is small. Let’s do a variation of the earlier ‘alphabet count’ code to compare the time stats between Spark Local and Spark RAPIDS. Below code generates 100M tuples of random string and count, to feed in to distributed count operation.

Fig 17. Spark Job Progress Report in AWS EMR Notebook
Fig 17. Spark Job Progress Report in AWS EMR Notebook
Fig 18. Timing Comparison: Spark Local vs Spark RAPIDS
Fig 18. Timing Comparison: Spark Local vs Spark RAPIDS

Closing Thoughts

Apologies for the blog length. If you have read this far, you should consider making your hands dirty with some distributed coding. This article gave you a historical & logical context of Spark, with multiple sample implementations on Local, AWS EMR & Spark Rapids. If it has inspired you to explore further, then it has served its purpose.

The complete source code of the above experiments can be found here.

If you have any query or suggestion, you can reach me here

References

[1] www.appliedaicourse.com

[2] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008)

[3] Bagavathi, Arunkumar & Tzacheva, Angelina. (2017). Rule Based Systems in a Distributed Environment: Survey. 10.11159/cca17.107.

[4] PySpark Documentation: https://spark.apache.org/docs/latest/api/python/index.html

[5] Spark MLLib: https://spark.apache.org/docs/1.2.1/mllib-guide.html

[6] Spark SQL: https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#sql

[7] AWS EMR: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-docker.html

[8] AWS EMR with Docker: https://aws.amazon.com/blogs/big-data/simplify-your-spark-dependency-management-with-docker-in-emr-6-0-0/

[9] GPU Cluster (Spark-Rapids): https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-rapids.html


Related Articles