A n00bs guide to Apache Spark

Jeroen Schmidt
Towards Data Science
10 min readJun 4, 2017

--

I wrote this guide to help my self understand the basic underlying functions of Spark, where it fits in the Hadoop ecosystem and how it works in Java and Scala. I hope it helps you as much it helped me.

What is Spark?

Spark is a general-purpose computing engine, in memory framework. It lets you execute real-time and batch work in a scripting manner in a variety of languages with powerful fault tolerance. Why should you care what Spark is? To put it bluntly, it has addressed many of the shortcomings Hadoop MapReduce has and is roughly 10 to 100-fold faster than Hadoop MapReduce. Spark is a big deal in Data Science; some notable organisations that use Spark are; Amazon, NASA Jet Propulsion Labs, IBM and Hitachi. The goal of this article is to give you a quick rundown on the functionality that Spark offers, its basic inner working and for you to gain an appreciation of how awesome Spark is.

Sparks Context in Big Data Environments

Spark is designed to work with an external cluster manager or its own standalone manager. Spark also relies on a distributed storage system to function from which it calls the data it is meant to use. The following systems are supported:

Cluster Managers:

  • Spark Standalone Manager
  • Hadoop YARN
  • Apache Mesos

Distributed Storage Systems:

  • Hadoop Distributed File System (HDFS)
  • MapR File System (MapR-FS)
  • Cassandra
  • OpenStack Swift
  • Amazon S3
  • Kudu

For sanities sake, I will only be focusing on Spark in the context of the Hadoop ecosystem.

Spark Core provides a platform which addresses many of the shortcomings of Hadoop MapReduce as it allows us to move away from having to break up the tasks into small atom jobs and also from having to wrangle with the complexity of building solutions on a distributed system development.

Symantec note: The term Hadoop is used interchangeably to refer to either the Hadoop ecosystem or Hadoop MapReduce or Hadoop HDFS. It’s quite common to read statements online that “Spark replaces Hadoop” or that “Spark is the new Hadoop” and then be inclined to believe that they mean Spark is replacing all of Hadoop services BUT! what they really mean is that Spark is taking on the role of Hadoop MapReduce functionality in many use cases.

Spark Core is very versatile and has been designed with the Hadoop ecosystem in mind; it can work alongside MapReduce or providing an alternate platform for PIG, HIVE and SEARCH to work on top of. See figure 1

Spark Core also brings its own set of useful APIs to the tables:

Spark Streaming: Manage live microbursts of data from a variety of sources. It allows for real-time results to be computed by enabling the implementation of ML Lib and Graphx on the live streams.

GraphX: A very powerful library to handle graph-parallel computation. Don’t confuse this with “Power Point graphs”, this library is all about a field in mathematics called graph theory and modelling pairwise relationships between objects.

ML Lib: Library to run machine learning algorithms on large data sets in a native distributed environment. The library is still in its infancy compared to more robust machine learning libraries that would be found in Python or Matlab.

Spark SQL: Allows the use of SQL quarries to quarry non-relational distributed databases.

Spark Steaming, GraphX, MLLib and Spark SQL will each be getting their own articles in due time but in the meantime don’t hesitate to look up the official documentation [1] [2] [3] [4].

What Makes Spark, Spark?

At the highest level of abstraction, Spark consists of three components that make it uniquely Spark; the Driver, the Executer and the DAG.

The Driver and the Executer

Spark uses a master-slave architecture. A driver coordinates many distributed workers in order to execute tasks in a distributed manner while a resource manager deals with the resource allocation to get the tasks done.

DRIVER

Think of it as the “Orchestrator”. The driver is where the main method runs. It converts the program into tasks and then schedules the tasks to the executors. The driver has at its disposal 3 different ways of communicating with the executors; Broadcast, Take, DAG — these will be elaborated on shortly.

EXECUTER — “WORKERS”

Executers execute the delegated tasks from the driver within a JVM instance. Executors are launched at the beginning of a Spark application and normally run for the whole life span of an application. This method allows for data to persist in memory while different tasks are loaded in and out of the execute throughout the application’s lifespan.

The JVM worker environments in Hadoop MapReduce in stark contrast powers down and powers up for each task. The consequence of this is that Hadoop must perform reads and writes on disk at the start and end of every task.

DRIVER COMMUNICATION WITH EXECUTERS

There are several methods a driver can communicate with executors. As a developer or data scientist it’s important that you be aware of the different types of communication and their use cases.

  1. Broadcast Action: The driver transmits the necessary data to each executor. This action is optimal for data sets under a million records, +- 1gb of data. This action can become a very expensive task.
  2. Take Action: Driver takes data from all Executors. This action can be a very expensive and dangerous action as the driver might run out of memory and the network could become overwhelmed.
  3. DAG Action: This is the by far least expensive action out of the three. It transmits control flow logic from the driver to the executors.

System Requirments

Spark has a considerable performance gain over Hadoop MapReduce, but it also has a higher operation cost as it operates in memory and requires a high bandwidth network environment (+10Gb/s is advised). It is recommended that the memory in the Spark cluster should be at least as large as the amount of data you need to process. If there isn’t enough memory for a job, Spark has several methods to spill the data over onto disk. For more on hardware requirements and recommendations.

The DAG

The DAG is a Directed Acyclic Graph which outlines of a series of steps needed to get from point A to point B. Hadoop MapReduce, like most other computing engines, works independently of the DAG. These DAG independent computing engines rely on a scripting platforms like HIVE or PIG to link together jobs to achieve the desired result. What makes Spark in comparisons so powerful is that it is cognitive of the DAG and actively manages the DAG. This allows Spark to optimise job flows for optimal performance and allows for rollback and job redundancy features.

Have a look at figure 3. I will elaborate on the DAG and how it operates by discussing its components.

1) SOURCE

A source can be any data source supported by Spark. Some of them are: HDFS, Relational Database, CSV file etc. You will see later that we define this within our environment context setup.

2) RDD

Resilient Distributed Datasets are essentially sets of data that cannot be changed. These entities exist in memory and by their very nature are immutable. Due to this immutability; A new RDD is created after every transformation performed on an existing RDD. A consequence of this design is redundancy; if at any point in the DAGs execution there is a failure then it is possible to roll back to a functioning state and reattempt the failed action/transformation.

RDDs in their original form don’t have a schema attached to them but they can be extended using something called a DataFrames. DataFrames adds schema functionality to the data set contained within; this is very useful when dealing with relational datasets.

3) TRANSFORMATION

Transformations transform an RDD into another RDD. Some example transformations are:

  1. Map
  2. reduceByKey
  3. GroupByKey
  4. JoinByKey
  5. SparkSQL

4) ACTION

An action is anything that retrieves data to answer a question. Some examples are; Count, Take, For each.

EXECUTING THE DAG
Spark does something called lazy evaluation. The DAG itself is constructed by the Transformations but nothing happens until an Action is called. When an action is executed, Spark will look at the DAG and then optimise it in the context of what jobs it needs to execute to reach the action step it has been asked to do. When the DAG is finally executed, the driver sends out the transformation commands to the executers on the cluster.

APACHE FLUME API

Apache Flume was developed with the idea of allowing developers to create programs that could run on distributed systems using the same code that would work on non-distributed programming. In other words, Apache Flume allows us to write code that can run on a single thread and multiple thread machines without a problem. The implication of Apache Flume is that we can now run code on our local machine and debug it while being sure that it will run on our Spark Hadoop Cluster. Further implications are that you can pull data from a cluster and run it on your local machine for testing and development purposes.

The following languages are supported:

  • Scala
  • Java
  • Python
  • R

To demonstrate some of the inner workings of Spark I’m going to run through a word count example in both ScalaFlume and JavaFlume.

In Scala

Lines 1 to 2 initialise our Spark Context and defines our source. In line 3 we have our initial RDD defined. In lines 4 to 6 we are defining our transformations of our RDD and defined some new RDDS. Note by line 7, no code has been executed; only our DAG has been built up. At line 7, we finally have an action which then executes the transformations. It’s important to note that the only work that’s been distributed across the cluster is in blue as those lambda expressions are the transformations which are run by the executers! Everything else is executed on the Driver.

Scala propaganda side note: Scala is an amazing language that is built on top of the JVM Compiler. It provides an environment for developers with OOP backgrounds to ease into a functional programming mindset that is optimal for distributed computing programming. Scala does this by providing support both for OO and functional programming paradigms simultaneously. Why should you care? Spark is built using Scala and as such the newest features in Spark will always be implemented in Scala first. Scala also offers the best performance compared to the other languages when dealing with large data sets — As an example: Scala is roughly 10 to 225 times faster than Python depending on the use case. Scala is also used by some of the biggest names in data science and distributed computing, Amazon and Google to name a few. I hope this snippet of propaganda has convinced you to give Scala at least a curious look.

IN JAVA

The code highlighted in blue are transformations and build up the DAG. More importantly, take note that essentially every transformation is an object which is then sent to all the distributed executers. This is also happening in the Scala example, but the lambda expressions hide this layer of interaction from you. The transformation objects aren’t executed by the executers in the cluster until the action code (highlighted in red) in line 27 is executed by the driver.

CONCLUSION

In conclusion, I hope that this article has helped you grasped the fundamentals of what makes Spark such an interesting and powerful platform for data science and data engineering.

The takeaways from this article should be:

  • Spark Core works alongside or replaces Hadoop MapReduce.
  • Spark is FAST compared to other computing engines! Sparks speed comes from the fact that is cognitive of the DAG and can optimise it; it persists data in memory by keeping the JVM states up for the whole job with the end goal of minimising I/O to disk.
  • Spark has some great data science and data engineering APIs for Machine Learning, Graph Theory, Data Streaming and SQL.
  • The main components to be cognitive within Spark are the driver and executor. Those two components operate through something called the DAG which is managed directly by Spark. There is something called a transformation which builds the DAG and produces a new RDD from an existing RDD. The RDD is an immutable data entity that provides; redundancy and roll-back functionality with the aid of the DAG. The DAG is only executed once an action has been executed.
  • Apache Flume allows you to write programs in a handful of well-established programming languages on your local machine for development and debug purposes. That same Apache Flume code can then be deployed a distributed system with no changes needed. Scala is awesome.

Bibliography

--

--