Originally I wanted to write a single article for a fair comparison of Pandas and Spark, but it continued to grow until I decided to split this up. This is the second part of the small series.
- Spark vs Pandas, part 1 – Pandas
- Spark vs Pandas, part 2 – Spark
- Spark vs Pandas, part 3 – Languages
- Spark vs Pandas, part 4—Shootout and Recommendation
What to Expect
This second part portrays Apache Spark. I already wrote a different article about Spark as part of a series about Big Data Engineering, but this time I will focus more on the differences to Pandas.
What is Apache Spark?
Apache Spark is a Big Data processing framework written in Scala targeting the Java Virtual Machine, but which also provides language bindings for Java, Python and R. The inception of Spark is probably very different from Pandas, since Spark initially mainly addressed the challenge of efficiently working with huge amounts of data, which do not fit into the memory of a single computer (or even into the total amount of memory of a whole computing cluster) any more.
One can say that Spark was created to replace Hadoop MapReduce by providing both a simpler and at the same time more powerful programming model. And Spark was very successful with that mission, since I assume that no project would start writing new Hadoop MapReduce jobs by today and clearly go with Spark instead.
Spark Data Model
Initially Spark only provided a (nowadays called low level) API called RDDs, which required developers to model their data as classes. After a couple of years, Spark was extended by the DataFrame API which picked up many of the good ideas of Pandas and R and which is now the preferred API to use (together with DataSets, but I’ll omit them in this discussion).
Similar to Pandas, Spark DataFrames are built on the concepts of columns and rows, with the set of columns implicitly defining a schema which is shared among all rows.

In contrast to Pandas, the schema definition of a Spark DataFrame also dictates the data type for each column that can be stored in each row. This aspect strongly resembles classical databases, where each column also has a fixed data type which is enforced on all records (newer NoSQL databases might be more flexible, but that doesn’t mean that type enforcement is bad or old-school). You can easily display the schema of a given DataFrame, in order to inspect the columns and their data types:

As opposed to Pandas, Spark doesn’t support any indexing for efficient access to individual rows in a DataFrame. Spark solves all tasks which could benefit from an index more or less by brute-force -since all transformations are always performed on all records, Spark will reorganize the data as needed on the fly.
Generally speaking, columns and rows in Spark are not interchangeable like they are in Pandas. The reason for this lack of orthogonality is that Spark is designed to scale with data in terms of number of rows, but not in terms of number of columns. Spark can easily work with billions of rows, but the number of columns should always be limited (hundreds or maybe a couple of thousands).
When we try to model stock prices with open, close, low, high and volume attributes, we therefore need to take a different approach in Spark than in Pandas. Instead of using wide DataFrames with different columns for different stocks, we use a more normalized (in the sense database normalization) approach where each row is uniquely identified by its dimensions date
and asset
and contains the metrics close
, high
, low
and open
.

Although the data model of Apache Spark is less flexible than the one of Pandas, that doesn’t need to be a bad thing. These restrictions naturally arise with Sparks focus on implementing a relational algebra (more on that below) and the enforcement of strict data types help you to find bugs on your side earlier.
Since Spark doesn’t support indices, it also doesn’t support nested indices like Pandas. Instead Spark offers very good support for deeply nested data structures, like they are found in JSON documents or Avro messages. These kinds of data are often used in internal communication protocols between applications and Sparks exhaustive support of them underlines its focus as a data processing tool. Nested structures are implemented as complex data types like structs, arrays and maps, which in turn can again contain these complex data types.
As an example, I show you some Twitter data, which is publicly available on the Internet Archive.

This tabular representation doesn’t really show the complex nature of the tweets. A look at the schema reveals all details (although only a subsection is shown below):

This sort of support for complex and deeply nested schemas is something which sets Spark apart from Pandas, which can only work with purely tabular data. And since this type of data naturally arises in many domains, it is good to know that Spark could be a tool for working with that data. Suggestions for working with this kind of data might be an interesting topic for a separate article.
Flexibility of Spark
Apache Spark also provides a broad set of transformations, which implement a full relational algebra as you find in traditional databases (MySQL, Oracle, DB2, MS SQL, …). This means that you can perform just any transformation like you could do within a SELECT
statement in SQL.
The Spark/Scala examples below follow along the lines of the Pandas examples in the first article. Note how Spark uses the wording (but not the syntax) of SQL in all its methods.
Projections
One of the possibly simplest transformations is a projection, which simply creates a new DataFrame with a subset of the existing columns. This operation is called projection, because it is similar to a mathematical projection of a higher-dimensional space into a lower dimensional space (for example 3d to 2d). Specifically a projection reduces the number of dimensions and it is idempotent, i.e. performing the same projection a second time on the result will not change the data any more.

A projection in SQL would be a very simple SELECT
statement with a subset of all available columns. Note how the name of the Spark method select
reflects this equivalence.
Filtering
The next simple transformation in filtering, which only selects a subset of the available rows.

Filtering in SQL is typically performed within the WHERE
clause. Again note that Spark chose to use the SQL term where
although Scala users would prefer filter
– actually you can also use the equivalent method filter
instead.
Joins
Joins are an elementary operation in a relation database – without them, the term relational wouldn’t be very meaningful.
For a small demonstration, I first load a second DataFrame containing the name of the city some persons live in:

Now we can now perform the join operation:

In SQL a join operation is performed via the JOIN
clause as part of a SELECT
statement.
Note that as opposed to Pandas, Spark does not require special indexing of either DataFrames (if you remember, Spark doesn’t support the concept of indices).
Instead of relying on usable indices, Spark will reorganize the data under the hood as part of implementing an efficient parallel and distributed join operation. This reorganization will shuffle the data among all machines of a cluster, which means that all records of both DataFrames will be redistributed such that records with matching join keys are sent to the same machine. Once this shuffle phase is done, the join can be executed independently in parallel using a sort-merge join on all machines.
Concatenation
Spark also supports concatenation of multiple DataFrames, but only vertically (i.e. adding rows from a second DataFrame with the same number of columns).


In SQL vertical concatenation can be easily done using a UNION
.
If you need a horizontal concatenation of two DataFrames (which was easy with Pandas), you have to use a join operation instead. This might be difficult (or even impossible) in some cases, specifically if no natural join key is available.
Aggregations
Simple total aggregations are also well supported in Spark. The following example calculates the minimum, maximum and average of all columns in our persons
DataFrame:

Again, Spark clearly mimics SQL, where the same result would be achieved by using an aggregate function (like SUM
, MIN
, AVG
etc) in a SELECT
statement.
In contrast to Spark, Pandas was also able to perform row-wise aggregations over all columns of a DataFrame. This is not directly possible in Spark, you’d need to sum up all columns manually (which is not an aggregation). Remember, Spark was developed to scale with the number of rows, not with the number of columns, and rows and columns are not interchangeable like they are in Pandas.
Grouped Aggregations
Like traditional databases and like Pandas, Spark also supports grouped aggregations. For example the average age and height per sex for the persons
DataFrame can be calculated as follows:

SQL also supports grouped aggregations via the GROUP BY
clause and aggregate functions in a SELECT
statement.
Reshaping
The first article of this series provides a subsection about the flexible and powerful possibilities Pandas offers for reshaping a table. With Pandas you can easily dissect a table into smaller subtables (horizontally and vertically) and then concatenate them together. With Pandas you can transpose (i.e. exchange columns with rows) the whole DataFrame with a fingersnip.
Spark does not offer these operations, since they do not fit well to the conceptional data model where a DataFrame has a fixed set of columns and a possibly unknown or even unlimited (in the case of a streaming application which continually processes new rows as they enter the system) number of rows.
Data Sources
Spark supports CSV, JSON, Parquet and ORC files out of the box. Since Spark is really meant for working with huge amounts of data, you won’t find direct support for working with Excel files ("if it doesn’t fit into Excel any more, it must be Big Data"). In addition to traditional files, Spark can also easily access SQL databases and there are tons of connectors for all other sorts of database systems (Cassandra, MongoDB, HBase, …).
When working in "local" mode without distributing the work over a cluster, you can well work with files on your local file system. But when you are working inside a cluster, the data has to be accessible from all machines in the cluster. This can be accomplished by using a shared database which is reachable via network or by using a shared network file system (like HDFS or S3 – but NFS might also do the trick, although probably with limited bandwidth).
Limitations
As I already pointed out multiple times, one important difference between the data models of Spark and Pandas is the lack of interchangeability of columns and rows in Spark. This means that you cannot simply perform a transpose operation in Apache Spark and you cannot directly perform aggregations along columns (only along rows).
Conclusion
With all these operations in place, Spark can be understood as a relational execution engine working on external data. Spark elegantly picks up the wording and terms used in SQL, so most SQL users will quickly find out how to perform specific tasks with Spark.
But Sparks design imposes some limitations already on a conceptional level on the possible types of transformations. Spark always assumes a fixed set of columns and a possibly unknown or even unlimited number of rows. This design is fundamental for Sparks scaling capabilities and is in stark contrast to Pandas.
Spark Runtime Characteristics
So far it might seem as if Pandas generally is the better solution, since it provides more flexibility and is well integrated with the whole Python data science ecosystem. But this impression will now change when we look under the hood of Apache Spark.
Runtime Platform
Spark is implemented in the programming language Scala, which targets the Java Virtual Machine (JVM). As opposed to Python, Scala is a compiled and statically typed language, two aspects which often help the computer to generate (much) faster code. Spark does not rely on optimized low level C/C++ code, instead all code is optimized on the fly during the execution by the Java Just-in-Time (JIT) compiler.
I would dare to say (but many people will disagree) that the Java platform as a whole probably is not able to bring out the best performance of your hardware, when compared to highly optimized (but hardware specific) C/C++ or even assembler code. But performance is often more than "good enough" and certainly at least one order of magnitude better than purely interpreted code (i.e. Python without optimized low level code). The terms fast and slow should always be used in relation to a specific alternative, and in this case it is Python.
Execution Model
In contrast to Pandas, Spark uses a lazy execution model. This means that when you apply some transformation to a DataFrame, the data is not processed immediately. Instead your transformation is recorded in a logical execution plan, which essentially is a graph where nodes represent operations (like reading data or applying a transformation). Spark then will start to work, when the result of all transformation is required – for example when you want to display some records on the console or when you write the results into a file.
As already mentioned in the article devoted to Pandas, this lazy execution model has the huge advantage by offering Spark the ability to optimize the whole plan before execution instead of blindingly following the steps as specified by the developer. Internally Spark uses the following stages:
- A logical execution plan is created from the transformations specified by the developer
- Then an analyzed execution plan is derived where all column names and external references to data sources are checked and resolved
- This plan then is transformed into an optimized execution plan by iteratively applying available optimization strategies. For example filter operations are pushed as near as possible to the data sources in order to reduce the number of records as early as possible. But there are many more optimizations.
- Finally the result of the previous step is transformed into a physical execution plan where transformations are pipelined together into so called stages.
Just to be complete, the physical execution plan then is split up along the data into so called tasks, which then can be executed in parallel and distributed by the machines in the cluster.
This general approach is very similar to probably all relational databases, which also optimize SQL queries with similar strategies before executing them. But there is more to lazy execution than only optimization, as we see in the next paragraph.
Processing Scalability
Spark is inherently multi threaded and can make use of all cores of your machine. In addition Spark was designed from the very beginning to perform its work in large clusters with possibly hundreds of machines and thousands of cores.
By breaking down the total amount of work into individual tasks, which then can be processed independently (as soon as each tasks input data is available) in parallel, Spark makes very efficient use of the available cluster resources. It would be much harder to implement an eager execution model in a cluster without wasting lots of resources (CPU power, RAM and disk storage).
Data Scalability
Spark also scales very well with huge amounts of data. It does not only scale via multiple machines in a cluster, but the ability to spill intermediate results to disk is at the core of the design of Spark. Therefore Spark is almost never limited by the total amount of main memory, but only by the total amount of available disk space.
It is important to understand that by lazily following execution plans, the total working data set does never to be completely materialized into RAM at any point in time. Instead all the data (including input data and intermediate results) is split up into small chunks, which are processed independently and even results are eventually stored within small chunks and those never need to fit into RAM at once.
As you see, the slightly smaller flexibility of Spark is outweighed by its scalability, both in terms of compute power and in terms of data size. Spark is made for different types of problems than Pandas.
Conclusion
The virtually unlimited scalability both in terms of data and processing power makes Spark a distributed and parallel relational execution engine.
Conclusion
Like Pandas, Spark is a very versatile tool for manipulating large amounts of data. While Pandas surpasses Spark at its reshaping capabilities, Spark excels at working with really huge data sets by making use of disk space in addition to RAM and by scaling to multiple CPU cores, multiple processes and multiple machines in a cluster.
As long as your data has a fixed schema (i.e. no new columns are added every day), Spark will be able to handle it, even if it contains hundreds of billions of rows. This scalability together with the availability of connectors to almost any storage system (i.e. remote filesystem, databases, etc) make Spark a wonderful tool for Big Data engineering and data integration tasks.
With this second part, we now should have a rough understanding of the focus of Pandas and Spark. But before we can eventually give advice when to use what, we also should examine the programming languages and the ecosystems both frameworks live in. This will be the topic of the next part of the series.