Recently there was a nice article on Medium explaining why data scientists should start using Spark and Scala instead of Pandas. Although the article contains many valid points, I propose a more differentiated view, which is also reflected in my personal work where I use both, but for different kinds of tasks.
Whenever I gave a training for PySpark to Data Scientists, I was always asked if they should stop using Pandas from now on altogether, or when to prefer which of the two frameworks Pandas and Spark. For answering this question you need to understand each tools strengths and limitations and you should understand that both frameworks have been developed to solve similar problems but each with a different focus.
This is the first part of a small series for comparing Spark and Pandas.
- Spark vs Pandas, part 1 – Pandas
- Spark vs Pandas, part 2 – Spark
What to Expect
I will present both frameworks Pandas and Spark and discuss their strengths and weaknesses to set the ground for a fair comparison. Originally I wanted to write a single article on this topic, but it continued to grow until I decided to split this up.
I won’t conclude with A is better than B, but instead I will give you some insights of each frameworks focus and limitations. Eventually I will conclude with some advice how to chose between both technologies for implementing a given task.
This first article will give you an overview about Pandas, with its strengths and weakness and its unique selling points.
What is Pandas?
Pandas is a Python library and the de-facto standard for working with structured tabular data on Python. Pandas provides simple methods for transforming and preprocessing data with a strong focus specifically on numerical data, but it can also be used with other data, as long as its structure is tabular.
Many readers will understand why Pandas is so important nowadays since probably most data science projects at some point in time use Python as a programming language. Therefore these projects rely on Pandas for reading, transforming and possibly writing data.
The reason for the popularity of Pandas in the world of Python is its simple but powerful programming interface and the fact that it interacts nicely with NumPy and therefore with most statistics and machine learning libraries.
Pandas Data Model
Basically Pandas stores your data inside so called DataFrames which inherently look __ like _table_s, as you might know from databases. A DataFrame has a set of columns and a set of rows, where each row contains entries for all columns (even if they are _Na_N or _Non_e values to denote missing information).

Similar to a table in a database, each DataFrame also has an index with unique keys to efficiently access individual rows or whole ranges of rows. The named columns of Pandas DataFrames can also be seen as a horizontal index which again allows you to efficiently access individual columns or ranges of columns.
Both the vertical row index and the horizontal column index can also contain multiple levels, which can be quite useful, for example for modelling stock prices of multiple different assets over time:

Pandas provides a very orthogonal design with respect to rows and columns, they are interchangeable in most (but not all) functions. And if that is not possible, you can easily transpose a DataFrame, i.e. rotate the table by 90 degree to turn all rows into columns and vice versa. Note that this operation is something which is not possible with a traditional database.
Flexibility of Pandas
The Pandas DataFrame API is very flexible and providing more functionality than you’d find in SELECT
statements in a traditional database. Let us have a quick look at the most basic ones, just to get a feeling and to be able to compare the functionality with Spark in the next part of this series.
For the following tiny examples, assume that we use a DataFrame persons
with the following contents:

Since most (but not all) operations can also be performed in a traditional database, I will also mention the corresponding operation to put them into the perspective of a relational algebra.
Projections
One of the possibly simplest transformations is a projection, which simply creates a new DataFrame with a subset of the existing columns. This operation is called projection, because it is similar to a mathematical projection of a higher-dimensional space into a lower dimensional space (for example 3d to 2d). Specifically a projection reduces the number of dimensions and it is idempotent, i.e. performing the same projection a second time on the result will not change the data any more.

A projection in SQL would be a very simple SELECT
statement with a subset of all available columns.
Filtering
The next simple transformation in filtering, which only selects a subset of the available rows. It is similar to a projection, but works on rows instead of columns.

Filtering in SQL is typically performed within the WHERE
clause.
Joins
Joins are an elementary operation in a relation database – without them, the term relational wouldn’t be very meaningful. Pandas joins require that the right DataFrame already is indexed by the join column.
For a small demonstration, I first load a second DataFrame containing the name of the city some persons live in:

Now we need to create an appropriate index for that DataFrame, otherwise Pandas refuses to perform the join operation:

Finally we can now perform the join operation:

Note that this example shows that the term index really is appropriate as it provides a fast lookup mechanism, which is required for the join operation.
In SQL a join operation is performed via the JOIN
clause as part of a SELECT
statement.
Concatenation
In Pandas you can also concatenate (glue together) DataFrames, either horizontally (i.e. adding columns from a second DataFrame with the same number of rows) or vertically (i.e. adding rows from a second DataFrame with the same number of columns).

For horizontally concatenating columns of different DataFrames (not shown) Pandas will matching on the index, in a similar way like horizontal concatenation is done by matching on columns.
In SQL vertical concatenation can be easily done using a UNION
while horizontal concatenation requires a more expensive JOIN
operation.
Aggregations (Horizontal + Vertical)
Simple total aggregations are also well supported in Pandas. The following example calculates the minimum, maximum and average of all columns in our persons
DataFrame:

The aggregation is performed for each column over all rows. SQL also well supports total aggregations via an aggregate function (like SUM
, MIN
, AVG
etc) in a SELECT
statement.
Pandas can also perform a row-wise aggregation over columns, which comes in handy in some cases. For example when looking at some of the COVID-19 Data provided on GitHub by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University, it initially looks as follows:

Each row represents a single country or state and contains a column with the total number of COVID-19 cases so far. This data schema actually is very unfriendly for storing in a traditional database which commonly have a limited set of columns and new entries should be added via new rows. But this representation will add a new column for every date. We will see how to transform this data into a more database friendly format later.
But now, suppose we want to calculate some metric over all date columns for each different region, say the average number of new corona cases. This can be easily achieved with Pandas in several small steps including a horizontal aggregation (as opposed to the vertical aggregation above):

This example shows the flexibility of Pandas as opposed to a relational database where aggregations are always horizontal by their nature, but which would be unable to store the data as given in its original format anyway.
Grouped Aggregations
Pandas also supports grouped aggregations, like for example the average age and height per sex for the persons
DataFrame:

SQL also supports grouped aggregations via the GROUP BY
clause and aggregate functions in a SELECT
statement.
Reshaping
Let’s get back to the COVID-19 data set above:

The number of cases are stored in separate columns for every day. Unfortunately this representation might not be appropriate for every use case and it might be preferable to provide separate rows for every date. For example if you wanted to store the information in a relational database, you should define a fixed set of columns, independent of the stored time frame.
The required transformation can be easily accomplished with two lines of Pandas code:

This representation is now perfect for storing in a relational database. The idea is that each row contains a combination of dimensions ("Province/State", "Country/Region" and "Date"), which uniquely identify the row. That is these columns provide a composite primary key. In addition each row contains one or more metrics (in this case it’s only "Count"). In the world of analytical databases, this would be called a fact table.
Data Sources
Pandas provides built in support for reading and writing CSV, JSON, Excel, fixed-width format (FWF), HDF5, Parquet, ORC, SAS, SPSS and it can access data in SQL databases and even Google BigQuery.
In addition to access to data on your local file system, Pandas also supports a couple of remote file systems, with the most important one probably being S3.
This broad support both of file formats and storage systems makes Pandas a well integrated citizen in a world with very different data sources it needs to work with.
Conclusion
Pandas supports most database-like operations you’d find in a SELECT
statement, but it provides a much larger flexibility and has stronger focus on numerical operations. It supports many more operations like rolling window calculations and pivoting to just name two commonly used.
Internally Pandas uses NumPy arrays, which can be accessed easily and fed into all kinds of additional libraries like scikit-learn, statsmodels or even Tensorflow. Again, this sets Pandas apart from a classical database, which doesn’t offer this kind of integration.
Pandas Runtime Characteristics
So far everything might have sounded just perfect. But let’s peek a little bit under the hood and examine some design decisions of Pandas and which implications those decisions have.
Runtime Platform
Pandas itself is written in a mixture of Python and Cython (a superset of Python with support of static typing, which transform Python code to C and then compiles the C code to give C-like performance), with all the benefits and downsides of an interpreted and dynamically typed language. But since most transformations are eventually performed by some low level and optimized C code, performance is sufficient for most use cases.
Execution Model
Pandas implements a so called eager execution model, which simply means that any transformation to a DataFrame you apply will be executed immediately. This approach of least surprise for the user is great for interactive work (like Data Scientists often do) but it also has some downsides: Pandas cannot create something like an optimized execution plan by fusing multiple operations together, since every transformation will be performed immediately.
The idea of optimizing execution plans is rather old and helps many SQL databases to transparently boost their performance – but of course those databases see a full SQL query at once instead of seeing each tiny step one after the other. We will see later some more important advantages of a lazy execution which is implemented by Spark.
As a side note, Tensorflow supports both a lazy and an eager execution model, with the clear advice to use the eager execution model only during development since the lazy execution is much faster due to whole-graph optimizations.
Processing Scalability
Processing itself is single threaded and cannot be distributed to different machines. Some projects like Dask try to overcome this limitation by reimplementing the Pandas API using a completely different backend. But Pandas itself is limited to a single thread.
As bad as it sounds, Pandas is still very usable and can be very fast, since most of the work is done inside a highly optimized C/C++ backend. Sometimes it is also possible to implement some parallelism on top of Pandas with the multiprocessing or joblib libraries.
Data Scalability
Pandas requires that all of your data fit into the main memory of you local computer. It can neither spill data to disk nor can it distribute data within a cluster of machines. Again, projects like Dask or Vaex try to overcome this limitation by implementing distributed and/or out of core processing while providing a Pandas-compatible (or Pandas-like for Vaex) API.
Conclusion
Pandas doesn’t scale – it’s as simple as that. If your data set grows, you need more RAM and probably a faster CPU (faster in terms of single core performance). While this may be limiting for some people, this keeps the whole library (relatively) simple.
Conclusion
Pandas is a very versatile Python library for manipulating tabular data. It provides far more flexibility for manipulating and reshaping tabular data than a relational database and therefore is the Swiss army knife of (pre-)processing not only for Data Science projects.
Pandas main limitation are that your data needs to fit into RAM and that it doesn’t make good use of multiple CPUs. But if your data is small enough to escape these limitations (and that may well still be millions of records), Pandas is a wonderful and flexible library, which integrated very well into the whole Python machine learning eco-system.
Next time, we’ll have a closer look at the rival Apache Spark.