The world’s leading publication for data science, AI, and ML professionals.

Spark vs Pandas, part 4- Recommendations

Why neither Spark nor Pandas is better than the other. Or: Always chose the right tool for the right job.

Photo by Cesar Carlevarino Aragon on Unsplash
Photo by Cesar Carlevarino Aragon on Unsplash

Originally I wanted to write a single article for a fair comparison of Pandas and Spark, but it continued to grow until I decided to split this up. This is the second part of the small series.

What to Expect

This last part of the series will give you some advice how to chose between both technologies for implementing a given task.


When to prefer Pandas over Spark

After a detailed analysis of the two contenders Pandas and Spark, we can now summarize the strengths and weakness of both and provide indications when to use what.

Let’s start with Pandas.

Strengths

Pandas is simple to use and you find lots of valuable information and online resources. Pandas performs all it operations reasonably quick, as long as the amount of data is not too huge. It is well integrated into a whole ecosystem of numerical, statistical and machine learning libraries like SciKit Learn, Tensorflow and many more.

Weaknesses

Pandas does not scale at all. It cannot make use of multiple CPUs and the whole data set needs to fit into the RAM of your local machine. Some projects like Dask try to address these shortcomings, but that’s a different story.

Python as a language is a little bit weak due to being dynamically typed. Writing robust code is harder than with a statically and compiled language.

Conclusion

Because of its simplicity, flexibility and availability I would always use Pandas for data exploration and for experiments – as long as the data fits into memory. Specifically for ML projects, I don’t think twice and start with Pandas because of all the powerful libraries, which are well integrated with Pandas. Even if the final amount of data might be too large, Pandas and its friends still is a tool simple and flexible enough for first experiments with a subset of the full data set.

On the other hand, nowadays I think twice before using Pandas for production workloads, because of the weaker correctness guarantees of Python as a dynamically typed language. But due to the availability of many important ML libraries, Python and Pandas also have their place in production. (Unfortunately).

When to prefer Spark over Pandas

Spark shines in many areas where Pandas has some weaknesses, as we will see below.

Strengths

Spark scales very well – with number of CPUs, number of machines and probably most importantly with the amount of data. There is no real limit except time for how much data you can process with a limited amount of resources.

Due to the availability of a vast number of connectors for all kinds of data sources and sinks, Spark is very well suited for integrating data from different origins.

Finally by relying on Scala as a statically typed and compiled language, Spark code often has a higher inherent robustness than Python code. This makes Spark with Scala a very good candidate for production.

Weaknesses

Spark is made for huge amounts of data – although it is much faster than its old ancestor Hadoop, it is still often slower on small data sets, for which Pandas takes less than one second.

Spark provides some ML algorithms, but you probably will never get a universe as rich as for Python.

One thing to keep in mind is that Spark was designed as a relational algebra running in a cluster of machines – but the processing atoms of a relational algebra (joins, projections, filtering, aggregations, simple transformations etc) are distinct from the processing atoms of a matrix algebra (matrix multiplication, decomposition etc), which would be required by most ML algorithms. When you time some time and look at the implementation of the ML algorithms in Spark you will see that the developers had to transform numerical problems into a map/reduce problems for distributed processing. While this is possible, it is of course much harder than using a distributed matrix algebra, which in turn also explains the slow development of new features Spark in the ML area.

Conclusion

Spark is perfect for typical ETL/ELT workloads, but only my second choice for machine learning projects due to the limited amount of available algorithms. If you have really huge amounts of data, where subsampling won’t work, Spark is still a good candidate.

Combining Spark and Pandas

So far I took the perspective of using either Pandas or Spark for a given task, but not combining them in a single application. But sometimes the world has some gifts for you, and this time it is Pyspark. Although PySpark primarily is a Python wrapper around Spark, it includes a growing support for integrating Pandas code within Spark.

The main driver supporting Pandas directly from within Spark is that even the Spark developers understood that Spark cannot and probably should not try to replace Pandas in certain scenarios, most notably in the ML part of projects. Instead Spark puts its development focus on integrating frameworks like Pandas and even Tensorflow.

Spark basically offers two levels of integration of Pandas:

Converting DataFrames

The first, most simple and straight forward integration of Pandas is the ability to convert between Spark DataFrames and Pandas DataFrames. This allows a developer to use both frameworks and switch between them. But beware that Spark does not magically remove the limitations of Pandas: When converting a Spark DataFrame into a Pandas DataFrame, the whole data set again needs to fit into the RAM of your local machine.

While this limitation can be a deal-breaker for some scenarios, it is acceptable in other scenarios where you use Spark to reduce the amount of data (by sampling or aggregations) and then use Pandas with its friends (SciKit Learn etc) to continue with the smaller data set.

Embedding Pandas

As we saw, simply switching back and forth between Spark DataFrames and Pandas DataFrames is not advisable, but fortunately there is a better way for using Pandas within a PySpark application:

With version 2.3.0, Apache Spark introduced the so called Pandas User Defined Functions (UDFs) in addition to the already existing Python UDFs. Before version 2.3.0 the only way to write custom Python code which should be executed in parallel by Apache Spark on all executors was to write a small Python function containing the desired logic and wrapping that function into a Python UDF. Spark then would execute that UDF for every record.

As nice as this sounds, that approach was notoriously slow. Python UDFs in Spark are implemented by exchanging data between Spark (which lives inside a JVM process) and multiple Python processes and then by calling the user defined Python function for every individual record. This approach contains two significant bottlenecks: First the data exchange involves CPU intensive data serialization and deserialization steps and second calling a Python function for every individual record is really slow.

In order to improve the situation, Spark implemented a new API for creating Pandas UDFs to offer a significant performance boost. With Pandas UDFs, you now provide Python functions which don’t work on individual records any more, but which transform either Pandas DataFrames or Series containing batches of multiple records to transform. Moreover by using Apache Arrow Spark does not need to perform the expensive serialization/deserialization any more. Instead Spark uses shared memory or directly passes memory blocks without any conversion to exchange data between the JVM and Python.

This combination will bring features from both worlds into a single application: By using Pandas, you have a greater degree of flexibility, and by embedded your Pandas code in Spark, it is executed in parallel on multiple machines in the cluster.


Conclusion

Do not try to replace Pandas with Spark, they are complementary to each other and have each their pros and cons.

Whether to use Pandas or Spark depends on your use case. For most Machine Learning tasks, you probably will eventually use Pandas, even if you do your preprocessing with Spark. But for complex Data Engineering tasks which often also need to scale to huge amounts of data, I highly recommend to use Spark with Scala (don’t be afraid of Scala -investing into Scala will pay off from a projects point of view).


Related Articles