In this third installment of the series "Pandas vs Spark" we will have a closer look at the programming languages and the implications of choosing one.
Originally I wanted to write a single article for a fair comparison of Pandas and Spark, but it continued to grow until I decided to split this up. This is the second part of the small series.
- Spark vs Pandas, part 1 – Pandas
- Spark vs Pandas, part 2 – Spark
- Spark vs Pandas, part 3 – Programming Languages
- Spark vs Pandas, part 4 – Shootout and Recommendation
What to Expect
This third part of the series will focus on the programming languages Scala and Python. Spark itself is written in Scala with bindings for Python while Pandas is available only for Python.
Why Programming Languages matter
Of course programming languages play an important role, although their relevance is often misunderstood. Having the right programming language in your CV may eventually be one of the deciding factors for getting a specific job or project. This is a good example where the relevance of programming languages might be misunderstood, especially in the context of Data Science.
Don’t get me wrong, being an expert for a given programming language takes far more time than coding a couple of weeks. You do not only need to get used to the syntax, but also to the language specific idioms. It’s really like learning a foreign natural language, which takes more than only knowing the words and the grammar (which in itself already is a huge task).
On the other hand, in certain areas like Data Science, methodology matters at least as much as knowing a specific programming language. I would prefer to hire a machine learning expert with profound knowledge in R for ML project using Python instead of a Python expert with no knowledge in Data Science, and I bet most of you would agree. So from an experts point of view, the programming language doesn’t matter so much on your CV (at least it shouldn’t – I know that it’s different in reality), as long as you know what’s going on under the hood and understand the scientific method of approaching a problem.
But things look quite differently from a project’s point of view: When setting up a larger project and starting to create actual code, you eventually need to think about which programming language you’d preferably want to use. And this decision has many consequences, which you should be aware of. I will discuss many of them in this article, with a strong focus on Scala and Python as being the natural programming languages for Spark and Pandas.
Python vs Scala
When comparing Spark and Pandas, we should also include a comparison of the programming languages supported by each framework. While Pandas is "Python-only", you can use Spark with Scala, Java, Python and R with some more bindings being developed by corresponding communities.
Since choosing a programming language will have some serious direct and indirect implications, I’d like to point out some fundamental differences between Python and Scala. Going into more detail would probably make up a separate article on its own. I mainly pick up this comparison, as the original article I was referring to at the beginning also suggested that people should start using Scala (instead of Python), while I propose a more differentiated view again.
Type System
Let’s first look at the type systems: Both languages provide some simple built in types like integers, floats and strings. The fundamental types in Scala also provide some specific sizes like Short
for a 16bit integer, Double
for a 64bit floating point number.
Both languages also offer classes with inheritance, although many details are really different.
There are two main differences between the type systems in Scala and in Python:
- While Scala is a strongly typed language (i.e. every variable and parameter has a fixed type and Scala immediately throws an error if you try to use a wrong type), Python is dynamically typed (i.e. a single variable or parameter technically can accept any data type – although the code may assume specific types and therefore fail later during execution).
- Due to the dynamically typed nature of Python, a suitable type for a certain operation often is only determined by the operations it implements. Using a correct base class or inheritance often is not crucial, only the available methods are. This paradigm is called "Duck Typing"
These differences have a huge impact, as we will see later.
Functional Language Aspects
While Python has grown from a simple scripting language to a fully featured programming language, the focus of Scala as a research project was from the very beginning to combine aspects from functional programming languages (like Haskell) with those of object oriented languages (like Java) – there is a some debate if this combination is successful, or even desirable.
For me, the term functional programming refers to a certain paradigm that functions shall not have side effects (i.e. they do not change some global state and respect immutability). Object oriented programming on the other hand is just about the opposite, where each method is seen as some way to communicate with an object, which in turn changes its state. It is important to separate the paradigm itself from specific language features – one can implement purely functional programs in almost any language, but only some languages will provide supporting concepts, while things will get complicated in other languages.
Both Python and Scala support some functional concepts, specifically functions can be passed as values and anonymous functions (lambda functions). Scala also comes with a rich collections library which very well supports functional approaches like immutability, while Pythons best offering in this area is list comprehension.
Execution Model
Python is an interpreted language, which essentially means that Python can immediately execute any code, as long as it is valid Python syntax. No "build" or "compile" step is required. This makes Python a great choice for interactive work, since Python can immediately execute code as you type it.
Scala on the other hand is a compiled language, which means that a Scala compiler first needs to transform Scala code into so called Java bytecode for the JVM (which in turn is translated into native machine code during execution). This three-step approach (write, compile, execute) often makes code experiments more difficult, since the turn over times are higher. Luckily Scala also provides an interactive shell, which is able to compile and immediately execute the code as you type it. But generally speaking, Scala is meant to be compiled.
Learning Curve
Generally speaking, Python is very simple to learn – it was specifically designed to be like that with a strong focus on readability. Pythons dynamic type system is well suited for beginners, which had never contact to a programming language. Python is very forgiving and its syntax is easy to understand.
Scala on the other hand has a much steeper learning curve, and – as opposed to Python – code can become quickly hard to read for novices. Although for using Spark you first only need a small subset, you eventually need to understand more and more details of Scala when you begin to dig deeper into Spark and when you try to solve more complex problems.
I found that most Java programmers at the beginning have big problems getting used to the functional aspects of Scala, partly because of a very concise syntax. I always feel that the information density (i.e. how much logic is encoded per letter program code) is much higher in Scala than in Java, and this density is challenging for most peoples brain at the beginning since they are used to much more boiler-plate code in Java, which significantly lowers the information density.
Robustness of Code
Dynamically typed languages have one huge disadvantage over statically typed languages: Using a wrong type is only detected during run time and not earlier (during compile time). This means that if a function is called with a wrong data type under some very rare conditions, you might only notice that when it’s too late – in production.
In contrast a statically typed and compiled language will stop you from releasing the broken code to production. It will point directly to the usage of the wrong type and you have to fix that before the compiler can finish its work.
Because of this difference I found writing robust, production-ready Python code much more difficult than writing robust Scala code. This is even more difficult when writing a whole framework or library, that is then used by other applications. Applications could pass wrong data types to functions, but maybe those types are "good enough" in some cases (because they implement all required methods) but fail in other cases (because other methods are missing or their signature has changed).
On top of that, refactoring with Python can be very difficult, since the consequences of using different types or renaming methods are not always correctly detected by your IDE.
Ecosystem
Nowadays the success of a programming language is not mainly tied to its syntax or its concepts, but to its ecosystem. This includes many aspects like the availability of useful libraries, the choice of good editors, the support of relevant operating systems and more.
Specifically the set of libraries nowadays has a huge impact of the primary domain where a specific programming language is used. The most prominent example is Python, where most new state-of-the-art machine learning algorithms are implemented for – an area where Scala is far behind, although projects like ScalaNLP try to improve the situation.
While Scalas boost during the last years probably can be traced back to the success of Apache Spark, it is also used in many projects for network services which require a high concurrency, something where Scalas functional programming features can provide support for implementing robust multi-threaded code.
Conclusion
Both Scala and Python have their place. Specifically in the area of data processing, Python well suits a scientific workflow with many small and quick code experiments as part of an exploration phase to gain new insights.
Scala’ s "write-compile-execute" workflow its static type system better fit to an engineering workflow, where the knowledge for approaching a specific problem is already there and therefore experiments are not performed any more.
As I pointed out in "Robustness of Code", I prefer to use a strongly typed language for production code except in some simple cases, where the application is almost trivial.
The Value of Ecosystems
After this excursion in a comparison of Scala and Python, let’s move back a little bit to Pandas vs Spark. There is one aspect that is highly coupled to the programming language, and that is the ecosystem. I already mentioned this aspect above, but let us focus more on libraries which can be used together with Pandas and with Spark.
Python Ecosystem
Since Pandas at its core is built on top of NumPy arrays, it naturally integrates very well with a very rich ecosystem of many numerical and statistical libraries. Just to name a few important examples:
Moreover we also have the lovely Jupyter Notebooks for working interactively as part of an experimentally driven exploration phase.
Scala Ecosystem
Spark on the other hand lives in a completely different universe. As being a citizen of the JVM world, you can use all kind of Java libraries – but the focus of most Java libraries is networking, web services and databases. Numerical algorithms is not in the core domain of Java.
Therefore the ecosystem for Spark looks very differently. Most importantly, there are many connectors to use Spark with all kinds of databases, like relational databases via JDBC connectors, HBase, MongoDB, Cassandra, and so on.
In addition to connectors, Spark already implements the most important machine learning algorithms like regression, decision trees etc. And then we also have Breeze and ScalaNLP for lower level numerical algorithms (which also cannot be directly scaled by Spark to work on different machines in parallel). But when you compare these libraries with the possibilities of the corresponding Python libraries, you quickly find out that these are much smaller in scope.
Finally with Zeppelin or by using PySpark (the Python binding for Spark) in Jupyter, we can also use Spark in notebook environments.
Conclusion
We see huge differences in the ecosystems of Pandas and Spark. While Pandas has strong ties to all sorts of numerical packages, Spark excels in uniform connectivity to all sorts of data sources.
Python vs Scala for Spark
Since Spark can be used with both Scala and Python, it makes sense to dig a little bit deeper for choosing the appropriate programming language for working with Spark.
I already indicated that Python has a far larger set of numerical libraries which are commonly used in Data Science projects. Although this is already a strong argument for using Python with PySpark instead of Scala with Spark, another strong argument is the ease of learning Python in contrast to the steep learning curve required for non-trivial Scala programs. Even worse, Scala code is not only hard to write, but also hard to read and to understand. That makes Scala a difficult language for collaborative projects where colleagues or even non-programmers also need or want to understand the logical details of an application. This is often the case in a Data Science environment.
Python for Data Science
Because of the availability of many relevant libraries for data science, and because of the easy readability of Python code, I always recommend to use PySpark for real Data Science. This also fits well to the profile of many Data Scientists, who have a strong mathematical background but who often are no programming experts (the focus of their work is somewhere else).
By using PySpark, data scientsts can work with huge data sets which do not fit into the RAM of a local machine any more, and at the same time (to a certain degree) they can still access all the relevant Python libraries – as long as they can downsample or aggregate the data such that these tools and libraries become feasible again.
Scala for Data Engineering
Things look differently for data engineering. I highly recommend to use Spark with Scala for these types of tasks. First data engineers should have a strong technical background such that using Scala is viable. Next it may be well the case that some custom transformations are required which are not available in Spark. But Spark is very extensible, and in this case it can really pay off to use Scala as the native Spark programming language.
Using Scala instead of Python not only provides better performance, but also enables developers to extend Spark in many more ways than what would be possible by using Python. With Scala you can access even the internal developer APIs of Spark (as long as they aren’t private
) whereas Python can only access the public end user API of Spark.
Moreover I strongly believe that in data engineering projects all the aspects of "production quality code" are far more important than for an explorative data analysis task performed in a notebook environment. This is precisely where having a statically typed and compiled language like Scala provides great benefits.
Conclusion
Choosing a programming language isn’t easy. You have to think about your requirements, both functional and non-functional. While Python is great for data science, I would prefer to use Scala for data engineering with Spark.
The next and final section will summarize all the findings and will give more advise when to use what.