Highlights of Data + AI Summit 2020 (formerly Spark Summit)

Recent developments with Spark 3.0, Spark-on-Kubernetes going GA, PySpark usability improvements, and more.

Jean Yves
Towards Data Science

--

Source: Unsplash. Not a photo of the actual conference, which was entirely online this year.

November 20th, 2020: I just attended the first edition of the Data + AI Summit — the new name of the Spark Summit conference organized twice a year by Databricks. This was the European edition, meaning the talks took place at a European-friendly time zone. In reality it drew participants from everywhere, as the conference was virtual (and free) because of the epidemic.

The conference featured 125 pre-recorded video talks aired at a specific time (as if they were live), with the speakers available to answer questions in real time on a “Live Chat”. All the talks are available for replay for free on the summit’s website.

In this article we’ll go over the highlights of the conference, focusing on the new developments added to or coming up to for Apache Spark.

Keynotes: Focus on PySpark and COVID-19

Watch the keynotes: https://databricks.com/dataaisummit/europe-2020/keynotes

On Wednesday morning, November 18, we had a keynote introduced by Ali Ghodsi, the CEO and a co-founder of Databricks. He brought on stage quite a few people, highlighting especially the need to bolster the Python experience of Apache Spark with a discussion of Project Zen, and the new Koalas API, as well as the performance boosts enabled by the latest release of Spark 3.0.

Koalas 1.4 — Announcement & Key Performance Improvements. Source.
Apache Spark 3.0 — SQL Performance Benchmark. Source.

After about an hour and a half of technical discussion, Malcom Gladwell (a relatively famous author of The Tipping Point, and Blink) discussed COVID-19 data, and how this could have helped us to navigate the pandemic differently, if we used this data to make better decisions. He went on to apply data science to the pandemic, which he said Japan, China, and South Korea handled well, but not Europe or the USA.

He went on to give three guiding principles: Framing, Usefulness and Courage. We need to use the data to properly frame our understanding of the problem. Make sure the data you have selected is the most useful (and useful doesn’t mean perfect). Lastly, by courage, he means: “to have the courage to follow the data, to do what the data tells you to do.” As an example, he said that “10% of the American population has a risk of dying from COVID-19” based on obesity. And that we need to take special precautions to help the obese but we don’t do this because we have a social norm against shaming people. He felt those who do these three things will in fact become the proper leaders of the rest of us.

Getting Started with Apache Spark on Kubernetes

Watch the talk: https://databricks.com/session_eu20/getting-started-with-apache-spark-on-kubernetes

In this talk, the founders of Data Mechanics (ex-Databricks engineers) went over the pros and cons for running Spark on Kubernetes followed by a concrete guide on how to make Spark on k8s stable and cost-effective.

A Timeline Of Improvements To Spark On Kubernetes. Image by Author.

They revealed that Spark on Kubernetes will officially be declared Generally Available and Production-Ready with the upcoming version of Spark (3.1). Update (March 2021): Spark 3.1 has been officially released, learn more about the new available features!

One specific feature includes the ability to move shuffle files before a Spark executor is terminated (due to dynamic allocation or because of a Spot kill), which will significantly improve Spark’s robustness and performance in elastic cloud architectures.

They then illustrated the development workflow with a live demo running on the Data Mechanics platform. He used a script to build a docker image with all the required dependencies, push it to the registry, and then run it at scale on the Kubernetes cluster — all happening with a 30 second iteration cycle.

This is a big win for the developer experience, compared to my experience deploying Spark on traditional Hadoop YARN clusters.

What is New with Apache Spark Performance Monitoring in Spark 3.0

Watch the talk: https://databricks.com/session_eu20/what-is-new-with-apache-spark-performance-monitoring-in-spark-3-0

The speaker, Luca Canali, is a Data Engineer at CERN, which processes one Exabyte of data from the Large Hadron Collider using Spark (on Kubernetes). What was exciting about this talk is that it explained clearly how to write Spark 3 custom monitoring listener code by extending the new SparkListener API.

Spark Listeners in Apache Spark 3.0. Source.

Some open-source projects have started to make use of these new capabilities in Apache Spark 3.0 like Delight, a free and cross-platform Spark UI replacement with new metrics and data visualizations. You can install it in your Spark infrastructure by downloading their open-sourced Spark agent (which uses this new SparkListener interface).

Update (April 2021): Delight has been officially released! It works on top of any Spark platform: Databricks, EMR, Dataproc, HDInsight, CDH/HDP, Spark on Kubernetes open-source, Spark-on-Kubernetes operator, open-source spark-submit, etc.
[Get Started] [Check or open-source agent on Github]

From Query Plan to Query Performance: Supercharging your Spark Queries using the Spark UI SQL Tab

Watch the talk: https://databricks.com/session_eu20/from-query-plan-to-query-performance-supercharging-your-apache-spark-queries-using-the-spark-ui-sql-tab

If you love JOIN strategies, and efficiency, you will be intrigued by this use case, to JOIN a ship to a port, but only if the location of the ship made it possible for the ship to be physically at that port.

Spark UI SQL Tab Illustration with a Join Strategy example. Source.

They created clever geographical hash regions based on latitude and longitude, and don’t attempt the JOIN unless the geo-hashing strategy is satisfied, which converts to an equi-join implemented by BroadcastHashJoin. This geo-filtering technique sped up the JOIN considerably. Whew!

Interoperability Between Koalas and Apache Spark

Watch the talk: https://databricks.com/session_eu20/koalas-interoperability-between-koalas-and-apache-spark

For those who haven’t heard of Koalas and its relationship to PySpark, “the Koalas project makes data scientists more productive when interacting with big data, by implementing the pandas DataFrame API on top of Apache Spark.”

Koalas allows for a much richer experience of PySpark DataFrames, in that the Pandas functionality we know and love is exposed and yet at runtime we have a Spark Job. An example of the new features is simply generating a Koalas DataFrame using the df.to_koalas() function, which easily provides custom indexing of a Koalas DataFrame (as shown below).

Koalas implements pandas Dataframe API on top of Apache Spark. Source

Project Zen: Improving Apache Spark for Python Users

Watch the talk: https://databricks.com/session_eu20/project-zen-improving-apache-spark-for-python-users

I used to work at Databricks, in 2015, and at that time Python was not nearly as popular as it is today. Back then people still believed that Scala would take off, but it never really did. Instead, we have seen the dramatic rise in popularity of Python development worldwide, at the expense of Java (which is considered old), and Scala (which is considered obscure these days).

According to the speaker Hyukjin Kwon, PySpark applications now represent 68% of the queries run in Databricks notebook (vs 11% for Scala). This is a huge jump in only five years, when it was around 35% Python. That is simply amazing.

This is was the new documentation will look like in Spark 3.1 (January 2021). Source

This talk was about how PySpark has been found lacking in many ways and the effort to fortify it for the current needs of the PySpark dev community. The talk mentioned a clever poem called the Zen of Python:

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren’t special enough to break the rules.

The goal of the Apache Spark Project Zen (JIRA) is to make PySpark the first class citizen it was always marketed to be. While those of us on the inside knew from the start it was a 2nd class citizen, it is now high time to write this, and build out the PySpark API and documentation properly. As of June 2020 there are more than a million downloads of PySpark each week.

Conclusion: Spark is alive and thriving

I was at the Spark Summit in 2015 and it was extremely exciting. Back then I was just learning Spark, and started to teach and consult about Spark for Databricks. Since then I have been around the world teaching, and mentoring teams about Spark. If this current Data+AI summit we just attended is any indicator, then Apache Spark is still alive and thriving.

About the “AI” part of the event title, well, those who know will say: “If it is written in PowerPoint it is AI, and if it is written in Python it is Machine Learning.” I will stick to Python and PySpark ML, and now Koalas, thank you!

About the author

Laurent Weichberger is a full time consultant at Hashmap, Inc. where he works as a Sr. Cloud Architect. He is also a published author of spiritual books, which are available on Amazon. Laurent lives with his wife and children (and their pet cat), in North Carolina.

This blog post was originally published on the Data Mechanics blog.

--

--

Co-Founder @Data Mechanics, The Cloud-Native Spark Platform Senior Product Manager @ Spot.io — Building Ocean for Spark Former software eng @Databricks.