PySpark Recipes: Map And Unpivot

Is the PySpark API really missing key functionality?

Pan Cretan
Towards Data Science

--

Photo by William Bout on Unsplash

PySpark offers a fluent API that covers most needs. Still, experienced pandas users may find that some data transformations are not so straightforward. This article aims at providing a small number of recipes to cover use cases that some users may consider as not natively supported by the PySpark API. In reality they are supported, but they do require some more effort (and imagination).

We start with the necessary imports and the creation of a spark session

noting that for the purposes of this article we are using a local PySpark environment with 4 cores and 8 GiB of memory. We will be creating simple data frames for the purposes of each recipe.

Map values

The first recipe deals with mapping values and is based on creating a mapping column

The mapping key value pairs are stored in a dictionary. The construct chain(*mapping.items()) returns a chain object of key value pairs as (key1, value1, key2, value2, …) that are used to create a mapping column. This mapping column is essentially a constant and, hence, we will have the same map in every row of the data frame. The mapping is achieved by retrieving the mapped value for every key in the original column. The original column may contain values that do exist as keys in the mapping that are consequently mapped to null. Null values in the original column do not lead to errors and remain null after the mapping. It is possible to use default values instead of nulls, e.g. with

In the interest of completeness, it is also possible to use the pandas API on spark to achieve the same result

but this article does not delve into this new API that is a great development, especially for prototyping purposes. I have included some more thoughts on this in the conclusions.

Melt (also known as unpivot)

Users who have worked with pandas, may have wondered how we can emulate the functionality of melt that is available in the pandas API. Melt converts a data frame into a format where one or more columns are identifier variables (id_vars), whilst all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis. In other words, the data frame is converted from wide to long format.

The starting data frame to demonstrate this recipe can be constructed with

There are likely several ways to implement a melt function in PySpark. It this tutorial we demonstrate two in an attempt to introduce a generic way of thinking that can be helpful in other situations too. The first solution constructs an array of structs that is subsequently exploded

In order to increase the number of rows and reduce the number of columns we need to somehow pack together the content of the value variables into a container and then explode the container into multiple rows. Using a comprehension we create an array of structs that is exploded and stored in a newly created column named _vars_and_vals. After exploding we have one struct in each row from which it is possible to retrieve the column name and column value for all value variables.

As an illustration we can unpivot some or all columns in the starting data frame

A second solution relies on the creation of a map from arrays that is subsequently exploded.

The creation of the map can be achieved in other ways too, such asF.create_map(*chain(*((F.lit(x), F.col(x)) for x in value_vars))). Using a map as a container is perhaps less obvious because exploding maps may not be something that springs to mind straightaway, although the documentation is explicit on this. The two solutions provide identical results, but I am no so certain which solution has better performance (comments are welcome). In terms of readability I personally find the first solution better.

Conclusions

Hopefully this article provides insights on how pyspark.sql.functions that generate and handle containers, such as maps, arrays and structs, can be used to emulate well known pandas functions. An alternative option is to use the recently introduced PySpark pandas API that used to be known as Koalas before Spark v3.2. The official advice in the documentation is to use the PySpark pandas API if you are already familiar with pandas and start directly with the PySpark API if you are not. I would consider myself an experienced pandas user but I sill prefer to rely on the PySpark API for all purposes other than fast prototyping. The main reason is that I do not need to worry about incompatibilities that may take more time to discover compared to thinking of a solution using the PySpark API directly. For example, Spark provides both null (in a SQL sense, as missing value) and nan (numeric not a number), whilst pandas doesn’t have native value which can be used to represent missing values. Thinking in pandas when using the PySpark pandas API is in some way like trying to speak a language by translating your mother tongue. In the long term it may be better to think directly in the new language. This is a personal view only and it is not a criticism of the quality of the PySpark pandas API. I am sure that there are different viewpoints, and diversity of opinions is essential for making progress.

--

--

I am an engineer who turned into a data analyst. I enjoy learning and sharing knowledge with experts in data analysis and modelling.