Reduce your worries: using ‘reduce’ with PySpark

Using python to repeat PySpark operations with ease

Patrick Normile
Towards Data Science

--

Photo by S Migaj on Unsplash

If you use PySpark, you’re probably already familiar with its ability to write great SQL-like queries. You can easily method-chain common SQL clauses like .select(), .filter/where()/, .join(), .withColumn(), .groupBy(), and .agg() to transform a Spark DataFrame. It is powerful on its own, but its capabilities become limitless when you combine it with python-style scripting. I’ll show two examples where I use python’s ‘reduce’ from the functools library to repeatedly apply operations to Spark DataFrames.

Stacking Tables

The first trick is to stack any number of DataFrames using the SQL-like union all. Suppose you have a series of tables that all have the same structure and you want to stack them on top of each other. These could be partitioned into many tables of smaller size for sharing purposes, or each table could represent a month, or whatever reason. In this example, we assume we have a list of parquet file paths that hold a series of tables we need to combine. We could write an unnecessary for-loop to stack them one-by-one, but a much better approach would be to leverage ‘reduce’ from the functools library.

The reduce function requires two arguments. The first argument is the function we want to repeat, and the second is an iterable that we want to repeat over. Normally when you use reduce, you use a function that requires two arguments. A common example you’ll see is

reduce(lambda x, y : x + y, [1,2,3,4,5])

Which would calculate this:

((((1+2)+3)+4)+5)

For this example, we will use a DataFrame method instead and repeatedly chain it over the iterable.

This method chain combines all our DataFrames as we would desire.

(dfs[0].unionAll(dfs[1])).unionAll(dfs[2])...

Nested OR/AND

For the next example, we will need to apply a filter with a series of conditions, and they may be either all linked by an OR or an AND. A common SQL example of this is you may want to query all rows associated with particular markets, and the first three characters of the ‘market’ field are all that you need to know.

SELECT * FROM TABLE
WHERE (MARKET LIKE 'AQ1%' OR MARKET LIKE 'AW3%' OR MARKET LIKE 'M89%' OR ...)

For my table, I have a row for each person and a series of columns representing if they are covered in that month. If they are covered in that month, they have a 1 in the column, otherwise they have a 0. I want to query all the members that have coverage at least some point in a time-range, so the SQL-like query I want to write is something like this:

SELECT * FROM TABLE
WHERE (COV_2007 = 1 OR COV_2008 = 1 OR COV_2009 = 1 OR ... OR COV_2106 = 1)

Writing this alone in SQL is a pain, but using python we can script this repeated OR condition easily.

To do so, I first create a list of the columns I’m interested in as strings. In the example, I want July 2020 through June 2021 (they are named ‘cov_2007’ — ‘cov_2106’).

Next, I create a list of the column-level filters, in this case I want the column to equal the value 1. The last step before using reduce is to create the function I want to repeat. In this case, I create a lambda function that simply takes a logical OR of two columns (you could use ‘&’ for AND instead if needed).

That’s all it takes! Using ‘reduce’ saved me a lot of time writing out the conditions unnecessarily or from writing a bad for-loop. I hope you enjoyed reading!

--

--