
Introduction
I work with PySpark in Databricks on a daily basis. My work as a Data Scientist requires me to deal with large amounts of data in many different tables. It is a challenging job, many times.
As much as the Extract, Transform and Load (ETL) process sounds like something simple, I can tell that it is not always like that. When we work with Big Data, a lot of our thinking has to change for two reasons:
- The amounts of data are way bigger than regular datasets.
- When working with parallel computing in clusters, we must take into account that the data will be split among many worker nodes to perform part of the job and then be brought together as a whole. And this process, many times, can become very time consuming if the query is too complex.
Knowing that, we must learn how to be write smart queries for Big Data. In this post, I will show a few of my favorite functions from the module pyspark.sql.functions
, aiming to help you during your Data Wrangling in PySpark.
Best Functions
Now let’s move on to the content in this post.
Just like many other languages, PySpark has the benefit of the modules, where you can find many ready-to-use functions for the most different purposes. Here’s the one we will load to our session:
from pyspark.sql import functions as F
If you want to see how extended is the list of functions inside pyspark.sql.functions
, go to this web site, where the API Reference is. Have in mind that this is for the version 3.5.0. Some older versions may not carry all the functions I will show in this post.
Dataset
The dataset to be used as our example is the Diamonds, from ggplot2, shared under MIT License.
# Point file path
path = '/Databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv'
# Load Data
df = spark.read.csv(path, header=True, inferSchema= True)
Creating an Index Column
For those who work with Pandas in Python, it feels weird at first to deal with a data frame without index. So, if we want to add an index column, the function is monotonically_increasing_id()
. The counting starts with 0. Ergo, if we want to start in one, just add a +1
after the function.
# Add an increasing ID column starting in 1
display(
df
.limit(100)
.withColumn('ID', F.monotonically_increasing_id()+1 )
)
Sum, Mean, Max, Min
The classic mathematical functions are surely making this list. Useful in every case.
display(
df
.groupBy('cut')
.agg( F.sum('price').alias('total'),
F.mean('price').alias('avg_price'),
F.min('price').alias('min_price'),
F.max('price').alias('max_price') )
)
Count and Count Distinct
It is important to count values and know how many distinct values are in the data as well.
display(
df
.groupBy('cut')
.agg( F.count('cut').alias('n_count'), #count of obervations
F.countDistinct('price').alias('distinct') ) #distinct n prices
)
Literal Value
The function lit()
lets you write a literal value for every row in the data.
display(
df #dataset
.limit(10) #only 10 rows
.withColumn('literal', F.lit('my text or number')) #add column with literal value
)

Floor, Ceiling and Percentile
Some great math functions that help a lot while wrangling data are floor
– the closest integer down – and ceiling
– the closest integer up.
display(
df
.limit(10)
.select('x')
.withColumn('floor', F.floor('x') )
.withColumn('ceiling', F.ceiling('x') )
)

Now percentile is useful especially to calculate the median value. Until not so long ago, I remember I was looking to calculate the median and kept getting an error. Now I saw that it is there in version 3.5.0. The best workaround was to calculate it using percentile()
at 50%.
display(
df
.groupBy('cut')
.agg( F.median('price').alias('median'),
F.percentile('price', 0.5).alias('50th pct'))
)

Descriptive Stats
Despite not being within the sql.functions
module, describe()
is also for those familiar with Pandas. It brings the descriptive statistics of the dataset. The numbers provided here are: count, mean, standard deviation, min and max.
# Descriptive Stats
display(
df
.describe()
)
Resulting in:

Logarithm
Working as Data Scientists, we keep the log functions close. Especially for linear regression, it is a helper for normalization of variables.
# Calculating different Logs of 'price'
display(
df
.select( F.ln('price').alias('Ln'),
F.log1p('price').alias('Log1p'),
F.log10('price').alias('Log10'))
)

Array Aggregate
array_agg
is a good function to take the values of a group and list them in a new column. Let’s say we want to group the diamonds by the quality of the cut and have a look at the prices listed. The next snippet performs that.
# Get the aggregated values and list them in a new variable
display(
df.limit(50)
.groupBy('cut')
.agg( F.array_agg('price'))
)

Count IF
I bet it sounds familiar if you ever used MS Excel, right? And the idea is the same. If we want to count only the diamonds that cost more than $18,000 after grouping, we can use this function. Check this out.
display(
df
.groupBy('cut')
.agg( F.count_if( col('price') > 18000))
)
We have more Ideal and Premium cuts with those expensive price, and almost no Fair.

Mode
The mode of a variable is the most common values. Now we want to know what is the most common carat for each quality of cut.
# Most common value
display(
df
.groupBy('cut')
.agg( F.mode( 'carat' ).alias('mode') )
)
Regression Functions
These are really interesting. We can quickly calculate linear regression metrics like R-Squared, intercept, slope, using these functions: regr_r2
, regr_intercept
, regr_slope
, regr_avgx
.
In the next snippet, we will calculate the regression R-Squared and the formula for each group.
# Remember that the regression formula is y = a + b*x
(
df
.groupBy('cut')
.agg( F.regr_r2( 'price', 'carat').alias('regression_r2'),
F.lit('y ='),
F.regr_intercept( 'price', 'carat').alias('intercept'),
F.lit('+'),
F.regr_slope( 'price', 'carat').alias('reg_slope'),
F.lit('*'),
F.regr_avgx( 'price', 'carat').alias('avg_x') )
).show()
This is very cool! The Ideal cut (best quality) has the highest R², while the Fair cut (lowest quality) has the lowest. It makes sense.

Regular Expressions
Yes, the regexps are everywhere. I see a lot of people rolling their eyes to those beautiful expressions, but it is an awesome tool Imagine that you can extract mostly anything from a text using these expressions. It might be difficult and bumpy at first, but once you learn more about it, you start to love them. And Pyspark has them in the functions too.
Here, I am using the regexp()
combined with a literal text lit
to check if the variable clarity
has digits. The next line is the function locate
, to locate the position of the first occurrence of the letter ‘S’ in the same variable.
# Using Regular Expessions
display(
df
.select( 'clarity',
F.regexp('clarity', F.lit(r'(d+)')),
F.locate('S', 'clarity', 1) )
)

Text Parsing
More about text parsing, we can use split()
to split a text in parts. In the next snippet I am converting the column carat
to text and splitting it by .
. So, a number like 0.23 becomes ["0" , "23"]. Then I just use a slicing notation to place the results in separate columns.
display( df
.select( col('carat').cast('string'))
.select( F.split('carat', '.')[0],
F.split('carat', '.')[1] )
)

Another parsing possibility is the function left
, similar to MS Excel. You have a given column with text where you want to get only N characters from it. Just use left
combined with lit
.
display( df
.select('cut')
.withColumn('first3', F.left('cut', F.lit(3)))
)

Before You Go
Wrangling data is an art. Being it in PySpark, R or Python, you will always need the best functions to make your transformations happen. And here, I listed just a few from the module pyspark.sql.functions
. I suggest you to visit the documentation page and create your own best list.
My best tip to transform data is:
- Know what your end result should look like.
- From there, you can break the process down to smaller steps that will make it happen.
I learned that when I started my career and used to work a lot with Excel sheets. When I didn’t know how to write a big fancy formula, I used to write smaller pieces in different cells until I got to the desired result. Then, it was just a matter of gathering the pieces and making it work as a single formula.
The same applies to programming. Create your strategy with the end point in mind and go step by step, if needed. That’s it.
If you liked this content, follow my blog for more and subscribe to my newsletter.
Also find me on LinkedIn.
Interested in Learning More About PySpark?
I am just releasing my new online course Mastering Data Processing With PySpark in Databrick. This is a good opportunity for you that want to upskill yourself and learn more about Wrangling Big Data!
I have made available this free Webinar with my top tips to write faster queries in PySpark. Check it out, and you will find a coupon with the most special launching price inside!