Spark is an analytics engine used for large-scale data processing. It lets you spread both data and computations over clusters to achieve a substantial performance increase.
As the cost of collecting, storing, and transferring data decreases, we are likely to have huge amounts of data when working on a real life problem. Thus, distributed engines like Spark are becoming the predominant tools in the Data Science ecosystem.
PySpark is a Python API for Spark. It combines the simplicity of Python with the efficiency of Spark which results in a cooperation that is highly appreciated by both data scientists and engineers.
In this article, we will go over 10 functions of PySpark that are essential to perform efficient data analysis with structured data. We will be using the pyspark.sql module which is used for structured data processing.
We first need to create a SparkSession which serves as an entry point to Spark SQL.
from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
sc.sparkContext.setLogLevel("WARN")
print(sc)
<pyspark.sql.session.SparkSession object at 0x7fecd819e630>
We will use this SparkSession object to interact with functions and methods of Spark SQL. Let’s create a spark data frame by reading a csv file. We will be using the Melbourne housing dataset available on Kaggle.
df = sc.read.option("header", "true").csv(
"/home/sparkuser/Desktop/fozzy/melb_housing.csv"
)
df.columns
['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG','Date', 'Postcode', 'Regionname', 'Propertycount', 'Distance','CouncilArea']
The dataset contains 13 features about houses in Melbourne including the house prices.
1. select
The select function helps us to create a subset of the data frame column-wise. We just need to pass the desired column names.
df.select("Date", "Regionname", "Price").show(5)

2. filter
The filter function can be used to filter data points (i.e. rows) based on column values. For instance, we can filter the houses that are in the Northern Metropolitan region and cost more than 1 million.
from pyspark.sql import functions as F
df.filter(
(F.col("Regionname") == "Northern Metropolitan") &
(F.col("Price") > 1000000)
).count()
3022
We first import the functions module. We should use the col function to make a comparison on the column values. Multiple conditions are combines using the "&" operator.
Finally, the count function returns the number of rows that fit the specified conditions. There are 3022 houses in the Northern Metropolitan region that cost over 1 million.
3. withColumn
The withColumn function can be used to manipulate columns or create new columns. Let’s update the price column to make it show the price in millions.
df = df.withColumn("Price_million", F.col("Price") / 1000000)
df.select("Price_million").show(5)

The first parameter is the new name of the column. We do not have to change the existing name though. The second parameter defines the operation on the existing values.
4. groupby
A highly common operation in data analysis is grouping data points (i.e. rows) based on distinct values in a column and perform some aggregation. In our data set, we can find the average house price in different regions.
df.groupby("Regionname").agg(
F.mean("Price_million").alias("Avg_house_price")
).show()

After we group the rows according to the specified column which is the region name in our case, we apply the mean function. The alias method is used for assigning a new name to the aggregated column.
5. orderby
The orderBy function is used for sorting values. We can apply it on the entire data frame to sort the rows based on the values in a column. Another common operation is to sort the aggregated results.
For instance, the average house prices calculated in the previous step can be sorted in descending order as follows:
df.groupby("Regionname").agg(
F.round(F.mean("Price_million"), 2).alias("Avg_house_price")
).orderBy(
F.col("Avg_house_price"), descending=False
).show()

6. lit
We can use the lit function to create a column by assigning a literal or constant value.
Consider a case where we need a column that contains a single value. Pandas allows for doing such operations using the desired value. However, when working with PySpark, we should pass the value with the lit function.
Let’s see it in action. The following code filters the data points with type "h" and then selects three columns. The withColumn function creates a new column named "is_house" and filled with ones.
df2 = df.filter(F.col("Type") == "h").select(
"Address", "Regionname", "Price"
).withColumn("is_house", F.lit(1))
df2.show(5)

7. when
The when function evaluates the given conditions and returns values accordingly. Considering the previous example, we can implement a more robust solution using the when function.
The following code creates a new column named "is_house" which takes the value of 1 or 0 depending on the type column.
df.select(when(df["Type"] == "h", 1)
.otherwise(0).alias("is_house")).show(5)

Conclusion
We have covered 7 PySpark functions that will help you perform efficient data manipulation and analysis. The PySpark syntax seems like a mixture of Python and SQL. Thus, if you are familiar with these tools, it will be relatively easy for you to adapt PySpark.
It is important to note that Spark is optimized for large-scale data. Thus, you may not see any performance increase when working with small-scale data. In fact, Pandas might outperform PySpark when working with small datasets.
Thank you for reading. Please let me know if you have any feedback.