A combination of Python and SQL but easier than both

Spark is an analytics engine used for large-scale data processing. It lets you spread both data and computations over clusters to achieve a substantial performance increase.
It is easier than ever to collect, transfer, and store data. Hence, we deal with tremendous amount of data when working on a real life problem. As a result, distributed engines like Spark are becoming a necessity in such cases.
PySpark is a Python API for Spark. It brings us the simplicity of Python syntax so we can easily process and analyze large amounts of data. The SQL module of PySpark takes it one step further and provides us with SQL-like operations.
What I’m trying to get here is that PySpark is an extremely efficient tool with easy-to-use and intuitive syntax. A significant factor that makes PySpark such a simple tool is the flexibility it offers.
Whether you are comfortable working with Pandas or SQL, you will not have hard time learning PySpark. In this article, we will go over 3 common data manipulation operations that demonstrate the its flexibility.
We first need to create a SparkSession which serves as an entry point to Spark SQL.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
Let’s create a spark data frame by reading a csv file. We will be using the Melbourne housing dataset available on Kaggle.
file_path = "/home/sparkuser/Downloads/melb_housing.csv"
df = spark.read.csv(file_path, header=True)
- Selecting a subset of columns
We may not need all the columns in a dataset. Some columns might be redundant or not applicable to a particular task.
PySpark provides multiple ways for selecting a subset of columns. The first one is the select method.
df1 = df.select("Type", "Rooms", "Regionname", "Distance", "Price")
df1.show(5)

We can also do the same operation with a Pandas-like syntax.
df2 = df["Type", "Rooms", "Regionname", "Distance", "Price"]
df2.show(5)

Both methods are fine. Make sure you use the correct type of parenthesis. The select method use parenthesis ("()") where as the other one uses square brackets ("[]").
- Filtering rows
A common operation in data analysis and manipulation is filtering the observations (i.e. rows) based on a condition. We will go over 3 different methods of filtering.
The first one is the filter method.
from pyspark.sql import functions as F
df_sub = df2.filter(F.col("Rooms") == 4)
The col function is used for selecting the rooms column so we only select the houses that have 4 rooms.
Just like the previous example, we can do the same operation with a Pandas-like syntax.
df_sub = df2[df2.Rooms == 4]
Last but not least, the where method can be used. It is similar to the where clause in SQL.
df_sub = df2.where("Rooms = 4")
df_sub.show(5)

I don’t think we can declare one method simpler than the others. The point here is to emphasize the multiple ways of doing an operation with PySpark. It does not force you to adapt a particular way of doing things. Contrary to this, PySpark fits to your taste.
- Creating a new column
Data preprocessing usually involves some column manipulations. One of the common operations is to create a new one based on the existing columns.
Let’s go over two different methods for creating a new column in PySpark. We will create a column that shows the house prices in million. Thus, we can achieve it by dividing the price column with 1 million.
The first option is to use the withColumn method.
df2_new = df2.withColumn("Price_mio", F.col("Price") / 1000000)
df2_new.show(5)

The first parameter is the name of the new column. The second part contains the operation for calculating the values. We are not limited to a simple operation in the second part. In fact, PySpark offers great flexibility when creating new columns.
Consider a case where we need to update the prices based on a value in a different column. For instance, we may want to increase the price of the houses with type h. The other ones will remain the same.
We can perform this task using the withColumn method along with the when function as follows.
df2_new = df2.withColumn(
"Price_updated",
F.when(F.col("Type") == "h", F.col("Price") * 1.2).
otherwise(F.col("Price")))
df2_new.show(5)

The above code tells PySpark to multiply the price column when the type is h and keep the price the same otherwise.
The second option for creating a column is by using an SQL query. We first need to register the data frame as a temp table.
df2.registerTempTable("df2_table")
We can now use is just like an SQL table. In order to create data frame with a new column, we can write an SQL query as follows.
df2_updated = spark.sql(
"select *, Price / 1000000 as Price_mio from df2_table"
)
df2_updated.show(5)

We can query the temp table (df2_table) just like an SQL table. For instance, we can calculate the average number of houses in each region as follows.
spark.sql(
"""
select Regionname, avg(Price) as avg_price
from df2_table
group by Regionname
"""
).show()

Conclusion
The examples we have done clearly demonstrate the flexibility of PySpark. Whether you are a Python or SQL user, you will find a comfortable way of working with PySpark.
PySpark syntax is kind of a combination of Python and SQL but I think it is simpler than both.
It is important to note that Spark is optimized for large-scale data. Thus, you may not see any performance increase when working with small-scale data. In fact, Pandas might outperform PySpark when working with small datasets.
Thank you for reading. Please let me know if you have any feedback.