Simple as Python, powerful as Spark

Spark is an analytics engine used for large-scale data processing. It lets you spread both data and computations over clusters to achieve a substantial performance increase.
PySpark is a Python library for Spark. It combines the simplicity of Python with the efficiency of Spark which results in a cooperation that is highly appreciated by both data scientists and engineers.
In this article, we will go over 6 different column operations that are frequently done for data analysis and manipulation. We will be using the SQL module of PySpark which provides several functions for working on structured data.
Let’s start with importing the libraries.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
We need to create a SparkSession which serves as an entry point to Spark SQL.
spark = SparkSession.builder.getOrCreate()
We will be using this spark session throughout the article. The next step is to create a sample data frame to do the examples.
data = [
("John", "Biology", 3.45, "Junior"),
("Jane", "Chemistry", 3.60, "Sophomore"),
("Ashley", "Biology", 3.25, "Sophomore"),
("Max", "Physics", 2.95, "Senior"),
("Emily", "Biology", 3.30, "Junior")
]
columns = ["Name", "Major", "GPA", "Class"]
df = spark.createDataFrame(data = data, schema = columns)
df.show()

We write the sample data according to a schema. Then both the data and schema are passed to the createDataFrame
function.
We can now start on the column operations.
- Create a new column
The withColumn
function is used for creating a new column. We pass the name of the new column along with the data to fill it.
df = df.withColumn("School", F.lit("A"))
df.show()

The lit
function allows for filling a column with a constant value. Unlike pandas, we cannot just write "A".
It is important to note that the withColumn
function does not work in place. Thus, we need to assign the updated data frame to a new variable (itself or any other variable) to save the changes.
- Update a column
The withColumn
function can also be used to update or modify the values in a column. For instance, we can convert the GPA values to a 100-point scale.
df.withColumn("GPA", F.col("GPA") * 100 / 4).show()

Please keep in mind that we need to use the col
function to apply a column-wise operation. If we only write the column name as a string, we will get an error.
- Rename a column
The withColumnRenamed
function changes the name of the columns. We pass the current name and the new name as arguments to this function.
df = df.withColumnRenamed("School", "University")
df.show()

- Select a column or columns
In some cases, we may need to get only a particular column or a few columns from a data frame. The select function allows us to accomplish this task by passing the name of the desired columns.
df.select("Name", "Major").show()

- Unique values in a column
If we want to check the number of distinct values in a column, the countDistinct
function can be used.
df.select(F.countDistinct("Major").alias("Unique_count")).show()

It is quite similar to the SQL syntax. The function is applied while selecting the column. We can also display the unique values in a column.
df.select("Major").distinct().show()

- Drop a column
The syntax of dropping a column is highly intuitive. As you might guess, the drop
function is used.
df = df.drop("University")
df.show()

Conclusion
We have covered 6 commonly used column operations with PySpark. The SQL module of PySpark offers many more functions and methods to perform efficient data analysis.
It is important to note that Spark is optimized for large-scale data. Thus, you may not see any performance increase when working with small-scale data. In fact, Pandas might outperform PySpark when working with small datasets.
Thank you for reading. Please let me know if you have any feedback.