The world’s leading publication for data science, AI, and ML professionals.

6 Must-Know Column Operations with PySpark

Simple as Python, powerful as Spark

Photo by Marc-Olivier Jodoin on Unsplash
Photo by Marc-Olivier Jodoin on Unsplash

Spark is an analytics engine used for large-scale data processing. It lets you spread both data and computations over clusters to achieve a substantial performance increase.

PySpark is a Python library for Spark. It combines the simplicity of Python with the efficiency of Spark which results in a cooperation that is highly appreciated by both data scientists and engineers.

In this article, we will go over 6 different column operations that are frequently done for data analysis and manipulation. We will be using the SQL module of PySpark which provides several functions for working on structured data.

Let’s start with importing the libraries.

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

We need to create a SparkSession which serves as an entry point to Spark SQL.

spark = SparkSession.builder.getOrCreate()

We will be using this spark session throughout the article. The next step is to create a sample data frame to do the examples.

data = [
    ("John", "Biology", 3.45, "Junior"),
    ("Jane", "Chemistry", 3.60, "Sophomore"),
    ("Ashley", "Biology", 3.25, "Sophomore"),
    ("Max", "Physics", 2.95, "Senior"),
    ("Emily", "Biology", 3.30, "Junior")
]
columns = ["Name", "Major", "GPA", "Class"]
df = spark.createDataFrame(data = data, schema = columns)
df.show()
(image by author)
(image by author)

We write the sample data according to a schema. Then both the data and schema are passed to the createDataFrame function.

We can now start on the column operations.


  1. Create a new column

The withColumn function is used for creating a new column. We pass the name of the new column along with the data to fill it.

df = df.withColumn("School", F.lit("A"))
df.show()
(image by author)
(image by author)

The lit function allows for filling a column with a constant value. Unlike pandas, we cannot just write "A".

It is important to note that the withColumn function does not work in place. Thus, we need to assign the updated data frame to a new variable (itself or any other variable) to save the changes.


  1. Update a column

The withColumn function can also be used to update or modify the values in a column. For instance, we can convert the GPA values to a 100-point scale.

df.withColumn("GPA", F.col("GPA") * 100 / 4).show()
(image by author)
(image by author)

Please keep in mind that we need to use the col function to apply a column-wise operation. If we only write the column name as a string, we will get an error.


  1. Rename a column

The withColumnRenamed function changes the name of the columns. We pass the current name and the new name as arguments to this function.

df = df.withColumnRenamed("School", "University")
df.show()
(image by author)
(image by author)

  1. Select a column or columns

In some cases, we may need to get only a particular column or a few columns from a data frame. The select function allows us to accomplish this task by passing the name of the desired columns.

df.select("Name", "Major").show()
(image by author)
(image by author)

  1. Unique values in a column

If we want to check the number of distinct values in a column, the countDistinct function can be used.

df.select(F.countDistinct("Major").alias("Unique_count")).show()
(image by author)
(image by author)

It is quite similar to the SQL syntax. The function is applied while selecting the column. We can also display the unique values in a column.

df.select("Major").distinct().show()
(image by author)
(image by author)

  1. Drop a column

The syntax of dropping a column is highly intuitive. As you might guess, the drop function is used.

df = df.drop("University")
df.show()
(image by author)
(image by author)

Conclusion

We have covered 6 commonly used column operations with PySpark. The SQL module of PySpark offers many more functions and methods to perform efficient data analysis.

It is important to note that Spark is optimized for large-scale data. Thus, you may not see any performance increase when working with small-scale data. In fact, Pandas might outperform PySpark when working with small datasets.

Thank you for reading. Please let me know if you have any feedback.


Related Articles