The world’s leading publication for data science, AI, and ML professionals.

Pandas to PySpark in 6 Examples

What happens when you go to large-scale

Photo by Jake Givens on Unsplash
Photo by Jake Givens on Unsplash

Pandas is one of the predominant tools for manipulating and analyzing structured data. It provides numerous functions and methods to play around with tabular data.

However, Pandas may not be your best friend as the data size gets larger. When working with large-scale data, it becomes necessary to distribute both data and computations which cannot be achieved with Pandas.

A highly popular option for such tasks is Spark, which is an analytics engine used for large-scale data processing. It lets you spread both data and computations over clusters to achieve a substantial performance increase.

It has become extremely easy to collect and store data so we are likely to have huge amounts of data when working on a real life problem. Thus, distributed engines like Spark are becoming the predominant tools in the Data Science ecosystem.

PySpark is a Python API for Spark. It combines the simplicity of Python with the high performance of Spark. In this article, we will go over 6 examples to demonstrate PySpark version of Pandas on typical data analysis and manipulation tasks.


Example 1

We need a dataset for the examples. Thus, the first example is to create a data frame by reading a csv file. I will using the Melbourne housing dataset available on Kaggle.

# Pandas
import pandas as pd
df = pd.read_csv("melb_housing.csv")

For PySpark, We first need to create a SparkSession which serves as an entry point to Spark SQL.

from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
sc.sparkContext.setLogLevel("WARN")print(sc)
<pyspark.sql.session.SparkSession object at 0x7fecd819e630>

We can now read the csv file.

# PySpark
df = sc.read.option("header", "true").csv(
    "/home/sparkuser/Desktop/melb_housing.csv"
)

Example 2

A good way to start data analysis is to get an overview of the data. For instance, we may want to check the column names, the number of rows, and take a look at the first a few rows.

# Pandas
len(df)
63023
df.columns
Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG','Date', 'Postcode', 'Regionname', 'Propertycount', 'Distance','CouncilArea'])
(image by author)
(image by author)

These functions and methods are quite similar with PySpark.

# PySpark
df.count()
63023
df.columns
['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG','Date', 'Postcode', 'Regionname', 'Propertycount', 'Distance','CouncilArea']
df.show(5)

Example 3

In some cases, we need to filter a data frame based on column values. For instance, we may be interested in the houses that are in Northern Metropolitan region and cost more than 1 million.

For Pandas, wgionne specify the conditions along with the column names as below:

# Pandas
df_sub = df[
      (df.Price > 1000000) &amp; 
      (df.Regionname == 'Northern Metropolitan')
]
len(df_sub)
3022

We follow a similar approach with PySpark.

# PySpark
from pyspark.sql import functions as F
df_sub = df.filter(
    (F.col('Price') > 1000000) &amp;
    (F.col('Regionname') == 'Northern Metropolitan')
)
df_sub.count()
3022

The functions in the SQL module of PySpark need to be separately imported. We will be using several functions in this module in the following examples.


Example 4

A typical task can be selecting a subset of the dataset in terms of columns. Let’s take the price, region name, and address columns and sort them by the price in descending order. We will only display the top 5 observations (i.e. rows).

# Pandas
df[['Price', 'Regionname', 'Address']].sort_values(by='Price', ascending=False).head()
# PySpark
df.select("Price", "Regionname", "Address").orderBy('Price', ascending=False).show(5)
(image by author)
(image by author)

The syntax is quite similar, again. The functions of PySpark are more like SQL statements. Furthermore, as you notice, PySpark displays the data frame like an SQL table.


Example 5

We do not always numerical data. Textual data is a fundamental part of data science. Thus, data analysis libraries provide many functions to manipulate strings.

For instance, we may want to extract the type of address from the values in the address column.

# Pandas
df.Address[:3]
0    49 Lithgow St 
1    59A Turner St 
2    119B Yarra St

The last word in the address gives us the types such as st and rd. We can get this information by splitting the address at spaces and getting the last item.

Pandas does this operation using the split function under the str accessor.

# Pandas
df.Address.str.split(' ', expand=True)[2][:5]
0    St 
1    St 
2    St 
3    St 
4    Rd

The "[:5]" expression at the end is just for displaying the first 5 rows.

Pyspark has a split function as well. Here is how this operation is done using this function:

# PySpark
df.select(
   F.split("Address", ' ').getItem(2).alias('Address_type')
).show(5)
(image by author)
(image by author)

Example 6

One of the most frequently used functions in data analysis is the groupby function. It allows for grouping rows based on categories or distinct values in a column. Then, we can perform aggregations on numerical columns for each group.

For instance, we can calculate the average house price in each region. Both Pandas and PySpark has groupby functions for this kind of tasks.

# Pandas
df.groupby('Regionname').agg(Avg_price=('Price', 'mean')).round(2)
# Pyspark
df.groupby('Regionname').agg(F.round(F.mean('Price'), 2).alias('Avg_price'))
(image by author)
(image by author)

Conclusion

We have done 6 examples to compare the syntax of Pandas and PySpark. As we have seen in the examples, they are quite similar.

It is important to note that Spark is optimized for large-scale data. Thus, you may not see any performance increase when working with small-scale data. In fact, Pandas might outperform PySpark when working with small datasets.

Thank you for reading. Please let me know if you have any feedback.


Related Articles