The world’s leading publication for data science, AI, and ML professionals.

Data Manipulation Pandas-PySpark Conversion Guide

Everything you need to do Exploratory Data Analysis(EDA) with PySpark on Databricks

Photo by Sami Mititelu on Unsplash
Photo by Sami Mititelu on Unsplash

When working with big data, Jupyter notebook will fail. This is one reason why some companies are using ML platforms such as Databricks, SageMaker, Alteryx, etc. A good ML platform supports the entire machine learning lifecycle from data ingestion to modeling and monitoring, which increases the team’s productivity and efficiency. In this simple tutorial, I’ll share my notes on converting Pandas script to Pyspark, so that you can seamlessly convert these two languages as well!

Introduction

What is Spark?

Spark is an open-source cloud computing framework. It’s a scalable, massively parallel, and in-memory execution environment for running analytics applications. Spark is a fast and powerful engine for processing Hadoop data. It runs in Hadoop clusters through Hadoop YARN or Spark’s standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both general data processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. You can read more about Mapreduce here: https://medium.com/@francescomandru/mapreduce-explained-45a858c5ac1d (P.S one of my favorite explanations!)

What is Databricks?

Databricks provides a unified open platform for all your data. It empowers data scientists, data engineers, data analysts with a simple collaborative environment. It’s one of the market leaders when it comes to data services. It was founded by Ali Gozii in 2013 who was one of the original creators of apache spark delta lake and MLflow. And it’s from the original creators of some of the world’s most popular open sources projects, Apache Spark, Delta Lake, MLflow, and Koalas. It builds on these technologies to deliver a true lake house architecture, combining the best of data lakes and data warehouses for a fast, scalable and reliable platform. Built for the cloud, your data is stored in low-cost cloud object stores such as AWS s3 and Azure data lake storage with performance access enabled through caching, optimized data layout, and other techniques. You can launch clusters with hundreds of machines, each with a mixture of CPUs and GPUs needed for your analysis.

Main concepts

Exploring the data – The table below summarizes the main functions used to get an overview of the data.

              Pandas      |            PySpark               
 -------------------------|------------------------------------ 
  pd.read_csv(path)       | spark.read.csv(path)
  df.shape                | print(df.count(), len(df.columns)) 
  df.head(10)             | df.limit(10).toPandas()            
  df[col].isnull().sum()  | df.where(df.col.isNull()).count()
  df                      | Display(df)

Data preprocessing

  • Drop Duplicates – drop duplicate rows on selected multiple columns
  • Filtering – we can filter rows according to some conditions
  • Changing columns – changing the name of the column, convert the types of the data, create a new column
  • Conditional column – a **** column can take different values with respect to a particular set of conditions with the following R command
  • Sort – order the data
  • Datetime conversion – fields containing datetime values are converted from string to datetime
  • Groupby – a data frame can be aggregated with respect to given columns
  • Join() – convert column to a comma separated list
  • Fill nan value – replace missing null values on a column with a value
             Pandas       |              PySpark               
 -------------------------|------------------------------------         
  df.drop_duplicates()    | df.dropDuplicates()
  df.drop(xx,axis=1)      | df.drop(xx)
  df[['col1','col2']]     | df.select('col1','col2')
  df[df.isin(xxx)]        | df.filter(df.xx.isin()).show()
  df.rename(columns={})   | df.withColumnRenamed('col1','col2')
  df.x.astype(str)        |df.withColumn(x,col(x).cast(StringType())
  df.sort_values()        | df.sort() or df.orderby()
  np.datetime64('today')  | current_date()
  pd.to_datetime()        | to_date(column, time_format)
  df.groupby()            | df.groupBy()
  ','.join()         |','.join(list(df.select[col].toPandas()[col]))
  df.fillna(0)            | df.x.fill(value=0)

Data frame transformation

  • Merging data frames – We can merge two data frames by a given field
  • Concatenate data frames – concatenate two or more dataframe into one dataframe
            Pandas        |             PySpark               
 -------------------------|------------------------------------ 
  df1.join(df2, on=,how=) | df1.join(df2,df1.id=df2.id, how='')
  pd.concat([df1, df2])   | df1.unionAll(df1,df2)

Conclusion

I hope this simple comparison cheat sheet can help you get started with PySpark and Databricks faster. The comparison above is only the starting point! For more details and examples, please check out this Pyspark beginner tutorial page. When it comes to learning any new language or tool, the best way to learn it is by doing it. I highly recommend going through the pyspark code on your own, and maybe even start a project on Databricks! 😉


If you find this helpful, please follow me and check out my other blogs. Stay tuned for more! ❤

10 Tips To Land Your First Data Science Job As a New Grad

How to Prepare for Business Case Interview as an Analyst?

10 Questions You Must Know to Ace any SQL Interviews


Related Articles