How to Use Pyspark For Your Machine Learning Project

Data cleaning, EDA, feature engineering and Machine Learning with Pyspark

François St-Amant
Towards Data Science
5 min readNov 2, 2020

--

Source: https://unsplash.com/photos/Q1p7bh3SHj8

Pyspark is a Python API that supports Apache Spark, a distributed framework made for handling big data analysis. It’s an amazing framework to use when you are working with huge datasets, and it’s becoming a must-have skill for any data scientist.

In this tutorial, I will present how to use Pyspark to do exactly what you are used to see in a Kaggle notebook (cleaning, EDA, feature engineering and building models).

I used a database containing information about customers for a telecom company. The objective is to predict which clients will leave (Churn) in the upcoming three months. The CSV file with the data contains more than 800,000 rows and 8 features, as well as a binary Churn variable.

The goal here is not to find the best solution. It’s rather to show you how to work with Pyspark. Along the way I will try to present many functions that can be used for all stages of your machine learning project!

Let’s begin by creating a SparkSession, which is the entry point to any Spark functionality.

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[4]")\
.appName("test").getOrCreate()

Getting the data

Here is how to read a CSV using Pyspark.

df=spark.read.csv('train.csv',header=True,sep= ",",inferSchema=True)

Here is what the dataframe looks like.

Cleaning the data

The Pyspark.sql module allows you to do in Pyspark pretty much anything that can be done with SQL.

For instance, let’s begin by cleaning the data a bit. First, as you can see in the image above, we have some Null values. I will drop all rows that contain a null value.

df = df.na.drop()

--

--