How to Use Pyspark For Your Machine Learning Project
Data cleaning, EDA, feature engineering and Machine Learning with Pyspark
Pyspark is a Python API that supports Apache Spark, a distributed framework made for handling big data analysis. It’s an amazing framework to use when you are working with huge datasets, and it’s becoming a must-have skill for any data scientist.
In this tutorial, I will present how to use Pyspark to do exactly what you are used to see in a Kaggle notebook (cleaning, EDA, feature engineering and building models).
I used a database containing information about customers for a telecom company. The objective is to predict which clients will leave (Churn) in the upcoming three months. The CSV file with the data contains more than 800,000 rows and 8 features, as well as a binary Churn variable.
The goal here is not to find the best solution. It’s rather to show you how to work with Pyspark. Along the way I will try to present many functions that can be used for all stages of your machine learning project!
Let’s begin by creating a SparkSession, which is the entry point to any Spark functionality.
import pyspark
from pyspark.sql import SparkSessionspark = SparkSession.builder.master("local[4]")\
.appName("test").getOrCreate()
Getting the data
Here is how to read a CSV using Pyspark.
df=spark.read.csv('train.csv',header=True,sep= ",",inferSchema=True)
Here is what the dataframe looks like.
Cleaning the data
The Pyspark.sql module allows you to do in Pyspark pretty much anything that can be done with SQL.
For instance, let’s begin by cleaning the data a bit. First, as you can see in the image above, we have some Null values. I will drop all rows that contain a null value.
df = df.na.drop()