The world’s leading publication for data science, AI, and ML professionals.

Finding Donors: Classification Project With PySpark

Learn how to use Apache PySpark to empower your classification predictions

Picture from Unsplash
Picture from Unsplash

Introduction

The aim of this article is to make a gentle introduction to Classification problems in Machine Learning and go through a comprehensive guide to develop succesfully a class prediction using PySpark.

So without further a do, let’s jump into it!

Classification

If you want to a deep explanation about Classification problems, its main algorithms and how to deal with them using machine learning techniques, I strongly suggest you to chek out the following article, where I explain this concepts throughfully.

Supervised Learning: Basics of Classification and Main Algorithms

What is Classification?

Classification is a subcategory of supervised learning where the goal is to predict the categorical class labels (discrete, unoredered values, group membership) of new instances based on past observations.

There are two main types of classification problems:

  • Binary classification: The typical example is e-mail spam detection, which each e-mail is spam → 1 spam; or isn’t → 0.
  • Multi-class classification: Like handwritten character recognition (where classes go from 0 to 9).

The following example is very representative to explain binary classification:

There are 2 classes, circles and crosses, and 2 features, X1 and X2. The model is able to find the relationship between the features of each data point and its class, and to set a boundary line between them, so when provided with new data, it can estimate the class where it belongs, given its features.

Figure by Author
Figure by Author

In this case, the new data point falls into the circle subspace and, therefore, the model will predict its class to be a circle.

Classification Main Algorithms

In order to predict the class of certain samples, there are several classification algorithms that can be used. In fact, when developing our machine learning models, we will train and evaluate a certain number of them, and we will keep those with better predicting performance.

A non-exhaustive list of some of the most used algorithms are:

  • Logistic Regression
  • Decision Trees
  • Random Forests
  • Support Vector Machines
  • K-Nearest Neighbors (KNN)

Classification Evaluation Metrics

When making predictions on events we can get four type of results:

  • True Positives: TP
  • True Negatives: TN
  • False Positives: FP
  • False Negatives: FN

All of these are represented in the following classification matrix:

Accuracy measures how often the classifier makes the correct prediction. It’s the ratio of the number of correct predictions to the total number of predictions (the number of test data points).

Precision tells us what proportion of events we classified as a certain class, actually were that class. It is a ratio of true positives to all positives.

Recall (sensitivity) tells us what proportion of events that actually were the of a certain class were classified by us as that class. It is a ratio of true positives to all the positives.

Specifity is the proportion of classes that were correctly identified as negative upon the total of negative classes.

For classification problems that are skewed in their classification distributions , accuracy by itself is not an appropiate metric. Instead, precision and recall are much more representative.

These two metrics can be combined to get the F1 score, which is weighted average(harmonic mean) of the precision and recall scores. This score can range from 0 to 1, with 1 being the best possible F1 score(we take the harmonic mean as we are dealing with ratios).

ROC

Finally, the metric that we will use in our project is the Reciever Operation Characteristic or ROC.

The ROC curve tells us about how good the model can distinguish between two classes. It can get values from 0 to 1 ( €[0,1] ). The better the model is, the closer to 1 value it will be.

Figure by Author
Figure by Author

As can be seen in the image of above, our classification model will draw a separation boundary between the classes and:

  • Every sample that falls at the left of the threshod, will be classified as negative class.
  • Every sample that falls at the right of the threshod, will be classified as positive class,

And the distribution of predictions will be the following:

Figure by Author
Figure by Author

Trade off Between Sensitivity & Specifity

When we decrease the threshold, we end up predicting more positive values and increasing sensitivity. Therefore, specifity decreases.

When we increase the threshold, we end up predicting more negative values and increasing specifity. Therefore, decreasing sensitivity.

As Sensitivity ⬇️ Specificity ⬆️

As Specificity ⬇️ Sensitivity ⬆️

In order to optimize the classification performance,we consider (1- specifity) instead specificity. So, when sensitivity increases, (1-specificity) will also increase. And that is how we calculate the ROC.

Figure by Author
Figure by Author

Examples of Performance

As stated before, the closer to 1 gets the evaluator, the better predictive performance the model will be, and the smaller the overlapping area between classes will be.

Figure by Author
Figure by Author

Finding Donors Project

A complete walkthrough of the project can be found in the following article:

Machine Learning Classification Project: Finding Donors

In the present article we will focus on the PySpark implementation of the project.

As a summary, throughout the project, we will use a number of different supervised algorithms to precisely predict individuals’ income using data collected from the 1994 U.S. Census.

We will then choose the best candidate algorithm from preliminary results and further optimize this algorithm to best model the data.

Our goal with this implementation is to build a model that accurately predicts whether an individual makes more than $50,000. As from our previous research we have found out that the individuals who are most likely to donate money to a charity are the ones that make more than $50,000.

Therefore, we are facing a binary classification problem, where we want to determine wether an individual makes more than $50K a year (class 1) or do not (class 0).

The dataset for this project originates from the UCI Machine Learning Repository.

Data

The census dataset consists of approximately 45222 data points, with each datapoint having 13 features.

Features

  • age: Age
  • workclass: Working Class (Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked)
  • education_level: Level of Education (Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool)
  • education-num: Number of educational years completed
  • marital-status: Marital status (Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse)
  • occupation: Work Occupation (Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces)
  • relationship: Relationship Status (Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried)
  • race: Race (White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black)
  • sex: Sex (Female, Male)
  • capital-gain: Monetary Capital Gains
  • capital-loss: Monetary Capital Losses
  • hours-per-week: Average Hours Per Week Worked
  • native-country: Native Country (United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands)

Target Variable

  • income: Income Class (<=50K, >50K)

Import Data & Exploratory Data Analysis (EDA)

We will start by importing the dataset and displaying the firsts rows of the data to make a first approximation to an exploratory data analysis.

# File location and type
file_location = "/FileStore/tables/census.csv"
file_type = "csv"
# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) 
  .option("inferSchema", infer_schema) 
  .option("header", first_row_is_header) 
  .option("sep", delimiter) 
  .load(file_location)
display(df)

We will now display a summary of the dataset’s information by using the .describe() method.

# Display Dataset's Summary
display(df.describe())

Let’s also find out the dataset’s schema.

# Display Dataset's Schema
display(df.describe())

Prepare the Data

As we want to predict wether or not the individual is earning more of $50K per year, we will substitute the label ‘income’ to ‘>50K’.

To do so, we will create a new column which values will be 1 or 0 depending if the individual makes or not more than $50K per year. We will then drop this income column.

# Import pyspark functions
from pyspark.sql import functions as F
# Create add new column to the dataset
df = df.withColumn('>50K', F.when(df.income == '<=50K', 0).otherwise(1))
# Drop the Income label
df = df.drop('income')
# Show dataset's columns
df.columns

Vectorizing Numerical Features and One-Hot Encodin Categorical Features

In order to be processed for the training of the models, features in Apache Spark must be transformed into vectors. This process will be done using certain classes that we will explore now.

First, we will import relevant libraries and methods.

from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.classification import (DecisionTreeClassifier, GBTClassifier, RandomForestClassifier, LogisticRegression)
from pyspark.ml.evaluation import BinaryClassificationEvaluator

Now, we will select the categorical features.

# Selecting categorical features
categorical_columns = [
 'workclass',
 'education_level',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'hours-per-week',
 'native-country',
 ]

In order to One-Hot encode this categorical features we will first pass them through an indexer and then to an encoder.

# The index of string values multiple columns
indexers = [
    StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c))
    for c in categorical_columns]
# The encode of indexed values multiple columns
encoders = [OneHotEncoder(dropLast=False,inputCol=indexer.getOutputCol(),
            outputCol="{0}_encoded".format(indexer.getOutputCol())) 
    for indexer in indexers]

Now, we will join the categorical encoded features with the numerical ones and make a vector with both of them.

# Vectorizing encoded values
categorical_encoded = [encoder.getOutputCol() for encoder in encoders]
numerical_columns = ['age', 'education-num', 'capital-gain', 'capital-loss']
inputcols = categorical_encoded + numerical_columns
assembler = VectorAssembler(inputCols=inputcols, outputCol="features")

Now, we will set up a pipeline to automatize this stages.

pipeline = Pipeline(stages=indexers + encoders+[assembler])
model = pipeline.fit(df)
# Transform data
transformed = model.transform(df)
display(transformed)

Finally, we will select a dataset only with the relevant features.

# Transform data
final_data = transformed.select('features', '>50K')

Initializing the Models

For this project, we will study the predictive performance of three different classification algorithms:

  • Decision Trees
  • Random Forests
  • Gradient Boosted Trees
# Initialize the classification models
dtc = DecisionTreeClassifier(labelCol='>50K', featuresCol='features')
rfc = RandomForestClassifier(numTrees=150, labelCol='>50K', featuresCol='features')
gbt = GBTClassifier(labelCol='>50K', featuresCol='features', maxIter=10)

Splitting Data

We will perform a classic 80/20 split between training and testing data.

train_data, test_data = final_data.randomSplit([0.8,0.2])

Training the Models

dtc_model = dtc.fit(train_data)
rfc_model = rfc.fit(train_data)
gbt_model = gbt.fit(train_data)

Obtaining Predictions

dtc_preds = dtc_model.transform(test_data)
rfc_preds = rfc_model.transform(test_data)
gbt_preds = gbt_model.transform(test_data)

Evaluating Model’s Performance

As stated before, our evaluator will be the ROC. We will initialize its class and pass it the predicitons in order to obtain the value.

my_eval = BinaryClassificationEvaluator(labelCol='>50K')
# Display Decision Tree evaluation metric
print('DTC')
print(my_eval.evaluate(dtc_preds))
# Display Random Forest evaluation metric
print('RFC')
print(my_eval.evaluate(rfc_preds))
# Display Gradien Boosting Tree evaluation metric
print('GBT')
print(my_eval.evaluate(gbt_preds))

The best predictor is the Gradient Boosting Tree. Actually 0.911 is a pretty good value and when display its predictions we will see the following:

Improving Models Performance

We will try to do this by performing the grid search cross validation technique. With it, we will evaluate the performance of the model with different combinations of previously sets of hyperparameter’s values.

The hyperparameters that we will tune are:

  • Max Depth
  • Max Bins
  • Max Iterations
# Import libraries
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
# Set the Parameters grid
paramGrid = (ParamGridBuilder()
             .addGrid(gbt.maxDepth, [2, 4, 6])
             .addGrid(gbt.maxBins, [20, 60])
             .addGrid(gbt.maxIter, [10, 20])
             .build())
# Iinitializing the cross validator class
cv = CrossValidator(estimator=gbt, estimatorParamMaps=paramGrid, evaluator=my_eval, numFolds=5)
# Run cross validations.  This can take about 6 minutes since it is training over 20 trees
cvModel = cv.fit(train_data)
gbt_predictions_2 = cvModel.transform(test_data)
my_eval.evaluate(gbt_predictions_2)

We have obtained a tiny improvement in the predictive performance. And the computation time, went almost to the 20 minutes. So, in these cases we should analyze if the improvement is worth the effort.

Conclusion

Throughout this article we made a machine learning classification project from end-to-end. We also learned and obtained several insights about classification models and the keys to develop one with a good performance, using PySpark, its methods and implementations.

We also have learned how to tune our algorithms once one good-performing model has been identified.

On the next articles we will learn how to develop Regression Models in PySpark. So, if you are interested in the topic I strongly suggest you to stay tuned!

If you liked this post then you can take a look at my other posts on Data Science and Machine Learning here.

If you want to learn more about Machine Learning, Data Science and Artificial Intelligence follow me on Medium, and stay tuned for my next posts!


Related Articles