A Guide to exploit Random Forest Classifier in PySpark

A practical explanatory guide for the classification of Iris flowers

Manusha Priyanjalee
Towards Data Science

--

Photo by Adél Grőber on Unsplash

In this article, I am going to give you a step-by-step guide on how to use PySpark for the classification of Iris flowers with Random Forest Classifier.

I have used the popular Iris dataset and I have provided the link to the dataset at the end of the article. I used Google Colab for coding and I have also provided Colab notebook in Resources.

Pyspark is a Python API for Apache Spark and pip is a package manager for Python packages.

!pip install pyspark

With the above command, pyspark can be installed using pip.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ml-iris').getOrCreate()
df = spark.read.csv('IRIS.csv', header = True, inferSchema = True)
df.printSchema()

First, I need to create an entry point into all functionality in Spark. SparkSession class is used for this. SparkSession.builder() creates a basic SparkSession. To set a name for the application use appName(name). Here I have set ‘ml-iris’ as the application name. Setting a name is not necessary and if it is not set, a random name will be generated for the application. getOrCreate() creates a new SparkSession if there is no existing session. Otherwise, it gets the existing session.

spark.read.csv(“path”) is used to read the CSV file into Spark DataFrame. A Data Frame is a 2D data structure and it sets data in a tabular format. Iris dataset has a header, so I set header = True, otherwise, the API treats the header as a data record. inferSchema attribute is related to the column types. By default, inferSchema is false. It will give all columns as strings. Here I set inferSchema = True, so Spark goes through the file and infers the schema of each column. printSchema() will print the schema in a tree format.

Output:

import pandas as pd
pd.DataFrame(df.take(5), columns=df.columns).transpose()

pandas is a toolkit used for data analysis. Here df.take(5) returns the first 5 rows and df.columns returns the names of all columns. DataFrame.transpose() transpose index and columns of the DataFrame. It writes columns as rows and rows as columns.

Output:

numeric_features = [t[0] for t in df.dtypes if t[1] == 'double']
df.select(numeric_features).describe().toPandas().transpose()

df.dtypes returns names and types of all columns. Here we assign columns of type Double to numeric_features. select(numeric_features) returns a new Data Frame. describe() computes statistics such as count, min, max, mean for columns and toPandas() returns current Data Frame as a Pandas DataFrame.

Output:

Since we have a good idea about the dataset we are working with now, we can start feature transforming. Feature transforming means scaling, converting, and modifying features so they can be used to train the machine learning model to make more accurate predictions. For this purpose, I have used String indexer, and Vector assembler.

First, I have used Vector Assembler to combine the sepal length, sepal width, petal length, and petal width into a single vector column. Here the new single vector column is called features.

Output:

Then I have used String Indexer to encode the string column of species to a column of label indices. By default, the labels are assigned according to the frequencies. So, the most frequent species gets an index of 0.

Output:

As you can see, we now have new columns named labelIndex and features.

pd.DataFrame(df.take(110), columns=df.columns).transpose()

The bottom row is the labelIndex. We can see that Iris-setosa has the labelIndex of 0 and Iris-versicolor has the label index of 1. And Iris-virginica has the labelIndex of 2.

Now we have transformed our features and then we need to split our dataset into training and testing data.

randomSplit() splits the Data Frame randomly into train and test sets. Here I set the seed for reproducibility. 0.7 and 0.3 are weights to split the dataset given as a list and they should sum up to 1.0.

Output:

Now we can import and apply random forest classifier. Random forest is a method that operates by constructing multiple decision trees during the training phase. The decision of the majority of the trees is chosen by the random forest as the final decision.

Representation of Random Forest Classifier (Image by author)

It comes under supervised learning and mainly used for classification but can be used for regression as well. Random forest classifier is useful because,

  1. No overfitting
  2. High accuracy
  3. Estimates missing data

Here featuresCol is the list of features of the Data Frame, here in our case it is the features column. labelCol is the targeted feature which is labelIndex. rf.fit(train) fits the random forest model to our input dataset named train. rfModel.transform(test) transforms the test dataset. This will add new columns to the Data Frame such as prediction, rawPrediction, and probability.

Output:

We can clearly compare the actual values and predicted values with the output below.

predictions.select("labelIndex", "prediction").show(10)

Output:

Now we have applied the classifier for our testing data and we got the predictions. Then we need to evaluate our model.

MulticlassClassificationEvaluator is the evaluator for multi-class classifications. Since we have 3 classes (Iris-Setosa, Iris-Versicolor, Iris-Virginia) we need MulticlassClassificationEvaluator. The method evaluate() is used to evaluate the performance of the classifier.

Output:

Now we can see that the accuracy of our model is high and the test error is very low. It means our classifier model is performing well.

We can use a confusion matrix to compare the predicted iris species and the actual iris species.

MulticlassMetrics is an evaluator for multiclass classification in the pyspark mllib library.

Output:

According to the confusion matrix, 44 (12+16+16) species are correctly classified out of 47 test data. 3 species are incorrectly classified.

I hope this article helped you learn how to use PySpark and do a classification task with the random forest classifier. I have provided the dataset and notebook links below.

Happy Coding! 😀

Resources

Dataset

Colab Notebook

--

--