Random Forest Classification

Background information & sample use case in 7 minutes

Nima Beheshti
Towards Data Science

--

Image by David Kovalenko from Unsplash

Machine learning models are usually broken down into supervised and unsupervised learning algorithms. Supervised models are created when we have defined (labeled) parameters, both dependent and independent. Conversely, unsupervised methods are used when we don’t have defined (unlabeled) parameters. For this article we will focus on a specific supervised model, known as Random Forest, and will demonstrate a basic use case on Titanic survivor data.

Before going into the details of the Random Forest model, it’s important to define decision trees, ensemble models, and bootstrapping which are essential to the understanding of the random forest model.

Decision Trees are used for both regression and classification problems. They visually flow like trees, hence the name, and in the classification case, they start with the root of the tree and follow binary splits based on variable outcomes until a leaf node is reached and the final binary result is given. An example of a decision tree is below:

Image by Author

Here we see the decision tree starts with the Variable_1 and splits based off of specific criteria. When ‘yes’, the decision tree classifies as True (True-False could be seen as any binary value such as 1–0, Yes-No). When ‘no’, the decision tree goes down to the next node and the process repeats until the decision tree reaches the leaf node and the resulting outcome is decided.

Ensemble learning is the process of using multiple models, trained over the same data, averaging the results of each model ultimately finding a more powerful prediction/classification result.

Bootstrapping is the process of randomly sampling subsets of a dataset over a given number of iterations and a given number of variables. These results are then averaged together to obtain a more powerful result. Bootstrapping is an example of an applied ensemble model.

The bootstrapping Random Forest algorithm combines ensemble learning methods with the decision tree framework to create multiple randomly drawn decision trees from the data, averaging the results to output a result that often times leads to strong predictions/classifications.

For this article, I will demonstrate a Random Forest model created on Titanic survivor data posted to Kaggle by Syed Hamza Ali located here, this data is licensed CC0 — Public Domain. This dataset provides information on passengers such as age, ticket class, sex, and a binary variable for whether the passenger survived. This data could also be used to compete in the Kaggle Titanic ML competition so in the spirit of keeping this competition fair, I won’t show all of the steps I took to conduct EDA & data wrangling, or directly post the code. Instead, I’ll mention some general concepts and tips, then focus on the Random Forest model.

EDA & Data Wrangling:
One of the challenges faced when conducting EDA is that of missing data. When we are dealing with missing data values there are a few options we have, we can fill the missing values with a fixed value such as a mean, min, max. We could generate the value using the sample mean, standard deviation, and distribution type to provide an estimate for each of the missing values. A third option would be to just drop the rows with missing data (I do not generally recommend this approach). An example of some of those options are below:

import pandas as pd
# fill with mean
df.fillna(np.mean('column_name')
# create normal distribution
np.random.normal(mean, standard_deviation, size= size_of_sample)

Additionally, it is important to treat categorical variables as such even if the datatype is an integer. A common method for doing so is known as one-hot encoding, a sample of which is below.

import pandas as pd
pd.get_dummies(df, columns=['list_of_column_names'])

Lastly, it’s important to consider that some variables you have may just not be useful in the model. Deciding these variables could be done with methods such as regularization or judgement calls made by your experience and intuition. Be careful with removing variables just off of intuition as you may mistakenly remove variables that are actually important for the model.

Train/Test Split:
We will use the sklearn module for the bulk of our analysis, specifically in this stage we will use the train_test_split function of this package to create separate train and test sets of the data. For a complete data science project, we would look to also perform cross validation and pick the option that has the best results. However, for simplicity reasons, I kept cross validation out of this article and will speak about cross validation and grid search in later articles. The code for running train_test_split is below:

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = .25, random_state = 18)

The parameters passed to our train_test_split function are ‘X’, which contains our dataset variables other than our outcome variable, and ‘y’ is the array or resulting outcome variable for each observation in X. The test_size parameter decides which fraction of the data will be held for the testing dataset. In this case, I chose 0.25 or 25%. The random_state parameter just determines the specific split taken of the data so you can later replicate your results. After using this function, we now have our train and test datasets which we can use for model training and testing.

Random Forest Model:
We will continue using the sklearn module for training our Random Forest Model, specifically the RandomForestClassifier function. The RandomForestClassifier documentation shows many different parameters we can select for our model. Some of the important parameters are highlighted below:

  • n_estimators — the number of decision trees you will be running in the model
  • max_depth — this sets the maximum possible depth of each tree
  • max_features — the maximum number of features the model will consider when determining a split
  • bootstrapping — the default value for this is True, meaning the model follows bootstrapping principles (defined earlier).
  • max_samples — This parameter assumes bootstrapping is set to True, if not, this parameter doesn’t apply. In the case of True, this value sets the largest size of each sample for each tree.
  • Other important parameters are criterion, min_samples_split, min_samples_leaf, class_weights, n_jobs, and others that can be read in the sklearn’s RandomForestClassifier documentation here.

For the purposes of this article, I will choose basic values for these parameters without any major fine tuning to see how this algorithm performs overall. The training code used is below:

clf = RandomForestClassifier(n_estimators = 500, max_depth = 4, max_features = 3, bootstrap = True, random_state = 18).fit(x_train, y_train)

The parameter values I chose were n_estimators = 500, meaning 500 trees were run for this model; max_depth = 4 so the maximum possible depth of each tree was set to 4; max_features = 3 so only a maximum of 3 features were selected in each tree; bootstrap = True again, this was the default setting but I wanted to include it to reiterate how bootstrapping applies to random forest models; and, finally a random_state = 18.

I’d like to re-emphasize that these values were chosen with minimal fine tuning and optimization. The goal of this article is to demonstrate the random forest classification model, not receive the most optimal outcome of results (although this model does perform relatively well as we will see shortly). In further articles I will dive into optimization methods, and grid search to find a more optimal solution.

To test the trained model we can use the internal ‘.predict’ function, passing our testing dataset as a parameter. We can also use the following metrics to see how well our test worked.

# Create our predictions
prediction = clf.predict(x_test)
# Create confusion matrix
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score
confusion_matrix(y_test, prediction)# Display accuracy score
accuracy_score(y_test, prediction)
# Display F1 score
f1_score(y_test,prediction)
Image by Author

Our model provided an accuracy measure of 86.1% and an F1 score of 80.25%.

Accuracy is measured as the total number of (TP + TN)/(All Cases), while a F1 score is calculated by 2*((precision*recall)/(precision + recall)), with precision = TP/(TP+FP), and recall = TP/(TP+FN).

Normally accuracy is not the metric we use to judge the performance of a classification model for reasons such as possible imbalances in data leading to high accuracy due to imbalanced predictions to one class. However, for simplicity reasons I included it above. I also included the F1 score, which measures the harmonic mean between precision and recall. The F1 score metric is able to penalize large differences between precision. Generally speaking, we would prefer to determine a classification’s performance by its precision, recall, or F1 score.

Conclusions:
The purpose of this article was to introduce Random Forest models, describe some of sklearn’s documentation, and provide an example of the model on actual data. Using Random Forest classification yielded us an accuracy score of 86.1%, and a F1 score of 80.25%. These tests were conducted using a normal train/test split and without much parameter tuning. In later tests we will look to include cross validation and grid search in our training phase to find a better performing model.

Thank you for taking the time to read this article! I hope you enjoyed reading and have learned more about Random Forest Classification. As mentioned, I will continue writing articles updating the methods deployed here, as well as other methods and data science related topics.

--

--