
The random forests algorithm is a Machine Learning method that can be used for supervised learning tasks such as classification and regression. The algorithm works by constructing a set of decision trees trained on random subsets of features. In the case of classification, the output of a random forest model is the mode of the predicted classes across the decision trees. In this post, we will discuss how to build random forest models for classification tasks in python.
Let’s get started!
CLASSIFICATION WITH RANDOM FOREST
For our classification task, we will be working with the Mushroom Classification data set which can be found here. We will be predicting on a binary target that specifies whether a mushroom is poisonous or edible.
To start, let’s import the pandas library and read our data into a data frame:
import pandas as pd
df = pd.read_csv("mushrooms.csv")
Let’s print the shape of our data frame:
print("Shape: ", df.shape)

Next, let’s print the columns in our data frame:
print(df.columns)

Now let’s also look at the first five rows of data using the ‘.head()’ method:
print(df.head())

The attribute information is as follows

We will be predicting the class for mushrooms where the possible class values are ‘e’ for edible and ‘p’ for poisonous. The next thing we will do is convert each column into machine readable categorical variables:
df_cat = pd.DataFrame()
for i in list(df.columns):
df_cat['{}_cat'.format(i)] = df[i].astype('category').copy()
df_cat['{}_cat'.format(i)] = df_cat['{}_cat'.format(i)].cat.codes
Let’s print the first five rows of the resulting data frame:
print(df_cat.head())


Next, let’s define our features and our targets:
X = df_cat.drop('class_cat', axis = 1)
y = df_cat['class_cat']
Now let’s import the random forests classifier from ‘sklearn’:
from sklearn.ensemble import RandomForestClassifier
Next, let’s import ‘KFold’ from the model selection module in ‘sklearn’. We will us ‘KFold’ to validate our model. Additionally, we will use the f1-score as our accuracy metric, which is the harmonic mean of the precision and recall. Let’s also initialize the "KFold" object with two splits. Finally, we’ll initialize a list that we will use to append our f1-scores:
from sklearn.model_selection import KFold
kf = KFold(n_splits=2, random_state = 42)
results = []
Next, let’s iterate over the indices in our data and split our data for training and testing:
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Within the for-loop we will define random forest model objects, fit to the different folds of training data, predict on the corresponding folds of test data, evaluate the f1-score at each test run and append the f1-scores to our ‘results’ list. Our model will use 100 estimators, which corresponds to 100 decision trees:
for train_index, test_index in kf.split(X):
...
model = RandomForestClassifier(n_estimators = 100, random_state = 24)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
results.append(f1_score(y_test, y_pred))
Finally, let’s print the average performance of our model:
print("Accuracy: ", np.mean(results))

If we increase the number of splits to 5 we have:
kf = KFold(n_splits=3)
...
print("Accuracy: ", np.mean(results))

I’ll stop here but I encourage you to play around with the data and code yourself.
CONCLUSIONS
To summarize, in this post we discussed how to train a random forest classification model in Python. We showed how to transform categorical feature values into machine readable categorical values. Further, we showed how to split our data for training and testing, initialize our random forest model object, fit to our training data, and measure the performance of our model. I hope you found this post useful/interesting. The code in this post is available on GitHub. Thank you for reading!