Machine Learning: Step-By-Step

A Step-By-Step Guide To Machine Learning Classification In Python Using Random Forest, PCA, & Hyperparameter Tuning — WITH CODE!

Published in

Towards Data Science

12 min readFeb 10, 2020

As data scientists, we have many options to choose from to create a classification model. One of the most popular and robust methods is using Random Forests. We can perform Hyperparameter Tuning on Random Forests to try to optimize the model’s performance.

It is also common practice to try Principal Component Analysis (PCA) before fitting our data to a model. But why should we even add this step? Isn’t the whole point of Random Forest to help us interpret feature importance easily?

Yes, PCA can make interpreting each “feature” a little harder when we analyze the “feature importances” of our Random Forest model. However, PCA performs dimensionality reduction, which can reduce the number of features for the Random Forest to process, so PCA might help speed up the training of your Random Forest model. Note that computational cost is one of the biggest drawbacks of Random Forests (it can take a long time to run the model). PCA can become really important especially when you are working with hundreds or even thousands of predicting features. So if the most important thing is to simply have the best performing model, and interpreting feature importance can be sacrificed, then PCA may be useful to try.

Now, let’s get started with our example. We will be working with the Scikit-learn “breast cancer” dataset. We will create 3 models, and compare their performance to each other:

1. Random Forest
2. Random Forest With PCA Reduced Dimensionality
3. Random Forest With PCA Reduced Dimensionality & Hyperparameter Tuning

1. Import Data

First, we load the data and create a dataframe. Since this is a pre-cleaned “toy” dataset from Scikit-learn, we are good to proceed with the modeling process. However, as a best practice, we should always do the following:

Use df.head() to take a glance at the new dataframe to make sure it looks as intended.
Use df.info() to get a sense of the data types and counts in each column. You may need to convert data types as needed.
Use df.isna() to make sure that there are no NaN values. You may need to impute values or eliminate rows as needed.
Use df.describe() to get a sense for the minimum, maximum, mean, median, standard deviation, and interquartile ranges of each column.

The column named “cancer” is the target variable that we want to predict using our model. “0” means “no cancer” and “1” means “cancer”.

import pandas as pd
from sklearn.datasets import load_breast_cancercolumns = ['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension']dataset = load_breast_cancer()
data = pd.DataFrame(dataset['data'], columns=columns)
data['cancer'] = dataset['target']display(data.head())
display(data.info())
display(data.isna().sum())
display(data.describe())

Above is a portion of the breast cancer dataframe. Each row has observations about a patient. The end column named “cancer” is the target variable that we are trying to predict. 0 means “no cancer” and 1 means “cancer”.

2. Train / Test Split

Now we split our data using the Scikit-learn “train_test_split” function. We want to give the model as much data as possible to train with. However, we also want to make sure that we have enough data for the model to test itself on. In general, as the number of rows in the dataset increases, the more data we can give to the training set.

For example, if we had millions of rows, we could have a 90% train / 10% test split. However, our dataset is only 569 rows, which is not a very large dataset to train or test on. So to be fair to both training and testing, we will split the data into 50% train and 50% test. We set stratify=y to ensure that both the train and test sets have the same proportion of 0s and 1s as the original dataset.

from sklearn.model_selection import train_test_splitX = data.drop('cancer', axis=1)  
y = data['cancer'] 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state = 2020, stratify=y)

3. Scale Data

Before modeling, we need to “center” and “standardize” our data by scaling. We scale to control for the fact that different variables are measured on different scales. We scale so that each predictor can have a “fair fight” against each other in deciding importance. See this article. We also convert “y_train” from a Pandas “Series” object into a NumPy array for the model to accept the target training data later on.

import numpy as np
from sklearn.preprocessing import StandardScalerss = StandardScaler()
X_train_scaled = ss.fit_transform(X_train)
X_test_scaled = ss.transform(X_test)
y_train = np.array(y_train)

4. Fit To “Baseline” Random Forest Model

Now we create a “baseline” Random Forest model. This model uses all of the predicting features and of the default settings defined in the Scikit-learn Random Forest Classifier documentation. First, we instantiate the model and fit the scaled data to it. We can measure the accuracy of the model on our training data.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import recall_scorerfc = RandomForestClassifier()
rfc.fit(X_train_scaled, y_train)
display(rfc.score(X_train_scaled, y_train))# 1.0

If we are curious to see which features are most important to the Random Forest model to predict breast cancer, we can visualize and quantify the importances by calling the “feature_importances_” method:

feats = {}
for feature, importance in zip(data.columns, rfc_1.feature_importances_):
    feats[feature] = importanceimportances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-Importance'})
importances = importances.sort_values(by='Gini-Importance', ascending=False)
importances = importances.reset_index()
importances = importances.rename(columns={'index': 'Features'})sns.set(font_scale = 5)
sns.set(style="whitegrid", color_codes=True, font_scale = 1.7)
fig, ax = plt.subplots()
fig.set_size_inches(30,15)
sns.barplot(x=importances['Gini-Importance'], y=importances['Features'], data=importances, color='skyblue')
plt.xlabel('Importance', fontsize=25, weight = 'bold')
plt.ylabel('Features', fontsize=25, weight = 'bold')
plt.title('Feature Importance', fontsize=25, weight = 'bold')display(plt.show())
display(importances)

5. PCA (Principal Component Analysis)

Now, how could we improve our baseline model? Using dimension reduction, we can approximate the original dataset with fewer variables, while reducing computational power to run our model. Using PCA, we can study the cumulative explained variance ratio of these features to understand which features explain the most variance in the data.

We instantiate the PCA function and set the number of components (features) that we want to consider. We’ll set it to “30” to see the explained variance of all the generated components, before deciding where to make the cut. Then we “fit” our scaled X_train data to the PCA function.

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCApca_test = PCA(n_components=30)
pca_test.fit(X_train_scaled)sns.set(style='whitegrid')
plt.plot(np.cumsum(pca_test.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.axvline(linewidth=4, color='r', linestyle = '--', x=10, ymin=0, ymax=1)
display(plt.show())evr = pca_test.explained_variance_ratio_
cvr = np.cumsum(pca_test.explained_variance_ratio_)pca_df = pd.DataFrame()
pca_df['Cumulative Variance Ratio'] = cvr
pca_df['Explained Variance Ratio'] = evr
display(pca_df.head(10))

This graph shows that after more than 10 components, we don’t gain very much explained variance.

This dataframe shows the Cumulative Variance Ratio (how much total variance of the data is explained) and the Explained Variance Ratio (how much each PCA component explains the total variance of the data).

Looking at the dataframe above, when we use PCA to reduce our 30 predicting variables down to 10 components, we can still explain over 95% of the variance. The other 20 components explain less than 5% of the variance, so we can cut them. Using this logic, we will use PCA to reduce the number of components from 30 to 10 for X_train and X_test. We will assign these recreated, “reduced dimension” datasets to “X_train_scaled_pca” and “X_test_scaled_pca”.

pca = PCA(n_components=10)
pca.fit(X_train_scaled)X_train_scaled_pca = pca.transform(X_train_scaled)
X_test_scaled_pca = pca.transform(X_test_scaled)

Each component is a linear combination of the original variables with corresponding “weights”. We can see these “weights” for each PCA component by creating a dataframe.

pca_dims = []
for x in range(0, len(pca_df)):
    pca_dims.append('PCA Component {}'.format(x))pca_test_df = pd.DataFrame(pca_test.components_, columns=columns, index=pca_dims)
pca_test_df.head(10).T

6. Fit To “Baseline” Random Forest Model After PCA

Now, we can fit our X_train_scaled_pca and y_train data to another “baseline” Random Forest model, to see if we get any improvement on the model’s predictions.

rfc = RandomForestClassifier()
rfc.fit(X_train_scaled_pca, y_train)display(rfc.score(X_train_scaled_pca, y_train))# 1.0

7. Hyperparameter Tuning Round 1: RandomSearchCV

After performing PCA, we can also try some hyperparameter tuning to tweak our Random Forest to try and get better predicting performance. Hyperparameters can be thought of as “settings” for a model. The perfect settings for one dataset won’t be the same for another, so we have to “tune” the model.

First, we can start with RandomSearchCV to consider a wide range of values. All of the hyperparameters for Random Forest can be found in the Scikit-learn Random Forest Classifier documentation.

We generate a “param_dist” with a range of values to try for each hyperparameter. RandomSearchCV is instantiated, our Random Forest model is passed in first, then our “param_dist”, the number of iterations to try, and the number of cross-validations to perform.

The “verbose” hyperparameter gives you more or less output as the model runs (like status updates). The “n_jobs” hyperparameter lets you decide how many cores of your processor you want to use to run the model. Setting “n_jobs = -1” will run the model fastest, because it uses all of your computer cores.

We will be tuning these hyperparameters:

n_estimators: the number of “trees” in our Random Forest.
max_features: the number of features at each split.
max_depth: the max number of “splits”each tree can have.
min_samples_split: the minimum number of observations required before a node of a tree can split itself.
min_samples_leaf: the minimum number of observations required at each leaf at the ends of each tree.
bootstrap: whether to use bootstrapping or not to provide data to each tree in the Random Forest. (Bootstrapping is a random sampling from the dataset with replacement.)

from sklearn.model_selection import RandomizedSearchCVn_estimators = [int(x) for x in np.linspace(start = 100, stop = 1000, num = 10)]max_features = ['log2', 'sqrt']max_depth = [int(x) for x in np.linspace(start = 1, stop = 15, num = 15)]min_samples_split = [int(x) for x in np.linspace(start = 2, stop = 50, num = 10)]min_samples_leaf = [int(x) for x in np.linspace(start = 2, stop = 50, num = 10)]bootstrap = [True, False]param_dist = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}rs = RandomizedSearchCV(rfc_2, 
                        param_dist, 
                        n_iter = 100, 
                        cv = 3, 
                        verbose = 1, 
                        n_jobs=-1, 
                        random_state=0)rs.fit(X_train_scaled_pca, y_train)
rs.best_params_# {'n_estimators': 700,
# 'min_samples_split': 2,
# 'min_samples_leaf': 2,
# 'max_features': 'log2',
# 'max_depth': 11,
# 'bootstrap': True}

With n_iter = 100 and cv = 3, we created 300 Random Forest models, randomly sampling combinations of the hyperparameters input above. We can call “best_params_” to get the best performing model’s parameters (shown at the bottom of the code box above). However, “best_params_” at this stage may not give us the best insight to get a range of parameters to try for the next round of hyperparameter tuning. To get a good range of values to try next, we can easily get a dataframe of our RandomSearchCV results.

rs_df = pd.DataFrame(rs.cv_results_).sort_values('rank_test_score').reset_index(drop=True)
rs_df = rs_df.drop([
            'mean_fit_time', 
            'std_fit_time', 
            'mean_score_time',
            'std_score_time', 
            'params', 
            'split0_test_score', 
            'split1_test_score', 
            'split2_test_score', 
            'std_test_score'],
            axis=1)
rs_df.head(10)

Now, let’s create bar plots of each hyperparameter on the x-axis, and the mean score of the models made at each value, to see which values were most successful on average:

fig, axs = plt.subplots(ncols=3, nrows=2)
sns.set(style="whitegrid", color_codes=True, font_scale = 2)
fig.set_size_inches(30,25)sns.barplot(x='param_n_estimators', y='mean_test_score', data=rs_df, ax=axs[0,0], color='lightgrey')
axs[0,0].set_ylim([.83,.93])axs[0,0].set_title(label = 'n_estimators', size=30, weight='bold')sns.barplot(x='param_min_samples_split', y='mean_test_score', data=rs_df, ax=axs[0,1], color='coral')
axs[0,1].set_ylim([.85,.93])axs[0,1].set_title(label = 'min_samples_split', size=30, weight='bold')sns.barplot(x='param_min_samples_leaf', y='mean_test_score', data=rs_df, ax=axs[0,2], color='lightgreen')
axs[0,2].set_ylim([.80,.93])axs[0,2].set_title(label = 'min_samples_leaf', size=30, weight='bold')sns.barplot(x='param_max_features', y='mean_test_score', data=rs_df, ax=axs[1,0], color='wheat')
axs[1,0].set_ylim([.88,.92])axs[1,0].set_title(label = 'max_features', size=30, weight='bold')sns.barplot(x='param_max_depth', y='mean_test_score', data=rs_df, ax=axs[1,1], color='lightpink')
axs[1,1].set_ylim([.80,.93])axs[1,1].set_title(label = 'max_depth', size=30, weight='bold')sns.barplot(x='param_bootstrap',y='mean_test_score', data=rs_df, ax=axs[1,2], color='skyblue')
axs[1,2].set_ylim([.88,.92])
axs[1,2].set_title(label = 'bootstrap', size=30, weight='bold')plt.show()

Looking at the plots above, we can extract insights about how well each value for each hyperparameter performed on average.

n_estimators: 300, 500, 700 seem to have the highest average scores.

min_samples_split: smaller values like 2 and 7 seem to have higher scores. There are also high scores at 23. We can try a few values above 2, and a few values around 23.

min_samples_leaf: smaller values seem to correlate with higher scores…we can try values between 2–7.

max_features: “sqrt” has the highest average score.

max_depth: no clear pattern, but values of 2, 3, 7, 11, 15 seem to do well.

bootstrap: “False” has the highest average score.

So now we can take these insights ^ and move into the second round of hyperparameter tuning to further narrow our selections.

8. Hyperparameter Tuning Round 2: GridSearchCV

After using RandomSearchCV, we can use GridSearchCV to perform a more refined search for our best hyperparameters. The hyperparameters are the same, but now we perform a more “exhaustive” search using GridSearchCV. In GridSearchCV, every single combination of hyperparameter values is tried which takes much more computational power than RandomSearchCV, where we can directly control how many iterations we want to try. For example, searching just 10 different parameter values for each of our 6 parameters, with 3-fold cross-validation will require 10⁶ x 3 or 3,000,000 model fits! This is why we perform GridSearchCV after using RandomSearchCV, to help narrow our search first.

So using what we learned from our RandomizedSearchCV, let’s plug in the average best performing ranges of each hyperparameter:

from sklearn.model_selection import GridSearchCVn_estimators = [300,500,700]
max_features = ['sqrt']
max_depth = [2,3,7,11,15]
min_samples_split = [2,3,4,22,23,24]
min_samples_leaf = [2,3,4,5,6,7]
bootstrap = [False]param_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}gs = GridSearchCV(rfc_2, param_grid, cv = 3, verbose = 1, n_jobs=-1)
gs.fit(X_train_scaled_pca, y_train)
rfc_3 = gs.best_estimator_
gs.best_params_# {'bootstrap': False,
# 'max_depth': 7,
# 'max_features': 'sqrt',
# 'min_samples_leaf': 3,
# 'min_samples_split': 2,
# 'n_estimators': 500}

So here ^ we are performing 3-fold cross-validation for 3x 1 x 5x 6 x 6 x 1 = 540 model fits, which is a grand total of 1,620 model fits! And now, after performing RandomizedSearchCV followed by GridSearchCV, we can call “best_params_” to get the one best model to try and predict our data (shown at the bottom of the code box above).

9. Evaluate Performance Of Models On Test Data

Now, we can evaluate each of the models that we have made on our test data. Remember that we are testing 3 models:

1. Baseline Random Forest
2. Baseline Random Forest With PCA Reduced Dimensionality
3. Baseline Random Forest With PCA Reduced Dimensionality & Hyperparameter Tuning

Let’s generate the predictions of each of these models:

y_pred = rfc.predict(X_test_scaled)
y_pred_pca = rfc.predict(X_test_scaled_pca)
y_pred_gs = gs.best_estimator_.predict(X_test_scaled_pca)

Now, let’s create confusion matrices for each model, to see how well each model was able to predict breast cancer:

from sklearn.metrics import confusion_matrixconf_matrix_baseline = pd.DataFrame(confusion_matrix(y_test, y_pred), index = ['actual 0', 'actual 1'], columns = ['predicted 0', 'predicted 1'])conf_matrix_baseline_pca = pd.DataFrame(confusion_matrix(y_test, y_pred_pca), index = ['actual 0', 'actual 1'], columns = ['predicted 0', 'predicted 1'])conf_matrix_tuned_pca = pd.DataFrame(confusion_matrix(y_test, y_pred_gs), index = ['actual 0', 'actual 1'], columns = ['predicted 0', 'predicted 1'])display(conf_matrix_baseline)
display('Baseline Random Forest recall score', recall_score(y_test, y_pred))
display(conf_matrix_baseline_pca)
display('Baseline Random Forest With PCA recall score', recall_score(y_test, y_pred_pca))
display(conf_matrix_tuned_pca)
display('Hyperparameter Tuned Random Forest With PCA Reduced Dimensionality recall score', recall_score(y_test, y_pred_gs))

Below, we have the final results of our labor:

We are using recall as our performance metric because we are dealing with diagnosing cancer — and are most concerned with minimizing False Negative prediction errors in our model.

With this ^ in mind, it looks like our Baseline Random Forest model did the best, with the highest recall score of 94.97%. Given our test data set, the baseline model correctly predicted that 170 patients had cancer, out of a total of 179 people who actually had cancer.

This case study brings up an important note: sometimes, after PCA, or even after extensive hyperparameter tuning, a tuned model may not perform as well as a plain-old “vanilla” model. But it’s important to try. You never know which model will do best unless you try them all. In the case of predicting cancer — the better the model, the more lives can be saved.