
With so much attention on generative AI and vast neural networks, it is easy to overlook the tried and tested Machine Learning algorithms of yore (they’re actually not that old…). I would go so far as to argue that for most business cases, a straightforward Machine Learning solution will go further than most complex AI implementation. Not only do ML algorithms scale extremely well, the far lower model complexity is what (in my opinion) makes them superior in most scenarios. Not to mention, I have also had a far easier time tracking the performance of such ML solutions.
In this article, we will tackle a classic ML problem using a classic ML solution. More specifically, I will show how one can (in only a few lines of code) identify feature importance within a dataset using a Random Forest classifier. I’ll start by demonstrating the effectiveness of this technique. I’ll then apply a ‘back-to-basics’ approach to show how this method works under the hood by creating a Decision Tree and a Random Forest from scratch whilst benchmarking the models along the way.
I have found the initial phases of an ML project to be particularly important in a professional setting. Once feasibility for the project has been granted by stakeholders (those paying the bills), they will want to see return on the investment. Part of this feasibility discussion will entail discussions around the data: is there sufficient data, is the data of a high quality etc. etc. Some answers to the distribution and quality of the data can only be answered after some initial analyses. The technique I’m showing here assumes you have completed the initial feasibility assessment and you are ready to move to the next step. The main question we need to ask ourselves at this point is: how many features can I remove whilst still maintaining model performance. There are many benefits to reducing the number of features (dimensionality) of our model. These include but are not limited to:
- Reduce model complexity
- Faster training times
- Reduce multicollinearity (correlated features)
- Noise reduction
- Improve model performance
Using the Random Forest technique, we will be left with a graph clearly showing how important each feature is in explaining our target (if the titanic passenger died or not…yes I’m using the titanic dataset!). We will also have a preliminary prototype model that is fit to our data and can be used for further prediction. Whilst this is only a prototype, it will serve well as a baseline for future experiments and will evidence to stakeholders that project is worth your time and their money! It’s a great way to gain momentum during these initial stages of your project.
On the other hand, this technique may also serve to show that further engineering efforts are needed to create new datapoints/features or gather them from external sources in order for your model to better learn relationships between your features and your target (what you’re trying to predict).
Let’s begin
Implementing the Random Forest
Decision Trees and their ensemble counterparts (Random Forests) do not typically require pre-processing of features (only some encoding). A decision tree will choose the feature that best separates the data based on a certain criteria. This criteria is referred to as Gini impurity. We’ll cover this in the later sections when we build a decision tree from scratch. Furthermore, a decision tree makes no assumptions about the distribution of features or the relationship between them. The decision tree is capable of partitioning the feature space based on thresholds, making them robust to the spread of your data. Decision trees are also robust to outliers. As we will see later, they split the data based on binary decisions at each node. An outlier may affect the splitting threshold at a specific node but it is unlikely to significantly impact the overall performance of the individual tree. We will go on to see how performance gains can be made by taking the average prediction of a group of Decision Trees to create a Random Forest using a technique known as bagging.
The only adaptations you would need to apply make are for pre-processing and the encoding of your features. Feel free to adapt my code as necessary.
We’ll start by importing our dataset from Kaggle. When importing the Kaggle dataset, you will need to ensure your login credentials are stored at this location on your machine:
~/.kaggle/kaggle.json
The kaggle.json file contains your API key. If you are starting from scratch, simply register on Kaggle, access your account settings, head to the API header and click ‘create new token’. This will store the kaggle.json file in your downloads folder. To move to the correct location simply open your CLI (I am using my Mac’s terminal) and enter the following command:
mv ~/Downloads/kaggle.json ~/.kaggle/
You can verify the file has been moved by typing:
ls ~/.kaggle/
You should see the kaggle.json file.
Now we have the API credentials in place, let’s import all necessary libraries. If you receive any errors regarding the module not being found, simply run a pip install of the library. A quick google search should suffice if you get stuck.
import pandas as pd
import numpy as np
np.set_printoptions(linewidth=130)
from pathlib import Path
import zipfile,kaggle
from numpy import random
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_absolute_error
Next, we’ll import our data and assign the train/test data.
path = Path('titanic')
kaggle.api.competition_download_cli(str(path))
zipfile.ZipFile(f'{path}.zip').extractall(path)
df = pd.read_csv(path/'train.csv')
tst_df = pd.read_csv(path/'test.csv')
modes = df.mode().iloc[0]
Next we’ll handle some light pre-processing. Again, adapt this code to your personal requirements. We fill null values, transform Fare to LogFare (this is largely just personal preference, as we discussed previously, Decision Trees are robust when it comes to outliers and distribution of data). We also set-up the categorical transformation of the Embarked and Sex columns.
def process_data(df):
df['Fare'] = df.Fare.fillna(0)
df.fillna(modes, inplace=True)
df['LogFare'] = np.log1p(df['Fare'])
df['Embarked'] = pd.Categorical(df.Embarked)
df['Sex'] = pd.Categorical(df.Sex)
process_data(df)
process_data(tst_df)
We will then identify our categorical, continuous and dependant variables:
cats=["Sex","Embarked"]
conts=['Age', 'SibSp', 'Parch', 'LogFare',"Pclass"]
dep="Survived"
Next, we need to split our data and apply the categorical transformations:
random.seed(42)
trn_df,val_df = train_test_split(df, test_size=0.25)
trn_df[cats] = trn_df[cats].apply(lambda x: x.cat.codes)
val_df[cats] = val_df[cats].apply(lambda x: x.cat.codes)
Then we assign our independent variables (x) and the dependent variable (y):
def xs_y(df):
xs = df[cats+conts].copy()
return xs,df[dep] if dep in df else None
trn_xs,trn_y = xs_y(trn_df)
val_xs,val_y = xs_y(val_df)
We are now ready to fit our Random Forest using sklearn’s RandomForestClassifier() class. The great thing about this class is that it has an underlying method called _feature_importances__ which allows us to identify and plot the extent to which a particular feature influences the survival rate of the passenger which is ultimately the goal of the exercise. We will also get a performance benchmark of our model by obtaining the _mean_absoluteerror:
rf = RandomForestClassifier(100, min_samples_leaf=5)
rf.fit(trn_xs, trn_y);
mean_absolute_error(val_y, rf.predict(val_xs))
# 0.18834080717488788
Let’s plot our findings:
pd.DataFrame(dict(cols=trn_xs.columns, imp=rf.feature_importances_)).plot('cols', 'imp', 'barh');

So there we have it. We can see Sex was by far the most important feature in predicting the survival of a passenger. At this stage we can decide to press on with the features we have whilst experimenting with different algorithms, using our Random Forest model as a performance benchmark. Or, if our model’s predictive performance is weak, we can choose to do more feature engineering. Whilst we only used this on a very small dataset, this method scales exceptionally well. Imagine you have a dataset with 1000+ features, with this method you can quickly extract the top features and establish a game-plan as to how you can best progress in your project.
Under the hood…
Now that we have an idea of how to implement a Random Forest to show feature importance within our dataset, let’s understand how we got there. We’ll start by creating a decision tree manually and progress from there.
We know that sex is the most important feature in our dataset in understanding the survivability of the passenger. We can manually test this by creating some scoring functions that measure impurity of a particular binary split. The impurity indicates how much our split on a specific feature (eg Sex) creates two groups where the rows in each group are similar or dissimilar to one another. The goal is to reduce impurity by creating a split on a feature that best explains the relationship between our features and our target. In our previous Random Forest example, the individual decision trees chose to split on the "Sex" column because it creates subsets with the least amount of mixing between the classes (Survived and Not Survived), thus reducing the overall uncertainty (impurity) in predicting the outcome. A binary split is the first building block of a Decision tree, whereby a binary split is made for each feature.
We’ll measure the similarity of rows within a group by taking the standard deviation of the dependent variable. If the standard deviation is higher, it means the rows are more different to each other. We’ll then multiply this by the number of rows, since a bigger group of values has more impact than a smaller group:
def side_score(side, y):
tot = side.sum()
if tot<=1: return 0
return y[side].std()*tot
Now we can calculate the score for a split by adding up the scores for the left hand side and the right hand side:
def score(col, y, split):
lhs = col<=split
return (side_score(lhs,y) + side_score(~lhs,y))/len(y)
We can check the impurity score for our sex column by setting the threshold to 0.5. Within our data, female passengers are represented as 0, males are 1.
score(trn_xs["Sex"], trn_y, 0.5)
# 0.40787530982063946
For other features, the threshold is less clear-cut. We can set-up an experiment for either our other categorical or constant variables by implementing a slider and seeing how this affects the impurity of the data. Remember we want to reduce the impurity/increase the purity of the data at each split:
def iscore(nm, split):
col = trn_xs[nm]
return score(col, trn_y, split)
from ipywidgets import interact
interact(nm=conts, split=15.5)(iscore);
At this point have a play with slider. I’ve only applied it to the continuous variables but you can test it on the categorical variables also. As you can imagine, doing this for each of our features is rather time consuming. Let’s write a function that allows us to find the best split point for a column for us. We will want to make a list of all the possible split points (the unique values of that field) and find the point at which the score() is the lowest:
def min_col(df, nm):
col,y = df[nm],df[dep]
unq = col.dropna().unique()
scores = np.array([score(col, y, o) for o in unq if not np.isnan(o)])
idx = scores.argmin()
return unq[idx],scores[idx]
min_col(trn_df, "Age")
# (6.0, 0.478316717508991)
Great. We have found the optimal split on the "Age" column of our training set is 6, with the impurity score being 0.478…
Let’s implement this idea for all columns:
cols = cats+conts
{o:min_col(trn_df, o) for o in cols}
# {'Sex': (0, 0.40787530982063946),
# 'Embarked': (0, 0.47883342573147836),
# 'Age': (6.0, 0.478316717508991),
# 'SibSp': (4, 0.4783740258817434),
# 'Parch': (0, 0.4805296527841601),
# 'LogFare': (2.4390808375825834, 0.4620823937736597),
# 'Pclass': (2, 0.46048261885806596)}
So apparently Sex<=0 is the best split we can use as this is when our impurity score is at its lowest. Whilst these results do not exactly mirror our initial Random Forest example (understandable as we are only using a small subset of data and not using any ensemble methods), it still shows we are on the right track. We have essentially recreated a basic version of the OneR classifier.
Traversing our Decision Tree
To progress from here, it is important to understand the optimal initial split is on our Sex column. Let’s manually move onto the next level, where we decide (after having split the data into male/female) what the next optimal split is for each group. You can see where this is going. We are assembling the building blocks of our decision tree by traversing to the next level. In order to do so, we remove Sex from our list of possible splits. We then need to split our data into either male or female and find the optimal split for each of the two groups (the split with lowest impurity score). Let’s do it.
We’ll remove the Sex column and split our data. This is essentially the first binary split within our decision tree:
cols.remove("Sex")
ismale = trn_df.Sex==1
males,females = trn_df[ismale],trn_df[~ismale]
Now we find the best split for males:
{o:min_col(males, o) for o in cols}
# {'Embarked': (0, 0.3875581870410906),
# 'Age': (6.0, 0.3739828371010595),
# 'SibSp': (4, 0.3875864227586273),
# 'Parch': (0, 0.3874704821461959),
# 'LogFare': (2.803360380906535, 0.3804856231758151),
# 'Pclass': (1, 0.38155442004360934)}
and the best split for females:
{o:min_col(females, o) for o in cols}
# {'Embarked': (0, 0.4295252982857327),
# 'Age': (50.0, 0.4225927658431649),
# 'SibSp': (4, 0.42319212059713535),
# 'Parch': (3, 0.4193314500446158),
# 'LogFare': (4.256321678298823, 0.41350598332911376),
# 'Pclass': (2, 0.3335388911567601)}
For males, the best next binary split is Age<=6 and for females in Pclass<=2. This is where our impurity score was at its lowest.
We’ve just created our first decision tree by hand. We can repeat this process, creating new additional rules for each of the four sub-groups we have now created. Thankfully however we do not need to reinvent the wheel. There are many open-source libraries that do the heavy lifting for us. Let’s repeat the same process but use an existing library and plot our output decision tree to compare our findings:
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt
model = DecisionTreeClassifier(max_leaf_nodes=4).fit(trn_xs, trn_y)
plt.figure(figsize=(20, 10))
plot_tree(model, feature_names=trn_xs.columns, filled=True, max_depth=3, rounded=True, precision=2)
plt.show()

Awesome we get the same results as our manual split. Let’s measure the performance of this model:
mean_absolute_error(val_y, model.predict(val_xs))
# 0.2242152466367713
As expected, this model is worse than our Random Forest ensemble model we created at the beginning. Each node in our diagram shows how many rows/samples match that specific set of rules and shows how many passengers perish or survive. The Gini score is very similar to our scoring function we created earlier. It is defined as follows:
def gini(cond):
act = df.loc[cond, dep]
return 1 - act.mean()**2 - (1-act).mean()**2
Whereby it calculates the probability that, if you pick two rows from a group, you’ll get the same Survived result each time. If the group is all the same, the probability is 1.0. The score is 0.0 if they’re all different.
Bigger tree…
Let’s create a larger decision tree and see how this affects performance:
model = DecisionTreeClassifier(min_samples_leaf=50).fit(trn_xs, trn_y)
plt.figure(figsize=(20, 10))
plot_tree(model, feature_names=trn_xs.columns, filled=True, rounded=True, precision=2)
plt.show()

We’ll now measure the larger model’s performance:
mean_absolute_error(val_y, model.predict(val_xs))
# 0.18385650224215247
So this model outperforms our initial model. I would take this with a pinch of salt however, since our dataset is so small.
Random Forest by hand…
Finally, let’s manually create our own Random Forest classifier. We’ll do this by creating lots of individual decision trees using the sklearn learn method we covered previously. We’ll then take the average of all outputs from our individual decision trees. Again, the idea here is that by averaging the predictions of uncorrelated models, we reduce the error of our prediction. The key word here is uncorrelated. We need to ensure each of our decision trees trains on a unique subset of our data. Therefore, on an individual level, the performance of each decision tree will be a bit better than average. It will either predict too high or too low. By averaging the prediction over lots of individual, unbiased, uncorrelated decision trees, we are able to get very close to the true target value. This is because the average of lots of uncorrelated random errors is zero. Pretty cool. This technique is known as bagging. Let’s put this into code.
First we handle the creation of a decision tree on a new random subset of the data:
def get_tree(prop=0.75):
n = len(trn_y)
idxs = random.choice(n, int(n*prop))
return DecisionTreeClassifier(min_samples_leaf=5).fit(trn_xs.iloc[idxs], trn_y.iloc[idxs])
Now we create as many trees as we want:
trees = [get_tree() for t in range(100)]
Now we obtain the average prediction of all our trees:
all_probs = [t.predict(val_xs) for t in trees]
avg_probs = np.stack(all_probs).mean(0)
mean_absolute_error(val_y, avg_probs)
# 0.22524663677130047
This is a nearly identical methodology that which is used under the hood for our Random Forest Classifier at the beginning of the article. The only difference is that sklearn also picks a random subset of columns for each split.
Conclusion
So there we have it. We’ve covered the building blocks that help you understand the most important features within your dataset. I hope you are able to apply this methodology to quickly make headway in your Data Science projects. As mentioned before, the performance benchmarks of each model in this article should be taken with a pinch of salt as our dataset was so small. Nevertheless, this method scales incredibly well and is a great way to obtain an explainable benchmark.
As always, let me know if you have any questions or wish to discuss anything mentioned in the article.
Cheers!
All images belong to the author unless otherwise stated.