Hands-on Tutorials, Machine Learning, Intuited

What’s in a “Random Forest”? Predicting Diabetes

Understanding a Random Forest intuitively and predicting diabetes in real hospital patient data

Raveena Jay
Towards Data Science
11 min readOct 2, 2021

--

Photo by Priscilla Du Preez on Unsplash

(Disclaimer: This article is mainly for people who want a friendly, intuitive understanding of what’s going on in random forests and decision trees — so I won’t be going into huge mathematical detail. I’ll post some links further below in the article to videos and material if you want to go more into the statistical details. If you’re also interested in a Python implementation, read further!)

If you’ve heard of “random forests” as a hot, sexy machine learning algorithm and you want to implement it, great! But if you’re not sure exactly what happens in a random forest, or how random forests make their classification decisions, then read on :)

We’ll find that we can break down random forests into smaller, more digestible pieces. As a forest is made of trees, so a random forest is made of a bunch of randomly sampled sub-components called decision trees. So first let’s try to understand what a decision tree is, and how it comes to its prediction. For now, we’ll just look at classification decision trees.

Decision trees have, as the name says, a tree-like structure that is made up of things called “True/False tree-branches” that split (just like branches in real-life trees split!). It’s like looking at an upside down tree — imagine the trunk of the tree at the top. In fact, a decision tree is very interpretable, as humans make decision trees too about certain choices — more commonly called flowcharts!

Image credit: Wikipedia (https://en.wikipedia.org/wiki/Decision_tree_learning). Here we see a simple flowchart (decision tree) for how a computer estimates the probability that someone survived the Titanic sinking

In the decision tree above, the machine (or perhaps a diagnostician) makes a series of judgements at every level of the tree, going down further until the probabilities are reached. For example at the top root of the tree: the machine asks: is it True or False the passenger’s gender is male? If it’s true, go down the left branch; False, go down the right branch. This is an example of a True/False (or Yes/No question) in a decision tree.

In this way, it’s very easy to visualize a decision tree. But how exactly do decision trees, you know, make their decision?

Choosing What “Features” to Split On

In the Titanic data example pictured above, we see that the decision tree decided to make “splits” in the tree for the passenger in question (the True/False question split I mentioned earlier) based on certain features, like the gender of the person, their age, and the # of siblings of the passenger. How did the decision tree figure out these features? Because we didn’t explicitly ask the tree to pick those attributes.

Imagine yourself as the decision tree, and you’re faced with predicting if a passenger from the Titanic died or survived based on the data. At first, when you haven’t asked Yes/No questions about the attributes, you don’t know how to sort the “dead” passengers and “alive” passengers. But intuitively, if you pick the right features to ask Yes/No questions on (like the above graph shows), far enough down in your flowchart decisions, you’d probably want one branch to have all the “dead’ passengers and a separate branch to have the “alive” passengers, right? At this point, you’ve sorted and filtered out your passengers, and by picking the right attributes, you can literally read the flowchart, visually follow each branch downward and see if the passenger is alive or dead — what an awesome way to make a prediction!

What is “Information Gain” in a decision tree?

Picking features so that certain “branches” of the decision tree become “full” of just one type of data — for example, one branch just has all the “dead” passengers and another branch has just the “alive” ones — that notion of “fullness” is directly related to a concept in probability called information content.

To give some motivation and intuition for “information content”, imagine this situation: you lived in an area plagued by lack of rainfall, and because of this, dry grass and plant matter often caught on fire more easily from the Sun’s exposed heat — you started becoming used to seeing fires around the local area. However, one night, you suddenly hear a crack! in the sky, following by a huge explosion of fire, and, visibly shocked you realize a lightning strike caused the fire this time. You had never seen lightning striking in the local area! It was more surprising to see a lightning cause the fire this time. That word, “surprise”, is the information content of the event — in this case, it was lightning causing the fire (instead of the Sun’s heat+dryness causing a fire to spread). The surprise was high because you didn’t expect the lightning to be a cause — in the same way, the information-content of an event is higher if the event was less likely.

Photo by David Moum on Unsplash
Photo by Dominik Kiss on Unsplash

We can call “cause of the fires in a local drought area” a variable, something we don’t know. There are many values of this variable, since there can be many causes. Some of those values (or call them events), like “a small spark lit up thousands of dry parched grass in the area and a fire spread” are more likely than “a sudden lightning strike caused the fire spread.” To get a better general sense of this variable, we can average out values that have higher chance, and lower chance, and derive a number called the entropy of the variable. This number, entropy, is crucial to understanding how decision trees make their Yes/No question splits.

How the Decision Tree Makes Its Classification

Remember how I said before, your intuitive goal in being a decision tree classifier is to pick the “right” features so that as you go down the tree in your Yes/No (or True/False) decisions, at the end you should have a branch or two that’s just purely made up of the “dead passenger” class and another one or two branches with the “alive class”? In a sense, by asking the right Yes/No questions — choosing the right feature attributes — we’re traveling down the tree of our choices and making the last branch pure. In a decision tree we want to choose features to increase the class “purity” as we travel down the branches.

Now, if the final branches contain all, or most of the “alive passenger” class, notice something: the proportion of “alive” to “dead” passengers in that branch is very uneven; there’s way more alive ones. In the lightning-strike analogy, the “alive passenger” class has “high purity” — which is also called “low entropy” in that “final branch” — why? Because the “alive” class is much more likely in that branch; there’s just more “alive-class” data points there! In a symmetrical way, in the final branch for “dead-class” data points, ideally the entropy for the “dead” category is also low, because that branch is purely made up of “dead passenger” data. So our goal in the decision tree is to find the “right” features to split so we can lower the entropy as much as possible using True/False questions. This “lowering of entropy” down the decision tree is also called “increasing information gain”. It’s called this because reducing the entropy in the tree helps you gain information about finding the right features to split on, and therefore creating a good classifier.

Many Decision Trees Together= Random Forest!

Now that we’ve gone through an intuitive feel for how a decision tree works, we do need to mention a few things. Decision trees are very interpretable (because you can literally read them left-to-right and top-to-bottom) but they are prone to overfitting. Overfitting happens when machine learning algorithms fit their model too closely to the data, so they can’t generalize well on test sets.

How can we fix overfitting? Well, instead of looking at just one decision tree, we look at a collection of them, say 10 or 20. Depending on the classification from different decision trees, we let the trees vote. Say we are classifying diabetes in hospital patient records. If we train 20 decision trees on random subsets of the data, and for a new, un-seen patient record, 15 of trees say “Yes, this patient has diabetes!” and only 5 trees say “No!”, the majority vote (like in a democracy!) defaults to predicting diabetes for the patient. This “collection” of decision trees making a democratic vote on a classification — is the random forest.

One important note I want to make before going onto Python implementation: the 20 decision trees in the example above are trained by sampling subsets of the data; each decision tree is built from a different random sample of the data — this is called bootstrapping (it comes from a term in statistics called bootstrap sampling — when you need to get information about your population but you only have a small sample to work with; say, a small community survey.) When we do the “democratic vote” among the 20 decision trees in the random forest, that’s called aggregating — and so combining these two methods is called bootstrap aggregating, or “bagging.” If you’d like to understand more details about this “bagging” process in random forests, check out this post on Medium!

If you’d like to see a video going more into the mathematics of decision trees than I could in my post — so you can get a better solid understanding of information gain, entropy, and the concepts we discussed above — check out this awesome video by the YouTube channel “StatQuest” :)

Python Implementation

Now that we’ve gone through some conceptual context behind what a random forest and a decision tree is, and how it makes its decisions, let’s actually implement the algorithm in Python!

For this implementation, I’ll be using real-life recent data from patients at the Sylhet Diabetes Hospital in Sylhet, Bangladesh. The data was collected and published just last year in June 2020, in this research paper by Dr. MM Faniqul Islam and others (cited below), and is freely available on the UC Irvine Machine Learning Repository at this link.

First, you’ll need to import the CSV file once it’s downloaded from the Repository. The dataset, once formatted in the Pandas Data-Frame module, should look like this:

Image credit: Raveena Jayadev, author

Once we’ve downloaded the data, it’s good standard practice to convert the “Yes” and “No” answers into 1’s and 0’s, respectively in a numerical format. We can do that by coding the following:

binary_dict = {'Yes':1, 'No':0}
gender_dict = {'Male':0, 'Female':1}
DBdata_copy['Gender'] = DBdata['Gender'].map(gender_dict)
for name in DBdata.columns[2:-1]:
#loop through each column of the dataframe
x = DBdata[name]
DBdata_copy[name] = x.map(binary_dict)

Anyway, once we’ve converted the data numerically, it should look something like this:

Image Credit: Raveena Jayadev, author

All we have to do now is use the random-forest classification models from Python’s awesome Sci-kit Learn’s module. We can instantiate the classifier like this:

from sklearn.ensemble import RandomForestClassifierrf_classifier = RandomForestClassifier(n_estimators=20, criterion='entropy', n_jobs=-1)
rf_classifier.fit(X_train, y_train)

Notice one of the parameters in the “RandomForestClassifier” module: criterion. Where I’ve put criterion=’entropy’, that’s the same “entropy” word I was talking about earlier when explaining how random forests pick their features to split on.

Awesome! Once we fit our random forest classifier, we can plot one of the most important visualizations for forests: feature importances. This can be done using the bar-plot (plot.barh) in Matplotlib, along with this piece of code: rf_classifier.feature_importances_

Image Credit: Raveena Jayadev, Author

It looks like “Polydipsia”, “Polyuria”, and age were 3 of the most important factors the Random Forest used to classify diabetes. Awesome!

Visualizing the Actual Decision Tree

Now here comes the fun part that probably all of you have been waiting for: we’re going to visualize one of the decision trees in this random forest. To do this, first we’ll need the package “graphviz”. Graphviz is a visualization tool developed for decision trees, and you can install it by using pip install graphviz or, if you’re in an Anaconda environment, conda install -c graphviz.

W can choose one of the individual “estimators” in the random forest — the estimator being a decision tree, and program the visualization like this: (here, the decision tree I chose, I called it “estimator_0”.

estimator_0 = rf_classifier.estimators_[0]tree_estimator = export_graphviz(estimator_0,
feature_names=features,
class_names=target)
# Draw tree
graph = graphviz.Source(tree_estimator, format="png")

And the decision tree looks something like this! You should be able to click on the image and zoom into it with an additional tool.

Image Credit: Raveena Jayadev, Author

Let’s look at the word “value” in each of the boxes: what does that mean?? Well, when you see something like value=[149, 215] all the way at the root node, it means that before the True/False feature split, 149 samples were of the “patient-has-no-diabetes” (Negative) class and 215 were of “patient-has-diabetes” (Positive) class. If you recall from the “Information Gain” section of this post, I mentioned that the goal of a decision tree is to make the “best” True/False splits so right at the very bottom of the tree we get these “pure” nodes where there’s only one type of class or another. The trees aren’t perfect so the bottom-most nodes aren’t always 100% pure, but if you look at the bottom in the picture above, for example: at the “Age ≤ 39.5, entropy=0.8” split near the bottom, the values say [0,2] in the blue box on the left and [6,0] in the orange box on the right. This means that feature split of Age≤39.5 was good enough to create pure nodes where on the left, it’s only positive-diabetes patients and on the right, it’s negative — this is fantastic! As you can see, the farther down the decision tree, the better the splits into pure nodes.

Another easy, visual way to see how the “purity” of each node in the decision tree increases — and hence, a better feature split — is that the darker the hue of the box — so dark blue, or dark orange — the purer the node. Notice how at the top of the tree, nodes are generally light-brown/light-blue/light-orange, and as you go down, the color of the node becomes darker.

In terms of accuracy, random forests do a fantastic job because they average over many decision trees. By themselves, decision trees overfit but as a team, they make great majority-rule classifications. In our example, the random forest was 94% accurate, with a false-negative rate of 2.2%, which is great for a detector of diabetes, since it’s more risky for a machine to make a false-negative, and lie to a patient saying they don’t have diabetes.

Conclusion

I hope you, the reader, have gotten a better intuitive feel for random forests, decision trees, and how they make their classification decisions. I also hope you’ve learned the basics of implementing the code in Python and interpreting each decision tree. You can check out my Github code here for downloading the diabetes data and importing graphviz for the visualizations of the forest.

See you next time! :)

References

  1. Islam, MM Faniqul, et al. “Likelihood prediction of diabetes at early stage using data mining techniques” (2020), Computer Vision and Machine Intelligence in Medical Image Analysis. Springer, Singapore. pp. 113–125.

--

--

I recently earned my B.A. in Mathematics and I'm interested in AI's social impact & creating human-like AI/ML systems. @raveena-jay.