Outside the garage, the growls and snarls did not stop. He could not believe that the Zombie Apocalypse he had watched many times in series and movies was finally on his front porch. He could wait hidden in the garage for some time but had to come out eventually. Should he take an axe with him or would the rifle be enough? He could try to find some food but, should he go alone? He tried to remember all the zombie movies he had seen but could not agree on a single strategy. If he only had a way of remembering every scene where a character is killed by zombies, would that be enough to increase his chances of survival? If he just had a decision guide everything would be simpler…
Introduction
Have you ever watched one of those zombie apocalypse movies where there is one character that always seems to know where the zombies are hidden or if it is better to fight with them or run away? Does this person really know what is going to happen? Did someone tell him/her beforehand? Maybe there is nothing magical about this. Maybe this person has read a lot of comics about zombies and it is really good at knowing what to do in each case and learning from others’ mistakes. How important it is to find the best way of using the events of the past as a guide for our decisions! This guide, also known as a decision tree, is a widely used supervised learning algorithm. This article is an introductory discussion about decision trees, how to build them and why many of them create a random forest.
A simple decision
You are in the middle of the zombie mayhem and you want to know how to increase your chances of survival. At this point, you only have information from 15 of your friends. For each one of them you know if they were alone, if they had a vehicle or a weapon or if they were trained to fight. Most importantly, you know if they were able to survive or not. How can you use this information to your advantage?
Table 1 summarizes the results and characteristics of your 15 friends. You want to be like the 3 of them that survived in the end. What do these 3 friends have in common? A simple inspection of the table will tell us that the three survivors had all these things in common: they were not alone, they were trained to fight, and they had a vehicle and a weapon. So, will you be able to survive if you had all these four things? Past experiences are telling us that you might! If you had to decide what to take with you and whether to be on your own or not, at least you now have some historical data to support your decision.
A harder decision
Zombie apocalypses are never as simple as they look. Let’s say that instead of the 15 friends from the previous example, you have the following friends:
This time, reaching a conclusion only by visual inspection is not that simple. The one thing we know for sure is that if you want to survive, you better have someone by your side. The 5 people that survived were not alone (Figure 1). Besides this, it is difficult to see if there is a particular combination of things that will lead you to survival. Some people were able to survive although they were alone. How did they do? If you know you will be alone, what else can you do to increase your chances of surviving? Is there anything like a decision roadmap?
The Decision Tree
We can find some answers to the previous questions in a decision tree. A decision tree is a model of the predicted outcome we can find according to the decisions we make. This model is built using previous experiences. In our example, we can build a decision tree using the characteristics of the 15 friends and their outcomes. A decision tree consists of multiple decision nodes or branches. In each one of these nodes, we make a decision that will take us to the following node until we reach an outcome.
Growing a decision tree
If someone asks you to draw a genealogical tree, you would probably start with your grandparents or great-grandparents. From there, the tree will grow through your parents, uncles and cousins, until it reaches you. In a similar way, to grow a decision tree you always start from a node that does the best separation of your data. From that point, the tree will start growing according to the feature that best divides the data. There are many algorithms you can use to grow a decision tree. This article explains how to use the information gain and the Shannon entropy.
Let’s focus on Table 2. We can see that there are 5 people who survived and 10 who died. This means that the probability of surviving is 5/15 = ⅓ and the probability of dying is ⅔. With this information, we can calculate the entropy of this distribution. In this case, entropy refers to the average level of surprise or uncertainty in this distribution. To calculate the entropy we use the following equation:
Note that this equation can also be expressed in terms of one of the probabilities since p(surv)+p(die)=1. If we plot this function you can see how the entropy has the highest value of one when both p(surv) and p(die) are equal to 0.5. On the contrary, if the whole distribution corresponds to people who all survived or all died, the entropy is zero. So, the highest the entropy, the highest the uncertainty. The lowest the entropy, the more homogeneous the distribution is and the less "surprised" we will be about the outcome.
In our case, the number of survivors is less than half of the entire population. It would be reasonable to think that most people do not survive the zombie apocalypse. The entropy in this case is 0.92 which is what you get in the blue curve of Figure 2 when you search for x=⅓ or ⅔ or when you apply the following equation:
Now that we know the entropy or the degree of uncertainty of the entire distribution, what should we do? The next step is to find how to divide the data so we can keep that level of uncertainty.
The premise of the information gain consists of choosing the decision node that will reduce the less the level of entropy of the previous node. At this point, we are trying to find which is the best first separation of the data. Is it the fact that we are alone, that we know how to fight or that we have a vehicle or a weapon? To know the answer we can calculate what is the information gain of each one of these choices and then decide which one of them has the biggest gain. Remember that we are trying to minimize the change in the entropy which is the heterogeneity or level of surprise in the outcome distribution.
Are you trained to fight?
Is this the first question you should ask yourself in this case? Will this question minimize the change in entropy of the outcome distribution? To know this, let’s calculate the entropy of each one of these two cases: we know how to fight and we do not know how to fight. Figure 3 shows that of the 9 people that knew how to fight, only 5 of them survived. On the contrary, all of the 6 people who were not trained to fight did not survive.
To calculate the entropy of the previous cases we can apply the same equation we used before. Figure 3 shows how the entropy for the case in which people were trained to fight is 0.99 whereas in the other case, the entropy is zero. Remember that an entropy of zero means no surprise, homogeneous distribution, this is what is actually happening in this case since all the people who were not trained to fight, did not survive. At this point, it is important to note that the calculation of the entropy in this second scenario contains an undefined calculation since we will end up with a logarithm of zero. In these cases, you can always apply L’Hôpital’s rule as it is explained in this article.
We now need to calculate the information gain from this decision. This is the same as asking how much the uncertainty in all my decisions would change if I decided to split all the outcomes according to this question. The information gain is calculated by subtracting the entropy of each decision from the entropy of the main node. An important thing to notice is how this operation is weighted according to the number of individuals on each decision. So a big entropy can have a small effect on the information gain calculation if the number of people that took that decision is small. For this example, the information gain from the ability to fight is 0.32 as it is shown in Figure 3.
Can you survive the zombies all by yourself?
We can do a similar analysis of the possibility of surviving the zombie apocalypse alone or with someone else. Figure 4 shows the calculation. In this case, the information gain is 0.52. Note how in this case, the possibility of being alone never led to survival, whereas in cases where the person was not alone, it survived in 5 of the 7 cases.
What about having a vehicle or a weapon?
For these two cases, we can calculate the information gain as we did before (Figure 5). You can see how the information gains are smaller than the ones calculated previously. This means that, at this point, it is better to divide our data according to the two previous features than to these ones. Remember that the biggest information gain corresponds to the feature in which the entropy reduction is the smallest. Once we have calculated all the information gains for every feature we can decide what the first node of the decision tree will be.
The first node
Table 3 shows the information gains for each feature. The biggest information gain corresponds to the fact of being alone or having a companion. This node takes us to the first decision in our tree: you will not survive alone. The 8 people that were alone did not make it regardless of whether they had a weapon, a car or were trained to fight. So this is the first thing we can infer from our analysis which supports what we had concluded by just inspecting the data.
At this point, the decision tree looks like Figure 5. We know that there are no possibilities of surviving if we are alone (taking into account the data we have). If we are not alone, then we might survive but not in all cases. Since we can calculate the entropy at the right node in Figure 5, which is 0.86 (this calculation is shown in Figure 4), then we can also calculate the information gain from the other three features and decide which the next decision node will be.
The second node
Figure 5 shows that the biggest information gain at this point comes from the weapon feature so that is the next decision node as is shown in Figure 6. Note how all the people that were not alone and had a weapon survived and that is why the left side of the weapon node finishes in a survival decision.
The tree is complete
There are still 3 people that were not alone and did not have a weapon that we need to categorize. If we follow the same process explained previously, we will find that the next feature with the biggest information gain is the vehicle. So we can add an additional node to our tree in which we ask if a particular person had a vehicle. That will divide the remaining 3 people into a group of 2 people that did have a vehicle but did not survive and one single person without a vehicle who survived. The final decision tree is presented in Figure 7.
The problem with decision trees
As you can see, the decision tree is a model built with previous experiences. Depending on the number of features in your data you will encounter multiple questions that will guide you to the final answer. It is important to notice how in this case one of the features is not represented in the decision tree. The ability to fight was never selected as a decision node since the other features always had a bigger information gain. This means that, according to the input data, being trained on how to fight is not important to survive a zombie apocalypse. However, this could also mean that we did not have enough samples to determine if the ability to fight was important or not. The key here is to remember that a decision tree is as good as the input data we are using to build it. In this case, a sample of 15 people might not be enough to have a good estimation of the importance of being trained to fight. This is one of the problems of the decision trees.
As with other supervised learning approaches, decision trees are not perfect. On the one hand, they rely heavily on the input data. This means that a small change in the input data can lead to important changes in the final tree. Decision trees are not really good at generalizing. On the other hand, they tend to have overfitting problems. In other words, we can end up with a complex decision tree that works perfectly with the input data but it will dramatically fail when we use a test set. This could also affect the results if we are using the decision tree with continuous variables instead of categorical ones like the example presented.
One way of making decision trees more efficient is to prune them. This means stopping the algorithm before reaching a pure node just as the ones we reached in our example. This could lead to the removal of a branch that is not providing any improvement in the accuracy of the decision tree. Pruning gives the decision tree more generalization power. However, if we decide to prune our decision tree then we might start asking additional questions such as: when is the right moment to stop the algorithm? Should we stop after we reach a minimum number of samples? Or after a predefined number of nodes? How to determine these numbers? Pruning can definitely help us to avoid overfitting but it may lead to additional questions that are not so simple to answer.
What about a full forest instead of a tree?
What if instead of a single decision tree, we have multiple decision trees? They will change according to the portion of the input data they take, the features they read and their pruning characteristics. We will end up with many decision trees and many different answers but we can always go with the majority in the case of a classification task or an average if we are working on regressions. This can help us to generalize the distribution of our data better. We might think that one decision tree is misclassifying but if we find 10 or 20 trees that reach the same conclusion, then this is telling us that there might be no misclassification after all. Basically, we are letting the majority decide instead of guiding ourselves by one single decision tree. This methodology is called Random Forest.
The concept of Random Forests is usually related to the concept of Bagging which is a process where a random sample of data in a training set is selected with replacement. This means that individual data points can be chosen more than once. In the Random Forest methodology, we can choose a random number of points, build a decision tree, and then do this again until we have multiple trees. Then, the final decision will come from all the answers that were obtained from the trees.
Random Forests is a well-known ensemble method used for classification and regression problems. This method has been applied in many industries such as finance, healthcare and e-commerce [1]. Although the original idea of Random Forests was slowly developed by many researchers, Leo Breiman is commonly known as the creator of this methodology [2]. His personal webpage contains a detailed description of Random Forests and an extensive explanation of how and why it works. It is a long but worthy read.
An additional and important thing to understand about random forests is the way in which they work with the features of the dataset. At each node, the random forest will randomly select a pre-defined number of features instead of all of them to decide how to split each node. Remember that in the previous example, we analyzed the information gain from each feature at each level of the decision tree. On the contrary, a random forest will only analyze the information gain from a subset of the features at each node. So, the random forest mixes Bagging with the random variable selection at each node.
A true zombie apocalypse
Let’s go back to the zombies! The previous example was really simple, we had data from 15 people and we only knew 4 things about each one of them. Let’s make this harder! Let’s say that now we have a dataset with more than one thousand entries and for each one of them we have 10 features. This dataset was randomly generated in Excel and does not belong to any commercial or private repository, you can access it from this GitHub page.
As is usual with these types of methodologies, it is a good idea to separate the entire dataset into a training and testing set. We will use the training set to build the decision tree and random forest models and then we will evaluate them with the test set. For this purpose, we will use the scikit-learn libraries. This Jupyter Notebook contains a detailed explanation of the dataset, how to load it and how to build the models using the library.
The entire dataset contains 1024 entries of which 212 (21%) correspond to survivals and 812 (79%) to deaths. We divided this dataset into a training set that corresponds to 80% of the data (819 entries) and a testing set which contains 205 entries. Figure 8 shows how the relation between survivals and deaths is maintained in all sets.
Regarding the features, this time we have 6 additional characteristics for each individual:
- Do you have a radio?
- Do you have food?
- Have you taken a course in outdoor survival?
- Have you taken a first aid course?
- Have you had a zombie encounter before?
- Do you have a GPS?
These 6 features combined with the 4 features we already had, represent 10 different characteristics for each individual or entry. With this information, we can build a decision tree following the previously explained steps. The Jupyter Notebook uses the function DecisionTreeClassifier to generate a Decision Tree. Note that this function is not meant to work for categorical variables. In this case, we have converted all the answers for each category to -1 or +1. This means that every time we see a -1 in the results it means "No" whereas a +1 means "Yes". This is better explained in the Jupyter Notebook.
The Notebook explains how to load the data, call the decision tree function and plot the results. Figure 9 shows the decision tree that was built with the 819 entries that corresponded to the training set (click here for a bigger picture). The dark blue boxes correspond to final decision nodes in which the answer was survival whereas the dark orange boxes represent final decision nodes where the final answer was not survival. You can see how the first decision node corresponds to the vehicle and from there, the tree starts growing according to different features.
We can evaluate how good this tree is if we use the test set inputs to predict the final categories and then compare these results with the original results. Table 4 shows a confusion matrix with the number of times the decision tree misclassified an entry. We can see that the test set had 40 cases that represented survival and the decision tree only classified correctly 25 of them. On the other hand, from the 165 cases that did not survive, the decision tree misclassified 11. The relation between the correct classifications and the entire dataset of 205 points is 0.87 which is usually known as the prediction accuracy score.
87% of accuracy does not look bad but, can we improve this using a random forest? The next section of the Jupyter Notebook contains an implementation of a random forest using the sklearn function RandomForestClassifier. This random forest will contain 10 decision trees that are built using all the entries but only considering 3 features at each split. Each of the decision trees in the random forest considers 682 entries which represent 84% of the full training set. So, just to be clear, the random forest process will:
- Take a random subset of 682 entries from the training set
- Build a decision tree that considers 3 randomly selected features at each node
- Repeat the previous steps 9 additional times
- The predictions will correspond to the majority vote over the 10 decision trees
Table 5 shows the confusion matrix for the results coming from the random forest. We can see that these results are better than what we were getting before with a single decision tree. This random forest misclassifies 11 entries and has a prediction accuracy score of 0.95 which is higher than the decision tree.
It is important to take into account that the random forest methodology is not only as good as the input data we have but also as good as the selection of parameters that we use. The number of decision trees we build and the number of parameters we analyze at each split will have an important effect on the final result. So, as is the case of many other supervised learning algorithms, it is necessary to spend some time tuning the parameters until we found the best possible result.
Conclusions
Going through this article is just like that guy in the movies that managed to escape from the zombie that was chasing him because a tree branch fell on the zombie’s head just at the right time! This is not the only zombie he will encounter and he is definitely not out of the woods yet! There are many things about Random Forests and Decision Trees that were not even mentioned in this article. However, it is enough to understand the usage and applicability of this method. Currently, there are multiple libraries and programs that build these models in seconds. So you probably don’t need to go through the entropy and information gain calculation again. Still, it is important to understand what is happening behind the curtain and how to correctly interpret the results. In a world where topics such as "Machine Learning", "Ensemble Methods" and "Data Analytics" are every day more common, it is important to have a clear idea of what these methods are and how to apply them to everyday problems. Unlike the zombie apocalypse survival movies, being ready does not happen by chance.
References
- IBM. What is random forest?
- Louppe, Gilles (2014). Understanding Random Forests. PhD dissertation. University of Liege