The world’s leading publication for data science, AI, and ML professionals.

Go Out For Exercise Or Not? Let Data Science Decide

An introduction to the Decision Tree Machine Learning Algorithm

Photo by Sorbyphoto on Pixabay
Photo by Sorbyphoto on Pixabay

As one of the most popular classic machine learning algorithms, the Decision Tree is much more intuitive than the others for its explainability. Let’s consider the following conversation.

Jade: "Should we go out running tomorrow?" Chris: "How about the weather?" Jade: "Let me check it on my phone. It’s a sunny day!" Chris: "That sounds good! How about the temperature then?" Jade: "Hmmm… 35 degrees" Chris: "Oh, it’s too hot. I would prefer to go swimming in an indoor swimming pool."

We make many decisions in our life. If we think about why we made those decisions, most of the time there is a "decision tree" behind it as shown in the above picture. That’s why the decision tree is probably the most intuitive machine learning method which is very close to the human mind.

Building Blocks of a Decision Tree

Photo by PublicDomainPictures on Pixabay
Photo by PublicDomainPictures on Pixabay

How is a Decision Tree built as a machine learning model? There are several different popular algorithms to build a Decision Tree, but they must include two steps: Constructing a tree and Pruning the tree.

Constructing a Decision Tree

To construct a Decision Tree, an algorithm needs to generate three types of nodes:

  • Root Node: A Decision Tree will have only one root node, which is on the top of the tree.
  • Internal Node: These nodes are intermediate between their parent nodes and child nodes.
  • Leaf Node: These nodes will NEVER have child nodes, and they are the "decision" made by the Decision Tree.

Then, it is obvious that the problem becomes how a typical Decision Tree algorithm to generate/decide these types of nodes:

  • Which feature should be made as to the Root Node?
  • How the "conditions" are decided to split the parent nodes into child nodes?
  • When to stop generating more branches and end up with Leaf Nodes?

These are varied between different types of Decision Tree algorithms, which will be discussed in the later section.

Pruning the Decision Tree

It sounds like once the tree is built we are done, but not really. Most of the time, we have to prune the tree to avoid "overfitting".

As shown in the above graph, let’s imagine that each boundary that splits the red samples and blue samples are a condition that splits the nodes in the decision tree. It turns out that the tree on the left seems to be overfitted to the training example. After the pruning of the tree, we are expecting it becomes something on the right.

The reason for overfitting is usually that the training dataset is small, and there are too many "features" that are considered to be split. The most interesting characteristic of Decision Tree models is that they can be perfectly fitted with the training dataset, i.e. 100% correctly classify the training dataset. However, it is not what we want, because it almost means that the tree we’ve built has lost its generalisation ability. In other words, it cannot be used in a real problem, but only for the training dataset.

Generally, there are two types of tree pruning mechanisms:

  • Pre-Pruning

Prune the tree while constructing it. That is, when before splitting a node and applying the change to the tree, test whether the splitting will improve the classifying accuracy in the test dataset. If so, continue to make this branch. Otherwise, this current node will not be split and consequently become a leaf node.

  • Post-Pruning

The algorithm firstly builds the Decision Tree regardless. Then, going backwards from the leaf nodes to the root node, evaluating every branch to test the influences of the splits. If the branch did not improve the performance too much on the test dataset, or even sometimes can improve the performance, then get rid of it.

How the algorithm will build the Decision Tree

Photo by FeeLoona on Pixabay
Photo by FeeLoona on Pixabay

No matters what kind of algorithms we will use for building the Decision Tree, the two common and important criteria that will be used are Purity and Entropy.

Purity

It is fairly easy to understand what is purity. That is, for a single feature, how pure the decisions it corresponds to. Let’s use the above sample dataset to demonstrate.

  • When the weather is rainy, all the decisions are "No" (100% Purity)
  • When the weather is sunny, 2 of the decisions are "No" and 1 of them is "Yes"

Therefore, we can say that the purity of the first set is higher than the second.

Entropy

Entropy is almost a reversed concept of entropy. It was firstly invited by Claude Shannon [1] which is a concept transferred from thermodynamics into information theory. It is generally referring to the "uncertainty" of information. The formula is given as follows:

Don’t be scared by the formula, it is very simple in our context. The set "X" is everything in the set of the node, and "xᵢ" refers to the specific decision of each sample. Therefore, "P(xᵢ)" is the probability of the set to be made with a certain decision.

Let’s use the same example that we have used in the Purity concept explanation.

  • When the weather is rainy, both the 2 decisions are "No" (100% Purity)
  • When the weather is sunny, 2 of the decisions are "No" and 1 of them is "Yes"

The results comply with intuition. Based on the training dataset, when the weather is rainy, it is for sure that we will not go out for running. In other words, given that the weather is rainy, there is no "uncertainty" at all, so the entropy is 0.

Information Gain

After understanding the concepts of Purity and Entropy, we can use them to build our Decision Tree. One of the most classic and basic Decision Tree algorithms is called "ID3". It split the nodes based on a concept called "Information Gain" which is calculated by subtracting the parent node’s entropy and all the child node’s. The formula is as follows.

where

  • "T" is the parent node and "a" is the set of attributes of "T"
  • The notation "|T|" means the size of the set

Again, do not be scared by the formula. Let’s demonstrate it in an example. Suppose we want to use "weather" as our Root Node. We need to calculate its Information Gain.

From the above illustration, we can easily calculate the entropies of the Weather Node (which is the parent node in this context) and its child nodes "Sunny", "Cloudy" and "Rainy" as follows:

Therefore, the Information Gain of using "Weather" as the Root Node can be calculated as:

Similarly, we can use the same method to calculate the Information Gain of the other two features: Temperature and Wind Level.

The Information Gain for using temperature as root node is 0.522, and for wind level, it is 0.306. Therefore, we should use weather as the root node because it has the highest Information Gain which is 0.592.

Once the Root Node is decided, the Information Gain will be kept using for the rest of the Internal Nodes, until every branch reaches Leaf Node, i.e. the decisions.

Drawbacks of ID3 algorithm (Information Gain)

Photo by geralt on Pixabay
Photo by geralt on Pixabay

OK. If you understand everything above-mentioned, you have understood all you need to know about how to build a Decision Tree using the ID3 (Information Gain) algorithm. However, there are some significant drawbacks to the ID3 algorithm.

Suppose we have another feature "Date". Obviously, the "Date" feature might not be very useful in the decision making about whether we should go out for running. However, the plain ID3 algorithm tends to choose the feature having more distinct values to be nodes close to the root of the tree. That means we may end up with a tree with unuseful features as important nodes.

Of course, it won’t happen every time, but possibly. Therefore, there are more machine learning algorithms such as C4.5 and CART are raised to improve the performance of a Decision Tree.

I’ll keep update with different kinds of machine learning and data mining algorithms. So, keep an eye later if you are interested.

C4.5 algorithm:

Do Not Use Decision Tree Like This

Summary

Photo by viarami on Pixabay
Photo by viarami on Pixabay

In this article, I have introduced the basics of a very popular and classic machine learning algorithm – Decision Tree. The major building blocks of a Decision Tree model consists of constructing and pruning. After that, I have introduced the most basic Decision Tree algorithms ID3. We’ve learned what are Purity and Entropy, and these useful concepts play important roles in build a Decision Tree. ID3 makes use of Information Gain for constructing a Decision Tree, which is very intuitive and explainable. However, it does have drawbacks that will be improved in other kinds of algorithms.

My next article will be about other popular Decision Tree algorithms that varied on top of ID3, to overcome some disadvantages of it. Hope the articles help you to understand more in the Machine Learning/Data Science discipline!

Join Medium with my referral link – Christopher Tao

If you feel my articles are helpful, please consider joining Medium Membership to support me and thousands of other writers! (Click the link above)

Reference

[1] Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3), 379–423.


Related Articles