The world’s leading publication for data science, AI, and ML professionals.

How do Decision Trees and Random Forests Work? 1b

Regression Trees

Photo by Jeremy Bishop on Unsplash
Photo by Jeremy Bishop on Unsplash

In part 1 of this series, I talked about how decision trees work. (If you haven’t read that article yet, I’d suggest going back and reading it before going on. Go ahead. I’ll wait.) That article was mostly focused on classification – e.g., yes/no, live/die, etc. Well, decision trees can also be used for regression – i.e., prediction of a continuous variable. Many aspects of the Decision Tree are the same, but predicting the answer is handled a little differently.

For this article, I’m going to use a dataset from a Dockship.io challenge: Power Plant Energy Prediction AI Challenge. This challenge uses four input variables to predict the energy output (PE)from a power plant. Here are the first six records in the dataset:

Screenshot by author
Screenshot by author

Where,

  • AT = Ambient Temperature
  • AP = Ambient Pressure
  • V = Exhaust Velocity
  • RH = Relative Humidity

What a regression tree does is to take the average for a certain set of conditions and use that as the prediction. The mean value of PE for the entire dataset is 454.31, so that’s the prediction for the root node.

If you remember from part 1, in a classification tree, entropy is used to decide how to split the data into separate branches. Entropy is essentially a measure of disorder or uncertainty. For a continuous variable, we can use the standard deviation (SD) for the same purpose. Then we want to reduce the overall standard deviation by as much as possible. The standard deviation for our dataset is 17.06. Calculating the overall standard deviation is handled in much the same way as we calculated overall entropy before – it’s simply a weighted average. Add up the products of the fraction percentage and the standard deviation for each fraction.

So, let’s start out by splitting on temperature, starting with the mean temperature of 19.68. If we look at the mean and standard deviation of power output, we get:

Screenshot by author
Screenshot by author

Here, we see that the standard deviation is reduced substantially (from 17.06 to 9.12) by this simple split. This extremely simple model will predict a power output of 469.42 if AT is less than 19.68, and 440.71 if AT is 19.68 or more.

If we do the same thing with the means of the other three input variables, we get:

Screenshot by author
Screenshot by author

From this, we see that mean temperature gives us the best SD reduction. Of course, as we saw in part 1, we don’t have to (nor should we) settle with this. For instance, if we split temperature at the median (20.32) instead of the mean, we get an overall SD of 9.22, a reduction of 7.84. So the mean temperature is a better split value than the median. The algorithm will try different split values for each variable, to get the best possible SD reduction.


Once we have the first split, we can try splitting our first two groups further. Then those can be split, etc. Here’s the tree resulting from R’s rpart() function, with standard parameters. (Note: only 70% of the full dataset was used as the training set to create this decision tree.)

Screenshot by author
Screenshot by author

As you can see, only two variables ended up being useful in this decision tree. If I use this as a predictive model, you’ll see that it’s very crude. After all, with only five leaf nodes, you can only get five possible answers from your prediction.

Screenshot by author
Screenshot by author

This is one limitation of the regression tree model. It deals with averages. A related limitation is that it can’t extrapolate outside of the training data. After all, you can’t get an average smaller than the lowest value or greater than the largest value.

While we can’t do anything about the latter limitation, we can get around the former. If we can create a lot of slightly different decision trees, we can get a lot of different possible leaf nodes, and a lot of different possible answers. If we can then average the results from all the different trees, we can get much better predictive power. This is where random forests come in, which will be the topic of the next part of this series.


Further Reading

How do Decision Trees and Random Forests Work?

Decision Tree Regression

Use Rattle to Help You Learn R


Related Articles