
Suppose we wish to perform supervised learning on a Classification problem to determine if an incoming email is spam or not spam. The spam dataset consists of 4601 emails, each labelled as real (or not spam) (0) or spam (1). The data also contains a large number of predictors (57), each of which is either a character count, or a frequency of occurrence of a certain word or symbol. In this short article, we will briefly cover the main concepts in tree based classification and compare and contrast the most popular methods.
This dataset and several worked examples are covered in detail in The Elements of Statistical Learning, II edition.

The Data:
The spam dataset consists of 4601 observations, with the response, Y, being not spam (0) or spam (1). The data also consists of 57 predictors x. Each predictor reflects the frequency of a certain word or letter in the email or a certain symbol. For example the total number of capital letters, the frequency of the ‘$’ sign, etc. Take a look at a sample of the original data.

Notice that the predictors are quite small and there appear to be many zero data points. We use a log transformation on our predictors before fitting models to help deal with this issue. We add an epsilon of 0.1 to our log to ensure we don’t try to take the log of 0.
We also randomly split our dataset into a training set and a test set. We will use the test set later to evaluate our model performance.
Decision Trees:
Decision trees for classification are one of the oldest and most simple machine learning algorithms. They are popular due to the easy visual interpretation. However, in practice, they tend to overfit the training data which leads to poor performance when attempting to make predictions (also known as the high variance problem). R allows us to easily fit decision trees with the rpart function. There are several ways to decide what makes a ‘good’ or ‘best’ tree, but typically we use the Gini Impurity. Ideally, a terminal node should contain only one class. The more mixed it is, the less ‘pure’ it is and the worse the tree will perform on new data.
The function builds several trees, and at each point we either go left or right depending on our predictors value. We can see that ‘$’ appears to be the most important feature for determining if an email is spam or not in our training data. Depending on ‘$’ we either go left, at which point we check for ‘remove’ or go right at which point we look at ‘hp’. Regardless of our path, we eventually end up at a terminal node, where our email is classified as either spam (1) or not spam(0). At our root node, we have 100% of our training examples. Depending on the value of log($+0.1) we either go to the left (which happens 76% of the time, or to the right, which happens 24% of the time). The number in the middle represents the ‘impurity’ of that node. Our root nodes has an impurity of around 40%, because our training data consists of approximately 40% spam emails.

Once we have fit a model to our training data, we can see how it performs on our test data. This is done by feeding our new examples into the tree and following the path indicated. To evaluate the performance, we examine the confusion matrix. We obtain a test accuracy of 91% which is quite high.

Random Forests and Bagging:
In decision trees, our algorithm built several trees and chose one that we found to be the ‘best’. We then used this single tree to produce predictions on new emails. Brieman improved on these simple models greatly by introducing two key concepts. Firstly, Bagging or Bootstrap Aggregation and finally, the Random Forest. These combine many (hundreds or thousands) of trees, where we take random samples of our observations and predictors to form new trees.
Random Forests greatly reduce the possibility of overfitting (i.e it reduces variance) by averaging over multiple trees, this is sometimes referred to as the wisdom of crowds.
Hastie described the following in a lecture (linked in sources):
- Bagging (Breiman, 1996): "Fit many large trees to bootstrap-resampled versions of the training data, and classify by majority vote."
- Random Forests (Breiman 1999): "Fancier version of bagging."
Essentially, Random Forests utilise bagging to form hundreds of trees. Since bagging by definition excludes certain observations from some our models, we can use trees that don’t contain a certain observation to obtain ‘out-of-bag error’ estimates. We can also obtain estimates of the most important features by seeing which predictors consistently are important in many trees. Note that bootstrap aggregation in general will result in a decorrelation of predictors. Hence, if we have some predictors that are essentially identical, the random forest should be able to tease these out and only keep one copy.

We use the randomForest package in R to fit a model. We can also retain the variables that are most included in our trees, which allows us to visualise the ‘variable importance’. We have greatly improved our accuracy from just under 91% to 94.6%.

One of the great aspects of Random Forests as they also allow us to retain ‘relative importance’ of our predictors. Figure 7 summarises which symbols/words are important in determining spam, and those which are not as important.
Gradient Boosting:
- Boosting (Freund & Shapire, 1996): _"Fit many large or small trees to reweighted versions of the training data. Classify by weighted majority vote." -_Hastie.
Gradient Boosting is an improvement on Random Forests ( if you tune the hyperparameters well). Now, instead of just fitting bootstrap aggregated trees, we take into account how intermediate trees are performing. Using these intermediate trees, we adjust our weights of future trees (we weight certain trees more than others). This can be thought of as repeatedly fitting the residuals of our model. We start with a basic model and see how it preforms at predicting new observations. The residuals are where we did not do as well at predicting, so we fit our next tree to these residuals and so forth to account for where we did poorly (we also don’t use bagging, but instead sample with replacement). Gradient Boosting also tends to create many ‘shallow’ trees. That is, they contain a root node and only a small number of subsequent splits. Often, they only contain a single split, these are known as ‘stumps’ or ‘weak learners’.
- "Gradient Boosting inherits all the good features of trees (variable selection, missing data, mixed predictors), and improves on the weak features, such as prediction performance." – Trevor __ Hastie

Figure 8 illustrates bagging which was discussed above. Bagging can be viewed as an algorithm that works in parallel. That is, bagging is done by selecting multiple sets with replacement at the same time.

Bagging and Boosting are not the same. Bagging is done in parallel as seen in Figure 8 and is used to form several sets of observations to build models with. Boosting, actually uses intermediate trees to adjust the weights that are used when computing the final prediction. As seen in Figure 9, boosting is not done in parallel, it updates weights as it progresses.
We use the R package, ‘gbm’ to apply gradient boosting to our dataset. Fitting with gradient boosting requires
The hyperparmaters n.trees, interaction.depth and shrinkage can be tuned using a grid search and cross validation. We set the interaction depth to six, which allows for sixth order interactions.

We have ‘boosted’ to an accuracy of over 95% on our test data!
Summary:
Tree based classification remains one of the most popular algorithms used today, mainly due to there easy interpretation and ability to deal with a large number of predictors. Decision trees are the most basic of the tree based methods, and can be easily interpreted, but are extremely prone to overfitting on the training data. Fortunately, several improvements have been made over the years to address the weaknesses faced by simple decision trees. Random Forests utilise Bagging (Bootstrap Aggregation) to obtain many trees, and use the wisdom of crowds to obtain a lower variance prediction. Boosted Trees further improve on the previous algorithms by considering how intermediate trees preform, and tweaking trees to perform better on places they were previously doing poorly (and weighting powerful trees more). All of these techniques and methods are easy to utilise in R, Python and about every other coding language. Gradient Boosting in particular remains one of the most popular algorithms used in Machine Learning competitions, such as those hosted on Kaggle.
I hope this article has provided a broad overview, and a simple example of these fundamental ML techniques. Thanks for reading!
Code:
Sources:
[1] Lecture Given by Trevor Hastie , Boosted, https://www.cc.gatech.edu/~hic/CS7616/pdf/lecture5.pdf
[2] Leo Breiman, (1996). Bagging Predictors. Machine Learning
[3] Leo Breiman (2001). Random Forests. Machine Learning
[4] Hastie, Tibshirani, Friedman (2009). The Elements of Statistical Learning II.
[5] Friedman, J. H. (2002). Stochastic gradient boosting. Computational statistics & data analysis, 38(4), 367–378.
[6] Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(3), 18–22.
[7]Genuer, R., Poggi, J. M., & Tuleau-Malot, C. (2010). Variable selection using random forests. Pattern recognition letters, 31(14), 2225–2236.