
People working in groups are more likely to perform better than individuals in general. One advantage of groups is that they are able to approach a task from many different perspectives which might be impossible for an individual. An issue that we could not realize might be spotted by a coworker. The advantages of grouping apply to Machine Learning models as well. We can create a robust, highly accurate model by grouping individual weak learners. In the field of machine learning, ensemble methods are used to combine base estimators (i.e. weak learners). Two types of ensemble methods are averaging (eg. bagging) and boosting (eg. gradient boosting, AdaBoost).
Ensemble methods not only increase the performance but also reduce the risk of overfitting. Consider a person evaluating the performance of a product. One person may focus too much on a specific feature or detail and thus fail to provide a well generalized evaluation. On the other hand, if a group of people evaluate the product, each individual might focus on a different feature or detail. So, we reduce the risk of putting too much focus on one feature. At the end, we will have a more generalized evaluation. Similarly, ensemble methods result in well generalized model and thus reduce the risk of overfitting.
Bagging
Bagging means aggregating the predictions of several weak learners. We can think of it combining weak learners in parallel. The average of the predictions of several weak learners is used as the overall prediction. The most common algoritm that uses bagging method is random forests.
The base estimator of random forests is decision tree which partitions data by iteratively asking questions. Random forests are built by combining several decision trees with bagging method. If used for a classification problem, the overall prediction is based on majority vote of the results received from each decision tree. For regression, the prediction of a leaf node is the mean value of the target values in that leaf. Random forest regression takes mean value of the results from decision trees.
The success of a random forest highly depends on using uncorrelated decision trees. If we use same or very similar trees, overall result will not be much different than the result of a single decision tree. Random forests achieve to have uncorrelated decision trees by bootstrapping and feature randomness.
Bootsrapping is randomly selecting samples from training data with replacement. They are called bootstrap samples. The following figure clearly explains this process:

Feature randomness is achieved by selecting features randomly for each decision tree in a random forest. The number of features used for each tree in a random forest can be controlled with max_features parameter.


Bootstrap samples and feature randomness provide the random forest model with uncorrelated trees.
Hyperparemetes are key parts of learning algorithms which effect the performance and accuracy of a model. Two critical hyperparameters of random forests are max_depth and n_estimators.
max_depth: The maximum depth of a tree. Depth of a tree starts from 0 (i.e. the depth on root node is zero). If not specified, the model keeps splitting until all leaves are pure or until all leaves contain less than min_samples_split samples. Increasing the depth more than necessary creates the risk of overfitting.
n_estimators: Represents the number of trees in a forest. To a certain degree, as the number of trees in a forest increase, the result gets better. However, after some point, adding additional trees do not improve the model. Please keep in mind that adding additional trees always mean more time for computation.
Boosting
Boosting means combining several weak learners in series. We end up having a **** strong learner from many sequentially connected weak learners. One of most commonly used ensemble learning algorithm that uses boosting is gradient boosted decision tree (GBDT). Like in random forests, weak learners (or base estimators) in GBDT are decision trees. The way to combine decision trees is different though.
Gradient boosting algorithm sequentially combines weak learners in way that each new learner fits to the residuals from the previous step so that the model improves. The final model aggregates the results from each step and a strong learner is achieved. Gradient boosted decision trees algorithm uses decision trees as week learners. A loss function is used to detect the residuals. For instance, mean squared error (MSE) can be used for a regression task and logarithmic loss (log loss) can be used for classification tasks. It is worth noting that existing trees in the model do not change when a new tree is added. The added decision tree fits the residuals from the current model.

Learning rate and n_estimators are two critical hyperparameters for gradient boosting decision trees. Learning rate, denoted as α, simply means how fast the model learns. Each tree added modifies the overall model. The magnitude of the modification is controlled by learning rate.
The lower the learning rate, the slower the model learns. The advantage of slower learning rate is that the model becomes more robust and generalized. In statistical learning, models that learn slowly perform better. However, learning slowly comes at a cost. It takes more time to train the model which brings us to the other significant hyperparameter. n_estimator is the number of trees used in the model. If the learning rate is low, we need more trees to train the model. However, we need to be very careful at selecting the number of trees. It creates a high risk of overfitting to use too many trees.
A note on overfitting
One key difference between random forests and gradient boosting decision trees is the number of trees used in the model. Increasing the number of trees in random forests does not cause overfitting. After some point, the accuracy of the model does not increase by adding more trees but it is also not negatively effected by adding excessive trees. You still do not want to add unnecessary amount of trees due to computational reasons but there is no risk of overfitting associated with the number of trees in random forests.
However, the number of trees in gradient boosting decision trees is very critical in terms of overfitting. Adding too many trees will cause overfitting so it is important to stop adding trees at some point.
Thank you for reading. Please let me know if you have any feedback.