One of the most useful and simple techniques in Machine Learning is what is called Ensemble Learning. Ensemble Learning (EL) is the method behind XGBoost, Bagging Trees, Random Forest, and others but is much more.
T[here](https://towardsdatascience.com/simple-guide-for-ensemble-learning-methods-d87cc68705a2) a lot of very good articles here on Towards Data Science (here and here two of the stories I appreciated most). So, why another article on EL? Because I would like to show you how it works with a very simple example that has convinced me that it was not magic!
Well, the first time I saw EL in action (with few very simple regression models) I could not believe my eyes and I still thank the Professor who taught me this technique.
I had two different models ("weak learners") with an out-of-sample R² of 0.90 e 0.93, respectively. Before looking at the result, I would have thought to obtain an R² somewhere between the two originals R². In other words, I thought that EL could be used to avoid having a model performing as bad as the worst model but not a model that could _**** outperform the best mode_l.
With my great surprise, the results of a simple average of the predictions had an R² of 0.95.
At first, I looked for a bug then I thought there might be something magic behind this!
What is Ensemble Learning
With EL, you can combine the predictions of two or more models to obtain a more robust and performant model. There a lot of methodologies to ensemble models but here I will discuss only two of the most useful, just to give an idea.
In regression, you can average the predictions of the available models.
In classification, you can let any model vote for a label. The label with more votes is the label chosen by the new model.
Why it works better
The basic reason why EL works better is that every prediction has an error (at least in a probabilistic sense), combining two predictions could help to reduce the error and therefore to improve the performance metrics (RMSE, R², etc.).
In the following chart, you how two weak learner works on a sampled data set. The first learner has a much higher angular coefficient that it should be while the second an almost zero angular coefficient (maybe as a result of excessive regularization). The ensemble seems much better.
Looking at the out-of-sample R² we have -0.01¹, 0.22 for the two weak learners, and 0.73 for the ensemble.

There a lot of reasons why a learner might not be a good model even in a basic example like this one: maybe you chose to use regularization to avoid overfitting or you chose to not exclude some outliers or you used a polynomial regression of a wrong degree (as when you use second-degree polynomial but your test data show clear asymmetry more fitting a third-degree).
When it works better
Let’s look at two other learners with the same data.

In this example, we can say that ensembling these two models didn’t improve the performance much but the R² were respectively -0.37, 0.22 for the learner, and -0.04 for the ensemble. Hence, the EL model scored between the two original models.
There is a big difference between these two examples: in the former example the errors of the models were negatively correlated while in the latter they were positively correlated (the coefficients of the three models were not estimated but chosen by the Author just for the sake of the example).
Hence, Ensemble Learning could be used to improve Bias/Variance balance in any case but when the errors of the models are not positively correlated the EL mode could result in a boost of the performance.
Heterogeneous vs Homogeneous models
Often EL is used on homogeneous models (as in this example or Random Forest) but you can combine very different models (Linear Regression + Neural Network + XGBoost) with different sets of explanatory variables. Chances are that it will result in uncorrelated errors and a boost of performance.
Comparison with Portfolio Diversification
EL works similarly to diversification in portfolio theory but it’s even better.
In diversification, you try to reduce the variance of your performance investing in uncorrelated stocks. A well-diversified basket of stocks performs better than the worst single stock but never better than the best one.
Quoting Warren Buffet:
"Diversification is a protection against ignorance, [it] makes very little sense for those who know what they’re doing."
In Machine Learning, EL helps to reduce the variance of your model but it could result in a model with overall performance better than the best single model.
Summing Up
Combining more models in one is a relatively simple technique that could result in a solution to the bias-variance trade-off and a boost of performance.
If you have two or more models that look good, don’t choose: use all of them (with care)!
About Me
I’m an ex-mathematician with a passion for a lot of different things: Data Science, Data Visualization, Finance, Asset/Liabilities management, and (but don’t tell anyone) Accounting!
¹ if you are wondering, an R² could be negative if calculated on out-of-sample data or if your estimator is not OLS