Two is better than one: Ensembling Models

Sangarshanan
Towards Data Science
5 min readJun 29, 2018

--

Ensembling sounds like a very intimidating word at first but it’s actually deceptively simple….lemme explain ensembling with an analogy

Ensembling algorithms to obtain the cake of accuracy

Ensembling is somewhat like watching Netflix all weekend(Generic ML algorithm), it is good but combine it with some pizza or maybe a special friend and you are all set to chill and by “chill” I mean reach 90% accuracy

So basically ensembling/combining two or more algorithms could improve or boost your performance. But there is a logic behind ensembling…you cannot just randomly combine two models and demand an increase in accuracy….there is a math behind everything…..So let’s dive into the several ensembling methods that you can try out….(if you are into that kinda thing)

Simple Averaging/ Weighted Method:

The name explains it all….This method of ensembling just takes the average of two models….but how does that work….is it Scientology?

Assume that you have used algorithm called WILL and the results are

Actual_targets: 1 0 1 1 0 1Predicted vals: 1 0 1 0 1 0

Now you use another algorithm called GRACE and the results are

Actual_targets: 1 0 1 1 0 1Predicted vals: 0 1 0 1 0 1

WILL correctly predicted the first three targets while GRACE correctly predicted the last three

What if we combine these two algorithms(WILL & GRACE) we would get a pretty decent sitcom and our accuracy would definitely increase. This is the idea behind an average model which a very basic case of ensembling.

You can implement this easily using Sklearn’s Voting Classifier, which even allows you to assign weights for each algorithm.

from sklearn.ensemble import VotingClassifier
ensemble=VotingClassifier(estimators=[('Decision Tree', decisiontree), ('Random Forest', forest)],
voting='soft', weights=[2,1]).fit(train_X,train_Y)
print('The accuracy for DecisionTree and Random Forest is:',ensemble.score(test_X,test_Y))

We can assign weights depending on the performance or take an average, ie setting equal weights for the algorithms.

Bagging Methods:

Bagging methods are much like the previous ones but instead of using WILL & GRACE we will use Fast and Furious 1,2,3,4,5,6. Well that means that instead of using different models, we would use different versions of the same model.

eg: Random forest is a famous bagging model which uses variations of multiple trees. If same trees are used then it’s a bagged decision tree.

But why does bagging work ? aren’t we just doing the same thing again ?

Let us consider that a model is training and it notices an instance where a white male with blue eyes and a red shirt with a laptop is sitting in Starbucks and we need to find the probability that he is an aspiring screenwriter….Now the model seems to have gone too deep to predict the probability that it begins to justify every instance of this popular stereotype. Every white person fitting the description above is not a screenwriter…..(he probably is tho)

To summarize, A model has high variance when it generalizes on something it shouldn’t and bagging can mitigate this issue

Using Bagging ensures that the predictions do not read high variance.

Each model(bag) in the bagging model is slightly different from one another and ensures that we don’t dwell on very deep representations and outputs a more test-set friendly prediction.

Below is the code for Random Forest, a bagging method

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(dataset.drop('target'),dataset['target'])
trained_model = random_forest_classifier(train_x, train_y)
predictions = trained_model.predict(test_x)

Boosting Methods:

So in the previous examples we either takes several independent models and calculate their average or their weighted average but what if we considered how well our previous model has performed and incorporate it into our next model….. sorta like boosting the new model with the predictions of the previous model….well there you have it….that is just what boosting does.

Also it is important to note that the models are added to the ensemble sequentially. Boosting sometimes gives really awesome results by reducing bias, variance and many kagglers fine tune boosting algorithms to win.

Weight based boosting:

Suppose you are training a model where every predicted label is assigned a weight depending on the prediction (good predictions mean less weight and bad predictions mean more weight) . It’s kinda like karma except for the fact that boosting is not a bitch.

These weights get added up at every model in the ensemble and as more weights get added you understand that you need to get better at predicting them and you start to focus your model on them instead of those with lower weights. This goes on till max accuracy or max number of models is reached.

These weights are kinda like real life weights, the more a person gains, the lesser the self esteem and hence the need to hit the gym, here gym is the boosting algorithm…..You always need a boost at the gym…:)

eg: Adaboost , A Strong classifier is where weight is low and a weak classifier is where weight is more. Adaboost aims to convert weak classifiers to strong classifiers by adding models and correcting errors from previous models(compulsory textbook definition)

class sklearn.ensemble.AdaBoostClassifier(base_estimator=None, n_estimators=50, learning_rate=1.0, algorithm=’SAMME.R’, random_state=None)

Residual based boosting:

In this type of boosting we assign the absolute value of weights (difference between the predicted and actual labels) and assign it as a new variable to the next model. But why ?. …..Consider the following

Actual value : 1
Predicted Val: 0.75
Weight : 0.25
For next model let use this weight to minimize the error in the next model where we predict the value

Whaaaaat so just plug in a loss function like gradient descent…Yup…this method can also be called gradient boosting

Here the gradient refers to finite differences, instead of a derivative

Well some of the positives of this method

  • There is no overfitting (using learning rate)
  • We don’t rely on a single tree for the predictions (number of estimators)

TO prevent overfitting we can take the deep learning concept of dropout and apply it to ensembling, this ensures randomness and regularization and makes sure that our model generalizes well. ie. If we have built 4 trees in ensembling, we purposely leave out 2 random trees while building the fifth tree. eg: DART( Dropouts meet Multiple Additive Regression Trees).

Super awesome boosting algorithms that a lot of developers use to win competitions are

  • Xgboost (extreme gradient boosting),
  • Lighbgm (Very fast gradient boosting that grows vertically)
  • Catboost (Comes with good initial parameters and can automatically encode categorical variables).

Hope you found this article useful and fun…..

--

--