Ensembling HuggingFaceTransformer models

Combine 2 or more HuggingFace transformers using a simple linear layer on top of them.

Published in

Towards Data Science

6 min readJul 6, 2020

Recently while doing some research on question answering using BERT, I was suggested to ensemble 2 BERT models. I took the obvious route — google search. But to my surprise, nothing really came up. There was a plethora of articles about transformers, but nothing on how to ensemble transformer models. This article discusses exactly that — how to ensemble 2 PyTorch HuggingFace Transformer models.

What is model ensembling?
In many cases, one single model might not give the best results. However, if we combine “several weak” classifiers together and somehow combine the results of each of these weak classifiers in a meaningful way, we might get a much better result.

For example, suppose we are doing question answering using BERT — if we pass 2 sentences sent_1 and sent_2 to the BERT model, the model should predict if these 2 sentences form a question-answer pair, i.e. sent_2 answers sent_1.

The usual technique, is to feed the model the question-answer pairs both ways —
[CLS] + sent_1 +[SEP] + sent_2 + [SEP] and,
[CLS] + sent_2 +[SEP] + sent_1 + [SEP]
into one single model and train that model.

Instead of doing this if we rather ensemble 2 models — train 2 models in the following manner:
Model 1 gets input,
[CLS] + sent_1 +[SEP] + sent_2 + [SEP] and,
Model 2 gets input,
[CLS] + sent_2 +[SEP] + sent_1 + [SEP]

Then using a simple feed-forward network we can combine the results of the models (don’t bang your head on how it can be done, just assume it can be done for now. I will show how it is to be done, that the crux of this post).

Why is ensembling better?
Well, it’s not! At least not always. In certain situations, the result of the ensemble might outperform a single big model. This works because the task at hand might be just too complicated for one single model to comprehend.
It's the same reason why a deep CNN with 100 neurons will work better than a CNN with 1 layer and 100 neurons in that one single layer — each layer learns different things and makes the overall model better.
It's the same reason why a single human is weaker than a lion but as a society, we are the most superior species on earth.
It’s the same reason why 5 me are better than 1 Messi (at least that’s what I dreamt yesterday 😆).
It’s the same reason why Gogeta is better than Broly.
But the main reason is they don’t overfit.

Single model approach

Let’s go back to the question-answering example. Let’s consider 3 sentences,

sentence_1 = Who killed Freeza?
sentence_2 = Freeza was killed by Goku
sentence_3 = Freeza destroyed the Sayans but he spared the Mayans.

We feed the model,
INPUT: [CLS] + sentence_1 + [SEP] + sentence_2 + [SEP]
OUTPUT: 1
INPUT: [CLS] + sentence_1 + [SEP] + sentence_3 + [SEP]
OUTPUT: 0

As mentioned earlier, if we had a single model then, the normal way of train the model is,
[CLS] + question + [SEP] + answer + [SEP]
[CLS] + answer + [SEP] + question + [SEP]
specifically in our case,
[CLS] + sentence_1 + [SEP] + sentence_2 + [SEP]
[CLS] + sentence_2 + [SEP] + sentence_1 + [SEP]
This approach doubles the size of the dataset.

Transformers Ensemble

In this approach there we have 2 models

Model 1
INPUT: [CLS] + question + [SEP] + answer + [SEP]
Model 2
INPUT: [CLS] + answer + [SEP] + question + [SEP]

Now the question becomes how to combine the output of the 2 models into one i.e. ensemble?

Well, talk is cheap so let’s code. Since code is not cheap I will try and explain most of it.

*Image by author (created using Paint S)*

The code

First I create a new model called BertEnsembleForNextSentencePrediction.

self.bert_model_1 = BertModel(config)        
self.bert_model_2 = BertModel(config)         
self.cls = nn.Linear(self.n_models * self.config.hidden_size, 2)

BertEnsembleForNextSentencePrediction takes 2 BertModel as input (as can be seen in the __init__) and adds an nn.Linear on top of them. nn.Linear, as mentioned here, applies a linear transformation like:

on the input. I will explain why the nn.Linear is used (even though anyone who has seen a bit of the transformers code should be able to immediately see a lot of nn.Linear as in BertOnlyNSPHead, BertForSequenceClassification, etc.).

input_ids_1 = input_ids[0]        
attention_mask_1 = attention_mask[0]        
token_type_ids_1 = token_type_ids[0]input_ids_2 = input_ids[1]        
attention_mask_2 = attention_mask[1]        
token_type_ids_2 = token_type_ids[1]

Then comes the main forward function. The arguments input_ids, attention_mask, token_type_ids of the forward function are tuples, the 0th index is for the first model and the 1st index is for the second model. So the first model takes as inputs, input_ids[0], attention_mask[0], token_type_ids[0]. (I will not go into detail what each of these terms mean as they are standard BERT terms). This is exactly what the above 6 lines do.

outputs.append(self.bert_model_1(input_ids_1, attention_mask=attention_mask_1, token_type_ids=token_type_ids_1))outputs.append(self.bert_model_2(input_ids_2, attention_mask=attention_mask_2, token_type_ids=token_type_ids_2))

Then we just pass the left-hand side variables above i.e. input_ids_1, attention_mask_1, token_type_ids_1, input_ids_2, attention_mask_2, token_type_ids_2 to the BERT model — BertModel.

As written here, the BertModel returns last_hidden_state and pooler_output as the first 2 outputs. We are interested in the pooler_output here. As mentioned here, the pooler_output is

Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pre-training.

For a given input, the pooler_output is of size (batch_size, hidden_size). By default the hidden_size = 768.

last_hidden_states = torch.cat([output[1] for output in outputs], dim=1) 
logits = self.cls(last_hidden_states)

Now comes in the main combination part using the nn.Linear. How to combine the outputs of model 1 and model 2? It’s very simple — the [CLS] token in each model is of size (batch_size * 768). So basically for every question-answer pair, we have a vector of size 768. Thus for every given question-answer pair, there will be 2 vectors each of size 768 generated from each of the 2 models respectively. For example,
[CLS1] + Who killed Freeza? + [SEP] + Freeza was killed by Goku + [SEP]
[CLS2] + Freeza was killed by Goku + [SEP] + Who killed Freeza? + [SEP]

The nn.Linear takes the concatenation of the outputs array elements i.e. flattens the outputs array and does a linear operation. The linear layer thus takes as input a vector of size (2 * 768) and outputs 0 or 1 probability i.e. the logits (well logits are not exactly the probability but if we apply softmax on logits we get the probability, so it’s close enough).

The linear layer thus takes as input a vector of size (2 * 768) and outputs 0 or 1.

Here is a complete working example.
https://colab.research.google.com/drive/1SyRrBAudJHiKjHnxXaZT5w_ukA0BmK9X?usp=sharing

Please note that the dataset used in the code is very small and the way the code is written is a huge overkill for that small dataset. But I wrote the code in such a way that any decently larger dataset can also be used. Moreover, the general pattern of PyTorch has been used to write the code. I find it useful to save this template and tweak the code a bit based on the use case.

That’s it!

References and links:

Why ensembling better: https://www.quora.com/How-do-ensemble-methods-work-and-why-are-they-superior-to-individual-models
PyTorch nn.Linear: https://pytorch.org/docs/master/generated/torch.nn.Linear.html