The world’s leading publication for data science, AI, and ML professionals.

Siamese and Dual BERT for Multi Text Classification

Different ways to insert a transformer in your model

Photo by rolf neumann on Unsplash
Photo by rolf neumann on Unsplash

Constant research in NLP produces the development of various kinds of pre-trained models. It’s typical to register increasing improvements in state-of-the-art results for various tasks, such as text classification, unsupervised topic modeling, and question-answering.

One of the greatest discoveries was the adoption of the attention mechanics in neural network structures. This technique is the basis of all networks called transformers. They **** apply attention mechanisms to extract information about the context of a given word, and then encode it in a learned vector.

There are a lot of Transformers architectures that we, as data scientists, can evoke and use to make predictions or fine-tune on our task. In this post, we enjoy ourselves with the classical BERT, but the same reasoning can be applied to every other transformer structure. Our scope is to play with BERT using it in dual and siamese structures instead of using it as a single feature extractor for multi-textural input classification. What presented in this post is inspired by here.

THE DATA

We collect a dataset from Kaggle. The News Category Dataset contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost. Our scope is to categorize news articles based on two different text sources: headlines and short descriptions. In total, we have more than 40 different types of news. For simplicity, considering the computation time of our workflow, we use only a subgroup of 8 classes.

We don’t apply any sort of preprocessing cleaning; we let our BERTs do all the magic. Our working framework is Tensorflow with the great Huggingface transformers library. More in detail, we utilize the bare Bert Model transformer which outputs raw hidden-states without any specific head on top. It’s accessible like a Tensorflow model sub-class and can be easily pulled in our network architecture for fine-tuning.

SINGLE BERT

As the first competitor, we introduce a single BERT structure. It receives only one text input, which is the result of the concatenation of our two textual sources. This is the normality: any model can receive input of concatenated features. For transformers, this process is exalted combining the input with special tokens.

BERT expects input data in a specific format: there are special tokens to mark the beginning ([CLS]) and the end of sentences/textual sources ([SEP]). At the same time, the tokenization involves splitting the input text into lists of tokens that are available in the vocabulary. The out of vocabulary words are processed with the word-piece technique; where a word is progressively split into subwords which are part of the vocabulary. This process can be carried out easily by the pre-trained Tokenizer of Huggingface; we have only to take care of padding.

We end with three matrices (token, mask, sequence ids) for each text source. They are the inputs of our transformers. In the case of single BERT, we have only a single tuple of matrices. This is because we pass simultaneously to our tokenizer the two text sequences, which are automatically concatenated (with [SEP] token).

The structure of our model is very simple: the transformer is directly fed with the matrices we’ve built above. In the end, the final hidden-state of the transformer is reduced with an average-pooling operation. The probability scores are calculated by a final dense layer.

Our simple BERT achieves 83% accuracy on our test data. The performances are reported in the confusion matrix below.

DUAL BERT

Our second structure can be defined as dual BERT because it uses two different transformers. They have the same composition but are trained with different inputs. The first one receives news headlines while the other the short textual descriptions. Inputs are encoded as always producing two tuples of matrices (token, mask, sequence ids), one for each input. The final hidden-states of our transformers, for both the data sources, are reduced with average pooling. They are concatenated and passed through a fully connected layer.

With these settings, we can achieve 84% accuracy on our test data.

SIAMESE BERT

Our last model is a kind of Siamese architecture. It can be defined in this way because the two different data sources are passed simultaneously in the same trainable transformer structure. The input matrices are the same as in the case of dual BERT. The final hidden state of our transformer, for both data sources, is pooled with an average operation. The resulting concatenation is passed in a fully connected layer that combines them and produces probabilities.

Our siamese structure achieves 82% accuracy on our test data.

SUMMARY

In this post, we applied the BERT structure to carry out a multiclass classification task. The added value of our experiments was to use the transformers in various ways to deal with multiple input sources. We started with the classic concatenation of all the inputs in only one source, and we ended feeding our models maintaining the text inputs separated. The dual and siamese variants presented were able to achieve good performances. For this reason, they can be considered good alternatives to the classical single transformer structure.


CHECK MY GITHUB REPO

Keep in touch: Linkedin


REFERENCES

Kaggle: Two BERTs are better than one

Kaggle: Bert-base TF2.0


Related Articles