Tagging Genes and Proteins with BioBERT

The intuition behind BioBERT with an implementation in Colab

Drew Perkins

Published in

Towards Data Science

8 min readAug 27, 2020

I. Introduction

Text mining in the clinical domain has become increasingly important with the number of biomedical documents currently out there with valuable information waiting to be deciphered and optimized by NLP techniques. With the accelerated progress in NLP, pre-trained language models now carry millions (or even billions) of parameters and can leverage massive amounts of textual knowledge for downstream tasks such as question answering, natural language inference, and in the case that we will work through, biomedical text tagging via named-entity recognition. All of the code can be found on my GitHub.

II. Background

As a state-of-the-art breakthrough in NLP, Google researchers developed a language model known as BERT (Devlin et. al, 2018) that was developed to learn deep representations by jointly conditioning on a bidirectional context of the text in all layers of its architecture¹. These representations are valuable for sequential data, such as text, that heavily relies on context and the advent of transfer learning in this field helps carry the encoded knowledge over to strengthen an individual’s smaller tasks across domains. In transfer learning, we call this step “fine-tuning”, which means that the pre-trained model is now being fine-tuned for the particular task we have in mind. The original English-language model used two corpora in their pre-training: Wikipedia and BooksCorpus. For a deeper intuition behind transformers like BERT, I would suggest a series of blogs on their architecture and fine-tuned tasks.

BioBERT (Lee et al., 2019) is a variation of the aforementioned model from Korea University and Clova AI. Researchers added to the corpora of the original BERT with PubMed and PMC. PubMed is a database of biomedical citations and abstractions, whereas PMC is an electronic archive of full-text journal articles. Their contributions were a biomedical language representation model that could manage tasks such as relation extraction and drug discovery to name a few. By having a pre-trained model that encompasses both general and biomedical domain corpora, developers and practitioners could now encapsulate biomedical terms that would have been incredibly difficult for a general language model to comprehend.

Text is broken down in BERT and BioBERT is through a WordPiece tokenizer, which splits words into frequent subwords, such that Immunoglobulin will be tokenized into constituent pieces of I ##mm ##uno ##g ##lo ##bul ##in². These word pieces can utilize the flexibility of characters as well as the general word meanings through character combinations. The fine-tuned tasks that achieved state-of-the-art results with BioBERT include named-entity recognition, relation extraction, and question-answering. Here we will look at the first task and what exactly is being accomplished.

III. Task

Named-entity recognition (NER) is the recognition process of numerous proper nouns (or given target phrases) that we establish as entity types to be labeled. The datasets used to evaluate NER are structured in the BIO (Beginning, Inside, Outside) schema, the most common tagging format for sentence tokens within this task. Additionally, an “S” like in “S-Protein” can be used to infer a single token. That way we can note the positional prefix and entity type that is being predicted from the training data.

Named-Entity Recognition with BioNLP13 Corpus

The dataset used in this example is a combination of the BioNLP09 Corpus, BioNLP11 Corpus, and BioNLP13 Corpus. Although we focus on genes and proteins, there are other entity types such as diseases, chemicals, and drugs. There are 74 tags included in these experiments, but for the sake of brevity, here is a peek at the BioNLP13 Corpus tags in the BIO schema:

'B-Anatomical_system',
'B-Cancer',
'B-Cell', 
'B-Cellular_component',
'B-Developing_anatomical_structure',
'B-Gene_or_gene_product', 
'B-Immaterial_anatomical_entity',
'B-Multi-tissue_structure',
'B-Organ',
'B-Organism', 
'B-Organism_subdivision',
'B-Organism_substance',
'B-Pathological_formation', 
'B-Simple_chemical',
'B-Tissue',
'I-Amino_acid',
'I-Anatomical_system',
'I-Cancer', 
'I-Cell',
'I-Cellular_component',
'I-Developing_anatomical_structure',
'I-Gene_or_gene_product', 
'I-Immaterial_anatomical_entity',
'I-Multi-tissue_structure',
'I-Organ',
'I-Organism', 
'I-Organism_subdivision',
'I-Organism_substance',
'I-Pathological_formation',
'I-Simple_chemical', 
'I-Tissue',
'O'

IV. Implementation

First, we will want to import BioBERT from the original GitHub and transfer the files to our Colab notebook. Here we are downloading the main BioBERT file, extracting the BioBERT weights, and converting them to be applicable in PyTorch so as to work with the HuggingFace API. We move the config file for simplicity, and now are good to go!

You will need the transformers libraries from HuggingFace. For the full list of installs and imports, consult my notebook. We set the maximum length of text and batch size as constraints we will be working with later. We will also create a device that utilizes GPUs for computation in Colab. The BertTokenizer class will take in the vocab.txt from the BioBERT file we’ve previously set up.

The SentenceFetch class will take in our data, read the TSV files that the BIO schema text comes in, and organize the sentences and tags into workable data structures. We then have methods to retrieve the sentences and tags.

We search through all the subdirectories of our root directory. In Colab, I would suggest uploading your data from either Google Drive or your local drive. We use the SentenceFetch class and create a list of sentences and tags to use in our experiments.

We create a helper function to tokenize the text without losing the labels to each token. We need those labels intact for our model.

Now we can get our input data normalized via “pad_sequences” in Keras. This is used to keep the fixed-length consistent with sequences that are shorter than our maximum length.

Now we can finalize the data preparation for our modeling. Attention masks are used when batching sequences together to indicate which tokens should be observed. We split our inputs and masks between training and validation data. Then we convert our data as tensors to work properly with PyTorch. Afterward, we pass these tensors through the data utils in PyTorch and finally have the data ready for our model.

In the modeling stage, we use the BertConfig class to use the configuration file for the model. We also use the pre-trained weights to help establish our “state_dict”. The “state_dict” is a dictionary that maps each layer to its parameter tensor, which heavily increases modularity to our models and optimizers in PyTorch.

We create a simple BioBERT class for the NER model. Our attributes are the layers in our network along with a forward pass method.

With our primary model created, we set up the optimization and learning rate scheduler. We also set other hyperparameters here.

A function was created that manages epoch training. Here we notify the layers that we are in train mode so batch norm and dropout layers work in training mode rather than in eval mode, which is required in PyTorch. Gradients are computed and model weights are updated.

A second function was created to evaluate our model. As one could guess, we know need to inform layers that we are in eval mode. We also deactivate the autograd engine to reduce memory usage and speed up computations. In the future, it would be more concise to consolidate these two functions as methods in the BioBERT model class.

Here we will loop through our epochs, use our two functions, and print the results for both the training and validation loss/accuracy.

Our results show high accuracy of 96% for training accuracy and 95% for validation accuracy. Given the time it takes (although here the model ran incredibly fast with the available GPU!), there is no reason to increase epochs.

======== Epoch 1/3 ========
Train Loss: 0.20869970354739556 Train Accuracy: 0.9479462699822381
Val Loss: 0.1037805580667087 Val Accuracy: 0.9576587301587302
======== Epoch 2/3 ========
Train Loss: 0.09325889256480109 Train Accuracy: 0.9650584665482536 Val Loss: 0.09049581730413059 Val Accuracy: 0.9589087301587302
 ======== Epoch 3/3 ======== 
Train Loss: 0.0828356556263529 Train Accuracy: 0.9658170515097693 Val Loss: 0.08888424655038213 Val Accuracy: 0.9585449735449736CPU times: user 8min 41s, sys: 6min 12s, total: 14min 54s 
Wall time: 14min 58s

When we consider loss, the learning curve expresses a reasonable loss over the three epochs:

Here we can now tag a random sentence from a biomedical paper abstract:

( - O 
i - O 
) - O 
Genetic - O 
experiments - O 
demonstrate - O 
that - O 
complementation - O 
of - O 
an - O 
in - O 
- - O 
frame - O 
deletion - O 
of - O 
HisG - S-Protein 
from - O 
Escherichia - O 
coli - O 
( - O 
which - O 
does - O 
not - O 
possess - O 
HisZ - S-Protein 
) - O 
requires - O
both - O 
HisG S-Protein 
and - O 
HisZ - S-Protein 
from - O 
L - O

We see that it has properly tagged singular/solo protein mentions. However, given the abundance of “0” denoting irrelevant words to the task at hand, there were results that overfit tagging to “0” because of the sheer number of examples in the training data. This tag imbalance would be an integral area to improve upon in the future.

V. Conclusion

We got a general sense of how BioBERT operates and how it expands on the work of BERT. By leveraging BioBERT, we sought to properly tag biomedical text through the NER task. I walked us through my implementation of BioBERT that imported the necessary files, preprocessed the data, and finally, constructed, trained, and tested the model. For a deeper breakdown of BERT in Colab, I would highly recommend the tutorials of ChrisMcCormickAI. There is also some valuable information in Chapter 10 of Practical Natural Language Processing pertaining directly to NLP tools and applications (including BioBERT!) in the healthcare industry. I hope this was helpful in guiding you through a state-of-the-art model in the biomedical domain.

[1]: Devlin et. al. (October 2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding https://arxiv.org/abs/1901.08746

[2]: Lee et. al. (September 2019). BioBERT: a pre-trained biomedical language representation model for biomedical text mining https://arxiv.org/abs/1901.08746