Making Sense of Big Data

Bringing BERT to the field: Transformer models for gene expression prediction in maize

Zihao Xu

Published in

Towards Data Science

16 min readDec 14, 2020

Collaboration between Inari and IACS @ Harvard

Authors: Benjamin Levy, Zihao Xu, Liyang Zhao, Shuying Ni

To get a quick overview of our project, check out our project poster and video narration. Special thanks to our partners at Inari Karl Kremling and Ross Altman, our Teaching Fellows Phoebe Wong and Shahab Asoodeh, and our course instructor Chris Tanner.

Disclaimer: The views expressed in this post are the views of the authors and do not reflect the views of Inari or those of the Institute for Advanced Computational Science (IACS) at Harvard University.

1. Introduction

Inspired by the rapid developments in the field of Natural Language Processing (NLP), especially the papers “Attention is All You Need” (A. Vaswani et al., 2017) and “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” (Devlin et al., 2018), Inari and a group of curious students at IACS are interested in exploring the possibility of applying powerful language representation models like the transformers to the field of genomics.

In this blog post, we document our findings as a part of the Harvard AC297r Capstone Project. Supported by our partner Inari, we built a transformer-based model called CornBERT that, given a gene’s regulatory (promoter) sequence of maize DNA, can predict how much that gene will be expressed in ten different corn tissues.

In a human body, gene sequences govern what proteins are made. Equally important is the DNA that controls those genes, called promoters. Whereas the gene controls WHAT protein is made, the promoter controls WHEN, WHERE in the body, and HOW MUCH of that protein is made. Promoters resides in the regulatory region which are are typically located next to the genes they control. These regions are called regulatory sequences.

In simple terms, gene expression levels, typically measured in transcripts per million, represents the number of copies of a particular gene within a cell. For plants, a higher gene expression level of a protein that resists heat stress could translate into higher resiliency.

In the sections below, we will walk through the data acquisition process, the CornBERT model, and model performance and visualization.

2. Data Acquisition

2.1 Regulatory Sequence Data

We acquired genomic data from three databases, Ensembl Plants, Refseq, and MaizeGDB. The first two databases contain DNA sequences with the corresponding annotation files for many species of plants, while the third database is specific to maize (corn).

In order to extract the regulatory sequences, we first need to download the raw DNA sequences and the corresponding annotation files that record the start and end positions of promoter regions. After that, we fed each gene-annotation pair to an end-to-end processing pipeline, which uses two packages “samtools” and “bedtools”, to output the corresponding regulatory region that is located right before the gene region with a user-specific length. See the figure below for a graphical walk-through.

**The 3-step Process for Data Acquisition.** Image by the authors.

The image below provides some general statistics of the processed data for each of the databases. Note that in our implementation, we capped the length of the regulatory regions to be 1000, as suggested by our partner Inari.

**Summary Statistics of the Databases.** Image by the authors.

For our purposes, we use sequences from the three databases as the “corpus” on which the language model is trained.

2.2 Gene Expression Levels

After we have acquired and extracted the regulatory sequences, we now take a look at a matrix of gene expressions that serve as our target variable. Below is a schematic representation of the gene expression level matrix.

**Schematic of the Gene Expression Matrix.** Image by the authors.

We observe that each row of the matrix corresponds to a particular gene, where the gene names like “gene_1” are unique identifiers that could be used to map to the corresponding regulatory sequences. Each column represents a particular tissue, which could be thought of as “body parts” of the maize plant.

Finally, we split the matrix row-wise into 70% training data and 30 % testing data, while the split over the genes ensures that our model is able to generalize to genes it was not trained on.

3. CornBERT

Now to the exciting part — transformers! The sections below require some degree of understanding of the transformer models. To brush up on some of the background knowledge, check out this Amazing Blog by Jay Alammar, or this Awesome Video by The A.I. Hacker — Michael Phi.

3.1 RoBERTa

Building on previous work using transformer-type models for modeling genomic or proteomic data (see here, here, and here), we implement a transformer that we call CornBERT. The model is based on RoBERTa (Liu et al, 2019), which itself largely follows the same architecture as BERT (Devlin et al., 2018). The model consists of:

An embedding layer
L encoder layers
A task-specific head

Data preprocessing

Before being fed into the transformer, we preprocess each sentence using the byte-level byte-pair encoding (BPE). BPE is an iterative algorithm that begins with a fixed vocabulary of individual nucleotides (A, T, C, G, N) and progressively merges the bytes into pairs based on which pairs occur most frequently in the corpus of training sequences. This is then repeated to create larger composite tokens until a target vocabulary size is reached (in our case, 5,000). For more on (byte-level) BPE, see the HuggingFace tokenizers library docs here.

After tokenization, sequences tend to be roughly 180–220 tokens, so we set the transformer's maximum input length to be 256. Because inputs must be packaged into batches of equal length, we pad all sequences with a fixed <pad> token.

**An example of 4 sequences left-padded to the same length of 6.** Image by the authors.

Padding is done on the left to ensure that sequences are aligned at their ends since sequences were obtained by starting just upstream of a functional gene and going backward.

Another processing step we performed is a transformation over the gene expression levels or our target variable. Because the distribution of gene expression values (expressed in transcripts per million, or TPM) is highly right-skewed, we perform a log-transformation over the gene expression levels:

where Y and Y’ denote the original and the transformed expression levels and log denotes the natural logarithm. The +1 term is added to avoid taking the log of 0, which is undefined. After the transformation, the distributions of expression levels for the training and testing sets look like the following.

**Distribution of the gene expression levels (log-transformed). The original distributions are omitted here because they are too skewed to be visualized at all.** Image by the authors.

We observe that the transformed distribution is roughly normal except for a huge spike over 0, corresponding to the large number of tiny expression levels in the original scale.

Embedding layer

The input to the embedding layer is a padded sequence of 256 tokens, obtained using the byte-level byte-pair encoding (see the section on BPE below). Then:

Map each token to a 768-dimensional vector based on the letters in the token. E.g. “ATTA” gets mapped to [-1, 5, 2.4, 0.55, …]. These are referred to as token embeddings
Separately, each token is mapped to a 768-dimensional vector based on its position. These are referred to (unsurprisingly) as positional embeddings
The token embeddings and positional embeddings are summed element-wise (this is possible because both are 768 dimensions)

The output of the embedding layer is a set of 256 vectors of 768 dimensions each, forming a 256x768 matrix.

Transformer encoder layer

**An example of a single transformer block, where the input tokens are “thinking”, “machines”. In our case, they would be groupings of nucleotides, such as “ATTTA”, “CAAATAG”.** Source: “The Illustrated Transformer” (Alammar, 2018).

The transformer contains L=6 encoder layers, each of which consists of 6 “heads” of the same self-attention mechanism followed by a concatenation of the attention outputs and 2 feed-forward neural network layers. The self-attention and feed-forward layers are connected via residual (skip) connections, which improves convergence during training.

The inputs to the self-attention mechanism are three learned vectors per token, referred to as “query”, “key”, and “value”. These vectors are assembled into a matrix with a row for each token, denoted Q, K, and V. Attention is calculated as:

The output of each of the 6 attention heads is concatenated and fed into two successive feed-forward layers, outputting a 768-dimensional vector for each token (Vaswani et al., 2017). Finally, these are fed into the task-specific head (next section).

**Schematic of the self-attention mechanism.** Source: “Attention is you need” (Vaswani et al., 2017)

Task-specific head

The final part of the transformer is referred to as the “head”, but a better name for it might be the “hat” because just like a hat, it can be taken off and switched with another if the occasion demands (we hope the same is not true of your head). This flexibility is the key to the transformer’s success.

The base of the transformer (the part with multi-headed attention) has over 46 million parameters, but we only have roughly 30,000 genes with measured expression levels. To train this part of the model, we use transfer learning: we first pre-train the model on a large corpus or task that is similar to our own task, and then we fine-tune the model on our smaller dataset.

**An overview of the transfer learning process. After the language model is trained on large datasets (first two columns), the head is changed to a regression head.** Figure adapted from “Genomic-ULMFiT” (Heyer, 2019)

3.2 Model Pretraining

For the pre-training step, we use a masked language model head, as in Devlin et al., (2018). In this task, 15% of the input tokens are randomly replaced with a <mask> token. Then, the output layer consists of predicting the token for each masked position. This is accomplished by first passing the final layer embeddings through a feed-forward layer and then projecting the vectors back into the same number of dimensions as the vocabulary (i.e. 5004), where the values for each token’s index are taken to be the log-probabilities for that respective token. First, we train the model on this task for all plant promoter sequences from the RefSeq and Ensembl databases. Second, we further pre-train on just maize promoter sequences from MaizeGDB, since the downstream task will be for maize DNA alone.

3.3 Model Fine-tuning

The fine-tuning task is regression to predict gene expression levels for each gene. A common strategy for regression/classification with BERT models is to use the final embedding of the first token in the sequence (a special token represented as [CLS] or <s>) as the embedding for the whole sequence, which is then passed through a feed-forward layer and used for regression or classification. However, we found that empirically this did not perform well for predicting gene expression. Instead, we average the final embeddings for all the tokens in the sequence, removing the <pad> and start/end tokens (<s> and </s>, respectively), generating a single 768-dimensional vector that can then be passed through a feed-forward neural network.

**Schematic of pre-trained CornBERT with prediction head, including the byte-pair encoding tokenization step.** Image by the authors.

Multi-task regression

For each gene sequence, we were given gene expression levels for each of 10 different tissues. Rather than training separate networks for each tissue, we re-engineered the RoBERTa model to predict the expression level for each tissue simultaneously. Thus, the output of the regression head is not a scalar, but a 10-dimensional vector.

Training tricks

Training large transformer models are notoriously difficult and often requires some clever tricks to get them to work well. Luckily, Huggingface’s wonderful transformers library implements many of these tricks for us, such as warming up and linearly decreasing the learning rate, batching, parallel data loading, and multiple gradient accumulation steps.

In addition, we use the LAMB optimizer, a layer-wise optimization algorithm designed specifically for the fast convergence of transformer models (You et al., 2020). We use hyperparameters

beta_1 = 0.9, beta_2 = 0.999, epsilon = 10–8, max_learning_rate = 0.001

We train with a batch size of 84 on 4 NVIDIA® Tesla V100 Tensor Core GPUs for roughly 24 hours total for both pre-training runs (all plant DNA and just maize).

3.4 Supervised fine-tuning

For fine-tuning, we employ the same tricks as in the pre-training phase (LAMB optimizer, linear warm-up, and cooldown) as well as progressive unfreezing from the top down. This means that initially, the RoBERTa base layers are all kept frozen while the randomly initialized weights for the regression head are trained using the linear warmup and cooldown as before. After 400 iterations, the top 2 layers of the RoBERTa base are unfrozen and the learning rate is linearly warmed up to a value of half the head max learning rate (the head has a max of 2x10–3, the top two layers have a max of 10–3). The same delay of 400 steps is used for each pair of base layers (i.e. layers 3 and 4 are unfrozen after another 400 steps, followed by layers 1 and 2 400 steps after that). The embedding layers are never unfrozen. Here is a visual depicting the layers of CornBERT colored by the training group and the corresponding learning rate schedule for each group.

Layers of CornBERT colored by training group. The regression head is trained immediately; layers 5 & 6 are trained after a delay of 400 steps, followed by layers 3 & 4 and finally layers 1 & 2. The embedding layer remains frozen. Image by the authors.

Training time on each dataset
Language model — all sequences: Training time: 18h 45m 33s
Language model — just maize: Training time: 4h 41m 4s
Prediction model — just B73 maize: Training time: 9m 38s

That was quite a lot of steps! Now let’s take a look at the training and validation (or “eval”) loss during the entire process of pre-training and fine-tuning.

**Train and eval loss during pre-training and fine-tuning.** Image by the authors.

We observe that the train and validation loss dropped steadily during the pre-training phases, while the fine-tuning drastically improved the model even more. Now it’s time to test our model and see how it performs!

4. Model Evaluation

4.1 Baseline Models

To facilitate comparison, we have developed several baseline models:

const_mean: We simply use the average gene expression levels for each tissue within the training data as the prediction for all the test sequences.
hand_eng: We identified a list of common promoters in plants from the research by Y. Yamamoto et al., 2007. Then, we converted this list of promoters into binary features and applied LightGBM, a tree-based boosting method for prediction.
BOW_5mer: We converted the raw sequences into Bag of Words representation of 5-mers, or 5-grams in NLP lingo. Then, we applied a simple Linear Regression using the 5-mer counts as features to predict the expression levels.

The details of these models are omitted here for brevity.

4.2 Evaluation Metrics

To evaluate the model performances, we adopt 2 evaluation metrics:

Mean squared error (MSE) of the log-transformed gene expression levels, expressed by

Mean absolute error (MAE) of the gene expression levels in the original unit (tpm), expressed by

where Y represents the actual gene expression level in tpm and Y_pred represents the model prediction in tpm. Note that we applied the log-transformation when using MSE since this metric is sensitive to outliers, while the addition of 1 avoids taking the log of 0, which is undefined.

Further, we calculated these two metrics both across the entire gene expression matrix (denoted by “All”) and on a per-tissue level (indicated by each of the 10 tissue names).

4.3 Model Comparison

The two tables below display the actual values of the evaluation metrics for the baseline model const_mean and for our model CornBERT.

**Test MAE (left) and Test MSE with log-transformation (right). Note that ConBERT improves over the constant-mean baseline model by around 10–12% for the entire set of expression levels.** Image by the authors.

We observe that the MAE metric for CornBERT ranges from around 24 to 30 for the individual tissues, while the MAE for “All” expression levels is 26.62. This means that, on average, the absolute differences between the predicted expression levels and the actual expression levels are around 26.62 tpm. In the third column of each table, we also calculate the percentage improvements (or the reduction in metrics) achieved using our model, which we call “pct_reduction”. We observe that CornBERT improves over the const_mean baseline by around 10–13% for the entire set of expression levels, while improves by around 7–18% for the individual tissues.

Below, we visually display the model results. To facilitate comparison, we divide the metrics by each group’s maximum (worst performing model) so that the metrics are capped at 1.

**Test MSE (log-transformed).** Image by the authors.

We see that CornBERT performs the best among all of the models across all of the tissues as well as over the entire test set. Pretty amazing!

4.4 The importance of location

In addition to making predictions, we could also explore which area within the regulatory region that the model finds important, or which positions on the regulatory sequences have a higher impact on the predicted gene expression levels. To answer this question, we compare CornBERT’s predictions on a set of sequences that have been randomly permuted at different locations from 0 to 1000 with the predictions on the original sequences. Intuitively, a larger difference indicates the model placed more weights on the permuted position and thus deemed as more important.

Detailed methodology

To modify the regulatory sequences, we proceed in the following steps:

Pick a fixed length of nucleotides to modify, L. Note, we call a set of nucleotides of length L an L-mer.
For each position T on a given sequence, “flip” the nucleotide(s) at positions [T, T+1, …, T+L] by changing the original L-mer to a random L-mer (so that the length is preserved).
Repeat the “flipping” process at each position T until we have enough samples.

We denote the “flipped” sequence flip(X, T, L) using the notations from above, where X denotes the original sequence. After the sequence modification, we generate the prediction for the “flipped” sequence, flip(X, T, L)_pred, and calculate its absolute difference with the prediction generated by the original sequence, X_pred. Then, for each position T from 1 to 1000, we average these absolute differences into a metric we call the position T’s “importance level”.

Ideally, we would like to flip every sequence on all positions. However, due to computational constraints, we randomly selected a subset of all positions to flip for each sequence.

Visualization

We experimented with L from 1 to 6 and plotted the L-mers on the x-axis. On the y-axis, we denote the distance, or how many nucleotides away from the end, of each position T. Note that we padded the sequence to be of equal length of 1000 from the left (or the opposite side of the actual gene sequence). Finally, each cell represents the Importance Level at a given position T and the chosen length L, and we use the brighter colors to represents a larger Importance Level.

Heatmap of the Importance Levels. Image by the authors.

We can see that the importance is higher for T ranging from 0 to -100 and L from 4 to 6. This means that the model places more emphasis on subsequences of length 4 to 6 located at the end of the regulatory sequences (or closer to the DNA that encode the gene). This result aligns with the biological intuition since promoters, like the TATA boxes, tend to be located close to the transcription start site (TSS), or the beginning of the gene coding region. It appears as though CornBERT has learned some fundamental facts about biology all on its own!

5. Conclusion

Our team has gone a long way from raw DNA sequences and gene expression levels to CornBERT, an end-to-end language model. In addition, the model places higher importance on regions that are closer to the gene region, which is backed by our biology knowledge.

As for the future steps, we would like to explore possible ways to improve the model performance. Some promising next steps to take include gathering more sequence-expression level pairs and using more powerful computers to train our model for a longer time. In addition, we could utilize the attention mechanism of BERT models to more directly visualize the importance of regions within the regulatory sequences and potentially discover new motifs of biological interest.

Reflecting on our journey throughout the semester, we believe this was a deeply rewarding project. Not only did we successfully apply the transformer model to the field of genomics (and by doing so, have helped advanced a burgeoning field of deep language models in biology), we also developed our abilities to effectively collaborate with an industry partner and communicate our results to the broader public.

Finally, we would like to give the most sincere gratitude to our partner Inari, our TF mentors, and the lovely IACS community— thank you all for an amazing semester!

Contact Us

Have a question? Email us or leave a comment below! :)

Benjamin Levy: benjaminlevy@g.harvard.edu
Zihao Xu: zxu@g.harvard.edu
Liyang Zhao: liyangzhao@g.harvard.edu
Shuying Ni: shuying_ni@g.harvard.edu

References

A. Vaswani et al., Attention is All You Need (2017), arXiv:1706.03762

J. Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018), arXiv:1810.04805

Y. Liu et al., RoBERTa: A Robustly Optimized BERT Pretraining Approach (2019), arXiv:1907.11692

Y. You et al., Large Batch Optimization for Deep Learning: Training BERT in 76 minutes (2020), arXiv:1904.00962

Y. Yamamoto et al., Identification of plant promoter constituents by analysis of local distribution of short sequences (2007), Research article on BMC Genomics

Y. Jiao, Improved maize reference genome with single-molecule technologies (2017), Nature volume 546, pages524–527

EnsemblPlants FTP Database, available at ftp://ftp.ensemblgenomes.org/pub/plants/release-48/, Ensembl Genomes (plant-specific)

RefSeq FTP Database, available at https://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant, RefSeq, National Center for Biotechnology Information (NCBI)

Maize Genome Database, available at https://download.maizegdb.org/ and (for B73 specific data) https://download.maizegdb.org/Zm-B73-REFERENCE-GRAMENE-4.0/, MaizeGDB

Gene Expression Levels Data, RNA-seq samples of ten tissues for Zea mays subsp. mays B73 (version 5)(2020), MaizeGDB USDA-ARS, National Center for Biotechnology Information (NCBI)