Pre-training BERT from scratch with cloud TPU

Published in

Towards Data Science

7 min readMay 9, 2019

In this experiment, we will be pre-training a state-of-the-art Natural Language Understanding model BERT on arbitrary text data using Google Cloud infrastructure.

This guide covers all stages of the procedure, including:

Setting up the training environment
Downloading raw text data
Preprocessing text data
Learning a new vocabulary
Creating sharded pre-training data
Setting up GCS storage for data and model
Training the model on a cloud TPU

A few questions

What is this guide useful for?

With this guide, you will be able to train a BERT model on arbitrary text data. This is useful if a pre-trained model for your language or use case is not available in open source.

For whom is this guide?

This guide is intended for NLP researchers who are excited with the BERT technology but are not satisfied with the performance of the available
open-sourced models.

How do I get started?

For persistent storage of training data and model, you will require a Google Cloud Storage bucket. Please follow the Google Cloud TPU quickstart to create a GCP account and GCS bucket. New Google Cloud users have $300 free credit to get started with any GCP product.

Steps 1–5 of this tutorial can be run without a GCS bucket for demonstration purposes. In that case, however, you will not be able to train the model.

What would it take?

Pre-training a BERT-Base model on a TPUv2 will take about 54 hours. Google Colab is not designed for executing such long-running jobs and will interrupt the training process every 8 hours or so. For uninterrupted training, consider using a paid pre-emptible TPUv2 instance.

That said, at the time of writing (09.05.2019), with a Colab TPU, pre-training a BERT model from scratch can be achieved at a negligible cost of storing the said model and data in GCS (~1 USD).

How do I follow the guide?

The code below is a combination of Python and Bash.
It was designed to run in a Colab Jupyter environment.
Therefore, it would be most convenient to simply run it there.

However, all steps of the guide, except for the actual training part, might be run on a separate machine. This could be useful if your dataset is too large
(or too private) to preprocess inside a Colab environment.

OK, show me the code.

Here you go.

Do I need to change anything in the code?

The only parameter you really have to set is your GCS BUCKET_NAME. Everything else has default values which should work for most use-cases.

Now, let’s get to business.

Step 1: setting up training environment

First and foremost, we get the packages required to train the model.
The Jupyter environment allows executing bash commands directly from the notebook by using an exclamation mark ‘!’, like this:

!pip install sentencepiece
!git clone https://github.com/google-research/bert

I will be exploiting this approach to make use of several other bash commands throughout the experiment.
Now, let’s import the packages and authorize ourselves in Google Cloud.

Setting up BERT training environment

Step 2: getting the data

We proceed with obtaining a corpus of text data. For this experiment, we will be using the OpenSubtitles dataset, which is available for 65 languages here.

Unlike more commonly used text datasets (like Wikipedia) it does not require any complex pre-processing. It also comes pre-formatted with one sentence per line, which is a requirement for further processing steps.

Feel free to use the dataset for your language instead by setting the corresponding language code.

Download OPUS data

For demonstration purposes, we will only use a small fraction of the whole corpus by default.

When training the real model, make sure to uncheck the DEMO_MODE checkbox to use a 100x larger dataset.

Rest assured, 100M lines are perfectly sufficient to train a reasonably good BERT-base model.

Truncate dataset

Step 3: preprocessing text

The raw text data we have downloaded contains punctuation, uppercase letters and non-UTF symbols which we will remove before proceeding. During inference, we will apply the same procedure to new data.

If your use-case requires different preprocessing (e.g. if uppercase letters or punctuation are expected during inference), feel free to modify the function below to accommodate for your needs.

Define preprocessing routine

Now let’s preprocess the whole dataset.

Apply preprocessing

Step 4: building the vocabulary

For the next step, we will learn a new vocabulary that we will use to represent our dataset.

The BERT paper uses a WordPiece tokenizer, which is not available in opensource. Instead, we will be using SentencePiece tokenizer in unigram mode. While it is not directly compatible with BERT, with a small hack we can make it work.

SentencePiece requires quite a lot of RAM, so running it on the full dataset in Colab will crash the kernel. To avoid this, we will randomly subsample a fraction of the dataset for building the vocabulary. Another option would be to use a machine with more RAM for this step — that decision is up to you.

Also, SentencePiece adds BOS and EOS control symbols to the vocabulary by default. We disable them explicitly by setting their indices to -1.

The typical values for VOC_SIZE are somewhere in between 32000 and 128000. We reserve NUM_PLACEHOLDERS tokens in case one wants to update the vocabulary and fine-tune the model after the pre-training phase is finished.

Learn SentencePiece vocabulary

Now, let’s see how we can make SentencePiece work for the BERT model.

Below is a sentence tokenized using the WordPiece vocabulary from a
pre-trained English BERT-base model from the official repo.

>>> wordpiece.tokenize("Colorless geothermal substations are generating furiously")

['color',
 '##less',
 'geo',
 '##thermal',
 'sub',
 '##station',
 '##s',
 'are',
 'generating',
 'furiously']

As we can see, the WordPiece tokenizer prepends the subwords which occur in the middle of words with ‘##’. The subwords occurring at the beginning of words are unchanged. If the subword occurs both in the beginning and in the middle of words, both versions (with and without ‘##’) are added to the vocabulary.

SentencePiece has created two files: tokenizer.model and tokenizer.vocab. Let’s have a look at the learned vocabulary:

Read the learned SentencePiece vocabulary

This gives:

Learnt vocab size: 31743 
Sample tokens: ['▁cafe', '▁slippery', 'xious', '▁resonate', '▁terrier', '▁feat', '▁frequencies', 'ainty', '▁punning', 'modern']

As we may observe, SentencePiece does quite the opposite to WordPiece. From the documentation:

SentencePiece first escapes the whitespace with a meta-symbol “▁” (U+2581) as follows:

Hello▁World.

Then, this text is segmented into small pieces, for example:

[Hello] [▁Wor] [ld] [.]

Subwords which occur after whitespace (which are also those that most words begin with) are prepended with ‘▁’, while others are unchanged. This excludes subwords which only occur at the beginning of sentences and nowhere else. These cases should be quite rare, however.

So, in order to obtain a vocabulary analogous to WordPiece, we need to perform a simple conversion, removing “▁” from the tokens that contain it and adding “##” to the ones that don’t.

We also add some special control symbols which are required by the BERT architecture. By convention, we put those at the beginning of the vocabulary.

Additionally, we append some placeholder tokens to the vocabulary.
Those are useful if one wishes to update the pre-trained model with new,
task-specific tokens. In that case, the placeholder tokens are replaced with new real ones, the pre-training data is re-generated, and the model is fine-tuned on new data.

Convert the vocabulary to use for BERT

Finally, we write the obtained vocabulary to file.

Dump vocabulary to file

Now let’s see how the new vocabulary works in practice:

>>> testcase = "Colorless geothermal substations are generating furiously"
>>> bert_tokenizer = tokenization.FullTokenizer(VOC_FNAME)
>>> bert_tokenizer.tokenize(testcase)['color',  
 '##less',  
 'geo',  
 '##ther',  
 '##mal',  
 'sub',  
 '##station',  
 '##s',  
 'are',  
 'generat',  
 '##ing',  
 'furious',  
 '##ly']

Looking good!

Step 5: generating pre-training data

With the vocabulary at hand, we are ready to generate pre-training data for the BERT model.
Since our dataset might be quite large, we will split it into shards:

Split the dataset

Now, for each shard we need to call create_pretraining_data.py script from the BERT repo. To that end, we will employ the xargs command.

Before we start generating, we need to set some parameters to pass to the script. You can find out more about the meaning of them in the README.

Define parameters for pre-training data

Running this might take quite some time depending on the size of your dataset.

Create pre-training data

Step 6: setting up persistent storage

To preserve our hard-earned assets, we will persist them to Google Cloud Storage. Provided that you have created the GCS bucket, this should be easy.

We will create two directories in GCS, one for the data and one for the model. In the model directory, we will put the model vocabulary and configuration file.

Configure your BUCKET_NAME variable here before proceeding, otherwise you will not be able to train the model.

Configure GCS bucket name

Below is the sample hyperparameter configuration for BERT-base. Change at your own risk!

Configure BERT hyperparameters and save to disk

Now, we are ready to push our assets to GCS

Upload assets to GCS

Step 7: training the model

We are almost ready to begin training our model.

Be advised that some of the parameters from previous steps are duplicated here, so as to allow for convenient restart of the training procedure.

Make sure that the parameters set are exactly the same across the experiment.

Configure training run

Prepare the training run configuration, build the estimator and
input function, power up the bass cannon.

Build estimator model and input function

Execute!

Execute BERT training procedure

Training the model with the default parameters for 1 million steps
will take ~54 hours of run time. In case the kernel restarts for some reason, you may always continue training from the latest checkpoint.

This concludes the guide to pre-training BERT from scratch on a cloud TPU.

Next steps

Okay, we’ve trained the model, now what?

That is a topic for a whole new discussion. There are a couple of things you could do:

Use the pre-trained model as a general-purpose NLU module
Fine-tune the model for some specific classification task
Create another DL model using BERT as a building block
???

The really fun stuff is still to come, so stay woke. Meanwhile, check out the awesome bert-as-a-service project and start serving your newly trained model in production.

Keep learning!