The most effective solution to train your own BERT model at this time
BERT-Large has been a real "game changer" technology in the field of Natural Language Processing in recent years. Extending the basic model with transfer learning, we get state-of-the-art solutions for tasks such as Question Answering, Named Entity Recognition or Text Summarization. The model currently exists in around 10 languages and in this article we will share our experiences of training so that, a new model can be trained in your own language relatively easily and effectively.
In terms of training, Microsoft’s ONNX Runtime library with DeepSpeed optimizations offers the fastest (and cheapest!) solution for training a model, so we used that in our experiments with Azure Machine Learning platform (however, feel free to check the ONNX Runtime link above for local/different environment runs). Note, that the training can be completed roughly around 200hrs on a 4x Tesla V100 node.
The guide covers the training process, which consists of 2 major parts:
- Data preparation
- Training
Data preparation
A 3.4 billion word text corpus was used for the original Bert-Large, so it is worth training with a data set of this size. An obvious solution might be to use the wikipedia corpus, which can be downloaded in the target language from here. The wikicorpus alone most probably won’t contain enough data, but it is definitely worth adding to the existing corpus, as it is a good quality corpus, it increases the efficiency of training and use. Data preprocessing can be a computationally intensive operation, depending on the size of the training files, a lot of RAM may be required. For this, we used STANDARD_D14_V2 (16 Cores, 112 GB RAM, 800 GB Disk) VM in AzureML.
OnnxRuntime uses NVIDIA’s BERT-Large solution. Raw dataset first of all needs to be cleaned, if it’s necessary and the desired format must meet two criteria:
- Each sentence is in a separate line.
- The related entries (articles, paragraphs) are separated by a blank line.
For the wiki dataset, I created and uploaded to my own repository a customized solution from the NVIDIA BERT data preparation scripts. So let’s take a look at the process:
Downloading WikiDump:
Wikipedia dumps are available at this link. You can even download the dump with wget. The downloaded file is in .bz2 format, which can be extracted with the bunzip2 (or bzip2 -d) command on Linux-based machines.
The extracted file is in .xml format. The WikiExtractor Python package extracts articles from the input wiki file, provides useful and essential help for dump cleaning.
When we’re done with that, the script has created a folder structure that contains the extracted text. In this case, we need to take the following steps to bring the dataset into a form compatible with training script:
- Formatting
- Tokenizing
- Filtering
- Creating vocab
- Sharding
- Creating binaries
Formatting (Wiki only)
For formatting, let’s use the following script:
python3 formatting.py --input_folder=INPUT_FOLDER --output_file=OUTPUT_FILE
Which sorts the extracted text into one file and one article per line.
Tokenizing
Thereafter, we need to tokenize our articles to sentence-per-line (spl) format because it’s needed for our training script and for further actions like filtering and sharding. It’s a quite easy job, however the quality of tokenizing is absolutely important, so if you don’t just know a good sentence tokenizer for your language, it’s time to do a bit research work now. We’re going to give a few examples and also implemented a few method there, in our tokenization script:
- NLTK is a very common library, it uses a package called punkt for tokenization. Check supported languages here.
- UDPipe (v1) has many supported languages for sentence tokenization. Check them here.
- StanfordNLP is also a common library in this case. Check pretrained (language) models here. Note, that the library has a spacy wrapper also, if you are more comfortable with that.
- Spacy (last but not least) is a very simple and powerful library for NLP. Available languages can be found here.
Note, that at this point, I strongly recommend to do a bit research work about the available sentence tokenizers in your language, since it can be a matter of your model performance. So if your text is not in spl format, please test a correct tokenizer and use it on your text. Make sure that every paragraph of your text is separated by an empty line.
Filtering
As we’re a great fan of the Finnish TurkuNLP group (who have done wikiBERT-base models in many languages), we would like to share a customized filtering method from their pipeline. This filtering method has default parameter values based on TurkuNLP team’s experience, however feel free to adjust those values (if you want) based on your dataset(s). To run (with the default working directory):
python filtering.py INPUT_SPL_FILE
--word-chars abcdefghijklmnopqrstuvwxyz (replace with your alphabet)
--language en (replace with your lang)
--langdetect en (replace with your lang)
> OUTPUT_FILE
At this point, make sure your text is clean, it doesn’t have unnecessary or way too much repeated lines(more likely in crawled data) and every paragraph has it’s meaning, the text is continously in them. For premade alphabets, please refer to the TurkuNLP pipeline repo.
Also a way to clean the text is converting to a smaller character set like Latin1 (then back to UTF-8) to remove unnecessary symbols.
Creating a vocab
For creating vocab files, we forked the solution of TurkuNLP group (which is derived from Google).
Now you can run the vocabulary training method on the whole dataset (note, that it’ll take lines randomly from a large corpus file, according to our setup). Feel free to increase your vocab size, according to your experiences and language, but take note, that larger vocab size will require more VRAM, what means you’ll have to train your model with lower microbatch size, which can decrease the effectivity of your model on 16GB sized GPU’s. It doesn’t count too much really on 32 GB GPU’s.
python3 spmtrain.py INPUT_FILE
--model_prefix=bert
--vocab_size=32000
--input_sentence_size=100000000
--shuffle_input_sentence=true
--character_coverage=0.9999
--model_type=bpe
So after this one, we need to convert our SentencePiece vocab to a BERT compatible WordPiece vocab, issuing this script:
python3 sent2wordpiece.py bert.vocab > vocab.txt
Tadaa! You’re done creating a BERT compatible vocab based on your text corpus.
Sharding:
Sharding is a recommended part, since an 500Mbyte raw text file could take up to 50GB RAM at 512 seq len binary creation so it’s recommended to create shards sized between 50–100 MBytes.
python3 sharding.py
--input_file=INPUT_FILE
--num_shards=NUMBER_OF_SHARDS_TO_MAKE
Creating binaries
For binary (.hdf5) creation, we need to have a BERT compatible vocabulary file. If you have one, it’s nice, if you don’t, then please check the following article for creating your own vocab file. You can process the files parallelly, however it could take up to 10GB RAM on a 100MByte file. You can try more shards in 128 seq len run, since it eats less RAM.
To create .hdf5 files, run:
python3 create_hdf5_files.py --max_seq_length 128 --max_predictions_per_seq 20 --vocab_file=vocab.txt --n_processes=NUMBER_OF_PROCESSES
for 128 sequence length train, and
python3 create_hdf5_files.py --max_seq_length 512 --max_predictions_per_seq 80 --vocab_file=vocab.txt --n_processes=NUMBER_OF_PROCESSES
for 512 sequence length preprocessing. Both of them are required for the training process, which will be explained later on.
So finishing the .hdf5 creation, you should have 2 folders, named something like this:
hdf5_lower_case_0_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/
and
hdf5_lower_case_0_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5
Moving to BLOB
In order to use the converted files for training, you’ll need to upload them to a BLOB container. We recommend to use Azure Storage Explorer and AzCopy command line tool.
In Azure Storage Explorer you can easily create a BLOB container (use the guide above) and with AzCopy, you can copy the converted files to the BLOB container, with the following format:
azcopy.exe cp --recursive "src" "dest"
Note: you can create a BLOB under your Azure storage account (left side in Explorer) and you can get the source navigating to and right clicking on your .hdf5 folder, using "Get Shared Access Signature" option. Same with the BLOB container as destination.

Training
There are two parts to the training process: a phase of sequence length 128 and 512. This significantly speeds up training, as it first takes approx. 7,000 steps on 128 and 1,500 on 512, which allows for much faster training. This is based on the fact that training is much faster at 128 sequence length, but we want our model to be able to work with texts with 512 token lengths.
For the training, we used the Onnx Runtime-based solution which now contains the DeepSpeed training optimizations. This provides the fastest and of course the cheapest solution currently available (SOTA). The repository is available here. The ONNX Runtime team also prepared a docker image for training with necessary components like openMPI, CUDA, cuDNN, NCCL and required Python packages.
As mentioned, we ran the training in AzureML, so the guide also follows this method. This is not necessarily required to run in AzureML if the required resources (GPU) are available. The Microsoft’s repository above also includes recipe for local execution.
Let’s get deeper into train
First we need to create an compute instance to get the code from the GitHub repositories. This instance (VM) does not have to be a pricey one, we used STANDARD_D1_V2 (1 Cores, 3.5 GB RAM, 50 GB Disk) for example. To creating a compute instance open any file on Notebooks tab and click on the + button:

Now open a terminal which you can do on the same tab.
In the VM’s terminal, you need to use the following commands (according to the ONNX Runtime training examples repository above) to get the training code(s).
To get the ONNX Runtime code, for enchance BERT train:
git clone https://github.com/microsoft/onnxruntime-training-examples.git
cd onnxruntime-training-examples
To get the NVIDIA’s BERT Large training solution:
git clone --no-checkout https://github.com/NVIDIA/DeepLearningExamples.git
cd DeepLearningExamples/
git checkout 4733603577080dbd1bdcd51864f31e45d5196704
cd ..
And getting them together a bit:
mkdir -p workspace
mv DeepLearningExamples/Pytorch/LanguageModeling/BERT/ workspace
rm -rf DeepLearningExamples
cp -r ./nvidia-bert/ort_addon/* workspace/BERT
Preparation
You will need a few steps to perform before you can start the training:
- Copy your vocab.txt file to workspace/BERT/vocab directory
- Modify vocab size in _nvidia_bert/ort_addon/ort_supplement/ortsupplement.py line:55
- Download and copy _bertconfig.json to workspace/BERT
- Modify vocab size in _workspace/BERT/bertconfig.json
Now you can access the training notebook in the nvidia-bert/azureml-notebooks/. Open it.
AzureML Workspace setup
First you’ll need to address your
- workspace name
- subscription ID
- resource group
in the notebook.
Register datastore
Next, it’s mandatory to give access the notebook of your BLOB container, where we previously uploaded the converted dataset. It consists of
- datastore name
- account name
- account key
- container name
parameters.
Create AzureML Compute Cluster
In the next step, we need to create a compute target for training. In the training process we were using _Standard_NC24rsv3 which contains 4x Tesla V100 16GB GPU’s. This setup takes approx. 200–220 hours to train.
It’s really up to you, which VM’s you want to use. You can mainly choose between
- Standard_NC24rs_v3 (4x Tesla V100 16GB)
- Standard_ND40rs_v2 (8x Tesla V100 32GB)
VM’s.
Creating Estimator
Perhaps this is the most exciting step in your training. Here you need to configure the training script, which the Notebook will launch as an AzureML experiment. Note, that we will run the experiment (and script) twice regarding to 128 and 512 train.
You’ll need to set up the following essential parameters:
- process_count_per_node (2x)
- node_count
- input_dir
- output_dir
- train_batch_size
- gradient_accumulation_steps
- gpu_memory_limit
_process_count_pernode: number of GPU’s per VM (4/8).
_nodecount: overall number of VM’s you use.
_inputdir: the location of the 128 pretraining data within the BLOB.
_outputdir: arbitrary directory for the checkpoints. Note: 512 train will use this directory to load the last phase 1 checkpoint.
_train_batchsize: train batch size. Set this regarding to the table above in the Notebook.
_gradient_accumulationsteps: set this one aswell according to table above in the Notebook. Micro batch size will be calculated with train batch size and this parameter.
_gpu_memorylimit: set this one regarding you use 16 or 32GB GPU’s.
And finally, don’t forget to add
'--deepspeed_zero_stage': ''
parameter to accelerate your training with DeepSpeed ZeRO optimizer.
Note: you may also want to disable progress bar during your pretrain with this parameter:
'--disable_progress_bar': ''
Note, that micro batch should be maximized with batch size and gradient accumulation steps.
Submit.
So as you started pretraining as an AzureML experiment, you should find something like this in your Experiments tab:

Converting checkpoints
I also included a small script in my repo, named _convertcheckpoint.py to make your checkpoint compatible for fine-tuning with transformers library.
After 128 train
After your cluster finished with phase 1, you can set up the phase 2 script with the same method as above. It’s important to use the same output dir, so the phase 2 run will find your phase 1 checkpoint in that directory.
The run will always keep the last 3 checkpoints, and you can set up the number of training steps to checkpoint. We recommend around 100 steps, so you can download a checkpoint around 100 steps and benchmark them with a fine-tuning script.
After 512 train
Congratulations, you have a BERT-Large model in your own language! Please share your experiences here or contact me by email, since we are eager to hear about your experiences here in Applied Data Science And Artifical Intelligence Group, University of Pécs, Hungary.