Just when we thought that all name variations of BERT were taken (RoBERTa, ALBERT, FlauBERT, ColBERT, CamemBERT etc.), along comes AMBERT, another incremental iteration on the Transformer Muppet that has taken over natural language understanding. AMBERT was published on August 27 by ByteDance, the developer of TikTok and Toutiao.
AMBERT proposes a simple twist to BERT: tokenize the input twice, once with a fine-grained tokenizer and once with a coarse-grained tokenizer.
This article is mostly a summary of the AMBERT paper meant to distill the main ideas without the nitty-gritty details, but I will occasionally chime in with personal observations. I will prefix my own comments / ideas with the 🤔 emoji.

The core idea: two-way tokenization
AMBERT proposes a simple twist to BERT: tokenize the input twice, once with a fine-grained tokenizer (sub-word or word level in English, character level in Chinese) and once with a coarse-grained tokenizer (phrase level in English, word level in Chinese). The hope is to leverage the best of both worlds. The former implies a smaller vocabulary, fewer out-of-vocabulary tokens, more training data per token, and thus better generalization. The latter is meant to fight strong biases that attention-based models learn with fine-grained tokenization. For instance, the token "new" will strongly attend to "york" even when their co-existence in a sentence is unrelated to New York (e.g. "A new chapel in York Minster was built in 1154").

The example above shows that, when the fine-grained "a" and "new" tokens are concatenated into a single coarse-grained "a_new" token, the model correctly directs its attention to "chapel".
As the authors of the paper note, coarse-grained tokenization cannot always be perfect – longer tokens imply more ambiguity on how to group characters into tokens. 🤔 For instance, consider an input like "I’m dreaming of a New York with endless summers". It’s possible that the tokenizer above would mistakenly produce two tokens "a_new" and "york". The hope is that, in this particular case, the model will trust the fine-grained "a" / "new" / "york" tokenization more.
The two inputs share a BERT encoder
A forward pass through the model consists of the following two steps:
- text tokens → token embeddings (via separate weights): Each list of tokens (one fine-grained, one coarse-grained) is looked up in its own embedding __ matrix and turned into a list of real-valued vectors.
- token embeddings → contextual embeddings (via shared weights): The two real-valued vectors are fed into the same BERT encoder (a stack of Transformer layers) – which can be done either sequentially through a single encoder copy, or in parallel through two encoder copies with tied parameters. This results in two lists of per-token contextual embeddings.
Because AMBERT uses two embedding matrices (one for each tokenization), its parameter count is notably higher than BERT’s (194M vs 110M for the English Base models). However, latency remains relatively unchanged since AMBERT solely adds a new set of dictionary lookups.
Variations of AMBERT
The vanilla version of AMBERT makes two strong design decisions:
- The two inputs share the encoder, as explained in the previous section.
- The two inputs are independent of each other. The fine-grained tokens do not attend to the coarse-grained tokens, and vice-versa. Their final contextual embeddings are independent of each other.
The authors propose two alternative versions to challenge these decisions:
- AMBERT-Combo has two separate encoders, one for each tokenization; this increases the size of the English model from 194M to 280M.
- AMBERT-Hybrid reverts back to a traditional BERT model and solely modifies its input to be the concatenation of the two tokenizations.
Before diving deeper into AMBERT’s training procedure and performance, let’s get these two out of the way. When considering aggregated metrics over benchmarks for classification and machine reading comprehension, vanilla AMBERT outperforms the two variations. AMBERT-Combo occasionally scores higher on individual tasks. The ablation studies support the following hypotheses:
- AMBERT-Combo mostly under-performs vanilla AMBERT because the two separate encoders impede information sharing between tokenizations and allow for their respective output contextual embeddings to diverge.
- AMBERT-Hybrid allows for a fine-tuned token to attend to some close coarse-grained counterpart (either the exact same token, or some extension of it), which weakens important intra-tokenization attention.
The training procedure
For pre-training, the authors use the standard Masked Language Model (MLM) objective, masking the same spans of text in both tokenizations (for instance, if the coarse-grained token "a_new" is masked, then so are the fine-tuned "a" and "new" tokens). The final loss sums up the standard cross-entropy losses from both tokenizations.
When fine-tuning for classification, the upstream classifier makes three predictions, based on the following contextual embeddings: a) fine-grained, b) coarse-grained, and c) a concatenation of the two. Each of these predictions contribute equally to the final loss, together with a regularization term that encourages fine- and coarse-grained representations to be nearby in vectorial space. It seems that, for inference, the authors use the prediction from c), even though the wording is not entirely clear.
Results on NLU benchmarks
On average, AMBERT sometimes outperforms other BERT-derived models, with a higher margin on Chinese benchmarks than on English ones.
- When compared against other BERT-derived models with accuracies cited from their original respective papers, AMBERT gains 0.9% on CLUE classification (Chinese) and loses 0.4% on GLUE (English). As noted by AMBERT’s authors, this comparison should be interpreted carefully, since the baselines were trained under slightly different regimes.
- When compared against a standard BERT model (with regular word-piece tokenization) re-trained by the authors under the same conditions as AMBERT, the latter gains 2.77% on CLUE classification (Chinese) and gains 1.1% on GLUE (English). 🤔 This comparison should take into account that the BERT baselines have significantly fewer parameters (110M vs 194M for English).
🤔 A note on coarse-grained tokenization
I have to start with the disclaimer that I know very little about Chinese tokenization; the extent of my knowledge is limited to what the authors of the paper offer. From their description, it sounds like coarse-grained tokenization in Chinese is a natural grouping of characters into words:
The characters in the Chinese texts are naturally taken as fine-grained tokens. We conduct word segmentation on the texts and treat the words as coarse-grained tokens. We employ a word segmentation tool developed at ByteDance for the task.
Instead, I will focus on coarse-grained tokenization in English:
We perform coarse-grained tokenization on the English texts in the following way. Specifically, we first calculate the n-grams in the texts using KenLM (Heafield, 2011) and Wikipedia. We next build a phrase-level dictionary consisting of phrases whose frequencies are sufficiently high and whose last words highly depend on their previous words. We then employ a greedy algorithm to perform phrase-level tokenization on the texts.
This description is somewhat hand-wavy, especially for a paper that has the word "tokenization" in the title. While the exact procedure deserves more clarity, the high-level algorithm seems to be: 1) build a dictionary of phrases from Wikipedia, 2) prune it based on frequencies and a heuristic that identifies phrases in which changing start tokens heavily affects the meaning of the end tokens.
The problem is – longer tokens inevitably become more domain-specific, especially when selected based on frequency. If the nature of the fine-tuning dataset (say, legal documents) diverges from Wikipedia, most coarse-grained tokens could be out-of-vocabulary, in which case the model would rely solely on fine-grained tokens (i.e. reverting back to standard BERT). In fact, this problem is visible even without domain shift: an ablation study in the AMBERT paper coarse-tokenizes 10k English sentences (presumably from Wikipedia) and observes that 86% of them overlap with fine-grained tokens.
Another question is whether coarse-grained tokenization can enable multilingual models – the joint vocabulary size might be prohibitively expensive.
Generally, research on natural language understanding has been moving in the opposite direction, from coarser- to finer-grained granularities: BERT replaced the standard word tokens with sub-word units. There is indeed research [2] showing that masking longer spans in the MLM is helpful, but such papers make a statement about the output granularity, not the input one.
Conclusion
AMBERT augments BERT with two granularities of tokenization, showing promising results on benchmarks in Chinese. The gains are lower on English datasets, where coarse-grained tokenization is more ambiguous. Nonetheless, the idea of multiple levels of tokenization remains appealing. One could consider even finer levels of granularity for English, or tokenizations of the same granularity produced by different algorithms.
References
- Zhang & Li, AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization (August 2020)
- Joshi et al.: SpanBERT: Improving Pre-training by Representing and Predicting Spans (2019)