; Facebook AI
A novel idea to pretrain the language model using Transformers.
Cloze-driven Pretraining of Self-attention Networks is a new type of language pre-training published Facebook AI Research by to get improved word embedding for fine-tuning using transformer modules. As this is a paper dissection session, I will try to explain the main idea of the paper section-wise both intuitively and mathematically if needed for a better understanding of the readers. It would be great if the readers open both this article and the paper side by side and start going through together. I will skip the Abstract, Introduction, and Related work section and start from section 3.
Prerequisites
Knowledge on Transformers. Along with the original paper this blog from Jay Alammar is very helpful in understanding Transformers.
3. Two Tower Model:
In this section, the paper reveals their novel architecture to pretrain word embeddings. Before directly jumping to the architecture, let’s build the intuition of this pretraining.
Now, what is Cloze reading? Cloze reading is an instructional strategy where users are required to fill in the blanks within a passage with correct words from a word bank. For example, given a sentence, This is my first medium article, the pretraining idea is to predict my, given This is and first medium article. Makes sense? If not, do not worry. In a moment I will present a pictorial diagram that will make you super clear.
Let’s come to the two tower analogy. For now, assume these two towers are two black boxes. As the word/token my is in between phrases This is and first medium article, This is will go the left tower and first medium article will go to the right tower as inputs to finally predict my. This left tower or forward tower works left to right which means given This is, it tries to predict my, where the right or backward tower works from right to left which means given article medium first, it tries to predict my. Sentences are appended with token at the beginning and end. As the input sentences for both the towers are not equal in length masking needs to be done.
3.1 Block Structure

In this section, I will talk in detail about the towers. These towers are the Transformer decoder blocks stacked on top of each other as shown in Figure 1. The green blocks are the part of the forward tower and the blue blocks are the part of the backward tower. From the given figure it is seen that given, , a in the forward tower and c, in the backward tower b is desired to be predicted finally.
This paper used one different kind of word embeddings using CNN encodings. The details of the encoding can be found here. In short, given a word input it is broken into characters and character embeddings are generated. On top of this, Conv1D layers of different filter size areused, and different sizes of outputs are obtained. After this, a max-over-time pooling operation is applied to obtain a fixed-dimensional representation of the word, which is given to the highway network to get the final word embeddings. This process is shown in Figure 2.

Cloze-pretraining uses this CNN encoding to generate word embeddings. Fixed sinusoidal positional encodings (as described in the transformer paper) are summed with the CNN encoded word embeddings to provide the relative position of words in sentences. I have prepared an extended version of the model architecture for a better understanding of the readers. This is shown in Figure 3. CNN Encoding + Positional Encoding generates word/token embeddings to be used as the inputs of transformer modules. Input embeddings are shared between two towers. The number of blocks in forward and backward towers are the same for each layer. Just to be clear, Block 11 means the first block of the first layer and Block N3 means the third block of the N-th layer. While pretraining if we want to predict the i-th token/word (e.g. my), in the forward tower they mask all the tokens after i including the i-th token. Similarly in the backward tower, they mask all the tokens before i including i-th token so that the model does not get the information of i-th token while predicting the same.

One thing to note here is that, in the left (green) side, Block 11 takes inputs from only, Block 12 takes inputs from both and This and Block 13 takes input from , This and is, etc.
3.2 Combination of representations
In this section, I will talk about the combine block shown in the Figure 3. Once the two towers are build the outputs of the final layers are combined with a self-attention layer. One Feed Forward module is used after that with a softmax activation of size V (where V is vocabulary size) to predict the correct word/token.
They do not suggest to simply feed all the outputs of final layers of forward and backward towers, rather they use masking before doing so. The masking is almost similar to earlier, i.e. for forward tower, all the outputs after i-th block is masked including i-th block and for backward tower, all the outputs before i-th block including i-th block is masked while predicting i-th token. If, FL1 (same as green Block N1 in Figure 3) and BL1 (same as blue Block N1 in Figure 3) are the outputs of block 1 in the final layer (L in the notation and N in Figure 3), then FL1, FL2, …, FL(i-1) in the forward block and BL(i+1), BL(i+2), …, BLn are only needed to be considered as the inputs of self-attention module. Rest blocks of the final layers are masked to predict the i-th token as shown in Figure 3. In the self-attention layer, FL(i-1) and BL(i+1) are summed and concatenated to be used as query vectors for base models and large models respectively. Keys and value vectors are created based on the other unmasked final layer outputs of forward and backward blocks.
The masking of the output of the i-th block is done during pretraining, but while fine tuning the masking is removed which showed improvement in test results.
4 Fine-tuning
Different processes are used for different fine-tuning tasks.
Classification and Regression Tasks
Figure 4 shows the fine-tuning process of a single sentence task. The changes are: (a) All the tokens of the input sentences go to each of the towers. (b) The feed forward layer with softmax activation is removed. In this scenario, the output of the language model will be the output of the self-attention layer i.e. : [batch-size X Time-steps X attention-dim]. As all the input sentences are appended with token at the very start and end, it is very easy to get the self-attention output vector for these two tokens and they are concatenated to produce a vector of *2attention-dim (attention-dim = 1024 in the paper) size.** Basically, these two tokens are used to calculate the final output for the classification and regression tasks. If it is a classification problem, on the top of the concatenated vector one feed forward layer is used with softmax activation to output the desired number of classes. For regression the output dimension is 1 and activation is linear.

When the inputs are two sentences or question-answer pair, then one separator token () is used between two inputsentences. So, at the output of the self-attention layer along with the token vectors, we also get vector output of token. Instead of concatenating only the token vectors, in sentence pair problems, along with token vectors token vector is also concatenated to produce *3attention-dim** vector. Now on the based of regression or classification problem feed forward layer is used as described for the single sentence task.
Structured prediction tasks
For Named entity recognition, the input will the output of the pretrained language model but without the masking just at the input of the self-attention layer.
No Masking
It is seen that disabling the masking at the input of the self-attention layer improves the test result. But it not recommended disabling the masking before the final layer of the towers.
Conclusion
I skipped the rest parts as they describe the model parameters, datasets, and results in a very straight-forward way which is very easy to understand. Let me know in comments whether I should extend the article to cover up the Experimental Setup and Results section also.