Simplifying Transformers: State of the Art NLP Using Words You Understand — part 2— Input

Deep dive into how transformers’ inputs are constructed

Chen Margalit
Towards Data Science

--

Inputs

Dragon hatches from eggs, babies spring out from bellies, AI-generated text starts from inputs. We all have to start somewhere.
What kind of inputs? it depends on the task at hand. If you’re building a language model, a software that knows how to generate relevant text (the Transformers architecture is useful in diverse scenarios) the input is text. Nonetheless, can a computer receive any kind of input (text, image, sound) and magically know how to process it? it doesn’t.

I’m sure you know people who aren’t very good with words but are great with numbers. The computer is something like that. It cannot process text directly in the CPU/GPU (where the calculations happen), but it can certainly work with numbers! As you will soon see, the way to represent these words as numbers is a crucial ingredient in the secret sauce.

Image from the original paper by Vaswani, A. et al.

Tokenizer

Tokenization is the process of transforming the corpus (all the text you’ve got) into smaller parts that the machine can make better use of. Say we have a dataset of 10,000 Wikipedia articles. We take each character and we transform (tokenize) it. There are many ways to tokenize text, let's see how OpenAi’s tokenizer does it with the following text:

Many words map to one token, but some don’t: indivisible.

Unicode characters like emojis may be split into many tokens containing the underlying bytes: 🤚🏾

Sequences of characters commonly found next to each other may be grouped together: 1234567890"

This is the tokenization result:

Image by OpenAi, taken from here

As you can see, there are around 40 words (depending on how you count (punctuation signs). Out of these 40 words, 64 tokens were generated. Sometimes the token is the entire word, as with “Many, words, map” and sometimes it's a part of a word, as with “Unicode”. Why do we break entire words into smaller parts? why even divide sentences? we could’ve held them contact. In the end, they are converted to numbers anyway so what’s the difference in the computer’s point of view if the token is 3 characters long or 30?
Tokens help the model learn because as text is our data, they are the data’s features. Different ways of engineering those features will lead to variations in performance. For example, in the sentence: “Get out!!!!!!!”, we need to decide if multiple “!” are different than just one, or if it has the same meaning. Technically we could’ve kept the sentences as a whole, but Imagine looking at a crowd vs. at each person individually, in which scenario will you get better insights?

Now that we have tokens we can build a lookup dictionary that will allow us to get rid of words and use indexes (numbers) instead. For example, if our whole dataset is the sentence: “Where is god”. We might build this kind of vocabulary, which is just a key:value pair of the words and a single number representing them. We won't have to use the entire word every time, we can use the number. For example:
{Where: 0, is: 1, god: 2}. Whenever we encounter the word “is”, we replace it with 1. For more examples of tokenizers, you can check the one Google developed or play some more with OpenAI’s TikToken.

Word to Vector

Intuition
We are making great progress in our journey to represent words as numbers. The next step will be to generate numeric, semantic representations from those tokens. To do so, we can use an algorithm called Word2Vec. The details aren't very important at the moment, but the main idea is that you take a vector (we’ll simplify for now, think of a regular list) of numbers in any size you want (the paper’s authors used 512) and this list of numbers should represent the semantic meaning of a word. Imagine a list of numbers like [-2, 4,-3.7, 41…-0.98] which actually holds the semantic representation of a word. It should be created in such a way, that if we plot these vectors on a 2D graph, similar terms will be closer than dissimilar terms.

As You can see in the picture (taken from here), “Baby” is close to “Aww” and “Asleep” whereas “Citizen”/“State”/“America’s” are also somewhat grouped together.
*2D word vectors (a.k.a a list with 2 numbers) will not be able to hold any accurate meaning even for one word, as mentioned the authors used 512 numbers. Since we can't plot anything with 512 dimensions, we use a method called PCA to reduce the number of dimensions to two, hopefully preserving much of the original meaning. In the 3rd part of this series we deep our toes a bit into how that happens.

Word2Vec 2D presentation — image by Piere Mergret from here

It works! You can actually train a model that will be able to produce lists of numbers that hold semantic meanings. The computer doesn't know a baby is a screaming, sleep depriving (super sweet) small human but it knows it usually sees that baby word around “aww”, more often than “State” and “Government”. I’ll be writing some more on exactly how that happens, but until then if you’re interested, this might be a good place to check out.

These “lists of numbers” are pretty important, so they get their own name in the ML terminology which is Embeddings. Why embeddings? because we are performing an embedding (so creative) which is the process of mapping (translating) a term from one form (words) to another (list of numbers). These are a lot of ().
From here on we will call words, embeddings, which as explained are lists of numbers that hold the semantic meaning of any word it's trained to represent.

Creating Embeddings with Pytorch

We first calculate the number of unique tokens we have, for simplicity let’s say 2. The creation of the embeddings layer, which is the first part of the Transformer architecture, will be as simple as writing this code:

*General code remark — don’t take this code and its conventions as good coding style, it’s written particularly to make it easy to understand.

Code

import torch.nn as nn

vocabulary_size = 2
num_dimensions_per_word = 2

embds = nn.Embedding(vocabulary_size, num_dimensions_per_word)

print(embds.weight)
---------------------
output:
Parameter containing:
tensor([[-1.5218, -2.5683],
[-0.6769, -0.7848]], requires_grad=True)

We now have an embedding matrix which in this case is a 2 by 2 matrix, generated with random numbers derived from the normal distribution N(0,1) (e.g. a distribution with mean 0 and variance 1).
Note the requires_grad=True, this is Pytorch language for saying these 4 numbers are learnable weights. They can and will be customized in the learning process to better represent the data the model receives.

In a more realistic scenario, we can expect something closer to a 10k by 512 matrix which represents our entire dataset in numbers.

vocabulary_size = 10_000
num_dimensions_per_word = 512

embds = nn.Embedding(vocabulary_size, num_dimensions_per_word)

print(embds)
---------------------
output:
Embedding(10000, 512)

*Fun fact (we can think of things that are more fun), you sometimes hear language models use billions of parameters. This initial, not too crazy layer, holds 10_000 by 512 parameters which are 5 million parameters. This LLM (Large Language Model) is difficult stuff, it needs a lot of calculations.
Parameters here is a fancy word for those numbers (-1.525 etc.) just that they are subject to change and will change during training.
These numbers are the learning of the machine, this is what the machine is learning. Later when we give it input, we multiply the input with those numbers, and we hopefully get a good result. What do you know, numbers matter. When you’re important, you get your own name, so those aren’t just numbers, those are parameters.

Why use as many as 512 and not 5? because more numbers mean we can probably generate more accurate meaning. Great, stop thinking small, let's use a million then! why not? because more numbers mean more calculations, more computing power, more expensive to train, etc. 512 has been found to be a good place in the middle.

Sequence Length

When training the model we are going to put a whole bunch of words together. It's more computationally efficient and it helps the model learn as it gets more context together. As mentioned every word will be represented by a 512-dimensional vector (list with 512 numbers) and each time we pass inputs to the model (a.k.a forward pass), we will send a bunch of sentences, not only one. For example, we decided to support a 50-word sequence. This means we are going to take the x number of words in a sentence, if x > 50 we split it and only take the first 50, if x < 50, we still need the size to be the exact same (I’ll soon explain why). To solve this we add padding which are special dummy strings, to the rest of the sentence. For example, if we support a 7-word sentence, and we have the sentence “Where is god”. We add 4 paddings, so the input to the model will be “Where is god <PAD> <PAD> <PAD> <PAD>”. Actually, we usually add at least 2 more special paddings so the model knows where the sentence starts and where it ends, so it will actually be something like “<StartOfSentence> Where is god <PAD> <PAD> <EndOfSentence>”.

* Why must all input vectors be of the same size? because software has “expectations”, and matrices have even stricter expectations. You can’t do any “mathy” calculation you want, it has to adhere to certain rules, and one of those rules is adequate vector sizes.

Positional encodings

Intuition
We now have a way to represent (and learn) words in our vocabulary. Let’s make it even better by encoding the position of the words. Why is this important? because if we take these two sentences:

1. The man played with my cat
2. The cat played with my man

We can represent the two sentences using the exact same embeddings, but the sentences have different meanings. We can think of such data in which order does not matter. If I’m calculating a sum of something, it doesn’t matter where we start. In Language — order usually matters. The embeddings contain semantic meanings, but no exact order meaning. They do hold order in a way because these embeddings were originally created according to some linguistic logic (baby appears closer to sleep, not to state), but the same word can have more than one meaning in itself, and more importantly, different meaning when it's in a different context.

Representing words as text without order is not good enough, we can improve this. The authors suggest we add positional encoding to the embeddings. We do this by calculating a position vector for every word and adding it (summing) the two vectors. The positional encoding vectors must be of the same size so they can be added. The formula for positional encoding uses two functions: sine for even positions (e.g. 0th word, 2d word, 4th, 6th, etc.) and cosine for odd positions (e.g. 1st, 3rd, 5th, etc.).

Visualization
By looking at these functions (sin in red, cosine in blue) you can perhaps imagine why these two functions specifically were chosen. There is some symmetry between the functions, as there is between a word and the word that came before it, which helps model (represent) these related positions. Also, they output values from -1 to 1, which are very stable numbers to work with (they don’t get super big or super small).

Formula image from the original paper by Vaswani, A. et al.

In the formula above, the upper row represents even numbers starting from 0 (i = 0) and continues to be even-numbered (2*1, 2*2, 2*3). The second row represents odd numbers in the same way.

Every positional vector is a number_of_dimensions (512 in our case) vector with numbers from 0 to 1.

Code

from math import sin, cos
max_seq_len = 50
number_of_model_dimensions = 512


positions_vector = np.zeros((max_seq_len, number_of_model_dimensions))

for position in range(max_seq_len):
for index in range(number_of_model_dimensions//2):
theta = position / (10000 ** ((2*index)/number_of_model_dimensions))
positions_vector[position, 2*index ] = sin(theta)
positions_vector[position, 2*index + 1] = cos(theta)

print(positions_vector)
---------------------
output:
(50, 512)

If we print the first word, we see we only get 0 and 1 interchangeably.

print(positions_vector[0][:10])
---------------------
output:
array([0., 1., 0., 1., 0., 1., 0., 1., 0., 1.])

The second number is already much more diverse.

print(positions_vector[1][:10])
---------------------
output:
array([0.84147098, 0.54030231, 0.82185619, 0.56969501, 0.8019618 ,
0.59737533, 0.78188711, 0.62342004, 0.76172041, 0.64790587])

*Code inspiration is from here.

We have seen that different positions result in different representations. In order to finalize the section input as a whole (squared in red in the picture below) we add the numbers in the position matrix to our input embeddings matrix. We end up getting a matrix of the same size as the embedding, only this time the numbers contain semantic meaning + order.

Image from the original paper by Vaswani, A. et al.

Summary
This concludes our first part of the series (Rectangled in red). We talked about the model gets its inputs. We saw how to break down text to its features (tokens), represent them as numbers (embeddings) and a smart way to add positional encoding to these numbers.

The next part will focus on the different mechanics of the Encoder block (the first gray rectangle), with each section describing a different coloured rectangle (e.g. Multi head attention, Add & Norm, etc.)

--

--