
Introduction
Large Language Models (LLMs), such as ChatGPT, Gemini, Claude, etc., have been around for a while now, and I believe all of us already used at least one of them. As this article is written, ChatGPT already implements the fourth generation of the Gpt-based model, named GPT-4. But do you know what GPT actually is, and what the underlying neural network architecture looks like? In this article we are going to talk about GPT models, especially GPT-1, GPT-2 and GPT-3. I will also demonstrate how to code them from scratch with PyTorch so that you can get better understanding about the structure of these models.
A Brief History of GPT
Before we get into GPT, we need to understand the original Transformer architecture in advance. Generally speaking, a Transformer consists of two main components: the Encoder and the Decoder. The former is responsible for understanding input sequence, whereas the latter is used for generating another sequence based on the input. For example, in a question answering task, the decoder will produce an answer to the input sequence, while in a machine translation task it is used for generating the translation of the input.
![Figure 1. The Transformer model. The block on the left is the Encoder and the one on the right is the Decoder [1].](https://towardsdatascience.com/wp-content/uploads/2025/01/0_nyg8aTEV6i8Da66.png)
The two main components of the Transformer mentioned above also consist of several sub-components, such as attention block, look-ahead mask, and layer normalization. Here I assume that you already have basic knowledge about them. If you haven’t, I highly recommend you read my previous post regarding the topic which you can access through the link I provided at the end of this article [2].
It was proven that Transformer has an impressive performance in language modeling. Interestingly, future researchers found that its encoder and decoder part can work individually to do so. This was actually the moment when BERT (Bidirectional Encoder Representation of Transformers) and GPT (Generative Pretrained Transformers) were invented, where BERT is basically just a stack of encoders, while GPT is a stack of decoders.
Talking more specifically about GPT, its first version (GPT-1) was released by OpenAI back in 2018. This was then followed by GPT-2 and GPT-3 in 2019 and 2020, respectively. However, there were not so many people knew about GPT at the moment since it was only usable via an API. It wasn’t until 2022 when OpenAI released ChatGPT with the GPT-3.5 backend which allows public to interact with this LLM easily. Below is a figure showing the evolution of GPT models.
![Figure 2. The evolution of GPT models over time [3].](https://towardsdatascience.com/wp-content/uploads/2025/01/1huPbIkA_Lx2CYQZyPnmx8A.png)
GPT-1
The first GPT version was published in a research paper titled "Improving Language Understanding by Generative Pre-Training" by Radford et al. [4] back in 2018. Previously I’ve mentioned that GPT is basically just a stack of decoders, and in the case of GPT-1 the decoder block is repeated 12 times. It is important to keep in mind that the decoder architecture implemented in GPT-1 is not completely identical with the one in the original Transformer. In in the following figure, the model on the left is the decoder proposed in the GPT-1 paper, whereas the one on the right is the decoder part of the original Transformer. Here we can see that the part highlighted in red in the original decoder does not exist in GPT-1. This is essentially because this component is employed to combine the information coming from the encoder and from the decoder input itself. In the case of GPT-1, since we don’t have the encoder part, hence we can just omit it.
![Figure 3. The GPT-1 architecture (left) [4] and the Decoder part of the original Transformer architecture [5].](https://towardsdatascience.com/wp-content/uploads/2025/01/1YldhQxvr9wi_fHoN4wmiRg.png)
GPT-1 Pretraining
The training process of the GPT-1 model is divided into two steps: pretraining and fine-tuning. The goal of pretraining is to teach the model to predict the next token in a sequence based on the preceding tokens – a process commonly known as language modeling. This pretraining step uses a self-supervised mechanism, i.e., a training process where the label comes from the dataset itself. With this method, we don’t need to perform manual labeling. Instead, we can just chunk 513 tokens at random positions from a long text, setting the first 512 as the features and the last one as the label. This number of tokens is chosen based on the context window parameter of GPT-1, which by default is set to 512. In addition to the tokenization mechanism, GPT-1 uses BPE (Byte Pair Encoding). This essentially means that every single token does not necessarily correspond to a single word. Rather, it can also be a sub-word or even an individual letter.
The GPT-2 pretraining is done using the objective function shown in Figure 4 below, where uᵢ is the token being predicted, uᵢ₋ₖ, …, uᵢ₋₁, are the k previous tokens (context window), and Θ is the model parameters. What’s essentially done by this equation is that it computes the likelihood of a token occurring given the previous tokens in the sequence. The token with the highest probability will be returned as the predicted output. By doing this process iteratively, the model will continue the text provided in the prompt. If we go back to Figure 3, we will see that the GPT-1 model has two heads: text prediction and task classifier. Later on, this text generation process is going to be done using the text prediction head.
![Figure 4. The objective function for pretraining [4].](https://towardsdatascience.com/wp-content/uploads/2025/01/1sn44oFz87-3gvGhsX9zQzQ.png)
GPT-1 Fine-Tuning
Even though by default GPT is a generative model, but during the fine-tuning phase we treat it as a discriminative model. This is essentially because in this phase the goal is just to perform a typical classification task. In the following objective function, y represents the class to be predicted, while x¹, …, xᵐ denote m input tokens in sequence x. We can simply think of this equation like we want to categorize a text into a specific class. Such a classification mechanism will later be used to perform varying downstream tasks, which I will explain very soon.
![Figure 5. The objective function for the downstream classification task [4].](https://towardsdatascience.com/wp-content/uploads/2025/01/1gXcJ7PJ8xajSo-BayejEVw.png)
There are four different downstream tasks experimented in the paper: classification, natural language inference (entailment), sentence similarity, and multiple-choice question answering. The figure below illustrates the workflow of these tasks.
![Figure 6. The downstream task workflows of the GPT-1 model [4].](https://towardsdatascience.com/wp-content/uploads/2025/01/13cwafq7WLA3GtTXt4y4Gfw.png)
The Transformer blocks colored in green are GPT-1 models, each having the exact same architecture. In order to allow the model to perform different tasks, we need to arrange the input texts accordingly. For a standard text classification task, e.g., sentiment analysis or document classification, we can simply put the token sequence between the start and extract token to mark the beginning and the end of a text before feeding it into the GPT-1 model. The resulting tensor will then be forwarded to a linear layer, which each neuron in the layer corresponds to a single class.
![Figure 7. Examples of input texts and the corresponding labels for sentiment analysis (classification) task [3].](https://towardsdatascience.com/wp-content/uploads/2025/01/1uhBx-8gPA5jcC-YAd9VuSg.png)
For textual entailment, the model accepts premise and hypothesis as a single sequence, separated by a delimiter token. In this case, the Task Classifier head is responsible for classifying whether the hypothesis entails the premise.
![Figure 8. Examples of input texts and the corresponding labels for textual entailment task [3].](https://towardsdatascience.com/wp-content/uploads/2025/01/1h00gStP2yfOdokLO78SSqg.png)
In the case of text similarity task, the model works by accepting two texts to be compared in two different orders: text 1 followed by text 2, and text 2 followed by text 1. These two sequences are fed into the GPT model in parallel, which the resulting outputs are then summed before eventually predicted whether they are similar. Or, we can also configure the output layer to perform a regression task, returning a continuous similarity score.
![Figure 9. Example of a dataset for text similarity measurement [3].](https://towardsdatascience.com/wp-content/uploads/2025/01/1NuR45_NqCaq_EgMybje6gQ.png)
Lastly, for multiple-choice question answering we wrap both the text containing facts and the corresponding question inside the context block. Next, we place a delimiter token before appending one of the answers to it. We do the same thing for all possible answers for every question. With this dataset structure, we perform inference by passing them into the model, letting it calculate the similarity score between each question-answer pair. This score indicates how well each answer addresses the question based on the given facts. We can basically think of this like a standard classification task, where the selected answer is the one having the highest similarity score.
![Figure 10. An example of a dataset for multiple-choice question answering task [3].](https://towardsdatascience.com/wp-content/uploads/2025/01/1RE_XrS9XLaexc7IFCwC7Nw.png)
During the fine-tuning phase, we don’t completely ignore the language modeling process as it still gives some ideas regarding what token should come next. In other words, we can perceive it as an auxiliary objective, which is useful for accelerating convergence while at the same time improving the generalization of the classifier model. Therefore, the downstream task objective function (L2) needs to be combined with the language modeling objective function (L1). The Figure 11 below shows how it is expressed in a formal mathematical definition, where the weight λ is typically set to be less than 1, allowing the model to pay more attention to the downstream task.
![Figure 11. The objective function used for fine-tuning [4].](https://towardsdatascience.com/wp-content/uploads/2025/01/1KLbDjeipw10hU8Voi2G07g.png)
So, to sum up, the point of GPT-1 is that it basically works by continuing the preceding sequence. If we don’t further fine-tune the model, it will continue the sequence based on its understanding of the data provided in the self-supervised training phase. Meanwhile, if we perform fine-tuning, the model will also continue the sequence but only using the specific ground truths provided in the supervised learning phase.
GPT-1 Implementation: Look-Ahead Mask & Positional Encoding
As we already know the theory behind GPT-1, let’s now implement the architectural design from scratch! We are going to start by importing the required modules.
# Codeblock 1
import torch
import torch.nn as nn
Afterwards, we will continue with the parameter configuration, which you can see in Codeblock 2 below. All variables we set here are exactly the same as the ones specified in the GPT-1 paper, except for the BATCH_SIZE
and N_CLASS
(written at line marked with #(1)
and #(2)
). The BATCH_SIZE
variable is necessary because PyTorch by default processes tensors in a batch regardless of the number of samples contained inside. In this case, I assume that there is only a single sample in each batch. Meanwhile, N_CLASS
will be used for the task classifier head which will run when the downstream task is performed. As an example, here I set the parameter to 3. With this configuration, we can use the head for 3-class classification task like the sentiment analysis or the textual entailment cases I showed you earlier in Figure 7 and 8.
# Codeblock 2
BATCH_SIZE = 1 #(1)
N_CLASS = 3 #(2)
SEQ_LENGTH = 512 #(3)
VOCAB_SIZE = 40000 #(4)
D_MODEL = 768 #(5)
N_LAYERS = 12 #(6)
NUM_HEADS = 12 #(7)
HIDDEN_DIM = D_MODEL*4 #(8)
DROP_PROB = 0.1 #(9)
The SEQ_LENGTH
parameter (#(3)
), which is another term to denote context window, is set to 512. The BPE tokenization mechanism performed on the training dataset produces 40,000 unique tokens, hence we need to use this number for VOCAB_SIZE
(#(4)
). Next, the D_MODEL
parameter denotes the feature vector length used to represent a token, which in the case of GPT-1, this is set to 768 (#(5)
). Previously I mentioned that the decoder layer is repeated 12 times. In the above code, this number is assigned to the N_LAYERS
variable (#(6)
). Each of the decoder layers themselves comprises some other components which the parameters need to be manually configured as well. Those parameters are the number of attention heads (#(7)
), the number of hidden neurons in the feed forward block (#(8)
), and the rate for the dropout layers (#(9)
).
As the required parameters have been configured, the next thing to be done is initializing a function for creating the so-called look-ahead mask and a class for creating positional embedding. The look-ahead mask can be thought of as a tool that prevents the model from looking at the subsequent tokens during the training phase, considering that later in the inference phase, subsequent tokens are unavailable. Meanwhile, the positional embedding is used to label each token with specific numbers, which is useful to preserve information regarding the token orders. In fact, even though the look-ahead mask already contains this information, but the positional embedding emphasizes it even further.
Look at the Codeblock 3 and 4 below to see how I implement the two concepts I just explained. I am not going to get any deeper into them as I’ve provided the complete explanation in my article about Transformer, which the link is provided in the references list [2] – you can just click it and scroll all the way down to the Positional Encoding and the Look-Ahead Mask sections. Even the following codes are exactly the same as what I wrote there!
# Codeblock 3
def create_mask():
mask = torch.tril(torch.ones((SEQ_LENGTH, SEQ_LENGTH)))
mask[mask == 0] = -float('inf')
mask[mask == 1] = 0
return mask
# Codeblock 4
class PositionalEncoding(nn.Module):
def forward(self):
pos = torch.arange(SEQ_LENGTH).reshape(SEQ_LENGTH, 1)
i = torch.arange(0, D_MODEL, 2)
denominator = torch.pow(10000, i/D_MODEL)
even_pos_embed = torch.sin(pos/denominator)
odd_pos_embed = torch.cos(pos/denominator)
stacked = torch.stack([even_pos_embed, odd_pos_embed], dim=2)
pos_embed = torch.flatten(stacked, start_dim=1, end_dim=2)
return pos_embed
GPT-1 Implementation: Decoder
Now let’s talk about the decoder part which I implement inside the DecoderGPT1()
class. The reason that I name it this way is because we are going to use it exclusively for GPT-1. See the detailed implementation in Codeblock 5a and 5b.
# Codeblock 5a
class DecoderGPT1(nn.Module):
def __init__(self):
super().__init__()
self.multihead_attention = nn.MultiheadAttention(embed_dim=D_MODEL, #(1)
num_heads=NUM_HEADS,
batch_first=True) #(2)
self.dropout_0 = nn.Dropout(DROP_PROB)
self.norm_0 = nn.LayerNorm(D_MODEL) #(3)
self.feed_forward = nn.Sequential(nn.Linear(D_MODEL, HIDDEN_DIM), #(4)
nn.GELU(),
nn.Linear(HIDDEN_DIM, D_MODEL))
self.dropout_1 = nn.Dropout(DROP_PROB)
self.norm_1 = nn.LayerNorm(D_MODEL) #(5)
nn.init.normal_(self.feed_forward[0].weight, 0, 0.02) #(6)
nn.init.normal_(self.feed_forward[2].weight, 0, 0.02) #(7)
There are several neural network layers I initialize in the __init__()
method above, in which every single of those corresponds to each sub-component inside the decoder shown back in Figure 3. The first one is the multihead attention layer (#(1)
), where the values used for embed_dim
and num_heads
are taken from the variables we initialized earlier. Additionally, here I set the batch_first
parameter to True
(#(2)
) since our batch dimension is on the 0th axis, which is a common practice when it comes to working with PyTorch tensors. Next, we initialize two layer normalization layers with D_MODEL
as the input argument for each (at line #(3)
and #(5)
). This essentially means that these two layers will perform normalization across the 768 values for each token.
To the feed forward block, I create it using nn.Sequential()
(#(4)
), where I initialize two linear layers and a GELU activation function in between. The first linear layer is responsible to expand the 768 (D_MODEL
)-dimensional token representation into 3072 (HIDDEN_DIM
) dimensions. Afterwards, we pass it through GELU before shrinking it back to 768 dimensions. The authors of this paper mentioned that the weight initialization for these layers follows a normal distribution with the mean and standard deviation of 0 and 0.02, respectively. We can manually configure them using the code at line #(6)
and #(7)
.
Now let’s move on to Codeblock 5b where I define the forward()
method of the DecoderGPT1()
class. You can see below that it works by accepting two inputs: x
and attn_mask
(#(1)
). The first input is the embedded token sequence, while the second one is the look-ahead mask generated by the create_mask()
function we defined earlier.
# Codeblock 5b
def forward(self, x, attn_mask): #(1)
residual = x #(2)
print(f"original & residualt: {x.shape}")
x = self.multihead_attention(x, x, x, attn_mask=attn_mask)[0] #(3)
print(f"after attentiontt: {x.shape}")
x = self.dropout_0(x) #(4)
print(f"after dropouttt: {x.shape}")
x = x + residual #(5)
print(f"after additiontt: {x.shape}")
x = self.norm_0(x) #(6)
print(f"after normalizationt: {x.shape}")
residual = x
print(f"nx & residualtt: {x.shape}")
x = self.feed_forward(x) #(7)
print(f"after feed forwardt: {x.shape}")
x = self.dropout_1(x)
print(f"after dropouttt: {x.shape}")
x = x + residual
print(f"after additiontt: {x.shape}")
x = self.norm_1(x)
print(f"after normalizationt: {x.shape}")
return x
Before doing anything, the first thing we do inside the forward()
method above is to store the original input tensor x
to the residual
variable (#(2)
). The x
tensor itself is then processed with the multihead attention layer (#(3)
). Since we are about to perform self attention (not a cross attention), hence the query, key and value inputs for the layer are all derived from x
. Not only that, here we also need to pass the look-ahead mask as the argument for the attn_mask
parameter. After processing with the attention layer is complete, we will then pass the x
tensor through a dropout layer (#(4)
) before it is eventually combined again with residual
(#(5)
) and normalized by layer norm (#(6)
). The remaining processes are nearly the same, except that we replace the self.multihead_attention
layer with the self.feed_forward
layer (#(7)
).
To check if our decoder works properly, we can pass a tensor with the size of 1×512×768 as shown in Codeblock 6 below. This simulates a sequence of 512 tokens, each represented as a 768-dimensional vector.
# Codeblock 6
decoder_gpt_1 = DecoderGPT1()
x = torch.randn(BATCH_SIZE, SEQ_LENGTH, D_MODEL)
look_ahead_mask = create_mask()
x = decoder_gpt_1(x, look_ahead_mask)
We can see in the resulting output that this tensor successfully passed through all components in the decoder. It is worth noting that the tensor dimensions remain the same at each process, including the final output. This property allows us to stack multiple decoders without worrying that the tensor dimensions will break. – Well, in fact, there are actually some dimensionality changes inside the attention and the feed forward layer, but it immediately returns back to its original dimension before being fed into the subsequent layers.
# Codeblock 6 output
original & residual : torch.Size([1, 512, 768])
after attention : torch.Size([1, 512, 768])
after dropout : torch.Size([1, 512, 768])
after addition : torch.Size([1, 512, 768])
after normalization : torch.Size([1, 512, 768])
x & residual : torch.Size([1, 512, 768])
after feed forward : torch.Size([1, 512, 768])
after dropout : torch.Size([1, 512, 768])
after addition : torch.Size([1, 512, 768])
after normalization : torch.Size([1, 512, 768])
GPT-1 Implementation: Decoder with Input & Text Prediction
As we have completed the decoder block, we will now connect the input layer before it and attach the text prediction head to the output. You can see how I implement them in the GPT1()
class below.
# Codeblock 7a
class GPT1(nn.Module):
def __init__(self):
super().__init__()
self.token_embedding = nn.Embedding(num_embeddings=VOCAB_SIZE,
embedding_dim=D_MODEL) #(1)
self.positional_encoding = PositionalEncoding() #(2)
self.decoders = nn.ModuleList([DecoderGPT1() for _ in range(N_LAYERS)]) #(3)
self.linear = nn.Linear(in_features=D_MODEL, out_features=VOCAB_SIZE) #(4)
nn.init.normal_(self.token_embedding.weight, mean=0, std=0.02) #(5)
nn.init.normal_(self.linear.weight, mean=0, std=0.02) #(6)
Inside the __init__()
method, we first initialize an nn.Embedding()
layer. This layer is used to map each token into 768 (D_MODEL
)-dimensional vector (#(1)
). Secondly, we initialize a positional encoding tensor using the PositionalEncoding()
class we created earlier (#(2)
). The 12 decoder layers need to be initialized one by one, and in this case I do it using a simple for
loop. All these decoders are then stored in self.decoders
(#(3)
). Next, we initialize a linear layer, which basically corresponds to the text prediction head (#(4)
). This layer is responsible to map each vector into VOCAB_SIZE
(40,000) number of neurons, where every single of those indicates the probability of a specific token being selected. Again, here I also manually configure the weight initialization distribution using the code at line #(5)
and #(6)
.
Moving on to the forward()
method in Codeblock 7b, the first thing we do is processing the input tensor with the self.token_embedding
layer (#(1)
). Next, we inject the positional encoding tensor into x
by element-wise addition (#(2)
). The resulting tensor is then forwarded to the stack of 12 decoders, which we can do with another loop as shown at line #(3)
. Remember that the GPT-1 model has two heads. In this case, the text prediction head will be included inside the forward()
method, whereas the task classifier head will later be implemented in separate class. To accomplish this, I will return both the raw decoder output (decoder_output
) as well as the next-word prediction output (text_output
) as shown at line #(5)
. Later on, I will use decoder_output
as the input for the task classifier head.
# Codeblock 7b
def forward(self, x):
print(f"original inputtt: {x.shape}")
x = self.token_embedding(x.long()) #(1)
print(f"embedded tokenstt: {x.shape}")
x = x + self.positional_encoding() #(2)
print(f"after additiontt: {x.shape}")
for i, decoder in enumerate(self.decoders):
x = decoder(x, attn_mask=look_ahead_mask) #(3)
print(f"after decoder #{i}t: {x.shape}")
decoder_output = x #(4)
print(f"decoder_outputtt: {decoder_output.shape}")
text_output = self.linear(x)
print(f"text_outputtt: {text_output.shape}")
return decoder_output, text_output #(5)
We can check if our GPT1()
class works properly with the Codeblock 8 below. The x
tensor here is assumed as a sequence of tokens with the length of SEQ_LENGTH
(512), in which every single of the element is a random integer within the range of 0 to VOCAB_SIZE
(40,000), representing the encoded tokens.
# Codeblock 8
gpt1 = GPT1()
x = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH))
x = gpt1(x)
# Codeblock 8 output
original input : torch.Size([1, 512]) #(1)
embedded tokens : torch.Size([1, 512, 768]) #(2)
after addition : torch.Size([1, 512, 768])
after decoder #0 : torch.Size([1, 512, 768])
after decoder #1 : torch.Size([1, 512, 768])
after decoder #2 : torch.Size([1, 512, 768])
after decoder #3 : torch.Size([1, 512, 768])
after decoder #4 : torch.Size([1, 512, 768])
after decoder #5 : torch.Size([1, 512, 768])
after decoder #6 : torch.Size([1, 512, 768])
after decoder #7 : torch.Size([1, 512, 768])
after decoder #8 : torch.Size([1, 512, 768])
after decoder #9 : torch.Size([1, 512, 768])
after decoder #10 : torch.Size([1, 512, 768])
after decoder #11 : torch.Size([1, 512, 768])
decoder_output : torch.Size([1, 512, 768]) #(3)
text_output : torch.Size([1, 512, 40000]) #(4)
Based on the above output, we can see that our self.token_embedding
layer successfully converted the sequence of 512 tokens (#(1)
) into a sequence of 768-dimensional token vectors (#(2)
). This tensor dimension remained the same all the way to the last decoder layer, which the output was then stored in the decoder_output
variable (#(3)
). Finally, after being processed with the task classifier head, the tensor dimension changed to 1×512×40000 (#(4)
), containing the information regarding the next-token prediction. – In the original Transformer, this is often called shifted-right output. It basically means that the information stored in the 0th row is the prediction for the 1st token, the 1st row contains the prediction for 2nd token, and so on. Hence, since we want to predict the 513th token, we can simply take the last (512th) row and select the element corresponding to the token with the highest probability.
To calculate the number of model parameters, we can use the count_parameters()
function below.
# Codeblock 9
def count_parameters(model):
return sum([params.numel() for params in model.parameters()])
count_parameters(gpt1)
# Codeblock 9 output
146534464
We can see here that our GPT-1 implementation has approximately 146 million number of params. – I do need to acknowledge that this number is different to the one disclosed in the original paper, i.e., 117 million. Such a difference might probably be because I missed some intricate details. Feel free to comment if you know which part of the code I should change to achieve this number!
GPT-1 Implementation: Task Classifier Head
Remember that our GPT1()
class only includes the text prediction head. For language modeling alone, this is already sufficient, yet for fine-tuning, we need to manually create the task classifier head. Look at the Codeblock 10 below to see how I implement it.
# Codeblock 10
class TaskClassifier(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(in_features=D_MODEL, out_features=N_CLASS) #(1)
nn.init.normal_(self.linear.weight, mean=0, std=0.02)
def forward(self, x): #(2)
print(f"decoder_outputt: {x.shape}")
class_output = self.linear(x)
print(f"class_outputt: {class_output.shape}")
return class_output
Similar to text prediction, the task classifier head is basically just a linear layer as well. However, in this case, it maps every 768-dimensional token embedding into 3 (N_CLASS
) output values corresponding to the number of classes for the classification task we want to train it on (#(1)
). Later on, the output from the decoder will be used as the input for the forward()
method (#(2)
). Thus, to test this TaskClassifier()
class, I will pass through a dummy tensor which the dimension exactly matches with the decoder output, i.e., 1×512×768. We can see in the Codeblock 11 below that this tensor successfully passes through the task classifier head.
# Codeblock 11
task_classifier = TaskClassifier()
x = torch.randn(BATCH_SIZE, SEQ_LENGTH, D_MODEL)
x = task_classifier(x)
# Codeblock 11 output
decoder_output : torch.Size([1, 512, 768])
class_output : torch.Size([1, 512, 3]) #(1)
If we take a closer look at the above output, we can see that the resulting tensor is now having the shape of 1×512×3 (#(1)
). This essentially means that every single token is now represented as 3 numbers. As mentioned earlier, in this example we are about to simulate a sentiment analysis task with 3 classes: positive, negative and neutral. To determine the sentiment of the entire sequence, we can either aggregate the logits across all tokens or use only the logits from the last token (considering that it already contains information from the entire sequence). Additionally, with the same output tensor shape, we can use the similar idea to perform token-level classification task, such as NER (Named Entity Recognition) or POS (Part-of-Speech) tagging.
Later in the inference phase, we will use the TaskClassifier()
head every time we want to perform a specific downstream task. The Codeblock 12 below is a sample code to perform the forward pass. What it essentially does is that we pass the tokenized sentence into the gpt1
model, which returns the raw decoder output and the next-word prediction (#(1)
). Then, we use the output from the decoder as the input for the task classifier head, which will return the logits of the available classes (#(2)
).
# Codeblock 12
def gpt1_fine_tune(x, gpt1, task_classifier):
print(f"original inputtt: {x.shape}")
decoder_output, text_output = gpt1(x) #(1)
print(f"decoder_outputtt: {decoder_output.shape}")
print(f"text_outputtt: {text_output.shape}")
class_output = task_classifier(decoder_output) #(2)
print(f"class_outputtt: {class_output.shape}")
return text_output, class_output
Based on the output produced by the following codeblock, we can see that our gpt1_fine_tune()
function above works properly.
# Codeblock 13
gpt1 = GPT1()
task_classifier = TaskClassifier()
x = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH))
text_output, class_output = gpt1_fine_tune(x, gpt1, task_classifier)
# Codeblock 13 output
original input : torch.Size([1, 512])
decoder_output : torch.Size([1, 512, 768])
text_output : torch.Size([1, 512, 40000])
class_output : torch.Size([1, 512, 3])
GPT-1 Limitations
Despite obtaining remarkable results in handling the four downstream tasks I showed in Figure 6, it is important to know that this approach has some drawbacks. First, the training procedure is complex since we need to perform pretraining and fine-tuning in separate processes. Second, since fine-tuning is a discriminative process, we still need to perform manual labeling (unlike the generative process for pretraining that uses self-supervised labeling method). Third, the model is not flexible, as it can only work on the task it is fine-tuned on. For instance, a model specialized for sentiment analysis cannot be used for question answering task. – Fortunately, GPT-2 was then introduced soon after to handle these issues.
GPT-2
GPT-2 was introduced in the paper titled "Language Models are Unsupervised Multitask Learners" published several months after GPT-1 [6]. The authors of this paper found that the plain GPT language model could actually perform various downstream tasks without fine-tuning. It is possible to achieve this by modifying the objective function. If GPT-1 makes predictions based solely on the previous token sequence, i.e., P(output | input), GPT-2 does so not only based on the sequence, but also based on the given task, i.e., P(output | input, task). With this property, the same prompt will cause the model to produce different output whenever the given task is different. And interestingly, we can simply include the task in the prompt as a natural language.
As an example, if you prompt a model with "lorem ipsum dolor sit amet", it will likely continue with "consectetur adipiscing elit." But if you include a task like "what does it mean?" in the prompt, the model will give an explanation regarding what it actually is. I tried this in ChatGPT, and the answer was exactly what I expected.
![Figure 12. ChatGPT only continues the input sentence if the task is not specified [3].](https://towardsdatascience.com/wp-content/uploads/2025/01/1rMeh1uhTykXDh5qcu9FGWg.png)
![Figure 13. An example of how assigning a specific task causes the model to respond differently [3].](https://towardsdatascience.com/wp-content/uploads/2025/01/1gMgpPuNLz8ZFgn8e1r4qcg.png)
The idea of providing the task in form of natural language can be achieved by training the model with an enormous amount of text in self-supervised manner. For the sake of comparison, the dataset used for GPT-1 to perform language modeling is the BooksCorpus dataset, in which it contains more than 7000 unpublished books and is equivalent to approximately 5 GB of text. Meanwhile, the dataset used for GPT-2 is WebText which has the size of approximately 40 GB. Not only the dataset, but the model itself is also larger. The author of the GPT-2 paper created four model variations, each having different configurations as summarized in Figure 14 below. The one in the first row is equivalent with the GPT-1 paper we just implemented, whereas the model recognized as GPT-2 is the one in the last row. Here we can see that GPT-2 is roughly 13 times larger than GPT-1 in terms of the number of parameters. Based on this information regarding the dataset and model size, we can definitely expect GPT-2 to perform much better than its predecessor.
![Figure 14. The four model variations proposed in the GPT-2 paper [6].](https://towardsdatascience.com/wp-content/uploads/2025/01/17h2-Sfw_ZR5ZAiJ4FcfZJA.png)
It is important to know that N_LAYERS
and D_MODEL
are not the only parameters we need to change if we were to actually create the model. The codeblock below shows the complete parameter configuration for GPT-2.
# Codeblock 14
BATCH_SIZE = 1
SEQ_LENGTH = 1024 #(1)
VOCAB_SIZE = 50257 #(2)
D_MODEL = 1600
NUM_HEADS = 25 #(3)
HIDDEN_DIM = D_MODEL*4 #(4)
N_LAYERS = 48
DROP_PROB = 0.1
In this GPT version, instead of only taking into account 512 tokens for predicting next token, authors extend it further to 1024 (#(1)
) so that it can now attend and process longer token sequence, allowing the model to accept longer prompts. The vocabulary size also gets larger. Previously in GPT-1, the number of unique tokens was only 40,000, but in GPT-2 this number increased to 50,257 (#(2)
). The last thing we need to change is the number of attention heads, which is now set to 25 as shown at line #(3)
. The HIDDEN_DIM
parameter actually also changes, but we don’t need to manually specify the value for this as it remains configured to be 4 times larger than the embedding dimension (#(4)
).
GPT-2 Implementation: Decoder
Talking about the architecture implementation, it is important to know that the decoder used in GPT-2 is somewhat different from the one used in GPT-1. In the case of GPT-2, we use the so-called pre-normalization, as opposed to GPT-1 that uses post-normalization. The idea of pre-normalization is that we place layer norm before the main operation is performed, i.e., the multihead attention and the feed forward blocks. You can see the illustration in the following figure.
![Figure 15. The GPT-1 architecture without Task Classifier head (left) and the GPT-2 architecture (right) [3].](https://towardsdatascience.com/wp-content/uploads/2025/01/16GS2P6dpoWjMQkjDTN0cmA.png)
I implement the decoder for GPT-2 in the DecoderGPT23()
class below. Spoiler alert: I named it this way because the structure of the GPT-2 and GPT-3 architectures is exactly the same.
# Codeblock 15
class DecoderGPT23(nn.Module):
def __init__(self):
super().__init__()
self.norm_0 = nn.LayerNorm(D_MODEL)
self.multihead_attention = nn.MultiheadAttention(embed_dim=D_MODEL,
num_heads=NUM_HEADS,
batch_first=True)
self.dropout_0 = nn.Dropout(DROP_PROB)
self.norm_1 = nn.LayerNorm(D_MODEL)
self.feed_forward = nn.Sequential(nn.Linear(D_MODEL, HIDDEN_DIM),
nn.GELU(),
nn.Linear(HIDDEN_DIM, D_MODEL))
self.dropout_1 = nn.Dropout(DROP_PROB)
nn.init.normal_(self.feed_forward[0].weight, 0, 0.02)
nn.init.normal_(self.feed_forward[2].weight, 0, 0.02)
def forward(self, x, attn_mask):
residual = x
print(f"original & residualt: {x.shape}")
x = self.norm_0(x)
print(f"after normalizationt: {x.shape}")
x = self.multihead_attention(x, x, x, attn_mask=attn_mask)[0]
print(f"after attentiontt: {x.shape}")
x = self.dropout_0(x)
print(f"after dropouttt: {x.shape}")
x = x + residual
print(f"after additiontt: {x.shape}")
residual = x
print(f"nx & residualtt: {x.shape}")
x = self.norm_1(x)
print(f"after normalizationt: {x.shape}")
x = self.feed_forward(x)
print(f"after feed forwardt: {x.shape}")
x = self.dropout_1(x)
print(f"after dropouttt: {x.shape}")
x = x + residual
print(f"after additiontt: {x.shape}")
return x
Well, I don’t think I need to explain the above code any further since it is mostly the same as the decoder for GPT-1, except that here we place the layer normalization blocks at different positions. So, now we will jump directly into the testing code. See the Codeblock 16 below.
# Codeblock 16
decoder_gpt_2 = DecoderGPT23()
x = torch.randn(BATCH_SIZE, SEQ_LENGTH, D_MODEL)
look_ahead_mask = create_mask()
x = decoder_gpt_2(x, look_ahead_mask)
We can see in the resulting output that our x
tensor successfully passed through all sub-components inside the decoder layer.
# Codeblock 16 output
original & residual : torch.Size([1, 1024, 1600])
after normalization : torch.Size([1, 1024, 1600])
after attention : torch.Size([1, 1024, 1600])
after dropout : torch.Size([1, 1024, 1600])
after addition : torch.Size([1, 1024, 1600])
x & residual : torch.Size([1, 1024, 1600])
after normalization : torch.Size([1, 1024, 1600])
after feed forward : torch.Size([1, 1024, 1600])
after dropout : torch.Size([1, 1024, 1600])
after addition : torch.Size([1, 1024, 1600])
GPT-2 Implementation: Decoder with Input & Text Prediction
Although the decoder used in GPT-2 is different from the one used in GPT-1, yet the other components, namely positional encoding and the look-ahead mask, remain the same. Hence, we can just reuse them. The code used to attach these two components is mostly the same, but there are still some intricate details to pay attention to in Codeblock 17 below. First, here we initialize another layer normalization layer at line #(1)
before placing it in the flow at line #(2)
. This is essentially done because in GPT-2 we have another layer norm block placed outside the decoder, which previously does not exist in GPT-1 (see Figure 15). Secondly, it is not necessary to store the raw decoder output like what we did in the GPT1()
class (at line #(4)
Codeblock 7b). This is basically because GPT-2 does not require fine-tuning to perform any kind of downstream tasks. Rather, it will rely solely on the task prediction head to do so.
# Codeblock 17
class GPT23(nn.Module):
def __init__(self):
super().__init__()
self.token_embedding = nn.Embedding(num_embeddings=VOCAB_SIZE,
embedding_dim=D_MODEL)
self.positional_encoding = PositionalEncoding()
self.decoders = nn.ModuleList([DecoderGPT23() for _ in range(N_LAYERS)])
self.norm_final = nn.LayerNorm(D_MODEL) #(1)
self.linear = nn.Linear(in_features=D_MODEL, out_features=VOCAB_SIZE)
nn.init.normal_(self.token_embedding.weight, mean=0, std=0.02)
nn.init.normal_(self.linear.weight, mean=0, std=0.02)
def forward(self, x):
print(f"original inputtt: {x.shape}")
x = self.token_embedding(x.long())
print(f"embedded tokenstt: {x.shape}")
x = x + self.positional_encoding()
print(f"after additiontt: {x.shape}")
for i, decoder in enumerate(self.decoders):
x = decoder(x, attn_mask=look_ahead_mask)
print(f"after decoder #{i}t: {x.shape}")
x = self.norm_final(x) #(2)
print(f"after final normt: {x.shape}")
text_output = self.linear(x)
print(f"text_outputtt: {text_output.shape}")
return text_output
Now that we can test the GPT23()
class above with the following codeblock. Here I test it with a sequence of tokens of length 1024. The resulting output is very long since we have the decoder layer repeated 48 times.
# Codeblock 18
gpt2 = GPT23()
x = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH))
x = gpt2(x)
# Codeblock 18 output
original input : torch.Size([1, 1024])
embedded tokens : torch.Size([1, 1024, 1600])
after addition : torch.Size([1, 1024, 1600])
after decoder #0 : torch.Size([1, 1024, 1600])
after decoder #1 : torch.Size([1, 1024, 1600])
after decoder #2 : torch.Size([1, 1024, 1600])
after decoder #3 : torch.Size([1, 1024, 1600])
.
.
.
.
after decoder #44 : torch.Size([1, 1024, 1600])
after decoder #45 : torch.Size([1, 1024, 1600])
after decoder #46 : torch.Size([1, 1024, 1600])
after decoder #47 : torch.Size([1, 1024, 1600])
after final norm : torch.Size([1, 1024, 1600])
text_output : torch.Size([1, 1024, 50257])
If we try to print out the number of parameters, we can see that GPT-2 has around 1.6 billion. Just like the GPT-1 implementation we did earlier, this number of parameters is also slightly different to the one disclosed in the paper, which is around 1.5 billion as shown in Figure 14.
# Codeblock 19
count_parameters(gpt2)
# Codeblock 19 output
1636434257
GPT-3
GPT-3 was proposed in the paper titled "Language Models are Few-Shot Learners" which was published back in 2020 [7]. This title signifies that the proposed model is able to perform a wide range of tasks given only several examples, a.k.a. "shots." Despite this emphasis on few-shot learning, in practice this model is also able to perform one-shot or even zero-shot learning. In case you’re not yet familiar with few-shot learning, it is basically a method to adapt the model to a specific task using only a small number of examples. Even though the objective is similar to fine-tuning, but few-shot learning allows it to do so without updating model weights. In the case of GPT models, this can be achieved thanks to the presence of the attention mechanism, which allows the model to dynamically focus on the most relevant parts of the instruction and examples provided in the prompt. Similar to the improvements made from GPT-1 to GPT-2, the ability of GPT-3 to perform much better in few-shot learning than its predecessors are also due to the increased amount of training data used and the larger model size.
GPT-3 Implementation: Model Configuration & Architectural Design
You already read the spoiler, right? The architectural design of GPT-3 is exactly the same as GPT-2. What makes them different is only the model size, which we can adjust by using larger values for the parameters. The Codeblock 20 below shows the parameter configuration for GPT-3.
# Codeblock 20
BATCH_SIZE = 1
SEQ_LENGTH = 2048
VOCAB_SIZE = 50257
D_MODEL = 12288
NUM_HEADS = 96
HIDDEN_DIM = D_MODEL*4
N_LAYERS = 96
DROP_PROB = 0.1
As the above variables have been updated, we can simply run the following codeblock to initialize the GPT-3 model (#(1)
) and pass a tensor representing a sequence of tokens through it (#(2)
).
# Codeblock 21
gpt3 = GPT23() #(1)
x = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH))
x = gpt3(x) #(2)
Unfortunately, I cannot run the above code due to the limited memory I have. I even tried to run it on Kaggle Notebook with 30 GB of memory, but the out-of-memory error persists. So, for this one, I cannot show you the number of parameters the model creates when it is initialized. However, it is mentioned in the paper that GPT-3 consists of around 175 billion parameters, which basically means that it’s more than 100 times larger than GPT-2, – so it makes sense now why it can only be run on an extremely large and powerful machine. Look at the figure below to see how GPT versions differ from each other.
![Figure 16. Comparison of different GPT versions [3].](https://towardsdatascience.com/wp-content/uploads/2025/01/1oGsue_2AGpLDE9P1-2C1XA.png)
Ending
That’s pretty much everything about the theory and the implementation of different GPT versions, especially GPT-1, GPT-2 and GPT-3. As this article is written, OpenAI hasn’t officially disclosed the architectural details for GPT-4, so we can’t reproduce it just yet. I hope OpenAI will publish the paper very soon!
Thank you for reading my article up to this point. I do appreciate your time, and I hope you learn something new here. Have a nice day!
_Note: you can also access the code used in this article here._
References
[1] Ashish Vaswani et al. Attention Is All You Need. Arxiv. https://arxiv.org/pdf/1706.03762 [Accessed October 31, 2024].
[2] Muhammad Ardi. Paper Walkthrough: Attention Is All You Need. Towards Data Science. https://medium.com/towards-data-science/paper-walkthrough-attention-is-all-you-need-80399cdc59e1 [Accessed November 4, 2024].
[3] Image originally created by author.
[4] Alec Radford et al. Improving Language Understanding by Generative Pre-Training. OpenAI. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf [Accessed October 31, 2024].
[5] Image created originally by author based on [1].
[6] Alec Radford et al. Language Models are Unsupervised Multitask Learners. OpenAI. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf [Accessed October 31, 2024].
[7] Top B. Brown et al. Language Models are Few-Shot Learners. Arxiv. https://arxiv.org/pdf/2005.14165 [Accessed October 31, 2024].