The world’s leading publication for data science, AI, and ML professionals.

Text Generation with GPT

How to fine-tune a GPT model to generate a TED description-like text

Photo by Aaron Burden on Unsplash
Photo by Aaron Burden on Unsplash

If you’re working in the data science or machine learning industry, chances are that you’ve heard the term Generative AI before, which refers to AI algorithms capable of creating new content like texts, images, or audio. In this article, we’re going to delve into one of Generative AI models: the Gpt model. As you might have guessed, GPT is a foundational model of ChatGPT that can generate sequences of texts.

Specifically, we will shortly discuss the fine-tuning and text generation process of a GPT model. While there are many established libraries and platforms out there that we can use to handle this task, they often abstract away many implementation details, leaving us curious about what actually happens under the hood.

Therefore, we’ll explore the fine-tuning and text generation process in low-level details. This means we’ll cover everything comprehensively, from data preprocessing, model building, setting up the loss function, the fine-tuning process, and the logic behind text generation after fine-tuning the model.

So, without further ado, let’s start with the dataset we’ll be using to fine-tune our GPT model!


About the Dataset

The dataset we’ll be using is the _TED-talk dataset_, which we can download directly from the HuggingFace Hub. It is listed as having a CC-BY 4.0 license, so there’s no need to worry about copyright.

# Load all necessary libraries
!pip install datasets

import torch
import numpy as np
from torch import nn
from Transformers import GPT2Tokenizer, GPT2Config, GPT2Model, GPT2PreTrainedModel
from torch.optim import AdamW
from datasets import load_dataset
from tqdm import tqdm
from torch.nn import functional as F

import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_built() else 'cpu'

dataset = load_dataset("gigant/ted_descriptions")
print(len(dataset['train']))

'''
Output:
5705
'''

In total, the dataset comprises 5705 entries, each containing a URL (url) and a description of a TED event (descr). For the purpose of this article, we will exclusively use the descr part as training data to fine-tune our GPT model.

print(dataset['train'][0])

'''
Output:
{'url': 'https://www.ted.com/talks/jen_gunter_how_your_sense_of_smell_helps_you_savor_flavor',
 'descr': "Eating pizza with a stuffy nose just isn't as satisfying -- 
          and there's a reason for that. Dr. Jen Gunter explains 
          how our ability to smell and taste work together to give us 
          a full sensory experience. So whether you're sniffing 
          the caramelized aroma of coffee, a whiff of trash or 
          a trillion other things, your brain knows exactly what's under 
          your nose. For more on how your body works, tune into her 
          podcast, Body Stuff with Dr. Jen Gunter, from the 
          TED Audio Collective."}
'''

Before we use each description as training data for model fine-tuning, we need to append a special word <|endoftext|> to the end of it. This special word will be helpful especially during inference time and we’ll delve the exact reason for this later on in this article. For now, let’s append <|endoftext|> to the end of each description in our dataset.

text_corpus = [f"{txt} <|endoftext|>" for i, txt in enumerate(dataset['train']['descr']) if txt != '']
print(text_corpus[0])

'''
Output:
Eating pizza with a stuffy nose just isn't as satisfying -- 
and there's a reason for that. Dr. Jen Gunter explains how our 
ability to smell and taste work together to give us a full 
sensory experience. So whether you're sniffing the caramelized aroma 
of coffee, a whiff of trash or a trillion other things, 
your brain knows exactly what's under your nose. For more on 
how your body works, tune into her podcast, Body Stuff with 
Dr. Jen Gunter, from the TED Audio Collective. <|endoftext|>
'''

And that’s all we need from our dataset! Let’s now continue with tokenization process.


GPT Tokenization

Our GPT model, much like any other language models, cannot process raw text as an input. Therefore, it is necessary for us to transform our text input into numerical features that can be processed by the model. This process is known as tokenization.

Image by author
Image by author

There are several methods of tokenization. For instance, BERT uses the SentencePiece method to tokenize input text, while GPT model employs the BPE method.

BPE stands for Byte Pair Encoding, and this method converts our text into numerical features at the sub-word level, allowing a word to possibly have more than one token.

To use BPE, we first need to gather all the available words in our training data. Next, this tokenization method iteratively pairs the most frequent characters from the available words until it reaches the intended vocabulary size that we set in advance. We won’t delve deeper into this tokenization process as it’s not our primary focus in this article. However, if you wish to explore BPE in more detail, you can find additional information here.

In practice, we can implement BPE easily with GPT tokenizer available on HuggingFace. Below is the code to load the tokenizer.

tokenizer = GPT2Tokenizer.from_pretrained('gpt2', pad_token='<|pad|>') 

There are three special tokens used by any tokenization method, including GPT tokenizer, to train a language model:

  • bos_token: special token used to indicate the beginning of a sequence
  • eos_token: special token used to indicate the end of a sequence
  • pad_token: special token used for padding

By default, GPT tokenizer uses a special token <|endoftext|>to represent both bos_token and eos_token, while pad_token is set to be optional and the default value is set to None. Whether or not we should assign pad_token into a dedicated token depends on our data preprocessing method. In this article, we’ll be using padding as part of our data preprocessing method to fine-tune our GPT model, which you’ll see in the later section. For this reason, we should assign pad_token into a dedicated token, and as you can see in the code where we load the tokenizer above, we assign pad_token to a dedicated token called <|pad|>.

Now that we have initialized the tokenizer, we can start the tokenization process of our input sequence. Let’s say we have the following sequence:

today is a beautiful day <|endoftext|><|endoftext|><|pad|><|pad|><|pad|>

we can tokenize it with the following code:

inp_text = "today is a beautiful day <|endoftext|><|endoftext|><|pad|><|pad|><|pad|>"

tokenized_inp_text = tokenizer(inp_text)['input_ids']
print(tokenized_inp_text)

'''
Output:
[40838, 318, 257, 4950, 1110, 220, 50256, 50256, 50257, 50257, 50257]
'''

As you can see, the special tokens have dedicated values: <|endoftext|> will be converted into 50256, while <|pad|> will be converted into 5027. This tokenized sequence will be the input of our GPT model during the fine-tuning process.

If you wish, we can alsodecode the tokenization result back into our original sequence with the following method:

print(' '.join(tokenizer.convert_ids_to_tokens(tokenized_inp_text)))

'''
Output:
today Ġis Ġa Ġbeautiful Ġday Ġ <|endoftext|> <|endoftext|> <|pad|> <|pad|> <|pad|>
'''

GPT Model

GPT stands for Generative Pretrained Model from Transformer. From its name alone, you might have already guessed that the famous Transformer architecture serves as the backbone of this model. One of the most critical components of Transformer architecture is the Transformer layer. In essence, a Transformer layer comprises the following components:

  1. Attention layer: This layer serves as a fundamental component in the Transformer architecture, allowing the model to capture the semantic meaning of input sequences.
  2. Residual and normalization layer: This layer facilitates faster convergence during model training. There are two of these layers in a Transformer layer.
  3. Feedforward layer: A regular self-connected neural network layer placed between the first and second residual & normalization layer.
Image by author
Image by author

GPT model consists of several stacks of this Transformer layer and it comes with several variants and evolutions.

The very first GPT model was launched by OpenAI in 2018, while the latest version, GPT-4, emerged in 2023. Each version of GPT represents an improvement over the previous one, and below is a general overview of the evolution of the GPT model:

  • GPT-1: The very first GPT model, launched by OpenAI in 2018, had 117 parameters and 12 stacks of Transformer layers.
  • GPT-2: Introduced by OpenAI in 2019 as an improvement of GPT-1 and it came with several variants. It had parameters ranging from 117 million to 1.5 billion, with Transformer layers ranging from 12 to 48 stacks.
  • GPT-3: Launched in 2020 as an improvement of GPT-2, GPT-3 also featured several variants. It had parameters ranging from 125 million to 175 billion, and Transformer layers ranging from 12 to 96 stacks.
  • GPT-3.5: Introduced by OpenAI as an improvement of GPT-3 and optimized for conversational chat. In this version, the GPT-3 model has been developed by incorporating Reinforcement Learning from Human Feedback (RLHF) to make it more robust in generating texts and responses.
  • GPT-4: Launched in 2023 as an improvement of GPT-3 and GPT-3.5, this is the latest and most advanced model in the GPT series. The detailed architecture of the model itself is still unknown, but it is estimated to have trillions of parameters.

We now understand that the GPT model consists of several stacks of Transformer layers. However, when we examine its architecture, the GPT model appears quite similar to other Transformers-based models, such as BERT. Like GPT, BERT also comprises several stacks of Transformer layers.

Now, the question arises: what sets GPT apart from other Transformers-based models like BERT? One distinctive difference is that GPT is a Transformer decoder-based model, while other models designed for tasks like text classification, Named Entity Recognition (NER), or question-answering are Transformer encoder-based.

In a Transformer encoder-based model like BERT, the attention layer within a stack of Transformer layer operates in a bi-directional manner. This means that it can consider the entire input sequence, from the first token to the last token, to determine the context of a particular token.

Image by author
Image by author

The bi-directional nature of the attention layer allows the model to determine the context of each token with respect to the entire input sequence. This property makes the model helpful for text classification, such as determining the sentiment of a review.

On the other hand, in a Transformer-decoder based model like GPT, the attention layer operates in a uni-directional manner. This means that only the preceding tokens should be taken into account when determining the context of a token in a particular position.

Image by author
Image by author

This property is what makes GPT model auto-regressive: it can produce one token at a time based on the context of previous tokens.

The context size of GPT model varies depending on the version. For example, the GPT-1 model has a context size of 512 tokens, GPT-2 has 1024, and GPT-3 has 4096. This means that the GPT-3 model, for example, uses a maximum of 4096 preceding tokens to predict a token at a specific position.

Now that we’ve covered the theory, let’s delve into the actual process of fine-tuning a GPT model!

Since GPT-3, GPT-3.5, and GPT-4 are not open-sourced and their size is too large for personal machines, we will focus on using a GPT-2 model instead. This model is open-sourced by OpenAI and is available on HuggingFace for us to use. However, the fine-tuning and text generation concept that we’ll go through in the next section is applicable to any variants of GPT model.

Image by author
Image by author

The GPT-2 model, as shown in the image above, comes with several variants. The GPT-2 small model comes with 12 stacks of Transformer-decoder layers, while the extra-large one comes with 48 stacks of Transformer-decoder layers.

To make everything as simple as possible and to speed up the fine tuning process, we’ll use the GPT-2 small model.

As mentioned above, this model consists of 12 stacks of Transformer-decoder layers, and the final stack will output a vector embedding with a size of 768 for each token. To map this vector embedding to an actual output (next-token prediction), we need to add a linear layer on top of the final stack of Transformer-decoder layer. The output of the linear layer should be equal to the vocabulary size of GPT-2 model, which, in our case, is 50,258.

Let’s define the model architecture with the help of HuggingFace library

class GPT2_Model(GPT2PreTrainedModel):

    def __init__(self, config):

        super().__init__(config)

        self.transformer = GPT2Model.from_pretrained('gpt2')
        tokenizer = GPT2Tokenizer.from_pretrained('gpt2', pad_token='<|pad|>')

        # this is necessary since we add a new unique token for pad_token
        self.transformer.resize_token_embeddings(len(tokenizer))

        self.lm_head = nn.Linear(config.n_embd, len(tokenizer), bias=False)

    def forward(self, input_ids, attention_mask=None, token_type_ids=None):

        x = self.transformer(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)[0]
        x = self.lm_head(x)

        return x

GPT Model Fine-Tuning

In this section, we’ll fine-tune our GPT-2 model on the TED dataset prepared in the previous section. The first step is to define the DataLoader class for loading the training data in batches. However, if we aim to load the training data in batches, there’s an important consideration: the data in each batch must have the same sequence length.

Each description text in our training data has varying lengths, and this is something that we need to address. One way to solve this is by applying max_len parameter during the tokenization process, ensuring that texts with a sequence of tokens longer than max_len will be truncated to max_len.

To set up a proper value for max_lenparameter, let’s visualize the distribution of sequence length of our dataset.

tokenized_inp_len = [len(tokenizer(txt)['input_ids']) for txt in text_corpus]

np.random.seed(42)
x = np.asarray(tokenized_inp_len)

plt.hist(x, density=False, bins=10)  
plt.ylabel('Count')
plt.xlabel('Token Length')
Image by author
Image by author

We observe that most, if not all, of our texts have fewer than 256 tokens. Thus, we can set our max_len parameter to be 256. If the sequence length is less than 256, then that sequence will be padded with padding token that we have defined in the previous section until it reaches max_len.

Let’s create a dataloader class for batch training of our dataset.

class TedDataset(torch.utils.data.Dataset):

  def __init__(self, input_data, tokenizer, gpt2_type="gpt2", max_length=256):

    self.texts = [tokenizer(data, truncation=True, max_length=max_length, padding="max_length", return_tensors="pt")
                    for data in input_data]

  def __len__(self):
    return len(self.texts)

  def __getitem__(self, idx):
    return self.texts[idx]

Before we fine-tune our model, there is one more step we need to take: defining the loss function.

For the loss function, we will use the standard cross-entropy loss function, as we can frame the next-token prediction task as a multi-class classification problem. This means that given an input token, we want the model to predict the most likely token available in our vocabulary as the next token to complement our input token.

To train a model for next-token prediction, we can supply our text input sequentially as the training data. As an example, let’s consider an input text ‘This is a text,’ which contains four words. We can split this input text into three training data by shifting the text sequentially, one word at a time, and use the next word as the label, as illustrated in the following image.

Image by author
Image by author

The code below does the same logic as described above in a token level.

class CrossEntropyLossFunction(nn.Module):

    def __init__(self):

        super(CrossEntropyLossFunction, self).__init__()
        self.loss_fct = nn.CrossEntropyLoss()

    def forward(self, lm_logits, labels):

        shift_logits = lm_logits[..., :-1, :].contiguous()
        shift_labels = labels[..., 1:].contiguous()

        loss = self.loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

        return loss

Now we can fine-tune our model.

The training script for fine-tuning our GPT-2 model is a standard PyTorch script that you’ve likely encountered in many tutorials. However, it requires several hours to train our GPT-2 model, even with GPU acceleration, in order to generate meaningful text during the inference time.

Therefore, you can implement some steps during the training process, for example saving the model every 4 or 5 epochs, so that you won’t train the model from the very beginning just in case something goes wrong during the fine-tuning process.

If you prefer, you can also just download the model weights here and load them directly for inference.

def train(model, tokenizer, train_data, epochs, learning_rate, epsilon=1e-8):

    train = TedDataset(train_data, tokenizer)
    train_dataloader = torch.utils.data.DataLoader(train, batch_size=2, shuffle=True)

    optimizer = AdamW(model.parameters(),
                  lr = learning_rate,
                  eps = epsilon
                )
    criterion = CrossEntropyLossFunction().to(device)
    model = model.to(device)

    best_loss = 1000

    for epoch_i in range(0, epochs):

        total_train_loss = 0
        total_val_loss = 0
        for train_input in tqdm(train_dataloader):

            mask = train_input['attention_mask'].to(device)
            input_id = train_input['input_ids'].to(device)

            outputs = model(input_id,
                            attention_mask = mask,
                            token_type_ids=None
                            )

            loss = criterion(outputs, input_id)

            batch_loss = loss.item()
            total_train_loss += batch_loss

            loss.backward()
            optimizer.step()
            model.zero_grad()

        avg_train_loss = total_train_loss / len(train_dataloader)

        print(f"Epoch: {epoch_i}, Avg train loss: {np.round(avg_train_loss,2)}")

epochs = 35
learning_rate = 1e-5
configuration = GPT2Config()
gpt_model = GPT2_Model(configuration).to(device)

train(gpt_model, tokenizer, text_corpus, epochs, learning_rate)

GPT Model Text Generation

Now that we have fine-tuned our model, we can use it to generate text that sounds exactly like TED descriptions, as shown in our training data. There are several steps that we need to understand to use the GPT model for text generation, and we’ll go through these steps in this section.

In a nutshell, the output of our GPT model is a vector with a size equal to the vocabulary size that we set in advance. In our case, the BPE tokenization for GPT-2 in Hugging Face has been trained with a vocabulary size of 50,257, and we added one special token for padding, bringing the final vocabulary size to 50,258.

With the help of the Softmax function, the output of our GPT model represents the probability of each token in our vocabulary as the next token given the preceding tokens.

Image by author
Image by author

At the end, we’ll pick the token with the highest probability as the prediction.

Now let’s get deeper into the implementation of text generation using our GPT model.

By default, the GPT model won’t stop generating new token. Thus, we need to specify the maximum number of tokens that we want to generate.

max_new_tokens = 100
for _ in range(max_new_tokens):

Aside of specifying the maximum number of tokens generated by the model, we also need to add a constraint in which the model should stop generating a new token as soon as <|endoftext|> token is picked as the next token prediction. This is why we need to append <|endoftext|> to the end of each training data before fine-tuning process. This special token will act as a sign that our GPT model should stop generating new token.

max_new_tokens = 100
for _ in range(max_new_tokens):
  if idx[:,-1].item() != tokenizer.encode(tokenizer.eos_token)[0]:
      pass
  else:
      break

GPT model can take thousands of preceding tokens as a context, but in our case as our text has maximum sequence length of 256, then we can set the context size to be 256 as well. This means that our GPT model will only take up to 256 preceding tokens as context to predict the next token.

max_new_tokens = 100
context_size = 256

for _ in range(max_new_tokens):
  if idx[:,-1].item() != tokenizer.encode(tokenizer.eos_token)[0]:
      # crop idx to the last block_size tokens
      idx_cond = idx[:, -context_size:]
  else:
      break

Now we fetch the model prediction. By using the Softmax function, we’ll get a probability of each token in our vocabulary being the next token.

max_new_tokens = 100
context_size = 256

for _ in range(max_new_tokens):
  if idx[:,-1].item() != tokenizer.encode(tokenizer.eos_token)[0]:
      # crop idx to the last block_size tokens
      idx_cond = idx[:, -context_size:]
      # get the predictions
      logits = model(idx_cond)
      # focus only on the last time step
      logits = logits[:, -1, :]
      # apply softmax to get probabilities
      probs = F.softmax(logits, dim=-1)
  else:
      break

We can, of course, pick the token with the highest probability as the output. However, this will lead to text generation that is too deterministic. We don’t want the model to always predict the word ‘car‘ everytime the context is ‘ I have a new’. We need the model to be able to generate various, yet realistic words.

There are several methods that can be used to avoid deterministic nature of text generation in practice. First, we use the method called top_p. This is a method where the model picks the smallest possible number of tokens which cumulative probability takes up certain parameter p that we set in advance. As an example, let’s say we set the top_p to be 0.95, then our model will pick smallest possible number of tokens which cumulative probability makes up 95% of probability mass.

Image by author
Image by author

To implement top_p, we sort the probabilities of our GPT output. Then, we compute its cumulative probability and apply a filter to only consider outputs whose cumulative probability makes up 95% of the probability mass.

max_new_tokens = 100
context_size = 256
top_p = 0.95

for _ in range(max_new_tokens):
  if idx[:,-1].item() != tokenizer.encode(tokenizer.eos_token)[0]:
      # crop idx to the last block_size tokens
      idx_cond = idx[:, -context_size:]
      # get the predictions
      logits = model(idx_cond)
      # focus only on the last time step
      logits = logits[:, -1, :]
      # apply softmax to get probabilities
      probs = F.softmax(logits, dim=-1)
      # sort probabilities in descending order
      sorted_probs, indices = torch.sort(probs, descending=True)
      # compute cumsum of probabilities
      probs_cumsum = torch.cumsum(sorted_probs, dim=1)
      # choose only top_p tokens
      sorted_probs, indices = sorted_probs[:, :probs_cumsum[[probs_cumsum < top_p]].size()[0] + 1], indices[:, :probs_cumsum[[probs_cumsum < top_p]].size()[0] +1]
  else:
      break

We can combine top_p method with another method called top_k, in which the model will only consider the top k tokens with highest probability. We need to set the k parameter in advance for this.

Image by author
Image by author
max_new_tokens = 100
context_size = 256
top_p = 0.95
top_k = 10

for _ in range(max_new_tokens):
  if idx[:,-1].item() != tokenizer.encode(tokenizer.eos_token)[0]:
      # crop idx to the last block_size tokens
      idx_cond = idx[:, -context_size:]
      # get the predictions
      logits = model(idx_cond)
      # focus only on the last time step
      logits = logits[:, -1, :]
      # apply softmax to get probabilities
      probs = F.softmax(logits, dim=-1)
      # sort probabilities in descending order
      sorted_probs, indices = torch.sort(probs, descending=True)
      # compute cumsum of probabilities
      probs_cumsum = torch.cumsum(sorted_probs, dim=1)
      # choose only top_p tokens
      sorted_probs, indices = sorted_probs[:, :probs_cumsum[[probs_cumsum < top_p]].size()[0] + 1], indices[:, :probs_cumsum[[probs_cumsum < top_p]].size()[0] +1]
      # choose only top_k tokens
      sorted_probs, indices = sorted_probs[:,:top_k], indices[:,:top_k]
  else:
      break

With the application of top_k and top_p, instead of having to consider all of the tokens in our vocabulary, we will only end up with small number of promising tokens to consider as the next token. The next step involves recalculating the softmax for this refined set of promising tokens and then randomly selecting one of them as the next token.

max_new_tokens = 100
context_size = 256
top_p = 0.95
top_k = 10

for _ in range(max_new_tokens):
  if idx[:,-1].item() != tokenizer.encode(tokenizer.eos_token)[0]:
      # crop idx to the last block_size tokens
      idx_cond = idx[:, -context_size:]
      # get the predictions
      logits = model(idx_cond)
      # focus only on the last time step
      logits = logits[:, -1, :]
      # apply softmax to get probabilities
      probs = F.softmax(logits, dim=-1)
      # sort probabilities in descending order
      sorted_probs, indices = torch.sort(probs, descending=True)
      # compute cumsum of probabilities
      probs_cumsum = torch.cumsum(sorted_probs, dim=1)
      # choose only top_p tokens
      sorted_probs, indices = sorted_probs[:, :probs_cumsum[[probs_cumsum < top_p]].size()[0] + 1], indices[:, :probs_cumsum[[probs_cumsum < top_p]].size()[0] +1]
      # choose only top_k tokens
      sorted_probs, indices = sorted_probs[:,:top_k], indices[:,:top_k]
      # sample from the distribution
      sorted_probs = F.softmax(sorted_probs, dim=-1)
      idx_next = indices[:, torch.multinomial(sorted_probs, num_samples=1)].squeeze(0)
  else:
      break

Now that we have our next token prediction, we need to append it to the current context to form an updated context for the prediction of the following token.

max_new_tokens = 100
context_size = 256
top_p = 0.95
top_k = 10

for _ in range(max_new_tokens):
  if idx[:,-1].item() != tokenizer.encode(tokenizer.eos_token)[0]:
      # crop idx to the last block_size tokens
      idx_cond = idx[:, -context_size:]
      # get the predictions
      logits = model(idx_cond)
      # focus only on the last time step
      logits = logits[:, -1, :]
      # apply softmax to get probabilities
      probs = F.softmax(logits, dim=-1)
      # sort probabilities in descending order
      sorted_probs, indices = torch.sort(probs, descending=True)
      # compute cumsum of probabilities
      probs_cumsum = torch.cumsum(sorted_probs, dim=1)
      # choose only top_p tokens
      sorted_probs, indices = sorted_probs[:, :probs_cumsum[[probs_cumsum < top_p]].size()[0] + 1], indices[:, :probs_cumsum[[probs_cumsum < top_p]].size()[0] +1]
      # choose only top_k tokens
      sorted_probs, indices = sorted_probs[:,:top_k], indices[:,:top_k]
      # sample from the distribution
      sorted_probs = F.softmax(sorted_probs, dim=-1)
      idx_next = indices[:, torch.multinomial(sorted_probs, num_samples=1)].squeeze(0)
      # append new token ids
      idx = torch.cat((idx, idx_next), dim=1)
  else:
      break

We can wrap all of the steps that we have done above into a function.

def generate(idx, max_new_tokens, context_size, tokenizer, model, top_k=10, top_p=0.95):

        for _ in range(max_new_tokens):
            if idx[:,-1].item() != tokenizer.encode(tokenizer.eos_token)[0]:
                # crop idx to the last block_size tokens
                idx_cond = idx[:, -context_size:]
                # get the predictions
                logits = model(idx_cond)
                # focus only on the last time step
                logits = logits[:, -1, :]
                # apply softmax to get probabilities
                probs = F.softmax(logits, dim=-1)
                # sort probabilities in descending order
                sorted_probs, indices = torch.sort(probs, descending=True)
                # compute cumsum of probabilities
                probs_cumsum = torch.cumsum(sorted_probs, dim=1)
                # choose only top_p tokens
                sorted_probs, indices = sorted_probs[:, :probs_cumsum[[probs_cumsum < top_p]].size()[0] + 1], indices[:, :probs_cumsum[[probs_cumsum < top_p]].size()[0] +1]
                # choose only top_k tokens
                sorted_probs, indices = sorted_probs[:,:top_k], indices[:,:top_k]
                # sample from the distribution
                sorted_probs = F.softmax(sorted_probs, dim=-1)
                idx_next = indices[:, torch.multinomial(sorted_probs, num_samples=1)].squeeze(0)
                # append new token ids
                idx = torch.cat((idx, idx_next), dim=1)
            else:
                break

        return idx

Let’s generate some TED descriptions. Let’s say we give our model a prompt text ‘How will AI impact‘ as the context. We can generate the predicted text with the following code.

gpt_model.eval()

prompt = "How will AI impact"
generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)
generated = generated.to(device)

sample_outputs = generate(generated,
                         max_new_tokens=100,
                         context_size=256,
                         tokenizer=tokenizer,
                         model=gpt_model,
                         top_k=10,
                         top_p=0.95)

for i, sample_output in enumerate(sample_outputs):
    print(f"{tokenizer.decode(sample_output, skip_special_tokens=True)}")

Cherry picking a little bit, below are a couple of examples of the text generated by the GPT-2 model after it has been trained for 35 epochs:

How will AI impact the world? In this thought-provoking talk, futurist Juan Enriquez offers a vision of intelligent life that blends science and design – and makes the case for globalization and the development of AI to solve problems in a way that is both fair, transparent and sustainable.


How will AI impact our lives? In a talk packed with dystopian implications, AI expert Stephen Hawking offers some ambitious AI research, some realistic threats and realistic tangible ways we can help make future AI smarter and better than we are.


Conclusion

In this article, we have learned how to fine-tune a GPT model and how to use it to generate text based on a context provided in advance. I hope this article helps you get started with GPT!

Given the GPT model’s ability to generate text, we can also use it for more advanced applications, such as chatbots, where the model should generate an answer based on a provided question.

As usual, you can find the notebook containing all the code demonstrated in this article at _this link_.


Related Articles