
The rise in popularity of generative AI has also led to an increase in the number of large language models. In this story, I will make a comparison between two of them: GPT and Bert. GPT (Generative Pre-trained Transformer) is developed by OpenAI and is based on decoder-only architecture. On the other hand, BERT (Bidirectional Encoder Representations from Transformers) is developed by Google and is an encoder-only pre-trained model
Both are technically different, but, they have a similar objective – to perform natural language processing tasks. Many articles compare the two from a technical point of view. However, in this story, I would compare them based on the quality of their objective, which is natural language processing.
Comparison approach
How to compare two completely different technical architectures? Gpt is decoder-only architecture and BERT is encoder-only architecture. So a technical comparison of a decoder-only vs encoder-only architecture is like comparing Ferrari vs Lamborgini – both are great but with completely different technology under the chassis.
However, we can make a comparison based on the quality of a common natural language task that both can do – which is the generation of embeddings. The embeddings are vector representations of a text. The embeddings form the basis of any natural language processing task. So if we can compare the quality of embeddings, then it can help us judge the quality of natural language tasks, as embeddings are foundational for natural language processing by transformer architecture.
Shown below is the comparison approach which I will take.

Let us start with GPT
I made a toss of a coin, and GPT won the toss! So let us start with GPT first. I will take text from Amazon’s fine food reviews dataset. Reviews are a good way to test both models, as reviews are expressed in natural language and are very spontaneous. They encompass the feeling of customers, and can contain all types of languages – good, bad, the ugly! In addition, they can have many misspelled words, emojis as well as commonly used slang.
Here is an example of the review text.

In order to get the embeddings of the text using GPT, we need to make an API call to OpenAI. The result is embedding or vector of size of 1540 for each text. Here is a sample data which includes the embeddings.

The next step is clustering and visualization. One can use KMeans to cluster the embedding vector and use TSNE to reduce the 1540 dimensions to 2 dimensions. Shown below are the results after clustering and dimensionality reduction.

One can observe that the clusters are very well formed. Hovering over some of the clusters can help understand the meaning of the clusters. For example, the red cluster is related to dog food. Further analysis also shows that GPT embeddings have correctly identified that the word ‘Dog’ and ‘Dawg’ are similar and placed them in the same cluster.
Overall, GPT embeddings give good results as indicated by the quality of clustering.
It’s now BERT’s turn
Can BERT perform better? Let us find out. There are multiple versions of the BERT model such as bert-base-case, bert-base-uncased, etc.. Essentially they have different embedding vector sizes. Here is the result based on Bert base which has an embedding size of 768.

The green cluster corresponds to dog food. However one can observe that the clusters are widely spread and not very compact compared to GPT. The main reason is that the embedding vector length of 768 is inferior compared to the embedding vector length of 1540 of GPT.
Fortunately, BERT also offers a higher embedding size of 1024. Here are the results.

Here the orange cluster corresponds to dog food. The cluster is relatively compact, which is a better result compared to the embedding of 768. However, there are some points which are far away from the center. These points are incorrectly classified. For example, there is a review for coffee, but it has got incorrectly classified as dog food because it has got a word Dog in it.
Conclusion
Clearly, GPT does a better job and provides higher-quality embeddings compared to BERT. However, I would not like to give all credit to GPT as there are other aspects to the comparison. Here is a summary table

GPT wins over BERT for the embedding quality provided by the higher embedding size. However, GPT required a paid API, while BERT is free. In addition, the BERT model is open-source, and not black-box so you can make further analysis to understand it better. The GPT models from OpenAI are black-box.
In conclusion, I would recommend using BERT for medium complex text such as web pages or books which have curated text. GPT can be used for very complex text such as customer reviews which are completely in natural language and not curated.
Technical Implementation
Here is a Python code snippet that implements the process described in the story. For illustration, I have given GPT example. The BERT one is similar.
##Import packages
import openai
import pandas as pd
import re
import contextlib
import io
import tiktoken
from openai.embeddings_utils import get_embedding
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
##Read data
file_name = 'path_to_file'
df = pd.read_csv(file_name)
##Set parameters
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base" # this the encoding for text-embedding-ada-002
max_tokens = 8000 # the maximum for text-embedding-ada-002 is 8191
top_n = 1000
encoding = tiktoken.get_encoding(embedding_encoding)
col_embedding = 'embedding'
n_tsne=2
n_iter = 1000
##Gets the embedding from OpenAI
def get_embedding(text, model):
openai.api_key = "YOUR_OPENAPI_KEY"
text = text.replace("n", " ")
return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']
col_txt = 'Review'
df["n_tokens"] = df[col_txt].apply(lambda x: len(encoding.encode(x)))
df = df[df.n_tokens <= max_tokens].tail(top_n)
df = df[df.n_tokens > 0].reset_index(drop=True) ##Remove if there no tokens, for example blank lines
df[col_embedding] = df[col_txt].apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))
matrix = np.array(df[col_embedding].to_list())
##Make clustering
kmeans_model = KMeans(n_clusters=n_clusters,random_state=0)
kmeans = kmeans_model.fit(matrix)
kmeans_clusters = kmeans.predict(matrix)
#TSNE
tsne_model = TSNE(n_components=n_tsne, verbose=0, random_state=42, n_iter=n_iter,init='random')
tsne_out = tsne_model.fit_transform(matrix)
Dataset citation
The dataset is available here with license CC0 Public domain. Both commercial and non-commercial use of it is permitted.
Please subscribe to stay informed whenever I release a new story.
You can also join Medium with my referral link
Additional Resources
Website
You can visit my website to make analytics with zero coding. https://experiencedatascience.com
Youtube channel
Please visit my YouTube channel to learn data science and AI use cases using demos