The world’s leading publication for data science, AI, and ML professionals.

ChatGPT Generated Food Industry Reviews: Realism Assessment

Investigating how review and survey collection by food industry companies can be supported by ChatGPT-generated data.

Where It Started

The bulk of my research in the past used Generative Adversarial Networks (GAN) for creating deepfake images of my dataset. I wanted to do this to increase the diversity of information within my dataset, which I predicted would result in better object detection models (see more about this research here!). While a completely different task than deepfake image creation, I wondered; is there a way to increase the size of the datasets I used containing reviews for different companies in the food industry?

Could I train a GAN? Yes, but GANs are not great at generating tabular data, and to me, words in the text are more aligned with data found in spreadsheets. Then came ChatGPT. Voila! Could I create new reviews for my dataset by simply asking ChatGPT to generate them with different prompts?

Why It Matters

There are a few reasons we would want to increase the size of a dataset.

Lacking enough data to train a model.

Biased dataset (hence we need to bias it with the underrepresented class of data).

Lack of diversity within the dataset

The datasets I was creating (with the approval from the company I used in my example today) lacked negative reviews, review diversity, and size, warranting an implementation of dataset augmentation.

Lacking enough data to train a model. If you attempt to build a model with a lack of data, a multitude of problems could occur. One problem may be the model overfits the data and performs poorly on real-world examples.

Biased dataset. If a dataset is dominated by one class, it lacks representations for other classes which will lead to models and analyses unfit for said class. We want to have a balanced dataset to ensure our model performs well in operations across all classes of data we are interested in investigating.

Lack of diversity within the dataset. The real world is messy. If our dataset lacks diversity, our model will not generalize well to changes in the details of samples once put into production. Generalizability is the model’s ability to classify or know a sample belongs to a certain class, even if that sample contains features that are distinct from or underrepresented in the class it belongs to.


The Analysis

The following APIs were used for today’s analysis.

from gensim.models import Word2Vec,KeyedVectors
import gensim.downloader as api
import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import spacy
import spacy.cli
import spacy
import numpy as np
from random import sample
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

#To download 
spacy.cli.download("en_core_web_lg")Dataset

The Dataset

The original dataset was compiled together with the permission of the company (Altomontes Inc) to use the reviews (see these two articles where I show how to use Natural Language Processing on the reviews).

Machine Learning is Not Just for Big Tech

Topic Modeling Analysis for Small Businesses

Example of an original review:

"Highly recommend!!! I never knew about Altomontes until a friend dropped off a meal at my house recently. My husband and I decided to go and check it out and while we were there, we met the owner and she was the sweetest person ever! She basically gave us a tour. We bought the chicken marsala for dinner and it was wonderful! We also bought the Brooklyn pizza for lunch and it was delicious! We sampled their coffee and they let us taste a cannoli, biscotti and a cookie. Very good! We bought shells, marinara sauce, a bottle of wine, some cheese etc. Everything looked good, taste good and we could spend hours there! We will be back!! Maybe today?"

The next two datasets I created using ChatGPT. One contained positive reviews about an Italian Market and the goods it sells, while the other contained negative reviews about an Italian Market and the goods it sells.

Example of a ChatGPT-generated positive review:

"The Pecorino Toscano cheese I bought had a robust and savory taste. Its firm and crumbly texture, with a hint of grassiness, made it a great choice for grating, shaving, or enjoying on its own."

Example of a ChatGPT-generated negative review:

"The prosciutto and arugula pizza I tried had wilted arugula and the prosciutto was tough. It wasn’t appetizing."

Once all of the reviews were created and placed into CSV files (find them here), I formatted them into a dictionary where each key was the source (source being original, generated positive, or generated negative ) and the items were its reviews and a list of tuples where each tuple contained the review and its source.

#Dictionary: Keys=Source, Values=Reviews
#Lists: List of reviews for each dataset

reviews = []
reviews_dict = {}

reviews_dict['original review'] = []
reviews_dict['fake positive review'] = []
reviews_dict['fake negative review'] = []

#Original Revews
orig_reviews = pd.read_csv('/content/drive/MyDrive/reviews/Altomontes_reviews.csv')
for rev in orig_reviews.Review:
  reviews_dict['original review'].append(rev)
  reviews.append((rev,'original review'))

#positive reviews
pos_reviews = pd.read_csv('/content/drive/MyDrive/reviews/generated_positive_reviews - Sheet1.csv')
for rev in pos_reviews.Review:
  reviews.append((rev,'fake positive review'))
  reviews_dict['fake positive review'].append(rev)

#negstive reviews
neg_reviews = pd.read_csv('/content/drive/MyDrive/reviews/generated_negative_reviews - Sheet1.csv')
for rev in neg_reviews.Review:
  reviews.append((rev,'fake negative review'))
  reviews_dict['fake negative review'].append(rev)An example of an original review:

Sentence Assessments

Realism Assessment

First, I simply wanted to assess if the reviews seemed "realistic." This is similar to calculating the coherence of a given body of text and its connections. My initial thought is they will all be deemed realistic, but I thought it would be interesting to also visualize the scores from the original reviews, artificial positive reviews, and artificial negative reviews.

To start this assessment, we are first going to create an _assess_sentencerealism function. This function seeks to look at the cohesiveness of a sentence and if the input sentence is "realistic" to how a human would interpret it.

def assess_sentence_realism(sentence, model):
    """
  A function that accepts a sentence and embeddings model as inputs, and outputs a 
  'realism' score based on the cohesion and similarity between words of the 
  sentence

  Inputs:
  sentence (str): A string of words.
  model (.model): An embedding model (user's choice)

  Returns:
  avg_similarity: An average similarity score between the words of the sentence.

  """
    tokens = sentence.split()

    # Calculate the average similarity between adjacent word pairs
    similarities = []
    for i in range(len(tokens) - 1):
        word1 = tokens[i]
        word2 = tokens[i + 1]
        if word1 in model.key_to_index and word2 in model.key_to_index:
            word1_index = model.key_to_index[word1]
            word2_index = model.key_to_index[word2]
            similarity = model.cosine_similarities(
                model.get_vector(word1),
                [model.get_vector(word2)]
            )[0]
            similarities.append(similarity)

    # Calculate the average similarity score
    if similarities:
        avg_similarity = sum(similarities) / len(similarities)
    else:
        avg_similarity = 0.0

    return avg_similarity

You will need to download an embedding model to create your word embeddings. The model I chose to use was the Google News 300 model (check it out here).

# Download the pre-trained Word2Vec model
#model_name = 'word2vec-google-news-300'  # Example model name
#model = api.load(model_name)
#model.save('/content/drive/MyDrive/models/word2vec-google-news-300.model') 

pretrained_model_path = '/content/drive/MyDrive/models/word2vec-google-news-300.model'
model = KeyedVectors.load(pretrained_model_path)

scores = {}
sources = []

# Evaluate the realism score for each sentence and store in scores dictionary
for sentence, source in reviews:
    realism_score = assess_sentence_realism(sentence, model)
    if source in scores:
        scores[source].append(realism_score)
    else:
        scores[source] = [realism_score]
    sources.append(source)
    #print(f"Realism Score for {source}: {realism_score}")

# Calculate the mean score for each source
mean_scores = {source: np.mean(score_list) for source, score_list in scores.items()}

# Plot the mean scores in a scatter plot
colors = {'original review': 'green','fake positive review':'blue','fake negative review':'red'}
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
plt.figure(figsize=(8, 6))
plt.bar(mean_scores.keys(), mean_scores.values(),color=['green','blue','red'])
plt.xlabel("Source")
plt.ylabel("Mean Realism Score")
plt.title("Mean Realism Scores by Source")
plt.show()

Original Dataset Score: 0.17

ChatGPT-generated Positive Dataset Score: 0.15

→ ChatGPT-generated Negative Dataset Score: 0.16

As expected (well maybe), the original reviews were scored to be the most realistic. Why could this be? Well, for starters, they were the real reviews. One of the big reasons I also suspect why ChatGPT scored lower is linked to how over time the model was following somewhat of a pattern with the reviews it was creating. Some of the reviews were very close to being the same, and ChatGPT would simply alter a few words around.

(ie. The pizza was not good and was very dry → The pasta was not good and was very dry)

There was also a lack of diversity in the ChatGPT reviews compared to the real reviews, which we can expect (think about it, multiple people made the real reviews while one model created the fake reviews). With that being said, the reviews created by ChatGPT still scored relatively well compared to the original reviews and I would say it’s worth a shot to go ahead and train our models with them in the future.

Before doing so, let’s also do a similarity assessment between the reviews.


Similarity Assessment

Next, I wanted to look at the similarities between each batch of the generated reviews and the original reviews. To do this, we can use cosine similarity to calculate how similar the different sentence vectors from each source are. First, we can create a cosine similarity matrix that will first transform our sentences into vectors using TfidVectorizer() and then calculate the cosine similarity between the two new sentence vectors.

def cosine_similarity(sentence1, sentence2):
  """
  A function that accepts two sentences as input and outputs their cosine
  similarity

  Inputs:
  sentence1 (str): A string of word
  sentence2 (str): A string of words 

  Returns:
  cosine_sim: Cosine similarity score for the two input sentences
  """
  # Initialize the TfidfVectorizer
  vectorizer = TfidfVectorizer()

  # Create the TF-IDF matrix
  tfidf_matrix = vectorizer.fit_transform([sentence1, sentence2])

  # Calculate the cosine similarity
  cosine_sim = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])

  return cosine_sim[0][0]

One problem I had was the datasets were now so big that the calculations were taking too long (and sometimes I did not have enough RAM on Google Colab to continue). To combat this issue, I randomly sampled 200 reviews from each of the datasets for calculating the similarity.

#Random Sample 200 Reviews
o_review = sample(reviews_dict['original review'],200)
p_review = sample(reviews_dict['fake positive review'],200)
n_review = sample(reviews_dict['fake negative review'],200)

r_dict = {'original review': o_review,
          'fake positive review': p_review,
          'fake negative review':n_review}

Now that we have the randomly selected samples, we can look at cosine similarities between different combinations of the datasets.

#Cosine Similarity Calcualtion
source = ['original review','fake negative review','fake positive review']
source_to_compare = ['original review','fake negative review','fake positive review']
avg_cos_sim_per_word = {}
for s in source:
  count = []
  for s2 in source_to_compare:
    if s != s2:
      for sent in r_dict[s]:
          for sent2 in r_dict[s2]:
            similarity = calculate_cosine_similarity(sent, sent2)
            count.append(similarity)
      avg_cos_sim_per_word['{0} to {1}'.format(s,s2)] = np.mean(count)

results = pd.DataFrame(avg_cos_sim_per_word,index=[0]).T

For the original dataset, the negative reviews were more similar. My hypothesis is this is due to my using more prompts to create negative reviews than positive reviews. No surprise, the ChatGPT-generated reviews showed the highest signs of similarity between themselves.

Great, we have the cosine similarities, but is there another step we can take to assess the similarities of the reviews? There is! Let’s visualize the sentences as vectors. To do this, we must embed the sentences (turn them into vectors of numbers) and then we can visualize them in 2D space. I used Spacy to embed my vectors and visualize them.

# Load pre-trained GloVe model
nlp = spacy.load('en_core_web_lg')

source_embeddings = {}

for source, source_sentences in reviews_dict.items():
    source_embeddings[source] = []
    for sentence in source_sentences:
        # Tokenize the sentence using spaCy
        doc = nlp(sentence)

        # Retrieve word embeddings
        word_embeddings = np.array([token.vector for token in doc])

        # Save word embeddings for the source
        source_embeddings[source].append(word_embeddings)
def legend_without_duplicate_labels(figure):
    handles, labels = plt.gca().get_legend_handles_labels()
    by_label = dict(zip(labels, handles))
    figure.legend(by_label.values(), by_label.keys(), loc='lower right')

# Plot embeddings with colors based on source

fig, ax = plt.subplots()
colors = ['g', 'b', 'r']  # Colors for each source
i=0
for source, embeddings in source_embeddings.items():
    for embedding in embeddings:
        ax.scatter(embedding[:, 0], embedding[:, 1], c=colors[i], label=source)
    i+=1
legend_without_duplicate_labels(plt)
plt.show()

The good news is we can clearly see the embeddings and distributions of the sentence vectors closely align. Visual inspection shows there is more variability in the distribution of the original reviews, supporting the assertion they are more _diverse._ Since ChatGPT generated positive and negative reviews, we would suspect their distributions to be the same. Notice, however, the fake negative reviews actually have a wider distribution and more variance than positive reviews. Why might this be? Probably it is due in part to the fact that I had to trick ChatGPT to create the fake negative reviews (ChatGPT is designed to say positive statements) and I had to actually provide more prompts to ChatGPT to get enough negative reviews vs. positive ones. This helps the dataset because, with the additional diversity in the dataset, we can train higher-performing machine learning models.

Next, we can inspect the differences in the three different distributions of reviews and see if there are any distinguishing patterns.

What do we see? Visually, we can see that the bulk of the reviews for the dataset are centered around the origin and span from -10 to 10. This is a positive sign and supports the use of fake reviews for training prediction models. The variances are somewhat the same, however, the original reviews had a wider variance in their distribution, both laterally and longitudinally, a proxy that there is more diversity in the lexicon within those reviews. The reviews from ChatGPT definitely had similar distributions, but the positive reviews had more outliers. As stated, these distinctions could be a result of the way I was prompting the system to generate reviews.


Pitfalls and Shortcomings

While increasing the size and diversity of a dataset has many added benefits, there are weaknesses and pitfalls to this approach. The newly generated data may not be representative or close to the format of the real data. While we can make some math calculations and visualizations to support similarity, we can never be sure how the reviews will be interpreted in machine language. We could develop models for reading surveys and reviews for a company in the food industry with this data, but the model may break down when given the unstructured, "dirty" data from the real world since we have trained it more on generated fake data that follows an underlying pattern.

Another issue is once the fake data is added, we forfeit the ability to do various analytical techniques for information extraction for example if I conduct topic modeling analysis with this new dataset, the topics won’t be defining just the original data. They now will be defining the fake data as well and that tells my customer nothing. Why does my customer care if "spaghetti is dry" is a topic when I created fake reviews saying stating such a fact? That’s my problem, not theirs. To be frank, this process hinders our ability to conduct exploratory data analysis (EDA). I see this as the biggest trade: with this data set, we can create classification and prediction models which may be suitable for interpreting new reviews (maybe even better due to the increase in the size of the dataset but you will need to build in processes for testing this) at the expense of not being able to extrapolate as much information from the data the company already (if we use this dataset).

My biggest caution to anyone using generating data is, do not to forget about the original data you collected. Do not forget about the original questions and problems you are trying to solve. Forgetting this could lead you down a rabbit hole of trying to solve a problem that lies in the fake data itself!

Conclusion

One problem that arises in Data Science is the lack of data and diversity in the data. There are various methods for generating new data, and today showed how ChatGPT could be used for creating more data for your dataset. Today’s findings are especially helpful for those working in the Food Industry. Doing this could alleviate issues of imbalance and lack of diversity in datasets, leading to models which perform better on real-world data post-training.

What did today show?

ChatGPT data could be helpful for your next Natural Language Processing Project (NLP), especially if you are implementing data science techniques for businesses in the Food industry. I would caution it is always good to try to collect the real data first. If you find that you need more data for your dataset, it may not hurt to explore options like Generative Adversarial Networks (GAN), or Large Language Models (LLM), like ChatGPT. Finally, I always want to foot stomp, especially when working with Generative AI, it is important to use these tools ethically and in a positive manner for all impact parties.

You may be wondering, what does it mean to use these tools ethically? You should use generative AI to help support people and bring a positive impact in their lives. There are use cases of people creating deepfakes that are hurtful to a person’s image, which is never okay. Additionally, generative AI should never be used to trick someone or change their thoughts with fake, untrue data. Today’s example is a perfect use case of how we can create data that will be used to train models for a company to understand the sentiment of customers’ reviews. It will help the company change its goods and processes to cater to the needs of the customer, positively impacting both parties!

If you enjoyed today’s reading, PLEASE give me a follow and let me know if there is another topic you would like me to explore! If you do not have a Medium account, sign up through my link here (I receive a small commission when you do this)! Additionally, add me on LinkedIn, or feel free to reach out! Thanks for reading!


Sources

  1. Data usage approved by Altomontes Inc.
  2. Full code here

Related Articles