Introduction
The journey I am about to take you on is important for two reasons.
- It will show you how you can use ChatGPT to help support companies working in the Food industry.
- Arguably the most important reason, I am going to walk through a post I made almost two years ago, point out the problems with that article, and attempt the fix them.
Yes, I argue that the second reason is more important. Why? Looking back past ways and processes you analyze data is important because it allows you to learn how to fix your failures which ultimately leads to success. I am in no way perfect, and I personally look for the wrong things I have done in the past in the hopes of learning from my mistakes and developing stronger models for the clients I support.
The Original Publication
I first published "Machine Learning is Not Just for Big Tech" in July of 2021.
The purpose of the article was to show how a company in the food industry could be supported by the various uses of machine learning (ML). I used Natural Language Processing (NLP) techniques to work with reviews across the internet about the company. Some of the methods I used from NLP were Topic Modeling Analysis to gain a better understanding of what customers were talking about and Sentiment Analysis to create a model that could help predict the sentiment of future reviews and provide feedback to the company. The analysis showed both methods were capable of being performed on a small corpus of data.
AH! The big mistake.
My data was not great. Not only was the dataset small, but it was also biased toward positive reviews. This led to models almost always predicting a review to be positive (not helpful for the company) and was overfitting.
Solution? I thought about using a Generative Adversarial Network (GAN) to create new synthetic reviews, but then I thought, could I just ask ChatGPT? Boom. The first mistake from my original work was solved. I was able to use ChatGPT to create artificial positive and negative reviews which ultimately balanced my Italian Food Company Review Dataset!
The Dataset
Luckily, I did my due diligence and verified the usability of the data I created before training any models or doing any sort of analysis. Check out the post below which provides a more in-depth analysis of the real as well artificial data.
Positive Review Creation
For the dataset, I wanted to have an equal balance of positive and negative reviews. First, using ChatGPT, I queried it to create positive reviews.
Create 500 different positive reviews about different Italian foods and products purchased from an Italian market and put them in a CSV file.
I did this at least 5 times for two reasons. One, ChatGPT kept timing out. Two, I wanted to make sure I got enough different reviews. Additionally, to increase the diversity of generated data, I would change the query each time. For example, I would say Italian Desserts or Italian Wines in place of Italian foods and products. Let’s take a look at a fake positive review generated by ChatGPT.
"The Pecorino Toscano cheese I bought had a robust and savory taste. Its firm and crumbly texture, with a hint of grassiness, made it a great choice for grating, shaving, or enjoying on its own.
Not bad if you ask me!
Negative Review Creation
For creating the negative reviews, I followed the same process. I did have to explicitly tell ChatGPT I was not making the negative reviews to harm anyone, which is the truth! I simply wanted a classification model and analysis that generalizes to all types of data the model may encounter (positive and negative reviews).
Create 100 negative reviews about different Italian foods purchased from an Italian market and put them in a CSV file.
Example of a generated negative review:
"The prosciutto and arugula pizza I tried had wilted arugula and the prosciutto was tough. It wasn’t appetizing."
Again, not bad if you ask me!
Analysis
First error:
_cannot import name ‘padsequences’ from ‘keras.preprocessing.sequence’ (/usr/local/lib/python3.10/dist-packages/keras/preprocessing/sequence.py)
Solution:
→ Instead of the import
from keras.preprocessing.sequence import pad_sequences
→ Use the import:
from keras.utils import pad_sequences
Most of the code had worked as before which was surprising. With the recent release of Keras 3.0, some of this code may be depreciated depending on what packages and IDE you are using to run your code.
Data Cleaning
There was some additional cleaning that needed to be done on the reviews before a model could be trained. Luckily, "0" for negative reviews and "1" for positive reviews could be appended to each review in a sentiment column.
Original Cleaning
While the following line would be acceptable to implement if I was adding a new column, I decided to use a lambda function instead for cleaning the dataset.
#Original Line from the first analysis
df['Label'] = [1 if x >=3 else 0 for x in df['Rating']]
Also, I decided that a rating of 3 is negative which luckily helped balance the dataset.
orig_reviews['Rating']=orig_reviews['Rating'].apply(lambda x: 0 if x < 4else 1)
I noticed through the cleaning process I had undefined variables in the original post, which is a problem I use to have a lot of (honestly I still do). Currently, I attempt to mitigate this problem by resetting my kernel and double-checking my work before integrating it into one of my posts.
Problems with the Original Dataset
The biggest issue with the dataset was the giant imbalance between the positive and negative reviews. Originally, there were 512 positive reviews and 132 negative reviews. Why is this a problem? Training a model with this dataset will more than likely lead to a model which mostly (almost always) predicts reviews as positive. I definitely overlooked this in my original post and should have done a better job of addressing this issue. While I do not shy away from imbalanced datasets, I will not initially use them until I have exhausted all efforts and tried various techniques to balance them (including further data collection!).
What if we did not train a model at all and just used BERT?
As we can see, the dataset in its current state is not supportive of a strong sentiment analysis model. IS BERT capable of accurately predicting the sentiment of the reviews with no training? Or is another approach needed to be taken?
With no training, BERT was accurate 68.27% of the time. Higher accuracies may not have been achieved due to the unstructured nature and fuzzy language used in many of the reviews which BERT could not comprehend.
Let’s examine the confusion matrix in pursuit of understanding the BERT predictions. For the negative reviews, BERT was correct 76 times. 56 labels were positive which should have been negative. This could be due to how the reviews may have been more neutral in their given state but leaned towards a positive sentiment. For example, "This place used to be good but the pizza was not very good today." As humans, we read this sentence and understand there is a mostly negative sentiment being expressed. BERT on the other hand may interpret the continuous use of "good" in the sentence as a means for predicting the sentence as positive.
For the positive reviews, BERT accomplishes more accurate classifications. 363 times BERT correctly predicts a review to be positive. One failure, however, is there are 149 predictions where BERT believed a positive review to be negative.
I wanted to outline and discuss the information from the confusion matrix since it shows how difficult it can be for algorithms to understand human language, especially when different linguistic features are incorporated, such as sarcasm and the sentiment the speaker wishes to communicate. The purpose of using BERT with no training was to see if a model was needed to be fine-tuned for the company or if an off the shelf model could be used. With an accuracy below 80%, I would recommend fine-tuning a model to be more aligned with the company’s data.
Model Creation
I noticed in my original post I used Convolutional Neural Networks (CNN) for developing my model. From what I recall, I had done this because I was working a lot with CNNs for my published research (Benefits of using blended generative adversarial network images to augment classification model training data sets). While it’s not wrong to use CNNs (1-D in this case), I wanted to also look at other models which may provide better predictions for the dataset.
D = 20
i = Input(shape = (T,))
x = Embedding(V +1 , D)(i)
x = Conv1D(16,2,activation='relu',)(x)
x = MaxPooling1D(2)(x)
x = Conv1D(32,2,activation='relu')(x)
x = GlobalMaxPooling1D()(x)
x = Dense(1,activation='sigmoid')(x)
model = Model(i,x)
New Model
While creating my own model was beneficial for my own learning, I decided to conduct transfer learning and train on the data with a well-known model BERT for my new SA model.
See the full BERT code I used below.
Why Transfer Learning? Why reinvent the wheel when there are powerful models readily available which can be tailored to your given problem? Whenever you face a problem, always do your research to see what others have accomplished in the past. You may be surprised to find out many people have experienced the same, if not similar problem and have already solved it for you!
Results
Overall, I missed the mark on (or really, did not outline) how the dataset I originally had led to a model that was biased and heavily overfit on the dataset.
Training: No Dataset Augmentation
As stated, the original model was overfitting to the dataset due to the extreme imbalances in the positive and negative labels. Some of the changes that led to a "better" model were going from an 80/20 split to a 70/30 split for the training and test set as well as adding more dropout into the BERT model (I used 0.5).
As you can see, there was a stable loss of 50% (somewhat expected) throughout the model training. When I trained with a 70/30 split, the biggest difference was the loss value for the training set went from ~80% to ~60%. Overall, the models did not perform well with the imbalanced dataset.
Dataset with Augmentation
How does BERT interpret fake reviews?
BERT underperformed with an accuracy of 38.22%. I do not put the full fault on the BERT model. The generated data from ChatGPT lacked distinction between what constitutes a review as positive and negative, a clear indication that ChatGPT generates a lot of its data off of past patterns and is not sentient.
The most significant problem in the overhead confusion matrix is where BERT classified many reviews as positive, when in fact they were negative. Why? Well, for starters more negative reviews had to be generated by ChatGPT than positive reviews. Clearly, the patterns which ChatGPT used for creating its negative reviews were too similar to the positive reviews, encompassing one of the downfalls of data generation and ChatGPT’s inability to produce diversity within different categories of data distinct and aligned with real-world information.
Fine-tuning BERT with the Augmented Dataset
With ChatGPT, the dataset was balanced to contain 1,126 positive reviews and 1,124 negative reviews (an additional 614 positive reviews and 992 negative reviews). One benefit of using Generative AI algorithms is their ability to balance datasets, especially those with huge imbalances like this one. The downfall? The newly generated data may not be representative of the original data and this needs to be taken into consideration.
Once the dataset is balanced, we can attempt to finetune BERT again for sentiment analysis using the same processes as before. One distinct change was the number of words for each embedding decreased in size due to the average length being reduced by the generated reviews (150 → 60).
Success! Dataset augmentation improved the fine-tuning of the BERT model. After 5 epochs, an accuracy of 99.67% was achieved with little loss. The test lost stayed consistently around 13.3% with an accuracy of 97.3% at epoch 5.
How would this model benefit a company in the food industry that uses it? While using the data from just the company could lead to a model that is more personable and aligned with their processes, using external data can help provide a model to the company which is more generalizable and adaptable to real-world changes. These changes, which are almost guaranteed to occur, may be alien to the model, and having a model which can adapt to the unpredictability of the real world mitigates its failure and works as a stopgap against undesired decisions taken by the model.
Conclusion
Generative AI has recently exploded, and finding appropriate use cases which are positive and beneficial to companies in different industries is important. For companies in the food industry, or any company which has a product and reviews about said product, ChatGPT can help support models which can flag reviews as positive and negative to support business operations and product development. While ChatGPT can be used for automation, we should be wary of its power, taking a human-on-the-loop approach and evaluate the generated data. Whatever Generative AI technique you use for your dataset development, ensure the data is realistic to the real world information it wishes to capture in hopes of create the strongest performing model possible.
On a more personal note, today showed how we can always try to get better, and we must learn from our poor past performances to get better. Learning from your mistakes and weaknesses is one of the most important parts of being a Data Scientist, and ultimately will foster a career makred with excellence and focused on continuous development.
If you enjoyed today’s reading, PLEASE give me a follow and let me know if there is another topic you would like me to explore! If you do not have a Medium account, sign up through my link here (I receive a small commission when you do this)! Additionally, add me on LinkedIn, or feel free to reach out! Thanks for reading!
Sources
- Data usage approved by Altomontes Inc.
Code
# Set the model name
MODEL_NAME = 'bert-base-cased'
# Build a BERT based tokenizer
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
# Store length of each review
token_lens = []
# Iterate through the content slide
for txt in df.content:
tokens = tokenizer.encode(txt, max_length=512)
token_lens.append(len(tokens))
MAX_LEN = 160
class GPReviewDataset(Dataset):
# Constructor Function
def __init__(self, reviews, targets, tokenizer, max_len):
self.reviews = reviews
self.targets = targets
self.tokenizer = tokenizer
self.max_len = max_len
# Length magic method
def __len__(self):
return len(self.reviews)
# get item magic method
def __getitem__(self, item):
review = str(self.reviews[item])
target = self.targets[item]
# Encoded format to be returned
encoding = self.tokenizer.encode_plus(
review,
add_special_tokens=True,
max_length=self.max_len,
return_token_type_ids=False,
pad_to_max_length=True,
return_attention_mask=True,
return_tensors='pt',
)
return {
'review_text': review,
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'targets': torch.tensor(target, dtype=torch.long)
df_train, df_test = train_test_split(df, test_size=0.2, random_state=RANDOM_SEED,stratify=df.sentiment)
def create_data_loader(df, tokenizer, max_len, batch_size):
ds = GPReviewDataset(
reviews=df.content.to_numpy(),
targets=df.sentiment.to_numpy(),
tokenizer=tokenizer,
max_len=max_len
)
return DataLoader(
ds,
batch_size=batch_size,
num_workers=0
)
BATCH_SIZE = 16
train_data_loader = create_data_loader(df_train, tokenizer, MAX_LEN, BATCH_SIZE)
test_data_loader = create_data_loader(df_test, tokenizer, MAX_LEN, BATCH_SIZE)
print(df_train.shape, df_test.shape)
bert_model = BertModel.from_pretrained(MODEL_NAME,return_dict=False)
# Build the Sentiment Classifier class
class SentimentClassifier(nn.Module):
# Constructor class
def __init__(self, n_classes):
super(SentimentClassifier, self).__init__()
self.bert = BertModel.from_pretrained(MODEL_NAME,return_dict=False)
self.drop = nn.Dropout(p=0.5)
self.out = nn.Linear(self.bert.config.hidden_size, n_classes)
# Forward propagaion class
def forward(self, input_ids, attention_mask,return_dict):
_, pooled_output = self.bert(
input_ids=input_ids,
attention_mask=attention_mask,
return_dict=False
)
# Add a dropout layer
output = self.drop(pooled_output)
return self.out(output)
# Instantiate the model and move to classifier
model = SentimentClassifier(2)
model = model.to(device)
# Number of iterations
EPOCHS = 10
# Optimizer Adam
optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
total_steps = len(train_data_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=0,
num_training_steps=total_steps
)
# Set the loss function
loss_fn = nn.CrossEntropyLoss().to(device)
# Function for a single training iteration
def train_epoch(model, data_loader, loss_fn, optimizer, device, scheduler, n_examples):
model = model.train()
losses = []
correct_predictions = 0
for d in data_loader:
input_ids = d["input_ids"].to(device)
attention_mask = d["attention_mask"].to(device)
targets = d["targets"].to(device)
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
return_dict=True)
_, preds = torch.max(outputs, dim=1)
loss = loss_fn(outputs, targets)
correct_predictions += torch.sum(preds == targets)
losses.append(loss.item())
# Backward prop
loss.backward()
# Gradient Descent
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
return correct_predictions.double() / n_examples, np.mean(losses)
def eval_model(model, data_loader, loss_fn, device, n_examples):
model = model.eval()
losses = []
correct_predictions = 0
with torch.no_grad():
for d in data_loader:
input_ids = d["input_ids"].to(device)
attention_mask = d["attention_mask"].to(device)
targets = d["targets"].to(device)
# Get model ouptuts
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
return_dict=True)
_, preds = torch.max(outputs, dim=1)
loss = loss_fn(outputs, targets)
correct_predictions += torch.sum(preds == targets)
losses.append(loss.item())
return correct_predictions.double() / n_examples, np.mean(losses)