
Machine Learning is growing in the small business food industry and has shown to provide results for boosting businesses’ productivity. Today’s analysis stems from Part 1 of my investigation covering online reviews of Altomonte’s Italian Market located near Philadelphia, PA. Using Natural Language Processing helped this small, but exponentially expanding, business in the food industry gain a deeper understanding of customers’ perceptions of Altomonte’s and its operations.
Natural Language Processing (NLP)
Natural Language Processing is a domain of Machine Learning that seeks to uncover hidden meaning and sentiment in textual data. If you want to learn more, Dan Juurafsky and James H. Martin have a free textbook that does an excellent job of taking a deeper dive into the theory and processes of NLP. The technique that was mainly used in this analysis was Topic Model Analysis. Topic Modeling Analysis is an unsupervised machine learning technique in NLP that can uncover the latent meaning of a corpus of text. Topic Modeling Analysis can be used on reviews for small businesses to help uncover the opinions and feelings that customers collectively attribute towards a business, allowing a business to better shape its reputation.
Analysis
List of the various packages used throughout the analysis:
import pandas as pd
import numpy as np
import nltk
from nltk import FreqDist, PorterStemmer
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
import re
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.tokenize import word_tokenize,sent_tokenize
from wordcloud import WordCloud
from gensim import corpora
import gensim
import spacy
from gensim.models.coherencemodel import CoherenceModel
from gensim.models import LdaMulticore
from gensim import models
import pprint
import tqdm
Dataset
The dataset used was a corpus containing reviews pertaining to Altomonte’s Italian Market that were collected from Yelp, TripAdvisor, and Google Reviews. The reviews dated back 10 years. There were reviews that dated back even further; however, I felt that any reviews after 10 years would be less applicable to Altomonte’s due to the vast growth of the business in the past 10 years. The reviews were cleaned and sorted into one Pandas data frame. The columns of the data frame were "Month," "Year" "Review", "Rating", and "Platform".

Dataset Statistics
Before diving into the Topic Modeling Analysis, various statistics about the dataset were extracted.
print("Average Rating for All Online Reviews :" ,df['Rating'].mean())
The average rating for all of the reviews was 4.23 out of 5.
print("Number of each Rating for all of the reviews")
df['Rating'].value_counts()
The breakdown of the reviews by rating are as follows:


Word Frequency
The frequency of the words was extracted to see if any meaning could be obtained from the most frequent words in the corpus.
# function to plot most frequent terms
def frequent_words(x, terms = 20):
totalwords = ' '.join([text for text in x])
totalwords = totalwords.split()
fdist = FreqDist(allwords)
words_df = pd.DataFrame({'word':list(fdist.keys()), 'count':list(fdist.values())})
# selecting top 20 most frequent words
d = words_df.nlargest(columns="count", n = terms)
plt.figure(figsize=(20,5))
ax = sns.barplot(data=d, x= "word", y = "count")
ax.set(ylabel = 'Count')
plt.show()
First, a function was created that found the 20 most frequent words and plotted their frequency into a bar graph. Then, all of the unwanted symbols, numbers, and characters were removed.
# remove unwanted characters, numbers and symbols
df['Review'] = df['Review'].str.replace("[^a-zA-Z#]", " ")
Finally, the stopwords were removed from the dataset and the entire dataset was lowercased for normalization.
def remove_stopwords(onl_rev):
new_review = " ".join([i for i in onl_rev if i not in stop_words])
return new_review
# remove short words (length < 3)
df['Review'] = df['Review'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>2]))
# remove stopwords from the text
Reviews = [remove_stopwords(r.split()) for r in df['Review']]
# make entire text lowercase
Reviews = [r.lower() for r in Reviews]
Lemmatization
After the reviews were cleaned, the data set was lemmatized. Lemmatization is the process of breaking a word down to its base.
nlp =spacy.load('en_core_web_sm',disable=['parser','ner'])
def lemmatization(texts, tags=['NOUN', 'ADJ']): # filter noun and adjective
output = []
for sent in texts:
doc = nlp(" ".join(sent))
output.append([token.lemma_ for token in doc if token.pos_ in tags])
return output
This function is then applied to the dataset.
Lem_reviews = lemmatization(tokenized_Reviews)
print(Lem_reviews[1])

Tokenization
Tokenization is the processing of breaking a corpus and sentences down into individual "tokens." Here, we want to break apart the sentences so that we can individually examine each word in the corpus.
#Tokenization
tokenized_Reviews = pd.Series(Reviews).apply(lambda x: x.split())
print(tokenized_Reviews[1])

Final Cleaning
reviews_cleaned = []
for i in range(len(Lem_reviews)):
reviews_cleaned.append(' '.join(Lem_reviews[i]))
df['Reviews'] = reviews_cleaned
frequent_words(df['Reviews'], 20)
One final cleaning of the dataset is applied and the word frequencies are plotted.

Additionally, a word cloud was created of the most frequent words
full_text = ' '.join(df['Review'])
cloud_no_stopword = WordCloud(background_color='white', stopwords=stop_words).generate(full_text)
plt.imshow(cloud_no_stopword, interpolation='bilinear')
plt.axis('off')
plt.show()

Five of the most frequent words from the word frequency list are Food, Italian, Good, Great, and Sandwich. From the word frequencies, we can gain some insight and come to a conclusion that customers perceive Altomonte’s Italian Market as a Good, even GREAT, Italian Market that serves a great selection of Sandwiches (In Philly, we call sandwiches HOAGIES! Highly recommend to stop by and get one!).
While word frequency can provide great insight to words that customers use when reviewing a business, stacking word frequency with Topic Modeling Analysis can help create concrete conclusions of the true feelings and views customers have towards a business.
Topic Modeling Analysis
For Topic Modeling Analysis, I used a Linear Dirichlet Allocation Model from the Genism Python Package. The LDA model seeks to uncover the latent topics of the corpus of reviews that may not be noticeable from a top-level analysis.
First, a dictionary was created of the lemmatized reviews.
overall_dictionary = corpora.Dictionary(Lem_reviews)
#Converting reviews into a Document Term Matrix
overall_doctermtx = [overall_dictionary.doc2bow(review) for review in Lem_reviews]
Deciding how many topics you want to represent corpora depends on the coherence score of the topics. A coherence score is the quality of the topics and how related they are when grouped together.
def compute_c_values(dictionary, corpus, texts, limit, start=2, step=1):
"""
Compute c_v coherence for various number of topics
Parameters:
----------
dictionary : Gensim dictionary
corpus : Gensim corpus
texts : List of input texts
limit : Max num of topics
Returns:
-------
model_list : List of LDA topic models
coherence_values : Coherence values corresponding to the LDA model with respective number of topics
"""
coherence_values = []
model_list = []
for num_topics in range(start, limit, step):
model=LDA(corpus=corpus, id2word=dictionary, num_topics=num_topics,)
model_list.append(model)
coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_values.append(coherencemodel.get_coherence())
return model_list, coherence_values

The function above will calculate the coherence scores for a corpus. For this corpus, 7 topics were observed to have the highest coherence score of .400. 7 topics were then used for the Topic Model Analysis to extract the latent topics in the dataset.
LDA = gensim.models.ldamodel.LdaModel
LDA_overall_Model = LDA(corpus=overall_doctermtx, id2word=overall_dictionary, num_topics=7)
LDA_overall_Model.print_topics( num_topics=7,num_words=5)
The 5 most important words were extracted for each of the 7 topics. As shown below, the importance of the word was given a ranked score by the LDA Model.

Now, we can attribute 7 Topics that represent Altomonte’s Online Reviews Dataset. These topics are (In order from Above):
- Good Italian Pizza
- Wide Selections of Fresh Food
- Wide Sandwich Selection
- Wide Selection of Italian Foods
- Great Place to go for Italian Goods/Cuisine/etc.
- Great Sandwiches
- High-Quality Italian Foods
And there we have it! These topics can help explain customer’s perception of Altomonte’s Italian Market! Altomonte’s can be labeled as a unique food store that has a wide selection of pizza, sandwiches, and food. Whether you go to the store to shop for ingredients to make recipes at home or grab a bite to eat at the in-house Hotbar, sandwich counter, or pizza counter, you will thoroughly enjoy your visit to Alomonte’s Italian Market and enjoy the food! The topics found make sense since Altomonte’s offers a wide range of other products besides just freshly made Italian Food. Recently, the R&D department has worked tirelessly to branch out the line of products in the store. The latent meaning underlying the reviews shows that the customers have noticed this expansion in the business and the leadership team at Altomonte’s should therefore continue to expand its product lines.
Conclusion
Topic Modeling Analysis was used to show that small businesses in the food industry can gain more insight into how the customer perceives the goods and services they offer. For Altomonte’s, customers have a desire for the food, pizza, sandwiches, and unique selection offered by the business. Additionally, the continual work of the business owners to expand the variety of goods and services in the store has not gone unnoticed by the customer and is gaining the attention of the consumers. This knowledge can help Altomonte’s to continue to develop and invest in the processes it’s using to grow the store’s product offerings and customer experiences as it continues serving the people of the Greater Philadelphia Area. Thank you for reading!
Sources
- Géron, A.: Hands-on machine learning with Scikit-Learn and TensorFlow : concepts, tools, and techniques to build intelligent systems. O’Reilly Media, Sebastopol, CA (2017).
- Vasiliev, Yuli. Natural Language Processing with Python and Spacy: A Practical Introduction. San Francisco: No Starch, 2020. Print.
- Use of the picture was approved by Altomonte’s Italian Market Inc.
- https://www.analyticsvidhya.com/blog/2018/10/mining-online-reviews-topic-modeling-lda/
- https://realpython.com/sentiment-analysis-python/
- https://neptune.ai/blog/pyldavis-topic-modelling-exploration-tool-that-every-nlp-data-scientist-should-know
- https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0
- https://iq.opengenus.org/topic-modelling-techniques/#:~:text=Topic%20modeling%20can%20be%20used%20in%20graph%20based,time%20and%20helps%20students%20get%20their%20results%20quickly.