The world’s leading publication for data science, AI, and ML professionals.

Text Mining and Sentiment Analysis for Yelp Reviews of A Burger Chain

Text, such as social media posts and customer reviews, is a gold mine waiting to be discovered. We can turn this unstructured data into…

Text, such as social media posts and customer reviews, is a gold mine waiting to be discovered. We can turn this unstructured data into useful insights, which can help companies better understand how customers like their products or services and more importantly, why, and then make business improvements as quickly as possible.

Photo by Ramille Soares on Unsplash
Photo by Ramille Soares on Unsplash

1 Case Background

Super Duper Burgers is one of my favourite burger restaurants. Every time I went there, I would always see customers queueing up for the burgers. One day I was thinking, why people are so obsessed with this burger chain? I know there are lots of reviews on Yelp and maybe this is a good start to figure out the secrets behind.

2 Data Understanding

2.1 Data Source

I used the Yelp API and got related information of 17 Super Duper Burgers restaurants in the Bay Area, such as urls, review counts, ratings, locations, etc.

Restaurants Info from Yelp API
Restaurants Info from Yelp API

Then I used Beautiful Soup to do web scraping and get reviews for each restaurant. I not only got the content of the reviews, but also the date and the rating from that specific customer. Date is useful when we do time series analysis and ratings can be the target variable if we apply any supervised learning algorithm to do prediction. I got 10,661 pieces of reviews in total.

Yelp Reviews
Yelp Reviews

2.2 Exploratory Data Analysis

Visualization is a good way to do exploratory data analysis.

import matplotlib.pyplot as plt
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
plt.plot(df['review'].resample('M').count())
plt.xlabel('Year')
plt.ylabel('Number of reviews')
plt.title('Number of reviews per month')
plt.show()

The reviews were from 2010–04–09 to 2020–12–21 and it was more than ten years. The number of reviews per month were increasing and it could mean that the burger chain is becoming more and more popular in the past decade. After COVID, the number dropped significantly and only around 30 customers would write a review every month.

import seaborn as sns
ax = sns.barplot(data=df, x='rating', y='rating', estimator=lambda x: len(x) / len(df) * 100)
ax.set(ylabel="Percent")
plt.title('Distribution of Customer Rating')
plt.show()
plt.plot(df['rating'].resample('M').mean())
plt.xlabel('Year')
plt.ylabel('Rating')
plt.title('Average Monthly Customer Rating')
plt.ylim(0,5)
plt.show()

Most customers were satisfied with the restaurants and over 70% of them gave ratings of 4 or 5. Over time, there is no so much variation in the rating and it was quite stable at 4.

2.3 Data Cleaning

In the review, some character references such as "&" are not useful in our text content and I removed them.

df['review'] = [i.replace("&", '').replace("'",'') for i in df['review']]

Next, I wanted to make sure all the reviews are in English and I did the language detection with a library called langdetect and the specific function _detectlangs.

from langdetect import detect_langs
language = [detect_langs(i) for i in df.review]
languages = [str(i[0]).split(':')[0] for i in language]
df['language'] = languages

8 out of 10,661 reviews were detected as other languages. Most of them are very short and have some sort of emphasis on the word: waaaaaay for way and guuuud for good. In this case, the detection is not that accurate. If we look closer to each of these 8 reviews, all of them are actually in English and I will keep them.

Reviews detected in other languages
Reviews detected in other languages

3 Text Mining

3.1 Stopwords

In every language, there are words that occur too frequently and are not informative, such as "a", "an", "the", "and" in English. It is useful to build a list containing all the stopwords and get rid of them before we do any text mining.

Depending on the specific context, you may also want to add more to the list. In our case, words like "super", "duper" are not very meaningful.

from nltk.corpus import stopwords
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
my_stop_words = set(stopwords.words('english') + list(ENGLISH_STOP_WORDS) + ['super', 'duper', 'place'])

3.2 Word Cloud

Word cloud is a very popular way to highlight the words with high-frequency in textual data. The more a specific word appears in the text, the bigger and bolder it will be in the word cloud.

from wordcloud import WordCloud
# concatenate all the reviews into one single string 
full_text = ' '.join(df['review'])
cloud_no_stopword = WordCloud(background_color='white', stopwords=my_stop_words).generate(full_text)
plt.imshow(cloud_no_stopword, interpolation='bilinear')
plt.axis('off')
plt.show()
Word Cloud
Word Cloud

We can see that "burger", "garlic fries", "cheese" and some other words were mentioned by lots of customers.

3.3 Tokenization and Bag-of-Words (BoW)

In addition to the word cloud, we may be also interested in how much exactly a word appeared across all the reviews. Here, we are actually trying to transform the text data into a numeric form and Bag-of-Words is the simplest form of text representation in numbers. It basically builds a list of words occurring within a collection of documents (corpus) and keeps track of their frequencies.

from nltk.tokenize import word_tokenize
from nltk import FreqDist
lower_full_text = full_text.lower()
word_tokens = word_tokenize(lower_full_text)
tokens = list()
for word in word_tokens:
    if word.isalpha() and word not in my_stop_words:
        tokens.append(word)
token_dist = FreqDist(tokens)
dist = pd.DataFrame(token_dist.most_common(20),columns=['Word', 'Frequency'])

Obviously, "burger" and "burgers" are saying the same thing and we can do better than that by using stemming. Stemming is the process of transforming words into root forms, even if the stemmed word is not a valid word in the language. In general, stemming will tend to chop off suffixes such as "-ed" and "ing" as well as plural forms.

from nltk.stem import PorterStemmer
porter = PorterStemmer()
stemmed_tokens =[porter.stem(word) for word in tokens]
stemmed_token_dist = FreqDist(stemmed_tokens)
stemmed_dist = pd.DataFrame(stemmed_token_dist.most_common(20),columns=['Word', 'Frequency'])

3.4 N-grams

Under the Bag-of-Words approach, the word order is discarded. However, in many cases, the sequence of words is very important. For example, compare these two sentences: 1) I am happy, not sad. 2) I am sad, not happy. The meaning of them are totally different but they will get the same numeric representation with single-token BoW. In order to better capture the context, we can consider pairs or triples of words that appear next to each other and they can also give us more useful information.

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(stop_words=my_stop_words, ngram_range=(2,2))
bigrams = vect.fit_transform(df['review'])
bigram_df = pd.DataFrame(bigrams.toarray(), columns=vect.get_feature_names())
bigram_frequency = pd.DataFrame(bigram_df.sum(axis=0)).reset_index()
bigram_frequency.columns = ['bigram', 'frequency']
bigram_frequency = bigram_frequency.sort_values(by='frequency', ascending=False).head(20)

Garlic fries seem to be the most popular menu item for this burger chain, even over the burgers! Other top-selling dishes include mini burger, ice cream, veggie burger and chicken sandwich. Pairs of tokens give us more insights than single ones.

3.5 Why People Love It

Even though the bigrams give us more information, it only answers the question of WHAT. If I were the business owner, I would be definitely interested in WHY: Why people love the fries? Is it because of the special flavor or the sauce?

To answer this question, I will use the Word2Vec model and see what the words are most likely around our target words such as fries, burgers, service, etc. Word2Vec uses a neural network model to learn word associations from the corpus. Compared to BOW and n-grams, Word2Vec leverages the context and better captures the meaning and relationship of the word.

There are two model architectures behind Word2Vec: continuous Bag-of-Words (CBOW) and skip-gram. I am not going to give too many details about the algorithms here and you can find more in other articles and papers. In general, CBOW is faster while skip-gram is slower but does a better job in representing infrequent words.

We can easily do the job with Gensim in Python. First, I got the good reviews with rating of 4 or 5 and did some basic preprocessing.

from nltk.tokenize import sent_tokenize
good_reviews = ' '.join(df_good.review)
# split the long string into sentences
sentences_good = sent_tokenize(good_reviews)
good_token_clean = list()
# get tokens for each sentence
for sentence in sentences_good:
    eng_word = re.findall(r'[A-Za-z-]+', sentence)
    good_token_clean.append([i.lower() for i in eng_word if i.lower() not in my_stop_words])
Pre and Post Text Cleaning
Pre and Post Text Cleaning

Now we can build the model and see what the customers love most about the service of the burger chains.

from gensim.models import Word2Vec
model_ted = Word2Vec(sentences=good_token_clean, size=500, window=10, min_count=1, workers=4, sg=0)
model_ted.predict_output_word(['service'], topn=10)

Obviously, people really appreciate their friendly customer service as well as the fast and quick response. We can do the same for other target words that we are interested in. These surrounding words are very informative and they can better explain why people love or explain about certain things.

4 Sentiment Analysis

Sentiment Analysis is the process of understanding the opinions of people about a subject. There are two types of methods: lexicon/rule based and automated.

4.1 Lexicon-based Tool – VADER

This method has a predefined list of words with sentiment scores and it matches words from the lexicon with words from the text. I will use the VADER analyzer in the NLTK package. For each piece of text, the analyzer provides four scores: negative, neutral, positive and compound. The first three are easy to understand and for the compound score, it is a combination of positive and negative scores and ranges from -1 to 1: below 0 is negative and above 0 is positive. I am going to use the compound score to measure the sentiment.

# Load SentimentIntensityAnalyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Instantiate new SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
# Generate sentiment scores
sentiment_scores = df['review'].apply(sid.polarity_scores)
sentiment = sentiment_scores.apply(lambda x: x['compound'])
monthly_sentiment = sentiment.resample('M').mean()

Generally, the sentiment for this burger chain is positive and we can notice there is a decreasing trend in the past decade, especially after the pandemic.

4.2 Supervised Learning Classifiers

We can also use historical data with known sentiment to predict the sentiment of a new piece of text. Here I will use two supervised learning classifiers: logistic regression and naive bayes.

First, I labeled the positive reviews as "1" (with rating of four or five) and negative reviews as "0" (with rating of one or two). 85% out of 9271 reviews are positive.

Then, I vectorized the reviews using BoW and split them into training set and test set.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
# vectorization
vect = CountVectorizer(max_features=300, stop_words=my_stop_words)
vect.fit(df_update.review)
X = vect.transform(df_update.review)
X_df = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
# Define the vector of targets and matrix of features
y = df_update.label
X = X_df
# Perform the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

Now we can build the models. The first one is logistic regression.

from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression().fit(X_train, y_train)
y_pred_lg = log_reg.predict(X_test)
# find the most informative words
log_odds = log_reg.coef_[0]
coeff = pd.DataFrame(log_odds, X.columns, columns=['coef'])
            .sort_values(by='coef', ascending=False)
Most informative words for positive reviews
Most informative words for positive reviews
Most informative words for negative reviews
Most informative words for negative reviews

The second model is Naive Bayes.

from sklearn.naive_bayes import MultinomialNB

nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)
pred = nb_classifier.predict(X_test)

Finally, we can compare the generalization performance of these two models. It turned out that both models worked really well, with accuracy over 90%. Of course we can still improve the models, by using n-grams, Tf-idf, etc.

Generalization Performance
Generalization Performance

5 Conclusion

Text Mining not only allows us to know what people are talking about, but how they talk about it. It is very important and beneficial for brand monitoring, product analysis and customer service. With Python, it is convenient for us to leverage all kinds of library to dive deeper into the text and get valuable insights.


Related Articles