Text, such as social media posts and customer reviews, is a gold mine waiting to be discovered. We can turn this unstructured data into useful insights, which can help companies better understand how customers like their products or services and more importantly, why, and then make business improvements as quickly as possible.

1 Case Background
Super Duper Burgers is one of my favourite burger restaurants. Every time I went there, I would always see customers queueing up for the burgers. One day I was thinking, why people are so obsessed with this burger chain? I know there are lots of reviews on Yelp and maybe this is a good start to figure out the secrets behind.
2 Data Understanding
2.1 Data Source
I used the Yelp API and got related information of 17 Super Duper Burgers restaurants in the Bay Area, such as urls, review counts, ratings, locations, etc.

Then I used Beautiful Soup to do web scraping and get reviews for each restaurant. I not only got the content of the reviews, but also the date and the rating from that specific customer. Date is useful when we do time series analysis and ratings can be the target variable if we apply any supervised learning algorithm to do prediction. I got 10,661 pieces of reviews in total.

2.2 Exploratory Data Analysis
Visualization is a good way to do exploratory data analysis.
import matplotlib.pyplot as plt
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
plt.plot(df['review'].resample('M').count())
plt.xlabel('Year')
plt.ylabel('Number of reviews')
plt.title('Number of reviews per month')
plt.show()

The reviews were from 2010–04–09 to 2020–12–21 and it was more than ten years. The number of reviews per month were increasing and it could mean that the burger chain is becoming more and more popular in the past decade. After COVID, the number dropped significantly and only around 30 customers would write a review every month.
import seaborn as sns
ax = sns.barplot(data=df, x='rating', y='rating', estimator=lambda x: len(x) / len(df) * 100)
ax.set(ylabel="Percent")
plt.title('Distribution of Customer Rating')
plt.show()

plt.plot(df['rating'].resample('M').mean())
plt.xlabel('Year')
plt.ylabel('Rating')
plt.title('Average Monthly Customer Rating')
plt.ylim(0,5)
plt.show()

Most customers were satisfied with the restaurants and over 70% of them gave ratings of 4 or 5. Over time, there is no so much variation in the rating and it was quite stable at 4.
2.3 Data Cleaning
In the review, some character references such as "&" are not useful in our text content and I removed them.
df['review'] = [i.replace("&", '').replace("'",'') for i in df['review']]
Next, I wanted to make sure all the reviews are in English and I did the language detection with a library called langdetect and the specific function _detectlangs.
from langdetect import detect_langs
language = [detect_langs(i) for i in df.review]
languages = [str(i[0]).split(':')[0] for i in language]
df['language'] = languages
8 out of 10,661 reviews were detected as other languages. Most of them are very short and have some sort of emphasis on the word: waaaaaay for way and guuuud for good. In this case, the detection is not that accurate. If we look closer to each of these 8 reviews, all of them are actually in English and I will keep them.

3 Text Mining
3.1 Stopwords
In every language, there are words that occur too frequently and are not informative, such as "a", "an", "the", "and" in English. It is useful to build a list containing all the stopwords and get rid of them before we do any text mining.
Depending on the specific context, you may also want to add more to the list. In our case, words like "super", "duper" are not very meaningful.
from nltk.corpus import stopwords
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
my_stop_words = set(stopwords.words('english') + list(ENGLISH_STOP_WORDS) + ['super', 'duper', 'place'])
3.2 Word Cloud
Word cloud is a very popular way to highlight the words with high-frequency in textual data. The more a specific word appears in the text, the bigger and bolder it will be in the word cloud.
from wordcloud import WordCloud
# concatenate all the reviews into one single string
full_text = ' '.join(df['review'])
cloud_no_stopword = WordCloud(background_color='white', stopwords=my_stop_words).generate(full_text)
plt.imshow(cloud_no_stopword, interpolation='bilinear')
plt.axis('off')
plt.show()

We can see that "burger", "garlic fries", "cheese" and some other words were mentioned by lots of customers.
3.3 Tokenization and Bag-of-Words (BoW)
In addition to the word cloud, we may be also interested in how much exactly a word appeared across all the reviews. Here, we are actually trying to transform the text data into a numeric form and Bag-of-Words is the simplest form of text representation in numbers. It basically builds a list of words occurring within a collection of documents (corpus) and keeps track of their frequencies.
from nltk.tokenize import word_tokenize
from nltk import FreqDist
lower_full_text = full_text.lower()
word_tokens = word_tokenize(lower_full_text)
tokens = list()
for word in word_tokens:
if word.isalpha() and word not in my_stop_words:
tokens.append(word)
token_dist = FreqDist(tokens)
dist = pd.DataFrame(token_dist.most_common(20),columns=['Word', 'Frequency'])

Obviously, "burger" and "burgers" are saying the same thing and we can do better than that by using stemming. Stemming is the process of transforming words into root forms, even if the stemmed word is not a valid word in the language. In general, stemming will tend to chop off suffixes such as "-ed" and "ing" as well as plural forms.
from nltk.stem import PorterStemmer
porter = PorterStemmer()
stemmed_tokens =[porter.stem(word) for word in tokens]
stemmed_token_dist = FreqDist(stemmed_tokens)
stemmed_dist = pd.DataFrame(stemmed_token_dist.most_common(20),columns=['Word', 'Frequency'])

3.4 N-grams
Under the Bag-of-Words approach, the word order is discarded. However, in many cases, the sequence of words is very important. For example, compare these two sentences: 1) I am happy, not sad. 2) I am sad, not happy. The meaning of them are totally different but they will get the same numeric representation with single-token BoW. In order to better capture the context, we can consider pairs or triples of words that appear next to each other and they can also give us more useful information.
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(stop_words=my_stop_words, ngram_range=(2,2))
bigrams = vect.fit_transform(df['review'])
bigram_df = pd.DataFrame(bigrams.toarray(), columns=vect.get_feature_names())
bigram_frequency = pd.DataFrame(bigram_df.sum(axis=0)).reset_index()
bigram_frequency.columns = ['bigram', 'frequency']
bigram_frequency = bigram_frequency.sort_values(by='frequency', ascending=False).head(20)

Garlic fries seem to be the most popular menu item for this burger chain, even over the burgers! Other top-selling dishes include mini burger, ice cream, veggie burger and chicken sandwich. Pairs of tokens give us more insights than single ones.
3.5 Why People Love It
Even though the bigrams give us more information, it only answers the question of WHAT. If I were the business owner, I would be definitely interested in WHY: Why people love the fries? Is it because of the special flavor or the sauce?
To answer this question, I will use the Word2Vec model and see what the words are most likely around our target words such as fries, burgers, service, etc. Word2Vec uses a neural network model to learn word associations from the corpus. Compared to BOW and n-grams, Word2Vec leverages the context and better captures the meaning and relationship of the word.
There are two model architectures behind Word2Vec: continuous Bag-of-Words (CBOW) and skip-gram. I am not going to give too many details about the algorithms here and you can find more in other articles and papers. In general, CBOW is faster while skip-gram is slower but does a better job in representing infrequent words.
We can easily do the job with Gensim in Python. First, I got the good reviews with rating of 4 or 5 and did some basic preprocessing.
from nltk.tokenize import sent_tokenize
good_reviews = ' '.join(df_good.review)
# split the long string into sentences
sentences_good = sent_tokenize(good_reviews)
good_token_clean = list()
# get tokens for each sentence
for sentence in sentences_good:
eng_word = re.findall(r'[A-Za-z-]+', sentence)
good_token_clean.append([i.lower() for i in eng_word if i.lower() not in my_stop_words])

Now we can build the model and see what the customers love most about the service of the burger chains.
from gensim.models import Word2Vec
model_ted = Word2Vec(sentences=good_token_clean, size=500, window=10, min_count=1, workers=4, sg=0)
model_ted.predict_output_word(['service'], topn=10)

Obviously, people really appreciate their friendly customer service as well as the fast and quick response. We can do the same for other target words that we are interested in. These surrounding words are very informative and they can better explain why people love or explain about certain things.
4 Sentiment Analysis
Sentiment Analysis is the process of understanding the opinions of people about a subject. There are two types of methods: lexicon/rule based and automated.
4.1 Lexicon-based Tool – VADER
This method has a predefined list of words with sentiment scores and it matches words from the lexicon with words from the text. I will use the VADER analyzer in the NLTK package. For each piece of text, the analyzer provides four scores: negative, neutral, positive and compound. The first three are easy to understand and for the compound score, it is a combination of positive and negative scores and ranges from -1 to 1: below 0 is negative and above 0 is positive. I am going to use the compound score to measure the sentiment.
# Load SentimentIntensityAnalyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Instantiate new SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
# Generate sentiment scores
sentiment_scores = df['review'].apply(sid.polarity_scores)
sentiment = sentiment_scores.apply(lambda x: x['compound'])
monthly_sentiment = sentiment.resample('M').mean()

Generally, the sentiment for this burger chain is positive and we can notice there is a decreasing trend in the past decade, especially after the pandemic.
4.2 Supervised Learning Classifiers
We can also use historical data with known sentiment to predict the sentiment of a new piece of text. Here I will use two supervised learning classifiers: logistic regression and naive bayes.
First, I labeled the positive reviews as "1" (with rating of four or five) and negative reviews as "0" (with rating of one or two). 85% out of 9271 reviews are positive.

Then, I vectorized the reviews using BoW and split them into training set and test set.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
# vectorization
vect = CountVectorizer(max_features=300, stop_words=my_stop_words)
vect.fit(df_update.review)
X = vect.transform(df_update.review)
X_df = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
# Define the vector of targets and matrix of features
y = df_update.label
X = X_df
# Perform the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
Now we can build the models. The first one is logistic regression.
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression().fit(X_train, y_train)
y_pred_lg = log_reg.predict(X_test)
# find the most informative words
log_odds = log_reg.coef_[0]
coeff = pd.DataFrame(log_odds, X.columns, columns=['coef'])
.sort_values(by='coef', ascending=False)


The second model is Naive Bayes.
from sklearn.naive_bayes import MultinomialNB
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)
pred = nb_classifier.predict(X_test)
Finally, we can compare the generalization performance of these two models. It turned out that both models worked really well, with accuracy over 90%. Of course we can still improve the models, by using n-grams, Tf-idf, etc.

5 Conclusion
Text Mining not only allows us to know what people are talking about, but how they talk about it. It is very important and beneficial for brand monitoring, product analysis and customer service. With Python, it is convenient for us to leverage all kinds of library to dive deeper into the text and get valuable insights.