Natural Language Processing for Consumer Satisfaction in Python

Data mining and visualization with consumer’s reviews

Cláudio Alves Monteiro

Published in

Towards Data Science

6 min readSep 11, 2020

How a company can better understand the consumer’s perception of its products, in order to improve them?

Marketing consulting companies or inside teams have been analyzing consumer’s perceptions to understand how to make improvements on its products, branding, and marketing campaigns. But the traditional methods for addressing this used to be in-dept interviews or surveys, which often don’t follow the statistical sampling techniques and require resources for data collection.

In the era of social networks, we produce vasts amounts of data on a daily basis. The company’s Facebook page, for example, can serve as a valuable resource for collecting comments, inbox messages, and reactions. Other sources of data are reviews and evaluations of consumers in e-commerce platforms like Amazon and Lowe’s. Here I’m going to present how one can use Python and Natural Language Processing (NLP) to analyze text data and get insights from these reviews.

Down below, we can see the data set used for example, which consists of 20473 consumer reviews from different retailers and products (mainly refrigerators). The goal is to (1) extract words within a text that encode a characteristic of products functioning, structure, or the purchase process; (2) identify what is being said in the sale’s channel ; (3) explore this information to generate insights — for example, which features are most related to positive reviews? and finally, (4) generate data visualizations to communicate the results.

Exploring which attributes (columns) we are going to use, one can see that we have useful information like a retailer, category, brand, review user rating, and the text review itself. Therefore we can think of the steps needed to achieve our goals:

A. Preprocessing and Exploratory Analysis: (1) clean and tokenize text; (2) count total words to identify features as most common nouns; (3) create a binary valuation of the product (good/bad).

B. Sentiment Analysis: (1) Identify valuable features to explore; (2) word cloud visualization for most frequent adjectives near a feature; (3) sentiment for a feature in time.

Preprocessing and Exploratory Analysis

First, we import the libraries and NLTK models. If you don’t have one of these libraries, you can use pip install library in the terminal to install it. Attention to the use of the NLTK package for natural language processing, pandas for data manipulation and plotly and cufflinks for data visualization.

An initial procedure is to identify duplicated reviews based on the review_id columns and remove them. 69% of data were duplicated, which is a high number and should be verified with the responsible team that collected this data from the websites.

After importing and exploring, we can preprocess the comments and put the words in lower case, remove numbers, remove stopwords (connectives) and apply the lemmatization algorithm for word standardization. This process standardizes our text in order to make feature extraction more efficient, so we make “Enjoyable” and “Enjoy” goes to “enjoy”, for example.

Now we can count the words in the reviews and identify the word class (noun, adjective, verb, adverb, etc.). With this step, we will be able to identify relevant features expressed in nouns.

Here we can see relevant features like fridge, door, ice, refrigerator, water, space, and others. The next step then is to explore which sentiments are linked to these features.

Sentiment Analysis

To perform the sentiment analysis, I’ve made a list of the relevant features observed in the previous analysis. We will use this list to create other attributes in the data set: (1) one that identifies if there is a relevant feature on each review; (2) a column with the adjectives (sentiments) and; (3) an attribute that tells us if the review was positive (≥4 stars ) or negative (<4 stars).

It is now possible to perform a search for words and adjectives close to the features to understand what people are saying about each aspect. I also highlight that this process could be driven to a specific brand or product just by implementing a simple filter and comparing your brand with other brands, but for educational purposes, we include every brand in this analysis.

Here I coded three different strategies of analysis to aggregate the text near the feature that we are looking for, so in the next step, the counting is on the text extracted. We can identify all words in the feature review (ruleAll), the words right next to the feature word (ruleNext), or near the feature by 3 words of distance from the feature (ruleNear).

Finally, we code the wordcloud visualization, which counts the words in a text string, which is returned by the previous search algorithm.

And now we can visualize the results:

The first comes from exploring the words next to the feature ice, which leads to the words maker, machine, and dispenser, indicating that many reviews are related to this aspect of the refrigerators.

A more in-depth analysis is to identify negative and positive sentiments related to the word ‘ice’ so we can get insights from the consumer’s perception of the ice makers. The left wordcloud below is based on the words close to the feature ice in positive reviews, in which relevant adjectives such as easy, big, spacious, large are observed, which denote a consumer interest for a large ice maker.

The right wordcloud captures words close to ice in negative reviews, here we can see the highlight of the adjectives small and tiny, that reinforces the interest of consumers on a large ice producer, but not too big since we find huge as a bad review for ice.

At the same time, it is also possible to see positive sentiments related to ice in bad reviews, like good and great, which suggests that even in negative reviews people can still indicate a good feature of the product.

This same model can be replicated for any other feature just by changing the ‘aspect’ in the function aspectSentimentWordcloud().

**(left)** Words in positive reviews near ‘ice’ by Author | **(right)** Words in negative reviews near ‘ice’ by Author

Times Series Sentiment Analysis

As we saw in the beginning, there is date information in the data set that can be used to give us insights about the distribution of positive reviews for a specific brand, product, or feature. Here I show the code to count positive reviews by date, and visualize it using plotly interactive plots.

In the plot generated we can see that there is no clear long-term trend, as there is almost no variation in the regression line (trace1). Nevertheless, there are expressive variations in the positive percent line itself, like from September to October 2019, when we observe a downside to almost only 60% of positive reviews, followed by a boost in positive reviews that reaches 90%. This analysis can inform the consumer’s perception of a product that was launched during this period.

Conclusion

In this article, I wanted to show how an analyst or data scientist can use Python tools to process, analyze, and visualize text data to get insights from the consumer’s perception.

The usage of these tools can not only reduce the costs of collecting data but also get trends and ideas that are spontaneous, once the source is large amounts of reviews and comments from satisfied or unsatisfied consumers and not closed questions from a survey.

One can also use it for analyzing open questions in surveys or in-depth interviews. Although the amount of data may be not that large as it is in online platforms.

I express my gratitude to Birdie for the data set. For data access and view the complete project, you can visit https://github.com/claudioalvesmonteiro/nlp_birdie