NLP: Text Data Visualization

Namrata Kapoor
Towards Data Science
4 min readMar 27, 2021

--

Image Source

“Data will talk to you if you are willing to listen.”-Jim Bergeson

Text data visualization has many advantages, like getting the most frequently used word quickly to understand what the text is about, the number of positive and negative reviews represented by a graph for all data, plus the user-wise and product-wise relations between the part of speech, and many more.

Now let us see how to do it. Amazon is a big retail brand and we can see it has a lot of reviews on products. Let’s get this data set from Kaggle and get started.

Import a few important libraries for the task:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import string
from wordcloud import WordCloud

Make a data frame from reviews CSV:

df = pd.read_csv('../input/amazon-fine-food-reviews/Reviews.csv')

Let’s visualize the data:

df.head(10)
Image by Author

Dropping null values if any:

print(df.shape)
print(df.isnull().values.any())
df.dropna(axis = 0 , inplace = True)
print(df.shape)

Dropping duplicates:

df.drop_duplicates(subset=['Score','Text'],keep='first',inplace=True)
print(df.shape)
df.head(10)

Visualizing total count of scores:

plt.figure(figsize=(10,10))
ax = sns.countplot(x=df["Score"], data=df, order = df["Score"].value_counts().index )
for p, label in zip(ax.patches, df["Score"].value_counts()):
ax.annotate(label, (p.get_x()+0.25, p.get_height()+0.5))
Image by Author

Group by productId with more than 400 products:

df.groupby('ProductId').count()
df_products = df.groupby('ProductId').filter(lambda x: len(x) >= 400)
df_product_groups = df_products.groupby('ProductId')
#Count of products and groups
print(len(df_products))
print(len(df_product_groups))

Plot the scores product-wise

plt.figure(figsize=(20,20))
sns.countplot(y="ProductId", hue="Score", data=df_products);
Image by Author

Group by UserId who gave more than 100 reviews:

df.groupby('UserId').count()df_users = df.groupby('UserId').filter(lambda x: len(x) >= 100)
df_userGroup = df_users.groupby('UserId')
print("Number of Users:"+ str(len(df_userGroup)))
df_products = df_users.groupby('ProductId')
print("Number of products:"+ str(len(df_products)))

Plotting users as per their given score ratings :

Image by Author

Now let’s see which are the words used mostly in positive reviews and the most used words in negative reviews.

For this import library for data cleaning:

from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

Make functions for removal of stopwords, lemmatizing and cleaning text:

def remove_Stopwords(text ):
stop_words = set(stopwords.words('english'))
words = word_tokenize( text.lower() )
sentence = [w for w in words if not w in stop_words]
return " ".join(sentence)
def lemmatize_text(text):
wordlist=[]
lemmatizer = WordNetLemmatizer()
sentences=sent_tokenize(text)
for sentence in sentences:
words=word_tokenize(sentence)
for word in words:
wordlist.append(lemmatizer.lemmatize(word))
return ' '.join(wordlist)
def clean_text(text ):
delete_dict = {sp_character: '' for sp_character in string.punctuation}
delete_dict[' '] = ' '
table = str.maketrans(delete_dict)
text1 = text.translate(table)
textArr= text1.split()
text2 = ' '.join([w for w in textArr])

return text2.lower()

Segregate positive and negative reviews:

mask = (df["Score"] == 1) | (df["Score"] == 2)
df_rating1 = df[mask]
mask = (df["Score"]==4) | (df["Score"]==5) | (df["Score"]==3)
df_rating2 = df[mask]
print(len(df_rating1))
print(len(df_rating2))

Cleaning the text of stopwords, lemmatizing, and cleaning punctuations:

df_rating1['Text'] = df_rating1['Text'].apply(clean_text)
df_rating1['Text'] = df_rating1['Text'].apply(remove_Stopwords)
df_rating1['Text'] = df_rating1['Text'].apply(lemmatize_text)
df_rating2['Text'] = df_rating2['Text'].apply(clean_text)
df_rating2['Text'] = df_rating2['Text'].apply(remove_Stopwords)
df_rating2['Text'] = df_rating2['Text'].apply(lemmatize_text)
df_rating1['Num_words_text'] = df_rating1['Text'].apply(lambda x:len(str(x).split()))
df_rating2['Num_words_text'] = df_rating2['Text'].apply(lambda x:len(str(x).split()))

WordCloud view of negative reviews:

wordcloud = WordCloud(background_color="white",width=1600, height=800).generate(' '.join(df_rating1['Summary'].tolist()))
plt.figure( figsize=(20,10), facecolor='k')
plt.imshow(wordcloud)
Image by Author

WordCloud view of positive reviews:

wordcloud = WordCloud(background_color="white",width=1600, height=800).generate(' '.join(df_rating2['Summary'].tolist()))
plt.figure( figsize=(20,10), facecolor='k')
plt.imshow(wordcloud)
plt.axis("off")
Image by Author

Let us see how to visualize the relation between parts of speech.

For this import spacy

import spacy
nlp=spacy.load('en_core_web_sm')
from spacy import displacy
doc=nlp(u'The blue pen was over the oval table.')

Visualize as below:

displacy.render(doc, style='dep')
Image by Author

Now let’s fill in a few colors to this representation with some options:

doc1=nlp(u'I am Namrata Kapoor and I love NLP.')
options={'distance':110,'compact':'True', 'color':'white','bg':'#FF5733','font':'Times'}
displacy.render(doc1, style='dep',options=options)
Image by Author

Conclusion

We have seen here a few techniques of text visualization with the use of WordCloud, SNS, and matplotlib. There are more things that can be explored once sentiment analysis is used on it and we dig in deeper with some rules that define clearly if the review was given for product or delivery.

Also, stopwords like ‘not’ change the meaning of a word which has to be catered to by replacing them with the antonyms before applying any stop words and visualizing in a wordcloud.

These are a few of the things that can optimize what I did.

Thanks for reading!

Originally published at https://www.numpyninja.com on March 27, 2021.

--

--

Data Science Professional | Technical Blogger | Artificial Intelligence | NLP | Chatbots and more