
Should Businesses Really Hear Their Customers’ Voices?
In a rapidly evolving world that is getting more and more AI-driven every instant, businesses now need to constantly seek a competitive edge to remain sustainable. Companies may do this by regularly observing and analyzing customer opinions regarding their products and services. They achieve this by assessing comments from many sources, both online and offline. Identifying positive and negative trends in customer feedback allows them to fine-tune product features and design marketing strategies that meet the needs of customers.
Thus, customer opinions need to be discerned appropriately to find valuable insights that can help make informed business decisions.
Familiarizing Yourself with Sentiment Analysis
Sentiment analysis, a part of natural language processing (NLP), is a popular technique today because it studies people’s opinions, sentiments, and emotions in any given text. Businesses can understand public opinion, monitor brand reputation, and improve customer experiences by applying sentiment analysis to their collected feedback, which contains valuable information, but its unstructured nature can make it difficult to analyze. By regularly analyzing customer sentiments, companies can identify their strengths and weaknesses, decide on how to boost product development, and build better marketing strategies.
Powerful packages for sentiment analysis in both Python and R enable businesses to uncover valuable patterns, track sentiment trends, and make data-driven decisions. In this article, we will explore how to use different packages (Quanteda, Sentimentr and Textstem) to perform sentiment analysis on customer feedback by processing, analyzing, and visualizing textual data.
Adding a Real-world Context
For this tutorial, let us consider a fictional tech company, PhoneTech, that has recently launched a new smartphone in the budget segment for its young audience. Now, they want to know the public perception of their newly launched product and, hence, want to analyze the customer feedback from social media, online reviews, and customer surveys.
To achieve this, PhoneTech needs to use Sentiment Analysis to find product strengths and weaknesses, guide product development, and adjust marketing strategies. For example, PhoneTech has collected feedback from various platforms like social media (e.g., informal comments like "The camera is 🔥 but battery life 😒 . #Disappointed"), online reviews (e.g., semi-structured comments such as "Amazing build quality! ⭐⭐⭐⭐ Battery could last longer, though"), and customer surveys (e.g., structured responses to questions like "What do you like/dislike about the product?").
It’s important to note that customer feedback often includes informal language, emojis, and specific terms. We can use R packages to clean, tokenize, and analyze this data in order to turn raw text into actionable business insights.
Implementing Sentiment Analysis
Next, we’ll build a model for sentiment analysis in R using the chosen quanteda
package.
1. Importing necessary packages and dataset
For evaluating sentiments in a given dataset, we need several packages, including dplyr
to manipulate the data of customer feedback entries, quanteda
(License: GPL-3.0 license) for text analysis, and quanteda.textplots
to create a word cloud. Additionally, tidytext
(License: [MIT](https://cran.r-project.org/web/licenses/MIT) + file [LICENSE](https://cran.r-project.org/web/packages/sentimentr/LICENSE)) to use sentiment lexicons for scoring while ggplot2
will be used for data visualization, textstem
(License: GPL-2) will aid in text stemming and lemmatization, sentimentr
(License: MIT + file LICENSE) will be utilized for sentiment analysis, and RColorBrewer
will provide color palettes for our visualizations.
These can be easily installed with the following command-
install.packages(c("dplyr", "ggplot2", "quanteda", "quanteda.textplots",
"tidytext", "textstem", "sentimentr", "colorbrewer"))
After installation, we can load the packages as:
# Load necessary R packages
library(dplyr)
library(ggplot2)
library(quanteda)
library(quanteda.textplots)
library(tidytext)
library(textstem)
library(sentimentr)
library(RColorBrewer)
Dataset for customer reviews
In the case of the real-world dataset, this data would actually be scraped using multiple tools from various social media platforms. The collected data would represent the feedback that includes informal language, emojis, and domain-specific terms. Such a combined dataset can allow for a detailed analysis of customer sentiments and opinions across different sources.
However, for this tutorial, let us use a synthetic dataset generated in R using packages that cover these above points. The dataset with 200 rows represents customer feedback (~2–3 sentences in each row) from different sources and includes raw text with emojis and symbols, abbreviations, etc., mimicking real-world scenarios. These sentences are simply a generic representation of the reviews commonly seen on e-commerce or product websites (talk about keywords such as UI, design, phone features and price, experienced battery life, customer service support, etc.) and are combined in random patterns with emojis for creating a review text.
You can find the synthetic dataset generated using R on GitHub here.
#load the dataset
data <- read.csv("sentiment_data.csv")
# Print the dimensions (number of rows and columns) of the dataset
dataset_size <- dim(data)
print(dataset_size)

Since the dataset has a lot of text, let’s print a few words per row for the dataset overview.
To achieve this, we’ll first define a function to extract the first few words from each feedback entry in our dataset. We’ll then randomly sample 5 rows from the dataset and apply the function to truncate the feedback text. Finally, we’ll print the resulting data frame to get an idea of the feedback text.
# Function to extract the first few words
extract_first_words <- function(text, num_words = 10) {
if (is.na(text) || !is.character(text)) {
return(NA)
}
words <- unlist(strsplit(text, "s+"))
return(paste(words[1:min(num_words, length(words))], collapse = " "))
}
# Randomly sample 5 rows from the dataset
set.seed(123)
random_feedback <- data[sample(nrow(data), size = 5, replace = FALSE), ]
# Extract the first 5 words
random_feedback$text <- sapply(random_feedback$text, function(text) {
truncated <- extract_first_words(text)
paste0(truncated, "...")
})
# Print the data frame
print(random_feedback)

2. Preprocessing Text Data
Before moving to text analysis, we need to preprocess the text to ensure a clean and consistent format. Preprocessing will involve several key steps:
- Text Cleaning which includes removal of punctuation, numbers, and special characters;
- Text Normalizing which includes conversion of the alphabets to lowercase;
- Tokenizing the text which includes splitting the text into individual words or tokens;
- Removing stop words which includes intentional removal of words that do not contribute to sentiment analysis (e.g., "the," "and"); and finally,
- Stemming or lemmatizing the text where the words are reduced to their root forms. These steps help lessen the noise and improve the accuracy of the analysis.
Now, we’ll implement the above preprocessing steps on our dataset.
# Cleaning the dataset
corpus <- quanteda::corpus(data$text)
tokens_clean <- quanteda::tokens(corpus, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
tokens_tolower() %>%
tokens_remove(stopwords("en"))
# Convert tokens to character vectors for lemmatization
tokens_char <- sapply(tokens_clean, function(tokens) paste(tokens, collapse = " "))
# Lemmatize tokens using WordNet
lemmatized_texts <- lemmatize_strings(tokens_char)
In this code, we convert the dataset’s text column into a quanteda corpus object. We clean the text by tokenizing it, which involves removing punctuation, numbers, and symbols, converting all words to lowercase, and filtering out common stopwords. Finally, stemming is applied to reduce words to their root forms, such as changing "running" to "run," to ensure consistency in text analysis. It is important to note here that we haven’t used stemming since it can cause partial or incomplete extraction of words due to the simplification of words to root forms. In another way, it applies simple rules to chop off the ends of words. For example, the algorithm might remove the suffix "-ing" from "amazing," resulting in "amaz," or "terrible" could be "terribl". To avoid that and get more accurate root forms, instead, we’ll use lemmatization, which is a more sophisticated process that relies on dictionaries to map words and considers the context and part of speech of the words to return their base or dictionary forms.
Now that we have cleaned and tokenized the text data, we can move on to the next step. Our goal is to analyze the sentiments in the feedback entries. We will use the sentimentr package to evaluate the sentiments in our structured data, providing insights into the emotional tone of the feedback entries.
3. Performing Sentiment Analysis Using Sentimentr package
Now, we can perform sentiment analysis on these sentences with the sentiment function from the sentimentr
package. This function calculates sentiment scores for each piece of text, identifying positive and negative words.
Next, we summarize the sentiment scores for each document. We group the scores by document and calculate the total number of positive and negative words. We also calculate a compound score and categorize the overall sentiment as either positive or negative.
# Perform sentiment analysis using sentimentr
sentiment_scores <- sentiment(lemmatized_texts)
# Summarize sentiment scores for each document
sentiment_summary <- sentiment_scores %>%
group_by(element_id) %>%
summarize(
positive_words = sum(sentiment > 0),
negative_words = sum(sentiment < 0),
compound = sum(sentiment)
) %>%
mutate(
sentiment = ifelse(compound > 0, "Positive", "Negative")
)
Finally, we merge this sentiment summary with the original text data and print the results. This gives us a clear, concise evaluation of the sentiment in our dataset.
# Merge with original text for context using row number as a common column
sentiment_summary <- sentiment_summary %>%
mutate(doc_id = as.character(element_id)) %>%
left_join(data %>% mutate(doc_id = as.character(1:nrow(data))), by = "doc_id") %>%
select(text, positive_words, negative_words, compound, sentiment)
# Print the sentiment evaluation table
print(sentiment_summary)

The output table clearly shows the positive and negative word count per review in each row along with the compound score as well as the predictive sentiment. At a glance, the model does a reasonably good job of sorting positive and negative reviews. Although some reviews clearly look negative (e.g. "Would not recommend….") due to the incomplete display of the review content in a table, it is quite likely that there are more positive keywords (satisfies, best, good, etc) contained in that particular review that resulted in a positive sentiment as per the model evaluation. Hence, such reviews need to be carefully reviewed separately before being included in the interpretation of the results for decision-making.
Next, we need to print a Document-Feature Matrix (DFM) which is a structured representation of the text where rows represent documents and columns represent features (words). Each cell contains the frequency of a specific word in a document. Here, the cleaned corpus is transformed into a DFM, making it ready for statistical analysis and visualization.
# Create a document-feature matrix (DFM)
dfm <- dfm(corpus_clean)
This section calculates sentiment metrics for each text entry. Positive and negative word counts are summed, and a compound score is computed as the difference between these counts. A positive compound score indicates positive sentiment and a negative score indicates negative sentiment. This information is combined with the original text for a comprehensive sentiment evaluation.
4. Analyzing Sentiment Proportions
# Evaluate sentiment proportions as percentages
sentiment_proportion <- sentiment_summary %>%
group_by(sentiment) %>%
summarise(count = n()) %>%
mutate(proportion = count / sum(count) * 100)
print(sentiment_proportion)

To understand the overall sentiment distribution, we calculate the proportions of positive and negative sentiments in the dataset. Grouping by sentiment type, the count of entries in each category is calculated and normalized to derive their proportions.
5. Visualizing Sentiment Distribution
We’ll create a bar chart in ggplot2
to visualize the proportions of positive and negative sentiments for an intuitive visualization of the sentiment distribution, making it easy to observe which type of sentiment seems dominant.
# Plot sentiment distribution as percentages
ggplot(sentiment_proportion, aes(x = sentiment, y = proportion, fill = sentiment)) +
geom_bar(stat = "identity", width = 0.7) +
scale_fill_manual(values = c("Positive" = "blue", "Negative" = "red")) +
labs(title = "Distribution of Sentiments",
x = "Sentiment Type",
y = "Percentage",
fill = "Sentiment") +
theme_minimal() +
theme(panel.grid = element_blank())

In our dataset, positive sentiment seems dominant. Hence, a larger proportion of the customers are happy with PhoneTech’s product.
6. Visualizing Top Terms
# Plotting top 10 terms
top_terms <- topfeatures(dfm, 10)
bar_colors <- colorRampPalette(c("lightblue", "blue"))(length(top_terms))
# Barplot
barplot(top_terms, main = "Top 10 Terms", las = 2, col = bar_colors, horiz = TRUE, cex.names = 0.7)

The 5 most frequent terms in the dataset seem to be "recommend", "design", "smartphone", "display," and "terrible". Although such words are not very useful standalone for understanding sentiment, PhoneTech personnel could dig deeper into how these words are associated with the product in the reviews and build some other plots to add more context in which it would be clear whether these words are used in a certain review.
So, let’s filter out the positive feedback, create a DFM, and plot again to see what customers are really saying about the product.
# Filter positive feedback
positive_feedback <- sentiment_summary %>%
filter(sentiment == "Positive")
# Create a DFM for positive feedback
positive_tokens <- quanteda::tokens(positive_feedback$text, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
tokens_tolower() %>%
tokens_remove(stopwords("en"))
positive_dfm <- dfm(positive_tokens)
# Plot top 5 terms with a gradient
top_positive <- topfeatures(positive_dfm, 5)
bar_colors <- colorRampPalette(c("lightblue", "blue"))(length(top_positive))
# Plot with gradient
barplot(top_positive, main = "Top 5 Positive Terms", las = 2, col = bar_colors, horiz = TRUE, cex.names = 0.7)

The product performance, smartphone (could likely indicate brand), display, and design seem to be the most talked about in the dataset.
Another way to visualize these sentiments in our dataset is by generating a word cloud and fine-tuning the word frequencies using the max_words
parameter as needed.
7. Generating a Word Cloud
# Word cloud
textplot_wordcloud(dfm, max_words = 200, color = RColorBrewer::brewer.pal(8, "Reds"), min_size = 0.5, max_size = 8)
We can also display the most frequent terms in an engaging and intuitive format with the use of word cloud when working on sentiment analysis tasks. It is important to note that larger words indicate higher frequencies, and this plot is particularly useful for quickly identifying key themes in the given dataset.

For the PhoneTech team, it might be worth considering two separate positive and negative word clouds to understand better what the most loved feature of the product is and what the pain point is.
8. Sampling and Reviewing Sentiments
Finally, we’ll print five random sentences from the dataset to inspect their sentiment evaluation results. This will help us validate the sentiment analysis outputs and gain insights into individual entries.
# Sample 5 sentences from the dataset
sample_indices <- sample(1:nrow(sentiment_summary), 5)
sample_sentiment_summary <- sentiment_summary[sample_indices, ]
# Print the sample sentences
print(sample_sentiment_summary)

So, all the above steps form a comprehensive pipeline for analyzing textual data as well as extracting valuable insights. Together, these steps help to transform raw text into actionable insights, supporting data-driven decision-making for any company.
Interpreting Sentiment Analysis Results
It is crucial to assess and evaluate the findings of the sentiment analysis correctly. **** For this, we generated a Document-Feature Matrix (DFM) to find top words and overall sentiment distribution, helping us understand the overall customer mood and identify patterns in the feedback. Additionally, our model generated sentiment scores to provide an idea about the tone of the reviews.
For example, PhoneTech finds that 68% of the feedback is positive, with the top words being "design" and "performance," it highlights key selling points for marketing. Conversely, the remaining 32% of reviews, i.e., negative comments, talk about customer service and poor photos, indicate potential areas for improvement.
Comparing sentiment trends over time or across sources, such as social media versus online reviews, helps identify shifts in customer perception. An accurate interpretation is important for making informed decisions and developing targeted strategies.
While the model seems to effectively identify positive and negative reviews, further steps can involve fine-tuning the model to sort neutral reviews, if any, for a more comprehensive analysis.
Applying Sentiment Insights to Fine-tune Strategy
The sentiment analysis has revealed some key areas of product improvement and its strengths for PhoneTech that can be leveraged to enhance its business. By addressing both positive and negative customer feedback, PhoneTech can drive overall satisfaction and attract more buyers.
Based on sentiment analysis results, PhoneTech could identify the following actionable insights and strategies to improve its business:
Positive Strategies:
(1) Refine Marketing Strategies:
- Customers seem to be happy with the sleek and fast UI.
- Positive feedback on the UI design indicates that this is a key selling point, which PhoneTech should continue promoting in their marketing campaigns to attract more buyers.
Negative Strategies:
(1) Enhance Product Features:
- Frequent complaints about image quality suggest an issue with either the hardware or software.
- Improving these areas quickly can enhance the user experience and reduce negative reviews.
(2) Addressing Customer Service Issues:
- Handling customer service issues and resolving them promptly will boost product satisfaction.
- These actions can prevent or reduce negative reviews while ensuring a better user experience and increasing overall reliability.
Best Practices in Sentiment Analysis
- Text Context: As lexicon-based sentiment analysis often misses sarcasm and context, using advanced techniques like machine learning helps better capture nuances.
- Domain-Specific Language: As general lexicons may not understand industry-specific terms and slang, tailoring lexicons to include technical terms relevant to the industry improves accuracy.
- Use of Informal Language and Emojis: Since informal language and emojis can be challenging to analyze, using tools like
quanteda
to clean and systematically analyze data is beneficial. - Combining Techniques: As relying on one method limits analysis depth, combining text processing with machine learning provides comprehensive insights.
Key Takeaways
- Sentiment analysis helps businesses understand customer opinions to improve products and services.
- The R packages
quanteda
,sentimentr
, andtextstem
work well together for text analysis of customer reviews. - The outlined approach for sentiment analysis can be easily applied across industries like finance, healthcare, and retail for actionable insights.
Conclusion
Sentiment analysis gives businesses a clear idea about their customer needs and pain points. Companies can leverage insights to improve products and craft data-driven strategies.
In this article, we explored how R packages can help with sentiment analysis on customer feedback for a tech product. We discussed the background of the challenge with possible steps such as including data collection and preparation, corpus creation, tokenization, feature extraction, building sentiment models, and visualizing results to implement the sentiment analysis in R. We also considered the outcomes of the analysis that seem to have an impact and need to be considered by the company for further refining the product.
Other domain companies that are looking to gain actionable insights, enhance product features, refine marketing strategies, and monitor brand reputation effectively could take a significantly similar approach to sentiment analysis.