As data scientists, our jobs are to deliver tangible, bottom-line results to the business. While I’d love to train neural networks all day, it’s critical that we build solid relationships with our business units and find ways to deliver easily understandable and quantifiable value.
Unless you’re working for a big tech company, chances are that your team has plenty of use cases for analytics. In this short tutorial, I’m going to give you the code and overview of how you can leverage basic NLP (Natural Language Processing) to deliver real, communicable, valuable analytics.
For tutorials on gathering requisite text data from the internet (and some warnings) check out my articles on web crawling:
Part 1 for complete beginners can be found [here](https://medium.com/analytics-vidhya/automated-browsers-scraping-and-crawling-part-2-cc9e2149a64) and part 2, where we take a more object oriented and reusable approach can be found here.
TL;DR – Just Give Me The Code
Dataset can be found here (there’s a link in the code as well).
Let’s Break it Down: Corpus
The corpus is the core data structure we will work with for NLP. Since we’re passing a character vector to the Corpus() function, we need to specify that the source is a vector source. Using the inspect() method will allow you to take a look at the newly created Corpus.
Following the creation of the corpus, we clean up the text data using some common methods. Our cleaner() function does some basic manipulations like removing numbers, but it also removes all stop words, which are words that have little to no meaning on their own. Popular stop words are "the" and "a."
Note that there may be some information loss when removing stop words. This is where analysis becomes more art than science. You may opt to just remove basic ones, you may decide to remove industry specific terms, etc.


TDM & Word Cloud
When we created our Corpus, R basically coded every review as a separate text document. A Term Document Matrix (TDM) takes a corpus and makes a vectorized frequency table of every single word. Here’s an example of 3 dummy reviews made into a TDM:

From this TDM, we can create a matrix, sort it by row sums (word frequency) then visualize the results using a word cloud. This is the first part of our analysis.

Word Correlations
The word cloud is a good start, but it is a very basic starting point. We can enhance our analysis by evaluating correlations between words. Word correlations leverage the same methodology that R’s base cor() function uses, with Pearson correlation being the default. Here’s the Pearson correlation formula:

Since our TDM is vectorized, R can perform these comparisons efficiently. In our master code, we need to enter a term of interest and a lower correlation bound. Don’t be surprised if you see most word correlations falling under ~15%. In fact, 15% could be a significant correlation because of the sheer size of our datasets. Even with our tiny (by ML standards) dataset, our TDM is very large (23M elements – 3107 reviews x 750 words).
Let’s look at a correlation graph between the word "Pain" and the rest of our TDM, imposing a lower correlation bound of 18%:

The graph shows us a few insights right away. For example, we see "nerve" has a high correlation with pain with a correlation of 22%. This can have direct, immediate business impact; we have efficiently summarized over 3,000 reviews and pulled a common ailment.
Sentiment Analysis
While there are many complex ways to perform sentiment analysis, we will focus on reasonably performing, out of the box methods in this article. For a detailed overview of sentiment analysis, check out this Wikipedia page. For a quick startup, we essentially leverage huge lists of pre-labeled negative / positive words from Microsoft’s sentiment lexicon. We left join the lexicon with our list of words and then just calculate ratios of positive:negative words.

N-Grams
But wait, there’s more! While we can take a closer look into individual word frequencies, this is ostensibly a bar graph / table form of what we visualized in our word cloud. Single words also suffer from missing information. Similar to what we saw in our discussion of stop words, we can lose important qualifiers in a sentence with this approach. For example, we can have "fan" be one of our top words, but maybe half of the sentences also include "I was not a…"

We can counter this loss of information with N-gram analysis. An N-gram is basically "N" words that co-occur in our dataset. All we need to do in this Code Template is change the "n" in our method to the size we want. Here are the results after running a Bigram and Trigram analysis and cleaning our datasets.

Great, How Do I Sell It?
There are a few great advantages of this code template, both technical and non-technical:
- Combat Bias & Salience: One of the most important advantages of having a programmatic approach is removing the human element. It’s common for companies to have a point person reviewing reported drug side effects, internal surveys, etc. This is a problem. People are biased, whether they know it or not. Subconscious bias can lead to one type of comment remaining salient or directly influencing the delivery of results. A program like this can be easily scaled and easily mass-shared (see #4), eliminating the siloed bias/salience effects.
- Who Doesn’t Like Word Clouds? The resulting visuals are easily understandable, they look nice, and are a great way to build business-side confidence in your analytics.
- Speed: Sure, it’s not completely optimized, but this code is fast. Start-to- finish, this program runs in 6.25 seconds on my PC. Granted, my PC is a bit faster than most work machines, but not by that much. A person physically combing through these 3100 reviews could take days.

- No Analytics Infrastructure? No Problem! This template can be recycled for any text analysis with little to no user interaction. Infrastructure helps but isn’t really necessary. If we wanted to do a ‘soft deploy’ of this type of code:
- Add whichever export conditions you want to see (graphs, n-grams, etc.)
- Create a .bat file that references this script
- Schedule a windows process to run the executable every "X" days as data is updated, or just click the .bat