Understand how the model works and how to deal with its pitfalls
Text Classification is a quite common natural language processing application. This article aims to give a high level overview of some text classification applications, and then an introduction to a Naive Bayes model, a foundation of text classification.

Text Classification Applications
You may see it applied in areas such as:
- Sentiment Analysis: Here we can classify text as positive, negative, or otherwise. How does a platform with millions of users such as Facebook or Twitter moderate content and detect hate speech? This is a huge area of development and research nowadays
- Spam Filtering: You are probably pretty familiar with this one – there are 54 Billion spam messages sent every day, and most of those get filtered out before we waste our time and attention on them.
- Author Attribution: Who is this piece of text written by? We could train a model to guess that this is Shakespeare!

- Genre Classification: Based on a movie script, is it a comedy or action movie. This was at the heart of my first data science class!
- Language ID: Think of how Google Translate can detect what language you type in to be translated.
These problems are being solved by supervised learning models, where we feed a model with labeled training data to learn some pattern or function that will in turn help it classify text it hasn’t seen before.
For example, movie reviews that are already labeled either negative or positive can be used as training data, giving the model information with which it can use to determine the sentiment of previously unseen movie reviews. But how does the model learn from its training data? What is really going on under the hood?
Bag of Words
The Bag of Words (BOW) is one way in which we can represent text numerically. It does not take into account the order of the words in the text, but rather just keeps track of what words appear. Each piece of text (a tweet, or except from a book for example) is assigned an array of 1’s and 0’s, each of which map to a specific word. For example, if we encode the following quote via a BOW representation, we will end up with a large array of mostly 0’s for all of the common words in English, but 1’s for the words "shall," "I," "compare," "thee," "to," "a", "summer’s" "day," etc.

If we do this for a ton of quotes from Shakespeare, and label them as positive or negative, we now have a dataset that can be used to train a model. How will the model know whether to predict positive or negative based on our training data?
What we really want to understand is:
Given some text from a piece of text (which we have turned into a vector of 1’s and 0’s depending on the words it contains), is it higher probability that the text is positive or negative? We can find this probability through Bayes Rule – which is at the core of a common classifier for sentiment analysis, called a Naive Bayes Model.
A quick side-note: Earlier I mentioned that one important application of text classification is to detect email spam. Spam has been a huge problem since the 90s, when MAPS (Mail Abuse Prevention System) was launched. Interestingly, Paul Graham of Y Combinator wrote an influential paper in 2002 that proposed an even more effective method for filtering out spam. It included an improved version of Bayesian Filtering i.e. a Naive Bayes model!
Naive Bayes Model
Bayes rule specifies that we can find the probability of some event A, given that some other event B has happened. This is notated P(A | B). Our Naive Bayes classifier can find the probability of Shakespeare’s quote from Sonnet 18 being positive (event A), given that it contains the words that it does (Shall, I, compare, thee, etc. = event B). This is the basic formula:

P(piece of text is positive | words it contains) = P(it contains those words | piece of text is positive) * P(text is positive) / P(those words appear in any text in our training data).
Going term by term, here are a few things to take note of in this formula:
- The probability that a positive text contains the words we feed into the classifier, P(B|A), is a little complicated to calculate if we were to take into account how the probabilities change given the other words in the text. For example, it may be more likely that "yeah" is included in a piece of text given that "hell" is also in the text because "hell yeah!" is a common phrase. Maybe not from Shakespeare specifically, but as a general rule. Here, we make an important simplifying assumption: assume all the features are independent. So to find P(B|A) we just multiply the probability that a word is in a piece of text given that it is positive * the probability the next word is in piece of text given that is it positive…for all of the words.
- The probability that the text is positive, P(A), is what’s called the prior. In practice, these probabilities are typically based on the training data: what percent of positive texts do we have in the training dataset?
- The probability that a piece of text contains certain words, P(B), is to help us adjust the overall probability for words that may appear more or less often in general
The independence assumption made above means that the probability for each feature is calculated without looking at the other features around it – it is "naive" of those features, hence the name Naive Bayes.
Pitfalls and Mitigations
A few issues that arise with Naive Bayes implementations and what we can do to deal with them.
Repeated Words
Sometimes, our classifier can get thrown off if a word is repeated many times in a piece of text. For example, check out this tweet from July 2020. The New Jersey governor used the word "really" 19 times 😲 as he was calling out residents for having indoor parties during a pandemic.
We may be tempted to count the number of times a certain word appears in a piece of text, and assign 19 as our numerical representation for the word "really". However, each occurrence of "really" is not adding that much our understanding of the sentiment of this tweet. Instead, we just add a 1 in our array representation of this piece of text for one or more appearances of a word.

Features That Don’t Appear in a Specific Class
Under our assumption of independence, we find the probability of a piece of text belonging to a certain class by multiplying the probability of a word occurring given that it’s part of a certain class, for every word in a piece of text. Sometimes, our training data for a certain class will not contain a word/feature.
For example, if none of the pieces of text that were labeled as positive in my training data contained the word "disgust," then P("digust" appears | the piece of text is positive) = 0. 🤭 That means that whatever we multiply it by (the probabilities for the rest of the words) will also end up being 0!
To mitigate this, we can add a little bit of probability mass to each element. Typically the probability we use for each word is: P = occurrences of that word in training data / total words in training data. So instead of saying the probability of "disgust" appearing is 0 / 100,000 for example, we can two to every element, so that it instead equals 2 / (100,000 + 2*# of distinct words) and we don’t get a 0. There is always a chance, however small 😉
The Main Takeaway
Naive Bayes models apply Bayes’ Theorem to text classification, and can be trained fairly easily, with relatively little computation compared to other models. However, they are not able to achieve the same accuracy that advanced text classification techniques do, given their reliance on the independence of each parameter – in reality the words we see next to each other in pieces of text are dependent on each other.