Naive Bayes Document Classification in Python

How well can I classify a philosophy paper based on its abstract?

Kelly Epley
Towards Data Science

--

Naive Bayes is a reasonably effective strategy for document classification tasks even though it is, as the name indicates, “naive.”

Naive Bayes classification makes use of Bayes theorem to determine how probable it is that an item is a member of a category. If I have a document that contains the word “trust” or “virtue” or “knowledge,” what’s the probability that it falls in the category “ethics” rather than “epistemology?” Naive Bayes sorts items into categories based on whichever probability is highest.

It’s “naive” because it treats the probability of each word appearing in a document as though it were independent of the probability of any other word appearing. This assumption is almost never true of any documents we’d wish to classify, which tend to follow rules of grammar, syntax, and communication. When we follow these rules, some words tend to be correlated with other words.

Here, I devised what I thought would be a somewhat difficult classification task: sorting philosophy articles’ abstracts. I chose sub-disciplines that are distinct, but that have a significant amount of overlap: Epistemology and Ethics. Both employ the language of justification and reasons. They also intersect frequently (e.g. ethics of belief, moral knowledge, and so forth). In the end, Naive Bayes performed surprisingly well in classifying these documents.

What is Naive Bayes Classification?

Bayes Theorem

Bayes theorem tells us that the probability of a hypothesis given some evidence is equal to the probability of the hypothesis multiplied by the probability of the evidence given the hypothesis, then divided by the probability of the evidence.

Pr(H|E) = Pr(H) * Pr(E|H) / Pr(E)

Since we are classifying documents, the “hypothesis” is: the document fits into category C. The “evidence” is the words W occurring in the document.

Since classification tasks involve comparing two (or more) hypotheses, we can use the ratio form of Bayes theorem, which compares the numerators of the above formula (for Bayes aficionados: the prior times the likelihood) for each hypothesis:

Pr(C₁|W) / Pr(C₂|W)= Pr(C₁) * Pr(W|C₁) / Pr(C₂) * Pr(W|C₂)

Since there are many words in a document, the formula becomes:

Pr(C₁|W₁, W₂ …Wn) / Pr(C₂|W₁, W₂ …Wn)=

Pr(C₁) * (Pr(W₁|C₁) * Pr(W₂|C₁) * …Pr(Wn|C₁)) /

Pr(C₂) * (Pr(W₁|C₂) * Pr(W₂|C₂) * …Pr(Wn|C₂))

For example, if I want to know whether a document containing the words “preheat the oven” is a member of the category “cookbooks” rather than “novels,” I’d compare this:

Pr(cookbook) * Pr(“preheat”|cookbook) * Pr(“the”|cookbook) * Pr(“oven”|cookbook)

To this:

Pr(novel) * Pr(“preheat”|novel) * Pr(“the”|novel) * Pr(“oven”|novel)

If the probability of its being a cookbook given the presence of the words in the document is greater than the probability of its being a novel, Naive Bayes returns “cookbook”. If it’s the other way around, Naive Bayes returns “novel”.

A demonstration: Classifying philosophy papers by their abstracts

  1. Prepare the data

The documents I will attempt to classify are article abstracts from a database called PhilPapers. Philpapers is a comprehensive database of research in philosophy. Since this database is curated by legions of topic editors, we can be reasonably confident that the document classifications given on the site are correct.

I selected two philosophy subdisciplines from the site for a binary Naive Bayes classifier: ethics or epistemology. From each subdiscipline, I selected a topic. For ethics, I chose the topic “Varieties of Virtue Ethics” and for epistemology, I chose “Trust.” I collected 80 ethics and 80 epistemology abstracts.

The head and tail of my initial DataFrame looked like this:

To run a Naive Bayes classifier in Scikit Learn, the categories must be numeric, so I assigned the label 1 to all ethics abstracts and the label 0 to all epistemology abstracts (that is, not ethics):

df[‘label’] = df[‘category’].apply(lambda x: 0 if x==’Epistemology’ else 1)

2. Split data into training and testing sets

It’s important to hold back some data so that we can validate our model. For this, we can use Scikit Learn’s train_test_split.

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(df[‘abstract’], df[‘label’], random_state=1)

3. Convert abstracts into word count vectors

A Naive Bayes classifier needs to be able to calculate how many times each word appears in each document and how many times it appears in each category. To make this possible, the data needs to look something like this:

[0, 1, 0, …]

[1, 1, 1, …]

[0, 2, 0, …]

Each row represents a document, and each column represents a word. The first row might be a document that contains a zero for “preheat,” a one for “the” and a zero for “oven”. That means that the document contains one instance of the word “the”, but no “preheat” or “oven.”

To get our abstracts in this format, we can use Scikit Learn’s CountVectorizer. CountVectorizer creates a vector of word counts for each abstract to form a matrix. Each index corresponds to a word and every word appearing in the abstracts is represented.

from sklearn.feature_extraction.text import CountVectorizercv = CountVectorizer(strip_accents=’ascii’, token_pattern=u’(?ui)\\b\\w*[a-z]+\\w*\\b’, lowercase=True, stop_words=’english’)X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)

We can use the strip_accents, token_pattern, lowercase, and stopwords arguments to exclude nonwords, numbers, articles, and other things that are not useful for predicting categories from our counts. For details, see the documentation.

If you’d like to view the data and investigate the word counts, you can make a DataFrame of the word counts with the following code:

word_freq_df = pd.DataFrame(X_train_cv.toarray(), columns=cv.get_feature_names())top_words_df = pd.DataFrame(word_freq.sum()).sort_values(0, ascending=False)```

4. Fit the model and make predictions

Now we’re ready to fit a Multinomial Naive Bayes classifier model to our training data and use it to predict the test data’s labels:

from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train_cv, y_train)
predictions = naive_bayes.predict(X_test_cv)

5. Check the results

Let’s see how the model performed on the test data:

from sklearn.metrics import accuracy_score, precision_score, recall_scoreprint(‘Accuracy score: ‘, accuracy_score(y_test, predictions))
print(‘Precision score: ‘, precision_score(y_test, predictions))
print(‘Recall score: ‘, recall_score(y_test, predictions))

To understand these scores, it helps to see a breakdown:

from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
cm = confusion_matrix(y_test, predictions)
sns.heatmap(cm, square=True, annot=True, cmap=’RdBu’, cbar=False,
xticklabels=[‘epistemology’, ‘ethics’], yticklabels=[‘epistemology’, ‘ethics’])
plt.xlabel(‘true label’)
plt.ylabel(‘predicted label’)

The accuracy score tells us: out of all of the identifications we made, how many were correct?

  • true positives + true negatives / total observations: (18 + 19) / 40

The precision score tells us: out of all of the ethics identifications we made, how many were correct?

  • true positives / (true positives + false positives): 18 / (18+2)

The recall score tells us: out of all of the true cases of ethics, how many did we identify correctly?

  • true positives / (true positives + false negatives): 18/(18+1)

6. Investigate the model’s misses

To investigate the incorrect labels, we can put the actual labels and the predicted labels side-by-side in a DataFrame.

testing_predictions = []for i in range(len(X_test)):
if predictions[i] == 1:
testing_predictions.append(‘Ethics’)
else:
testing_predictions.append(‘Epistemology’)
check_df = pd.DataFrame({‘actual_label’: list(y_test), ‘prediction’: testing_predictions, ‘abstract’:list(X_test)})
check_df.replace(to_replace=0, value=’Epistemology’, inplace=True)
check_df.replace(to_replace=1, value=’Ethics’, inplace=True)

Overall, my Naive Bayes classifier performed well on the test set. There were only three mismatched labels out of 40.

Recommended reading:

--

--

Kelly Epley is a Data Science Fellow at Flatiron DC. She has a Ph.D. in philosophy from the University of Oklahoma.