
There is so much fake news in circulation, it is difficult to find sources of accurate and unfabricated news. This article aims to use the Naive Bayes Classifier to classify real and fake news.
What is the Naive Bayes Classifier:
The Naive Bayes Classifier is a deterministic algorithm that uses the Bayes theorem to classify data. Let’s look at an example:
Suppose that you wanted to predict the probability that it would rain today: In the last few days, you have collected data by looking at the clouds in the sky. Here is the table of your data:

This table represents the number of times a certain feature appears, given that it rained or it didn’t. What we have is actually a table containing the probability of it raining, given that grey clouds or white clouds appeared.
Now armed with data, let’s make a prediction. Today we have seen grey clouds and no white clouds, is it more likely for it to be a rainy day or a sunny day? To answer this question, we have to use Bayes Theorem:

This theorem uses past data to make better decisions.
The probability of raining given that grey clouds appeared is equal to the probability that it rained, given that there were grey clouds, multiplied by the probability of it raining, divided the probability of grey clouds appearing.
Based on our data:
P(B|A) (Probability of raining, given grey clouds) = 10/11
P(A) (Probability of raining) = 11/50+11 = 11/66 = 1/6
P(B) (Probability of grey clouds) = 1 (Grey clouds have confirmed to have appeared)
P(A|B) = P(B|A) * P(A) / P(B)
P(A|B) = 10/11 * 1/6 / 1
P(A|B) = 10/66
This is our result! Given that grey clouds appeared, the probability that it will rain is 10/66, that is, in 66 different probabilities in which the scenarios are the same, in 10 of them it will rain.
The Project:
With that brief introduction to Naive Bayes Classifiers, let’s talk about fake news detection with Naive Bayes Classifiers.
We will count the number of times a word appears in the headline, given that the news is fake. Change that to a probability, and then calculate the probability that the headline is fake, as compared to the headline being real.
The dataset I used has over 21,000 instances of real news, and instances 23,000 of fake news. To a normal dataset, this might seem unbalanced, but this unbalance is necessary to calculate the initial probability: that is the probability of a headline being fake, without considering what it is. You can contact me for the dataset at [email protected].
The Code:
import pandas as pd
import string
These are the three dependencies for the program: pandas is to read the csv file and string is to manipulate the casing of the words.
true_text = {}
fake_text = {}
true = pd.read_csv('/Users/XXXXXXXX/Desktop/True.csv')
fake = pd.read_csv('/Users/XXXXXXXX/Desktop/Fake.csv')
This script is to read the two datasets, containing the instances of fake and true news.
def extract_words(category,dictionary):
for entry in category['title']:
words = entry.split()
for word in words:
lower_word = word.lower()
if word in dictionary:
dictionary[lower_word] += 1
else:
dictionary[lower_word] = 1
return dictionary
This script counts how many times a word appears, given that the headline is of fake news, and adds one count to its entry into the dictionary that counts how many times each word appears.
def count_to_prob(dictionary,length):
for term in dictionary:
dictionary[term] = dictionary[term]/length
return dictionary
This function changes the number into a probability, by calculating the total number of words for fake news headlines, or real news headlines.
def calculate_probability(dictionary,X,initial):
X.translate(str.maketrans('', '', string.punctuation))
X = X.lower()
split = X.split()
probability = initial
for term in split:
if term in dictionary:
probability *= dictionary[term]
print(term,dictionary[term])
return probability
This function multiplies the relevant probabilites, to compute a "score" for the headline. To make the prediction, compare the score when using the fake news and real news dictionary. If the fake news dictionary returns a higher score, the model has predicted the headline to be fake news.
true_text = extract_words(true,true_text)
fake_text = extract_words(fake,fake_text)
true_count = count_total(true_text)
fake_count = count_total(fake_text)
true_text = count_to_prob(true_text,true_count)
fake_text = count_to_prob(fake_text,fake_count)
total_count = true_count + fake_count
fake_initial = fake_count/total_count
true_initial = true_count/total_count
This script uses all the above functions to create a dictionary of probabilities for each word, to later calculate the "score" for the headline.
X = 'Hillary Clinton eats Donald Trump'
calculate_probability(fake_text,X,1)>calculate_probability(true_text,X,1)
This final script evaluates the headline: "Hillary Clinton eats Donald Trump", to test the model.
True
The model outputs True, as the headline is obviously fake news.
Where you can improve my program:
I created this program as a framework, so that others could improve upon it. Here are a few things you could consider:
- Consider phrases, as well as words
A word itself has no meaning, but a phrase could give more insight into if the news is fake
- Gain a larger dataset, by web scraping
There are plenty of sources of real news and fake news online, you just need to find it.