The world’s leading publication for data science, AI, and ML professionals.

Build a Fake News Detector

It's Time to Crush Misinformation

Image by Author
Image by Author

If there is one thing the coronavirus has taught me, that is nobody is safe from this dangerous virus. It multiplies exponentially in certain parts of the world leaving us concerned and scared for the physical well being of our loved ones and ourselves. Though have you taken a moment to stop and think how misinformation may impact you? The decisions you make based on that data? As a matter of fact, false news has a 70% higher likelihood of spreading or "go viral" [1]. Imagine some of the decisions’ that were made based on this fake viral news. Thankfully AI is making some significant tailwinds in fighting misinformation.

In this article, we are going to build our very own Fake News detector with python. We will leverage some test data and code that will train the model with our data using a passive aggressive classifier (PAC) that will determine whether a piece of text is true or false. The main concept behind a PAC is that it is less memory intensive and uses each training sample to re-evaluate the weights for the training algorithm. Get the prediction right, keep it to the same, get it wrong, tweak the algorithm. It is a popular machine learning algorithm especially when doing work such as fake news detection. The math behind it isn’t that simple so we’ll leave that for another article. In this article let’s build a practical application. Ready? Let’s get started!

To see a video demonstration of this exercise check out the youtube video with code walkthrough at https://youtu.be/z_mNVoBcMjM

To develop this application, we’re going to leverage two popular Python packages, one being pandas and the other sklearn. The full code and data can be found at my github link. So let’s start with importing our requirements

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import cross_val_score

Once we have our requirements imported, the next step would be to import the dataset. The data we got from kaggle and the direct link can be found on the linked github link above.

df=pd.read_csv('fake-news/train.csv')
conversion_dict = {0: 'Real', 1: 'Fake'}
df['label'] = df['label'].replace(conversion_dict)
df.label.value_counts()
Fake    10413
Real    10387
Name: label, dtype: int64

Once we import the data into a pandas dataframe, we re-label the data as real or fake as per the sample data. The results show that we have a fairly even and balanced dataset without the need to do any sampling between real and fake reviews. So we don’t have to worry about an imbalanced dataset. Whew! one less thing to worry about!

Then we take our dataset, create a train test split and hold back a certain percentage for testing, in this case 25% of the dataset. The test data will consist of the actual text as the X variable and the label as the Y value. To ensure we’re only using keywords we remove any stopwords from the algorithm as well.

x_train,x_test,y_train,y_test=train_test_split(df['text'], df['label'], test_size=0.25, random_state=7, shuffle=True)
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.75)
vec_train=tfidf_vectorizer.fit_transform(x_train.values.astype('U')) 
vec_test=tfidf_vectorizer.transform(x_test.values.astype('U'))
pac=PassiveAggressiveClassifier(max_iter=50)
pac.fit(vec_train,y_train)
PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
                            early_stopping=False, fit_intercept=True,
                            loss='hinge', max_iter=50, n_iter_no_change=5,
                            n_jobs=None, random_state=None, shuffle=True,
                            tol=0.001, validation_fraction=0.1, verbose=0,
                            warm_start=False)

We then train the model with data, run it through 50 iterations and fit it to the PAC. Great now we have a trained classifier. Next lets test the accuracy of our trained model.We run the y test and y_pred accuracy score and see we get a decent score. 96.29%. Great!

y_pred=pac.predict(vec_test)
score=accuracy_score(y_test,y_pred)
print(f'PAC Accuracy: {round(score*100,2)}%')
PAC Accuracy: 96.29%

To further ensure that our model is predicting the right behaviors we look at the confusion matrix to help us better understand how many true negatives and false positives we see in our result set.

confusion_matrix(y_test,y_pred, labels=['Real','Fake'])
array([[2488,   98],
       [  95, 2519]])

Based on the confusion matrix above, the results seem pretty good. In the majority of cases, it’s accurately predicting the right result. Ok, but let’s not stop there either. Let’s run the gold standard test against these results and that is a K-folds accuracy test.

X=tfidf_vectorizer.transform(df['text'].values.astype('U'))
scores = cross_val_score(pac, X, df['label'].values, cv=5)
print(f'K Fold Accuracy: {round(scores.mean()*100,2)}%')
K Fold Accuracy: 96.27%

This results in a k-folds test score of 96.27%. Ok I’m pretty happy with this result. Ok now lets show this model some data that it has never seen before to see how well it fairs with predicting the right label.

df_true=pd.read_csv('True.csv')
df_true['label']='Real'
df_true_rep=[df_true['text'][i].replace('WASHINGTON (Reuters) - ','').replace('LONDON (Reuters) - ','').replace('(Reuters) - ','') for i in range(len(df_true['text']))]
df_true['text']=df_true_rep
df_fake=pd.read_csv('Fake.csv')
df_fake['label']='Fake'
df_final=pd.concat([df_true,df_fake])
df_final=df_final.drop(['subject','date'], axis=1)
df_fake

In the code block above we bring in two new datasets, df_true that we are all true new documents and df_fake that we know are all fake articles. I want to take these articles and put it through our trained classifier and see how many times it predicts it as true for the df_true dataset and what percentage of times it predicts it as fake for the df_fake dataset.

To do this lets build a quick function that will return the label for an unseen dataset for the model

def findlabel(newtext):
    vec_newtest=tfidf_vectorizer.transform([newtext])
    y_pred1=pac.predict(vec_newtest)
    return y_pred1[0]
findlabel((df_true['text'][0]))
'Real'

In this function we pass through some dataset and it will calculate whether the dataset is real or fake based on our PAC trained algorithm. We then run the first text document in the df_true data frame, hoping it predicts real, which it does in this case. What I want to see is what percentage of the time it accurately predicts the right dataset.

sum([1 if findlabel((df_true['text'][i]))=='Real' else 0 for i in range(len(df_true['text']))])/df_true['text'].size
0.7151328383994023
sum([1 if findlabel((df_fake['text'][i]))=='Fake' else 0 for i in range(len(df_fake['text']))])/df_fake['text'].size
0.6975001064690601

Ok, so we have looped the data in the df_true dataset and noticed that it accurately predicts ‘Real’ 71.5% of the time. When we look at the df_fake dataset, it accurately predicts 69.75% as ‘Fake’

This was actually a super fun project for me to take on and something I definitely enjoyed building. Although the training results looked amazing, the actual test data against unseen data were not as high, but hey if I can block misinformation from the internet approximately 7/10 times, I consider that a victory.

Hopefully you enjoyed this code walkthrough. Be sure to check us out at levers.ai to find out more about our services.

Sources: https://www.reuters.com/article/us-usa-cyber-twitter/false-news-70-percent-more-likely-to-spread-on-twitter-study-idUSKCN1GK2QQ


Related Articles