Automatically Detect COVID-19 Misinformation

NLP, Machine learning, COVID-19

Susan Li

Published in

Towards Data Science

6 min readJun 13, 2020

It’s not easy for ordinary citizens to identify fake news. And fake coronavirus news is no exception.

As part of an effort to combat misinformation about coronavirus, I tried and collected training data and trained a ML model to detect fake news on coronavirus.

My training data is not perfect, but I hope it will be useful to help us understand whether fake news differs systematically from real news in style and language use. So, let’s find out.

The Data

As mentioned in the previous article, I collected over 1,100 news articles and social network posts on COVID-19 from a variety of new sources, then labeled them. The data set can be found here.

fake_covid.py

I decided to created several dozens of new features based on news titles and news article bodies. Let me explain them one by one.

Capital Letters in Title

Count the number of capital letters in each title.
Compute the percentage of capital letters in each article body rather than simply counting the number, because the length of the articles are very different.

title_uppercase.py

boxplot_cap.py

On average, fake news have way more words that appear in capital letters in the title.This makes me to think that fake news is targeting audiences who are likely to be influenced by titles.

Stop Words in Title

Count the number of stop words in each title.
Compute the percentage of stop words in each article body rather than simply counting the number, because the length of the articles are very different.

stop_words.py

boxplot_stop_words.py

Fake news titles have fewer stop-words than those of real news.

Proper Noun in Title

Count number of proper nouns (NNP) in each title.

boxplot_proper_noun.py

Fake news titles have more proper nouns. Apparently the use of proper nouns in titles are very significant in differentiating fake from real.

Take Away from Analysis on Article Titles

Overall, these results suggest that the writers of fake news are attempting to attracting attention by using all capitalized words in titles, and squeeze as much substance into the titles as possible by skipping stop-words and increase proper nouns. We will find out whether these apply to article bodies as well shortly.

Here is an example of fake news title vs. real news title.

Fake news title: “FULL TRANSCRIPT OF “SMOKING GUN” BOMBSHELL INTERVIEW: PROF. FRANCES BOYLE EXPOSES THE BIOWEAPONS ORIGINS OF THE COVID-19 CORONAVIRUS”

Real news title: “Why outbreaks like coronavirus spread exponentially, and how to ‘flatten the curve’”

Features

To study fake and real news articles, we compute many content based features on the article bodies. They are:

Use part-of-speech tagger and keep a count of how many times each tag appears in the article.

pos_tag.py

Number of negations, interrogatives in the article body.

negation.py

Use a Python library — textstat to calculate statistics from text to determine readability, complexity and grade level of any article. The explanation of each statistical feature value can be found here.

textstat.py

Type-Token Ratio (TTR), is the total number of unique words (types) divided by the total number of words (tokens) in a given segment of language. Using a Python library — lexicalrichness.

ttr.py

Number of power words, casual words, tentative words, emotion words in the article body.

power_words.py

Exploring

Capital Letters in Article Body

uppercase_text.py

On average, fake news have more words that appear in capital letters in the article body than those of real news.

Stop Words in Article Body

text_stop_words.py

It seems there isn’t a significant difference on the percentage of stop words in article text between fake news and real news.

NNP (Proper noun, singular) in Article Body

proper_noun_text.py

Similar with the titles, fake news pack more proper nouns in the article bodies as well.

Negation Words in Article Bodies

negation.py

On average, fake news have a little more negation words than the real ones.

Bracket

bracket.py

For some reason, in my data, fake news pack more brackets in the article bodies.

Type-Token Ratio (TTR)

ttr.py

There does not seem to be a significant difference between fake news and real news in terms of TTR.

I did not explore all the numeric features, feel free to do yourself. However, from what I have explored, I found that fake news articles differ much more in their titles than in their text bodies.

Harvard Health Publishing vs. Natural News

Remember, Natural News is a far-right conspiracy theory and fake news website. All the news articles I collected from there are labeled as fake news.

harvard_natural.py

Within the expectation, Natural News articles use a lot less stop words than Harvard Health Publishing.

harvard_natural_ttr.py

TTR is meant to capture the lexical diversity of the vocabulary in a document. A low TTR means a document has more word redundancy and a high TTR means a document has more word diversity. Its clear that there is a big difference between Harvard Health Publish and Natural News in terms of TTR.

The Classification Model

We will not use “source” as a feature due to the bias of my data collection, for example, I only collected fake posts from Facebook and Twitter. While in reality, most of posts in Facebook or Twitter are real.

I am sure you have noticed that we have created a large number of numeric features. For the first attempt, I decided to use all of them to fit a a Support Vector Machine (SVM) model with a linear kernel and 10-fold cross-validation to prevent overfitting.

linearSVC_fake.py

print(scores.mean())

When 10-fold cross validation is done we can see 10 different score in each iteration, and then we compute the mean score.

Take all the values of C parameter and check out the accuracy score.

c_accuracy1.py

From the above plot we can see that accuracy is close to 84.2% for C=1 and then it drops around 83.8% and remains constant.

We will look into more details of what is the exact value of C parameter that is giving us a good accuracy score.

c_accuracy2.py

The above plot shows that accuracy score is the highest for C=0.7.

Future Work

Remember I created several dozens of new numeric features, for the purpose of learning and exploring, and I used all of them to fit the classification model. I would suggest you to use a hypothesis testing method to select top 8 or top 10 features, then run a linear SVM model and 10-fold cross-validation.

The hypothesis testing can’t say anything about predicting classes in the data, however, these test can illustrate which features are more significant than the others.

Jupyter notebook can be found on Github. Have a great weekend!

Reference: https://arxiv.org/pdf/1703.09398.pdf