Fake News Classifier to Tackle COVID-19 Disinformation-I

An effort to tackle one of the most pressing problems faced by the world currently, Fake News

Shaunak Varudandi
Towards Data Science

--

(Image by author)

Introduction

Coronavirus (COVID-19) is an infectious disease that has resulted in an ongoing pandemic. The disease was first identified in Wuhan, China, and the first case was identified in December 2019. As of 21st August 2020, more than 22 million cases have been reported across 180 countries and territories. The sheer scale of this pandemic has led to myriad problems for the current generation. One of the acute problems that I have come across is the circulation of bogus news articles and in today’s world, spurious news articles can cause panic and mass hysteria. I realized the gravity of this problem and decided to base my next machine learning project on resolving this issue.

Problem Statement

To develop a fake news classifier that appropriately classifies a news article on COVID-19 into real news or fake news.

Work Flow

Before starting with this project, I had to search for datasets that had a list of news articles related to COVID-19. This was a challenge since there are not many datasets out there that record COVID-19 news articles. After scouring the internet for days, I finally found a data set that had news articles related to COVID-19. The only task required now was to clean the data, fit the appropriate machine learning model on it, and assess the model’s performance.

Data Exploration and Data Engineering

Step 1: Checking for missing values.

I started the project by exploring the data and looking out for missing values in it. Each column in the data set had some missing values in it but most importantly, the “Label” column had 5 missing values. Fortunately, the source from where I downloaded the data set had values for the missing labels and that helped me to eliminate missing values from the “Label” column. As for the other columns i.e. “Title”, “Source” and “Text”, the missing values were replaced with an empty string.

Step 2: Looking for inconsistencies in the “Label” column.

After handling the missing data, I thought of checking the target labels to look for any inconsistencies that may be present. After exploring the “Label” column, I discovered two different Fake labels, the same can be seen in the image below. After discovering this anomaly, I decided to change the label for fake news. Final labels can be seen in the second image given below.

Labels Before (left) and After (right). (Image by author)

Step 3: Combining the Title and Text column.

Once the target labels were finalized, I turned my attention towards the data that I will be using for my classification project. I decided to use the “Title” and “Text” columns since they had the most relevant information related to COVID-19. As a result, I combined the two columns into a single column and named it “Total”.

Step 4: Removing punctuation from the data and converting it into lowercase.

“Step 4 onward, all the operations that I perform will be on the Total column.”

It is not advisable to send raw data that we have collected straight to the machine learning algorithm. Before doing so we need to implement some pre-processing steps to make the data interpretable for the machine learning algorithm. Hence, I first use Regex to remove punctuation from the data, and then I convert the data into lowercase. First row of the “Total” column, after pre-processing of the data, can be seen in the image below.

(Image by author)

Step 5: Splitting the data into Training data and Test data.

As soon as I was done cleaning the data, I decided to split the data into a training set and a test set. I decided to assign the “Label” column to a new variable y and dropped the label column from my data frame. Next, I used the train_test_split function to split the data. I assigned 80% of the data to the training set and 20% to the test set.

Step 6: Implementing Tf-Idf on X_train and X_test.

The data we currently possess in X_train and X_test still needs to be converted into a format that can be interpreted by a machine learning algorithm, since these algorithms do not work well with textual data. Hence, we need to convert it into a form that will enable the algorithm to discern patterns and meaningful insights from the data. In order to achieve this, I implemented Tf-Idf.

Tf-Idf, also known as Term frequency-Inverse document frequency. It gives us a way to associate each word in a document with a number that represents how relevant each word is in that document. With Tf-Idf, instead of representing a term in a document by its raw frequency (number of occurrences) or its relative frequency (term count divided by document length), each term is weighted by dividing the term frequency by the number of documents in the corpus containing the word. The overall effect of this weighting scheme is to avoid a common problem when conducting text analysis: the most frequently used words in a document are often the most frequently used words in all of the documents. In contrast, terms with the highest Tf-Idf scores are the terms in a document that are distinctively frequent in a document, when that document is compared to other documents.

I used TfidfVectorizer from the sklearn library to convert the text I had into a sparse matrix. This matrix represents the Tf-Idf values for all the words present in my training and test data. The training and test data are now represented by the variables tfidf_train and tfidf_test.

Since I now have the data ready for implementing the machine learning algorithm, I move to the next step which includes fitting my machine learning algorithm on the training data.

Fitting a machine learning model on the training data and assessing model performance.

Step 1: Choose a classification algorithm and fit the model on training data.

I chose Support Vector Machine (SVM) as the classification algorithm for my project. Moreover, I used the linear kernel for training my model. The reason I chose SVM with a linear kernel is because linear kernel works well when there are a lot of features. Additionally, most text classification tasks are linearly separable. Moreover, mapping data to a high dimensional space does not necessarily improve model performance. Lastly, training an SVM with a linear kernel is faster than with other kernels. Therefore, I decided to work with SVM on my project.

I imported SVM classifier from the sklearn library and fit the model on my training data (i.e. tfidf_train). As soon as the training part was completed, I moved on to the next step, which was to assess model performance.

Step 2: Assess model performance using test data.

Once the training part was completed, I used the test data (i.e. tfidf_test) to predict labels for news articles present in the test set. I calculated the model accuracy which came out to be 94.4%.

(Image by author)

Conclusion

The following machine learning project on COVID-19 was an exciting experience for me. I was introduced to the domain of Natural Language Processing and I was able to understand the various data pre-processing steps that are needed before we can implement machine learning algorithms on textual data. I learned about two new major concepts which are, Term frequency-Inverse document frequency, and Support Vector Machine.

The next step is to convert this project into a fully responsive web application. My aim is to develop the front-end using HTML and CSS, whereas the trained SVM classifier will act as the back-end. Lastly, to ensure seamless interaction between the front-end and the back-end, I will make use of the Flask framework. A detailed walkthrough of all the steps needed to deploy the SVM classifier on the Heroku cloud platform can be found in Part II of this blog. Be sure to check that one out as well.

The workflow followed by me for this project can be found on my Github page. I hope you enjoyed reading my blog.

--

--

MBA (Co-op) student at the DeGroote School of Business || Aspiring Business Data Analyst.