The Fight Against Fake News with Deep Learning

Fake news detection with LSTMs and BERT

Shaya Farahmand

Published in

Towards Data Science

10 min readFeb 23, 2022

Let’s play a game. I will give you two titles, and you will tell me which one is fake news. Ready? Here we go:

“Nancy Pelosi diverting social security money for the impeachment inquiry”
“FEMA arrives in tornado-stricken Kentucky — with vaccinations”

Well…it turns out that both headlines were examples of fake news.

The harsh reality is that misinformation and disinformation is rampant on the internet. According to a 2019 poll conducted by Ipsos Public Affairs for Canada’s Centre for International Governance Innovation, 90% of Canadians have fallen for fake news.

This got me thinking, would it be possible to create a model that can detect whether the title of a given article is fake news? Well, it turns out that it is! In this article, I will share with you my experiences in building this text classification model with LSTMs and BERT.

You can find the code I use at my Github repository here

Finding a Dataset

We will use the University of Victoria’s ISOT Fake News Dataset. It contains more than 12,600 real news articles from Reuters.com and more than 12,600 fake news articles that were flagged by Politifact (a fact-checking organization in the United States). The dataset contains articles relating to a variety of topics, though mostly political news and world news. Check it out at this link.

Exploratory Data Analysis

First, let’s perform some analyze our data so we can understand possible trends within our data. Because the real news and fake news files are separate, we will need to label and concatenate the two data frames. For the real news dataset, we loop the number of rows, and add a 0 to a NumPy array (denoting class). This array will be added to the real news dataframe. For the fake news dataset, we repeat this procedure, but add a 1 to the NumPy array.

Because we have 21, 417 samples of real news, and 23, 481 samples of fake news, there is an approximately 48:52 real:fake news ratio. This means that our dataset is relatively balanced. For the purposes of our project, we only want the title and class columns.

Great! Now that we cleaned our dataset, we can analyze trends we found within it. To get a gauge of our dataset’s size, we will analyze the mean, minimum, and maximum character length of our titles. We will plot this frequency with a histogram.

We see that that the numbers of characters in each entry ranges from 8–286. There is a high concentration of samples with a length of 50–100. This can be further seen with the mean length in the dataset being approximately 80.

Preprocessing our Data

We will conduct some initial preprocessing using python’s string library. They involve:

Lowercasing all characters
Removing punctuation

Afterwards, we will need to use the NLTK library to conduct further preprocessing on our dataset. It involves:

Tokenization: Splitting a text into a smaller unit called a token (each individual word will be an index in an array)
Lemmatization: Removing the word’s inflectional endings. For example, the word “children” will be lemmatized to “child”
Removal of Stop words: Commonly used words such as “the” or “for” will be removed, as they take up space in our dataset.

This means that our processed data frame will appear like this:

We will construct two models on this data to classify the text:

An LSTM model (We will use Tensorflow’s wiki-words-250 embeddings)
A BERT model.

Creating an LSTM Model

First, we will split our dataset to an 80:20 train:test ratio.

If we want our model to make predictions based on text data, we will need to convert it to vector format, where it can be processed by our computers.

Tensorflow’s wiki-words-250 uses a Word2Vec Skip-Gram architecture. Skip-gram is trained by predicting the context based on an input word.

For example, if we have the sentence:

I am going on a vacation with an airplane

We will pass in the word “vacation” as input, and specify a window size of 1. The window size indicates the words before and after the target word to predict. In this case, they are the words are “go” and “airplane” (excluding stopwords, and “go” is the lemmatized form of “going”).

We one hot-encode our word so our input vector is the size 1 x V where V is the size of our vocabulary. The representation will be multiplied by a weight matrix with V rows (one for each word in our vocabulary) and E columns, where E is a hyperparameter denoting the size of each embedding. Because the input vector is one-hot encoded, all values are 0, except for one (representing the word we are inputting). Hence, when multiplied by the weight matrix, the output is a 1xE vector which denotes the embedding for that word.

The 1xE vector will be passed into the output layer, consisting of a softmax regression classifier. It consists of V neurons (corresponding to the vocabulary’s one-hot encoding), with an output between 0 and 1 for each word, denoting a probability of that word being in the window size.

Word2Vec Skipgram Representation | Image by author

Tensorflow’s wiki-words-250 consists of word embeddings with a size E of 250. It can be applied to our model by looping through every word and calculating the embedding for each. We will need to apply the pad_sequences function to account for samples of different lengths.

Image by Author

Hence, there are 35,918 samples in the training data, the maximum length is 34 words (the ones with less are padded), and each word has 250 features.

We can apply the same procedure to our testing data.

Now, let’s construct our model. It will consist of:

1 LSTM layer with 50 units
2 Dense layers (one with 20 neurons, the second with 5) with a ReLU activation function
1 Dense output layer with a sigmoid activation function

We will use the Adam optimizer, a binary crossentropy loss, and a performance metric of accuracy. The model will be trained over 10 epochs. Feel free to further adjust these hyperparameters.

As seen above, our model has a maximum accuracy of 91.5% on test data.

Introducing BERT

If I were to ask you for the word in the English language with the most definitions, what would you say?

According to the Second Edition of the Oxford English Dictionary, that word is “set”.

If you think of it, we can generate many different sentences using that word in different contexts. For example:

My pencils are part of a set of stationary supplies

My teammate set the volleyball to me
I set the table for dinner

The problem with Word2Vec is that it generates the same embedding regardless of the way in which the word is used. To combat against this, we can use BERT, which can generate contextualized embeddings.

BERT stands for “Bidirectional Encoder Representations from Transformers”. It uses a transformer model, taking advantage of attention mechanisms to generate contextualized embeddings.

A transformer model uses an encoder-decoder architecture. The encoder layer generates a continuous representation which consists of the information learned from the input. The decoder layer creates an output, with the previous input being passed into the model. BERT only uses an encoder as its goal is to generate a vector representation to gain information from the text.

Pre-training and Fine-tuning BERT

Two methods are employed to train BERT. The first is called masked language modelling. Before passing sequences, 15% of the words are replaced with a [MASK] token. The model will predict the masked words, using the context provided by the unmasked ones. This is done by

Applying a classification layer on the encoder output, consisting of an embedding matrix. Hence, it will be the same size as that of the vocabulary.
Calculating the probability of the word with the softmax function.

The second method is next sentence prediction. The model will receive 2 sentences as input, and predict a binary value of whether the second sentence follows the first. While training, 50% of inputs are pairs, while the other 50% are random sentences from the corpus. To differentiate between the two sentences,

a [CLS] token is added at the beginning of the first sentence, and an [SEP] token is added at the end of each.
each token (word) has a positional embedding to discern information from the position within the text. This is important as there is no recurrence in a transformer model, so there is no inherent understanding of the word’s position
a sentence embedding is added to each token (further differentiating between the sentences).

To perform the classification for Next Sentence Prediction, the output of the [CLS] embedding, which denotes the “aggregate sequence representation for sentence classification,” is passed through a classification layer with softmax to return the probability of the two sentences being sequential.

Masked Language Modelling and Next Sentence Prediction | Image by Author

Because our task is classification, to fine-tune BERT, one needs to simply add classification layers over BERT’s output of the [CLS] token.

Implementing BERT

We will use Tensorflow-hub’s BERT preprocesser and encoder. Do not pass the text through our framework described earlier (which removes capitalization, applies lemmatization, etc). This has been abstracted with the BERT preprocesser.

Our dataframe should look like this initially:

Afterwards, we can split our model into training and testing data with an 80:20 train:test ratio.

We can now import the BERT pre-processer and encoder with Tensorflow-hub

Now, we can develop our neural network. It must be a functional model, where the output of a previous layer must be an argument to the next layer. The model consists of:

1 Input layer: This will represent the sentence that will be passed into the model).
The bert_preprocess layer: Here we pass in our input to preprocess the text.
The bert_encoder layer: Here we pass the preprocessed tokens into the BERT encoder.
1 Dropout layer with a rate of 0.2. The pooled_output of the BERT encoder is passed into it (more on this below)
2 Dense layers with 10 and 1 neurons respectively. The first one will use a ReLU activation function, and the second will use sigmoid.

You can see that the “pooled_output” will be passed into the dropout layer. This value denotes the overall sequence representation of the text. As mentioned earlier, it is the representation of the [CLS] token outputs.

We will use the Adam optimizer, a binary_crossentropy loss, and a performance metric of accuracy. The model will be trained over 5 epochs. Feel free to further adjust these hyperparameters.

Image by Author

You can see above that on the testing data (derived from the model.evaluate() method), the model reached an accuracy of 91.14%.

Creating a Web Application

We can create a web application using HTML and Flask, allowing users to interact with the models that we made.

The link to the model’s front end can be found here and the link to the back end can be found here. You can host this model locally through Google Colab and ngrok.

The interface will look like this:

Three models are available to the user.

A baseline dense neural network using Gensim Doc2Vec
The LSTM model using wiki-words-250 embeddings (with the Word2Vec architecture)
The BERT model

The user can enter a title and choose the model. The app will return a classification and a probability, where if the probability is above 50%, the model predicts that the title is from a fake news source. Consequently, if the probability is below 50%, the model will predict that the title is from a real news source.

I hope that this article demonstrated how we can harness techniques in NLP to classify Fake News from Real News. You might ask, “What can we do next?” Well, our process is by no means finished here. To further improve our model, we can:

Train our models on more data
Adjust our models’ hyperparameters
Deploy the web application that we created

Hope you enjoyed reading the article! Feel free to add me on Linkedin and stay tuned for more content!

Bibliography

[1] Classify Text with BERT (2022), Tensorflow

[2] E. Thompson. Poll finds 90% of Canadians have fallen for fake news (2019), CBC

[3] H. Ahmed, I. Traore, and S. Saad. Detecting opinion spams and fake news using text classification (2018), Journal of Security and Privacy, Volume 1, Issue 1, Wiley,

[4] H. Ahmed, I. Traore, and S. Saad. Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques, In: Traore I., Woungang I., Awad A. (eds) Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017. Lecture Notes in Computer Science, vol 10618. Springer, Cham (pp. 127- 138).

[5] J. Devlin, M. Chan, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2019), arXiv

[6] R. Horev, BERT Explained: State of the art language model for NLP (2018), Towards Data Science

[7] R. Kulshrestha, NLP 101: Word2Vec- Skip-gram and CBOW (2019), Towards Data Science