Detecting Fake News With Deep Learning

A Simple LSTM Implementation With Keras

Aaron Abrahamson
Towards Data Science

--

Image by memyselfaneye from Pixabay

I have been wanting to do a small project involving text classification for a while, and decided to try out an architecture that I have not used before: long short-term memory (LSTM). In short: LSTM’s are a type of recurrent neural network (RNN) that are able to remember information for a long time (an advantage over a vanilla RNN). If you’d like to get more detail: here’s an excellent and thorough explanation of the LSTM architecture.

Alright, let’s get started!

I found a dataset of both real and fakes news on Kaggle: Fake and Real News Dataset. I tried doing this locally in a Jupyter Notebook, but once I got to the training portion my computer almost exploded — ETA for one epoch was at least 2 hours. I moved things over into a GPU accelerated Google Colab instance and things went much smoother. Here is a link to view that notebook.

# load datasets into a panda's dataframe
real = pd.read_csv('data/True.csv')
fake = pd.read_csv('data/Fake.csv')

Now let’s look at what the data looks like.

‘Real’ dataset
‘Fake News’ dataset
real.head()fake.head()

One thing I immediately noticed was the ‘(Reuters)’ tag in the real news articles. Turns out practically all of the real stories came from Reuters, and barely any of the fake news contained the word. I want to eventually compare the model both with the word and with the word removed.

real.loc[real.text.str.contains('Reuters')].count()/real.count()
> title 0.998179
> text 0.998179
> subject 0.998179
> date 0.998179
fake.loc[fake.text.str.contains('Reuters')].count()/fake.count()
> title 0.013247
> text 0.013247
> subject 0.013247
> date 0.013247

Now let’s give the data labels and combine them into one dataset for training, then train/test split them.

# Give labels to data before combining
fake['fake'] = 1
real['fake'] = 0
combined = pd.concat([fake, real])
## train/test split the text data and labels
features = combined['text']
labels = combined['fake']
X_train, X_test, y_train, y_test = train_test_split(features, labels, random_state = 42)

Now we process the text data with the Tokenizer object from Keras. We will not be removing stop words, as the context of each word and how the sentences and paragraphs are formed matter. I believe there is an underlying difference in the writing quality of the two classes. Journalists from Reuters actually have copy editors!

# the model will remember only the top 2000 most common words
max_words = 2000
max_len = 400
token = Tokenizer(num_words=max_words, lower=True, split=' ')
token.fit_on_texts(X_train.values)
sequences = token.texts_to_sequences(X_train.values)
train_sequences_padded = pad_sequences(sequences, maxlen=max_len)

Now let’s build the model!

embed_dim = 50
lstm_out = 64
batch_size = 32
model = Sequential()
model.add(Embedding(max_words, embed_dim, input_length = max_len))
model.add(LSTM(lstm_out))
model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(1, name='out_layer'))
model.add(Activation('sigmoid'))
model.compile(loss = 'binary_crossentropy', optimizer='adam',\
metrics = ['accuracy'])
print(model.summary())

Now let’s train the model.

model.fit(train_sequences_padded, y_train, batch_size=batch_size, epochs = 5, validation_split=0.2)
LSTM model training progression on train data

Now let’s evaluate versus the test/holdout set.

test_sequences = token.texts_to_sequences(X_test)test_sequences_padded = pad_sequences(test_sequences,\ 
maxlen=max_len)

Pretty, pretty good.

Plotting both the accuracy and loss of the model reveals that it could probably still use some more training, as there is no evidence of overfitting.

99% is a nice result, however, remember that all of the real news has ‘Reuters’ in it? Granted it is only one word, I want to see how removing it from the text would affect the model’s performance (if at all). I think there must be a lot of other underlying patterns in both word choice and editing that might make the classification easy for the model.

After removing ‘Reuters’ from all news text, the resulting model’s test set evaluation had an accuracy of 98.5%. So, a slight decrease (0.6% difference) in its predictive capacity. I was kind of thinking it would be more.

Getting a great result like this almost out of the box should make you think more about the underlying data. If it’s too good to be true, it probably is! Reuters news stories rely upon a style guide and are rigorously edited, and I cannot say the same for the fake news. These underlying patterns might allow for the model to learn from this specific data set, but how does it generalize to news found in the wild from different sources?

I found a Business Insider article on the most viewed fake news articles viewed on Facebook in 2019. It was difficult to track down full text articles of some of those examples.

For the #5 shared story titled “Omar Holding Secret Fundraisers With Islamic Groups Tied to Terror,” the model predicted it as REAL.

For the #1 shared story titled “Trump’s grandfather was a pimp and tax evader; his father a member of the KKK,” the model predicted it as FAKE NEWS.

I also grabbed the current headline story on CNN, “Trump extends federal social distancing guidelines to April 30” and the model predicted it as REAL.

Conclusion

The model seemed to be a very powerful predictor on the training and test datasets, however it may not generalize well outside of this. When presented with an out of set fake news story, the model is 1 for 2. This sample size is quite small, and I would like to try to track down more fake news outside of this dataset and see how it performs. I would also like to try out more state of the art models (ELMo/BERT). Hope you enjoyed reading!

--

--