Part II: Using A.I to Combat Fake News Modeling Update

Building upon a competition winning model

Michael Harder

Published in

Towards Data Science

9 min readMay 15, 2020

Work conducted by: David Kebudi, Jason Katz, Michael Harder, and Naina Wodon

Part I: Using A.I. to Combat Fake News

Part III: Using A.I to Combat Fake News Final Model

In this follow-up post to our post on Fake News classification, we will introduce our own updated model and walk you through the steps. To quickly summarize, the task at hand is composed of 49,972 observations. Of the 49,972, about 75% or 36,545 of them belong to the “unrelated” class. The goal of the task is to classify the input data, which is a headline of an article and its associated article body, as (1) agree, (2) disagree, (3) discuss or (4) unrelated. The discuss label compromised 17.8% (8,909 obs.) of the data, followed by the agree label (7.3%), and the disagree label (1.68%).

Figure 1 — Bar chart for the stance counts. This is an unbalanced dataset.

The competition website provides competitors with a training set and a test set, and a baseline GradientBoosting classifier with a weighted accuracy score of 79.53%. The task is a multi-label multi-class problem, where the network needs to first decide if the headline and body are related or unrelated, and only then, if the two are related, categorize their relationship as agree, disagree or discuss.

The accuracy scores of the submissions to the competition are calculated through a function given by the competition’s organizers and can be summarized as:

Figure 2 — Scoring structure for the competition

Winning Model

The winning model, with a 82.02% accuracy score, is the model we are trying to improve upon in this project. The neural network architecture of the winning model is as follows:

Figure 3 — Competition winning CNN model structure (https://github.com/Cisco-Talos/fnc-1/tree/master/deep_learning_model)

The architecture of the model treats the problem as a multi-class, multi-label problem and feeds both of the inputs, headline texts and body, into the network. However, in doing so, they use different embedding layers, customized for the headline and/or body, with the inputs concatenating later on in the network. The inputs go through a series of identical convolutional layers, eventually going through a series of dense layers post-concatenation. The last layer of the network is a dense layer with 4 tensors, representing the 2 categories and their 3 (agree, disagree, discuss) and 1 labels (related, unrelated).

The model, however, does not process the engineered features such as sentiment features about the text, in the neural network. Instead, the architects of the model opted to use a XGBoost classifier, that classifies the data.

Figure 4 — Competition winning XGBoost model structure (https://github.com/Cisco-Talos/fnc-1/tree/master/tree_model)

They then combine the two models, the CNN and XGBoost, with a weighted average for predictions. With this technique, they achieve 82.02%.

Our Model

Our model improves on 2 main points: (1) treats the problem as a classification problem and (2) feeds the engineered features into the CNN.

Redefining the Problem as Classification

The problem defined by the competition, as explained above, is inherently a multi-label multi-class problem. However, we do not have to solve it as one. We can split the problem into two main chunks, where we train models for the two different classifications.

The first model is to classify whether the headline and the body are related or unrelated. This model takes in the headlines, bodies, engineered features and labels. However, it is important to note that the data given by the competition does not have any label named “related”. Instead, if related, the data simply classifies the headline/body relationship as agree/disagree/discuss. Thus, we need to create a new dataset with the label “related” in the place of “agree/disagree/discuss”. To do so, we simply use the following chunk of code:

We concatenate the agree dataset, which refers to the dataset that only has “related” headline/body pairs, with the related dataset, which only has the “unrelated” labels. Then we simply code anything that is not “unrelated” in the labels and “related”. This also allows us to use the whole dataset for training the related/unrelated model. It is important to remember to shuffle the new data frame after the concatenation. Without it, the dataset labels will be polarized at the beginning and at the end. (unrelated at the top, related at the bottom).

Figure 6 — Data sampling code

The second model simply classifies the relationship between headline and body as agree/disagree/discuss. The dataset for the second model is significantly smaller than that of the first model. This is because only 25% of the whole data is actually classified as “related”. It is also important to note that the two models are identical.

Model Architecture

Instead of using an XGBoost, we decided to create a third input to the network, composed of Dense layers. This input would be a matrix nx3, where n is the number of observations and 3 represents the three engineered features: [“fk_scores”, “word_count”, “num_grammar_errors”]. All of these 3 features are for the body. fk_score is a widely used linguistic metric that measures the complexity of the language the article was written in. The Flesch-Kincaid grade level test is a readability test designed to indicate how difficult a passage in English is to understand. It is calculated as:

Figure 8 — Formula for flesch-kincaid grade level

The result of the function is a number that corresponds with a U.S. grade level. “num_grammar_errors” was calculated using a program called Language Tool, which checks a body of text for grammar errors. The model is composed of two CNNs with different embeddings, and one dense network. The three later converges and concatenates, to be processed another few dense layers before the prediction.

Convolutional Layers

The convolutional networks, (input 1 and input 2 of the model), follow a simple architecture with two convolutional layers. They, output a dense layer with 34 tensors, which then connects to the concatenating layer.

The embedding layer, which serves as the first layer of the network, is initialized using the embedding weights previously calculated using Google’s word2vec.

Before we train the embedding weights, we process the input data by taking out punctuation and then tokenizing the text into an array of words. We have entertained the thought of using a linguistic root reduction method which converts the words to their roots. Such that, “describing” and “described”, both become “describe”. However, due to embedding concerns, we decided to not use rooting in this version of the model.

Figure 10 — Code for creating training data

We then import the word2vec Google has trained on billions of news articles:

Figure 12 — Code for importing word2vec

Then in a single embedding matrix creating method, we pass in the headlines and body of the model separately both for the agree dataset and related dataset. This gives us 4 matrixes to be used in the 2 different models for the two classifications.

Dense Layers

The Dense layers for the network are also very simple. We used the following architecture:

It is important to note that, while we are not putting up the code for it; we used LabelEncoding and StandardScaler to preprocess the engineered features as well as the labels. After LabelEncoding we used one-hot-encoding for the multi-label agree/disagree/discuss model.

Results

To calculate the weighted accuracy score that the competition score described where a fourth of the score corresponds to correctly classifying the headline-body pairs as related or unrelated, and the remainder of the score corresponds to the accuracy score for correctly characterizing the headline-body pairs as in agreement, in disagreement, or as the body discusses the headline. As we can see from the graph below, the accuracy score for classifying headline-body pairs as related or unrelated was 73.13%.

Surprisingly, across the 8 epochs the training and test scores stayed constant. The next graph below displays the accuracy scores of the second CNN which classifies the headline-body pairs into the agree, disagree, or discusses categories. As previously mentioned, only about 25% of the data was inputted into the second CNN as the other 75% was classified as unrelated in the first CNN. After 4 epochs the train accuracy score was 96.01%, and the test accuracy score was 94.97%. The overall weighted accuracy score was 89.51% = 73.13*(¼) + 94.97*(¾), significantly higher than that of the baseline model score of 79.53%.

Figure 16 — Training results with test accuracy of 94.97%

Moreover, while this score was for the test set and not the final test data that was used in the competition and so this cannot be directly compared, it is still important to note that the winning model had an accuracy score of 82.02%. As such, we can hypothesize that the model outlined above performs better than the winning model.

Next Steps

Train your own word2vec

One way to improve the model is to train your own word2vec for your own corpus. Since this requires a very large dataset to train on, we can simply import Google’s word2vec and then use TFIDF embedding to train the already trained weights according to our smaller dataset.

Figure 17 — Code for TFIDF embedding

To do so, you first need to calculate the TFIDF values in the vector space, which can easily be done by using the TfidfVectorizer.

Once done, then you build a function that updates Google’s word2vec matrix according to the TFIDF matrix calculated before.

We then simply convert the tokenized words into the vector space using the new weights. This process also allows the user to skip the embedding layer, and simply feed the inputs to a convolutional layer.

We attempted to use this method to build a more customized word2vector space however the process proved itself to be computationally expensive. Thus, we plan on doing this for the final leg of our project.

Build a multi-label multi-class single model instead of two models

Another way we can improve the project is to make out network more similar to that of the winning model’s. This means treating the problem as a multi-label multi-class problem instead of two separate classification problems.

There are two ways to do this. (1) we simply don’t create two datasets as agree dataset and related dataset to be fed into the network but simply feed in a single dataset with multi-class, multi-label target variables. (2) we still have two networks but they converge before making a prediction. The last layer is a Dense layer with 4 tensors, treating the problem as multi-label multi-class problem it is.

Upsampling

We can use upsampling in the related model to make the dataset more balanced. The related model, at 73% accuracy, is predicting the balance accuracy. The balance of the data is already 75% with 75% of the dataset being classified as unrelated.

Additional Figures

Figure 20 — Training related/unrelated model loss

Figure 21 — Training headline-body pairs model loss

References

Two different techniques on training Google’s Word2Vec or building your own. It is a wonderful guide to deploying the technology: https://towardsdatascience.com/natural-language-processing-classification-using-deep-learning-and-word2vec-50cbadd3bd6a

Github Repo for the third best model in the competition that used a cosine concatenation, when joining the inputs of the neural nets: https://github.com/uclnlp/fakenewschallenge

Explanation of why and how CNNs work with NLP tasks: https://medium.com/saarthi-ai/sentence-classification-using-convolutional-neural-networks-ddad72c7048c

Documentation for writing a keras model with multiple inputs: https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/

Groundbreaking Word2Vec paper published in 2013: https://arxiv.org/pdf/1301.3781.pdf

Documentation for writing a multi-channel keras model for an NLP task : https://machinelearningmastery.com/develop-n-gram-multichannel-convolutional-neural-network-sentiment-analysis/

Explanation on how NLP works with different techniques in a purely mathematical space : https://towardsdatascience.com/nlp-learning-series-part-1-text-preprocessing-methods-for-deep-learning-20085601684b

Example vectorizer methods from Github: https://github.com/Cisco-Talos/fnc-1/blob/master/deep_learning_model/Vectors.py