Fake News Classification via Anomaly Detection

Can we use anomaly detection to detect and filter fake news?

Doron Sadeh
Towards Data Science

--

Fake news has become one of the biggest problems of our age. It has polluted online as well as offline discourse. One can even go as far as saying that, to date, fake news poses a clear and present danger to western democracy.

As the technology world has created the machinery allowing mass publications of such, it is only fair that it would come up with a solution to classify and filter them out.

In this article, we would try to present a data-science based approach that may be utilized to classify and filter fake news.

Note, this is not a full-fledged algorithm, rather a skeletal description of how we can tackle the problem. Please comment if you have any suggestions regarding the algorithm or whether you would like to join an open-source project trying to take this from architecture to implementation.

Algorithm Overview

In order to classify fake news using anomaly detection we suggest the following steps:

  • Extract trending topics from multiple online sources (e.g. Twitter, Google, Facebook)
  • Detected news anomalies
  • Cross check anomalies between and within sources to find contradictions
  • Classify contradictory anomalies as fake news

Don’t worry if it is not entirely clear, we would go into each of the above in great depth.

Anomaly Detection

In data mining, anomaly detection (also outlier detection) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. Anomalies are also referred to as outliers, novelties, noise, deviations, and exceptions.

Image result for Anomaly Detection
Anomaly in a Time Series

It is not the purpose of this article to provide a comprehensive tutorial (you can find a very good intro here), nor it assumes the reader is proficient in the underlying algorithms used to detect anomalies. The only prerequisite is you have a basic intuition of what an anomaly is (see above drawing to get one).

A News Anomaly

Before we start, lets us define what is a news anomaly. Picture a day in the office, reading the news online. Your twitter account starts trending with a news piece about some event. This may be a local event, a global one, or something you are personally interested in. Regardless of the actual topic, you clearly notice a spike in information transmitted about it.

We would define a news anomaly as some topic that is trending over multiple sources, e.g. Twitter, Facebook, Google Search Trends, etc. Note we allow for a time differential between sources as the news piece may not hit all sources at the same time.

Detecting a News anomaly

Before we could classify a news anomaly we must detect one.

To do so we need to employ tools from Natural Language Processing in order to parse Twitter-provided trending topics and hashtags. We would collect multiple twits belonging to the trending twits list over a predefined time window, and categorize them into topics.

We further collect the main phrases in the twits’ stream creating a temporal feature vector holding the volume of each topic and phrases over time. We now have a time-series of phrases.

We go on to produce such a time-series per each online source (i.e. Facebook, Google) looking for the collected topics and phrases on those other sources. We use the seed trending-topics and phrases from Twitter to find the above topic references on Facebook and Google Search Trends.

Once done, we have three time-series objects each representing the volume (and possibly also the frequency — but we’ll not get into that right now) of each trending topic over each source.

As a trending topic may not appear at the same time on all sources (e.g first on Twitter, then Facebook, then people starts searching for it on Google) we need to temporally-align the time-series objects. We do that by trying to phase shift the time-series objects until we get to a minimal alignment error.

Lies spread faster than the truth

“There is worldwide concern over false news and the possibility that it can influence political, economic, and social well-being. To understand how false news spreads, Vosoughi et al. used a data set of rumor cascades on Twitter from 2006 to 2017. About 126,000 rumors were spread by ∼3 million people. False news reached more people than the truth; the top 1% of false news cascades diffused to between 1000 and 100,000 people, whereas the truth rarely diffused to more than 1000 people. Falsehood also diffused faster than the truth. The degree of novelty and the emotional reactions of recipients may be responsible for the differences observed.”

From The spread of true and false news online, Soroush Vosoughi et. al, MIT

We are aware of the fact that temporal alignment may skew the differentiation between fake and true news, however, there is a tension (that requires further thinking) between canceling news propagation delay between different sources and not losing information. This is a fine point that may be addressed by measuring the propagation velocity of some topic T within each source separately, adding the propagation velocity as yet another feature into the below classification.

Enter anomaly detection.

We’ll use any of the widely available techniques (e.g. ARIMA, Holt-Winters, etc.) in order to detect raw anomalies in each of the above time series objects. Note we must detect both positive, and negative volume anomalies (more on that in a minute).

With the above list of anomalies, each detected as a point in time (with some environment) for the extracted trending topic, across all sources.

Note we are discussing volume anomalies only, as those are easy to extract over all source types, while we could also convert the anomalies into the frequency domain (using FFT) creating a time series of the frequency this topic has appeared in the news. However, as not all sources support such a conversion (e.g. Google Search Trends) we would keep to the volume domain for now.

Classifying Fake News

Once we have our time-series objects nicely aligned for some topic T we can compare them.

What we are looking for are anomalies that are in anti-correlation, meaning that the volume of the anomaly on one source greatly differs from the other sources.

We now run a second step over the raw textual data (within anomaly windows only). We collect al textual data in those windows looking for evidence suggesting people have determined these to be fake news. This may include the words fake, fake news, or any other phrase suggesting people are referring to the news piece as false (we are oversimplifying this step for clarity, however the words fake, fake news, etc. must be in some nearby context in sentence word-order or alternatively be syntactically connected, but we would not go into such fine details here).

Comparing the appearance of the words “Fake news” with “Pizzagate” in Google Trends

We would then run a final analysis step where we use sentiment analysis over the raw textual data in the anomaly windows to mark anti-correlating points in time sentiment-wise, i.e. where people had both strong negative and positive sentiments towards the texts over multiple sources.

Intersecting the above findings we can now perform a majority vote between each of the above classification steps, outputting a set of detections. Each detection being a point (or environment) in time, where topic T is suspected to be fake news according to the above procedure.

Setting detections on a temporal axis, we measure the volume trend between the detections, i.e. we draw a graph where the Y-axis is message (or searches) volume, and the X-axis is time, over which we plot the combined volume of topic T over time.

Should the above graph be a slow-start trendline followed by a peak, then declining (rapidly, or slowly), we define the topic to be a fake news item. Else, should the graph show great variance, we discard the detection altogether.

Normal News Trend

The reasoning behind this final filtering step is that trending news are usually behaving as a wave, i.e. starting slow, peaking, then dying down. If we detect a large variance signal, it may suggest this is not a news piece at all (however this requires further investigation on actual data).

Call for Action

As mentioned at the beginning of this article, we have not provided a tested fake news classifier, rather an algorithmic idea for one. To prove the above is a feasible solution with an acceptable level of recall and accuracy, one needs to implement the algorithm over large corpora extracted from multiple sources and validate it vs. known fake news cases (i.e. labeled data).

Please feel free to point any problems in the above approach or join us in an open-source initiative implementing the above.

--

--

I ride motorcycles, gaze, and solve hard problems. I am fascinated with all things data, culture and philosophy. Publishing the "A Colder Lazarus" magazine.