How I Created a Fake News Detector with Python

Developing a fake news detection app with spaCy and Streamlit

Giannis Tolios
Towards Data Science

--

Photo by Markus Winkler on Unsplash

The proliferation of fake news is a significant challenge for modern democratic societies. Inaccurate information can affect the health and well-being of people, especially during the challenging times of the COVID-19 pandemic. Furthermore, disinformation erodes public trust in democratic institutions, by preventing citizens from making rational decisions based on verifiable facts. A disturbing study has shown that fake news reach more people and spread faster than actual facts, especially on social media. MIT researchers have discovered that fake news are 70% more likely to be shared on platforms like Twitter and Facebook¹.

Fake news campaigns are a form of modern information warfare, used by states and other entities to undermine the power and legitimacy of their opponents. According to EU authorities, european countries have been targeted by chinese and russian disinformation campaigns, spreading falsehoods about numerous topics, including the COVID-19 pandemic. The East StratCom Task Force has been set up to deal with that problem, by monitoring and debunking fake news about EU member states.

Fact-checkers are individuals that verify the factual correctness of published news. Those professionals debunk fake news by identifying their false claims. Research has shown that traditional fact-checking can be augmented by machine learning and natural language processing (NLP) algorithms². In this article, I am going to explain how I developed a web application that detects fake news written in my native language (Greek), by using the Python programming language.

The Greek Fake News Dataset

The success of every machine learning project depends on having a proper and reliable dataset. There are numerous publicly available fake news datasets, such as LIAR³ and FakeNewsNet⁴, but unfortunately most of them are comprised of english articles exclusively. As I couldn’t find any datasets including articles in Greek, I decided to create my own. The Greek Fake News (GFN) dataset is comprised of real and fake news written in the greek language, and can be used to train text classification models, as well as other NLP tasks.

The dataset was created based on the following methodology. First of all, real news items were collected from a number of reputable greek newspapers and websites. I added news from a variety of topics, mostly focusing on politics, the economy, the COVID-19 pandemic and world news. To identify fake news articles, I consulted Ellinika Hoaxes, a greek fact-checking website that has been certified by the International Fact-Checking Network (IFCN). A sample of news items verified to be false were also added to the dataset. After that process was completed, the resulting dataset was used to train the text classification model of the Greek Fake News Detector application.

The spaCy Python Library

There are numerous advanced Python libraries that can be used for natural language processing tasks. One of the most popular is spaCy , a NLP library that comes with pre-trained models, as well as support for tokenization and training for more than 60 languages. spaCy includes components for named entity recognition (NER), part-of-speech tagging, sentence segmentation, text classification, lemmatization, morphological analysis, and others. Furthermore, spaCy is robust and production-ready software that can be used in real-world products. This library was used to create the text classification model of the Greek Fake News Detector application.

The Streamlit Framework

Streamlit is a Python framework that lets you build web apps for data science projects very quickly. You can easily create a user interface with various widgets, in a few lines of code. Furthermore, Streamlit is a great tool for deploying machine learning models to the web, and adding great visualizations of your data. Streamlit also has a powerful caching mechanism, that optimizes the performance of your app. Furthermore, Streamlit Sharing is a service provided freely by the library creators, that lets you easily deploy and share your app with others. A detailed introduction to Streamlit is available here.

Developing the Web Application

I decided to develop the Greek Fake News Detector for a number of a reasons. First of all, I wanted to acquaint myself with the spaCy library and NLP in general, thus enhancing my skill set and evolving as a professional. Second, I wanted to showcase the potential of using machine learning to deal with the issue of fake news, in a way that is accesible to non-experts. The best way to achieve that was by developing a simple prototype in the form of a web application. Streamlit is an ideal tool for this purpose, so I decided to utilize it. I will now explain the source code functionality, starting with the text classification model training. I originally used a Jupyter notebook, but for the purposes of this article, the code was converted to the following Python file, named gfn_train.py.

First of all, we import the necessary Python libraries and define two helper functions. The load_data() function shuffles the data, assigns a category to each news article, and splits the dataset into train and test subsets. The evaluate() function calculates various metrics such as precision, recall and F-score, that can help us evaluate the text classifier performance. After defining the helper functions, we load the spaCy pre-trained model. I’ve chosen the el_core_news_md model, as we’re working with articles written in Greek.

After doing that, we load the GFN dataset to a pandas dataframe, and clean it by removing some unwanted characters. Afterwards, we add the textcat component to our pre-trained model. This component will be trained with the GFN dataset, to create the text classification model. We then disable the other components, as we only need to train textcat . Afterwards, we use the load_data() and update() functions to load the dataset and train the model respectively. The evaluate() function that we defined earlier, is used to print the training metrics and performance. After the training is complete, the model is saved by using the to_disk() function. We are now going to examine app.py, the main file of the Streamlit web application.

We begin by importing Streamlit, spaCy, and other necessary libraries. After doing that, we define the get_nlp_model() function that loads the spaCy text classification model we trained earlier. This function is marked with the @st.cache decorator, that lets Streamlit store the model in a local cache, thus improving performance. Afterwards, we define the generate_output() function that prints the classification result, using the markdown() function and some common HTML tags. After that, the article text is printed, along with a word cloud for visualization purposes.

We then create the application layout, using various Streamlit widgets. First of all, we set the page title and description. Second, we create a radio button widget that is used for input type selection. By doing that, users can select between entering the article URL or text. In case the user selects article URL as input type, the text is scraped using the get_page_text() function. Otherwise, the user can paste the article in a multi-line text input. In both cases, a button widget is used to call the generate_output() function, thus classifying the article and printing the result. Finally, we can execute the streamlit run app.py command to run the application locally, or use the free Streamlit Sharing service to deploy it.

Conclusion

I hope that after reading this article, you’ll be more knowledgeable about the potential of using NLP and machine learning to deal with the serious problem of fake news. Furthermore, I encourage you to experiment and create your own fake news detection application, as modifying the code to train the model on a different dataset is simple. In case you want to clone the Github repository of this project, it is available here. Feel free to share your thoughts in the comments, or follow me on LinkedIn where I regularly post content about data science and other topics. You can also visit my personal website or check my latest book, titled Simplifying Machine Learning with PyCaret.

References

[1] Vosoughi, Soroush, Deb Roy, and Sinan Aral. “The spread of true and false news online.” Science 359.6380 (2018): 1146–1151.

[2] Oshikawa, Ray, Jing Qian, and William Yang Wang. “A survey on natural language processing for fake news detection.” arXiv:1811.00770 (2018).

[3] Wang, William Yang. “” liar, liar pants on fire”: A new benchmark dataset for fake news detection.” arXiv:1705.00648 (2017).

[4] Shu, Kai, et al. “Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media.” Big Data 8.3 (2020): 171–188.

--

--