MACHINE LEARNING DEPLOYMENT

The spread of fake news is unstoppable with the adoption of different social networks. On Twitter, Facebook, Reddit, people take advantage of fake news to spread rumours, win political benefits and click rates.
Detecting fake news is critical for a healthy society, and there are multiple different approaches to detect fake news. From a machine learning standpoint, fake news detection is a binary classification problem; hence we can use traditional classification methods or state-of-the-art Neural Networks to deal with this problem.
This tutorial will create a natural language processing application from scratch and deploy it on Flask. In the end, you will have a Fake news detection web app running on your local machine. See the teaser here.
The tutorial is organized in the following structure:
- Step1: Load data from Kaggle to Google Colab.
- Step2: Text preprocessing.
- Step3: Model training and validation.
- Step4: Pickle and load model.
- Step5: Create a Flask APP and a virtual environment.
- Step6: Add functionalities.
- Conclusion.
Note: The complete notebook is on GitHub.
Step1: Load data from Kaggle to Google Colab
Well, the most fundamental part of a machine learning project is data. We will use the Fake and real news dataset from Kaggle to build our machine learning model.
I wrote a blog about how to download data from Kaggle to Google Colab before. Feel free to follow the steps inside.
There are two separate CSV files in the folder, True and False, corresponding to Real and Fake news. Let’s have a look at what the data look like:
true = pd.read_csv('True.csv')
fake = pd.read_csv('Fake.csv')
true.head(3)

Step2: Text preprocessing
The datasets have four columns, but they have no label yet, let’s create labels first. Fake news as label 0 and Real news label 1.
true['label'] = 1
fake['label'] = 0
The datasets are relatively clean and organized. For the sake of training speed, we are using the first 5000 data points in both datasets to build the model. You can also use the complete datasets to get a more comprehensive result.
# Combine the sub-datasets in one.
frames = [true.loc[:5000][:], fake.loc[:5000][:]]
df = pd.concat(frames)
df.tail()

Let’s also separate features and labels as well as make a copy of the DataFrame for later training.
X = df.drop('label', axis=1)
y = df['label']
# Delete missing data
df = df.dropna()
df2 = df.copy()
df2.reset_index(inplace=True)
Cool! Time for the real text preprocessing, which includes deleting punctuations, lowering all capitalized characters, deleting all stopwords, and stemming, most of the time we call this process as tokenization.
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import re
import nltk
nltk.download('stopwords')
ps = PorterStemmer()
corpus = []
for i in range(0, len(df2)):
review = re.sub('[^a-zA-Z]', ' ', df2['text'][i])
review = review.lower()
review = review.split()
review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
review = ' '.join(review)
corpus.append(review)
Next, let’s use the TF-IDF vectorizer to convert each token to a vector, aka, vectorize tokens or word embedding. You can play with this dataset with other word embedding techniques, such as Word2Vec, Glove, and even BERT, but I found TF-IDF good enough to generate an accurate result.
A concise explanation for TF-IDF (Term Frequency – Inverse Document Frequency): it calculates how important a word is by considering both the frequency of that word in a document and other documents in the same corpus.
For example, the word "detection" appears quite a lot in this article but not in other articles in the MEDIUM corpus; hence "detection" is a critical word in this post, but the word "term" exists almost in any document with a high frequency, so it’s not so important.
A more detailed introduction about TF-IDF can be found in this Medium blog.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_v = TfidfVectorizer(max_features=5000, ngram_range=(1,3))
X = tfidf_v.fit_transform(corpus).toarray()
y = df2['label']
Mostly done! Let’s do the last step to split the dataset to train and test!
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Step3: Model training and validation
You can try multiple classification algorithms here: Logistic Regression, SVM, XGBoost, CatBoost or Neural Networks. I am using the Online Passive-Aggressive Algorithms.
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn import metrics
import numpy as np
import itertools
classifier = PassiveAggressiveClassifier(max_iter=1000)
classifier.fit(X_train, y_train)
pred = classifier.predict(X_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy: %0.3f" % score)

Pretty good result! Let’s print the confusion matrix to have a look at the False Positives and False Negatives.
import matplotlib.pyplot as plt
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
cm = metrics.confusion_matrix(y_test, pred)
plot_confusion_matrix(cm, classes=['FAKE', 'REAL'])

So, we got 3 False Positives and no False Negatives using the Passive-Aggressive algorithm in a balanced dataset with a TF-IDF vectorizer.
Let’s validate using an unseen dataset, say, the 13070th data point from the Fake CSV file. We anticipate the result of the classification model to be 0.
# Tokenization
review = re.sub('[^a-zA-Z]', ' ', fake['text'][13070])
review = review.lower()
review = review.split()
review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
review = ' '.join(review)
# Vectorization
val = tfidf_v.transform([review]).toarray()
# Predict
classifier.predict(val)

Cool! We got what we want. You can try more unseen data points from the complete dataset. I believe the model would give you a satisfying answer with such high accuracy.
Step4: Pickle and load model
Now, time to pickle (save) the model and vectorizer so you can use them elsewhere.
import pickle
pickle.dump(classifier, open('model2.pkl', 'wb'))
pickle.dump(tfidf_v, open('tfidfvect2.pkl', 'wb'))
Let’s see if we can use this model without training again.
# Load model and vectorizer
joblib_model = pickle.load(open('model2.pkl', 'rb'))
joblib_vect = pickle.load(open('tfidfvect2.pkl', 'rb'))
val_pkl = joblib_vect.transform([review]).toarray()
joblib_model.predict(val_pkl)

We got the same output! That’s what we expected!
Now the model is ready, time for us to deploy it and detect any news on the web application.
Step5: Create a Flask APP and a virtual environment
Flask is a lightweight WSGI web application framework. Compared with Django, Flask is easier to learn, whereas it’s inappropriate for production use because of security concerns. For the purpose of this blog, you will learn Flask. Instead, feel free to follow my other tutorial on how to deploy an app using Django.
- From the terminal or command line, create a new directory:
mkdir myproject
cd myproject
- Inside the project directory, create a virtual environment for the project.
If you don’t have virtualen installed, run the following to install the environment in your terminal.
pip install virtualenv
After virtualen is installed, run the following to create an env.
virtualenv <ENV_NAME>
Replace the name of your env in
Activate the env by:
source <ENV_NAME>/bin/activate
You can remove the env when it’s needed using the following command:
sudo rm -rf <ENV_NAME>
Now your env is ready. Let’s install Flask first.
pip install flask
It’s time to build the web app!
Step6: Add functionalities
To start, let’s create a new file in the same directory with the following content and name it app.py, and we will add the functionalities in this file. Move the pickled model and vectorizer in the previous step to the same directory.
We are going to build four functions: home is for returning to the home page; predict is for getting the classification result, whether the input news is fake or real; webapp is for returning the prediction on the web page; api is to convert the classification result to JSON file to build an external API.
You may find the Flask official documentation helpful.
from flask import Flask, render_template, request, jsonify
import nltk
import pickle
from nltk.corpus import stopwords
import re
from nltk.stem.porter import PorterStemmer
app = Flask(__name__)
ps = PorterStemmer()
# Load model and vectorizer
model = pickle.load(open('model2.pkl', 'rb'))
tfidfvect = pickle.load(open('tfidfvect2.pkl', 'rb'))
# Build functionalities
@app.route('/', methods=['GET'])
def home():
return render_template('index.html')
def predict(text):
review = re.sub('[^a-zA-Z]', ' ', text)
review = review.lower()
review = review.split()
review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
review = ' '.join(review)
review_vect = tfidfvect.transform([review]).toarray()
prediction = 'FAKE' if model.predict(review_vect) == 0 else 'REAL'
return prediction
@app.route('/', methods=['POST'])
def webapp():
text = request.form['text']
prediction = predict(text)
return render_template('index.html', text=text, result=prediction)
@app.route('/predict/', methods=['GET','POST'])
def api():
text = request.args.get("text")
prediction = predict(text)
return jsonify(prediction=prediction)
if __name__ == "__main__":
app.run()
You can see an index.html file in the previous section, and it’s the home page of the application. Create a folder named "Templates" in the root folder, and create a file "index.html" inside. Now let’s add some content to the page.
<!DOCTYPE HTML>
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<title>Fake News Prediction</title>
<link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" rel="stylesheet"
integrity="sha384-EVSTQN3/azprG1Anm3QDgpJLIm9Nao0Yz1ztcQTwFspd3yD65VohhpuuCOmLASjC" crossorigin="anonymous">
<script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/js/bootstrap.bundle.min.js"
integrity="sha384-MrcW6ZMFYlzcLA8Nl+NtUVF0sA7MsXsP1UyJoMp4YLEuNSfAP+JcXn/tWtIaxVXM"
crossorigin="anonymous"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.1.1/jquery.min.js"></script>
</head>
<body>
<nav class="navbar navbar-expand-lg navbar-light bg-light">
<div class="container-fluid">
<a class="navbar-brand" href="/">FAKE NEWS PREDICTION</a>
<button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbarNavAltMarkup"
aria-controls="navbarNavAltMarkup" aria-expanded="false" aria-label="Toggle navigation">
<span class="navbar-toggler-icon"></span>
</button>
<div class="nav navbar-nav navbar-right" id="navbarNavAltMarkup">
<div class="navbar-nav">
<a class="nav-link" target="_blank"
href="https://rapidapi.com/fangyiyu/api/fake-news-detection1/">API</a>
<a class="nav-link" target="_blank"
href="https://medium.com/@fangyiyu/how-to-build-a-fake-news-detection-web-app-using-flask-c0cfd1d9c2d4?sk=2a752b0d87c759672664232b33543667/">Blog</a>
<a class="nav-link" target="_blank"
href="https://github.com/fangyiyu/Fake_News_Detection_Flask/blob/main/Fake_news_detection.ipynb">NoteBook</a>
<a class="nav-link" target="_blank" href="https://github.com/fangyiyu/Fake_News_Detection_Flask">Code Source</a>
</div>
</div>
</div>
</nav>
<br>
<p style=text-align:center>A fake news prediction web application using Machine Learning algorithms, deployed using Django and Heroku. </p>
<p style=text-align:center>Enter your text to try it.</p>
<br>
<div class='container'>
<form action="/" method="POST">
<div class="col-three-forth text-center col-md-offset-2">
<div class="form-group">
<textarea class="form-control jTextarea mt-3" id="jTextarea'" rows="5" name="text"
placeholder="Write your text here..." required>{{text}}</textarea>
<br><br>
<button class="btn btn-primary btn-outline btn-md" type="submit" name="predict">Predict</button>
</div>
</div>
</form>
</div>
<br>
{% if result %}
<p style="text-align:center"><strong>Prediction : {{result}}</strong></p>
{% endif %}
<script>
function growTextarea (i,elem) {
var elem = $(elem);
var resizeTextarea = function( elem ) {
var scrollLeft = window.pageXOffset || (document.documentElement || document.body.parentNode || document.body).scrollLeft;
var scrollTop = window.pageYOffset || (document.documentElement || document.body.parentNode || document.body).scrollTop;
elem.css('height', 'auto').css('height', elem.prop('scrollHeight') );
window.scrollTo(scrollLeft, scrollTop);
};
elem.on('input', function() {
resizeTextarea( $(this) );
});
resizeTextarea( $(elem) );
}
$('.jTextarea').each(growTextarea);
</script>
</body>
</html>
The above script creates a web page like this:

Now you can run your app by typing the following command in your terminal:
python3 app.py
You will be able to run your app locally and have a test on the model.
Conclusion
In this tutorial, you built a machine learning model to detect fake news from real ones from scratch and saved the model to build a web application using Flask. The web application is running in your local machine, and you can try to make it public using cloud services such as Heroku, AWS or DigitalOcean. I have deployed mine on Heroku. Feel free to have a try.
I hope you enjoy this journey. Welcome to leave a comment and connect with me on Linkedin.