STEP-BY-STEP GUIDE

Intro.
Very often, as a data scientist, you may be faced with a task that includes a complete pipeline: from the data collection up to deploy the app on the server. I bumped into such an odd job in the interview process during the job search. The focus was not to develop the most accurate or most complex model but to show a good grasp of Machine Learning and the NLP concept. In this article, I will show you how to deploy a Bert model and preprocessing pipeline. To deploy the application, I used the Heroku server and the Flask Python framework.
Task and problem description.
The task base is on the paper "Identifying Nuances in Fake News vs. Satire: Using Semantic and Linguistic Cues". The paper introduces two models to classify textual content as "satire" or "fake news" – one model based on Google’s BERT and one model based on coherence metrics.
The goal for this task is:
- to develop a new model to classify "satire" or "fake news" ;
- to build a demo web application to serve the model predictions;
- to deploy it to a cloud service Heroku.
The model can be any of your choices and will have the following features:
- BERT embeddings (vectors) – using Sentence Transformers library representing each text.
- Sentiment & Modality of each text – using patternen library.
Here the naive structure of the project :
/web-app
|
|--data/
| |--model_for_prediction.pkl
|--static/
| |--style.css
|--templates/
| |--home.html
| |--result.html
|--.gitignore
|--app.py
|--nltk.txt
|--requirements.txt
|--runtime.txt
|--Procfile
Assuming you already have a labeled data set with an article/article paragraph in each cell. Label ‘1’ represents Satire and ‘0’ – Fake news.
Sentence Transformers documentation recommends several pre-trained models and a variety of tasks where we may apply them. Even if you see this library for the first time, the documentation with great clear examples will save your time substantially. So let’s begin!
Part 1 – develop a new model.
BERT embeddings. I took a Tiny Bert model from the Hugging Face/Sentence Transformers library. The size is crucial in our task. Heroku has time and size limits – the size of the app should be less than 500Mb, and for each operation given a maximum of 30 seconds; thus, we don’t want to crash our web app. I chose a light 57.4Mb model and a PyTorch version for CPU (not GPU).
When I tried to install the sentence-transformers library and deploy the application, my web app crashed with Error H12 (Request timeout). I decided to implement this part in the handle way to decrease the runtime complexity. The PyTorch version for CPU reduced the app size significantly as well. It is possible to use sentence embeddings models without installing the whole library. Such a trick saved my time, and the web application didn’t crash again.
The sentiment of each text. Any text we can broadly categorize into two types: facts and opinions. Opinions carry people’s sentiments, appraisals, and feelings toward the world. The pattern.en module bundles a lexicon of adjectives (e.g., good, bad, amazing, irritating, etc.) that frequently occur in articles/reviews, annotated with scores for sentiment polarity
(positive-negative) and subjectivity
(objective ↔ subjective). Polarity is a value between -1.0
and +1.0
and subjectivity between 0.0
and 1.0
. Let’s extract these values:
The modality of each text. The modality()
function from the pattern library returns the degree of certainty as a value between -1.0
and +1.0
, where values >+0.5
represent facts.
Last step in feature engineering – concatenation into one data set:
Model training. As written above, we can choose any model. I took a Logistic Regression from sklearn. You free to chose any you like. First, we create a model. Then save it into /data
folder in a pickle format for future use:
Important notice. In this POC, I use Pandas Python library just for convenience. Pandas is not always the best library for running production code. It is good for analysis and data exploration, but it is slow because it does a bunch of stuff in the background.
Part 2 – a demo web app to serve the model.
It’s time to create a web application. The home.html
file we place in the /templates
folder. It looks like this:
Page to show the prediction result I made in a separate result.html
file and put into the same /templates
folder:
UI design is simple, and it is in /static/style.css
:
Important: mark folders /data
and /static
as Resources. In PyCharm for Linux, you can do it in File–>Settings–>Project Structure. The /templates
folder mark as Templates. Nice to exclude some folders from Heroku upload and reduce application size(virtual environment folder). Don’t forget to add a .gitignore
file in your project:

Since we use Flask for deployment, we need to point a template language:

Finally, in the project’s directory, we may create our app.py file. First, we put there all needed imports and preprocessing pipeline functions. Then we create a Flask app instance (line 50) and add functionality with home and resulting pages. In the predict() function, we open a saved before model and predict the input text.
Run the app in main. Congratulations! Now you can check your app work on the localhost http://127.0.0.1:5000/.
Part 3 – deployment on cloud service.
For the beginning, let’s create a requirements.txt file. There are two convenient ways to do it:
- Via Pycharm IDE. Right-click on the project name→New →File→name the file requirements.txt. When you open an empty just created requirements.txt file, PyCharm IDE will offer you to write down all imports automatically. Very convenient way 🙂
- Via command line and
'pip freeze > requirements.txt'
inside your project directory.
Don’t forget to change a PyTorch version for CPU :
torch @ https://download.pytorch.org/whl/cpu/torch-1.6.0%2Bcpu-cp36-cp36m-linux_x86_64.whl
I got a pretty long list of libraries in the current project: 63. For those who faced such a task for the first time, I want to notice: the last line in requirements.txt
should be empty.
An important note about Heroku and NLTK library: if your app uses the NLTK library, you should create an additional nltk.txt file with inner imports. In our case, pattern.text.en use NLTK and nltk.txt
should contain two lines:
wordnet
pros_cons
runtime.txt we need for pointing which Python version to use on the server. You may create it in the same way as a requirements.txt file: via IDE or the command line. Inside it should contain one line:
python-3.6.13
Procfile. This file tells the Heroku what to do when the application is deployed and contains one line as well:
web: gunicorn app:app
Make sure that you have a Heroku CLI (and Git CLI). Since now there are only command-line actions:
heroku login
heroku create your_application_name
The following steps are almost the same as for Git CLI:
git init
git add .
git commit -m 'initial commit'
git push heroku master
That’s all, folks! You deployed a WEB app on the Heroku server. The access link looks like https://your_application_name.herokuapp.com/
Conclusions.
In this tutorial, you learned how to create an NLP pipeline, including Bert-based and additional feature engineering. You get familiar, as well, with how to create a demo WEB application with Flask and how to deploy it on the Heroku server.
Resources.
Github repo. Code from the article.
My WEB app URL deployed on Heroku.
The article for inspiration.
Special acknowledgments to Mordechai Worch.