REDDIT FLAIR PREDICTION SERIES

Predicting Reddit Flairs using Machine Learning and Deploying the Model using Heroku — Part 3

Creating a web app and deploying the machine learning model

Prakhar Rathi

Published in

Towards Data Science

8 min readMay 31, 2020

If you’re stuck behind a paywall, click here to get my friend link and view this article.

Welcome to Part 3 of this series where I continue working on the Reddit Flair Detection Problem. In Part 1, I discussed the background of the problem and the data collection method and in Part 2, I built a Machine Learning Model to predict the corresponding flairs. It is highly recommended that you go through both of them before starting this one because I have shared insights and reasoning behind the data collection and the model building process. If you have completed part 1 and 2, it’s time you congratulate yourselves and I am really proud of you for completing two very tedious tasks. You have collected, cleaned, analysed and modelled a lot of data and you have also built an ML model to go with it. Most ML practitioners usually stop here but this is where we pick up on in Part 3.

In this part, we’ll continue with the model we built in the previous article and use that to make predictions for any new entries based on the URL of the Reddit post in the India subreddit.

Recap

So, far we have collected the data from the India Subreddit using the PRAW library. After performing text analysis, we tried 4 machine learning models and based on the performance picked one of them. I also converted the model and data pre-processing in a single pipeline for more efficient execution. I will be going with Logistic Regression as it is a simple model and is easily deployed. It doesn’t lead to any issues within Heroku either. You can choose your own model for this purpose. This is how we made a pipeline and trained the model.

model = Pipeline([(‘vect’, CountVectorizer()),
                   (‘tfidf’, TfidfTransformer()),
                   (‘model’, LogisticRegression()),
         ])# Fit Model to the Data
model.fit(X_train, y_train)

Deployment

The deployment process has a few steps:-

Saving the Model
Creating web-app interfaces
Creating a flask app
Getting it all together

Saving the Model

Once you have saved your model as a pipeline in a variable let’s say model and fit it to the training data, it is now time to save it onto a file so that we can load it later for deployment to our web application. We save our models using a technique called serialising. In simple words serializing is a way to write a python object on the disk that can be transferred anywhere and later de-serialized (read) back by a python script. There are two ways to achieve this [1].

Using Pickle

Of course, the picture is a joke but the pickle library is actually really popular and super useful. (Fun Fact: Pickles are just cucumbers dipped in vinegar). Anyway, Pickle is the standard way of serializing objects in Python. You can use the pickle operation to serialize your machine learning algorithms and save the serialized format to a file. Later you can load this file to deserialize your model and use it to make new predictions.

Saving the model using the pickle library

2. Using Joblib ( Recommended for bigger models)

Joblib is part of the SciPy ecosystem and provides utilities for pipelining Python jobs. Hence, this is what I used for this project. It provides utilities for saving and loading Python objects that make use of NumPy data structures, efficiently.

This is useful for machine learning algorithms which save a lot of parameters or sometimes even store the entire dataset. For deep learning models, we usually switch to .h5 files.

Saving the model using joblib library

Creating the flask app and its interfaces.

This step is somewhat new to machine learning practitioners and that is why I will give my insights for most steps. I will also try not to bombard you with information. You can refer to the whole code here. We will be using flask for this purpose.

NOTE:- This is a very important note so pay attention! A lot of people prefer to create a virtual environment in python for the next step. This is because it helps to keep dependencies required by different projects separate by creating isolated python virtual environments for them. This is one of the most important tools that most of the Python developers use. However, for this tutorial, I am not creating a virtual environment. If you want to create a virtual environment, you can find a tutorial here. If you do go ahead with the virtual environment then the list of dependencies that you will require can be found here. Make sure you download the file and run the command in the virtual environment.

pip install -r requirements.txtor pip3 install -r requirements.txt

This file is compulsory for both virtual env and non-virtual env users, however, those who are not using virtual environments may already have most of the requirements satisfied.

Why Flask? [2]

Easy to use.
Built-in development server and debugger.
Integrated unit testing support.
RESTful request dispatching.
Extensively documented.

Let’s get started! You can access the repository here.

Front End

Let’s get started with making the front end using HTML for the user to input the URL to the post. I will keep this fairly simple and utilitarian because this isn’t a front-end development course. The page will look pretty simple. You need three files for this and I have put them under the templates repository here.

base.html: Base file
index.html: Startup page which takes the input
show.html: Page which displays the result

I have used Django template extending for the front end. If you don’t know what that is then you don’t have to worry. It’s a simple concept and can be understood better with classes. I have defined a base file which is extended in the index and the show file so that I don’t have to write the bootstrap and the header code again and again. The base file gets its content from the Bootstrap base file which is stored in the Python library called Flask_Bootstrap. You can read more about extending here.

Base file which inherits from bootstrap base file

Index File which is the default startup page

This is what the output looks like. As I said, it's just for our utility. You are free to add your own CSS. In fact, I am open to pull requests from anyone who wants to contribute and add styling to this page.

The show.html page will be designed in the following manner. I will show the output at the end when I show the results popping up. You can club all three of these in the templates folder while any images and styling go in the static folder (matter of convention).

APIs to receive the data and display results

The next step was to create an API which would receive the URL and then after the prediction, display the results. All of this is to be done through GUI. The basic concept is that we need to create an interface which performs the following steps:-

Take the URL as an input from the index page using the form
Scrape the data from the Reddit website
De-serialise the model
Make predictions on the data using the deserialized model
Display the output

Now, it is possible to do all of this in a single file, however, I think it just makes the process more intuitive and flexible if there are two different files. This also helps when you have to deal with multiple POST request through a single file. I will create two files — app.py where I will get the data from the form through a POST request from the index.html file and inference.py where I will de-serialize the model and get the predictions. I will return the predicted class to the app.py file and that will display the result through the show.html file. Makes sense, right?

app.py file which gets the data and sends the output

inference.py which makes the predictions

You can see that the app.py file imports the inference module and calls the method get_flair(text=url). I have de-serialized the model in the inference file and sent it as an argument to the method along with the URL. Everything else happens in the inference file. You might have noticed that these are the steps that we have followed in Part 1 and Part 2. We downloaded and cleaned the data and then sent it to our model which vectorizes and transforms the data. The model then makes a prediction and returns the predicted class which is stored in the result variable which is picked up by the show.html file in line 6.

Let’s check out the app in action. Here’s the link to the post that I tested.

In case the flair is not from India Subreddit like this one, then we get this.

Conclusion

This article demonstrated a very simple way to deploy machine learning models. You can use the knowledge gained in this blog to make some cool models and take them into production. This has been a long journey so far and we have done a lot of things like scrapping data, building models and finally deploying them. It must have been really hectic. So, no more machine learning in the next part. In Part 4, I will discuss how to take your app online using Heroku, which is a platform as a service (PaaS) that enables developers to build, run, and operate applications entirely in the cloud. You will be able to deploy your app as I did here.

Continue to Part 4! The last part of this series.