Building and Deploying a Recommender with Flask and Heroku

A model-based perspective in the building of a hybrid recommender and the deployment of the content-based filtering component with Heroku

Published in

Towards Data Science

9 min readMay 12, 2020

Front page of my Data Science Immersive Program’s Capstone presentation slide deck

In this post, I illustrate one way of building a hybrid recommender and deploying a bare-bones, model-based content-filtering system with Flask and Heroku. This is the culmination of my capstone project from a Data Science Immersive program under General Assembly (GA) Singapore. More details can be found in these GitHub repo links: GA_Capstone, GA_Capstone_Flask_Heroku_deployment.

For this project, I have chosen coffee as the topic of interest, as I am surrounded by coffee-loving social cliques (even though I don’t drink coffee!) and am well aware of the existence of practically endless choices when it comes to finding a good venue to grab a cuppa. Therefore, I reckon the building of a recommender that could address the problem of being inundated with choices, and that does away with needing to filter choices based on a plethora of attributes such as location, service standards etc., might come in handy here.

Overview on Types of Recommendation Systems (RecSys) to be built

Model-based Content Filtering

A paradigm of Content-based Filtering which is framed in the context of a supervised machine learning algorithm (defined by feature X and target y), in which various features (feature matrix, X) of the items (coffee outlets in this case) (and hence Content-based) are used to predict user ratings (target y). The predicted user ratings are then ranked in descending order and the top 5 or any arbitrary number, n, one deems fit are presented as the top 5 or top-n recommendations for the user to check out.

Model-based Collaborative Filtering

A paradigm of Collaborative Filtering based on a machine learning algorithm that learns user-item interactions from existing data to predict a user’s item rating (or a user’s coffee outlet rating in our context), by taking into account the item ratings of other users with similar rating patterns — this is why Collaborative Filtering RecSys often runs into the cold-start problem where it is impossible to find users with rating patterns similar to a new user who has not rated anything — and then recommending items that are most-liked by those similar users to the user of interest (hence Collaborative in nature).

Hybrid RecSys

A hybrid RecSys combines Content-based Filtering and Collaborative Filtering in an attempt to mitigate the shortcomings of either to produce a more holistic and comprehensive system (best of both worlds, so-to-speak)— Eg. Content-based Filtering falls short of producing cross-recommendations, i.e. it tends to only recommend items of the same type as items once liked by user. Collaborative Filtering addresses this with its collaborative trait — can potentially recommend items that fall outside of categories of items once liked by user as long as those items are most liked by other users with similar rating patterns. On the other hand, for datasets involving only explicit feedback such as user ratings (Eg. this Yelp dataset), Collaborative Filtering only takes into account the user-item interaction matrix that comprises userids, ratings, and itemids. In this case, Content-based Filtering can enrich the item-feature dataset of the RecSys by providing more granular, item-wise features (i.e. category, review_count, avg_store_rating etc. of coffee outlets) and thus make the RecSys more all-encompassing and robust.

Data

The dataset consists of about 987 local coffee-drinking outlets and associated features, 6,292 user reviews, and 7,076 user ratings scraped from Yelp using a combination of BeautifulSoup and Yelp’s API token.

As there was a large proportion of users in the scraped dataset that only rated 1–2 outlets, training a model on all of these users will inevitably run into errors in the train_test_split() and cross validation stages of the model evaluation phase later on, when the dataset is split into training and validation sets stratified by userids (especially if the test_size is less than 0.5 and number of cross validation folds is more than 2). As such, I have only incorporated users who have rated at least 10 different outlets into the content-based filtering model (this resulted in 110 /2552 users being included in the model-training). The number 10 was arbitrarily chosen, although there might be a connection between this and how 10 ratings are supposed to be provided by a user to generate the top 5 recommendations in the deployed model which we will see later on.

Looking at the training data, our target, y (user ratings), was evidently imbalanced:

Imbalanced target classes: Barely any ratings in Classes 1 and 2, minimal in Class 3, and majority fall under Classes 4 and 5, with Class 4 dominating

Due to the imbalanced nature of the target classes, and the fact that we wouldn’t want poor recommendations to be recommended (false positives) nor do we want good recommendations to be missed out (false negatives), a micro-averaged F1 score, which is essentially a metric defined as a micro-averaged balance between precision and recall, was used to evaluate the performance of the respective models later on.

Models

Content-based Filtering

The item (or coffee outlet)-wise features to be incorporated into the Content-based Filtering model are:

Content-based Filtering features: Reviews, Category, Review Count, and Average Store Rating

Reviews were broken down into word term frequencies using a naive Tf-idf vectorizer, which serves to show the discriminating strength of each review word term and their association with the various coffee outlets prior to model-training.

I have tuned and experimented with various supervised machine learning algorithms such as logistic regression, decision tree classifier, and ensemble methods such as extreme gradient-boosting classifier (XGB). XGB performed the best (micro-averaged F1 score: 0.97) and was chosen for deployment. XGB is an extreme gradient-boosting regression tree (GBRT) algorithm that:

Reduces both bias, variance and “learns” from the errors of past weak learners to yield a strong learner since it starts off with shallow low variance/high-bias base estimators and iteratively fits estimators on errors of past estimators to correct those errors/mis-classifications
Controls over-fitting by using a more regularized model formalization and thus performs better than regular gradient-boosting methods like gradient-boosting classifier
Is able to work well on mixed data types (recall: the dataset used for content-based filtering involves a mixture of numerical and categorical data types like review_count and reviews respectively)

An example of XGB in action — courtesy of https://blog.quantinsti.com/xgboost-python/

Collaborative Filtering

Alternating Least Squares (ALS) algorithm was used and tuned using Pyspark’s CrossValidator (micro-averaged F1 score: 1.0). ALS is a matrix factorization technique that decomposes user-item interaction matrix (such as user-item ratings matrix for datasets with explicit feedback) into user and item latent factors where their dot product will predict users’ item ratings (or users’ ratings of the various coffee outlets in this context). It alternates between fixing user or item latent factors to solve for the other via gradient descent at each iteration in the process of minimizing loss (error between actual and predicted user ratings):

Matrix Factorization Loss Function: Explaining latent user and item factors that are to be learnt through gradient descent when using the ALS algorithm

Hybrid RecSys

Content-based and Collaborative Filtering were combined by taking the weighted sum of rating predictions from both filtering systems. Eg. If the predicted user rating from Content-based Filtering is 3 while that from Collaborative Filtering is 4, the final predicted user rating would be:

(0.97 / (0.97 (micro-averaged F1 of XGB) + 1.0 (micro-averaged F1 of ALS)))(Weight assigned to Content-based Filtering based on XGB’s micro-averaged F1) x 3 + (1.0 / (0.97 (micro-averaged F1 of XGB) + 1.0 (micro-averaged F1 of ALS)))(Weight assigned to Collaborative Filtering based on ALS’ micro-averaged F1) x 4 = ~3.51 = 3.5 (rounded off to nearest 0.5)

The hybrid RecSys yielded a micro-averaged F1 score of 1.0.

Flask app implementation

The hybrid RecSys was successfully implemented in a local virtual environment using a Flask python script and some html templates.

Basically, users are supposed to select and rate 10 local coffee-drinking outlets they have been to before or are familiar with (preferably 2 different outlets per rating class so as to allow the machine learning models working at the back end to get a more comprehensive sense of the user’s “preferences”, which will in turn generate more reliable recommendations) and click “Submit” to churn out the top 5 recommendations for said user to check out.

Front page of Flask app implemented in a local virtual environment: Users first submit ratings of some outlets based on established guidelines

The front page of the Flask app also provides additional Yelp links to the coffee-drinking outlets for users’ reference prior to rating 10 of them

An example of top 5 recommendations generated for user based on the 10 ratings submitted

However, unlike most deployed machine learning models which only simply predict outcomes based on unseen features (i.e. X_test) provided as user input, this hybrid RecSys relies on a manually combined pair of machine learning models to train new unseen ratings (i.e. target, y_train) and rated coffee outlets (i.e. features, X_train) and predict user’s ratings for the other coffee outlets that were not rated. As such, it takes on average 15–20+ minutes upon clicking “Submit” to churn out the top 5 recommendations on the Flask app… This duration is unacceptable for formal deployment platforms like Heroku, which kills any web application’s process as long as the wait time exceeds a mere 30 seconds, and outputs a “Request timeout” message and an H12 error in the error logs…

Deployment of Content-based Filtering with Heroku

Given the above, I could only attempt to deploy “half” of the RecSys, specifically, the XGB component: As ALS could only be imported from Pyspark — pyspark.ml.recommendation library — and I have yet to find an online resource that details deploying Pyspark code with Heroku. Most resources found are on Scala and Spark applications which require compilation with Scala Build Tool (sbt) and jar assembly which in turn seem a little too complex for my Flask app which uses minimal Pyspark code — not even SparkContext or spark-submit, just the SparkSession and ALS components.

With the RecSys cut down to just the XGB component, “Request Timeout” was still an issue. After trimming the XGB component of the code down and keeping tabs of the runtime for each code cell in Jupyter Notebook with %%time, I was able to shorten the runtime sufficiently for the XGB component to be deployed formally with Heroku over here.

Content-based Filtering RecSys deployed with Heroku: Front page where user submits 10 outlet ratings

Content-based Filtering RecSys deployed with Heroku: Outcome page of top 5 recommendations for user based on 10 ratings submitted

Model Limitations

Some limitations of the hybrid RecSys include:

Lack of implicit data
RecSys is based on static data that is not being updated and therefore can become obsolete in future
Questionable data quality (most users rated only 1–2 outlets, making it impossible to train and cross validate a model incorporating all 2552 users)
It was difficult to tune Tf-idf and train the model with a combination of word-term and numerical features simultaneously — thus Tf-idf was not tuned and a naive version was used instead
Mean Normalization could be incorporated as a fallback for the RecSys to recommend outlets based on average user rating per outlet, to users who do not provide any ratings at the outset

Future Plans

Keep a lookout online for updates on the various deployment platforms, stackoverflow, and Pyspark deployment to see if the Pyspark ALS component can be incorporated into Heroku deployment so that the full hybrid RecSys can be deployed online.

If the above pans out well, can potentially incorporate user-design/user-experience concepts to better design the html user interfaces in preparation for A/B testing to gauge the effectiveness of the hybrid RecSys built.