Implementing an Enterprise Recommendation System

An end-to-end look at implementing a “real-world” content-based recommendation system

Published in

Towards Data Science

11 min readMay 4, 2022

I recently completed a recommendation system that will be released as part of a newsfeed for a high traffic global website. With must-haves like sub-second response times for recommendations, the requirements presented significant design challenges.

As with any application that will be deployed into a production environment, important decisions were required for topics such as

performance,
availability,
data readiness, and
total cost,

each requiring due consideration.

During my initial project assessment, I found few publicly available enterprise-class machine learning (ML) resources that were meant to inform end-to-end decisions for real-world projects. This article is intended to assist practitioners, as they wade through the various decision points of an enterprise recommendation system project.

Project Requirements

Requirements called for a recommendation system with a RESTful API to integrate with a website’s newsfeed. By giving visitors access to personalized content, recommendation systems can be an important driver of insight and engagement, which is a primary goal for the newsfeed.

Other key requirements:

Measurable model performance and business value
Sub-second latency below 500 milliseconds.
High availability and scalability, with a service-level agreement (SLA) level of 99.9%, which translates to at most 1.44 minutes of downtime per day.
Use of automation for data extraction and the model build-deploy-monitor lifecycle.
Comprehensive model and API monitoring for use in continuous improvements.
Total cost of ownership (TCO) was also a factor, especially recurring hosting costs.

Step 1: Exploratory Data Analysis

The most important first step in any machine learning project is to obtain the best quality data available. Projects don’t often have easily accessible data that is labeled and ready for training. In this case, custom interfaces were developed to extract data sets from multiple sources, a vendor hosted API, an internal content management system, and the Google Analytics Reporting API.

Once the data was extracted, exploratory data analysis revealed that the corpus was limited in size, with approximately 6800 press releases and 700 featured articles. The data also lacked consistent cross-dataset metadata or tags to assist in candidate generation. More importantly, further analysis uncovered no user data or explicit/implicit visitor feedback history to base recommendations on.

Based on available data, an unsupervised content-based recommendation system was necessary, along with a method for tracking visitors in a way that supported personalized recommendations.

Content-Based Filtering vs Collaborative Filtering (Image by author)

Step 2: Developing a Recommendation Model

For recommendations, each press release and featured article needed the most similar content in the corpus to provide in response to API requests. To select the best unsupervised natural language processing (NLP) technique, it was necessary to evaluate several algorithms that can convert text into feature vectors. SentenceBERT, Doc2Vec, FastText, and TF-IDF were selected to assess which algorithm was the most effective for the use case and the corpus.

Once the vector representation of documents in the corpus were generated for each algorithm, the similarity between the 7500 documents had to be determined. Cosine similarity is a preferred metric for computing similarity between vectors as long as the vectors are not too large, where performance could be a problem. Cosine similarity provides a measure of how similar documents are to another based on their vectorized scoring or embeddings.

Word Cloud of Preprocessed TF-IDF Generated Features (Image by author)

Without labeled data and standard evaluation metrics to rely on, an accuracy metric that calculated the percentage of documents that had themselves as the highest scoring document in the cosine similarity matrix was used as a way of estimating the general accuracy of the model. The metric was also used to perform a random cross validation to tune hyperparameters and preprocessing steps before yielding a final comparison. Overall results were good for a general recommendation system, but TF-IDF emerged as having marginally better recommendations compared to the other models.

Algorithm Comparison on T2.Xlarge Instance (Image by author)

To confirm initial findings, each model was validated by comparing a broad sampling of recommendations with observations confirming that TF-IDF provided good accuracy and was the best choice for the project.

Unlike the other NLP techniques, which compute word and sentence embeddings, TF-IDF provides a straightforward feature extraction by giving importance to less frequent terms. TF-IDF also delivers faster and less resource-intensive model generation, and the implementation is simple enough that it can easily be explained to stakeholders.

Re-Ranking Recommendations

To further improve recommendations, similarity scores are re-ranked by including popularity and freshness features. Using popularity is a good way to boost recommendations and, in this case, popularity scores are based on pageview metrics extracted from Google Analytics while freshness is based on document age. Both scores are normalized and then weight averaged for importance before coalescing into a final score.

To ensure that only the highest quality recommendations are provided, API recommendations were capped to three of the highest scoring for each document. Recommendations are also limited to content not recently viewed by the visitor. New visitors that have not interacted with newsfeed content are served a mix of the most popular content in the newsfeed.

Recommendation System Overview (Image by author)

Future State

Once enough visitor history data has been collected, more effective techniques can be experimented with including hybrid approaches that combines both collaborative and content-based filtering.

Step 3: Personalizing Recommendations

Having a newsfeed visitor history is critical to the API being able to provide personalized website recommendations. Due to the nature of the site, visitors are not authenticated or tracked in a manner that is helpful in this case. Google Analytics is used to track general website activity but does not allow for extracting unsampled user-level data. To avoid unnecessary complexity, an approach relying on browser cookies was chosen to track recent visitor selections within the newsfeed.

Website Recommendation Lifecycle (Image by author)

Each time a visitor clicks into newsfeed content, a cookie is updated in the visitor’s browser, reflecting the latest pages that were viewed. These IDs are then passed to the API for providing personalized recommendations.

Step 4: Designing a Deployment Architecture

Now it was time to design the overall API deployment architecture while keeping within the sub-second response time and TCO requirements. The architecture also needed to reside within the confines of a private cloud environment and had to meet other proprietary security restrictions that will not be covered here.

A variety of AWS deployment options were considered, including SageMaker endpoints and Elastic Beanstalk. SageMaker endpoints can make deploying models relatively straightforward, but in my experience, models not created in SageMaker can be cumbersome to deploy and documentation can be lacking and sometimes outdated. From a cost perspective, SageMaker deployments typically use EC2 virtual server instances, which are billed hourly even in times of low demand. Not ideal for the newsfeed, which is expected to receive lower demand on weeknights and on weekends.

Another option used Elastic Beanstalk to deploy the API with load-balanced and auto-scaled EC2 instances for high availability. Each instance would run Flask, Django, or FastAPI to process incoming API requests. Although Elastic Beanstalk is an effective way to deploy applications, it has similar cost implications; the service is free, but any launched EC2 instances are billed 24x7 whether they are being fully utilized or not.

Serverless Lambda

A serverless architecture was quickly becoming the best choice for deploying the recommendation API. With the similarity matrix pre-generated, there is no need for real-time inferences, and a lookup table could be used as long as the model is updated for new content. A lookup table gave the flexibility of using DynamoDB for model storage and, in combination with Lambda, maximized performance, cost, and availability. Lambda functions are billed only when they are used and can provide significant cost savings.

High-Level API Architecture (Image by author)

Along with cost savings, Lambda supports scalability and availability. Lambda functions automatically scale to handle 1,000+ concurrent executions per region and run in multiple Availability Zones, meeting the 99.9% SLA requirement.

DynamoDB

For model storage, a nontraditional pattern using DynamoDB, AWS’s NoSQL database, was chosen. Each item in the database would be a JSON document, primarily containing a unique content ID with its three highest scoring content IDs. This resulted in very fast API lookups with over-the-internet response times averaging 250 milliseconds.

In regard to scalability and high availability, DynamoDB automatically scales capacity and repartitions data as the table size grows. Data is also replicated across three facilities in an AWS Region, providing an SLA of 99.99%.

As effective as DynamoDB can be, using it to store ML models might not work in many situations. To store and load large models for inference purposes, serverless Lambda can be used with AWS Elastic File System (EFS) for a variety of use cases. More information about the Lambda/EFS pattern can be found in the Resources section.

AWS Lambda Considerations

Serverless architectures provide a lot of benefits over the standard load-balanced EC2 architecture, but they also have an inherent cold-start problem. Depending on the size of the deployment package and the initialization time for the function, execution latency can typically be seen in 1% of invocations. Keeping deployment packages light is important, but if that is not an option, provisioned concurrency can be enabled to keep a requested number of execution environments initialized and ready to respond to incoming requests.

For the newsfeed, provisioned concurrency will be enabled during peak hours to ensure that the API can scale to meet traffic with low latency recommendations.

Global Considerations

For global use cases, it is essential to evaluate the use of a content delivery network (CDN). Depending on how much data is being sent through the API and how well it can be cached, CDNs can significantly improve response times internationally. For the newsfeed, the recommendations cached well, and AWS CloudFront’s Edge locations will give visitors across the globe fast and consistent performance when calling the API.

Step 5: MLOps and Finalizing a Deployment Strategy

Machine learning operations, or MLOps, streamlines the process of taking ML models to production and then maintaining and monitoring them. While there are many options available for a basic MLOps automation pipeline, some are highly recommended, while others depend on data sensitivity, team size, and API availability needs. Below, I summarize my strategy for each of the key deployment topics, along with some general considerations.

Basic MLOps Automation Pipeline (Image by author)

Automated Model Builds

Newly created content will automatically be processed every hour by scheduled model builds. Near real-time processing can be achieved but not necessary since newsfeed content is not published at a high frequency. Python scripts hosted on an EC2 instance makes it possible to collect data from multiple sources, generate a model, and push the results to DynamoDB. Heavy use of logging with error alerting ensures that any failures can be reviewed and addressed quickly.

Automated Model Build Architecture (Image by author)

Security

Security was an essential part of the overall design and needed to follow organizational policies. Beyond the default resource policy protections between AWS services, traffic to the API needed to be TLS encrypted and filtered through internal network security infrastructure. Depending on project needs, a wide variety of security measures are available, including IP filtering, custom bearer token authentication, and AWS Cognito for (federated) authentication with single sign-on or social sign-on.

API Monitoring

Continuous application monitoring is another important aspect of deploying an API. AWS CloudWatch captures Lambda execution metrics by default but will also be used for detailed logging as well as alarms triggered on both traffic metrics and health checks.

AWS CloudWatch Logs and Alarms (Image by author)

For situations that call for high continuity, CloudWatch alarms can be integrated into an organization’s existing service desk infrastructure for priority ticket generation. Logs can also be consolidated into tools like Splunk for the additional benefits that they provide.

Model Drift Monitoring

Once in a production environment, models need ongoing monitoring to revalidate their health and business value. Changing and shifting data is a primary reason the accuracy of deployed models tends to decrease over time. For the newsfeed, without labeled data or “ground truth” labels to rely on, a similarity distribution score index was devised by calculating the mean of the highest (non-parent) similarity score for all documents. The overall score is then indexed to a baseline.

Additionally, the accuracy metric used for hyperparameter tuning is another way of monitoring the ongoing accuracy of the model. Computed after each model build, both drift scores will be tracked over time and used as a leading indicator and troubleshooting tool.

Example Model Drift Dashboard (Image by author)

CI/CD and Version Control

Depending on project size, a continuous integration/continuous delivery pipeline can be implemented for enforcing automation in building, testing, and deployment. A variety of CI/CD options are available, including AWS’s CodeDeploy and SAM pipelines. For this project, PyCharm, with the AWS Toolkit and GitLab plug-ins allow for managing version control and deployments to Lambda, whereas production-ready model build scripts are automatically downloaded to the EC2 instance before each scheduled run.

Step 6: A/B Testing

Recommendations are estimated predictions and don’t come with guarantees for how a sites audience will react to a recommendation strategy. Once the recommendation system is deployed to production, A/B testing will be conducted to gauge its effectiveness by comparing site metrics like click-through rate (CTR), pages per session, and average session duration with and without recommendations. Metrics data gathered during the experiment will help validate (or invalidate) the strategy. Eventually, A/B testing will also be used to compare current and future recommendation strategies.

Conclusion

I hope this article provided some food for thought when it comes to developing and deploying recommendation systems. Looking at a project holistically, with production in mind, will help deliver real business value and have a positive impact on how the project is perceived both internally and externally to an organization.

What I covered here is an introduction to a much larger topic. Implementations like hybrid models, for example, may need multiple deployment methods and can bring additional challenges to each of the facets discussed. Welcome to the world of enterprise recommendation systems!