From entertainment applications to dating platforms, social networking sites to retail, recommendation engines play a pivotal role in today’s society. Not only have they made significant progress in their effectiveness, but they are playing an ever-increasing role in our lives from guiding our attention, personalizing to our interests, and surfacing items of personal value. While each recommendation engine is unique and needs to account for the intricacies of a problem, the business, and available data, many of the building blocks are the same – user/item embeddings, user histories, contextual features, and neural collaborative layers to map users and items to ratings. Models can bypass assumptive and error-prone hand-made data and instead use large datasets of implicit and explicit feedback to predict ratings.
Recommender engines are machine learning or rules-based models that provide recommendations like the best content/item for a user, the right customers to target for a product, or a fair price. They follow the generic structure:
Outcome = F(U, I, C)
Where U is a user, I is an item, C is the context, and F is a function that maps the combination of a U, I, and C to an outcome. An outcome could be explicit feedback like a rating, implicit feedback like watch time, or a non-feedback quantity like price.
When a recommendation engine is packaged along with a data pipeline to source required input data, an ability host or obtain batch inferences from the model and make updates over time, and a user-interface to receive and interact with recommendations, it becomes a recommendation system.
A simple example of a recommendation engine is a model that recommends a movie (I) for a user of a streaming platform (U) on the weekend (C).
A collaborative recommendation engine can be used to map users and items to a common embedding space, after which the closest items in space can be recommended to a user. In order to map users and items to this new embedding space, matrix factorization approaches or multi-layer perceptrons can be applied. So long as there is sufficient past data on user and item feedback, approaches to automatically map and find the similarity between users and items tend to lead to much better recommendations than those using hand-made features like user and item metadata.
Consider the following embedding space based on 2 hand-made features, the level to which a movie falls within drama and fantasy genres out of 5, and considering 3 movies.

Instead of using 2 hand-made features, collaborative systems are trained to automatically map users and items to for example k=10 different ‘latent dimensions’ that can be found using past ratings and without requiring additional item/user metadata.
You’ll also note that both drama and fantasy play an ~ equal weight in determining an item’s location in space in the above. Using multi-layer perceptrons vs. matrix factorization, there is more flexibility in how embeddings are derived and used to find the most similar items to users. For example, embeddings can be learned whereby different dimensions have different weights in determining the users location in space. I.e. if drama was twice as important as fantasy in a movie’s position, you could picture the Y axis getting squashed in half, and Memento’s relative distance to Star Wars being slightly reduced.

Neural collaborative filtering (NCF) is a generalized framework to predict user by items ratings which allows relaxing some of the linear restrictions of matrix factorization akin to Chart 1 – all dimensions having the same weight, and the rating being inversely proportional to distance in space (known as the interaction function being linear).
Contextual information like when the movie is watched or who a user is watching it with can be considered after the initial recommendation is made by filtering results (contextual post-filtering), or before by treating items as separate depending on the context in which they are consumed (contextual pre-filtering).
For greater rigor, contextual fields can be integrated as additional dimensions in the space. Consider the below example where we add a hand-made feature on the vertical axis that indicates the prevalence of users watching the movie with their family.

However, since the computational complexity of traditional embedding techniques grows exponentially with the number of contextual dimensions (i.e. see multiverse as an example), alternative approaches like factorization machines can be employed to preserve tractability.
Given each real-world recommender system case is unique, modern recommendation systems are often a purpose-built combination of various building blocks like item embeddings, matrix factorization, and neural layers to find connections between users and items. For example, consider YouTube’s video recommendation engine that combines two different deep neural networks, a first to select good candidate videos that you’d like if you watched, and a second to pick the best candidates that you are most likely to watch the longest.
In this article, I’d like to discuss two general examples I’ve encountered that deviate from traditional user by item recommender systems, as well as some unique ways to engineer features to use to solve such problems with tree-based or deep learning models. In both examples, contextual information plays a relatively greater role in predicting successful interactions than in traditional examples like movie recommenders.
The scenarios are as follows:
- High Context & Low User Importance > Outcome = F(I, C)
- High Context & No Item > Outcome = F(U, C)
Scenario 1
Outcome = F(I, C)
The first scenario is where the items and context play a dominant role in the recommender, whereas the user takes a backseat. Consider cases where the majority of users are first-time so there is limited value in user history/similarity, many items are new and have never been rated, or the right fit items depend entirely on context and little on the user.
To select the best freelancer for a job, you’d want to understand how well they performed similar jobs, how past users rated them, and if they have a reputation of getting their work done on time. The fact that you recently worked with an online copywriter may not be the most relevant to finding the right person for your current on-premise photography job.
How would traditional recommender engines work in such a scenario? A content-based one may recommend somebody similar to your copywriter, while a collaborative one may recommend freelancers that others had hired after hiring the copywriter. Neither of these recommendations would be particularly helpful. Because of the differing context, experience other users have had hiring for photography jobs in the area may be more relevant to you then your past experience.
So we’d like to understand the best freelancers that were hired for previous jobs that matched the current context, regardless of the user. But what if there are a multitude of different contextual dimensions of the job, i.e. type of work, skills, scope etc., as well as different ways to measure the user feedback i.e. speed of delivery, rating, and hires?
One solution is to take a cartesian product of contextual and feedback dimensions, then aggregate item histories across all users.
Cartesian Product of Context & Past Feedback
To do this we first need to one-hot-encode dimensions to have dedicated dimensions for each level. For illustrative purposes, the result looks as follows for a candidate freelancer. Noting that when used in a model we would unroll each row in the last 3 columns into dedicated feature columns with names like avg_delivery_on_time_photography.

But if we have n initial contextual dimensions each with h – 1 one-hot encoded levels and k feedback dimensions, n x h – 1 x k model features are required to map all possible combinations. In the above incomplete example, that leads to 51 features.
Furthermore, many of the features are related to additional types of jobs (software development) that aren’t relevant to the current context but have been created by the cartesian product.
A more compact and meaningful way to encode this information is by finding past history that matches the current context across all contextual and feedback dimensions, which instead requires n x k features. Such contextual aggregations can easily be accomplished in SQL using case logic akin to the following:
avg(case when past_job.type_of_work = job.type_of_work then delivery_on_time end) as avg_delivery_type_of_work
Where the job table contains information on the current job context and the past_job table contains information for past jobs.
The improvement looks as follows:

The end result is rich information on how past freelancers (items) performed for similar jobs (context) while limiting unnecessarily high dimensionality. To produce features, contextual dimensions are only considered linearly. Combination like past ratings when both the budget is medium and the experience level is expert are omitted. Non-linearities can still be derived by the training algorithm, or additional contextual dimensions can be created as the the product of existing ones.
We can then unroll each feedback by context column into a dedicated feature column, with a row indicating a distinct freelancer. Then, stack all freelancers meeting hard criteria (i.e. does photography work) and their associated features into additional rows. We can rebuild this representation now considering each past job as the current job, stacking each past job as a set of additional rows. For past jobs, we can create an additional column to capture explicit feedback received by the freelancer such as being hired, or implicit feedback like being messaged.

Finally, we can train a tree-based model like XGBoost to use the engineered features (job and talent # excluded) to predict feedback talent received among completed jobs, deploy the model, and apply it to newly posted jobs to recommend the best freelancers.
Scenario 2
Outcome = F(U, C)
Another instance is when the recommendation isn’t for a discrete item, but rather a continuous value. Consider a platform recommending the price a property rental company should charge for a property on a weekend.
In such an instance, we can observe past prices for rentals that matched contextual dimensions, but now instead of performing a cartesian product with types of item feedback, we can multiply by whether or not the rental was for the same user or another user. Finally, we can aggregate past prices such as through an average.

We can then perform the additional steps outlined at the end of Scenario 1.
Layered Models
Another option is to create an initial contextual model to make a prediction across all users, and then a second model that takes the user-agnostic prediction and tailors it to the user. One benefit of such an approach is that it provides two different yet meaningful predictions that you can recommend to the user or otherwise apply to drive the desired behavior.
Consider the example of a model that inputs a job description and associated metadata to recommend a position’s salary prior to recruitment. A first model can predict the salary agnostic of the company to find the ‘market rate’ for the position.
A second model can be trained with the first model’s prediction as an input, and additional company features similar to the third column in Chart 7, to find the ‘tailored’ salary prediction, i.e. the salary prediction considering similar positions filled and whether they were filled by the same or another company.
If the tailored prediction is substantially different than the market rate, it may be useful for the company to understand the delta and it’s implications. For example, another model could be developed to predict the tenure of a new hire based on the job description, candidate hired, starting salary, and company. The difference in predicted tenure could be computed using the market rate as opposed to the tailored salary as input. Let’s say the market rate is $75K and the tailored salary is $65K, but the predicted tenure at $75K is 1 year greater. The recommender system could present $75K as the recommended salary and provide a tool to observe how predicted tenure changes by the salary offered.
To train layered models while preventing leakage, where ‘knowledge from the hold-out set leaks into the dataset used to train the model,’ you’ll need to ensure that observations used to train each layer do not overlap.
Furthermore, if you integrate a feature in the second-layer model that finds the difference between historical company salaries for similar positions and market rate salaries predicted by the first layer model, i.e. say resulting in an average of $8K below the market rate, you’ll need to sample your training, evaluation, and testing sets on a company vs. position level. This is because as a feature to predict the salary for a position, you’ve also integrated information on the market rate predictions from all prior positions filled at the company. To avoid bias, the predictions of observations used to train a first model should not be used in anyway as features for a subsequent model. We end up with a train/test split as follows, excluding the validation set, and using a split ratio 80% for illustrative purposes.

Evidently, a layered approach is only a viable alternative when there is a substantial volume of past data. A relatively lower split ratio like 70% may be preferable. Any improvements to the model through the additional user features made available in the layered approach need to be contrasted with observation loss in the second layer model. However, in the specific regression use cases where I’ve evaluated a layered approach, I’ve found that just by integrating the prediction from an initial tree-based model trained on 80% of observations into a second one trained on a remaining 16% of observations, the resulting MSE on the final 4% of observations ~ the MSE of a model trained on all 96% of training observations.
Combining Recommendations From Contextual & Collaborative Models
F(I, C) + F(U, I)
Finally, recommendations derived from contextual models can always be weighted with those obtained from user/item similarity in a collaborative model to provide to the best recommendations.
To better understand alternative approaches to measure user/item similarity given embeddings, I’d strongly recommend this article that distinguishes L-norms and angular measures.
Closing remarks
While our current society has seen remarkable advancements in recommenders, less research has focused on incorporating contextual information that alters the nature of user and item interactions. Neural Collaborative Filtering (NCF) provides a framework for the development of neural networks where the first layer takes in a user and item identity to derive embeddings, and later layers model the interaction with the user and item embeddings to predict ratings. For now, relevant context still needs to be hand-made and explicitly fed into models. While NCF approaches are the recommended way to go, used by the likes of YouTube and Google, in some cases they may be a bit overkill such as in non-traditional cases when contextual information plays a dominant role.
I’ve walked through a couple of recommender cases I’ve personally experienced; where user information isn’t very important, and where there are no items. I’ve displayed some intriguing ways to engineer hand-made features in these cases such as through cartesian products and layered models.
I hope the ideas here helped spark some creativity, grow your knowledge, and motivate you to further explore the captivating field of recommendation engines.
Thanks for reading! If you liked this article, follow me to get notified of my new posts. Also, feel free to share any comments/suggestions.