
Introduction
Learning to rank (LTR) is a class of supervised machine learning algorithms aiming to sort a list of items in terms of their relevance to a query. In classical machine learning in problems like classification and regression, the goal is to predict a single value based on a feature vector. LTR algorithms operate on a set of feature vectors and predict the optimal order of items.
LTR has many different applications. Here are some of them:
- Search engines. A user types a query into a browser search bar. The search engine should rank the web pages in a way that the most relevant results appear in top positions.
- Recommender systems. A movie recommender system choosing which film should be recommended to a user based on an input query.
Let us formally define the ranking problem:
Given an n-dimensional feature vector storing the information about a query and a document, the objective of ranking is to find such a function f which produces a real number indicating the relevance of the query to the document. Additionally, if object i is ranked higher than object j (i ▷ j), then f(i) should be greater than f(j).
Note. i ▷ j means that document i is ranked higher than document j.
Data
Feature vectors
Feature vectors consist of three types of features:
- Features derived only from a document (e.g., document length, number of links in a document).
- Features derived only from a query (e.g., query length, frequency of a query).
- Features derived from a combination of a document and query (e.g., TF-IDF, BM25, BERT, number of common words in a document and query)
Training data
In order to train a model, we need training data that will be fed into the model. There are two possible approaches based on how the training data is collected.
- Offline LTR. Data is manually annotated by a human. The human rates the relevance of pairs (query, document) for different queries and documents. This approach is expensive and time-consuming but provides high-quality annotations.
- Online LTR. Data is implicitly collected from user interactions with query results (e.g., number of clicks on ranked items, time spent on a web page). In this case, it is simple to obtain training data but user interactions are not easy-interpretable.
After that, we have features vectors and labels corresponding to them. This is everything we need for training a model. The next step is to choose the most suited machine learning algorithm for a problem.
Ranking types
From the high level, the majority of LTR algorithms use stochastic gradient descent to find the most optimal ranking. Depending on how an algorithm chooses and compares ranks of items at each iteration, there exist three principal methods:
- Pointwise ranking.
- Pairwise ranking.
- Listwise ranking.
All of these methods transform ranking task to a classification or regression problem. In the following sections, we will see how they operate under the hood.
Pointwise ranking
In the pointwise approach, scores are predicted individually for each feature vector. Ultimately, the predicted scores are sorted. It does not matter which type of model (decision tree, neural network, etc.) is used for prediction.
This type of ranking transforms the ranking problem to the regression task where a regression model tries to predict correct relevance with respect to a chosen loss function (e.g., MSE).
Another valid approach is to transform ground truth rankings into one-hot representations and feed this data to the model. In this case, it is possible to use either a regression or classification model (with cross-entropy loss).

Despite the method being very simple, it has some issues listed below.
Class imbalance
A common issue when using the pointwise method is class imbalance. If a random query is taken in real life, then it is very likely that only a tiny part of all documents in the collection will be relevant to it. Thus there is a high disbalance between relative and irrelative documents to a query in training data.
While it is possible to overcome this issue but there is a much more serious problem to consider.
Bad optimisation metric
Pointwise ranking has a major fundamental problem with its optimisation objective:
Pointwise ranking optimises document scores independently and does not take into account relative scores between different documents. Therefore, it does not directly optimise the ranking quality.
Consider an example below where a pointwise algorithm made predictions for two sets of documents. Let us assume that MSE loss is optimised during training.


Given two ranking results, we can see that from the algorithm’s point of view, the second ranking is better because the corresponding MSE value is lower. However, choosing the second ranking means that the user will be shown all irrelevant results at first. By the way, in the first example, the relevant result is shown at first which is much better from the user experience. Normally, a user does not put much attention on what is recommended after.
This example shows that in real life we are concerned more about showing relevant results at first as well as about the relative order of the items. With independent processing of documents, pointwise does not guarantee these aspects. A lower loss is not the equivalent of a better ranking.
Pairwise ranking
Pairwise models work with a pair of documents at each iteration. Depending on the input format there are two types of pairwise models.
Pair-input models
The input to the model is two feature vectors. The model output is the probability that the first document is ranked higher than the second. During training, these probabilities are calculated for different pairs of feature vectors. The weights of the model are adjusted through gradient descent based on ground truth ranks.

This method has two major disadvantages during inference:
- In order to rank n documents for a given query during inference, each pair of these documents needs to be processed by the model to get all pairwise probabilities. The total number of pairs is quadratic (exactly equal to n * (n – 1) / 2) which is very inefficient.
- Even by having pairwise probabilities of all documents, it is not obvious how to finally rank them, especially in paradoxical situations like vicious circles when there are triplets of documents (x, y, z) that are ranked by the model in a way that: x ▷ y, y ▷ z and z ▷ x.

Because of these downsides, pair-input models are rarely used in practice and single-input models are preferred over them.
Single-input models
The model accepts a single feature vector as an input. During training, each document in a pair is independently fed into the model to receive its own score. Then both scores are compared and the model is adjusted through gradient descent based on ground truth ranks.

During inference, each document receives a score by being passed to the model. The scores are then sorted to obtain the final ranking.
For those who are familiar with Siamese networks (FaceNet, SBERT, e.t.c), single input models can be thought of Siamese networks.
Pairwise loss functions
During each training iteration, the model predicts scores for a pair of documents. Therefore, the loss function should be pairwise and consider the scores of both documents.
In general, pairwise loss takes as its argument z the difference between two scores s[i] – s[j] multiplied by a constant σ. Depending on the algorithm, the loss function can have one of the following forms:

Sometimes the score difference z can be multiplied by a constant.
RankNet is one of the most popular pairwise ranking algorithms. We are going to look through the details of its implementation in the next section.
RankNet
After obtaining scores for documents i and j, RankNet uses the softmax function to normalise them. By doing so, RankNet obtains the probability P[i][j] = P(i ▷ j) that the document i is ranked higher than document j. Inversely, we can calculate the probability P̃[j][i] = P(j ▷ i) = 1 – P(i ▷ j). For simplicity, let us suppose that in reality i is ranked higher j, so P̃[i][j] = 1 and P̃[j][i] = 0. For model weights’ update, RankNet uses cross-entropy loss which is simplified in the following way:

The weights of the model are adjusted by the gradient descent. The next time the model gets the same pair of documents i and j, the document i will be likely to get a higher score than before and the document j will probably be pushed down.
RankNet factorisation
For simplicity, we are not going to dive deeply into the mathematics but there was an interesting research result presented in the original paper where authors found a way to simplify the training process. This is done by introducing the variable S[i][j] which takes one of three possible values:

After some mathematical tricks, the derivative of cross-entropy loss factorised as:

The lambda value in the formula is a constant that can be relatively fast calculated for all the pairs of documents. By taking positive or negative values, these lambdas act as forces pushing documents up or down.
We can sum up all the pairwise lambdas for a single document i. This sum results in the total force applied to the document i in the ranking.

Working with lambdas directly results in a faster training time and better interpretation of results.
Though pairwise algorithms perform better than pointwise approaches, they have two downsides.
Not interpretable probabilities
The output probabilities of the model just show how confident the model is that a certain object i is ranked higher than object j. However, these probabilities are not real and sometimes can be roughly approximated by the model, so it is not a good idea to always use them for interpretation, especially in confusing cases with vicious circles we saw earlier.
Minimisation of inversions is not optimal
This issue is much more critical than the previous one. The fundamental problem with most pairwise algorithms and RankNet in particular as well is that they minimise the number of rank inversions. Though it might appear natural to optimise the number of inversions, this is not what in reality most end users want. Consider the following example with two rankings. Which ranking is better, in your opinion?


Though the second ranking has fewer inversions which is the priority for the algorithm, a normal user would still prefer the first ranking because there is at least one relevant result at the top. This means that the user does not have to scroll through a lot of documents to find the first relevant result. In the same way, it would be better to use such user-oriented metrics as nDCG or ERR which put more emphasis on top results rather than the number of inversions.
As a consequence, we can see that not all document pairs are equally important. The algorithm needs to be adjusted in a way that would put much more importance on getting correct ranking on the top rather than on the bottom.
Researchers of the paper present a well-illustrated example of how optimising ranking with RankNet can lead to not optimal results:

We can see that the document in the 1-st position was pushed to the 4-th and the 15-th document to the 10-th, so the total number of inversions has decreased by 2. Nevertheless, from the user experience, the new ranking became worse. The core issue lies in the fact that RankNet assigns larger gradients to documents at worse positions. However, for optimising user-oriented metrics this should work inversely: documents at better positions should be pushed further up than those at worse positions. This way, user-oriented metrics like nDCG will be higher.
Listwise ranking
Listwise algorithms optimise ranking metrics explicitly. To optimise a certain metric with gradient descent, a derivative needs to be calculated for that metric. Unfortunately, most of the ranking metrics like nDCG or precision are non-continuous and non-differentiable, so other advanced techniques are invented.
Unlike pointwise or pairwise ranking, listwise methods take as an input a whole list of documents at a single time. Sometimes this leads to big computations but also gives more robustness since the algorithm is provided with more information at each iteration.

LambdaRank
Depending on the implementation, LambdaRank can be considered as a pairwise or listwise method.
When focusing on nDCG, it seems optimal to assign larger gradients to pairs of documents whose position swap results in higher nDCG. This core idea lies in LambdaRank.
The "lambda" in the name of the algorithm hints that LambdaRank also uses lambdas described in RankNet for optimisation.
Researchers produced an amazing result and proved that if the loss value in RankNet is multiplied by |nDCG|, then the algorithm tends to directly optimize nDCG! That is said, the LambdaRank algorithm is very similar to RankNet except for the fact that this time lambdas are multiplied by nDCG change:

What is also incredible about the research is the fact that this trick works not only for nDCG but for other Information Retrieval metrics as well! Similarly, if lambdas are multiplied by the precision change, then precision will be optimised.
Recently, it was theoretically proven that LambdaRank optimizes a lower bound on certain information retrieval metrics.
Other Methods
We will not be discussing in detail how other listwise methods work but still provide the main idea behind its implementations of two honourable algorithms.
LambdaMart is a famous implementation of a listwise approach that uses gradient boosting trees with a loss function derived from LambdaRank. In practice, it performs better than LambdaRank.
SoftRank addresses the problem of derivative existence for nDCG. It creates a new metric called "SoftNDCG" which smoothly represents and approximates nDCG making it possible to find a corresponding derivative and update the model’s weights through gradient descent. In fact, this approach can be equally applied to other metrics as well.
Conclusion
We have covered ranking – an important task in machine learning for sorting a set of objects in the relevant order. Pointwise and pairwise approaches are not used that often while listwise methods are most robust. Obviously, we have discussed only a small part of ranking algorithms but this information is essential for understanding more complex techniques like ListNet or ListMLE. An extensive list of listwise algorithms can be found here.
It is worth noting that LambdaRank is currently one of the state-of-the-art ranking algorithms which gives a lot of flexibility for optimising a particular metric.
If you would like to find more information about ranking metrics, I highly recommend you go through my other article on this topic.
Resources
- From RankNet to LambdaRank to LambdaMART: An Overview
- Learning to Rank using Gradient Descent
- Information Retrieval | Learning to Rank | Harrie Oosterhuis
- Learning to Rank | Wikipedia
- SoftRank: Optimising Non-Smooth Rank Metrics
All images unless otherwise noted are by the author