RBO v/s Kendall Tau to compare ranked lists of items

Published in

Towards Data Science

8 min readJan 11, 2021

While working on ranking problems (feed ranking, search ranking etc.), I have often encountered situations when I needed to compare ranked lists generated by separate systems — for example Production v/s an A/B test. The need to compared two or more ranked lists is more common than you would think. Let’s consider a simple example: you and a couple of friends decide to list each of your 5 favorite episodes of the popular television series “FRIENDS”. Everyone writes down their 5 favorite episode titles with the first item signifying their most favorite and the last item signifying their least favorite episode. Each of you may now wonder how similar or different your list is to the two other friends in the room.

Fig. 1 Episodes ranked by 3 different people

How do we go about comparing these lists? Formally, these ranked lists are called incomplete rankings because they do not rank all of the episodes from “FRIENDS” and instead only pick the top 5. We will discuss a few approaches that are popular and fairly common in solving this particular problem.

Method 1: Kendall Tau

The Kendall Tau metric also known as Kendall’s Correlation is a common method used to check if two ranked lists are in agreement.

Fig. 2 Side by side comparison of episodes ranked by Person 1 and Person 3

Kendall’s correlation (𝛕) can be computed by first counting the number of concordant pairs (C) and the number of discordant pairs (D). A pair is said to be concordant if they appear in the same order in their ranking lists.

M = (C -D)

𝛕 = M / (C + D)

The maximum number of distinct pairs between the two conjoint ranking lists in our example (Fig. 2) is = 5 Choose 2 = ½ * (5 * (5–1)) = 10. If all pairs are concordant then C = 10 and D = 0, that means max(𝛕) = (10–0) / (10+0) = 1. If all pairs are discordant then C = 0 and D = 10, that means min(𝛕) = (0–10) / (0+10) = -1.

Let’s calculate the actual value of 𝛕

Concordant Pairs:

(S6E17, S2E10) (S6E17, S3E24) (S6E17, S3E25) (S6E17, S3E15) (S2E10, S3E24) (S2E10, S3E25) (S2E10, S3E15) (S3E24, S3E25)

Discordant Pairs:

(S3E24, S3E15) (S3E25, S3E15)

𝛕 = (8–2) / (8 + 2) = 0.6

Interpretation of Kendall Tau

Notice that the maximum and minimum value that 𝛕 can take is +1 and -1 respectively. +1 denotes perfect agreement and -1 denotes complete disagreement. A value of 0 means that there is no correlation between the rankings. The value of 0.6 that we computed for Fig. 2 shows us that the rankings have medium correlation

Another way to interpret 𝛕 is that its statistic M is the number of adjacent pairwise swaps needed to arrange one ranked list such that it resembles the other. Does this remind you of a popular sorting algorithm? Bubble sort, of course!

You can also test statistical significance of the Kendall Tau metric [8].

Implementation

A modified Kendall Tau can be applied to rankings that contain ties — take for instance if a person loved an episode so much that they listed the same episode in all 3 of the top 3 slots in their list. Ties in ranking is not uncommon. I implemented a custom O(NLogN) Kendall Tau algorithm that handles ties recently at Netflix. Here are some open source implementations of the algorithm:

[Java] Apache commons library
[Python] SciPy library

Usages

Kendall Tau is a a popular method used in information retrieval tasks such as search engines and recommendation systems. Companies like Yahoo [1] and Microsoft [3] have used this method for various use cases.

Drawbacks

Although Kendall Tau is a popular rank correlation metric, it does have drawbacks:

It requires the two ranking lists to be conjoint (same elements in both lists)
It is unweighted i.e., it places as much emphasis on the disagreement at the bottom of the list as much as the top (in popular search engines, the results in the “head” matter a lot more than the results in the “tail”)
The contribution of a single discordant pair decreases with the increase in the depth of the ranking i.e., 𝛕 values are intrinsically linked to the depth of the ranking list.

To demonstrate these drawbacks, consider the ranking lists of Person 1 v/s Person 2 from Fig. 1. Notice that there are only two episodes that are in common between the two lists: S2E10 and S3E24. Kendall’s Correlation cannot be applied to these lists because they are non-conjoint (do not contain the exact same items). One potential way to apply Kendall’s Correlation to these two lists is by replacing the items that are not present in both the lists with a special character ‘#’. Then, the Person 1 ranking becomes “# S2E10 # S3E24 #” and Person 2 ranking becomes “# S2E10 # # S3E24”. Calculating 𝛕 for these transformed list would result in a value of 1.0 even though S3E24 is at position 3 in Person 1’s list and at position 4 in Person 2’s list (zero based index). This value of 1.0 denotes perfect agreement between the two lists — obviously not true here.

Method 2: Rank Biased Overlap (RBO)

Unlike the Kendall Tau distance measure, RBO is a similarity measure. RBO solves for the 3 drawbacks observed in Kendall Tau by using weights for each rank position. The weights are derived from a convergent series. The equation below describes RBO of two infinite ranked lists S and T:

The parameter p is a tunable parameter in the range (0, 1) that can be used to determine the contribution of the top d ranks to the final value of the RBO similarity measure. Note that the summation term of SUM(p^(d-1)) is convergent since it is a geometric series. Its sum is given by 1/(1-p) and since 0 < p < 1, the sum is finite.

To get a single RBO score, we can extrapolate it from the available information, assuming that the agreement seen up to a depth k is continued indefinitely among the two lists. The single RBO score can be calculated using:

Here is a simple (not thoroughly tested) implementation of the extrapolated single score RBO measure in Python 3:

How to choose the value of ‘p’?

In the RBO measure, the choice of the value of p determines the degree of top weighted-ness that the resulting RBO measure exhibits. From the RBO paper, the following equation can be used to calculate the overall weight that the top d ranks contribute to the RBO calculation:

Here is a simple (not thoroughly tested) implementation of the above:

For the value of p = 0.9 and d = 10 (top 10 ranks being examined), WRBO[1:d] is 0.8556 i.e., the top 10 ranks contribute 85.56% to the final RBO measure. Hence, depending on the amount of contribution you would like from the top d results, you can choose the value of p accordingly using the WRBO[1:d] equation above.

Interpreting RBO

As mentioned above RBO takes values in the range [0, 1] where 0 means disjoint and 1 means identical ranked lists.

Let’s go back to the example where Kendall Tau failed to recognize the clearly non-similar ranked lists generated by Person 1 and Person 2:

Person 1 ranking was “# S2E10 # S3E24 #”

Person 2 ranking was “# S2E10 # # S3E24”

I am going to choose p = 0.6 since I want the top 3 positions to carry the most weight. With p = 0.6, the top 3 positions carry 91.26% of weight on the final RBO measure. Plugging in the values of Person 1 and Person 2, we get RBO(Person1, Person2, p, k) = 0.24144 i.e., the RBO value indicates that the two lists are mostly different with some light similarity in the top 3 positions. Increasing the value of p to 0.9 results in a higher RBO value of 0.352655 but yet the value is able to show that the two lists are different.

In conclusion, RBO is a more robust measure to judge similarity of indefinite ranking lists that are not necessarily conjoint and it is able to be tuned for top weighted-ness. For further reading on RBO and the theory behind it, refer to the RBO paper [2].

Some food for thought

Finally, I want to leave you with something to ponder on:

Imagine a case where comparing any two items in a list is not so simple. Say that you rank your top 5 favorite events from the Football (Soccer) world cup by taking a picture of the event. For example, Fig. 3 shows two images that are from the same event when Mario Götze scored the decisive extra time goal, but are different images taken a few seconds apart. Being able to tell two different images are the same is important if we want to compute RBO or Kendall Tau on the ranked lists of images. How would we compare two images and tell that they are from the same event?

Fig. 3 Two images of the same event, Germany V/s Argentina World Cup 2014 [4]

References

[1] Kumar, Ravi, and Sergei Vassilvitskii. “Generalized distances between rankings.” Proceedings of the 19th international conference on World wide web. 2010.

[2] William Webber, Alistair Moffat, and Justin Zobel. 2010. “A similarity measure for indefinite rankings”. ACM Trans. Inf. Syst. 28, 4, Article 20 (November 2010), 38 pages. DOI:https://doi.org/10.1145/1852102.1852106

[3] Bachrach, Yoram, Ralf Herbrich, and Ely Porat. “Sketching algorithms for approximating rank correlations in collaborative filtering systems.” International Symposium on String Processing and Information Retrieval. Springer, Berlin, Heidelberg, 2009.

[4] Football World cup images attribution:

[left image] Danilo Borges/Portal da Copa copa2014.gov.br Licença Creative Commons Atribuição 3.0, CC BY 3.0 <https://creativecommons.org/licenses/by/3.0>, via Wikimedia Commons;

[right image] Danilo Borges/copa2014.gov.br Licença Creative Commons Atribuição 3.0 Brasil, CC BY 3.0 BR <https://creativecommons.org/licenses/by/3.0/br/deed.en>, via Wikimedia Commons

[5] Benesty, Jacob, et al. “Pearson correlation coefficient.” Noise reduction in speech processing. Springer, Berlin, Heidelberg, 2009. 1–4.

[6] Massey Jr, Frank J. “The Kolmogorov-Smirnov test for goodness of fit.” Journal of the American statistical Association 46.253 (1951): 68–78.

[7] Fagin, Ronald, Ravi Kumar, and Dakshinamurthi Sivakumar. “Comparing top k lists.” SIAM Journal on discrete mathematics 17.1 (2003): 134–160.

[8] Kendall Tau Significance test https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient#Significance_tests