Ranking The Best UFC Fighters Using PageRank and Neo4j

Steven Wang
Towards Data Science
10 min readSep 15, 2019

--

MMA fans and pundits love to debate which fighters should be considered the all-time greats. Many times however, these rankings are a subjective mix of fighter records, individual biases, analyses from Reddit keyboard-warriors, and gut feelings. Is there a more data-driven way to assemble fighter rankings?

In this post, I will show how the PageRank algorithm, originally developed by Google to rank webpages, and Neo4j can be used to rank the all-time best fighters in each UFC weight division.

Getting the data

The UFC publishes fight results for all its events dating back to 1993 on http://ufcstats.com. In order to obtain this data, I wrote a Python scraper to scrape the site and save the results. For this post, I pulled data on all fights up until 2019–09–14 (UFC Fight Night 158). More in depth descriptions and usage examples of the scraper can be found on my github here.

The scraper retrieves all historical fight results and updates already-saved data with fresh results when new UFC events occur. Though not used in this analysis, data on in-fight statistics such as strikes landed and fighter attributes like reach, height, and date of birth can also be pulled using the scraper.

Below is a subset of the scraped data that will be used in this post:

As shown above, the four most relevant fields for this analysis are fighter, opponent, result, and division.

Designing the graph data model

Before loading the data into Neo4j, let’s think about how we want the data to be represented in a graph:

  • Each fighter will be assigned to a node.
  • Edges will be directed so that if fighter A lost to fighter B, there will be an edge going from fighter A to fighter B.
  • Divisions and dates of the fights will be set as properties of the edges. If fighter A fought fighter B in the Lightweight division in 2019, then those division and date properties would reside on the edge connecting those fighters.
  • If fighters faced each other multiple times, there will be separate edges for each time they fought.

A visual representation of our graph data model looks like this:

In plain english, the graph above shows that Fighter A:

  • lost a Lightweight bout to Fighter B in 2015
  • lost a Lightweight bout to Fighter C in 2016
  • won a rematch with Fighter C at Welterweight in 2017

Loading data into Neo4j

With our data model sketched out, we can now begin loading the data into Neo4j. To do this we will leverage Py2neo which allows Python to interact with Neo4j. This allows for easier data workflows when going from Python to Neo4j and for extracting Cypher query results from Neo4j back into Python.

The gist below shows the code used to load the fight results data from a pandas dataframe into a local Neo4j instance running under default settings:

In the code above, we only look at rows with “W” in the results column of the fights dataframe since rows with losses are just mirror images of rows with wins (fighter and opponent switched). For every fight, we create nodes for both the fighter and his/her opponent if none exist already or match them to already existing nodes using Cypher’s MERGE clause. Using MERGE again, we create the directed edge relationship between the loser and winner of the fight. Attributes such as fighter names, division, date, and method are passed in as parameters that are assigned to properties of either nodes or edges of the graph.

Once the data is loaded, we can look at the graph visualized in Neo4j’s browser. Below is the result of a query that shows all of Conor McGregor’s opponents and how they fared against Conor and each other:

A very brief explanation of PageRank

Originally developed by Larry Page and Sergey Brin, PageRank was used in Google’s search engine to rank the relevance of linked webpages. Using webpages as nodes and links as edges in a directed graph, the general idea of PageRank is as follows:

  • pages with many other pages linked to it will have higher relevance
  • having links from high relevance pages increases relevance
  • pages pass relevance to the neighbors they link to

Using the original webpage ranking use case as an example, a website like Twitter would have a high PageRank since many other sites link to it. Likewise, a site with links from a popular site like Twitter would also have a high relevance from a PageRank perspective. The image below demonstrates this idea:

Source: Wikipedia

Node B has the highest PageRank because it not only has many nodes linking to it but also because it has links from nodes with relatively high relevance. Though node C only has only one link directed to it, it has a high PageRank because that one link comes from node B, the most important node in the network. Nodes A, D, and F have low PageRanks because they only have a few relatively unimportant nodes linking to them.

With an understanding of the above example, we can see how PageRank can also be translated to UFC fighters. Fighters with more wins (more edges incoming) and wins over highly ranked fighters (incoming edges from important nodes) will be more highly ranked with PageRank.

Though brief, this explanation of PageRank hopefully provides a general idea of how the algorithm works. For those interested in learning more, videos 5–10 of this video series provides a great explanation.

Applying PageRank to the UFC

Conveniently, Neo4j already provides an implementation of PageRank so we can easily apply the algorithm to our UFC graph. Using Py2neo again, the code below calculates fighter PageRank scores for each division:

Below we can see how PageRank ranks the top 10 all-time best fighters for each division for men and women:

Men’s top 10 PageRank scores by division
Women’s top 10 PageRank scores by division

Adjusting PageRank scores

Although the PageRank rankings seem mostly right, there are a few cases that fans may find questionable. For example, current Lightweight champion and undefeated fighter Khabib Nurmagomedov ranks 3rd in the all-time rankings despite holding a victory over number 2 ranked Michael Johnson. Johnson himself seems to be too highly ranked seeing that he has eight losses in the Lightweight division, all of which came at the hands of fighters ranked below him. Furthermore, ranking Donald Cerrone as the number 1 Lightweight of all time is questionable given that he has never won the belt and has also lost to many fighters ranked below him.

Similarly in the Heavyweight division, some might take issue with former champion Daniel Cormier being ranked four spots below contender Derrick Lewis despite Cormier having previously beaten both Lewis and number 1 ranked Stipe Miocic.

Where do these discrepancies come from? Recall that PageRank takes into account the number of incoming edges a node has when calculating the score. Thus, fighters with more wins overall (more incoming edges) will see a boost in their PageRank scores. As a result, it appears that the PageRank rankings are overweighting the number of wins and perhaps also slightly underweighting the number of losses.

The clearest example of this is in the Lightweight division where current champion Khabib Nurmagomedov has a record of 12–0–0 (W-L-D) yet is only ranked 3rd while number 1 ranked Donald Cerrone has a record of 17–6–0. Cerrone has more wins than Nurmagomedov largely due to having had more fights in the UFC Lightweight division (Nurmagomedov: 12, Cerrone: 23). Thus, despite being champion and despite beating highly ranked fighters, Nurmagomedov has a PageRank score that is somewhat deflated relative to Cerrone’s due to having had less fights overall.

Similarly, Michael Johnson’s 2nd overall ranking seems odd given Johnson’s record of 9–8–0 which includes a loss to Nurmagomedov. Indeed, Johnson has had victories over several highly ranked Lightweights including Tony Ferguson (5th), Edson Barboza (6th), and Dustin Poirier (7th) which help boost his PageRank score. However, his barely .500 record at Lightweight draws into question his number 2 ranking and suggests that losses are not weighted as heavily as they should be.

Based on the observations above, we need to somehow incorporate fighters’ records into the rankings so that fighters with higher win percentages or less losses receive a boost to their PageRank scores.

Personalized PageRank

One potential way of doing this would be to use a variation of the PageRank algorithm known as Personalized PageRank. In very general terms Personalized PageRank favorably biases certain nodes in the network, thereby increasing scores for those nodes. However, specifying which nodes should receive bias can be slightly arbitrary and can dramatically change results depending on which nodes are selected. For example, the tables below show Personalized PageRank rankings for the Lightweight division using two different criteria to select biased nodes:

Left: Personalized PageRank rankings with biased nodes of Khabib Nurmagomedov and Tony Ferguson. Right: Personalized PageRank rankings with biased nodes of Khabib Nurmagomedov, Tony Ferguson, and Gregor Gillespie

The table on the left uses the biased node selection criteria of ≥ 10 fights AND ≥ 90% win rate while the table on the right uses ≥ 6 fights AND ≥ 90% win rate. As we can see, the second selection criteria added one extra biased node that resulted in a pretty dramatic change; Gregor Gillepsie, who had been ranked 60th in the left table, suddenly moves up to 3rd overall when included as a biased node. Given these results, selecting biased nodes for use in Personalized PageRank may not be the best idea since a) selection criteria can be arbitrary and b) results vary significantly depending on which nodes are selected.

For those interested, here is the Neo4j Cypher query used to return Personalized PageRank scores:

Non-binary biased node selection

A potential workaround of the problem described above is to apply a weight to each node, perhaps the fighters’ win rate or 1/losses, and use those weights to calculate how much to bias each node. So instead of a binary selection of whether or not to bias a specific node, each node receives varying degrees of bias based on their win or loss rate.

Unfortunately, Neo4j does not offer support for this functionality at this time, though it appears to be in the works according to this open issue on Neo4j’s github. I may write a followup in the future when this is added.

Weighting PageRank by win rate

The easiest way to incorporate fighters’ records into the rankings would be to simply multiply each fighter’s PageRank by their win %. Though not perfect (losses would have larger effect on fighters with fewer fights), this would help balance out the fact that some fighters have more wins largely due to having had more fights overall while also increasing the impact of losses.

The revised rankings that result from using this method (PageRank * Win %) are shown in the tables below:

Men’s rankings: PageRank * win %
Women’s rankings: PageRank * win %

Though subjective, an evaluation of the revised rankings above feels better. In the Lightweight division, Khabib Nurmagomedov and Tony Ferguson move to 2nd and 3rd places, respectively, while Michael Johnson drops down to 4th. Though Donald Cerrone still holds the number 1 overall spot, the PageRank score gap between Cerrone and Nurmagomedov has narrowed significantly. In the Heavyweight division, former champion Daniel Cormier moves up from 8th to 6th overall, which while still a little low in my opinion (should probably be above number 5 ranked Derrick Lewis), is a more reasonable ranking.

Conclusion

In this post, we have shown how PageRank can be used to rank UFC fighters and find the best in each division. Though not perfect (most rankings aren’t), PageRank seems to do a relatively good job at ranking fighters.

One potential further improvement would be somehow taking into account fighters’ ages. Some great fighters continue to fight well past their primes and suffer losses that reduce their PageRank scores and increase those who beat them. Good examples of this are Middleweight great Anderson Silva who has gone 1–4 in his last five fights and Lightweight legend BJ Penn who has gone 0–5 in his last five. Beating a great fighter in their twilight years should not count the same as beating them in their prime. If we want to find the best pound-for-pound fighters in their primes, incorporating age seems necessary.

With that said, perhaps knowing when to hang them up is also what makes a great fighter great.

For those that would like to explore the data on their own, all code and data can be found on my github.

For the UFC/MMA fans out there, how do these rankings stack up to your own?

--

--