Predicting Premier league standings — putting that math to some use

Published in

Towards Data Science

7 min readJan 30, 2019

I recently came across an interesting problem task, courtesy the final challenge of an online crypt hunt based on mathematics and other fun themes (math and fun usually don’t go together :P ), and my attempt at a solution put together in a short span of time was not half bad. In this article, I attempt to describe the solution approach I took and discuss some possible improvements.

The problem task in simple words was to:

Predict the final rank of Manchester United F.C. on the points table for the current season (18-19) of English Premier League

Some background

The Premier League is the top level of the English football league system and is the most-watched sports league in the world with a potential TV audience of 4.7 billion people (yeah its that popular!). Twenty clubs compete during the course of a season, with each club playing the others twice (a double round-robin system), once at their home stadium and once away, at that of their opponents’ for a total of 38 games per team.

Manchester United, the football club based in Old Trafford, is the most successful club with a record 20 league titles and is one of the most widely supported football clubs in the world. Off late, however, the club has been suffering from a mixture of management and investment problems, making it (almost) six years since they last lifted the prestigious title.

I am a casual fan when it comes to football, but the idea of building a mathematical model that can be applied to a real-world problem seemed exciting enough to have a try at it. (Let’s kick off then, shall we? ⚽️)

Breaking down the problem

The rankings in the league table are primarily determined by the points tally for each team, with ties broken down by goal difference. To predict the final standing of Man United, it was, therefore, necessary to estimate total points for all teams. The problem is then reduced to predicting the outcome for all of the match fixtures.

For sake of simplicity, it can be assumed that each of the match results is independent of each other, that is the outcome of any match X is independent of any other match Y.

A match between two teams can end in 3 possible outcomes: Home team win(H), Away team win(A) or a Tie (T). Teams receive 3 points for a win and 1 point for a draw.

We can pick results at random, but then that is not a good model for fitting the real world scenario. Practically, a top tier team has a better chance of winning against a low tier, weaker team. Hence some parameters are needed to measure the team’s strength

Obtaining data and choosing parameters

What better way to measure a team’s performance than looking at its past data. I used the datasets available at http://www.football-data.co.uk/data, which consists of all the match results since the formation of the Premier League in ‘92, compiled neatly into CSV files. The next step was to determine the dominating factors that correlate well with the team strength. I have to mention this excellent blog post that provided some great insights in understanding the variables, along with pretty neat visualizations. The following points could be concluded:

Factors like the number of corners, fouls, red and yellow cards have a weak relationship with the points tally and hence, the team strength.
The most significant factor with the highest positive correlation is the goal difference, which basically translates to the balance between a team’s attacking and defensive strength.
Interestingly, the number of shots comes out to be inversely correlated! That means more the number of shots a team makes, the lesser points it will be likely to achieve 😮. While it seems to defy logic at first, in hindsight, every shot attempt that does not convert to a goal invariably handles possession back to the opponent team and gives them the upper hand, thus the negative correlation.

Sticking to a simple model, I decided then to use the full-time goal count for the home and away team as parameters.

Show me the math

Effectively speaking, the outcome of the match is based on the number of goals scored on either side. Hence, we need to model the probability distribution of the goals scored. One of the most common methods to do so is via Poisson distribution. (Source)

The Poisson distribution measures the probability of a given number of events occurring in a fixed interval of time if these events occur with a known constant rate and independently of the time since the last event.

Poisson distribution for x occurrences of the event, *λ is the average rate and e is the Euler’s constant*

To understand why this model fits our case, we can consider a goal scored to be an event. Then within the span of 90 minutes of play, each such event can occur any number of times independently.

To give an example, let’s try predicting the probability that a match between Arsenal and Leicester City ends with the scoreline 2–1.

What remains then is to figure out the constant rate (λ):

It can be intuitively seen that this parameter reflects the performance of a team, the better team having a higher rate of scoring goals on average. Also, this rate would depend on both, attacking strength of the team and defensive strength of the opponent. Lastly, we also have to account for the home advantage, that is, take into consideration that a team generally plays better at home ground.

Based on the discussion above, we can define the parameter λ as the Average number of goals scored by a team on a particular venue, which can be computed using the past data.

Building the model

Let’s build some statistics then:

Using the above stats, we can now formulate the λ parameter as follows:

Simulating the matches

As discussed before, a match between two teams can end in 3 possible outcomes: Home team win(H), Away team win(A) or a Tie (T). Let the home team score X goals and away team score Y goals. Then:

We have already seen how to calculate the probability that the match ends with the scoreline X-Y. Also, we can put a practical upper limit to the number of goals scored by a team at say, 10. Finally, since all score lines are independent of each other, the probabilities can be simply added together:

Thus, we can simulate a match between Home(H) and Away(A) teams and predict the points scored by the teams:

Putting it all together

To predict the final standings then, we simply simulate all the league matches using the model and add up the predicted point scores to the build the points table.

The final result obtained:

So it seems that Liverpool and Man City will have the top finish, with Chelsea jumping ahead of Tottenham. Man United is predicted to finish at 5th place with Arsenal close behind. The results seem to agree with the general public opinion then — let’s just bring Fergie back (please)

Find the complete code here

Conclusion

As always, there is plenty of room for improvements. Some ideas to try:

Considering time as a factor: the form of a team can play an important role, and time-weighted averages can be considered to assign more importance to recent matches
It could be interesting to see if manager rankings at the time as a parameter can improve the efficiency
Improving the model’s underestimation of draws, the general idea being that real-world chances of a draw happens to be more than the model’s average estimate of ties

Despite the shortcomings, the model is a good starting point with decent accuracy. And the exercise was fun, after all, it got me first place in the event :D