The world’s leading publication for data science, AI, and ML professionals.

Python for FPL(!) Data Analytics

Using Python and Matplotlib to perform Fantasy Football Data Analysis and Visualisation

author's graph
author’s graph

Introduction

There are two reasons for this piece: (1) I wanted to teach myself some Data Analysis and Visualisation techniques using Python; and (2) I need to arrest my Fantasy Football team’s slide down several leaderboards. But first, credit to David Allen for the helpful guide on accessing the Fantasy Premier League API, which can be found here.

To begin, we need to set-up our notebook to use Pandas and Matplotlib (I’m using Jupyter for this), and connect to the Fantasy Premier League API to access the data needed for the analysis.

#Notebook Config
import requests
import pandas as pd
import numpy as np
%Matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
#API Set-Up
url = 'https://fantasy.premierleague.com/api/bootstrap-static/'
r = requests.get(url)
json = r.json()

Then, we can set up our Pandas DataFrames (think data tables) which will be queried for valuable insights – hopefully. Each DataFrame (_df) we create relates to a JSON data structure accessible via the FPL API. For a full list of these, run json.keys(). We’re interested in ‘elements’ (player data), ‘element_types’ (positional references), and ‘teams’.

elements_df = pd.DataFrame(json['elements'])
element_types_df = pd.DataFrame(json['element_types'])
teams_df = pd.DataFrame(json['teams'])

By default, _elementsdf contains a number of columns we aren’t interested in right now (for an overview of each DataFrame, see David’s article). I’ve created a new DataFrame – _maindf – with columns I might want to use.

main_df = elements_df[['web_name','first_name','team','element_type','now_cost','selected_by_percent','transfers_in','transfers_out','form','event_points','total_points','bonus','points_per_game','value_season','minutes','goals_scored','assists','ict_index','clean_sheets','saves']]

It’s important to note that _elementsdf uses keys to reference things such as a player’s position and team. For example, in column ‘elementtype, a value of "1" = goalkeeper, and in column team‘_ a value of "1" = Arsenal. These are references to the two other DataFrames we created (_element_typesdf, and _teamsdf)

If we preview _element_typesdf, we’ll see that each ‘id’ number here corresponds to a position:

element_types_df.head()
author's table output
author’s table output

And in _teamsdf, ‘id’ corresponds to each team name:

teams_df.head()
author's table output
author’s table output

At this point we can also see that none of the team stats are updated, which could be a problem later on – I particularly want the games played by each team. In the interests of time and/or lacking a more elegant solution, I added the data manually by (1) creating a new dictionary, (2) turning this into a two-column DataFrame, (3) overwriting the ‘played’ column in _teamsdf with the data I created.

#create new dictionary
games_played = [['Arsenal','4'], ['Aston Villa','3'], ['Brighton','4'], ['Burnley','3'], ['Chelsea','4'], ['Crystal Palace','4'], ['Everton','4'], ['Fulham','4'], ['Leicester','4'], ['Leeds','4'], ['Liverpool','4'], ['Man City','3'], ['Man Utd','3'], ['Newcastle','4'], ['Sheffield Utd','4'], ['Southampton','4'], ['Spurs','4'], ['West Brom','4'], ['West Ham','4'], ['Wolves','4']]
#turn into a DataFrame
played_df = pd.DataFrame(games_played,columns=['team','games_played'])
#overwrite existing DataFrame column
teams_df['played'] = played_df['games_played'].astype(str).astype(int)
#voila
teams_df.head()
author's table output
author’s table output

For the purpose of our analysis, we are going to want to use the actual position and team names, rather than the number identifiers. To avoid referencing data from different places later on, we can merge our DataFrames together now to form a sort of ‘master’ table. We’ll use _maindf as the base for this, using pd.merge __ to perform a join to the other DataFrames, _elements_types_d_f, and _teams_d_f.

There are three elements here:

  1. Use of pd.merge, where _"_lefton="_ takes the unique identifier in _maindf that we are joining to. _"_righton=" is the equivalent column in the target DataFrame that we are joining from (in both cases, id‘_). For example, main_df[‘element_type’] = element_types_df[‘id’], but we actually want to use element_types_df[‘singularname’] in display. We use "right="_ to list each column we are pulling across.
#merging elements_types_df onto main_df
main_df = pd.merge(left=main_df,right=elements_types_df[['id','singular_name']],left_on='element_type', right_on='id', how='left')
  1. Use of df.drop, to remove unwanted columns post-merge. Whilst we need to use the ‘id’ columns to join on, we don’t want them in our final DataFrame. We also no longer need our original _’_elementtype and team‘_ columns, since we have now merged the user-friendly data into _maindf. The use of __ "axis=1" specifies we are dropping a column rather than a row.
main_df = main_df.drop(["id", "element_type"],axis=1)
  1. Use of df.rename, to clean up our DataFrame. In _element_typesdf, position names are stored under ‘singular_name’. We’ll rename that to ‘position’ to make it more intuitive.
main_df = main_df.rename(columns = {'singular_name': 'position'})

To merge the _teamsdf data across, we can just tweak the above:

#merging teams_df onto main_df
main_df = pd.merge(left=main_df,right=teams_df[['id','name','played','strength_overall_away','strength_overall_home']],left_on='team', right_on='id', how='left')
main_df = main_df.drop(["id", "team"],axis=1)
main_df = main_df.rename(columns = {'name': 'team'})

Finally, we pre-emptively convert some existing columns to floats, to avoid potential issues running calculations and sorting values, and we create a new column ‘total_contribution’ (sum of goals and assists):

#Additional columns stored as floats
main_df['value'] = main_df.value_season.astype(float)
main_df['ict_score'] = main_df.ict_index.astype(float)
main_df['selection_percentage'] = main_df.selected_by_percent.astype(float)
main_df['current_form'] = main_df.form.astype(float)
#Total Goals Contribution column = Goals + Assists
main_df['total_contribution']= main_df['goals_scored'] + main_df['assists']

_maindf set up, we can use .loc() to filter out all players with < 0 value, to avoid our results being skewed by those not getting game-time:

main_df = main_df.loc[sel_df.value > 0]
#preview of current state
main_df.head(3)
author's table output
author’s table output

Positions

Now, using the groupby() function we can take a quick look at the data by position. The first argument ‘position’ chooses the column to group on, and "as_index=False" prevents that column from being removed and used as an index instead. Using aggregate functions we can find the mean value and total points scored per position. I’ve wrapped it in a roundby() to **** limit decimals to two.

position_group = np.round(main_df.groupby('position', as_index=False).aggregate({'value':np.mean, 'total_points':np.sum}), 2)
position_group.sort_values('value', ascending=False)
author's table output
author’s table output

Per £ spent, Goalkeepers are currently returning the most points, but they also make up a much smaller fraction of a teams points total. As a lot of goalkeepers in the game play 0 minutes, they won’t be accounted for here (as their value = 0). Many outfield players will pick up 1 point returns if subbed on with 5 or 10 minutes to go, pulling average value down.

Teams

Team-level stats can give a broad indication of which teams are playing well, or where players could be under/overvalued.

team_group = np.round(main_df.groupby('team', as_index=False).aggregate({'value':np.mean, 'total_points':np.sum}), 2)
team_grp_df = team_group.sort_values('value', ascending=False)
team_grp_df['games_played'] = teams_df['played']
team_grp_df.head(5)
author's table output
author’s table output

I personally wouldn’t have thought West Ham and Brighton assets would be top-5 in terms of value at this (or any) stage. However, we know that some teams – including Man Utd and Man City – have played a game less than everyone else. Let’s adjust for that by creating some value-per-game and points-per-game metrics:

team_group = np.round(main_df.groupby('team', as_index=False).aggregate({'value':np.mean, 'total_points':np.sum}), 2)
team_grp_df = team_group
team_grp_df['games_played'] = teams_df['played']
team_grp_df['value_adjusted'] = np.round(team_grp_df['value']/teams_df['played'],2)
team_grp_df['points_adjusted'] = np.round(team_grp_df['total_points']/teams_df['played'],2)
team_grp_df.sort_values('points_adjusted',ascending=False).head(5)
author's table output
author’s table output

Even after adjusting for games played, Man Utd and Man City don’t break into the top 5. Not too surprising given their recent performances against Spurs and Leicester. Arsenal and Liverpool are both picking up a good amount of points per game, but their value score shows you have to pay for it.

Let’s use Matplotlib to create some nicer graphical representations of this data. Using the subplot function, we can plot two graphs side-by-side for easier comparisons. We’ll compare team ‘value’ and ‘total_points’ to the adjusted figures we created based on games played.

fig,axes = plt.subplots(nrows=1, ncols=2, figsize=(20,7))
plt.subplots_adjust(hspace=0.25,  wspace=0.25)
team_grp_df.sort_values('value').plot.barh(ax=axes[0],x="team", y="value", subplots=True, color='#0087F1')
team_grp_df.sort_values('value_adjusted').plot.barh(ax=axes[1],x="team", y="value_adjusted", subplots=True, color='#2BBD00')
plt.ylabel("")
fig,axes = plt.subplots(nrows=1, ncols=2, figsize=(20,7))
plt.subplots_adjust(hspace=0.25,  wspace=0.25)
team_grp_df.sort_values('total_points').plot.barh(ax=axes[0],x="team", y="total_points", subplots=True, color='#EB2000')
team_grp_df.sort_values('points_adjusted').plot.barh(ax=axes[1],x="team", y="points_adjusted", subplots=True, color='#FF8000')
plt.ylabel("")
author's graphs: value and total_points adjusted for games_played
author’s graphs: value and total_points adjusted for games_played

Aston Villa stand out even more after adjusting for the fact they’ve only played three games. Looking at their results it’s not hard to see why: 1–0 (v Sheff Utd), 3–0 (v Fulham), and 7–2 (v Liverpool!). Obviously, we’re looking at tiny sample sizes, but before we decide to load up on Villa assets, maybe consider that two of their wins and clean sheets came against the two sides that have had the worst start to the season by points per game. And how much significance can we attach to that result against last year’s Champions Liverpool?

Team-level data can only tell us so much – we are capped at three players per team after all – so let’s narrow our focus. We’ll use .loc __ again to filter on each position and create some more DataFrames:

gk_df = main_df.loc[main_df.position == 'Goalkeeper']
gk_df = gk_df[['web_name','team','selection_percentage','now_cost','clean_sheets','saves','bonus','total_points','value']]
def_df = main_df.loc[main_df.position == 'Defender']
def_df = def_df[['web_name','team','selection_percentage','now_cost','clean_sheets','assists','goals_scored','total_contribution','ict_score','bonus','total_points','value']]
mid_df = main_df.loc[main_df.position == 'Midfielder']
mid_df = mid_df[['web_name','team','selection_percentage','now_cost','assists','goals_scored','total_contribution','ict_score','current_form','bonus','total_points','value']]
fwd_df = main_df.loc[main_df.position == 'Forward']
fwd_df = fwd_df[['web_name','team','selection_percentage','now_cost','assists','goals_scored','total_contribution','ict_score','current_form','minutes','bonus','total_points','value']]

Goalkeepers

Starting with goalkeepers, we’ll plot a simple scatter graph showing cost v points (the ‘value’ metric). Matplotlib allows us to easily customise things such as plot transparency (alpha=), size (figsize=), and line-style (ls=). For the purpose of this article I’m going to keep everything consistent.

ax = gk_df.plot.scatter(x='now_cost',y='total_points', alpha=.5, figsize=(20,9), title="goalkeepers: total_points v cost")
for i, txt in enumerate(gk_df.web_name):
    ax.annotate(txt, (gk_df.now_cost.iat[i],gk_df.total_points.iat[i]))
plt.grid(which='both', axis='both', ls='-')
plt.show()
author's graph
author’s graph

Early on in the season, there is zero correlation between a goalkeepers price and points total. McCarthy was a pre-season value pick for a huge number of players, and that’s paid off so far. Martinez was undervalued pre-season because he was Arsenal’s second choice – his late move to Villa has rewarded those who brought him in*. Newcastle’s Darlow is currently scoring the highest of the £5.0m picks, which is surprising given they’ve only kept one clean sheet in four games. What else explains it? Knowing that Goalkeepers also score one point for every three saves, we can re-plot using ‘saves’ data.

author's graph
author’s graph

Looks like Darlow has made over twice as many saves than most goalkeepers, which helps explain his total points. It also suggests that he’s not being offered much protection by his defenders. In any event, clean sheets will generally return more points than saves, so I’d rather opt for a goalkeeper with more potential there. Martinez and McCarthy offer more balance for less cost. New Chelsea signing Mendy didn’t need to make a single save for his debut clean sheet, and at £5.0m there is long-term value if you have faith that their defensive line will continue to improve.

*His move didn’t reward those who brought him in and left him on the bench. I’m sure those people will be kicking themselves even harder if they started Pickford instead – a keeper who has done everything in his power to help the opposition team put the ball in his own net.

Conclusion: Pickford out, McCarthy/Meslier in.

Defence

As for Defenders, we want clean sheets or attacking threat – ideally both. Alexander-Arnold and Robertson set defensive expectations through the roof last year, but shipping three goals to Leeds and seven to Aston Villa suggest problems. So which defenders have started the season well?

def_df.sort_values('total_points',ascending=False).head(5)
author's table output
author’s table output

A lot of people have picked up on Castagne’s fantastic start to the season (now in 19.2% of teams). West Ham’s Cresswell still largely ignored on the other hand – despite his total points return for just £4.9m. His ICT index score is also high at 27.7, which for anyone unfamiliar, stands for Influence, Creativity, and Threat. It’s essentially a measure of a players involvement in important moments at either end of the pitch. It’s useful for assessing whether a player is creating good chances for others, or getting into positions where they might score themselves.

For a quick view of the highest performing defenders in relation to price, I’ll make a new DataFrame _topdefdf, selecting only defenders with a value > 3 and at least 1 goal contribution (goal or assist).

topdef_df = def_df = def_df.loc[def_df.value > 3]
topdef_df = topdef_df = topdef_df.loc[def_df.total_contribution > 0]
author's graph
author’s graph

Players with high ICT scores but low total points might be on the bad side of luck, where those ‘underperforming’ from a points perspective now might still be valuable additions to a team if chances are converted sooner rather than later. Can we tell if there any ‘unlucky’ defenders out there? We can try, by creating a DataFrame for defenders with an ICT score > 10, and see who has been rewarded and who hasn’t.

unluckydef_df = def_df.loc[def_df.ict_score > 10]
unluckydef_df.sort_values('ict_score', ascending=False).head(5)
author's table output
author’s table output
author's graph
author’s graph

Alexander-Arnold (£7.5m) and Robertson (£7.0m) didn’t make it into the previous plot because we filtered on value > 3. Given their slow start to the season, players owning either of the two will be pleased to see that they are still amongst the most threatening in the game and should be held onto – despite their cost. Chelsea’s James made his way into a lot of people’s teams given his attacking value, although he has recently fallen behind Azpilicueta in the pecking order. Digne isn’t cheap (£6.1m) but he’s serving Richarlison and Calvert-Lewin in a strong Everton side going forward – expect his assists to keep tallying up. On the ‘unlucky’ side, Fredericks (£4.4m), Ayling (£4.5m) and Dallas (£4.5m) have all been fairly heavily involved in their matches so far, without the points to show for it. All three make good bench candidates at those prices.

Conclusion: Both Digne and Chilwell look like better options than my current pick Doherty (£5.9m). I also brought in Strujk (£4.0m) as a bench option on the bad advice of a friend – any one of Fredericks/Ayling/Dallas would be an affordable upgrade.

Midfield

Onto midfielders: probably where most of a team’s points will come from. To focus on the high-performers right away, we can do what we did with the defenders and create another new DataFrame – this time, selecting only midfielders with an ICT score > 25 and points total > 15.

topmid_df = mid_df.loc[mid_df.ict_score > 25]
topmid_df = topmid_df.loc[mid_df.total_points > 15]
topmid_df.sort_values('total_points',ascending=False).head(5)
author's table output
author’s table output
author's graph
author’s graph

Salah (£12.2m) and Son (£9.0m) are the only big-name assets to justify their price tag so far. De Bruyne and Sterling have a game in hand but not too much to show from their first three fixtures. Grealish (£7.1m) and Rodriguez (£7.8m) are the middle-tier value picks at the moment, and you’d expect that to continue given how much of Aston Villa’s and Everton’s attacking play goes through them. Bowen (£6.3.m) and Jorginho (£5.2) look like good value picks, although the latter has scored three penalties in four games to inflate his early-stage points return.

But which midfielders have FPL players gone for (or ignored)? I’ve kept the >15 total points parameter the same, but plotted the results by overall selection %, rather than cost.

author's graph
author’s graph
author's table output
author’s table output

Again, Bowen stands out – less than 2% of players have selected him after a pretty strong start to the season. In FPL-speak he is effectively a "differential" pick. Someone with a low enough ownership % that he would help you gain leaderboard positions should he return well (because very few others will also be gaining those points). There’s a lot of people hoping premium assets Aubameyang and De Bruyne start firing soon.

author's graph
author’s graph

Judging by ICT, De Bruyne holders can relax a bit knowing he’s started brightly – injuries mean Man City have lacked an out-and-out striker that could have rewarded his chance creation. If you opted for Sterling, you might be slightly more concerned to see he’s about as threatening as Townsend or McGinn (for double the cost). At the other end, Son and Salah look like must-have’s.

Conclusion: Salah and De Bruyne are probably holds for the rest of the season. I’ll get Son back in as soon as possible after transferring him out over a non-existent injury. Bringing Barnes and Pereira in hasn’t worked (yet) – if their contribution doesn’t improve quickly I’ll make a move for Grealish/Rodriguez depending on my budget.

Forwards

At the top of the pitch, there is some more surprises looking at where the value is coming from early on. I’ve ruled out everyone yet to contribute to a goal to tidy up the plot:

topfwd_df = fwd_df[fwd_df.total_contribution > 0]
author's table output
author’s table output
author's graph
author’s graph

There still isn’t much correlation between price and points. Middle-value players like Ings and Jiminez have done okay, not great. Most of the points seem to be coming from either the £6.0-£6.5m mark, or £10.0m+. Calvert-Lewin’s impressive start to the season has seen his value steadily increase from £7.0m to £7.6m. Unlike Vardy, Wilson, and Maupay, none of Calvert-Lewin’s goals have come from penalties either. 48% of players now have him onboard – the other 52% won’t admit they’ve got it wrong, even if it’s going to cost them (I’ve seen this before…).

Elsewhere, even at £10.6m, Kane is still providing a lot of value – directly contributing to the most goals so far this season (9). Martial and Werner perhaps the biggest disappointments, with Werner somehow being out-scored by Abraham, the striker he was signed to replace.

To end with something slightly different, we can look at how FPL players are reacting to player form. Everybody refers to in-form strikers that "can’t miss at the moment". In theory, people will want to catch in-form players whilst they are on their good run. In FPL, form = average total points per game within the thirty days.

#Add the 'transfers_in' column from our original main_df
fwd_df['transfers_in'] = main_df['transfers_in']
informfwd_df = fwd_df[fwd_df.current_form > 5]
author's graph
author’s graph

There is a degree of correlation in a small sample size of in-form players (where form > 5). Kane’s low transfer-in number might just reflect the fact he started the season in a lot of teams already. A lot of people are picking up Bamford, Vardy, and Wilson based on the first few fixtures, but still not a lot of love for West Brom’s Callum Robinson – despite his two goals against Chelsea. The people have spoken on Calvert-Lewin, added to almost 3,000,000 teams in the past few weeks.

Conclusion: I’ve been stubborn and hung onto Richarlison, but clearly the smart money is on his partner. Jiminez has a nice run of fixtures so he can stay, for now. Brewster (£4.5m) will probably make his way onto my bench if he starts well for Sheffield Utd. Otherwise, I’ll consider re-balancing my team to bring in Watkins or Bamford.

Overall

To finish, here’s a view of the top-5 value (value = points/cost) players by position. Although it’s worth remembering that a team of the highest ‘value’ players won’t top many leaderboards – it’s all about maximising points subject to the overall budget constraint. Finding good value allows you to spend more money on those players (Salah, Kane) that can hit the biggest scores.

author's graph
author’s graph

To plot this, I made a top-5 DataFrame by position using nlargest()

top5_gk_df = gk_df.nlargest(5, 'value')
top5_def_df = def_df.nlargest(5, 'value')
top5_mid_df = mid_df.nlargest(5, 'value')
top5_fwd_df = fwd_df.nlargest(5, 'value')

before layering each DataFrame on the same graph by defining our first plot as ‘ax’, then referencing ‘ax=ax’ in the second, third and fourth:

ax = top5_gk_df.plot.scatter(x='value', y='total_points', color='DarkBlue', label='GK', s=top5_gk_df['value']*10, alpha=.5, figsize=(15,9), title="Top 5 Value Players by Position")
for i, txt in enumerate(top5_gk_df.web_name):
    ax.annotate(txt, (top5_gk_df.value.iat[i],top5_gk_df.total_points.iat[i]))
top5_def_df.plot.scatter(x='value', y='total_points', color='DarkGreen', label='DEF', s=top5_gk_df['value']*10, ax=ax)
for i, txt in enumerate(top5_def_df.web_name):
    ax.annotate(txt, (top5_def_df.value.iat[i],top5_def_df.total_points.iat[i]))
top5_mid_df.plot.scatter(x='value', y='total_points', color='DarkOrange', label='MID', s=top5_gk_df['value']*10, ax=ax)
for i, txt in enumerate(top5_mid_df.web_name):
    ax.annotate(txt, (top5_mid_df.value.iat[i],top5_mid_df.total_points.iat[i]))
top5_fwd_df.plot.scatter(x='value', y='total_points', color='DarkRed', label='FWD', s=top5_gk_df['value']*10, ax=ax)
for i, txt in enumerate(top5_fwd_df.web_name):
    ax.annotate(txt, (top5_fwd_df.value.iat[i],top5_fwd_df.total_points.iat[i]))

The "s=top5_gk_df[‘value’]*10" argument makes the size of each scatter bubble a function of the ‘value’ column. In other words, the larger the ‘value’, the bigger the bubble. However, the result is pretty negligible above given the small scale.

Conclusion

The above has largely been an exercise in getting to grips with DataFrame creation, manipulation, and graphical representation using Python, but there’s still a lot more that can be done in terms of visualisation and customisation. From a data perspective, there’s no denying that three or four games into a season is too early to make long-term projections, but the tools/methods for analysing the data will remain largely consistent. It’ll be interesting to keep re-running the code to see how things develop as the season goes on. It takes less than 1 second to re-run all of the code that produced everything here – sure saves a lot of time on the FPL app.


Related Articles