Build a Player Recommendation System For Cricket Using K-Nearest Neighbor Algorithm
A beginner’s guide to using k-Nearest Neighbor Algorithm to build a simple recommender system.
In this article, we will build a simple cricket player recommendation system that will suggest a list of batsmen for the team based on the statistics of players that have been playing for the team in the past.
We will build the recommender system for batsmen only which can be extended for bowlers and other types of players as well by calculating their respective metrics.
Overview of k-nearest neighbors
In simple terms, k-nearest neighbors (kNN) algorithm finds out k neighbors nearest to a data point based on any distance metric. It is very similar to k-means in the way how similarity of data points is calculated. We will use kNN algorithm to recommend players that are nearest to the current team members.
Data Collection
The dataset used for this system was downloaded from Indian Premier League CSV Dataset on Kaggle. It consists of 6 CSV files that summarize the ball by ball information of 577 IPL matches up to IPL Season 9.
Understanding the data
We will use Pandas to read the following CSV files.
- Ball_by_Ball.csv — This file has the data about every ball of the matches. We can extract the id of the players at the striker and non striker end, runs scored, etc. We will use this file to calculate batsmen statistics for our recommender system.
- Match.csv — This file stores information about the match like the venue, teams, result, umpire details, etc. We will need this file to extract the association between a
Match_Id
and theSeason_Id
. - Player.csv — This file contains data about all the players, i.e. their name, country, date of birth, etc. These fields will be used to build our recommender system using k-nearest neighbors algorithm.
- Player_Match.csv — This file associates the
Player_Id
with the matches they have played. We will use this file to understand the features of the players in the current team.
Data Cleaning
We will create another data frame called player_data
to store batsmen statistics and other relevant features from the existing player
dataframe. As the player
dataframe has two columns Is_Umpire
and unnamed:7
that are insignificant for our use case, we will drop them and copy the other columns to player_data
.
player_data = player.drop(["Is_Umpire", "Unnamed: 7"], axis = 1)
Feature Extraction
Derive Season from Match_Id
We will derive the performance statistics of the players in every season. match
dataframe has the fields Match_Id
and Season_Id
that can be used to derive the season number from the Match_Id
.
NUMBER_OF_SEASONS = 9
season_info = pd.DataFrame(columns = ['Season', 'Match_Id_start', 'Match_Id_end'])for season in range(1, NUMBER_OF_SEASONS + 1):
match_info = match.loc[match['Season_Id'] == season]['Match_Id']
season_info = season_info.append({
'Season' : season,
'Match_Id_start' : match_info.min(),
'Match_Id_end' : match_info.max()
}, ignore_index=True)
The above code snippet will find the range of Match_Id
for every season.
Based on the above results, we will create a function that will return the correct season number based on the id of the match.
def get_season_from_match_id(match_id):
season = season_info.loc[
(season_info['Match_Id_start'] <= match_id) &
(season_info['Match_Id_end'] >= match_id)] ['Season']
# Return the integer value of the season else return -1 if
season is not found
return season.item() if not season.empty else -1
Calculation of Batting Performance per season
From the ball_by_ball
data, we will calculate the following features for all players per season:
- Runs Scored
- Number of innings played
- Number of innings where the player was not out
- Balls Faced
- Number of fours
- Number of sixes
Calculating runs scored, balls faced, number of fours and number of sixes is pretty straightforward. In the ball_by_ball
dataframe, we can simply check the values in theStriker_Id
and Batsman_Scored
columns and accordingly increment these features.
The difficulty lies in calculating the number of innings played per season and the number of innings where the player was not out. For this, we not only need to look at the current row in the dataframe, but also at the previous row. The innings for a player should be incremented in the following cases:
Match_Id
of the previous ball is different than that of the current ball. This means that the current row belongs to a new match, so a new inning will begin. We will increment the innings count for both the striker and the non striker.Match_Id
is the same for the previous and the current ball but theInnings_Id
changes. This means that it is the second innings of the same match. We will increment the innings count for both the striker and the non-striker.- Both the
Match_Id
andInnings_Id
are same in the previous and current ball but the currentStriker_Id
is not equal to previousStriker_Id
orNon_Striker_Id
. This means that a new player has come to bat, so we will increase the count of innings only for the player with id equal toStriker_Id
. Similar logic would be applicable for the currentNon_Striker_Id.
We will also track the Player_dismissed
column to find out whether the player was not out in a particular inning.
Calculation of Batting Statistics
The final step is the calculation of batting statistics like Batting Strike Rate (BSR), Batting Average (BA) and Boundary Runs per Inning (BRPI). The BSR, BA and BRPI are first calculated per season and then the mean of these values are calculated with respect to the number of seasons played by the players. While calculating the average metrics, only those seasons are considered in which the batsman had actually played. This removes the bias towards players who have played in all the past seasons.
Once we have calculated these statistics, we also derive the age of the player and whether they are domestic or international players. These attributes are important because of the assumption that players who are currently over the age of 40 won’t be playing the next season. Also, with the restrictions around choosing international players in an IPL team, the attribute for domestic players will be important.
Implementation of K-Nearest Neighbors
For the implementation of k-nearest neighbors, we will use scikit learn and build a numpy array with only those features will contribute to the team selection, i.e BA, BSR, BRPI, Age and Nationality.
X is the numpy array representing those features.
from sklearn.neighbors import NearestNeighborsnbrs = NearestNeighbors(n_neighbors=3, algorithm='ball_tree').fit(X)
distances, indices = nbrs.kneighbors(X)
indices
will give the indices of the nearest neighbors for each of the rows in X.
For all of the players who have played in a particular team in the current season, kNN will return 3 nearest neighbors that are similar to the existing player.
Out of the 3 players suggested, the first player will be the same as the existing one because the distance of a player with himself is 0.
All the recommended players are added to an ordered collection and sorted by the highest BSR to lowest. We can program to return the top n batsmen where n can be changed as required.
This was a very simple implementation of the batsmen recommendation system. You can view the GitHub repository for the code here. Thank you for reading! :)