Build a Player Recommendation System For Cricket Using K-Nearest Neighbor Algorithm

A beginner’s guide to using k-Nearest Neighbor Algorithm to build a simple recommender system.

Published in

Towards Data Science

6 min readJul 24, 2021

In this article, we will build a simple cricket player recommendation system that will suggest a list of batsmen for the team based on the statistics of players that have been playing for the team in the past.

We will build the recommender system for batsmen only which can be extended for bowlers and other types of players as well by calculating their respective metrics.

Overview of k-nearest neighbors

In simple terms, k-nearest neighbors (kNN) algorithm finds out k neighbors nearest to a data point based on any distance metric. It is very similar to k-means in the way how similarity of data points is calculated. We will use kNN algorithm to recommend players that are nearest to the current team members.

Data Collection

The dataset used for this system was downloaded from Indian Premier League CSV Dataset on Kaggle. It consists of 6 CSV files that summarize the ball by ball information of 577 IPL matches up to IPL Season 9.

Understanding the data

We will use Pandas to read the following CSV files.

Ball_by_Ball.csv — This file has the data about every ball of the matches. We can extract the id of the players at the striker and non striker end, runs scored, etc. We will use this file to calculate batsmen statistics for our recommender system.
Match.csv — This file stores information about the match like the venue, teams, result, umpire details, etc. We will need this file to extract the association between a Match_Id and the Season_Id .
Player.csv — This file contains data about all the players, i.e. their name, country, date of birth, etc. These fields will be used to build our recommender system using k-nearest neighbors algorithm.
Player_Match.csv — This file associates the Player_Id with the matches they have played. We will use this file to understand the features of the players in the current team.

Data Cleaning

We will create another data frame called player_datato store batsmen statistics and other relevant features from the existing playerdataframe. As the player dataframe has two columns Is_Umpire and unnamed:7 that are insignificant for our use case, we will drop them and copy the other columns to player_data .

player_data = player.drop(["Is_Umpire", "Unnamed: 7"], axis = 1)

Feature Extraction

Derive Season from Match_Id

We will derive the performance statistics of the players in every season. match dataframe has the fields Match_Id and Season_Id that can be used to derive the season number from the Match_Id .

NUMBER_OF_SEASONS = 9
season_info = pd.DataFrame(columns = ['Season', 'Match_Id_start', 'Match_Id_end'])for season in range(1, NUMBER_OF_SEASONS + 1):
    match_info = match.loc[match['Season_Id'] == season]['Match_Id']
    season_info = season_info.append({
        'Season' : season,
        'Match_Id_start' : match_info.min(), 
        'Match_Id_end' : match_info.max()
    }, ignore_index=True)

The above code snippet will find the range of Match_Id for every season.

Based on the above results, we will create a function that will return the correct season number based on the id of the match.

def get_season_from_match_id(match_id):
    season = season_info.loc[
        (season_info['Match_Id_start'] <= match_id) & 
        (season_info['Match_Id_end'] >= match_id)] ['Season']
    # Return the integer value of the season else return -1 if
      season is not found   
    return season.item() if not season.empty else -1

Calculation of Batting Performance per season

From the ball_by_ball data, we will calculate the following features for all players per season:

Runs Scored
Number of innings played
Number of innings where the player was not out
Balls Faced
Number of fours
Number of sixes

Calculating runs scored, balls faced, number of fours and number of sixes is pretty straightforward. In the ball_by_balldataframe, we can simply check the values in theStriker_Id and Batsman_Scored columns and accordingly increment these features.

The difficulty lies in calculating the number of innings played per season and the number of innings where the player was not out. For this, we not only need to look at the current row in the dataframe, but also at the previous row. The innings for a player should be incremented in the following cases:

Match_Id of the previous ball is different than that of the current ball. This means that the current row belongs to a new match, so a new inning will begin. We will increment the innings count for both the striker and the non striker.
Match_Id is the same for the previous and the current ball but the Innings_Id changes. This means that it is the second innings of the same match. We will increment the innings count for both the striker and the non-striker.
Both the Match_Id and Innings_Id are same in the previous and current ball but the current Striker_Id is not equal to previous Striker_Id or Non_Striker_Id . This means that a new player has come to bat, so we will increase the count of innings only for the player with id equal to Striker_Id. Similar logic would be applicable for the current Non_Striker_Id.

We will also track the Player_dismissedcolumn to find out whether the player was not out in a particular inning.

Calculation of Batting Statistics

The final step is the calculation of batting statistics like Batting Strike Rate (BSR), Batting Average (BA) and Boundary Runs per Inning (BRPI). The BSR, BA and BRPI are first calculated per season and then the mean of these values are calculated with respect to the number of seasons played by the players. While calculating the average metrics, only those seasons are considered in which the batsman had actually played. This removes the bias towards players who have played in all the past seasons.

Calculation of Batting Average per season

Calculation of Batting Strike Rate per season

Calculation of Boundary Runs Per Inning per season

Once we have calculated these statistics, we also derive the age of the player and whether they are domestic or international players. These attributes are important because of the assumption that players who are currently over the age of 40 won’t be playing the next season. Also, with the restrictions around choosing international players in an IPL team, the attribute for domestic players will be important.

Implementation of K-Nearest Neighbors

For the implementation of k-nearest neighbors, we will use scikit learn and build a numpy array with only those features will contribute to the team selection, i.e BA, BSR, BRPI, Age and Nationality.

X is the numpy array representing those features.

from sklearn.neighbors import NearestNeighborsnbrs = NearestNeighbors(n_neighbors=3, algorithm='ball_tree').fit(X)
distances, indices = nbrs.kneighbors(X)

indices will give the indices of the nearest neighbors for each of the rows in X.

Visualization of first few players and their nearest neighbors. The nodes numbers are id of the players

For all of the players who have played in a particular team in the current season, kNN will return 3 nearest neighbors that are similar to the existing player.

Out of the 3 players suggested, the first player will be the same as the existing one because the distance of a player with himself is 0.

All the recommended players are added to an ordered collection and sorted by the highest BSR to lowest. We can program to return the top n batsmen where n can be changed as required.

Final Output recommending 15 players sorted by Batting Strike Rate.

This was a very simple implementation of the batsmen recommendation system. You can view the GitHub repository for the code here. Thank you for reading! :)