Moneyballing Cricket: Probability of 100 Using Repeated Conditioning

Probabilistic analysis of Scoring ≥ 100 runs in cricket

Published in

Towards Data Science

8 min readJun 4, 2022

Ever since I was a kid watching my team Pakistan play, I had a fascination with the score 100. I am sure many people will resonate with wanting their favorite batsman to end up scoring a century. However, everyone knows that not every batsman knock results in 100. The event is rare, making it a perfect subject to study as a probability student. In this article, I will explore the probability of scoring a batsman scoring 100. Then dive deep into how the probability varies using the rules of conditional probability.

Data & Methodology

Please find important information regarding the data source and methodology:

Data Source: All the data has been sourced from cricsheet.org. They offer ball-by-ball data of ODIs, T20s, and Test matches. I do not own the data but cricsheet data is available under the Open Data Commons Attribution License. Everyone is free to use, build and redistribute the data with proper attribution under this license. Read about the license here.
Data Verification: The founder of cricsheet does a good job of verifying the data source with minimal errors. I verified the data using aggregates and compared them with aggregates available at major cricketing sites such as ESPNcricinfo.
Data Dimensions & time: The dataset contains 1998 ODI matches, starting from 2004–01–03 to 2022–04–16. It contains almost all major male ODIs played during the period. The dataset contains 1,059,669 balls played, & 34,466 batsman knocks & 3900 innings.
Methodology: The core purpose of this piece is to analyze the probability of making 100 runs in a batsman knock. The methodology will be explained in detail later on but uses mostly probabilistic rules such as the law of total probability, conditional probability, and Bayes rules.

Beginning with the basics

Probability problems can often be broken down into problems of counting. In the classic textbook example of counting the number of die rolls (the die is six-sided) that result in a six, one can easily simulate the die and count the number of times the die lands on 6 and divide by the total number of rolls. Given a sufficient number of die rolls, one can find the empirically observed probability of rolling a 6. If it is a fair die then the probability of a rolling a 6 will be 1/6.

Similarly, if you were to find the probability of scoring 100 or greater, you could count the number of batsmen knocks with a total score ≥ 100 and divide by the total number of batsmen knocks. (A batsman knock is the play of one batsman in one match).

Image by the author. Histogram/Empirical Probability Density Function of batsman score totals. The red region highlights all the knocks that resulted in 100 runs or greater. — `Contains information from cricsheet which is made available under the ODC License`

The above plot shows batsman scores on the x-axis and the corresponding probability density on the y-axis. Summing over the number of batsman knocks that result in a score ≥ 100, we get 1,090 batsman knocks. The total amount of batsman knocks in our dataset is 34,466. Dividing the two we 1090/34466 ≅ 3.16%, which means that only approximately 316 out 10000 knocks result in a century or higher score.

A century is certainly a very rare event. As a statistical modeler, I find modeling unlikely events challenging and fascinating at the same time. This event can be modeled as a binary classification problem. However, a very low-class prevalence makes it hard for models to predict with a high degree of accuracy. To build good models you should see how the target changes based on varying other variables in the data.

Enjoying reading? Why not buy me a coffee?

Arslan Shahid

I write about technology & startups. In particular about Python, cool new libraries and informative written tutorials.

www.buymeacoffee.com

What is probabilistic conditioning?

Although the concept seems esoteric when described in technical terms but I believe every person has an intuitive idea of what it is. Conditional Probability is governed by this formula P(A|B) = P(A & B)/P(B). To get the probability of A given that event B already happened, you take the probability of two events happening together — P(A & B) and then divide it by the total probability of B occurring— P(B).

Suppose we were to calculate the probability that the Pakistani team sets a score ≥ 200 conditioned on the fact that they were playing against Australia. The probability can easily be empirically estimated based on how many instances we observed of the Pakistani team scoring a total of 200 or higher while playing against Australia and dividing by the total number of matches the Pakistani team has played against Australia.

Image by the Author, ‘|’ means conditioned on. ‘∩’ means &

We can keep on adding more conditions like was the match on the Pakistani home ground or Australian home turf, whether the Pakistani team is chasing or attacking (setting the score) etc. You can add conditions to see how under different scenarios the desired probability changes.

Conditioning on Team

Frequency of 100+ scoring knocks grouped by teams out of total knocks played by the team in the dataset. Minor teams excluded. Image by the author. — `Contains information from cricsheet which is made available under the ODC License`

In the first section, it was observed that 3.16% of batsman knocks result in a score ≥ 100. The logical next step is to see how likely is a century or higher score based on which team is batting. For the above graph, the number of batsmen knocks which resulted in a score of 100 were counted for each team, and to get the frequency we divide the count by the total batsman knocks played by each team.

India has played the second-highest amount of knocks in our dataset 3379 and has the most amount of centurion knocks. However, it doesn’t have the highest frequency of 100+ runs, that award goes to South Africa with 5.08% of knocks making 100+ runs. Pakistan sits in the middle both in terms of total 100+ knocks (99) and frequency (3.33%). Slightly above the team neutral rate of 3.16%. Zimbabwe has the poorest performance in this regard as out of 2321 knocks only 30 resulted in a score of 100+, a rate of 1.29%.

Conditioning on Balls Survived & Innings

Conditional Probability Function, applying conditioning on balls survived by batsman & innings. Image by the author. — `Contains information from cricsheet which is made available under the ODC License`

Baseline is innings neutral, Attacking is 1st innings knocks, and Chasing is 2nd innings knocks. ‘Balls survived’ represents the number of balls played by the batsman up till now. At every data point, the plot tells the probability of batsmen ending the knock with a 100+ score. Naturally, the longer the batsman survives the more runs they accumulate eventually reaching one hundred. The rate is separated based on the innings (chasing innings with lower than 100 scores by the opposing team are excluded). A batsman is more likely to score a 100 when bating in the first innings, as opposed to when chasing. When a batsman has survived 60 balls they are likely to make 100+ runs 21% of the time in the 1st innings, 19% in baseline, and 16.5% in the 2nd innings. Note: When a batsman has survived above 120 balls the number of knocks left is very small, so those numbers are likely to be biased due to small samples.

There is a higher likelihood in the first innings of scoring 100+, the longer the batsman survives the higher their chances of making a century but it must be noted that the accumulated score has not been accounted for in this plot. While modeling this problem and predicting the probability (a topic for another article), it seems intuitive to make a combined variable of balls survived and runs accumulated.

Conditioning on Accumulated score & Innings

Conditional Probability Function, applying conditioning on accumulated runs by batsman & innings. Image by the author. — `Contains information from cricsheet which is made available under the ODC License`

The plot shows the likelihood of making a score ≥100, conditioned on how many runs a batsman has made already and which innings they are batting. The plot shows by the time a batsman has already made a half-century (50 runs) they have a probability of 21.5% (1st innings), 19.7% (Baseline) & 17.3%(2nd Innings) chance of making at least a 100 runs. Between an accumulated score of 75 & 80, the probability crosses 50%. Interestingly, there is only a 97.8% chance of making a hundred even after accumulating 99 runs. There are 2.2% of knocks that result in an out/match end before reaching 100 from 99.

Unsurprisingly the more runs a batsman makes the higher the chance of finishing with at least a century. What is of special interest is how much the probability changes, which can help in constructing a statistical model to predict the outcome of 100+. The curve as a whole is not linear, after 50 runs the rate of increase in probability increases at a much faster pace than before 50 runs. Indicating that the event becomes easier and easier to predict as more runs accumulate!

Conditioning on Players & Accumulated runs

Conditional Probability Function, applying conditioning on player & accumulated runs by the batsman. Image by the author. — `Contains information from cricsheet which is made available under the ODC License.`

The baseline curve captures the ‘average’ of the 4 players selected. The sample sizes for 100+ on a player level are very small, the top centurion is Virat Kholi with only 43 knocks resulting in more than 100 runs. So these numbers should be viewed with a healthy dose of skepticism. Nonetheless, it can be seen that Babar Azam has the highest curve, indicative of better performance but the sample size for him is the smallest only 84 innings in the dataset. Virat Kholi’s propensity to score 100s rises faster than Martin Guptill & AB De Villiers. It is interesting to note that neither Martin Guptil nor Babar Azam has ever gotten out at 99 runs, whereas both Kohli & De Villiers have.

This concludes the article. I hope you enjoyed it, please subscribe via email and follow me for more content. In the following articles, I will try to make a predictive model for centuries, building on this piece. Stay tuned!

Here are some of my other articles that you will probably enjoy:

Money Balling Cricket — Statistically evaluating a Match: https://medium.com/mlearning-ai/money-balling-cricket-statistically-evaluating-a-match-9cda986d015e
Lies, Big Lies, and Data Science: https://medium.com/mlearning-ai/lies-big-lies-and-data-science-6147e81fb9fc
Money Balling Cricket — Averaging Babar Azam’s Runs: https://medium.com/@arslanshahid-1997/money-balling-cricket-averaging-babar-azams-runs-adb8de62d65b

Thank you!