The world’s leading publication for data science, AI, and ML professionals.

A Machine Learning Algorithm for Predicting Outcomes of MLB Games

Could machine learning methods provide an incremental informational edge over the wisdom of the masses?

Hands-on Tutorials

unsplash.com
unsplash.com

Raw data is immune to cognitive bias – devoid of emotion and human predisposition. Guided by this precept, Billy Beane’s 2001 Oakland As designed a new statistical blueprint for assembling a baseball team, and catalyzed a data analytics revolution that spread like wildfire throughout all facets of professional sports. As Beane discovered, baseball’s wealth of data lends itself well to predictive analytics. The problem I have chosen to explore is employing machine learning to predict outcomes of individual games. My final logistic regression and random forest models achieved test accuracies among the higher levels found in existing scientific literature, and outperformed the Vegas betting odds on a two-year test period. My results confirm that the betting odds are highly efficient; however, my findings also suggest that machine learning methods may provide an incremental informational edge over the wisdom of the masses, which could translate to meaningful insights in the long run. My model also shed light on the importance of team pitching, namely that the strength of a team’s bullpen is more indicative of its ability to win than the quality of its offense. Overall, the underlying methodologies and insights drawn from my algorithm may be useful to the strategic decision making of Major League Baseball front offices, or various other sports analytics entities.

To construct the dataset, I employed web scraping to aggregate twenty years of regular-season data, totaling roughly 48,000 games, and scraped game-by-game hitting and pitching statistics. Unlike the existing literature, I also included starting pitchers’ season-long metrics leading up to each individual game. This was a computationally intensive process, but in my view a key strength of my model as it provided a meaningful variable having low multi-collinearity with the other features. Additionally, I derived a method to measure the consistency of a team by examining the distribution of its runs conceded per game. Other key features in my model were Bill James’s Pythagorean expectation and Log5 metrics, which evaluate a team based on its previous runs scored and winning percentage, respectively. I also included double headers in my analysis, which, in retrospect, were likely not worth the added programming burden.

The outcome of any single MLB game is replete with variance and luck – in some sense bearing similarities to a coin flip. The best existing scientific literature (Elfrink, Valero, Jia, etc.) has produced accuracy levels within the 57–59.5% range. At face value this may seem unexceptional; however, in the context of our problem it is quite good. For reference, the Vegas betting odds – a barometer rightly considered the best predictive metric for professional sports – achieved an accuracy level of merely 58.2% over the past six years (calculated using data from covers.com). Attaining an accuracy level even one percent higher than the consensus could prove extremely valuable. My logistic regression and random forest models achieved standalone classification accuracies of 58.9% and 59.6%, respectively, the latter slightly outperforming the Vegas betting odds by 0.2% over the 2018 and ’19 seasons. When incorporating the betting odds as an extra input variable to the model itself, the combined random forest model achieved an accuracy of 61.1%, outperforming the betting odds by 0.8% in 2019. The logistic regression algorithm attained a 60.5% over the same period. Note: for our purposes, I simply define accuracy as the percentage of points correctly classified by an algorithm h: 1/m ⋅ Σ(h(xᵢ)=f(xᵢ)).


Machine learning is the study of predictive Algorithms that improve automatically as they learn from data. The first step is to choose a hypothesis class (a family of classifiers). If we are trying to predict, for instance, how an avocado will taste based on its color and hardness, a possible hypothesis class might be the family of characteristic functions of closed rectangles on the Euclidean plane: {f(x; y) ∈ [x₁, y₁] × [x₂, y₂]}, and our algorithm will "learn" this class to find the rectangle having the optimal upper and lower boundaries for hardness and color. In other words, this algorithm will predict the quality of new avocados based on a range of hardness and color that most accurately classified the quality of avocados in the past. Intuitively, our goal is to minimize the probability of misclassifying an avocado, known as the true error of our classifier. Of course, the exact underlying probability distribution can never be precisely known since we do not have access to every avocado in the world. However, the percentage of points that our algorithm misclassifies within our sample data, known as the empirical error, can serve as a proxy for the desired true error. The classifier that achieves the minimum error on this sample data is known as the empirical risk minimizer (ERM).

An ERM is a naturally logical approach, but we must be cautious of overfitting – wherein our algorithm performs exceptionally well on the training data but does poorly when predicting new, unseen data. For instance, one could fit a polynomial of n-1 degrees to n data points, which would perfectly predict each point in our training data but would generalize poorly. Underfitting, on the other hand, is using a model that lacks sufficient complexity for the data – like applying linear regression to a dataset where a quadradic relationship is present. Known as the bias-variance tradeoff, this notion of choosing a hypothesis class that balances both the simplicity and accuracy of a model in order to generalize well to unseen data is a crucial tenet of machine learning.

The best way to measure a model’s predictive ability is to set aside a portion of the data and hide it from the analysis at the outset. We then train our model on the unhidden data, known as training data, and subsequently make predictions on the test data to simulate a real-world scenario. Ultimately, we want the accuracies of our model on the training and testing data to roughly equate. A disproportionally higher training versus testing accuracy is indicative of overfitting. For the problem at hand, I have held out the 2018 and ’19 seasons as the test data, roughly 4800 games, and trained the model on the 2000–’17 seasons. I chose the training and testing data cutoffs in order to avoid using information drawn from games that have not yet taken place to predict the future, a phenomenon known as data leakage. One final note: predicting the outcome of a Baseball game is a binary classification problem, namely, whether or not the home team will win or lose.

Feature Engineering

The data extraction process was computationally intensive and often times programmatically rigorous. For instance, retrieving and calculating starting pitchers’ individual metrics leading up to each game had a runtime of roughly twenty hours. Nonetheless, acquiring a vast amount of data is typically very helpful, as our training data will better reflect the true underlying distribution, and the empirical error (assuming we do not overfit) should better reflect the true error. The first metric I added through feature engineering was Bill James’s Pythagorean expectation, a proxy for win% that accounts for a team’s total previous runs scored that season versus the runs it has conceded. Pythagorean expectation = (runs scored)²/[(runs scored)²+(runs allowed)²]. Jay Huemann discovered through back-testing that the optimal exponent should be 1.83 as opposed to 2. I ran both methods, and setting the exponent parameter to 1.83 yielded a higher correlation with the target variable, so I elected to use 1.83.

The next feature I added was Log5: P(j beats k) = (Pⱼ -(Pⱼ ⋅ Pₖ))/(Pⱼ+Pₖ-(2 ⋅ Pⱼ ⋅ Pₖ) where Pⱼ is the percentage of games won by team j and Pₖ is the percentage of games won my team k. The Log5 formula is very intuitive. For a detailed explanation, please see James’s commentary here. Essentially, he derived a clever mathematical relationship between the win percentages of team A and team B to determine the probability that team A will emerge victorious. Furthermore, I calculated OBP, SLG, ERA, FIP, and WHIP metrics for each team prior to every game and also included 3-day, 5-day, 7-day, … 15-day averages for each metric. To balance both accuracy and low multicollinearity with other features, I ultimately settled on 10-day trailing averages to capture a team’s short-term performance. These 10-day trailing metrics were considered in addition to a team’s season-long metrics. I also added Physicist Kerry Whisnant’s formula. Whisnant set out to create a metric that also accounts for a team’s run-scoring consistency. He found that teams with a higher slugging percentage often have narrower run distributions. Though Whisnant’s formula had a lower predictive ability in my model than the PE or the Log5, it still had a meaningful correlation with the target variable, and it got me thinking about various other ways to capture a team’s consistency.

In line with Bill Petti’s discovery, my results indicated that the consistency with which a team concedes runs matters more than the consistency with which it scores runs (higher correlation with target variable). However, merely finding the standard deviations of runs scored and runs conceded were not useful metrics on their own because, of course, a team cannot score less than zero runs, so the distribution of a team’s runs scored is cut off at zero. In other words, lower-run-scoring teams will inherently have a lower standard deviation than higher-run-scoring teams partially just because they score less. I wanted to isolate and extract only a team’s consistency, regardless of the actual amount of runs it scores, in order to compare apples to apples.

My workaround was this: I took the average of each team’s runs conceded for the past twenty years as well as the standard deviation of runs conceded, and did this for all thirty teams over the last twenty seasons. I then fit a linear regression model to those 600 data points, creating a function whose input is average runs conceded per game and whose output is standard deviation of runs conceded. In other words, this function provides a baseline for what a team’s standard deviation "should" be based on its avg. runs scored. A team with a standard deviation below the baseline is more consistent, and vice-versa. For each game I calculated baselines for both teams and from them subtracted each team’s actual standard deviation of runs conceded. I then subtracted team A’s number from team B’s number to gauge the consistency of team A relative to team B, with a higher value indicating that the home team (team A) is more consistent than team B.

Avg. Runs Conceded Per Game Regressed Against Std Dev of Avg. Runs Conceded Per Game, 2000–2019 (Image by author)
Avg. Runs Conceded Per Game Regressed Against Std Dev of Avg. Runs Conceded Per Game, 2000–2019 (Image by author)

As seen above, there is no discernible pattern among the residuals and thus heteroscedasticity is not an issue. The residuals also appear to be normally distributed with a mean of zero. These are all requisite properties for performing linear regression.

Exploratory Data Analysis

In choosing the right features to feed into a model, we want to capture the input variables most strongly related to the target variable. However, we also want to avoid a scenario where redundant or extraneous variables could dilute our model’s efficacy. It is not desirable to have many similar features that give us the same information. In other words, we want input features with low multi-collinearity. Below is a correlation heatmap of the variables. All metrics except the log5 and Whisnant are denoted as the difference (Δ) between that of the home team and that of the away team.

Correlation Heatmap (Image by author)
Correlation Heatmap (Image by author)

As seen above, the Pythagorean expectation (ΔPE) and Log5 have the highest correlations with our target variable. The team pitching metrics appear to be the next most valuable, with WHIP the strongest among them, indicating that disallowing walks and hits (the theoretical inverse of OBP) is the most important consideration for a team in order to win games. The difference in starting pitchers’ ERA (ΔSP_ERA) also has nice properties: low multi-collinearity with other features and some meaningful correlation with our target variable. Logically, this makes sense, because the quality of a specific starting pitcher should be largely independent of the quality of a team’s offense or the rest of its bullpen, and should have a substantial impact on the outcome of the game. The 10-day trailing ΔPE, ΔOBP, ΔSLG, and ΔWHIP are also helpful because they serve as a proxy for a team’s short-term performance, and also have low multi-collinearity with other variables. I elected to omit ΔERA from the model’s final input variables given its high correlation with ΔWHIP and ΔFIP. However, since I have tens of thousands of data points and less than 15 features, I elected to keep all other features.

(image by author)
(image by author)

Above are some distributions of key variables. As you can see, they appear to be normally distributed, which will be helpful when developing the models.

Logistic Regression Model

Logistic regression employs the logit function, the inverse of the more well-known sigmoid, to arrive at a discrete output from a linear combination of the input variables. We use the equation from simple linear regression and set it equal to the log odds of our event: β₁x+β₀ = log[P/(1-P)]. Solving for P yields 1/(1+eᶻ) where z = β₁x+β₀. Note: we take the logarithm of the odds in order to achieve a range that is symmetrical around the origin. Among the family of potential logit curves (the hypothesis class), our algorithm selects the curve that maximizes the likelihood of the observed samples.

To better understand maximum likelihood, consider a simple and intuitive example: suppose we have a box containing three balls, where each ball can either be green or blue. Say we pick two balls from the box, and they are both green. What should be our best guess for the color of the third ball? Well, there are two possible options given that we have already seen two green balls – either θ₁: the box contains three green balls, or θ₂: it contains two green and one blue. The chances of observing our sample of two green balls, given that the box contains three green balls, is clearly 100%: P(S|θ₁) = 1. For the second case, the probability of picking two green balls given that there are actually two green and one blue ball in the box is P(S|θ₂) = ⅔ ⋅ ½ = ⅓. Clearly, 1 > ⅓, so our "best guess" – the choice that maximizes the likelihood of observing our sample – is three green balls. Our logistic regression model functions this way, selecting the logit curve that maximizes the product of the probabilities of observing each data point given the parameters of that curve: argmax[Π(P(xi|θ))]. This is how it "learns" the hypothesis class.

A method called regularization protects our model from overfitting by adding a penalty in the loss function for large coefficients. By penalizing increased complexity in our model, we can prevent overfitting. To find the optimal regularization constant, I employ k-fold cross validation. K-fold validation is very clever and intuitive. First, we divide our training data into k folds and hold out the first fold as our validation dataset. We then train the model on the other k-1 folds and test it on the first fold. We repeat this process for each of the k folds, and finally average all of the test-fold accuracy levels to arrive at a more precise overall accuracy of the model. In my model, I employ k-fold validation for hyperparameter tuning, performing k-folds each time I modify the model in order to find the optimal parameters. In the logistic regression case, I iterate through regularization constants between zero and one and pick the constant with the highest average level of accuracy over the k folds. K-folds validation will be extremely useful for the random forest model, which has many more model parameters.

Trained on the 2000-’17 seasons, the logistic regression algorithm achieved a 58.9% accuracy when tested on the the 2018-’19 seasons, versus the Vegas betting odds’ accuracy of 59.4% over that same period (scraped from covers.com). I then elected to incorporate the betting odds themselves into my model as one of the input features. This combined logistic regression model achieved an accuracy of 60.5% on the 2019 season, slightly outperforming the betting odds’ standalone accuracy of 60.3%. I was only able to obtain betting odds dating back to 2014, so in order to save a big enough portion of the data for the training set, I only used one season for the test period as opposed to two.

Random Forest Algorithm

The random forest algorithm is predicated on the concept of a decision tree. All of us use decision trees every day; they are merely a sequence of logical if-else questions to ultimately arrive at a conclusion. A decision tree starts with a root node, followed by branch nodes and leaf nodes. To determine what value to use for the root node, the algorithm picks the variable that on its own best separates the data. This is usually calculated using a metric called Gini Impurity. In our avocado example, if color has the lowest Gini Impurity (if most green avocadoes taste bad and most black avocadoes taste good), then logically the first if-else statement at the root node should be: is the avocado green? Intuitively, this makes sense because if we know that color is the single best way to divide up the good and bad avocadoes, it should be the first question we ask. After the root node, we will have two edges pointing to two more nodes; this time we might ask if the avocado is above or below a certain degree of hardness, and based on this we can classify it as good or bad tasting.

unsplash.com
unsplash.com

Individual decision trees are typically susceptible to overfitting and do not generalize well to new data because complexity increases markedly at each new node. An ingenious method to circumvent this problem but preserve the desirable properties of decision trees is the random forest. Composed of many decision trees (called an ensemble), a random forest through a process called bagging or bootstrapping randomly selects samples from the dataset (with replacement) to form a new training set for each tree. It also restricts each tree to only a random subset of the N features. In practice, it has been shown that N is typically the optimal number of features to randomly select for each tree. Each trained on their own bootstrapped data sets, the consortium of trees, all of diverse structures, finally vote on whether or not to classify a data point as 1 or 0. The output receiving most votes becomes the algorithm’s prediction. The beauty of the random forest is that many uncorrelated trees will collectively produce a consensus estimate of lower variance and ultimately higher true accuracy than any single decision tree on its own.

Once assembled, the random forest algorithm can be fine-tuned by altering its parameters to find the most accurate model that does not overfit or underfit. Known as hyperparameter tuning, this step is an optimization problem to determine the best combination of maximum depth, number of trees, maximum features per tree, etc. I used nested for loops to generate many different combinations of parameters and ran k-fold validation each time, selecting for my final model the combination yielding the highest average accuracy level over the k-folds. The optimal max depth and max features were five and four, respectively. On the 2018–’19 seasons, the random forest model improved upon the logistic regression model and attained an accuracy of 59.6%, versus the betting odds’ accuracy of 59.4%. When including the betting odds as an input variable to the model itself, the random forest algorithm achieved a 61.1% accuracy over the 2019 season compared to 60.3% for the Vegas odds.

(image by author)
(image by author)

Below are feature importance charts for both the random forest model as well as a dummy decision tree classifier. Bill James’s Pythagorean expectation is clearly the most valuable feature, reaffirming the self-explanatory notion that to win games you must score more runs than you concede. A more revealing insight is that, after runs and win percentage, pitching is the most important determinant of wins. Rapacious home-run hitting and offensive prowess often receive most media and fan attention, but maybe front offices should really be more concerned about investing in their bullpen. This notion is further accentuated by the fact that ΔSP_ERA ranks so highly in importance despite the fact that OBP and SLG were more highly correlated with the target variable in the exploratory data analysis.

(image by author)
(image by author)

To conclude, I would like to thank Dr. Guillermo Reyes (professor of Mathematics of Machine Learning) and Dr. James Faghmous (professor of Applied Machine Learning in Python) for enthralling me with machine learning this semester and providing invaluable counsel at every step throughout my process.


Related Articles