In Baseball and Machine Learning Part 1, I outlined the methodology I had used in March to build 2021 MLB hitting projections using a combination of regression and classification algorithms. Then I lied to you. Horribly. I promised the pitching side of my project in "a week or so." Oops. I guess I should have included a confidence interval with that estimate. Here we are, nearly six weeks later, and I am finally delivering.
Since the year began, I have realized more and more how terrible a season this was to try to project. 2020 was the least normal year in modern history to try to use as a basis for predictions: 60 games, beginning much later than normal, with a contracted ramp-up period and frequent cancellations due to COVID. Oh, and let’s not discount the psychological effects of playing during a pandemic. Now, we can add to that the yet-to-be-determined effects of MLB’s crackdown on pitchers’ use of foreign substances on the balls when they throw. I still did projections, though, so let’s take a look.
The methodology was largely the same one that I laid out in Part 1 – please take a look at that one for the full write-up. I’ll give a brief overview here, but I will mainly focus on the specific differences we see with pitching vs. hitting and the pitching results.
Methodology Recap
Source Data
Fangraphs. My starting data set contained every pitching statistic they report. I added several of my own fields, all of which compiled manually. These included indicator variables for league (0 for AL, 1 for NL), whether the player changed teams and/or leagues in the prior off-season, and whether they changed teams/leagues during the season, accounting for trades/cuts/signings. This variable was more subjective than you might expect. If a player appeared on several teams in a season, I looked at where they played the most games. If someone played for six years for an AL team, was signed by an NL team in the off-season, and played seven games for them before being traded back to the AL for the rest of the year, I did not count that as changing leagues. I initially included indicator variables for teams, but I cut those during the feature engineering stage.
I also added lag variables for about 20 statistics (i.e. the values of those statistics from the prior year). This introduced some problems, as there were quite a few players who did not accumulate statistics in the preceding seasons. I opted to remove those cases from the data set under the assumption that I would gain more from including these variables than I would lose by cutting the rows with nulls in these areas.
I used data from 2016–2019 to train my models. For starting pitchers, I used an innings floor of 60 innings pitched and cut all data points below that threshold. For relief pitchers, I set the floor at 40 innings pitched.
Data Prep
For the hitting piece of this exercise, I dropped almost all of the Statcast fields from the input features set during the feature engineering stage. I was much more hesitant to take that step for the pitching data. I did, however, pare down, as Fangraphs displays many of these statistics from multiple sources. I cut out the Pitch f/x versions of the stats (marked "(pi)" on Fangraphs) and kept the Statcast versions (marked "(sc)"). These fields are also the ones where you will find the overwhelming majority of missing data. This is logical: not every pitcher throws every pitch type, so there would naturally be gaps. My general approach to missing data was to cut a column if less than 60% of it was populated, as I determined it would not have been useful to impute that many missing values. Otherwise, I filled in gaps in two ways: for percentages, I filled nulls with zeroes, since percentages were calculated from other fields containing events that did not occur. Otherwise, I filled in nulls with the median values in each column (by year – I kept the years separate throughout data prep).
Target Variables
I predicted eleven pitching statistics using separate models: innings pitched (IP), wins (W), losses (L), saves (SV), blown saves (BS), holds (HLD), home runs allowed (HRA), strikeouts (K), earned run average (ERA), walks + hits per inning pitcher (WHIP), and Fangraphs’ calculated dollar value (Dol). For the training data, I used the results for each of these stats from the following year. For the 2019 data, the target variables were 2020 data that I scaled to 162 games. This was not an ideal solution, but worked well enough.
Models
I constructed two types of models for this exercise. The first used XGBoost regression to produce a set of predictions of the 2021 statistics listed above. I tuned these models in several steps: first, I tuned maximum tree depth. Next, I did subsample and column sample by tree together. Step three was learning rate, and the final step was the number of estimators. I cross-validated using kfold with ten folds and mean absolute error for scoring. This produced solid overall results, but ran into one common problem with machine learning: it’s great at predicting the middle, but terrible at nailing down the outliers. Unfortunately, the outliers are what we care about most when predicting Baseball statistics.
To address this, I prepared a second set of models, this time using XGBoost classifiers. Starting pitchers and relief pitchers produce very different ranges of statistics by nature, so I split them into two data sets for this exercise. I bucketed the various stats into tiers and predicted those tiers (for example, for starting pitchers, 0–49 strikeouts were tier 0, 50–179 were tier 1, and 180 and above were tier 2; tiers 0 and 2 represented the outliers). Since I was deliberately trying to predict the outliers, these were heavily imbalanced data sets. In order to optimize the models for imbalanced data, I used SMOTE oversampling of the underrepresented classes. As with the regression exercise, I cross-validated using kfold with ten folds. In this case, though, I used the micro-averaged F1 score in cross-validation because I was looking for the best balance between precision and recall.
I defined success in the classification models by how well they were able to predict outliers. Overall, precision was very good: of the stats flagged as outliers, the algorithms did a very good job of picking them correctly. Recall was not as good, but Pitching performed better than hitting in this regard. The summary is below. There are a couple of weird ones in here. For example, relief pitcher innings pitched started at 0% precision and recall and ended at 100% precision. This is partially because there were only two tiers for reliever innings: fewer than 55 IP were tier 0 and more than 55 IP were tier 1, with tier 0 representing the only outliers. Initially, the algorithm predicted that everyone was in tier 1. After applying SMOTE, the model got 100% of those flagged as tier 0 correct, but only identified 21.43% of those outliers.


Once all models were tuned and finalized, I combined the results to build my projections: for any predicted middle tiers, I used the regression results. For the extremes, I used the classification results, scaling the tiers to expected ranges.
Results
Let’s take a look at each stat. I’ll give the optimized hyperparameters and then show you which input variables had the greatest impact on the models. The summary plots are from the SHAP package, which show the input features listed in order of greatest to least impact to the models. For each feature, there is a red-to-blue spectrum displayed, where red is higher value and blue is lower value. The various points are laid out left to right showing the degree of impact on the model. So if all the red points for Age are all the way on the left, it means younger ages had stronger positive impact on the model.
Remember, the input features are for the previous season, so if you see IP predicting IP, it means IP from one year predicting IP in the next year. When you see "Lag1" attached to a stat, such as IP_Lag1, it means innings pitched from two years prior. One note: I noticed that two labels did not carry through correctly. If you see "#NAME?", it means "-WPA" (sum of all negative Win Probability Added events), and "#NAME?.1" means "+WPA" (sum of all negative Win Probability Added events). I won’t waste your time defining all of the sabermetric stats in here, so if you see a stat you do not recognize, you should consult the sabermetrics glossary on Fangraphs or, you know, Google.
Innings Pitched (IP)
Best hyperparameters:
{'n_estimators': 400,
'learning_rate': 0.01,
'max_depth': 3,
'subsample': 0.8,
'colsample_bytree': 0.1}

Wins (W)
Best hyperparameters:
{'n_estimators': 300,
'learning_rate': 0.01,
'max_depth': 7,
'subsample': 0.5,
'colsample_bytree': 0.2}

Losses (L)
Best hyperparameters:
{'n_estimators': 300,
'learning_rate': 0.01,
'max_depth': 6,
'subsample': 1.0,
'colsample_bytree': 0.1}

Saves (SV)
Best hyperparameters:
{'n_estimators': 200,
'learning_rate': 0.01,
'max_depth': 6,
'subsample': 0.4,
'colsample_bytree': 0.3}

Blown Saves (BS)
Best hyperparameters:
{'n_estimators': 200,
'learning_rate': 0.01,
'max_depth': 6,
'subsample': 0.4,
'colsample_bytree': 0.1}

Holds (HLD)
Best hyperparameters:
{'n_estimators': 200,
'learning_rate': 0.01,
'max_depth': 6,
'subsample': 0.6,
'colsample_bytree': 0.15}

Strikeouts (K)
Best hyperparameters:
{'n_estimators': 500,
'learning_rate': 0.01,
'max_depth': 3,
'subsample': 0.4,
'colsample_bytree': 0.2}

Home Runs Allowed (HRA)
Best hyperparameters:
{'n_estimators': 300,
'learning_rate': 0.01,
'max_depth': 8,
'subsample': 0.8,
'colsample_bytree': 0.1}

Earned Run Average (ERA)
Best hyperparameters:
{'n_estimators': 500,
'learning_rate': 0.01,
'max_depth': 6,
'subsample': 0.4,
'colsample_bytree': 0.15}

Walks + Hits Per Inning Pitched (WHIP)
Best hyperparameters:
{'n_estimators': 500,
'learning_rate': 0.01,
'max_depth': 7,
'subsample': 0.6,
'colsample_bytree': 0.2}

Dollar Value (Dol)
Best hyperparameters:
{'n_estimators': 300,
'learning_rate': 0.01,
'max_depth': 4,
'subsample': 0.5,
'colsample_bytree': 0.15}

Takeaways
These were fascinating results to me. The first thing that jumped out at me was the conspicuous absence of age as a factor. In my hitting article, you can see that age was the number one predictor for most hitting statistics, and it is nowhere to be found on the pitching side.
Another thing that struck me as particularly interesting was how prominent roles some of the overall value measures played – particularly the Fangraphs Dollar Value. The bizarre piece of this was that it was the dollar value from two years prior (Dol_Lag1) that was the most predictive factor. It was the number 1 predictor for IP, W, and K, and figured prominently into several of the other stats. I’m still digesting this result. It’s intriguing.
K/9+ was a strong predictor of the ERA and WHIP ratios. I feel as though experts keyed in on this one some time ago, but it was interesting to see in practice. K/BB+ also had a strong showing. That one has been very popular among experts in recent years.
I suspect that there are some confounding variables within the innings pitched categories. Innings Pitched, Starter-IP, and Reliever-IP all make appearances in these models, but they are heavily correlated, so the algorithm would only have picked up one per decision tree. When you add all of the trees up to get the ensemble model, that means that one or more of the innings pitched categories might be shortchanged in the final results.
Projection Output
[2022 note: In evaluating these results in April 2022, I found a significant error. My scores are correct, but the ones from THE BAT are not. More details in the 2022 article.]
Once I had my completed projections, I calculated scores using scoring formulas I developed. I then used the same formulas to score Derek Carty’s THEBAT projections so that I could see where the biggest differences were. I only looked at those players with 60 or more projected innings – it just isn’t useful to compare projections for anyone who is not projected for a full season. First, we have the players who my projections said would do better than THEBAT:

I will take this opportunity to reiterate that this is the dumbest possible year to be doing this particular exercise, with the COVID year baked into the training data, MLB changes to the ball, a preposterous amount of injuries (some of the above miiiight be related), and now, the MLB crackdown on use of sticky stuff. A chunk of the names above have been injured. Some lost their starting jobs (we’ll never know how that David Price projection might have played out). We have a lot of season left, but some of those calls look really good (Burnes, Musgrove) and others look like colossal misses (Bundy, Castillo).
Now, let’s look at where the models have predicted worse production than THEBAT:

This batch is interesting. To this point in the season, the algorithms seem to have hit a lot more than they missed for the All Pessimism Team. Jackson, Brubaker, Wittgren, Williams, and Bassitt have been solid from this list. The other ten have lined up much more with my projections than not (though we can give Shane Greene a pass for the moment since he spent so much time without a job).
Wrap Up
This concludes the grand pitching exploration. I hope it got the gears turning for you in some way.
I would love to hear your input on this. What did you think of my approach? What should I do differently in future iterations? I am making my code and data files available on GitHub.