Using Random Forests to Help Explain Why MLB Players Make the Hall of Fame

How local importance scores can be used to obtain class-wise overall variable importance to help understand the why.

Chris Kuchar
Towards Data Science

--

Photo by Caitlin Conner on Unsplash

Introduction

Last fall I wrote an article titled: Rfviz: An Interactive Visualization Package for Interpreting Random Forests in R. That article is a tutorial for a visualization tool in R. I wanted to give an option for not using the visualization tool to get the same type of results. So, this article is a coding only example about using the local importance scores from the randomForest library in R. We will be using it to identify overall variable importance on a class-wise level. Using this method, we will be exploring why some MLB players make the Hall of Fame vs. many fail to make the Hall of Fame.

Theoretical Background

Random Forests

Random forests (Breiman (2001)) fit a number of trees (typically 500 or more) to regression or classification data. Each tree is fit to a bootstrap sample of the data, so some observations are not included in the fit of each tree (these are called out of bag observations for the tree). Independently at each node of each tree, a relatively small number of predictor variables (called mtry) is randomly chosen and these variables are used to find the best split. The trees are grown deep and not pruned. To predict for a new observation, the observation is passed down all the trees and the predictions are averaged (regression) or voted (classification).

Variable Importance

A local importance score is obtained for each observation in the data set, for each variable. To obtain the local importance score for observation i and variable j, randomly permute variable j for each of the trees in which observation i is out of bag, and compare the error for the variable-j permuted data to actual error. The average difference in the errors across all trees for which observation i is out of bag is its local importance score.

The (overall) variable importance score for variable j is the average value of its local importance scores over all observations.

Example

Our data here are statistics from Major League Baseball. The question we will be trying to help explain is why some MLB players make the Hall of Fame vs. many fail to make the Hall of Fame.

We will be using batting and fielding statistics as well as end of season awards for those players. In this example we are ignoring pitching statistics and end of season awards for pitchers. Using Random Forests in R, we’ll discover on a class-wise basis the overall variable importance and how we can use the results to help answer the question.

This example uses data from the R package Lahman, which is a baseball statistics dataset that contains pitching, hitting, fielding, and awards statistics for Major League Baseball from 1871 to 2019 (Lahman 2020). It is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

For a reference on the variable definitions from those we used to aggregate and create our dataset, see this link: http://www.seanlahman.com/files/database/readme2017.txt

Data Preparation and Exploration

library('Lahman')
library(randomForest)
library(dplyr)
#Get the list of players inducted into the Hall of Fame
HallOfFamers <-
HallOfFame %>%
group_by(playerID) %>%
filter((votedBy=="BBWAA" | votedBy=="Special Election")
& category == "Player") %>%
summarise(inducted = sum(inducted == "Y")) %>%
ungroup()
#Batting Statistics
Batting_Stats <-
Batting %>%
group_by(playerID) %>%
mutate(total_BA=sum(H)/sum(AB),
total_AB=sum(AB), total_R=sum(R),
total_X2B=sum(X2B),total_X3B=sum(X3B),
total_H = sum(H), total_HR = sum(HR),
total_RBI=sum(RBI), total_SB=sum(SB), total_CS=sum(CS),
total_BB=sum(BB), total_SO=sum(SO),
total_IBB=sum(IBB), total_HBP=sum(HBP),
total_SH=sum(SH), total_SF=sum(SF),
total_GIDP=sum(GIDP)) %>%
select(playerID, total_BA, total_AB, total_R, total_X2B, total_X3B, total_H, total_HR,
total_RBI, total_SB, total_CS, total_BB, total_SO, total_IBB, total_HBP, total_SH,
total_SF, total_GIDP) %>%
group_by_all() %>%
summarise() %>%
arrange(desc(total_H)) %>%
ungroup()
#Fielding Statistics
Fielding_Stats <-
Fielding %>% group_by(playerID) %>%
mutate(POS=toString(unique(POS)), total_G=sum(G), total_GS=sum(GS), total_InnOuts=sum(InnOuts),
total_PO=sum(PO), total_A=sum(A), total_E=sum(E), total_DP=sum(DP), total_PB_Catchers=sum(PB),
total_WP_Catchers=sum(WP), total_SB_Catchers=sum(SB), total_CS_Catchers=sum(CS), mean_ZR=mean(ZR)) %>%
select(playerID, POS, total_G, total_GS, total_InnOuts, total_PO, total_A, total_E, total_DP,
total_PB_Catchers, total_WP_Catchers, total_SB_Catchers, total_CS_Catchers, mean_ZR) %>%
group_by_all() %>%
summarise() %>%
arrange(desc(total_G)) %>%
ungroup()
#End of Season Awards
Season_Awards <-
AwardsPlayers %>%
group_by(playerID) %>%
mutate(Count_All_Star=sum(awardID=='Baseball Magazine All-Star' | awardID=='TSN All-Star'), Count_MVP = sum(awardID == "Most Valuable Player"),
Count_Silver_Slugger = sum(awardID == "Silver Slugger"), Count_Gold_Glove = sum(awardID == "Gold Glove")) %>%
select(playerID, Count_All_Star, Count_MVP, Count_Silver_Slugger, Count_Gold_Glove) %>%
group_by_all() %>%
summarise() %>%
ungroup()
#Joining the datasets together
HOF_Data <- Batting_Stats %>%
full_join(Fielding_Stats, by=c('playerID')) %>%
left_join(Season_Awards, by='playerID') %>%
left_join(HallOfFamers, by='playerID') %>%
#Filling in NA's based on data type
mutate_if(is.integer, ~replace(., is.na(.), 0)) %>%
mutate_if(is.numeric, ~replace(., is.na(.), 0)) %>%
mutate_if(is.character, ~replace(., is.na(.), 'xx')) %>%
mutate_if(is.character, as.factor) #Converting characters to factors in preparation for Random Forests
#Double checking filling in NA's and converting the response variable to a factor.
HOF_Data[is.na(HOF_Data)] <- 0
HOF_Data$inducted <- as.factor(HOF_Data$inducted)
#Looking at the spread of inducted vs. not to the Hall of Fame
table(HOF_Data$inducted)

This result below shows our data set has 19,773 players who haven’t been inducted to the Hall of Fame, and 125 who have:

Looking at a glimpse of the data we can visually see the structure of our data and class types:

Model Output

#For purposes of exploratory analysis, we will not be splitting into training and test.#Quick trick for getting all variables separated by a '+' for the formula. (We just omit the response variable and any others we don't want before copying as pasting the output) 
paste(names(HOF_Data), collapse='+')
rf <- randomForest(inducted~total_BA+total_AB+total_R+total_X2B+total_X3B+
total_H+total_HR+total_RBI+total_SB+total_CS+total_BB+
total_SO+total_IBB+total_HBP+total_SH+total_SF+total_GIDP+
total_G+total_GS+total_InnOuts+total_PO+total_A+total_E+
total_DP+total_PB_Catchers+total_WP_Catchers+
total_SB_Catchers+total_CS_Catchers+mean_ZR+Count_All_Star+
Count_MVP+Count_Silver_Slugger+Count_Gold_Glove,
data=HOF_Data,
localImp=TRUE, cutoff=c(.8,.2))
rf

It looks like the model is separating the training data decently between the two classes even without optimizing cutoff and just guessing.

Class-Wise Overall Variable Importance

Next, let’s look at what the local importance scores are saying are most important to classifying those who make the Hall of Fame.

#Rather than using the model for predictions I am using it to see how the trees separated the data. This way it is more of an unsupervised learning problem. Because of this I am okay predicting on the data set I used to train the model.HOF_Data$predicted <- predict(rf, HOF_Data)#Looking at the class-wise variable importance for those who were classified as making the Hall of Fame.
HF <- data.frame(t(rf$localImportance[,which(HOF_Data$predicted==1)]))
sort((apply((HF),2,mean)), decreasing=TRUE)

What ‘sort((apply((HF),2,mean)), decreasing=TRUE)’ does is it calculates the overall variable importance for class 1 or those players who the model classified as making the Hall of Fame. This is because we subsetted for that class in the previous coding step. Now we have overall variable importance on a class-wise level.

Now let’s look at the opposite class

#Looking at the class-wise variable importance for those who were classified as not making the Hall of Fame.
NHF <- data.frame(t(rf$localImportance[,which(HOF_Data$predicted==0)]))
sort((apply((NHF),2,mean)), decreasing=TRUE)

We can see that the classes have different variables as far as what is most important to that class’ prediction. But what are those differences? Let’s dive into that for those who are classified as making the Hall of Fame.

Answering the Why

According to the local important scores from Random Forests, the top 5 features for being classified as making the Hall of Fame for Batters/Fielders, are:

  1. Count_All_Star (how many times a player was named an All-Star)
  2. total_G (how many total games a player played)
  3. total_SH (how many total sacrifice hits a player hit)
  4. total_RBI (how many total runs batted in a player has hit)
  5. total_HR (how many total home runs a player hit)

Let’s take three of these, Count_All_Star, total_HR, and total_G, and compare the data between the two classes:

summary((HOF_Data[HOF_Data$predicted==1,'Count_All_Star']))
summary((HOF_Data[HOF_Data$predicted==0,'Count_All_Star']))
summary((HOF_Data[HOF_Data$predicted==1,’total_HR’]))
summary((HOF_Data[HOF_Data$predicted==0,’total_HR’]))
summary((HOF_Data[HOF_Data$predicted==1,’total_G’]))
summary((HOF_Data[HOF_Data$predicted==0,’total_G’]))

We can see that the local importance scores for Random Forests are saying that more years of being named an All-Star, more home runs hit, and more games played are 3 of the top 5 most important variables to being classified as making the Hall of Fame for a batter/fielder. By using this method we can look at the actual data spreads and differences between the classes fairly quickly.

Note: Looking at those apparent the outliers, the player who played 3528 games and is Charlie Gehrich who in fact made the Hall of Fame. He is a misclassification. The player who hit 762 home runs and was classified as not making the Hall of Fame is Barry Bonds, a known steroid user. The player who played 3528 games and was classified as not making the Hall is Pete Rose, shut out for betting on games. It’s kind of ironic that the latter two didn’t make the Hall of Fame and that Random Forests classified them as not making it.

Conclusion

In conclusion, someone who has a moderate level knowledge of baseball might know that making the 500 Home Run club and being a 10-time All-Star gives you a decent shot at making the Hall of Fame. However, it doesn’t guarantee it. This seems to be the case since there aren’t clear rules to making the Hall of Fame. On the flip side, what if someone has no prior knowledge of Major League Baseball? Or the topic they are running a model on? Well, using local importance scores and class-wise overall variable importance with Random Forests in R, they can help explain the why of their classification problem.

References:

Breiman, L. 2001. “Random Forests.” Machine Learning. http://www.springerlink.com/index/u0p06167n6173512.pdf.

Breiman, L, and A Cutler. 2004. Random Forests. https://www.stat.berkeley.edu/~breiman/RandomForests/cc_graphics.htm.

C Beckett, Rfviz: An Interactive Visualization Package for Random Forests in R, 2018, https://chrisbeckett8.github.io/Rfviz.

Lahman, S. (2020) Lahman’s Baseball Database, 1871–2019, Main page, http://www.seanlahman.com/baseball-archive/statistics/

Pleskoff, Bernie. “What Are the Standards for Election to the National Baseball Hall of Fame?” Forbes, Forbes Magazine, 22 Jan. 2020, https://www.forbes.com/sites/berniepleskoff/2020/01/21/what-are-the-standards-for-election-to-the-national-baseball-hall-of-fame/?sh=17c57173149e.

--

--