Simulating the 2018 World Series

Published in

Towards Data Science

9 min readOct 23, 2018

An interesting part of the Major League Baseball season is that the hard work of 162 games gets reduced to a 5 or 7 game winner-take-all series in the playoffs. Each series is played only once, which allows for randomness to enter the equation. The better team doesn’t always win.

What if we could actually play the series many times and find the most probable outcome? This is the basic idea behind simulation: Providing inputs into an equation or model, repeating the model many times, then finding the number of occurrences that match the desired outcome.

Baseball simulations can be extremely complex and granular, down to a batter-by-batter simulation adjusted for the pitcher, ballpark, and other factors. That level of detail is admirable, but I decided to take a new, different approach to simulations.

Instead, I’ll look at the game-by-game performance by batters (runs scored) and pitchers (runs allowed) against a team’s opponent and other teams similar to that opponent to simulate scores and series winners.

All project code is on my GitHub here.

The Data

The data for this project includes annual team statistics for both pitching and batting that I scraped for a previous project predicting playoff teams. This data will allow us to compare the statistics of one team to others in their league. Examples of the data can be found below:

In addition to the annual statistics, the project requires 2018 game-by-game results for each team. Again, baseball-reference.com was a helpful resource as we can navigate to a team’s 2018 schedule and scrape the data that was needed, such as the score of the home team and visiting team as well as the opponent.

The Process

The basic assumption of this simulation is that we can

Find a team’s most similar batting and pitching “neighbors” (i.e. teams with the most similar end-of-season statistics in each category) within their league (American or National). Then, we can see how a team performed against their opponent plus their opponent’s “neighbors” for batting and pitching respectively during the regular season.
Using the outcomes of the games against similar opponents, we can create a distribution of runs scored (batters) and runs allowed (pitchers) during the regular season games. We’ll then “simulate” a single batting game by randomly drawing a score for each team from the runs scored distribution. We’ll separately “simulate” a single pitching game by randomly drawing a runs allowed value for each team from the runs allowed distribution. The game’s winner is the one with the higher runs scored value(batting) or lower runs allowed (pitching).
Next, we simulate the series by continuing to simulate games using the method described in point (2) until one of the team reaches 3 wins (5-game series) or 4 wins (7-game series).
Repeat the series simulation process 20,000 times for the batters and 20,000 times for the pitchers for a total of 40,000 simulations.
Calculate the percentage of times that a team won the series; this represents the total probability predicted from the model that the team would win the series (>50% = series winner).

In effect, we want to know how a team’s batters and pitching staff performed against team’s “like” their opponent in the opposite category (i.e., how a team’s hitters performed against similar pitching staffs to their opponent, and how a team’s pitching staff performed against the hitters similar to their opponent) and use that information to simulate games.

We can test this process on the playoff series that have already been completed in the playoffs to provide an early view on its accuracy as well as simulate the World Series winner.

Let’s go through each of the steps individually using the Red Sox vs. Yankees series as an example:

The function to simulate the batting outcome is the code on the left; the pitching function can be found in the linked code on GitHub.

Similar Team Performance

Though playoff baseball is different than regular season baseball, we can make the assumption that how a team plays in the regular season will be similar to how they play in the playoffs. Therefore, if we can find how a team played against an opponent, and also to other teams similar to that opponent, we can have a reasonable proxy for their playoff performance. This concept is the crux of the modeling presented in the following paragraphs.

How will we define similarity?

To calculate similarity, we can fit a Nearest Neighbors model to our batting and pitching data sets. The unsupervised machine learning algorithm calculates the distance between points and returns those that are closest to the inputted value.

# Example: Find the closest 5 neighbors to the Yankeesneighbors = NearestNeighbors(n_neighbors = 5)neighbors.fit(df)# Find the distance from the Yankees on our fitted data frame and index values 
distance, idx = neighbors.kneighbors(df[df.index == 'NYY'], n_neighbors=5)

Because our goal is to have a sufficiently large number of games played, I limited the neighbors to the most similar 5 teams (plus the opponent) in the opponent team’s league. This means that for the Yankees and Red Sox, we’ll look for the five most common teams in the American League.

If we were to expand to both the American and National leagues, and the Yankees most similar neighbors were all teams in the National league that the Red Sox hadn’t played, that wouldn’t help us create our distributions. So, we sacrifice some potential better similarity neighbors for higher sample.

Using the neighbors, we can see how the Yankees batters performed in games against teams similar to the Red Sox’s pitching staff (i.e. how many runs do they typically score?) and how the Yankees pitching staff performed against teams similar to the Red Sox’s hitters (i.e. how many runs do they typically allow?).

# Note how we ensure that the opponent is in the opposite category (i.e. batting performance is against the other team's pitching neighbors).nyy_batting_performance = df_nyy[df_nyy.opponent.isin(bos_pitching_neighbors.index)]nyy_pitching_performance = df_nyy[df_nyy.opponent.isin(bos_batting_neighbors.index)]

This results in a data frame of the Yankees’ performance against the Red Sox and the five teams that finished with the most similar results to their pitching staff:

A sample of the Yankees performance in games against teams similar to the Red Sox pitching staff

Simulating a Game

The heavy lifting of the process comes in step 1 while the rest can mostly be carried out with a series of if statements and for loops.

For a game simulation, we simply randomly choose a value from each team’s batting data frame and compare the results, with the higher score being the winner. Likewise, for the pitchers, we follow the same process, but the team that gives up the least runs is the winner.

The code below provides a single game simulation that ensures the two scores will never be the same:

## Draw a random number from the distribution of runs in games against similar opponentsteam1_score = np.random.choice(runs_for_team1)
team2_score = np.random.choice(runs_for_team2)## Repeat simulations until the score is not the same  while team1_score == team2_score:
    
    team1_score = np.random.choice(runs_for_team1)
    team2_score = np.random.choice(runs_for_team2)## If team 1’s score is higher, that team wins. Otherwise, team 2 is credited with the win. if team1_score > team2_score:
                
    team1_wins += 1
            
elif team2_score > team1_score:
                
    team2_wins += 1

Simulating a Series

Building on the last section, we can simulate a five or seven game series by repeating the process until a team reaches our desired number of games to win:

## Start each team with 0 winsteam1_wins = 0
team2_wins = 0## Once one of the teams reaches the desired number of games to win, we append either a 1 (if team  1 wins) otherwise a 0 (indicating team 2 wins)if (team1_wins == num_games_to_win) | (team2_wins == num_games_to_win):
                
    winner.append([1 if team1_wins == num_games_to_win else 0])
    total_games.append(team1_wins + team2_wins)
           
    ## Stop the simulation and start fresh once we have a winner    break

Up to here, this process represents one single series simulation.

Repeat the Process and Calculate the Percentage

As mentioned earlier, a single game or series introduces randomness, but the randomness should eventually become a cohesive story with repeated interactions.

For this reason, we’ll repeat the process thousands of times — 20,000 in this case, both for the pitching and batting — and calculate the percentage of times that a team wins the simulated series. This number then becomes our probability that a team wins the series, with greater than 50% being our predicted team.

Does it work?

The process is logical, but does it work? To fully evaluate the power of this model would require significant back testing, with multiple decades of data being most helpful. For the sake of this article, I have tested it on the 2018 playoff results to provide example outputs.

Running the algorithm for the series to-date, including wildcard play-in games, resulted in 6 out of 8 correct predictions. Assuming a 50/50 chance of guessing right, for the 2018 playoffs we were able to improve the odds of picking a winner by 25 percentage points. Of course, with such a small sample that isn’t significant yet, but it’s worth noting.

The only incorrect predictions were the two series with the Red Sox— quite ironic since they’re now in the World Series.

Probability of each team winning the series and accuracy of predictions

2018 World Series Prediction

So, who’s going to win the World Series? The model says the Dodgers, and its fairly confident with a 67% prediction rate, but betting against the Red Sox the first two times wouldn’t have produced very good results.

As with any probability, it only says that, according to the model, the Dodgers are more likely to win, not that they will win every time.

Conclusion and Next Steps

Probabilities can be a fun way to approach a static, one-time event with the understanding that the outcome isn’t exactly feasible. The Yankees can’t win 52% of a series, but we can see who may have an edge within a game or series.

There are a few ways to carry on the project:

Back test! This data represents only a small portion of the playoff series ever played. The code is set up in a way that running the scraping algorithm can take in any season year and team, then the functions can be called with those teams.
Speaking of the functions, I think that those could be condensed further to simplify the call to run them. I’d be interested in hearing anyone’s feedback on suggestions to minimize the number of calls that I had to make to run each simulation.
Pulling in additional statistics could potentially improve the accuracy further. I limited myself to what was on baseball-reference when computing similarity, but that doesn’t mean it’s the only or best data to use.

These represent just a few suggestions if anyone were interested in picking up the torch from here.

In sports, there will always be an element of randomness, human intervention, and just plain luck. Ultimately, data and simulations will never tell us with 100% accuracy the winner and replicate the excitement of sitting down and watching the game, but when we bring data into the picture, we can look at the games and match ups through a different lens that can also provide interesting insights.

If you have any questions or feedback, I’d love to hear from you! Send me an email at jordan@jordanbean.com or reach out to me via LinkedIn.

Simulating the 2018 World Series

Written by Jordan Bean