Building an AI that Can Beat You at Your Own Game

Published in

Towards Data Science

16 min readMar 14, 2018

At the age of 11, I invented a board game. I didn’t think anything would come of it. Eight years, three publishers, two agents, and one crazy story later, Pathwayz arrived on shelves. The game proceeded to win Dr. Toy Best Pick of 2014, and was recommended by Mensa in 2015. Pathwayz has since sold online, at Barnes & Noble, and in hundreds of stores across North America.

Pathwayz is similar to Othello and Go. Artificial Intelligence has already mastered those games. In 1997, Logistello defeated the Othello world champion. In 2016, AlphaGo defeated the Go world champion. But no one had ever created an AI for Pathwayz. In 2017, Michael Tucker, Nikhil Prabala, and I decided to change that. We were enrolled in CS 221: Artificial Intelligence — Principles and Techniques. We applied what we learned to build PAI, the world’s best (and only) Pathwayz AI.

Our primary challenge was the game’s large tree size. We experimented with various reflex and search models, which we evaluated through bracket play. Our best PAI consists of 52 domain-specific features, with weights trained in over 5,000 games via temporal difference learning, applied in a narrowing beam minimax algorithm. Against amateur human players, PAI won 33 out of 33 games. Against our team, PAI defeated Nikhil Prabala 5–0, Michael Tucker 4–1, and me 3–2. At the age of 22, I lost to a computer at my own game.

PAI’s playstyle has taught us new strategies. PAI, by playing against itself, has also allowed us to test for a first-player advantage; in 1,000 games, the first player had a win rate of 49.4%, so there does not appear to be a significant player advantage.

This post chronicles, in detail, how we created PAI. I am sharing this work so others can learn from and build upon it. Our version of PAI is remarkably skilled for not employing neural nets. Our careful feature selection, with lookahead capabilities, allowed for efficient development of a top-tier player. In the next iteration, neural nets and other improvements present the potential for superhuman play.

I’ve included basic terminology explanations so readers with less exposure can learn more about Pathwayz and/or AI. As a result, this is a long post. Feel free to skip around to whichever sections you find interesting or relevant. Or you can skip reading altogether and instead play against versions of PAI here. I suggest playing in this order: 1. PAI Baseline, 2. PAI Advanced Baseline, 3. PAI TDL, and 4. PAI TDL Beam Minimax.

Introduction to Pathwayz

Pathwayz is a 2-player strategy game. Like in Othello and Go, you must plan ahead to outmaneuver your opponent. One player is white; the other is black. The goal is to build a path in your color across the long side of an 8x12 board. Paths can be orthogonal or diagonal, weaving back and forth in any direction.

*White just won this game by connecting the left and right sides of the board.*

Each turn, you can place a regular piece in your color or a permanent piece in your opponent’s color. Why would you give your opponent a permanent piece? Because it flips the surrounding pieces.

A: *White places a regular piece, circled red.* B: *Black places a permanent piece, which flips the four adjacent pieces.*

The full instructions are here, and a sample game is here.

A Timeline of Game-Playing Artificial Intelligence

Here is a timeline of game-playing AI:

*From Deep Blue to AlphaGo, AIs have defeated the world champions of Chess, Go, and other board games.*

AIs are now better than humans at Backgammon, Checkers, Chess, Othello, and Go. See Audrey Keurenkov’s A ‘Brief’ History of Game AI Up to AlphaGo for a more in-depth timeline.

In 2017, Michael Tucker, Nikhil Prabala, and I set out to create PAI, the world’s first AI for Pathwayz. The AIs for Othello and Backgammon were especially relevant to our development of PAI.

Othello, like Pathwayz, is a relatively young game — at least compared to the ancient Backgammon, Checkers, Chess, and Go. There is not a massive database of grandmaster games with which to train an AI. Instead, Logistello and other Othello AIs have relied on domain-specific features (i.e. strategic elements specific to Othello).

TD-Gammon, a Backgammon AI, achieved success through temporal difference learning (a type of reinforcement learning). It learned by playing against itself, adjusting feature values along the way.

PAI similarly uses domain-specific features and temporal difference learning. I will discuss these techniques more later in this post.

Millions, Billions, Trillions, and Zillions of Moves

The difficulty of building a game-playing AI is tied to the game’s tree size. Tree size is a result of breadth and depth.

Consider the tree size of tic-tac-toe as a simple example. When the board is empty, there are 9 possible moves: a breadth of 9. After the first move, there are 8 possible moves: a breadth of 8. In this case (and in other placement games), the breadth decreases as the board fills: 9, 8, 7, 6, 5, 4, 3, 2, 1.

Tic-tac-toe has a maximum of 9 plies. (“Ply” is just a fancy word for a single player’s turn. In some games, where a single player’s turn is considered “one turn,” 1 ply = 1 turn. In other games, where both players’ turns are considered “one turn,” 1 ply = ½ turn. I’ll stick with “ply” to avoid confusion.) Let’s say that I win in 5 plies. The depth of that game is 5. The maximum depth of tic-tac-toe is 9.

A total game tree consists of all possible move combinations.

*Here is a small portion of tic-tac-toe’s game tree, just the first three plies of a single branch.*

We can estimate tic-tac-toe’s total tree size with breadth to the power of depth. Let’s pretend that everyone is a good tic-tac-toe player, so all games end in a draw, filling the whole board. In this case, the depth is 9. The average breadth is 5. If we take breadth to the depth, or 5⁹, we get around 1.95 * 10⁶, which is tic-tac-toe’s (super approximate) tree size. We can use a similar method to estimate the tree size for other games.

*Here is the complexity — approximate breadth, depth, and tree size — of Pathwayz, Chess, and Go.*

Our main challenge with building PAI was its large tree size. 6.60 * 10¹²⁷ is huge. It’s more than the number of grains of sand on the planet. In fact, 6.60 * 10¹²⁷ is more than the number of grains of sand it would take to fill the entire universe.

In this respect, Pathwayz is reminiscent of Chess and Go. Its tree size is larger than that of Chess. Its tree breadth is halfway between that of Chess and Go. For handling such a complex tree, we experimented with different approaches. These approaches fell into two categories:

Reflex Models (random, baseline, adversarial baseline, hand-weighted smart features, and TDL-weighted smart features) where PAI just evaluates its current set of moves.
Search Models (Monte Carlo tree search, expectimax, minimax, beam minimax, and narrowing beam minimax) where PAI looks ahead, exploring the game tree.

Reflex Models: Applying an Evaluation Function

We used a state-based model to simulate the game tree. A state describes the current player and board layout.

Each ply, PAI looks at all legal moves and chooses the one it considers “best.” How does it decide what move is best? PAI uses an evaluation function, calculating the value of the given game state. We experimented with a variety of possible functions.

Level 1: Random

PAI chooses a random move. This obviously isn’t a good strategy…

Level 2: Baseline

PAI chooses the move that maximizes myLongestPath: the number of columns that PAI traverses in a contiguous path.

*White’s* myLongestPath is 4, and black’s is 5 (note that the two unconnected pieces do not contribute).

When a strategy performs better than baseline 50% of the time, we know that the strategy is reasonable.

Level 3: Adversarial Baseline

PAI chooses the move that maximizes myLongestPath – 0.4 * yourLongestPath. Play improves significantly since PAI now advances its own path and blocks its opponent’s. Adversarial baseline sometimes beats beginning players.

When a strategy performs better than adversarial baseline 50% of the time, we know that the strategy is good.

Level 4: Hand-Weighted Smart Features

myLongestPath and yourLongestPath are just two ways of evaluating a game state. There are countless other factors that signal how good a player is doing. For example, how many permanent pieces do you each have? More or less than your opponent? How many columns do you control? Is your path easily flipped? Easily blocked? Do you use diagonal, forked, or straight formations? Etc.

These are called domain-specific features or smart features. Michael, Nikhil, and I brainstormed 440 possible features. We narrowed down to our favorite features. Our best PAI uses only 52, for reasons I’ll explain later.

We devised ways to measure the following for each player:

Path length
Path permanence
Number of columns
Number of permanent pieces
Number of regular pieces with x empty neighbors
Number of pieces you can flip at once

Our breakthrough was establishing lookahead features. By storing the left and right frontiers of paths, we were able to efficiently estimate:

Future path length
If a path is blocked
If someone is one ply away from winning

We created various other features, including squared, interaction, and indicator features. We handpicked weights: the positive or negative value of each feature. We assigned higher magnitude weights to features that we thought were more important. For example, we tried giving myLongestPath a weight of 10, yourLongestPath a weight of -6, and myNumPerm—my number of permanent pieces—a weight of 3.

We created an evaluation function by normalizing the feature values (i.e. squishing them all into the same range), multiplying by their weights, and adding the results. We also created a utility function of +/- one million, added to a player’s evaluation function in the case of winning/losing.

PAI with hand-weighted smart features was a good player. As the inventor of Pathwayz, having played thousands of games, probably more than anyone else, I was well-suited to weight features. However, I weighted them mostly by instinct. Among our 52 favorite features, I didn’t know how their importance truly compared. That’s where temporal difference learning helped.

Level 5: TDL-Weighted Smart Features

PAI learned feature weights via temporal difference learning (TDL). For each ply, PAI used its current weights to evaluate the board — storing this value V1. Then PAI moved, the opponent moved, and PAI again used its weights to evaluate the board — storing this new value V2. In TDL, the reward prediction error (R) is V2 minus V1. In other words, your error is how wrong your original evaluation function was compared to the new board state. The learning rate alpha (α) is the rate at which you learn. An alpha of 1.0 would mean you ignore old values altogether and just consider the newest data. An alpha of 0.5 would mean you average the old values and newest data. Most alphas are significantly smaller (we experimented with 0.00001 to 0.01) so that you learn gradually, building up stronger evidence over time.

Every two plies, PAI calculated the reward prediction error (R). Given the learning rate (α), PAI updated its vector of feature weights, adding R * α. If PAI is doing better than expected after two plies, the board’s previous features get weighted more positively. If PAI is doing worse than expected after two plies, the features get weighted more negatively. We provided a large positive/negative end-game reward for winning/losing. PAI played against versions of itself in 10,000 games to learn weights for its domain-specific features.

The key elements of TDL are exploitation and exploration. We wanted PAI to exploit what it had learned, making better and better moves. However, we also wanted PAI to keep exploring, instead of getting too focused on one scenario. If PAI always played the “best” current move, it could get stuck in a local minimum, overlooking a better path. Therefore, we used exploration probabilities. PAI explored among its top ten moves based on a random epsilon (highest scored move if ε < ½, second highest if ε < ¾, third highest if ε < ⅞, etc.).

We also wanted to regularize weights, so we created an automated training pipeline. If PAI wins too many games while training, it becomes overconfident, increasing all of its feature weights (including ones that should be negative). After winning at least 60% of games, PAI’s opponent adopts PAI’s current weights, which ensures that PAI is always challenged at an appropriate skill level.

Conversely, if PAI loses too many games, it becomes pessimistic, decreasing all of its feature weights (including ones that should be positive). After winning under 40% of games, PAI switches to play against adversarial baseline, which it can learn to consistently beat. PAI subsequently builds back confidence, regularizing features before returning to a more challenging opponent.

*We used an automated training pipeline. If PAI ever won too few games, PAI’s opponent became adversarial baseline. If PAI ever won too many games, PAI’s opponent adopted its current weights.*

We put PAI, with 52 features, through this pipeline for 5,000 games.

*Here are the top 15 feature weights after the first 5,000 games of training. There are two key takeaways.* 1: *Lookahead features make up 8/15 of the top features.* 2: *Positive features make up 12/15. PAI focuses on getting ahead more than on blocking.*

PAI with TDL-weighted features can consistently beat amateur human players. We eventually put PAI, with 440 features, through this pipeline for another 5,000 games. (This version performed slightly worse, for reasons I’ll explain later.)

Search Models: Exploring the Game Tree

Level 6: Monte Carlo Tree Search

The above versions of PAI all relied on using an evaluation function to score the current set of possible moves. But none of them looked ahead to future plies.

Monte Carlo tree search (MCTS) is an algorithm for “looking ahead.” Instead of using an evaluation function, PAI rapidly tests as many scenarios as possible. It randomly plays tons and tons of games against itself (in our case: 250,000 games per ply). Based on the subsequent win rates, PAI can estimate which moves lead to better outcomes.

The results aren’t amazing. Random moves in Pathwayz are often terrible (e.g. giving your opponent a permanent piece), so random sampling is a weak predictor.

Level 7: Expectimax

How about exploring the game tree with our evaluation function?

Expectimax is an algorithm for looking ahead against a random opponent. For each of PAI’s possible moves, it considers all of its opponent’s subsequent moves, applies our evaluation function, and averages these values. PAI chooses the move with the highest average value. Because our evaluation function takes time (compared to simulating random moves), PAI isn’t able to look far ahead. Especially at the beginning of the game, when the tree breadth is huge, looking far ahead takes seemingly forever. So PAI only looks ahead one ply.

Note that expectimax optimizes for a random opponent. Like with MCTS, the results aren’t amazing because random sampling is a weak predictor in Pathwayz.

Level 8: Minimax

How about exploring the game tree against a non-random opponent?

Minimax is an algorithm for looking ahead against the best possible opponent. Minimax works similarly to expectimax. However, instead of averaging across the opponent’s possible moves, PAI assumes the opponent will make their best possible move. PAI maximizes the evaluation function for its own moves while minimizing the evaluation function for its opponent’s moves.

Note that minimax doesn’t play optimally against suboptimal opponents. In other words, it will always assume the best from other players, instead of capitalizing on common mistakes.

A more pressing concern is that minimax (like expectimax) can’t look far ahead in Pathwayz. Because of the huge tree breadth, especially early game, PAI is initially limited to looking ahead by only one ply. However, we did increase lookahead depth as the game progressed, so PAI minimax is a formidable end-game player.

Level 9: Beam Minimax

Can PAI look farther ahead?

Beam minimax is an algorithm for increasing lookahead depth by decreasing breadth. At each ply, PAI chooses the top X moves, then only looks ahead at those moves. In subsequent levels, PAI again reduces the breadth — only looking ahead at its opponent’s best X moves, then its own best X moves. (In our case, X was five.)

If the best long-term move is outside these X moves, then beam minimax will miss it. With unlimited computing power, minimax is always a better player than beam minimax. (When the opponent is one move from winning, we made PAI revert to the more thorough minimax algorithm. Computation time remains reasonable at this stage because of the limited breadth.) However, under our limited time and computing resources, beam minimax allows PAI to look ahead an extra ply, which makes it a superior player.

Level 10: Narrowing Beam Minimax

Our final version of PAI uses a narrowing beam minimax. PAI simulates its top ten moves, then its opponent’s top five moves, then its own best move. This narrowing beam allows PAI to look ahead three plies, while considering a larger set of (ten) possible moves, within a reasonable time period.

*Here is an example of narrowing beam minimax, with breadths of 3, then 2, then 1 at each level.*

Quantitative Results

To compare the performance of our various PAIs, we placed them in a round-robin tournament. The PAIs all played one another in 20 games, for a total of 720 games. Although there is no known first or second-player advantage, we had each PAI play black and white the same number of times to ensure an even match-up. Note that the minimax and beam minimax PAIs in this tournament were using the original hand-picked feature weights (instead of the TDL-trained weights). Also note that, in this case, “beam minimax” is a shorthand for our “narrowing beam minimax” algorithm.

Here are the results of our round-robin tournament. Win rates are down the columns, loss rates across the rows.

TDL-trained features had a significant impact on performance. The two reflex models trained via TDL performed much better than adversarial baseline. TDL with 52 features managed to scrape out a 60% win rate against our initial beam minimax; a reflex model defeating an adversarial search model is a remarkable achievement. We think the TDL with 440 features suffers from overfitting. It trained against itself and beam TDL minimax, so is optimal against those players, but is suboptimal against other players. In contrast, the TDL with 52 features is more adaptable.

The narrowing beam minimax won the tournament. The version trained with 52 features defeated its counterpart with 440 features. We attribute this victory to the former’s versatility and the latter’s overfitting. It’s important to remember that twenty game match-offs are susceptible to considerable variance. However, the quantitative results match our qualitative intuitions.

Ultimately, combining TDL features with a lookahead algorithm led to a smart, adaptive player. When we presented our results at an AI poster fair, we offered a $50 prize to the first person who could beat our best reflex PAI (the one trained with 52 features). PAI went 33–0 against humans. Admittedly, they were all amateur players, though some played multiple times in a row, and some had experience with Go or other similar board games.

*Pictured from left to right: Nikhil Prabala, Louis Lafair, and Michael Tucker at the* CS 221 Artificial Intelligence — Principle and Techniques poster fair.

*The reflex TDL-weighted PAI defeated amateur humans in 33 straight games at the poster fair, played on the pictured iPads. No one won the $50 reward.*

Finally, we pitted our best lookahead PAI (narrowing beam minimax trained via TDL with 52 features) against our more experienced team. PAI defeated Nikhil Prabala 5–0, Michael Tucker 4–1, and me 3–2.

*In 2017, PAI gets added to the timeline of notable game-playing AIs.*

Learning from PAI

PAI, as the first Pathwayz AI, can teach us novel approaches to Pathwayz. There have been several surprises in its playstyle. Most human players are conservative, waiting to flip until middle or end game. PAI, on the other hand, is aggressive, consistently flipping within the first several moves. PAI tries (often successfully) to get an early lead so that it can play offensively instead of defensively at end game.

The reflex model, by definition, is a greedy player, so its aggressiveness makes sense.

A: *Reflex TDL-weighted PAI (black) plays a permanent piece, aggressively flipping at start of game.* B: *Later in the same game, PAI has now played permanents in four of its first six moves*

The search model’s aggressiveness is more surprising.

*PAI beam minimax (black) has learned that getting ahead early is helpful. PAI tends to go on the aggressive.*

Also surprising is PAI’s ability to block flips. PAI will sometimes play in unconventional spaces simply to block an opponent’s flip.

*PAI (black) strategically places to block a dangerous flip.*

PAI will do whatever possible to prevent (or at least prolong) its opponent’s victory, even, at times, flipping its own pieces.

A: *I (black) am one turn away from winning.* B: *PAI (white) makes an end-game sacrifice, flipping its own piece to prevent a winning flip next turn.*

Continuing to play and develop PAI will reveal more novel strategies.

Is There a First-Player Advantage?

A first-player advantage exists in many two-player strategy games. In Go, the second player receives a compensation to remove first-player advantage. In Chess, the first player has a winning percentage of around 55%. Winning percentage is calculated as: (Wins + 1/2 Draws) / Total Games.

Given the lack of historical data, there is no known first-player advantage in Pathwayz. Some have posited a first-player advantage.

*A commenter on BoardGameGeek.com believes there’s a first-player advantage in Pathwayz.*

We decided to see if the theory holds, and to what extent. White goes first in Pathwayz. We played our best reflex (and thereby fast) PAI in 1,000 games against itself. White won 487 games, black won 499 games, and they drew 14 games. In other words, the first player had a win rate of 49.4%. If anything, there seemed to be a minor second player advantage, though it may decrease with more simulations or more advanced AIs. Black’s advantage was so small that we can assume (until a larger game database exists) there is no significant player advantage.

The Next Level

Our careful feature selection, with lookahead capabilities, allowed for efficient development of a top-tier player. We were able to build, train, and test PAI in much less time (and with much less computing power) than if we had employed neural nets.

For future iterations, there are several ways we could continue to improve PAI. First, we want to train PAI against human players; we’ve already built the infrastructure to begin doing so. Second, we want to try integrating Monte Carlo tree search with our evaluation function, efficiently increasing foresight. Third, we want to experiment with different sets of features, settling somewhere between 52 and 440, without overfitting. Finally, we want to employ neural nets, allowing for more complicated, nuanced, nonlinear strategies.

PAI in its current form is a powerful player, but not unstoppable. Moving forward with neural nets, PAI (like AlphaGo Zero) demonstrates potential for superhuman play.

Acknowledgements

PAI is the result of a lot of time and a lot of help from a lot of people. Thanks to Percy Liang and Stefano Ermon for teaching CS221: Artificial Intelligence — Principles and Techniques. Thanks to Zach Barnes for feedback. Thanks to Dylan Hunn for providing the initial GUI. Thanks to everyone who tried playing against PAI along the way.

And of course, a huge thanks to Michael Tucker and Nikhil Prabala, who spent countless hours (and lost too much sleep) co-developing PAI. This blogpost has drawn heavily upon our final joint paper for CS 221: Artificial Intelligence — Principles and Techniques.