Blind Chess Log [0] | Towards Data Science

Welcome to Blind Chess Log index 0. Here I’m planning to share weekly experiences of working on a Machine Learning pet-project together with my friend Gleb Tkachev. The goal of the project is to develop an ML bot for Reconnaissance Blind Chess that wouldn’t do outstandingly stupid moves and maybe would be able to beat a random bot. A full set of rules can be found in the documentation for reconchess python package. But in a nutshell, it’s a Chess variant where you don’t see the opponent’s moves, and every time before deciding on your move you can choose 3-by-3 square and get the actual true state of the board there. Sounds exciting? Let’s go…

Quite a quest

If you read through the rules of Reconnaissance Chess and think about RL algorithm that might be applicable here, what first comes to mind? I bet that for the majority of readers it’s Alphazero. That’s of course not without a reason. However, AlphaZero is undoubtedly a kind of "don’t try it at home" project. Despite being quite a simple approach conceptually, It’s an engineering masterpiece built by a team of outstanding scientists and engineers with quite an un-affordable budget. Its scale is what makes it so unbeatable. Of course one could try to develop some DIY clone for a single machine but the chances that it will train some reasonable behavior are quite slim.

Knowing all that and without too much hesitation we of course decided to go with AlphaZero-like approach. We do have one little excuse though. We’re not aiming to reimplement AlphaZero 1-to-1, rather trying to build something inspired by it, i.e. an algorithm that combines search and learning.

My personal favorite bird-eye view on AlphaZero is that we train Policy function to compress Monte Carlo Tree Search (MCTS) iterations and Value function to predict a value estimate for the unvisited state. We then use both of them to improve the convergence of MCTS by narrowing the search with the Policy and bootstrapping the estimates with the Value function. That allows computing strong moves even when running with a restricted computational budget during the evaluation. In principle, given the infinite compute, one could solve Games like Go and Chess with MCTS only. Thus starting by implementing MCTS and letting it play (even if quite miserably) sounded quite reasonable.

Where the problem begins

Conveniently, I had an MCTS implementation resting on my private cemetery of unfinished projects. The only problem though is that standard MCTS works for fully observed environments. Blind Chess by contrast is partially observed since you don’t see the opponent’s actions. This poses some complications…

MCTS maintains its own simulator of the environment (game) to search through the space of available actions. In order to run the search from some given state, we need to be able to reset our simulator to this state. You can see the problem, right? It isn’t so simple in the case of partially-observed environments since we DON’T know the state. To put it in the context of Blind Chess, if we assume the opponent’s moves to be a part of the simulation we can build the tree only for the first state and only if we play for White. After the first move we don’t know the actual position of our opponent’s pieces thus we don’t know from which state to start the search for the next move.

Here’s an important point for the later explanations: an internal simulator of MCTS that is used for search from one single state can still return you true states. You’ll code it up and be free to do whatever you deem reasonable. However, that won’t help you to go beyond the first move and only if the first state is known (as for White player in Blind Chess). Of course, during the training we can hack anything we want, even make a true simulator return the moves of the opponent and recover the true state. But that clearly wouldn’t work for evaluation especially if your bot should play some online matches with other players.

Fortunately enough, there have been proposed quite a number of methods to adapt MCTS for partially-observed settings. We found POMCP by David Silver et al. (yeah, that guy) to be the most straightforward and easy to implement and decided to go with it. The idea of this method is to combine MCTS with the particle filter estimator of the true state. I won’t go into the full fetched explanation of the paper, but here’s a summary:

You branch your tree by actions and observations (instead of states as in classics) and for every node in your tree you maintain a set of possible true states. What you also will need is an internal simulator that returns the states in addition to observations (as mentioned above it isn’t a show-stopper).
At the beginning, you sample N states from the initial state distribution, which should be known (that’s the case even for the Blind Chess), and add them to the set of the Root.
For every MCTS iteration, i.e. expansion and back-propagation, you sample one of the states from the set in the current Root and reset your simulator to it. As you move through the tree, you add every new state returned by your internal simulator to the set in the next node that corresponds to the observation you got.
Once you built a tree or exceeded your computations budget you select the best move given the collected statistics and feed it to the true simulator. True simulator returns you only an observation that you use to select the correct node at the next level of the tree and promote it to Root. Then you repeat step 3.

Even considering the numerous hacks provided in the paper to make your state estimate stable it still seemed to be the simplest way to address partial observability with MCTS. So we spent a good fraction of our free time in the evenings to implement it. In the beginning, we were actually worried a bit and didn’t know if "sense" actions (those where you get a true state of some 3-by-3 square) would "make sense" in this method. Fortunately, after quite some brainstorming, drawings, and coding the first prototype up we realized that they integrate quite naturally. Essentially, you spread the particle estimates of your true state over the different branches of the tree since "good" sense actions reveal additional information and result in different observations. In theory that should reduce the variance of state estimate and everything should work smoothly. However, not after a long, we realized that there’s one more problem…

Let’s play together

Another key ingredient of AlphaZero that is considered to be crucial for superhuman performance is Self-play. For MCTS Self-play is implement by searching for the best moves for two players in the round-robin fashion in the same tree.

Let’s forget about partial observability for now and assume that we’re building a tree for some Root state and select an action according to some statistics we’ve collected previously (or randomly if statistics isn’t available). We now transition into the state where our opponent has to make a move. So which move should we assume? I’ve already mentioned that the simplest way would be to use some fixed strategy for the opponent. However, that wouldn’t work if our true opponent plays differently. In Self-play MCTS instead, we add this state as a node and continue tree search from it as if we would play for the opponent. In essence, when expanding the tree you assume that you’re playing against an opponent who also uses MCTS. That should allow you to come up with much stronger moves. This approach also scales nicely with your computation budget. One additional benefit is that for zero-sum games (such as Chess) extending MCTS to Self-play doesn’t take much, well at least conceptually. All you need is to negate the rewards that one player gets during the rollout when back-propagating the statistics for the nodes of the opponent. Of course, that requires a general enough implementation which tracks the id of a player and carefully handles statistics updates. After quite some "design discussions" we managed to implement something like that and it seems to work reliably.

Let’s come back to our partially-observed setting, where everything gets "a bit more complicated". Now, when building a tree for a single known state (that might be the initial state for example), we can still do Self-play thanks to our internal simulator. As mentioned, we do have full control over it and we can implement it in a way to return the opponent’s observations and actions so that we could branch on them and add corresponding nodes to our tree. The problem shows up when we exceeded our computation budget for a single move, have chosen the best one we’ve found so far, and fed it into the true simulator. The simulator state was advanced, the opponent also has made his move, but we know nothing about it. Now we need to find a new Root node in our sub-tree that corresponds to the observation where it’s our turn again and that likely contains the true state in the set of state estimates. Since we don’t know which observation our opponent got and which move he’s chosen we don’t know which branch to traverse. Thus we don’t have an estimate for the current true state and cannot reset our internal simulator to run the search again…

Sorry, I know that it might be quite difficult to grasp without some drawings, but I really don’t want to turn these logs into the full-fetched tutorial. Otherwise, I would spend more time preparing those than actually working on the project. So if you have any questions, or are excited about the topic in general, please reach out in the comments or on LinkedIn. I’ll be glad to discuss it with you.

Of course, there are some simple hacks that come to mind, like traverse into the branch that corresponds to our player’s move and then pool and merge all available state estimates from the children at the next level that correspond to our turn. However, that would probably bring quite some variance into our true state estimate. In addition, according to the rules, when the opponent captures your piece you actually get some information about his move. So you might end up in a situation where you pool some states from the children that are not possible given the new information you’ve just got. Or even worse, in a situation where you don’t have any particle estimate that is consistent with the new information. In any case, that sounds like the most straightforward approach to start with, so most probably will go with it for the next week.

So, that was it for this iteration of Blind Chess Log. Stay tuned for the next update in about a week (probably).