AlphaZero implementation and tutorial

A walk-through of implementing AlphaZero using custom TensorFlow operations and a custom Python C module

Published in

Towards Data Science

15 min readDec 9, 2018

Note (Jan 27, 2020): I’ve published an update to this story where I’ve obtained better performing models. You can also watch me play against the improved network.

I describe here my implementation of the AlphaZero algorithm, available on Github, written in Python with custom Tensorflow GPU operations and a few accessory functions in C for the tree search.

The AlphaZero algorithm has gone through three main iterations, first called AlphaGo, then improved to not use any pre-training and called AlphaGo Zero, and finally generalized further to other games and called AlphaZero. My implementation is most similar to AlphaZero, however, all variants are relatively similar. I implemented similar GPU functions for the game Reversi, but as of now I only have my Go GPU code available (I could make the former available, if anyone’s interested, though).

My motivation for writing this has been to use the algorithm on other board games using relatively minimal hardware (a single GPU) — as a result, I have not attempted to implement the algorithm word-for-word (for example, all games are played with a pre-set number of turns instead of using an automated procedure for having the algorithm auto-resign near the end-game).

I release this code, descriptions, and all other material in the public domain. If you find it useful, I’d be happy hear about it. I’d also be happy to help, to the extent that I can and time allows, if you have any problems understanding my code. My email is X@Y.com. Where X is cody2007 and Y is protonmail.com.

How good is the network? Not that great — I can beat it essentially all the time and I’m a complete novice at Go. However, the implementation does seem to work — its score against GNU Go does steadily increase while training (the network starts to beat GNU Go about 50% of the time for 20-turn games using a simple scoring system) and it can at least beat a completely random player about 100% of the time.

One thing though is that it takes shortcuts to abuse the scoring system I used. I score the game by counting all the stones each player has on the board plus all the empty positions that each player has completely surrounded. For this reason, the neural network seems to like (especially near the end-game) placing stones in territories safely controlled by the other player. This substantially reduces the score of the other player (which would no longer get any points for that territory aside from the number of stones on the border) even though the strategy wouldn’t hold up if games were played out beyond 20 turns. See the section below about improving performance.

Compiling and running the code

This section assumes you have already installed Python 2.7, Tensorflow, the Nvidia toolkit, the GNU C compiler (GCC) and standard development environment, and some basic Python packages: numpy, scipy, colorama. It may also be useful to install and configure an Ipython (Jupyter) Notebook server for visualization of the training progress (described more below). If you’d like to play against the network, graphically, you might want to also install the Pygame Python package (also described more below).

All code has been tested and written on Ubuntu 18.04 using Python 2.7, Tensorflow v1.7.0 and compiled with NVCC V9.0.176 (the Nvidia Cuda compiler). I run and test all code on a single Nvidia Titan X card, with a quad-core Intel i5–6600K CPU @ 3.50GHz, and 16 Gb of RAM. I have not tested it on alternative configurations, but suspect it’ll likely compile and run with different versions of the Cuda toolkit and Tensorflow.

Compiling the code

Compilation involves compiling the Tensorflow operations and the Python C module. I’ve written the compilation commands into the script build.sh. So, to compile, you should be able to change directories to where you downloaded the code, and simply run it:

cd ~/alpha_go_zero_implementation ./build.sh

If you have an older Nvidia GPU card than mine (or even if you have a newer one and want better performance), you'll need to adjust the -arch=sm_52 flag in the build.sh script to match the compute capacity of your card. You can find the compute capacity of your card, probably in several ways, but Tensorflow can easily tell you--in Python you can run:

import tensorflow as tf
sess = tf.InteractiveSession()

Which generates a decent amount of text, but if you look at the last line or so, it should look something like this:

Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11435 MB memory) -> physical GPU (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:01:00.0, compute capability: 5.2)

(Notice the compute capacity statement above, so "sm_52" is the best option for me).

Adjusting the code’s memory requirements (if needed)

The code, as currently configured, requires approximately 7.8 Gb of RAM (almost all of this is from the tree search buffers). However, this memory usage can be scaled back (or up if you want to do larger searches) by adjusting the constants TREE_BUFFER_SZ and MV_BUFFER_SZ in py_util/includes.h (at the expense of requiring N_SIM, controlling the number of simulated games per turn used to construct the tree, in bp_tree.py to also be reduced due to decreased size of the tree buffers).

Running the code

Occasionally, the tree search exceeds the buffer size, at which time the code should detect this and terminate. For this reason, I created a shell script, called run.sh to run the main training script, bp_tree.py in a loop — if the training terminates, it will restart it again. So from the command line, you would change your current directory to where you downloaded the code and execute run.sh. For example:

cd ~/alpha_go_zero_implementation
./run.sh

Because of the script occassionally terminating, you need to be careful if you start training a new model. bp_tree.py works by loading the model file specified in the variable save_nm at the top of the script. If save_nm is set to None the script will start from scratch. So, when starting to train a new model, you should update save_nm to indicate the new model file name once you start running the script initially. If you don't, it will start again from scratch each time the script (auto) restarts. In theory, this could be fixed by changing the code to recover from the tree buffer overflow instead of simply terminating, but I've not found it enough of a nuisance to justify the time of implementing that behavior just yet.

While not necessary, I often run all of this in a GNU Screen session so that I can close my console and have the code continue running (use screen -r to reconnect).

Visualizing the algorithm training

I’ve included a Jupyter notebook for visualizing network training progress and also plotting example games played by the neural network (see the file notebooks/training_visualizations.ipynb). Below I show example outputs in the notebook.

**Black = neural network; white = opponent**

Playing the trained network

The script play_network_gui.py allows you to play the network via a simple GUI using PyGame. Set the variable save_nm in the script to the name of the model file. You can two play the network in two modes: running the network only (fastest, no tree search), or with the tree search. Setting run_net to True allows you to play without the tree search, and setting it to False enables the tree search.

Overview of the algorithm

The best description of the algorithm is directly from the original papers (AlphaGo, AlphaGo Zero, and Alpha Zero) however, I’ll briefly re-summarize. The idea is to use a tree search guided by a neural network which is trained to estimate likely good moves. A tree search is simply a brute-force approach where at each move, multiple games are played out from the current position to the end-game to determine how good a particular move is. The tree search chooses moves based on the output of the neural network and as the neural network is trained, the tree search becomes more focused at exploring successful strategies. How is the neural network trained? It is trained to estimate a set of summary statistics from the tree search (how often each move was visited by the tree search, and how likely each move resulted in the network winning the game). During training, the network plays against itself and requires no human training.

Code overview

I’ve placed all code governing the Go rules and movement in tensorflow GPU operations (in the sub-directory kernels/). The tree structures I’ve kept on the CPU (in the sub-directory py_util/). I previously implemented the tree search also as tensflow GPU operations, but decided to move it to the CPU for two reasons. The first is that I was starting to run into GPU memory limitations when running tree searches greater than 1000 simulations. The second is that I’ve found about a 2x speed improvement keeping the tree search code on the CPU (although I had not invested much time in optimizing the GPU tree search implementation).

GPU game data-structures

All GPU data structures are in row-major-order and are defined in cuda_includes.h.

The board is stored with dimensions [BATCH_SZ, MAP_SZ_X, MAP_SZ_Y].

board represents the current game boards being evaluated.
board2 represents a backup of the game for use while a tree search evaluation is occurring (code for backing up and restoring the board is in kernels/session_backup.cu.cc)
board_prev and board_pprev represent the board one and two turns before the current, to prevent repetition of moves (and again board_prev2, and board_pprev2 are backups similar to board2)
n_captures and n_captures2 (shape [N_PLAYERS, BATCH_SZ]) indicate the number of captures each player made in the current game. This is for statistical use only, not actually needed for the algorithm.
ai_to_coord (shape [BATCH_SZ]) are the coordinates chosen by the random player and also not needed for the algorithm (it is the output of kernels/move_random_ai.cu and input into kernels/move_unit.cu)
valid_mv_map_internal (shape [BATCH_SZ, MAP_SZ]) this is a binary map for each board indicating which positions are valid moves (1) and which are not (0). It is the output of kernels/create_batch.cu and used by kernels/move_unit.cu to ensure only valid moves are made.

CPU tree data-structures

All CPU tree data structures (used within the py_util/_py_util.c Python module) are also in row-major-order. They are defined in py_util/includes.h and are pre-allocated to a fixed size to avoid needing to constantly allocate and free memory. Care must be taken to not exceed the buffer sizes (the code terminates if so).

Node data structures:
Each node represents a board state and contains a pointer to a list of: valid moves, their visit counts, mean Q value, and the probability of winning assigned by the neural network. These values (visit counts and mean Q value) are updated each time the network makes a move — all game states leading to the current state are updated via the py_util/backup_visit.c function.

tree_sz (shape [BATCH_SZ]) represents the number of nodes (with children). It should never exceed TREE_BUFFER_SZ or else we have a memory overflow.
tree_start (shape [BATCH_SZ]) is the index of the node representing the current board state (it should always be >= 0 and less than the tree_sz).
tree_player (shape [BATCH_SZ, TREE_BUFFER_SZ]) represents, for each node, the player who next makes a move.
tree_parent (shape [BATCH_SZ, TREE_BUFFER_SZ]) represents, for each node, the parent node index. A value of -1 indicates we are at the root of the tree.
tree_list_sz (shape [BATCH_SZ, TREE_BUFFER_SZ]) represents, for each node (game state), the number of children (possible moves) available. For all initialized nodes, this value will always be at least one (indicating no move) and never exceeding 1 + MAP_SZ (it's not possible to have more valid moves than the board size).
tree_list_start (shape [BATCH_SZ, TREE_BUFFER_SZ]) represents, for each node, the index where we can find the children and their statistics (visit counts, sum of Q values). This should never exceed MV_BUFFER_SZ or we have a memory overflow.

List data structures:
As mentioned above, each game state (which I call a node) is associated with a list of valid moves, visit counts, mean Q values, and probabilities. These structures are each stored in a single large buffer, and the node data structures contain a pointer (tree_list_start) within this buffer to a contiguous set of entries (tree_list_sz represents the number of entries).

list_sz (shape BATCH_SZ) contains the number of active elements in the list. As the list grows, elements are added at the end and the list_sz entry for each game is appropriately incremented. It should never exceed MV_BUFFER_SZ for any given game, because this is the size of the list buffers described below.
list_valid_mv_inds (shape [BATCH_SZ, MV_BUFFER_SZ]) for each game, it contains the valid move indices for the player who next makes a move.
list_valid_tree_inds (shape [BATCH_SZ, MV_BUFFER_SZ]) for each game, it contains the valid node indices (it's a pointer into the structures described above--variables starting with tree_). With the node index, one can then lookup the next set of leaves from the tree_list_start index. list_valid_tree_inds will be -1 if it is unitialized. When the network moves to an unitialized node, it will be initialized by the py_util/register_mv.c function.
list_q_total (shape [BATCH_SZ, MV_BUFFER_SZ]) for each game, it represents the sum total of Q values from all states visited here and forward in the game. It's divided by the visit count to arrive at the mean.
list_prob (shape [BATCH_SZ, MV_BUFFER_SZ]) for each game, it contains the network's estimated probability of winning the game. This value only needs to be set once (unlike the visit count and mean Q values) and it is set in py_util/choose_moves.c.
list_visit_count (shape [BATCH_SZ, MV_BUFFER_SZ]) for each game, it contains the number of times this move and any move following it is played.

Model architecture

All my testing has focused on a 5 layer residual, convolutional neural network with 128 filters at each layer (although these parameters can be easily changed by altering bp_tree.py). This is a smaller network than that reported in the AlphaGo papers, and was influenced by my own hardware and timing constraints.

Walk-through of the training script

Here is a walk-through the training script, bp_tree.py. Generally, it sets up and loads general variables and state information. Then enters the main loop where it will first generate training batches by self-play, then it will train on them in a randomly shuffled order in addition to randomly reflecting and rotating the board. At the end of the loop, it will then save and evaluate the network, and then the cycle is repeated.

The first part of the script is simply importing modules and setting the directory where the neural network model is to be loaded and saved. Two files worth mentioning are the global_vars.py file which contains, for example, the batch size, and the architectures/tree_tf_op.py file which constructs the neural network in Tensorflow using parameters will define later in the present script.

Next, we set if we want the script to start training a new network or load and resume training an existing one (by setting the save_nm variable). Also, we create a list of variables that we want to save in our model file and variables we will use for logging performance.

Now we either initialize a new model or load it. This is where you’d change model or training parameters, such as the number of layers, filters, loss function weightings, or gradient descent step size. Also, if you ever want to reduce the gradient descent step size, you’d set the EPS variable in the else branch of the if statement.

Next come functions to save and restore the current tree position (omitted here on this webpage for brevity), and a simple function for generating the dictionary for input into Tensorflow. After, comes the main function for executing the tree search. First the current tree state is backed up:

Then, we enter the main simulation and turn loops for which each player makes a move. We then run the neural network forward to get its estimate of the probability each move will result in a winning game for it (the pol variable in the script). We also get its Q value estimate of the current game state (the val variable in the script), and the list of valid moves the current player can make (generated by the kernels/create_batch.cu function). If we are not at the first move of the simulation, then the script will backup the Q value to all game states that led to the current one.

With the list of valid moves, we then update the tree by calling the pu.add_valid_mvs function which is from the Python C module compiled above. Next, the tree is quered to find the next move based on the networks probability and Q value estimates. We then register the moves in the tree and then make the move (which updates the GPU memory structures containing the board state with the line beginning: arch.sess.run which ultimately calls the kernels/move_unit.cu function).

Finally, at the end of each simulation we must backup who won each game (the network is playing itself, but this is necessary to know which set of moves won for training). We also restore our previous state of where we were in the tree.

Next comes some additional initialization code (not shown here). After, comes the main training loop. First we call the kernels/init_state.cu Tensorflow operation which initializes a new set of game boards on the GPU. We also initialize the tree data structures by calling py_util/init_tree.c. Next, for each turn we run our simulation code to play out N_SIM (ex. 1000) games. Then for each player we save the board state (for use later as training inputs), and get the list of valid moves.

As before, with the run_sim function, we register the valid moves in the tree, and then use the tree to choose the current player's next move. However, this time instead of choosing the move with the maximum tree search value (which reflects the visit count, Q values, and network probability estimates), we move probabilistically in proportion to the number of times the tree search visited each move. Then, as before, we register the move in the tree. Finally, we prune the tree (py_util/prune_tree.c), which means we remove all the nodes and leaves which are no longer accessible (those that represent branches that are no longer possible from the current move).

Now, for the purpose of constructing our training batches, we determine who won each game (which we train the network to predict — the Q value, or val variable in the script) and determine the probability that the tree search would choose each move for a given game state (which we also train the network to predict--the pol variable in the script). The py_util/return_probs_map.c function returns the probabilities for each turn, player, game, and move from our previous simulations. Its dimensions are: [N_TURNS, N_PLAYERS, BATCH_SZ, MAP_SZ_X, MAP_SZ_Y].

Finally, we are ready to train the network. We shuffle the order of training steps and then rotate and reflect the board, which gives the network a larger variety of inputs to train on and prevents it from overfitting to any one game.

We then perform the actual backpropagation steps. For diagnostics, we load in the loss function values to monitor training progress. These values are then incremented in temporary variables printed, saved, and reset later on in the script.

The remainder of the script (not shown here) will then save the model and evaluate it against a random opponent and the GNU Go algorithm. Also included in that code are print statements for seeing some of the loss function outputs.

Training speed

On my setup (see the top of the page for my hardware description), it takes about 40 minutes to generate 128 games played to 20 turns at 1000 simulations per turn per game. My code randomly rotates and reflects the board and trains on each move made for all turns (so in 40 minutes it generates 20 training batches). The plots I show on this page represent about one month of training [ (25000 training steps * (40 minutes / 20 training steps))/(60 minutes / hours * 24 hours / day )] = 34 days.

Improving performance further

I’ve noticed improved performance when increasing the number of game simulations from 500 to 1000 (see below; blue = 1000 simulations; red = 500 simulations). Notice the modest increase in winning rates against the GNU Go algorithm. I suspect additional increases in simulations may also improve performance, but I’ve not tested this. Oddly the AlphaZero paper only used 800 simulations, however, there were other differences that may’ve gotten them better performance such as hyperparameter tuning (which I’ve not done systematically), and increased network size. On some older tests I’ve done with the game Reversi, I noticed a more prominent increase in performance by increasing the number of simulations, and modest to no improvement with increased network size (layers and number of filters). These results may not apply to Go, though, given the state space and number of moves per turn are drastically increased. If anyone else has thoughts on this, I’d be happy to hear (my email is at the top of the page).

Comparing 1000 vs. 500 simulations

**blue** = 1000 simulations; **red** = 500 simulations
(Darker colors are games played against GNU Go, lighter colors are against an opponent making random moves)

One additional change that I suspect would improve performance is related to the scoring system (based on a discussion I’ve had in the Computer Go mailing list). As mentioned at the beginning of the article, the network can increase its score by placing stones even in areas the other player “safely” controls. Fixing this problem would require: implementing a resign mechanism (currently, all games are played with each player making 20 turns), and extending out the number of turns each player can make so that stray stones in a safely controlled territory could be captured. Because of the small board size (7x7) that I use, a resign mechanism would likely be critical if the number of turns were increased, because otherwise a player may run out of moves and be forced to start filling in their own territories making them vulnerable to capture.

I plan to implement the resign mechanism and increase the number of turns at some point and will try to link to my results here.