The world’s leading publication for data science, AI, and ML professionals.

Reinforcement Learning Lock N’ Roll

Using a Double Deep Q Network to Learn a Game with Random States

Image by author
Image by author

Lock N’ Roll was created in 2009 by Armor Games. It is a game for those with advanced levels of intelligence, the chosen few who are gifted with massive IQs and a disposition toward mathematical and probabilistic thinking. The average gamer couldn’t possibly grasp such sophistication and elevated amusement.

Or, maybe that’s just what I tell myself to justify the fact that a surprising few of my friends have even heard of it, one of my favorite time wasters. It is a game that requires placing different color dice (red, yellow, green, or blue, numbered 1–4) on a 4×4 grid in different combinations and patterns to maximize point output. When you are out of dice or you want to score the current state of the board, you select the button that says "Roll," which also locks your move in so that the dice you placed can no longer be manipulated. Which dice you obtain after a roll is random. Jokers are earned after achieving certain point totals and they can be placed anywhere on the board to mimic the effect of any given dice. The game ends when the board is filled with dice and you have no more jokers. Dice can be cleared by achieving a high-scoring combo. What’s not to love?

Lock-n-Roll – Play on Armor Games

Image by author
Image by author

For years of having the app on my phone on-and-off, I have tried to perfect my own approach to the game. How do I maximize the chance of clearing spaces on the next roll? What is more important, setting up a 400-point move that clears 4 spaces, or a 200-point move that clears 6? When should I use a joker? Through much trial-and-error, I have perfected my own strategy, having gone from 1,000 being a good score when I first started to being able to consistently reach over 20,000 and having even scored in the millions several times. However, although much glory I have received through my budding Lock N’ Roll talents (and by "much glory" I mean "many wasted hours"), I have always wondered if there wasn’t a way to ostensibly "solve" the game, to employ the optimal strategy, the one strategy to rule them all. With the game having over 20 quintillion possible states, and me not having as strong a background in mathematics as I would like, the task always seemed daunting.

Double Deep Q Learning

Enter Greg Surma. Greg Surma doesn’t know me and I don’t know him, but for whatever reason, when I read his Medium article about using Machine Learning to train an AI to play Atari games to an above-average level, I thought of trying to do the same with Lock N’ Roll. Maybe there was a way to solve the game after all? I am a Data Scientist trained in the art of Python programming, but I don’t have much of a background in reinforcement learning. My specialty lies more toward forecasting time series.

Nevertheless, I decided to undertake this challenge. Whereas a mathematician might sit down with a pencil and paper to attempt to prove an optimal set of rules to maximize potential point totals, this approach involves training an agent to play a game many times, each time evaluating its gameplay and forming strategies to maximize a reward it receives for each move or to stay away from moves that punish it. How does it know which kind of move is which? The programmer tells it. This process is generally referred to as reinforcement learning.

Under the umbrella of Reinforcement Learning techniques, there is Deep Q Learning (DQL). DQL uses a neural network model to "remember" past moves the AI has made and systematically retrains itself by consulting its memory. It then decides its future moves by not only predicting the reward it will immediately receive, but also looking ahead to the next state that the move will lead to and applying a discounted reward for the predicted points obtained in that future state. By reiterating through this process many times, and by using more random exploration in the initial training phase, it strings together enough rewards and discounted future rewards in its memory to form "Markov Chains," which is a Bayesian concept that I think of as potential paths through the game. With enough of these Markov Chains formed, the AI can begin to form a theory about optimal game play.

Double Deep Q Learning (DDQL) becomes necessary when the population of available moves and the variables needing to be considered reach a certain size and complexity. DDQL systematically "unbiases" certain moves that the AI begins to wrongly favor with DQL due to the probability of a suboptimal convergence in the underlying learning function. The unbiasing process is accomplished with a second neural network (the target network) and it helps balance the universe of available moves to the agent so it is constantly learning what moves are actually best. The two networks are systematically synched together. The code is given below:

In my mind, this is a perfect framework to teach an AI Lock N’ Roll, which requires thinking ahead to the moves your current dice placements will most likely lead to, and I quickly learned that I needed to use Double Deep Q Learning as opposed to Deep Q Learning because of the myriad of complexities involved in the training process.

One of the challenges I faced was figuring out how to let the computer know that certain moves were not always available. The examples I found online about using DDQL were not clear about how to solve this problem and many discussions on online forums seemed to suggest that the framework is only possible to implement when the action space does not need to be restricted, i.e., all moves are available at all times. Therefore, I made the decision to always let the AI choose from 16 moves, labeled 0–15. Depending on which move it decided to take, the next available die (sorted alphabetically) was placed on the available space closest to what was chosen. If no dice were available, a joker would be used if one had been earned. It would lock and roll after every die placement. These are significant restrictions, and after playing with these restrictions myself, I found that scoring 5,000–12,000 points was a reasonably consistent outcome, but I couldn’t score much more than that.

Its reward for each move was however many points it scored from making the move and its punishment was -500 if the move caused a game over.

I also had a large observation space to deal with, and one that created a lot of redundancy. For instance, because I needed to pass a numeric vector through the neural network, a one-hot encoding technique was used to represent how each space on the board was covered or uncovered and which dice were currently available. This means that four of the variables used were whether a red, blue, green, or yellow-color die was currently on any given space. If there was a red die on that space, the variables indicating the other three colors would always be zero. This was just to represent the colors on one space on the board; there were also the numbers on the board and the dice and jokers available to the player. There was no way I could think of to create an ordinal relationship between any of these features to eliminate redundancy in the encoding process.

To solve this problem, I used PCA to reduce the observation-space dimensions from 193 to 154, while maintaining over 99.95% of the dataset’s original variance. I also tried using Linear Discriminant Analysis, but ultimately opted for the PCA reduction. I am still learning about other reduction techniques that would potentially be better for this situation.

Then, there was a problem I didn’t realize I would have – my hardware. Training the dang thing took a long time. To have the computer play and remember 500,000 times, I had to wait over three days. Not to mention all the failed experiments that I quickly saw were not going to work. I’m sure my Windows Surface laptop is not optimal for this problem, as it reached very hot temperatures while training, and maybe I could have coded the actual game to be less space-time complex, but until I have access to more computing power, I’m going to leave the results of 500,000 training iterations the way they are.

For the record, I don’t think any of my solutions to these problems are perfect. I am just getting into this space and recognize I have a lot to study and learn. However, maybe me sharing some of my challenges can be useful to someone else or I can learn from the comments about potential improvements.

Results

The results were mixed. I think there were definite signs that my AI learned, but not nearly as much as I had hoped for. After 500,000 tries, it could score an average of 200 points and regularly scored over 1,000. Its high score was a little over 3,000. Manually testing the agent, I noticed it could recognize that some combinations placed in some patterns would score it mega points, but it had trouble recognizing good moves all the time, even ones that were obvious to me. It didn’t come anywhere near my level, but it was statistically significantly better than random. A random agent averages about 165 and rarely scores over 1,000.

Image by author
Image by author
random score mean: 164.47355516637478
ai train mean: 188.473811164714
ai non-random train mean: 199.63831736189402
Ttest_indResult(statistic=-4.57018461548957, pvalue=4.906525386975824e-06)
Ttest_indResult(statistic=-6.165806097995396, pvalue=7.170408706268266e-10)

In the future, I’d like to perfect this approach with a more effective reduction process, a genetic algorithm to find the best neural network, a way to restrict the action space in any given state so that the game the AI is playing is more indicative of a real game experience, and more computing power in my hardware to train it for at least 1 million or up to 5 million iterations. However, for my first attempt to create a learning agent (and going from set-up to now has only taken about a month of working in my free time), I am satisfied that I built an agent that learned something.

I intend to create a version 2 of this application eventually, but for now, I am taking a break. If you are interested, the GitHub repository with all the code is below.

mikekeith52/LockNRoll


Related Articles