The world’s leading publication for data science, AI, and ML professionals.

An Introduction to Reinforcement Learning: the K-Armed Bandit

The k-armed bandit problem introduces many reinforcement learning ideas. Read more to find out.

Photo by DEAR on Unsplash
Photo by DEAR on Unsplash

There’s a lot of hype around Reinforcement Learning (RL) these days, and rightfully so. Ever since DeepMind published its paper "Playing Atari with Deep Reinforcement Learning", many promising results have come out, with perhaps the most famous one being AlphaGo. Before we can understand how these models work, however, we need to understand some basic principles of reinforcement learning. I think the best introduction to these concepts is through a much simpler problem – the so-called "k-armed bandit problem".

The Problem

First, let’s define what the k-armed bandit problem is. The phrase "k-armed bandit" conjures up images of Boba Fett from Star Wars crossed with an octopus, but fortunately things are not that crazy. Instead, you can think of the bandit as a slot machine, and each of the k arms as a lever. Pulling on one of the k levers will give you a positive reward. Your goal, as in most reinforcement learning problems, is to maximize your total reward, given a fixed number (say 1000) of pulls. The only prior information you have is that the average reward for each lever is doesn’t change over time. What is the best strategy here?

If we knew the average reward for each lever, this problem would be easy. We would just always pull on the lever with the highest average reward. So one possible approach is to first use some pulls to figure out which lever has the highest average reward, and then only pull on that lever.

How should we figure out which lever is the best? We need to try all the levers multiple times, because even though the average reward per lever is fixed, each individual lever pull will give a different reward. Once we’ve tried each lever a bunch of times, we can take the average of all the pulls as an approximation of the true reward. In other words, one possible algorithm is:


Algorithm 1

Pull each lever n times, and average the n rewards from each lever.

Whichever lever has the highest average is the best one. Pull that one exclusively from now on.


There is a problem with this, which is that we don’t know how to set n. For example, let’s say we use n = 3. For one of the levers we do 3 pulls and we get rewards of 14, 15, and 16, averaging out to 15. However, the true average reward for that lever was 5 – we just got very lucky for those three pulls. Now we are in trouble, because we will always assume this lever is much better than it actually is. We could make n larger, which would make us more confident that our averages are accurate. However, since we have a limited number of time steps, we don’t want to waste lever pulls. Therefore setting a large n has problems too.

We could avoid fixing a number n altogether. Instead, we keep a running average of each lever’s rewards. Another possible algorithm is:


Algorithm 2

Before doing any pulls, set the running average of each lever to 0.

For each timestep:

  • Pick the lever with the highest running average. If there is a tie, pick randomly between the tied levers.
  • Pull the lever we just picked, observe the reward, and modify the running average for that lever with the reward.

Theoretically, this algorithm has the advantage of always getting closer and closer to the true average reward for each lever. There is no number of pulls per lever n after which we stop learning. Unfortunately, it doesn’t work. On the first timestep, this algorithm will pull a random lever, getting a positive reward. Then because all the other running averages were set to 0, it will continue pulling the lever from the first timestep forever. It will never touch the other levers.

One fix for this problem is to use another parameter, epsilon, that controls if we pull the lever with the highest running average or if we pull a random lever. Our algorithm becomes:


Algorithm 3

Before doing any pulls, set the running average of each lever to 0.

Set epsilon to a number in [0, 1].

For each timestep:

  • Get a random number (from the uniform distribution) x in [0, 1].
  • If x > epsilon, Pick the lever with the highest running average. If there is a tie, pick randomly between the tied levers.
  • If x < epsilon, pick a random lever.
  • Pull the lever we just picked, observe the reward, and modify the running average for that lever with the reward.

This algorithm does work. Through the randomness of x, eventually we will be forced to try all the levers enough times to get an accurate running average for all of them. However, we have to decide what epsilon is. If we set epsilon closer to 1, the algorithm will try the levers more evenly, leading to better estimates, but lower total reward. If we set epsilon closer to 0, the algorithm will focus more on a few levers, leading to less accurate estimates but a higher possible reward.

The RL Perspective

Now let’s take all of this and put it in reinforcement learning terms. First, we noticed that we wanted to get accurate estimates of the true average reward for each lever. We call the true average reward for each lever the action-value function of the problem. Why is it called this? Well, in our problem the possible actions we can take are pulling each of the levers. And for each lever pull there is a true average value of the reward. More formally, the action-value function is denoted by Q(a). Q is the function, and a represents the possible actions. We don’t know Q, so we try to estimate it, denoted as **Q**. Our hope is that Q will be close to Q.

We also came up with some concrete algorithms to solve this bandit problem. Algorithm 1 has two separate parts. The first part tried to approximate the action-value function (find Q) by using a sample average of n pulls per lever. The second part used Q to dictate which lever to pull. On a high level, the first part of the algorithm, finding Q, is called exploration. The second part of the algorithm, using the best action/pull as determined by Q, is called exploitation. We had some trouble setting the parameter n for this algorithm. Equivalently, we had trouble deciding how to split the timesteps between exploration and exploitation. This is a common theme in reinforcement learning.

We call an algorithm that always exploits a greedy algorithm. A greedy algorithm always tries to maximize the reward for each timestep; it doesn’t explore. We saw this problem in Algorithm 2. Because no exploration is done, in the short run the greedy algorithm might work well, but in the long run it almost certainly is suboptimal.

Another algorithm we came up with (Algorithm 3) uses running averages combined with a parameter epsilon. Epsilon effectively controls the balance between exploration and exploitation. If epsilon is close to 0, most of the time we will be doing exploitation, or the greedy algorithm. Because of this, this type of algorithm is called "epsilon-greedy", because epsilon controls how greedy we want to be. Epsilon-greedy algorithms are very commonly used, including in the Atari paper linked at the beginning of this article.

Let’s recap what we’ve covered. First, we explained the k-armed bandit problem and several possible solutions. We noticed that approximating the action-value function is an important step in solving the problem. We saw the difficulty in balancing exploration and exploitation, a common theme in reinforcement learning problems. Finally, we discussed greedy and epsilon-greedy algorithms, and discussed how they tie into the exploration/exploitation framework. Reinforcement learning is quite a broad topic, but I hope this article has shed some light on some of the basic concepts.


Related Articles