The world’s leading publication for data science, AI, and ML professionals.

The Bandit Framework

Multi-Armed Bandits: Part 2

A Baby Robot’s Guide To Reinforcement Learning

Photo by Jonathan Borba on Unsplash
Photo by Jonathan Borba on Unsplash

Overview

Baby Robot is lost in the mall. Using Reinforcement Learning we want to help him find his way back to his mum. However, before he can even begin looking for her, he needs to recharge, from a set of power sockets that each give a slightly different amount of charge.

Using the strategies from the multi-armed bandit problem we need to find the best socket, in the shortest amount of time, to allow Baby Robot to get charged up and on his way.

This is the second, in a six part series, on Multi-Armed Bandits. In Part 1 we covered all the basic terminology and mathematics required to describe the bandit problem.

In this part we’ll take a look at the problem we’ll be solving in the forthcoming articles, describing exactly how the power socket problem will be setup. This covers all the code that is used to create the basic socket simulator and the test framework used to evaluate these sockets.

So, although we’ll not yet make it onto the actual Bandit algorithms, we’ll do all the required groundwork, to allow us to start examining the various Bandit strategies in subsequent parts.


All code for the bandit algorithms and testing framework can be found on github: Multi_Armed_Bandits


Recap

Baby Robot has entered a charging room containing 5 different power sockets. Each of these sockets returns a slightly different amount of charge. We want to get Baby Robot charged up in the minimum amount of time, so we need to locate the best socket and then use it until charging is complete.

This is identical to the Multi-Armed Bandit problem except that, instead of looking for a slot machine that gives the best payout, we’re looking for a power socket that gives the most charge.


Testing Selection Strategies

Now we have all the required notation and terminology in place, we can build a test system. This will let us examine how different strategies perform at finding and exploiting the best power socket.


Implementing the Power Socket

We’re going to be testing several different strategies for locating the best power socket and, as part of this, we’ll need to create several different types of socket. However, all sockets have the same basic functionality and need to be able to perform the following operations:

  • Supply some charge. Where the amount of charge returned is dependent on the true reward value for the socket.
  • Update the estimate of the socket’s reward value. Mainly this will be done by calculating the sample average, as described in Part 1. Each socket will keep track of its own estimate and the number of times it has been tried. It will also store its true reward value, but this is only for implementation purposes. Since this is the value we’re trying to estimate, after the socket has been initialised, we’ll pretend we don’t actually know this value.
  • Return the metric. This **** is used to evaluate and choose between sockets. By default this is just the current estimate of the socket’s reward value.

A python implementation of a basic power socket is shown below. This acts as the base class from which all the other, more specialised, sockets will be derived.

The basic power socket keeps track of the following values, which match with the equations we’ve seen so far:

  • q = the true mean value of the socket output
  • Q = the running estimate of the socket output (i.e. its reward)
  • n = the number of times the socket has been tried

Q‘ and ‘n‘ are initialised in a separate ‘initialize‘ function to allow the socket to be reset, without having to create a whole new socket.

When estimating the socket output, the true output ‘q‘ is the value that we’re trying to converge upon. Although ‘q‘ is required to setup the socket, its value isn’t used elsewhere, since this is the value we’re trying to find.

The ‘charge‘ function returns the output reward for the socket. This is given by a normal (Gaussian) distribution, with a mean value of ‘q‘ (the value provided during setup).

The ‘update‘ function calculates the sample average estimate, using the new reward value and the previous estimate, as described in Formula 1 from Part 1. In code this translates directly to be:

And finally the ‘sample‘ function returns the metric that will be used to select the next socket. In this case its just the value ‘Q’, the estimate of the current reward.


Implementing the Charging Room

We want to setup a room that has 5 sockets, each of which will return a burst of power with a mean duration that is different to the other sockets.

To keep things simple we’re just going to use a fixed order for the power output from each of the sockets. So, in the code below, the ‘_socketorder‘ parameter defines that socket 2 will be the worst and socket 4 will be the best and this will remain fixed throughout our experiments.

The mean amount of charge returned by each socket is calculated by multiplying its socket order value by 2 and then adding 2 to offset this value and keep it above zero. So, in order, the sockets have the following mean rewards: 6, 4, 8, 12, 10.

With the sockets set up in this way we get the following output from each of the sockets:

In the "violin" plot above, the line for each socket shows the range of values returned as the reward for each socket. So socket number 1 returned between approximately 2 and 10 seconds of charge. The shaded area, underneath the line, represents the frequency at which each reward was returned. For socket 1 the most frequently returned reward was 6 seconds of charge. This is the mean reward for this socket and also its true reward value.

In the density plot we can see how the reward probability varies for each socket. Socket number 4 is the best, giving an average of 12 seconds worth of charge, whereas socket 2 is the worst, giving only 4 seconds worth. Socket 4 is the optimal socket that we’re trying to find and exploit.


The Socket Tester

We’re also going to need a way to sample and select from the set of sockets at each time step. This is handled by the SocketTester class, as shown below:

The SocketTester class creates a set of power sockets, with the supplied socket order, as defined previously. When it’s run, the socket tester loops for a specified number of time steps and, at each time step, performs the following operations:

  • Select a socket: This is done by calling the ‘sample‘ function for each socket and then choosing the socket with the largest returned value. As described previously, by default the returned sample value will just be a socket’s current estimate of its mean reward. When there is more than one socket with the highest sample value, the ‘_randomargmax‘ function (shown below) arbitrarily selects one of the sockets with the largest value. This is used in preference to Numpy’s argmax function which simply chooses the first item when a tie occurs.
  • _Charge and Update:O_nce a socket has been chosen we charge from that socket to get a reward. We then update that socket, incrementing its count of the number of times it’s been tried and improving its reward estimate using the new reward value.

Note that the code shown above for the ‘SocketTester’ class is a cut-down version of the actual class and only shows the main functionality. In the full version of the class the properties of the sockets are tracked during the course of the run. Additionally, this class only performs a single run. Since we’re dealing with random variables one run may not be sufficient to determine the true behaviour of a socket selection strategy. Therefore a second class, ‘SocketExperiment‘, is used to setup and run repeated socket tests and generate the average results obtained across these tests. The full code for both of these classes can be found in the file PowerSocketSystem.py in the Github repository.


Summary

This part of our look at Multi-Armed Bandits didn’t really look at them at all! However, it did cover all of the basic framework that we’ll need to create and evaluate the Multi-Armed Bandit strategies in the subsequent articles.

In future articles, we’ll take the theory we learnt in part 1, and combine it with this test framework, to start examining how the various Bandit strategies can be applied to explore and exploit the set of power sockets.


< Part 1: Multi-Armed Bandits            Part 3: Bandit Algorithms >

Related Articles