Getting Started, A Baby Robot’s Guide To Reinforcement Learning


Overview
In this, the fourth part of our series on Multi-Armed Bandits, we’re going to take a look at the Upper Confidence Bound (UCB) algorithm that can be used to solve the bandit problem.
If you’re not already familiar with the bandit problem and its terminology you may want to first take a look at the earlier parts of this series, which are as follows:
- Part 1: Mathematical Framework and Terminology– all the basic information needed to get started
-
Part 2: The Bandit Framework – a description of the code and test framework
- The Optimistic-Greedy Algorithm
- The Epsilon-Greedy Algorithm (ε-Greedy)
- Regret_
All code for the bandit algorithms and testing framework can be found on github: Multi_Armed_Bandits
Recap
Baby Robot is lost in the mall. Using Reinforcement Learning we want to help him find his way back to his mum. However, before he can even begin looking for her, he needs to recharge, from a set of power sockets that each give a slightly different amount of charge.
Using the strategies from the multi-armed bandit problem we need to find the best socket, in the shortest amount of time, to allow Baby Robot to get charged up and on his way.

Baby Robot has entered a charging room containing 5 different power sockets. Each of these sockets returns a slightly different amount of charge. We want to get Baby Robot charged up in the minimum amount of time, so we need to locate the best socket and then use it until charging is complete.
This is identical to the Multi-Armed Bandit problem except that, instead of looking for a slot machine that gives the best payout, we’re looking for a power socket that gives the most charge.

Introduction
In the Power Socket problem, where we’re trying to get the most charge in the shortest amount of time, we’ve seen that there’s a trade-off between the amount of time we spend exploring, in search of the best socket, and the time spent exploiting the socket that currently gives the best return. If we take too long exploring then we’re potentially missing out on using sockets that have already shown a high return. Alternatively, if we just exploit the sockets that have performed well then we might not find the best socket and potentially miss out on getting the maximum possible return.
Obviously the best approach would be to choose the best socket every time and this is what’s known as the optimal policy. It’s also obvious that this approach isn’t actually practical, since you initially don’t know which socket is the best. Some time has to be spent investigating how the sockets perform if you’re going to find the best one and therefore it’s not always possible to choose the best action.
The optimal policy, although only theoretical, can however be used to evaluate other policies, to see how close they come to being optimal. The difference between the return that would be achieved by the optimal policy and the amount of return actually achieved by the policy under test is known as the regret. In this case the return is the amount of charge from the sockets and the policy is the method or strategy used to select the sockets to try.
As we’ve seen, Epsilon-Greedy has linear regret. It continues to explore the set of all actions, long after it has gained sufficient knowledge to know which of these actions are bad actions to take.
A better approach, in terms of maximising the total reward, would be to restrict the sampling over time to the actions showing the best performance. This is the exact approach taken by the Upper Confidence Bound (UCB) strategy.
The Upper Confidence Bound (UCB) Algorithm
Rather than performing exploration by simply selecting an arbitrary action, chosen with a probability that remains constant, the UCB algorithm changes its exploration-exploitation balance as it gathers more knowledge of the environment. It moves from being primarily focused on exploration, when actions that have been tried the least are preferred, to instead concentrate on exploitation, selecting the action with the highest estimated reward.
With UCB, ‘Aₜ’, the action chosen at time step ‘t‘, is given by:

where;
- Qₜ(a) is the estimated value of action ‘a’ at time step ‘t’.
- Nₜ(a) is the number of times that action ‘a’ has been selected, prior to time ‘t’.
- ‘c’ is a confidence value that controls the level of exploration.
The formula for UCB can be thought of as being formed from 2 distinct parts:
Exploitation:
- Qₜ(a) represents the exploitation part of the equation. UCB is based on the principle of "optimism in the fact of uncertainty", which basically means if you don’t know which action is best then choose the one that currently looks to be the best. Taking this half of the equation by itself will do exactly that: the action that currently has the highest estimated reward will be the chosen action.
Exploration:
- The second half of the equation adds exploration, with the degree of exploration being controlled by the hyper-parameter ‘c‘. Effectively this part of the equation provides a measure of the uncertainty for the action’s reward estimate.
- If an action hasn’t been tried very often, or not at all, then Nₜ(a) will be small. Consequently the uncertainty term will be large, making this action more likely to be selected. Every time an action is taken we become more confident about its estimate. In this case Nₜ(a) increments, and so the uncertainty term decreases, making it less likely that this action will be selected as a result of exploration (although it may still be selected as the action with the highest value, due to the exploitation term).
- When an action is not being selected, the uncertainty term will grow slowly, due to the log function in the numerator. Whereas, every time that the action is selected, the uncertainty will shrink rapidly due to the increase in Nₜ(a) being linear. So the exploration term will be larger for actions that have been selected infrequently, due to the uncertainty in the estimates of their rewards.
- As time progresses the exploration term gradually decreases (since as ‘n‘ goes to infinity log n/n goes to zero), until eventually actions are selected based only on the exploitation term.
UCB Implementation
To investigate the performance of the UCB algorithm we’ll add another socket class to our test system. The full details of this test system are described in Part 2 of this series and all the code can be found in the github repository.
The UCB socket needs to slightly modify the base power socket class, to add uncertainty to the metric it uses to evaluate a socket. So the ‘sample‘ function now returns the sum of the estimated mean reward and the uncertainty value, which is calculated as a function of the number of times the socket has been tried and the current time step.
The socket tester class reverts to the standard socket tester, with sockets being selected based only on the socket that returns the largest value from its sample function (i.e. the random sampling of the epsilon-greedy algorithm is no longer required).
Analysis of UCB
Using our test system we can analyse the performance of the UCB algorithm. To simplify the comparison, we’ve taken only the first 2 sockets from our standard set. These 2 sockets have mean reward values of 6 and 4 seconds of charge respectively. The relative contributions of each of the exploration and exploitation terms can clearly be seen in the graph below.
The main things to note here are:
-
The total height of each bar represents the total UCB value. So, since at each time step we’re choosing the socket that gives the maximum UCB value (this is what the argmax in the formula is doing), the socket with the tallest bar will be the socket that is chosen. The number given at the top of each bar is the number of times this socket has been chosen. So you can see that this value will increase if this socket had the tallest bar at the previous time step.
- The shaded part of the bar represents the estimate of each socket’s actual output (its Q value). As time progresses this estimate converges on the true mean output for each socket. So socket 1’s Q value (the yellow shaded bar) heads towards its true output of 6 and the Q value for socket 2 goes towards its true mean value of 4.
-
The solid part of each bar represents the exploration part of the equation. When a socket is used we become more certain about its true output and so the solid part of the bar will decrease in size. On the other hand, when a socket hasn’t been tested, we become less certain of its output and so the exploration value will begin to increase in size. So, in the graph below, it can be seen how socket 2’s exploration term, illustrated by the solid green part of the bar, gradually increases when socket 2 isn’t being selected, until eventually it results in the total size of the bar being greater than that of socket 1 and, at this point, socket 2 is then chosen.

In-depth analysis of the UCB graph
Some more, in depth observations, from the UCB graph are as follows:
- The socket uncertainty value is set to infinity if a socket has not yet been tried, causing an initial priming round to be performed, in which each action is tried once to get its initial value. This then avoids divide-by-zero errors in the exploration term when actions have not yet been tried and Nₜ(a) is equal to zero. Obviously this is only applicable when there are fewer possible actions ‘k’ than there are time steps ‘t‘, otherwise there wouldn’t be enough time to try every action.
- Due to the priming round the graph begins at time step 2. The number of times that each socket has been selected is shown by the number at the top of each bar. So, at time step 2, it can be seen that each socket has been selected once. Since each socket has been tried the same number of times the contribution of the uncertainty term is the same for each socket. However, due to its larger reward estimate ‘Q’, socket 1 has the largest total UCB value and is therefore selected by the argmax function.
At time step 3, socket 1 was the selected socket at the previous time step, so the count of the number of times it has been tried increases to 2. As a result the uncertainty term for this socket shrinks, so the solid blue bar can be seen to decrease in size. The hatched yellow bar also decreases due to this socket having been sampled and forming a better estimate for the true socket reward. On the other hand, socket 2 wasn’t selected, so its reward estimate stays the same. The number of times it has been selected also stays the same, while the number of time steps increases, consequently the size of its uncertainty term increases, so the solid green bar can be seen to get bigger. However, the overall size of the UCB term for socket 1 is still greater than that of socket 2, so once again it is the socket that gets selected.
- Eventually, at time step 5, socket 2’s uncertainty term has increased sufficiently to make its total UCB value greater than that of socket 1 and so it is the socket that gets chosen. Once this happens its estimated reward value moves closer to the true mean reward, its uncertainty term shrinks and the whole process begins again.
The confidence value ‘c’
The behaviour of the UCB algorithm, over a range of confidence values, is shown below. The confidence parameter controls the level of exploration. In the case of our simple Power Socket experiment it can actually be seen that the greater the level of exploration the lower the mean total reward. This is due to each socket having a distinct value and a limited range of possible values, such that the other sockets are unlikely to generate values in the same range.
As a result, in our simple experiment, it looks like exploration is actually unnecessary. After the initial priming step the UCB algorithm has already locked on to the optimal socket and produces the best results when it can just exploit this knowledge. Increasing the value of the confidence parameter sharply reduces the mean total reward and this decrease continues until the algorithm becomes no better than random search.

A closer examination of the variation of the mean total reward as the confidence level is increased is shown below. Here it can be seen that the mean total reward does increase slightly, from having a confidence parameter of zero up to a value of about 0.6, after which it falls off rapidly. So a small degree of exploration is required to get the best results.

Analysing how often each socket is selected, with the confidence parameter set to a value of 0.6 (the value that gave the maximum mean total reward, as shown above), gives the graph shown below. Here it can be seen that during the initial priming round, when a socket’s uncertainty is set to infinity if it has not yet been tried, that each socket is selected exactly once during the first 5 time steps. After this point the optimal socket (socket 4) has already been identified as the best socket and so is selected for nearly all of the remaining trials.

UCB Regret
Due to the optimal action being identified quickly, and only trying the other actions when they have high uncertainty, the UCB method shows a much lower level of regret than seen in the Epsilon-Greedy method.
As can be seen in the graphs below, the optimal and actual rewards are almost identical (so much so that the actual line is obscured by the optimal line) and regret is almost flat.
The vast majority of the regret occurs during the initial priming round, where each socket is being tried once to get its first estimate. Indeed, it has been shown that the expected cumulative regret of UCB is logarithmic in ‘T‘, the total number of time-steps.

For an in-depth examination of UCB regret and UCB in general check out the following resources:
- Tor Lattimore and Csaba Szepesvari’s book and website on Bandit Algorithms
- Lai and Robbins’ seminal paper on the UCB algorithm: "Asymptotically efficient adaptive allocation rules".
Summary
In problems such as Socket Selection or the Multi-Armed Bandit, when faced with the dilemma of how to strike a balance between searching for the actions that give the best return and exploiting those that have already been found, it is important to use an approach that can modify the levels of exploration and exploitation.
As we saw with the Epsilon-Greedy algorithm it simply maintains a constant level of exploration, continuing to explore the set of all actions as time progresses. As a result, it has linear regret. The difference between the reward it achieves and the maximum possible reward continues to increase with time.
On the other hand, the Upper Confidence Bound (UCB) algorithm modifies its levels of exploration and exploitation. Initially, when it has little knowledge of the available actions, and a low confidence in the best actions to take, the exploration part of its equation causes it to search through the set of all possible actions.
As exploration progresses better estimates are formed for the rewards given by each action. Therefore the level of exploration can be decreased and the use of the good actions that have been found can increase. Gradually the focus of the algorithm swings from exploration to favour exploitation. By shifting this balance as time progresses the UCB algorithm reduces its regret and, consequently, is able to achieve a much lower level of regret than that seen in Epsilon-Greedy.
In the next part we’ll take a look at Thompson Sampling, an even more sophisticated approach to balancing exploration and exploitation. Using this we’ll get Baby Robot charged and ready to go in practically no time!

< Part 3: Bandit Algorithms Part 5: Thompson Sampling >