A Multi-Armed Bandit (MAB) is a classic problem in decision-making, where an agent must choose between multiple options (called "arms") and maximize the total reward over a series of trials. The problem gets its name from a metaphor involving a gambler at a row of slot machines (one-armed bandits), each with a different but unknown probability of paying out. The goal is to find the best strategy to pull the arms (select actions) and maximize the gambler’s overall reward over time. The MAB problem is a fancy name for the exploitation-exploration trade-off.
The Multi-Armed Bandit problem is a foundational problem that arises in numerous industrial applications. Let’s explore it and examine interesting strategies for solving it.

Example
You’ve just arrived in a new city. You’re a spy and plan to stay for 120 days to complete your next assignment. There are three restaurants in town: Italian, Chinese, and Mexican. You want to maximize your dining satisfaction during your stay. However, you don’t know which restaurant will be the best for you. Here’s how the three restaurants stack up:
- Italian restaurant: Average satisfaction score of 8/10
- Chinese restaurant: Average satisfaction score of 6/10
- Mexican restaurant: Average satisfaction score of 9/10
The catch is that you don’t know these satisfaction scores when you start. What would be your strategy to pick the best restaurant over your 120 dinners?
Exploration
Let’s say you explore all three restaurants equally. In other words, you visited each restaurant for 40 days. The expected satisfaction score will equal (40 8 + 40 6+ 40 * 9) = 920. Hence, an average satisfaction of 7.67 per day. Is this an optimal strategy? If you had picked the Mexican restaurant, you would have an average satisfaction of 9!
Exploration and Exploitation
You don’t want to explore too much. At the same time, you don’t want to choose one restaurant randomly and visit it all the time. You need a strategy that focuses on exploration followed by exploitation – revisiting the restaurant that consistently offers the highest satisfaction. This leads to the exploration-exploitation dilemma, and Multi-Armed Bandit algorithms help you balance the two.
ε-Greedy
The ε-Greedy algorithm is a simple method for managing exploration and exploitation.
- With probability ε, you explore by picking a random restaurant to try.
- With probability 1-ε, you exploit by choosing the restaurant that has provided the best satisfaction so far.
- Over time, this strategy helps you discover the best restaurant while occasionally trying new options.
The following Python code simulates the ε-Greedy algorithm. The true average satisfaction scores follow the Normal distribution with means of 8, 6, and 9 and standard deviations of 1, 2, and 1.5.
import numpy as np
class EpsilonGreedy:
def __init__(self, n_restaurants, epsilon):
self.n_restaurants = n_restaurants
self.epsilon = epsilon
self.visits = np.zeros(n_restaurants)
self.satisfaction = np.zeros(n_restaurants)
def choose_restaurant(self):
if np.random.random() < self.epsilon:
return np.random.choice(self.n_restaurants) # Explore
else:
return np.argmax(self.satisfaction / (self.visits + 1e-5)) # Exploit
def update(self, restaurant, score):
self.visits[restaurant] += 1
self.satisfaction[restaurant] += score
n_restaurants = 3
epsilon = 0.1
n_days = 120
true_avg_satisfaction = np.array([8, 6, 9])
true_stddev_satisfaction = np.array([1, 2, 1.5])
total_satisfaction_arr = []
for i in range(50): # Run the simulation 50 times
epsilon_greedy_restaurant = EpsilonGreedy(n_restaurants, epsilon)
total_satisfaction = 0
for _ in range(n_days):
restaurant = epsilon_greedy_restaurant.choose_restaurant()
score = np.random.normal(loc=true_avg_satisfaction[restaurant], scale=true_stddev_satisfaction[restaurant])
epsilon_greedy_restaurant.update(restaurant, score)
total_satisfaction += score
print("Total Satisfaction (Epsilon-Greedy):", total_satisfaction)
total_satisfaction_arr.append(total_satisfaction)
# Calculate average satisfaction
np.mean(total_satisfaction_arr) / n_days, np.std(total_satisfaction_arr) / n_days
I simulated the algorithm 50 times with ε=0.1. I observed an average satisfaction score of 8.49±0.35. One can play with the ε parameter.
Upper Confidence Bound
The Upper Confidence Bound algorithm also tries to balance exploration and exploitation. It considers the average satisfaction score recorded so far and the uncertainty about the restaurant’s satisfaction score. Restaurants that haven’t been visited enough are explored more to reduce uncertainty, but the one that consistently performs well is still favored. Once it’s confident enough, the UCB algorithm eventually settles on the most satisfactory restaurant.

- UCB involves maximizing the term inside the bracket in the above equation. The above equation for confidence bound is derived from Hoeffding inequality.
- Hoeffding’s Inequality provides a theoretical guarantee that a restaurant’s true satisfaction lies within a confidence interval around the estimated satisfaction.
- The first term is the observed average satisfaction score and the second is the confidence interval. Note that the denominator
n_i
is the number of times a restaurant has been visited. Ifn_i
is smaller, the second term is larger. - Thus, it gives the restaurants with lesser data a chance before exploiting the restaurant with the best satisfaction score.
- The less frequently you visit the restaurant, the higher the uncertainty, and the algorithm will give this restaurant a higher confidence score to encourage exploration.
The following Python code simulates the UCB algorithm with the same satisfaction score distribution illustrated in the ε-Greedy algorithm.
import numpy as np
class UCB:
def __init__(self, n_restaurants):
self.n_restaurants = n_restaurants
self.visits = np.zeros(n_restaurants)
self.satisfaction = np.zeros(n_restaurants)
self.total_trials = 0
def choose_restaurant(self):
if self.total_trials < self.n_restaurants:
return self.total_trials # First, visit each restaurant at least once
ucb_values = np.zeros(self.n_restaurants)
for restaurant in range(self.n_restaurants):
avg_score = self.satisfaction[restaurant] / (self.visits[restaurant] + 1e-5)
confidence_bound = np.sqrt(2 * np.log(self.total_trials + 1) / (self.visits[restaurant] + 1e-5))
ucb_values[restaurant] = avg_score + confidence_bound
return np.argmax(ucb_values)
def update(self, restaurant, score):
self.visits[restaurant] += 1
self.satisfaction[restaurant] += score
self.total_trials += 1
n_restaurants = 3
n_days = 120
true_avg_satisfaction = np.array([8, 6, 9])
true_stddev_satisfaction = np.array([1, 2, 1.5])
total_satisfaction_arr = []
for i in range(50): # Run the simulation 50 times
ucb_restaurant = UCB(n_restaurants)
total_satisfaction = 0
for _ in range(n_days):
restaurant = ucb_restaurant.choose_restaurant()
score = np.random.normal(loc=true_avg_satisfaction[restaurant], scale=true_stddev_satisfaction[restaurant])
ucb_restaurant.update(restaurant, score)
total_satisfaction += score
print("Total Satisfaction (UCB):", total_satisfaction)
total_satisfaction_arr.append(total_satisfaction)
# Calculate average satisfaction
np.mean(total_satisfaction_arr) / n_days, np.std(total_satisfaction_arr) / n_days
I simulated the algorithm 50 times. I observed an average satisfaction score of 8.84±0.19.
Thompson’s Sampling
Thompson Sampling is another widely used algorithm for solving the Multi-Armed Bandit (MAB) problem. Unlike other methods like ε-Greedy or UCB, which use fixed rules to explore and exploit, Thompson Sampling uses a probabilistic approach to balance exploration and exploitation.
- The key idea behind Thompson Sampling is to maintain a probability distribution (often Beta) over the possible satisfaction scores.
- Each restaurant is assigned a prior probability distribution that represents how satisfied you expect to be. Initially, the distributions are wide, reflecting uncertainty.
- Before each dinner, you sample a potential satisfaction score from each restaurant’s distribution and choose the restaurant with the highest sampled score.
- After dinner, you update the probability distribution of the restaurant based on the actual satisfaction you experienced.
- The updated distribution becomes the prior distribution for the next dinner.
The following Python code simulates the Thompson Sampling algorithm with the same satisfaction score distribution illustrated in the ε-Greedy algorithm.
import numpy as np
class ThompsonSampling:
def __init__(self, n_restaurants):
self.n_restaurants = n_restaurants
self.visits = np.zeros(n_restaurants)
self.satisfaction = np.zeros(n_restaurants)
self.alpha = np.ones(n_restaurants) # Beta distribution parameters
self.beta = np.ones(n_restaurants)
def choose_restaurant(self):
sampled_values = np.random.beta(self.alpha, self.beta)
return np.argmax(sampled_values)
def update(self, restaurant, score):
self.visits[restaurant] += 1
self.satisfaction[restaurant] += score
# Update the beta distribution based on the satisfaction score
if score > np.mean(self.satisfaction / (self.visits + 1e-5)):
self.alpha[restaurant] += 1 # success
else:
self.beta[restaurant] += 1 # failure
n_restaurants = 3
n_days = 120
true_avg_satisfaction = np.array([8, 6, 9])
true_stddev_satisfaction = np.array([1, 2, 1.5])
total_satisfaction_arr = []
for i in range(50): # Run the simulation 50 times
thompson_sampling_restaurant = ThompsonSampling(n_restaurants)
total_satisfaction = 0
for _ in range(n_days):
restaurant = thompson_sampling_restaurant.choose_restaurant()
score = np.random.normal(loc=true_avg_satisfaction[restaurant], scale=true_stddev_satisfaction[restaurant])
thompson_sampling_restaurant.update(restaurant, score)
total_satisfaction += score
print("Total Satisfaction (Thompson Sampling):", total_satisfaction)
total_satisfaction_arr.append(total_satisfaction)
# Calculate average satisfaction
np.mean(total_satisfaction_arr) / n_days, np.std(total_satisfaction_arr) / n_days
I simulated the algorithm 50 times. I observed an average satisfaction score of 8.5±0.3.
All three algorithms perform better than the basic strategy of equal exploration. Note that this is a simple illustration of just three restaurants. In practical cases, you may have hundreds of restaurants in a city.
Applications
- Online Advertising Platforms and E-commerce Apps – MAB algorithms can optimize website ad placements by selecting which ads to display to maximize click-through rates (CTR) or conversions.
- Ride-hailing and food-delivery Apps – MAB can help optimize driver allocation and routing in real-time to maximize efficiency and minimize wait times.
I hope you found my article insightful. Thank you for reading!