The world’s leading publication for data science, AI, and ML professionals.

Rusty Bandit

Building a Multi-armed Bandit with Rust

Why not use Python like every other data scientist? Python is fine for front end work, but all of the powerful data science algorithms use something faster on the backend. Why not C++? The same reasons a lot of people avoid C++; it’s too easy to do something seriously unsafe. Why not Java or Scala? Nothing on the JVM wins any speed contests.

But why Rust? Because it’s almost as fast as C++ and lacks many of the drawbacks. Rust is almost paranoid about memory and thread safety. Mutability has to be specified explicitly, which encourages programmers to use it sparingly. A variable can have at most one mutable reference at any time, which the compiler checks zealously. Threads are examined in similar detail. The end result is a language which makes it difficult to do something stupid (at least with an IDE) and where almost all mistakes are caught at compile time. As an added bonus, the compiler’s error messages are, a feature almost unique in the programming world, helpful.

But Rust isn’t a data science language, and isn’t intended to be. Neither is C++, which is happily propping up the back-ends of Tensorflow, XGBoost, PyTorch, and LightGBM. Rust could have a similar future, running all the messy matrix calculations hidden by Keras’s elegant syntax.

Aside from that, it’s there. Learning Rust is a challenge, which is a good enough reason for anything.

With the preliminaries concluded, the remainder of this article will consist of three parts. The first will be a quick introduction to the multi-armed bandit problem. The second will describe how I designed and coded Ratel, a Rust program for performing multi-armed bandit experiments. The third will discuss the output of one of these experiments.

(A note on the program name. Multi-armed bandit learning involves a combination of exploration (making new choices) and exploitation (making the same, informed, choice). I couldn’t think of an animal better at exploration or exploitation than the honey badger, or in Afrikaans the ratel.

CT Cooper / Public domain
CT Cooper / Public domain

Multi-armed Bandit

One-armed bandit is an old slang term for a slot machine. With a one-armed bandit, the user pulls the lever, or arm, in the hopes of a reward, all the while letting the machine steal a bit of their own money for the privilege. A multi-armed bandit has several ams that can be pulled, each one producing a reward according to some probability distribution. The good news is that, in most applications, multi-armed bandits give better odds than Las Vegas. The bad news is that the reward distribution for each arm is secret, and is likely to remain so.

The goal of multi-arm bandit learning is not to find the distribution of each arm, merely the arm with the highest average reward. Broadly speaking, there’s exactly one way to do this; pull all the arms a lot of times until you have a reasonably accurate measure of each of their averages. That’s not exactly an algorithm; the algorithm determines how the arms are chosen. We have a few options.

Greedy Algorithm (Full Exploitation)

An initial guess is made for each arm. This is usually on the high end of the known range of possible values. An initial arm is selected at random, and the guess for that arm is updated based on the output. Subsequent guesses take the arm with the highest average return to date, with any ties broken randomly.

This isn’t a bad strategy, but it does run the risk of picking a sub-optimal arm if the first few guesses are unlucky. I’ll be able to quantify this risk in part three; for now an example will suffice.

Suppose there are two arms, one of which pays out 1 25% of the time and another that pays out 1 75% of the time. The rest of the time they pay out 0. Initial guesses for the average payout for each arm are both 1. By chance, the first four pulls of the second arm pay out 0, while the first four pulls of the first arm pay out 0 three time and 1 once. Our estimate for the average return of the second arm, 0.2, is now below the real average return for the first arm, 0.25. Following a greedy strategy, I’ll keep pulling arm 1 with little chance of correctly my initial error.

This may seem like an extreme case; the chances of four 0s in a row from arm two is less than 1%. In real life the spread between the returns of the bandit arms is usually much smaller, and the chances of the greedy algorithm finding a bad local maximum are corresponding greater.

Epsilon-Greedy Algorithm (Exploration and Exploitation)

The only way the greedy algorithm can get stuck in a local maximum is if the estimated return for a suboptimal arm is higher than the estimated return for the optimal arm. In practice, this will only happen relatively early in the experiment; given enough time the law of large numbers guarantees that the optimal arm will have the best estimate. The phrase "enough time" is a funny one in mathematics; it can be a very, very long time indeed. Furthermore, multi-armed bandits are usually trained "in production" and need to hit optimal, or near optimal, performance in a timely manner. We don’t have "enough time."

Enter the epsilon-greedy algorithm. It differs from the greedy algorithm in only one particular. With some probability epsilon (0.01 and 0.1 are popular but it varies by problem) the epsilon-greedy algorithm will pull a random arm. This allows the optimal arm to get back in the game if the greedy algorithm has made a bad choice while still making the best guess based on current knowledge most of the time.

To be really sophisticated, drop the value of epsilon over time.

Optimistic Algorithm

Both the algorithms I’ve discussed so far only use the average of the bandit arm payouts. What if we want to use information about the variance. The best nonparametric estimate comes from Hoffding’s Inequality and gives us the estimate:

Richard Sutton and Andrew Barto, Reinforcement Learning: An Introduction, 2nd Edition.
Richard Sutton and Andrew Barto, Reinforcement Learning: An Introduction, 2nd Edition.

In this formula, Q_t(a) is the current estimated average of value of arm a, t is the number of trials, N_t(a) is the number of times arm a has been pulled, and c is a user defined constant. The algorithm picks the arm with the highest upper confidence bound. This has an effect similar to the epsilon greedy-algorithm with a decreasing value of epsilon, only without the need to pick an initial epsilon or a decay rule by hand. A value of c still needs to be chosen.

Ratel

A multi-armed armed bandit simulator can be split into two parts; the bandit itself and the player, or agent. In Python, C++, or Java, both components could be encoded as classes. Rust’s object oriented setup doesn’t have classes per se, rather relying on structs, which carry organized collects of data, and traits, which contain methods that can be implemented for structs. Structs are very similar to C++ structs, while traits are similar to Java interfaces. As is usual with coding concepts, this will become clearer with an example.

The Bandit

My basic bandit trait is below, complete with all the functions a good bandit should need.

/// A trait for common members of the Bandits
pub trait Bandit<T: ToPrimitive> {
    /// The number of arms of the Bandit
    fn arms(&amp;self) -> usize;

    /// The arm with the highest average reward.
    fn best_arm(&amp;self) -> usize;

    /// The maximum average reward of all the arms.
    fn max_reward(&amp;self) -> f64 {
        self.mean(self.best_arm())
    }

    /// The average reward of a given arm.
    fn mean(&amp;self, arm: usize) -> f64;

    /// The average rewards of all the arms.
    fn means(&amp;self) -> Vec<f64> {
        (0..self.arms()).map(|arm| self.mean(arm)).collect()
    }

    /// The reward from a pull of a given arm.
    fn reward(&amp;self, arm: usize) -> T;

    /// The standard deviation of a given arm.
    fn std(&amp;self, arm: usize) -> f64;

    /// the standard deviations of all the arms.
    fn stds(&amp;self) -> Vec<f64> {
        (0..self.arms()).map(|arm| self.std(arm)).collect()
    }
}

The methods that could be defined have been. Those that require data from a specific type of struct will be defined in the implementation. The data needed for a bandit is simply the probability distribution of each arm. A case where each arm has a binomial distribution is below.

use rand_distr::Binomial;
/// A bandit whose arms distribute rewards according to binomial distributions.
pub struct BinomialBandit<'a> {
    /// Vector of number of trials of a `yes-no` experiment.
    nums: &amp;'a Vec<u32>,

    /// Vector of experiment success probabilities.
    probs: &amp;'a Vec<f64>,

    /// The bandit arm with highest reward.
    best_arm: usize,

    /// Distributions of the arms.
    distributions: Vec<Binomial>,
}

The struct needs to be public because it will be referenced outside it’s current module. There are a number of privacy levels for objects in Rust; it is almost always a good idea to use the most restrictive available.

The other bits of this code that may be confusing to newcomers are the use of the ampersand & and the use of 'a. The former indicates that the value is a reference, which is similar to a pointer. In this struct nums and prob are not the vectors themselves, but merely the memory addresses of the vectors. The strange little 'a symbol governs the lifetimes of these references, and ensures that they don’t live longer than the vectors they point to. Rust does not allow dangling pointers.

A binomial distribution is defined by the number of trials and the probability of success for each trial. It may seem redundant to store both the defining parameters and a vector of distributions, and it is. As currently designed, the distributions cannot return their parameters, so it’s worth the extra space to hold on to them separately.

It often helps to define a custom constructor. To do so, implement the struct.

impl<'a> BinomialBandit<'a> {
    /// Initializes a new Bandit where each arm distributes rewards according to a binomial
    /// distribution.
    pub fn new(nums: &amp;'a Vec<u32>, probs: &amp;'a Vec<f64>) -> BinomialBandit<'a> {
        assert_eq!(nums.len(), probs.len());
        assert!(probs.val_max() <= 1.0);
        assert!(probs.val_min() >= 0.0);
        assert!(nums.val_min() > 0);
        let dist = nums
            .iter()
            .zip(probs)
            .map(|(&amp;n, &amp;p)| Binomial::new(u64::from(n), p).unwrap())
            .collect();
        let best_arm = nums
            .iter()
            .zip(probs)
            .map(|(&amp;n, &amp;p)| f64::from(n) * p)
            .collect::<Vec<f64>>()
            .arg_max();
        BinomialBandit {
            nums,
            probs,
            best_arm,
            distributions: dist,
        }
    }
}

Most of this is self-explanatory. A few basic assert statements ensure I don’t load rubbish. The two vectors are then zipped and iterated over to create the vector of distributions. The one bit of code that may require comment is u64::from(n). Modern programming languages, at least the compiled variety, seem to be going off the idea of automatic type conversion, and Rust is no exception. Dividing an int by a float doesn’t work. This is probably good for type safety, but does have the downside of forcing the use of ugly explicit type casting statements.

At this point, I just have a bandit sitting around not able to do much. To fix that, I need to implement the Bandit trait for the BinomialBandit struct. This involves defining the five methods in Bandit that were dependent on struct data.

impl<'a> Bandit<u32> for BinomialBandit<'a> {
    ///Returns the number of arms on the bandit.
    fn arms(&amp;self) -> usize {
        self.nums.len()
    }

    ///Returns the arm with highest average reward.
    fn best_arm(&amp;self) -> usize {
        self.best_arm
    }

    /// Computes the expected return of each arm.
    fn mean(&amp;self, arm: usize) -> f64 {
        f64::from(self.nums[arm]) * self.probs[arm]
    }

    /// Determines the reward for pulling a given arm.
    fn reward(&amp;self, arm: usize) -> u32 {
        self.distributions[arm].sample(&amp;mut thread_rng()) as u32
    }

    /// Computes the standard deviations of each arm.
    fn std(&amp;self, arm: usize) -> f64 {
        (f64::from(self.nums[arm]) * self.probs[arm] * (1.0 - self.probs[arm])).sqrt()
    }
}

The original Bandit trait left the type of the output to the individual implementations. Here, I specify that BinomialBandit is in the business of giving out unsigned integer rewards. If you look at GaussianBandit you will see it dishes out floats.

The Agent

Just as there can be multiple types of bandits, there are multiple types of agents as well. I designed one agent struct for each of the three algorithms for finding the optimal strategy and a general Agent trait containing the function they all need. As with the bandit, I’ll describe the agent trait first.

/// A trait for common members of the Agents.
pub trait Agent<T: ToPrimitive> {
    /// The action chosen by the Agent.
    fn action(&amp;self) -> usize;

    /// The number of arms in the Bandit the Agent is playing.
    fn arms(&amp;self) -> usize {
        self.q_star().len()
    }

    /// The Agent's current estimate of the value of a Bandit's arm.
    fn current_estimate(&amp;self, arm: usize) -> f64 {
        self.q_star()[arm]
    }

    /// The Agent's current estimate of all the Bandit's arms.
    fn q_star(&amp;self) -> &amp;Vec<f64>;

    /// Reset the Agent's history and give it a new initial guess of the Bandit's arm values.
    fn reset(&amp;mut self, q_init: Vec<f64>);

    /// Update the Agent's estimate of a Bandit arm based on a given reward.
    fn step(&amp;mut self, arm: usize, reward: T);

    /// Calculate the update of the Agent's guess of a Bandit arm based on a given reward.
    fn update(&amp;mut self, arm: usize, reward: T) -> f64 {
        self.stepper().step(arm) * (reward.to_f64().unwrap() - self.q_star()[arm])
    }

    /// Returns a reference to the Agent's step size update rule.
    fn stepper(&amp;mut self) -> &amp;mut dyn Stepper;
}

As with the bandit trait, most of the function are already filled in. Only method that requires much discussion is update. All agents update their estimates by a rule not dissimilar to stochastic gradient descent. Take the difference between the current reward and the estimated average reward and adjust the estimate by some multiple, analogous to the learning rate, of that value. Exactly what rule determines the "learning rate" is up to the user; a good choice is the harmonically decreasing rate, which makes the estimated average the sample average. The available stepper classes are in the stepper module in utils.

ToPrimitive at the beginning of the trait is necessary to explicitly state that any type T used by Agent must be convertible to a float.

A look at the EpsilonGreedy struct will be instructive in showing us how the the Agent trait is used.

/// Agent that follows the Epsilon-Greedy Algorithm.
///
/// A fixed (usually small) percentage of the
/// time it picks a random arm; the rest of the time it picks the arm with the highest expected
/// reward.
pub struct EpsilonGreedyAgent<'a, T> {
    /// The current estimates of the Bandit arm values.
    q_star: Vec<f64>,

    /// The Agent's rule for step size updates.
    stepper: &amp;'a mut dyn Stepper,

    /// The fraction of times a random arm is chosen.
    epsilon: f64,

    /// A random uniform distribution to determine if a random action should be chosen.
    uniform: Uniform<f64>,

    /// A random uniform distribution to chose a random arm.
    pick_arm: Uniform<usize>,
    phantom: PhantomData<T>,
}

The most important component of the struct is q_star, which contains the estimates of the average value of each arm. Most of the rest helps with the updates. I’ve already discussed stepper, epsilon is simply the frequency with which random arms are chosen, and the two uniform distributions are there to decide when and how to pick a random arm. The last parameter, phantom, is a reference necessary to ensure that the struct is seen to use something of type T, which is otherwise only used by the implementing trait.

How do we implement Agent for EpsilonGreedyAgent? See below.

impl<'a, T: ToPrimitive> Agent<T> for EpsilonGreedyAgent<'a, T> {
    /// The action chosen by the Agent. A random action with probability `epsilon` and the greedy
    /// action otherwise.
    fn action(&amp;self) -> usize {
        if self.uniform.sample(&amp;mut thread_rng()) < self.epsilon {
            self.pick_arm.sample(&amp;mut thread_rng())
        } else {
            self.q_star.arg_max()
        }
    }

    /// The Agent's current estimate of all the Bandit's arms.
    fn q_star(&amp;self) -> &amp;Vec<f64> {
        &amp;self.q_star
    }

    /// Reset the Agent's history and give it a new initial guess of the Bandit's arm values.
    fn reset(&amp;mut self, q_init: Vec<f64>) {
        self.q_star = q_init;
        self.stepper.reset()
    }

    /// Update the Agent's estimate of a Bandit arm based on a given reward.
    fn step(&amp;mut self, arm: usize, reward: T) {
        self.q_star[arm] += self.update(arm, reward)
    }

    /// Returns a reference to the Agent's step size update rule.
    fn stepper(&amp;mut self) -> &amp;mut dyn Stepper {
        self.stepper
    }
}

The Game

The agent and the bandit have to interact in order to perform experiments. Interactions are managed by a Game struct that contains an Agent, a Bandit, as well as counters and methods to make them interact.

///Structure to make the Agent interact with the Bandit.
pub struct Game<'a, T: AddAssign + Num + ToPrimitive> {
    /// Agent learning about bandit.
    agent: &amp;'a mut dyn Agent<T>,
    ///Bandit used by agent.
    bandit: &amp;'a dyn Bandit<T>,
    /// Records wins and losses from each arm pull. Win means pulling the best arm.
    wins: RecordCounter<u32>,
    /// Records rewards from each arm pull.
    rewards: RecordCounter<T>,
}

The basic structure is self-explanatory. Apart from the Agent and Bandit it contains counters to keep track of wins (picking the best arm) and rewards (the results of all arm pulls.)

The implementation is a little more interesting.

impl<'a, T: AddAssign + Copy + Num + ToPrimitive> Game<'a, T> {
    /// Initializes a Game with an Agent, Bandit, and new counters.
    pub fn new(agent: &amp;'a mut dyn Agent<T>, bandit: &amp;'a dyn Bandit<T>) -> Game<'a, T> {
        assert_eq!(agent.arms(), bandit.arms());
        Game {
            agent,
            bandit,
            wins: RecordCounter::new(),
            rewards: RecordCounter::new(),
        }
    }

    /// Returns the number of bandit arms.
    pub fn arms(&amp;self) -> usize {
        self.bandit.arms()
    }

    /// Agent chooses an arm to pull and updates based on reward.
    fn pull_arm(&amp;mut self) {
        let current_action = self.agent.action();
        self.wins
            .update((current_action == self.bandit.best_arm()) as u32);
        let reward = self.bandit.reward(current_action);
        self.rewards.update(reward);
        self.agent.step(current_action, reward);
    }

    /// Resets Game. Resets Agent with new initial guess and resets counters.
    pub fn reset(&amp;mut self, q_init: Vec<f64>) {
        self.agent.reset(q_init);
        self.rewards.reset();
        self.wins.reset();
    }

    /// Returns vector of rewards.
    pub fn rewards(&amp;self) -> &amp;Vec<T> {
        self.rewards.record()
    }

    /// Run game for a certain number of steps.
    pub fn run(&amp;mut self, steps: u32) {
        for _ in 1..=steps {
            self.pull_arm()
        }
    }

    /// Returns vector of wins.
    pub fn wins(&amp;self) -> &amp;Vec<u32> {
        self.wins.record()
    }
}

As usual, most class methods are simple setters and getters. The two most interesting methods arepull_arm and run. The first chooses an action, checks if it’s the best possible, and then returns the reward and whether it was the best. The second performs a full experiment by pulling the arm as many times as the experiment requires.

Those are all the building blocks needed for running a multi-armed bandit experiment. The rest of the Ratel code is either glue for putting experiments together or code for producing output. Anyone using this module for their own work would have to write their own versions, so I won’t discuss my code in detail. I will, in the next section, discuss one of my experiments.

Pairwise Bernoulli Bandits

When thinking about any complicated system it pays to start simple. The simplest multi-armed bandit has only two arms and the simplest probability distribution is, arguably, Bernoulli. One is returned with some probability p, and zero is returned with probability 1-p. Given two arms, a left and a right, with Bernoulli outputs with different probabilities (otherwise it would be trivial) what’s the likelihood of finding the best arm with the a) greedy algorithm, b) epsilon-greedy algorithm? (The optimistic algorithm works better with continuous distributions; I won’t consider it here.) How does this likelihood change as the probabilities vary?

I take the full range of probabilities from 0.01 to 0.99 in increments of 0.01. There’s no virtue in testing arm A with probability p and arm B with probability q only to then turn around and test arm A with probability q and arm B with probability B, so I can cut the number of cases in half by always giving the left arm the smaller probability. I still have 4851 combinations to try. While I’m at it, I’ll also try different starting values, 0.1 to 1.0 in increments of 0.1. That’s now 48510 combinations.

And here’s where working in Rust really starts to pay off. If I’d been working in Python, this is last you’d hear from me for about a month. Rust is about 100 times faster, which means I can run even hefty experiments overnight.

The Greedy Algorithm converges very fast; 200 iterations is sufficient to get accurate results for all combinations. This allows me to run each experiment 100000 times. With a six-core i7-processor (12 threads total,) the greedy algorithm experiments run in about 3.5 hours.

Probability of choosing the best arm with initial guess 0.1. Axes are the chances of a reward of 1 from each arm.
Probability of choosing the best arm with initial guess 0.1. Axes are the chances of a reward of 1 from each arm.

The lowest possible initial guess in the experiment is 0.1 for both arms. If I start with that, then the probability of the greedy algorithm picking the correct arm is high when only one arm has high probability, in which case the problem is very easy, or both arms have low probability, in which case their relative differences will still be significant. By the time the left arm reaches a probability of 0.7, the chances of picking the best arm are little better than even.

Matters improve as the initial guess increases. Each time the initial guess is incremented by 0.1, the likelihood of picking the best arm increases, or at least does not decrease, for all choices of arm probabilities. By the time the initial guess reach 0.7, the region of low probability success is a small sliver where the probabilities of both arms are high and the difference of their probabilities is low (the lower part of the triangle diagonal)

Probability of choosing the best arm for each initial guess value. As the initial guess increases, the likelihood of choosing the best arm goes up.
Probability of choosing the best arm for each initial guess value. As the initial guess increases, the likelihood of choosing the best arm goes up.

The moral is that, for the greedy algorithm, the initial guess is important and should be made as high as is practical. In this case, it was easy since the distributions were known to be Bernoulli. With less information about the distribution, a good initial guess will be harder to come by.

The other lesson is that even with a good guess, there are cases where the greedy algorithm simply isn’t that good. In this toy example, it is easy to qualify, if not quite quantify, the bad regions, but with more arms or more complicated distributions that luxury will no longer be available.

An obvious question is whether it’s possible to get more accuracy with longer runs; 200 iterations is not that much. It turns out to be enough. Plotting a subset of the experiments where the probabilities of the two arms differed only by 0.01 (taking the left arm to be 0.1, 0.2, 0.3, etc) it’s clear that convergence was reached before the 100th iteration. This is as good as the greedy algorithm gets.

Epsilon Greedy

Is there a better strategy? Yes, although it does come at a price. Picking a random arm with some low probability gets near optimal results for all but the most extreme cases. However, the epsilon-greedy algorithm comes with two catches. The first is that there is a hard upper bound, less than 1.0, on the probability of success. If a random arm is chosen with probability epsilon, then the best arm is can be chosen with probability at most (1-epsilon) + epsilon/(number of arms). As the number of arms gets larger, this approaches 1-epsilon, which is an incentive to keep epsilon small.

The second catch is that epsilon-greedy takes longer to converge, with smaller values of epsilon taking longer than larger ones.

epsilon=0.1

I start with epsilon=0.1. With two arms, the probability of picking the best arm should max out at about 0.95. As with the greedy algorithm, I try ten different initial values, ranging from 0.1 to 1.0. Each run consists of 4000 iterations, and I make 10000 runs (one tenth of the number for the greedy algorithm). The experiment took about 7.5 hours.

Probability of choosing the best arm for each initial guess value. The likelihood of choosing the best arm is largely independent of the initial guess.
Probability of choosing the best arm for each initial guess value. The likelihood of choosing the best arm is largely independent of the initial guess.

The first thing to notice is that the dependence on the initial guess has been all but eliminated. That alone is a significant argument in epsilon-greedy’s favor. The second is that, except for the narrowest of slivers along the diagonal, all arm combinations have accuracy close to their maximum. The third is that, as expected, the maximum accuracy is not quite as high as it is for the greedy algorithm. Even with that one caveat, this is still an improvement over the previous result.

And there’s more good news in the region where the agent still has trouble picking the correct arm. Learning isn’t saturated yet. Further iterations would likely lead to better results. Plotting the same subset of experiments as above shows that the win rate has not leveled off yet. Another 4000 iterations, and I’d have even better results.

Win Rates over time.
Win Rates over time.

epsilon=0.01

What happens if I try a lower value of epsilon? With a value of epsilon equal to 0.01, I get a result that has most of the plusses of both the greedy algorithm and epsilon-greedy with a larger epsilon. Apart from the change in epsilon, all parameters for this experiment are the same as above.

Once again, there is maximum possible accuracy, but now it is 0.995. Once again, the results are largely independent of the initial guess. Once again, this accuracy is reached for all but a fairly narrow, albeit slightly wider, subset where the probabilities of the two arms are similar.

What about that narrow range? Is there any chance that more iterations will lead to better results there. Yes, but it will require a lot more. Accuracy is still increasing but at a much slower rate than for epsilon = 0.1.

Win Rates Over Time
Win Rates Over Time

Conclusion

There are a number of enhancements that could be made to Ratel. At the moment, only two probability distributions are supported for bandits, Gaussian and binomial, but adding others would be trivial. Similarly, there are more sophisticated agent algorithms that could be added. A slightly less trivial enhancement would be to allow bandit arms from multiple distribution families, although most applications I’m aware of do not require that.

Of the agents implemented, it seems that the most successful overall is epsilon-greedy with a low value of epsilon. The greedy algorithm works best in cases where the arms can be easily distinguished. Epsilon-greedy is successful in all but the toughest to distinguish cases.

Perhaps the most important conclusion is that Rust performs well as a language for modeling and simulation.


Related Articles