My Journey to Reinforcement Learning — Part 2: Multi-Armed Bandit Problem

Published in

Towards Data Science

7 min readJun 29, 2018

Currently I am studying more about reinforcement learning and I wanted to tackle the famous Multi Armed Bandit Problem. (Now since this problem is already so famous I won’t go into the details of explaining it, hope that is okay with you!). Below are the different types of solution we are going to use to solve this problem.

Please note that this post is for my learning process since the way I learn things is by explaining the concepts to myself, and this post is exactly that.

Introduction to the Problem (Multi-Armed Bandit)

Now this problem is already very famous and since I am not good at English (lol), I have link my friend (Michael’s) explanation regarding this problem. If you wish to read it please click here, or if you want additional reading please click here. However, I will be focusing on different methods to solve this problem. (Since I found out upon my research, there are indeed multiple methods to solve this partial-information sequential decision making problem.)

Solution 1) Epsilon-greedy Agent

Image from this website

The above equation is action-value function, in which measures how good it is to be in certain state and taking which action. However, in our problem we only have one state, the state we choose which Armed Bandit to pull hence we can remove the symbol s. (That is the reason why we can’t see it up there.)

Now as seen above, after few expansion and substitution we can get a more simpler form of the equation. One another format of the equation can be seen below….

And we can now write action function into python, as seen below.

And if we run the code, we can get a results as seen below.

As seen above, some of the percentage is off, however thankfully our agent correctly predicts that bandit # 9 have the highest probability. (Credit to this solution michaelpacheco and Anson Wong)

Solution 2) Policy Gradient

This method is pretty simple, before we move onto deeper neural network, lets just use simple log likelihood and it’s derivative to choose the optimal lever. (For more in depth explanation about this solution, please click here.)

The setting for the experiment is exactly the same as the previous one, however now we are going to update the agent’s weight. And as seen above, when implemented in numpy we would need to perform derivative our selves. And below is the result…..

And as seen above, the agent’s guess on each probability is some what close to the actual ground truth probability. (Credit to this solution Arthur Juliani)

Solution 3) Policy Gradient (Multi Layer Network)

As seen above, we can easily create two fully connected network, and using this method we can expand the solution 2), rather than directly getting which bandit we want to pull, we can train a neural network that computes the optimal bandit. (Using back propagation)

As seen above, while playing around with the network itself, I had to clip the gradient to prevent it from raising the invalid input for log(). (log(0) is undefined). And below is the results……

As seen above the agent is able to guess a similar probability for each bandit, however, for some reason it never wanted to pull the left most bandit.

Solution 4) Linear Reward Inaction

This method is simpler than using a neural network, we are simply going to increase the probability of getting the optimal bandit to be choose in the given probability distribution.

And as seen above, we are only going to update when we receive a reward from the bandit. And when we do not have any reward we aren’t going to update anything.

And the agent’s final probability for each lever is actually pretty accurate. (Credit to this solution michaelpacheco)

Solution 5) Linear Reward Penalty

Now lets simply expand the previous solution, as seen above, when we do not get a reward for certain bandit. We are simply going to decrease the probability of that bandit being chosen in the next round.

Adding one else statement with few more lines to decrease the probability can give the results seen below.

Quite near perfect prediction of the probability for each bandit. (Credit to this solution michaelpacheco)

Interactive Code

For Google Colab, you would need a google account to view the codes, also you can’t run read only scripts in Google Colab so make a copy on your play ground. Finally, I will never ask for permission to access your files on Google Drive, just FYI. Happy Coding!

To access the code for solution 1, please click here.
To access the code for solution 2, please click here.
To access the code for solution 3, please click here.
To access the code for solution 4, please click here.
To access the code for solution 5, please click here.

Final Words

I want to include the slides/paper below, as an example of how complex and fundamental this problem can be…..

Image from this website

Paper from this website

If any errors are found, please email me at jae.duk.seo@gmail.com, if you wish to see the list of all of my writing please view my website here.

Meanwhile follow me on my twitter here, and visit my website, or my Youtube channel for more content. I also implemented Wide Residual Networks, please click here to view the blog post.

Reference

numpy.random.uniform — NumPy v1.14 Manual. (2018). Docs.scipy.org. Retrieved 28 June 2018, from https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.uniform.html
numpy.set_printoptions — NumPy v1.13 Manual. (2018). Docs.scipy.org. Retrieved 28 June 2018, from https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.set_printoptions.html
percentage, R. (2018). Real Python random percentage. Stack Overflow. Retrieved 28 June 2018, from https://stackoverflow.com/questions/15613041/real-python-random-percentage
distribution, G. (2018). Generate random numbers with a given (numerical) distribution. Stack Overflow. Retrieved 28 June 2018, from https://stackoverflow.com/questions/4265988/generate-random-numbers-with-a-given-numerical-distribution
Michael Pacheco. (2018). Michaelpacheco.net. Retrieved 28 June 2018, from https://www.michaelpacheco.net/blog/RL-multi-armed-bandit
arctan(x) | inverse tangent function. (2018). Rapidtables.com. Retrieved 28 June 2018, from https://www.rapidtables.com/math/trigonometry/arctan.html
numpy.arctan — NumPy v1.14 Manual. (2018). Docs.scipy.org. Retrieved 28 June 2018, from https://docs.scipy.org/doc/numpy/reference/generated/numpy.arctan.html
Table of Derivatives. (2018). Math.com. Retrieved 28 June 2018, from http://www.math.com/tables/derivatives/tableof.htm
@, D. (2018). Difference between numpy dot() and Python 3.5+ matrix multiplication @. Stack Overflow. Retrieved 28 June 2018, from https://stackoverflow.com/questions/34142485/difference-between-numpy-dot-and-python-3-5-matrix-multiplication
Atan (arc tangent) function. (2018). Communityviz.city-explained.com. Retrieved 28 June 2018, from http://communityviz.city-explained.com/communityviz/s360webhelp4-2/formulas/function_library/atan_function.htm
Learning rate. (2018). Users.ics.aalto.fi. Retrieved 28 June 2018, from http://users.ics.aalto.fi/jhollmen/dippa/node22.html
What does “1E+09” mean? — Playing with a modded Stock part. (2013). Kerbal Space Program Forums. Retrieved 28 June 2018, from https://forum.kerbalspaceprogram.com/index.php?/topic/36897-what-does-quot1e09quot-mean-playing-with-a-modded-stock-part/
Derivative of the Natural Logarithm. (2018). Oregonstate.edu. Retrieved 28 June 2018, from https://oregonstate.edu/instruct/mth251/cq/Stage6/Lesson/log.html
Derivative of Logarithm — log(x)’. (2018). Rapidtables.com. Retrieved 28 June 2018, from https://www.rapidtables.com/math/algebra/logarithm/Logarithm_Derivative.html
[online] Available at: https://www.quora.com/What-is-the-multi-arm-bandit-problem-What-are-some-of-its-implications [Accessed 28 Jun. 2018].
Solving the Multi-Armed Bandit Problem — Towards Data Science. (2017). Towards Data Science. Retrieved 28 June 2018, from https://towardsdatascience.com/solving-the-multi-armed-bandit-problem-b72de40db97c
Michael Pacheco. (2018). Michaelpacheco.net. Retrieved 28 June 2018, from https://michaelpacheco.net/
Solving the Multi-Armed Bandit Problem — Towards Data Science. (2017). Towards Data Science. Retrieved 28 June 2018, from https://towardsdatascience.com/solving-the-multi-armed-bandit-problem-b72de40db97c
numpy.clip — NumPy v1.14 Manual. (2018). Docs.scipy.org. Retrieved 29 June 2018, from https://docs.scipy.org/doc/numpy/reference/generated/numpy.clip.html
(2018). Cs.cornell.edu. Retrieved 29 June 2018, from https://www.cs.cornell.edu/people/tj/publications/yue_joachims_09a.pdf
Langford, J., & Zhang, T. (2008). The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information. Papers.nips.cc. Retrieved 29 June 2018, from https://papers.nips.cc/paper/3178-the-epoch-greedy-algorithm-for-multi-armed-bandits-with-side-information
Cross entropy and log likelihood. (2018). Awebb.info. Retrieved 29 June 2018, from http://www.awebb.info/blog/cross_entropy
Simple Reinforcement Learning in Tensorflow: Part 1 — Two-armed Bandit. (2016). Medium. Retrieved 29 June 2018, from https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-1-fd544fab149

My Journey to Reinforcement Learning — Part 2: Multi-Armed Bandit Problem

Written by Jae Duk Seo