Understanding Reinforcement Learning Hands-on: Non-Stationary

Exploring dynamic worlds and how to deal with them

Alejandro Aristizabal

Published in

Towards Data Science

11 min readSep 14, 2020

Series’ Links:

Welcome to the third entry of a series on Reinforcement Learning. On the previous article we explored the first on a series of many scenarios we’re going to tackle, the Multi-Armed Bandits. In this situation, we’re presented with an environment with a fixed number of actions, and are tasked on finding the action that yields the greatest reward. We presented some strategies, and measured their performance on this simple task.

On this article, we’re going to modify the previously presented environment, and make it a little bit more dynamic. We will see how our previous strategies deal with non-stationary environments, and how we can do better.

Stationary vs. Non-Stationary:

Last time we began our story on a Casino, filled with bandits at our disposal. Using this example, we built a simplified environment, and developed a strong strategy to obtain high rewards, the ɛ-greedy Agent. Using this strategy, we were able to find the best action given enough time, and therefore earn tons of reward. Our agent performed well because it had a good balance between exploring the environment and exploiting its knowledge. This balance allowed the agent to learn how the environment behaves, while also receiving high rewards along the way. But, there’s a small assumption our agent is doing to be able to behave so optimally, and that is that the environment is static, non-changing. What do we mean by this, and where is our agent making such assumption?

Stationary

When we mention the word “stationary”, we’re talking about the underlying behavior of our environment. If you remember from last time, the environment is defined to have a set of actions that, upon interaction, yield a random reward. Even though the reward is random, they are generated from a central or mean value that every action has, which we called the true value. To see what I’m talking about, let’s have a look at one of the animated interactions we saw on the previous article.

Example of an ɛ-greedy agent interacting with a static environment. Image by Author

Observe how the true values (red dots) are static. Even though the environment generates random rewards, each action has a true expected value which is never-changing. This is a big assumption to make, and one that almost no valuable real-world scenario will follow. If, for example, our casino analogy was static, then casinos would be quickly out of business! So, how can we portray a more realistic scenario without making the problem that more complex?

Non-Stationary

Making the multi-armed bandit scenario non-stationary is actually pretty simple, all we need to do is make sure the expected value for all our actions is always changing. Now, this change should not be completely erratic, since no strategy would be able to deal with a completely random scenario. Instead, we want the expected value to slowly drift away. This is so our agent can still consider previous experience meaningful for future decision-making. Here’s how we can implement such environment

There are some subtle changes. Now, every time our agent interacts with the environment, the true value for each action changes by a small, random amount. How much it changes can be controlled when we create the environment. This is so we can experiment with it later on. How does the environment behave now? We can make the same visualization as before, but with our new environment. Since we’re not interested in the agent, we’re not going to plot its interaction (yet).

A non-stationary environment. The value for each action changes randomly by some amount. Image by Author

The previous plot was generated with a somewhat high amount of non-stationary for the sake of illustration. Note that on some instances the highest-valued action changes due to the random movements. Since we’re interested on an agent that behaves as optimally as possible, we would expect that the agent can observe this changes and act accordingly.

Current Performance

Having declared our new environment, let’s see how the ɛ-greedy Agent performs here. Knowing that our agent has room for exploration, it should only be a matter of time before the agent notices that the environment changed.

ɛ-greedy Agent interacting with a non-stationary environment. Image by Author

To make visualization clearer, I’ve added some colors to show which is the action with the greatest value (orange dot) and which action the agent considers to be best (blue bar). For this animations, the amount of elapsed time was extended to 800 steps. At first, the agents are able to rapidly adapt to changes in the environment, but after a while, the estimated values stagnate, and tend to move slow. This makes it hard to catch up to changes in the future, and causes the agent to stay on a sub-optimal action for longer. As always, let’s plot the optimality of this strategy by averaging the performance of many experiments.

General performance of the ɛ-greedy Agent in a non-stationary environment. Image by Author

As we can see, at first our agent rapidly adapts to the environment, but as more time goes on, its performance starts to decrease. Even though our agent is always gathering experience, and is always exploring the environment, it seems like it can’t handle a dynamic world. How can we ensure that our agent will adapt to changes even further in the future?

The Update Rule Problem

Going back to the previous article, we exposed a way in which we can easily evaluate how valuable it is to take an action on the multi-armed bandit scenario. This evaluation was done using a basic average, which would converge to the expected value of the action after a given time. If you’ve ever dealt with taking averages, you should know that the more items you add up to the operation, the more robust the result is. Let’s say we want to take the average of a list of three values:

Small list of values to average. Image by Author

Then the average would equal 4. Given such a small list, a change on any of the inputs will vary the resulting average by some amount.

average of l, with the last value modified. Image by Author

If we instead use a larger list of values, then the same change in the input will cause a smaller variation on the output.

average of a bigger list, with the same amount of variation on the input. Image by Author

To sum up, the more information we have to calculate the average, the less prone the result is to outliers or variations. This effect can also be seen in the Incremental Average Update Rule we implemented previously:

Incremental Average Update Rule. Image by Author

The expression of 1/n causes the same effect, given that as n gets larger, then the Q value changes less and less. This is what causes our agent to stagnate. Once enough experience has been collected, it’s very hard to make our agent adapt and change its mind.To fix this, we must modify the update rule, so that later experience isn’t discarded or ignored.

The General Update Rule

Those that have some experience or knowledge with Machine Learning in general might have already seen a pattern behind the update rule presented above. If you haven’t, let me give a brief explanation:

Generally, for a Machine Learning model to learn, we use a measure called the Error or Loss function. This measurement determines how off our model’s performance is compared to the expected results. To improve our model, we update its parameters by moving them against the error. How much should we move or modify our parameters? That’s determined by the learning rate. Here’s an overly simplified demonstration of the above statement.

Oversimplified example of Gradient Descent Update Rule. Image by Author

Looks familiar, right? h(x) represents the output of our model, while α represents the learning rate. This is a grossly simplified version of Gradient Descent, which is the standard method for improving a model in most machine learning tasks. If interested on this topic, I think this is a great article about Gradient Descent.

Comparing the incremental average update rule and gradient descent, our intuition would say that 1/n is equivalent to the learning rate α, and that would be correct. We can therefore modify the update rule we use to a more generalized version.

Generalized Update Rule. Image by Author

What does this do? Well, now we can define α to be anything other than 1/n. Using a constant for the learning rate is a simple but effective method used in other machine learning paradigms, and so we could try it as well here. Let’s implement our new agent!

We’re basing our new implementation on the ɛ-greedy Agent. There are some minor changes. First, we’re adding a new parameter called ss_func, which allows us to change the step-size function if desired. By default, this function returns a value of 0.1, which will be the constant step-size used here. Also, at the time of updating the estimates, we execute the ss_func function, and use the return value as our step size. Just to make things clear, the next line of code would be equivalent to declaring an ɛ-greedy Agent using this new implementation.

Variables like estimates and epsilon have been assumed to exist previously.

Let’s see how this new agent performs on our dynamic multi-armed bandit scenario, compared to the previous ɛ-greedy strategy.

Comparison between ɛ-Greedy and Constant Step Size ɛ-Greedy strategies on a dynamic environment. Image by Author.

And just like that, our agent is now capable of adapting to a dynamic world! By using a constant step-size (or learning rate), the agent will no longer stagnate. Instead, every new experience is valued the same, and so it’s possible to overcome previous experiences with new knowledge.

The Step Size Parameter:

As a final note, I want to briefly go into understanding this new parameter α, how it behaves and what values are reasonable for it. We’ve already seen that a value of 0.1 gave great results on this scenario, but how would it behave with other values?

The step-size parameter can be considered as a measurement of confidence in new knowledge. A high value for α is analogous to saying that we trust our recent experience to be a good representation of the problem at hand. As such, the step-size parameter should lie between 0 and 1, where 1 is total confidence and 0 is no confidence on how representative the previous interaction is towards understanding the underlying problem. Let’s see how this two values affect the learning outcome.

A confidence of 1 would turn our update rule into this:

Update rule with a learning-rate of 1. Equivalent to assigning our estimates to the recent reward received. Image by Author

Here, we’re basically saying that our estimates should equal the reward recently obtained, disregarding any previous experience. This would only work if we know (or are really confident) that the reward we receive is deterministic and stationary. Since this is not the case in the real world (because interactions are usually stochastic, and the ones that aren’t don’t usually require learning algorithms), a value of 1 for the step-size should not be considered.

A confidence of 0 would cancel out the update part of our update rule:

Zero confidence on our recent interaction would keep our previous estimates intact. Image by Author

With a step-size of zero, we’re removing the possibility for our agent to learn from experience. For this reason, a constant step-size of 0 is meaningless for Reinforcement Learning.

As for other values inside the range of 0 and 1, they determine how quickly or slowly our agent responds to variance, and how fast it would converge. Values close to 1 would quickly update the estimated values, and try to keep up with changes in the environment, but it would also be susceptible to noise. On the other hand, small values would take longer to converge, and react slower to a dynamic environment. Let’s compare some values on this scenario.

Comparison of performance of multiple agents with different step-sizes on a dynamic environment. Image by Author.

As predicted, the extremes are not well suited for the problem, low values converge slower and higher values converge faster. Different to other fields of Machine Learning, in which the learning-rate or step-size affects mostly convergence time and accuracy towards optimal results, in Reinforcement Learning the step-size is tightly linked to how dynamic the environment is. A really dynamic world (one that changes often and rapidly) would require high values for our step size, or else our agent will simply not be fast enough to keep up with the variability of the world.

Wrap Up

In this article we covered the idea of non-stationary, and implemented it onto the Multi-Armed Bandit scenario.We then explored how our previous contestant, the ɛ-greedy Agent, performed in this new situation, and exposed what made it behave sub-optimally. Then, by borrowing some concepts from other machine learning areas, we defined a new way of evaluating our actions. This new update rule allowed for a constant step-size, which was key to solve the problem of non-stationary. Lastly, a brief explanation of how the step-size parameter affects the learning outcome was demonstrated.

This article was intended to touch on a few topics that were too long to be added to the previous one, but that still had enough value to be presented on the series. The Generalized Update Rule will play a role later on, and for that reason it had to be covered. With this, we can say goodbye to the Multi-Armed Bandit, and start touching on other topics. Next article will cover Markov Decision Processes (MDPs) and build the foundation for the famous Bellman Equation. These two concepts are the backbone of Reinforcement Learning, and understanding them could literally change for perspective of the world. See you then!

Series’ Links:

References:

Images without references have been generated by the author.
Coursera’s Reinforcement Learning Specialization provided by the University of Alberta.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cambridge (Mass.): The MIT Press. Taken from the Stanford Archives.

Understanding Reinforcement Learning Hands-on: Non-Stationary

Exploring dynamic worlds and how to deal with them

Series’ Links:

Stationary vs. Non-Stationary:

Stationary

Non-Stationary

Current Performance

The Update Rule Problem

The General Update Rule

The Step Size Parameter:

Wrap Up

Series’ Links:

References:

Written by Alejandro Aristizabal