The world’s leading publication for data science, AI, and ML professionals.

Why Discount Future Rewards In Reinforcement Learning?

A discussion on discount rates from the perspectives of mathematics, finance, life, and of course Reinforcement Learning.

Not that kind of discount. Photo by Artem Beliaikin on Unsplash
Not that kind of discount. Photo by Artem Beliaikin on Unsplash

Although discount rates are an integral part of Markov decision problems and Reinforcement Learning (RL), we often select γ=0.9 or γ=0.99 without thinking twice. Surely, when asked, we have some intuitions like ‘rewards today are worth more than rewards tomorrow’ or ‘compensating for uncertainty’. When pressed, can you defend why those intuitions hold though? Why do you pick _γ=_0.8 instead of _γ=_0.9? Isn’t uncertainty already incorporated in the expected value? If you don’t have an instant answer ready, this article may shed some light on the matter.

Discounting in mathematics 🧮

From a strictly mathematical perspective, the purpose of a discount rate is obvious, at least for infinite horizon problems. From Bellman’s recursive equation we learn to solve value functions for a sequence of states:

If that sequence is infinite, so is the reward series. Consider the following cumulative reward sequence _Gt:

G_t = R_t + R_t+1 + R_t+2 + ... = 1 + 1 + 1 + ... = ∞

As we all know, summing an infinite rewards series yields infinite rewards, making the equations system unsolvable. Fortunately, adding a discount rate γ ∈ [0,1) yields a converging geometric series. For example, if we set γ=0.8 we obtain:

_G_t = γ⁰R_t + γ¹R_t+1 + γ²R_t+2 + ... = 1 + 0.8 + 0.64 + ... = 5_

With this trick, we can attach values to states and solve the system of Bellman equations. Of course, in Reinforcement Learning that solution would be approximate.

That explains the infinite case, but why bother with discounting for finite time horizons? You might argue that we compensate for uncertainty, but isn’t that already reflected in the expected value (multiplying future rewards with their probabilities)? The mathematical perspective doesn’t resolve this, we need to dive into the human psyche a bit.

Discounting in finance 💸

What better defines the human psyche than money?

An important building block of investing is the existence of a risk-free rate. This is the return that can be earned without any uncertainty or default risk, serving as a baseline for all other returns. US Treasury bills are often used as a proxy. Put one dollar in a 2% US Treasury bill, and you will receive a guaranteed $1.02 one year from now. Consequently, we prefer $1 today over $1 next year. Without effort we can grow our wealth by 2% annually, and as such would discount future Rewards by 2% to reflect time value.

It gets more interesting when considering risk-bearing instruments such as stocks. Suppose a stock will increase either 0% or 4% a year from now, both with probability 0.5. The expected payoff is 2%. However, the chance to end up empty-handed is substantial. The typical investor will prefer the risk-free 2% bond in this case, despite expected payoffs being equivalent. It follows that the stock returns are discounted at a higher rate than the bond returns.

This phenomenon is known as risk aversion. People expect to be compensated for uncertainty, otherwise they would pick the safer alternative. The more uncertainty, the higher the discount rate. Maybe we would select the stock if it yielded 10% rather than 4% (bumping the expected payoff to 5%). The investment needs to provide a certain risk premium on top of compensating the time value.

There are still many topics left untouched, such as the tendency to use exponential discounting (similar to RL), opportunity or regret costs (you can only invest your money once), risk-seeking behavior (lotteries make little sense from a rational investor’s perspective), and oddly inconsistent boundary cases. For now, let’s settle on the rationale that discount rates reflect both time value and a risk premium.

Discounting in life ⏳

Discounting behavior is not restricted to dollars. In daily life, we constantly balance short-term gratification and long-term consequences, trade off certainty and uncertainty. Go to bed late and you’ll be tired tomorrow. Eat heartily during winter and you’ll need to trim fat to rock a beach body in summer. Study day and night now and hopefully reap the rewards later.

Comparison of exponential and hyperbolic discounting. The former is typically used in Reinforcement Learning, the latter is empirically observed within humans. Image by Moxfyre from WikiMedia
Comparison of exponential and hyperbolic discounting. The former is typically used in Reinforcement Learning, the latter is empirically observed within humans. Image by Moxfyre from WikiMedia

Countless psychological studies on the matter have been performed, suggesting that humans tend to perform something close to hyperbolic discounting in their decision-making. People sporting a Carpe Diem tattoo on their arm likely discount future rewards quite strongly, others might weigh them more heavily. Despite the discrepancies between individuals, all of us discount to some degree.

A very natural reason for this phenomenon is the hazard rate – the probability we will not be alive tomorrow to reap rewards. Although the risk of acute mortality is not as high as for our hunter-gatherer ancestors, the biological impulse to prefer rewards now remains very much intact. The hazard needs not to be as morbid as death. An athletic career may be cut short by a knee injury, and that backpacking trip through Asia might not be possible five years from now. Even at the most banal level, we prefer to have a cookie now over a cookie next week.

As humans, we simply have a hardwired predisposition to value rewards now higher than those in the (distant) future, sensible or not.

Discounting in Reinforcement Learning 📖

Now we have some insights into human rationale for discounting, but does that reasoning hold for Reinforcement Learning problems? Despite some loose connections between neural networks and the human brain, typically RL algorithms are not designed to mimic human behavior. The hazard rate is also a susceptible rationale, as we can model hazards into the environment. For instance, a cliff walking game ends when the agent steps into the cliff – we don’t need to discount additionally to reflect heart attacks or stubbed toes. A counterargument would be that such hazards might occur in environments for which the policy is not trained, utilizing the discount rate as a sort of robustness device.

For many RL problems, not discounting future rewards is perfectly acceptable. Still, there are valid reasons to discount in RL, even for finite horizons. One of them is the actual impact that decisions have on long term performance. Ultimately, it is up to the modeler what discount rate best reflects the cumulative rewards within the context of the problem.

Suppose I have to choose between a light or heavy dinner tonight. The decision might affect my gym session afterwards, but likely has no impact on getting promoted next year. Here, we see a clear reason to discount future rewards – some consequences can hardly be tied to today’s action. The purpose of RL is to incorporate downstream effects of current decisions, yet future rewards may be uncorrelated. In general, the more stochastic the environment, the less our actions have a lasting impact on performance.

An answer on StackExchange formalizes that notion in an interesting way, expressing a parameter τ that reflects the time interval we are interested in:

As before, suppose the reward is always 1. With γ=0.8, the series converges to 5. Effectively, rewards beyond five time steps ahead – note e^(-1/5)≈0.8 – have little impact. Similarly, a series with γ=0.9 converges to 10 and with γ=0.99 it converges to 100. Mind you: a sudden reward of +100 after t+τ still substantially impacts the discounted reward, but as a rule of thumb the approach makes sense. If you believe rewards five time steps from now have little to do with decisions made now, γ=0.8 might be a suitable discount rate.

Hopefully this article provided some clarity on the topic of discount rates, yet it only scratched the surface. It has been suggested that the entire concept is flawed for function approximations (which we typically use in RL), and that average rewards are a better metric than discounted rewards. Another interesting approach is to abandon the notion of a fixed discount rate, and work with state-dependent rates instead. Factor in human preferences, and a whole new world opens up.

Indeed, that little parameter γ hides a lot of depth.

Takeaways

  • Discounting is often necessary to solve infinite horizon problems. A discount rate γ<1 ensures a converging geometric series of rewards.
  • From finance, we learn that discounting reflects both time value and risk compensation. Like in Reinforcement Learning, exponential discounting is typically assumed.
  • Humans have a natural predisposition to discount future rewards (close to hyperbolic discounting), with the hazard rate being an important biological rationale.
  • For finite horizon problems, the need for discounting strongly depends on the nature of the problem and the preferences of the modeler.
  • Stochastic environments beg for discount rates that place less emphasis on future rewards; the impact of current rewards on the faraway future is low. Discount rates implicitly reflect the number of time steps you wish to look ahead.

Further reading

Dasgupta, P. & Masgin, E. (2005). Uncertainty and Hyperbolic Discounting. https://scholar.harvard.edu/files/maskin/files/uncertainty_and_hyperbolic_discounting_aer.pdf

Fedus, W., Gelada, C, Bengio, Y., Bellemare, M. & Larochelle, H. (2019). Hyperbolic discounting and learning over multiple horizons. https://arxiv.org/pdf/1902.06865.pdf

Investopedia (2021). Discount rate. https://www.investopedia.com/terms/d/discountrate.asp

Investopedia (2021). Risk premium. https://www.investopedia.com/terms/r/riskpremium.asp

Kahneman, D. (2017). Thinking, fast and slow.

Naik, A., Shariff, R., Yasui, N., Yao, H. & Sutton, R. (2019). Discounted Reinforcement Learning Is Not an Optimization Problem. https://arxiv.org/pdf/1910.02140.pdf

Pitus, S. (2019). Rethinking the Discount Factor in Reinforcement Learning: A Decision Theoretic Approach. https://arxiv.org/pdf/1902.02893.pdf

StackExchange (2016). Understanding the role of the discount factor in reinforcement learning. https://stats.stackexchange.com/questions/221402/understanding-the-role-of-the-discount-factor-in-reinforcement-learning

Wikipedia (2021). Geometric series. https://en.wikipedia.org/wiki/Geometric_series


Related Articles