The world’s leading publication for data science, AI, and ML professionals.

Why Hasn’t Reinforcement Learning Conquered The World (Yet)?

Optimization has been around for decades, machine learning achieves breakthrough after breakthrough. Marrying the two in the form of…

Optimization has been around for decades, machine learning achieves breakthrough after breakthrough. Marrying the two in the form of Reinforcement Learning should be the holy grail of problem-solving. Why isn’t it?

Photo by Artem Beliaikin on Unsplash
Photo by Artem Beliaikin on Unsplash

You might not realize it when boarding your flight, but the departure time, fuel level, maintenance crew and takeoff lane have likely all been determined by a mathematical optimization model. Whether it is exact solutions such as linear programming or powerful heuristics such as genetic algorithms, optimization models are the silent force behind many scheduling and resource allocation problems. Spurred by increasing computational power and advances in solvers, the 1990s in particular saw a massive wave in optimization techniques being deployed across all industries. Although not necessarily a `hot’ topic today, computers still slave away to solve inconceivably large optimization problems in virtually every sector.

Advances in machine learning and artificial intelligence gain more prominence. Rarely a week passes without a cheerful headline about a new breakthrough. Detecting fake news, facial recognition, putting a name to that one elusive song – both unsupervised and supervised learning are deployed with great success in the real world. Neural networks in particular are extremely apt at discovering patterns; the abundance of data nowadays provides a steady stream of new test cases.

It seems natural that espousing the two fields generates some immensely powerful offspring. That offspring is better known as Reinforcement Learning. By learning from its environment to constantly improve decision-making, we theoretically have an insanely powerful self-learning framework, able to simultaneously make sage decisions and learn from them to make even better decisions. In reality, we’re not there though, not by a long shot.

Sure, AlphaGo is impressive, as is training Mario to navigate Mushroom Kingdom. There are also some real-world successes to celebrate. RL seems quite suitable for A/B testing. RL may learn to tailor the ads you see. Optimal Dynamics is doing very interesting work in the transport domain. In short, there are successful RL applications circulating, no doubt. But to claim that RL is currently widely deployed to tackle real-world problems – absolutely not.

What makes Reinforcement Learning so hard?

Consider a warehouse with 10,000 product types. Expected sales per product are known. The question is how much of each type to order daily (say for a time horizon of 10 days), taking into account lead times, storage capacity, costs of lost sales, etc. A massive and complicated problem, but something that can be well-handled by contemporary optimization techniques.

The word expected is crucial here. In reality, countless combinations of products may be sold on a given day, each representing a unique scenario that leads to a new inventory state. For every state, we want the ability to make a good decision. Decisions are made sequentially: we order a combination of products, observe the products sold, and at the end of the day we must be able to make a re-order decision for every possible scenario that may unfold.

In other words: deterministic optimization provides decisions, stochastic optimization provides decision-making policies. Deterministic optimization solves a single problem, stochastic optimization solves all problems that may arise. RL does not tell us what decisions to make, but how to make them. It’s a whole different game.

Photo by JJ Ying on Unsplash
Photo by JJ Ying on Unsplash

The differences between supervised learning and RL are substantial as well. Techniques may be similar, e.g., neural networks are common in both fields. In supervised learning, a neural network might take images as input and predict the correct label. In RL, consider a Q-network that takes the problem state as input and returns the value corresponding to that state. Both may be seen as samples of supervised learning.

However, in the former case, image labels are static. In contrast, the Q-value corresponding to a state is policy-dependent— with every policy update the value should change as well. As such, the problem is substantially more dynamic. Additionally, actions in RL typically only has a partial influence on the cumulative reward, with stochasticity playing a large role as well. Neural networks are excellent at detecting patterns, but are also prone to fitting noise. Even for toy-sized RL problems, neural networks often struggle to learn accurate values.

Bottlenecks for RL in practice

We have discussed why RL is in some ways harder than deterministic optimization and supervised learning (although these domains naturally have their own sets of challenges). However, humanity has cracked hard problems before, that cannot be the full story.

In fact, the theoretical body of RL literature is quite impressive. Many problems that are completely intractable for dynamic- or stochastic programming have been tackled by RL, often with impressive results. Examples from computer science are perhaps best known, yet the engineering community came up with many clever algorithms as well.

So why do we see so little of that in practice?

Scattered communities

Although many research communities tackle RL problems, tools and terminology vary significantly. Photo by Katie Harp on Unsplash
Although many research communities tackle RL problems, tools and terminology vary significantly. Photo by Katie Harp on Unsplash

Without delving too much into history; RL originates from different fields, and has been subject to branching and transforming along the road. As a result, there are many distinct communities, all with their own applications, notational styles, terminologies and modeling conventions. Partially due to this dispersion, there is also no generic and mature toolbox of RL algorithms, like there is for supervised learning and deterministic problems.

It may also be argued that academia and practice are too much segregated as communities. RL may be divided into four policy classes. Academia tends to focus on mathematically elegant solutions, whereas industry relies more on straightforward sampling and extensions of deterministic models. Explainability of policies is often overlooked in academic works as well. As such, RL research rarely makes it into practice.

The Four Policy Classes of Reinforcement Learning

Problem modeling

Before even thinking about solutions, one must first identify the problem. A Markov Decision Process looks simple enough in canonical form, yet the actual implementation might contain thousands of line of code.

Without diving too deep into the theory, the following aspects are relevant:

State → What data is needed for decision-making? This does not just entail physical resources, but also information and even beliefs.

Action → What actions are allowed in a given state? Actions that are feasible in a simulation may be highly undesirable in reality. The size of the action space often proves to be a bottleneck as well.

Transition function → Usually hidden as a simple probability, this function captures all system dynamics. Modelling the dynamics that govern a real system is a highly complicated task, even when only approximately capturing its behavior.

Reward → Defining appropriate KPIs is hard. In practice there may be multiple criteria and ambiguous goals, yet no machine learning algorithm can learn without explicit instructions.

Viewing the world as a stochastic system subject to sequential decision-making is not necessarily natural or intuitive, leading to many ill-posed problems.

Five Things to Consider for Reinforcement Learning in Business Environments

Data availability

The reason RL works so well on games is that they can be repeated infinitely often. Billions of episodes might be played before the agent has been properly trained.

When drawing observations from real life, it is not possible to gather a replay buffer anywhere near to that. Testing a policy in reality could take hours, days, months, depending on the nature of the problem. Additionally, failing in the real world is expensive – think safety measures, think financial constraints. As such, there is less room for exploration. Time and costs prevent building a rich set of observations. Crucially, real-life data only reflects the policy that is used in real-life, whereas we want to test many policies.

A simulation model that represents reality circumvents such problems. That is easier said than done however. To closely mimic reality, the simulation environment requires massive amounts of real-time data. Until fairly recently, data was simply not collected and stored at such scale.

Without a very good simulation environment, it is near-impossible to get sufficient quality observations and learn a policy.

Need Help Making Decisions? Ask Your Digital Twin!

Future outlook

Data scientists and academic researchers like novelty, and as long as they can ride the machine learning wave and breakthroughs continue to be made, there will be a steady stream of new developments. Projects that promise plug-and-play RL algorithms are on the rise as well, inevitably culminating in more standardized solutions. For machine learning tasks it is easy enough to test an array of algorithms and select the most suitable one, there is no doubt RL will follow.

Spurred by the rise of IoT technology and the abundance of real-time system data it offers, it becomes easier to construct rich, data-driven simulation environments known as Digital Twins. This technology enables to model detailed virtual proxies of reality, in turn enabling much more experimentation than the real-world allows. Future RL implementations can be trained in this alternative reality, greatly enhancing the opportunities to learn policies applicable in the real world.

Finally, understanding of machine learning will improve with time. There are still too many companies that view machine learning as a magic black box. Successful RL requires a deep understanding of the sequential decision-making problem, the stochastic environment, and the deployed techniques. The building blocks are out there, but often still need to click. Advances in Explainable AI also enhance insights into the inner mechanisms of RL.

In short, I do not see RL as a dead-end street, not at all. If anything, we are on the brink of great advances in the real world. Even just five years ago the world of Data Science looked vastly different than it does today. The revolution will not stop anytime soon, and the solutions to many practical hurdles seem within reach.

Whether RL will indeed ‘conquer the world’, I am not sure. Successful practical implementations, however, seem merely a matter of time.


Takeaways

  • RL aims to find a policy that works under any circumstance, rather than just a decision. This makes it much harder than deterministic optimization.
  • RL is typically applied in noisy stochastic environments; machine learning algorithms have a tendency to fit the noise rather than the controllable patterns.
  • Challenges in RL are the many different subfields (no uniform language or toolbox), problem representation in mathematical form, and the reliance on limited real-world observations to learn policies.
  • Advances in data science and research will overcome certain RL challenges in the near future. In particular, Digital Twin environments seem highly promising to learn policies using a virtual proxy of reality.

Further reading

Capgemini (2020). Is Reinforcement Learning Worth The Hype? https://www.capgemini.com/gb-en/2020/05/is-reinforcement-learning-worth-the-hype/

Dulac-Arnold, G., Mankowitz, D. & Hester, T. (2019). Challenges of Real-World Reinforcement Learning. https://arxiv.org/pdf/1904.12901.pdf

Powell, W.B. (2019). A unified framework for stochastic optimization. European Journal of Operational Research, 275 (3), pp. 795–821.

Powell. W.B. (2020). Sequential Decision Analytics For The Truckload Industry. Optimal Dynamics.


Related Articles