
Even those completely unfamiliar with Reinforcement Learning (RL) probably know how DeepMind’s AlphaGo managed to beat world champion Lee Sedol in the ancient board game Go. For many, RL has become synonymous with playing games, be it chess, Super Mario, or simply crossing a frozen lake. Games are fun, intuitive, unambiguously defined, and often contain distinct properties that make them challenging to learning algorithms. It is not surprising that they were adopted as benchmark problems for the development of new RL techniques. However, that does not mean that RL is confined to just the AI Gym.
Companies are generally concerned with allocating resources in a way that maximizes profit. This is not a one-time decision, but a sequence of decisions over time, with decisions made today affecting performance tomorrow. Add a good portion of uncertainty to the mix, and the decisions often become vexing, beyond the cognitive limitations of human beings. Making intelligent decisions in such environments gives an edge over competitors, and RL can contribute in that respect. However, there are some important aspects worth considering, before sinking your teeth into a real-world business problem.
I. Results, results, results
Yes, three times, because this might as well be point (i), (ii) and (iii). Granted, it’s more than a bit cliché, but it’s cliché for a reason. For a business, the purpose of deploying reinforcement learning is to resolve a problem. An algorithm should enable a business to make decisions that better position them for the future. If RL fails to have a direct and measurable impact, there’s little worth in it.
To have practical value, the (automated) decision support originating from RL should also be integrated into existing workflows. In many cases, that means working with the systems that are already there. For instance, if a deterministic planning algorithm is already in place, simply adding some lookahead features is often more manageable than designing an RL algorithm from scratch.
Computational time is naturally a point of relevance as well. For decisions at the strategic and tactical level, lengthier run times might be forgiven. However, should RL support day-to-day operational decisions, online run time must be kept at a limit.
II. The future must be predictable ánd influenceable
"Prediction is very difficult, especially if it’s about the future!" – Niels Bohr
To be able to predict a process, it should actually be predictable. Sure, everybody would like to possess an RL algorithm that predicts stock price movements – at least I would – and tells you when to buy and when to sell. A factory manager would like to know when machine parts break down, so they can be replaced just in time. As for any machine learning algorithm, the data should actually contain a pattern that can be discerned.
Needless to say, sufficient high-quality data should be available. As any data scientist knows, ‘garbage in = garbage out‘ – RL algorithms are no exception to this rule. It takes substantial amounts of quality data to learn something useful. A large neural network might take millions of observations to train properly; historical records may be insufficient for such purposes. Don’t be surprised if a basic linear regression actually yields better results.
So far nothing new, but RL has some challenges of its own, not experienced in supervised learning. Those who ever experimented with replay buffers quickly realized that many observations become obsolete once better policies are identified. A trucking company may have plenty of data from the East coast, but what if the policy tells you to explore the Western frontier? Sometimes it works out fine – you can backtest many investment strategies on historical price data, for instance. Other times, you might simulate new outcomes at will. Sometimes though, the right data might simply not be there once deviating from the current real-life policy.
RL environments are generally partially deterministic and partially stochastic. If you can predict or model the stochastic process better than others, you should be able to outperform competitors. The same applies if you can better prepare yourself for the future by current actions. If you can’t forecast much, today’s decisions hardly impact performance tomorrow, or you simply don’t have the data, RL is probably the wrong way to go.
III. Problem formulation is key
For board- and computer games, problem definitions are known from the set-out. In contrast, practice does not dictate what the problem actually is, let alone how to measure performance. Defining appropriate KPIs can be something of an art. There might be multiple performance indicators and stakeholders that are important, requiring to weigh or rank them. Many practical problems can be formalized as an objective function subject to a set of constraints. We’d like to maximize profits, but also consider our environmental footprint. We want to minimize costs, but must maintain the contractual service level. For sub processes, it may not be trivial to define good performance metrics. Re-positioning empty trains seemingly has little impact on annual profits, yet the alignment with the business goals should be there. The RL algorithm will strive to maximize exactly the objective that you define: be careful what you wish for.
After talking to stakeholders, abstracting the problem properties, and defining an agreed upon objective, it is essential to mathematically formulate the problem. Markov Decision Processes and mathematical programming are vital frameworks for this step.
A Markov Decision Process (MDP) model describes sequential decisions in an uncertain environment. In essence, it defines the problem state (the information needed to make the decision), the action (the decisions that are allowed), the reward function (the contribution or cost of a decision), the transition function (model for transitions between states, influenced by both decisions and stochastic fluctuations), and the objective function (the cumulative reward function we seek to optimize). Especially the assumptions made on distributions that govern the environment and the information required for decision-making are essential in a business context.
It is often convenient to express the action space in the form of a mathematical program (even if not solving to optimality), as it very deliberately defines the objective function – which obviously should be in line with the one in the MDP – and the corresponding set of constraints. If we want to minimize costs, what are the minimum service levels, what order volumes can we hold, what truck routes are feasible? Simplifications of the real-world problem are inevitable, yet mathematical programming makes these simplifications very explicit.
![Snippet of mathematical problem formation. Here, the action space is formulated as a linear program [image by author] [1]](https://towardsdatascience.com/wp-content/uploads/2021/09/1NA6gtfdDCnkP3w5JJ04y6g.png)
People might disagree with your abstraction of reality, yet they should be no room to misinterpret that abstraction.
IV. Pick a suitable solution method
A remarkable gap between research and industry can be found in the choice of solution method. Academics and researchers tend to value design effort and mathematical elegance, whereas practitioners comparatively rely more on brute force and leveraging problem structure. For instance, a deterministic airline scheduling algorithm could be expanded with slack parameters, whose values are determined through RL. Another example is to solve today’s truck routing problem to optimality, and estimate routing costs for the remainder of the week with a simple heuristic. Companies typically have a fairly good idea of problem structures – these structures can be exploited in the solution method.
Although vastly oversimplifying Powell’s four policy classes [2], I try to make a crude distinction here:
![Simplification of the four policy classes of Reinforcement Learning [image by author]](https://towardsdatascience.com/wp-content/uploads/2021/09/1Q4Bin1yEbDd9XTttDIdLJQ.png)
The inherent tradeoff is between design effort and computational effort. Although some useful guidelines exist[3], there’s not necessarily a ‘best’ policy class for a given problem, let alone a one-size-fits-all solution. That’s a discussion for another time though – for now it suffices to recommend an open mind when conceptualizing solutions for real-world RL problems.
V. Action spaces are often massive

In many benchmark RL problems, action spaces are often not that bad. Traversing a maze only requires evaluating four moves at a time. A game of Super Mario has a few more button combinations to consider, but is still well manageable. A chess board has a dazzling number of states, but the number of feasible moves on a single board composition is rarely much over a 100. In short, for most benchmark problems, actions are well within the limits of enumeration, e.g., as outputs of an actor- or deep Q-Network.
As mentioned before, business decisions often boil down to a matter of Resource Allocation – what distribution of assets yields the best return? This frequently translates into vector-based decisions, in which many resources are distributed at the same time. If you have $1m, how do you best divide it over all NASDAQ stocks ? If your warehouse contains 10,000 products, how many of each product do you reorder for next month? If you own a fleet of a thousand trucks, how do you distribute them across the United States?
Even for the deterministic variants of these problems, action enumeration is often wildly infeasible – combinatorial optimization problems may yield many millions, billions, or trillions of feasible actions. There is a silver lining though; a lot of real-life problems are convex or even (approximately) linear. This is a very powerful property that often makes problem solving considerably easier. Commercial solvers can solve massive problems in limited time – a linear program with millions of actions might be solvable within seconds.
Still, even mathematical programming cannot always save us. Making a decision within a few seconds sounds great, but if we need millions of training observations, these ‘few seconds’ truly add up. (Meta)heuristics and simplifying assumptions are often needed to handle the action space. After all, real life is messy.
Executive summary
- The RL solution needs to be integrated in existing workflows. Ultimately, the algorithm should improve decision-making in the real world. Practical usability, computational effort, data availability, role in decision support – all such factors need to be considered at the start.
- Reinforcement Learning make sense in environments that are partially controllable and allow anticipating future events. Decisions made today should have some future impact, and the environment should be predictable to a degree. If you cannot really influence or forecast the future (or simply don’t have the data), you are likely better off with a myopic policy or deterministic optimization.
- Before even thinking about any solution, first aim to unambiguously model the problem. The objective (performance metric) and constraints should be crystal clear for each stakeholder. Markov Decision Processes and mathematical programming are useful frameworks for problem formulation.
- Take some time to pick the most suitable solution method (policy class). This requires a reflection on the problem at hand. Can the problem structure be exploited? Is the extra design effort worth the reduction in computational effort? Should the solution be embedded within existing algorithms?
- Be prepared to handle massive action spaces. Resource allocation often entails combinatorial optimization. Vector-based problems are typically very large, and require techniques such as mathematical programming and heuristic reduction to keep them manageable.
References
[1] Van Heeswijk, W.J.A., Mes, M.R.K., & Schutten, J.M.J. (2019). The delivery dispatching problem with time windows for urban consolidation centers. Transportation Science, 53(1), 203–221.
[2] Powell, W. B. (2019). A unified framework for stochastic optimization. European Journal of Operational Research, 275(3), 795–821.
[3] Powell, W. B. (to appear). Reinforcement learning and stochastic optimization: A unified framework for sequential decisions. John Wiley & Sons, Hoboken, New Jersey. pp. 522–525.