
Being a subdomain of Machine Learning, Reinforcement Learning (RL) is often likened to a black box. You try a couple of actions, feed the resulting observations into a neural network, and out roll some values – an esoteric policy telling you what to do in any given circumstance.
When traversing a frozen lake or playing a videogame, you will see soon enough whether that policy is of any use. However, there are many problems out there without a clear notion of solution quality, without lower and upper bounds, without visual aids. Think of controlling a large fleet of trucks, of rebalancing a stock portfolio over time, of determining order policies for a supermarket. Determining whether your RL algorithm is any good might become surprisingly hard then.
For such problems, having some quick-and-dirty Baseline policies at hand is essential during algorithmic development. The three policies outlined in this article are very easy to implement, serve as a sanity check, and immediately tell you when something is off.
Random policy

Most RL algorithms have some exploration parameter, e.g., an ϵ that translates to taking random actions 5% of the time. Set it to a 100% and you are exploring all the time; easy to implement indeed.
It is obvious that a blindfolded monkey throwing darts is not a brilliant policy, and that is precisely why your RL algorithm should always – consistently and substantially – outperform it.
There is a bit more to it though, especially if you are not sure to what extent your environment is predictable. If you fail to comprehensively outperform the random baseline, that could be an indication that there are simply no predictable patterns to be learned. After all, even the most sophisticated neural network cannot learn anything from pure noise. (Unfortunately, it could also be that your algorithm just sucks)
Myopic policy

The great appeal of RL is that it allows solving complicated sequential decision problems. Determine the best action right now might be a straightforward problem, but anticipating how that action keeps affecting our rewards and environment long afterwards is another matter.
Naturally, if we put all that effort into modeling and learning from the future, we want to see superior results. If we could make decisions of similar quality without considering their downstream impact, why bother doing more? For most problems, the myopic policy simply maximizes (minimizes) the direct rewards (costs) and is easily implemented.
Similar to the random policy, the myopic policy could be somewhat of a litmus test. In highly stochastic environments, the action you take today may have very limited effect on the world tomorrow, and even less the day after that. Transitions from one state to another typically contain a stochastic and a deterministic component; if the stochastic component is large and mostly noise, there is little to be gained from predicting downstream effects. Contrasting policies with and without lookahead quantify the degree to which anticipating the future actually helps.
Off-the-shelf algorithms

Most of us have to contend ourselves with ‘standing on the shoulders of giants’. Radical developments of completely novel algorithms are rare. The solution for your problem is likely a tweaked version of some RL algorithm that already exists, rather than something built from scratch.
Naturally we all believe that we know better, that we can cleverly recombine techniques, construct architectures and tweak parameters to obtain superior results. Maybe we can, but it’s imperative to check. If you are going to spend weeks on building a custom actor-critic model, it’d better outperform the basic REINFORCE algorithm by quite some margin. Time and resources are always scarce; sometimes ‘good enough’ is all you need.
Final notes
The baselines in this article might seem a bit silly to some, but frankly have helped me out more than once. Especially in high-dimensional business problems, gazing at large vectors without comparison often isn’t very helpful. In some cases (especially finance), it is genuinely questionable whether data sets hide predictable patterns that actually aid decision-making today. Baselines policies help to verify whether you are on the right track.
Before wrapping up, it is good to emphasize there is a stark difference between a baseline and a competitive Benchmark. The baselines mentioned in this article should be considerably outperformed to demonstrate that your RL algorithm learns something useful. That alone is not enough to show it is a good solution though. If you want to publish an academic paper or want to upgrade your companies’ planning system, you’d better contrast your algorithm to some serious competitors.
But before you do, please first check if you can beat the mole and the monkey with the darts.