The world’s leading publication for data science, AI, and ML professionals.

Why does Markov Decision Process matter in Reinforcement Learning?

Comparison between MDP and k-armed bandit problems to see how MDP is better-fitted in the real world

For most learners, the Markov Decision Process(MDP) framework is the first to know when diving into Reinforcement Learning (RL). However, can you explain why it is so important? Why not another framework? In this post, I will explain the advantages of MDP compared to the k-armed bandit problem, another popular RL framework.

The post is inspired by an RL specialization offered by University of Alberta and Alberta Machine Intelligence Institute on Coursera. I wrote this post to summarize some of the videos and get a deeper understanding of the specialization.

The objective of Reinforcement Learning

Before diving into our leading problem, I want to mention the primary goal of RL, as it matters when we choose a framework to build a model. In RL problems, we want to maximize the total rewards we receive by taking action over time. We are not told what actions lead to higher rewards. Instead, we learn it from experience. We repeat try-and-error attempts and observe which action gives us a higher reward and which gives a lower reward. Moreover, we are not even told when rewards are given in the beginning. They might be given immediately or might be given after a few time steps after we take action. Therefore, we need a dynamical framework that captures those two features, "try-and-error search" and "delayed rewards."

Example: Rabbit exploring foods

We think about a simple situation. There is a hungry rabbit looking for foods to fill her appetite immediately. There is a carrot on its right and broccoli on its left. The rabbit prefers carrots to broccoli, so she receives a +10 reward when she eats a carrot. On the other hand, broccoli generates a reward of only +3. In each time, the rabbit will choose a carrot because she knows choosing a carrot is an optimal action(= action that leads to the possible highest reward).

What if the positions of a carrot and a broccoli switch? The rabbit will also choose a carrot again because the positions of foods do not affect the rewards the rabbit will receive. This is actually an example of a simple "k-armed bandit" problem.

K-armed Bandits

The k-armed bandit problem is one of the most simplest but powerful frameworks in RL. It is named by analogy to "one-armed bandit"(= a slot machine) although the framework has k levers instead of one. In the problem, you face a choice among k different options. After each choice, you receive a reward chosen from a stationary probability distribution corresponding to the option you chose. You repeat such selections and aim to maximize the total amount of rewards you receive. To do so, you need to find a lever (or option) that is most likely to generate the best reward and to concentrate your selections on it.

The rabbit example can be treated as a "two-armed bandit" problem because the rabbit always has two options to choose from, the carrot and broccoli. Choosing the carrot generates +10 with 100% probability and choosing the broccoli generates +3 with 100 probability respectively. Because the rabbit wants to maximize the total amount of rewards, it always chooses carrots.

The following article will be helpful to understand the k-armed bandit problem more in detail.

Multi-Armed Bandits and Reinforcement Learning

Example Cont’d: Tiger behind the carrot

While the k-armed bandit framework captures the previous simple situations very well, it has a limit, which is shown in the next situation.

Here, we think about a tiger that is behind the carrot and hunting rabbits. When the rabbit eats the carrot, there are no longer any obstacles between the rabbit and tiger. The tiger runs faster than the rabbit does, so she cannot run away. The rabbit ends up being hunt and receives a -100 reward.

If the rabbit chooses to eat the broccoli, the rabbit receives +3 and remains safe.

In the k-armed bandit problem, the rabbit keeps choosing a carrot even if she knows there is a tiger behind it because changing situations do not affect its optimal action. However, keeping choosing a carrot will not help the rabbit forever. How should we consider a change in the situation into our problem definition?

Markov Decision Process

Finally, we introduce Markov Decision Process(MDP) to solve such a problem. An MDP consists of two elements; the agent and the environment. The agent is a learner or decision-maker. In the above example, the agent is the rabbit. The environment is everything surrounding the agent. In the example, the environment includes everything in the field where the rabbit is with food and the tiger. After the agent taking an action, we face a different situation. We call these different situations states. For example, the situation where the rabbit has not moved yet is considered as a state, and the situation where the rabbit is between the broccoli and the tiger after she eats the carrot is considered as another state.

In the MDP, the agent "interacts" with the environment. The agent chooses an action and observes what happens in the environment after it took the action. Then, it receives a reward in correspondence with the action and the state she transitioned to. The agent repeats the interaction many times and learns what action is optimal at each state.

The following diagram is a formalization of MDP. At time t, the agent at state St chooses an action At from the action space and the environment returns a new state _St+1 from the state space. Then, the agent receives the reward _Rt+1 depending on the starting state, the taken action, and the subsequent state.

In the rabbit example, the action space consists of moving right and left. The state space includes all the four possible situations, 1) the rabbit is at the initial position, 2) the rabbit is between the broccoli and tiger, 3)the rabbit is the leftmost after eating the broccoli, and 4)The rabbit being eaten by the tiger. The possible rewards are +3(Situation 2), +10(Situation 3), and -100(Situation 4). The details are described in the following diagram.

Considering the MDP framework and the diagram, the rabbit knows the optimal action is moving right(=going for the broccoli) because eating the carrot generates negative total rewards in the end.

Technically, MDP captures the dynamism of RL problems by defining the dynamics function of the MDP, which is a probability distribution of pairs of a state and a subsequent reward, conditional on its previous state and the taken action there. However, I will save an explanation about this for another article I will write in the future.

Summary

  • The MDP framework is important because it accounts for a change in situations that might lead to change in optimal action, which is not considered in the k-armed bandit framework
  • An MDP consists of its agent and environment. The agent interacts with the environment observing transitions of states and receiving rewards to find an optimal action at each state

Reference


Related Articles