Understanding Markov Decision Processes

Published in

Towards Data Science

9 min readJan 26, 2019

At a high level intuition, a Markov Decision Process(MDP) is a type of mathematics model that is very useful for machine learning, reinforcement learning to be specific. The model allows machines and agents to determine the ideal behavior within a specific environment, in order to maximize the model’s ability to achieve a certain state in an environment or even multiple states, depending on what you want to achieve. This goal is determined by what we will call a policy, which is applied to the agent’s actions depending on the environment, and MDP seeks to optimize the steps taken to achieve such a solution. This optimization is done with a reward feedback system,where different actions are weighted depending on the predicated state these actions will cause.

Breaking it down

To understand MDP, we should first look at the unique components of the process. It contains:

A set of states that exist within our specified environment: S
A set of possible actions that can be performed by an agent/agents within our specified environment: A
A description of each action‘s effect on the current state: T
A function that gives reward given desired state and action: R(s,a).
A policy that seeks to solve the MDP. You can think of it as a mapping from the states to our actions. In more simple terms it indicates the best action ‘a’ to be taken while in state S.

Understanding the model:

As you can see from the picture above we have T(S, a, S’) ~ P(S’|S,a).

The model is often referred to as the Transition Model. If you break things down it becomes simple to understand. T represents a conceptualization of our actions. We start in some state S, we take action a, and our agent finds itself in a new state S’. (Don’t let my wording confuse you here. It’s absolutely possible that the action we take will leave the agent in the same state.)

Next we define P, or the probability that we achieved the new state from taking an action in the previous state.

Let’s get mathy: The Markov Property

P[St+1 | a, S0, ….. , St]= P[St+1 |a, St]

The formula above is the definition of a Markov Property State. Again, let’s break it down.

St+1 can be interpreted as the future state, or the state we intend to move towards and [S1, ….. , St] represents a history of all relevant information that exists in the states history. The a of course still represents the action being taken. However, the new state depends only on the previous state. It has no dependence on the history of states in the past.

Given the Markov state, define transition probability.

Given that our agent is in some state s, there is a probability to go to the first state, another probability to go to the second state, and so on for every existing state. This is our transition probability.

We can take these probabilities and feed them into a state matrix! Let’s define an example with a 3 state weather problem. Here are the rules

You live in the town of Sunny, but sadly, Sunny never has two nice days of weather in a row. Ever.
If you have a nice day, the following day is just as likely to have snow or rain.
If we have snow or rain, there is a 50/50 chance of having the same weather the next day.
If the weather changes from snow or rain, it will only change to sunny half the time.

Using this fake environment information, we can construct a state matrix like so:

Where the first row of the matrix represents the probability for the following days weather given a rainy day. The second row represents the probability for normal days, and the third, as you may have grasped, for snow days. This matrix of transitional probabilities is known as the transition matrix.

Now, let’s try a model a real problem from our matrix where p²(ij).
Let i represent the current state, and j represent the future state.

We want to know what is the likelihood that given it is raining today, what is the likelihood it will be snowing two days from now? Given that it is raining, what are our possible states?

Well, the only restriction is that it cannot be nice two days in a row. All other states between now and then are possible. So it could be raining tomorrow and snowing the next day, it could be nice tomorrow and snowing the next day, and it could be snowing tomorrow and snowing the next day.

Put in an equation form :

P²(Rain⁰Snow²) = P(RainRain)*P(RainSnow) + P(RainNormal)*P(NormalSnow) + P(RainSnow)*P(SnowSnow)

Now I realize that hurts to look at, but you may have realized at this point that this is essentially vector and matrix math. We take the dot product of row 1 with row column 3.

Visualizing to make life easier

Let us consider that we wish to predict P given a 6 day time period, or six state transitions. We have not defined a single state to start with, we merely wish to find the probability of our states for a transition period, given our initial probabilities. This is called a regular markov chain. As you can see, we continue to use vector math to calculate the odds of each state by taking the dot products.

Okay, but how do we determine the initial starting state? How will it affect our calculations of probability or how do I create Markov Chains:

I’m so glad you asked! We can define this as: u^(n) = uP^(n)

Where u is a vector that represents the initial distribution of states, and P is our transition matrix for a Markov chain.

We know we have three states, so let’s plug it in. u³ = uP³

Let’s solve for the third day probabilities. We make a 1x3 vector to represent the initial distribution, and take the dot product to find the likelihood of each state for the third day given a random starting state.

Great. How can I find the optimal solution of so many states for a desired outcome? That is after all, what I want to use reinforcement learning for here.

In order to understand how to calculate the optimization of our states and actions, we need to assign value to them. In order to understand value, we need to define a policy, and in order to define a policy, we need to understand reward and return.

Reward and Return Value

Reinforcement agents seek to maximize the sum total of future reward. They want to predict the actions that achieve the greatest amount of reward given. This is called return, and we can model it like so, with r representing reward, R representing return, and subscript t representing the time step, in our case state transitions.

Now, as you can see, this equation allows you to go towards infinity, but it would not make sense for dealing with many problems. We want the sum total of the rewards to end. We call tasks that end episodic. Think of a monopoly board game. The episode is the monopoly game, and starts by assigning the same value to all, and given a wide range of possible states and actions, the episode eventually ends with a winner. A new episode can start by creating a new instance of the game. I’ll admit that monopoly games can sometimes feel like they’re going to be infinite…

The more common method for handling return value is a method called future cumulative discounted reward

where the discount in front of the reward represents a value between 0 and 1. The benefits of this method are that return is now better modeled for infinite rewards, and it gives greater weight to more immediate rewards. The model now cares about rewards that come sooner, and less about future rewards. We can weight this ourselves. The smaller the discount we select, the more emphasis that will be applied to early rewards. As you might imagine, a discount of 1 becomes our original reward equation, and a discount of 0 creates a model that only cares about the very first reward. This could be useful in the sense that the agent would learn the absolute best thing to do in that exact moment, but it would lack any insight into its own future. Kind of like a baby vs an adult. Well, some adults. The baby knows only that it needs a specific thing in that moment, an adult can plan and try to predict its future given a tree of choices.

More on policy

π(s, a) is our policy function. It describes a way of acting. It takes the current state and an agent action and return the probability of taking that action in that state. Kind of like we demonstrated above, no?

If you think about what that means, it must be true that the set of all states, given all actions, equals a probability of 1.

Our policy should describe how to act in each state.

Consider this policy created by Josh Greaves

As you can see, we are rewarded when we are full, and we are rewarded when we are hungry and eat, and we are rewarded for not eating when full. We are extremely heavily penalized if starving, and penalized for eating when full, as well as eating when more satiated than hungry. It’s pretty easy to see that in this very simple example, the optimal policy is to always eat when hungry.

Value Functions:

The two types of value functions in reinforcement learning are the state value function V(s) and the action value function Q(s, a).

The state value function explains the value of a state given a specific policy. It is the calculation of a return when we start from initial state s and act within our policy rules.

The action value function returns the value of taking an action in a given state when following our specified policy.

Now, given that when selecting an action, the environment is what returns the next state, we have to keep in mind that if our policy changes, so too will the value functions. We expect to see a given return value for these functions, however, there can be a great deal of randomness in arriving at a certain state, and the transition functions can also influence our state. We may not have 100% probability!

How can we account for randomness of the returned state given the environment?

Well, we will save that topic for the next blog post, but in short, we can use the Bellman Equations. These equations allow us to express the values of states as values of other states. Meaning if we know the value of some state, we can use it to calculate the value of other states.

We can use these equations to derive the Bellman equation for both the state value and the action value, which I promise we will do next time…after I study some more :D

Till then!

Sources:

Dartmouth probability text: http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/book.html

MIT published paper: http://www.mit.edu/~jnt/Papers/J083-01-mar-MDP.pdf