In previous posts, we have extended the idea of Reinforcement Learning from discrete state space to continuous state space, and a 1000 state random walk example was implemented, in which case a policy is given, as at all states the action of going left or right always has equal probability, and the only problem is to measure the value function of each state based on the given policy(we call these sort of problems a predict problem).
In this article, let’s extend the idea to more general cases called control problem, where a policy is not given and our goal is to use Q function to learn the best action at each state. In this post, I will:
- Introduce the on-policy algorithm for continuous state space setting
- Implement and apply the algorithm to the classic mountain car problem
(Note: we are going to apply tile coding as the approximate function, if you are not familiar with it, please refer to my last post)

Control Problem
Unlike predict problem, where action distribution is already know, in control problem the agent is tasked to explore the environment and learn the best action at each state, consequently, when formulating the problem a Q(s, a)
function is considered instead of value function V(s)
.
Semi-gradient Sarsa
On the other hand, like predict problem, the algorithm of parametric value function can be naturally extended to parametric Q function, simply by replacing the value function with Q function. Let’s get right into the algorithm we are going to apply:

If you have been following my previous posts, once again you will see this algorithm very familiar. Alike n-step Sarsa in discrete state space, this semi-gradient n-step Sarsa is exactly the same with that except that at each time step, the parameter w
is being updated rather than directly updating Q function. The idea is that at each update time τ
, the accumulated value up to τ+n
is used to correct current estimation, and according to gradient descent idea, the parameter weight w
is updated slightly towards the actual value proportionally to its derivative and delta.
Mountain Car Problem
If you still have some confusions in mind, that is totally fine. I believe the best way to understand every detail of an algorithm is to implement it and apply it on an actual problem. Let’s do it on one of the classic reinforcement learning problems, mountain car.
Mountain Car Setting

Consider the task of driving an underpowered car up a steep mountain road, as suggested by the diagram. The difficulty is that gravity is stronger than the car’s engine, and even at full throttle the car cannot accelerate up the steep slope. The only solution is to first move away from the goal and up the opposite slope on the left.
_The reward in this problem is -1 on all time steps until the car moves past its goal position at the top of the mountain, which ends the episode. There are three possible actions: full throttle forward (+1), full throttle reverse (-1), and zero throttle (0). The car moves according to a simplified physics. Its position, x_t
, and velocity, x_ ̇t
, are updated by:_

_where the bound operation enforces -1.2 <= x_t+1 <= 0.5
and -0.07 <= x_ ̇t+1 <= 0.07
. In addition, when x_t+1
reached the left bound, x_ ̇t+1
was reset to zero. When it reached the right bound, the goal was reached and the episode was terminated. Each episode started from a random position x_t
in [-0.6, -0.4)
and zero velocity._ (Sorry for the unformatted syntax editing, I am not sure how to edit mathematic syntax on medium)
In a nutshell, the car starts at a position between -0.6 and 0.4 with 0 velocity, and it has 3 discrete actions, stay(0), going forward(1) and backward(-1), and the goal is to reach the mountain top at the rightmost. So the states in this problem includes (position, velocity)
, action set is (-1, 0, 1)
, and reward is -1
everywhere except the goal state(notice that the states are continuous and action is discrete).
The Value Function
Understood the problem setting, we now need a Q value function to keep track of values of all states and update value estimation along the exploration. Remember in last post, we have gone through the basics of using tile coding to encode a continuous state space and applying it in the update process. In this post, we will continue to use tile coding in our value function, but the code is slightly different:
The tile coding function being used here refers to here. The reason I didn’t use the tile coding introduced in last post is that the one I introduced is a bit basic with uniformly distributed offset, and I personally tested the result is not as good as this one, and also this seems to be tile coding function Sutton introduced in his book.
Anyway, let’s get back on track. The idea here is exactly the same, each continuous state is being encoded into some indices of tiles, and at each step of the episode, its weights of corresponding tiles are being updated.
Let’s get a closer look at the function. The numberOfTilings = 8
and maxSize = 2048
, meaning that we set 8 tilings, each with 16*16
number of grids, so that 16*16*8 = 2048
. More specifically, the state of (position, velocity)
is represented by a 3D cube of dimension 16*16*8
, and the tiles
function, from the files I referred to, is able to get the active tiles given a state and action. You might have noticed that inside the init
function, the state is rescaled by:
# position and velocity needs scaling to satisfy the tile software
self.positionScale = self.numOfTilings / (POSITION_BOUND[1] - POSITION_BOUND[0])
self.velocityScale = self.numOfTilings / (VELOCITY_BOUND[1] - VELOCITY_BOUND[0])
To be honest, I don’t know why it is scaled in this way, but looks like to use this tile coding function, continuous value of state has always to be rescaled.
Like all Q value functions, it has at least two functions:
value
function: returns the value of current state, and action. When the agent reaches the goal state, returns 0.update
function: update weights based on the new estimation(remember that in tile coding, the derivative of weight parameter is always 1, thus the update is to add thedelta
value directly).
The costToGo
function is used for visualisation in later steps, and it simply returns the most negative value of that state and action(as the reward is always -1, the result is given a -
sign).
Mountain Car Implementation
Till now, we actually have gone through the most tricky part, and the rest are things we have been repeatedly doing. (full implementation)
Init
As usual, in the init
function, we specify the action and state range, and some general elements in all reinforcement learning problems.
Action Functions
The takeAction
takes in an action and determines next state of the car. The update rule follows the formula I introduced above. The chooseAction
function still uses ϵ-greedy method, and in the greedy part, it chooses the action results in the most estimate value, giving by the valueFunc
we described above.
Reward
Besides the goal state getting a reward of 0
, all other states get a reward of -1
.
Get the Car Rolling!
This is the longest function of all, but no need to worry, in fact, everything is the same as it was in discrete setting(I actually copied and pasted from my previous code), except the value function is changed to the function we introduced just now. The target G
is the accumulated n-step value, including the value in step tau+n
, and in the update process, the state, action and G
are sent to the valueFunction.update()
function together.
Visualisation
In this session, we will use the costToGo
function defined inside theValueFunction
to measure the cost to reach the goal given a state and action. The code of plot is listed below:
Let’s see the result in step 100 and step 9000:


The lower the cost, the closer it is to reach the goal state. In fact, the lower cost appears either at the leftmost position or the rightmost position, which verifies that only by getting to the opposite direction of the goal can the agent be able to reach the final goal state.
Reference: