Till now I have introduced most basic ideas and algorithms of reinforcement learning with discrete state, action settings. Recall the examples we have been implemented so far, Grid World, Tic-Tac-Toe, Multi-Arm Bandits, Cliff Walking, Blackjack … etc, most of which has a basic setting of a board or a grid in order to make the state space countable. However, the power of reinforcement learning does not stop there, and in a real world situation, the states space are mostly continuous, with uncountable states and actions combinations for an agent to explore. In this article, I will extend our previous learnt idea to continuous space and implement a more general Random Walk example by applying function approximation.
From this article, I will introduce:
- How to approximate a value function using parametric methods
- Semi-gradient TD method(which is an extension of the n-step TD method)
- Some general functions used for approximation
- Apply the functions on Random Walk example

The Idea of Generalisation
For discrete state space, theoretically an agent is able to experience every state and explore rewards on each of them. When the case being extended to continuous state space, to generalise the value function, we will need a representation of state.
Function Approximation
Consider a supervised learning problem, no matter what algorithms one applies, the model being trained is able to do prediction on data it has never seen before, and the magic is that it learnt a representation of y = f(x)
, where x
is the input features and y
is the target. This approximation or generalisation idea can be exactly copied to reinforcement learning problems, (in fact, most algorithms in Machine Learning can be applied on function approximation in reinforcement learning settings), where we try to approximate the value function v = v(s, w)
(where s
is state and w
is model weights) and take the state, value as training examples. For example, in its simplest form, we could define value function v = w_0*s + w_1
, using a linear approximation with order 1.
Update the Weights
Now here comes the problem, we have a representation of value function, then how are we able to update the weights of parameters in order to make the value approximate the actual value? In fact, the weight update follows the rule in Stochastic Gradient Descent(SGD) :

Where v(S_t)
denotes the actual value at step t
and v(S_t, w_t)
denotes the approximate function with weight parameter w_t
. Note that the square difference between actual value and estimated value measures the error of the approximate function, and by taking derivative with respect to w_t
, the weight is tuning itself slightly towards the right direction.
Stochastic gradient-descent (SGD) methods do this by adjusting the weight vector after each example by a small amount in the direction that would most reduce the error on that example.
For approximate function with multiple weight parameters, one should update each weights by taking derivative respectively:

N-Step Semi-gradient TD method
Now let’s get to the algorithm applied in continuous state space:

It looks there‘s awful lot steps, but if you look closer, you will see it looks super similar with the n-step TD method we learnt here. Have a look at the n-step TD method in discrete space:

The Q function here is substitutable by Value function. By comparing TD method for continuous space and discrete space, it is clear that the only difference lies in the value function update, whereas in discrete space, the value function(or Q function) is directly updated, in continuous space, the value function is implicitly updated through weights update, as the value function here is represented by weights w
. Another thing you need to notice is that in continuous space, the target value is taken as G
, which in 1 step TD method is the accumulated value of one step ahead and in Monte Carlo simulation(which essentially is infinite TD method as we talked before) is the accumulated value till the end of the episode.
General Approximation Functions
As we mentioned above, in continuous space setting, value function is a representation of states ( V = V(S, w)
) and most machine learning algorithms can be applied here. In this post, I will introduce some most basic approximation functions that can be easily implemented and help you get a sense on how to apply it in Reinforcement Learning problems.
State Aggregation
The simplest form of function approximation. For example, suppose that there are 10 states (1, 2, 3, …, 10) in the space, we set 1 to 5 has the same value and 6 to 10 has another value, then this is called state aggregation. The mathematical representation could be:
V(S) = w_0 if 1≤S≤5
V(S) = w_1 if 5<S≤10
The function is simple but very important, as the idea is critical in Tile Coding, which is a technic of function generalisation that being used a lot in reinforcement learning problems. The advantage of state aggregation is that the derivative for state in each aggregated session is 1 and the value for that state is the corresponding weight.
Polynomials
The polynomial function we will apply here can be written as:
V(S) = w_0*S^0 + w_1*S^1 + ...
In this case the derivative of weight parameters is always the value of its corresponding power to the states.
Fourier
Similar to polynomial case, the Fourier function applied here is:
V(S) = w_0*cos(0*πs) + w_1*cos(1*πs) + ...
The derivative of weight _wi is cos(i*πs).
Notice that both polynomials and Fourier introduced here are linear in terms of weight parameter, as this guarantees convergence. And also here I only introduced case for one state, for more general form please refer to Sutton’s book.
1000-State Random Walk
Enough for theorems, let’s apply what we’ve learned and get our hands on a actual case.
Rules
Consider a 1000-state version of the random walk task. The states are numbered from 1 to 1000, left to right, and all episodes begin near the center, in state 500. State transitions are from the current state to one of the 100 neighbouring states to its left, or to one of the 100 neighbouring states to its right, all with equal probability. Of course, if the current state is near an edge, then there may be fewer than 100 neighbours on that side of it. In this case, all the probability that would have gone into those missing neighbours goes into the probability of terminating on that side (thus, state 1 has a 0.5 chance of terminating on the left, and state 950 has a 0.25 chance of terminating on the right). As usual, termination on the left produces a reward of -1, and termination on the right produces a reward of +1. All other transitions have a reward of zero.
Remember the 100-state random walk we talked in the other post, here the rules are mostly the same. Differences are
- Each step ranges from 1 to 100
- When the agent bashes into the boundary, it ends there
Python Implementation
I previously did the implementation of random walk on discrete state space, and you can check out here. The following implementation, I will be focusing on the difference.
Approximate Functions
The soul of reinforcement learning in continuous state space. The value function should be able to do
- Get the value given a state
- Update weights given state and temporal difference
Aggregate State Function
For the state aggregation, the 1000 states were partitioned into 10 groups of 100 states each (i.e., states 1–100 were one group, states 101–200 were another, and so on)
The 1000 states are divided into 10 groups, each with a value stored in self.values
. The value
function just returns the value stored in the corresponding group and update
function update the value by adding delta*dev
, where here derivative is 1 and delta
is G-V(S, w)
as introduced above.(Notice that here value
is essentially the weight parameter)
Polynomials & Fourier Function
The LinearValueFunction
includes both polynomial function and Fourier function. Inside the value
function, features
is a list of components of the value function representation, and inside the update
function, derivative equals the components value as we described above.
Action Taking
Difference from previous random walk, here the agent is able to walk faster(from 1 to 100 steps), and when its position reaches out of the boundary, the episode ends and reward will be received with either -1 or 1.
Play the Game
With all the above prep work, the play function is a lot easier to implement. The structure is very alike with our previous implementation:
If you have been following up my previous posts, you must have seen this structure many times. The only difference here is that the value function is replaced by valueFunction
.
Learning Result
Three different value functions are applied and let’s compare the learning result. (full implementation)

For aggregate state function with 1 step, 1000 states are aggregated into 10 groups, each with 100 states, and this is why you see a staircase-like graph.

This is the learning graph with polynomial function of order 5 and 1 step, from which we see a divergence at the lower states and a convergence at higher states. This is because for polynomial function approximation, there is always an intercept, thus making the value for state 0 is non-zero(in general, polynomial functions are not recognised).

This is for now the best approximation, and generally Fourier function has better performance than polynomial function.
Conclusion
We here learned the basic theory of reinforcement learning in continuous space and implement some basic approximate functions. And for sure, the idea dose not limit only on these functions. In more advanced settings, neural network functions can be leveraged which makes a deep reinforcement learning problem.
In the next post, I will introduce tile coding and apply it on Q function learning in continuous state space, which is more general than random walk example.
Reference: