Till now we have been through many Reinforcement Learning examples, from on-policy to off-policy, discrete state space to continuous state space. All these examples vary in some way, but you might have noticed that they have at least one shared trait – Episodic, that is all of which have a clear starting point and ending point, and whenever an agent reaches the goal, it starts over again and again until reaching certain number of loops. In this article, we will extend the idea to non-episodic task, that is task which has no clear ending point and the agent goes on forever in the environment setting. In this article, we will
- Learn the idea and algorithm applying on non-episodic task
- Implement server access example(tile coding is required)

Average Reward
The main concept that will be applied to non-episodic task is average reward. The average reward setting also applies to continuing problems, problems for which the interaction between agent and environment goes on and on forever without termination or start states. Unlike that setting, however, there is no discounting—the agent cares just as much about delayed rewards as it does about immediate reward. So the key is
- There is no discount factor under this setting
- Average reward will be introduced to the algorithm
Let dive directly into the algorithm and I will explain why it goes in this way.
Differential semi-gradient Sarsa

Firstly, the algorithm applies to continuous state space, and in fact, the agent’s exploring process and weight updating process are the same with other algorithms we talked before, the difference lies in the average reward term introduced here.
The key difference is the definition of δ, in continuing task, reward is defined as R(S, A) - R(average)
, which is the difference between reward received at the current state and the average reward until the current state. I will give you an intuitive explanation of why the average reward is introduced here, recall the definition of R(S, A) + Q(S', A')
, it is the estimation of the value of Q(S, A)
, which equals to the rewards the agent going to collect all the way to the end of the game. However, in a continuing task, the game never ends, thus the collected reward could go to infinity, it needs a term to restrain the estimation value, then there comes the average reward!
As the step goes on, the average reward needs to be updated as well(Note that β is the learning rate dedicated for the reward update). With the average reward setting, it is like setting a baseline for the agent that only when the agent conducts an action that gives a reward above the average reward can the weight be updated positively, otherwise negatively (disregarding the q value of next state). If you are still not convinced by the explanation and doubt what would happen without the average reward term, I will show you the learning result without average reward of the server access example at the end of the post.
Server Access Implementation
Now let’s apply the algorithm on a concrete continuing task example cited from Sutton’s book called Access-Control Queuing Task.
Problem Description
This is a decision task involving access control to a set of 10 servers. Customers of four different priorities arrive at a single queue. If given access to a server, the customers pay a reward of 1, 2, 4, or 8 to the server, depending on their priority, with higher priority customers paying more. In each time step, the customer at the head of the queue is either accepted (assigned to one of the servers) or rejected (removed from the queue, with a reward of zero). In either case, on the next time step the next customer in the queue is considered. The queue never empties, and the priorities of the customers in the queue are equally randomly distributed. Of course a customer cannot be served if there is no free server; the customer is always rejected in this case. Each busy server becomes free with probability p = 0.06
on each time step. The task is to decide on each step whether to accept or reject the next customer, on the basis of his priority and the number of free servers, so as to maximize long-term reward without discounting.
To recap, our agent needs to decide whether to accept a customer based on its priority and current number of free servers in order to gain maximum long-term reward. This game setting is definitely a continuing task, as the process never ends since the waiting queue would never be empty. The state is (numberOfFreeServers, customerPriority)
, action is either reject or accept and reward is 1, 2, 4, 8 based on the customer’s priority. Now if you are clear with the rules, let’s go head and implement it.(Full implementation)
The Value Function
This is a discrete state task, but as we have already intensively talked about applying tile coding on continuous state, we are going to apply it again in representation of the state space even it is discrete(The tile coding function we applied here is discussed here).
This is in effect a Q value class, inside which we set 8 tilings and in total 2048 grids. Action 1 stands for accept, 0 for reject. The value
function gives the value based on current state(n_server, priority) and action, and update
function updates the weights accordingly. The stateValue
function returns the maximum value of the state(this function will only be used in the visualisation part), and note that when the number of free servers is 0, it returns the value of action 0 (always reject).
Now let’s get into the main class:
Initialisation
All these initialisations are self evident and for each server the free probability is 0.06.
Action Selection
The numFreeServers
function gives the current number of free servers based on current state. The chooseAction
function chooses the action based on current state and the value function we discussed above, and again the action selection method is ϵ-greedy.
Next State and Reward
After the agent takes an action, we will need to judge the next state, where the number of free servers will be subtracted by 1 if the action was accept, and remains the same if the action was reject. The next customer’s priority is randomly generated.
The giveReward
function simply gives the reward based on a customer’s priority. In the case of reject action, the reward is always 0.
Rolling the Game
In the final run
function, we will assemble the whole process and let the agent learn.
The function is straight forward, as everything follows exactly the steps in the algorithm.(Full implementation)
Result
Now let’s see how our agent performs:

This is the result after 50,000 rounds and with α=0.01, β=0.01 and exp_rate=0.1. The drop on the right of the graph is probably due to insufficient data; many of these states were never experienced.
Lastly, let me show you what would happen if goes without the average reward term.

The state value is way larger than the actual value, and in fact, the more rounds we run, the larger the state value would be as the δ is not well limited.
Reference: