SD-WAN Link Switch as Reinforcement Learning experiment with Deep Q-Learning

Published in

Towards Data Science

10 min readDec 21, 2018

Credit — https://usblogs.pwc.com/emerging-technology/deep-learning-ai/

‘Deep Q’ or Deep Q-Learning is a well-known algorithm in reinforcement learning which approximates Q Value of an MDP system with deep neural network. In this article I have explored the same algorithm in solving the link switch problem in SD-WAN network for which I already have developed an AI-gym based on Mininet (see my previous article- Open-AI gym for SD-WAN Link Selection). While applying the Deep-Q algorithm I found that the gym based on Mininet takes considerable amount of time performing measurement by sending real TCP/UDP traffic. This led me to another idea of using purely statistical back-end which simulates the traffic measurement by generating Gaussian random numbers with a particular mean and variance. The algorithm is then tuned with thousands of episodes and cross-checked at the end with actual Mininet back-end. This experiment presents reasonable evidence that SD-WAN link switch problem can be effectively solved by reinforcement learning.

A few words about Deep Q-Learning implementation

Deep Q-Learning came out of DeepMind’s research team (see the ground breaking paper — Playing Atari with Deep Reinforcement Learning by Mnih et. al.). There were couple of options in implementing this algorithm. I could implement ‘double Q network’ or ‘single Q network’ along with ‘experience replay’. Also I could use dueling architecture (Dueling Network Architectures for Deep Reinforcement Learning by Wang et. al.). With many baseline around (open-ai, Denny Britz, Keon) I took inspiration from Greg. This was a simple implementation with only experience replay and single network. It was a baseline for solving Cartpole problem. Cartpole was similar to my problem with discrete actions and infinite and continuous state space. This was similar to my problem of sending traffic down one of the two links and observing what the throughput turns out to be. The agent part was very similar, only the Cartpole AI-gym was to be replaced with my SD-WAN specific AI-gym which I already developed earlier (see my previous article).

The deep neural net model for approximating the Q value had two dense layers followed by a linear activation -

The observation space was a tuple consisting of three parameters –

(action_chosen, current_bandwidth, available_bandwidth)

The action space had two values — 0 and 1 for selecting either MPLS or INTERNET link.

The parameters for Deep Q-Learning are as follows –

GAMMA = 0.95

LEARNING_RATE = 0.001

MEMORY_SIZE = 1000000

BATCH_SIZE = 20

EXPLORATION_MAX = 1.0

EXPLORATION_MIN = 0.01

EXPLORATION_DECAY = 0.995

Recap — Downward trend in episode score

As mentioned in my previous article the cumulative rewards in a single episode was trending downwards as seen below -

It was also evident from multiple episodes that the average reward over random trial was not able to show any improvement.

As seen here the score was fluctuating around zero. There was no upward trend without any learning (as we did not apply Deep Q-Learning yet). However this time while applying the Deep Q-Learning algorithm I was faced with a new kind of hurdle. In performing the experiment with many episodes there was this issue with large time span in finishing just one episode. This time I was using Mininet to simulate the SD-WAN environment and it was really lengthy to finish just one episode of maximum 30 ‘time ticks’.

Lengthy execution of Mininet based episode

With Mininet back-end each episode was taking considerable amount of time even with just 30 time ticks. With each ‘time tick’ traffic was being sent on INTERNET link and then measurement was being performed. An experiment of 100 episodes was takings easily more than 12 hours on a cloud Virtual Machine and there was no visible trend in performance of the Deep Q-Learning -

Deep Q-Learning was known to take considerable amount of training and with just 100 episodes it was not conclusive how the training was progressing.

Statistical back-end

The idea was simple. Since actual traffic over Mininet was taking time why don’t I generate the measurement according to some statistical distribution? In Mininet experiment I was generating the traffic with some distribution anyway. Why not apply the same distribution in generating the measurement without actually sending the traffic? With this idea I created a statistical back-end of the gym which was generating measurements based on normal distribution with certain mean (µ) and standard deviation (σ) but there was no traffic going over the links.

Previously we were actually generating internet link traffic with a certain mean (µ) and standard deviation (σ) and then sending it down the Mininet links. Now we did not generate any real traffic but output the measurement directly as a number generated by the normal distribution with same µ and σ. This represented the amount of external internet traffic occupying the link. Since link capacity was fixed at certain value we got the throughput of our traffic by subtracting this external traffic bandwidth from link capacity. I also sprinkled a little amount of randomness while creating measurement of this ‘bandwidth achieved’. This was done by generating a number with a narrow normal distribution with µ as our ‘bandwidth achieved’ but with a low σ (0.5 Mbps).

Shorter Episodes with statistical back-end

I ran the same measurements of training with deep Q agent and random trials. Now the episodes were taking much shorter time but learning was not evident.

Here is the score of random agent over 100 episodes. You can see the mean score was negative -

Following is the result of training with deep Q network for 100 episodes -

You can see the mean score was well below zero.

With 1000 episodes (below) it was very clear that the training was failing. There was no up-trend. The episodes were stuck to a low cumulative (negative) score -

Corrections in Reward Design

It was apparent that the problem was too hard for agent to train and solve. As I thought through it there were multiple factors responsible for lower rewards for the agent –

1) Tight bandwidth allocation

I set the link capacity as 10 Mbps and sent 5 Mbps UDP traffic over it (with ± 2 Mbps as deviation). That left around 5 Mbps for our intended traffic. However I set the SLA limit as 6 Mbps. That means if ‘bandwidth achieved’ was less than 6 Mbps we should have done a link switch. As there was not much headroom with this kind of tight arrangement agent had no choice but to do a link switch every now and then and that ultimately caused a lower cumulative score. So the solution was to do a fair allocation of bandwidth. I kept SLA again at 6 Mbps but this time gave 4 Mbps to other general traffic. That left us about 6 Mbps for the ‘intended traffic’ to meet the SLA now more often.

2) Reward design skewed for failure

Previous design was adding +1 for surviving a ‘time tick’. Whenever ‘bandwidth achieved’ could not match SLA limit the penalty was -5. Also for taking the MPLS link the penalty was -2. However, for taking the correct action, i.e. to take the internet link when bandwidth was available, only +1 reward was given. So with the default reward of +1, if it was taking correct action, the total reward was 2 and for wrong action it was -1. Also whenever SLA was missed total reward for that ‘time tick’ was -4. As with tight bandwidth budget agent was missing the SLA target more often, the overall score was going down rapidly.

There was also one more rule that could end an episode early. If an episode missed the SLA guideline twice in a row then the episode was brought to an abrupt end. However, it was not reflected in reward. This was also fixed in the new logic.

New Logic

The new design principle was based on following –

· Encourage surviving the ‘time tick’

· Encourage correct link choice , more reward this time

· Discourage wrong link choice but do not punish as much as we did before

· Discourage SLA miss but not as maximum penalty

· Penalize heavily for ending the episode

Here is the implementation of new logic (see code below) -

So, this time we were deducting 1 instead of 2 whenever MPLS link was chosen. Also there was an increased reward +2 whenever correct link was chosen. The ‘SLA miss’ was punished with -2 but not -5 as was done before. The reward of +1 for surviving a tick was still there. However we were deducting maximum (-5) for finishing the episode early. This time the goal of not finishing the episode early was rolled into reward design.

3) Short episode length

Episode length can influence reward outcome and hence learning process. Previously the episode length was only 30. This was kept low keeping in mind the slow speed of experiment with actual Mininet back-end. With 30 ‘time tick’ per episode a total of 100 episodes would take around 12 hours in a cloud Virtual Machine. This was a major obstacle in repeating the episodes thousands of times. As deep Q-learning is known to take many episodes initially to warm up, it was not known if the learning was at all taking place. With statistical back-end I was able to keep the episode length as much as 300 (as 30 was really low number for an episode) and the episodes were finishing much quicker. I ran the experiments this time for thousands of episodes to get the real picture of learning. The skewed design for the reward was apparent only when I switched to statistical back-end.

Training

After correcting the reward design and applying statistical back-end I started training the agent again. The random agent on a statistical back-end did not show any learning as expected. Hoever this time the average score per episode was positive.

However the Deep Q agent was showing a slight upward trend event very early on even for 100 episodes.

The warm up happened around 500 episodes as seen below -

The sudden jump is obvious when we zoom near the 400–600 range -

Validation

After I ran the training for 1000 episodes with statistical back-end it was clear that the agent was consistently gaining on each episode. Now was the time to verify this learning on Mininet back-end. Will the learning on statistical back-end be still valid for Mininet where we actually send traffic to determine the throughput?

I ran the deep Q agent with Mininet back-end but this time with same neural net weights obtained via learning from 1000 episodes of training with statistical back-end. The gain in score was very clear when I took a close look at a single episode -

Single Episode comparison after training

One can see the clear gain in deep Q agent compared to random and deterministic agents.

It is also very clear from following two plots that the deep Q agent was making very few link switches as compared to the random agent –

Finally the plots were compared for 100 episodes and it was clear that the average return from deep Q agent was way higher that the random agent for the same setup with Mininet back-end -

Conclusion

It was very clear that deep Q agent learnt when to switch link for SD-WAN setup and actually solved the link switch problem. It was a very simple agent with just single Q network with experience replay. The results could improve with more sophisticated approach like double Q network or dueling double Q network or even with policy gradient methods like actor-critic (see Sutton and Barto) or A3C. The goal of this experiment was not to achieve the maximum possible reward that is possible but to show that reinforcement learning, when applied to SD-WAN link switch issue, could actually solve the problem with reasonable efficiency. One can improve upon this one with more advanced algorithms and more realistic observation space by adding QOS parameter like jitters, delay and so on.