AI Learning to land a Rocket(Lunar Lander) | Reinforcement Learning

See Artificial Intelligence learning to land a rocket until it smoothly starts to land the rocket by learning from its mistakes.

Published in

Towards Data Science

8 min readAug 28, 2019

In this article, we will cover a brief introduction to Reinforcement Learning and will solve the “Lunar Lander” Environment in OpenAI gym by training a Deep Q-Network(DQN) agent.

We will see how this AI agent initially does not anything about how to control and land a rocket, but with time it learns from its mistakes and start to improve its performance and in the end learns to fully control the rocket and perfectly lands it.

Reinforcement Learning is a massive topic and we are not going to cover everything here in detail. Instead, this article aims to get our hands dirty with some practical examples of reinforcement learning and show the implementation of RL in solving real-world use cases.

We will discuss the rationale behind using the DQN and will cover the Experience Replay and Exploration-Exploitation dilemma encountered while training the Neural Network. In the last, we will discuss the agent’s training and testing performance and the effect of hyper-parameter in the agent’s performance.

What is Reinforcement Learning?

Reinforcement learning is one of the most discussed, followed, and contemplated topics in artificial intelligence (AI) as it has the potential to transform most businesses.

At the core of reinforcement learning is the concept that optimal behavior or action is reinforced by a positive reward. Similar to toddlers learning how to walk who adjust actions based on the outcomes they experience such as taking a smaller step if the previous broad step made them fall. Machines and AI agents use reinforcement learning algorithms to determine the ideal behavior based upon feedback from the environment. It’s a form of machine learning and therefore a branch of artificial intelligence.

Reinforcement Learning in action

An example of reinforcement Learning in Action is AlphaGo Zero which was in the headlines in 2017. AlphaGo is a bot developed by Deepmind that leveraged reinforcement learning and defeated a world champion at the ancient Chinese game of Go. This is the first time artificial intelligence (AI) defeated a professional Go player. Go is considered much more difficult for computers to win than other games such as chess because its much larger branching factor makes it prohibitively difficult to use traditional AI methods such as alpha-beta pruning, tree traversal, and heuristic search.

Lunar Lander Environment

We are using the ‘Lunar Lander’ environment from OpenAI gym. This environment deals with the problem of landing a lander on a landing pad. The steps to set up this environment are mentioned in the OpenAI gym’s GitHub page and on OpenAI’s documentation. Following are the env variables in brief to understand the environment we are working in.

State: The state/observation is just the current state of the environment. There is an 8-dimensional continuous state space and a discrete action space.
Action: For each state of the environment, the agent takes an action based on its current state. The agent can choose to take action from four discrete actions: do_nothing, fire_left_engine, fire_right_engine, and fire_main_engine.
Reward: The agent receives a small negative reward every time it acts. This is done in an attempt to teach the agent to land the rocket as quickly and efficiently as possible. If the lander crashes or comes to rest, the episode is considered complete and it will be receiving additional -100 or +100 points depending on the outcome.

DQN Algorithm

The deep Q-learning algorithm that includes experience replay and ϵ-greedy exploration is as follows:

Model Training and Hyperparameters Selection

For running the complete experiment for the ‘Lunar Landing’ environment, we will first train a benchmark model and then do more experiments to find out the effects of changing the hyperparameters on the model performance.

Initially, as we can see below the agent is very bad at landing, it’s taking random actions to control the rocket and tries to land it. It fails most of the time and receives negative rewards for crashing the rocket.

For training the model there is no rule of thumb to find out how many hidden layers you need in a neural network. I have conducted different experiments to try different combinations of node sizes for input and hidden layers. The following benchmark model was finalized based on parameters like training time, number of episodes required for training, and trained model performance.

Input layer: 512 nodes, observation_space count as input_dim and ‘relu’ activation function
Hidden layer: 1 layer, 256 nodes with ‘relu’ activation function
Output layer: 4 nodes, with ‘linear’ activation function

Still, this model was sometimes diverging after an average reward of 170 and was taking more than 1000 episodes to diverge. I figured out that this behavior might be attributed to the overtraining of the model and implemented ‘Early Stopping’. Early Stopping is the practice to stop the neural networks from overtraining. To implement this, I avoided training the model for a specific episode if the average of the last 10 rewards is more than 180.

Buffer capacity size is chosen of size 500000 to avoid overflow occurring because of large experience tuple. Model is trained for the maximum episode count of 2000 and stopping criteria for the trained model is the average reward of 200 for the last 100 episodes.

Final benchmark model has the following hyperparameters:

learning_rate = 0.001
gamma = 0.99
replay_memory_buffer_size = 500000
epsilon_decay = 0.995

After around 300 training episodes, it starts learning how to control and land the rocket.

After 600 the agent is fully trained. It learns to handle the rocket perfectly and lands the rocket perfectly each time.

Result Analysis

Figure: Reward per each training episodes

The above figure shows the reward values per experience at the time of training. Blue lines denote the reward for each training episode and the orange line shows the rolling mean of the last 100 episodes. The agent keeps learning with the time and the value of the rolling mean increases with the training episodes.

The average reward in the earlier episodes is mostly negative because the agent has just started learning. Eventually, The agent starts performing relatively better and the average reward starts going up and becoming positive after 300 episodes. After 514 episodes the rolling mean crosses 200 and the training concludes. There are a couple of episodes where the agent has received negative awards at this time, but I believe if the agent is allowed to continue training, these instances will reduce.

The above figure shows the performance of the trained model for 100 episodes in the Lunar Lander environment. The trained model is performing well in the environment with all the rewards being positive. The average reward for 100 testing episodes is 205.

Effect of Hyperparameters

A. learning_rate(α)

The learning rate, set between 0 and 1 and is defined as how much we accept the new value vs the old value. This value then gets added to our previous q-value which essentially moves it in the direction of our latest update. Setting it to 0 means that the Q-values are never updated, hence nothing is learned. Setting a high value such as 0.9 means that learning can occur quickly.

For validating the effect of the different learning rates on the model performance, I have trained different agents with different learning rates. Learning rates chosen for this experiment are 0.0001, 0.001, 0.01, 0.1. Best performance is observed for the middle value of the learning rate of 0.001. The orange line in figure 3 corresponds to this value and provides the maximum reward. The agent is not able to learn at the higher learning rate and the reward values are diverging.

Rewards per episode for different learning rate

B. Discount factor(γ)

The discount factor affects how much weight it gives to future rewards in the value function. A discount factor 𝛾=0 will result in state/action values representing the immediate reward, while a higher discount factor 𝛾=0.9 will result in the values representing the cumulative discounted future reward an agent expects to receive. The below figure shows the variation in the model performance for a different discount factor. The agent has the best performance for the gamma value of 0.99 which is represented by the blue line.

C. epsilon(ε) decay

As discussed above, ε is the probability where we do not go with the “greedy” action with the highest Q-value but rather choose a random action. epsilon(ε) decay is the decay rate by which this value decreases after an episode. Figure 5 shows the variations in the rewards values for different values of the epsilon(ε) decay. worst agent performance is observed for epsilon decay of 0.999 and the best performance is for epsilon decay value of 0.9 which is shown in red color in figure 3. This behavior might be because this value of epsilon decay is providing a better balance between Exploration-Exploitation.

There have been many advancements in the deep Q-learning since its first introduction. In the next article, I will experiment with more advancements like Double Q-learning, Prioritized Experience Replay, Dueling NetworkArchitecture, and Extension to Continuous Action Space.

P.S. This is my first article on medium. Please let me know your views. Please comment if you find any bug or have any idea on the code/algorithm improvement.