
Imagine aliens 👽 attacked and you were trying to land a Lander🛸 on the Moon, what factors would you consider to complete the mission successfully?
Here are some considerations:
- Touch down on the landing pad vs Move away from the landing pad
- Land with a low velocity vs Crash at a high velocity
- Use as little fuel as possible vs Use lots of fuel
- Approach the target as fast as possible vs Hang in the air
What to punish? What to reward? How to balance multiple constraints? And how to represent those ideas in our reward function?
Reward Function in Reinforcement Learning
Reinforcement Learning (RL) is a branch in Machine Learning that leverages the trial and error problem-solving method in agent training. In our example, the agent will try to land the Lunar Lander for, let’s say, 10k times, to learn how to make better actions in different states.
The Reward Function is an incentive mechanism that tells the agent what is correct and what is wrong using reward and punishment. The goal of agents in RL is to maximize the total rewards. Sometimes we need to sacrifice immediate rewards in order to maximize the total rewards.
The rules in reward function of lunar lander
Some ideas of reward and punishment rules in lunar lander reward function could be:
- Give a high reward for landing on the right place with low enough velocity
- Give a penalty if lander landed outside of the landing pad
- Give a reward based on the percentage of remaining fuel
- Give a big penalty if the velocity is above threshold (crashed) when landed on the surface
- Give distance reward to encourage lander to approach the target
How to represent the rules in python code

As illustrated in the above image, variable _fuel_conservation is a value between 0 and 1. When landed successfully at the landing pad, the reward received will be multiplied by fuel_conservation_ to encourage the lander to use as little fuel as possible.
If the lander landed outside of the target spot, we give a small penalty of -10. If the lander crashed with a high velocity, we give a big penalty of -100.
_distance_reward = 1-(distance_to_goal / distance_max)0.5**_ uses power of 0.5 to offer agents a smooth gradient of rewards as lander getting closer to the landing pad.
# Encourage lander to use as little fuel as possible
# i.e. 0.85, or 0.32
fuel_conservation = fuel_remaining / total_fuel
if distance_to_goal is decreasing:
if speed < threshold:
if position is on landing pad:
# Land successfully; give a big reward
landing_reward = 100
# Multiply percentage of remaining fuel
reward = landing_reward * fuel_conservation
else:
# Landing outside of landing pad
reward = -10
else:
# Crashed
reward = -100
else:
# Encourage agents to approach the surface instead of
# hanging in the air
distance_reward = 1 - (distance_to_goal / distance_max)**0.5
reward = distance_reward * fuel_conservation
Conclusion
In this article, we use lunar lander as an example to demonstrate how to build an advanced reward function with reward and punishment rules.
During the training of RL models, Reward Function guide agents to learn from the trials and errors that:
- What should I do? How to select between actions?
- What are better actions to maximize the total rewards?
- How to evaluate the goodness/badness of actions in different states?
Happy Landing! Hopefully, aliens will come in PEACE. 👽 ☮️✌️🕊🛸
References:
Sign up for Udemy course 🦞:
Recommender System With Machine Learning and Statistics
