The world’s leading publication for data science, AI, and ML professionals.

How to Design a Reinforcement Learning Reward Function for a Lunar Lander 🛸

Imagine aliens 👽 attacked and you were trying to land a Lander🛸 on the Moon, what factors would you consider to complete the mission…

Photo credit to Nasa; Code by Author
Photo credit to Nasa; Code by Author

Imagine aliens 👽 attacked and you were trying to land a Lander🛸 on the Moon, what factors would you consider to complete the mission successfully?

Here are some considerations:

  • Touch down on the landing pad vs Move away from the landing pad
  • Land with a low velocity vs Crash at a high velocity
  • Use as little fuel as possible vs Use lots of fuel
  • Approach the target as fast as possible vs Hang in the air

What to punish? What to reward? How to balance multiple constraints? And how to represent those ideas in our reward function?


Reward Function in Reinforcement Learning

Reinforcement Learning (RL) is a branch in Machine Learning that leverages the trial and error problem-solving method in agent training. In our example, the agent will try to land the Lunar Lander for, let’s say, 10k times, to learn how to make better actions in different states.

The Reward Function is an incentive mechanism that tells the agent what is correct and what is wrong using reward and punishment. The goal of agents in RL is to maximize the total rewards. Sometimes we need to sacrifice immediate rewards in order to maximize the total rewards.

The rules in reward function of lunar lander

Some ideas of reward and punishment rules in lunar lander reward function could be:

  • Give a high reward for landing on the right place with low enough velocity
  • Give a penalty if lander landed outside of the landing pad
  • Give a reward based on the percentage of remaining fuel
  • Give a big penalty if the velocity is above threshold (crashed) when landed on the surface
  • Give distance reward to encourage lander to approach the target

How to represent the rules in python code

Code by Author
Code by Author

As illustrated in the above image, variable _fuel_conservation is a value between 0 and 1. When landed successfully at the landing pad, the reward received will be multiplied by fuel_conservation_ to encourage the lander to use as little fuel as possible.

If the lander landed outside of the target spot, we give a small penalty of -10. If the lander crashed with a high velocity, we give a big penalty of -100.

_distance_reward = 1-(distance_to_goal / distance_max)0.5**_ uses power of 0.5 to offer agents a smooth gradient of rewards as lander getting closer to the landing pad.

# Encourage lander to use as little fuel as possible
# i.e. 0.85, or 0.32
fuel_conservation = fuel_remaining / total_fuel
if distance_to_goal is decreasing:
    if speed < threshold:
        if position is on landing pad:
            # Land successfully; give a big reward
            landing_reward = 100
            # Multiply percentage of remaining fuel
            reward = landing_reward * fuel_conservation
        else:
            # Landing outside of landing pad
            reward = -10
    else:
        # Crashed
        reward = -100
else:
    # Encourage agents to approach the surface instead of
    # hanging in the air
    distance_reward = 1 - (distance_to_goal / distance_max)**0.5
    reward = distance_reward * fuel_conservation

Conclusion

In this article, we use lunar lander as an example to demonstrate how to build an advanced reward function with reward and punishment rules.

During the training of RL models, Reward Function guide agents to learn from the trials and errors that:

  • What should I do? How to select between actions?
  • What are better actions to maximize the total rewards?
  • How to evaluate the goodness/badness of actions in different states?

Happy Landing! Hopefully, aliens will come in PEACE. 👽 ☮️✌️🕊🛸


References:


Sign up for Udemy course 🦞:

Recommender System With Machine Learning and Statistics

https://www.udemy.com/course/recommender-system-with-machine-learning-and-statistics/?referralCode=178D030EF728F966D62D
https://www.udemy.com/course/recommender-system-with-machine-learning-and-statistics/?referralCode=178D030EF728F966D62D

Related Articles