Reinforcement Learning
Intro
In a slightly different fashion to my previous articles, I will not include Python code this time as I want to avoid information overload for the reader. Instead, I will briefly introduce Reinforcement Learning (RL), explaining the main ideas and terminology.
My subsequent articles will delve deeper into individual RL algorithms and provide detailed Python examples. So, remember to subscribe to get them straight to your inbox when they get published.
Contents
- Reinforcement Learning (RL) within the Machine Learning (ML) universe
- What is Reinforcement Learning?
- Agent and Environment
- Action Space
- State (Observation) Space
- Reward (and Discount Factor)
- Exploration vs Exploitation
- Different Methods to Train Your Model
- Summary
Reinforcement Learning within the ML universe
In a nutshell, Reinforcement Learning is similar to how babies learn about their surroundings or how we train dogs. We allow them to interact with and explore the environment and provide positive/negative rewards to encourage a particular behaviour.
Since RL’s approach to Machine Learning is significantly different from other types of ML, we have placed RL in a separate category within our Machine Learning universe chart.
The below chart is interactive, so please click👇 on different sections to explore and reveal more.
If you enjoy Data Science and Machine Learning, please subscribe to get an email with my new articles.
What is Reinforcement Learning?
Reinforcement Learning (RL) is a category of Machine Learning algorithms used for training an intelligent agent to perform tasks or achieve goals in a specific environment by maximising the expected cumulative reward. Here are a few examples:
- Teaching a basic AI how to play a computer game such as Atari’s Space Invaders or Nintendo’s Super Mario Bros
- Teaching a more advanced AI to play real-life games such as chess or Go (see Google’s AlphaGo)
- Teaching a self-driving car (or a model car) how to drive in a simulated or real-life environment (see AWS DeepRacer)
Agent and Environment
The two critical components within RL are the agent and its environment. The agent represents a player, a car, or other "intelligent actor" that can interact with its environment.
Meanwhile, the environment is the "world" where that agent "lives" or operates.

Action Space
The agent can perform actions within its environment. E.g., Mario can do things like go left, go right, jump up, etc., inside its Super Mario game level. The same applies to the agent in my made-up game illustration below.

The available action space can be either discrete or continuous.
- Discrete action space – like in the Super Mario Bros example, the agent can only perform a finite number of actions, making it discrete. Hence, it is always possible for us to list all available actions.
- Continuous action space – the agent can take an unlimited number of actions. E.g., a self-driving car could turn left by 1 degree, 1.2 degrees, 1.2432… degrees, etc. Hence, we cannot list all possible actions.
State (Observation) Space
The next important element is an observation or state of the environment.
- A state describes the information that the agent gets from the environment. E.g., in a computer game, it could be a frozen frame of the environment that shows the exact location of the agent. Note that a state is a complete description of the world, i.e. a fully observed environment, like a full view of a chess board.
- An observation is a partial description of the state, i.e., it only provides a partial view of the environment. E.g., a frozen frame of Super Mario Bros would only show us part of the level close to the agent, not the entire level.
However, in practice, state and observation are used interchangeably, so don’t be surprised to see the term "state" when we don’t necessarily have a full view of the environment and vice versa.

Reward (and Discount Factor)
Perhaps the most crucial piece of the puzzle is the reward. We train the agent to take the "best" actions by giving a positive, neutral or negative reward based on whether the action has taken the agent closer to achieving a specific goal.
Assume we are playing a game where the objective is to catch a squirrel. The agent would start by randomly exploring its environment, and we would reward the agent (+1 point) if it got closer to a squirrel and "punish" (-1 point) if it moved further away from the squirrel. Finally, we would give a significant reward (+1000) when the agent achieved its goal, i.e., caught the squirrel.
Just by taking random actions and receiving corresponding rewards, the agent can learn what it needs to do to maximise the cumulative reward and achieve the goal.

Note that the reward values in this example are just for illustration purposes. We often choose to create our own reward function when we do Reinforcement Learning. Usually, the "quality" of your reward function will be a significant driver of the success of your model (e.g., see how the reward functions are being used in the AWS DeepRacer competitions).
Another critical aspect of rewards is the discount factor (gamma). It can range between 0 and 1, but we would typically choose a value between 0.95 and 0.99. The purpose of a discount factor is to give us control over the preference for short-term vs long-term rewards.
For example, a move that captures the opponent’s piece would be rewarded in the chess game. However, we wouldn’t want the agent to prioritise that move if it put us in a losing situation over the longer term. Hence, it is essential to balance short-term and longer-term rewards.
Exploration vs Exploitation
The final concept for us to learn about is the trade-off between exploration and exploitation. We typically want to encourage the agent to do some level of exploration of its environment instead of using 100% of the time exploiting the knowledge it already has about it.
A simple real-life example would be choosing a meal for lunch. Say you are a massive fan of burgers and keep having them for lunch every day. Hence, you are exploiting the existing knowledge (your taste for burgers).
However, you could also add an element of exploration to your routine. For example, you could try a different meal once in a while, such as a hot dog or a kebab. Of course, by exploring, you risk getting disappointed (negative experience/reward), but you also open yourself up to discovering something that you may enjoy even more than burgers (positive experience/reward).

In the context of Reinforcement Learning, we want to encourage the agent to spend part of its time exploring. Otherwise, the agent may end up constantly repeating the same move, not realising that there are much better moves to make. We do this via an additional parameter (epsilon). We specify in what percentage of situations the agent should take a random action (i.e. explore).

Different Methods to Train Your Model
When we use Reinforcement Learning, we want to train the agent to take the "best" actions to achieve its goal. We refer to it as Policy(𝜋). It is kind of like the "brain" of the agent.
For example, if we want to teach an agent how to play a computer game, it needs to learn what actions to take in each situation. In other words, we want to find an optimal Policy(𝜋) that leads to the highest possible reward over the long term.
To find the optimal Policy, we can use either of the following approaches:
- Policy-based methods – we train the agent directly on what action to take in which state.
- Value-based methods – we train the agent to identify which states (or state-action pairs) are more valuable, so it can be guided by value maximisation. E.g., in the game of catching a squirrel, standing one step away from a squirrel would describe a more valuable state than standing ten steps away.
I will go into more detail about each method in my upcoming articles as we dive deeper into various RL algorithms and learn how to code them in Python.
Summary
To recap, we have learned that Reinforcement Learning is used to teach the agent to operate within its environment and achieve a goal or objective (e.g., win a game) by providing positive, neutral or negative rewards to the agent based on the actions it takes at different states.
We balance the exploration vs exploitation by specifying what proportion of the agent’s actions should be chosen randomly, and we apply a discount factor (gamma) to control the agent’s preference for short-term vs long-term rewards.
Finally, we train the model (teach the agent) by optimising Policy(𝜋), which we do either through a direct policy-based method or an indirect value-based method.
In my next article, I will take you through the Q-learning algorithm (value-based method) and show you how to use it in Python to train an agent to successfully navigate from start to finish in a simple game called Frozen Lake.

Please don’t forget to follow and subscribe so you don’t miss out on the fun of exploring Reinforcement Learning with me.
Cheers! 🤓 Saul Dobilas