Reinforcement Learning: a Subtle Introduction

Published in

Towards Data Science

8 min readOct 4, 2019

On February 10th, 1996, IBM’s Deep Blue AI beat world champion Garry Kasparov at a game of chess.
Google’s Alpha Go AI is the best GO player in the world and has crushed world champions over and over again.

But how is this even possible? How is it possible that a computer can outsmart humans? The answer… Reinforcement Learning.

What is Reinforcement Learning?

Reinforcement learning is a branch of Machine Learning and AI. It takes a very specific approach to creating models to do certain things. The objective of reinforcement learning is to teach a computer/machine to perform a certain task with a high degree of success. (ex. win a chess game, play Mario Kart and win, etc.)

It is also important to note what reinforcement learning isn’t. These models are artificial specific intelligence (ASIs), meaning they can only perform very specific tasks. These are not general artificial intelligence machines (AGIs) and cannot perform a breath of tasks. That is the next goal of AI, however that will not be discussed in this article.

A reinforcement learning model playing Mario

How Reinforcement learning Works?

To get a grasp for the essence of reinforcement learning, let’s break it down with this simple analogy:

Imagine a child sees a fireplace for the first time. The child is intrigued and curious about the fireplace… so it naturally get closer to it. As the child gets closer, it feels the warmth of the fireplace and thus the child feels warm and happy.
Furthermore, the child wants to feel even more happy and warm, so the child gets closer to the fireplace just until it can touch it. *🔥* The child is hurt and moves back away from the fireplace.
The next time the child encounters a fireplace, it has learned to not get too close but stay at a reasonable distance so the child will get the reward of feeling warm without the punishment of being burned. This action is pleasing to the child.

The reward of feeling warm is the child’s goal in this scenario. If we put this into an AI perspective, the AI or agent, desires to maximize some reward value by performing some sequence of actions.

The AI, just like the child, is motivated by the reward. They both want to have as much reward as possible, which can be achieved through trial and error, in which they learn about the patterns when seeking for a reward. In the child scenario, the child quickly learned that getting too close sacrificed the reward, and wasn’t worth it. A computer might do the same, trying out different things until it finds a pattern that gives consistent high rewards. It reinforces patterns that give high rewards and keeps doing that certain task.

The environment is where the action takes place or where tasks are carried out. For example, in our scenario from above, the environment was the room with the fireplace. The environment can also be video game environments, certain scenario rooms such as bowling alleys or kitchens, and everything else. Environments range for certain tasks.

The agent is a computer or machine that controls some entity. For example, in our scenario the agent was the child. Another example could be a computer controlling Mario in Super Mario Bros. The agent is restrictive to the controls that the entity can perform. It can only use a set of rules that were predefined.

The action is a move or a set of sequential moves performed by the agent. For example, in our scenario it was moving closer or further away from the fireplace. Simply put, it’s the actual part of the problem we are trying to solve. How do we create the best actions to fit a task.

The reward is the measure of how well it performed a task within in an environment and how the motivation for the agent to get better. In our scenario, the reward was the warmth of the fireplace which was also an indicator that the agent was doing something good.

Finally, the state is the computation of data that is given to the agent basically telling it where it is in the environment. In our scenario that would be letting the child know where it was after performing an action. It is used to computer more actions and reach an overall goal.

They all connect through a simple procedure. First, our agent does an action. That action is processed through the environment and then a reward and state are fed to the agent. The reward tells it how well the action did and where the agent is in the environment, so it knows what to make in the future.

The agent keeps trying things until one rewards high values. It will start to implement those patterns more until it creates a system for the environment. At that point it will know how to perform a task at any state in the environment, essentially solving our problem!

There is still a question that needs to be answered however… how do we create efficient rewards so the agent is able to perform the task?

Rewards — the principle of Reinforcement Learning.

Rewards are the motivation for an agent/computer to get better at a certain task, for example, in chess, the reward could be winning. Since the necessity of rewards are the basis of reinforcement learning, it is important to understand how to create efficient reward systems, through a process called Reward Shaping

For machines to converge on certain behaviors, it’s important when learning to know what is desired and what isn’t. For example, what is the desired goal when playing solitaire or checkers? It seems obvious for us but for a computer it isn’t as clear. In those cases, the goal is to simply win, which just means reaching certain parameters. So when shaping a reward model for the machine, you can tell it that when it wins, give it +1 reward and when it loses, -1. This forces the machine to find ways to always reach that +1 which is our intended goal.

But how do we shape rewards for things that aren’t as clear cut as games, for example autonomous vehicles? The trick is to always consider the larger goal at play. In this case, it’s to get to point a to point b without crashing or causing harm. That in itself can be the reward for the machine, how far can it get without crashing. So whenever you are trying to create a reward function, just simply ask yourself what is the goal of the machine and write a program for those parameters. The computer in theory should get better.

So we are equipped with the knowledge to create these agents / machines to perform certain tasks… so let’s jump into it!

Well… there’s just one more thing to figure out…

Simulations and How to train Reinforcement Learning Models

It’s easy to see that many games have been subjected to reinforcement learning. Classic video games like Chess, Pong, Mario, Checkers, GO… (you get the idea) are all games that computers crush humans in.

Why? Because it’s easy.

In a game there are rules to follow and there are many iterations. It’s easy to simulate many games at a time and computers can learn at exponential rates. Computers don’t get tired, hungry, sleepy, etc. Unlike a human, where you only have limited time to learn something, computers can learn for hours-on-end. Not to mention, they are able to play multiple games every second. This is fundamentally why computers can easily get better than us. No wonder a 40 day old chess program was able to beat chess champion Garry Kasparov.

But what about tasks that have no specific rules and that can be applied to the real world? How would we be able to apply reinforcement learning?

Take for example driving a car. It would be idiotic to let a machine “trial and error” a car in real life. You can only imagine the horror that it would present. So how do we apply reinforcement learning? Same method before, with simulations.

“trial and error” approach in real life 😬

We can simulate the physics of our world in a game engine, like Unity. We can give the computer specific rules, like moving forward and moving backwards. We can put the machine into an environment that is similar to the real world, like roads. We train the machine with our reward system in the environment until it gets good enough to make little to no mistakes.

We then take our model and put it into a real world machine where it does the exact same thing. No real world chaos required!

Well… there is a slight downside…

The con to this is that simulations are just that… simulations. Nothing can exactly mimic our world. If there is even a slight disparity between the environment and the real world, the model might not work in the real world (and just look at the gif for the implications). When training a model for applications in the real world, if you can get over this hurdle, you will be able to be successful. But to make sure it’ll work, you will have to meticulously look over the environment just to get it right.

And this begs the question… with the ability to create real world reinforcement models and machines… what are some of the applications in the real world?

Real World Applications

Autonomous Vehicles

Through simulation training, autonomous vehicles have been the subject of reinforcement learning for a while. To self-driving cars all the way to self-flying planes, the realm of autonomous vehicles is limitless.

Health Care

Reinforcement Learning has been used to figure out what kind of diagnoses works best with certain medicines, through trial and error in simulations. With AI taking over the medical industry, trial and error will become more useful.

Manufacturing

In conventional manufacturing, the machine would perform timed procedure moves. With the addition of reinforcement learning, it makes these machines more adaptable to do a certain task and change the tasks as you please. This imitates the hectic changing of many industries.

Key Takeaways

Reinforcement learning is training an agent to perform a certain task by using a reward system in an environment.
Rewards are the principal for reinforcement learning and we use reward shaping to create reward models for reinforcement learning models.
Simulations can be used to train agents
Reinforcement learning is being applied in many industries today.