Let’s make a tour where artificial intelligence and control meet
"What we want is a machine that can learn from experience." Alan Turing, 1947
Do you know how machines (or computers) are able to surpass human performance in complex games like chess and go (DeepMind’s AlphaGo, AlphaZero and MuZero), or drive cars without human intervention? The answer is hidden in the way they are programmed. Such machines are programmed to maximize some objective, which is explicitly defined by humans. This approach is called Reinforcement Learning, and completely mimics how humans and animals learns to act in world.
In this article, I aimed to explain what Reinforcement Learning is and the basics without going too much details.

What makes Reinforcement Learning so special?
Today, Machine Learning is outlined by 3 main paradigm. Along with Supervised Learning (SL) and Unsupervised Learning (UL), Reinforcement Learning (RL) forms 3rd and last one. Although it is a sub field of Machine Learning, it has drawbacks to Control Theory and Game Theory.
We can think of an intelligent agent as a function, expected to give a useful output (or action) when fed with an input (or observation). Learning is just getting closer to give a correct (or useful) output.
In SL, agent is fed by both inputs and corresponding correct outputs to make it learn to give reasonable outputs to unseen inputs.
UL is feeding agent with only inputs and make agent to learn to give some outputs which optimizes a predefined objective like compactness or entropy like measures.
The problem of RL arises when an agent is required to learn behavior through trial-and-error encountering with a dynamic environment. This is how humans and animals learn their behavior such as utterance and movement.

Unlike stated above, learning methods do not need to be discriminated sharply. If thought from a bigger perspective, all of them are based on optimization of some objective. In recent years, many successful AI applications are hybrid of those 3 paradigms, like self-supervised learning, inverse Reinforcement Learning etc. Recent successes on ML is hidden in defining proper objectives, but not selecting one of 3 learning type.
However, RL is still unique because it is dealing with dynamic environment, and data is not given beforehand. As a result, unlike SL and UL, a RL agent should;
- gather data itself,
- explore environment (learn how to gather useful data),
- exploit its knowledge (learn how to act considering future states, learning best policy),
- discriminate reason of consequences (which actions in the past caused instant situation),
- keep itself safe during exploration (for mechanical applications).
Exploration Exploitation Dilemma
As stated above, the agent should exploit its knowledge to have optimal policy. However, without sufficient knowledge on environment, agent may stuck to local optimal solution. Therefore, agent should explore the environment at the same time. This emerges exploration-exploitation dilemma, which is most important part of RL. Many algorithms have various techniques to balance them.

Now, let get into technical details.
Sequential Decision Making
Sequential Decision Making, or Discrete Control Process is making a decision at each time step in discrete time, considering dynamics of the environment.
At time t, our agent is in state sₜ, rewarded by rₜ, and gets observation oₜ, and takes action aₜ according to its policy π . As a result of action, state transition happens to sₜ₊₁ and new observation oₜ₊₁ comes with reward rₜ₊₁.
Observations are expected to represent state of the agent. If state can be formed using instant observations, observations are directly used as states (sₜ=oₜ) and this is called Markov Decision Process. If not, it is called Partially Observable Markov Decision Process, since instant observations are not capable of informing agent about the state totally. However, we focus on MDP for now.
Markov Decision Process (MDP)
MDP consists of the following;
- State space S, as a set of all possible states.
- Action space A, as a set of all possible actions.
- Model Function T(s’|s,a), as a state transition probabilities.
- Reward Function R(s), as rewarding mapping from state, action, next state tuple to reward.
- Discount Factor γ ∈ [0,1], a real number determining importance of future rewards for control objective.

Building Units for RL
- Policy Function π(a|s): Probabilistic function for actions depending on states, indicating how to act in a certain situation.
- Return G: Cumulative sum of future rewards in time, scaled by discount factor γ. It is defined as;

- Value Function V(s|π): Expected return when policy π is followed at state s, defined as;

- Action-Value Function Q(s,a|π): Expected return when policy π is followed, except action a at first step at state s, defined as;

Bellman Equation
The whole aim of RL is maximization of value function V for an optimal policy π*. To do so, value function must satisfy following Bellman Equation. Bellman Equation tells us that optimal policy must maximize value function (cumulative rewards in the future) in average for all possible states, in an implicit way.

You may wonder why we need Q function. There exist a direct relation with it and policy, which is;

Note that both of them depends on each other. So why do we need another definition? By observing environment and consequences of actions, agent can learn Q function for the current policy. Then, agent can improve its policy using this equation. It allows agent to improve policy throughout learning as it learns Q function at the same time.
Model Free Reinforcement Learning
Model Free RL is purely based on experience with absence of model and reward functions.

Monte Carlo Methods
Monte Carlo Methods uses statistical sampling to approximate Q function. In order to use such methods, one must wait until simulation ends because cumulative sum of rewards in the future is used for each state.
Once state sₜ is visited and action aₜ is taken, return Gₜ is calculated from its definition using instant and future rewards waiting to end of episode. Monte Carlo methods aim to minimize the gap between Q(sₜ,aₜ) and target value Gₜ for all possible samples.
With a predetermined learning rate α, Q function is updated as;

Temporal Difference (TD) Methods
Temporal Difference (TD) methods bootstrap Q function estimate for target return. This allows agent to update Q function from each reward.
Main TD methods are SARSA and Q Learning.
- SARSA minimizes gap between Q(sₜ,aₜ) and target value rₜ₊₁+γ Q(sₜ₊₁,aₜ₊₁) which is bootstrapped estimate of return. Since next actions are also required, it is named SARSA indicating state, action, reward, state, action sequence. With a learning rate α, Q is updated by;

- Q Learning minimizes gap Q(sₜ,aₜ) and target value rₜ₊₁+γ max Q(sₜ₊₁,·) which is also a bootstrapped estimate of return. Unlike SARSA, it does not need next actions, and assumes best action is taken at the next state. With a learning rate α, Q is updated by;

Model Based Reinforcement Learning
In model based RL, model and reward function are either given beforehand, or learned by SL methods.
Model based RL is dependent on the quality of the learned model.

Dynamic Programming
Dynamic Programming do not require sampling. It is analytically solving for an optimal policy given model and reward function.

Simulation Based Search
Sometimes, state and action spaces are too large. This makes Dynamic Programming inapplicable. In such cases, random simulation paths are generated. The rest is usual Model Free RL, and one can use Monte Carlo, or Temporal Difference methods. Only advantage of such approach that agent simulate starting from whichever state it wants to.
A comparison between Monte Carlo, Temporal Difference methods and Dynamic Programming can be illustrated as follows.

Conclusion
In most applications, RL is combined by Deep Learning called Deep Reinforcement Learning. Recent success of RL is correlated with neural networks. However, RL suffers to adapt in real world scenarios due to sample inefficiencies of learning and safety constraints. Future of RL depends on our grasping of how humans learn.
Overall, RL is nothing but learning in dynamic environments. There are many RL algorithms and methods, but I tried to give core definitions and algorithms to let beginners to have idea about it keeping definitions as simple as possible. I hope you enjoyed!
As an extra, I leave OpenAI’s Hide and Seek simulation with RL.