Reinforcement learning: the naturalist, the hedonist and the disciplined

A history of ideas that shaped reinforcement learning

Published in

Towards Data Science

8 min readDec 1, 2018

Embracing the chaos of a biological brain and the order of an electronic one. — Image credit: http://www.gregadunn.com/microetchings/cortical-circuitboard/

The pursuit of artificial intelligence has always been intermingled with another struggle, more philosophical, more romantic, less tangible. The understanding of human intelligence.

Although current breakthroughs in supervised learning seem to be based on optimized hardware, sophisticated training algorithms and over-complicated neural network architectures, reinforcement learning is still as old school as it gets.

The idea is quite simple: you are a learning agent in an environment. If we make the general assumption that you have the goal of satisfying yourself (don’t we all?), then you perform actions. Based on these actions the environment responds with rewards and you, based on the rewards, adjust your behavior in order to maximize your satisfaction.

Does RL have limits? “The day will never come when a computer defeats a pro *shogi* player”, a proclamation by the deceased shogi player Satoshi Murayama, found its refutant in the face of AlphaGo Zero. — Image Credit: https://www.hokusai-katsushika.org/shogi-chess-board.html

It did not take us long to draw a connection between the ability of living organisms to learn through reinforcement and artificial intelligence. As early as 1948, Turing described a pleasure-pain system that follows the current rules of reinforcement learning, established decades later.

Intelligence is the ability to adapt to change — Stephen Hawkins

The first attempts of the community targeted the game of backgammon due to its simplicity, offering a small number of discrete states and simple rules. Nowadays we have AI agents that use reinforcement learning to play Atari games, Minecraft and flip pancakes. So, how did we accomplish all this?

The short answer is deep learning.

This article will venture into a longer answer. It will explore the origins of the ideas behind the reinforcement learning algorithms that we have been using for decades. Our recent successes are not just a product of deep neural networks, but a deep history of observations, conclusions and attempts to comprehend the mechanisms of learning.

Reinforcement learning is a field whose origins are hard to trace. It owes most of its theoretical foundation to control theorists. The Markov decision process is the discrete stochastic version of the optimal control problem, so it should not be a surprise that almost all reinforcement learning algorithms are based on solutions derived in control theory.

Yet the background offered by control theory was not enough to create reinforcement learning. The algorithms that we still use today required ideas, such as classical conditioning and temporal-difference learning, to formalize the process of learning.

Had it not been for a handful of curious biologists, psychologists and non-conformist computer scientists, the AI community would probably not possess the tools to implement learning.

How do we act before unforeseen situations? How do adopt our behavior? How does the environment affect our actions? How do we improve? How is a skill learned?

It’s a trial-and-error world

In 1898 Thorndike was either very angry with his cat or very curious about animal behavior. He put it in a cage, which he equipped with latches, and placed an appetizing dish of fish outside. The cat could escape the cage only by pulling a lever.

How would the cat react?

There is no reasoning, no process of inference or comparison; there is no thinking about things, no putting two and two together; there are no ideas — the animal does not think of the box or of the food or of the act he is to perform.

What Thorndike observes is that the cat does not appear to behave intelligently: it starts with randomly moving and acting inside the box. Only when it, by random chance, pulls the lever and releases itself, does it start to improve its skill of escaping.

Based on this observation, Thorndike put forward a Law of effect, which states that any behavior that is followed by pleasant consequences is likely to be repeated, and any behavior followed by unpleasant consequences is likely to be stopped.

This law gave rise to the field of operant conditioning, officially defined by Skinner in 1938. For the reinforcement learning community, it provided the reason to formulate agents that learn policies based on rewards and their interaction with their environment.

It also provided us with a new insight on animal learning, as the Law of effect suspiciously resembles another, well-known at that time, law: natural selection. Could our intellectuality just be a survival of the fittest idea?

Yet, there are two traits that make reinforcement learning unique as a process:

It is selectional. This differentiates it from supervised learning, since an agent tries alternatives and selects among them by comparing their consequences.
It is associative. This means that the alternatives found by selection are associated with particular situations, or states, to form the agent’s policy. Natural selection is a prime example of a selectional process, but it is not associative.

We are what we repeatedly do. Excellence, then, is not an act, but a habit.”
― Aristotle

A hedonist’s guide to learning

When it comes to analyzing the human mind, Klopf is quite concise:

“ What is the fundamental nature of man?
Man is a hedonist.”

In his controversial book The Hedonistic Neuron — A Theory of Memory, Learning, and Intelligence, Klopf employs neuroscience, biology, psychology and the disarming simplicity and curiosity of his reasoning to persuade us that our neurons are hedonists. Yes, your neurons are as pleasure-seeking as you are.

When confronted with the dominant neuronal model of his time, Rosenblatt’s Perceptron (which is the building block of today’s neural networks), Klopf wonders:

“ If neurons are assumed to function as non-goal-seeking components, then goal-seeking brain function must be viewed as an emergent phenomenon. Can such a view lead to explanations for memory, learning and, more generally, intelligence? “

He proposes a new building block called the elementary heterostat, as the basis for future AI research. Klopf also argues that homeostasis, the pursuit of a good, stable state is not the purpose of complex systems, such as humans and animals. It may be good enough to explain the objective of plants, but we can assume that humans, after having ensured homeostasis, pursue to maximize pleasure, not stabilize it. Why should our neurons differ?

Implausible these ideas may sound, they could be attributed with shaking the world of AI. Klopf recognized that essential aspects of adaptive behavior were being lost as learning researchers came to focus almost exclusively on supervised learning. What was missing, according to Klopf, were the hedonic aspects of behavior, the drive to achieve some result from the environment, to control the environment toward desired ends and away from undesired ends.

In an extensive chapter that criticizes the current principles of cybernetics, as machine learning was termed at that time, one can highlight three lines of attack:

Should we use deep neural networks?

Just to be clear, two layers sufficed for a network in the 50s to be termed deep. Klopf seems to be satisfied with the Perceptron model, but he questions its ability to learn when placed in deep networks. Klopf raises an issue that, even today, could not leave any machine learning scientist untouched:

“ […] However, the algorithm applies only to single-layered adaptive networks. Much subsequent research has failed to produce a truly viable deterministic adaptive mechanism for the more general case of the multi-layer network. The central problem in the general case is that of establishing what any given network element should be doing when the system behavior is inappropriate. This has proven very difficult because most of the outputs of individual elements in a deep network bear a highly indirect relationship to the final output of the system. ”

What is the purpose of AI?

Klopf also questions the pursuit of AI research. In his attempt to approach the right goal of learning, he employs an argument, that I’ve also found in later reinforcement learning researchers:

“Life has been evolving on this planet for approximately 3 billions years. Of that time, 90% was spent in evolving the neural substrate we share with the reptile. From the time of the reptile forwards, it has been only a relatively short 300 million years until the emergence of humans. A question arises regarding the processes leading to intelligence. If the evolutionary process spent 90% of its time developing the neural substrate and the remaining 10% working out effective higher level mechanisms, why are artificial intelligence researchers attempting to do it the other way around?”

Is intelligence intelligent?

In the following excerpt, it feels as if Thorndike and Klopf have been reinforcement learning pals:

“There is another way in which the AI researcher’s perception of intelligence appears not to be consonant with the nature of the phenomenon in living systems. […] In living systems, intelligence frequently is not intelligent, at least not in the intellectual sense in which researchers have sometimes viewed the phenomenon. Rather, intelligence in living systems is frequently simply effective. There would appear to be much that if a “brute force” nature om the everyday information processing of intelligent organisms. […] Even for the most intelligent humans, the more intellectual forms of activity are difficult. […] Thus, one wonders if the association of intelligence with higher level information processing has, perhaps, left the AI researcher with too narrow and elevated a view of the phenomenon. In the near term, would a more modest view yield more productive theories?”

Pavlov’s dog plays backgammon

We may have been talking about reinforcement learning so far, but the truth is that this term was first used by Pavlov, in the 1927 English translation of Pavlov’s monograph on conditioned reflexes.

What Pavlov observed in his famous experiment was that when a dog was presented with food and a sound went off very close to the feeding time, the dog learned to associate feeding with that sound and salivated, even in the absence of food, when that sound was heard.

With this observation, Pavlov laid the ground for classical conditioning, which was the first theory that incorporated time into the learning procedure. Nowadays, RL algorithms mainly employ temporal-difference learning, which means that when calculating the “quality” of an action in order to make a decision, we also consider future rewards.

The temporal-difference and optimal control threads were fully brought together in 1989 with Chris Watkins’s development of Q-learning, one of the most famous reinforcement learning algorithms.

In 1992, Tesauro employed the concept of temporal-difference learning on agents that played backgammon. This is the moment and application that persuaded the research community that there is potential in this type of machine learning.

Although current threads of research focus on deep learning and games, we would not have the field of reinforcement learning today, had is not been for a bunch of guys talking about cats, neurons and dogs.

One could say that the reward we got from solving backgammon, until that point unimaginable difficult task, motivated us to further explore the potential of reinforcement learning. How’s that for an example of reinforcement learning?

Reinforcement learning: the naturalist, the hedonist and the disciplined

A history of ideas that shaped reinforcement learning

It’s a trial-and-error world

A hedonist’s guide to learning

Pavlov’s dog plays backgammon

Written by Eleni Nisioti