The world’s leading publication for data science, AI, and ML professionals.

Rethinking the Autonomous Agent: A Q-Learning Behavior System

This essay introduces a Q learning based behavior system for embodied agents including robots and video game characters. Accompanying…

Rethinking the Autonomous Agent

A Q-Learning Behavior System for Embodied Agents

Image courtesy of Eric Kilby
Image courtesy of Eric Kilby

This essay introduces a Q Learning based behavior system for embodied agents including robots and video game characters.

Most recent reinforcement learning(RL) success stories have focused on achieving superhuman levels of performance in some virtual task – be it a video game or a test of robotic control. A separate and less sensational branch of reinforcement learning is what could be called normative reinforcement learning, and seeks to mimic subjective notions of human or animal behavior. Here the goal is not to optimize towards an arbitrary level of perfection but rather to replicate something that is "human-like" or "animal-like". While this approach has yet to garner the same headlines as other examples of RL, it will undoubtedly generate many of the important use cases of the future. Consider the advent of open world video games that demand high fidelity representations of human and animal behavior, or the case of elderly care robots where superhuman levels of performance might actually feel alien and disorientating.

In this essay I introduce a behavior system for embodied agents using the popular Q learning algorithm and aimed at normative Reinforcement Learning. While geared to game engines, the intuition can be extended to other platforms and will hopefully inspire a variety of use cases. Q-learning is also the foundation for more advanced systems of intelligent behavior such as those found in the Autonomous Emotions Toolkit which I have written about elsewhere. While in this article I will take a high level walk through of the behavior system, those wishing a more in depth treatment can obtain the download the project files for the behavior system with an accompanying tutorial.

Q learning is one of many reinforcement learning techniques and has a number of advantages over similar algorithms – namely it is simple and robust. A very cursory introduction to Q learning follows. Q learning can be broken up into a training phase and exploitation phase of activity. During the training phase the agent explores its environment, taking random actions. During this time it populates a table, called the Q table, with value associations regarding the actions it has taken and the rewards it has received. This then becomes the basis for how the agent chooses to make decisions in the exploitation phase of its activity. Actions that led to rewards get repeated in the exploitation phase. Rewards can also take negative values if one wishes the agent to avoid some action. Thus Q learning supplies both a stick and a carrot to drive character behavior in a manner consistent with how humans and many other animals learn. It can be used to replace or supplement the utility based calculations that previously endowed artificial agents with dynamic lifelike behavior.

In the behavior system presented, I make use of use of Q learning and it’s tables to drive learning in a synthetic agent or NPC (Non Player Character). The Q learning algorithm is basically a method for backward induction, allowing the agent to redistribute reward information backwards in time across the states and actions that got it to the desired location. A reward requiring an arbitrary long series of actions can thus be parceled out along preceding states that the agent took to get there. When this process is repeated enough times, like cream rising to the surface, the agent learns which actions and environmental cues were instrumental in getting to the reward and which were not.

The Q learning equation is given below, where state and action pairs refer to coordinates within the Q table and Rewards table respectively and gamma is a fixed discount rate between 0 and 1. Gamma controls how much the agent values present rewards vs. future ones.

Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]

In the behavior system presented, Q learning is used solve a match to sample task in which the NPC learns that it must activate a switch within the game environment at the same time that a light is on in order to receive a "food reward". The match to sample task has been used in a wide variety of animal learning experiments that explore associative learning and working memory.

An example of a real match to sample task carried out on primates. Image Citation: Porrino LJ, Daunais JB, Rogers GA, Hampson RE, Deadwyler SA (2005) Facilitation of Task Performance and Removal of the Effects of Sleep Deprivation by an Ampakine (CX717) in Nonhuman Primates. PLoS Biol 3(9): e299.

The key point is that the agent must learn to predict that it can take an action to receive a reward only during specific circumstance. In the match to sample task on which Q learning behavior system is tested, the agent learns that it can receive a reward when a light is on and it first touches a switch and then proceeds to the gold food bowl. The same action taken when the green light is off will not generate the reward.

The setup for the task begins with a training phase in which the agent randomly travels between four locations, 3 food bowls and one switch, represented by balls and a cone respectively. During training period, it learns associations about the values of each of these elements, how they are affected by the light that periodically turns on and off, and its own actions.

Screen grab from the Q learning Behavior System as the agent learns to solve the match to sample task.

After exploring long enough, the agent will display intentional behavior by first going to the switch (cone) and then going to the food bowl when the light is on in order to receive rewards. The same system can be used to provide a wide variety of intentional behaviors, including avoiding enemy players, collecting health points, and many of the behaviors a human or animal is capable of manifesting within a game environment. In a modified version of this example, the agent learns to approach the player’s character for "food rewards" the same way a dog or cat might learn to approach a human master for a treat.

Unlike other reinforcement learning tasks, this one can be calibrated to produce results that mimic real world cognitive agents. Adding a simple learning rate to the Q-algorithm, one can produce learning curves that mimic those of animal test subjects. Thus the match to sample task provides a way to fine tune an agents "brain" and produce normative results that are consistent with many examples of animal behavior under similar conditions.

[Aaron Krumins](mailto: [email protected]) is a freelance developer of AI related software.


Related Articles