PODCAST

Sample-efficient AI

Yang Gao on making AIs that learn as fast as humans

Jeremie Harris
Towards Data Science
5 min readDec 8, 2021

--

APPLE | GOOGLE | SPOTIFY | OTHERS

Editor’s note: The TDS Podcast is hosted by Jeremie Harris, who is the co-founder of Mercurius, an AI safety startup. Every week, Jeremie chats with researchers and business leaders at the forefront of the field to unpack the most pressing questions around data science, machine learning, and AI.

Historically, AI systems have been slow learners. For example, a computer vision model often needs to see tens of thousands of hand-written digits before it can tell a 1 apart from a 3. Even game-playing AIs like DeepMind’s AlphaGo, or its more recent descendant MuZero, need far more experience than humans do to master a given game.

So when someone develops an algorithm that can reach human-level performance at anything as fast as a human can, it’s a big deal. And that’s exactly why I asked Yang Gao to join me on this episode of the podcast. Yang is an AI researcher with affiliations at Berkeley and Tsinghua University, who recently co-authored a paper introducing EfficientZero: a reinforcement learning system that learned to play Atari games at the human-level after just two hours of in-game experience. It’s a tremendous breakthrough in sample-efficiency, and a major milestone in the development of more general and flexible AI systems.

Here were some of my favourite take-homes from the conversation:

  • Since AlphaGo, AI researchers have recognized the promise of integrating reinforcement learning with search methods, which involve considering many potential next actions available to an RL agent, and simulating what their results might be before choosing one. This starts to mimic human deliberation much more closely, by explicitly introducing elements of “planning” into the RL paradigm. Yang attributes the huge performance improvements of AlphaGo, AlphaZero and MuZero to this search process.
  • Another important distinction in RL is between model-based systems, which construct explicit models of their environments, and model-free systems, which don’t. Prior to AlphaGo, just about all leading RL work was done on model-free systems (PPO and deep Q learning, for example). Model-based systems just weren’t practical because the learning environment models is hard, and adds a significant layer of complexity on top of the simpler action selection task that model-free systems could focus on exclusively. But now that computational resources and a few new algorithmic tricks are available, model-based systems are rapidly emerging as more flexible and increasingly more capable options.
  • DeepMind’s MuZero was a major step towards practical model-based RL. MuZero was designed to play a variety of video games, using the pixels on a screen as its inputs. However, unlike its predecessors, the game environment model that it constructed wasn’t trying to predict how every in-game pixel would change in future time steps. Instead, it mapped the game environment to a latent space, compressing its representation of the game so that it only included relevant information. This lower dimensional representation made it much easier to predict the game environment’ salient features in future time steps, and much more closely matches the way humans learn to play games: when we play soccer, for example, we don’t keep track of every blade of grass on the field, or the expressions on the faces of each player — we maintain a very stripped-down mental model of the arena, which includes a few details like the locations of players, the position and velocity of the ball, and so on.
  • EffcientZero improved on MuZero in a number of ways. First, it’s able to consider hypotheticals of the form, “what would the future look like, if I performed action X?” This ability is developed through a self-supervised learning process, in which EfficientZero leverages its past experience about the way the environment responded to similar actions.
  • Second, EfficientZero also took on the so-called “aliasing problem” — a well-known issue in reinforcement learning, which arises because RL agents’ environment models are often designed to predict the exact time step at which a key event will occur (for example, the precise video frame in which a soccer ball will fly into a goal). But that level of precision isn’t required, and in fact is counter-productive, because it leads to an over-sensitive learning signal: even if an agent’s environment model gets just about everything right, but its predictions are off by a fraction of a second, it ends up receiving no reward! EfficientZero corrects this by coarse-graining the time dimension, ensuring that the model is rewarded for predictions that are “close enough for practical purposes.” Again, there’s a powerful analogy to human learning behind this: good teachers give students part-marks for getting a problem mostly right, instead of offering binary 100%/0% scores. This gives students more signal to latch onto, and also avoids overfitting to levels of detail that don’t have practical significance.
  • It’s not immediately obvious how to compare EfficentZero’s sample-efficiency to that of a human being. The approach shared in the EfficientZero paper was to get a bunch of humans to play a range of different Atari games, and check in after two hours to see what their median or average scores were, and compare that to EfficientZero’s performance after it was trained for the same amount of in-game time. And while that strategy does have EfficientZero comparing favourably to humans, it almost certainly understates EfficientZero’s actual sample efficiency. In some sense, Yang argues, an 8-year old human who picks up an Atari game for the first time has been preparing (training) for that game their whole life — picking up on cause-and-effect dynamics, and even cultural cues that help them navigate the game even when they haven’t seen it before. EfficientZero, on the other hand, has to learn all that from scratch every time it’s presented with a new game.

You can follow Yang on Twitter here.

Chapters:

  • 0:00 Intro
  • 1:50 Yang’s background
  • 6:00 MuZero’s activity
  • 13:25 MuZero to EfficientZero
  • 19:00 Sample efficiency comparison
  • 23:40 Leveraging algorithmic tweaks
  • 27:10 Importance of evolution to human brains and AI systems
  • 35:10 Human-level sample efficiency
  • 38:28 Existential risk from AI in China
  • 47:30 Evolution and language
  • 49:40 Wrap-up

--

--

Co-founder of Gladstone AI 🤖 an AI safety company. Author of Quantum Mechanics Made Me Do It (preorder: shorturl.at/jtMN0).