PODCAST

New Research: Advanced AI may tend to seek power *by default*

Edouard Harris on AI safety and the risks that come with advanced AI

Jeremie Harris
Towards Data Science
5 min readOct 12, 2022

--

APPLE | GOOGLE | SPOTIFY | OTHERS

Editor’s note: The TDS Podcast is hosted by Jeremie Harris, who is the co-founder of Gladstone, an AI safety startup. Every week, Jeremie chats with researchers and business leaders at the forefront of the field to unpack the most pressing questions around data science, machine learning, and AI.

Progress in AI has been accelerating dramatically in recent years, and even months. It seems like every other day, there’s a new, previously-believed-to-be-impossible feat of AI that’s achieved by a world-leading lab. And increasingly, these breakthroughs have been driven by the same, simple idea: AI scaling.

For those who haven’t been following the AI scaling sage, scaling means training AI systems with larger models, using increasingly absurd quantities of data and processing power. So far, empirical studies by the world’s top AI labs seem to suggest that scaling is an open-ended process that can lead to more and more capable and intelligent systems, with no clear limit.

And that’s led many people to speculate that scaling might usher in a new era of broadly human-level or even superhuman AI — the holy grail AI researchers have been after for decades.

And while that might sound cool, an AI that can solve general reasoning problems as well as or better than a human might actually be an intrinsically dangerous thing to build.

At least, that’s the conclusion that many AI safety researchers have come to following the publication of a new line of research that explores how modern AI systems tend to solve problems, and whether we should expect more advanced versions of them to perform dangerous behaviours like seeking power.

This line of research in AI safety is called “power-seeking”, and although it’s currently not well understood outside the frontier of AI safety and AI alignment research, it’s starting to draw a lot of attention. The first major theoretical study of power seeking was led by Alex Turner, who’s appeared on the podcast before, and was published in NeurIPS (the world’s top AI conference), for example.

And today, we’ll be hearing from Edouard Harris, an AI alignment researcher and one of my co-founders in the AI safety company (Gladstone AI). Ed’s just completed a significant piece of AI safety research that extends Alex Turner’s original power-seeking work, and that shows what seems to be the first experimental evidence suggesting that we should expect highly advanced AI systems to seek power by default.

What does power seeking really mean though? And does all this imply for the safety of future, general-purpose reasoning systems? That’s what this episode will be all about.

Here were some of my favourite take-homes from the conversation:

  • There are some goals that you’ll tend to want to pursue regardless of what your ultimate objectives are in life — goals that you’ll want to pursue even if you don’t know what your objectives are. For example, you’ll never not want an additional $10M in your bank account, because no matter what your objectives are, or turn out to be, an extra $10M won’t make them harder to achieve, and will probably make them easier to reach. Likewise, no matter what your ultimate objectives are, you’ll never not want to be more intelligent, because more intelligence is helpful for pursuing any objective you might want. So collecting more resources and becoming more intelligent turn into sub-goals that just about everyone consistently converges on. In AI safety, goals like these are know as “instrumental goals”.
  • Instrumental goals don’t just apply to humans. As we saw in our conversation with Alex Turner, there are strong reasons to believe that powerful AI systems will end up pursuing them, too. According to an increasing body of theoretical research, we should expect AIs — like humans — to converge on certain instrumental goals by default (an idea known as “instrumental convergence”). Almost no matter what objective an AI has been trained to pursue, it won’t be more likely to achieve it if it’s turned off (since being “on” is the only way it can continue to influence the world). And for similar reasons, we should expect AIs to converge on other instrumental goals like self-improvement and resource aggregation. Collectively, these and other related behaviors are known as “power-seeking” behaviors in AI safety.
  • Alex Turner’s work showed theoretically that AI agents will tend to engage in power-seeking behaviors for the overwhelming majority of objectives we might train them to achieve. He did this by introducing a rigorous mathematical definition of power — something no one had done before. After a decade of hand-waving arguments in the AI safety world, it finally framed the power-seeking debate in quantitative terms.
  • Ed’s work extended Alex’s definition, and allowed him to explicitly investigate interactions between different intelligent agents. It represents the first experimental demonstration of power-seeking in AI.
  • In his experiments, he started by simulating a “human” agent, which was allowed to learn an optimized reward-collection strategy over an environment in which the human agent was placed. Ed uses the analogy of a maze that contains a piece of cheese, which represents that agent’s reward: his human agent was allowed to learn an optimal strategy for navigating that maze and collecting the cheese. And importantly, the environment (the maze) was static. This reflects the reality that human beings learn much faster than nature does. Humans learn on brain clock time, whereas nature “learns” or optimizes on glacially slow evolutionary time, requiring generations of selection to produce meaningfully new species. As a result, nature appears roughly static to us: there are trees that were planted before the industrial revolution, and whose fate is now entirely in human hands, for example.
  • Having allowed his simulated human agent to optimize over this environment, Ed freezes the human’s optimization process, preventing them from learning further. Then, he introduces a highly advanced “AI agent” to his simulation. This agent is able to run its own optimization process over the combined human agent + environment system, which, relative to the AI, remains static. And again, this makes sense: whatever form human-level AI may take, it will run on computer clock time, compared to which human biological brain time will be so slow as to basically appear static.
  • Ed then investigates how the power of each agent influences the other. He explores cases in which agents end up competing (and in which the human, being an inferior reasoning machine, would necessarily lose) and collaborating. His conclusions were quite striking: it turns out that even if the human and AI agent have goals that are independent from one another (and that don’t directly contradict one another), they end up competing over instrumental goals. We go into some detail unpacking other aspects of his research, which provides evidence that we should expect advanced AI and humans to compete over resources and “power” (mathematically defined) by default, and that preventing that from happening may turn out to require extreme precision in AI system design.

Chapters:

  • 0:00 Intro
  • 4:00 Alex Turner’s research
  • 7:45 What technology wants
  • 11:30 Universal goals
  • 17:30 Connecting observations
  • 24:00 Micro power seeking behaviour
  • 28:15 Ed’s research
  • 38:00 The human as the environment
  • 42:30 What leads to power seeking
  • 48:00 Competition as a default outcome
  • 52:45 General concern
  • 57:30 Wrap-up

--

--

Co-founder of Gladstone AI 🤖 an AI safety company. Author of Quantum Mechanics Made Me Do It (preorder: shorturl.at/jtMN0).