PODCAST

The Inner Alignment Problem

Evan Hubinger on building safe and honest AIs

Jeremie Harris

Published in

Towards Data Science

4 min readJun 9, 2021

APPLE | GOOGLE | SPOTIFY | OTHERS

Editor’s note: This episode is part of our podcast series on emerging problems in data science and machine learning, hosted by Jeremie Harris. Apart from hosting the podcast, Jeremie helps run a data science mentorship startup called SharpestMinds.

How can you know that a super-intelligent AI is trying to do what you asked it to do?

The answer, it turns out, is: not easily. And unfortunately, an increasing number of AI safety researchers are warning that this is a problem we’re going to have to solve sooner rather than later, if we want to avoid bad outcomes — which may include a species-level catastrophe.

The type of failure mode whereby AIs optimize for things other than those we ask them to is known as an inner alignment failure in the context of AI safety. It’s distinct from outer alignment failure, which is what happens when you ask your AI to do something that turns out to be dangerous, and it was only recognized by AI safety researchers as its own category of risk in 2019. And the researcher who led that effort is my guest for this episode of the podcast, Evan Hubinger.

Evan is an AI safety veteran who’s done research at leading AI labs like OpenAI, and whose experience also includes stints at Google, Ripple and Yelp. He currently works at the Machine Intelligence Research Institute (MIRI) as a Research Fellow, and joined me to talk about his views on AI safety, the alignment problem, and whether humanity is likely to survive the advent of superintelligent AI.

Here were some of my favourite take-homes from the conversation:

Evan shared his toy example of an inner alignment failure: he has us imagine a maze-solving AI that’s trained on a dataset of mazes, to reach the center of a given maze. The center tile is labeled with a green arrow during training, so unbeknownst to the AI developers, the AI actually ends up thinking that its objective is to locate a green arrow, rather than solve the maze. So, when that AI is placed in a new setting — one in which the green arrow is moved to a completely different location that has nothing to do with the solution to the maze, the AI fails to solve this new problem. While this may seem like a fairly innocent issue, it’s not hard to imagine other cases where even a slight mismatch between what a developer thinks an AI is optimizing for, and what it’s actually doing, could become very dangerous.
We discussed whether AI safety actually needs more researchers, or more funding. While Evan believes it generally does — and that there are strong reasons to think it’s the single most important problem anyone could work on — we also discussed the value of a small, tightly focused, high-signal, low-noise community. It’s clear there’s a trade-off: AI safety research requires a lot of very careful thinking, which would be much harder to do if the field became flooded with interest and ultimately entrenched academic incentives that have arguably led to replication crises and a flood of low-quality research in other domains.
We explored a number of processes — from evolution to child-rearing, to the formation of companies — through the lens of inner alignment, and I found this part of the conversation especially interesting. To crudely summarize, the universe “tries” to align humans to the goal of genetic propagation, but fails (humans invented contraceptives, and don’t spend every waking second trying to figure out how to procreate), parents try to align their children with their values, but fail (kids develop their own goals, and deceive their parents to try to achieve them), and society tries to align companies to the objective of generating goods that add more value to everyone’s lives on net, but fails (companies pursue their own goals, and, much like children, develop incentives to deceive the society around them to maximize profits).
Evan thinks that alignment is a harder problem to solve than intelligence. As a result, he’s pessimistic that we’ll have a solution to the full AI alignment problem before we build superintelligent systems. While he’s deeply uncertain — and makes a point of emphasizing it — he thinks there’s a greater than 50% chance that humanity will face extinction due to risks posed by advanced AI systems. Nonetheless, the stakes are so high that he considers the problem worth tackling head-on despite the challenges.

You can follow Evan on Twitter here, or me here.

Chapters:

0:00 Intro
1:45 AGI safety research
8:20 AI’s capabilities
14:40 Inner alignment
25:12 Anthropomorphizing these AI systems
29:15 Prosaic
37:36 Evolution through the lens of inner alignment
49:06 Inner alignment failure
54:32 Inner alignment and parenting
1:00:35 Startups and the needs of the market
1:03:30 Optimism for AI alignment
1:09:20 Wrap-up

PODCAST

The Inner Alignment Problem

Evan Hubinger on building safe and honest AIs

Chapters:

Written by Jeremie Harris