PODCAST

AI alignment at OpenAI

Jan Leike on the state of frontier AI alignment research

Jeremie Harris

Published in

Towards Data Science

4 min readSep 29, 2021

APPLE | GOOGLE | SPOTIFY | OTHERS

Editor’s note: The TDS Podcast is hosted by Jeremie Harris, who is the co-founder of SharpestMinds, a data science mentorship startup. Every week, Jeremie chats with researchers and business leaders at the forefront of the field to unpack the most pressing questions around data science, machine learning, and AI.

The more powerful our AIs become, the more we’ll have to ensure that they’re doing exactly what we want. If we don’t, we risk building AIs that use dangerously creative solutions that have side-effects that could be undesirable, or downright dangerous. Even a slight misalignment between the motives of a sufficiently advanced AI and human values could be hazardous.

That’s why leading AI labs like OpenAI are already investing significant resources into AI alignment research. Understanding that research is important if you want to understand where advanced AI systems might be headed, and what challenges we might encounter as AI capabilities continue to grow — and that’s what this episode of the podcast is all about. My guest today is Jan Leike, head of AI alignment at OpenAI, and an alumnus of DeepMind and the Future of Humanity Institute. As someone who works directly with some of the world’s largest AI systems (including OpenAI’s GPT-3) Jan has a unique and interesting perspective to offer both on the current challenges facing alignment researchers, and the most promising future directions the field might take.

Here were some of my favourite take-homes from the conversation:

Historically, AI alignment research has been theory-heavy. That’s understandable: AIs just haven’t been sufficiently capable for alignment problems to really matter until recently. Jan generally favors empirical alignment research and experiments on live systems over pure theory, and would be happy to see more alignment researchers move in that direction as well. Though he thinks pure theory has a place — particularly when considering systems with capabilities we can’t yet achieve, where theory is the only available option — he also considers a large body of alignment research work to be insufficiently concrete. An experimental focus helps to ground intuitions about the behavior of real systems, and Jan argues that it leads to more useful insights.
One of the challenges with aligning very powerful AI systems is that they’ll potentially conceive of solutions so complex that humans won’t be able to audit them to determine that they’re sensible and ethical. One strategy that Jan has been exploring to address this problem is called recursive reward modeling (RRM). Through RRM, a series of “helper” AIs are trained to help a human evaluate the performance of a more complex AI — and the hope is that this helper-augmented human would then be in a position to assess this complex AI’s behavior and level of alignment.
More generally, Jan is optimistic about the prospect of building what he calls an “alignment MVP”. In startup language, an MVP — or minimum viable product — is the simplest value-creating product a company can build to test an idea. To Jan, an alignment MVP would be an AI system capable of matching human performance on the specific task of AI and AI alignment research, while itself being sufficiently aligned with human values to produce useful results. Jan’s hope is that from this point on, AI research can be augmented, and ultimately taken over by AIs themselves, which might potentially help human researchers come up with a more complete solution to the general alignment problem.
Jan points out that even after the alignment problem is solved, many of the most difficult questions associated with the development of transformative AI will still be ahead of us. We’ll have to think about how to distribute access to the technology, and which human values advanced AIs will have to be aligned with, specifically. Issues like worker displacement and international dynamics will also have to be addressed. OpenAI has a team of policy experts who work alongside their capabilities and alignment researchers to anticipate these challenges ahead of time.
OpenAI has just announced their release of a book summary AI. Check it out here!
OpenAI is hiring! If you’re interested in working on aligning powerful AI systems, check out their open roles here.

You can follow Jan on Twitter here, or me here.

Chapters:

0:00 Intro
1:35 Jan’s background
7:10 Timing of scalable solutions
16:30 Recursive reward modeling
24:30 Amplification of misalignment
31:00 Community focus
32:55 Wireheading
41:30 Arguments against the democratization of AIs
49:30 Differences between capabilities and alignment
51:15 Research to focus on
1:01:45 Formalizing an understanding of personal experience
1:04:04 OpenAI hiring
1:05:02 Wrap-up

PODCAST

AI alignment at OpenAI

Jan Leike on the state of frontier AI alignment research

Chapters:

Written by Jeremie Harris