Podcast

Making AI Safe through Debate

Ethan Perez explains how AI debate could get us to superintelligence — safely

Jeremie Harris

Published in

Towards Data Science

38 min readMar 10, 2021

To select chapters, visit the Youtube video here.

Editor’s note: This episode is part of our podcast series on emerging problems in data science and machine learning, hosted by Jeremie Harris. Apart from hosting the podcast, Jeremie helps run a data science mentorship startup called SharpestMinds. You can listen to the podcast below:

APPLE | GOOGLE | SPOTIFY | OTHERS

Most AI researchers are confident that we will one day create superintelligent systems — machines that can significantly outperform humans across a wide variety of tasks.

If this ends up happening, it will pose some potentially serious problems. Specifically: if a system is superintelligent, how can we maintain control over it? That’s the core of the AI alignment problem — the problem of aligning advanced AI systems with human values.

A full solution to the alignment problem will have to involve at least two things. First, we’ll have to know exactly what we want superintelligent systems to do, and make sure they don’t misinterpret us when we ask them to do it (the “outer alignment” problem). But second, we’ll have to make sure that those systems are genuinely trying to optimize for what we’ve asked them to do, and that they aren’t trying to deceive us (the “inner alignment” problem).

Creating systems that are inner-aligned and superintelligent might seem like different problems — and many think that they are. But in the last few years, AI researchers have been exploring a new family of strategies that some hope will allow us to achieve both superintelligence and inner alignment at the same time. Today’s guest, Ethan Perez, is using these approaches to build language models that he hopes will form an important part of the superintelligent systems of the future. Ethan has done frontier research at Google, Facebook, and MILA, and is now working full-time on developing learning systems with generalization abilities that could one day exceed those of human beings.

Here were some of my favourite take-homes from the conversation:

One of the challenges with researching ways to create superintelligent AI systems is that it’s fairly unclear what superintelligence means. Is a system that’s smarter than 90% of humans superintelligent? What about 99%? What if a system is only as smart as the median human, but can think millions of times faster because it works on computer clock time? Or what if a human-level intelligence can simply be replicated a huge number of times, allowing to to explore many different possibilities in parallel? There’s no universally accepted set of answers to these questions.
It’s plausible that supervised learning techniques won’t be able to get us all the way to superintelligent systems. That’s because they rely on data generated by human beings, and can only learn to perform as well as those human beings (with some added advantages I won’t get into). From one perspective, supervised learning systems are arguably “emulators” rather than “original thinkers”, but Ethan thinks it may be possible to extend their capabilities into superintelligent territory by teaching them a simple bit of logic.
That logic is decomposition: Ethan’s work involves training language models to break down abstract and complex questions (like, “should apples be illegal?”) into simpler and more tractable sub-questions that they depend on (like, “how harmful are apples?”, and “how harmful would something have to be in order for us to outlaw it?”). This decomposition approach is then repeated on the sub-questions, and the process can be repeated as many times as necessary to ensure that the final sub-questions we get are simple. If those final questions are simple enough, a human-trained language model like GPT-3 can be used to answer them directly. This technique belongs to a family of strategies collectively called iterated distillation and amplification (IDA).
One benefit to this approach is that it’s human-understandable. At least in principle, a human being could investigate each branching sub-question and satisfy themselves that the logic being used by the AI is sound, and that the AI isn’t trying to deceive them (which is a serious concern that’s currently being explored in depth in the AI alignment literature). But in practice, the question and sub-question trees that Ethan’s systems generate would be far too large for a human to parse in detail. As a result, Ethan is exploring approaches that could allow humans to more efficiently find out whether a given AI system is providing sincere answers. One of these approaches is debate: by getting two AI systems to debate each other in a carefully controlled setting, it’s possible that logical errors or deception might surface and be caught by a human judge. We discussed some reasons why this approach might work, and some reasons it might not.

You can follow Ethan on Twitter here, or follow me on Twitter here

Links referenced during the podcast:

Chapters:

0:00 Intro
1:37 Ethan’s background
7:26 Issues with IDA
12:09 Gödel’s incompleteness theorem
15:01 The role of IDA today
26:45 The capabilities of IDA and GPT-3
29:54 Systems and debate processes
41:21 Making AGI work
49:33 Our ability to control these systems
51:41 Wrap-up

Please find the transcript below:

Jeremie Harris (00:00):
Hey everyone. Jeremie here. Welcome back to the Towards Data Science podcast. Today’s episode is about one of the biggest outstanding open questions in AI, period. And that is the question of whether we’ll be able to get to superhuman levels of intelligence using current AI systems.

Jeremie Harris (00:18):
One of the reasons that a lot of people are skeptical about whether we’ll be able to achieve superhuman intelligence using conventional machine learning is that conventional machine learning algorithms are typically trained on data that was created by human beings. So the logic goes, how can you ever achieve superhuman intelligence by training it on human level data?

Jeremie Harris (00:38):
Now, there are a lot of strategies that speak to exactly that question, trying to get superhuman intelligence from human level data, one of which is called iterated distillation and amplification, or IDA for short. And while IDA itself comes in a lot of shapes and sizes, one of the most common fruitful applications of IDA to date has been question-answering systems. Now, that typically involves breaking down complex questions that, in theory, could be so complex that they’re not human-understandable into simpler questions, that humans can actually parse and that AI systems that are at human level can actually answer and work with.

Jeremie Harris (01:12):
My guest today, Ethan Perez, is an expert on IDA-flavored question and answer systems of exactly that type. We’re going to be talking about those systems, about his thoughts on how they might give rise to AGI, and his thinking more generally about AI safety. So, lots to get into. I hope you enjoy this conversation as much as I did, and without any further ado, here we go.

Jeremie Harris (01:32):
Hi, Ethan. Thanks so much for joining me for the podcast.

Ethan Perez (01:34):
Hey. Good to chat.

Jeremie Harris (01:37):
I’m really happy that you’re here. This is actually a conversation that I’ve been excited to have at some point with someone, and you’re just absolutely the best person to have it with. I’m really curious about your views on a particular kind of alignment AI safety, AI alignment strategy, a family of strategies that probably falls under debate or IDA. We’ll get to what this is in a minute. But first I want to get to understand you a little bit and explore your background, how you came to this space in the first place. So, how did you discover AI safety and AI alignment, and what brought you to do research on it full time?

Ethan Perez (02:12):
I was generally excited about the long-term impacts of AI. I, early on, had some background in math, and I was thinking about, oh, what are some of the impactful things I could do with that kind of background. Got more into machine learnings. Then I read Nick Bostrom’s book, Superintelligence, which I think is probably a common read for people with these interests. And I think it just generally got me thinking a lot about the long-term impacts of AI, that, oh, I should really be thinking about what happens when we get technology that is close to human level, or even surpasses human level, in its abilities.

Ethan Perez (02:51):
That might be the kind of thing that happens at the end of my lifetime, or not even in my lifetime. But it’s really, I think, where a lot of the impact of AI will have, and where we’ll be able to use AI to actually advance our human knowledge, which is the thing that I really care about.

Jeremie Harris (03:06):
By the way, just to ask a little bit about the Superintelligence side of things. What was it about Superintelligence that shifted your views on things. How did your views shift as a result of reading the book?

Ethan Perez (03:17):
I think it’s that the book focuses a lot on very long-term considerations. I don’t think I have the expertise or the time to gauge the very specific arguments that Nick made about existential risk and other things at the time. But I think it did make some arguments that AI and very powerful AI systems were possible.

Ethan Perez (03:41):
One key argument that I remember was, let’s say that we had a system that could achieve human level ability on some tasks. Maybe some imaginable way to do that is to just emulate the human brain. That might be just extremely computationally expensive but at least, in principle, it does seem possible because we are sort of an example of such an intelligence system. We just need to copy that.

Ethan Perez (04:09):
Then, if you do some sort of copying procedure, maybe if you don’t think neuroscience [inaudible 00:04:13] is possible, we could do this with supervised learning or unsupervised learning on human texts or data at a very large scale and get something to a similar effect. But that seems sort of plausible, and then he makes this point that, oh, well you could just speed up, or parallelize, that software at a very large scale and then get something that is superhuman in one sense in that it can do things much more quickly than a human could, but at the human level quality of ability.

Ethan Perez (04:40):
I think that got me excited about, oh, there’s just so many things that I’m bottlenecked, in terms of my reading speed, in terms of how much I can learn. It would be really exciting if I could just have 1000 copies of me and go read different parts of the Internet. I’m looking up a new diet or looking up a new philosophy question and then report back to me all the information that they gathered. I think that was a plausibility argument that seemed pretty realistic to me of, oh, it seems pretty reasonable that we might get systems that can do much more than we can integration-wise.

Jeremie Harris (05:18):
Yeah. In a sense, it’s almost like it plays with the idea that intelligence is a very poorly-defined thing, where you can be super intelligent even by just replicating human neural hardware almost exactly. Do it in a different medium where computations are just faster because it’s substrate of [inaudible 00:05:37] cells or something. And you immediately have something that is for all intents and purposes, superhuman, even though the algorithms are just human. It’s kind of an interesting… Yeah, the hardware-software side.

Jeremie Harris (05:49):
So did that prompt you? Like, you ready Superintelligence and you were like, “I know I want to make this my life’s work.”

Ethan Perez (05:53):
That motivated me to take my first major research internship and-

Jeremie Harris (05:58):
At MILA, right?

Ethan Perez (05:59):
Yeah. At MILA. Yep. Which Aaron Courville there, he was just a really great mentor. I had other really good peers there, and ended up having a succsessful paper out of there. Then, I really enjoyed the whole process. I’m really excited about just how we can help humans advance our collective knowledge using machine learning. I just read a lot and am always excited to learn more about the world. Sometimes I feel very bottlenecked, or I feel like we really need step changes in our ability to answer some questions about the world. Especially, I think about this in regards to philosophy where it seems like, oh, these are just really hard questions, where it seems like it would just take us many, many, many more years in order to answer some questions there. I’m like, we really need to be thinking about new ways to approach really challenging questions.

Ethan Perez (06:50):
Then, I think after the first research internship, I started to think more broadly about, oh, what other kinds of methods that could actually let us get beyond human ability. Yeah, I mean there’s sort of not really many ways to do that, that seem plausible, and that’s where I came across iterated distillation and amplification, as well as debate in those contexts of different paradigms to get past human ability and not just imitate human ability, which supervised and unsupervised methods are able to do.

Jeremie Harris (07:26):
That actually brings us to another interesting aspect of this, which is right out the gate, you’re approaching IDA, iterated distillation and amplification, this thing that I’ve been teasing since the beginning of the conversation. You talked about that as a way of achieving superhuman capabilities from AI systems. I’d love to discuss that, because usually when I hear IDA talked about, I spend more of my time talking to people in the AI safety community, so they’re focused on IDA as a way to achieve safe AI systems that can be interrogated. I want to talk about that too, but I’m really curious about, what is it that prevents us from getting superhuman performance without IDA. I think that might be an interesting place to start.

Ethan Perez (08:06):
Yeah. Yeah, I mean I think the baseline, the start point to think about it is, well, the default way that we would train systems to answer questions is with supervised learning. This is all the state-of-the-art and standard methods for training systems to answer questions and MLP are just using supervised learning where we collect a large data set of answers to questions, and then we learn a mapping from questions to answers. That’s a good paradigm as long as we can collect the answers to questions, which is possible for a lot of the sorts of questions that people are googling around for and very factual questions like, when was this person born? But that paradigm basically just breaks down when we aren’t able to collect the answers to questions.

Ethan Perez (08:53):
Then, basically, then we just don’t have any training data to train our supervised learning algorithms. That’s basically the problem that iterated distillation and amplification is trying to solve is that, let’s say you have a system that can do human level question-answering, because you’ve trained it on data where we have answers to questions collected by people. So you have this black box, bid question-answering system. Now the problem is, well, how can we use this system to answer some question that’s out of its distribution, because it’s not been labeled by the answer to a question?

Ethan Perez (09:30):
Let’s say, like questions in philosophy, I think would be a good example, because we just don’t have labeled answers to those kinds of questions. So, they would be out of distribution and might require some sore of different reasoning process to get to the answer. But we have this really good question-answering system that can answer a lot of other related questions.

Ethan Perez (09:49):
The way we can think about approaching this problem is by breaking this out of distribution or hard, or question that we don’t have the actually answer to, breaking it down into sub-questions that we can answer with our existing question-answering system. So we can break it down into, I don’t know, if the questions like, do people have freewill? We can first have a question about, what does it mean to have freewill? Then we can have some questions around the neuroscience of freewill, et cetera, et cetera. Then, for each of those questions, we can answer them with our question-answering system, or if they need to be broken down further because they’re challenging, we can break them down further and then answer those sub-questions.

Ethan Perez (10:34):
The hope is that by continually breaking these questions down into sub-questions, we’ll eventually bottom out in questions that a supervised question-answering system get training; where those sub-questions are the kinds of questions that we reliably expect our trained model to accurately answer. Then, now that we’ve answered a bunch of difference pieces of this larger problem, it should be much easier for us to predict the final answer, given the answers to these sub questions.

Jeremie Harris (11:04):
So the idea, to nutshell, to make sure I understand it is you have some complex, or really difficult question, and that question would nominally require a superhuman intelligence to answer it. You have a, maybe human or slightly even sub-human level intelligence that you’ve trained. You’ve been able to train it because you have a whole bunch of questions and the answers that real humans have given, so you’re able to use those two things to train this roughly human level intelligence. Then the hope is that we can break down this really complex question into sub-questions that eventually can be answered by this slightly less than human or roughly human level intelligence. That’s the plan?

Ethan Perez (11:46):
Yeah. Exactly.

Jeremie Harris (11:50):
That’s so cool, because to me it seems like if that were true, if it actually is the case that we can do this, it seems to apply that in principle there is no question, and correct me if I’m wrong about this, but there is no question that cannot be broken down into sub-questions that are ultimately human understandable.

Ethan Perez (12:09):
I think you might get into, if you make the very general statement, you might get into issues around Gödel’s incompleteness theorem, where we know that there are some true statements that we can’t prove the answers to. But maybe there’s modest versions of it where… I mean, some of those examples that are used to prove Gödel’s incompleteness theorem are a bit odd and not really, people don’t really generally care about them. But you might say that for most questions that we really care about, there might be some reasonable way to break them down into sub-questions, sufficiently many sub-questions. It might be just an extremely large number of sub-questions in some cases that we would be able to answer them, yeah.

Jeremie Harris (12:56):
Sorry. Can you elaborate on the Gödel’s incompleteness theorem in connection to this? I think there might be some listeners who aren’t familiar with Gödel’s incompleteness theorem.

Ethan Perez (13:02):
Oh, yeah.

Jeremie Harris (13:03):
And it’s an interesting aspect of this.

Ethan Perez (13:05):
Yeah. I mean, I’m definitely not an expert on it.

Jeremie Harris (13:08):
Sure, yeah.

Ethan Perez (13:08):
But as I understand the simple statement of the claim is there exists true statements which we know are true but cannot construct a proof to show that that true statement is true. These are often self-referential statements, which are like-

Jeremie Harris (13:27):
Like a sieve that contains itself or something like that, yeah?

Ethan Perez (13:30):
Yeah, or this statement is false. They have that sort of quality of being self-referential. That initially made me sad. I was like, oh, well we can’t just prove every true statement. That’s really disappointing, but I think it would be even a big step. I mean, those are weird because they’re self-referential and a lot of… There’s an existence proof that we can’t provably reason to every true statement, but I think… It’s sort of unclear to me, the extent to which that actually holds in practice.

Jeremie Harris (14:04):
I guess that itself is its own interesting question, like speaking to the limitations of IDA or the limitations of any strategy, I mean, to the extent that from as safety perspective, we want to use something like IDA to make a super intelligence that could explore arbitrarily complex ideas. If those ideas end up involving areas of logic that are not ultimately human tractable, in the sense that we actually can’t break them down into even human understandable terms, then it’s like the genie’s out of the bottle. We have, essentially, no hope of regaining control over that thought process in predicting the behavior of that system.

Ethan Perez (14:44):
Yeah. I guess you might want a system that can say, “Hey, I can’t answer this question.”

Jeremie Harris (14:51):
Right.

Ethan Perez (14:53):
Like, “No proof found given computational resources I have,” something like that.

Jeremie Harris (15:01):
We talk about IDA by referencing a human level system or an almost human level system, then we talk about how that can lead to super intelligence. I guess one question I have is, what is IDA like today? What is the state of play and what can you do with IDA systems that maybe you can’t do otherwise?

Ethan Perez (15:21):
Oh, yeah. I think that’s a great question. I think that these kinds of systems are useful now because, fundamentally they’re, I mean, this is like the amplification in iterated amplification where you’re amplifying the ability of an existing system that you’ve trained to be able to do slightly more.

Ethan Perez (15:43):
In some of my work, for example, I’ve taken standard question-answering systems, which can answer questions about a single paragraph in Wikipedia. Those are very good at answering a question about single paragraph on Wikipedia, but they struggle to answer questions, or it’s not clear how to apply them to questions where the answer is in multiple different Wikipedia articles or different paragraphs.

Ethan Perez (16:08):
One example of a question would be, who was born earlier, George Washington or Abraham Lincoln? That information is in just two articles of Wikipedia. What I’ve done in my work is train a model to first generate sub-questions for that original question, so you’ll get some sub-questions to the effect of, when was George Washington born? When was Abraham Lincoln born? Then, you can just answer each of those individually with your pre-trained model, and then pass both of the answers to the sub-questions to another question-answering system along with the sub-questions. Then it should be able to predict the answer by basically taking, I guess an argument over the two answers to sub-questions, basically.

Ethan Perez (16:55):
It makes the whole process much easier, and then we show that it improves the results on a current popular benchmark for question-answering.

Jeremie Harris (17:06):
How do you train it to identify sub-questions? I’m struggling to think of it the… Is there just a labeled training set? Is that what you use?

Ethan Perez (17:13):
Yeah. I think that would be the simplest thing, but they’re sort of a weird kind of supervision. They’re not that common to get these question and sub-question pairs. Yeah, another simple strategy, if you don’t have that much supervision, I have a paper come out which does this, which is to basically leverage these large language models that can adapt from a few number of examples to generalize to doing this.

Ethan Perez (17:41):
Yeah, so we’ve basically been using GPT-3, plus some other tricks, to generate sub-questions. You condition on a question and sub-question pair, or few sub-questions, and then you prompt with a new question. It can do pretty good sub-question generation. And then there also, in another paper that I had, we were basically just trying to see if we could generate the sub-questions completely without supervision. There, we used some of these crazy methods coming out from unsuprevised machine translation to do that. But yeah, I can go into more detail on that if you’re interested. But there’s sort of… That’s a high level of the different kinds of methods you could use.

Jeremie Harris (18:22):
Yeah. No, I find it really interesting that you’re looking at, effectively, augmenting GPT-3 with, I don’t know if the terminology would be “a little bit of logic” or a kind of logical structure that you’re baking in. This brings me to another question, especially as somebody who’s working in this space with a lot of Q&A type problems. How impressed were you with GPT3 when it came out? How surprised were you as well with how well lit could perform?

Ethan Perez (18:50):
I’ve been pretty impressed. I think maybe the place that I’ve been most impressed are specifically for using the model to generate arguments or generate advice, or other forms of long text generation. It’s just really good, especially on the kinds of questions that most people ask because there’s just probably the most amount of training data on that. So anything related to politics or political controversies or economics, economic controversies, apologetics and religious debates. It actually just does a reasonable job at, I don’t know, quoting Bible verses or other very crazy things; because probably there’s just a lot of discussion about those kinds of things on the Internet. So it’s just way better than any model I’ve seen at doing those sorts of things.

Ethan Perez (19:46):
Yeah, I think that the places maybe where it fails, or if you want something that’s very precise, if the correct number of outputs is one and there’s not really any other correct outputs that you would want if you’re wanting an answer to a question, you need the exact answer to a question, it still is… I mean, it’s not really using supervision. I think that’s basically where supervision is helpful is to get the exact sort of thing that you want for your task.

Jeremie Harris (20:20):
Do you think it has a plausible model of the world behind it? To what extent… Because I think one of the areas people have been debating a lot is, is GPT-3 basically a glorified memorization tool, where it’s… Like you said, “Oh, I’ve seen this before. I’ve seen somebody have this particular argument about a Bible verse before. I know what to quote. Verses…” it’s actually connecting dots at a higher level of abstraction. Where do you see it falling on that spectrum?

Ethan Perez (20:47):
Yeah, that’s a good question. It is definitely doing some generalization. A friend and I were playing around with it recently where he was kind of skeptical, and I was like, “Oh, well what would convince you that it’s doing some sort of generalization?” And he was like, “Well, it’s probably seen a lot of recipes on the Internet. It’s also probably seen writing about unicorns, but it’s probably never seen those two together.” Because I don’t know, that just seems unlikely. So he just [inaudible 00:21:14] prompted it with having a… Like, “Give me a recipe for how to make unicorn eye soup.” And then it just goes off into talking about how we need rubies and we need to smelt the iron ore and put it into the soup, and sprinkle it with some gold dust on the top as a garnish.

Jeremie Harris (21:35):
Wow.

Ethan Perez (21:35):
It was very detailed. And yeah, I don’t know. That seems like a fish at evidence that it is doing some sort of generalization. I think to some extent it seems hard to catch cases where that just doesn’t really do a good job.

Jeremie Harris (21:53):
Do you expect, on that basis, that as GPT3 is scaled, I mean, obviously Google’s just come out with their one trillion parameter model, and we can expect more scaling to come in the near future. Do you think scaling is likely to bring GPT-3 to the point where the IDA strategies that you’ve been, at least exploring most recently, are subsumed by it; where it’s good enough on its own to not require that kind of augmentation, or where that augmentation can then be used to target the next layer of failure for that model? It this, in other words, is scaling GP-3 plus IDA, can that get us arbitrarily far, I guess is what I’m trying to ask.

Ethan Perez (22:35):
Yeah. I think the nice thing about IDA is that it just leverages whatever ability that you have to get a model that can do more. I think… I’m not concerned about larger language models subsuming the need to do IDA, because whatever large language model you have, you can always make it answer more kinds of questions by writing this deep question decomposition process. So far, it does seem easier to get that sort of benefit from question decomposition the better our models are; because, yeah you can just generate better sub-questions if you have a better language model.

Ethan Perez (23:22):
Question decompositions from GPT-3 do better than question decompositions from our previous sequence to sequence methods. Even on pre-trained ones, which also do better than methods that don’t use sequence to sequence learning. It’s just a clear gradient where the benefit from the sub-questions improves. Yeah, so I think we’ll see an increasing large effect from doing this sort of question decomposition.

Jeremie Harris (23:48):
That’s interesting, because I would have guessed that at a certain point, as these language models get more and more sophisticated and they build up like a more and more complex model of the world, that that model of the world would include the value of question decomposition and, therefore, that… Essentially, the GPT-N, let’s say, could optimize both for question decomposition and for answering the questions at the same time. In other words, you benefit from that interaction where if it’s all happening under the hood of the same optimization engine?

Ethan Perez (24:21):
Yeah. I think that seems correct. As long as you have training data for the amount of decomposition that you need to do. I don’t know, GPT-3 might be trained on some questions where it needs a couple or yeah, or two to four sub-questions maybe. So it might just be doing this internal decomposition process and predicting the answer. Who knows? You might be able to even decode the sub-questions from the internal weights.

Ethan Perez (24:53):
But I think the tricky bit is, how do you get generalization to questions that require even more decomposition than just the ones that people are generally answering on the Internet. I think that’s where this structured process is helpful, because you can just arbitrarily decompose the questions into sub-questions.

Jeremie Harris (25:13):
What I love about this is it plays with this… I have this ental model of physics, or maybe you call it physics and logic and machine learning as two ends of a continuum. In physics what we do is we go out into the world, we run a bunch of experiments, and we try to identify underlying rules that seem to apply all the time, like no matter what, these rules are always true. The speed of light in any frame of reference is always the same. Gravity works the same way and so on and so forth.

Jeremie Harris (25:48):
The laws of logic are sort of similar to that where… It seems like you’re assuming a law of logic here. The law of question decomposition that you can coherently decompose complex propositions into simple propositions, and that will always be true. It’s almost like you’re taking the trial and error, machine learning, let me feel my way around the elephant, but never actually… Like, machine learning models don’t actually come up with rules. They come up with predictive models, which are a little bit different and a little bit less fundamental in flavor.

Jeremie Harris (26:18):
It almost feels like you’re coupling the two together. You’re saying, okay, you’re really, really good at building this model of the world. It’s a very flexible model, but it fails the moment it encounters something that it hasn’t been trained on. So what’s augmented with this principle, that we think will be true independent of context to give it much greater reach. Is that a fair framing or description [inaudible 00:26:41]?

Ethan Perez (26:41):
Yeah. No, I think that was mostly it. Yeah, you nailed it.

Jeremie Harris (26:45):
Do you think there’s anything that IDA plus GPT-3, or GPT-N, I should say, will not be able to do? Does IDA get us across the super intelligence AGI finish line, or is there other stuff, like self-play or reinforcement learning, that will have to be coupled to these systems?

Ethan Perez (27:06):
Yeah. I think that’s a good question, at least if you’re wanting to answer questions and not, say, take actions in the world. Then, I think a language model plus question decomposition will get you quite far. Some of my uncertainties are around… One issue that I’ve found is that there are just cascading errors where if you answer a sub-question incorrectly, then that can propagate as an error into the final answer. If you answer, “When was George Washington born,” as 1900. Then, you’re going to answer the question, “Was he born earlier than Abraham Lincoln,” incorrectly.

Ethan Perez (27:56):
I think there will need to be a lot of thought about how we do the question decomposition process, how we use the answers to sub-questions. We might just need to do lots of different question decompositions for any different question. We might also want to just directly ask the question. Maybe there’s just some text in Wikipedia that says, “George Washington was born earlier than Abraham Lincoln.” So we might just want to ask the question directly. Maybe the birth dates are not available, so we want to look at when they died and maybe if the differences between when they died is large enough, then we could get some estimate. That would let us do some bayesian update on when their birth dates were.

Ethan Perez (28:40):
We might want to ensemble a bunch of these different results of question question decomposition. Also, others things that are important, I think, are looking at the confidence that the model that’s answering the sub-question has at predicting the answer. One thing that I found is that when that model is less confident of its answer, it’s likely that the entire process will end up with an incorrect answer, which I think makes sense.

Jeremie Harris (29:09):
Makes sense, yeah.

Ethan Perez (29:11):
It’s sort of the kind of thing that we should actually carefully tune. It’s emerged in the system that we built, but it’s that sort of thing that we should be careful about where, oh, maybe we should actually train models that are better calibrated and see how we should use their confidence to affect our confidence in the overall prediction. There’s a lot of different parts to the system that I think do need to be carefully looked at in order to get the whole process to work out properly, yeah.

Jeremie Harris (29:38):
Yeah. Definitely it’s high stakes, especially as far as those earlier questions, I presume, are concerned, right? The sooner the system makes a mistake, the sooner the entire tree gets compromised that comes after. Is that…

Ethan Perez (29:53):
Yeah.

Jeremie Harris (29:54):
Okay. This actually brings us to, because you mentioned ensembling. You have a whole bunch of different question decompositions. Question decomposition one leads to one conclusion. Question decomposition two leads to a different conclusion and so on and so forth. Eventually, you average out, or you let them all vote, and you use that to ensemble their predictions and you get something that’s more robust. What about the other angle of this, which is getting these systems to, perhaps, undergo some kind of debate process. I know that’s another thing that you’ve been working on.

Ethan Perez (30:24):
Yeah.

Jeremie Harris (30:25):
Can you speak to debate and its importance in this context?

Ethan Perez (30:28):
Yeah. I think the high-level idea is similar, which is that when we have a question that is more difficult or complex or out of distribution somehow, we need to break it down recursively somehow, in order to get to a point where we can actually make an assessment from a model that’s trained with supervised learning.

Ethan Perez (30:48):
The way that debate approaches it is in a different way where, they way that I’ll motivate it is by starting out about how you might approach this problem from a reinforcement learning perspective. Let’s say you’re trying to ask a mode, should I eat keto? Give me an answer with some explanation of why I should think that you’re giving me some good answer. If you don’t have that kind of supervision for those explanations, one possible thing you might want to do is just have your language model generate an explanation. Then, you just check the explanation, and then you see, oh, does this seem reasonable? If it seems reasonable, then an RL type approach would just reward the language model for giving a good explanation and answer, and negatively reward if it doesn’t seem correct.

Ethan Perez (31:39):
I think this kind of thing makes a lot of sense if you think that the answer is easy to check but hard to come up with. Math proofs might be an example where there’s a really complicated reasoning process in order to generate the whole proof. But once we have the proof, then it’s much easier to just check the answer, and then with every can reward the model positively for coming up with the proof. That reinforcement learning type approach will plausibly get us past human ability, because we aren’t actually generating the answers. We’re having the model do some expensive computation to come up with the answers, and then we’re checking them.

Ethan Perez (32:16):
The failing is that it’s unsafe or could be misaligned with our values in that we’re basically just optimizing a model for generating convincing answers, or answers that look correct. But-

Jeremie Harris (32:27):
Things that we will consider to be convincing, yeah.

Ethan Perez (32:30):
Yeah. Yeah, exactly. I think that’s where the motivation for debate comes in is that, well we have one model answer a question. And then we can have the same model bring us considerations about the answer to this question that we might miss, and that are very convincing cases against this answer. So maybe it provides another answer, so now we have both the original answer, as well as a counter answer. These, for a good model, might be two of the best answers to this question. There we might think that, okay, now we’re more informed about, maybe some missing information that we might have not had.

Ethan Perez (33:17):
With debate, in a similar way, is iterated amplification. You can just iterate the whole process, and then you can just say, okay, given I have this answer and rebuttal answer, I can generate a counter-counter argument to that response answer and a counter-counter argument et cetera, et cetera until, basically, we have enough information for a human to predict with reasonably high confidence what the correct answer is. Then we can use that to… Since we have reliable human judgments on this data, then we can train reliable models to make the same prediction, just using supervised learning.

Jeremie Harris (33:59):
The hope here would be that you get two AI systems to debate each other, and in the process, they end up revealing the strengths and weaknesses of each other’s arguments in a human intelligible fashion.

Ethan Perez (34:09):
Yep.

Jeremie Harris (34:10):
So that we can go, oh, this AI is malicious, and it’s trying to fool me into helping it along in its plan for world dominations, whereas this AI is doing the right thing. Its answer actually makes sense when I see the debate play out. Does that actually work? I guess one thing I’d imagine, or one issue I’d imagine here is, as these two systems that are debating vastly exceed human ability to follow the reasoning, you end up in a situation where there’s a decoupling between, like you said, what sounds convincing to humans and what actual logic would dictate. This logic gets really complicated, it seems like it would become much, much easier to deceive a human and much harder to convey the real true arguments.

Ethan Perez (35:04):
Yeah. I think that is basically a key uncertainty with this approach. You might think that as the models that are running these sort of debates get better, then both the arguments and the counterarguments get better in a way that’s helpful, in that you’re more likely to catch areas in an argument, but yeah. It does seem like there… I mean, there are some interesting experiments, like human experiments that OpenAI has run where they have just smart people who know about physics play this process of debate, and just try to have a way of… non-physicsy, not too physicsy person try to judge the debate. And it just seems really, really difficult for me to assess some of those debates.

Ethan Perez (35:54):
So I think, it’s another thing where I think the process needs to be tweaked in order to fig out how we can get the most easy to judge debates, in a way that we can still feel confident that we’re getting a good reliable judge of the whole process.

Jeremie Harris (36:11):
Yeah, because it almost seems like, I don’t know, naively to me, it feels like there’s a curse of dimensionality aspect to this where the more complicated the problem domain becomes, the more sophisticated their reasoning becomes the more high-dimensional it is, the more spare degrees of freedom these AIs have. And it seems like those degrees of freedom, like that’s the space that you need to wedge in deception. It seems like that space just grows as the problem increases in complexity.

Jeremie Harris (36:46):
But yeah, hopefully, there’s some underlying principles so we can actually uncover the [inaudible 00:36:50] scale a little bit more.

Ethan Perez (36:52):
It seems plausible that you might need some combination of these methods. I don’t know. Train some IDA type system, and then because it’s done this decomposition, then it’s above human ability in what sorts of questions it can answer, and also what sorts of debates it can judge. Then, you might want to use that system as the judge for this debate process. It might be able to do a better job than just using a human annotator as the judge directly.

Ethan Perez (37:22):
With debate, I think one of the motivations is this might be a good training signal, if you can train a model to accurately judge the results of the debate. Then you can just use that as a reward signal for some sort of self-play, even AlphaZero type system where the models optimize the reward and get better at generating these arguments. Then as they get better, you might also think that they might be better judges for the whole system. There’s different combinations of these puzzle pieces that we can try to fit together, but it’s sort of an open question about what’s the best way to do it to get reliable judgments.

Jeremie Harris (38:04):
One thing that I’m curious about too, and this informs both the strategies that researchers choose to work on and, also, their attitude toward safety and the kinds of alignment and AI safety techniques that they feel compelled to work on. There’s a tight coupling between those technical questions and thoughts about AI timelines. When you think an AGI human level general intelligence, or something that we can refer to in those terms, is going to arise; do you have, I don’t know, do you have a take on that? If so, has that informed your position on whether to work on capabilities or alignment or anything else?

Ethan Perez (38:41):
Yeah. I made a prediction on this. There’s a really nice thread. I may have to send it to you later, but there’s a really nice post from this research group, AUT, where they collect a lot of different people’s AI timelines. I think from that…

Jeremie Harris (38:53):
I’ll post a link. I did see your illicit plot, so everybody check out the blog post that will come with the podcast, and you’ll be able to see Ethan’s illicit post, because I think it’s kind of intriguing.

Ethan Perez (39:02):
Okay. Great, yeah. I think it depends a lot on your definition of AGI. I think I was considering that it would both need good language modeling abilities or language abilities, good vision capabilities, and be able to take actions in the world. I think, in my prediction, I decouple it into two cases. There’s one case where we get general intelligence from combining existing components that we have currently and scaling them up with the compute that we have. Then, there’s another component of my distribution, which is in the case that that other scenario doesn’t work out, it’s just this prior over the next century or couple centuries that, well, not really sure when it’s going to happen. So I’ll just have some decaying distribution over the timeline.

Ethan Perez (39:57):
In that second scenario, it seems very uncertain when we’ll get general intelligence, because it could just be plausibly some very hard breakthrough that… We’ve already been working on the problem for quite a long time, so it might just take a really long time to get the necessary breakthrough that we need. But in the other scenario, it seems quite plausible that in the near term, if we get larger GPT-style models, if we have enough compute to get a good performance; then I think it seems plausible that we get quite good language models. And then we just need to combine them with the visual components, which is, I think, the direction that CLIP and DALL-E, and some of these other models are going in. Then, we just need to give them some ability to take actions in the world, or be able to train that like we would want.

Jeremie Harris (40:56):
Some body.

Ethan Perez (40:58):
Yeah. Maybe… I don’t know if I’m tied to physical embodiment, but at least the models should be able to, for example, run a company, choose actions that are optimizing some goal. That seems like it might need a bit more work beyond just getting and understanding vision and language.

Jeremie Harris (41:21):
I think you’ve posed this problem in a way that everyone implicitly has posed it in their minds, but no one’s ever made it explicit quite in the way that you did, or as far as I’ve seen in that comment. The central question is, hey, have we solved all the conceptual problems that we need to solve, basically? Do we already, basically, have all of these ingredients, the reinforcement learning, the deep learning, we just smush them together and add a pinch of scale for flavor and then we end up with an AGI? Or is there some fundamental thing, whether it’s some weird quantum effect, or something we’re going to have to figure out to make AGI actually work?

Jeremie Harris (41:53):
If I recall, you actually said you thought there was about a one in four chance that we currently have everything we need to achieve liftoff. Is that correct?

Ethan Perez (42:03):
Yeah. I think that’s right. I mean, it depends a lot on your definition of AGI. I think language models alone that are very powerful are probably quite good enough-

Jeremie Harris (42:17):
[inaudible 00:42:17]

Ethan Perez (42:17):
… to get quite, have massive impacts on the world. I said one in four in that post, but I think probably most people would be lower than me on that. Some people would be higher.

Jeremie Harris (42:29):
Oh, interesting. So you think most people would think that there probably are some missing conceptual ingredients?

Ethan Perez (42:37):
Yeah. I think that might be an opinion that’s getting worn down by some of OpenAI’s scaling work, but generally, it does seem to be the case that most researchers are like, “Oh, we do need some big new ideas in the way that we’re approaching things.”

Jeremie Harris (42:55):
When I looked at your plot, one of the things that struck me was you had it, essentially, the probability of us hitting AGI spiking sometime in the next 10 years or so, is basically your big bump that’s associated with the possibility that maybe we already have everything and all we need is scale. So within the next 10 years, we’ll be able to scale our way there.

Jeremie Harris (43:13):
Then, of course, you’ve got his big period of high uncertainty where you’re like, “I don’t know. If it doesn’t happen in that time, scaling doesn’t work, then who knows when we’ll have the big idea?” Does that cause concern? Are you worried about that possibility that we might hit AGI in the next 10 years? Do you think that we’re prepared for it from as safety standpoint, from an alignment standpoint, from a philosophy standpoint?

Ethan Perez (43:38):
Yeah. That’s a good question. I think to some extent, it seems hard to work on some of these safety methods without having [inaudible 00:43:47] models.

Jeremie Harris (43:48):
Capabilities?

Ethan Perez (43:48):
Yeah. It is just hard to actually generate a good sub-question to a question before we… It would’ve been really hard to do that a couple of years ago. Yeah, so part of me feels like, oh, these methods are going to become more useful as we get closer to very powerful language models. Yeah, but also it does seem like, I don’t know, maybe it feels like we need a long period where we’re close to having strong language models and have a lot of time to develop our methods. And then, oh, then we’ll be ready for it, and we’ll have all these methods for scaling past human ability in a way that generalizes properly.

Ethan Perez (44:32):
So yeah, maybe it’s more the shape of the trajectory that…

Jeremie Harris (44:35):
Interesting.

Ethan Perez (44:36):
That matters to me.

Jeremie Harris (44:40):
Does that imply, though, that we would know? Because right now I think we’re scaling the biggest machine learning model in the world. There’s been scaling by, I think a factor of about 10X every year. To me, this seems to suggest that, we’re very likely to not hum along just below human level intelligence while we sort out a solution, but rather shatter right through that threshold then create something way bigger than we can handle. Is that a plausible concern, or is that something that you think is obviated by something else?

Ethan Perez (45:13):
Well, I think it will be pretty compute limited. In the case that we get methods that are generally intelligent from our current set of methods, it looks like the direction that that’s going is, we require a lot of scaling. In that case, there’s going to be some trade-off between the quality of the model that you’re getting and the amount of compute that you’re using it for. I think that does limit the amount of… Yeah, I guess the amount by which it can really increase quickly in terms of intelligence.

Ethan Perez (45:49):
Yeah, it seems like… I mean, we definitely need a few more order of magnitude of compute to get even near human level intelligence. At that point it’s like, you’re using a large budget, a large fraction of Google’s R&D budget in order to train the model. It’s a question of, well, how much more money is there to spend on these? And it seems like a process that’s going to take a while to get, at least some amount of time to, maybe it’s even going to have to be with some sort of governmental project to gather more than Google R&D budget amounts of money to scale the models. But it does seem hard to just blast past.

Ethan Perez (46:28):
We also have these very predictable power law curves in terms of, how is loss for language modeling going to go down as we increase with compute? That is… the amount of gain that we’re getting is slowing down in terms of how much better we’re getting at language modeling from training with more compute. I think those factors all suggest to me that things will slow down a bit.

Jeremie Harris (46:52):
Well, and I guess if they do, that would suggest that strategies like IDA are prolyl going to be all the more valuable as we look for that little extra nudge here or little extra nudge there. Is IDA going to be necessary to get across the AGI threshold, or there plausibly other techniques that… Is IDA on its own, as a plausible way, in other words, of getting us to superhuman levels of performance from systems that are trained on human level data?

Ethan Perez (47:25):
Yeah. Okay, so the characterization in my head is something like, a very good language model will capture the full distribution of human intelligence and ability. It has the knowledge about what the very upper-end of human intelligence is like. For example, I prompted GPT-3 as if it’s like a conversation between two faculty at Stanford. And then the conversation is very distinct, and that’s where they’re citing different famous people and very high quality, well-formed English and stuff like that.

Jeremie Harris (48:01):
A very sophisticated conversation.

Ethan Perez (48:02):
Yeah, exactly. There, I think, there’s clear. I mean, that seems like… I mean, it’s above the mean oh human intelligence in terms of quality. So I think that seems like a reasonable place that… I mean, maybe you might just consider that superhuman already, depending on what you consider-

Jeremie Harris (48:22):
Right. Right. It’s picking the top one percentile of the human population. Is that a superhuman thing? I mean, maybe, yeah. Yeah.

Ethan Perez (48:32):
Yeah. I mean, that would already be quite useful. Then I think, maybe if you use some sort of reinforcement based approach, like the one I was describing, you could get beyond that. But I think you would start to see some of the failings that I described where models are generating answers that are convincing, but not necessarily correct. There, I think you get into dangers about, well, people are probably going to use the answers because they’re probably correct, but they’re also optimized to just convince you-

Jeremie Harris (49:04):
[inaudible 00:49:04].

Ethan Perez (49:04):
And therefore, you might be taking important decisions based off of incorrect information. I think that is the regime where when we try to scale past that level of human ability, then we might want to… We might basically need to use distillation and amplification or debate methods to get us past this regime of human ability in a way that we are confident will work.

Jeremie Harris (49:33):
And are you overall optimistic when it comes to… Because we’ve talked about IDA’s potential as a way of amplifying these systems of getting superhuman level performance. It seems like we do have strategies that, at least in principle, seem like they have a decent shot of getting us to super intelligence. And whether one of those strategies works or something else works, it seems quite likely that we will eventually get there. Are you optimistic about our ability to control these systems and well, maybe wielding them wisely is a separate question. But just our ability to control them to make sure that we don’t let a genie out of the bottle that we’d rather put back in the bottle.

Ethan Perez (50:12):
Yeah. I think that’s one thing that’s nice about question-answering systems is that we decide what we do with the information. It’s not that we’re having systems act autonomously on their own, which I think is where a lot of the difficulties come in. I mean, there are still difficulties with doing this question-answering problem correctly, but I think if we can really nail this, I think simpler problem, of just getting models to generalize past human ability, in just the domain of prediction, answering questions; then, I think we can use some of those methods to help us in the more general case, where we want our systems to just autonomously take actions in a way that’s not supervised by humans.

Ethan Perez (51:03):
For example, I think one way to make that connection is to use this question-answering system trained with IDA or debate to provide a reward for systems that are running manufacturing, like Amazon manufacturing or making CEO level decisions at a tech company or something like that. I think if we have some sort of superhuman question-answering system that it might be much easier to get accurate evaluations of these systems that are taking very long-term actions.

Jeremie Harris (51:41):
Interesting. Okay, well hopefully question-answering system are going to be a big part of the solution here. Really appreciate your time and your thoughts on this, Ethan. Do you have a personal website that you want to point people to actually-

Ethan Perez (51:53):
Yeah.

Jeremie Harris (51:53):
… see if they’re interested in following your work?

Ethan Perez (51:55):
Yeah. It’s just ethanperez.net.

Jeremie Harris (51:57):
Awesome. Dot net. Awesome. I like it. Very, what is it? I don’t want to say hipster-ish, but it’s got that nice aesthetic, yeah. Great.

Ethan Perez (52:07):
That’s good. It was the only one available. [inaudible 00:52:09]

Jeremie Harris (52:09):
Nice. nice. Welcome to 2021, I guess. Yeah well, Ethan, thanks so much. I really appreciate it.

Ethan Perez (52:16):
Yeah. Thanks for having me.