PODCAST

What does your AI want?

Ryan Carey on the quest to understand the incentives of AI systems

Jeremie Harris
Towards Data Science
4 min readApr 14, 2021

--

APPLE | GOOGLE | SPOTIFY | OTHERS

Editor’s note: This episode is part of our podcast series on emerging problems in data science and machine learning, hosted by Jeremie Harris. Apart from hosting the podcast, Jeremie helps run a data science mentorship startup called SharpestMinds.

AI safety researchers are increasingly focused on understanding what AI systems want. That may sound like an odd thing to care about: after all, aren’t we just programming AIs to want certain things by providing them with a loss function, or a number to optimize?

Well, not necessarily. It turns out that AI systems can have incentives that aren’t necessarily obvious based on their initial programming. Twitter, for example, runs a recommender system whose job is nominally to figure out what tweets you’re most likely to engage with. And while that might make you think that it should be optimizing for matching tweets to people, another way Twitter can achieve its goal is by matching people to tweets — that is, making people easier to predict, by nudging them towards simplistic and partisan views of the world. Some have argued that’s a key reason that social media has had such a divisive impact on online political discourse.

So the incentives of many current AIs already deviate from those of their programmers in important and significant ways — ways that are literally shaping society. But there’s a bigger reason they matter: as AI systems continue to develop more capabilities, inconsistencies between their incentives and our own will become more and more important. That’s why my guest for this episode, Ryan Carey, has focused much of his research on identifying and controlling the incentives of AIs. Ryan is a former medical doctor, now pursuing a PhD in machine learning and doing research on AI safety at Oxford University’s Future of Humanity Institute.

Here were some of my favourite take-homes from the conversation:

  • Most people who work on AI safety came at it from the Effective Altruism (EA) community. EA is a philosophical movement focused on figuring out how an individual with limited time and money can make the most good in the world. Many in the EA community have a long-term outlook, and with that comes a focus on the future of humanity, and the future of technologies like AI, which many believe will determine whether humanity flourishes or fails in the distant future.
  • Ryan’s work on AI incentives is applicable to many current systems, and is also important for long-term AI alignment work. For that reason, it’s an area where long-term and near-term AI safety researchers can clearly see eye to eye. That’s something that doesn’t happen as often as you might think, as there are active debates ongoing within the AI safety community over whether long-term or near-term safety issues should be prioritized.
  • One of the challenges facing long-term AI safety researchers is abstraction leakage. This is the idea that the basic concepts we rely on to describe the world — things like “apples” and “GPUs” and “algorithms” — are actually fuzzy. For example, an apple is really just a collection of cells, which are a collection of atoms and fundamental particles. But why is this an issue for AI safety? Well, we use abstractions because we have finite sensory and computational bandwidths. Our brains don’t think of apples as a collection of cells, even though that would be a more accurate way to perceive them, because we just don’t have enough compute power and bandwidth to do that. So we use shortcuts, like the concept “apple” to make the world compressible enough to be understood. But superintelligent AI systems wouldn’t face that constraint: in principle, they could see the world in far greater detail than we do, and could act on that detail by spotting relationships between sub-abstractions (like the cells in an apple) that we can’t. As a result, AI safety strategies that rely on certain fixed abstractions (e.g. “don’t harm a human being”, or “don’t eat that apple”) risk being far less effective than we might imagine in the face of a superintelligent system.

You can follow Ryan on Twitter here, or follow me on Twitter here

Links referenced during the podcast:

Chapters:

  • 0:00 Intro
  • 1:20 Ryan’s background
  • 5:25 Exploring alternatives
  • 6:45 Resisting scalable impact
  • 11:30 Valuing future life
  • 13:50 Existential risk and AI
  • 23:30 Progress in AI safety
  • 34:30 Components for an AI strategy
  • 41:30 Hitting AGI
  • 49:50 Five-year span
  • 55:10 Taking the risk
  • 56:50 Wrap-up

--

--

Co-founder of Gladstone AI 🤖 an AI safety company. Author of Quantum Mechanics Made Me Do It (preorder: shorturl.at/jtMN0).