PODCAST

data2vec and the future of multimodal learning

Alexei Baevski on AI architectures that work for text, images, speech, video, and more

Jeremie Harris

Published in

Towards Data Science

4 min readApr 27, 2022

APPLE | GOOGLE | SPOTIFY | OTHERS

Editor’s note: The TDS Podcast is hosted by Jeremie Harris, who is the co-founder of Mercurius, an AI safety startup. Every week, Jeremie chats with researchers and business leaders at the forefront of the field to unpack the most pressing questions around data science, machine learning, and AI.

If the name data2vec sounds familiar, that’s probably because it made quite a splash on social and even traditional media when it came out, about two months ago. It’s an important entry in what is now a growing list of strategies that are focused on creating individual machine learning architectures that handle many different data types, like text, image and speech.

Most self-supervised learning techniques involve getting a model to take some input data (say, an image or a piece of text) and mask out certain components of those inputs (say by blacking out pixels or words) in order to get the models to predict those masked out components.

That “filling in the blanks” task is hard enough to force AIs to learn facts about their data that generalize well, but it also means training models to perform tasks that are very different depending on the input data type. Filling in blacked out pixels is quite different from filling in blanks in a sentence, for example.

So what if there was a way to come up with one task that we could use to train machine learning models on any kind of data? That’s where data2vec comes in.

For this episode of the podcast, I’m joined by Alexei Baevski, a researcher at Meta AI one of the creators of data2vec. In addition to data2vec, Alexei has been involved in quite a bit of pioneering work on text and speech models, including wav2vec, Facebook’s widely publicized unsupervised speech model. Alexei joined me to talk about how data2vec works and what’s next for that research direction, as well as the future of multi-modal learning.

Here were some of my favourite take-homes from the conversation:

Autoregressive models are usually trained to fill in partially blacked out sentences or images. But this strategy has an inherent limitation: because fill-in-the-blanks is a very different task for text than it is for images, it’s much harder to use those tasks to train a single architecture that can handle text and images at the same time. To solve this problem, data2vec is trained not to fill in blanks in images or sentences, but in the latent representations of images and text data generated by a teacher network. This creates a common task that can be used regardless of the input data type.
As Alexei points out, data2vec still uses specialized preprocessing techniques that differ depending on the input data type. So it’s not exactly a universal architecture — it requires purpose-specific massaging of inputs. However, Alexei thinks that could change: Google AI recently published work they’ve done on an architecture called Perceiver, which uses a single preprocessing techniques for all input data types. By combining Perceiver’s input agnostic preprocessing with data2vec’s input-agbistic training task, he sees significant potential for a new wave of robust multi modal models.
One of the challenges that comes with increasingly multi modal models is interoperability: it’s hard enough to understand how deep networks process image data when that’s all they’re handling, but what if the same network that handles vision also handles text and audio data? We may need a new generation of interpretability techniques to keep up with scaled multi modal systems.
One question that Alexei and his team haven’t tried, but that Alexei is curious about: do the latent representations that data2vec generates for the word “dog” look similar or related to the latent representations it comes up with for dog images? Naively, this seems like it would tell us something about the robustness of be the concepts the system learns.
It’s often said that machine learning — and especially scaled AI — is becoming software engineering. Alexei has a software engineering background and days that while he sees some merit in the idea, this hasn’t translated into giving software engineers a noticeable advantage in AI research.

You can follow Alexei on Twitter here, or me here.

Chapters:

0:00 Intro
2:00 Alexei’s background
10:00 Software engineering knowledge
14:10 Role of data2vec in progression
30:00 Delta between student and teacher
38:30 Losing interpreting ability
41:45 Influence of greater abilities
49:15 Wrap-up

PODCAST

data2vec and the future of multimodal learning

Alexei Baevski on AI architectures that work for text, images, speech, video, and more

Chapters:

Written by Jeremie Harris