Patterns in Self-Supervised Learning

Photo by Kevin Gent on Unsplash

Probing the Self for Fun and Profit

Self-Supervision is in the air. Explaining the difference between self-, un-, weakly-, semi-, distantly-, and fully-supervised learning (and of course, RL) just got exponentially tougher. :) Nevertheless, we are gonna try.

The problem, in context, is to encode an object (a word, sentence, image, video, audio, …) into a general-enough representation (blobs of numbers) which is useful (preserves enough object features) for solving multiple tasks, e.g., find sentiment of a sentence, translate it into another language, locate things in an image, make it higher-resolution, detect text being spoken, identify speaker switches, and so on.

Given how diverse images or videos or speech can be, we must often make do with representations tied to a few tasks (or even a single one), which break down if we encounter new examples or new tasks. Learning more, and repeatedly, and continuously, from new examples (inputs labeled with expected outputs) is our go-to strategy (supervised learning). We’ve secretly (and ambitiously) wished that this tiresome, repeated learning process will eventually go away and we’d learn good universal representations for these objects. Learn once, reuse forever. But, the so-called unsupervised learning paradigm (only-input-no-labels) hasn’t delivered much (mild exceptions like GANs and learn-to-cluster models).

Enter Self-Supervision: Thankfully, strewn through the web of AI research, a new pattern of learning has quietly emerged, which promises to get closer to the elusive goal. The principle is pretty simple: to encode an object, you try to setup learning tasks between parts of it or different views of it (the self).

Given one part (input) of the object, 
can you predict / generate the other part (output)?

There are a couple of flavors of this principle.

  • For example, given a sentence context around a word, can you (learn to) predict the missing word (skip-grams, BERT).
  • Or, modify the view of an object at input and predict what changed (rotate an image and predict the rotation angle).
  • Or, modify the input view and ensure that the output does not change.

Because you are simply playing around with the object, these are free lunch tasks — no external labels needed.

By happy chance, we now have (plenty of) auto-generated input-output examples, and we’re back in the game. Go ahead and use every hammer from your supervised learning toolkit to learn a great (universal?) representation for the object from these examples.

By trying to predict the self-output from the self-input, you end up learning about the intrinsic properties / semantics of the object, which otherwise would have taken a ton of examples to learn from.

Self-supervision losses have been the silent heroes for a while now, across representation learning for multiple domains (as auto-encoders, word embedders, auxiliary losses, many data augmentations, …). A very nice slide deck here. Now, with the ImageNet moment for NLP (ELMo, BERT and others), I guess they’ve made it on their own. The missing gap in the supervision spectrum that everyone (including AGI) has been waiting for.

Understandably, there is flurry of research activity around newer self-supervision tricks, getting SoTA with fewer examples, and mixing various kinds of supervisions (hello NeurIPS!). Till now, the self-supervised methods mostly try to relate the components of an object, taking one part as input, predict the other part. Or, change the object’s view by data augmentation, and predict the same label.

Going ahead, let’s see how creative the community gets when playing around with the new hammer. There are many questions that remain: for example, how do you compare multiple different self-supervised tricks — which one learns better representations than others? How do you pick the output? For example, instead of having explicit labels as outputs, UDA uses the intrinsic output distribution D as the label — ensure D changes minimally when the view of input x changes.

Also, I’m very curious who claims they were the first to do it :)

PS: if you are looking for someone to ‘supervise’ you (weakly, fully, remotely or even co-supervise) to solve some very interesting text, vision and speech problems, get in touch with me at !