The Quiet Semi-Supervised Revolution

Time to dust off that unlabeled data?

One of the most familiar settings for a machine learning engineer is having access to a lot of data, but modest resources to annotate it. Everyone in that predicament eventually goes through the logical steps of asking themselves what to do when they have limited supervised data, but lots of unlabeled data, and the literature appears to have a ready answer: semi-supervised learning.

And that’s usually when things go wrong.

Historically, semi-supervised learning has been one of those rabbit holes that every engineer goes through as a rite of passage only to discover a newfound appreciation for plain old data labeling. The details are unique to every problem, but in broad strokes, they can often be depicted as follows:

In low data regimes, semi-supervised training does indeed tend to improve performance. But in a practical setting, you often go from ‘terrible and unusable’ levels of performance to ‘less terrible but still completely unusable.’ Essentially, when you are in a data regime where semi-supervised learning actually helps, it means you’re also in a regime where your classifier is just plain bad and of no practical use.

In addition, semi-supervision generally doesn’t come for free, and a method which uses semi-supervised learning very often doesn’t provide you with the same asymptotic properties that supervised learning does in high-data regimes — unlabeled data may introduce bias, for instance. See e.g. Section 4. A very popular method of semi-supervised learning in the early days of deep learning was to first learn an auto-encoder on unlabeled data, followed by fine-tuning on labeled data. Hardly anyone does this any more because representations learned via auto-encoding tend to empirically limit the asymptotic performance of fine-tuning. Interestingly, even vastly improved modern generative methods haven’t improved that picture much, probably because what makes a good generative model isn’t necessarily what makes a good classifier. As a result, when you see engineers fine-tuning models today, it’s generally starting from representations that were learned on supervised data — and yes, I consider text to be self-supervised data for the purpose of language modeling. Wherever practical, transfer learning from other pre-trained models is a much stronger starting point, which semi-supervised approaches have difficulty outperforming.

So a typical machine learning engineer’s journey through the swamps of semi-supervised learning goes like this:

1: Everything is terrible, let’s try semi-supervised learning! (After all, that’s engineering work, much more interesting than labeling data …)

2: Look, numbers go up! Still terrible, though. Looks like we’ll have to label data after all …

3: More data is better, yay, but have you tried what happens if you discard your semi-supervised machinery?

4: Hey, what do you know, it’s actually simpler and better. We could have saved time and a whole lot of technical debt by skipping 2 and 3 altogether.

If you’re very lucky, your problem may also admittedly have a performance characteristic shaped like this instead:

In that case, there is a narrow data regime where semi-supervised is non-terrible and also improves data efficiency. In my experience, it’s very rare to hit that sweet spot. Factoring in the cost of the extra complexity, the fact that the gap in the amount of labelled data is typically not orders of magnitude better, and the diminishing returns, it’s rarely worth the trouble, unless you’re competing on an academic benchmark.

But wait, isn’t this piece titled ‘The Quiet Semi-Supervised Revolution’?

One fascinating trend is that the landscape of semi-supervised learning may be changing to something that looks more like this:

And that would change everything. First, these curves match one’s mental model of what semi-supervised approaches should do: more data should always be better. The gap between semi-supervised and supervised should be strictly positive even for data regimes where supervised learning does well. And increasingly this is happening at no cost and remarkably little additional complexity. The ‘magic zone’ starts lower, and equally importantly, it isn’t bounded in high data regimes.

What’s new? Lots of things: many clever ways to self-label the data and express losses in such a way that they are compatible with the noise and potential biases of self-labeling. Two recent works exemplify recent progress and point to the relevant literature: MixMatch: A Holistic Approach to Semi-Supervised Learning and Unsupervised Data Augmentation.

Another fundamental shift in the world of semi-supervised learning is the realization that it may have a very important role to play in machine learning privacy. For example, the PATE approach (Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data, Scalable Private Learning with PATE,) whereby the supervised data is presumed private, and a student model with strong privacy guarantees is trained using only unlabeled (presumed public) data. Privacy-sensitive methods for distilling knowledge are becoming one of the key enablers of Federated Learning, which offers the promise of efficient distributed learning that doesn’t rely on the model having access to user data, with strong mathematical privacy guarantees.

It’s an exciting time to be revisiting the value of semi-supervised learning in practical settings. Seeing one’s long-held assumptions challenged is a great indicator of the amazing progress happening in the field. This trend is all very recent, and we’ll have to see if these methods stand the test of time, but the potential for a fundamental shift in the architecture of machine learning tools that could result from these advances is very intriguing.