Photo by @penguinuhh

How is the Quiet Revolution in Semi-Supervised Learning Changing the Industry?

MixMatch, Unsupervised Data Augmentation and the PATE Approach

Alex Moltzau
Towards Data Science
8 min readJul 13, 2019

--

As an anthropology student studying a minor in computer science I would like to do my best to understand this development and what consequences it could have when implemented. However first we have to run through the practical aspect of the changes and techniques enabling a possible viable Semi-Supervised-Learning approach. Then I will jump to a few techniques that combine, and possibly can change the way we approach this area within machine learning.

This is the last day of my three days looking at three questions.

Day one: how is Google a frontrunrunner in the field of AI? (Done ✓)

Day two: which advancements is being made in Semi-Supervised Learning (SSL) with Unsupervised Data Augmentation (UDA) and why is it important for the field of AI? (Done ✓)

Day three: how is the quiet revolution in SSL changing the industry?

Today being day the last day I will write about the The Quiet Semi-Supervised Revolution. I will start with how the term was coined; previous common practice; and how the SSL landscape is changing. After which I will proceed to conclude shortly.

Who Coined The The Quiet Semi-Supervised Revolution?

On the 15th of May Principal Scientist in Google Vincent Vanhoucke published an article called The Quiet Semi-Supervised Revolution. As far as I know this is the first mention of the changes in SSL used in this manner.

He starts by talking of the previous problems of Semi-Supervised Learning (SSL). With the access to a lot of data, limited supervised and lots of unlabelled data, SSL seems like an obvious solution. He presents his view on the graph that most often results from experiments with supervised and semi-supervised.

Illustrative graph by Vincent Vanhoucke

A machine learning engineer, according to Vanhoucke, goes through a journey ending back up at supervised learning shown in this graph.

Illustrative graph by Vincent Vanhoucke

However he follows up this and says:

One fascinating trend is that the landscape of semi-supervised learning may be changing to something that looks more like this:

Illustrative graph by Vincent Vanhoucke

What was common practice before?

SSL is described as a rabbit hole for engineers almost as a rite of passage only to come back to data labelling. The common practice previously according to Vanhoucke was:

… first learn an auto-encoder on unlabeled data, followed by fine-tuning on labeled data. Hardly anyone does this any more because representations learned via auto-encoding tend to empirically limit the asymptotic performance of fine-tuning.

So what is an auto-encoder? Let’s break this down.

Autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation(encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise”. Along with the reduction side, a reconstructing side is learnt, where the autoencoder tries to generate from the reduced encoding a representation as close as possible to its original input, hence its name.

  • Dimensionality of the input space. High dimensional spaces (100s or 1000s). The volume of the space increases so much that the data becomes sparse. Computing each combination of values in an optimisation problem for example. If you wish an arcane slant this point can be referred to as the Curse of dimensionality.

Asymptote used to refer to a line in math that is tangent to a curve at infinity. Asymptotic notation in computational complexity refers to limiting behavior of a function whose domain and range is Z+, it is valid for values of domain that are greater than a particular threshold. Thus, here we approximate curves with curves. Often, preferably the we seek curve(s) that track the original curve closely.

  • Asymptotic performance is a way to compare algorithm performance. You could abstract away low-level details (E.g. exact assembly code); investigate scaling behavior (which is better on really large inputs?). Asymptotic performance: as input size grows, how does execution time grow?

Vanhoucke claims:

…even vastly improved modern generative methods haven’t improved that picture much, probably because what makes a good generative model isn’t necessarily what makes a good classifier. As a result, when you see engineers fine-tuning models today, it’s generally starting from representations that were learned on supervised data…

What is generative methods?

Generative learning is a theory that involves the active integration of new ideas with the learner’s existing schemata. The main idea of generative learning is that, in order to learn with understanding, a learner has to construct meaning actively. A generative model only applies to probabilistic methods. In statistical classification, including machine learning, two main approaches are called the generative approach and the discriminative approach. Generative classifiers (joint distribution) shown underneath:

What is changing the SSL landscape?

Yesterday I wrote an article called Google AI and Developments in Semi-Supervised Learning. The article first went through an explanation unsupervised learning, supervised learning and reinforcement learning. Then it continued with an explanation of semi-supervised learning (SSL) and how research is being done on SSL with Unsupervised Data Augmentation (UDA). As such if you are unfamiliar with these terms it may be wise to skip back to that article.

Anyhow there are a few advancements mentioned that the shift towards increased viabilty of SSL. These three prevalent that you may want to check out:

There are new clever ways to self-label the data and express losses that are more compatible with the noise and potential biases of self-labeling. Two recent works matching the two first points exemplify recent progress and point to the relevant literature: MixMatch: A Holistic Approach to Semi-Supervised Learning and Unsupervised Data Augmentation.

In the MixMatch paper, they introduce MixMatch, an SSL algorithm which proposes a single loss that unifies dominant approaches to semi-supervised learning. Unlike previous methods, MixMatch targets all the properties at once which we find leads to the following benefits:

  • In an experiment they show that MixMatch obtains state-of-the-art results on all standard
    image benchmarks (section 4.2), for example obtaining a 11.08% error rate on CIFAR-10
    with 250 labels (compared to the next-best-method which achieved 38%).
  • They show in ablation study that MixMatch is greater than the sum of its parts.
  • They demonstrate that MixMatch is useful for differentially private learning, enabling students in the PATE framework to obtain new state-of-the-art results that
    simultaneously strengthen privacy guarantees provided and the accuracy achieved.

Consistency regularisation applies data augmentation to semi-supervised learning by leveraging the idea that a classifier should output the same class distribution for an unlabelled example even after it has been augmented. MixMatch utilises a form of consistency regularisation through the use of standard data augmentation for images (random horizontal flips and crops).

MixMatch is a “holistic” approach which incorporates ideas and components from the dominant paradigms for SSL.

MixMatch was introduced by members of the Google Brain team as a semi-supervised learning method which combines ideas and components from the current dominant paradigms for semi-supervised learning.

Through extensive experiments on semi-supervised and privacy-preserving learning, we found that MixMatch exhibited significantly improved performance compared to other methods in all settings they studied, often by a factor of two or more reduction in error rate.

In future work, they are interested in incorporating additional ideas from the semi-supervised learning literature into hybrid methods and continuing to explore which components result in effective algorithms.

Separately, most modern work on semi-supervised learning algorithms is evaluated on image benchmarks; they are interested in exploring the effectiveness of MixMatch in other domains.

SSL with UDA. Since it is much easier to obtain unlabeled data than labeled data, in practice, we often encounter a
situation where there is a large gap between the amount of unlabeled data and that of labeled data.

To enable UDA to take advantage of as much unlabeled data as possible, they usually need a large
enough model, but a large model can easily overfit the supervised data of a limited size.

To tackle
this difficulty, they introduce a new training technique called Training Signal Annealing (TSA). The main intuition behind TSA is to gradually release the training signals of the labeled examples without overfitting them as the model is trained on more and more unlabeled examples.

Sharpening Predictions. We observe that the predicted distributions on unlabeled examples and augmented unlabeled examples

tend to be over-flat across categories, in cases where the problem is hard and the number of labeled

examples is very smsmall.

Confidence-based masking. Mask out examples that the model is not confident about.

Scalable Private Learning with PATE. I will draw upon an excerpt from the abstract of the paper released the 24th of February 2018:

The rapid adoption of machine learning has increased concerns about the privacy implications of machine learning models trained on sensitive data, such as medical records or other personal information. To address those concerns, one promising approach is Private Aggregation of Teacher Ensembles, or PATE, which transfers to a “student” model the knowledge of an ensemble of “teacher” models, with intuitive privacy provided by training teachers on disjoint data and strong privacy guaranteed by noisy aggregation of teachers’ answers.

Conclusion

SSL with UDA is very much like recreation from memory what you just saw to understand the visual impression in a computational sense. MixMatch combines a lot of approaches to make SSL work better. PATE is necessary to maintain privacy. SSL can also propose to maintain privacy when learning has to be on a need-to-know basis with data you might not know (or be allowed to know) as much about beforehand. Increasing accuracy in this context is therefore important and may change the industry for the better.

This is day 41 of #500daysofAI.

Hope you enjoyed this article and remember to give me feedback if you have the chance. As I mentioned in the introduction I am doing my best to understand, and I write to learn.

Wish you all the best.

What is #500daysofAI?
I am challenging myself to write and think about the topic of artificial intelligence for the next 500 days with the #500daysofAI. Learning together is the greatest joy so please give me feedback if you feel an article resonates.

--

--