Advancements in Semi-Supervised Learning with Unsupervised Data Augmentation

Why is it important for the field of artificial intelligence?

Published in

Towards Data Science

15 min readJul 12, 2019

I am in this article attempting to understand the progress made within Semi-Supervised Learning (SSL) with Unsupervised Data Augmentation (UDA). Firstly by going through the different well-known machine learning techniques. Secondly by going through the recent blog post accompanied by an article on Google AI on SSL with UDA.

Me as a writer explaining articles by members of Google Brain and blog posts from Google AI may seem like a teenager commenting on a professional sports team. If that seems the case to you as much as it does to me I deeply apologise and kindly ask for your feedback. To me writing is a process of learning.

Starting with yesterday I expressed the wish to spend three days attempting to understand three questions.

Day one: how is Google a frontrunrunner in the field of AI? (Done ✓)

Day two: which advancements is being made in Semi-Supervised Learning (SSL) with Unsupervised Data Augmentation (UDA) and why is it important for the field of AI?

Day three: how is the quiet revolution in SSL changing the industry?

Today being day the second day I will focus on advancements made in SSL with UDA, but first I will start explaining the three main techniques within machine learning. You may skip the first part and go straight to the second if you have prior knowledge about the subject.

What is Unsupervised Learning, Supervised Learning and Reinforcement Learning?

To understand the ‘semi-supervised’ let us first look briefly at unsupervised learning, supervised learning and reinforcement learning. Be aware that much of the information here is edited versions from Wikipedia and a few other places, mostly fellow writers in Towards Data Science on Medium. This text is simply meant to give a surface understanding of the terminology.

Unsupervised Learning

Unsupervised learning is a type of organised Hebbian learning that helps find previously unknown patterns in data set without pre-existing labels. It is also known as self-organisation and allows modelling probability densities of given inputs.

If it indeed it a type of Hebbian learning, what does that entail?

Hebbian learning is one of the oldest learning algorithms, and is based in large part on the dynamics of biological systems. A synapse between two neurons is strengthened when the neurons on either side of the synapse (input and output) have highly correlated outputs. If you want to read more I suggest you take a look at What is Hebbian Learning written by Prafful Mishra on the topic.

Hebbian theory is a neuroscientific theory claiming that an increase in synaptic efficacy arises from a presynaptic cell’s repeated and persistent stimulation of a postsynaptic cell. It is an attempt to explain synaptic plasticity, the adaptation of brain neurons during the learning process. You can additionally if you wish watch a video on Hebb’s Three Postulates:

Thank you to Veer for sharing this video in the article Hybrid Model for Unsupervised Learning.

The theory was introduced by Donald Hebb in his 1949 book The Organization of Behavior. This book has been part of the basis for the development of artificial neural networks (ANN).

In psychology it is a hypothesis for how neuronal connections are enforced in mammalian brains; it is also a technique for weight selection in artificial neural networks. Algorithms can update weights of neural connections in modern artificial neural networks. By changing neural weights and associations, engineers can get different results out of ANN.

In the flavours of Hebbian Learning: (1) unsupervised, weights are strengthened by the actual response to a stimulus, (2) supervised, weights are strengthened by the desired response. Unsupervised Hebbian Learning (associative) had the problems of weights becoming arbitrarily large and no mechanism for weights to decrease.

Taking a step back unsupervised learning is one of the main three categories of machine learning that includes supervised and reinforcement learning.

Two of the main methods used in unsupervised learning are:

Principal component
Cluster analysis

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components.

PCA of a multivariate Gaussian distribution centered at (1,3) with a standard deviation of 3 in roughly the (0.866, 0.5) direction and of 1 in the orthogonal direction. The vectors shown are the eigenvectors of the covariance matrix scaled by the square root of the corresponding eigenvalue, and shifted so their tails are at the mean.

Cluster analysis is used in unsupervised learning to group, or segment, datasets with shared attributes in order to extrapolate algorithmic relationships. Cluster analysis is a branch of machine learning that groups the data that has not been labelled, classified or categorised. This analysis identifies commonalities in the data and reacts based on the presence or absence of such commonalities in each new piece of data. This approach helps detect anomalous data points that do not fit into either group.

The result of a cluster analysis shown as the colouring of the squares into three clusters.

The notion of a “cluster” cannot be precisely defined, which is one of the reasons why there are so many clustering algorithms.

A central application of unsupervised learning is in the field of density estimation.

Demonstration of density estimation using kernel smoothing: The true density is mixture of two Gaussians centered around 0 and 3, shown with solid blue curve. In each frame, 100 samples are generated from the distribution, shown in red. Centered on each sample, a Gaussian kernel is drawn in gray. Averaging the Gaussians yields the density estimate shown in the dashed black curve.

A very natural use of density estimates is in the informal investigation of the properties of a given set of data. Density estimates can give valuable indication of such features as skewness and multimodality in the data. In some cases they will yield conclusions that may then be regarded as self-evidently true, while in others all they will do is to point the way to further analysis and/or data collection.

(1) Skewness in probability theory and statistics is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive or negative, or undefined. Many models assume normal distribution; i.e., data are symmetric about the mean. The normal distribution has a skewness of zero. But in reality, data points may not be perfectly symmetric. So, an understanding of the skewness of the dataset indicates whether deviations from the mean are going to be positive or negative.

One critique of skewness, perhaps can be read by Benoît Mandelbrot, a French mathematician. He feels that the extensive reliance on the normal distribution for much of the body of modern finance and investment theory is a serious flaw of any related models. He explained his views and alternative finance theory in a book: The (Mis)Behavior of Markets: A Fractal View of Risk, Ruin and Reward (published in 2004). We could of course ask whether this critique could be extended to parts of the field of AI or certain machine learning techniques.

(2) Multimodality in its most basic sense is a theory of communication and social semiotics. Multimodality describes communication practices in terms of the textual, aural, linguistic, spatial, and visual resources — or modes — used to compose messages.

For the field of artificial intelligence multimodality can mean using machine learning techniques interpreting different signals together such as text and pictures.

The scientific publisher IGI Global has an overview over what called What is Multimodality?

A modality, or, more explicitly, a modality of information representation, is a way of representing information in some medium […] Multimodality allows an integrated use of various forms of interaction simultaneously […] Multiple types of media data, or multiple aspects of a data item. Its emphasis is on the existence of more than one type (aspects) of data. For example, a clip of digital broadcast news video has multiple modalities, include the audio, video frames, closed-caption (text), and so forth.

In statistics multimodal distribution is a continuous probability distribution with two or more modes. For a more comprehensive explanation check out Purvanshi Mehta’s article Multimodal Deep Learning.

Vishal Maini explains the utility of unsupervised learning in Machine Learning for Humans with his article on Unsupervised Learning (which you can read to go more in depth):

Unsupervised learning is often used to preprocess the data. Usually, that means compressing it in some meaning-preserving way like with PCA or SVD before feeding it to a deep neural net or another supervised learning algorithm.

There is of course much more to be said on the topic of Unsupervised learning, yet we will proceed to supervised learning.

Supervised Learning

In supervised learning an optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances.

Two important aspects are generally said to be classification and regression.

Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs.

Drawing from Manish Thapliyal article Machine Learning Basics Supervised Learning Theory Part-1

Regression analysis is a set of statistical processes for estimating the relationships among variables.

Picture from Dataaspirant’s article from 2014 on the difference between classification and regression in machine learning.

According to Stuart J. Russell, Peter Norvig (2010) in Artificial Intelligence: A modern Approach: Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs.

Function from a labeled training data with training examples.
Each example is a pair (input-output) input object and output value.
A supervised learning algorithm analyzes the training data and produces an inferred function.
The inferred function can be used for mapping new examples.

There are more steps than these that are more often propagated or shared.

An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a “reasonable” way.

There may be an inductive bias: a set of assumption the learner uses to predict outputs given inputs that it has not encountered. Although most learning algorithms have a static bias, some algorithms are designed to shift their bias as they acquire more data. This does not avoid bias, since the bias shifting process itself must have a bias. Biasception?

Some challenges can be:

Bias and variance tradeoff. Several different, but equally good training data sets. Should you make it flexible to fit the data? If it is too flexible it may fit each training data set differently.
Function complexity and amount of training data. Simple “inflexible” learning algorithm with high bias and low variance could be able to learn from a small amount of data. A highly complex function will only be able to learn from a very large amount of training data and using a “flexible” learning algorithm with low bias and high variance.
Dimensionality of the input space. High dimensional spaces (100s or 1000s). The volume of the space increases so much that the data becomes sparse. Computing each combination of values in an optimisation problem for example. If you wish an arcane slant this point can be referred to as the Curse of dimensionality.
Noise in the output values. If the desired output values are often incorrect (because of human error or sensor errors), then the learning algorithm should not attempt to find a function that exactly matches the training examples. Noise can be alleviated through early stopping and anomaly detection (see unsupervised learning).
Heterogeneity of the data. Input diverse in character or content often opposed to homogeneity (similarity).
Redundancy in the data. Giving more weight to information that have been repeated several times. This can mean two different fields within a single database, or two different spots in multiple software environments or platforms. A positive type of data redundancy works to safeguard data and promote consistency
Presence of interactions and non-linearity. The question of linear functions and distance functions versus decision-trees or neural networks. If each of the features makes an independent contribution to the output the first may go (linear/distance) if there are complex interactions among features then the second (decision/neural) could be the solution.

Additionally there is a pervasive oppositional question of overfitting or underfitting.

The green line represents an overfitted model and the black line represents a regularised model. While the green line best follows the training data, it is too dependent on that data and it is likely to have a higher error rate on new unseen data, compared to the black line.

Overfitting in statistics is “the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably”.

Underfitting occurs when a statistical model cannot adequately capture the underlying structure of the data. An underfitted model is a model where some parameters or terms that would appear in a correctly specified model are missing.

What is a good fit? We could into a more controversial philosophical perspective on this. A social scientist, politician and engineer may very well disagree on that which constitutes a good fit. In the end it is about model performance. Performance being how well a person, machine, etc. does a piece of work or an activity. There are certainly different goals involved in the making of an algorithm.

Will Koehrsen has written an article called Overfitting vs. Underfitting: A Complete Example that I recommend you check out, however there is one quote from the article I would like to mention here:

In order to talk about underfitting vs overfitting, we need to start with the basics: what is a model? A model is simply a system for mapping inputs to outputs.

No algorithm(s) can solve all problems. It is always fun in this context to mention the hilarious yet serious no free lunch theorem. In optimisation and computational complexity this is a result that states that for certain types of mathematical problems, the computational cost (resource used by computational models) of finding a solution, averaged over all problems in the class, is the same for any solution method. In this sense there is no shortcut.

However Wolpert and Macready have proved that there are free lunches in coevolutionary optimization. Which brings us elegantly to the next section.

Reinforcement Learning

Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning. Why is this so?

It differs from supervised learning in that labelled input/output pairs need not be presented, and sub-optimal actions need not be explicitly corrected. Instead the focus is finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).

Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.

Software agent: a computer program that acts for a user or other program in a relationship of agency.

Again I refer to an article of Vishal Maini, this time on Reinforcement Learning, where he shares this model:

The agent **observes** the environment, takes an **action** to interact with the environment, and receives positive or negative **reward.** Diagram from Berkeley’s CS 294: Deep Reinforcement Learning by John Schulman & Pieter Abbeel

The model however could also be represented in this manner:

The typical framing of a Reinforcement Learning (RL) scenario: an agent takes actions in an environment, which is interpreted into a reward and a representation of the state, which are fed back into the agent.

Markov decision process (MDP) is often how basic reinforcement learning is often presented Another writer for Towards Data Science Mohammad Ashraf in his article Reinforcement Learning Demystified: Markov Decision Processes (Part 1) presents this model:

In his article Mohammed gives a good introduction to MDP and I will quote a few lines from his article that I found useful:

The Markov property states that,“ The future is independent of the past given the present.” Once the current state in known, the history of information encountered so far may be thrown away, and that state is a sufficient statistic that gives us the same characterization of the future as if we have all the history […] A Markov Reward Process or an MRP is a Markov process with value judgment, saying how much reward accumulated through some particular sequence that we sampled […] Markov Decision Process. An MDP is a Markov Reward Process with decisions, it’s an environment in which all states are Markov.

A funny examples he posted illustrates this in the student Markov Decision Processes (MDP). To study, sleep, pub, facebook, quit — not easy to answer? This is clearly a British MDP.

State-value function in student MDP taken from David Silverman’s lecture at UCL.

If you are a visual learner who likes watching videos you may want to check out this introduction to Reinforcement Learning with Arxiv Insights.

I am sharing here a model he presented at 6:35:

Through trial-and-error there is an attempted task with a goal of maximising long-term reward, and the agent learns from experience in the absence of training data.

The environment as mentioned previously is typically formulated as an MDP. Many reinforcement learning algorithms for this context utilise dynamic programming dynamic programming techniques.

Dynamic programming refers to simplifying a complicated problem by breaking it down into simpler sub-problems in a recursive manner. If sub-problems can be nested recursively inside larger problems, so that dynamic programming methods are applicable, then there is a relation between the value of the larger problem and the values of the sub-problems.

Recursion occurs when a thing is defined in terms of itself or of its type. Recursion is sometimes used humorously in computer science, programming, philosophy, or mathematics textbooks, generally by giving a circular definition or self-reference, in which the putative recursive step does not get closer to a base case, but instead leads to an infinite regress. It is not unusual for such books to include a joke entry in their glossary along the lines of: Recursion, see Recursion.

Reinforcement learning is due to its generality studied in many other disciplines including: game theory, control theory, operations research, information theory, simulation-based optimisation, multi-agent systems, swarm intelligence, statistics and genetic algorithms.

These different machine learning tasks: unsupervised learning, supervised learning and reinforcement learning are different yet complementary. If you want to read more about each I recommend reading the articles by Vishal Maini in his series Machine Learning for Humans.

Semi-Supervised Learning

Semi-supervised learning has also been described, and is a hybridization of supervised and unsupervised techniques.

Semi-supervised learning is a class of machine learning tasks and techniques that also make use of unlabeled data for training — typically a small amount of labeled data with a large amount of unlabeled data. Semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data).

Semi-structured data is a form of structured data that does not obey the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data.

When you don’t have enough labeled data to produce an accurate model and you don’t have the ability or resources to get more, you can use semi-supervised techniques to increase the size of your training data.

For that reason, semi-supervised learning is a win-win for use cases like webpage classification, speech recognition, or even for genetic sequencing. In all of these cases, data scientists can access large volumes of unlabeled data, but the process of actually assigning supervision information to all of it would be an insurmountable task.

Semi-supervised classification: Labeled data is used to help identify that there are specific groups of webpage types present in the data and what they might be. The algorithm is then trained on unlabeled data to define the boundaries of those webpage types and may even identify new types of webpages that were unspecified in the existing human-inputted labels.

Unsupervised Data Augmentation

The method presented in the recent blog post on Google AI the 10th of July,
Unsupervised Data Augmentation (UDA), employs highly targeted data augmentations to generate diverse and realistic perturbations and
enforces the model to be smooth with respect to these peperturbations.

Using generated examples of similar texts or images that are augmented. Augmenting a picture with other related examples.
They also propose a technique called TSA that can effectively prevent UDA from overfitting the supervised data, when a lot more unlabeled data is available.
For text, UDA combines well with representation learning, e.g., BERT,

and is very effective in low-data regime where state-of-the-art performance is achieved on IMDb with only 20 examples. For vision, UDA reduces error rates by more than 30% in heavily-benchmarked

semi-supervised learning setups.
Lastly, UDA can effectively leverage out-of-domain unlabeled data

and achieve improved performances on ImageNet where there is a large amount of supervised data. In the blog post they say:

Our results support the recent revival of semi-supervised learning, showing that: (1) SSL can match and even outperform purely supervised learning that uses orders of magnitude more labeled data, (2) SSL works well across domains in both text and vision and (3) SSL combines well with transfer learning, e.g., when fine-tuning from BERT.

They showed two pictures to illustrate their model:

An overview of Unsupervised Data Augmentation (UDA). Left: Standard supervised loss is computed when labeled data is available. Right: With unlabeled data, a consistency loss is computed between an example and its augmented version

Example augmentation operations for text-based (top) or image-based (bottom) training data.

I will revisit this last section after writing more tomorrow.

There is more to read in the preprint available online in Arxiv.

What are the implications of this for industry?

Let me examine this shortly tomorrow.

This is day 40 of #500daysofAI.

Hope you enjoyed this article and remember to give me feedback if you have the chance.

What is #500daysofAI?
I am challenging myself to write and think about the topic of artificial intelligence for the next 500 days with the #500daysofAI. Learning together is the greatest joy so please give me feedback if you feel an article resonates.