Improve your model’s performance with unlabeled data

An introduction to semi-supervised learning and its applications in unstructured data

Naveen Rathani
Towards Data Science

--

Photo by Christopher Burns on Unsplash

Most machine learning problems that data scientists usually solve are either supervised learning (i.e. ground truth or actual labels are available for the observations and the algorithms model the conditional probabilty to accurately predict the ground truth or actual labels) or unsupervised learning (i.e. there is no label per observation and therefore we identify clusters, patterns or reduced latent dimensions among the observations).

Semi-supervised learning typically tries to combine the above two tasks together wherein, we attempt to improve performance in one of these two tasks by utilizing information generally associated with the other (Van Engelen, J.E., Hoos, H.H 2020). The essence of this family of algorithms lies in the fact that it permits use of large amounts of unlabeled data in combination with smaller volume of labelled data to build a more generalizable (and possibly more accurate) algorithm.

A very vast subject area with active ongoing research, Semi Supervised learning (SSL) is definitely one area that, we believe, could see significant growth in the coming years. There have been some significant advances in the area and a few papers have done some surveys on SSL techniques, including the work by Van Engelen, J.E., Hoos, H.H. [1] which can be accessed here: https://link.springer.com/article/10.1007/s10994-019-05855-6#Sec54. It is a highly recommended read albeit a bit technical. We will leverage, explain and simplify some of the key concepts from this survey to solidify the understanding of different types of semi-supervised learning techniques.

While, semi-supervised learning is possible in all forms of data, text/ unstructured datasets are trickier and more time-consuming to label. A few examples include — classifying emails for intent, predicting abuse/malpractice in email conversations, classifying long documents without availability of many labels. The lesser the number of labels, the harder it is to work with limited labelled data.

The contents of this article would be in the following order:

  1. Starter concepts and assumptions of semi-supervised learning
  2. Understanding the two families of semi-supervised learning (SSL) methods

Starter concepts and assumptions of semi-supervised learning

Semi supervised learning uses concepts from both supervised learning (i.e. modeling the relationship between the input data distribution and the label distribution) and unsupervised learning (grouping unlabeled data into homogenous groups).

Let’s assume the task of predicting if there is abusive content in an email box. This is a predictive-supervised learning task and we would need a set of emails that are labeled either abusive or non-abusive so as to train the classifier. Now, such labeled emails cannot naturally exist and we would need humans to annotate after going through the content of such emails, and thereby provide a label. Given that emails are highly personalized, it may not be an easy task to get access to thousands of emails to label on. If we build a classifier with only a few labeled observations, the classifier might just form relationships between word occurrences to the limited labels we provide the classifier, rather than generalizing the context, simply due to lack of enough labeled data.

Developing a language model or using a pre-trained language model (like BERT) can ease the above problem significantly by generalizing context and word replacements but we still need enough labels to train the classifier well. We will pick the above situation in Part-3, but for now, if we could leverage the input data on the unlabeled articles and allow the model to somehow learn from it. that would ideally improve the classifier. This is where semi-supervised learning comes in.

The following image taken from the survey paper by Van Engelen, J.E., Hoos, H.H. [1] highlights the discussion visually. In the image, we can see the inherent distribution of the two colored classes but only two observations (solid shapes) are provided as labels and everything else is only made available as unlabeled data. For a two-class problem, when the classifier sees only the two solid points (the triangle and the plus sign) as labeled observations, the most natural decision boundary it builds is the one that equally splits the shortest distance between the two data labeled points. As can be clearly seen from the difference in orientation of the solid and dotted line, our labeled classifier is significantly different from the optimal decision boundary (Van Engelen, J.E., Hoos, H.H 2020).

Image by Van Engelen, J.E., Hoos, H.H. taken from https://link.springer.com/article/10.1007/s10994-019-05855-6/figures/1

That brings a natural question — how can the improvement in creating the right decision boundary, above, be realistically achieved?

As identified in surveys conducted by Van Engelen, J.E., Hoos, H.H.[1], the most widely recognized assumptions for semi-supervised learning are:

1. The smoothness assumption (if two samples x and x′ are close in the input space, their labels y and y′ should be the same),

2. The low-density assumption (the decision boundary should not pass through high-density areas in the input space), and

3. The manifold assumption (data points on the same low-dimensional manifold should have the same label).

Despite the above holding true quite often, there is no guaranteed performance boost and just like in supervised learning, there is no one algorithm that works best. Sometimes, semi-supervised models can degrade the performance of a supervised model, especially when the above assumptions don’t hold true i.e. you are adding noise to the classifier by feeding input data that has a significantly different distribution that the existing (and true) conditional distribution known to the supervised model.

Besides the above, different semi-supervised algorithms show varying performances against size of labelled data, data manifold and its distribution, and even the specific labeled data points used. So it is important to run comparisons on a spread of datasets and labeled data sizes to truly perform any sort of evaluation.

Understanding of the two families of semi-supervised learning (SSL) methods

Regarding the various semi-supervised algorithms available today, as identified in surveys conducted by Van Engelen, J.E., Hoos, H.H.[1]:

Methods differ in the semi-supervised learning assumptions they are based on, (1) in how they make use of unlabeled data, and (2) in the way they relate to supervised algorithms.

Lets start with two clearly distinct families in which SSL algorithms usually fall:

  1. Inductive: The work published by Romeyn, J. W. (2004)[2] points the following on what inductive algorithms do:

Inductive prediction process draws a conclusion about a future instance from a past and current sample i.e. it typically relies on a data set consisting of specific instances of a phenomenon.

This is the same as classical machine learning that relies on modeling conditional probabilities. How they use unlabeled observations is where different inductive SSL algorithms differ.

  1. Transductive: Per Wikipedia [3]:

Transduction or transductive inference is reasoning from observed, specific (training) cases to specific (test) cases. This is in contrast to induction, which is — reasoning from observed training cases to general rules, which are then applied to the test cases.

Simply put, transductive methods produce predicted labels for the unlabeled data points by leveraging all of the unlabeled data using distances within unlabeled observations and against labeled observations. The end goal is an objective function that when optimized, results in right predictions for labeled observations and allows predictions for unlabeled observations based on similarity.

Inductive methods are far simpler to execute and implement. The whole family of learning algorithms available can be read from the survey paper by Van Engelen, J.E., Hoos, H.H.[1].

Through this series, we plan to experiment using two types of inductive semi-supervised methods. In the first approach, we start with only the labeled data and build a model, to which, we sequentially add unlabeled data where the model is confident of providing a label. In the second approach, we work with the whole dataset together and augment it by adding minor variations and noise to reduce the boundary separations in the input distributions.

  1. Self-Training using pseudo labels: Self-Training using pseudo labels: We start with labeled data and by leveraging the initial model, trained on labeled data, unlabeled data is predicted. The predictions are called called pseudo-labels because slowly (i.e. if they are above a certain thresholds to be included or if they are part of the top N allowed in a single iteration) these predictions are added to retrain the classifier. This process is repeated till convergence. These pseudo labels can be generated and added to the re-training on top of any supervised learning algorithm, by including it in a simpler wrapper function (Van Engelen, J.E., Hoos, H.H. 2020). We will implement this with experiments in part-2.
  2. Augmenting the input data: These algorithms train the model by using labeled and unlabeled data together. This is incorporated by training on augmented data points in addition to the original labelled samples (Van Engelen, J.E., Hoos, H.H. 2020). Adding noise into existing observations through, say back-translation or generating synthetic data, allow the models to generalize and learn from the unlabeled data better. We will implement this with experiments in part-3.
Image by Hamed-Hassanzadeh, to explain Self-Training visually, taken from https://www.researchgate.net/publication/326733520_Clinical_Document_Classification_Using_Labeled_and_Unlabeled_Data_Across_Hospitals

So there you are — We’ve understood what semi-supervised learning techniques, fundamentally, try to do under the hood and how some of the more direct techniques like self-training actually work.

Next, in part-2, we will run some of these algorithms in Python on a varied set of text datasets. See you then and thanks for reading!

The 3-part series, including this article, has been a joint work between Sreepada Abhinivesh, who is a passionate NLP data scientist with a master’s degree from Indian Institute of Sciences (IISC), Bengaluru-IN and Naveen Rathani, who is an applied machine learning specialist and a data science enthusiast.

[1] Van Engelen, J.E., Hoos, H.H. A survey on semi-supervised learning. Mach Learn 109, 373–440 (2020). https://doi.org/10.1007/s10994-019-05855-6

[2] Romeyn, J. W. (2004). “Hypotheses and Inductive Predictions: Including Examples on Crash Data” (PDF). Synthese. 141 (3): 333–64. doi:10.1023/B:SYNT.0000044993.82886.9e. JSTOR 20118486. S2CID 121862013.

[3]Transduction (machine learning). “https://en.wikipedia.org/wiki/Transduction_(machine_learning)

--

--

I’m a data science practitioner with a keen interest in finding simpler ways to perform, and/or explain, complicated analytical tasks!