Machines can learn in various ways. Supervised learning is a machine learning problem involving learning an input to output mapping function based on example input-output pairs. Unsupervised learning involves learning patterns from unlabeled data. Semi-supervised learning may be seen as a hybrid of both supervised and unsupervised learning.
Essentially, when we combine a small amount of labeled data to a large amount of unlabeled data during training, we have a semi-supervised machine learning problem. According to Wikipedia, Semi-supervised learning may be described as a special case of weak supervision [Source: Wikipedia].
"Weak Supervision is a branch of machine learning where noisy, limited, or imprecise sources are used to provide supervision signal for labeling large amounts of training data in a supervised learning setting." – Wikipedia.
The Data Problem
Supervised learning models and techniques are commonplace in business. However, building effective models is highly dependent on access to high-quality labeled training data – we’ve all heard the saying "Garbage in, garbage out". The demand for high-quality labeled data often leads to a major roadblock when businesses attempt to approach problems using machine learning. This problem manifests itself in several ways:
- Insufficient quantity of labeled data – When a new product is being proposed, or a new industry has come around, a common problem they face is a lack of labeled training data to apply traditional supervised learning approaches. Usually, data scientists would obtain more data but the issue in this scenario is that do so may be impractical, expensive, or impossible without waiting for time to pass so data can be accumulated.
- Insufficient domain expertise to label data – Some problems can be labeled by absolutely anyone. For example, most people can label images of cats and dogs, but this problem becomes more challenging when labeling the breed of the cat or dog in the image. Getting a domain expert to label training data can quickly become expensive, hence it’s often not a viable solution.
- Insufficient time to label and prepare data – It’s known that 60% or more of the time spent working on machine learning problems is dedicated to the preparation of a dataset. When working in a field that deals with fast-evolving problems, collecting and preparing a dataset to build a useful solution quickly enough may be impractical.
Overall, collecting high-quality labeled data can quickly put a heavy demand on one’s resources, and some companies may not have these resources to accommodate for the demand. Here is where semi-supervised learning comes into play. If we have a small amount of labeled data and a large amount of unlabeled data then we can frame our problem as a semi-supervised machine learning problem.
The Types of Semi-supervised Learning
When we are working on a semi-supervised problem, the objective varies depending on the type of learning we wish to perform. We may use semi-supervised learning for inductive learning, transductive learning, or contrast both.
Inductive Learning
The goal of inductive learning is to generalize to new data. Thus, inductive learning refers to building a learning algorithm that learns from a labeled training set and generalizes to new data.
Transductive Learning
The goal of transductive learning is to transduce information from labeled training datasets to available unlabeled (training) data.
Final Thoughts
Semi-supervised learning falls in between supervised and unsupervised learning. We can conduct semi-supervised learning as either inductive or transductive learning, and both may be performed using a semi-supervised learning algorithm. The way it is done is beyond the scope of this article, but the interested reader may want to read:
Thanks for Reading!
If you enjoyed this article, connect with me by subscribing ** to my FRE**E weekly newsletter. Never miss a post I make about Artificial Intelligence, Data Science, and Freelancing.
Related Articles
The Difference Between Classification and Regression in Machine Learning