
We all have come across semi-supervised learning as a type of Machine Learning problem. But it is a concept not understood really well. It’s best to understand this by getting our hands dirty and precisely that’s what we are bringing on.
The use-case is simple, we have some data that maybe 100 observations, and out of that 30 are labeled (Supervised), the rest are unlabelled (Unsupervised)and we are working on a classification problem. I think all of you will agree that the performance that can be achieved by training on 30 cases or observations will be lesser if we could have used all 100 observations, unfortunately, 70 of them are unlabeled.

Essentially it is a binary classification problem. The two classes are indicated by the blue triangle and the red circles. The greyed diamonds indicate unlabeled data.
Our first strategy is
Step 1: Build a classifier on the labeled data (routine stuff)
Step 2: Use this to predict the unlabeled data. However, apart from the prediction, you also check your confidence level.
Step 3: Add those observations to the training data on which you are moderately confident. These are called as pseudo-labeled as contrasted to labeled data.
Step 4: Use this augmented dataset, now for training, and use this model.
As we are using the unsupervised data to augment the training data for supervised learning, this comes somewhere in between and hence the name semi-supervised.

In the above diagram, the extension of training data is illustrated. For the observations we are confident, we have used a pseudo label and for those we are not confident, we have still kept it unlabeled.
Let’s jump to the codes now
We are creating a dummy scenario, where we divide the dataset (wine) into the train(labeled), unl(unlabeled), and test.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=1)
X_train, X_unl, y_train, y_unl = train_test_split(
X_train, y_train, test_size=0.7, random_state=1)
Here is the shape of train, test, and unlabeled respectively. Please note, we will not consider the label information of the unl portion, hence treating it as unlabeled

Next, we simply train on the labeled portion, which has just go 19 rows.
clf = Svm.SVC(kernel='linear', probability=True,C=1).fit(X_train, y_train)
clf.score(X_test, y_test)
The accuracy received is as follows:-

Next, we predict on the unlabeled data
df = pd.DataFrame(clp, columns = ['C1Prob', 'C2Prob','C3Prob'])
df['lab']=lab
df['actual']=y_unl
df['max']=df[["C1Prob", "C2Prob","C3Prob"]].max(axis=1)
You may have noticed that we have used the predicted probability, which predicts the class probability, instead of the labels, that actually will help us to find the confident guesses

We simply look at the absolute difference between the prediction probabilities
- When the three classes are equally probable then all have roughly 0.33 probability
- Any value greater than this shows some confidence
Essentially, we want to add those observations in training for which this difference in probability is high. The code goes below
The below graph gives the distribution of confidence as expressed by the probability of the class which is most probable.

Next, we run a loop where for different threshold values of the most probable class the data is added to the observation.

- There is an improvement when we have the most probable class having a probability greater than 0.4 and up to 0.55
- After that, there is hardly any improvement as on those we are super confident and it does not add to the knowledge as the observations are very similar to the labeled.
EndNote:
In this tutorial, we have shown how a simple semi-supervised strategy can be adopted using SVM. This technique can be easily extended to other classifiers. The impact will depend on how overlapping or discriminating the classes are, how informative the features are, and so on and so forth.
References:
[1] https://www.kaggle.com/saptarsi/a-simple-semi-supervised-strategy-based-on-svm