The world’s leading publication for data science, AI, and ML professionals.

A simple SVM based implementation of semi-supervised learning

clearing the fog of confusion around semi-supervised learning

Source:- https://unsplash.com/photos/GhFn3mddePk
Source:- https://unsplash.com/photos/GhFn3mddePk

We all have come across semi-supervised learning as a type of Machine Learning problem. But it is a concept not understood really well. It’s best to understand this by getting our hands dirty and precisely that’s what we are bringing on.

The use-case is simple, we have some data that maybe 100 observations, and out of that 30 are labeled (Supervised), the rest are unlabelled (Unsupervised)and we are working on a classification problem. I think all of you will agree that the performance that can be achieved by training on 30 cases or observations will be lesser if we could have used all 100 observations, unfortunately, 70 of them are unlabeled.

Fig 1: Unsupervised Learning Scenario (Source: Author)
Fig 1: Unsupervised Learning Scenario (Source: Author)

Essentially it is a binary classification problem. The two classes are indicated by the blue triangle and the red circles. The greyed diamonds indicate unlabeled data.

Our first strategy is

Step 1: Build a classifier on the labeled data (routine stuff)

Step 2: Use this to predict the unlabeled data. However, apart from the prediction, you also check your confidence level.

Step 3: Add those observations to the training data on which you are moderately confident. These are called as pseudo-labeled as contrasted to labeled data.

Step 4: Use this augmented dataset, now for training, and use this model.

As we are using the unsupervised data to augment the training data for supervised learning, this comes somewhere in between and hence the name semi-supervised.

Fig 2: Extended dataset
Fig 2: Extended dataset

In the above diagram, the extension of training data is illustrated. For the observations we are confident, we have used a pseudo label and for those we are not confident, we have still kept it unlabeled.

Let’s jump to the codes now

We are creating a dummy scenario, where we divide the dataset (wine) into the train(labeled), unl(unlabeled), and test.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1)
X_train, X_unl, y_train, y_unl = train_test_split(
    X_train, y_train, test_size=0.7, random_state=1)

Here is the shape of train, test, and unlabeled respectively. Please note, we will not consider the label information of the unl portion, hence treating it as unlabeled

The shape of the datasets (Image Source: Author)
The shape of the datasets (Image Source: Author)

Next, we simply train on the labeled portion, which has just go 19 rows.

clf = Svm.SVC(kernel='linear', probability=True,C=1).fit(X_train, y_train)
clf.score(X_test, y_test)

The accuracy received is as follows:-

Initial Classification Accuracy (Source: Author)
Initial Classification Accuracy (Source: Author)

Next, we predict on the unlabeled data

df = pd.DataFrame(clp, columns = ['C1Prob', 'C2Prob','C3Prob']) 
df['lab']=lab
df['actual']=y_unl
df['max']=df[["C1Prob", "C2Prob","C3Prob"]].max(axis=1)

You may have noticed that we have used the predicted probability, which predicts the class probability, instead of the labels, that actually will help us to find the confident guesses

Confidence in the predictions ( Source: Author)
Confidence in the predictions ( Source: Author)

We simply look at the absolute difference between the prediction probabilities

  • When the three classes are equally probable then all have roughly 0.33 probability
  • Any value greater than this shows some confidence

Essentially, we want to add those observations in training for which this difference in probability is high. The code goes below

The below graph gives the distribution of confidence as expressed by the probability of the class which is most probable.

Confidence in the prediction (Image Source: Author)
Confidence in the prediction (Image Source: Author)

Next, we run a loop where for different threshold values of the most probable class the data is added to the observation.

Accuracy with different threshold values (Source:-Author)
Accuracy with different threshold values (Source:-Author)
  • There is an improvement when we have the most probable class having a probability greater than 0.4 and up to 0.55
  • After that, there is hardly any improvement as on those we are super confident and it does not add to the knowledge as the observations are very similar to the labeled.

EndNote:

In this tutorial, we have shown how a simple semi-supervised strategy can be adopted using SVM. This technique can be easily extended to other classifiers. The impact will depend on how overlapping or discriminating the classes are, how informative the features are, and so on and so forth.

References:

[1] https://www.kaggle.com/saptarsi/a-simple-semi-supervised-strategy-based-on-svm

[2] https://towardsdatascience.com/supervised-learning-but-a-lot-better-semi-supervised-learning-a42dff534781


Related Articles