The world’s leading publication for data science, AI, and ML professionals.

Applying SVM Based Active Learning on Multi-Class Datasets

A labelling strategy based on active learning and semi-supervised learning for multi-class classification problems

Hands-on Tutorials

In the new era, a massive amount of data are being collected and processed to extract valuable information. Similarly, machine learning models are being improved and new approaches are offered continuously. Obviously, supervised learning-based methods result in better accuracies for data-driven problems. However, they need the label of each sample indispensably. The label information is so critical to design more solid models and obtain better results. And, unfortunately, the only lack among these evolvements is getting the correct label information. Labelling the data is mostly a time consuming and compelling process.

The idea behind Active Learning is that finding the most informative samples to be labelled considering that labelling all samples is tough. So, the accuracy of the model might be increased remarkably by labelling a small subset of the entire samples. At this point, a simple question arises. How to select the most informative samples among the entire set? I mention an approach to decide on the most informative samples in multi-class datasets. Additionally, I will mention a simple semi-supervised based approach to increase the number of labels in the sample set. And how to utilize both active learning and semi-supervised learning in the same data set.

Photo by Krisztina Papp on Unsplash
Photo by Krisztina Papp on Unsplash

To reveal the most informative samples, most studies benefit from probabilistic approaches. In this post, I utilize an SVM based methodology rather than typical probabilistic models. As most of you know, Support Vector Machines(SVM) aims to find out the hyperplane in N-dimensional space that maximizes the distance between points of different classes. Take a look at Figure 1 which is a representation of data set that consists of only 2 features. SVM basically look at the side of the point to decide on the class of the sample. Similarly, the distance(margin) of the point to the hyperplane is a kind of measurement of the sample how strongly the sample belongs to the class. So, SVM remarks that sample 3 and 4 belongs to blue and red classes respectively with a higher probability compared to the sample 1 and 2.

Ok then, how can we use this information in active learning, in other words, to find out the most informative samples?

In a simple manner, the samples nearer to the hyperplane are more informative than the ones further from the hyperplane. SVM make predictions with a higher confidence level for the farthest samples from the hyperplane. On the other hand, it is not so sure about the class of the samples nearer to the hyperplane. For this reason, having the ground truths for the samples nearer to the hyperplane is much more valuable than the others.

Fig 1 - SVM description in 2-dimensional space (Image by Author)
Fig 1 – SVM description in 2-dimensional space (Image by Author)

However, it is not so straightforward for multi-class classification problems. I use One-vs-the-rest (OvR) multiclass strategy which is one of the most commonly used approaches for multi-class classification tasks. In principle, each class is fitted against all other classes in OvR for each classifier. So, its interpretation is relatively easy compared to its alternative OneVsOneClassifier(OvO).

You expect one prediction per class versus the rest of the classes as a result of OvR. To make it more clear, let’s say we have 4 classes; A, B, C and D. In such an example, you obtain 4 different binary classifier results, those are

A vs (B,C,D);

B vs (A,C,D);

C vs (A,B,D);

D vs (A,B,C);

Similarly, that means 4 different distance measurements if you apply SVM on OvR. The sign of the distance simply states the class; for instance, the classifier returns as class A if the distance of the first classifier on the above-mentioned example (A vs (B, C, D)) is positive, and return as not A if it is negative.

As expected, the combination of 1 positive and 3 negative distance measurements is the most confident prediction, which is _Sample4 in Figure 2. OvR states that the sample belongs to the class with the positive distance value as a result of 4 different classifiers. On the other hand, 4 positive or 4 negative distance measurements are examples of uncertain predictions. In active learning, we are interested in these least confident samples. I did not encounter a case _Sample5 corresponds. Most probably, you encounter scenarios like _Sample1 as the least confident samples. Following _Sample__1, _Sample2 can be defined as the second most informative sample. It also represents a kind of ambiguity case. The prediction states that _Sample__2 might be a member of any class B, C or D, but not A.

And then if you still have space for manually labelling samples you can take into account the samples like _Sample3. At this point, firstly calculate the absolute value of the difference between two positive distance values. The least the difference, the more ambiguity the prediction of the sample has. Thus, you might select the samples with small difference values. I tried to summarize the combinations of OvR for a 4 class dataset in Figure 2. The same idea can be extended for all multi-class datasets with minor modifications.

Fig 2 - Applying OVR(SVM) on a 4 classes dataset - combinations of distance value signs (Image by Author)
Fig 2 – Applying OVR(SVM) on a 4 classes dataset – combinations of distance value signs (Image by Author)

You can also take a look at the following code snippet, which is the implementation of the above-mentioned approach.

# It calculates the difference of distance results of only positive 2 distance value
def posit_diff(a,b,c,d):
    lst = list([a,b,c,d])
    print(lst)
    index_lst = [lst.index(i) for i in lst if i>0]
    print(index_lst)
    if len(index_lst) != 2:
        print('Warning! Expecting only 2 positive distance values')
        return -1
    else:
        return abs(lst[index_lst[0]] - lst[index_lst[1]])
MODEL = LinearSVC(penalty='l2',dual=False, multi_class='ovr', class_weight = 'balanced', random_state=1709)
ACTIVE_LEARNING_BATCH_SIZE = 100 # lets say I am looking for 100 most informative samples from the data set 
# FEATURES = [...]
def active_learning(df_train_set, df_unlabelled_set):
    """
    Applying active learning to an example of 4 classes classification problem. 
    """
    ovr = OneVsRestClassifier(MODEL)
    ovr.fit(df_train_set[FEATURES], df_train_set['LABEL'])
    pred = ovr.predict(df_unlabelled_set[FEATURES])
dec_func = (ovr.decision_function(df_unlabelled_set[FEATURES]))
df_dec_func = pd.DataFrame(dec_func, columns = ['1','2','3','4'])
    df_dec_func = pd.concat([df_dec_func, df_unlabelled_set],axis=1)
df_dec_func['positives'] = df_dec_func[['1', '2', '3', '4']].gt(0).sum(axis=1)
    df_dec_func['negatives'] = df_dec_func[['1', '2', '3', '4']].lt(0).sum(axis=1)
df_dec_func_posit_0 = df_dec_func.loc[df_dec_func['positives']==0] # the most informative ones
    df_dec_func_posit_3 = df_dec_func.loc[df_dec_func['positives']==3] # the second most informative ones
    df_dec_func_posit_2 = df_dec_func.loc[df_dec_func['positives']==2] # the third most informative ones
df_dec_func_posit_2['posit_diff'] = df_dec_func_posit_2[['1','2','3','4']].apply(lambda x: posit_diff(*x), axis=1)
    df_dec_func_posit_2 = df_dec_func_posit_2.sort_values(by=['posit_diff'], ascending = True)
    df_dec_func_posit_2.reset_index(drop=True, inplace=True)
    rest_needed = (ACTIVE_LEARNING_BATCH_SIZE) - (df_dec_func_posit_0.shape[0] + df_dec_func_posit_3.shape[0])

    if rest_needed > 0:
        df_dec_func_posit_2_al = df_dec_func_posit_2.iloc[0:rest_needed,:]
        df_act_learn = pd.concat([df_dec_func_posit_0, df_dec_func_posit_3, df_dec_func_posit_2_al], axis=0)
    else:
        df_act_learn = pd.concat([df_dec_func_posit_0, df_dec_func_posit_3], axis=0)
        df_act_learn = df_act_learn.sort_values(by=['positives'], ascending=True)
        df_act_learn = df_act_learn.iloc[0:ACTIVE_LEARNING_BATCH_SIZE,:]
    df_act_learn.reset_index(drop=True, inplace=True)

    return df_act_learn

As one step beyond, clustering might be applied to the sample set inferred as the most informative samples. By selecting samples from the centroids and borders of the clusters(equal to the number of classes), the variation in the final set might be ensured.


SEMI-SUPERVISED LEARNING

Another way of increasing the number of labels in the data set is to label the most confident samples in an automatic manner. In other words, the samples might be labelled iteratively if they are predicted higher than a pre-determined threshold value. And that threshold value might be increased in each iteration.

Let’s say, the classifier predicts that a sample belongs to class A with 87% probability which is higher than the threshold value of 85%. So, it might be assessed as a member of class A for the next steps. The train set and unlabelled set is updated at the end of each iteration. Note that, in fact, it might not be a member of class A, which is a very critical drawback of the semi-supervised based approach. I share a simple code example of this methodology in the following code snippet:

MODEL = RandomForestClassifier(max_depth=4, n_estimators=200, class_weight='balanced', random_state=1709)
LIST_LABELLED_SAMPLES = df_train_set.shape[0]
SS_LEARNING_LABELLING_THRESHOLD = 0.80
SS_THRESHOLD_INCREASE = 0.01
MAX_SS_ITERATION = 10
# FEATURES = [...]
def ss_iterations(df_train_set, df_unlabelled_set):  
    """
    It covers all steps of Semi Supervised Learning.
    It uses a simple Ranfom Forest to fit df_train_set and predict df_unlabelled_set 
    to determine the most confident samples 
    those are predicted with a higher accuracy than SS_LEARNING_LABELLING_THRESHOLD. 
    """
    pred_threshold = SS_LEARNING_LABELLING_THRESHOLD

    ovr = OneVsRestClassifier(MODEL)
    print('Before iterations, size of labelled and unlabelled data', df_train_set.shape[0], df_unlabelled_set.shape[0])
for i in range (0, MAX_SS_ITERATION):
ovr.fit(df_train_set[FEATURES], df_train_set['LABEL'])
        preds = ovr.predict_proba(df_unlabelled_set[FEATURES])
df_pred = pd.DataFrame({'1': preds[:, 0], '2': preds[:, 1], '3': preds[:, 2], '4': preds[:, 3]})
        df_pred_ss= pd.concat([df_unlabelled_set, df_pred], axis=1)
df_pred_ss['MAX_PRED_RATIO'] = df_pred_ss[['1', '2', '3', '4']].max(axis=1)
        df_pred_ss['LABEL'] = df_pred_ss[['1', '2', '3', '4']].idxmax(axis=1)
        df_pred_ss['LABEL'] = df_pred_ss['LABEL'].astype(int)
        df_pred_ss_up = df_pred_ss[df_pred_ss['MAX_PRED_RATIO'] >= pred_threshold]
        print('The number of samples predicted with high confidence level:', df_pred_ss_up.shape)
if len(df_pred_ss_up) > 0:
# deleting from unlabelled set
            df_unlabelled_set.drop(index=df_pred_ss_up.index.tolist(),inplace=True)
            df_unlabelled_set.reset_index(drop=True,inplace=True)
# adding to train set as if they are ground truths
            df_train_set = pd.concat([df_train_set, df_pred_ss_up[df_train_set.columns]],axis=0)
            df_train_set.reset_index(drop=True, inplace=True)
print('Threshold ratio', pred_threshold)
            print('Remaining unlabelled data', df_unlabelled_set.shape[0])
            print('Total labelled data', df_train_set.shape[0])
            print('Iteration is completed', i)
pred_threshold += SS_THRESHOLD_INCREASE
        else:
            print('No improvement!')
            break
df_train_set.reset_index(drop=True,inplace=True)
    df_unlabelled_set.reset_index(drop=True,inplace=True)

    return df_train_set, df_unlabelled_set

Label Augmentation – Active Learning vs Semi-Supervised Learning

To conclude, the reasoning behind the label augmentation is very simple. Actually, the reasoning behind active learning is contrary to semi-supervised learning.

In active learning, we aim to find the most informative samples which are predicted with a low probability. And after finding out them, these samples are shared with experts to be labelled manually. This is a very effective approach if you have lots of unlabelled data and limited resources to label them. On the other hand, the samples which are predicted with a high probability(least informative) are labelled automatically as if they are ground truths and added to the train set for the next iterations.

Both active learning and semi-supervised learning might be applied to the same data set sequentially. Firstly, a number of semi-supervised iterations are applied till no more predictions with a higher probability than the threshold value. And then a single iteration of active learning might be applied to the result of the previous step and the inferred informative samples might be shared with domain experts to be labelled. A certain number of iteration sets might be repeated.

USEFUL LINKS

Active Learning in Machine Learning

Active Learning – Say Yeah!


Related Articles