Making Sense of Big Data

Introduction
A classic task for us data scientists is building a classification model for some problem. In a perfect world, data samples – including their corresponding labels – are handed to us on a silver plate. We then do our machine learning tricks and mathemagic to come to some useful insights that we derived from the data. So far so good.
However, what often happens in our imperfect yet beautiful world is one of the following:
- We get an extremely small dataset that is at least completely labeled. In this case, building a model can be extremely tricky. We have to use advanced feature engineering, maybe even Bayesian methods, and other tools to try to tackle this problem. Take the Overfitting Challenge on Kaggle as an example: The training set consists of 250 training examples and 200 features. Have fun.
- We get enough data, but with no labels at all. Well, tough luck. Try clustering, but this does not necessarily solve your classification problem.
- We get enough data, but it is only partially labeled. This is exactly what we will be talking about in this article! Keep reading.
Let us assume that we are in the third scenario from now on: The size of our dataset is quite okay, we have some thousand samples, maybe even a million. But looking at the labels, frustration sets in – only a small fraction of the data is labeled!
In this article, I will show you how you can deal with these kinds of common situations.
The naive Method
The easiest way to approach this setting is to transform it into something we are more familiar with. Concretely:
Throw away the unlabeled data points and train a classifier on the remaining completely labeled, but smaller dataset.
Let’s analyze this approach.
Pros
- easy to understand and implement
- fast transformation and faster training as fewer samples mean less computation
Cons
- the model might overfit to the remaining data
- bias in the process of which data is labeled might lead to wrong decision boundaries of the model
While the arguments on the pro side should be easy to understand, let’s look at a picture to understand the disadvantages better.

As humans, we can clearly see that there are two blobs. The left one should be blue, the right one should be red. There might be some overlap in the middle, but all in all, they can be separated quite well with a straight line, i.e. logistic regression or a linear SVM.

However, if we throw away the Unlabeled Data and fit a logistic regression, we end up with the following decision regions:

Not so great. The result is not bad because of overfitting, as logistic regression is a simple model. But the positions of the labeled data points are biased, i.e. they have some weird pattern that confuses the classifier. To be fair, if the labeled data points would have been kind of in the centers of the two blobs, the logistic regression would have worked much better.
Of course, the overfitting problem would occur as well if we tried to throw random forests and neural networks on a training set of size four. We can conclude:
It is a bad idea to simply drop the unlabeled data.
Let us now turn to a smarter technique that allows us to incorporate not only the knowledge of the labeled data but also the features of the unlabeled data samples.
This is what people sometimes refer to as semi-supervised learning.
The Label Propagation Algorithm
Label propagation is a neat idea originally introduced by Xiaojin Zhu and Zoubin Ghahramani [1] in 2002.
Important note: Here, I present a slight variation of the idea of the original paper as it is easier to explain and understand. The gist of both – and the other existing – variations is still the same.
From a very high perspective, it works as follows:
1. Build a graph with the data samples as nodes. Put a weighted edge between each pair of samples. The closer the samples are, the higher the weight. Labels do not matter at this point.
This screams for an example. Let’s assume that we have another two-dimensional dataset that consists of only five samples. There are two classes and one sample is not labeled. The samples are the nodes of our graph.

Now, let us build a complete graph, i.e. connect each node with any other node. We also annotate the edges with the distances between the nodes (=samples). You can choose any distance you like (i.e. Euclidean), it is just another hyperparameter of the algorithm.
Note: I leave out the edges between labeled samples because it keeps the visualization clear, and the algorithm does not need those anyway.

Remember that we said that closer samples should have higher weights between them? So far, it’s the other way around! There are several ways to fix this, the easiest ones being: Put a minus in front of all the numbers, or (multiplicatively) invert the numbers, e.g. 4 → 1/4=0.25.
What the authors in [1] propose is using some Gaussian function, sometimes also called radial basis function (rbf).

where x and x‘ are samples. If two samples are really close, i.e. |x-x‘| is around 0, their weight of the edge is around 1. The further they are apart, the more the weight approaches zero.
σ is a hyperparameter you can play around with. Scikit-learn’s default value for it is σ = 20, for example.
Anyway, let us use the multiplication inverse for now. The graph becomes

This is the end of step 1.
2. To get the label of an unlabeled sample, start a random walk originating in this sample. One step of the walk consists of jumping from one node to an adjacent node randomly. Edges with a higher weight are chosen with a higher probability. Calculate the probability that the random walk enters a blue node first. If it is larger than 50%, label the node blue, otherwise red.
This sounds more difficult than it actually is. Let us start in the lower white, unlabeled node, for example. To proceed, we have to define probabilities for jumping to the other unlabeled node, one of the two blue nodes and the red node. An easy way to do this is by normalizing.
There are four outgoing edges with the weights 1 (leading to the blue node), 0.25 (other blue node), 0.5 (other unlabeled node), and 0.5 (red node). So, for example, let us just define the probability to jump to the red node as 0.5/(1+0.25+0.5+0.5)=2/9. Jumping to the closer blue node happens with a probability of 1/(1+0.25+0.5+0.5)=4/9. You can calculate the rest yourself.
Using these probabilities, there is a lot of theory involved in how to compute the probabilities in ending up in a blue or red node first. You can do it via Markov chains, a fascinating field within mathematics. I might even write an article about it someday, but for now, I will just supply you with the results.
One can calculate the following probabilities of landing in either color:

With this result, we can say that the upper unlabeled node could belong to the red class, while the bottom one should be blue. We could also keep these probabilities as soft labels if we don’t want to commit to a single class.

This is probably also what you have expected intuitively, which speaks for this method.
Let us now see this method in action!
Label Propagation with scikit-learn
Using label propagation is easy, yet again thanks to scikit-learn! In the following snippet I
- load all libraries and the MNIST dataset,
- mask around 90% of the labels with a -1, the expected input for a missing label, and then
- use label propagation to restore the labels that I just masked.
Since we know the real labels, in this case, we can even assess the performance on the masked set. Note, however, that usually, we cannot do this.
import numpy as np
from sklearn.datasets import load_digits
from sklearn.metrics import classification_report
from sklearn.semi_supervised import LabelPropagation
np.random.seed(0)
X, y_true = load_digits(return_X_y=True)
n = len(y)
mask = np.random.choice(range(n), 9*n//10, replace=False)
y_missing = y_true.copy()
y_missing[mask] = -1 # -1 indicates a missing label
lp = LabelPropagation(gamma=.25) # rbf is the default, gamma = 1/σ²!
lp.fit(X, y_missing) # run the algorithm we described above
print(classification_report(y_true[mask], lp.transduction_[mask]))
The output is the following:
precision recall f1-score support
0 0.98 0.99 0.98 161
1 0.90 0.99 0.94 163
2 1.00 0.96 0.98 159
3 0.90 0.95 0.92 168
4 0.98 0.97 0.97 159
5 0.95 0.97 0.96 161
6 0.99 0.98 0.98 166
7 0.99 0.98 0.98 159
8 0.91 0.85 0.88 160
9 0.95 0.88 0.91 161
accuracy 0.95 1617
macro avg 0.95 0.95 0.95 1617
weighted avg 0.95 0.95 0.95 1617
An accuracy of 95%, what else can I say? It’s amazing! The algorithm merely had access to labels for 10% of the data, yet it managed to label the other samples correctly in nearly all cases. When I saw this example in a different form on the scikit-learn page first, that’s when I became a believer. And I bet that this might be useful for your everyday work, too!
Sure, it’s not 100% correct, but it is a valid option if labeling thousands or even millions of samples yourself is not where you want to go.
Remarks
As a final official act, let me point you to some interesting details.
Connection to KNN
If you think about it, the label propagation feels a bit k-nearest neighbor-ish, doesn’t it? Imagine that you trained a KNN classifier. During prediction time, a new point with no label comes in. You scan your whole training dataset and pick the closest points from there. The closer the points are to the new point, the more important they are.
It’s the same for label propagation, as we have seen. The closer two samples x, x‘ are, the larger the weight of the edge between them in the graph, the higher the probability that you jump from x to x‘ and vice versa. The parallels are there, yet label propagation is a bit more complicated than k-nearest neighbors.
Label propagation considers numerous unlabeled samples at once, and they help each other propagating the correct labels everywhere in the graph / dataset. In the case of KNN, each sample is on its own. Therefore, label propagation is a smarter algorithm in a sense. Although we should not compare apples and pears, as we say in Germany: both algorithms solve different problems.
The Graphs are HUGE
Imagine that you have a dataset consisting of 1,000,000 samples. The graph that is created in the course of label propagation then has around
*1,000,000 (1,000,000–1) / 2 =** 499,999,500,000
edges. If you store the weights of these edges as 64 bit floats, that would be 4 TB already. Too much for your RAM, too slow to handle when writing it to disk. Be aware that this is how I explained the algorithm and that this is also the default behavior of scikit-learns LabelPropagation
.
What you can do in these cases is building an incomplete graph. Instead of connecting every sample node with every other node, connect it with its k nearest neighbors. (There it is again.)
In this case, there are only k * 1,000,000 edges, and for a small value such as k=7, this is still easy to handle. You can use this approach in scikit-learn via setting kernel='knn'
and then playing around with the n_neighbors
parameter as well.
Conclusion
In this article, we have examined the problem of having a dataset with only a small portion of labeled data. We have established that throwing away the unlabeled data points can end in a disaster and that smarter approaches are needed, one of them being label propagation.
This algorithm works by building a graph where the samples from the dataset are the nodes and there is an edge between each pair of samples. For an unlabeled sample, start a random walk from there and see in which class of the labeled samples you end up most of the time.
We have then seen that this approach can work extremely well, demonstrated with an MNIST example with only 10% labeled data. The accuracy was a whopping 95%, which is not perfect, but better than the alternative: labeling the remaining 90%, or 1617 samples, by hand.
But wait… labeling the remaining dataset by hand is actually not the only alternative. Another path that we did not speak about today is active learning.
Whatever the case, this is a story for another time.
References
[1] Xiaojin Zhu and Zoubin Ghahramani, Learning from labeled and unlabeled data with label propagation (2002), Technical Report CMU-CALD-02–107, Carnegie Mellon University
[more] The scikit-learn label propagation user guide
I hope that you learned something new, interesting, and useful today. Thanks for reading!
As the last point, if you
- want to support me in writing more about machine learning and
- plan to get a Medium subscription anyway,
why not do it via this link? This would help me a lot! 😊
To be transparent, the price for you does not change, but about half of the subscription fees go directly to me.
Thanks a lot, if you consider supporting me!
If you have any questions, write me on LinkedIn!