Getting Started, Making Sense of Big Data
Opening
I guess anyone here with a ML classifier in production will know what I’m talking about. I had an algorithm with excellent performance on the test set. I deployed it, and the real performance of the algorithm was much worse than expected, especially the precision.
So, I dug in a bit and found mathematical explanation to the phenomenon, I also found better metric to evaluate the algorithm real performance using basic prior knowledge (don’t worry I will show it in the article).
Definitions and Setting
You can divide the data to 3 hierarchical sets:

Labeled and unlabeled data, and inside the labeled data there’s a set that wasn’t used in the algorithm choosing or training – test set. The validation set isn’t part of the test (because it was used to choose the best model).
Mathematical definitions
- Preal -distribution probability on the unlabeled data.
- Ptest – distribution probability on the test set
- x— a data sample
- y – the real label of x.
- Ypred – the prediction of X by the classifier
Important to remember – The label of all the labeled data (including the test), is of course known to us. All the customer cares about are the labels of the unlabeled data. So, if the performance on the test set is good, don’t celebrate yet. It might be that the real performance (on the unlabeled) are much worse (and it is likely to happen).
So, all we care about for Evaluation is Preal but unfortunately, we can calculate only Ptest. It is very tempting to generalize the result from test to the real performance (if the recall on the test was 70%, the real recall of the algorithm is 70%). In the next section I will analyze under which assumptions the generalization is true. We will see there is a big difference in the assumptions needed to generalize precision from those needed to generalize recall.
Mathematic definition of generalization (for this article) – switching Ptest in Preal in the calculation of a certain metric, and also reducing the condition of being in the test set:

Maybe it is not the most conventional way to write evaluation Metrics but I will show you how easily you can write metrics this way.
This article has 2 parts:
- Deep dive in recall and precision generalization.
- Better evaluation metrics for precision using prior knowledge.
If you will like this article I will post the third section – The best way to generalize precision using domain experts sampling.
At the end of each part I’ll present an real world example, The code is available in this git repository:
The results without a doubt support the theory!
Deep Dive to Recall and Precision
Recall
let’s start with the definition of recall on the test set (in the above formula):

This is the generalization to the real performance:

Notice that in order to generalize, the following mathematic assumption has to hold: The distribution of this specific class only has to be the same between the test set and the unlabeled data (there is no assumption on other classes because of the condition y=class).
So, with good labeling of a specific class we can evaluate the real recall of an algorithm with the recall on the test. Of course, labeling that doesn’t represent the unlabeled data will result in recall on the test set that won’t represent the real recall of the algorithm, but that is trivial. Sometimes finding good test set that represents the unlabeled data even for one class can be a hard problem. I think this is the minimal requirement possible for good generalization of a metric.
Precision
let’s continue with the definition of precision on the test set (in the above formula):

This is the generalization to the real performance:

Looks almost identical to the recall, but with a small and very significant difference, instead of the condition y=class that narrowed the problem to the specific class in the recall, now the condition is Ypred=class. This condition depends on the results of the algorithm – the samples that the algorithm recognized as class, and this group usually contains samples from all the classes in the data. Without any previous assumptions on the algorithm, we get that the mathematical assumption of generalizing the precision is: The distribution of the entire test will be the same as the distribution of the entire unlabeled data.
This assumption is not realistic in most cases, because it requires a good real world representation of all classes, including right proportions (weights) in the test set. For example:
- In order to get good precision evaluation of a classifier for cats images, you need to label all the animals in the world, maybe even all the objects (depends on the dataset). Not only that but you need to weight them exactly like their part in the real distribution.
- More specific example: Let’s say that our cat classification algorithm confuses cats with some specific types of dogs. So, in order to generalize precision from the test set we need labels of all those kinds of dogs and we need to weight them exactly like their part in the real distribution (most likely I don’t have this information). Weight them wrong and we will get precision value that won’t represent well the real precision.
- In general, the experience of most people that practice applied ML proves that the label data almost never represent the real data.
Conclusions
- Generalize metrics that were calculated on the test set requires not trivial assumptions.
- Generalization of the recall metric on the test is much more reliable then the generalization of the precision.
Example on real data
To demonstrate this phenomenon, I took a dataset of letters In English (26 classes, one for each letter) and split it to 3 (as in the diagram above): https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html
Unlabeled:
- 20 whole classes
- Sample 50% of the remain classes (6 at total) to be unlabeled
Those examples are not really "unlabeled" (we have their labels), but they are used to simulate real unlabeled data which is distributed differently from the labeled data.
Labeled: The rest 50% of the 6 classes, and the test set is 20% of each class.
Can be described by this Venn diagram:

In that way the unlabeled data contains also data from classes on which the algorithm never saw (like in many cases in real life). So the Recall assumption holds and the precision assumption doesn’t hold. Below you can see the performance of an algorithm trained on the labeled data (without the test), on the test set and the real performance (on the unlabeled data). The algorithm is – LightGBM with default parameters (except balanced weights for the classes). The performance shown only for the classes the algorithm trained on:

So as I said above the recall is generalized well from the test in this case (those classes distribute the same in the test and in the unlabeled – the recall assumption). And the precision is not generalized well (bad precision on unlabeled despite good performance on the test).
Important to mention that despite all the above I think metrics on the test set are very important. I see them as an upper boundary to the real algorithm performance. this is not always true but in most cases it does prove itself.
In my project I ran into this problem – I had labels only for the classes that costumers were interested in but not for the rest of the classes (there were many more). And all the costumers were interested in was the precision of the model. As I showed earlier the test performance does nothing with the real world precision.
Important to mention 2 Manual review of the predictions by domain experts can give you the best evaluation of the real precision, but in most cases it is extremely slow and expensive process. Definitely not something you can get for every experiment.
So, the good news is there are better metrics to work with! In the next section I will show you how to:
- Get better evaluation of the algorithm’s real precision using basic prior knowledge.
- Find the best decision boundary for your algorithm.
- Even without the prior knowledge, how to use the new metrics to improve the comparison of models performances.
Precision Evaluation Using Prior Knowledge
In my project, I had all the unlabeled data and a good evaluation of each class real number of examples (from now on will be called class size). I will show you how I used this information to get a better evaluation of my algorithm real precision.
Recall@K Unlabeled
In order to get better evaluation of the real precision I will use both the test set and the unlabeled set. I compute by the algorithm the class score for each sample in the test set and the unlabeled set and then ranked the unlabeled set accordingly. By this ranking for each class I defined the recall@k unlabeled metric as – for all k’s in [1, size(Unlabeled)] the metric equals to the recall in the test set using the score of the k ranked unlabeled example as the decision boundary. For example – If the score of the 130th unlabeled ranked example is 0.75, then the recall@130 unlabeled= Recall on the test when the decision boundary is 0.75.
Important notation – recall@k unlabeled can be generalized by the same assumption as the recall (proved the same way)
Let’s see how we can use the above metric helps us to generalize precision based on the recall assumption (reminder – distribution of this specific class only has to be the same between the test set and the unlabeled data).
Precision@k Unlabeled – Better evaluation of the real precision at specific k unlabeled examples
Now for the moment you were looking for – calculating precision under the recall assumption.
I call this metric precision@k unlabeled, and it’s defined by this equation:

Explanation to why it is evaluation for the real precision:
- The numerator – under the recall assumptions it equals to the amount of positive examples in the first k ranked examples. Example – If the real recall (equals to the test recall under the assumption) in a specific k unlabeled is 70% and the class size is 200, so there are 0.7*200=140 positive examples in the first k unlabeled examples.
- The denominator – is of course the number of examples in the first k unlabeled examples (just k).
Now, when we divide them we get the part of positive examples from the first k ranked unlabeled examples, aka precision!
Of course, this evaluation isn’t perfect too, The numerator is just an estimation for the number of positive examples (which is accurate if the recall assumption holds). But in most cases it is better evaluation than the precision on the test set. I use this metric also as an upper bound of the real precision (in most cases it is tighter than the test precision).
Important notation – If you have upper or lower boundaries on the class size, the metric can give upper or lower boundaries of the real precision.
Important notation 2— The metric can get values bigger than 1. I will address this case later in the article.
Example on real data
I calculated this metric on the data presented above (on the relevant k for score 0.5 threshold), Using the unlabeled set (without their labels) and the test set:

You can see very clearly that this method is better evaluation of the real precision in this case.
Recall@K Unlabeled vs. K curve
Beside the precision@k unlabeled, there is much more information on the algorithm that the recall@k unlabeled can provide by looking at the recall@k unlabeled vs. k curve.
Examples of some recall@k unlabeled vs. k curves (the red line marks the class size):

What I am expecting an ideal curve to look like? For perfect classifier (without bias) I expect linear ascending until 100% that will be exactly in the class size (the blue curve). In other words, all the positive will appear left from the class size line, and if the model is unbiased the order of the test positive and the unlabeled positive will be random, which will create a straight line (take a moment to understand this sentence).
Now I will address 2 important scenarios that could happen in this metric.
Overfit
Sometimes the recall@k unlabeled vs. k curve shows results that are "better" than the ideal scenario, this usually happens because of bias in the labeled data. Let’s look at the orange curve, notice that at k=3,000 the recall is 100%. Despite the fact that we know that the class size is 5,000, it means that the maximum recall@3,000 possible is: 3,000/5,000 = 0.6. That can also be realized by precision@k unlabeled metric which is bigger than 1.
Reasonable explanation to the phenomenon is that the labeled data don’t represent the whole class, just part of it. In this situation the algorithm only learns to identify a part of the class, and the recall@k unlabeled value calculated will be bigger than its real value because the evaluation is on subset of the class which is smaller group than the whole class. This is an example of a situation when you can’t generalize your recall, because the labeled classes distributed differently than the real class, so the mathematical assumption needed to generalize doesn’t hold (and indeed the metrics on the test doesn’t represent the real metrics).
For example – A cats classifier that all his labeled data doesn’t include black cats might not classify black cats as cats, this will decrease the number of examples that will be classified as cats from the unlabeled data. This can cause 100% recall before the real size of the class.
Underfit
Even when the performance on the test are perfect, it is reasonable that the recall@k unlabeled vs. k curve shows much worse performance. For example – notice the green curve in k=class size the recall is 50%, which indicates that the precision is 50% (by the precision@k unlabeled metric). This might happen even if the evaluation on the test shows perfect score. That can also be realized by low performance in the precision@k unlabeled metric.
Situations like this happen when the test set distribution are very different from the unlabeled distribution.
example from the letters dataset:

As you can see although the good performance on the test (figure 1) the model real performance for this class are much worse.
This metric is a good way to know that your performance is not good enough for production and you need to improve the algorithm.
Realistic
How the recall@k unlabeled vs. k curves looks in real data? something like that:

This is just to give you a good feeling of how those curves look in real life. More analysis on those curves you can find in the git repository.
Compare Models Without Prior Knowledge
This is the most important thing to remember from this section. even without the exact prior knowledge (the class size), it is easy to realize from the curves below which model is better for every class size smaller than 500 (model 2). And because the formula above for precision@k unlabeled (as showed above) is linear in the recall@k unlabeled, you can tell how significant is the improvement. Those models have similar results on the test set, but you can see by this metric that one is clearly better than the other.

Also we can use metrics like "Area Under the Curve" on the recall@k unlabeled vs. k curve for hyper parameters optimization. Of course, those metrics can be generalize under the recall assumption, and they are better than their equivalent on the test set in most cases. The main problem in this metric is that it "prefers" models that overfit to the labeled data rather than the perfect classifier.
Find Best Decision Boundary
One more thing that the recall@k unlabeled vs. k curve gives you, is a better evaluation of the best decision boundary. There are many ways to choose the best decision boundary, I will show how to approximate the decision boundary that maximize the F1-Score (this method can be generalized for other metrics). We use our better evaluation of the real precision to get better evaluation to the real F1-Score:

So, in order to choose the best decision boundary you need to calculate the above metric for every k, and choose the score at the k that maximize this metric. The difference between the best decision boundary for the test set and the real best decision boundary could be huge, this evaluation proves itself as mush better evaluation of the real best decision boundary.
In order to demonstrate this principle, I took the previous dataset and compared the f1-score obtained by those 3 thresholds:
- the f1-score we got by decision boundary of 0.5 threshold of the algorithm – f1-score_unlabeled.
- to the f1-score we got by the threshold calculated by the method above – f1-score_f1.
- The best threshold possible to get on the unlabeled data (by the f1-score metric ) – optimal_f1.

You can see clearly that the default threshold is much worse than the one achieved by this method. Also this threshold performance are close to the optimal one.
Conclusions
- the precision@k unlabeled metric as presented above, can give you better evaluation of the real precision of an algorithm, better than generalizing the performance on the test set.
- We discussed some interesting cases of recall@k unlabeled vs. k curves (overfit and underfit), and realized explanations of what cause them.
- recall@k unlabeled vs. k curves can help to get better comparison between different models even without the prior knowledge.