The world’s leading publication for data science, AI, and ML professionals.

A tangible variant of the F value for classification evaluation

Justify your model selection comprehensibly

Image by Author
Image by Author

Once you set out for a Classification task and experiment with different classification algorithms, hyperparameter tuning or feature selection, you will need to select your favourite setup. This is by no means a trivial task and requires careful examination of your data structure and on what you want to achieve. Selecting a classifier based on accuracy might be totally inappropriate if you have imbalanced classes – the Matthews coefficient might be the better choice.

Sometimes you don’t care about the prediction of the non-class (the "0"), in which case you might decide an algorithm based on its sensitivity, i.e. its ability to detect the positive (the "1") class. However, a super sensitive predictor might be just one always predicting "1" (as in Fig A), which you immediately recognize by its crappy precision.

So, having the choice between the setup resulting in a classification given in Fig A or Fig B, which should you select? The answer might depend on your task, but a common way to balance your sensitivity with your precision is the F1 measure.

The F1 measure is a special case of the F-measure, a harmonic mean between two values. In general, the F-value is defined as:

which, in the case of n=1, turns into the well-known formula (with TP=true positives, FP= false positives, FN=false negatives, TN=true negatives):

This is a reasonable compromise between sensitivity and precision, helping you to select an algorithm which will hopefully not skrew up in production. However, once you try to grasp the meaning of the F1 value, you realize how difficult a translation into human language gets. After all, what is a harmonic mean? While you can express precision as the proportion of true positives to predicted positive cases, there is no easy interpretation of the F value.

This bothered me when arguing for an algorithm based on its F value as well as David Hand and his co-authors at the Australian National University, Canberra [1]. They proposed a slight transformation of the F value to, what they call, "F*":

F can therefore be interpreted as the ratio of true positives over all samples being either true positive, negatives, but positive predicted as well as missed positives. Or, as the ratio of true positives over all samples except the correctly negative predicted. Or, as the ratio of true positives over all misclassified or correctly positive predicted samples. The authors also show that F and F are tightly correlated and there is unlikely a decision made based on F, which would be made differently based on F*.

For the practitioner this means that the F value can be safely replaced by F*, gaining interpretability and delivering a more tangible story for customer reports.

[1] Hand, D.J., Christen, P. & Kirielle, N. F: an interpretable transformation of the F-measure. Mach Learn (2021). https://doi.org/10.1007/s10994-021-05964-1


Related Articles