
Monitoring f-beta score for multi-label classification in Keras is often desired by data scientist. Unfortunately, F-beta metrics was removed in Keras 2.0 because it can be misleading when computed in batches rather than globally (for the whole dataset). It will be more misleading if the batch size is small or when a minority class has a very small number of observations. Sometimes, many data scientists are interested in knowing the F-beta score per batch for different reasons when the batch size is large. In this article, we will be implementing a custom f-beta function for multi-label classification in Keras.
In part II of this article, we implemented f-beta score in Keras for multiclass problem both as a stateful and stateless metric. We also saw the different ways of aggregating f-beta score for multiclass problem. In this article, we will be explaining how f-beta score can be applied to multi-label classification problem and creating both stateful and stateless custom f-beta metric for multi-label classification problem in Keras. We will assume you are familiar with the basics of deep learning, machine learning classifiers, and NLP.
In multi-label classification problems, we are predicting all the classes an observation belongs to. For example, predicting all the fruits present in an image from a set of fruits like apple, banana, orange, mango and cucumber. Worth noting in multi label classification is that an observation can belong to one or more classes at a time whether in the training or testing set. Also, the actual and predicted labels need to be one-hot encoded making them to be two dimensional arrays. Like in multiclass problem, metrics like f-beta score can be calculated per class before aggregating using either of micro, macro and weighted methods. Unlike to multiclass f-beta score, multi-label f-beta score could also be calculated per sample before aggregating the results.
Note: In this article, we will be using OneVsRest (OVR) strategy in explaining the per sample f-beta score.
Per Sample
In per sample f-beta score, the f-beta score for the actual and predicted labels of each observation (sample) is calculated before aggregation. The diagram below helps in understanding how this is done.

Each actual and prediction label of each sample in multi-label classification is an array that contains 1 or 0, and can be thought of as y_true and y_pred of a binary classification problem. Therefore, y_true and y_pred of multi-label classification can be seen as a horizontal stack of separate binary classification y_true and y_pred respectively. We therefore compute the f-beta score for each sample and aggregate them as our final result. The aggregation can be weighted or not. For simplicity, we will be focusing more on the unweighted aggregation of per sample f-beta score in this article. We will implement it as both stateless and stateful metric.
Stateless F-beta
As explained in part I of this article, stateless metric according to Keras documentation means that the metric is estimated per batch. Therefore, the last metric reported after training is actually that of the last batch. Sometimes, we may want to monitor a metric per batch during training especially when the batch size is large, validation data size is the expected test size or due to the fact that weights of nodes are updated per batch. To demonstrate how to implement this in Keras, we will generate a multi-label dataset using Scikit-learn’s _make_multilabelclassification function.
First, we import useful libraries.
Our generated dataset can be thought of as a Bag of Words (BOW) document vectors which we will transform using the TfidfTransformer.

In part I of this article, we calculated the F1 Score during training using Scikit-learn’s fbeta_score function after setting the _runeagerly parameter of the compile method of our Keras sequential model to False. We also observed that this method is slower than using functions wrapped in Tensorflow’s tf.function logic. In this article, we will go straight to defining a custom f-beta score function wrapped in Tensorflow’s tf.function logic that wouldn’t be run eagerly for brevity. We will simply call this function _multi_labelfbeta. Implementation of this function will be possible based on the facts that for ytrue and ypred arrays of a multi-label problem where 1 is positive and 0 is negative:
- True positive is the sum of the element-wise multiplication of the two arrays.
- Predicted positive is the sum of ypred.
-
Actual positive is the sum of ytrue.
Our aim is not to build a high performance model but to demonstrate how to monitor f-beta score in multi-label classification in Keras. For this reason, we will build a simple model that is quick to train and will run for few epochs.
We will now test the rightness of our multi-label f-beta function.

Stateful F-beta Score
When we are not interested in the per batch metric but in the metric evaluated on the whole dataset, we need to subclass the Metric class so that a state is maintained across all batches. This maintained state is made possible by keeping track of variables (called state variables) that are useful in evaluating our metric across all batches. When this happens, our metric is said to be stateful (you could set verbose to 2 in the model’s fit method so as to report only the metric of the last batch which is that of the whole dataset for stateful metrics). According to Keras documentation, there are four methods a stateful metric should have:
- init : we create (initialize) the state variables here.
- update: this method is called at the end of each batch and is used to change (update) the state variables.
- result: this is called at the end of each batch after states variables are updated. It is used to compute and return the metric for each batch.
- reset: this is called at the end of each epoch. It is used to clear (reinitialize) the state variables.
For multi-label f-beta metric, state variables would definitely be true positives, actual positives, predicted positives, number of samples and sum of f-beta scores because they can easily be tracked across all batches. Let’s now implement a stateful f-beta metric for our multi-label problem.
Training our model

Finally, we will check the rightness of our stateful f-beta by comparing it with Scikit-learn’s f-beta score metric on some randomly generated multi-label ytrue and ypred.

Conclusion
F-beta score can be implemented in Keras for multi-label problem either as a stateful or a stateless metric as we have seen in this article. We have also seen the different ways of aggregating f-beta score for multi-label problem. See all codes in my GitHub repository.