The world’s leading publication for data science, AI, and ML professionals.

F-beta Score in Keras Part II

Creating custom F-beta score for multi classification problems in Keras

Photo by Edgar Chaparro on Unsplash
Photo by Edgar Chaparro on Unsplash

In the previous article (part I), we explained stateless and stateful metrics in Keras, derived the formula for f-beta score and created both stateless and stateful custom f-beta metric in Keras for binary classification problems. In this article (part II), we will be explaining how f-beta score can be applied to multi-classification problems. We will also create both stateful and stateless custom f-beta metric for multi classification problems in Keras. We will assume you are familiar with the basics of deep learning and machine learning classifiers.

In multi classification problems, we are predicting if an observation belongs to one of a given set of three or more classes. For example, predicting if an image is that of a cat, dog, fish or a bird. Worth noting in multi classification is that an observation can only belong to one and only one class at a time whether in the training or testing set. Also, the actual and predicted labels need to be one-hot encoded making them to be two dimensional arrays. The same logic for f-beta in binary classification applies to multi classification with just some few adjustments. For instance, the f-beta score could be calculated per class or per sample before aggregating the results. In multiclass problem, calculating f-beta score per sample is highly discouraged because its true positive, false positive and false negative can only have a value of either zero or one due to the fact that only one class is predicted per sample. For example, if our model correctly predict the class of a particular sample, true positive will be 1, while false positive and false negative will both be 0 for that sample. If the model wrongly classify a sample, true positive will be 0, false positive and false negative will both be 1. This will make recall and precision equal for each sample and limit their values to either be 0 or infinity. In fact, we shouldn’t compute the f-beta score for multiclass problem per sample, this method is only safe for multi label problem which we will see in part III of this article.

We will now focus on multiclass f-beta computed per class. There are three ways of aggregating multiclass f-beta computed per class and they are:

  1. Micro
  2. Macro
  3. Weighted

Note: In this article, we will be using OneVsRest (OVR) strategy in explaining the aggregation methods above.


Micro

In micro mode, we compute the f-beta score globally by finding the harmonic mean of the global precision and recall. For example, let’s consider the confusion matrix for a multiclass problem as shown below:

Image by author
Image by author

The global true positive is 600+250+40 which equals 890, while the global false positive is 50+150+200+50+70+90 which equals 610. It is important to note that the global false positives can also be considered as global false negatives meaning global false negatives equals 610 also. This implies that the global precision which is 0.593 is same as the global recall. We know that if precision and recall are the same, f-beta will be of same value. Therefore, global f-beta is also 0.593 irrespective of the value of beta. If we were to calculate the accuracy from the confusion matrix above, we will still get 0.593 because what we call global true positive is actually the total number of correct predictions, while global false positive (or global false negative) is actually the total number of wrong predictions. In fact, in micro averaging:


Macro

In macro mode, metrics are calculated along axis 0. For example, in the diagram below, we have three classes A, B, and C; we compute the metric say f-beta for y_true and y_pred of each class. This is done as though y_true and y_pred of each class were those of binary classification problem.

Image by author
Image by author

In our example above, we will obtain three results which will result in a need of averaging the results. When using the macro averaging, we simply take the mean of the results.


Weighted averaging

Weighted averaging is similar to macro averaging except that the weighted mean of the f-beta scores is return. This considers class imbalance and is done by taking the summation of the products of the f-beta score and the sample fraction of each class. Mathematically, it can be written as:

where Wfᵦ is the weighted fᵦ,

pᵢ is the probability of choosing a class (sample fraction of the class in ytrue),

(fᵦ)ᵢ is the f-beta for a particular class

n is the number of classes

Haven understood the different methods of aggregating multiclass f-beta score, we will now proceed to implementing them in python specifically macro and weighted f-beta as both stateful and stateless f-beta score. For brevity, we will not be implementing micro multiclass f-beta because it is the same as accuracy.


Stateless F-beta

As explained in part I of this article, stateless metric according to Keras documentation means that the metric is estimated per batch. Therefore, the last metric reported after training is actually that of the last batch. Sometimes, we may want to monitor a metric per batch during training especially when the batch size is large, validation data size is the expected test size or due to the fact that weights of nodes are updated per batch. To demonstrate how to implement this in Keras, we will be using the famous Modified National Institute of Standards and Technology (MNIST) [dataset ](https://en.wikipedia.org/wiki/MNIST_database#:~:text=The%20MNIST%20database%20(%20National%20Institute,the%20field%20of%20machine%20learning.)which is a dataset of 60,000 training and 10,000 testing 28×28 grayscale images of handwritten digits between 0 and 9 (inclusive). First, let’s import useful libraries.

Let’s download the dataset.

Let’s randomly view some of the images and their corresponding labels.

Image by author
Image by author

In part I of this article, we calculated the F1 Score during training using Scikit-learn’s fbeta_score function after setting the _runeagerly parameter of the compile method of our Keras sequential model to False. We also observed that this method is slower than using functions wrapped in Tensorflow’s tf.function logic. In this article, we will go straight to defining a custom f-beta score function wrapped in Tensorflow’s tf.function logic that wouldn’t be run eagerly for brevity. We will simply call this function _multi_classfbeta. Implementation of this function will be possible based on the facts that for ytrue and ypred arrays of a multiclass problem where 1 is positive and 0 is negative:

  1. True positive is the sum of the element-wise multiplication of the two arrays.
  2. Predicted positive is the sum of ypred.
  3. Actual positive is the sum of ytrue.

    We will now define a function to build our model. The aim of this article is to demonstrate how to create custom f-beta score metric and not to build a high performance model. So, we will build a simple convolutional neural network which will run for few epochs.

    We now train the model on the training set.

Let’s confirm the rightness of our custom f-beta function by comparing its evaluation of the testing set to that of Scikit-learn’s f-beta function.

Image by author
Image by author

Stateful F-beta

When we are not interested in the per batch metric but in the metric evaluated on the whole dataset, we need to subclass the Metric class so that a state is maintained across all batches. This maintained state is made possible by keeping track of variables (called state variables) that are useful in evaluating our metric across all batches. When this happens, our metric is said to be stateful (you could set verbose to 2 in the model’s fit method so as to report only the metric of the last batch which is that of the whole dataset for stateful metrics). According to Keras documentation, there are four methods a stateful metric should have:

  1. init : we create (initialize) the state variables here.
  2. update: this method is called at the end of each batch and is used to change (update) the state variables.
  3. result: this is called at the end of each batch after states variables are updated. It is used to compute and return the metric for each batch.
  4. reset: this is called at the end of each epoch. It is used to clear (reinitialize) the state variables.

For multiclass f-beta metric, state variables would definitely be true positives, actual positives and predicted positives because they can easily be tracked across all batches. Let’s now implement a stateful f-beta metric for our multiclass problem.

Finally, we will check the rightness of our stateful f-beta by comparing it with Scikit-learn’s f-beta score metric on some randomly generated multiclass ytrue and ypred.


Conclusion

F-beta score can be implemented in Keras for multiclass problem either as a stateful or a stateless metric as we have seen in this article. We have also seen the different ways of aggregating f-beta score for multiclass problem. In part III, we will be implementing the f-beta score for multi-label classification problems. See all codes in my GitHub repository.


Related Articles