Are You Making This Mistake when Implementing the Macro F1 Score in Keras?

I have coded up the correct way so that you don’t have to

Published in

Towards Data Science

5 min readFeb 27, 2020

Since Keras 2.0, evaluation metrics F-score, precision and recall were removed. However, when it comes to imbalanced classification problems, they are the desired model performance measures. If this concept sounds unfamiliar, the paper linked provides a good explanation of the accuracy paradox and Precision-Recall curve. As a building block for my series of posts (tackling imbalanced dataset in neural networks), this post will focus on implementing the F1-score metric in Keras and discussing what to and not to do.

First Attempt: custom F1-score metric

According to Keras documentation, users can pass custom metrics at the neural networks compilation step. Easy peasy, right? Hence, I went ahead and implemented a metric function custom_f1 that takes in the true outcome and predicted outcome as args,

def custom_f1(y_true, y_pred):    
    def recall_m(y_true, y_pred):
        TP = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        Positives = K.sum(K.round(K.clip(y_true, 0, 1)))
        
        recall = TP / (Positives+K.epsilon())    
        return recall 
    
    
    def precision_m(y_true, y_pred):
        TP = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        Pred_Positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    
        precision = TP / (Pred_Positives+K.epsilon())
        return precision 
    
    precision, recall = precision_m(y_true, y_pred), recall_m(y_true, y_pred)
    
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

The dataset: Credit Card Fraud Detection

In order to show how this custom metric function works, I will use the credit card fraud detection dataset as an example. It is one of the most popular imbalanced datasets, and more details can be found here. Basic exploratory data analysis shows that there is an extreme class imbalance with Class0 (99.83%) and Class1 (0.17%),

import pandas as pdcredit_dat = pd.read_csv('creditcard.csv')
counts = credit_dat.Class.value_counts()
class0, class1 = round(counts[0]/sum(counts)*100, 2), round(counts[1]/sum(counts)*100, 2)print(f'Class 0 = {class0}% and Class 1 = {class1}%')

For demonstration purposes, I will include all the input features in my neural network model, and save 20% of the data as the hold-out testing set,

### Preprocess the training and testing data 
### save 20% for final testing 
def Pre_proc(dat, current_test_size=0.2, current_seed=42):    
    x_train, x_test, y_train, y_test = train_test_split(dat.iloc[:, 0:dat.shape[1]-1], 
                                                        dat['Class'], 
                                                        test_size=current_test_size, 
                                                        random_state=current_seed)
    sc = StandardScaler()
    x_train = sc.fit_transform(x_train)
    x_test = sc.transform(x_test)
    
    y_train, y_test = np.array(y_train), np.array(y_test)
    return x_train, x_test, y_train, y_testx_train, x_test, y_train, y_test = Pre_proc(dat)

Model structure using Neural Networks

After preprocessing the data, we can now move on to the modeling part. For this post, I will build a neural net with 2 hidden layers for binary classification (using sigmoid as the activation function on the output layer),

### Building a neural nets 
def runModel(x_tr, y_tr, x_val, y_val, epos=20, my_batch_size=112):  
    ## weight_init = random_normal_initializer(mean=0.0, stddev=0.05, seed=9125)
    inp = Input(shape = (x_tr.shape[1],))
    
    x = Dense(1024, activation='relu')(inp)
    x = Dropout(0.5)(x)
    x = BatchNormalization()(x)
    x = Dense(512, activation='relu')(x)
    x = Dropout(0.5)(x)
    x = BatchNormalization()(x)
        
    out = Dense(1, activation='sigmoid')(x)
    model = Model(inp, out)
    
    return model

Modeling with the custom F1 metric

Next, we use cross-validation(CV) to train the model. Since building an accurate model is out of scope of this post, I set up a 3-fold CV with only 5 epochs each to show how the F1 metric function works,

f1_cv, precision_cv, recall_cv = [], [], []current_folds = 3
current_epochs = 5
current_batch_size = 112kfold = StratifiedKFold(current_folds, random_state=42, shuffle=True)for k_fold, (tr_inds, val_inds) in enumerate(kfold.split(X=x_train, y=y_train)):
    print('---- Starting fold %d ----'%(k_fold+1))
    
    x_tr, y_tr = x_train[tr_inds], y_train[tr_inds]
    x_val, y_val = x_train[val_inds], y_train[val_inds]
    
    model = runModel(x_tr, y_tr, x_val, y_val, epos=current_epochs)
    
    model.compile(loss='binary_crossentropy', optimizer= "adam", metrics=[custom_f1, 'accuracy'])
    model.fit(x_tr, 
              y_tr,                  
              epochs=current_epochs, 
              batch_size=current_batch_size,   
              verbose=1)
    
    y_val_pred = model.predict(x_val)
    y_val_pred_cat = (np.asarray(y_val_pred)).round()### Get performance metrics 
    f1, precision, recall = f1_score(y_val, y_val_pred_cat), precision_score(y_val, y_val_pred_cat), recall_score(y_val, y_val_pred_cat)
    
    print("the fold %d f1 score is %f"%((k_fold+1), f1))
   
    f1_cv.append(round(f1, 6))
    precision_cv.append(round(precision, 6))
    recall_cv.append(round(recall, 6))print('mean f1 score = %f'% (np.mean(f1_cv)))

Running this model, you should see the following verbose logging, where the F1 scores calculated as training goes (e.g., 0.1255) are significantly different from that calculated for each validation set (e.g., 0.827).

Hmmm, why would this happen?

Using Callback to specify metrics

Digging into this issue, we realize that the way how Keras calculates by creating custom metric functions is batch-wise, meaning each metric is applied after each batch and then averaged to get a global approximation. This information is misleading, however, because what we are monitoring should be a macro training performance for each epoch. It is exactly why these metrics were removed from Keras 2.0 release. With all being said, what is the correct way to implement a macro F1 metric? Well, the answer is the Callback functionality,

class Metrics(Callback):
    def __init__(self, validation):   
        super(Metrics, self).__init__()
        self.validation = validation    
            
        print('validation shape', len(self.validation[0]))
        
    def on_train_begin(self, logs={}):        
        self.val_f1s = []
        self.val_recalls = []
        self.val_precisions = []
     
    def on_epoch_end(self, epoch, logs={}):
        val_targ = self.validation[1]   
        val_predict = (np.asarray(self.model.predict(self.validation[0]))).round()        
    
        val_f1 = f1_score(val_targ, val_predict)
        val_recall = recall_score(val_targ, val_predict)         
        val_precision = precision_score(val_targ, val_predict)
        
        self.val_f1s.append(round(val_f1, 6))
        self.val_recalls.append(round(val_recall, 6))
        self.val_precisions.append(round(val_precision, 6))
 
        print(f' — val_f1: {val_f1} — val_precision: {val_precision}, — val_recall: {val_recall}')

Then we compile and fit our model this way,

model.compile(loss='binary_crossentropy', optimizer= "adam", metrics=[])  
        model.fit(x_tr, 
                  y_tr, 
                  callbacks=[Metrics(validation=(x_val, y_val))], 
                  epochs=current_epochs, 
                  batch_size=current_batch_size,   
                  verbose=1)

Now, if we re-run the CV training, we will see the verbose logging producing consistent F1 score for training process and validation,

One final check, predicting the testing set gives us an F1 score that is reasonably close to the training,

y_test_pred_cat = predict(x_test).round()cm = confusion_matrix(y_test, y_test_pred_cat)
f1_final = round(f1_score(y_test, y_test_pred_cat), 6)print(f'Testing F1 score = {f1_final}')

Confusion matrix for the hold-out testing set

There you have it! The (incorrect and) correct way to calculate and monitor the F1 score in your neural nets. Similar procedures can be applied for recall and precision if it individually is your measure of interest. The full code is available in my Github repo.

One callout before I let you go is that this Metrics callback calculates the F1 score, but it does not mean that the model is trained on F1 score. In order to ‘train’ based on optimizing the F1 score, which sometimes is the preferred technique to handle imbalanced classification, we will need additional model/callback configurations. Therefore, please stay tuned for my next blog, where I will be discussing F1 score tuning and threshold-moving.

Are You Making This Mistake when Implementing the Macro F1 Score in Keras?

I have coded up the correct way so that you don’t have to

Written by Kat Li