A Tale of Two Macro-F1's

Boaz Shmueli
Towards Data Science
5 min readAug 19, 2019

--

After writing my 2-part series Multi-Class Metrics Made Simple (Part I, Part II) I received encouraging and useful feedback from readers, including claps, typo corrections, etc. So first, many thanks to all that responded! One email in particular came from a curious reader (who wished to remain anonymous, so I’ll refer to this reader as “Enigma”) and triggered an investigation into the way the macro-averaged F1 score is calculated. This led me down a rather surprising rabbit hole, which I describe in this post. The bottom line is: there’s more than one macro-F1 score; and data scientists mostly use whatever is available in their software package without giving it a second thought.

As a quick reminder, Part II explains how to calculate the macro-F1 score: it is the average of the per-class F1 scores. In other words, you first compute the per-class precision and recall for all classes, then combine these pairs to compute the per-class F1 scores, and finally use the arithmetic mean of these per-class F1-scores as the macro-F1 score. In the example in Part II, the F1 scores for classes Cat, Fish and Hen are 42.1%, 30.8%, and 66.7% respectively, and thus the macro F1-score is:

Macro-F1 = (42.1% + 30.8% + 66.7%) / 3 = 46.5%

But apparently, things are not so simple. In the email, “Enigma” included a reference to a highly-cited paper which defined the macro F1-score in a very different way: first, the macro-averaged precision and macro-averaged recall are calculated. Then, the harmonic mean of these two metrics is calculated as the final macro F1-score. In our example, the macro-precision and macro-recall are:

Macro-precision = (31% + 67% + 67%) / 3 = 54.7%

Macro-recall = (67% + 20% + 67%) / 3 = 51.1%

And thus using the second method, which I designate with an asterisk (*):

Macro-F1*= 2 × (54.7% × 51.1%) / (54.7% + 51.1%) = 52.8%

As you can see, the values for Macro-F1 and Macro-F1* are very different: 46.5% vs. 52.8%.

Macro-F1 and Macro-F1* measure two different things. So which one is the “true” macro F1-score? Macro-F1 or Macro-F1*? Read on.

The reference that “Enigma” sent me is “A systematic analysis of performance measures for classification tasks” by…

--

--