Hierarchical Performance Metrics and Where to Find Them

How to measure the performance of your hierarchical classification model.

Published in

Towards Data Science

7 min readAug 26, 2020

Hierarchical machine learning models are one top-notch trick. As discussed in previous posts, considering the natural taxonomy of the data when designing our models can be well worth our while. Instead of flattening out and ignoring those inner hierarchies, we’re able to use them, making our models smarter and more accurate.

“More accurate”, I say — are they, though? How can we tell? We are people of science, after all, and we expect bold claims to be be supported by the data. This is why we have performance metrics. Whether it’s precision, f1-score, or any other lovely metric we’ve got our eye on — if using hierarchy in our models improves their performance, the metrics should show it.

Problem is, if we use regular performance metrics — the ones designed for flat, one-level classification — we go back to ignoring that natural taxonomy of the data.

If we do hierarchy, let’s do it all the way. If we’ve decided to celebrate our data’s taxonomy and build our model in its image, this needs to also be a part of measuring its performance.

How do we do this? The answer lies below.

Before We Dive In

This post is about measuring the performance of machine learning models designed for hierarchical classification. It kind of assumes you know what all those words mean. If you don’t, check out my previous posts on the topic. Especially the one introducing the subject. Really. You’re gonna want to know what hierarchical classification is before learning how to measure it. That’s kind of an obvious one.

Throughout this post, I’ll be giving examples based on this taxonomy of common house pets:

The taxonomy of common house pets. My neighbor just adopted the cutest baby Pegasus.

Oh So Many Metrics

So we’ve got a whole ensemble of hierarchically-structured local classifiers, ready to do our bidding. How do we evaluate them?

That is not a trivial problem, and the solution is not obvious. As we’ve seen in previous problems in this series, different projects require different treatment. The best metric could differ depending on the specific requirements and limitations of your project.

All in all, there are three main options to choose from. Let’s introduce them, shall we?

The contestants, in all their grace and glory:

The Down-To-Earth One: Flat Classification Metrics

These are the same classification metrics we all know and love (precision, recall, f-score — you name it), applied… Well, flatly.

Same as with the original “flat classification” approach (described in the first post in this series), this method is all about ignoring the hierarchy. Only the final, leaf-node predictions are considered (in our house pets example, those are the specific breeds), and they’re all considered as equal classes, without any special treatment to sibling-classes vs. non-sibling ones.

This method is simple, but, obviously, not ideal. We don’t want the errors at different levels of the class hierarchy to be penalized in the same way (if I mistook a Pegasus for a Narwhal, that’s not as bad as mistaking it for a Labrador). Also, there isn’t an obvious way to handle cases where the final prediction is not a leaf-node one — which could definitely be the case if you implemented the previously-mentioned blocking by confidence method.

The Hipster One: A Custom-Made Metric

Not happy with the flat metrics, and feel that creative spark tingling in your fingertips? You can conjure up your own special metric, which specifically fits your unique snowflake of a use case.

This could be useful where the model needs to fit some unusual business constraints. If, for example, you don’t really care about falsely identifying dogs as unicorns, but a Sphynx cat must be correctly spotted or all hell breaks loose, you can design your metrics accordingly, giving more or less weight to different errors.

The Pretentious One: Hierarchy-Specific Variations on the Regular Classification Metrics

Those are variations of well-known precision, recall and f-score metrics, specifically adapted to fit hierarchical classification.

Please bear with me as I throw some math notations in your general direction:

Definitions for hierarchical precision (hP), hierarchical recall (hR) and hierarchical f-measure (hF), respectively.

What does it all mean, though?

Pi is the set consisting of the most specific class (or classes, in case of a multi label problem) predicted for each test example i, and all of its/their ancestor classes; Ti is the set consisting of the true most specific class(es) of test example i, and all its/their ancestor classes; and each summation is computed, of course, over all of the test set examples.

This one is a bit of a handful to unpack, so if you find yourself puzzled, check out the appendix, where I explain it in more detail.

Now, if you’ve implemented your model with non-mandatory leaf-node prediction (meaning the most specific level predicted doesn’t have to be the deepest one), some adjustments need to be made; I won’t go into it here, but if this is something you want to read more about, let me know.

Which Metric Is The Perfect Match?

The everlasting question. As I previously mentioned, there isn’t one obvious answer, but here are my own thoughts on the subject:

Flat metrics: it’s a simple enough method, but it loses hierarchy information, which you probably deem important if you went through the trouble of building a hierarchical ensemble model in the first place. I would recommend using flat metrics only for super quick-and-dirty projects, where time limit is a big factor.
Custom-made, unique metrics: might be a better fit, but you pay with time and effort. Also, since you’ll be using a metric that wasn’t peer reviewed, you could be missing something important. I would recommend custom-made metrics only when the project at hand has very unique requirements that should be taken into account when evaluating model performance.
Hierarchical versions of common classification metrics: this method is somewhat intuitive (once you get the hang of it), and it makes a lot of sense for a hierarchical model. However, it might not fit your own use-case the best (for example, there’s no added weight for a correct/false prediction of the deepest class — which might be important in some use cases). It also requires some extra implementation time. All in all, though, I think it’s a good enough premade solution, and should probably be the first choice for most projects.

To Conclude

A machine learning model is nothing without its performance metrics, and hierarchical models require their own special care. There is no one best method to measure hierarchy-based classification: different approaches have their own pros and cons, and each project has its own best fit. If you got this far, you hopefully have an idea as to which method is best for yours, and can now measure your model once you’ve got it rolling.

This post concludes my four-post series about hierarchical classification models. If you’ve read all of them, you should have all of the tools you need to design, build and measure an outstanding hierarchical classification project. I hope you put it to the best use possible.

Previous posts in the series:

Noa Weiss is an AI & Machine Learning Consultant based in Tel Aviv.

Appendix

Can’t figure out those pesky hierarchical metrics? I’m here to help.

In the table below I go over the mock results of a “common house pets” hierarchical model, looking at the measures for the “Dalmatian” class (remember: precision, recall and f-score metrics are calculated per class, treating the labels — both predicted and true — as binary).

I go over a few examples, checking out what each of them contributes to both the precision and the recall scores. Remember — the final precision/recall scores are the summation of all those individual examples.

Demystifying hierarchical metrics one dog at a time.

Comments by example:

Misclassification of a different breed as a Dalmatian: a full point for recall (as the “dog” part was correctly identified), but only half a point for precision (as “dog” was correct, but the predicted “dalmatian” label was wrong).
Recall isn’t negatively affected since the “Labrador” label, which was missed here, is not part of the [Dog, Dalmatian] classes, which are the ones measured here.
Misclassification of a narwhal as a dalmatian — a zero for precision (as both the “dog” and “dalmatian” predicted labels are wrong), but the recall metric isn’t affected, since the true narwhal label is irrelevant to the measurements of the [Dog, Dalmatian] classes.
Perfect prediction — an extra point for both prediction and recall.
Misclassification of a dalmatian as a different breed: a full point for the precision metric (as the “dog” classifier, which is the only that came out positive out of the two, was correct), but only half a point for recall (as the “dog” label was correctly identified, but the “dalmatian” one was missed.
A dalmatian misclassified as a Rainbow unicorn: 0 for recall (as both dog and dalmatian labels were missed), but the precision score isn’t affected.
This example doesn’t teach us anything about the performance of the Dog/Dalmatian classifiers, so it stands to reason it doesn’t affect the score.

Source: C.N. Silla & A.A. Freitas, A survey of hierarchical classification across different application domains (2011), Data Mining and Knowledge Discovery, 22(1–2):182–196