The world’s leading publication for data science, AI, and ML professionals.

It’s Time to Finally Memorize those Dang Classification Metrics!

Intuition behind the metrics and how I finally memorized them

image by cottonbro studio from Pexels.com
image by cottonbro studio from Pexels.com

The Greek philosopher Socrates allegedly had a dislike for written language (ironically, we don’t know exactly how he felt, because he never wrote anything 😂 ) – among other reasons, he felt that written language makes us intellectually lazy because we can write things down instead of memorizing/internalizing them. I have often thought about how Socrates would feel about me googling the categorical performance metrics every time I use them! He would lose his mind! To avoid this shame, I decided to commit them to memory and I wanted to share my memorization tools with you.

In this article, I won’t only go into how I finally memorized the metrics – I will also go over the intuition behind the metrics (no point in memorizing something you don’t understand) and perhaps perhaps more importantly, what each metric misses (Look for⚠️Warning sections).

Contents

  1. Brief overview of classifications metrics
  2. Accuracy
  3. Precision
  4. Recall/Sensitivity
  5. Specificity
  6. F1 Score

Overview of classification metrics

I’m going to use the terms ‘True Positive’ (TP), ‘False Positive’ (FP), ‘True Negative’ (TN) and ‘False Negative’ (FN) throughout this post. I think it is worth it to take a little bit of time to get aligned on the definitions of these four key terms before we dive into specifics. If you’re already familiar with these definitions, you should skip to the next section.

I think that these terms are best understood through a confusion matrix. There are a ton of resources out there if you want to go deep into confusion matrices – I’ll give a super high level overview meant to jog your memory rather than provide a complete instruction.

Confusion Matrix - Image by author
Confusion Matrix – Image by author

Note: Anything that starts with ‘True’ means the prediction is correct – ‘True’ means good job! Anything that starts with ‘False’ means that the prediction is incorrect.

Below I wrote down a definition of each of the terms. I think most people find learning by examples effective (I certainly do). Because of this, I included an example column in the table as well. For the example column, imagine that you are a prospector trying to identify gold from pyrite (fool’s gold).

Image by author
Image by author

Okay, now that we have all of our True/False, Positive/Negatives down, let’s start talking about the classification performance metrics and how we can memorize them!


Each metric’s section will have three parts (1) intuition, (2) warning and (3) ‘how I remember it.’ The intuition will go over the formula for the metric and how it can be interpreted. The warnings section talks about blind spots for the metric and how the metric could potentially be gamed to give overly positive results. The last section, ‘how I remember it.’ will give some silly mnemonics/tricks that I use to memorize the metrics so I don’t have to google them all the time!

Accuracy

image by author
image by author

Intuition:

How many predictions are correct as a percentage of all of the predictions. I think of it in terms of how multiple choice tests are graded in school. How many questions did you get right out of the total number of questions? This is a simple metric to understand, but it can be very misleading — see the ‘warning’ section below!

⚠️Warning:

Accuracy can be very misleading when you are working with imbalanced classifications in your data (i.e., the proportions of 1’s and 0’s are different).

Let’s explore the dangers with and example:

Imagine you are creating a model to predict if a person hates getting stuck in traffic (you could imagine how this would be imbalanced, almost everybody hates traffic!). In your data, you have 1000 people, only 3 of them do not hate getting stuck in traffic.

Given the disparity in the data, your model could be a single rule – predict that everyone is a traffic hater— i.e., all 1’s. Your accuracy will be 99.7%, but you got there by completely ignoring the minority class.

Demonstration of accuracy giving misleading results with class imbalance - image by author
Demonstration of accuracy giving misleading results with class imbalance – image by author

The traffic example is intentionally silly, but think about the case of rare diseases. Who cares that you have a high accuracy if you never diagnose a sick person!

Quick note to those looking for jobs in Data Science — if accuracy comes up in an interview and you don’t clearly express the issue we just finished talking about, you’re almost guaranteed to not get the offer. So make sure you understand this one before going into an interview — just in case it comes up!

How I remember it:

First of all, for most of the classification metrics, I logically simplify the numerators and denominators. For the numerator, instead of memorizing TN + TN, I memorize ‘correct classifications.’ For the denominator, instead of memorizing TP + TN + FP + FN; I remember that those four summed together are ‘all classifications’. So, now I’m memorizing "correct classifications / all classifications" – which I feel is a lot easier.

For the formula, I just remember the alliterative phrase _ ‘academic accuracy‘ – which reminds me that accuracy is calculated the same way that academi_c tests are scored, i.e., total correct answers divided by the total answers.

image by author
image by author

Precision

image by author
image by author

Intuition:

Precision captures what percent of our true predictions are actually true. Think about when the meteorologist says that it is going to rain — that is a positive prediction- they are predicting rain. The percent of their predictions that turn out to be correct is their precision.

One place the precision comes up is in the world of professional sports. Multiple sports now allow for teams to challenge a potentially bad call by a referee. The percentage of good calls that a referee makes of the total calls is a precision calculation. The referee’s call is the positive predictions (predicting that something happened – e.g., a touchdown or foul etc.). The total correct calls divided by all of the calls the referee makes is his/her precision.

⚠️ Warning:

High precision is always a good thing, right? Not necessarily! It depends on how conservative/aggressive the model is in making positive predictions. For example, an image that a radiologist is looking at scans to identify a disease. She could have really high precision by only diagnosing cases that she is 100% sure about. This would give her a really high precision – she would have very few positive predictions that were not true positives. But what about patients that she thinks have a 95% chance of having the disease? She should probably give them a diagnosis as well, right? This would reduce her precision though! Precision should be taken into account with other metrics, like sensitivity – which we’ll cover in the next section.

How I remember it:

Mnemonic device to remember precision formula - by author
Mnemonic device to remember precision formula – by author

Precision is the only metric that is exclusively based on positive predictions. As we discussed, it is the correct positive predictions divided by all positive predictions. I use the first three letters ‘pre’ to remember that precision only has ‘pre‘-dictions. I also use the fact that it starts with a ‘p’ to remember that it has to do with ‘_p’-_ositive predictions. Once I remember it is about positive predictions, it is pretty easy to remember that we want to know what percent of positive predictions that are correct.

Recall/Sensitivity

image by author
image by author

Intuition:

Sensitivity answers this question: How good is your model at identifying positives as a percentage of all positives in the data? It tells us what proportion of people with diseases we are able to correctly identify.

⚠️ Warning:

This metric doesn’t have any false positives in it. So, we can have a recall of 100% if we make positive predictions for every observations. In that case, every actual positive will also be a true positive. We won’t have any false negatives, because we won’t make any negative predictions! Because of this, all of our negatives will be predicted as positives — the recall/sensitivity doesn’t capture this problem.

How I remember it:

There are two things to memorize here, (1) that recall = sensitivity and (2) the formula for sensitivity/recall.

I remember that recall is the same as sensitivity with this sentence:

‘You recall sensitive emails sent to the wrong address.’

To remember the formula, I made the alliterative phrase:

‘sensitive sensor’

What do sensitive sensors do? They detect things that are there – they try to predict true positives. Think of a metal detector. A sensitive one will find metal. It detects metal that is present. We don’t buy a metal detector (or any other type of sensor) to correctly identify what is not metal (which would be a true negative)! We buy it to make positive predictions and the more sensitive it is, the better it is at detecting positives.

Specificity

image by author
image by author

Intuition:

Specificity is the exact same as recall/sensitivity except we swap the positive classification for the negative. It is the model’s ability to identify negative observations as a percent of all of the actual negative observations. For the diagnosis example we’ve been using heavily, this would be correctly not diagnosing someone who does not have the disease.

How I remember it:

Now that we’ve reviewed both sensitivity and specificity, I want to show the way that I remember that they are related and very similar. It’s nothing fancy… the words look similar – so I remember that the calculations are similar and are often compared to get a fuller picture of the model’s performance.

Sensitivity and Specificity are related - I remember that because they look similar, nothing clever here! Image by Author
Sensitivity and Specificity are related – I remember that because they look similar, nothing clever here! Image by Author

I remember that specificity has to do with negative predictions and negative actuals with the humorous phrase:

‘Your boyfriend/girlfriend is very specific when complaining about your negative personality traits.’

F1 Score

image by author
image by author

Intuition:

This one looks trickier than the other ones, but it actually isn’t that bad. The F1 (sometimes called F1 score) is the harmonic mean of the precision and recall. The harmonic mean is a mean calculation that is often used for ratios. For the intuition piece, I don’t think you necessarily need to remember that it is the harmonic mean. Remembering that it is the mean is sufficient to understand what it is doing.

Earlier in the article, we discussed how precision and recall both have blind spots. Precision doesn’t take into account how loose or strict the model is in making positive predictions. Recall/Sensitivity doesn’t capture false positives. Looking at either metric alone doesn’t tell the full story. It is similar to looking at a basketball player’s offensive ability while ignoring defensive skills. It just doesn’t tell the full story of how good the player is. Maybe he/she is an excellent offensive player and a bad defensive player – looking at both gives us an idea of the overall skills of the player. The same applies for the F1 score. It balances measuring our model’s ability generate true positives and avoid generating false positives.

If precision is high and recall is low, the model has few false positives, but is missing many true positives (has a lot of false negatives). If recall is high and precision is low, your model generates a lot of true positives, but also generates a lot of false positives. The F1 score "averages" (the harmonic mean isn’t really an average) the two metrics to get a balanced understanding of the model’s performance.

In the graphic below, you can see how precision and recall benefit differently from how casual/strict the model is in predicting positive categories. You can also see how the F1 Score is a balance between the two metrics.

image by author
image by author

Quick side note: I made the chart above by simulating data – if you are interested in data simulation, I did a four part series on it – link to the list below:

Data Simulation

How I remember it:

I don’t worry about memorizing the exact formula for F1. My focus is memorizing that it is the mean of recall/sensitivity and precision. I’m not a huge fan of racing, but I always thing of Formula 1 racing when I see the name F1. So I leverage this to make a phrase that helps me remember that the F1 is the "average" of precision and recall/sensitivity. I think about how your actions (steering, braking, accelerating) have to be very precise because the cars are very sensitive (recall).

"Driving the average F1 car requires precision because they are sensitive"

Summary of memorization tricks

I hope this has helped you gain a better understanding of the classifications metrics and given you the tools to make Socrates proud by memorizing them instead of leaning on written language!

The table below gives a summary of the memorization tricks! Hopefully this is helpful for you. If not, feel free to drop a comment with how you remember the metrics.

image by author
image by author

Related Articles