Choosing a Baseline Accuracy for a Classification Model

Pick a simple baseline accuracy to demonstrate that your classification model has skill on a problem

Published in

Towards Data Science

5 min readMay 7, 2021

When you evaluate a new machine learning model and end up with an accuracy number or other metric, you need to know if it is meaningful. Particularly in imbalanced classification models, it can appear that your model isn’t really doing much better than guessing. What accuracy is enough to call your model useful?

This article is just to show the simplistic baseline accuracy for your model that I find useful.

The Problem

I created an election model that predicted the voting habits of all 3200+ US counties (Democrat or Republican) using economic metrics. The model had an accuracy of 86%. That sounds great until you realize that 84% of all counties in the United States voted Republican. Those counties happen to be smaller and more rural, so there are lots more of them. Like many datasets, I had a significant class imbalance.

You could simply predict all counties to be Republican and get an accuracy of 84%. Was my model really only barely better than guessing Republican for every county?

Zero Rate Classifier

The model baseline we just described has a name.

The ZeroR (or Zero Rate) Classifier always classifies to the largest class– in other words, just trivially predicting the most-frequent class. For a two outcome model, it will be right more often than not by just going with the odds.

Always selecting the majority group (ZeroR classifier) is insightful as a baseline and is the one you should choose for any classification model. For a machine learning algorithm to demonstrate that it has skill on a problem, it must achieve an accuracy better than this ZeroR value.

For highly imbalanced problems (like the voting classification problem), a model accuracy even a little higher than ZeroR could be significant. Either way though, your model must be better than ZeroR to be considered useful on a problem.

Random Rate Classifier (Weighted Guessing)

Now we will put our high school math to work. Another baseline strategy we can use is to see what our accuracy would be if we guessed at the weighted percentages of each class. This value will always be lower than the ZeroR value, so it should not be your lower limit baseline. Instead, it can be used to explain and understand your results in terms of how much value is added by your model.

Random Rate Classifier — Applies prior knowledge of class assignments in making a random class assignment.

We will look at these two strategies for a few classification problems.

Coin Flip

Let’s start by looking at a coin flip model. 50% of the outcomes are tails (0), and 50% of the outcomes are heads (1).

How do our baseline strategies work out here?:

ZeroR — guessing all heads would give us 50% accuracy.
Random Rate — we intuitively know that guessing 0.50 heads and 0.50 tails would also give us 50% accuracy. We will be correct on half of the heads predictions and half of the tails predictions.

Guessing half heads and half tails for Random Rate Classifier works mathematically this way:

Odds of Guessing Heads Correct: 0.50 * 0.50 = 0.25
Odds of Guessing Tails Correct: 0.50 * 0.50 = 0.25Baseline = 0.50**2 + 0.50**2 = 0.50

Imbalanced Outcomes

Now let’s look at one that is not 50/50. Let’s say the outcomes are split 75/25. We would now weight the guessing so that we predict the majority outcome 75% of the time. This Random Rate guessing strategy looks like this

Odds of Guessing Minority Correct: 0.25 * 0.25 = 0.0625
Odds of Guessing Majority Correct: 0.75 * 0.75 = 0.5625Baseline = 0.25**2 + 0.75**2 = 0.625

If we guessed at that rate, we would guess correctly only 62.5% of the time. Any machine learning model that improves on this baseline is adding value, but must also be above the ZeroR 75% threshold to be useful as a predictor.

Our Voting Example

So how did my voting model described at the beginning of this article really do? Well, it is above the ZeroR baseline, so the model is useful. We can also find out what our baseline would be if we guessed at the actual rate, and then compare it to our model accuracy.

Odds of Guessing Democratic Correct: 0.16 * 0.16 = 0.0256
Odds of Guessing Republican Correct: 0.84 * 0.84 = 0.7056Baseline = 0.16**2 + 0.84**2 = 0.73

So with random weighted guessing, we would only predict 73% of our counties. Our actual accuracy was 86%. This is 13% improvement over the theoretical value of weighted guessing. The model, which didn’t appear that promising, has definitely added significant value. It is also above the ZeroR baseline, so this model is a useful one for this problem.

Multiclass Problems

Could you do this for multiclass problems? Sure.

I recently made a Twitter sentiment classifier which classified the emotion of a tweet as positive, negative, or no emotion. The classes were imbalanced in the following way:

No emotion toward brand or product    0.593
Positive emotion                      0.328
Negative emotion                      0.080

The theoretical baseline using our guessing strategy is:

0.593**2 + 0.328**2 + 0.080**2 = 0.465633

Our actual accuracy was 0.81, so machine learning added 34% to our theoretical baseline. That’s significant. More importantly, our accuracy is also well above the ZeroR value of 0.59 if we had predicted ‘no emotion’ for every outcome.

Takeaways and Suggestions

Your model has to do better than Zero Rule (ZeroR) to be useful at prediction. No getting around that.
You can compare your model to the theoretical baseline of random guessing and use it to evaluate the usefulness of your model .
There are other baselines that you might find useful (uniform guessing, random guessing, and One Rule are a few).
Check out sklearn.dummy.DummyClassifier which offers an automated solution for the following baseline strategies: “stratified”, “most_frequent”, “prior”, “uniform”, “constant”.
When evaluating accuracy for imbalanced classification problems, consider looking at the AUC.
Create your baseline before you build your model, and establish the rules for which you will evaluate your final model.

The techniques shown here are provide a ‘reality check’ on your model performance, and are understandable and interpretable for a broad audience (I could easily explain this for a non-technical presentation). They helped me in evaluating and understanding my own models, and I hope you might also find them useful too.