Dummy models are very simple to set up and yet provide great insights to check the performance of your Machine Learning models.
In this post, I want to explain what dummy models are and how to use them in scikit-learn.

If you like or want to learn machine learning with scikit-learn, check out my tutorial series on this amazing package:
All images by author.
What are dummy models?
Dummy models are very simplistic models that are meant to be used as a baseline to compare your actual models. A baseline is just some kind of reference point to compare yourself to. When you compute your first cross-validation results to estimate your Model’s performance, you usually know that the higher the score the better, and if the score is pretty high on the first try, that’s great. But it isn’t usually the case.
What to do if the first accuracy score is pretty low – or lower than what you’d want or expect? Is it because of the data? Is it because of your model? Both? How can we know quickly if our model isn’t badly tuned?
Dummy models are here to answer these questions. Their complexity and "intelligence" are very low: the idea is that you can compare your models to them to see how much better you are than the "stupidest" models. Note that they do not intentionally predict stupid values, they just take the easiest, very simplistic smart guess. If you model gives worst performance than the dummy model, you should tune or change your model completely.
A simple example for a dummy regressor would be to always predict the mean value of the training target, whatever the input: it’s not ideal, but on average it gives a reasonable simplistic guess. If your actual model gives worse results than this very, very simple approach, you might want to review your model.
Dummy models in scikit-learn
The dummy model package of scikit-learn is pretty simple: it only consists of 2 classes:
- a
DummyClassifier
- a
DummyRegressor

These are classes that expose the estimator API in the scikit-learn sense: they are actual models that can learn from a training set using the .fit
method, and predict a target based on new input using the .predict
method. In other words, they work just like your typical model created using a Pipeline, for example.
Note that, as we’ll see in greater detail below, both dummy estimators accept a strategy
parameter. The strategy allows to create different variants of dummy models, like a dummy regressor that always predicts the mean value of the training with the mean
strategy or allows us to specify a specific constant that the model should always predict.
In other words, Sklearn provides several dummy regressor and several dummy classifier models.
For most strategies, dummy estimators only use the y target values when training, and they never use the X test set for prediction. We’ll see how just below.
Dummy Regressor
The DummyRegressor class can be instantiated with 4 different strategies, whose names are pretty self-explanatory. For any new sample x_i, the dummy model simply returns the value it learned during training and completely disregards the sample’s content:
mean
: always return the mean of the training targetmedian
: always return the median of the training targetquantile
: always return a specified quantile of the training targetconstant
: always return a specified constant value

Let’s focus on an example with the mean
strategy.

We can now quickly compare how each strategy behaves, visually: we arbitrarily choose a constant of 30 for the constant strategy, and a .2 percentile for the percentile strategy:

As we can see, dummy regressors are pretty simple. They only use the y train set mean, median, or percentile – or even completely ignore it when using the constant strategy.
Most of the time, you might want to use the mean or median strategy to compare your model, and only resort to the constant or percentile strategies if you have a good reason to do so. One reason to manually set a specific constant or percentile could be if you know or observe that, while still very simple, this strategy could lead to a higher-score baseline model than what the mean or median dummy would offer. Another reason to use a percentile strategy for example is when doing percentile regression.
In the following example, we can easily see that the mean approach is not ideal, so we set ourselves a constant strategy for a better baseline:

Dummy Classifiers
Just like DummyRegressor, the DummyClassifier acts like an estimator and can be created using different strategies. Again, DummyClassifier learns only using the target set and ignores the training set.
While a DummyRegressor exposes only a predict
method like all regressors do, the DummyClassifier exposes both a predict
and a predict_proba
method. The relation between both methods is the same as for regular classifier estimators (like logistic regression or SVC):
- The estimator computes probabilities using
predict_proba
so it’ll return a vector of probabilities. - If you used the
predict
method, the corresponding class returned is the one with the highest probability (like if the probability of a sample being male is high, above the threshold, the estimator will return "1" or "male" for that sample).
All in all, the different strategies also follow that pattern between predict_proba
and predict
, and the strategy changes how those probabilities are generated. While reading through each strategy, notice that they don’t use the new sample values and only generate output based on the target set they learned when fitting.
The possible strategies are:
stratified
: The stratified strategy learns the training set class distribution to generate a corresponding multinomial distribution. Think of a multinomial distribution like the distribution of the output of a loaded die: there is a specific probability for each possible output. For example, if the training set contains 100 samples with 30 being "orange", 30 being "apple", 20 being "banana", and 20 being "lemon", then the corresponding multinomial distribution is [0.3, 0.3, 0.2, 0.2]. Note that this strategy is non-deterministic (it uses a random state). Sampling a random sample in this distribution outputs a vector of 0 with a 1 corresponding to the output (in the example above, [0, 0, 1, 0] would represent an output of "banana"). In other words, using thepredict
method is equivalent to randomly selecting the possible output with a weight vector corresponding to the class distributions:most_frequent
: This strategy always predicts the most frequent label in the training set. So it acts kind of like thestratified
strategy, except all the probabilities are 0 except for the most frequent class for which it is 1: [0,0,0,…0,1,0,…,0], where 1 corresponds to the most frequent label in the training target. Hence both methods return the actual same vector:prior
: This strategy is a bit trickier to understand: thepredict_proba
returns the vector of the "class prior", which means the class distribution vector. But thepredict
will simply predict the most frequent label.uniform
: This approach is similar to the stratified one, except it doesn’t respect the class distribution, as they are assumed to be uniformly distributed. This strategy is non-deterministic (it uses a random state).constant
: This strategy always predicts a constant label that is provided by the user. The constant target value provided must be present in the training data.

Let’s see an example of these dummy classifiers.
We’ll use the penguins dataset, where the target contains 3 possible species of penguins, not uniformly distributed (there are about 150 samples that belong to the "Adelie" species, 65 to the "Chinstrap" species, and 120 to the "Gentoo" species). We’ll look at both the predict_proba
and predict
results.
Let’s first analyze the results of the predict
method.

On the first plot, we have the ground truth distribution of the 3 classes in the test set. There are about 40 Adelie, 30 Gentoo, and 15 Chinstrap – note that we used stratify=True
when splitting the input data, so the distribution of the 3 classes is conserved:
- The most-frequent strategy always predicted Adelie, since this was the most frequent class in the training target.
- The prior strategy also always predicted Adelie, for the same reason.
- The stratified strategy predicted more or less the same distribution. Remember that this strategy is randomized, so running the
predict
method again could lead to slightly different results (or in other words: computing the prediction for a huge number of samples would approach the class distributions). - The uniform strategy predicted more or less the same amount of each class. This strategy is also randomized; running the code again would lead to slightly different results.
- Finally, the constant strategy always predicts the same class (we used Chinstrap in the example above).
Let’s now review the predict_proba
outputs.

The following plot presents the predicted probabilities for all the test samples. Remember that the test samples do not matter in the prediction; the dummy model only returns what it already learned. For a given row, we hence have 3 values corresponding to the probabilities that the sample belongs to each class.
- For the most_frequent strategy, the Adelie always has a probability of 1 and the 2 other classes have 0. This is coherent with the results of
predict
. - For the prior strategy, the Adelie probability is always about .45, the Gentoo .35, and the Chinstrap .2.
- For the stratified strategy, a single probability is 1 and the others 0. On average, the selected class is based on the class distribution, so on average, Adelie is selected about 45% of the time.
- For uniform, the same probability is returned for all classes, 1/3 in this case.
- For the constant strategy, the same vector is always returned with 1 for the chosen class and 0 for the others.
Wrap up
Let’s recap what we should know about dummy models:
- They are simplistic models meant to serve as a baseline.
- They utilize only the target training set for learning.
- They exist for both classification and regression tasks, offering various strategies (such as mean or quantile).
- They are distinct from weak learners, which are used as building blocks for complex algorithms like decision trees.
You might like some of my other posts, make sure to check them out:
Fourier-transforms for time-series