The world’s leading publication for data science, AI, and ML professionals.

Creating benchmark models the scikit-learn way

Learn how to create a selection of benchmark models for both classification and regression problems

Source: pixabay
Source: https://pixabay.com/photos/work-typing-computer-notebook-731198/

What I really like about scikit-learn is that I often stumble upon functionalities, which I was not aware of before. My most recent "discovery" is the DummyClassifier. The dummy estimator does not learn any patterns from the features, it uses simple heuristics (inferred from the targets) to calculate the predictions.

We can use that naïve estimator as a simple sanity check for our more advanced models. To pass the check, the considered model should result in better performance than the simple benchmark.

In this short article, I show how to use the DummyClassifier and explain the available heuristics.

Setup

We need to import the required libraries:

For this article, I write a simple function printing a few selected evaluation metrics to assess the model’s performance. Aside from accuracy, I included metrics that help to evaluate the performance in case of a class imbalance (the list is by no means exhaustive).

Load data

For this article, I use the famous Iris dataset. To simplify the problem even further, I transform the multi-class problem into a binary classification, at the same time introducing class imbalance. The goal of the exercise will be to predict if a given plant belongs to the Versicolour class or "other" (Setosa or Virginica).

After the transformation, the ratio of classes is 2:1.

The last step before training the models is to split the data into training and test sets using the train_test_split function:

As we intentionally introduced class imbalance

Exploring variants of the DummyClassifier

The DummyClassifier estimator offers a few possible rules (called strategies), which we can use for determining the benchmark class predictions. Below I briefly describe them and present the corresponding code snippets showing implementation.

The ‘constant’ strategy

Arguably the simplest variant of the dummy classifier. The idea is to replace all labels with a single value. A possible use case for this variant is when we want to evaluate a potential estimator in terms of the F1 Score, which is the harmonic mean of precision and recall.

I used the minority class (Versicolour) for creating the naïve prediction. It is also worth mentioning that the features do not play any role in determining the predicted value, they are only there to match the scikit-learn‘s fit + predict style. In the summary presented below, we see that we achieved perfect recall and an F1-Score of 0.5.

{'accuracy': 0.3333333333333333,
 'recall': 1.0,
 'precision': 0.3333333333333333,
 'f1_score': 0.5}

The ‘uniform’ strategy

In this variant, the naïve prediction is generated at random (uniformly) from the available classes.

Running the code results in the following summary:

{'accuracy': 0.4,
 'recall': 0.4,
 'precision': 0.25,
 'f1_score': 0.3076923076923077}

Additionally, we inspect the distribution of the predicted labels using Counter:

Counter({0: 14, 1: 16})

As expected from a uniform draw, the distribution of the target is random and does not reflect the true distribution of the labels.

The ‘stratified’ strategy

To account for the previously mentioned drawback, we can use the stratified rule. The predictions are generated randomly, however, the distribution of classes from the training set is preserved. Just as in the stratified train_test_split.

Running the code results in the following summary:

{'accuracy': 0.6333333333333333,
 'recall': 0.3,
 'precision': 0.42857142857142855,
 'f1_score': 0.3529411764705882}

Once again, we look at the number of observations in each class:

Counter({0: 23, 1: 7})

Using the stratified strategy results in a distribution similar to the one we saw in the observed values.

The ‘most_frequent’ strategy

The name of the strategy is pretty much self-explanatory – the predicted value is the most frequent value among the labels.

Running the code results in the following summary:

{'accuracy': 0.6666666666666666,
 'recall': 0.0,
 'precision': 0.0,
 'f1_score': 0.0}

While running the code, we receive the following warning: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. The reason for that is the strategy we chose. As the majority class was used for all of the predictions, there are no observations of the minority class and metrics such as precision and recall cannot be calculated.

The ‘prior’ strategy

A very similar strategy to the 'most_frequent' strategy. The dummy classifier always predicts the class that maximizes the class prior. The difference lies in the predict_proba method of the fitted classifier. In the case of the 'prior' strategy, it returns the class prior (class probabilities as determined by the ratio of the labels in the training set).

The results of the 'prior' and 'most_frequent' strategies are the same, so I do not show them again for brevity.

It is worth mentioning that for all strategies the predict method completely ignores the input data, the values that will be used as predictions were determined during the fitting stage (or provided by us when using the 'constant' strategy).

Training an actual estimator

Having experimented with multiple strategies for creating a benchmark model, it is time to train a simple classifier and evaluate if it outperforms the benchmarks. For that, I use the Decision Tree Classifier with default settings (aside from the random_state, specified for reproducibility).

By inspecting the summary presented below, we can clearly state that the decision tree outperformed all of the benchmarks.

{'accuracy': 0.9473684210526315,
 'recall': 0.9230769230769231,
 'precision': 0.9230769230769231,
 'f1_score': 0.9230769230769231}

DummyRegressor

I already showed how to use the scikit-learn‘s DummyClassifier to estimate benchmark models for classification tasks. Naturally, there exists an analogical estimator for regression problems – DummyRegressor. I will not go into much detail, as the usage is pretty similar to the DummyClassifier. For completeness, I only mention the available strategies:

  • 'mean' – the estimator uses the average of the target (from the training set) as predictions
  • 'median' – the estimator uses the median of the target (training set) as predictions
  • 'constant' – the estimator uses a constant value as predictions
  • 'quantile' – the estimator uses the specified quantile of the targets (training set) as predictions

Similarly to the DummyClassifier, the predict method ignores the input data for calculating the predictions.

Conclusions

In this article, I showed how to use estimators available in scikit-learn for creating benchmark models using simple heuristics. We can use such models for sanity checks, to see if our estimators perform better than naïve baselines. If not, we should investigate what causes the problem (maybe the features are not helpful or the class imbalance is impacting the results).

You can find the code used for this article on my GitHub. As always, any constructive feedback is welcome. You can reach out to me on Twitter or in the comments.


I recently published a book on using Python for solving practical tasks in the financial domain. If you are interested, I posted an article introducing the contents of the book. You can get the book on Amazon or Packt’s website.


Related Articles