The world’s leading publication for data science, AI, and ML professionals.

New features in scikit-learn

Overview of the latest developments in version 0.23

Photo by Jungwoo Hong on Unsplash
Photo by Jungwoo Hong on Unsplash

If you are into data science and work with python, then scikit-learn is a probable speed dial in your contact list. You might want to update this contact.

Two months ago, scikit-learn released version 0.23, introducing many exciting features and improvements. This version will be compatible with at least Python 3.6. ** This post will guide you through some of the interesting addition**s. Later, I will also provide a sneak-peek into some upcoming features in the next release 0.24 which is currently in development.


Major Features

1) New Regression Models

Scikit-learn has introduced the following three new regressors:

a) Poisson Regressor b) Gamma Regressor c) Tweedie Regressor

from sklearn.linear_model import PoissonRegressor
from sklearn.linear_model import GammaRegressor
from sklearn.linear_model import TweedieRegressor

All three are categorized as generalized linear models (GLMs) and support non-normal loss functions. They are useful in modeling situations where the error follows a distribution other than the normal distribution, for e.g. Poisson or Gamma distribution, or is always positive as nicely explained by Kida.

Which distribution to choose depends highly on the use case, such as the positive nature of the target variable, how tailed the distribution is, etc. A few examples of potential applications are risk modeling, climate event predictions, and insurance pricing.

2) Stable and faster KMeans estimator

The well-known algorithm to separate samples into a pre-defined number of clusters, with equal variance, is now supposed to be optimized, and possess better scalability with the large samples than the pre-0.23 versions.

3) Improved Histogram-based Gradient Boosting estimators

The improvements are two-fold: Firstly, both the Regressor and the Classifier, i.e., HistGradientBoostingClassifier and HistGradientBoostingRegressor, supports sample weights during the model training as shown here.

Secondly, if known a priori, you can now specify the nature of the monotonic constraints that the features can have on the response/target variable. For instance, while predicting the house prices, a larger bedroom area is likely to have a positive impact on the prices (constraint value = 1) whereas the distance from the city center is likely to have a negative impact on the prices (constraint value = -1). A value of 0 will represent unconstraint features.

4) Improved Lasso and Elastic Net

The two well-known linear regressors now support sample weights during the model training. In functionality, this leads to now a new argument sample_weight that accepts an array or a list, representing weights for each of the sample in the dataset, in the fit() function.

5) Interactive HTML visualization of pipelines and estimators

If you are modeling on a Jupyter Notebook, you can now interactively visualize the summarized workflow of your model pipelines and estimators. This requires invoking the display='diagram' option. The interactive nature of this functionality allows you to hover onto a certain estimator and expand it for more details. Check out this cool, official demonstration.

Other Interesting Features

1) Loading in-built datasets as DataFrames

Scikit-learn offers several in-built datasets to work with, for e.g.load_iris, load_digits,load_breast_cancer,load_diabetes, and load_wine, load_linnerud, fetch_california_housing. Now you can load these seven datasets as pandas DataFrames using the keyword as_frame=True.

Earlier, these embedded datasets were loaded as asklearn.utils.Bunch type.

from sklearn.datasets import load_iris

df = load_iris(as_frame=True)
# 0.25 onwards, type(df) should return pandas.core.frame.DataFrame

2) Drop selected categories during One Hot Encoding

The standard routine, preprocessing.OneHotEncoder, to One Hot Encode the categorical features now allows the possibility to drop the first category of each binary feature (having only two categories). This is implemented with a tag drop='if_binary'. The features that have either 1 or more than 2 categories will remain unaffected with this flag. The latest version also has a more efficient implementation of OneHotEncoder.

3) Centers of Gaussian Blobs

While generating isotropic Gaussian clusters (blobs), you now have access to the centers of each cluster via thereturn_centers=Trueargument.

from sklearn.datasets import make_blobs
X, y, centers = make_blobs(n_samples=20, centers=3, n_features=2,
                  random_state=0, return_centers=True)

Upcoming changes in the upcoming v-0.24

Soon, the next version 0.24, currently under development, will be released. The following are some interesting features you can look forward to.

1) Inverse transformation of the imputed values

The SimpleImputer from sklearn.impute allows the imputation of the missing data in a dataset. Soon, it will be possible to convert the imputed data back to its original state. Firstly, one would need to set a flag add_indicator=True while calling the SimpleImputer() and then use the function inverse_transform. Read more here on this cool feature.

2) Mean Absolute Percentage Error (MAPE)

This will be a new evaluation metric available for regression problems. Such a metric is insensitive to the global scaling of the target variable. It serves as a fair measure of the error in cases where the data has a large difference in the orders of magnitude, by computing the relative percentage of error w.r.t the true values, for e.g. as shown in the code snippet below. The mean_absolute_errorin the example below will be 100001.8.

Mean Absolute Percentage Error. ϵ is an arbitrarily small positive number, for example, 1e-6. The variables y and y_hat represent true and predicted values.
Mean Absolute Percentage Error. ϵ is an arbitrarily small positive number, for example, 1e-6. The variables y and y_hat represent true and predicted values.
# Implementation in upcoming version 0.24
from sklearn.metrics import mean_absolute_percentage_error
import numpy as np
y_true = np.array([1, 10, 1e6])
y_pred = np.array([0.9, 15, 1.2e6])
mape = mean_absolute_percentage_error(y_true, y_pred)
>>> 0.2666....
#============ My implementation below ============
eps = 1e-6
dev = [np.abs(x-y)/max([eps, np.abs(x)]) for x, y in zip(y_true,
                                                          y_pred)]
mape = sum(dev)/len(dev)
>>> 0.2666....

3) Optional color bar in confusion matrix plot

The color bar will now be optional while plotting the confusion matrix. This negates the need of a workaround solution to hide the color bar afterward, if not needed.

plot_confusion_matrix(estimator, X, y, colorbar=False)

This brings me to the end of this article. After you have updated your scikit-learn and want to visualize data in python, you can follow my recent article on What’s New in Matplotlib 3 here.

Happy Machine Learning!


Related Articles