Scikit-Learn is a staple machine learning package for data scientists using a Python environment. The package offers many useful APIs to use in our everyday work.
In May 2022, Scikit-Learn released a new V.1.1.0 update that offers various exciting feature updates. What are the updates? Let’s get into it.
However, please update the Scikit-Learn to the recent version before we go.
pip install --upgrade scikit-learn
BisectingKMeans
BisectingKMeans is a new additional variation of the unsupervised machine learning K-Means within Scikit-Learn. The method implements a simple divisive hierarchical algorithm during the clustering process.
In the normal K-Means, the clustering process is happened by creating K number of centroids simultaneously. The resulting cluster would be evaluated by calculating both intracluster and intercluster similarities. The cluster’s centroid is similar to the center of gravity.
However, in Bisetcting K-Means, we are not creating the K centroid simultaneously. Instead, the centroid is picked progressively based on the previous cluster. We would split the cluster each time until the number of K is achieved.
There are a few advantages to using Bisecting K-Means, including:
- It would be more efficient with a large number of clusters
- Cheaper computational costs
- It does not produce empty clusters
- The clustering result was well ordered and would create a visible hierarchy.
Let’s try a simple comparison between normal K-Means and Bisecting K-Means. I would use a sample dataset from the seaborn package to simulate the result.
import numpy as np
from sklearn.cluster import KMeans
import seaborn as sns
from sklearn.cluster import KMeans, BisectingKMeans
import matplotlib.pyplot as plt
mpg = sns.load_dataset('mpg')
mpg = mpg.dropna().reset_index(drop = True)
X = np.array(mpg[['mpg', 'acceleration']])
km = KMeans(n_clusters=5, random_state=0).fit(X)
bisect_km = BisectingKMeans(n_clusters=5, random_state=0).fit(X)
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
ax[0].scatter(X[:, 0], X[:, 1], s=10, c=km.labels_)
ax[0].scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], s=20, c="r")
ax[0].set_title("KMeans")
ax[1].scatter(X[:, 0], X[:, 1], s=10, c=bisect_km.labels_)
ax[1].scatter(
bisect_km.cluster_centers_[:, 0], bisect_km.cluster_centers_[:, 1], s=20, c="r"
)
_ = ax[1].set_title("BisectingKMeans")

As shown in the image above, Bisecting K-Means can efficiently and visibly create a cluster for the data in the furthest part.
Quantile Lost Function modeling with HistGradientBoostingRegressor
HistGradientBoostingRegressor in Scikit-Learn is a Gradient Boosting Regressor is an ensemble tree model with a Histogram-based learning model.
Histogram-based model is more efficient than the normal Gradient Boosting Regressor model because the algorithms bin the continuous features into discrete bins for training purposes instead of the usual splitting techniques.
According to the documentation, the HistGradientBoostingRegressor model is suitable for big datasets with more than 10.000 samples.
In a recent update, HistGradientBoostingRegressor adds a new quantile loss function for us to use. The quantile loss function predicts the likely range, which we call the prediction interval. You could read further about the quantile regression modeling in this article.
With an additional quantile loss function, we could now pass loss="quantile"
and use the new parameter quantile
. Using the tutorial below, we could track where the regression prediction was for each quantile.
from sklearn.ensemble import HistGradientBoostingRegressor
import numpy as np
import matplotlib.pyplot as plt
# Simple regression function for X * cos(X)
rng = np.random.RandomState(55)
X_1d = np.linspace(0, 10, num=2000)
X = X_1d.reshape(-1, 1)
y = X_1d * np.cos(X_1d) + rng.normal(scale=X_1d / 3)
quantiles = [0.9, 0.5, 0.1]
parameters = dict(loss="quantile", max_bins=32, max_iter=50)
hist_quantiles = {
f"quantile={quantile:.2f}": HistGradientBoostingRegressor(
**parameters, quantile=quantile
).fit(X, y)
for quantile in quantiles
}
fig, ax = plt.subplots()
ax.plot(X_1d, y, "o", alpha=0.5, markersize=1)
for quantile, hist in hist_quantiles.items():
ax.plot(X_1d, hist.predict(X), label=quantile)
_ = ax.legend(loc="lower left")
plt.title('Sample HistGradientBoostingRegressor with Quantile Loss')

Infrequent Categories in OneHotEncoder and Get Features Name in all Transformers
One-Hot Encoding is a common categorical process applied to categorical features to produce numerical features. Using Scikit-Learn OneHotEncoder
, we could develop transformers based on our data to be used in the production environment. If you have never heard what One-Hot Encoding is, you can read it all in the article below.
In the newest update, Scikit-Learn adds a new parameter the OneHotEncoder
to group all the rare values instead of creating each rare value into new numerical features. Let’s try it with a sample dataset. I would use the tips sample dataset from Seaborn.
import seaborn as sns
tips = sns.load_dataset('tips')
tips['size'].value_counts()

As we can see from the image above, the size category values for 1, 5, and 6 were infrequent compared to the others. In this case, I want to group them. We could do that with the following code.
from sklearn.preprocessing import OneHotEncoder
#We establish that any values frequency below 6 is considered as rare
enc = OneHotEncoder(min_frequency=6, sparse=False).fit(np.array(tips['size']).reshape(-1, 1))
enc.infrequent_categories_

The infrequent categories are what we are expected. If we try to transform the categorical values into numerical columns, this is what we would get.
encoded = enc.transform(np.array([[1], [2], [3], [4],[5],[6]]))
pd.DataFrame(encoded, columns=enc.get_feature_names_out())

All the infrequent category is considered as one feature now. It would help us minimize the number of features created and avoid the curse of dimensionality.
Additionally, all the transformers from Scikit-Learn now allowed us to get the names of the feature. The get_features_names_out
attribute would provide the string names for each column in the output of the transformers.
enc.get_feature_names_out()

Additional Parameter on Feature Selection
Scikit-Learn has recently added new parameters n_features_to_select='auto'
to the transformers SequentialFeatureSelector
and passing callable to the max_features
of the transformers SelectFromModel
.
The SequentialFeatureSelector is a greedy search algorithm to do feature forward selection or backward selection to form a feature subset. This estimator would add or remove features in each iteration based on the estimator cross-validation score.
With 'auto'
parameter, the end features would automatically end when the score improvement does not exceed tol
parameter. If the tol
parameter isNone
, half of the features are selected.
In the transformers SelectFromModel
, if we pass callable to the max_features
parameter, then the maximum number of features allowed would use the output of max_feaures(X)
.
MiniBatchNMF
Non-Negative Matrix Factorization or NMF is a multivariate analysis algorithm for dimensional reduction and feature extraction. The method is commonly used on the NLP, Recommendation Engine, Facial Recognition, etc. You can read more about NMF here.
MiniBatchNMF is an online way to optimize NMF by dividing the data into mini-batches and optimizing the NMF model by cycling over the mini-batches. The model is suitable for big datasets as the process would be faster but suffer from less accurate prediction. You can refer to the tutorial here.
Conclusion
Scikit-Learn has recently updated the packages to the version 1.1.0, and the following are the highlight update:
- New BisectingKMeans algorithm
- New Quantile Loss on the HistGradientBoostingRegressor
- Infrequent Categories in OneHotEncoder
- Get the names of the features from the transformers
- Additional Parameter on Feature Selection
- New MiniBatchNMF algorithm
I hope it helps!
Visit me on my Social Media to have more in-depth conversation or any questions.
If you are not subscribed as a Medium Member, please consider subscribing through my referral.