The world’s leading publication for data science, AI, and ML professionals.

The best imports from sklearn

Learn what to import and when from this amazing Python library

Photo by Alex Knight on Unsplash
Photo by Alex Knight on Unsplash

Scikit-Learn

Science Kit Learn, or just sklearn for those introduced to it is probably the main package for modeling data in Python.

Sklearn was developed in 2007 by David Cournapeau. The history tells that it started as a summer project in Google. Since then, a lot has evolved and, believe it or not, the version 1.0 was released only in December 2021! Of course it was already providing big results everywhere long before that date.

Anyway, sklearn is a library that handles not only unsupervised learning – like clustering – and supervised learning – like regression and classification – but also all the other components that surround a data science project. Using sklearn, we have access to pre-processing tools, such as scaling, normalization. We can see model selection tools like k-fold, grid search, cross-validation. There are the Algorithms to create models, of course, and tools to check metrics, like confusion matrix, for instance.

What I’d like to share with you in this post is a selection of modules to import when you’re using Scikit Learn, so you can use this content as a quick reference when building a model. Let’s see them.

Pre-processing Data

Modeling data is not just load sklearn and run the data through the algorithm. It requires much more time processing the data, so you can provide a good input to the model. For that purpose, you can use the following modules.

Min Max scaler

This tool will normalize the data based on the maximum and minimum value of the variable. The max turns to 1, the min will be 0 and everything in between will be a percentage of the max value.

from sklearn.preprocessing import MinMaxScaler

Standard Scaler

Normalize will transform the variable to mean = 0 and standard deviation = 1. It does not change the shape of the data, though, meaning that it does not transform the data to a normal distribution.

from sklearn.preprocessing import StandardScaler

Imputers

Imputers are used when there is too much missing values and we want to use some imputation technique to estimate a value that is currently NA. The main modules for that are the following.

from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.impute import IterativeImputer

Feature Selection

If the challenge is to look for the best features for a model, there are many possibilities. Scikit Learn will offer these and more.

from sklearn.feature_selection import SelectKBest, f_classif, r_regression
from sklearn.feature_selection import RFE

In summary:

[In] Numeric ; [Out] Numeric:

SelectKBest(score_func=f_regression, k=number best attributes)

SelectKBest(score_func=mutual_info_regression, k=number best attributes)

[In] Numeric ; [Out] Categorical:

SelectKBest(score_func=f_classif, k=number best attributes)

[In] Categorical ; [Out] Categorical:

SelectKBest(score_func=chi2, k=number best attributes)

SelectKBest(score_func=mutual_info_classif, k=number best attributes)

Model Selection

After going over pre-processing and feature selection, there will come the time to choose a model.

Train Test Split

Certainly that we will need to split the data in explanatory (X) and explained (y) variables. For that, we use the train test split.

from sklearn.model_selection import train_test_split

Cross Validation and Folds

There are many ways to cross validate data. The most common is using a K-fold, where you split your data in K parts and each of those are used as training and test sets. Example, if we fold one set in 3, part 1 and 2 are train and 3 is test. Then the next iteration uses 1 and 3 as train and 2 as test. Lastly, 2 and 3 are train, 1 is test.

from sklearn.model_selection import cross_validate, cross_val_score
from sklearn.model_selection import KFold

In this link, you can see more options and documentations, like Leave one out, stratified kfold, shuffle split etc.

Model Tuning

To tune a model, sklearn will provide us with these amazing options that are the grid search or the random search. Using this, it is possible to test many combinations of parameters for the model, taking the best result as the best estimator to move on for predictions.

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

Estimators

The estimators are the algorithms available for us to use. Scikit learn has a number of them.

Unsupervised: Clustering

Unsupervised learning is when we don’t provide a label for prediction. So, the algorithm will look for patterns and classify the data points with no supervision. Meaning, there is not "right or wrong answer".

# Clustering
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.mixture import GaussianMixture

Classification

Classification models will understand the patterns from a dataset and what is the associated label or group. Then it can classify new data based on those patterns. The most used ones are the ensemble models, such as Random Forest or Gradient Boost. There are some more simple, like the Decision Trees, Logistic Regression and K-Nearest Neighbors.

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

Regression

The regression problems are those where you need to return a number as the output. Classical problems solved by regression are car and house prices. The most used models in this case will be the linear models. There are some options of regularized regression, like Ridge or Lasso. For non linear relationship, the tree based models can also be used.

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

Metrics

Finally, we can evaluate the models using sklearn’s metrics component. See next the most used ones.

Classification metrics

from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import auc

Regression Metrics

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

Before You Go

Scikit learn is a tremendous helper for data scientists. There is so much more that you can explore and learn in their documentation.

Here, we covered only a few imports that are commonly used for Data Science, however, if you visit their page, you can see many more.

The nice thing about their documentation is that it is very clear and well organized. Additionally, many times it gives you a good depth of knowledge about the problem solved by that algorithm or function.

If you liked this content, follow my blog for more.

Gustavo Santos – Medium

Reference

API Reference

scikit-learn – Wikipedia


Related Articles