In a typical Machine Learning task, the features are not likely to be in the most appealing format for a model. Thus, we usually need to proprocess the features before training.
The common preprocessing operations are handling missing values, scaling the numerical features, and encoding the categorical features. The preprocessing operations can be viewed as transforming the features in some sense.
The pipeline module of Scikit-learn is a tool that makes the preprocessing simple and easy by combining the transformations in a pipe. It is important to note that the intermediate steps in a pipeline must transform a feature. Thus, they need to be able to implement the fit and transform methods.
In this article, we will be creating a pipeline to transform features for a machine learning model.
We will first create a sample dataframe that includes both numerical and categorical features.
import numpy as np
import pandas as pd
#features
col1 = np.random.randint(10, size=100)
col2 = np.random.randint(5, size=100)
col3 = np.random.random(100)
col4 = pd.Series(['a','b','c']).sample(100, replace=True)
#target
target = np.random.randint(2, size=100)
#dataframe
df = pd.DataFrame(
{
'col1':col1,
'col2':col2,
'col3':col3,
'col4':col4,
'target':target
}
)
Here are the first 5 rows of the sample dataframe:
It contains 1 categorical and 3 numerical features. The target variable has 2 classes so we have a binary classification task.
Let us also add some missing values.
df.iloc[[1,5,8,15,54,62],:-1] = np.nan
df.isna().sum()
col1 6
col2 6
col3 6
col4 6
target 0
dtype: int64
We have added 6 missing values to all the columns except for the target.
The sample dataframe is ready. We can start to create the pipelines.
The numerical and categorical features require different kinds of preprocessing techniques. Thus, we will create separate pipelines for numerical and categorical features and them combine them in another transformer. The final pipeline will include the combined transformer and a machine learning model.
Let us start by creating the base pipelines.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
num_transformer = Pipeline(
steps=
[
('imputer', SimpleImputer(strategy='mean')),
('scaler', MinMaxScaler())
]
)
The transformations are passed to the steps parameter of Pipeline as a list of tuples. Each tuple contains a name and the transformation function.
The num_transformer fills the missing values with the mean value of a column (SimpleImputer) and scales the values between 0 and 1 (MinMaxScaler).
cat_transformer = Pipeline(
steps=
[
('imputer', SimpleImputer(strategy='most_frequent')),
('encode', OneHotEncoder(drop='first'))
]
)
The cat_transformer fills the missing values with the most frequent value of a column and encode the categories using the OneHotEncoder.
We now have the transformers but they are not applied to the features yet. For this task, we will use the column transformer module of Scikit-learn.
Before creating the column transformer object, we need to create two lists that include numerical and categorical features. One way of doing this is through the "select_dtypes" function of Pandas.
num_ft = df.iloc[:,:-1]
.select_dtypes(include=['int64', 'float64']).columns
cat_ft = df.iloc[:,:-1]
.select_dtypes(include=['object']).columns
Please keep in mind that the categorical features might also have the "category" data type in addition to the "object.
The next step is to create the column transformer.
from sklearn.compose import ColumnTransformer
preprocess = ColumnTransformer(
transformers=[
('numeric', num_transformer, num_ft),
('categorical', cat_transformer, cat_ft)
]
)
We pass the operations to the transformers parameter as a list of tuples. In our case, the operations are the pipelines we created previously.
Each tuple contains a name, the transformer, and a list of features that will be transformed by that transformer.
The next step is to combine the column transformer we have just created and a machine learning model. We will implement a new pipeline for this operation.
from sklearn.linear_model import LogisticRegression
clf = Pipeline(
[
('preprocess', preprocess),
('model', LogisticRegression())
]
)
Let us analyze this new pipeline by breaking it into parts.
The "clf" is a pipeline that contains a transformer called "preprocess" and a logistic regression model.
The "preprocess" transformer contains two pipelines which are "num_transformer" and "cat_transformer".
We can now directly call the fit method on the "clf" pipeline. Before training the clf pipeline, we will split the dataset into train and test subsets.
from sklearn.model_selection import train_test_split
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
The next step is calling the fit method on clf.
clf.fit(X_train, y_train)
We now have a pipeline that contains all the feature preprocessing steps and a trained model. It can be used to make predictions on the test set. Furthermore, it is just like any other machine learning model of Scikit-learn. Thus, we can use the methods that are used on other models. The pipeline can also be used in cross validation.
Conclusion
The main advantage of using pipelines is to simplify the preprocessing by combining many different operations in a single pipeline.
We can use pipelines for model selection as well. For instance, the clf pipeline we have created can be used to try different models. We can iterate over a list of models and use each model in the pipeline.
Thank you for reading. Please let me know if you have any feedback.