Table of Contents
1. Introduction
2. Pipeline
3. Pipeline with Grid Search
4. Pipeline with ColumnTransformer, GridSearchCV
5. Pipeline with Feature Selection

1. Introduction
Preparing the dataset for the algorithm, designing the model, and adjusting the hyperparameters of the algorithm, which is at the discretion of the developer, to generalize the model and to reach the optimum accuracy value were mentioned in the previous articles. As we know, there are alternative solutions at the developer’s disposal for both model preprocessing, data preprocessing and adjusting the hyperparameters of the algorithm. The developer is responsible for applying the most appropriate combinations and keeping his project optimum in terms of both accuracy and generalization. This article includes implementing all these mentioned operations and more in one go with the pipeline offered by sklearn. All headers are supported with python implementation.
2. Pipeline
In its most basic form, the pipeline is to implement the specified data preprocessing operations and the model into the dataset with a single line:
IN[1]
iris=load_iris()
iris_data =iris.data
iris_target=iris.target
IN[2]
x_train,x_test,y_train,y_test = train_test_split(iris_data, iris_target,test_size=0.2, random_state=2021)
pip_iris = Pipeline([("scaler", RobustScaler()),("lr",LogisticRegression())])
pip_iris.fit(x_train,y_train)
iris_score=pip_iris.score(x_test,y_test)
print(iris_score)
OUT[2]
0.9333333333333333
Iris dataset was separated with train_test_split
as usual and RobustScaler()
was chosen as the scaler method for the dataset known to be a numeric dataset, and LogisticRegression was chosen as the classifier. The pipeline also contains various attributes such as .fit
, .score
just like grid search. Train dataset was fitted with the .fit
command in the created pipeline and the score was created with .score
.
3. Pipeline with Grid Search
Grid Search evaluated hyperparameter combinations in the algorithm or any operation with defined hyperparameters, informing the user about the accuracy rate or best hyperparameter combinations with various attributes(more information click here). Using Gridsearchcv with the Pipeline is a very effective way of eliminating workload and confusion. Now let’s test the Logistic Regression algorithm we implemented above with various combinations of hyperparameters for it:
IN[3]
x_train,x_test,y_train,y_test = train_test_split(iris_data, iris_target,test_size=0.2, random_state=2021)
pip_iris_gs = Pipeline([("scaler", RobustScaler()),("lr",LogisticRegression(solver='saga'))])
param_grids={'lr__C':[0.001,0.1,2,10],
'lr__penalty':['l1','l2']}
gs=GridSearchCV(pip_iris_gs,param_grids)
gs.fit(x_train,y_train)
test_score = gs.score(x_test,y_test)
print("test score:",test_score)
print("best parameters: ",gs.best_params_)
print("best score: ", gs.best_score_)
OUT[3]
test score: 0.9333333333333333
best parameters: {'lr__C': 2, 'lr__penalty': 'l1'}
best score: 0.9583333333333334
In addition to the above, ‘C’ and ‘penalty’ values are defined as param_grids
by creating a dictionary by the user. Later, the pipeline containing the algorithm and scaling was added to the GridSearchCV as an estimator. The training dataset is trained with .fit
command and evaluated with .score
. In addition, information was obtained about the model created with the help of various attributes in GridSearchCV.
4. Pipeline with ColumnTransformer, GridSearchCV
So far, only the dataset -iris dataset- which is containing only numerical data has been used. To make the situation more complex, let’s use the toy dataset, which contains both numerical and categorical data, and apply:
- Normalize the ‘Income’ column with
MinMaxScaler()
- Encode Categorical Columns with
OneHotEncoder()
- Group the ‘Age’ column with binning.
First, let’s take a quick look at the dataset:
IN[4]
toy = pd.read_csv('toy_dataset.csv')
toy_final=toy.drop(['Number'],axis=1)
IN[5]
toy_final.isna().sum()
OUT[5]
City 0
Gender 0
Age 0
Income 0
Illness 0
dtype: int64
IN[6]
numeric_cols=toy.select_dtypes(include=np.number).columns
print("numeric_cols:",numeric_cols)
categorical_cols=toy.select_dtypes(exclude=np.number).columns
print("categorical_cols:",categorical_cols)
print("shape:",toy_final.shape)
OUT[6]
numeric_cols: Index(['Number', 'Age', 'Income'], dtype='object')
categorical_cols: Index(['City', 'Gender', 'Illness'], dtype='object')
shape: (150000, 5)
Now let’s perform the operations mentioned above:
IN[7]
bins = KBinsDiscretizer(n_bins=5, encode='onehot-dense', strategy='uniform')
ct = ColumnTransformer([
('normalization', MinMaxScaler(), ['Income']),
('binning', bins, ['Age']),
('categorical-to-numeric', OneHotEncoder(sparse=False, handle_unknown='ignore'), ['City','Gender'])
], remainder='drop')
x_train, x_test, y_train, y_test = train_test_split(toy_final.drop('Illness', axis=1), toy_final.Illness,
test_size=0.2, random_state=0)
param_grid_lr=[{'lr__solver':['saga'],'lr__C':[0.1,1,10],'lr__penalty':['elasticnet','l1','l2']},
{'lr__solver':['lbfgs'],'lr__C':[0.1,1,10],'lr__penalty':['l2']}]
IN[8]
pipe_lr = Pipeline([
('columntransform', ct),
('lr', LogisticRegression()),
])
gs_lr =GridSearchCV(pipe_lr,param_grid_lr,cv=5)
gs_lr.fit(x_train,y_train)
test_score_lr = gs_lr.score(x_test,y_test)
print("test score:",test_score_lr)
print("best parameters: ",gs_lr.best_params_)
print("best score: ", gs_lr.best_score_)
OUT[8]
test score: 0.9198666666666667
best parameters: {'lr__C': 0.1, 'lr__penalty': 'l1', 'lr__solver': 'saga'}
best score: 0.9188750000000001
The bins method with the KBinsDiscretizer()
presented in the sklearn library is set to 5 groups and encode by OneHotEncoder. Preprocessing processes to be applied with ColumnTransformer()
were gathered in one hand. These operations are:
-Normalization for ‘Income’ column,
-Discretization for ‘Age’ column,
-Encode with OneHotEncoder()
for categorical columns
The dataset was then split into training and testing. A dictionary (param_grids_lr
) was created with the selected hyperparameters to evaluate the parameter combinations. The data preprocessing methods to be applied were collected in one hand with ColumnTransformer(more information click here)and the algorithm-LogisticRegression- were placed in the pipeline. As in the examples above, the model is completed by selecting the Cross-Validation value of 5 in the GridSearchCV.
The
param_grid_lr
dictionary is created as algorithm+double underscore + hyperparameter.LogisticRegression()
is defined as lr and we know the ‘C‘ is the hyperparameter of LogisticRegression so, lr__C is used. To see the all available hyperparameter that can be used,lr.get_params().keys()
is applied.
Now let’s try the model we prepared with DecisionTreeClassifier()
:
IN[9]
pipe_dt = Pipeline([
('columntransform', ct),
('dt', DecisionTreeClassifier()),
])
param_grid_dt={'dt__max_depth':[2,3,4,5,6,7,8]}
gs_dt =GridSearchCV(pipe_dt,param_grid_dt,cv=5)
gs_dt.fit(x_train,y_train)
test_score_dt = gs_dt.score(x_test,y_test)
print("test score:",test_score_dt)
print("best parameters: ",gs_dt.best_params_)
print("best score: ", gs_dt.best_score_)
OUT[9]
test score: 0.9198333333333333
best parameters: {'dt__max_depth': 2}
best score: 0.9188750000000001
The _maxdepth values we selected were fit one by one and the most successful one was determined by grid search.
5. Pipeline with Feature Selection
As mentioned in the introduction, using the pipeline and GridSearchCV is a very effective way to evaluate hyperparameter combinations and compile them easily. It is very useful not only for data preprocessing and algorithms but also for data cleaning(SimpleImputer
), feature processing(SelectKBest
, SelectPercentile
, more information click here), etc. Now let’s apply the following to the breast_cancer dataset containing 30 features:
— Standardization to numerical values with StandardScaler()
— PolynomialFeatures()
to numerical values
— ANOVA with SelectPercentile()
— LogisticRegression hyperparameters(‘C‘ and ‘penalty‘)
— tune Cross-Validation=3
IN[10]
cancer=load_breast_cancer()
cancer_data =cancer.data
cancer_target =cancer.target
IN[11]
anova = SelectPercentile()
poly = PolynomialFeatures()
lr=LogisticRegression(solver='saga')
param_grid_cancer=dict(poly__degree=[2,3,4],
anova__percentile=[20, 30, 40, 50],
lr__C=[0.01,0.1,1,10],
lr__penalty=['l1','l2']
)
pipe_cancer = Pipeline([
('standardization',StandardScaler()),
('poly',poly),
('anova',anova),
('lr',lr)
])
gs_final = GridSearchCV(pipe_cancer,param_grid_cancer,cv=3,n_jobs=-1)
x_train, x_test, y_train, y_test = train_test_split(cancer_data, cancer_target,test_size=0.2,random_state=2021)
gs_final.fit(x_train,y_train)
test_score_final = gs_final.score(x_test,y_test)
print("test score:",test_score_final)
print("best parameters: ",gs_final.best_params_)
print("best score: ", gs_final.best_score_)
OUT[11]
test score: 0.9736842105263158
best parameters: {'anova__percentile': 20, 'lr__C': 0.1, 'lr__penalty': 'l1', 'poly__degree': 2}
best score: 0.9626612059951203
Hyperparameter combinations to be tested with param_grid_cancer
have been defined:
degree=[2,3,4] for PolynomialFeatures()
percentile= [20, 30, 40, 50] for SelectPercentile()
C=[0.01,0.1,1,10] for LogisticRegression()
penalty=[‘l1′,’l2’] for LogisticRegression()
All these were piped in with StandardScaler()
. Then the cross-validation value was set to 3 in GridSearchCV. Dataset was split with train_test_split
and was fitted with .fit
as always. When the ‘percentile ‘which is in SelectPercentile is set 20%, ‘C‘ value in LogisticRegression is set 0.1, ‘penalty‘ parameter in LogisticRegression is set ‘L1’, and ‘degree ‘in PolynomialFeatures is set 2, the accuracy is highest.
The pipeline is useful in evaluating many things that are required when creating a pipeline model, collectively, from a single source.
make_pipeline
can be used as well as Pipeline.make_pipeline
automatically creates the necessary names for the steps, so just adding the process is sufficient.