Pipelines: Automated machine learning with HyperParameter Tuning!

The first step towards creating your own reusable codebase.

Harshal Soni
Towards Data Science

--

Tired of re-writing the same old code for every other Data Science deep-dives? You have come to the right place!

First, there is data, and then there are endless ETL processes, then modelling and at last inferences. But wouldn’t it be cool to automate it all so that you can plug-and-play any data with minimal modifications? Pipelines enable us to do this!

This is a Pipeline. One step stitched with another to automate your Ml workflow. Image by Author

What is a Machine Learning Pipeline?

Like our real-world equivalent, ML pipelines are the carriers that connect one process with another so that there is just one junction for input and one for output. Modern applications are excessively dependent on pipeline architecture which connects ML services with existing DevOps processes! In short, it helps you to supermantalize your project!

  • Data preparation including importing, validating and cleaning, munging and transformation, normalization, and staging
  • Training configuration including parameterizing arguments, file paths, and logging/reporting configurations
  • Training and validating efficiently and repeatedly. Efficiency might come from specifying specific data subsets; different hardware computes resources, distributed processing, and progress monitoring.
  • Deployment, including versioning, scaling, provisioning, and access control

If you want to read more about them, here is a good starter on ML pipelines.

In this blog, I will go through the basics of Scikit-learn Pipelines. I will write custom Transformer functions that will take care of Data Cleaning, Feature Engineering, and Model Training. ( I know you came here looking for custom transformer solutions, not onehotencoder plugged into pipeline steps:D)

Data and Descriptions

I am ingesting the “Star Type Classification” Dataset from Kaggle. This dataset has an eclectic mix of categorical as well as continuous variables. We will do a multi-class classification (saucy)!

Dataframe head

Most of the features are self-explanatory! You can read about the Spectral Class here.

Our target variable is “Type,” which is — Red Dwarf (0), Brown Dwarf (1), White Dwarf (2), Main Sequence (3), Super Giants (4), Hyper Giants(5).

Also, each class has 40 observations which makes it a perfectly balanced problem.

Explanatory Data Analysis

Our Favourite Scatter plot with fancy filters!

I have used D-tale- an automated EDA tool for this project. I love it because of its simplicity and super fast speed.

But, you are not here for EDA, eh? Let’s move on!

Data Pre-processing

Here is what I have concluded from my analysis.

  1. There are no missing values in the dataset (pheww — But I will handle it anyways)
  2. Boy, there are Outliers. So, pitch in your best outlier housekeeping strategies in the comments. I have trimmed the outliers by their interquartile range by 20th and 80th percentile range. See the Gaussian distribution here.
  3. Features are Skewed to the right or left. Hence, I cannot simply apply Linear Methods.
  4. Scaling — Space is quite humongous, and scaling is needed. Since I want my model to see the outliers but keep their variance-bias tradeoff, I will apply RobustScalar to my data.

Feature Engineering

I have performed three tests to unearth the heavenly importance of a feature in my down-to-earth pipeline.

  1. Co-relation Plots: Here is my cheat sheet!
- Pearson’s R for continuous-continuous cases - Correlation Ratio for categorical-continuous cases - Cramer’s V or Theil’s U for categorical-categorical cases

2. Analysis of Variance (ANOVA): It is a beauty! It Shows significance concerning the target.

Temperature: 3.323401956092008e-11
RelativeLuminosity: 1.641155523850019e-33
RelativeRadius: 1.6272694239287043e-31
ApparentMagnitude: 6.33087509199811e-128
Color: 1.047429715544e-06
SpectralClass: 0.44868186785826514

As you can see, the P-value of Color is SpectralClass is pretty high!

3. Chi-squared Test: For good old categorical analysis.

Color            8.716079e-26SpectralClass    1.167568e-37

Alright, take a 2 minutes break! Recap the concepts just covered. I know you are a pro but just a little mental recap!

Let us talk about Transformers!

Not the Micheal Bay’s one, Scikit-learn Transformers forms an integral part of the Pipelines which lets you plug-and-play different components. To begin, import these two base classes here:

from sklearn.base import BaseEstimator, TransformerMixin

These classes we just imported act like glue on our custom classes. Base Estimator gives the pipelines the get_params and set_params methods which all sklearn estimator requires. The TransformerMixin gives the fit_transform method.

To prevent any data leakage and drift, it is best to initially split our train and test dataset. There are several advantages, namely:

  1. Pre-Processing: You are filling the missing values and removing outliers initially; how will your model generalize on unseen real-world data? Hence, split first!
  2. Feature Scaling:- We use scalar. Fit () on the Train dataset and let those calculated parameters transform Train and Test data. LoL, what? Yes, it helps to detect data drift in production.
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns='Type'), df['Type'],test_size=0.30,                      random_state=1,stratify= df['Type'])

I have used Stratified Split because I want the train and test dataset to represent all the Target labels. (there are chances that the training dataset may never consist of some categories)

Transformer for Numerical Features

We can always use the default class implementation of Scalar or Imputer. But, what if your use-case demands a custom treatment of one feature which has been a bugger! It can break or make the system!

You need to populate two methods:

  1. Fit — prefer calculative steps here like measuring mean, median, variance.
  2. Transform — prefer applicative transformations here.

The advantage of using this is that you will have absolute control. Plus, your compute instance would not have to calculate the same statistics twice.

Note: Be careful which functionality you are implementing in your fit and transform methods. The sequence of implementation may introduce a bias (like I mentioned for scaling).

Transformer for Categorical Features

Now, we have handled our numerical features. We can proceed to categorical features. I will be Dummifying my categorical features. You may include KninsDiscretiser or any other custom function to suit your use case.’

Awesome! Almost There

Now, we need to stitch these two classes together. Scikit-learn provides two functionalities:

FeatureUnion: Concatenates results of multiple transformer objects. This estimator applies a list of transformer objects in parallel to the input data, then concatenates the results. This is useful to combine several feature extraction mechanisms into a single transformer.

ColumnTransformer: Applies transformers to columns of an array or pandas DataFrame. This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.

In our use case, I have created a base transformer so that categorical and numerical features are processed separately. So, I will be using ColumnTransformer.

Drum-roll!

We need a list of features first, obviously. 🙌

numerical_columns = ['Temperature', 'RelativeLuminosity', 'RelativeRadius', 'ApparentMagnitude']categorical_columns = ['SpectralClass']

Awesome. It is done!

Let us test it using a Kernel-based Model. Since my use-case is multi-class classification, I am going for KNearestNeighbors as my base model.

I have used K-Fold cross-validation as well. But, in my heart, I know that it will overfit since the size of training data is 110. But, that is not the concern right now. Overfitting will soon resolve it with hyperparameter tuning and model iterations.

Model Training

Once you have the processing transformed up and running, you can plug it into the training pipeline. You can include all sorts of fancy steps, including GridSearchCV, Cross-validation, and create an ensemble of models in a chain.

result = cross_val_score(model ,X_train, y_train, cv = KFold(n_splits=3), error_score=-1)

print("Avg accuracy: {}".format(max(result)))
Avg accuracy: 0.9107142857142857

Hey! It works, and Yay, you’ve made it. I am certain these steps are enough to kickstart your path towards pipelines. After all, we are all Mario and Luigi! (pipeline — game — please, get the joke!)

Finally!

Mission Accomplished! Today, we created the basic skeleton in scikit-learn for any data science deep-dive. We performed data cleaning, normalization, and data transformation using custom Transform elements.

Next, we plugged individual channels of the pipeline (categorical and numerical) using ColumnTransfomer, which glues two subsets of data together. lastly, we trained a basic model with our pipeline.

I hope this blog clears some of the basics about custom transformers. They are the backbone of any modern ML applications and it would be best if you can include them in your workflow.

You can get this complete notebook here on Github.

The next steps to automatically tune the Hyper-parameters for Multiple SKlearn Models using HyperOpts can be found here.

Thanks for sticking around! You can always reach out to me over LinkedIn to bounce back some ideas!

--

--

I am a Voyager of Machine Learning with an empathetic mission of creating a better society. I have a Masters's and research of sorts. Data Scientist at Urbint.