The world’s leading publication for data science, AI, and ML professionals.

How to test multiple machine learning pipelines with just a few lines of Python

Photo by Quinten de Graaf on Unsplash
Photo by Quinten de Graaf on Unsplash

Introduction

During the exploration phase of a project, a data scientist tries to find the optimal pipeline for his specific use case. Since it’s nearly impossible to know beforehand which transformations will benefit the model’s outcome the most, this process usually involves trying out different approaches. For example, if we are dealing with an imbalanced dataset, should we oversample the minority class or undersample the majority class? In this story, I’ll explain how to use the ATOM package to quickly help you evaluate the performance of a model trained on different pipelines. ATOM is an open-source Python package designed to help data scientists fasten the exploration of Machine Learning pipelines. Read this story if you want a gentle introduction to the library.


Managing pipelines

The easiest way to explain how to manage multiple pipelines is going through an example. In this example we will:

  1. Create an imbalanced dataset and feed it to atom
  2. Reduce the number of features using Recursive Feature Elimination (RFE)
  3. Train three Random Forest (RF) models:
  • One trained directly on the imbalanced dataset
  • One trained on the dataset after applying oversampling
  • One trained on the dataset after applying undersampling
  1. Compare the results

And we will do all of this in less than 20 lines of code! Let’s get started.

Create the dataset

We start creating a mock binary classification dataset with a 0.95–0.05 proportion of samples assigned to each class. The data is then fed to atom.

from atom import ATOMClassifier
from sklearn.datasets import make_classification
# Create an imbalanced dataset
X, y = make_classification(
    n_samples=5000,
    n_features=30,
    n_informative=20,
    weights=(0.95,),
)
# Load the dataset into atom
atom = ATOMClassifier(X, y, test_size=0.2, verbose=2)

The dataset is automatically divided into a train and test set. The output looks as follows.

<< ================== ATOM ================== >>
Algorithm task: binary classification.

Dataset stats ====================== >>
Shape: (5000, 31)
Scaled: False
Outlier values: 582 (0.5%)
---------------------------------------
Train set size: 4000
Test set size: 1000
---------------------------------------
|    | dataset     | train       | test       |
|---:|:------------|:------------|:-----------|
|  0 | 4731 (17.6) | 3777 (16.9) | 954 (20.7) |
|  1 | 269 (1.0)   | 223 (1.0)   | 46 (1.0)   |

We can immediately see that the dataset is imbalanced, since it contains almost 18 times more zeros than ones. Let’s have a look at the data. Note that, since the input wasn’t a dataframe, atom has given default names to the columns.

atom.dataset.head()

Perform feature selection

For explanatory purposes, we’ll start with a data transformation step that we want shared across all the pipelines we are going to test. Often, this would be something like feature scaling or the imputation of missing values. In this case, we reduce the dimensionality of the data from 30 to 12 features. With atom, that’s as easy as doing this.

atom.feature_selection("RFE", solver="RF", n_features=12)

This command runs RFE using a Random Forest as estimator. The remaining dataset contains the most promising features.

Fitting FeatureSelector...
Performing feature selection...
 --> The RFE selected 12 features from the dataset.
   >>> Dropping feature Feature 2 (rank 3).
   >>> Dropping feature Feature 3 (rank 8).
   >>> Dropping feature Feature 5 (rank 10).
   >>> Dropping feature Feature 7 (rank 17).
   >>> Dropping feature Feature 8 (rank 12).
   >>> Dropping feature Feature 11 (rank 19).
   >>> Dropping feature Feature 13 (rank 13).
   >>> Dropping feature Feature 14 (rank 11).
   >>> Dropping feature Feature 15 (rank 15).
   >>> Dropping feature Feature 17 (rank 4).
   >>> Dropping feature Feature 19 (rank 16).
   >>> Dropping feature Feature 20 (rank 2).
   >>> Dropping feature Feature 21 (rank 6).
   >>> Dropping feature Feature 23 (rank 5).
   >>> Dropping feature Feature 24 (rank 9).
   >>> Dropping feature Feature 25 (rank 18).
   >>> Dropping feature Feature 26 (rank 7).
   >>> Dropping feature Feature 27 (rank 14).

Now, we train our first model directly on the imbalanced dataset. Using the run method, we fit a Random Forest on the training set and evaluate it on the test set.

atom.run(models="RF", metric="balanced_accuracy")

Training ===================================== >>
Models: RF
Metric: balanced_accuracy

Results for Random Forest:         
Fit ---------------------------------------------
Train evaluation --> balanced_accuracy: 1.0
Test evaluation --> balanced_accuracy: 0.5326
Time elapsed: 0.733s
-------------------------------------------------
Total time: 0.733s

Final results ========================= >>
Duration: 0.733s
------------------------------------------
Random Forest --> balanced_accuracy: 0.5326

The branching system

Before we continue, it’s time to explain ATOM’s branching system. The branching system allows you to manage multiple pipelines within the same atom instance. Every pipeline is stored in a separate branch, that can be accessed through the branch attribute. A branch contains a copy of the dataset, and all transformers and models that are fitted on that specific dataset. Methods called from atom always use the dataset in the current branch, as well as data attributes such as atom.dataset. By default, atom starts with one branch called master. Call the branch for an overview of the transformers and models it contains.

atom.branch

Branch: master
 --> Pipeline: 
   >>> FeatureSelector
     --> strategy: RFE
     --> solver: RandomForestClassifier(n_jobs=1, random_state=1)
     --> n_features: 12
     --> max_frac_repeated: 1.0
     --> max_correlation: 1.0
     --> kwargs: {}
 --> Models: RF

The current branch contains the class for feature selection we called earlier, as well as the model we just trained.

Oversampling

Now it’s time to test how the model will perform after oversampling the dataset. Here, we create a new branch called oversample.

atom.branch = "oversample"

New branch oversample successfully created!

NOTE: Creating a new branch automatically changes the current branch to the new one. To switch between existing branches, just type the name of the desired branch, e.g. atom.branch = "master" to go back to the master branch.

The oversample branch is split from the current branch (master), adopting its dataset and transformers. This means that the feature selection transformer now also is a step in the oversampling pipeline. Splitting branches like this avoids having to recalculate previous transformations.

Call the balance method to oversample the dataset using SMOTE.

atom.balance(strategy="smote")

Oversampling with SMOTE...
 --> Adding 7102 samples to class: 1.

Remember that this method only transforms the dataset in the current branch. The dataset in the master branch is unchanged. Quickly check that the transformation worked.

atom.classes

Note that only the training set is balanced since we want to preserve the original class distribution in the test set.

Now we can train a Random Forest model on the oversampled dataset. To distinguish this model from the first one we trained, we add a tag (os for oversample) after the model’s acronym.

atom.run(models="RF_os", metric="balanced_accuracy")

Training ===================================== >>
Models: RF_os
Metric: balanced_accuracy

Results for Random Forest:         
Fit ---------------------------------------------
Train evaluation --> balanced_accuracy: 1.0
Test evaluation --> balanced_accuracy: 0.7737
Time elapsed: 1.325s
-------------------------------------------------
Total time: 1.325s

Final results ========================= >>
Duration: 1.341s
------------------------------------------
Random Forest --> balanced_accuracy: 0.7737

Undersampling

A new branch is needed to undersample the data. Since the current branch contains an oversampled dataset, we have to split the new branch from the master branch, that only contains the RFE transformer.

atom.branch = "undersample_from_master"

New branch undersample successfully created!

Add _from_ between the new branch and an existing one to split it from that one instead of the current branch. Check that the dataset in the undersample branch is still imbalanced.

atom.classes

Call the balance method again to undersample the data using NearMiss.

atom.balance(strategy="NearMiss")

Undersampling with NearMiss...
 --> Removing 7102 samples from class: 0.

And fit the Random Forest using a new tag (us for undersample).

atom.run(models="RF_us", metric="balanced_accuracy")

Training ===================================== >>
Models: RF_us
Metric: balanced_accuracy

Results for Random Forest:         
Fit ---------------------------------------------
Train evaluation --> balanced_accuracy: 1.0
Test evaluation --> balanced_accuracy: 0.6888
Time elapsed: 0.189s
-------------------------------------------------
Total time: 0.189s

Final results ========================= >>
Duration: 0.189s
------------------------------------------
Random Forest --> balanced_accuracy: 0.6888

If we look at our branch now, we see that the pipeline only contains the two transformations we want.

atom.branch

Branch: undersample
 --> Pipeline: 
   >>> FeatureSelector
     --> strategy: RFE
     --> solver: RandomForestClassifier(n_jobs=1, random_state=1)
     --> n_features: 12
     --> max_frac_repeated: 1.0
     --> max_correlation: 1.0
     --> kwargs: {}
   >>> Balancer
     --> strategy: NearMiss
     --> kwargs: {}
 --> Models: RF_us

Analyze the results

We finally have the three models we wanted in our atom instance. The branching system now looks as follows.

The RFE transformation is shared among the three branches, but after that, every branch follows a different path. The master branch has no further transformers while the other two branches each apply a different balancing algorithm. All three branches contain a Random Forest model, each trained on a different dataset. All that remains is comparing the results.

atom.evaluate()

atom.plot_prc()


Conclusion

We have learned how to use the ATOM package to easily compare multiple machine learning pipelines. Having all the pipelines (and thus the models) in the same atom instance has several advantages:

  • The code is shorter, which keeps the notebook less cluttered and makes it easier to maintain an overview
  • Transformations shared across pipelines don’t need to be recalculated
  • It makes it easier to compare the results using atom’s plotting methods

The example we went through is quite minimalistic, but ATOM is capable of much more! For further information, check this related story or have a look at the package’s documentation. For bugs or feature requests, don’t hesitate to open an issue on GitHub or send me an email.


Related Articles