ATOM: A Python package for fast exploration of machine learning pipelines

Marco vd Boom
Towards Data Science
7 min readMar 29, 2021

--

Photo by Mike Benna on Unsplash

Automated Tool for Optimized Modelling (ATOM) is an open-source Python package designed to help data scientists perform fast exploration and experimentation of supervised machine learning pipelines.

Introduction

During the exploration phase of a project, a data scientist tries to find the optimal pipeline for his specific use case. This usually involves applying standard data cleaning steps, creating or selecting useful features, trying out different models, etc. Testing multiple pipelines requires many lines of code, and writing it all in the same notebook often makes it long and cluttered. On the other hand, using multiple notebooks makes it harder to compare the results and to keep an overview. On top of that, refactoring the code for every test can be time-consuming. How many times have you conducted the same action to pre-process a raw dataset? How many times have you copy-and-pasted code from an old repository to re-use it in a new use case?

ATOM is here to help solve these common issues. The package acts as a wrapper of the whole machine learning pipeline, helping the data scientist to rapidly find a good model for his problem. Avoid endless imports and documentation lookups. Avoid rewriting the same code over and over again. With just a few lines of code, it’s now possible to perform basic data cleaning steps, select relevant features and compare the performance of multiple models on a given dataset, providing quick insights on which pipeline performs best for the task at hand.

Diagram of the possible steps taken by ATOM.

Installation

Install ATOM’s newest release easily via pip:

$ pip install -U atom-ml

or via conda:

$ conda install -c conda-forge atom-ml

Usage

The easiest way to understand what the package can do for you is going through an example. In this example we will:

  • Load the dataset. The data we are going to use is a variation on the Australian weather dataset from Kaggle. It can be downloaded from here. The goal of this dataset is to predict whether or not it will rain tomorrow, training a binary classifier on target column RainTomorrow.
  • Analyze a feature’s distribution
  • Impute the missing values
  • Encode the categorical columns
  • Fit a Logistic Regression and a Random Forest model to the data
  • Compare the performance of both models

And we will do all of this in less than 15 lines of code! Let’s get started.

Loading the dataset

We start loading the data from a csv file.

import pandas as pdX = pd.read_csv("weatherAUS.csv")
X.head()

ATOM has two main classes, which are used to initialize the pipeline:

  • ATOMClassifier: for binary or multiclass classification tasks.
  • ATOMRegressor: for regression tasks.

For this example, we use ATOMClassifier. Here, we initialize an atom instance with the loaded dataset.

from atom import ATOMClassifieratom = ATOMClassifier(X, y="RainTomorrow", test_size=0.3, verbose=2)

Additionally, we specify that atom should separate the dataset in a train and test set with a 70%-30% ratio.

<< ================== ATOM ================== >>
Algorithm task: binary classification.

Dataset stats ====================== >>
Shape: (142193, 22)
Scaled: False
Missing values: 316559 (10.1%)
Categorical features: 5 (23.8%)
Duplicate samples: 45 (0.0%)
---------------------------------------
Train set size: 99536
Test set size: 42657
---------------------------------------
| | dataset | train | test |
|---:|:-------------|:------------|:------------|
| 0 | 110316 (3.5) | 77205 (3.5) | 33111 (3.5) |
| 1 | 31877 (1.0) | 22331 (1.0) | 9546 (1.0) |

The output we receive is a short summary of the dataset. We can immediately see that there are missing values and categorical columns in the dataset.

Analyzing the dataset

With the dataset loaded into the atom instance, we can start analyzing it. ATOM provides various plots and methods for this. For example, to plot a feature correlation matrix we can type.

atom.plot_correlation()

Note that the plot automatically ignores categorical columns. We can also investigate a feature’s distribution using a Kolmogorov–Smirnov test.

atom.distribution("Temp3pm")

                 ks  p_value
weibull_max 0.0173 0.0053
beta 0.0178 0.0036
pearson3 0.0215 0.0002
gamma 0.0216 0.0002
lognorm 0.0217 0.0002
norm 0.0230 0.0001
invgauss 0.0649 0.0000
triang 0.0696 0.0000
uniform 0.1943 0.0000
expon 0.3376 0.0000
weibull_min 0.7675 0.0000

Let’s plot the feature’s distribution and see if it indeed fits the weibull_max distribution.

atom.plot_distribution("Temp3pm", distribution="weibull_max")

Data cleaning

Now that we have seen how to use the package to rapidly analyze the dataset, we can move on to cleaning it. Since sklearn models don’t accept missing values, we should get rid of them. We can do this using atom’s impute method.

atom.impute(strat_num="median", strat_cat="most_frequent")

The chosen parameters specify that for numerical features we impute with the median of the column, and for categorical features we impute with the most frequent value (mode) of the column. One of the advantages of having the dataset as part of the instance, is that we don’t need to call fit nor transform. The underlying transformer will automatically fit on the training set and transform the complete dataset.

Fitting Imputer...
Imputing missing values...
--> Dropping 702 samples for containing less than 50% non-missing values.
--> Imputing 351 missing values with median (12.0) in feature MinTemp.
--> Imputing 169 missing values with median (22.6) in feature MaxTemp.
--> Imputing 1285 missing values with median (0.0) in feature Rainfall.
--> Imputing 60160 missing values with median (4.8) in feature Evaporation.
--> Imputing 67131 missing values with median (8.5) in feature Sunshine.
--> Imputing 8667 missing values with most_frequent (W) in feature WindGustDir.
--> Imputing 8609 missing values with median (39.0) in feature WindGustSpeed.
--> Imputing 9402 missing values with most_frequent (N) in feature WindDir9am.
--> Imputing 3106 missing values with most_frequent (SE) in feature WindDir3pm.
--> Imputing 2096 missing values with median (21.1) in feature Temp3pm.
--> Imputing 1285 missing values with most_frequent (No) in feature RainToday.

To encode the categorical columns, we use atom’s encode method. We can choose any estimator from the category-encoders package to do the transformation.

atom.encode(strategy="LeaveOneOut")

Fitting Encoder...
Encoding categorical columns...
--> LeaveOneOut-encoding feature Location. Contains 49 classes.
--> LeaveOneOut-encoding feature WindGustDir. Contains 16 classes.
--> LeaveOneOut-encoding feature WindDir9am. Contains 16 classes.
--> LeaveOneOut-encoding feature WindDir3pm. Contains 16 classes.
--> Ordinal-encoding feature RainToday. Contains 2 classes.

And just like that, all missing values are imputed and the categorical features are encoded with numerical values. The data in atom’s pipeline can be accessed at any time through the dataset attribute.

atom.dataset.head()

Model training

Now that the dataset has been cleaned, we are ready to fit the models. In this example we will assess how a Logistic Regression and a Random Forest model perform on our classification problem. All the available models and their corresponding acronyms can be found in the documentation. Again, one simple command is sufficient.

atom.run(models=["LR", "RF"], metric="f1")

NOTE: This is a minimalistic example. Among other things, it is possible to specify model parameters, use multiple metrics, perform hyperparameter tuning and train custom models. See the documentation for further details.

Training ===================================== >>
Models: LR, RF
Metric: f1

Results for Logistic Regression:
Fit ---------------------------------------------
Train evaluation --> f1: 0.4716
Test evaluation --> f1: 0.4658
Time elapsed: 0.201s
-------------------------------------------------
Total time: 0.201s

Results for Random Forest:
Fit ---------------------------------------------
Train evaluation --> f1: 0.9999
Test evaluation --> f1: 0.5434
Time elapsed: 14.976s
-------------------------------------------------
Total time: 14.976s

Final results ========================= >>
Duration: 15.177s
------------------------------------------
Logistic Regression --> f1: 0.4658
Random Forest --> f1: 0.5434

A couple of things have happened here. Both models are trained on the training set and evaluated on the test set, using the provided metric. Then, an object is created for every model and attached to the atom instance as attribute. They are called through the models’ acronym (e.g. atom.RF for the Random Forest model) and can be used to further analyze the results.

Evaluate the results

Lastly, we want to compare the model’s performances. To analyze a single model, we use the model subclass mentioned before. For example, to check the Random Forest’s feature importance, we type.

atom.RF.plot_feature_importance(show=10)

The actual estimator that was used to fit the data (from the scikit-learn package), can be accessed through the model’s estimator attribute.

atom.RF.estimator

RandomForestClassifier(n_jobs=1)

But the real power of ATOM lies in the fact that we can easily compare the performance of all the models in its pipeline. For example, by plotting the ROC-curve.

atom.plot_roc()

Or evaluating their performances on multiple metrics.

atom.evaluate()

Conclusion

ATOM assists a data scientist during the exploration phase of a machine learning project. It is capable of analyzing the data, applying standard data cleaning steps and comparing the performance of multiple models in just a few lines of code. But that’s not all! ATOM can also help with other common tasks in machine learning such as:

  • Detect and remove outliers
  • Handle imbalanced datasets
  • Compare models trained on different pipelines
  • Perform hyperparameter tuning
  • Combine models using ensemble techniques
  • And much more…

For further information, check out the project’s GitHub or Documentation page. For bugs or feature requests, don’t hesitate to open an issue on GitHub or send me an email.

--

--