The world’s leading publication for data science, AI, and ML professionals.

An Omni-ensemble Automated Machine Learning – OptimalFlow

OptimalFlow is a high-level API toolkit to help data scientists building models in an ensemble way, and automate Machine Learning workflow…

Photo by Hunter Harritt on Unsplash
Photo by Hunter Harritt on Unsplash

OptimalFlow is an Omni-ensemble Automated Machine Learning toolkit, which is based on Pipeline Cluster Traversal Experiment approach, to help data scientists building optimal models in an easy way, and automate Machine Learning workflow with simple codes.

OptimalFlow wraps the Scikit-learn supervised learning framework to automatically create a collection of machine learning pipelines(Pipeline Cluster) based on algorithms permutation in each framework component.

It includes feature engineering methods in its preprocessing module such as missing value imputation, categorical feature encoding, numeric feature standardization, and outlier winsorization. The models inherit algorithms from Scikit-learn and XGBoost estimators for classification and regression problems. And the extendable coding structure supports adding models from external estimator libraries, which distincts OptimalFlow’s scalability out of most of Automl toolkits.

OptimalFlow uses Pipeline Cluster Traversal Experiments as the optimizer to build an omni-ensemble workflow for an optimal baseline model searching, including feature preprocessing/selection optimization, hyperparameters tuning, model selection, and assessment.

Comparing other popular "AutoML or Automated Machine Learning" APIs, OptimalFlow is designed as an Omni-ensemble ML workflow optimizer with higher-level API targeting to avoid manual repetitive train-along-evaluate experiments in general pipeline building.

It rebuilt the automated machine learning framework by switching the focus from single pipeline components automation to a higher workflow level by creating an automated ensemble pipelines (Pipeline Cluster) traversal experiments and evaluation mechanisms. In another word, OptimalFlow jumps out of a single pipeline’s scope, while treats the whole pipeline as an entity, and automate produce all possible pipelines for assessment, until finding one of them leads to the optimal model. Thus, when we say a pipeline represents an automated workflow, OptimalFlow is designed to assemble all these workflows, and find the optimal one. That’s also the reason to name it as OptimalFlow.

Fig1. OptimalFlow's workflow
Fig1. OptimalFlow’s workflow

To achieve that, OptimalFlow creates Pipeline Cluster Traversal Experiments to assemble all cross-matching pipelines covering major tasks of Machine Learning workflow, and apply traversal-experiment to search the optimal baseline model. Besides, by modularizing all key pipeline components in reusable packages, it allows all components to be custom updated along with high scalability.

The common machine learning workflow is automated by a "single pipeline" strategy, which is first introduced and well-supported by the scikit-learn library. In practical usage, data scientists need to implement repetitive experiments in each component within one pipeline, adjust algorithms & parameters, to get the optimal baseline model. I call this operation mechanism "Single Pipeline Repetitive Experiments". No matter classic machine learning or current popular AutoML libraries, it’s hard to avoid this single pipeline focused experiment, which is the biggest pain point in the supervised modeling workflow.

Fig 2, Single Pipeline Repetitive Experiments
Fig 2, Single Pipeline Repetitive Experiments

The core concept/improvement in OptimalFlow is Pipeline Cluster Traversal Experiments, which is a theory of framework first proposed by Tony Dong in Genpact 2020 GVector Conference, to optimize and automate Machine Learning Workflow using ensemble pipelines algorithm.

Comparing other automated or classic machine learning workflow’s repetitive experiments using a single pipeline, Pipeline Cluster Traversal Experiments is more powerful, since it expends the workflow from 1 dimension to 2 dimensions by ensemble all possible pipelines(Pipeline Cluster) and automated experiments. With larger coverage scope, to find the best model without manual intervention, and also more flexible with elasticity to cope with unseen data due to its ensemble designs in each component, the Pipeline Cluster Traversal Experiments provide data scientists an alternative more convenient and "Omni-automated" machine learning approach.

Fig 3, Pipeline Cluster Traversal Experiments
Fig 3, Pipeline Cluster Traversal Experiments

OptimalFlow is consist of 6 modules below, you can find more details about each module in [Documentation](https://optimal-flow.readthedocs.io/). Each module can be used to simplify and automate the specific component pipeline _ individually. Plus, you could find their examples in the Documentatio_n.

  • autoPP for feature preprocessing
  • autoFS for classification/regression features selection
  • autoCV for classification/regression model selection and evaluation
  • autoPipe for Pipeline Cluster Traversal Experiments
  • autoViz for pipeline cluster visualization
  • autoFlow for logging & tracking.
Fig 4, Model Retrieval Diagram Generated by autoViz Module
Fig 4, Model Retrieval Diagram Generated by autoViz Module

There are some live notebook (on _binder)and demos in the Documentation_.

Using OptimalFlow, data scientists, including experienced users or beginners, can build optimal models easily without tedious experiments and pay more attention to convert their industry domain knowledge to the deployment phase w/ practical implement.


OptimalFlow was designed highly modularized from the beginning, which made it easy to continue developing and users could build applications based on it.

After version 0.1.10, it added a "no-code" Web App as an application demo built on OptimalFlow. The web app allows simple click and selection for all of the parameters inside of OptimalFLow, which means users could build end-to-end Automated Machine Learning workflow without coding at all! (Read more details on Documentation or story on TDS)

Related Readings about OptimalFlow:

Ensemble Feature Selection in Machine Learning using OptimalFlow – Easy Way with Simple Code to Select top Features

Ensemble Model Selection & Evaluation in Machine Learning using OptimalFlow – Easy Way with Simple Code to Select the Optimal Model

End-to-end OptimalFlow Automated Machine Learning Tutorial with Real Projects-Formula E Laps Prediction Part 1

End-to-end OptimalFlow Automated Machine Learning Tutorial with Real Projects-Formula E Laps Prediction Part 2

Build No-code Automated Machine Learning Model with OptimalFlow Web App


In summary, OptimalFlow shares a few useful properties for data scientists:

  • Easy & less code – High-level APIs to implement Pipeline Cluster Traversal Experiments, and each ML component is highly automated and modularized;
  • Well-ensemble – Each key component is an ensemble of popular algorithms w/ hyperparameters tuning included;
  • Omni-coveragePipeline Cluster Traversal Experiments are designed to cross-experiment with all key ML components, like combined permuted input datasets, feature selection, and model selection;
  • Scalable & Consistency – Each module could add new algorithms easily due to its ensemble & reusable design; no extra needs to modify existing codes;
  • AdaptablePipeline Cluster Traversal Experiments makes it easier to adapt unseen datasets with the right pipeline;
  • Custom Modify Welcomed – Support custom settings to add/remove algorithms or modify hyperparameters for elastic requirements.

As an initial stable version to release, all supports are welcome! Please feel free to share your feedback, report issue, or join as a contributor at OptimalFlow GitHub here.


About me:

I am a healthcare & pharmaceutical data scientist and big data Analytics & AI enthusiast. I developed OptimalFlow library to help data scientists building optimal models in an easy way, and automate Machine Learning workflow with simple codes.

As a big data insights seeker, process optimizer, and AI professional with years of analytics experience, I use machine learning and problem-solving skills in Data Science to turn data into actionable insights while providing strategic and quantitative products as solutions for optimal outcomes.

You can connect with me on my LinkedIn or GitHub.


Related Articles