End-to-end OptimalFlow Automated Machine Learning Tutorial with Real Projects

Formula E Laps Prediction — Part 2

Tony Dong
Towards Data Science

--

In the previous Part 1 of this tutorial, we discussed how to implement data engineering to prepare suitable datasets, feeding further modeling steps. And now we will focus on how to use OptimalFlow library(Documentation | GitHub) to implement Omni-ensemble automated machine learning.

Why we use OptimalFlow? You could read another story of its introduction: An Omni-ensemble Automated Machine Learning — OptimalFlow.

Step 1: Install OptimalFlow

Set up your working environment with Python 3.7+, and install OptimalFlow by Pip command. Currently, the most recent version is 0.1.7. More package’s information could be found at PyPI.

pip install OptimalFlow

Step 2: Double-check missing values

After data preparation in Part 1 of this tutorial, most of the features are ready to feed the modeling process. Missing values in category features are not welcomed when the data flow comes to autoPP(OptimalFlow’s auto feature preprocessing module). So we need to double-check the cleaned data and apply data cleaning to the features with missing values. For this problem, I only found ‘GROUP’ feature has missing value and using the following code to convert it.

Step 3: Custom settings

OptimalFlow provides open interfaces for users to do custom settings in every module. Even in the autoCV (OptimalFlow’s model selection & evaluation module), you could custom set specific models or hyperparameters searching space. You could find more details in the Documentation.

As the code below, we set up scalers and encoders algorithm for autoPP (OptimalFlow’s auto feature preprocessing module); selectors for autoFS (OptimalFlow’s auto feature selection module); estimators for autoCV module.

For the features selection and model selection w/ evaluation, we set up selectors and estimators searching space.

PLEASE NOTE: The “sparsity” and “cols” are 2 limitations you could set to narrow down the number of dataset combinations. Usually, when the sparsity of a dataset is too low, the variance within the features will be low, which means the information value will be low. You could try different values settings in these 2 parameters, based on your acceptable number of datasets, which will go through the Pipeline Cluster Traversal Experiments(PCTE) process to find the optimal model with its pipeline workflow. Of course, the larger number of datasets combination generated from autoPP module, the more time needs for further processes in OptimalFlow automated machine learning. Vice versa, when there’s no dataset combination can meet the sparsity and columns number’s restrictions, the following process cannot continue. So, be careful and try some values with these settings.

Step 4: Pipeline Cluster Traversal Experiments(PCTE)

The core concept/improvement in OptimalFlow is Pipeline Cluster Traversal Experiments(PCTE), which is a theory of framework first proposed by Tony Dong in Genpact 2020 GVector Conference, to optimize and automate Machine Learning Workflow using ensemble pipelines algorithm.

Comparing other automated or classic machine learning workflow’s repetitive experiments using a single pipeline, Pipeline Cluster Traversal Experiments is more powerful, since it expends the workflow from 1 dimension to 2 dimensions by ensemble all possible pipelines(Pipeline Cluster) and automated experiments. With larger coverage scope, to find the best model without manual intervention, and also more flexible with elasticity to cope with unseen data due to its ensemble designs in each component, the Pipeline Cluster Traversal Experiments provide data scientists an alternative more convenient and “Omni-automated” machine learning approach.

To implement PCTE process, OptimalFlow provides autoPipe module to achieve that. More examples and functions details could be found in documentation.

Here’s the attributes we’ve set in autoPipe module:

  • for autoPP module: Give it the custom parameters we’ve set above; set the prediction column as “Total_Lap_Num”, and select the model_type as “regression”(prevent dummy variables trap);
  • for the splitting rule: set 20% validation, 20% test, and 60% train data;
  • for autoFS module: set 10 top features, 5 cross-validation folders;
  • for autoCV module: use fastRegressor class, 5 cross-validation folders.

Here’s the simple description of this automated process:

  • Based on our previous custom settings, autoPP module will generate total #256 dataset combinations(based on our custom sparsity and cols restriction settings). And our PCTE process will go through all of them. It will automatedly select top 10 features using autoFS module. And searching the best model with tuned hyperparameters to finally find the optimal model with its pipeline workflow within the pipeline cluster.

You will find all the log information of the PCTE process in the auto-generated log files, which created by OptimalFlow’s autoFlow module.

Modules’ auto-generated log files

PCTE process covers almost all machine learning steps data scientists need to cover, and automatically searching the best optimal model with its pipeline workflow information for easy evaluation and implementation.

Although PCTE will not save time for each machine learning pipeline operation, data scientists could move to other tasks when OptimalFlow is helping them get out of the tedious model experiment and tuning work.

This is what I understand a REAL Automated Machine Learning process should be. OptimalFlow should finish all of these tasks automatedly.

The outputs of the Pipeline Cluster Traversal Experiments (PCTE) process includes information of preprocessing algorithms applied to prepared dataset combinations(DICT_PREP_INFO), selected top features for each dataset combination(DICT_FEATURE_SELECTION_INFO), models evaluation results(DICT_MODELS_EVALUATION), split dataset combination(DICT_DATA), model selection results ranking table(models_summary).

This is a useful function for data scientists, since retrieving the previous machine learning workflow is painful when they want to reuse the previous outputs.

Step 5: Save pipeline cluster with optimal models

Since the PCTE process will last very long when there is a large number of datasets combination as the input, we’d better save the outputs of the previous step(pipeline cluster with optimal models) as pickles for results interpretation and visualization steps.

Step 6: Modeling results interpret

Next step we will see our modeling results by import the saved pickles in the previous step. We can use the following code to find the top 3 models with their optimal flow after PCTE automated process:

It’s very clear, the KNN algorithm with tuned hyperparameters performance the best. And we could retrieve the whole pipeline workflow from PCTE’s outputs.

Specifically:

The optimal pipeline is consisted by the KNN algorithm using Dataset_214, Dataset_230 in the 256 datasets combinations, with the best parameters [(‘weights’: ‘distance’),(‘n_neighbors’: ‘5’),(‘algorithm’: ‘kd_tree’)]. The R-squared is 0.971, MAE is 1.157, MSE is 5.928, RMSE is 5.928 and the latency score is 3.0.

All 256 datasets pipeline performance assessment results could be generated by autoViz module’s dynamic table function(more details and other existing visualization examples can be found here), and you could find it at ./temp-plot.html.

The top 10 features selected by autoFS module are:

The feature preprocessing details to Dataset_214, Dataset_230 : Winsorization with outliers by top 10% and bottom 10%; Encoding ‘match_name’ and ‘DATE_ONLY’ features by mean encoding approach; Encoding ‘GROUP’ feature by OneHot encoding approach; None scaler is involved in the preprocessing step.

That’s all. We made our fist OptimalFlow automated machine learning project. Simple and easy, right? 😎

*More things need to consider:

Our top pipeline of the model has a very high R-Squared value, which is over 0.9. For most physical processes this value might not be surprising, however, if we are predicting human behavior, that’s a way somehow too high. So we also need to consider other metrics like MSE.

Within this tutorial, we simplified the real project to a more suitable case for OptimalFlow’s beginners. So based on this starting point, this result is acceptable to be your first OptimalFlow automated machine learning’s output.

Here’re some suggestions, if you want to continually improve our sample script to go deeper with a more practical optimal model approach.

  • A high R-squared value usually means overfitting happened. So drop more features to prevent that;
  • Aggregation is a good idea to assemble data, but it also lets us lose the lap-by-lap and timing-by-timing variance information;
  • The scaling approach is also essential to prevent overfitting, we could move “None” out of our custom_pp, and add some other scaler approach(i.e. minmax, robust) in Step 3;

In Summary:

OptimalFlow is an easy-use API tool to achieve Omni-ensemble automated machine learning with simple code, and it’s also a best practice library to prove Pipeline Cluster Traversal Experiments (PCTE) theory.

Its 6 modules could not only be connected to implement PCTE process, but also could be used individually to optimize traditional machine learning workflow’s components. You can find their individual use cases in Documentation.

“An algorithmicist looks at no free lunch.” — Culberson

Last but not least, as data scientists, we should always keep in mind: no matter what kind of automated machine learning algorithms, ‘no free lunch theorem’ always applies.

About me:

I am a healthcare & pharmaceutical data scientist and big data Analytics & AI enthusiast. I developed OptimalFlow library to help data scientists building optimal models in an easy way, and automate Machine Learning workflow with simple codes.

As a big data insights seeker, process optimizer, and AI professional with years of analytics experience, I use machine learning and problem-solving skills in data science to turn data into actionable insights while providing strategic and quantitative products as solutions for optimal outcomes.

You can connect with me on LinkedIn or GitHub.

--

--

Healthcare & Pharmaceutical Data Scientist | Big Data Analytics & AI Enthusiast