TPOT Automated Machine Learning in Python
In this post I’m sharing some of my explorations with TPOT, an automated machine learning (autoML) tool in Python. The goal is to see what TPOT can do and if it merits becoming part of your machine learning workflow.
Automated machine learning doesn’t replace the data scientist, (at least not yet) but it might be able to help you find good models faster. TPOT bills itself as your Data Science Assistant.
TPOT is meant to be an assistant that gives you ideas on how to solve a particular machine learning problem by exploring pipeline configurations that you might have never considered, then leaves the fine-tuning to more constrained parameter tuning techniques such as grid search.
So TPOT helps you find good algorithms. Note that it isn’t designed for automating deep learning — something like AutoKeras might be helpful there.
TPOT is built on the scikit learn library and follows the scikit learn API closely. It can be used for regression and classification tasks and has special implementations for medical research.
TPOT is open source, well documented, and under active development. It’s development was spearheaded by researchers at the University of Pennsylvania. TPOT appears to be one of the most popular autoML libraries, with nearly 4,500 GitHub stars as of August 2018.
How does TPOT work?
TPOT has what its developers call a genetic search algorithm to find the best parameters and model ensembles. It could also be thought of as a natural selection or evolutionary algorithm. TPOT tries a pipeline, evaluates its performance, and randomly changes parts of the pipeline in search of better performing algorithms.
AutoML algorithms aren’t as simple as fitting one model on the dataset; they are considering multiple machine learning algorithms (random forests, linear models, SVMs, etc.) in a pipeline with multiple preprocessing steps (missing value imputation, scaling, PCA, feature selection, etc.), the hyperparameters for all of the models and preprocessing steps, as well as multiple ways to ensemble or stack the algorithms within the pipeline. (source: TPOT docs)
This power of TPOT comes from evaluating all kinds of possible pipelines automatically and efficiently. Doing this manually is cumbersome and slower.
Running TPOT
Instantiating, fitting, and scoring the TPOT classifier is similar to any other sklearn classifier. Here’s the format:
tpot = TPOTClassifier()
tpot.fit(X_train, y_train)
tpot.score(X_test, y_test)
TPOT comes with its own variation of one-hot encoding. Note that it could add it to a pipeline automatically because it treats features with fewer than 10 unique values as categorical. If you want to use your own encoding strategy you can encode your data and then feed it into TPOT.
You can choose the scoring criterion for tpot.score (although a bug with Jupyter and multiple processor cores prevents you from having a custom scoring criterion with multiple processor cores in a Jupyter notebook).
It appears that you can’t alter the scoring criteria TPOT uses internally as it searches for the best pipeline, just the scoring criteria for use on the test set after TPOT has chosen the best algorithms. This is an area where some users might want more control. Perhaps this option will be added in a future version.
TPOT writes information about the best performing algorithm and it’s accuracy score to a file with tpot.export(). You can choose the level of verboseness you would like to see as TPOT runs and have it write pipelines to an output file as it runs in case it terminates early for some reason (e.g. your Kaggle Kernel crashes).
How long does TPOT take to run?
The short answer is that it depends.
TPOT was designed to run for a while — hours or even a day. Although less complex problems with smaller datasets can see great results in minutes. You can adjust several parameters for TPOT to finish its searches faster, but at the expense of a less thorough search for an optimal pipeline. It was not designed to be a comprehensive search of preprocessing steps, feature selection, algorithms, and parameters, but it can come close if you set its parameters to be more exhaustive.
As the docs explain:
…TPOT will take a while to run on larger datasets, but it’s important to realize why. With the default TPOT settings (100 generations with 100 population size), TPOT will evaluate 10,000 pipeline configurations before finishing. To put this number into context, think about a grid search of 10,000 hyperparameter combinations for a machine learning algorithm and how long that grid search will take. That is 10,000 model configurations to evaluate with 10-fold cross-validation, which means that roughly 100,000 models are fit and evaluated on the training data in one grid search.
Some of the data sets we’ll see below only need a few minutes to find algorithms that score well; others might need days.
Here are the default TPOTClassifier parameters:
generations=100,
population_size=100,
offspring_size=None # Jeff notes this gets set to population_size
mutation_rate=0.9,
crossover_rate=0.1,
scoring="Accuracy", # for Classification
cv=5,
subsample=1.0,
n_jobs=1,
max_time_mins=None,
max_eval_time_mins=5,
random_state=None,
config_dict=None,
warm_start=False,
memory=None,
periodic_checkpoint_folder=None,
early_stop=None
verbosity=0
disable_update_check=False
A description of each parameter can be found the docs. Here are a few key ones that determine the number of pipelines TPOT will search through:
generations: int, optional (default: 100)
Number of iterations to the run pipeline optimization process. Generally, TPOT will work better when you give it more generations(and therefore time) to optimize the pipeline. TPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total (emphasis mine).population_size: int, optional (default: 100)
Number of individuals to retain in the GP population every generation.
Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize the pipeline. offspring_size: int, optional (default: None)
Number of offspring to produce in each GP generation.
By default, offspring_size = population_size.
When starting out with TPOT it’s worth setting verbosity=3 and periodic_checkpoint_folder=“any_string_you_like” so that you can watch the models evolve and training scores improve. You’ll see some errors as some combinations of pipeline elements are incompatible, but don’t sweat that.
If you’re running on multiple cores and not using a custom scoring function, set n_jobs=-1 to use all available cores and speed up TPOT.
Search Space
Here are the classification algorithms and parameters TPOT chooses from as of version 0.9:
‘sklearn.naive_bayes.BernoulliNB’: { ‘alpha’: [1e-3, 1e-2, 1e-1, 1., 10., 100.], ‘fit_prior’: [True, False] }, ‘sklearn.naive_bayes.MultinomialNB’: { ‘alpha’: [1e-3, 1e-2, 1e-1, 1., 10., 100.], ‘fit_prior’: [True, False] }, ‘sklearn.tree.DecisionTreeClassifier’: { ‘criterion’: [“gini”, “entropy”], ‘max_depth’: range(1, 11), ‘min_samples_split’: range(2, 21), ‘min_samples_leaf’: range(1, 21) }, ‘sklearn.ensemble.ExtraTreesClassifier’: { ‘n_estimators’: [100], ‘criterion’: [“gini”, “entropy”], ‘max_features’: np.arange(0.05, 1.01, 0.05), ‘min_samples_split’: range(2, 21), ‘min_samples_leaf’: range(1, 21), ‘bootstrap’: [True, False] },‘sklearn.ensemble.RandomForestClassifier’: { ‘n_estimators’: [100], ‘criterion’: [“gini”, “entropy”], ‘max_features’: np.arange(0.05, 1.01, 0.05), ‘min_samples_split’: range(2, 21), ‘min_samples_leaf’: range(1, 21), ‘bootstrap’: [True, False] }, ‘sklearn.ensemble.GradientBoostingClassifier’: { ‘n_estimators’: [100], ‘learning_rate’: [1e-3, 1e-2, 1e-1, 0.5, 1.], ‘max_depth’: range(1, 11), ‘min_samples_split’: range(2, 21), ‘min_samples_leaf’: range(1, 21), ‘subsample’: np.arange(0.05, 1.01, 0.05), ‘max_features’: np.arange(0.05, 1.01, 0.05) },‘sklearn.neighbors.KNeighborsClassifier’: { ‘n_neighbors’: range(1, 101), ‘weights’: [“uniform”, “distance”], ‘p’: [1, 2] }, ‘sklearn.svm.LinearSVC’: { ‘penalty’: [“l1”, “l2”], ‘loss’: [“hinge”, “squared_hinge”], ‘dual’: [True, False], ‘tol’: [1e-5, 1e-4, 1e-3, 1e-2, 1e-1], ‘C’: [1e-4, 1e-3, 1e-2, 1e-1, 0.5, 1., 5., 10., 15., 20., 25.] }, ‘sklearn.linear_model.LogisticRegression’: { ‘penalty’: [“l1”, “l2”], ‘C’: [1e-4, 1e-3, 1e-2, 1e-1, 0.5, 1., 5., 10., 15., 20., 25.], ‘dual’: [True, False] }, ‘xgboost.XGBClassifier’: { ‘n_estimators’: [100], ‘max_depth’: range(1, 11), ‘learning_rate’: [1e-3, 1e-2, 1e-1, 0.5, 1.], ‘subsample’: np.arange(0.05, 1.01, 0.05), ‘min_child_weight’: range(1, 21), ‘nthread’: [1] }
And TPOT can stack classifiers, including the same classifier multiple times. One of the core developers of TPOT explains how it works in this issue:
The pipeline
ExtraTreesClassifier(ExtraTreesClassifier(input_matrix, True, 'entropy', 0.10000000000000001, 13, 6), True, 'gini', 0.75, 17, 4)
does the following:Fit all of the original features using an ExtraTreesClassifier
Take the predictions from that ExtraTreesClassifier and create a new feature using those predictions
Pass the original features plus the new “predicted feature” to the 2nd ExtraTreesClassifier and use its predictions as the final predictions of the pipeline
This process is called stacking classifiers, which is a fairly common tactic in machine learning.
And here are the 11 preprocessors that could be applied by TPOT as of version 0.9.
‘sklearn.preprocessing.Binarizer’: { ‘threshold’: np.arange(0.0, 1.01, 0.05) }, ‘sklearn.decomposition.FastICA’: { ‘tol’: np.arange(0.0, 1.01, 0.05) }, ‘sklearn.cluster.FeatureAgglomeration’: { ‘linkage’: [‘ward’, ‘complete’, ‘average’], ‘affinity’: [‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’, ‘cosine’] }, ‘sklearn.preprocessing.MaxAbsScaler’: { }, ‘sklearn.preprocessing.MinMaxScaler’: { }, ‘sklearn.preprocessing.Normalizer’: { ‘norm’: [‘l1’, ‘l2’, ‘max’] }, ‘sklearn.kernel_approximation.Nystroem’: { ‘kernel’: [‘rbf’, ‘cosine’, ‘chi2’, ‘laplacian’, ‘polynomial’, ‘poly’, ‘linear’, ‘additive_chi2’, ‘sigmoid’], ‘gamma’: np.arange(0.0, 1.01, 0.05), ‘n_components’: range(1, 11) }, ‘sklearn.decomposition.PCA’: { ‘svd_solver’: [‘randomized’], ‘iterated_power’: range(1, 11) }, ‘sklearn.preprocessing.PolynomialFeatures’: { ‘degree’: [2], ‘include_bias’: [False], ‘interaction_only’: [False] }, ‘sklearn.kernel_approximation.RBFSampler’: { ‘gamma’: np.arange(0.0, 1.01, 0.05) }, ‘sklearn.preprocessing.RobustScaler’: { }, ‘sklearn.preprocessing.StandardScaler’: { }, ‘tpot.builtins.ZeroCount’: { }, ‘tpot.builtins.OneHotEncoder’: { ‘minimum_fraction’: [0.05, 0.1, 0.15, 0.2, 0.25], ‘sparse’: [False] } (emphasis mine)
That’s a pretty comprehensive list of sklearn ml algorithms and even a few you might not have used for preprocessing, including Nystroem and RBFSampler. The final preprocessing algorithm listed is the custom OneHotEncoder mentioned before. Note that the list contains no neural network algorithms.
The number of combinations appears to be nearly infinite — you can stack algorithms, including instances of the same algorithm. There may be an internal cap on the number of steps in the pipeline, but suffice to say there are a plethora of possible pipelines.
TPOT will likely not result in the same algorithm selection if you run it twice (maybe not even if random_state is set, I found, as discussed below). As the docs explain:
If you’re working with a reasonably complex dataset or run TPOT for a short amount of time, different TPOT runs may result in different pipeline recommendations. TPOT’s optimization algorithm is stochastic in nature, which means that it uses randomness (in part) to search the possible pipeline space. When two TPOT runs recommend different pipelines, this means that the TPOT runs didn’t converge due to lack of time or that multiple pipelines perform more-or-less the same on your dataset.
Less talk — more action. Let’s try out TPOT on some data!
Dataset 1: MNIST Digit Classification
First we’ll look at a classification task — the popular handwriting digit classification task from MNIST included in sklearn’s datasets. The MNIST database contains 70,000 images of handwritten Arabic digits in 28x28 pixels, labeled from 0 to 9.
TPOT comes standard on the Kaggle Docker image, so you only need to import it if you’re using Kaggle — you don’t need to install it.
Here’s my code — available on this Kaggle Kernel, in a slightly different form and possibly with a few modifications.
# import the usual stuff
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os# import TPOT and sklearn stuff
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import sklearn.metrics# create train and test sets
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, train_size=0.75, test_size=0.25, random_state=34)tpot = TPOTClassifier(verbosity=3,
scoring="balanced_accuracy",
random_state=23,
periodic_checkpoint_folder="tpot_mnst1.txt",
n_jobs=-1,
generations=10,
population_size=100)# run three iterations and time themfor x in range(3):
start_time = timeit.default_timer()
tpot.fit(X_train, y_train)
elapsed = timeit.default_timer() - start_time
times.append(elapsed)
winning_pipes.append(tpot.fitted_pipeline_)
scores.append(tpot.score(X_test, y_test))
tpot.export('tpot_mnist_pipeline.py')times = [time/60 for time in times]
print('Times:', times)
print('Scores:', scores)
print('Winning pipelines:', winning_pipes)
As mentioned above, the total number of pipelines is equal to POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE.
For example, if you set population_size=20 and generations=5, then offspring_size=20 (because offspring_size equals population_size by default. And you’ll have a total of 120 pipelines because 20 + (5 * 20 ) = 120.
You can see it doesn’t take much code at all to run this data set — and that includes a loop to time and test it repeatedly.
With 10 possible classes and no reason to prefer one outcome to another, accuracy — the TPOT classification default — is a fine metric for this task.
Here’s the relevant code section.
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, train_size=0.75, test_size=0.25, random_state=34)tpot = TPOTClassifier(verbosity=3,
scoring=”accuracy”,
random_state=32,
periodic_checkpoint_folder=”tpot_results.txt”,
n_jobs=-1,
generations=5,
population_size=10,
early_stop=5)
And here are the results:
Times: [4.740584810283326, 3.497970838083226, 3.4362493358499098]
Scores: [0.9733333333333334, 0.9644444444444444, 0.9666666666666667]Winning pipelines: [Pipeline(memory=None,
steps=[('gradientboostingclassifier', GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=7,
max_features=0.15000000000000002, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
...auto', random_state=None,
subsample=0.9500000000000001, verbose=0, warm_start=False))]), Pipeline(memory=None,
steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('gradientboostingclassifier', GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.5, loss='deviance', max_depth=2,
max_features=0.15000000000000002, max_leaf_...auto', random_state=None,
subsample=0.9500000000000001, verbose=0, warm_start=False))]), Pipeline(memory=None,
steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('gradientboostingclassifier', GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.5, loss='deviance', max_depth=2,
max_features=0.15000000000000002, max_leaf_...auto', random_state=None,
subsample=0.9500000000000001, verbose=0, warm_start=False))])]
Note that with only 60 pipelines — far less than what TPOT suggests — we were able to see pretty good scores — over 97% accuracy on the test set in one case.
Reproducibility
Does TPOT find the same winning pipeline every time with the same random_state set? Not necessarily. Individually algorithms such as RandomForrestClassifier() have their own random_state parameters that don’t get set.
TPOT doesn’t always find the same result if you instantiate one classifier and then fit it repeatedly like we do in the for loop in the code above. I ran three very small sets of 60 pipelines with random_state set and Kaggle’s GPU setting on. Note that we get slightly different pipelines and thus slightly different test set scores on the three test sets.
Here’s another example of a small number of pipelines with random state set and using Kaggle’s CPU setting.
Times: [2.8874817832668973, 0.043678393283335025, 0.04388708711679404]
Scores: [0.9622222222222222, 0.9622222222222222, 0.9622222222222222]
Winning pipelines: [Pipeline(memory=None,
steps=[('gradientboostingclassifier', GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.5, loss='deviance', max_depth=2,
max_features=0.15000000000000002, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
....9500000000000001, tol=0.0001,
validation_fraction=0.1, verbose=0, warm_start=False))]), Pipeline(memory=None,
steps=[('gradientboostingclassifier', GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.5, loss='deviance', max_depth=2,
max_features=0.15000000000000002, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
....9500000000000001, tol=0.0001,
validation_fraction=0.1, verbose=0, warm_start=False))]), Pipeline(memory=None,
steps=[('gradientboostingclassifier', GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.5, loss='deviance', max_depth=2,
max_features=0.15000000000000002, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
....9500000000000001, tol=0.0001,
validation_fraction=0.1, verbose=0, warm_start=False))])]
The same pipeline was found each of the three times.
Note that the run time is much faster after the first iteration. TPOT does seem to remember when it has seen an algorithm and doesn’t rerun it, even if it’s a second fit and you’ve set memory=False. Here’s what you’ll see if you set the verbosity=3 when it finds such a previously evaluated pipeline:
Pipeline encountered that has previously been evaluated during the optimization process. Using the score from the previous evaluation.
Longer runs for higher accuracy
How does TPOT do if you make a large number of pipelines? To really see the power of TPOT for the MNIST digits task, you need over 500 total pipelines to run. This will take at least an hour if you’re running it on Kaggle. Then you will see higher accuracy scores and might see more complex models.
Chained, or stacked, ensembles where the outputs of one machine learning algorithm feed into another are what you’ll likely see if you have a larger number of pipelines and a non-trivial task.
0.9950861171999883knn = KNeighborsClassifier(
DecisionTreeClassifier(
OneHotEncoder(input_matrix, OneHotEncoder__minimum_fraction=0.15, OneHotEncoder__sparse=False),
DecisionTreeClassifier__criterion=gini,
DecisionTreeClassifier__max_depth=5,
DecisionTreeClassifier__min_samples_leaf=20,
DecisionTreeClassifier__min_samples_split=17),
KNeighborsClassifier__n_neighbors=1,
KNeighborsClassifier__p=2,
KNeighborsClassifier__weights=distance)
This is .995 average internal CV accuracy score after running for over an hour and generating over 600 pipelines. The kernel crashed before completion, so I didn’t get to see a test set score and couldn’t get an outputted model, but this looks quite promising for TPOT.
The algorithm uses a DecisionTreeClassifier with TPOT’s custom OneHotEncoder categorical encodings feeding into KNeighborsClassifier.
Here’s a similar internal score with a different pipeline resulting from a different random_state after nearly 800 pipelines.
0.9903723557310828 KNeighborsClassifier(Normalizer(OneHotEncoder(RandomForestClassifier(MinMaxScaler(input_matrix), RandomForestClassifier__bootstrap=True, RandomForestClassifier__criterion=entropy, RandomForestClassifier__max_features=0.55, RandomForestClassifier__min_samples_leaf=6, RandomForestClassifier__min_samples_split=15, RandomForestClassifier__n_estimators=100), OneHotEncoder__minimum_fraction=0.2, OneHotEncoder__sparse=False), Normalizer__norm=max), KNeighborsClassifier__n_neighbors=4, KNeighborsClassifier__p=2, KNeighborsClassifier__weights=distance)
TPOT found a pipeline with KNN, One Hot encoding, normalization, and random forest. It took two and a half hours. Previous one was faster and scored better, but sometimes that’s what happens with the stochastic nature of TPOT’s genetic search algorithm. 😉
Takeaways from MNIST digit classification task
- TPOT can perform really well on this image recognition task if you give it enough time.
- TPOT works better with more pipelines.
- If you need reproducibility for a task, TPOT isn’t the tool you want.
Dataset 2: Mushroom Classification
For a second dataset I chose the popular mushroom classification task. The goal is to determine correctly whether a mushroom is poisonous based on its labels. This is not an image classification task. It’s set up as a binary task so that all potentially dangerous mushrooms are grouped as one category and safe to eat mushrooms as another category.
My code is available on this Kaggle Kernel.
TPOT can routinely fit a perfect model quickly on this data set. It did so in under two minutes. This is much better performance and speed than when I tested this dataset without TPOT with many scikit-learn classification algorithms, a wide range of nominal data encodings, and no parameter tuning.
On three runs with the same TPOTClassifier instance and the same random state set here’s what TPOT found:
Times: [1.854785452616731, 1.5694829618000463, 1.3383520993001488]
Scores: [1.0, 1.0, 1.0]
Interestingly, it found a different best algorithm each time. It found a DecisionTreeClassifier, then a KNeighorsClassifier, and then a Stacked RandomForestClassifier with BernoulliNB.
Let’s dig into reproducibility a bit more. Let’s run it again with everything exactly the same settings.
Times: [1.8664863013502326, 1.5520636909670429, 1.3386059726501116]
Scores: [1.0, 1.0, 1.0]
We see the same set of three pipelines, very similar times, and the same scores on the test set.
Now let’s try splitting the cell into multiple different cells and instantiating a TPOT instance in each one. Here’s the code:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, test_size=0.25, random_state=25)tpot = TPOTClassifier(
verbosity=3,
scoring=”accuracy”,
random_state=25,
periodic_checkpoint_folder=”tpot_mushroom_results.txt”,
n_jobs=-1,
generations=5,
population_size=10,
early_stop = 5
)
The result of the second run now matches the result of the first one and took almost the same time (Score = 1.0, Time = 1.9 minutes, pipeline = Decision Tree Classifier). The key for higher reproducibility is that we are instantiating a new instance of the TPOT classifier in each cell.
Time results from 10 sets of 30 pipelines with random_state on train_test_split and TPOT set to 10 are below. All pipelines correctly classified all mushrooms on the test set. TPOT was quite fast on this fairly easy-to-learn task.
Takeaways from Mushroom Task
TPOT performs well and quickly for this basic classification task.
As a comparison, this Kaggle kernel on the mushroom set in R is very nice and explores a variety of algorithms and gets very close to perfect accuracy. But it doesn’t quite reach 100% and it certainly took quite a bit more time to prepare and train than our implementation of TPOT.
I would strongly consider TPOT as a time saver for a task like this in the future, at least as a first step.
Dataset 3: Ames Housing Prediction
Next we turn to a regression task to see how TPOT performs. We’ll predict housing property sale values with the popular Ames, Iowa Housing Price Prediction dataset. My code is available on this Kaggle Kernel.
For this task, I did some basic imputation of missing values first. I filled missing numeric column values with the most frequent value for the column, because some of those columns contain ordinal data. With more time I’d categorize the columns and use different imputation strategies depending on interval, ordinal, or nominal data types.
String column missing values were filled with a “missing” label prior to ordinal encoding because not all columns had a most frequent value. TPOT’s one hot encoding algorithm would then make one more dimension per feature that would indicate that the data had a missing value for that feature.
TPOTRegressor uses mean squared error scoring by default.
Here’s a run with only 60 pipelines.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25, random_state = 33)# instantiate tpot
tpot = TPOTRegressor(verbosity=3,
random_state=25,
n_jobs=-1,
generations=5,
population_size=10,
early_stop = 5,
memory = None)
times = []
scores = []
winning_pipes = []# run 3 iterations
for x in range(3):
start_time = timeit.default_timer()
tpot.fit(X_train, y_train)
elapsed = timeit.default_timer() - start_time
times.append(elapsed)
winning_pipes.append(tpot.fitted_pipeline_)
scores.append(tpot.score(X_test, y_test))
tpot.export('tpot_ames.py')# output results
times = [time/60 for time in times]
print('Times:', times)
print('Scores:', scores)
print('Winning pipelines:', winning_pipes)
The results of these three little runs.
Times: [3.8920086714831994, 1.4063017464330188, 1.2469199204002508]
Scores: [-905092886.3009057, -922269561.2683483, -949881926.6436856]
Winning pipelines: [Pipeline(memory=None,
steps=[('zerocount', ZeroCount()), ('xgbregressor', XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
max_depth=9, min_child_weight=18, missing=None, n_estimators=100,
n_jobs=1, nthread=1, objective='reg:linear', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=0.5))]), Pipeline(memory=None,
steps=[('xgbregressor', XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
max_depth=9, min_child_weight=11, missing=None, n_estimators=100,
n_jobs=1, nthread=1, objective='reg:linear', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=0.5))]), Pipeline(memory=None,
steps=[('stackingestimator', StackingEstimator(estimator=RidgeCV(alphas=array([ 0.1, 1. , 10. ]), cv=None, fit_intercept=True,
gcv_mode=None, normalize=False, scoring=None, store_cv_values=False))), ('maxabsscaler-1', MaxAbsScaler(copy=True)), ('maxabsscaler-2', MaxAbsScaler(copy=True)), ('xgbr... reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=0.5))])]
The runs finished pretty quickly and found different winning pipelines each time. Taking the square root of the scores gives us the Root Mean Squared Error (RMSE). The RMSE was around $30,000 on average.
Trying with 60 pipelines and a random_state = 20 for train_test_split and TPOTRegressor.
Times: [9.691357856966594, 1.8972856383004304, 2.5272325469001466]
Scores: [-1061075530.3715296, -695536167.1288683, -783733389.9523941]Winning pipelines: [Pipeline(memory=None,
steps=[('stackingestimator-1', StackingEstimator(estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features=0.7000000000000001, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=12, min_sample...0.6000000000000001, tol=0.0001,
validation_fraction=0.1, verbose=0, warm_start=False))]), Pipeline(memory=None,
steps=[('xgbregressor', XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
max_depth=7, min_child_weight=3, missing=None, n_estimators=100,
n_jobs=1, nthread=1, objective='reg:linear', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=1.0))]), Pipeline(memory=None,
steps=[('stackingestimator', StackingEstimator(estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features=0.7000000000000001, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=12, min_samples_...ators=100, n_jobs=None,
oob_score=False, random_state=None, verbose=0, warm_start=False))])]
Led to very different pipelines and scores.
Let’s try one longer run with 720 pipelines
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25, random_state = 20)tpot = TPOTRegressor(verbosity=3,
random_state=10,
#scoring=rmsle,
periodic_checkpoint_folder=”any_string”,
n_jobs=-1,
generations=8,
population_size=80,
early_stop=5)
Results:
Times: [43.206709423016584]
Scores: [-644910660.5815958]
Winning pipelines: [Pipeline(memory=None,
steps=[('xgbregressor', XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
max_depth=8, min_child_weight=3, missing=None, n_estimators=100,
n_jobs=1, nthread=1, objective='reg:linear', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=0.8500000000000001))])]
RMSE is the best yet. It took the better part of an hour to converge, and we’re still running far smaller pipelines than recommended. 🤔
Next let’s try using Root Mean Squared Logarithmic Error, a custom scoring parameter Kaggle uses for this competition. This was run in another very small iteration with 30 pipelines in three runs with random_state=20. We couldn’t use more than one CPU core because of a bug with custom scoring parameters in Jupyter in some algorithms included in TPOT.
Times: [1.6125734224997965, 1.2910610851162345, 0.9708147236000514]
Scores: [-0.15007242511943228, -0.14164770517342357, -0.15506057088945932]
Winning pipelines: [Pipeline(memory=None,
steps=[('maxabsscaler', MaxAbsScaler(copy=True)), ('stackingestimator', StackingEstimator(estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features=0.7000000000000001, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
...0.6000000000000001, tol=0.0001,
validation_fraction=0.1, verbose=0, warm_start=False))]), Pipeline(memory=None,
steps=[('extratreesregressor', ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=None,
max_features=0.6500000000000001, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=7, min_samples_split=10,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
oob_score=False, random_state=None, verbose=0, warm_start=False))]), Pipeline(memory=None,
steps=[('ridgecv', RidgeCV(alphas=array([ 0.1, 1. , 10. ]), cv=None, fit_intercept=True,
gcv_mode=None, normalize=False, scoring=None, store_cv_values=False))])]
Those scores aren’t terrible. The output file from tpot.export from this small run is below.
import numpy as np
import pandas as pd
from sklearn.linear_model import ElasticNetCV, LassoLarsCV
from sklearn.model_selection import train_test_split from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator # NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64) features = tpot_data.drop('target', axis=1).values training_features, testing_features, training_target, testing_target = train_test_split(features, tpot_data['target'].values, random_state=42) # Score on the training set was:-0.169929041242275 exported_pipeline = make_pipeline( StackingEstimator(estimator=LassoLarsCV(normalize=False)), ElasticNetCV(l1_ratio=0.75, tol=0.01) exported_pipeline.fit(training_features, training_target) results = exported_pipeline.predict(testing_features)
In the future I’d like to do some longer runs with TPOT on this data set to see how it performs. I’d also like to see how some manual feature engineering and various encoding strategies can improve our model performance.
Gotchas with TPOT and Kaggle
I love Kaggle’s kernels, but if you want to run an algorithm such as TPOT for a few hours, it can be super frustrating. Kernels frequently crash when running, you sometimes can’t tell if your attempted commit is hanging, and you can’t control your environment as much as you might like.
There’s nothing like getting to 700 out of 720 pipeline iterations and having Kaggle disconnect. My Kaggle CPU utilization rate was often showed 400%+ and there were many restarts required during this exercise.
A few other things to be aware of:
- I found I needed to convert my Pandas DataFrame to a Numpy Array to avoid an XGBoost issue on the regression task. This is a known issue with Pandas and XGBoost.
- A Kaggle kernel is running a Jupyter notebook under the hood. Custom scoring classifiers in TPOT don’t work when n_jobs is > 1 in a Jupyter notebook. This is a known issue.
- Kaggle will only let your kernel code write to an output file when you commit your code. And you can’t see TPOT’s temporary output when committing. Make sure you just have the file name in quotes — no slashes. The file will show up on the Output tab.
- Turning on the GPU setting on Kaggle didn’t speed things up for most of these analyses, but likely would for deep learning.
- Kaggle’s 6 hours of possible run time and GPU setting make it possible to experiment with TPOT for free with no configuration on non-huge data sets. It’s hard to pass up free.
For more time and speed you can use something like Paperspace. I set TPOT up on Paperspace and it was pretty pain-free, although not money-free. If you need a cloud solution to run TPOT, I suggest playing around with it on Kaggle first and then moving off Kaggle if you need more than a few hours of running time or more power.
Future Directions
There are so many interesting directions to explore with TPOT and autoML. I’d like to compare TPOT with autoSKlearn, MLBox, Auto-Keras, and others. I’d also like to see how it performs with a greater variety of data, other imputation strategies, and other encoding strategies. A comparison with LightGBM, CatBoost , and deep learning algorithms would also be interesting. The exciting thing about this moment in machine learning is that there are so many areas to explore. Follow me to make sure you don’t miss future analysis.
For most data sets there’s still a lot of data cleaning, feature engineering, and final model selection to do — not to mention the most important step of asking the right questions up front. Then you might need to productionize your model. And TPOT isn’t doing exhaustive searches yet. So TPOT isn’t going to replace the data scientist role — but this tool might make your final machine learning algorithms better faster.
If you’ve used TPOT or other autoML tools please share your experience in the comments.
I hope you found this introduction to TPOT to be helpful. If you did, please share it on your favorite social media so other folks can find it, too. 😀
I write about Python, SQL, and other tech topics. If any of that’s of interest to you, sign up for my mailing list of awesome data science resources and read more to help you grow your skills here. 👍
Happy TPOTing! 🚀