As a Data Scientist or Machine Learning Engineer, you have surely faced the critical task of selecting the correct model, with the correct parameters, etc. However, it is not always an easy task because the spectrum of possibilities is really wide. You may have had to implement GridSearch to find the optimal parameters for your pipeline, which is very time consuming. It is here where tools such as TPOT appear, which acts as an assistant for the search for the optimal pipeline.
So let’s talk about TPOT, its components and how it works. This blog will be divided into:
- What is TPOT?
- TPOT in practice
If you want to access to the full code implementation, this is the link: https://github.com/FernandoLpz/TPOT-Optimal-Pipeline-Searching
What is TPOT?
TPOT (Tree-based Pipeline Optimization Tool) is an Automl tool specifically designed for the efficient construction of optimal pipelines through genetic programming. TPOT is and open source library and makes use of scikit-learn components for data transformation, feature decomposition, feature selection and model selection [1].
Although TPOT is classified as an AutoML tool, as such it does not offer the "end-to-end" of an ML pipeline. TPOT is merely focused on the optimized automation of specific components of an ML pipeline. In Figure 2 we can see the phases automated by TPOT and the ones specifically addressed by the Data Scientist or Machine Learning Engineer.

TPOT implements tree-based pipelines where each node represents an "operator". TPOT implements 4 main "operators": Preprocessor, Decomposition, Feature Selection and Model the ones are made up of a set of scikit-learn methods. These operators are orchestrated as a tree where the leaves are one or more copies of the input data. The dataset flows through the tree where the features evolve operator by operator, being in the final node where the model (classification or regression) is generated. In Figure 3 we can see a couple of examples of tree-based pipelines.

As we already observed in the previous figure, each operator in the pipeline is made up of a set of functions where each of these functions receives a set of parameters, so here is where genetic programming plays a very important role. Instead of running a GridSearch with a set of fixed values for each one of the functions (which would demand a high processing time), TPOT implements an evolutionary computation technique called genetic programming. Specifically TPOT implements genetic programming to evolve the pipeline operators sequentially as well as the parameters of the operators in order to maximize the pipeline classification accuracy.
If you want to delve into the details of TPOT, I highly recommend to take a look at the paper [2]: TPOT: A Tree-based Pipeline Optimization Tool for Automating Machine Learning.
Perfect, so far we already know what TPOT is, how it works and what its components are, let’s now see how we use TPOT in practice, let’s go for it!
TPOT in practice
The key idea of this example will be to show how to optimize a pipeline of a classification model with TPOT, for this we will use the tic-tac-toe toy dataset. Once the dataset has been downloaded, we are going to proceed with a basic preprocessing to later introduce it to the classifier TPOT (these steps would be the equivalent to those shown in the blue boxes on the left of Figure 2).
Once the preprocessing is done, we proceed to initialize the TPOT classifier which requires some parameters for optimization (specifically for the Genetic Algorithm). Since we are working with a toy dataset, a population size of 50 and 5 generations are enough (it is important to mention that these parameters are arbitrary, as is well known, the choice of these parameters will depend on the size of the dataset as well as the number of features). Then we just need to fit the model.
If you want to access to the full code implementation, this is the link: https://github.com/FernandoLpz/TPOT-Optimal-Pipeline-Searching
Once you fit the model you will get an output like this:
Generation 1 - Current best internal CV score: 0.8681266445972329
Generation 2 - Current best internal CV score: 0.9243442831678127
Generation 3 - Current best internal CV score: 0.9347254053136407
Generation 4 - Current best internal CV score: 0.9347254053136407
Generation 5 - Current best internal CV score: 0.9347254053136407
This is where the key part comes in and why TPOT is regarded as an assistant. Well, at the end of the TPOT Optimization it shows you what the optimal pipeline was:
Best pipeline: GradientBoostingClassifier(ZeroCount(VarianceThreshold(PolynomialFeatures(input_matrix, degree=2, include_bias=False, interaction_only=False), threshold=0.2)), learning_rate=1.0, max_depth=10, max_features=0.9000000000000001, min_samples_leaf=16, min_samples_split=3, n_estimators=100, subsample=0.7000000000000001)
As can be seen, TPOT returns the optimal Pipeline composed of optimal functions and parameters, in this case, the optimal classifier was GradientBoostingClassifier() (remember, TPOT builds pipelines as trees where each node is an operator).
On the other hand, TPOT generates a template by showing which libraries, paths and functions were used to generate the optimal pipeline. You can access to it by typing:
model.export('optimal_pipeline.py')
As you can see, when exporting it will generate a ‘.py’ file. Let’s see what it has inside:
This file as such cannot be executed, what this file provides us is the architecture of the functions and how they were implemented with the specific parameters. As we can see, the key part begins on line 18 where it is shown how the function that contains the pipeline and each of the subsequent functions with their respective parameters is initialized. All you have to do is import the necessary libraries and "copy and paste" the suggested pipeline for you to validate the results.
The results obtained were:
Train acc: 1.0
Test acc: 0.9427083333333334
That is it!
If you want to access to the full code implementation, this is the link: https://github.com/FernandoLpz/TPOT-Optimal-Pipeline-Searching
Conclusions
In this blog, we learned what Tpot is, how it works and what its components are. Although TPOT is classified as an AutoML tool, it is more oriented to be an "assistant" as the authors well mention in their article [1]. The introduction of genetic programming for pipeline optimization is interesting instead of a complicated grid search, however, it is important to note that computing time can explode with "real datasets".
On the other hand, these types of alternatives are classified as AutoML tools, however, when we make use of them we also have to provide parameters (in this case for the genetic algorithm) which could be seen as not completely "AutoML".
Finally, it is interesting to know that several projects are trying to achieve "autonomy" in machine learning models, it would be interesting to know if one day we will have a purely autonomous tool.