Thoughts and Theory

Nowadays, automated machine learning (Automl), composite artificial intelligence (AI) and structural learning are quite popular and widely-discussed concepts. The ideas of automation, multi-modality, and controllability provide a promising direction for the improvement of existing data-driven modeling methods. In this article, we would like to discuss the main trends and challenges in AutoML, the main ideas behind implementations of composite AI, and the open-source framework for structural learning of composite pipelines – FEDOT that inherits these ideas.
A little bit of theory and a few words about AutoML frameworks
Usually, data scientists have to do many steps for obtaining a solution to the real-world problem using machine learning (ML) techniques: data cleaning and dataset preparation, selection of the most informative features, a transformation of the feature space, selection of the ML model and adjustment of its hyperparameters. This sequence is often represented as an ML pipeline.

However, the manual processing even of simple linear pipelines (A, in the figure above) and selection of their structures and parameters can take days or even weeks of data scientist work. For complicated tasks, the pipeline structure may become more composite – as it is shown in cases B and C in the figure above. Case B shows the branching pipeline with ensemble methods (stacking) for combining several models; case C shows the branching pipeline that joins different preprocessing methods and models for different parts of the initial dataset.
Actually, the pipeline with the usage of several ML models may be treated as the whole composite model because between them there are not so many differences from a computational point of view. So, the structures of the pipelines in (B) and (С) actually become composite because they incorporate different ML algorithms. For example, an NLP model and a convolutional network can be combined to obtain the prediction using multimodal data. The composite models and ML pipelines can be handled using AutoML methods and techniques.
Automated pipeline creation is mainly a combinatorial optimization problem that aims to search for the best combination of possible building blocks. For convenience, the pipeline can be described as a directed acyclic graph (DAG), which can be easily transformed into a computational graph. The efficiency of optimization is determined by objective functions that allow estimating the quality, complexity, robustness, and other properties of the resulting pipeline.
The most straightforward method for solving the optimization task is a random search for the appropriate block combinations. But a better choice is meta-heuristic optimization algorithms: swarm and evolutionary (genetic) algorithms. But in the case of Evolutionary Algorithms, one should keep in mind that they should have specially designed crossover, mutation, and selection operators. Such special operators are important for processing the individuals described by a DAG, they also give a possibility to take multiple objective functions into account and include additional procedures to create stable pipelines and avoid overcomplication.
The crossover operators can be implemented using subtree crossover schemes. In this case, two parent individuals are chosen and exchange random parts of their graphs. But this is not the only possible way of implementation, there may be more semantically complex variants (e.g., one-point crossover). Implementation of the mutation operators may include random change of a model (or computational block) in a random node of the graph, removal of a random node, or random addition of a subtree.

In the ideal case, AutoML should give us an opportunity to exclude a human from the whole process of building the ML solution. However, it is difficult to achieve this completely, because most AutoML frameworks support the automation of separate steps (tuning of hyperparameters, feature selection, etc.) for fixed pipelines and only for specific types of data. Several state-of-the-art AutoML frameworks for pipelines and their features are listed in the table below. Of course, this comparison does not claim to be complete – it is based on the analysis of open documentation and examples. However, the state of AutoML can change rapidly.

It can be seen that despite a large variety of well-developed solutions, the existing solutions are aimed at relatively narrow tasks or ways of usage. For example, the TPOT framework automates the creation of classification (including multiclass classification) and regression models for tabular data only, and the structure of obtained pipelines usually consists of one of two levels. AutoGluon is quite flexible, but it is based on pre-defined pipelines mostly. There are also a lot of task-specific AutoML frameworks – for example, the AutoTS framework that can be used for the time series forecasting only.
However, it can be seen that there is a lack of more complex and multipurpose approaches for automated modeling that can be adapted for different tasks and data types without sophisticated modifications of the core algorithms. This problem leads us to the next point – is it possible to improve the situation?
What is missing, and what we want to propose
A typical AutoML application scenario is as follows: the AutoML framework uses the available datasets to optimize the structure of the pipeline and hyperparameters of the blocks included in it. However, in practice, the implementations that work well for benchmark problems are not so good on "real-world" datasets. Therefore, more and more new AutoML solutions appear, such as H2O, AutoGluon, LAMA, NNI, and others. They differ in capabilities (for example, industrial solutions have advanced infrastructure capabilities), but are often not suitable for a wide range of modeling tasks. While most frameworks allow solving classification and regression problems, they often do not support time series prediction.

ML pipeline can include models for different tasks. For example, it can be useful to generate a new useful feature based on regression and then use it in classification. At the moment, the AutoML frameworks do not allow solving such a task in a convenient way. However, it is not uncommon for ML engineers to face multimodal and heterogeneous data that have to be integrated for further modeling.
Until recently there were no ready-to-use tools with this set of features. In the Nature System Simulation Lab of ITMO University, we research and develop advanced solutions in the AutoML field. So, we decided to develop our own solution and avoid the described problems. We called it Fedot. It is an open-source framework that automates the creation and optimization of ML pipelines and their elements. FEDOT makes it possible to solve various data-driven modeling problems in a compact and efficient way.

Here is an example of the FEDOT-based solution for the classification problem (in Python):
# new instance to be used as AutoML tool with time limit equal to 10 minutes
auto_model = Fedot(problem='classification', learning_time=10)
#run of the AutoML-based model generation
pipeline = auto_model.fit(features=train_data_path, target='target')
prediction = auto_model.predict(features=test_data_path)
auto_metrics = auto_model.get_metrics()
The main focus of the framework is to manage the interaction between the computational blocks of the pipelines. Tho pipeline design starts from the stage of structural learning – FEDOT combines several ML models to achieve better values of objective functions. In the framework, we describe composite models in the form of a directed graph that defines the relationships between preprocessing and modeling blocks. The nodes are represented by ML models, as well as data preprocessing and transformation operation. The structure of this graph, as well as the parameters of each node, can be optimized.
The structure that is suitable for the specific task is designed automatically. To do this we used the evolutionary optimization algorithm GPComp, which creates a population of several ML pipelines and searches for the best solution by applying evolutionary methods: mutation and crossover. To avoid the undesirable over-complication of the pipeline structure, we apply the regularization procedures and multi-objective approaches.
Here is a teaser of the framework that illustrated this concept:
FEDOT is implemented in Python and is available under the BSD-3 open license. There are several key features:
- The FEDOT architecture is highly flexible; the framework can be used to automate the creation of ML-solution for different tasks (classification, regression, forecasting), data types (tables, time series, texts, images), and models;
- FEDOT supports popular ML-libraries (sci-kit-learn, Keras, statsmodels, etc.), but also allows integrating the other tools if needed;
- Pipeline optimization algorithms are not bound to data types or tasks. However, specialized templates for a certain task class or data type (time series prediction, NLP, tabular data, etc.) can be used;
- The framework is not limited to machine learning. The domain-specific models also can be built into the pipelines (e.g., models in ODE or PDE);
- Custom methods of hyperparameters tuning can also be integrated to FEDOT in addition to those already supported;
- FEDOT supports an any-time mode of operation (at any point in time, the algorithm can be stopped and the result obtained);
- Final pipelines can be exported in JSON format to achieve the reproducibility of the experiment.
Thus, compared to other frameworks, FEDOT is not limited to one class of problems but claims to be universal and extensible. It allows you to build models using input data of different nature. Also, the promising case of the FEDOT application for hackathon competition is described in this news.
AutoML prospects
Among the AutoML solutions (in addition to the tools listed above) there are EvalML, TransmogrifAI, Lale, and many others. All of them are developed by large enterprise IT companies. In some cases, the main focus of the frameworks is on technical features, such as support for scalability and distributed computing, or integration with Kubernetes and MLOps tools. In others, it is about conceptual issues like new optimization algorithms or their interpretability. However, there are several areas and prospects of the development of AutoML that are less covered by the community.
Flexible control of search complexity
Depending on the requirements and allowable budget, ML engineers can select different models: a single gradient boosting model with optimized hyperparameters, a deep neural network, or a non-linear pipeline that combines several modeling approaches. In both cases, he will be forced to discover the possibilities of available AutoML frameworks and conduct experimental research to find out what works better or worse. But, It would be very convenient to have a so-called "switch" of the continuous search complexity, with which one can adjust the dimensionality of the search space from simple solutions to complex but efficient pipelines.
Model Factory
In addition to allowable values of quality metrics, other criteria for decisions may arise in ML tasks. For example, the interpretability, the required amount of computational resources and memory to maintain in a production environment, the advance of predictions, and so on. Here would be useful an interface for specifying several objectives that can be taken into account. In some cases, it is impossible to minimize all the criteria simultaneously because there is a Pareto-front of decisions. For example, as the complexity of the neural network architecture increases, the accuracy increases, but it also requires more computational resources.
Our team conducted experimental research where we tried to apply evolutionary multi-objective optimization algorithms within FEDOT to optimize the machine learning pipelines. We chose not only accuracy but also the complexity of the pipelines (number of nodes and computation graph depth) as criteria for optimization. During the experiments we found that the integration of Pareto-front solutions into the search process increases the diversity in the population, and also allows us to find solutions with higher accuracy.
The idea of the AutoML model factory was expressed by Yuriy Guts from the company DataRobot in his report Automated Machine Learning. He drew an analogy with the OOP-pattern Factory where AutoML could provide different solutions to the user depending on the given conditions: the types of data sets, forecasting intervals, the lifetime of the model, etc.
Models can be derived for different data sets: random samples, data within time ranges. It is also possible to obtain "short-term" models based on a current data slice.

In general, automatic machine learning is a promising field. If you work in data science, keeping up with news from the AutoML world is valuable. We selected several sources to dive into:
- FEDOT repository
- Repository of the web interface to the framework – FEDOT.Web
- YouTube channel with tutorials on AutoML
- A set of open-source AutoML tools
- A set of benchmarks for AutoML
Authors: Nikolay Nikitin, Anna Kalyuzhnaya, Pavel Vychuzhanin