AutoML-Zero

It took us 60 years to discover a thing which can discover itself.

Published in

Towards Data Science

6 min readJul 22, 2020

We humans design the model and the model has parameters. Think of parameters as the knobs that you turn slightly left or right and change the model’s behavior. In recent years, neural networks have demonstrated remarkable success on a large variety of tasks and have seen a tremendous popularity growth. Considering factors like skills, computing resources, and difficulty of ML research, it has prompted a new field named AutoML, that aims to spend machine compute time instead of human research time.

What lead to AutoML-Zero?

AutoML studies have restricted the search space (pool of items from which algorithm can draw operations from) by using building blocks pre-described by human experts like the ReLU layer, Convolution layer, batch normalization layer which tends to bias the search results in favor of human-designed algorithms. A common example is Neural Architecture Search which uses sophisticated human-designed layers as building blocks and respect the rules of backpropagation to search for new architecture and optimal hyperparameters.

Other variants of AutoML restrict their search spaces by pre-defining certain hyperparameters which they think are crucial but the algorithm won’t be able to realize it like learning rate during backpropagation. Some studies suggested using fixed pre-processing pipelines to feed a processed input to the algorithm like data augmentation in case of images. So, many of these aspects remained hand designed. These approaches might save the computing time but possibly reducing the innovation potential of AutoML.

The concept behind AutoML-Zero is simple, why don’t we assume that we have no fancy operations (like backpropagation or gradient descent) and let the algorithm design everything from scratch by using basic mathematical operations, that’s why it’s called Machine Learning from Scratch. Think of it as there’s no TensorFlow, no PyTorch, just using NumPy to evolve the concepts of machine learning. And surprisingly, it can discover the concepts of non-linear models, learning rates, gradients, and things like that. So basically last 30 years of neural network evolution before your eyes.

Higher Level working Overview

The idea behind the AutoML-Zero framework is that they initialize a large number of programs, each includes three functions being empty at the time of starting. These three functions are

setup() -> This function is used to initialize the variables.
predict() -> This function is used to make prediction given the data point.
learn() -> This function is used to update the variables so that the predict function performs better over time.

Given these three functions, the AutoML-Zero have to fill in the rest of the program within these generic functions.

Source: https://github.com/google-research/google-research/tree/master/automl_zero

Above is a snippet of algorithm discovered by AutoML-Zero for a linear regression problem. As you can see that the setup function initialized a variable named “s2” which is later used as learning rate in the learn function, predict function predicts by applying dot product between the input features and learned weights “v1”, and the learn function first computes the error between the real label and predicted label, applies learning rate, computes gradients and update the weights. Let’s look at another snippet

Source; https://github.com/google-research/google-research/blob/master/automl_zero/initial_and_evolved_code.png

Above is a snippet for automatically discovered algorithms for CIFAR-10 classification. The setup function initializes the learning rate, the predict function introduces noise into the features (It discovered that introducing noise can improve its prediction accuracy), the learn function is computing error, estimating gradients, normalizing gradients, normalizing error and so on.

Remember that the program started as empty functions. The fact that these programs can learn something like dot products, gradients, and normalization is amazing because it has to search this enormous search space that is really sparse and most of the programs that you would discover in this kind of space are completely useless.

Internal Working

In this section, we are going to answer two questions. As stated earlier, the AutoML-Zero framework initializes a large number of programs. Why? (remember that each program contains these three functions — setup, predict and learn) and the second question, How exactly it fills in the instruction within the functions?.

AutoML-Zero setup makes the search space very sparse, as a result, a good algorithm to learn even a trivial task can be as rare as 1 in 10¹² candidates. In such an environment, a random search will not find a solution in a reasonable amount of time. To overcome this issue, these machine learning programs are discovered in the search space through the use of regularized evolutionary search.

Regularized Evolutionary Search

In the Regularized Evolution Search, a large number of programs are initialized (called as population) with their functions being empty. Then the algorithm selects a sample (two or more) from the population, evaluates them, and selects the best performing one as the parent. The parent then clones itself to create a child and the child gets mutated. This mutation is selected randomly from three options

Randomly inserting or removing an instruction at a random location.
Randomly replace all the instructions in a component function
Modify one of the arguments of instruction by replacing it with a random choice.

Following animation might give you a strong grasp on evolutionary search

Evolution Implementation Speedups

The AutoML-Zero search space for machine learning algorithms is very sparse and generic. This means that most of these programs are completely useless. For example, in the second type of mutation, that one predicts function takes in the scalar “s0” and assigns it to the mean of some matrix that hasn’t been defined at all in the setup and isn’t learned at all in the learn function, then it assigns other random scalar value “s3” to be the cosine of some other random scalar value “s7”. So you can observe that these functions are completely useless and don’t do anything that will help move the needle towards machine learning algorithms.

To make this algorithm work, they need to evaluate a ton of different models with their evolution search strategy. So particularly the researchers describe searching through 5000–10000 models/sec/cpu core. Some of the ways they do this is by implementing Migration which is a way of shuffling these different models across the different CPUs to ensure that you have diversity in the population of different workers within this distributed system. Then there is FEC (Functional equivalents checking) which is a way of ensuring that two programs aren’t having the same output for the given input features. Then there’s some dataset diversity, in which the algorithms have to do binary classification tasks from multilabel classification datasets like MNIST ( [0 vs 8] or [6 vs 9] ). A more diverse set of tasks further helps speed up the usefulness of these programs. Then there’s progressive dynamic hurdling which is where they have an intermediate fitness evaluation to truncate the models that aren’t performing very well.

Algorithm Evolution

Progress of one evolution experiment on projected binary CIFAR-10.

Source: https://storage.googleapis.com/gresearch/automl_zero/progress.gif

Conclusion

So this was a glance at the new variant of AutoML. In AutoML-Zero, we didn’t get the state-of-the-art yet. But what’s exciting about it is, it generates a program from just matrix-vector multiplication to do machine learning which is super exciting. The objective was to reduce human bias in the search space. I hope that this method will discover the new fundamental building blocks for machine learning concepts.