Adaptable Networks
Task embedding is one technique to make neural networks adaptable to new tasks and promises to reduce some of the problems that other methods have. For each task a network is trained to perform, a rich representation of the task (task embedding vector) is provided to the network during the training process. When the embedding is performed correctly, the task embedding vector will be similar for similar tasks. Without additional training, on a new task given a new task embedding, a network can infer what it is supposed to do.
A popular method for extending the usefulness of a trained network is transfer learning. The basic idea is that a neural network is trained to do some task using a very large corpus of training data and then is further trained to perform another task using a smaller corpus of training data. For instance, a network can perform far better when used to classify documents if the network had been pre-trained to "fill in the blank" on sentences with missing words before it was later trained to classify documents.
There are a few downsides to this approach. If the second step in the transfer learning training process is too extensive then the network will begin to "forget" what it was initially trained on. The likelihood of the network forgetting previous tasks is increased as it is asked to learn more tasks. This is known as "catastrophic forgetting". A current limitation of all networks, even those that use transfer learning, is that they are limited to a finite number of tasks and must be trained (or retrained) to learn a new task.
Task embeddings may be an alternate or complementary approach since it could reduce the number of training sessions required to perform well on multiple tasks. The power of task embedding is illustrated with two simple datasets – a univariate example and a Mnist example.
Task Embedding Concept
One way to think about a neural network is as a machine that is trained to map points from one space to points in another space. If trained carefully, the machine will generalize, meaning it will map a point it has never seen based on the similarity of that point to points it has seen. (See Figure 1 left section.)

A network trained on a given task fails when presented with a new task since it doesn’t have any idea that it needs to do its mapping any differently and doesn’t know how to change the mapping it does. In task embedding, we provide the network rich information about the task it is performing while training it to perform multiple tasks simultaneously. It learns a different mapping for each task. (See Figure 1 middle section.) When presented with a set of points it has never seen and asked to perform a task it has never been trained on, the network interpolates between the tasks it knows and derives a suitable mapping for the new points (See Figure 1 right section.)
Univariate Example
In this example, a network is asked to predict the y-value of one of a number of linear or quadratic equations given an x-value. The equations take the form of:
-8*(x+1)*(x-4)+9
An algorithm is used to generate equations. Given that the range of coefficients is limited from 0 to 9, the total number of possible equations is 128,304. Each of 9 steps in the algorithm makes a random choice and the sequence of choices made during the generation of the equation is a vector representation of the equation. For the equation above, its vector representation is
[1, 8, 1, 2, 1, 1, 4, 2, 9]
This serves as the task embedding for the task of predicting y-values for this equation given x-values.
In order to use a task embedding, the embedding is concatenated to the x-value and then fed into a network. An x data point for the equation above, with a value of -5 for instance, would be represented by concatenating the equation representation with the x value. For example:
[1, 8, 1, 2, 1, 1, 4, 2, 9, -5]
A training data set is constructed of 400 random equations with 2000 x-y pairs each. For evaluation purposes, the training set is forced to contain four known equations and forced to exclude four known equations.
The training set is used to train a network with a single 100-node hidden layer over 2 epochs.
Once trained, the network is asked to infer the y-values from a randomly generated set of x-values for each of the 4 equations known to be in the set and the 4 equations known to be outside the set. In other words, the network is asked to perform 8 different tasks but has only been trained on four of them. The top row of graphs are equations within the training set. The second row are equations outside of the training set. The blue points are actual and the orange points are predicted. (See Figure 2.)

As a control, the same experiment was performed except, instead of a rich task embedding, a single random digit was concatenated to the x-values in the training set for each equation task. The network was able to learn the correct mapping for the tasks but not as well as in the rich task embedding. It utterly failed to predict anything for the tasks it had not seen. (See Figure 3.).

The methodology to generate the task embedding results in a fairly compact representation – only 9 digits. As shown below, compact task embeddings generate better results.
A naive approach to generating a task embedding for each equation was simply to assign a numeric value to each of the 16 characters in the string that describes the equation – this results in 1015 more states than the compact approach. It failed miserably and couldn’t be improved with hyper parameter adjustments or more epochs of training. (See Figure 4.)

However, compressing the naive task embedding using a principal components analysis (PCA) compression method (from 16 down to 11 digits) significantly recovered the fidelity seen in the compact approach. (See Figure 5.)

So from this simplistic example we see that:
- Without task embeddings, networks fail to perform on tasks they’ve not been trained on
- Task embeddings work better when they are compact
MNIST Example
The MNIST dataset is used to train a network to perform 7 tasks. In each task, the network is trained to learn whether images are of a specified digit or not (for digits 0 to 6). For example, one task would be, "Is this image the digit 4?" Then the network is evaluated on how well it does on all ten tasks (for digits 0 to 9) without additional training. The accuracy of the network on a task is based on how often the network answers the question correctly. (See Figure 6.)

With task embedding, the network is able to achieve similar results across the digits it was trained on and the digits it was not. Without task embedding, the best the network could perform on digits 7, 8, and 9 would be 10% since that is the prevalence of those digits in the test set. (See Figure 7.)

From this example, we learn that
- Task embeddings are extensible to more complex data types
- A PCA-based compression generates a suitable task embedding
Compact Task Embedding for MNIST
A simplistic approach to creating a task embedding for the MNIST example is to choose a randomly selected image for each task. For the question, "Is this image the digit 4?", one could randomly choose an image with label 4 from the training set and concatenate that to all training and test images for this task. Since MNIST images are 28×28 pixels, a total of 784 elements, a task embedded image would be twice that, 1568 elements.
Based on what was learned in the univariate example, a compact task embedding learns faster and generates higher accuracy results. For this reason, a compression technique is applied. In this example, primary component analysis (PCA) is administered to compress a randomly selected image from 784 elements down to 264 elements (a 66% reduction). The compressed image is then concatenated to training and test images for the task.
Effectively, PCA identifies a linear transformation that maximizes the retained information in an image given a goal to remove a fixed number of pixels from it. In this example, the PCA algorithm is provided with a random set of 1000 images from the training set for digits 0 to 6 so that it can build the linear transformation. Digits 7 to 9 are excluded since the network must be trained without any knowledge of alternate tasks or data for alternate tasks. The resulting transformation preserves 99% of the variation between images in the set of 1000. Then the linear transformation is applied to a randomly selected image for each task.
Other image compression techniques may prove equally effective and may have other advantages like less processing.
Conclusions
Neural Networks can be made extensible with the use of task embeddings. It’s been shown in a couple simple examples that networks trained on a set of tasks can be used without additional training to perform additional tasks. It’s possible that incorporating task embeddings into transfer learning approaches may alleviate some catastrophic forgetting challenges.
Naive approaches for constructing task embeddings may severely impede the ability of a network to learn and may limit the networks overall accuracy. Compressed task embeddings are shown to be viable in univariate and MNIST examples. There seems to be plenty of room to explore compression alternatives.
In the examples explored in this post, the author chose methods to generate task embeddings based on an understanding of the data and tasks. Creating a system that can derive its own optimal task embeddings without expert intervention may be a cornerstone of general intelligence. One question along this line is whether there is sufficient information in the training data for a system to construct its own embedding method? Also, are there embedding methods that work great for a subset of all possible tasks but still fail on certain classes of tasks?