Deep learning has proven its superiority in many domains, in a variety of tasks such as image classification and text generation. Dealing with tasks that involve inputs from multiple modalities is an interesting research area.
The Gated Multimodal Unit (GMU) is a new building block proposed by a recent paper, which is presented in ICLR 2017 as a workshop. The goal of this building block is to fuse information from multiple different modalities in a smart way.
In this post I’ll describe the GMU, and illustrate how it works on a toy data set.
The architecture
Given two representations of different modalities, xᵛ and xᵗ (visual and textual modalities for instance), the GMU block performs a form of self Attention

The equations describing the GMU are relatively simple:

(1) + (2) are transforming the representations into different representations, which are then attended in (4) according to z which is calculated in (3). Since z is a function of xᵛ and xᵗ, it means we’re dealing with a self attention mechanism.
The intuition behind the GMU is that it uses the representations themselves to understand which of the modalities should affect the prediction. Consider the task of predicting the gender of a photographed person accompanied by a recording of his voice. If the recording of a given example is too noisy, the model should learn to use only the image in that example.
Synthetic data
In the paper they describe a nice synthetic data set which demonstrates how the GMU works.
Here we’ll implement the same data set, and find out for ourselves whether or not the GMU actually works (spoiler alert: it does).
First, let’s do the imports:
Generate the data

Don’t let the graph scare you – later on you’ll find a visualization of the data generated by this graph.
Basically what the graph says is that the target class C depicts the values of the modalities yᵛ and yᵗ – with some randomness of course.
In the next step the random variable M decides which of the inputs yᵛ, yᵗ to ignore, and instead to use a source of noise ŷᵛ, ŷᵗ.
In the end, xᵛ and xᵗ contain either the real source of information which can describe the target class C, or random noise.
The goal of the GMU block is to successfully find out which one of the sources is the informative one given a specific example, and to give all the attention to that source.

Create the model
I’ll implement a basic version of the GMU – just to make it easier to comprehend.
Generalizing the code to handle more than two modalities is straightforward.
Train the model

Inspect results
The loss is looking good.
Let’s look what z and the predictions look like. The following visualizations appear in the paper as well.

We can see z behaves exactly as we want (left figure). What’s nice about it is that the class of points that reside far from the boundary line are predicted using practically only one of the modalities. It means the model learned when to ignore the modality that contains pure unpredictive noise.
Why not to use simple FF (Feed Forward)?
If we ignore the data generating process and just look at the data points, clearly there are 4 distinct clusters.
These clusters aren’t linearly separable. While the GMU gives capacity to the model in order to explain this non-linear behaviour, one could just throw another layer to the mixture instead, thus solving the problem with plain feed-forward (FF) network.
_The universal approximation theorem states that a feed-forward network with a single hidden layer containing a finite number of neurons, can approximate continuous functions… (Wikipedia)_
So indeed, for this contrived example a simple FF will do the job. However, the point in introducing new architectures (GMU in this case) is to introduce inductive bias that allows the training process to take advantage of prior knowledge we have about the problem.
Conclusion
For real world problems involving multiple modalities, the authors claim the GMU achieves superior performance. They show cased their approach using the task of identifying a movie genre based on its plot and its poster.
GMU is easy to implement, and it may be worthwhile to keep it in your tool belt in case you need to train a model to use multiple modalities as input. To this end, you can create a sub network for each modality. The sub networks need not be the same – you can for instance use a CNN for a visual modality and LSTM for a textual one. What matters is that each sub network outputs a dense representation of its modality. Then, feed these representations into a GMU block in order to fuse the information into one representation. The fused representation will be fed into another sub network whose output will be the final prediction.
This post was originally posted at www.anotherdatum.com.