Multimodal Deep Learning

Fusion of multiple modalities using Deep Learning

Purvanshi Mehta
Towards Data Science

--

Being highly enthusiastic about research in deep learning I was always searching for unexplored areas in the field (Though it is tough to find one). I had previously worked on Maths word problem solving and many such related topics.

The challenge of using Deep Neural Networks as black boxes piqued me. I decided to dive deeper into the topic of “Interpretability in multimodal deep learning”. Here are some of the results.

Multimodal data

Our experience of the world is multimodal — we see objects, hear sounds, feel the texture, smell odors, and taste flavors. Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together.

For example, images are usually associated with tags and text explanations; texts contain images to more clearly express the main idea of the article. Different modalities are characterized by very different statistical properties.

Multimodal Deep Learning

Though combining different modalities or types of information for improving performance seems intuitively appealing task, but in practice, it is challenging to combine the varying level of noise and conflicts between modalities. Moreover, modalities have different quantitative influence over the prediction output. The most common method in practice is to combine high-level embeddings from the different inputs by concatenating them and then applying a softmax.

Example of Multimodal deep learning where different types of NN are used to extract features

The problem with this approach is that it would give an equal importance to all the sub-networks / modalities which is highly unlikely in real-life situations.

All Modalities have an equal contribution towards prediction

Weighted Combination of Networks

We take a weighted combination of the subnetworks so that each input modality can have a learned contribution(Theta) towards the output prediction.

Our optimization problem becomes -

Loss Function after Theta weight is given to each sub-network.
The output is predicted after attaching weights to the subnetworks.

But the use of all this!!

Let's get to the point where I start bragging about the results.

Accuracy and Interpretability

We achieve state-of-the-art results in two real-life multimodal datasets -

Multimodal Corpus of Sentiment Intensity(MOSI) dataset —Annotated dataset 417 of videos per-millisecond annotated audio features. There is a total of 2199 annotated data points where sentiment intensity is defined from strongly negative to strongly positive with a linear scale from −3 to +3.

The modalities are -

  1. Text

2. Audio

3. Speech

Amount of contribution of each modality on sentiment prediction

Transcription Start Site Prediction(TSS) dataset — Transcription is the first step of gene expression, in which a particular segment of DNA is copied into RNA (mRNA). The transcription start site is the location where transcription starts. The different parts of the DNA fragment have different properties which affect its presence. We divided the TSS into three parts -

  1. Upstream DNA
  2. Downstream DNA
  3. TSS region

We achieved an unprecedented improvement of 3% over the previous state-of-the-art results. The downstream DNA region with the TATA box has the most influence on the process.

We also performed experiments on synthetically generated data to verify our theory.

Now we are in the process of drafting a paper to be submitted in a ML journal.

If you are interested to know about the mathematical details or scope of multimodal learning, in general, ping me on purvanshi.mehta11@gmail.com. Comments on the work are welcome.

--

--