The world’s leading publication for data science, AI, and ML professionals.

How to Create Powerful AI Representations by Combining Multimodal Information

Learn how you can incorporate multimodal information into your machine-learning system

In this article, I will discuss how you can incorporate information from different modalities into your machine learning system. These modalities can be information like an image, text, or audio. It can also, for example, be several images of the same object taken from different angles. Adding information from different modalities gives the machine learning system more information to work with, which can, in turn, increase the performance of the system.

Learn how you can combine information from different modalities in this article. Image by ChatGPT. "make an image of combining multimodal information within machine learning" prompt. ChatGPT, 4, OpenAI, 1 Apr. 2024. https://chat.openai.com.
Learn how you can combine information from different modalities in this article. Image by ChatGPT. "make an image of combining multimodal information within machine learning" prompt. ChatGPT, 4, OpenAI, 1 Apr. 2024. https://chat.openai.com.

Motivation

My motivation for this article is that I am currently working on a problem where I have information from two different modalities. The first modality is the visual information of a document, and the second modality is the text contained within the document. Separately, a machine learning system can achieve decent performance using only the visual data from the document or the textual data from the text in the document. However, if you are only using one of the two available modalities, you need to give machine learning all the information possible to achieve the best performance. Therefore, you should combine different modalities to ensure the best possible performance of your machine learning system.

Though you often have two data modalities when working with multimodal systems, you can adapt all the approaches I discuss below to three or more data modalities. I am using two modalities primarily to describe the approaches as simply as possible.

You can see the general outline of each approach I discuss in this article. First, you need at least two information modalities: an image and a document’s text. Then, you create embeddings of each modality. These embeddings are then combined in a combination process, which will then represent the multimodal information.

This image represents the general outline for each approach discussed in this article. You start with different information modalities, like image and text, and create embeddings from the data. These embeddings are then combined, creating multimodal information. Image by the author.
This image represents the general outline for each approach discussed in this article. You start with different information modalities, like image and text, and create embeddings from the data. These embeddings are then combined, creating multimodal information. Image by the author.

Table of Contents

· Motivation · Table of Contents · Use multimodal models · Train a neural network to combine embeddings · Create a multimodal graph · Combine vectors · Ensemble models · Conclusion

Use multimodal models

The first approach I will discuss to incorporate information from different modalities is to use a multimodal model. Multimodal models consider different modalities in their architecture. On HuggingFace, you can find a range of multimodal models, including models that combine image and text (vision-language multimodal models) or audio and text. These models have an architecture designed to incorporate both modalities of information and are trained on a downstream task with both modalities in mind.

The image shows the general architecture of a multimodal model. I show an example of a model taking in an image and text modality, which are input into a multimodal model, which creates a prediction. Image by the author.
The image shows the general architecture of a multimodal model. I show an example of a model taking in an image and text modality, which are input into a multimodal model, which creates a prediction. Image by the author.

What makes the HuggingFace models suitable is that the models you find on HuggingFace will typically be pre-trained as well, meaning you can use them off the shelf for the downstream task the model is trained for. Making powerful models so easily accessible is one of HuggingFace’s significant contributions. I also recommend checking out some of the other contributions HuggingFace gives with a blog with interesting topics, datasets you can use to train your model, and spaces you can use to host your models.

The HuggingFace models can be a powerful way to utilize several information modalities, and I have discussed in more depth how you can use the HuggingFace models in my article on how to create powerful embeddings. A downside, however, is that the models are trained for a specific task, which only aligns with your needs. If you have a downstream task not performed by any models on HuggingFace, or the models available do not fit your specific needs, consider looking into other options for utilizing the different modalities of data you have available. I will mention several different options later in this article.

Train a neural network to combine embeddings

You can also train your neural network to combine different data modalities. First, you need embeddings for your modalities, which you can learn how to create in my article on creating robust embeddings below:

How to Create Powerful Embeddings from Your Data to Feed into Your AI

These embeddings represent your data, and you need to combine the embeddings to create a multimodal embedding. Combining the embeddings can be done by making a linear layer considering each modality’s embedding. Combining the embeddings can be done by first concatenating the embeddings and then creating a linear layer that takes in the concatenated embeddings and reduces the size of the embedding to the same shape as the original embeddings for each modality. You can see an image visualizing this process below.

# code to combine tensors with a linear layer
import torch
import torch.nn as nn
tensor1 = torch.randn(5, 3)  
tensor2 = torch.randn(5, 3) 
combined_tensor = torch.cat((tensor1, tensor2), dim=1) 
with torch.no_grad(): #do not need gradients in this case
 linear_layer = nn.Linear(in_features=(tensor1.shape[1]+tensor2.shape[1]), out_features=tensor1.shape[1]) #output shape is original shape 
 output_tensor = linear_layer(combined_tensor)

Which will create tensors like:

The image shows my tensors after running the code above. Note that the embeddings are randomly initiated, so you will not have the exact numbers, though the shape of the output should be the same. Furthermore, the weights of the linear layer are also randomly initialized, so the output tensor will not be the same even though you have identical input tensors. Image by the author.
The image shows my tensors after running the code above. Note that the embeddings are randomly initiated, so you will not have the exact numbers, though the shape of the output should be the same. Furthermore, the weights of the linear layer are also randomly initialized, so the output tensor will not be the same even though you have identical input tensors. Image by the author.

You can visualize the process done by the code above with the image below:

This is an image showing the combining process. You start with two embeddings, A and B. Then, you concatenate them and send them into a linear layer. The linear layer then restores the original vector shape, resulting in a new output tensor with the same shape as the original tensors, containing information from both original tensors. Image by the author.
This is an image showing the combining process. You start with two embeddings, A and B. Then, you concatenate them and send them into a linear layer. The linear layer then restores the original vector shape, resulting in a new output tensor with the same shape as the original tensors, containing information from both original tensors. Image by the author.

Training a neural network to combine embeddings from different modalities can create new robust embeddings containing information from other modalities. These embeddings can be used for several use cases, including classification and clustering. You can also test the quality of your newly created embeddings by reading my article on understanding embedding quality. With this, you can compare the quality of your original and newly combined embeddings, which should show increased performance on the embedding quality metrics discussed in my article.

However, a downside to this approach is that it requires labeled data since you are training a network to combine the embeddings. You need to fine-tune the weights of your newly created linear layers so the linear layer can learn how to best integrate the embeddings from different modalities, a process that requires a labeled dataset. If labeled data is not available, you can learn more about creating a synthetic dataset in my article on creating synthetic data, or you can look to another approach I provide in this article.

Create a multimodal graph

Another option for combining different data modalities is to create a multimodal graph and perform downstream tasks on this graph directly. If your problem is suitable for graph use, this can be a robust way of combining different modalities of information.

I assume you have different embeddings for each data modality to create the multimodal graph. Then, given these embeddings, there are several graph creation options. Some options you have are:

  • Use KMeans. With this approach, you find the embeddings most similar to a given embedding. Then, edges are made between similar embeddings.
  • Use a threshold. After finding the similarities between all embeddings with cosine similarity, you can make edges between embeddings with similarity higher than the threshold.
  • Use a percentile. You keep the top percentage of the most similar edges by setting a percentile.

With this approach, you can make one graph per data modality. To test the quality of each of your graphs, you can read my article on testing graph quality. You can visualize the multimodal graph you created in two ways, as shown below. One is a multigraph, where you combine several graphs per modality, making up a 3D graph. Another way to visualize your graph is that you have made a heterogeneous graph, with different kinds of edges representing similarities between various modalities of data.

I show you two ways of visualizing a multimodal graph in the image. Option 1 is to visualize it as a 3D graph, where you make one graph per modality. Option 2 is to visualize it as a heterogeneous graph with different kinds of edges. Image by the author.
I show you two ways of visualizing a multimodal graph in the image. Option 1 is to visualize it as a 3D graph, where you make one graph per modality. Option 2 is to visualize it as a heterogeneous graph with different kinds of edges. Image by the author.

After creating your graph, you can now perform different machine-learning tasks on the graph. One task you can perform off-the-bat is community detection, where you detect which community each node in the graph belongs to. Another approach you can take is creating embeddings from the graph you have made, which I have written an article about below:

How To Create Powerful Embeddings From Topology Information In Graphs

The embeddings you create from the graph’s topology information can then be used to perform different machine-learning tasks, such as classification, regression, clustering, and so on. However, converting from topology information to embeddings might cause you to lose some information stored in the graph.

A problem with the graph-based approach is that not all tasks within Machine Learning are suitable for graphs, though you might be able to adapt your task or data to fit this approach. Another downside is that you mostly compare information within a modality and combine the information across modalities less (when creating the graphs).

Combine vectors

Furthermore, you can also combine embeddings to incorporate information from different modalities. You mainly have three ways you can use to combine the embeddings:

  1. Concatenating
  2. Adding
  3. Multiplying

Which you can see visualized in the image below:

Image showing how you can combine the example embeddings A and B. Image by the author.
Image showing how you can combine the example embeddings A and B. Image by the author.

After combining the embeddings, you should likely normalize the vectors as well since the combining process will change the scale of the data.

Combining the embeddings with concatenation, addition, or multiplication is a simple yet effective approach for combining the information from different modalities. This approach can be implemented in Python with a few lines of code, as you can see below.

# Define vectors A and B
A = torch.tensor([0.1, 0.3, 0.2])
B = torch.tensor([0.5, 0.1, 0.4])

res_concat = torch.cat((A, B), 0)
res_addition = A+B
res_multiplication = A*B

# normalize the result vectors
res_concat = res_concat / torch.norm(res_concat)
res_addition = res_addition / torch.norm(res_addition)
res_multiplication = res_multiplication / torch.norm(res_multiplication)

The newly created vector will then contain information from the modalities of the original vectors used to make the new vector. To test the vectors you create, you can read my article on testing embedding quality.

A downside to this approach is that due to its simplicity, you will lose some information from each modality after combining the vectors. For example, adding two vectors into one vector will create a vector representing both the original vectors. This new vector, however, does not represent the complete information of both the original vectors, which you can understand with the example below:

An example showing how different vectors (A1, B1 and A2, B2) can create the same combined vectors (C1 and C2). This highlights a weakness of using addition or multiplication as a tool for combining embeddings. Image by the author.
An example showing how different vectors (A1, B1 and A2, B2) can create the same combined vectors (C1 and C2). This highlights a weakness of using addition or multiplication as a tool for combining embeddings. Image by the author.

The same issue will arise when multiplying the vectors. Concatenating the vectors will not create this issue, though you should note that the new concatenated vector will be of a different shape than the original vectors.

Ensemble models

Using ensemble models is also an effective way of combining information from different modalities. Ensemble models are created by creating several models and then combining the output of each model to create a combined prediction. The ensemble approach is similar to the approach I described in training a neural network to combine embeddings. Still, in this case, the combination of the different embeddings happens after the output of each model (in contrast to my previously described approach, where the combination of the modalities happens earlier in the network).

An example of an ensemble network is one with two models (an image prediction model and a text prediction model) making up the ensemble. Each model makes its output, combined into an ensemble output with a combining function. Image by the author.
An example of an ensemble network is one with two models (an image prediction model and a text prediction model) making up the ensemble. Each model makes its output, combined into an ensemble output with a combining function. Image by the author.

Ensemble modes are powerful because they leverage the power of different machine learning models to make one improved prediction, taking all models into account. You can use this to your advantage to incorporate the various information modalities into your machine-learning model.

A downside to this approach is that depending on the machine learning task you are performing, it might not fit your use case. Using an ensemble model will typically require a task like classification, where combining the model outputs is a simple process. The combination step of the models can be a challenge. Suppose you train a new machine learning model to combine the outputs of each model in the ensemble. In that case, you are essentially doing the same as I described in the section on training a neural network to combine embeddings. Combining embeddings with a linear layer can be a good approach, but as mentioned earlier, the strategy relies on you having a labeled dataset, which is only sometimes available.

A solution to this is using a non-machine learning approach to combine the output from each model in the ensemble. For a classification task, this can, for example, be using the model with the highest confidence in the ensemble or using the prediction that most of the models in the ensemble used (if you have more than two models in your ensemble)

Conclusion

In this article, I have described several approaches you can take to combine different data modalities. The approaches I have discussed are:

  • Using Multimodal models, for example, from HuggingFace
  • Train a neural network to combine embeddings
  • Create a multimodal graph
  • Combine vectors with addition, multiplication, or concatenation
  • Ensemble models

Different approaches will suit you better, depending on your needs. You should consider testing different approaches to determine which works best for your machine-learning problem.

You can also read my articles on WordPress.


Related Articles