AI Telephone — A Battle of Multimodal Models

DALL-E2, Stable Diffusion, BLIP, and more!

Published in

Towards Data Science

14 min readJun 15, 2023

*Artistic rendering of a game of AI Telephone. Image generated by the author using DALL-E2.*

Generative AI is on fire right now. The past few months especially have seen an explosion in multimodal machine learning — AI that connects concepts across different “modalities” such as text, images, and audio. As an example, Midjourney is a multimodal text-to-image model, because it takes in natural language, and outputs images. The magnum opus for this recent renaissance in multimodal synergy was Meta AI’s ImageBind, which can take inputs of 6(!) varieties and represent them in the same “space”.

With all of this excitement, I wanted to put multimodal models to the test and see how good they actually are. In particular, I wanted to answer three questions:

Which text-to-image model is the best?
Which image-to-text model is the best?
What is more important — image-to-text, or text-to-image?

Of course, each model brings its own biases to the table, from training data to model architecture, so there isn’t really ever one BEST model. But we can still put models to the test in a general context!

To answer these questions, I decided to play a game of AI Telephone, inspired by the board game Telestrations, which my family and I love to play together.

Telestrations is much like the game of telephone: players go around in a circle, taking in communication from the person on one side, and in turn communicating their interpretation to the person on their other side. As the game ensues, the original message is invariably altered, if not lost entirely. Telestrations differs, however, by adding bimodal communication: players alternate between drawing (or illustrating) a description, and describing (in text) a description.

Given that I was more interested in comparing models, I adapted the game to suit this purpose.

Here’s how the game of AI Telephone works:

Each “game” will pair up an image-to-text (I2T) model with a text-to-image (T2I) model
Given an initial prompt, we use the T2I model to generate an image.
We then pass this image into the I2T model to generate a description.
We repeat steps 2 and 3 a fixed number of times n (in our case n=10).
Finally, we quantify the difference between the original prompt and the final description.

In this post, I will walk you through this entire process, so that you can play AI Telephone too! At the end, I’ll answer the three motivating questions.

Note: This game of AI Telephone is intimately connected with the notion of cycle consistency. By incorporating a cycle consistency term in the loss function during training, models can be incentivized to, effectively, minimize degradation over a game of telephone. To my knowledge, none of the models considered in this experiment were trained with cycle consistency as a consideration.

The post is structured as follows:

Choosing the Multimodal Models
Generating the Prompts
Creating Telephone Lines
Carrying out the Conversations
Visualizing and Analyzing the Results

All of the code to run this experiment and play AI Telephone can be found here.

To run this code, you will need to install the FiftyOne open source library for dataset curation, the OpenAI Python Library, and the Replicate Python client.

pip install fiftyone openai replicate

Progression of images in a game of AI Telephone between DALL-E2 and BLIP.

Choosing the Competitors

The space of multimodal models is massive: at the time of writing, Hugging Face alone has 4,425 T2I models and 155 I2T models. Playing AI Telephone with all of these models — or even a non-negligible fraction of them — would be completely infeasible. My first task was to pare down this space of potential candidates to a more manageable set of competitors.

Opting for APIs

To start this project, I knew that I would be working with many models. Some of the prospective models were quite large, and many required their own environments, with a unique set of requirements. Given that I planned to pair up each T2I model with each I2T model, installing these models locally to play games of AI Telephone presented a potential dependency purgatory — especially because I work on a MacBook Pro M1!

To circumvent this problem, I decided to stick to models that were accessible via APIs. In particular, I chose to primarily use Replicate, whose simple interface allowed me to work with T2I and I2T models in plug-and-play fashion. Almost every model that I used is open source, so if you are braver than I, you can run these models locally and avoid the charges. That being said, in total this experiment cost < $15 USD.

Text-to-Image Models

When selecting T2I models, I chose from the models in Replicate’s Text to image collection. My selection criteria were that the model needed to be cheap, fast, and relatively popular (judged by the number of “runs” of the model on Replicate). Additionally, the model needed to be general purpose, meaning that I wasn’t going to consider outpainting, logo generation, or anime styling models. You are more than welcome to try playing AI Telephone with these types of models if you’d like!

Given these requirements, I chose Stable Diffusion and Feed forward VQGAN CLIP. Initially, I also worked with DALL-E Mini, but in early tests I was disappointed by the model’s performance, so I swapped the model out for OpenAI’s DALL-E2, which I accessed through OpenAI’s image generations endpoint.

As a side note, restricting my attention to API-accessible models meant that I did not consider Midjourney. There is no official API, and I did not want to use an unofficial API, nor did I want to enter prompts into Discord one by one and download the generated images one at a time.

To make this process as plug-and-play as possible, I took an object oriented approach. I defined a base Text2Image class, which exposes a method generate_image(text):

import replicate

class Text2Image(object):
    """Wrapper for a Text2Image model."""
    def __init__(self):
        self.name = None
        self.model_name = None

    def generate_image(self, text):
        response = replicate.run(self.model_name, input={"prompt": text})
        if type(response) == list:
            response = response[0]
        return response

For Replicate models, all that is needed is then setting the model_name attribute, identifying the model on Replicate. For Stable Diffusion, for instance, the class definition looks like this:

class StableDiffusion(Text2Image):
    """Wrapper for a StableDiffusion model."""
    def __init__(self):
        self.name = "stable-diffusion"
        self.model_name = "stability-ai/stable-diffusion:27b93a2413e7f36cd83da926f3656280b2931564ff050bf9575f1fdf9bcd7478"

For other models, such as DALL-E2, the generate_image(text) method can be overloaded:

import openai
class DALLE2(Text2Image):
    """Wrapper for a DALL-E 2 model."""
    def __init__(self):
        self.name = "dalle-2"
    
    def generate_image(self, text):
        response = openai.Image.create(
            prompt=text,
            n=1,
            size="512x512"
        )
        return response['data'][0]['url']

Each of these T2I models returns the URL of the generated image, which we can then pass directly to our I2T models.

Image-to-Text Models

I followed a similar process to determine the I2T competitors, evaluating candidates in Replicate’s Image to text collection. After looking at the examples for all of the models in the collection, six models stood out: BLIP, BLIP-2, CLIP prefix captioning, Fine-grained Image Captioning with CLIP Reward, mPLUG-Owl, and MiniGPT-4. Other models were enticing, such as CLIP Interrogator, which tries to reverse engineer a prompt you can then use to generate a similar image. But this felt a bit like cheating as far as AI Telephone was concerned!

Playing around with the six I2T candidates, I was able to quickly eliminate two models from contention: BLIP-2 generated responses that were consistently too short to be useful, and the CLIP Caption Reward model generated responses which were often incoherent.

In direct analogy with the T2I models, I defined a base class Image2Text class exposing a generate_text(image_url) method:

class Image2Text(object):
    """Wrapper for an Image2Text model."""
    def __init__(self):
        self.name = None
        self.model_name = None
        self.task_description = "Write a detailed description of this image."

    def generate_text(self, image_url):
        response = replicate.run(
            self.model_name, 
            input={
                "image": image_url,
                "prompt": self.task_description,
                }
            )
        return response

I then created subclasses for each model. Here is what the BLIP subclass looks like:

class BLIP(Image2Text):
    """Wrapper for a BLIP model."""
    def __init__(self):
        super().__init__()
        self.name = "blip"
        self.model_name = "salesforce/blip:2e1dddc8621f72155f24cf2e0adbde548458d3cab9f00c0139eea840d0ac4746"

All of the models are instantiated with the same task description — to “write a detailed description of this image”.

Progression of images in a game of AI Telephone between DALL-E2 and mPLUG-Owl.

Prompts

To be as “scientific” as possible, I thought it best to not generate the initial prompts myself. Instead, (and just for fun) I outsourced the task to ChatGPT. I asked:

I'm playing a game of telephone using text-to-image and image-to-text AI models. 
I want to evaluate these models based on their ability to retain complex semantic
information over the course of long conversations. Your job is to give me 10 text
prompts that I can use to run these games of telephone. You must give me one 3 
easy, 3 medium, 3 hard, and 1 ultra-hard prompt

I’m playing a game of telephone using text-to-image and image-to-text AI models. I want to evaluate these models based on their ability to retain complex semantic information over the course of long conversations. Your job is to give me 10 text prompts that I can use to run these games of telephone. You must give me one 3 easy, 3 medium, 3 hard, and 1 ultra-hard (“impossible”) prompt

Here are some of the prompts ChatGPT generated:

Easy:

"A red apple sitting on a wooden table with sunlight streaming in from a window."

Medium:

"An astronaut floating in the International Space Station, looking out at Earth through the window, with a space capsule docked in the background."

Hard:

"A bustling marketplace in an ancient Middle Eastern city. Traders haggling over spices and silks, camels carrying goods, the sun setting behind a mosque with a crescent moon visible."

Impossible:

"A panoramic scene of an advanced alien civilization on a distant exoplanet. Interstellar vehicles flying in an indigo sky above towering crystalline structures. Aliens with varying physical features are interacting, engaging in activities like exchanging energy orbs, communicating through light patterns, and tending to exotic, bio-luminescent flora. The planet’s twin moons are visible in the horizon over a glistening alien ocean."

A more rigorous scientific approach would be far more intentional with the prompts used, as well as their categorization.

I then took the text prompts generated by ChatGPT and constructed Prompt objects, which contained the text for the prompt, and the “level” of difficulty assigned by ChatGPT:

class Prompt(object):
    def __init__(self, text, level):
        self.text = text
        self.level = level


levels = ["easy", "medium", "hard", "impossible"]
level_prompts = [easy_texts, medium_texts, hard_texts, impossible_texts]

def get_prompts():
    prompts = []
    for level, texts in zip(levels, level_prompts):
        for text in texts:
            prompts.append(Prompt(text, level))
    return prompts

Progression of images in a game of AI Telephone between VQGAN-CLIP and MiniGPT-4.

The Telephone Line

The last component to playing AI Telephone was the “telephone line” itself. I created a TelephoneLine class to encapsulate the connection between a T2I model and an I2T model. Given a single telephone line, a “game” of telephone is played by calling the play(prompt, nturns=10), where the conversation evolves from prompt, and runs for nturns back-and-forth turns.

import os
import hashlib
import fiftyone as fo
from fiftyone import ViewField as F

class TelephoneLine(object):
    """Class for playing telephone with AI."""
    def __init__(self, t2i, i2t):
        self.t2i = t2i
        self.i2t = i2t
        self.name = f"{t2i.name}_{i2t.name}"
        self.conversations = {}

    def get_conversation_name(self, text):
        full_name = f"{self.name}{text}"
        hashed_name = hashlib.md5(full_name.encode())
        return hashed_name.hexdigest()[:6]
    
    def play(self, prompt, nturns = 10):
        """Play a game of telephone."""
        print(f"Connecting {self.t2i.name} <-> {self.i2t.name} with prompt: {prompt.text[:20]}...")
        texts = [prompt.text]
        image_urls = []

        for _ in range(nturns):
            image_url = self.t2i.generate_image(texts[-1])
            text = self.i2t.generate_text(image_url)
            texts.append(text)
            image_urls.append(image_url)
        
        conversation_name = self.get_conversation_name(prompt.text)
        self.conversations[conversation_name] = {
            "texts": texts,
            "image_urls": image_urls,
            "level": prompt.level
        }

For each game played, the conversation is logged with a unique name, generated by hashing the T2I model name, I2T model name, and the prompt text (get_conversation_name() method).

I also equipped the class with a save_conversations_to_dataset() method, which saves the images and descriptions from all games played on the telephone line to a FiftyOne Dataset:

 def save_conversations_to_dataset(self, dataset):
        """Save conversations to a dataset."""
        for conversation_name in self.conversations.keys():
            conversation = self.conversations[conversation_name]
            prompt = conversation["texts"][0]
            level = conversation["level"]
            image_urls = conversation["image_urls"]
            texts = conversation["texts"]

            for i in range(len(image_urls)):
                filename = f"{conversation_name}_{i}.jpg"
                filepath = os.path.join(IMAGES_DIR, filename)
                download_image(image_urls[i], filepath)

                sample = fo.Sample(
                    filepath = filepath,
                    conversation_name = conversation_name,
                    prompt = prompt,
                    level = level,
                    t2i_model = self.t2i.name,
                    i2t_model = self.i2t.name,
                    step_number = i,
                    text_before = texts[i],
                    text_after = texts[i+1]
                )
                
                dataset.add_sample(sample)

Progression of images in a game of AI Telephone between Stable Diffusion and CLIP Prefix Captioning.

Carrying out the Conversations

With all of the building blocks in place, playing AI Telephone is child’s play!

We can instantiate T2I and I2T models:

## Image2Text models
mplug_owl = MPLUGOwl()
blip = BLIP()
clip_prefix = CLIPPrefix()
mini_gpt4 = MiniGPT4()
image2text_models = [mplug_owl, blip, clip_prefix, mini_gpt4]

## Text2Image models
vqgan_clip = VQGANCLIP()
sd = StableDiffusion()
dalle2 = DALLE2()
text2image_models = [sd, dalle2, vqgan_clip]

And then create a telephone line for each pair:

combos = [(t2i, i2t) for t2i in text2image_models for i2t in image2text_models]
lines = [TelephoneLine(*combo) for combo in combos]

We then load in our prompts:

prompts = get_prompts()

And create a FiftyOne Dataset which we will use to store the generated images and all relevant information from the conversations:

import fiftyone as fo

dataset = fo.Dataset(name = 'telephone', persistent=True)
dataset.add_sample_field("conversation_name", fo.StringField)
dataset.add_sample_field("prompt", fo.StringField)
dataset.add_sample_field("level", fo.StringField)
dataset.add_sample_field("t2i_model", fo.StringField)
dataset.add_sample_field("i2t_model", fo.StringField)
dataset.add_sample_field("step_number", fo.IntField)
dataset.add_sample_field("text_before", fo.StringField)
dataset.add_sample_field("text_after", fo.StringField)

We can then run all 120 games of telephone:

from tqdm import tqdm

for line in tqdm(lines):
    for prompt in prompts:
        line.play(prompt, nturns = 10)
    line.save_conversations_to_dataset(dataset)

session = fo.launch_app(dataset)

In the FiftyOne App, click on the splitting icon in the menu bar to group images by conversation, select conversation_name from the dropdown, then toggle the selector to ordered and select step_number.

Results and Conclusions

To assess the quality of a conversation — purely in terms of how closely the meaning of the final description approximated the meaning of the initial prompt, I decided to generate embeddings for the prompts and descriptions, and compute the cosine distance (in [0, 2]) between the two.

from scipy.spatial.distance import cosine as cosine_distance

For an embedding model, I wanted a model that could embed both text and images, given the multimodal nature of the exercise. I ended up choosing to use ImageBind for three reasons:

Other popular joint image-text embedding models like CLIP and BLIP are related to some of the models I used in the experiment (BLIP and CLIP prefix captioning), and I wanted to avoid any possible biases from using the same types of models for evaluation.
Many text embedding models have a small max_token_count — the maximum number of tokens allowed in a text to be embedded. CLIP, for instance, has max_token_count=77. Some of our descriptions are significantly longer than this. Fortunately, ImageBind has a much longer maximum token count.
I’d been meaning to try ImageBind, and this was a great opportunity!

I wrapped Replicate’s ImageBind API in a function embed_text(text):

MODEL_NAME = "daanelson/imagebind:0383f62e173dc821ec52663ed22a076d9c970549c209666ac3db181618b7a304"
def embed_text(text):
    response = replicate.run(
        MODEL_NAME,
        input={
            "text_input": text,
             "modality": "text"
            }
    )
    return np.array(response)

To avoid redundant computations, I hashed the prompts and stored the prompt embeddings in a dictionary. This way, instead of embedding each prompt for each of the 12 telephone lines, we only need to embed each once:

import hashlib
def hash_prompt(prompt):
    return hashlib.md5(prompt.encode()).hexdigest()[:6]

### Embed initial prompts
prompt_embeddings = {}
dataset.add_sample_field("prompt_hash", fo.StringField)

## Group samples by initial prompt
## Add hash to all samples in group
prompt_groups = dataset.group_by("prompt")
for pg in prompt_groups.iter_dynamic_groups():
    prompt = pg.first().prompt
    hash = hash_prompt(prompt)
    prompt_embeddings[hash] = embed_text(prompt)
    view = pg.set_field("prompt_hash", hash)
    view.save("prompt_hash")

We can then group samples by conversation name, iterate through these groups, compute the text embedding for each step, and record the cosine distance (smaller is better!) between the text embedding and the initial prompt embedding:

dataset.add_sample_field("text_after_dist", fo.FloatField)

prompt_groups = dataset.group_by("conversation_name")
for cg in conversation_groups.iter_dynamic_groups(progress=True):
    hash = cg.first().prompt_hash
    prompt_embedding = prompt_embeddings[hash]

    ordered_samples = cg.sort_by("step_number")
    for sample in ordered_samples.iter_samples(autosave=True):
        text_embedding = embed_text(sample.text_after)
        sample["text_embedding"] = text_embedding        
        sample.text_after_dist = cosine_distance(
            prompt_embedding,
            text_embedding
        )

I then computed the average scores for each T2I-I2T pair across all prompts at a certain level of difficulty and plotted the results. In each of the videos, the I2T and T2I models are printed on the generated images, as well as the text used to generate that image (red), and the description generated from that image (green).

Easy

For easy prompts, performance tends to depend most strongly on the text-to-image model. DALL-E2 and Stable Diffusion dramatically outperform VQGAN-CLIP. MiniGPT-4 is a member of both of the top-performing pairs.

Here are some examples for the easy prompt introduced above:

AI Telephone for an easy prompt, with pairs of text-to-image and image-to-text models.

In the games with MiniGPT-4 (and to a slightly lesser extent BLIP), the apple remains front and center, whereas for games involving CLIP Prefix, the apple gets phased out over time.

Medium

When the prompts become a bit more difficult, the situation starts to change.

AI Telephone for a medium difficulty prompt, with pairs of text-to-image and image-to-text models.

For nearly all of the games, the subject changes somewhere around the fourth or fifth step. Early on, MiniGPT-4 holds an advantage. But by the end of the game, that advantage seems to have been entirely lost.

Hard

By the time the prompts become challenging, we start to see something interesting: for early steps, the image-to-text model is most important (MiniGPT-4 is best, and CLIP Prefix is for the most part the worst). By later stages, however, the text-to-image model becomes most important. And to complicate the situation further, VQGAN-CLIP is best here!

One might worry that “better” could just mean that consistency is maintained, without accurately representing the original concept. However, when we look at examples, we can see that this is not the case.

AI Telephone for a hard prompt, with pairs of text-to-image and image-to-text models.

Take the example highlighted in the video, where the initial prompt is the “hard” prompt introduced above concerning a “bustling marketplace”. While the images generated by VQGAN-CLIP are without a doubt grainy, the subject can still be made out, and matches the original prompt fairly closely.

Impossible

Unsurprisingly, none of our competitors do terribly well here. One might argue that VQGAN-CLIP is the winner. But for the most part, this is all just noise. In the video, even for games involving VQGAN-CLIP, the subject is effectively unrecognizable.

AI Telephone for an “impossible” prompt, with pairs of text-to-image and image-to-text models.

Takeaways

This exploration was far from scientific; I only looked at ten prompts, without true validation of their difficulty level. I only ran the conversations out to ten back-and-forth steps; and I only evaluated performance on one metric.

It is clear that which T2I and I2T models fare best depends in large part on the complexity of the prompt, and how long you want to keep the models talking. Nevertheless, it is worth noting a few key observations:

VQGAN-CLIP may fare better for more challenging prompts, but this doesn’t mean it is a better T2I model. The images produced by VQGAN-CLIP are often far less coherent and globally consistent than those produced by Stable Diffusion or DALL-E2.
The analysis above is all about semantic similarity — it does not take style into account. The style of these images can change a ton over the course of a game of AI Telephone. Anecdotally, I found that the style is much more consistent for I2T models like mPLUG-Owl, which give long descriptions, than for models like BLIP, whose descriptions are more subject focused.
By around five or six iterations, the games had mostly converged to stable equilibria.
Even though the embedding model, ImageBind, was multimodal, the distance between consecutive image embeddings and text embeddings were far greater than the distance between consecutive images or consecutive descriptions. In general, they followed the same trends, but in less pronounced fashion, which is why I didn’t include these in the plots.

I hope this inspires you to run your own experiments with generative AI — whether you’re playing AI Telephone, or doing something else entirely!

If you try out a variation of this and get interesting results, comment on this post!