PODCAST

CogView: Image generation and language modelling at scale

Ming Ding on the challenge of building China’s DALL-E

Jeremie Harris

Published in

Towards Data Science

20 min readMar 11, 2022

Editor’s note: The TDS Podcast is hosted by Jeremie Harris, who is the co-founder of Mercurius, an AI safety startup. Every week, Jeremie chats with researchers and business leaders at the forefront of the field to unpack the most pressing questions around data science, machine learning, and AI.

Note: Due to recording issues, this episode of the podcast is available only as a transcript. You can find the episode summary immediately below, followed by the full interview, in written form.

In early 2021, OpenAI published DALL-E, a powerful model that could generate images from arbitrary text prompts. But at the very same time DALL-E was built in California, a very similar model was being developed 10,000km away, in China.

That model is called CogView, and one of the lead researchers involved in its development was Ming Ding — a researcher at Tsinghua University.

By all accounts CogView is a massive achievement: a 4-billion parameter model that required a large compute budget and more than a little software engineering and hardware to put together. But it also offers a fascinating window into the world of international AI research, the difficulty of comparing generative models across languages, and how differences between the English and Chinese languages can have an impact on the rate at which language models can learn them.

Ming joined me to talk about the challenge of building CogView, how it compares to DALL-E, the international side of AI research, and much more on this episode of the TDS podcast.

Here were some of my favourite take-homes from the conversation:

CogView is based on a VQ-VAE (vector-quantized variational autoencoder), architecture. Like a typical variational autoencoders (VAEs), a VQ-VAE learns to take an input and compress it into a latent representation that can be used to reconstruct the same input. However, whereas VAEs map inputs to a continuous latent space, VQ-VAEs map inputs to a discrete latent space, effectively classifying them into one among a large but finite set of categories. CogView’s VQ-VAE is first trained to reconstruct images, and a separate language model is then used to map user input text to that VQ-VAE’s latent space, where image generation occurs.
Despite their very similar functions, Ming points out that it’s difficult to compare CogView to DALL-E, for two reasons. First, CogView was trained on Chinese text/image pairs, whereas DALL-E was trained to work in English, and this means that prompts to each model can’t be compared on an apples-to-apples basis. But second, OpenAI hasn’t released the DALL-E model, so CogView’s team would have only a limited set of public DALL-E text prompts and their corresponding image outputs to compare to.
Ming explains that metrics for image generation systems like CogView are still under-developed. In some cases, for example, they trend in directions opposite to human judgement, and images rated highly on the basis of certain metrics don’t look compelling at all to human observers. In particular, Ming highlights that so far, quantitative measures of the quality of generated images put more weight on textures, and insufficient weight on other characteristics that are more important to human perception, such as the shapes of objects.
One approach that Ming considers promising is to assess the quality of generated images not by directly measuring properties of those images, but rather by using an image-captioning model to convert the images back into captions, which could then be compared with the original text prompts fed to CogView. Whether or not this strategy ends up working out, it highlights how much creativity is required to define good metrics for generative learning tasks.
Whereas OpenAI famously decided not to release their DALL-E model, Ming and his colleagues opted to publish CogView in open-source. Ming argues that this decision is justified on the basis that CogView’s capabilities are not sufficiently developed to allow for malicious uses. However, he does think that may change in the future: within the next two years or so, Ming expects photorealistic image generation to be possible. With that, he anticipates, will come greater risks, and potentially a need for government intervention and corporate auditing.

Transcript:

Jeremie: Hey everyone, welcome back to the Towards Data Science Podcast. For today’s episode, I’ll be speaking with Ming Ding, a researcher at Tsinghua University and the lead author of a NeurIPS paper that introduced CogView: a generative text-to-image model that Ming and his colleagues developed and open-sourced earlier this year.

By all accounts, CogView is a massive achievement. It’s a 4-billion-parameter model that required a large compute budget and more than a little software engineering and hardware to put together.

But it also offers a fascinating window into the world of international AI-Research and the difficulties of comparing generative models across different languages — and how differences between the English and Chinese languages can have an impact on the rate at which language models can learn them. Ming joined me to talk about the challenge of building CogView, how it compares to DALL·E, which was developed during the same period, the international side of AI-research, and much, much more, on this episode of the Towards Data Science Podcast.

Jeremie: I’m really excited to have this conversation because CogView is such an important part of the rapidly emerging Chinese AI ecosystem. It’s a great example of Chinese experimentation with large language and vision models. What made you decide to work on this project in particular?

Ming: I actually began this direction at a very early time, maybe earlier than this time last year. No one before then had succeeded at using text to generate images in a general domain. I mean, there’s a lot of successful cases in limited domains, for example, faces, or animals or something like that, but no one had succeeded in turning text inputs into images in the general case.

I’d done some image generation work before, back when I was an intern at Alibaba. There, I was mainly focused on generating fashion-related images, and also some general domain images. And then suddenly, OpenAI really got started with their text-to-image work. We found that their ideas are extremely similar to ours, so we quickly asked for more resources to train a big model using a larger dataset and more parameters. This turned out to be CogView, and we finally published it on NeurIPS.

Jeremie: That’s a fast turnaround too, right? If I recall, the biggest version of CogView — the 4-billion-parameter version — came out just four or five months after OpenAI’s DALL-E model, right?

Ming: Actually, we’d finished all the code and the small model variant before OpenAI had released theirs.

Jeremie: I didn’t realize it was that close. It’s a really interesting project, too — I mean, general-purpose text-to-image is something that would have been science fiction just two years ago. Can you tell me a bit about the architecture that makes this happen? As I recall, it’s based on a VQ-VAE model, so if you could start by explaining the VAE part, that might be helpful for anybody who is not familiar with variational auto-encoders.

Ming: VQ-VAE is quite different from the original VAE. A VAE tries to map inputs to a set of continuous latent variables. And although the first stage of a VQ-VAE is similar, the middle layer turns the continuous representation into a discrete one using a dictionary. So you quantize the vectors you get after the initial layers into tokens pulled from a learned dictionary. The second step of a VQ-VAE refers to modeling the distribution of the discrete latent tokens via an auto-regressive model, so you don’t need to model the KL-divergence of the latent variable and the prior like in an ordinary VAE.

Jeremie: Rather than trying to encode something in a latent space that’s continuous, you have a discrete latent space. So you’re mapping an input to some finite set of different labels or different classes, essentially. I assume those discrete embeddings are easier for the model to learn than continuous ones?

Ming: Yeah. Actually, one of the big advantages of going with a discrete embedding strategy like VQ-VAE is that it allows us to work with data like language that’s inherently discrete (in the sense that words are discrete packets of meaning, and you can’t continuously move from one word to another). So VQ-VAE opens the door to language modeling in a natural way — and in particular to auto-regressive language models like GPT.

Distributions of language data are actually very hard to parameterize because they’re so complex. If the distributions we were dealing with were simpler, we could use Gaussian mixture models, or something like that. But if you’re modeling a very complex distribution like next-word probabilities in text data, or pixel brightnesses in images in the general case, you need a more robust approach.

Currently, the best practice is to use a transformer like GPT, which just learns a distribution over a dictionary. Each time you want to make a next-word prediction, you can look at that distribution and pull out a word that has high probability under that predicted distribution.

That’s a really flexible approach, but it does have a problem. There can be hundreds of thousands of words in a language dictionary, and forcing a downstream image generation model to learn semantic representations of each word is unmanageable. Continuous word embeddings are just too complex. If you want to feed those words as input to another model — like an image-generation model, for example — you need a way to reduce the complexity of your data, to reduce it to a more manageable form. The VQ-VAE lets us do that, by turning complex and continuous word embeddings into a small number of simple, discrete representations that won’t overwhelm downstream models.

Jeremie: Do you expect that auto-regressive models are going to start to dominate over GANs in terms of image generation more broadly? It does seem like things are heading that way.

Ming: Yeah, definitely. If you ask me, there are three reasons that we’ve seen GANs dominate until recently. The first is that GANs performed great very early on, and captured researchers’ imaginations. I should know — I was a researcher focusing on GANs back in the day.

But the second is that generative auto-regressive models are actually really slow. The reason is not because the models are big; it’s because auto-regressive models generate their outputs token-by-token. So it takes 65,000 inference steps to generate a 65,000-pixel image, for example. You can’t generate one token, or one pixel, without having generated the previous one. This means you can’t parallelize the task of image generation on GPUs the same way you might with a GAN. That speed and parallelism bottleneck is something we solved with CogView2. So, maybe in the future we’ll start to see very fast auto-regressive models, which have a better performance compared to diffusion-based or GAN-based models.

Finally, there’s the problem that current auto-regressive models are always based on transformer architectures, which include self-attention mechanisms, which cause slow-downs particularly when operating at very high resolutions. So, we also need special techniques like local attention to reduce the complexity this causes — which are also part of the work we’ve been doing with CogView2.

But my sense is that autoregressive models will dominate computer vision in the coming years.

Jeremie: It does seem like there is a lot more low hanging fruit for improvement when it comes to transformers. We saw with DeepMind’s Gopher model, for example, a 25X improvement in efficiency, just coming from architecture and training process optimizations. It’s interesting to see that your work is also showing just with a little bit more tweaking, you can solve a lot of these parallelization problems and get much more efficient output generation.

Do you think that continued architecture tweaks are going to drive more progress in transformers in the coming years, or do you think most of the progress from here on is going to come from scaling up more or less the same techniques we have today?

Ming: This is actually a very interesting topic. What techniques succeed at accelerating transformers and AI-related computation in general depends quite a bit on your data. If you are training an NLP model , just for sentences, so you cannot use some other technique to replace attention because attention is very important in NLP. But what we’ve found is that in computer vision, attention is actually not very important. Other works have found the same thing in the context of vision-based classification problems, for example. Metaformers, for example, found that mean pooling and some other operations can replace the attention in transformers without reducing performance. So the answer is fundamentally dependent on both the data and the task.

So certainly when it comes to generation tasks, I think the attention could be also replaced by more lightweight operations. There’ve been other papers since that have shown efficiency gains that come from swapping out attention in some cases — so it remains to be seen how far architecture tweaks can take us in the image generation domain.

But, back to the case of text-to-image generation. We not only want to model images; we also need to understand text. And that requires a more robust architecture that makes fewer task-specific optimizations. So that essentially forces our hand, and we have to go back to the original attention mechanism. So, my prediction is that in a few months, for tasks like image classification or segmentation, we might find self-attention replaced, but for cross-modality tasks like image generation, or image captioning, we’ll find the original attention mechanism conserved.

Jeremie: The idea of replacing self-attention with custom components is an interesting counter-trend, especially given the consolidation we’ve seen around transformers lately. I’m really curious to see whether that holds up going forward.

So looking specifically at the CogView model, I’d love to get a sense of how it compares to DALL-E in terms of performance. I know the model is a bit smaller — I think DALL-E is about 12-billion-parameters, whereas CogView is 4-billion. But there’s so much more than just the number of parameters that we need to consider when we’re comparing two different models.

Ming: It’s funny, the reviewer from NeurIPS asked us similar questions, but it’s quite hard to compare CogView with DALL-E because DALL-E is not an open-source project. So, we can only see the examples that OpenAI provided along with the DALL-E paper, but it’s hard to assess how representative they are, or how similar they might be to the DALL-E training data.

While it’s quite hard to make a fair comparison, we can at least compare CogView to DALL-E by looking at a few metrics that OpenAI released along with DALL-E, and seeing how they stack up against CogView. And although we found that we actually obtained better performance according to some of those metrics, it’s hard to know what to make of that comparison. One key issue is that metrics for image generation are notoriously bad, and ultimately only human evaluation is a reliable indicator of image quality.

Jeremie: What are some of these metrics? And how do you see them fail to align with human evaluation?

Ming: One popular metric has been FID (Fréchet inception distance). To calculate the FID, you take your generated image and feed it into a vision model like Inception V3. Then, you look at the activations of neurons in one of the deeper layers of the network, and compare those to the activations you see when a real image is fed into the model. The intuition is that a generated image of a dog and a real image of a dog should both cause the “dog” neurons in the deeper layers of the network to fire. So the more similar the firing patterns induced by real and generated images, the higher the quality of the generated image is assumed to be.

This has an inherent limitation: it works in cases where you have real images to compare to. But generated images might be designed not to correspond to real objects at all, which makes FID bad for general domain image quality evaluation.

But perhaps more concerning is that we often find that FID and human judgment are downright anticorrelated in certain cases.

Jeremie: Why is that?

Ming: FID and the Inception score are meant to capture the “realism” of a generated image. But InceptionNet tends to lock onto features that matter less to humans. In particular, humans focus on the shapes and colors to derive the meaning of images, whereas InceptionNet tends to pay more attention to textures. So what you often find is that scenes with vivid, realistic textures but completely incorrect shapes will have a very high Inception score.

That’s why we designed a different metric in our CogView paper. We take a text prompt, use it to generate an image in the usual way, and then using that image, we try to reconstruct the original text prompt. The idea here is that the better the image is, the better the prompt reconstruction should be.

Jeremie: So it’s almost like you’re designing a super clunky auto-encoder, where you’re mapping text to image, and image to text and looking at your reconstruction error. In a way, it’s like how you used to be able to go to Google Translate and translate a piece of text back and forth until it became garbled, but as the algorithm got better, it became more robust to that kind of thing.

Ming: Exactly. I think reconstruction errors often make for better metrics.

Jeremie: That makes a lot of sense. One thing you also talk about in your paper is fine tuning — you take CogView and fine-tune it for specific downstream tasks, like style learning and super resolution. What those pipelines look like?

Ming: Style learning and super-resolution are actually quite different. Style-based fine tuning is quite similar to the pre-training because it just uses some images in specific styles, and adds the word of the style in the text prompt. So we just augment the captions and train with a refined selection of images. In that way, the model can learn the correlation between the word for a style, and how to express the style in an image.

Super resolution is a different story. Suppose that we give CogView a prompt like “Big Ben”. What we’ll get is an image of Big Ben with sky and other buildings around it. But now suppose that we change the prompt to something like, “a close-up view of Big Ben”. In that second case, we’d expect to see more detail, at what is effectively higher resolution. So the model clearly knows a level of detail that it doesn’t always reveal when prompted. There’s the potential to squeeze out more.

But the quality of generated images is limited by the resolution of the model, which in turn is fixed by the number of tokens it produces. We can overcome that by training the model to specifically map shorter token sequences to longer ones — say, by going from 16 x 16 resolution images to 32 x 32 ones. We can then use a sliding window to magnify each part of our generated image to a higher level of detail.

The fine-tuning process is short enough that we aren’t really teaching the model anything new — it’s more that we’re causing it to reveal knowledge that it already had. In all, we fine-tune for this task using just about 20 hours of training time. In that sense, it’s a lot like prompt engineering — a quick, low-compute process that refines the system’s output to ensure that it’s revealing all the knowledge that it already contained.

Jeremie: That ties into another question I wanted to ask. A 4-billion parameter model is quite big, and I imagine there are lot of technical challenges involved in pulling that off. What were some of the hardest things about wrestling with your infrastructure or your model to make all that work?

Ming: The biggest problem is actually the instability. That’s an issue for all large models, but it’s more apparent in our cae because the data we needed for our specific task was very heterogeneous: it included both text and images. One of the findings in our paper is that the level of instability of the training process actually depends on the data, and heterogeneous data leads to more instability in general. We finally solved the problem using an algorithm-based method. We changed the architecture of the transformer by developing a new structure that we call Sandwich Layernorm, and used another operation we called Precision Bottleneck Relaxation, both of which are discussed in our paper.

But OpenAI and Google can solve the instability problem in other ways that don’t require those workarounds. They have access to more compute power, and Google in particular can use optimized hardware like TPUs and BF16 formatted data, which is much better suited for machine learning. As values can become quite large in some layers, having more floating point precision can really matter.

Older GPUs, of the sort we use, which date back to before the A100, don’t support the BF16 format. At OpenAI, their solution to this problem was to just use some engineering techniques to map data from certain layers back to single-precision, and a variety of other tricks. They discussed some of these tricks in their paper, but not in great detail — and more to the point, at the time we were working on our project, OpenAI’s paper wasn’t out, so we had to find another solution.

We decided to change the architecture of our transformer. The approach was very successful, and was also “discovered” in some recent papers — like Facebook AI Research’s NormFormer. I have already talked to their authors, and their techniques are actually very similar to ours.

Jeremie: It’s so interesting how AI training is now much more of a hardware problem than it used to be.

I want to shift gears to a more language-related aspect of this work. When we look at a language like English, there’s a certain dimensionality to our dictionary — a certain amount of meaning behind each word. English-language transformers often use a dimensionality 500, word2vec used 200 to 300 dimensions back in the day, but roughly, we’re talking about 300–500 dimensions to represent words in English. Is that also the case in Chinese, or is there a difference in terms of the information content per token in the two languages?

Ming: That’s a very interesting topic. I should start by noting that the dimensionality of word2vec, is not always optimal: larger models with more capacity might find a greater dimensionality to be superior. And of course there’s the question of the level of abstraction that’s used for the embeddings, and whether we’re dealing with word-, syllable- or character-level embeddings. In the general case, you tend to find that by using higher dimensionality, you simply tend to do better.

There are more differences for different languages in the tokenization part. Many current methods look at the syllable or character level, or other sub-word methods. That’s useful because English words have roots and tenses, and other components that convey specific meaning.

But in Chinese, everything is different. Chinese writing involves single characters, usually with well-defined meanings. Sometimes they cluster together in common phrases, but individual characters are semantic load-bearing elements in a way that individual syllables in English words are not. Chinese lacks tenses; whereas in English you might add “ed” to the end of a verb to indicate that the action took place in the past, in Chinese tense is inferred from context.

So modeling Chinese currently turns out to be easier than English. When a language model is first learning English, it has to pick up on all these little nuances of conjugation that don’t come up in Chinese. And that takes a lot of time and compute, which would otherwise be spent on more abstract reasoning.

Jeremie: This makes me think back to my high school and undergrad days when I was learning Chinese. It was really clear that it’s a more logical language than English, and much easier to learn as a result (or at least, the verbal form is, the characters are much harder!). So in a way, maybe I shouldn’t be surprised that it’s easier for machines to learn, too.

While we’re on the topic of language, it’s been the case for a while now that AI research has mostly been done in English. But these days, for the first time there’s a really large and impressive body work being done in China as well.

Does most published research in China tend to be in English or does it tend to be in Chinese? And what impact do you see that language barrier having on collaboration across the border?

Ming: Yes, much research is still done on English-language data, even in China. A main driving force behind that is the need to benchmark the performance of our models against those of models built in the English-speaking world. This is true for more than just Chinese researchers though: Chinese, Russian and French researchers must all train models in English if they want to measure the effectiveness of their models, and compare them to what the rest of the community is publishing.

That’s largely why our language modeling has been so focused on English to date. But we’re also investigating methods that would allow us to focus more on Chinese language work, without sacrificing our ability to measure our progress relative to others’ work. That’s important because ultimately, if we want Chinese consumers to benefit from language modeling, we need to be capable of performing language generation in Chinese. Through that lens, it’s been a real disadvantage that we’ve had to focus so much on English language generation: big models are expensive, and they don’t drive economic value if they’re only useful for benchmark measurement. Much of this work is focused on coming up with universal tokenizers that would allow our models to simultaneously learn English and Chinese, effectively creating bilingual models that create economic value in China while also being comparable to English language models developed abroad.

Jeremie: That’s fascinating, I’d never thought of the language barrier as effectively being a tax on non-English research.

Another aspect of the CogView project that I wanted to ask about is the question of whether to release the model to the general public. Obviously, CogView is now open source so you’ve made the decision to publish it. By contrast, OpenAI famously decided not to release DALL-E, citing safety concerns, and concerns over malicious use — the idea being that people might use it for evidence fabrication, or to generate disinformation and so on. Was that possibility something that you considered when you decided to release CogView?

Ming: We decided to publish the model to let more researchers access it, and to make more people aware of the power of this kind of model, when applied to images. I don’t think misuse is currently a very big problem. Our systems currently just aren’t there — maybe in two years the new models will be able to generate very realistic images in a way that has potentially strongly negative social impact, and I can see how releasing these models would be riskier then.

Jeremie: How would we be able to guess ahead of time how good a model has to be before it could have real negative social impact?

It seems like it would be impossible to know. When the GPT-3 API was released in beta, even OpenAI hadn’t appreciated the full range of its capabilities — some of which have been found to lend themselves to malicious use (in the form of phishing, disinformation generation and dissemination, etc). Surely we should account for the possibility of similar nasty surprises, even with models that appear at first glance not to have concerning levels of capability?

Ming: Yeah. I mean, my point is that because the technology itself is actually a product of an industry-wide effort, no one lab’s decision not to release a model will make much of an impact. Other methods are going to be needed to effectively manage the social impact of these systems — including new laws and regulations. Not releasing a model isn’t actually a big factor in the scheme of things.

Jeremie: It’s such an interesting debate, and I absolutely agree that there’s going to have to be a role for governments in all this, probably sooner than later at the rate things are moving.

One area governments have started to think about already is AI safety, which brings me to the last thing I did want to ask you about. How does the Chinese AI Community view things like AI safety and AI alignment even?

I know you’re not an alignment researcher and that’s totally fine, but there is a pretty small and fast-growing community of researchers in the West and in the US in particular, who are concerned about long-term future risk from AI. Is that something that’s being discussed in China?

Ming: I think the problems of the social impact of AI are not being discussed sufficiently in either the U.S. or China.

And there are related problems and challenges. Consider, for example, the case of “distantly-supervised” relation extraction, in which we train language models to learn from the co-occurrences of words (or syllables, etc, as the case may be) in text that’s crawled from the web. With the appearance and widespread use of models like GPT-3, an increasing amount of that web-based content will itself be AI-generated — so we risk training AI systems on the outputs of previous generations of AIs. These older systems are therefore introducing a very challenging type of noise that risks making it impossible for web-based crawling to be used as a means of dataset generation in the future.

At BAAI (Beijing Academy of AI), for example, and at other institutions, they might have some internal discussions about the safety of AI models, but they fundamentally lack the power to control the course of AI safety, since AI models are being developed around the world. There’s global coordination that’s needed here.

Jeremie: Such a complex issue — I totally agree. I mean, it will take some kind of global coordination. There is no way around that. Somehow, China and the US are going to have to talk about this, and other countries too.

Thank you so much, Ming. Fascinating chat.

Ming: Thank you very much.

PODCAST

CogView: Image generation and language modelling at scale

Ming Ding on the challenge of building China’s DALL-E

Transcript:

Written by Jeremie Harris