The world’s leading publication for data science, AI, and ML professionals.

Stable Diffusion, DreamFusion, Make-A-Video, Imagen Video, and What’s Next

Generative AI for Text-to-Image, Text-to-3D, and Text-to-Video

GENERATIVE AI OVERVIEW

The Starry Night (by the author using Stable Diffusion)
The Starry Night (by the author using Stable Diffusion)

Generative AI is nascent but emerging exponentially. It has been stealing the limelight in the AI field since OpenAI debuted GPT-3 and DALL·E.

2022 is the year of text-to-content generation (aka AIGC). In April 2022, OpenAI released DALL·E 2, described in the paper about CLIP and diffusion models. It was the first time to create realistic images and art from a text description in natural language.

Four months later, startup StabilityAI announced the release of Stable Diffusion, an open-source text-to-image generator creating stunning art within seconds. It can run on consumer GPUs with a breakthrough in speed and quality. It is so hot that it became a unicorn in its seed round on 17 Oct 2022.

On 29 Sep 2022, Google announced DreamFusion for text-to-3D generation using 2D diffusion. On the same day, Meta announced Make-A-Video for text-to-video generation without text-video data.

Within a week, Google seemed to answer Meta’s Make-A-Video and debuted Imaged Video for its text-to-video.

In this exciting journey over the past half a year, Midjourney and CogVideo are no less important to mention. Midjourney is an independent research lab offering Midjourney Bot to generate images from text. CogVideo is the first open-source large-scale pre-trained text-to-video model with 9.4 billion parameters.

Here I’ll describe how they work for Stable Diffusion, text-to-3D, and text-to-video. Also, let’s experience awesome text-to-image without coding and see what’s next.

Stable Diffusion and What’s Unstable

Stable Diffusion introduced conditional latent diffusion models (LDMs) to achieve new state-of-the-art scores for image inpainting and class-conditional image synthesis and highly competitive performance on various tasks, including text-to-image synthesis, unconditional image generation, and super-resolution while significantly reducing computational requirements compared to pixel-based DMs. This method can dramatically improve de-noising diffusion models’ training and sampling efficiency without degrading quality.

Conditional Latent Diffusion Models explained via concatenation or by a more general cross-attention mechanism (Source: Latent Diffusion Models)
Conditional Latent Diffusion Models explained via concatenation or by a more general cross-attention mechanism (Source: Latent Diffusion Models)

While releasing Stable Diffusion, StabilityAI has developed an AI-based Safety Classifier enabled by default. It understands concepts and other factors in generations to remove undesired outputs for users. But its parameters can be readily adjusted for powerful image generation models.

Based on Stable Diffusion, Mage shows up to generate NSFW content in the browser. It is straightforward and free to use without NSFW filtering.

Don’t get confused. Unstable Diffusion is a community supporting AI-generated NSFW content using Stable Diffusion. These models can be found on Patreon and Hugging Face without a doubt.

DreamFusion for Text-to-3D

Google and UCB together introduced DreamFusion for text-to-3D generation using 2D diffusion.

DreamFusion works by transferring scalable, high-quality 2D image diffusion models to the 3D domain through a novel SDS (Score Distillation Sampling) approach and a novel NeRF (Neural Radiance Field)-like rendering engine. DreamFusion does not require 3D or multi-view training data and uses only a pre-trained 2D diffusion model (trained on only 2D images) to perform 3D synthesis. [3]

DreamFusion illustrated to generate 3D objects from a natural language caption (Source: DreamFusion)
DreamFusion illustrated to generate 3D objects from a natural language caption (Source: DreamFusion)

As illustrated above, a scene is represented by a NeRF randomly initialized and trained from scratch for each caption. The NeRF parameterizes volumetric density and albedo (color) with an MLP. DreamFusion renders the NeRF from a random camera, using normals computed from density gradients to shade the scene with a random lighting direction. Shading reveals geometric details that are ambiguous from a single viewpoint. To compute parameter updates, DreamFusion diffuses the rendering and reconstructs it with a (frozen) conditional Imagen model to predict the injected noise.

Though DreamFusion produces compelling results, it is still in the early stage of text-to-3D. SDS is not a perfect loss function when applied to image sampling. Thus in a NeRF context, it often produces over-saturated and over-smoothed results and lacks diversity. Furthermore, DreamFusion uses the 64 × 64 Imagen model to balance quality and speed.

Make-A-Video for Text-to-Video

Meta (aka Facebook) has never fallen behind in AI evolution. On the same day as DreamFusion announced, Meta rolled out Make-A-Video for text-to-video generation.

Meta Make-A-Video high-level architecture (Source: Make-A-Video)
Meta Make-A-Video high-level architecture (Source: Make-A-Video)

According to the above high-level architecture, Make-A-Video has three main layers: 1). A base T2I (text-to-image) model trained on text-image pairs; 2). spatiotemporal convolution and attention layers that extend the networks’ building blocks to the temporal dimension; and 3). spatiotemporal networks that consist of both spatiotemporal layers and another crucial element needed for T2V generation – a frame interpolation network for high frame rate generation. [4]

So Make-A-Video is built on T2I models with novel and practical spatial-temporal modules. It accelerates the training of the T2V model without needing to learn visual and multimodal representations from scratch. It does not require paired text-video data. And the generated videos inherit the vastness (diversity in aesthetics, fantastical depictions, etc.)

Google’s Imaged Video

Google’s Imagen Video is a text-conditional video generation system based on a cascade of video diffusion models.

Given a text prompt, Imagen Video generates high-definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models.

It comprises seven sub-models to perform text-conditional video generation, spatial super-resolution, and temporal super-resolution. The entire cascade generates high-definition 1280×768 (width × height) videos at 24 frames per second for 128 frames (~ 5.3 seconds), approximately 126 million pixels.

Imagen Video sample for "A bunch of autumn leaves falling on a calm lake to form the text 'Imagen Video'. Smooth." The generated video is at 1280×768 resolution, 5.3-second duration, and 24 frames per second (Source: Imaged Video)
Imagen Video sample for "A bunch of autumn leaves falling on a calm lake to form the text ‘Imagen Video’. Smooth." The generated video is at 1280×768 resolution, 5.3-second duration, and 24 frames per second (Source: Imaged Video)

No Code AI for Stable Diffusion

As described above, we can see that diffusion models are the foundation for text-to-image, text-to-3D, and text-to-video. Let’s experience it using Stable Diffusion.

Use a suggested text: A high-tech solarpunk utopia in the Amazon rainforest

(by the author using Stable Diffusion)
(by the author using Stable Diffusion)

Text: anime female robot head with flowers growing out

(by the author using Stable Diffusion)
(by the author using Stable Diffusion)

Text: Mount Rainier close view from the sky

Generated by DreamStudio (left) vs. photo by author (right)
Generated by DreamStudio (left) vs. photo by author (right)

You probably can’t wait. Below are many to try without any code.

  1. StabilityAI’s Stable Diffusion hosted on Hugging Face
  2. Stable Diffusion Online
  3. StabilityAI’s DreamStudio
  4. Mage enabled with NSFW
  5. Playground AI
  6. Text-to-Image (Beta) on Canva
  7. Wombo Art

What’s Next?

Generative AI is stunning and moving fast. While we are still immersed in text-to-image for realistic images and art, we are now moving into the next frontiers: text-to-video and text-to-3D.

But it is nascent and filled with challenges regarding relevance, high fidelity, vastness, and efficiency.

  1. Relevance: We notice that the models produce different (even significantly different) results under the same input text, as well as some irrelevant results. When generating art, how to describe the input in natural language seems to become an art.
  2. High fidelity: We are impressed with many realistic images from DALL-E 2 and Stable Diffusion, but all of them still have a lot of room for high fidelity.
  3. Vastness: Vastness is about diversity in aesthetics, fantastical depictions, etc. It can provide rich results for a wide range of inputs.
  4. Efficiency: It takes seconds and a few minutes to generate an image. It takes longer for 3D and videos. For example, DreamFusion uses a smaller 64 × 64 Imagen model to speed up by compromising quality.

The good news is that it opens up many exciting opportunities: Ai Engineering, foundation models, and generative AI applications.

  1. AI engineering: AI engineering is essential for automating MLOps, improving data quality, enhancing ML observability, and self-generating application contents.
  2. Foundation models: It is expensive and becomes unrealistic to train and operate many large-scale models independently. Eventually, it will unify or integrate into several foundation models. These large-scale models run in the cloud to serve different domains and applications above.
  3. Generative AI applications: With AI engineering and foundational models, it is a massive opportunity for applications, including digital content in the metaverse and NFT space. e.g., startup Rosebud focuses on diverse photo generation.

By 2025, generative AI models will produce 10% of all data. We can expect remarkable changes in the coming years with the current pace of generative AI evolution.


References

  1. Hierarchical Text-Conditional Image Generation with CLIP Latents: https://arxiv.org/abs/2204.06125
  2. High-Resolution Image Synthesis with Latent Diffusion Models: https://arxiv.org/abs/2112.10752
  3. Dreamfusion: Text-to-3D using 2D Diffusion: https://arxiv.org/abs/2209.14988
  4. Make-A-Video: Text-to-Video Generation without Text-Video Data: https://arxiv.org/abs/2209.14792
  5. Imagen Video: High Definition Video Generation with Diffusion Models: https://arxiv.org/abs/2210.02303

Related Articles