The world’s leading publication for data science, AI, and ML professionals.

Learn to Build Advanced AI Image Applications

By building an interior designer in ComfyUI

Custom Image Generation

Visualizing the open source space for image generation | Image by author
Visualizing the open source space for image generation | Image by author

_If you’re not a member but want to read this article, see this friend link here._

For this piece, I wanted to embark on the challenge of making advanced image generation within the open-source space easier for beginners. I realize that it’s a pretty heavy task to complete in 20 minutes.

We’ll cover what’s happening in this space, which open-source models and tools are popular, learn how diffusion works, and dive into key technologies like LoRAs, ControlNets, and IP Adapters.

We’ll also explore different use cases and how you can apply various technologies to each.

By the end, we’ll create an interior designer with Flux that takes an image of a bedroom and generates different designs. You can see what this will look like below.

Interior designer workflow in ComfyUI | Image by author
Interior designer workflow in ComfyUI | Image by author

If you are keen to get started, you can skip the introduction and go straight into building.

Introduction

This part is for everyone, even if you don’t want to dig into the technicalities of working with custom image generation.

We’ll go through image generation, open source and proprietary models released, the tech available, a few use cases and cost of building.

I have a collection of resources and information gathered in my usual repository you can later find here.

Generative Images

If you are completely knew to generative images you should go do some research.

This space as seen some serious innovation in the last few years.

But it’s not just that you can generate high quality images with a prompts now but it’s about how new technologies enable you to have precise control over aspects like branding, product aesthetics, and personalized styles in these images.

The open source community has been very busy the last few years building. We’ll go through different use cases and how this technology is changing.

Closed vs Open Source Models

The first thing we should go through are the AI models available in this space and which ones you should focus on building with.

Probably the oldest – and most widely used – is Stable Diffusion 1.5. It has a huge community behind it and is fully open source, which means you can build APIs on top of it commercially.

In 2024, Flux was released by Black Forest Labs as a competitor to closed-source systems like MidJourney.

Most popular generative image models, release year and commercial licenses | Image by author
Most popular generative image models, release year and commercial licenses | Image by author

Flux Schnell is fully open source with a commercial license, but the more popular Flux Dev is not. This doesn’t mean you can’t use Flux Dev’s outputs commercially, but you can’t build systems on top of the model and sell it.

You can see my scribbles above on the different popular models, noting whether they’re open source and have a commercial license. If you want the entire list, check this.

Now, while you might be tempted to jump straight to Flux for all your projects, keep in mind that the longer a model has been available, the more tools and resources have been built around it.

Stable Diffusion 1.5 and Stable Diffusion XL have some of the largest communities, with many tools and techniques already developed. Don’t disregard them fully before you start building.

Image Use Cases

There are so many use cases possible within this space.

A few days ago a few generated AI images was spreading across LinkedIn that were evidently so real that people couldn’t see the difference. These images were created from a fine-tuned LoRA of a woman’s vacation photos in ComfyUI by this creator.

Now, the possibilities doesn’t stop there.

You’ll see a few popular use cases I’ve scribbled below, along with their corresponding potential techniques.

Different use cases and techniques, see a clean list in this repository | Image by author
Different use cases and techniques, see a clean list in this repository | Image by author

I have a table here that is a bit more organized.

What we do, though, is combine different tools for different workflows. You’ll often see people applying a ControlNet alongside an IP Adapter and additional LoRAs.

ComfyUI vs Alternatives

Now, to build advanced generative image applications, we need to look at the tools available.

I’m sure there are plenty of commercial options out there, but there are also many open-source ones. These are better in every way as they have larger communities driving innovation, and they’re free.

The biggest one released so far is ComfyUI.

ComfyUI is a visual, node-based GUI that gives you full control over the diffusion process, making it an excellent choice for creating custom generative images, and even videos at this point.

However, many find ComfyUI unnecessarily complicated, so let’s quickly explore some alternatives.

Generative AI Image advance open source tools. See the list here | Image by author
Generative AI Image advance open source tools. See the list here | Image by author

The issue with many alternatives is that they can be restrictive and, by now, outdated. For example, A111 is one of the more well-known UIs, but currently, it does not support the newer Flux models.

ComfyUI has a learning curve, and learning the basics may take a few days but any technical soul should get the grips of it.

If you still find ComfyUI overly complicated, a good alternative is SwarmUI, which is built on top of ComfyUI and offers a simplified UI.

This article will still be useful for understanding techniques, regardless of the tool you choose.

Economics of Advanced Image Generation

If you’re a stakeholder, let’s briefly talk about the cost of building with tools like ComfyUI.

To run these tools effectively, you’ll need good hardware, including GPUs with high VRAM, which requires a bit of an investment.

Alternatively, you can rent cloud GPUs (which we all do) but this can add up quickly. The cheapest options start at around $0.60 per hour, depending on your setup.

Despite the cost, this may still be a better investment than hiring external consultants to build, deploy, and maintain workflows. Having your team learn internally not only saves on long-term but also builds expertise in-house.

That said, achieving high-quality results isn’t easy. The technology is evolving rapidly, and there are plenty of outdated nodes and workflows all over the web. It’s also quite technical, so teams without the right skills may struggle.

It’s still good to invest into learning what you can do with it.

Technical Bits

Let’s go into more technical stuff so we can understand how image generation works. We’ll also set up ComfyUI and exploring its layout while covering key technologies like LoRAs, ControlNets, and IP Adapters.

At the end of this section, we’ll build the interior designer workflow which is a fairly easy use case.

How Image Generation Works

Image generation is not the same as computer vision, though there is some overlap. Computer vision focuses on identifying and analyzing the contents of an image.

If you’re interested in exploring computer vision use cases, check out an article I wrote here a year or so ago.

Image generation, on the other hand, falls under the generative AI category. It’s about creating content rather than analyzing or interpreting it. The primary model types in this space are GANs and diffusion models.

This article focuses on diffusion models which is the primary choice for ComfyUI, but there are several resources available to dig into this space if you want to.

I want to mention though that when we generate images with diffusion models we are starting with random noise, and the model gradually refines it step by step, until it forms a coherent image. This process is guided by our prompt (the embedding).

Simplified diffusion process | Image by author
Simplified diffusion process | Image by author

Some people call it making art from noise.

There are multiple parts here that play an important role.

We have the base model itself, the CLIP model, which translates our prompt into something the model can work with (i.e., the embedding), and an autoencoder that decodes the output of the process, giving us a finished image we can use.

When working with higher-level UIs such as MidJourney, you don’t participate in this process, but in ComfyUI, you have more control.

Setting up ComfyUI

Now this part is tricky. If you have good hardware then it’s simply about installing it on your computer. You can see this guide for Windows and Mac.

However, using ComfyUI on a MacBook without a GPU will be slow. If you’re like me and have no patience you’ll need a GPU.

There are several options available, and I’ve outlined them here.

Different choices to host ComfyUI, see the list here | Image by author
Different choices to host ComfyUI, see the list here | Image by author

I don’t want to recommend any specific one, but some can get you up and running within minutes which is ideal at the start. Unfortunately, this will cost around $0.99 per hour of use which is pretty expensive.

Once you’re using it daily, I’d recommend looking for better and more cost-effective options. I’ve never been keen on having high-end hardware but that is definitely changing.

The Layout

I’m assuming you’ve found a way to run ComfyUI, but if not, and you just want to experiment with the canvas for this next part, please go here. This website will let you run easier stuff for free (although you will be set in a queue so it may be very slow).

If you are running a web based ComfyUI this is what it will look like | Image by author
If you are running a web based ComfyUI this is what it will look like | Image by author

Once you land with a blank black screen in front of you, it can be daunting. It’s easy to get started though and you do not need to know every part of this tool to use it.

The key thing to know is that a left double-click lets you search for nodes to add to your workflow.

Left double-click search in ComfyUI | Image by author
Left double-click search in ComfyUI | Image by author

To load a model you simply find the node Load Checkpoint.

When using models via the Load Checkpoints node, you first need to download them and place them in the models directory to access them. The same applies to other models like LoRAs, ControlNets, and so on (which we’ll cover in a bit).

You need to put all the models you work with within the ComfyUI models directory.

If you’re using a hosting provider, like the one above, it will typically help you load those initial models.

You should also know that workflows can be easily downloaded and uploaded using a JSON file. You can simply paste in the JSON of a workflow in the canvas.

If you need to install or update custom nodes, you’ll need access to the ComfyUI Manager. For workflows that throw errors because of missing nodes, this is where you’ll search for and install or update them.

Before going for our use case, we’ll walk through one simple workflow for Stable Diffusion and another for Flux.

Key Building Blocks

I want to tie in how image generation works for this part so it makes sense.

You’ll think this process is unnecessarily complicated but there is a reason we set up the entire process, so we can later break it down if needed and do some really cool stuff.

The first node we’ll add is the Load Checkpoint (as mentioned above).

Search for Load Checkpoint in ComfyUI | Image by author
Search for Load Checkpoint in ComfyUI | Image by author

This is the key node that loads the model (checkpoint) and its associated components, such as the CLIP model, which is a text encoder that interprets our prompt into something the model can understand.

It also gives us the VAE (Variational Autoencoder) that’s embedded in the checkpoint, which compresses images into latent space during encoding and expands them back to full resolution during decoding.

You can load all of these separately, and we’ll do so for Flux in a bit.

Obviously, the checkpoint is the backbone of the pipeline. What might be new to you is dealing with the CLIP and VAE directly, which are hidden when using higher-level systems like MidJourney.

Don’t worry if you feel overwhelmed, it will click soon.

After this, we’ll need to grab two CLIP Text Encode (Prompt) nodes by searching for them. We need one for the positive prompt and another for the negative prompt.

Setting up your positive and negative prompts | Image by author
Setting up your positive and negative prompts | Image by author

This should be quite obvious, but in one prompt, you’ll write what you want, and in the second, you’ll write what you don’t want.

If you right-click on these nodes, you can change their colors and titles to make them easier to distinguish.

Changing the colors of the prompts | Image by author
Changing the colors of the prompts | Image by author

We’ll also link the CLIP model to the prompts directly. This model converts the text into a vector (a mathematical representation) that guides the model on how to shape the image.

Next, we need to add an Empty Latent Image node. This node creates an "empty" latent image filled with random noise in latent space which will be the starting point for the diffusion process.

This node creates an "empty" latent image | Image by author
This node creates an "empty" latent image | Image by author

You’ll set the dimensions and batch size of the images you want to generate via this node.

Once all of these nodes are in place, we’ll also add the KSampler.

Connecting all the nodes | Image by author
Connecting all the nodes | Image by author

The KSampler orchestrates the entire denoising process, i.e. it transforms random noise into the latent representation of our generated image.

You can keep the settings as they are for this one, except perhaps setting a random seed.

Be sure to connect the model, the positive and negative prompts, and the latent image to the KSampler (see the image above).

You can use a custom sampler if you’d like, but for your first time, it’s better to stick with something out of the box.

Once all these nodes are in place, we’ll need to decode the final latent representation produced by the KSampler into a full-resolution image.

The VAE Decode node will handle this, so be sure to add that one too.

Search for the VAE Decode node in ComfyUI | Image by author
Search for the VAE Decode node in ComfyUI | Image by author

To save the image after the process is complete, we’ll add the Save Image node provided by ComfyUI.

The finished workflow will look like this.

Make sure all the nodes are connected in ComfyUI | Image by author
Make sure all the nodes are connected in ComfyUI | Image by author

You can run the workflow by simply pressing Queue.

The images will appear in the Save Image node.

Results from the standard workflow with SD in ComfyUI | Image by author
Results from the standard workflow with SD in ComfyUI | Image by author

I have four images since I set the batch size to 4 in the Empty Latent Image node. Your images may look different depending on the model you used.

We’ll cover how checkpoints work in the next section.

This is the standard beginner workflow that you usually get when you load ComfyUI for the first time.

If you want to use a Flux model, you’ll need to make a few adjustments.

First, load the model, the CLIP, and the VAE separately. You’ll also need two CLIP models and the FluxGuidance node.

You’ll see an example for Flux below that you can follow.

Results from the standard workflow with Flux in ComfyUI | Image by author
Results from the standard workflow with Flux in ComfyUI | Image by author

Flux is built to use a standard CLIP model and a T5 language model tokenizer to read the prompt, which is why we’re loading two CLIP models here rather than just one. It is supposedly what gives Flux superior prompt-following.

FluxGuidance is not essential, but it helps the text encoder to "push" your image generation a little farther in the direction of your prompt. It should improve how well the model follows your prompt

You can set this from 1.0 to 100, and I implore you to play around with it.

Now, from here, if you are interested in getting into more basics, I would recommend this tutorial. The beauty of this tool is how you can really control how you want your images generated.

Model checkpoints & LoRAs

As we saw when we went through the basics, if you are using any ComfyUI hosting service, you might see a few models available when you add in the Load Checkpoint node.

Available models in the Checkpoints node in ComfyUI | Image by author
Available models in the Checkpoints node in ComfyUI | Image by author

If you have ComfyUI installed locally, you need to locate the folder modelscheckpoints, place the models you want in there, and then reload the UI to see them in the dropdown.

If you’re new to ComfyUI, you might be wondering how to find these models and what kind of files to put there. You have both HuggingFace and Civitai at your disposal when looking for models.

See below for how to find checkpoints via Civitai by going to the models page and then filtering for Checkpoint.

Finding checkpoints in Civicai | Image by author
Finding checkpoints in Civicai | Image by author

Models are quite large. I have gathered a few popular checkpoints here if you’re at a loss. The files you’re looking for will have .safetensors as the file extension.

Once you find one you want, download it and place it in the modelscheckpoints directory in ComfyUI. Yes, this will take quite a bit depending on your hardware and connection.

This brings us to the next topic: LoRAs.

LoRAs are much smaller in size and can be attached to different checkpoints. They are fine-tuned add-ons to a model. Think of the base model (checkpoint) as the blueprint for building a house. The LoRA is like adding custom paint without changing the structure of the house itself.

Every LoRA comes with its own instructions so be sure to read the model page on how it should be used. Sometimes you need to use trigger words or other keywords in the prompt for it to work well.

Adding a LoRA is quite simple. You find one you like – just as we looked for checkpoints – download it, and paste it into the modelsloras directory. If you reload ComfyUI, you should see it in the dropdown for the Load LoRA node.

Note: You can fine-tune your own LoRAs, but you’ll need to dig for resources on your own for that for now.

See below how I simply add the Load LoRA node, set the lora_name (based on the LoRA I’ve downloaded), and then connect it to the Checkpoint and the CLIP model.

Adding a LoRA to your workflow in ComfyUI | Image by author
Adding a LoRA to your workflow in ComfyUI | Image by author

If we look at the previous Flux workflow, adding the AmateurPhoto LoRA will result in a finished workflow that looks like the one below.

Adding a LoRA to your workflow in ComfyUI | Image by author
Adding a LoRA to your workflow in ComfyUI | Image by author

The only change here is the LoRA node, which now has the model and the CLIP going through it. If we run it, we’ll see that we’re now getting a much more realistic panda with this LoRA.

There are many popular LoRAs based on different models. Yes, the LoRA needs to be compatible with the base model, you can’t mix them. This means if you are using Flux Dev as the base, you look for LoRAs specifically for Flux Dev.

Key Technologies (you want to be using)

The basics, like we just went through, involve using a plain checkpoint model along with a LoRA for styling. But what makes this space so great is what’s possible beyond this.

This part is about ControlNets and IP Adapters.

ControlNet and other conditioning models allow you to guide the image generation process more precisely based on edges, depth maps, scribbles, or poses of an image.

You can see me testing a few options for processing an image of a dog using Canny Edge, a Depth Map and Pose below.

Testing different Controlnet options for an image of a dog | Image by author
Testing different Controlnet options for an image of a dog | Image by author

There are several input options but you’ll need specific models capable of correctly processing these inputs to generate an image.

For example, if we use the Canny Edge of a dog with a compatible ControlNet model, the generated image would follow the outline while having freedom to interpret everything else, like color and texture.

That’s why it’s important to pick a base model with a big community behind it, so you’ll have plenty of good ControlNet models to choose from.

IP Adapter (Image Prompt Adapter) models, on the other hand, can be thought of as 1-image LoRAs. If you are still confused by LoRAs then think of them as images as prompts. They transfer styling from a reference image to a new image. IP Adapters thus let you transfer styles, merge images, and imitate elements like faces.

I’ve collected a few guides for ControlNets and IP Adapters that are good for understanding the basics. I’ve also included popular models for each, organized by base model, in the same repository.

Beyond these two technologies, there are more popular techniques such as inpainting, outpainting, upscaling, and segmentation. I find these a bit easier to grasp.

You can see the table I organized below for each technology.

Different technologies and what they do, see the list here | Image by author
Different technologies and what they do, see the list here | Image by author

Some techniques don’t always require a specific model, but for good results, it’s recommended to use one.

This tech allow us to do some really cool stuff though.

You can transfer brand styling and logos onto new images. You can place products into newly generated images with different holiday themes for social media. You can turn company photos into cartoons.

I’ve collected a few use cases and the associated technologies in the same repository here. This space has a lot of creativity and we will see it grow within the next few years.

For the use case in this article we need to look at tools available as I want to be using Flux.

Flux Tools

As I noted before, we are restricted to the ControlNets and IP Adapters available for each base model. These can either be created by the original creators or by the community.

Since Flux was released in 2024, it doesn’t have the same community support as Stable Diffusion 1.5. However, there are still a few unofficial options for Flux. You can check out the same repository here to explore some of them.

That said, in November 2024, Black Forest Labs – the creators of Flux – released a few new models called Flux Tools to help control and steer images with FLUX.1.

These tools are the official ControlNet and IP Adapter models for Flux. In total, they released four new models: Canny, Depth, Fill, and Redux.

Now, I don’t know if I would recommend that you build with these for a production use case, as they may not have the same quality of results that ControlNets for SD 1.5 or SDXL have. I did some research on this, and the community is finding it hard to work with them, citing that they are still unreliable.

But I decided to go for one anyway for this interior designer workflow we’ll create next. Don’t feel discouraged if you don’t get the results you want, just keep building.

Building an Interior Designer

I have already completed this workflow, which you can find here. You are more than welcome to simply load it in and run it but the point of this exercise is for you to learn on your own.

We’ll work with depth maps so the model understands how the furniture and space should look.

To test out the finished product, I’ll grab a generic bedroom image. See it below.

A generic image of a bedroom | Image by IKEA
A generic image of a bedroom | Image by IKEA

We’ll look at the results right away so you can get an idea of what you can achieve with a simple workflow.

AI generated Images using different styles with the workflow with Flux
AI generated Images using different styles with the workflow with Flux
AI generated Images with a Scandinavian style using Flux
AI generated Images with a Scandinavian style using Flux
AI generated Image with a Scandinavian style using Flux
AI generated Image with a Scandinavian style using Flux
AI generated Images using different styles with the workflow with Flux
AI generated Images using different styles with the workflow with Flux

I’ve simply been prompting here while adding a few general LoRAs to get these images from the base image.

I’m sure people will start building interior designer LoRAs based on previous work at some point. This is also a scary possibility for current designers but hopefully they’ll use it to their own advantage.

Creating a Depth Map

The first thing we need to do, is load a new empty canvas in ComfyUI and make sure you have enough VRAM to run this workflow.

I’d recommend 24GB of VRAM and 48GB of RAM (depending on you adding a LoRA or not).

Next, we need to add the Load Image node along with the AIO Aux Preprocessor node. This node will help preprocess the images for either canny or depth.

You can see the example below of me testing both canny and depth.

Testing depth and canny in ComfyUI | Image by author
Testing depth and canny in ComfyUI | Image by author

To explain, these images will help us generate AI images that will either follow the depth map or the canny edges. There are specific depth and canny nodes you can use, but I prefer using the DepthAnythingv2 preprocessor for depth like above.

You can choose to use canny instead of depth here, but I don’t find the official Canny Flux model to be that great.

If you want to use canny with Flux, check out this list of unofficial models you can try.

You can remove the canny edge processor before we move on.

The Models

From here, we’ll follow the same steps we did earlier for Flux. Add the Load Diffusion Model, DualCLIPLoader, and Load VAE nodes.

Remember the left double-click to search for those nodes.

Adding the first nodes in ComfyUI | Image by author
Adding the first nodes in ComfyUI | Image by author

For the Load Diffusion Model node, you’ll need a new model called flux1-depth-dev. If you don’t have it downloaded, you can find it [here]. If you’re using a canny image, you’ll need the flux1-canny-dev model.

Remember, these models are specifically designed to process depth and canny images. Without them, you won’t be able to control the output in the same way.

For the CLIP models, you’ll need two models. Refer to the image for which ones – they’re the same ones we used for the simple Flux workflow earlier. You can find them [here].

If anything is missing, you’ll get an error, and you’ll need to locate the missing model and place it in the correct folder. You don’t have to use two CLIPs here, but it’s recommended.

The Prompt

Next, you’ll need to set the positive prompt and a negative one (though Flux seems to ignores the negative prompt).

You’ll need to specify the design you’re after, so the prompt will matter quite a lot. Go look at how people prompt to design spaces.

Adding the text prompt in ComfyUI | Image by author
Adding the text prompt in ComfyUI | Image by author

You should be able to load my exact prompt with this workflow if you’d like to use it.

Connect the Depth Map

Next, we’ll add a new node we haven’t used before called Instruct Pix to Pix Conditioning. This node facilitates conditioning for image-to-image translation tasks with precise control over the generated image.

You could potentially use the ControlNet node here instead, but I didn’t try it.

We will connect our depth image to the port labeled ‘pixels’ and then connect the remaining nodes to it.

See the picture below to ensure you do it correctly.

Adding the Instruct Pix to Pix Conditioning node in ComfyUI | Image by author
Adding the Instruct Pix to Pix Conditioning node in ComfyUI | Image by author

From here, we can follow the same steps as in the previous workflows, connect all the nodes to the KSampler.

Connecting all the nodes | Image by author
Connecting all the nodes | Image by author

I’ve set a random seed and increased the steps to 40, which should improve the quality of the generated image.

We also use a VAE Encode node here. While it’s not strictly necessary, I’m using the dimensions of the original image as the latent, which helps slightly improve the quality.

You can run it from here.

Running the workflow in ComfyUI | Image by author
Running the workflow in ComfyUI | Image by author

You should see the new image in the Save Image node. I also added a LoRA here.

Adding the LoRA to the workflow in ComfyUI | Image by author
Adding the LoRA to the workflow in ComfyUI | Image by author

Refer to the previous section for how to find LoRAs. For this example, I’m using the iPhone Photo LoRA, which is quite popular.

Now, I suggest playing around with different images, prompts, and LoRAs to see what you can achieve. Prompting matters quite a lot, so don’t underestimate its importance.

You can scroll up to see my results again.

Notes

From here, you would ideally build on this, perhaps taking in an image and then using a visual model to analyze the layout and automate the prompt based on the user’s request.

You’ll get the best results when you fully use both the prompt and the depth map to explain the interior.

You may also want to deploy this workflow as an API to make it easier to use. I’ll definitely cover this in a future piece.


I hope you got some inspiration. Like it if you want to see more of it.


Related Articles