Stable Diffusion: Training Your Own Model in 3 Simple Steps

What Is Stable Diffusion?

Stable Diffusion is an open source machine learning framework designed for generating high-quality images from textual descriptions. It uses a unique approach that blends variational autoencoders with diffusion models, enabling it to transform text into intricate visual representations.

The development of Stable Diffusion represents a significant step forward in the field of generative AI, offering creatives, designers, and developers a free and open tool for image creation. By inputting simple text prompts, users can produce images ranging from realistic photographs to artworks in various styles.

Advanced users of Stable Diffusion might want to train their own, fine-tuned version of the model for specific use cases. We’ll show a hands-on tutorial for achieving this with open source, no-code tools.

In this article:

Stable Diffusion Architecture and Concepts
What Can You Do with the Base Stable Diffusion Model?
Why Train Your Own Model?
Tutorial: Train Your Own Stable Diffusion Model Locally
Optimizing Your AI Infrastructure with Run:ai

Stable Diffusion Architecture and Concepts

The stable diffusion architecture is illustrated in the following diagram by Hugging Face. Let’s briefly review the key components.

Variational Autoencoder

The Variational Autoencoder (VAE) within the Stable Diffusion architecture is used to learn the distribution of training images. It works by encoding input images into a lower-dimensional latent space, capturing their essential features. This encoding process enables the model to generate new images by sampling from the latent space, effectively learning to recreate the diversity and complexity of the input data. The VAE's efficiency in data representation and generation is crucial for the model's ability to produce high-quality, varied images from text descriptions.

Forward Diffusion

The forward diffusion process in Stable Diffusion gradually introduces noise into an image, moving it from a state of order to disorder. This step-by-step degradation of the image's details simulates the transition from a coherent picture to a random noise pattern. By carefully controlling this process, the model learns to recognize and understand the underlying structures of images. This knowledge is essential for the reverse diffusion phase, where the model reconstructs images from noise based on textual cues, ensuring the generated images are both diverse and aligned with the input descriptions.

Reverse Diffusion

In the reverse diffusion phase, Stable Diffusion performs the inverse of the forward process. Starting from random noise, it progressively removes noise to synthesize an image that matches the provided text prompt. This stage is critical as it utilizes the learned representations to guide the transformation of noise back into coherent visual content. Through a series of iterations, the model fine-tunes the details, adjusting colors, shapes, and textures to align with the description, effectively bringing the textual prompt to visual life.

Noise Predictor (U-Net)

The noise predictor, based on the U-Net architecture, is a core component of Stable Diffusion that estimates the amount of noise to remove at each step of the reverse diffusion process. It acts as the model's intuition, determining how to refine the noisy image towards the final, detailed output that matches the text prompt. The U-Net's ability to handle both global structures and fine details is key to producing high-quality images that faithfully reflect the desired content, style, and mood indicated by the user.

Text Conditioning

Text conditioning in Stable Diffusion involves embedding the text prompt into a format that the model can understand and use to guide image generation. This process ensures that the output images are not just random creations but are closely aligned with the themes, subjects, and styles described in the input text. By effectively translating textual descriptions into visual cues, the model can produce images that accurately reflect the user's intentions, from specific objects and scenes to abstract concepts and artistic styles.

What Can You Do with the Base Stable Diffusion Model?

The base models of Stable Diffusion, such as XL 1.0 or the newer SD 3.0, are versatile tools capable of generating a broad spectrum of images across various styles, from photorealistic to animated and digital art. These models, designed to convert text prompts into images, offer general-purpose capabilities, making them suitable for a wide range of creative and practical applications.

Users can leverage these models to produce diverse content without the need for specific training in image creation techniques. For example, the base models can create artwork, design elements, and visual concepts straight from textual descriptions, offering an accessible entry point for users to explore AI-generated imagery.

Despite their versatility, base models have limitations in specialized tasks. While they can generate images in a wide array of styles, achieving high fidelity in specific genres or styles, like classic anime, may require extensive prompting and fine-tuning.

Why Train Your Own Model?

Training your own model on top of a Stable Diffusion base model allows for specialization and refinement in the generation of images tailored to specific needs or styles. A common method for teaching specialized styles to Stable Diffusion is Dreambooth.

For instance, by training a base model like SD XL 1.0 with an additional dataset focused on a particular subject, such as wild animals, the resulting fine-tuned model gains an enhanced ability to generate images that align closely with the desired outcomes, producing more accurate and stylistically consistent images with minimal effort.

This fine-tuning process transforms a generalist base model into a specialist, capable of understanding and replicating specific visual styles or subjects with high fidelity. The creation of fine-tuned models, including the use of advanced techniques like LoRAs (Locally Receptive Attention) and LyCORIS, further narrows the focus, allowing for the generation of images in highly specific styles.

For example, using LoRA or LyCORIS, you might inject a fictional character into the visuals, modify the clothing of characters, add specific elements to the background, or add objects like cars or buildings.

For example, Jake Dahn illustrated how to use LoRA to generate detailed self-portraits in various styles, by fine tuning the model with images of himself.

Tutorial: Train Your Own Stable Diffusion Model Locally

Requirements

This tutorial is primarily based on a setup tested with Windows 10, though the tools and software we're going to use are compatible across Linux, Windows, and Mac platforms.

A critical hardware requirement is a GPU with at least 6–7GB of VRAM. While this setup might not suffice for running pure Dreambooth tasks, incorporating LoRA (Locally Receptive Attention) makes it feasible.

On the software front, the requirements include:

Python 3.10: Ensure you select Add Python to PATH and Support for TCL/Tk during installation.
Git: Necessary for cloning the required repositories.

Step 1: Inference

Inference with Stable Diffusion involves generating new images based on the model's understanding. The setup utilizes the open source Stable Diffusion Web UI. To start, clone the repository using the command:

git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git

After cloning, navigate to the stable-diffusion-webui directory and execute webui-user.bat (for Windows) or webui-user.sh (for Linux). This action launches a command window that performs initial setup tasks and eventually displays a message indicating the local URL (e.g., http://127.0.0.1:7860) where the web UI is accessible.

By visiting this URL, you can explore the inference capabilities of Stable Diffusion, inputting text prompts to generate images according to your specifications.

Step 2: Training

This part uses the open source tool kohya_ss, which includes an implementation of Dreambooth, and will allow you to personalize your model. Installation involves cloning the repository with the following command:

git clone https://github.com/bmaltais/kohya_ss.git

Within the kohya_ss directory, initiate the setup by running setup.bat, which guides you through a series of configuration options, including the choice of PyTorch version and whether to utilize GPU resources. The setup script may prompt you to make selections regarding the uninstallation of previous files, distributed training, and optimization options, among others.

To begin training, navigate to the training directory and execute:

#On Windows

.\gui.bat --listen 127.0.0.1 --server_port 7860 --inbrowser

#On Linux

./gui.sh --listen 127.0.0.1 --server_port 7860 --inbrowser

This command starts the training web UI. In the Dreambooth LoRA tab, set up the training by specifying the instance prompt (a unique identifier for your model's focus), class prompt (the general category of your focus, such as animals or people), and the paths to your training images and destination directory.

Once configured, start the training process, which can vary in duration based on hardware capabilities.

Note: If you want the UI to be accessible on public web, use the following command to start the GUI: ./gui.sh --share

Step 3: Perform Inference on Your Personalized Model

After training, integrate your custom model with the Stable Diffusion Web UI for inference. This involves moving the last.safetensors file from your training output to the stable-diffusion-webui/models/LoRA directory.

Relaunch the inference web UI. Instead of clicking Generate as usual, click on the small button below it: Share/hide extra networks. That will show a new section under the LoRA tab, where you should see your personalized model.

Click over it, and a string like <lora:last:1> will be added to your prompts, which tells the system to use your specialized model on top of the base model.

That’s it! You have trained a customized Dreambooth model on top of Stable Diffusion.

Optimizing Your AI Infrastructure with Run:ai

Run:ai automates resource management and orchestration and reduces cost for the infrastructure used to train LLMs and other computationally intensive models. With Run:ai, you can automatically run as many compute intensive experiments as needed.

Here are some of the capabilities you gain when using Run:ai:

Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
A higher level of control—Run:ai enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:ai simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.

Learn more about the Run:ai GPU virtualization platform.