Latent Space: Instead of working directly with high-dimensional image data, Stable Diffusion operates in a lower-dimensional latent space. This latent space is learned by an autoencoder (usually a Variational Autoencoder or VQ-GAN) that compresses images into a more manageable form.
Diffusion Process: The diffusion process involves gradually adding noise to the data in the latent space over a fixed number of steps. This is done using a Markov chain, where each step depends only on the previous step.
Reverse Process (Denoising): The model learns to reverse this diffusion process. Given a noisy input, it predicts the less noisy version. This is done using a neural network that is trained to denoise the data. The network is typically a U-Net architecture, which is effective for image-to-image translation tasks.
Generation: To generate a new image, you start with random noise in the latent space and apply the learned denoising process iteratively. Over many steps, the noise is removed, and a coherent image emerges.
Conditioning: Stable Diffusion can be conditioned on various inputs, such as text prompts, to guide the image generation process. This is often done using a technique called cross-attention, where the text embeddings are integrated into the denoising network.
Training: The model is trained using a large dataset of images and corresponding conditions (like text descriptions). The training process involves optimizing the denoising network to accurately reverse the diffusion process.