Stable Diffusion XL 0.9 technical report released: multiple tasks performed better than Midjourney
Since the public beta of Stable Diffusion XL in April, it has been warmly welcomed by users and is known as "the open source version of Midjourney". In terms of details such as painting hands and writing, SDXL can control the overall situation, and the most important thing is that it can be realized without super long prompts. What's more exciting is that SDXL 0.9 can be experienced for free compared to Midjourney, which requires payment!
Recently, a report on Stable Diffusion XL 0.9, known as the "open source version of Midjourney", was published. This report details the latest technical progress of the "enhanced version" of Stable Diffusion.

So, how good is SDXL compared to Midjourney? In the report, the researchers randomly selected 5 cues from each category and generated four 1024×1024 images for each cue using Midjourney (v5.1, seed set to 2) and SDXL. These images were then submitted to the AWS GroundTruth task force, which voted on following prompts. Overall, SDXL is slightly better than Midjourney when it comes to following prompts. Feedback comparisons from 17,153 users cover all "categories" and "challenges" in the PartiPrompts (P2) benchmark. It is worth noting that SDXL outperforms Midjourney V5.1 in 54.9% of cases. Preliminary tests have shown that the recently released Midjourney V5.2 has reduced comprehension of prompts. However, the cumbersome process of generating multiple prompts hampers the speed at which more extensive testing can be performed.

Last year, Stable Diffusion, known as the strongest generation graph model, was open sourced, igniting the torch of global generative AI. Compared to OpenAI's DALL-E, Stable Diffusion enables people to achieve Vincent graph effects on consumer graphics cards. Stable Diffusion is a latent text-to-image diffusion model (DM), which is widely used. Recent studies on the reconstruction of brain images based on functional magnetic resonance imaging (fMRI) and music generation are based on DM. And Stability AI, the start-up company behind this explosive tool, launched again in April this year, an improved version of StableDiffusion-SDXL.

According to user research, the performance of SDXL has always exceeded all previous versions of Stable Diffusion, such as SD 1.5 and SD2.1. In the report, the researchers propose design choices that lead to this performance boost, including:
1) Compared with the previous Stable Diffusion model, the UNet backbone architecture has increased by 3 times;
2) Two simple yet effective additional conditioning techniques that do not require any form of additional supervision;
3) A separate diffusion-based refinement model that applies denoising to the latent signal generated by SDXL to improve the visual quality of the samples.
The researchers improved the Stable Diffusion architecture. These are modular and can be used individually, or together to extend any model. Although the strategies below are carried out as extensions to latent diffusion models, most of them also apply to their pixel-space counterparts, the report says.

Currently, DM has been proven to be a powerful generative model for image synthesis, and the convolutional UNet architecture becomes the dominant architecture for diffusion-based image synthesis. With the development of DM, the underlying architecture is also evolving: from adding self-attention and improving the upgrade layer, to cross-attention for text-image synthesis, to purely based on Transformer architecture. In the continuous improvement of Stable Diffusion, researchers are also following this trend, moving most of the Transformer calculations to the lower-level features in UNet. In particular, the researchers used a different heterogeneous distribution of Transformer blocks in UNet compared to the original SD architecture. For efficiency, and omit the Transformer block in the highest feature level, use 2 and 10 blocks in the lower levels, and also completely remove the lowest level (8x downsampling) in UNet, as shown in the figure below.

The researchers chose a more powerful pretrained text encoder for text conditioning. Specifically, using OpenCLIP ViT-bigG in conjunction with CLIP ViT-L, here the penultimate text encoder output is connected along the channel axis. In addition to using a cross-attention layer to constrain the model's text input, the researchers followed, and the hybrid embeddings of the OpenCLIP model additionally constrained the model on the text. As a result, these factors lead to a model parameter size of 2.6B in UNet and a total text encoder parameter size of 817M.
The biggest disadvantage of fine-tuning Latent Diffusion Model (LDM) is that training a model requires a minimum image size due to its two-stage architecture. There are mainly 2 ways to solve this problem, either discard training images below a certain minimum resolution, (SD 1.4/1.5 discards all images below 512 pixels), or scale all images to the same resolution (SD 2.1 Scale all images to 1024 pixels). However, both methods have their disadvantages. The former will lose a large amount of training data, and the latter will cause image distortion.

Therefore, the researchers came up with a new method that uses images of different sizes during training. The key to this approach is that for each batch, a random image size is chosen and all images in the batch are scaled to that size. The advantage of this approach is that all available training data can be used while avoiding image distortion. However, this approach also has its disadvantage, that is, it needs to use a lot of video memory when training.

During training, the researchers found that using larger image sizes improved the model's performance. However, larger images cannot be used directly due to video memory limitations. To address this issue, the researchers propose a novel approach using a separate diffusion-based refinement model (DRM). The task of this model is to denoise the underlying signal generated by SDXL in order to improve the visual quality of the samples. The training process of DRM is similar to that of SDXL, but a larger image size is used. The use of DRM can significantly improve the performance of the model without increasing the use of video memory.

During testing, the researchers found that SDXL outperformed Midjourney on many tasks. For example, SDXL outperforms Midjourney in generating high-quality images. In addition, SDXL also excels at understanding cues and is able to generate compliant images given the cues. These results demonstrate that SDXL is a powerful generative model that can achieve excellent results on many tasks.
Overall, SDXL is a powerful generative model that performs well on many tasks. This is due to its powerful architecture and effective training strategy. Although SDXL leaves room for improvement in some areas, it has proven its worth and provides a strong basis for future research.