Perfusion: Crafting Personalized Images from Text in Minutes
NVIDIA has released a new text-to-image personalized model: Perfusion.
With just a model size of 100KB, it can be trained in about 4 minutes to creatively depict personalized objects.
The Perfusion model is capable of generating images with specific attributes based on input text descriptions. These attributes could include object color, shape, texture, and more. At the same time, it maintains the fundamental identity of the generated objects.
In terms of efficiency, it surpasses models like SDXL and MidJourney in specific versions!

Perfusion employs a novel mechanism called "Key-Locking," which enables the combination of individually learned concepts into a generated image.
The "Key-Locking" mechanism allows the model to integrate separately learned concepts (such as specific objects, colors, shapes, etc.) into a generated image. This means that you can guide the model to generate images containing multiple specific elements through textual descriptions.
During the image generation process, the Perfusion model can adjust the balance between adhering closely to the input text description and ensuring the visual quality of the generated image.
For instance, if you desire the generated image to strictly follow the input text description, the model might sacrifice some visual quality. Conversely, if you aim for higher visual quality in the generated image, the model may to some extent overlook the input text description.
The Perfusion model can also identify a range of optimal solutions between "textual alignment" and "visual quality," which represent various trade-off strategies.
This is akin to plotting a curve on a chart, where each point on the curve represents a possible trade-off strategy, and this curve is referred to as the "Pareto frontier."



The working principle of Perfusion is as follows:
→Architecture Overview: A prompt is transformed into a series of encodings. Each encoding is fed into a set of cross-attention modules in the Diffusion U-Net denoiser (purple blocks). The enlarged purple module illustrates how the Key and Value paths are adjusted based on the text encoding. The Key drives the attention map, which then modulates the Value path.
→Comparison with Current Methods: Perfusion achieves more vivid results, improved prompt matching, and lower sensitivity to the original image background features compared to current methods.
→Composition: Our approach enables us to combine multiple learned concepts into a generated image using textual prompts. These concepts are learned separately and are only merged at runtime to produce the final image.
→Effective Visual-Text Alignment Control: Our method allows for effective control over the trade-off between visual fidelity and text alignment during inference. A higher bias value reduces the effect of concepts, while a lower value makes them more influential.
→One-Shot Personalization: When trained with a single image, our method can generate images with high visual fidelity and text alignment.

The Perfusion model can generate more vivid results, better match input text prompts, and exhibit lower sensitivity to the background features of the original image.
→More Vivid Results: The images generated by the Perfusion model are more vivid and captivating, capturing greater attention.
→Better Prompt Matching: The generated images more accurately reflect the input text prompts. For instance, if the input text prompt is "a green cat," the Perfusion model would generate an image of a green cat.
→Lower Sensitivity to Original Image Background Features: The images generated by the Perfusion model are less influenced by the background features of the original image. This means that even if the original image has a blue background, the Perfusion model can generate an image with a red background.
→Comparison with Other Methods: A comparison between the results of the Perfusion model and several other methods (such as Custom-Diffusion, Dreambooth, and Textual-Inversion) demonstrates the advantages of the Perfusion model.

If you train a Perfusion concept (e.g., the ability to generate a "green cat") using a regular diffusion model, this concept can be directly applied to a fine-tuned model without requiring additional training for the concept.
This represents a powerful generalization capability, as it means you can use the same Perfusion concept to handle various different tasks.
