harubaru/wd1-4.md

## wd1-4.md

      
    Raw
  

              wd1-4.md
            
          
    Waifu Diffusion 1.4 Overview


_{An image generated at resolution 512x512 then upscaled to 1024x1024 with Waifu Diffusion 1.3 Epoch 7.}
Goals


Improving image generation at different aspect ratios using conditional masking during training. This will allow for the entire image to be seen during training instead of center cropped images, which will allow for better results when generating full body images, portraits, and improving the composition.
Expanded the input context from 77 tokens to 231 tokens or perhaps to an unlimited amount of tokens. Out of 77 tokens for input, only 75 are useable. This does not give nearly enough room for complex prompting that requires a lot of detail.
Training on higher image resolutions to improve performance on face generation and hands. A lot of the details at 512x512 resolutions are not preserved by the VAE, which makes the model learn more inefficiently during training as more and more samples are needed in order to improve the model's capability with generating finer details. So, the image resolution that will be used during training will have a maximum size of 768x768.
Better compositional awareness to help guide the model during inference at producing images with natural language captions instead of booru style tags. Currently, composition is only implicit from booru-style tags that were used in the training data, which makes the model less guideable if you use a prompt with natural language like hakurei reimu is eating a cheeseburger.
Unconditional Generation for better classifier-free guidance. The training process for Waifu Diffusion 1.3 did not include unconditional generation. This will allow the model to use it's own knowledge during generation which will enhance it's capabilities with smaller prompts.
Diverse Outputs & Styles with the help of a diversified training dataset. With the help of DeepDanbooru, other image hosting sites such as Pixiv can be scraped and automatically tagged with a reasonable amount of accuracy. With a more diverse dataset, the individuals that are using the resulting model will be able to generate with a multitude of different styles.
Finetuned VAE to assist in generating finer details such as fingers, eyes, hair, and lips. Much detail with the vanilla SD 1.4 VAE is lost as it had been trained on real-life data and serves as a great baseline for general imagery, however this does not work well with anime-styled images.
Finetuned CLIP Text Embeddings to help generate a consistent baseline style that will serve well with diverse outputs and simpler prompts. This will help allow users to experiment with the model more easily to grasp what works well and what doesn't.
Extra Models:


AI Image Detection


Img2Txt Diffusion Prior


Dataset Goals


Improving diversity by sourcing images from a variety of sources.
Distributed data collection through hosting a centralized web server to allow worker nodes to collect data.
High quality curation by filtering based off of text-image similarity and aesthetic scoring based on aesthetic ratings gathered in the dataset-labelling channel in Discord.
Open distribution by publicly making all releases of the dataset available for public use.

Literature Review

High-Resolution Image Synthesis with Latent Diffusion Models
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Data Collection

The data that will be used for training will comprise of two styles:

Composition Captions: In order to preserve the composition of an image, composition captions will be used in order to help explicitly guide the model to set the composition of an image, instead of having booru captions imply the composition of the image.
Booru Captions: To allow better guidance on finer details such as clothing or facial expressions, booru captions will be included.

The amount of data for the first phase of training will be 10 to 20 million anime-styled images, then for the second phase aesthetic clip scoring will be used to eliminate images during training that have too low of a measured aesthetic score. This method will steer the model into generating higher aesthetic images more reliably. The Booru dataset will be referred to as booru-textim where it will be used for training image-generative models to generate anime-styled images from natural language. The dataset will be comprised of post IDs, file URLs, compositional captions, booru captions, and aesthetic CLIP scores. The estimated number of images whose aesthetic CLIP scores are greater than 6.0 is at 1 million images.
Training

Hardware

For finetuning Stable Diffusion v1 models, a node containing 8 A100 80GB GPUs and 192GB of RAM will be used. This will provide a sufficiently high batch size to be used during training which will help increase throughput and decrease the time spent during training significantly.
Phase 1

Training will start from Epoch 10 of Waifu Diffusion v1.3 and will continue with an overall training resolution of 768x768 for 4 epochs or until there is no more improvement in quality. Phase 1 of training will allow the model to serve as a baseline for generating anime-styled images from both compositional captioning and booru captioning.
Phase 2

After Phase 1 has completed, further training will begin on items in the booru-textim dataset that have a CLIP aesthetic score higher than 6.0 to allow for better capability in generating higher quality imagery compared to the baseline model trained on the full dataset. This will continue on for 2 epochs or until there is no more improvement in visual quality.
Release

Waifu Diffusion 1.4 and booru-textim will be released publicly on HuggingFace under the CreativeML Open RAIL-M once training has completed. The models from both training phases will be released as separate models: waifu-diffusion-1-4-base.ckpt and waifu-diffusion-1-4-aesthetic.ckpt
Disclaimer

This project is not at all affiliated with Danbooru or any other Boorus or image boards.