Skip to content

Instantly share code, notes, and snippets.

@harubaru
Last active October 27, 2024 19:44
Show Gist options
  • Save harubaru/313eec09026bb4090f4939d01f79a7e7 to your computer and use it in GitHub Desktop.
Save harubaru/313eec09026bb4090f4939d01f79a7e7 to your computer and use it in GitHub Desktop.

Waifu Diffusion 1.4 Overview

An image generated at resolution 512x512 then upscaled to 1024x1024 with Waifu Diffusion 1.3 Epoch 7.

Goals

  • Improving image generation at different aspect ratios using conditional masking during training. This will allow for the entire image to be seen during training instead of center cropped images, which will allow for better results when generating full body images, portraits, and improving the composition.
  • Expanded the input context from 77 tokens to 231 tokens or perhaps to an unlimited amount of tokens. Out of 77 tokens for input, only 75 are useable. This does not give nearly enough room for complex prompting that requires a lot of detail.
  • Training on higher image resolutions to improve performance on face generation and hands. A lot of the details at 512x512 resolutions are not preserved by the VAE, which makes the model learn more inefficiently during training as more and more samples are needed in order to improve the model's capability with generating finer details. So, the image resolution that will be used during training will have a maximum size of 768x768.
  • Better compositional awareness to help guide the model during inference at producing images with natural language captions instead of booru style tags. Currently, composition is only implicit from booru-style tags that were used in the training data, which makes the model less guideable if you use a prompt with natural language like hakurei reimu is eating a cheeseburger.
  • Unconditional Generation for better classifier-free guidance. The training process for Waifu Diffusion 1.3 did not include unconditional generation. This will allow the model to use it's own knowledge during generation which will enhance it's capabilities with smaller prompts.
  • Diverse Outputs & Styles with the help of a diversified training dataset. With the help of DeepDanbooru, other image hosting sites such as Pixiv can be scraped and automatically tagged with a reasonable amount of accuracy. With a more diverse dataset, the individuals that are using the resulting model will be able to generate with a multitude of different styles.
  • Finetuned VAE to assist in generating finer details such as fingers, eyes, hair, and lips. Much detail with the vanilla SD 1.4 VAE is lost as it had been trained on real-life data and serves as a great baseline for general imagery, however this does not work well with anime-styled images.
  • Finetuned CLIP Text Embeddings to help generate a consistent baseline style that will serve well with diverse outputs and simpler prompts. This will help allow users to experiment with the model more easily to grasp what works well and what doesn't.
  • Extra Models:
    • AI Image Detection
    • Img2Txt Diffusion Prior

Dataset Goals

  • Improving diversity by sourcing images from a variety of sources.
  • Distributed data collection through hosting a centralized web server to allow worker nodes to collect data.
  • High quality curation by filtering based off of text-image similarity and aesthetic scoring based on aesthetic ratings gathered in the dataset-labelling channel in Discord.
  • Open distribution by publicly making all releases of the dataset available for public use.

Literature Review

High-Resolution Image Synthesis with Latent Diffusion Models

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Data Collection

The data that will be used for training will comprise of two styles:

  • Composition Captions: In order to preserve the composition of an image, composition captions will be used in order to help explicitly guide the model to set the composition of an image, instead of having booru captions imply the composition of the image.
  • Booru Captions: To allow better guidance on finer details such as clothing or facial expressions, booru captions will be included.

The amount of data for the first phase of training will be 10 to 20 million anime-styled images, then for the second phase aesthetic clip scoring will be used to eliminate images during training that have too low of a measured aesthetic score. This method will steer the model into generating higher aesthetic images more reliably. The Booru dataset will be referred to as booru-textim where it will be used for training image-generative models to generate anime-styled images from natural language. The dataset will be comprised of post IDs, file URLs, compositional captions, booru captions, and aesthetic CLIP scores. The estimated number of images whose aesthetic CLIP scores are greater than 6.0 is at 1 million images.

Training

Hardware

For finetuning Stable Diffusion v1 models, a node containing 8 A100 80GB GPUs and 192GB of RAM will be used. This will provide a sufficiently high batch size to be used during training which will help increase throughput and decrease the time spent during training significantly.

Phase 1

Training will start from Epoch 10 of Waifu Diffusion v1.3 and will continue with an overall training resolution of 768x768 for 4 epochs or until there is no more improvement in quality. Phase 1 of training will allow the model to serve as a baseline for generating anime-styled images from both compositional captioning and booru captioning.

Phase 2

After Phase 1 has completed, further training will begin on items in the booru-textim dataset that have a CLIP aesthetic score higher than 6.0 to allow for better capability in generating higher quality imagery compared to the baseline model trained on the full dataset. This will continue on for 2 epochs or until there is no more improvement in visual quality.

Release

Waifu Diffusion 1.4 and booru-textim will be released publicly on HuggingFace under the CreativeML Open RAIL-M once training has completed. The models from both training phases will be released as separate models: waifu-diffusion-1-4-base.ckpt and waifu-diffusion-1-4-aesthetic.ckpt

Disclaimer

This project is not at all affiliated with Danbooru or any other Boorus or image boards.

@rx-fly
Copy link

rx-fly commented Oct 13, 2022

thx for your hard work~ greatly looking forward to the increase in tokens!

@toriato
Copy link

toriato commented Oct 23, 2022

🤩

@PopCat19
Copy link

yay :D

@JustCaptcha
Copy link

Great! Thank you for your hard work!

@Kilvoctu
Copy link

Looking forward to it 👍

@phineas-pta
Copy link

very great news! just a small question: will the newest v1.5 ckpt be used?

@madm1nds
Copy link

madm1nds commented Nov 6, 2022

We are looking forward to your result! The previous version works great! Thanks for your hard work! <3 <3

@aetherwu
Copy link

Great process. Is there a roadmap that we can expect for the next iteration?

@kikouousya
Copy link

This project is not at all affiliated with Danbooru or any other Boorus or image boards.
Why????? This dont make any sence! YOU KNOW USING THAT WOULD BE BETTER!!

@kikouousya
Copy link

NOBODY cares you whether you using that at all!

@satori7272
Copy link

You are my hero
I wish you well and hope to assist through helping to train waifus

@NukeBird
Copy link

Stay cool and keep rocking!

@Fresh-glitch
Copy link

Looking forward for a release using the .safetensors file format. The decreased loading times thanks to zero-copy and lazy-loading techniques plus the feeling of not having to worry about malicious arbitrary code executions will be a godsend!

@nolanaatama
Copy link

really looking forward to this!

@przemoc
Copy link

przemoc commented Jan 4, 2023

If someone missed that:
harubaru/wd1-4-anime-release.md - Waifu Diffusion 1.4 Anime Release Notes
https://huggingface.co/hakurei/waifu-diffusion-v1-4 - pickled model finetuned from Stable Diffusion v2.1 Base

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment