Question regarding training (noise offset)
Hello, I'm also an individual training a checkpoint and I have a question regarding the noise offset. Why is it such a specific number (0.0357)? What methodologies were used to find this value? I understand that this makes the images have higher contrast and thus was used in the finetuning stages; would it have been better if it was turned on from the beginning or was it intentionally turned off in the beginning (feature alignment stage)?
There's no particular meaning in it; it's like how we always set 42 or 1337 for random seed. Actually, it was the default noise offset value from an early sdxl commit in sd-scripts, and honestly, I'm enjoying the results so far. The recommended initial value for noise offset is 0.1, and SDXL was trained with a 0.05 noise offset, but I think that's just too high. As for why we're not pretraining with noise offset, it's because we weren't prepared for what was coming, as at that time, we trained the model with personal funds.
Actually, there were two versions of finetuned 3.0, trained using the same config, with the first one using noise offset while the second one didn't. We held a vote and chose to release this version. We do believe that the version using noise offset got 'better hands' results compared to the one without, while the model finetuned without noise offset was more radical in contrasts and had better pose but worse hands and anatomy. So yeah, the "brightness hack" was not our intention in the beginning.
We might consider focusing on that if we train the model with v-prediction and zero terminal snr
Thank you for the reply. I have a separate question regarding the "bad quality" images that scored less than 25 in the aesthetic score. Do you know the breakdown of images that fell under each category? And was it beneficial for training?
I plan on only having "good" images and I was wondering if I should include low quality images (properly tagged) in the dataset so I can put them in the negative prompt (eg. low quality, worst quality). But since other tags exists in the image description for the bad images, I fear the bad concepts leaking due to other tags absorbing the features (because the "low quality" tags doesn't absorb 100% of the bad features) and some tags may deteriorate due to it.
When I was making loras, I found properly tagged bad examples (~5% of dataset for the lora example) deteriorated the overall quality of the generations (even when I placed the tagged quality/description tags in the negative). In retrospect, do checkpoints benefit from having bad examples in the dataset so it knows what to avoid, or can the con of bad features bleeding outweigh the benefits and it may be better to stick to only good images?
Hi, one other thing I noticed is that CLIPskip is not set so it's defaulting to 1. I understand that CLIPskip 2 is a thing from NAI--> SD1.5 (anime), but is there a reason why you went back to CLIPskip 1? I see pony with CLIPskip 2 and Kohaku is using CLIPskip 1, so it seems like Pony went with the 1.5 anime model's norm and Kohaku went in their direction.