The Creation of the dataset used for training the model (and tricks used, if there are any)
From my knowledge, the basic way to train a model for controlNet with Stable-Diffusion is to obtain some dataset, e.g. images and their "canny-Edges", give the canny-edge as the condition to control net (along with text prompts, possibly) and try to predict the original image.
Assuming this model followed the same principle, the question is what dataset was used to make this model, and how was it obtained: A naive guess would be that the dataset is creative-manually created images of beautiful QR-codes, and the condition is the plain black-and-white original QR-code.
Such dataset seems pretty hard to manually create (and this seems unlikely that tens of thousands of such images were created, which is known to be the minimal amount for training a controlNet). So what was the dataset and how was it created/obtained.
Assuming this model didn't precisely follow this principle, what were the changes and what tricks were used to make it work?
These questions hold great value for the community of researchers working and looking into those topics, so I (and the entire community) would greatly appreciate your response.
Thanks in advance. :)
Tagging people I think to be relevant:
@achiru
@vyril
I plan to release datasets and methodology once I finish my work on the SDXL version, right now I prefer to focus the little time I have on training (I get asked about everyday when the SDXL model releases).
Right now everything is a bit of a mess, I don't want to release my scripts / datasets as is, that's why I'll take care of that when I don't have to worry about the SDXL version anymore.
I'll leave this issue open until then.
@achiru
Understood, thanks for the response. Just out of curiosity if you could at least shortly write in a couple of words the idea of the creation (simply what does the dataset look like and a headline of how you got it/created it) that would be awesome.
Thanks again
The short version is: Try out different luminance (as it was for qr codes at first) patterns on top of existing images. The patterns should be generic enough to make the model generalize to all types of input, but not too general otherwise the training does not converge.
A lot of patterns could work, still not sure which ones are the absolute best, especially for illusions which are more generic than qr codes.
thanks for sharing some info.
Here is my attempt: https://civitai.com/models/137638#heading-2386 but it's not good so far (pixelizing 8x8). I thought about trying different luminance-interpretations (grayscale, HSV, LAB), randomly varying the pattern (warping), creating metaballs or adding random masks to make it generalize more. My idea is to use normal images and automatically convert them to a pattern instead of collecting thousands of stylized QR codes.
I'll try to make a blogpost or something like this next week, I didn't realize people were interested in that, I need to dig up my 1.5 training data, as I don't remember what I used specifically for v1 and v2 models.
(For me it was more like v9 and v37 out of ~50 versions)
What I do remember is that v1 vs v2 is: More data, and a lot of fiddling around, some patterns work differently with different hyperparameters etc...
Once I get around doing the post, we can discuss all of that in more detail if need be.
you are aware of all the craze about monster QR on reddit, right?
what batch size and gradient accumulation did you use?
btw I wrote an article on controlnet training on civitait (with the very intention to create something similar like Monster QR but for more organic images). I also create a new edge-detection model and documented my whole experience at https://github.com/lllyasviel/ControlNet/discussions/318#discussioncomment-7176692 . if someone wants to collaborate on creating controlnets, please get in contact!
you are aware of all the craze about monster QR on reddit, right?
No, I'm only aware of what people share with me, if you have cool projects / results to share I'm always happy to see them π
what batch size and gradient accumulation did you use?
I don't remember, but the higher the better, basically get the most data you can, for 1.5 about 7500 steps were always enough (in most cases even 6500), so put everything else into batch size / grad accumulation.
I sent you a mail (which contains a lot a of links, so maybe it's in spam)
Thanks!
I just released a new version and it works pretty well: https://civitai.com/models/137638
training is described in the description
I like the results a lot
@GeroldMeisinger
- really well done! I wonder what are the differences between your way and
@achiru
's (we'll wait for his hopefully-soon-to-be-published post).
One question though: Did you simply convert the images to black and white using HSV method (+guassian blur and "remapping colors to 33% white-gray-black" [which, as a side question, honestly I don't know what this one means])?
I'm asking because below you wrote "My idea was to imagine existing images as having been generated by a white-gray-black pattern image." and I'm not 100% sure I understand what you meant - did you mean that you wanted the result of the black-and-whitening and blur to result in an image of a simple pattern ("what you imagined") and the actual thing you used, that apparently did work well, was the slight-blur?
I'd be very happy too see some example images on civitai! It's motivating to see your own work being used for something :)
I don't know how they did it. I assume either "actually collecting a lot of artsy QR codes (lot of work!)" or "conversion via weighted-RGB and denoising/blurring". But yes, I simply converted all the images with described method.
did you mean that you wanted the result of the black-and-whitening and blur to result in an image of a simple pattern ("what you imagined") and the actual thing you used, that apparently did work well, was the slight-blur?
yes. Imagine EVERY image was generated by some hidden pattern. like the staircase was actually generated by one of those spiral controls. now we only have to find a image processing method to reveal this pattern. and the hsv-grayscale+blur+remap gets close but it's not perfect. Right now I try to "blobbify" it and it works even better. Looks like next harvest will be even better!
@GeroldMeisinger Thanks for the response. Looks like you know your stuff in image processing, but I'll still write as it might be of help - for "blobbify"-ing an image maybe try quantization for the grey-scale and then looking at the result of "opening" and "closing" algorithms (of erosion and dilation). Maybe you know all of this but still gonna suggest :)
yep, that's what I'm using right now ;)
btw: If you want to help me, you could create a selection of images (where we imagine some hidden patterns) and make manual conditions to compare. then I could different open and closing kernels on it to match it with the manual condition.
@GeroldMeisinger
Sure, It will take me a bit to set up the environment, but I'll let you know if I have findings
BTW do you maybe have a repository with all of the code so I could "plug and play"? Could save some time for me (I saw your tutorial in the link, but thought maybe you have it all in a joined place)
no repo
https://gitlab.com/-/snippets/3611640
if you need images, follow the img2dataset part
I made a civitai bounty for using my new model version if anyone is interested https://civitai.com/bounties/28
Anybody still here focus on this question? I wonder where is the post @achiru mentioned above. This discussion in English is actually not easy for me. So is this model trained to control brightness of the image according to a qr code? What about the function patterns of qr codes, they don't actually matter right? This means a lot to me, because I am just a desperate college student working on my graduation essay about qr code π
The short version is: Try out different luminance (as it was for qr codes at first) patterns on top of existing images. The patterns should be generic enough to make the model generalize to all types of input, but not too general otherwise the training does not converge.
A lot of patterns could work, still not sure which ones are the absolute best, especially for illusions which are more generic than qr codes.
Does luminance patterns means a black and white mask? So that this model is trained by images shaped into a qr code? I don't see that make sense , if so the modules of the resulting qr codes would be just squares instead of merging into the images.
I plan to release datasets and methodology once I finish my work on the SDXL version, right now I prefer to focus the little time I have on training (I get asked about everyday when the SDXL model releases).
Right now everything is a bit of a mess, I don't want to release my scripts / datasets as is, that's why I'll take care of that when I don't have to worry about the SDXL version anymore.
I'll leave this issue open until then.
Hey @achiru it's been a while and I was wondering when will you release those :)