SmilingWolf/wd-v1-4-swinv2-tagger-v2 · Alternative training and output technique

Hi, sorry to bother, I had a question about an alternaive approach on the tag grouping.

I've seen that in the space https://huggingface.co/spaces/SmilingWolf/wd-tagger the output is split into rating, character and tags, i guess this is for the same reason I would like a different approach on the output.
Most autotaggers compare all the tags together and output the confidence they have the tag matches the image. The problem is that very "strong" tags seems to obfuscate "weaker" tags.
If I evaluate an image, i might get a lot of tags about hair and none about the face experssion. And wrong tags about hair, might have higher confidence respect to right tags about expression. I think it is not really fair to compare tags belonging to different groups in this case.
The result is that the autotagging seems to lack fundamental information, about the pose, the perspective, certain details, while is flooded by hair stiles, colors, certain other details (maybe multiple tags about gauntlets and none about the socks).

So, an ideal model would be more like a real tagger, taking the checklist and for every category find the right tag.
Example:

Start with hair color, it could possibly be:
"translucent hair": {}
"colored inner hair": {},
"colored tips": {},
"gradient hair": {},
"multicolored hair": {},
"print hair": {},
"rainbow hair": {},
"roots (hair)": {},
"split-color hair": {},
"spotted hair": {},
"streaked hair": {},
"two-tone hair": {}
"aqua hair": {},
"black hair": {},
"blonde hair": {},
"blue hair": {},
"brown hair": {},
"dark blue hair": {},
"dark green hair": {},
"green hair": {},
"grey hair": {},
"light blue hair": {},
"light brown hair": {},
"light green hair": {},
"light purple hair": {},
"orange hair": {},
"pink hair": {},
"purple hair": {},
"red hair": {},
"white hair": {}
then eye color
then skin color
then the pose
then the face expression
then the camera angle
etc...

In this case the system will be the same, but whatever technique is selected to apply a threshold to the confidence(ranking), it will be used to compare tags in their own group, so will only face tags they are actually competing with and (possibly) mutually exclusive.

The easiest way to create the groups for thousands of tags is to use the tag grouping in danbooru.

1st problem) there are a lot of groups and so making a lot of models that checks only those tags is likely unfeasible.
The most reasonable idea is to use the tagger as it is(evaluation), and just modify the threshold evaluation/ranking so that tags confidence is recomputed later based on their related tags.

2nd problem) danbooru tags are generally not covering all the space of probabilities. The list of hair colors doesn't cover the case where no hair is present in the image. So in the case that there is no head, the model might select the wrong tag. Similarly in the group number of eyes, the vanilla number (2 eyes) is not listed:
"$ Number of eyes": {
"extra eyes": {},
"missing eye": {},
"no eyes": {},
"one-eyed": {},
"third eye": {}
}
Again the model would find itself forced to select a tag that is wrong and unlikely.

To solve this, one of the option is to (assuming that if the tags are wrong, they have similar confidence) discard the group if the confidence is too evenly ditributed in the group.

3rd problem) this assumption might be wrong
An alternative could be to change the tags for the training and make sure that the tags cover all the possibilities, adding a "none of the above" tag for each group. For the eyes case:
"$ Number of eyes": {
"extra eyes": {},
"missing eye": {},
"no eyes": {},
"one-eyed": {},
"third eye": {}
"none of the above(Number of eyes)" (I would make a shorter tag like "nota4532") aka 2 eyes but nota4532 can be procedurally generated
}
So when the characters have the "classic" 2 eyes or there is no head in the picture, the model will select the last option.
In this way during training we could get all tags of the image, and check for each group, if there is at least a tag belonging to the group, otherwise we add the tag nota#### for the specific group.
In this way, the entire space of probability is covered for each group and when we perform a confidence comparison, if no tag matches, the higher confidence will be of the nota, and we could either output no tag and leave it implicit, or output also this tag if it can be useful( they would probably flood the tag list though)
( remember that we evaluate all the tags together as now, but we group them just when we perform the threshold computation/ranking, to decide which one to keep or discard )

4th problem) here the biggest problem i can think of is about images that are "undertagged".
For example if an image with the tipical h-protagonist has not been labeled with "no eyes", the algorithm will label it as none of the above (aka 2 eyes or no face), while, in facts, it was just not specified, as the user cannot tag everything, and it will mess up the training.
Or maybe if we imagine the "none of the above" as "unspecified", and if the variance is in the noise range, it will actually not impact the evaluation. Maybe?

All this essay to ask if you think the last approach could make sense, group tags, evaluate them all together, rank them in their own groups, and train the model as described with the "control tags".
Could this go anywhere or do you see huge problems already?