Model Card for Bad-Anatomy-Realism-Classifier

A finetuned Vision Transformer model for classifying AI-generated pictures for bad anatomy and realism.

This model is currently a support model for my Youtube series. Feel free to build on top of this.

Model Detail

Detecting Bad Anatomy in Realistic AI-Generated Images - Not all Image Generation models generate images with good anatomy. Some might generate the typical "bad hands" where the hand might have more than 5 fingers. This model's goaal is to detect such anatomy issues in AI-generated images.

Determining True Realism Versus AI Realism - AI-generated images tend to have an issue when attempting to achieve realism, which is the skin and generation style. Compared to a normal post on social media, a High-Definition upscaled AI-generated image can be easily identified by, characteristic such as shiny skin or very bright lighting. Below are some examples of such:

Unrealistic Good Anatomy AI-generated image number 29

Unrealistic Good Anatomy AI-generated image number 31

Model Description

This was fine-tuned on the google/vit-base-patch16-224-in21k Vision Transformer (ViT).

Uses

Detecting whether an image is actually real or is a very well AI-generated image
Detecting bad anatomy in AI-generated images to trigger a regeneration

Out-of-Scope Use

Racism
Illegal activities where doing illegal things is a crime

Bias, Risks, and Limitations

This initial model was trained on images generated on Stable Diffusion v1.5 on the Beautiful Realistic Asians v6 checkpoint by pleasebankai.

The dataset for this model was only 134 images, with only 6 being Unrealistic Bad Anatomy. (Additions of dataset details will be placed in the model card in later updates to documentation)

Recommendations

Recommendation is to build on the dataset and continue training with more variety of characters to raise performance for images that do not conform to the characteristics of images used in training.

How to Get Started with the Model

Finetuning

Please refer to the initial finetune script for this model in the supporting Github Repository here: https://github.com/angusleung100/barc-finetuning-gh

Using The Model For Classification

Please refer to the Hugging Face documentation example here for Image Classification: https://huggingface.co/docs/transformers/en/tasks/image_classification#inference

Training Details

Training and Testing Data

Dataset Image Label Criteria

Bad / Good Anatomy

Any deformed body parts or extra limbs for the character
Background does not overly matte (As it can always be removed or changed in post-processing with professional editing software)

Realistic vs. Unrealistic

The criteria is more interesting for determining realism. Since a lot of people like to use filters now, it's actually quite hard to determine what is a good standard for realism. Here is what I narrowed it down to for this model:

First glance reaction - Do I take a closer look and feel skeptical? Or do I know instantly it isn't real.
Lighting - It is easier to sort amateur style images since I can move onto the next criteria first. Some professional images do look AI-generated but are actually heavily edited. But we can definitely base it also off of unnatural lighting
Skin and hair - If the skin and hair are too shiny (Like the images at the start of the Model Card) or there is not enough detail on an upscaled image. Or there is TOO much detail on an upscaled image.
Photography style - This could lead to false positives or false negatives, but if the shot looks like the focal point is weird or just very airbrushed, it could be unrealistic

Overall it is based on "gut feeling" for the sorting. The model also has a goal to be able to replicate "gut feeling" and just your underlying feel for the image.

Compatible Images For Dataset

Since the default data collator is used and images are primarily from SD 1.5, I am not entirely certain whether images and sizes from different models will break the training, even if the testing pipeline didn't have any problems for the 3 images we used later on.

Here are a list of models where default image sizes should work:

Stable Diffusion 1.5
OpenDalle v1.1
Flux 1
Dall-E 3 on Copilot

Dataset Stats

Number Images Per Label
=======================
Realistic Bad Anatomy: 6 (4.48%)
Realistic Good Anatomy: 15 (11.19%)
Unrealistic Bad Anatomy: 81 (60.45%)
Unrealistic Good Anatomy: 32 (23.88%)

Total Number of Images:  134

Evaluation

Results

***** train metrics *****
  epoch                    =        3.0
  total_flos               = 20135801GF
  train_loss               =     0.8453
  train_runtime            = 0:00:42.83
  train_samples_per_second =      6.514
  train_steps_per_second   =      0.841

***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.6341
  eval_f1                 =      0.513
  eval_loss               =     0.8219
  eval_precision          =      0.464
  eval_recall             =     0.6341
  eval_runtime            = 0:00:06.95
  eval_samples_per_second =      5.893
  eval_steps_per_second   =      0.862

Summary

The initial dataset and finetune resulted in a 64.41% accuracy and 51.3% F1 score, which is low but expected for a small amateur dataset.

Hopefully I will have time to further build on the dataset and improve the model's performance in the future.

The next steps would be:

Have more variety of characters and poses
More variety of clothing styles and lighting
Different camera styles
Different model generations from different models -> Currently dominated by the SD1.5 BRAV6 and BRAV7 checkpoints

Model Examination

You can view example pipeline inferences and their results on the Initial Finetune notebook

The examples are at the bottom of the notebook. You can do ctr+f and search for Test Model With Custom Inputs to reach it faster.

Model Card Contact

Feel free to contact me if you have any questions or find me on Github