metadata

title: Person Thumbs Up
emoji: 🐠
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: 1.21.0
app_file: app.py
pinned: false

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Stable diffusion finetune using LoRA

HuggingFace Spaces URL: https://huggingface.co/spaces/asrimanth/person-thumbs-up

Please note that the app on spaces is very slow due to compute constraints. For good results, please try locally.

Approach

The key resource in this endeavor: https://huggingface.co/blog/lora

Training

All of the following models were trained on stable-diffusion-v1-5

Several different training strategies and found LoRA to be the best for my needs.
In the dataset, the thumbs up dataset had 121 images for training, which I found to be adequate.
First, I scraped ~50 images of "sachin tendulkar". This experiment failed, since the model gave a player with cricket helmet.
For training on "Tom cruise", I've scraped ~100 images from images.google.com, using the javascript code from pyimagesearch.com
For training on "srimanth", I've put 50 images of myself.

For the datasets, I started as follows:

Use an image captioning model from HuggingFace - In our case it is the Salesforce/blip-image-captioning-large model.
Once captioned, If the caption has "thumbs up", we replace it with #thumbsup, otherwise we attach the word #thumbsup to the caption.
If the model recognizes the person or says the word "man", we replace it with <person>. Otherwise, we attach the word <person> to the caption.
No-cap dataset: For the no-cap models, we don't use the captioning models. We simply add the <person> and the #thumbsup tag.
Plain dataset: For the plain models, we leave the words as is - the "thumbs up" and the person name are without special characters.

The wandb dashboard for the models are as follows: Initial experiments: I've tried training only on the thumbs up first. The results were good. The thumbs up was mostly accurate, with 4 fingers folded and the thumb raised. However, the model trained on sachin had several issues, including occlusion by cricket gear. I've tried several different learning rates (from 1e-4 to 1e-6 with cosine scheduler), but the loss curve did not change much. Number of epochs : 50-60 Augmentations used : Center crop, Random Flip Gradient accumulation steps : Tried 1, 3, and 4 for different experiments. 4 gave decent results.

text2image_fine-tune :

wandb dashboard : https://wandb.ai/asrimanth/text2image_fine-tune

Model card for asrimanth/person-thumbs-up-lora: https://huggingface.co/asrimanth/person-thumbs-up-lora

Prompt: <tom_cruise> #thumbsup

Deployed models:

When the above experiment failed, I had to try different datasets. One of them was "tom cruise".

srimanth-thumbs-up-lora-plain : We use the plain dataset with srimanth mentioned above.

wandb link: https://wandb.ai/asrimanth/srimanth-thumbs-up-lora-plain

Model card for srimanth-thumbs-up-lora-plain: https://huggingface.co/asrimanth/srimanth-thumbs-up-lora-plain

Prompt: srimanth thumbs up

person-thumbs-up-plain-lora wandb : We use the plain dataset with tom cruise images.

wandb link: https://wandb.ai/asrimanth/person-thumbs-up-plain-lora

Model card for asrimanth/person-thumbs-up-plain-lora: https://huggingface.co/asrimanth/person-thumbs-up-plain-lora

Prompt: tom cruise thumbs up

person-thumbs-up-lora-no-cap wandb dashboard: We use the no-cap dataset with tom cruise images.

https://wandb.ai/asrimanth/person-thumbs-up-lora-no-cap

Model card for asrimanth/person-thumbs-up-lora-no-cap: https://huggingface.co/asrimanth/person-thumbs-up-lora-no-cap

Prompt: <tom_cruise> #thumbsup

Inference

Inference works best for 25 steps in the pipeline.
Since the huggingface space built by Streamlit is slow due to low compute, please perform local inference using GPU.
During local inference (25 steps), I found the person-thumbs-up-plain-lora to show 35 out of 50 images with a decent thumbs up result for tom cruise, 5 incomplete thumbs up.
While I could not evaluate the model with metrics due to insufficient time, I chose the visual approach. To view the inference images, check the results folder.
To evaulate diffusion models, I would start with this: https://huggingface.co/docs/diffusers/conceptual/evaluation
The half-precision inference was not working on CPU, so we've used torch.float32 instead.

Deployment

To run inference locally, choose a model and run the command:

python3 inference.py

To run the streamlit app locally, run the command:

streamlit run app.py

I chose streamlit to deploy the application on HuggingFace spaces. It was developer friendly and the app logic can be found in app.py
Streamlit app would be a great choice for an MVP.
AWS sagemaker would be a good choice for deploying models, since it supports huggingface models with minimal friction.
A docker container orchestrated in a kubernetes cluster would be ideal.
In practice, evaluation of models in real-time would let us know if there is model drift and retrain accordingly.