Spaces:
Sleeping
Sleeping
title: Person Thumbs Up | |
emoji: ๐ | |
colorFrom: blue | |
colorTo: purple | |
sdk: streamlit | |
sdk_version: 1.21.0 | |
app_file: app.py | |
pinned: false | |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference | |
# Stable diffusion finetune using LoRA | |
## HuggingFace Spaces URL: https://huggingface.co/spaces/asrimanth/person-thumbs-up | |
## Please note that the app on spaces is very slow due to compute constraints. For good results, please try locally. | |
## Approach | |
**The key resource in this endeavor: https://huggingface.co/blog/lora** | |
### Training | |
All of the following models were trained on stable-diffusion-v1-5 | |
+ Several different training strategies and found LoRA to be the best for my needs. | |
+ In the dataset, the thumbs up dataset had 121 images for training, which I found to be adequate. | |
+ First, I scraped ~50 images of "sachin tendulkar". This experiment failed, since the model gave a player with cricket helmet. | |
+ For training on "Tom cruise", I've scraped ~100 images from images.google.com, using the javascript code from pyimagesearch.com | |
+ For training on "srimanth", I've put 50 images of myself. | |
For the datasets, I started as follows: | |
+ Use an image captioning model from HuggingFace - In our case it is the `Salesforce/blip-image-captioning-large` model. | |
+ Once captioned, If the caption has "thumbs up", we replace it with `#thumbsup`, otherwise we attach the word `#thumbsup` to the caption. | |
+ If the model recognizes the person or says the word "man", we replace it with `<person>`. Otherwise, we attach the word `<person>` to the caption. | |
+ No-cap dataset: For the no-cap models, we don't use the captioning models. We simply add the `<person>` and the `#thumbsup` tag. | |
+ Plain dataset: For the plain models, we leave the words as is - the "thumbs up" and the person name are without special characters. | |
The wandb dashboard for the models are as follows: | |
Initial experiments: I've tried training only on the thumbs up first. The results were good. The thumbs up was mostly accurate, with 4 fingers folded and the thumb raised. However, the model trained on sachin had several issues, including occlusion by cricket gear. | |
I've tried several different learning rates (from 1e-4 to 1e-6 with cosine scheduler), but the loss curve did not change much. | |
Number of epochs : 50-60 | |
Augmentations used : Center crop, Random Flip | |
Gradient accumulation steps : Tried 1, 3, and 4 for different experiments. 4 gave decent results. | |
text2image_fine-tune : | |
**wandb dashboard : https://wandb.ai/asrimanth/text2image_fine-tune** | |
**Model card for asrimanth/person-thumbs-up-lora: https://huggingface.co/asrimanth/person-thumbs-up-lora** | |
**Prompt: ```<tom_cruise> #thumbsup```** | |
Deployed models: | |
When the above experiment failed, I had to try different datasets. One of them was "tom cruise". | |
srimanth-thumbs-up-lora-plain : We use the plain dataset with srimanth mentioned above. | |
**wandb link: https://wandb.ai/asrimanth/srimanth-thumbs-up-lora-plain** | |
**Model card for srimanth-thumbs-up-lora-plain: https://huggingface.co/asrimanth/srimanth-thumbs-up-lora-plain** | |
**Prompt: ```srimanth thumbs up```** | |
person-thumbs-up-plain-lora wandb : We use the plain dataset with tom cruise images. | |
**wandb link: https://wandb.ai/asrimanth/person-thumbs-up-plain-lora** | |
**Model card for asrimanth/person-thumbs-up-plain-lora: https://huggingface.co/asrimanth/person-thumbs-up-plain-lora** | |
**Prompt: ```tom cruise thumbs up```** | |
person-thumbs-up-lora-no-cap wandb dashboard: We use the no-cap dataset with tom cruise images. | |
**https://wandb.ai/asrimanth/person-thumbs-up-lora-no-cap** | |
**Model card for asrimanth/person-thumbs-up-lora-no-cap: https://huggingface.co/asrimanth/person-thumbs-up-lora-no-cap** | |
**Prompt: ```<tom_cruise> #thumbsup```** | |
### Inference | |
+ Inference works best for 25 steps in the pipeline. | |
+ Since the huggingface space built by Streamlit is slow due to low compute, please perform local inference using GPU. | |
+ During local inference (25 steps), I found the person-thumbs-up-plain-lora to show 35 out of 50 images with a decent thumbs up result for tom cruise, 5 incomplete thumbs up. | |
+ While I could not evaluate the model with metrics due to insufficient time, I chose the visual approach. To view the inference images, check the `results` folder. | |
+ To evaulate diffusion models, I would start with this: https://huggingface.co/docs/diffusers/conceptual/evaluation | |
+ The half-precision inference was not working on CPU, so we've used torch.float32 instead. | |
### Deployment | |
To run inference locally, choose a model and run the command: | |
``` | |
python3 inference.py | |
``` | |
To run the streamlit app locally, run the command: | |
``` | |
streamlit run app.py | |
``` | |
+ I chose streamlit to deploy the application on HuggingFace spaces. It was developer friendly and the app logic can be found in app.py | |
+ Streamlit app would be a great choice for an MVP. | |
+ AWS sagemaker would be a good choice for deploying models, since it supports huggingface models with minimal friction. | |
+ A docker container orchestrated in a kubernetes cluster would be ideal. | |
+ In practice, evaluation of models in real-time would let us know if there is model drift and retrain accordingly. | |