File size: 5,223 Bytes
3f136b6
 
 
7c6ffc8
3f136b6
 
 
 
 
 
 
 
7c6ffc8
 
 
 
42a9d1a
7c6ffc8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5fb03e6
7c6ffc8
 
 
 
 
 
 
 
42a9d1a
f8a7eac
42a9d1a
f8a7eac
7c6ffc8
f8a7eac
7c6ffc8
 
 
 
 
 
42a9d1a
f8a7eac
7c6ffc8
f8a7eac
7c6ffc8
f8a7eac
7c6ffc8
 
f8a7eac
 
7c6ffc8
f8a7eac
7c6ffc8
f8a7eac
7c6ffc8
 
f8a7eac
 
7c6ffc8
f8a7eac
7c6ffc8
f8a7eac
7c6ffc8
 
 
 
 
 
 
 
 
f8a7eac
7c6ffc8
 
 
42a9d1a
 
 
 
 
 
 
 
 
 
7c6ffc8
 
42a9d1a
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
---
title: Person Thumbs Up
emoji: 🐠
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: 1.21.0
app_file: app.py
pinned: false
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# Stable diffusion finetune using LoRA

## HuggingFace Spaces URL: https://huggingface.co/spaces/asrimanth/person-thumbs-up
## Please note that the app on spaces is very slow due to compute constraints. For good results, please try locally.

## Approach

**The key resource in this endeavor: https://huggingface.co/blog/lora**

### Training

All of the following models were trained on stable-diffusion-v1-5

+ Several different training strategies and found LoRA to be the best for my needs.
+ In the dataset, the thumbs up dataset had 121 images for training, which I found to be adequate.
+ First, I scraped ~50 images of "sachin tendulkar". This experiment failed, since the model gave a player with cricket helmet.
+ For training on "Tom cruise", I've scraped ~100 images from images.google.com, using the javascript code from pyimagesearch.com
+ For training on "srimanth", I've put 50 images of myself.

For the datasets, I started as follows:
+ Use an image captioning model from HuggingFace - In our case it is the `Salesforce/blip-image-captioning-large` model.
+ Once captioned, If the caption has "thumbs up", we replace it with `#thumbsup`, otherwise we attach the word `#thumbsup` to the caption.
+ If the model recognizes the person or says the word "man", we replace it with `<person>`. Otherwise, we attach the word `<person>` to the caption.
+ No-cap dataset: For the no-cap models, we don't use the captioning models. We simply add the `<person>` and the `#thumbsup` tag.
+ Plain dataset: For the plain models, we leave the words as is - the "thumbs up" and the person name are without special characters.

The wandb dashboard for the models are as follows:
Initial experiments: I've tried training only on the thumbs up first. The results were good. The thumbs up was mostly accurate, with 4 fingers folded and the thumb raised. However, the model trained on sachin had several issues, including occlusion by cricket gear.
I've tried several different learning rates (from 1e-4 to 1e-6 with cosine scheduler), but the loss curve did not change much.
Number of epochs : 50-60
Augmentations used : Center crop, Random Flip
Gradient accumulation steps : Tried 1, 3, and 4 for different experiments. 4 gave decent results.

text2image_fine-tune :

**wandb dashboard : https://wandb.ai/asrimanth/text2image_fine-tune**

**Model card for asrimanth/person-thumbs-up-lora: https://huggingface.co/asrimanth/person-thumbs-up-lora**

**Prompt: ```<tom_cruise> #thumbsup```**

Deployed models:

When the above experiment failed, I had to try different datasets. One of them was "tom cruise".

srimanth-thumbs-up-lora-plain : We use the plain dataset with srimanth mentioned above.

**wandb link: https://wandb.ai/asrimanth/srimanth-thumbs-up-lora-plain**

**Model card for srimanth-thumbs-up-lora-plain: https://huggingface.co/asrimanth/srimanth-thumbs-up-lora-plain**

**Prompt: ```srimanth thumbs up```**

person-thumbs-up-plain-lora wandb : We use the plain dataset with tom cruise images.

**wandb link: https://wandb.ai/asrimanth/person-thumbs-up-plain-lora**

**Model card for asrimanth/person-thumbs-up-plain-lora: https://huggingface.co/asrimanth/person-thumbs-up-plain-lora**

**Prompt: ```tom cruise thumbs up```**

person-thumbs-up-lora-no-cap wandb dashboard: We use the no-cap dataset with tom cruise images.

**https://wandb.ai/asrimanth/person-thumbs-up-lora-no-cap**

**Model card for asrimanth/person-thumbs-up-lora-no-cap: https://huggingface.co/asrimanth/person-thumbs-up-lora-no-cap**

**Prompt: ```<tom_cruise> #thumbsup```**

### Inference

+ Inference works best for 25 steps in the pipeline.
+ Since the huggingface space built by Streamlit is slow due to low compute, please perform local inference using GPU.
+ During local inference (25 steps), I found the person-thumbs-up-plain-lora to show 35 out of 50 images with a decent thumbs up result for tom cruise, 5 incomplete thumbs up.
+ While I could not evaluate the model with metrics due to insufficient time, I chose the visual approach. To view the inference images, check the `results` folder.
+ To evaulate diffusion models, I would start with this: https://huggingface.co/docs/diffusers/conceptual/evaluation
+ The half-precision inference was not working on CPU, so we've used torch.float32 instead.

### Deployment

To run inference locally, choose a model and run the command:
```
python3 inference.py
```

To run the streamlit app locally, run the command:
```
streamlit run app.py
```

+ I chose streamlit to deploy the application on HuggingFace spaces. It was developer friendly and the app logic can be found in app.py
+ Streamlit app would be a great choice for an MVP.
+ AWS sagemaker would be a good choice for deploying models, since it supports huggingface models with minimal friction.
+ A docker container orchestrated in a kubernetes cluster would be ideal.
+ In practice, evaluation of models in real-time would let us know if there is model drift and retrain accordingly.