File size: 2,202 Bytes
6062d83
6fff23a
6062d83
6fff23a
 
6062d83
 
 
 
6fff23a
6062d83
6fff23a
6062d83
6fff23a
6062d83
6fff23a
6062d83
6fff23a
6062d83
6fff23a
6062d83
6fff23a
6062d83
6fff23a
6062d83
6fff23a
 
 
 
 
6062d83
6fff23a
6062d83
6fff23a
6062d83
6fff23a
 
 
 
6062d83
6fff23a
6062d83
6fff23a
6062d83
6fff23a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---
library_name: transformers
license: apache-2.0
datasets:
- merve/vqav2-small
---



![image/png](https://cdn-uploads.huggingface.co/production/uploads/6141a88b3a0ec78603c9e784/PebmPLcCig5BlpUS99VUc.png)

# Idefics3Llama Fine-tuned using QLoRA on VQAv2

- This is the [Idefics3Llama](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3) model QLoRA fine-tuned on a very small part of [VQAv2](https://huggingface.co/datasets/merve/vqav2-small) dataset.

- Find the fine-tuning notebook [here](https://github.com/merveenoyan/smol-vision/blob/main/Idefics_FT.ipynb). 

## Usage

You can load and use this model as follows.

```python

from transformers import Idefics3ForConditionalGeneration, AutoProcessor

peft_model_id = "merve/idefics3llama-vqav2"
base_model_id = "HuggingFaceM4/Idefics3-8B-Llama3"
processor = AutoProcessor.from_pretrained(base_model_id)
model = Idefics3ForConditionalGeneration.from_pretrained(base_model_id)
model.load_adapter(peft_model_id).to("cuda")

```

This model was conditioned on a prompt "Answer briefly.".

```python
from PIL import Image
import requests
from transformers.image_utils import load_image

DEVICE = "cuda"

image = load_image("https://huggingface.co/spaces/merve/OWLSAM2/resolve/main/buddha.JPG")


messages = [
          {
              "role": "user",
              "content": [
                  {"type": "text", "text": "Answer briefly."},
                  {"type": "image"},
                  {"type": "text", "text": "Which country is this located in?"}
              ]
          }
      ]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt", padding=True).to("cuda")
```

We can infer.

```python
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)

##['User: Answer briefly.<row_1_col_1><row_1_col_2><row_1_col_3><row_1_col_4>\n<row_2_col_1>
# <row_2_col_2><row_2_col_3><row_2_col_4>\n<row_3_col_1><row_3_col_2><row_3_col_3>
# <row_3_col_4>\n\n<global-img>Which country is this located in?\nAssistant: thailand\nAssistant: thailand']
```