File size: 2,202 Bytes
6062d83 6fff23a 6062d83 6fff23a 6062d83 6fff23a 6062d83 6fff23a 6062d83 6fff23a 6062d83 6fff23a 6062d83 6fff23a 6062d83 6fff23a 6062d83 6fff23a 6062d83 6fff23a 6062d83 6fff23a 6062d83 6fff23a 6062d83 6fff23a 6062d83 6fff23a 6062d83 6fff23a 6062d83 6fff23a 6062d83 6fff23a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
---
library_name: transformers
license: apache-2.0
datasets:
- merve/vqav2-small
---
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6141a88b3a0ec78603c9e784/PebmPLcCig5BlpUS99VUc.png)
# Idefics3Llama Fine-tuned using QLoRA on VQAv2
- This is the [Idefics3Llama](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3) model QLoRA fine-tuned on a very small part of [VQAv2](https://huggingface.co/datasets/merve/vqav2-small) dataset.
- Find the fine-tuning notebook [here](https://github.com/merveenoyan/smol-vision/blob/main/Idefics_FT.ipynb).
## Usage
You can load and use this model as follows.
```python
from transformers import Idefics3ForConditionalGeneration, AutoProcessor
peft_model_id = "merve/idefics3llama-vqav2"
base_model_id = "HuggingFaceM4/Idefics3-8B-Llama3"
processor = AutoProcessor.from_pretrained(base_model_id)
model = Idefics3ForConditionalGeneration.from_pretrained(base_model_id)
model.load_adapter(peft_model_id).to("cuda")
```
This model was conditioned on a prompt "Answer briefly.".
```python
from PIL import Image
import requests
from transformers.image_utils import load_image
DEVICE = "cuda"
image = load_image("https://huggingface.co/spaces/merve/OWLSAM2/resolve/main/buddha.JPG")
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Answer briefly."},
{"type": "image"},
{"type": "text", "text": "Which country is this located in?"}
]
}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt", padding=True).to("cuda")
```
We can infer.
```python
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)
##['User: Answer briefly.<row_1_col_1><row_1_col_2><row_1_col_3><row_1_col_4>\n<row_2_col_1>
# <row_2_col_2><row_2_col_3><row_2_col_4>\n<row_3_col_1><row_3_col_2><row_3_col_3>
# <row_3_col_4>\n\n<global-img>Which country is this located in?\nAssistant: thailand\nAssistant: thailand']
``` |