smishr-18 commited on
Commit
76ad19a
1 Parent(s): 503c05b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -3
README.md CHANGED
@@ -1,3 +1,71 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - nielsr/docvqa_1200_examples_donut
5
+ language:
6
+ - en
7
+ library_name: transformers
8
+ pipeline_tag: visual-question-answering
9
+ ---
10
+
11
+ ### IDEFICS2-OCR
12
+
13
+ Finetuned of Idefics2-8b with fp16 weight update on nielsr/docvqa_1200_examples_donut dataset for document VQA pairs.
14
+
15
+ ## Usage
16
+
17
+ ```Python
18
+ from transformers import BitsAndBytesConfig, AutoModelForVision2Seq, AutoProcessor
19
+ from transformers.image_utils import load_image
20
+
21
+ processor = AutoProcessor.from_pretrained("smishr-18/Idefics2-OCR", do_image_splitting=False)
22
+
23
+ bnb_config = BitsAndBytesConfig(
24
+ load_in_4bit=True,
25
+ bnb_4bit_quant_type="nf4",
26
+ bnb_4bit_compute_dtype=torch.float16
27
+ )
28
+
29
+ device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
30
+ model = AutoModelForVision2Seq.from_pretrained(
31
+ "smishr-18/Idefics2-OCR",
32
+ quantization_config=bnb_config,
33
+ device_map=device,
34
+ low_cpu_mem_usage=True
35
+ )
36
+
37
+ image = load_image("https://images.pokemontcg.io/pl1/1_hires.png")
38
+
39
+ messages = [
40
+ {
41
+ "role": "user",
42
+ "content": [
43
+ {"type": "text", "text": "Explain."},
44
+ {"type": "image"},
45
+ {"type": "text", "text": "What is the reflex energy in the image?"}
46
+ ]
47
+ }
48
+ ]
49
+
50
+ text = processor.apply_chat_template(messages, add_generation_prompt=True)
51
+ inputs = processor(text=[text.strip()], images=[image4], return_tensors="pt", padding=True)
52
+ inputs = {k: v.to(device) for k, v in inputs.items()}
53
+
54
+ # Generate texts
55
+ generated_ids = model.generate(**inputs, max_new_tokens=500)
56
+ generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
57
+ print(generated_texts)
58
+ # The reflex energy in the image is 70.
59
+
60
+
61
+ ## Limitations
62
+
63
+ The model was finetuned on limited T4 GPU and could be fintuned with more adapters on
64
+ devices with ```torch.cuda.get_device_capability()[0] >= 8``` or Ampere GPUs.
65
+
66
+ - **Developed by:** Shubh Mishra, Aug 2024
67
+ - **Model Type:** VLM
68
+ - **Language(s) (NLP):** English
69
+ - **License:** MIT
70
+ - **Finetuned from model:** nielsr/docvqa_1200_examples_donut
71
+