LZHgrla commited on
Commit
d8a7d36
1 Parent(s): 8634af6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +224 -0
README.md ADDED
@@ -0,0 +1,224 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - Lin-Chen/ShareGPT4V
4
+ pipeline_tag: image-text-to-text
5
+ library_name: xtuner
6
+ ---
7
+
8
+ <div align="center">
9
+ <img src="https://github.com/InternLM/lmdeploy/assets/36994684/0cf8d00f-e86b-40ba-9b54-dc8f1bc6c8d8" width="600"/>
10
+
11
+
12
+ [![Generic badge](https://img.shields.io/badge/GitHub-%20XTuner-black.svg)](https://github.com/InternLM/xtuner)
13
+
14
+
15
+ </div>
16
+
17
+ ## Model
18
+
19
+ llava-phi-3-mini is a LLaVA model fine-tuned from [microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) and [CLIP-ViT-Large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) with [ShareGPT4V-PT](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V) and [InternVL-SFT](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets) by [XTuner](https://github.com/InternLM/xtuner).
20
+
21
+ **Note: This model is in official LLaVA format. The models in xtuner LLaVA format and HuggingFace LLaVA format can be found on [xtuner/llava-phi-3-mini-xtuner](https://huggingface.co/xtuner/llava-phi-3-mini-xtuner) and [xtuner/llava-phi-3-mini-hf](https://huggingface.co/xtuner/llava-phi-3-mini-hf).**
22
+
23
+
24
+ ## Details
25
+
26
+ | Model | Visual Encoder | Projector | Resolution | Pretraining Strategy | Fine-tuning Strategy | Pretrain Dataset | Fine-tune Dataset |
27
+ | :-------------------- | ------------------: | --------: | ---------: | ---------------------: | ------------------------: | ------------------------: | -----------------------: |
28
+ | LLaVA-v1.5-7B | CLIP-L | MLP | 336 | Frozen LLM, Frozen ViT | Full LLM, Frozen ViT | LLaVA-PT (558K) | LLaVA-Mix (665K) |
29
+ | LLaVA-Llama-3-8B | CLIP-L | MLP | 336 | Frozen LLM, Frozen ViT | Full LLM, LoRA ViT | LLaVA-PT (558K) | LLaVA-Mix (665K) |
30
+ | LLaVA-Llama-3-8B-v1.1 | CLIP-L | MLP | 336 | Frozen LLM, Frozen ViT | Full LLM, LoRA ViT | ShareGPT4V-PT (1246K) | InternVL-SFT (1268K) |
31
+ | LLaVA-Phi-3-mini | CLIP-L | MLP | 336 | Frozen LLM, Frozen ViT | Full LLM, Full ViT | ShareGPT4V-PT (1246K) | InternVL-SFT (1268K) |
32
+
33
+ ## Results
34
+
35
+
36
+ ## Quickstart
37
+
38
+ ### Chat with LLaVA official library
39
+
40
+ 1. Install official LLaVA library
41
+
42
+ ```bash
43
+ pip install git+https://github.com/haotian-liu/LLaVA.git
44
+ ```
45
+
46
+ 2. Chat with below script
47
+
48
+ <details>
49
+ <summary>cli.py</summary>
50
+
51
+ ```python
52
+ import argparse
53
+ from io import BytesIO
54
+
55
+ import requests
56
+ import torch
57
+ from llava.constants import DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX
58
+ from llava.conversation import Conversation, SeparatorStyle
59
+ from llava.mm_utils import process_images, tokenizer_image_token
60
+ from llava.model import LlavaLlamaForCausalLM
61
+ from PIL import Image
62
+ from transformers import (AutoTokenizer, BitsAndBytesConfig, StoppingCriteria,
63
+ StoppingCriteriaList, TextStreamer)
64
+
65
+
66
+ def load_image(image_file):
67
+ if image_file.startswith('http://') or image_file.startswith('https://'):
68
+ response = requests.get(image_file)
69
+ image = Image.open(BytesIO(response.content)).convert('RGB')
70
+ else:
71
+ image = Image.open(image_file).convert('RGB')
72
+ return image
73
+
74
+
75
+ class StopWordStoppingCriteria(StoppingCriteria):
76
+ """StopWord stopping criteria."""
77
+
78
+ def __init__(self, tokenizer, stop_word):
79
+ self.tokenizer = tokenizer
80
+ self.stop_word = stop_word
81
+ self.length = len(self.stop_word)
82
+
83
+ def __call__(self, input_ids, *args, **kwargs) -> bool:
84
+ cur_text = self.tokenizer.decode(input_ids[0])
85
+ cur_text = cur_text.replace('\r', '').replace('\n', '')
86
+ return cur_text[-self.length:] == self.stop_word
87
+
88
+
89
+ def get_stop_criteria(tokenizer, stop_words=[]):
90
+ stop_criteria = StoppingCriteriaList()
91
+ for word in stop_words:
92
+ stop_criteria.append(StopWordStoppingCriteria(tokenizer, word))
93
+ return stop_criteria
94
+
95
+
96
+ def main(args):
97
+ kwargs = {'device_map': args.device}
98
+ if args.load_8bit:
99
+ kwargs['load_in_8bit'] = True
100
+ elif args.load_4bit:
101
+ kwargs['load_in_4bit'] = True
102
+ kwargs['quantization_config'] = BitsAndBytesConfig(
103
+ load_in_4bit=True,
104
+ bnb_4bit_compute_dtype=torch.float16,
105
+ bnb_4bit_use_double_quant=True,
106
+ bnb_4bit_quant_type='nf4')
107
+ else:
108
+ kwargs['torch_dtype'] = torch.float16
109
+
110
+ tokenizer = AutoTokenizer.from_pretrained(args.model_path)
111
+ model = LlavaLlamaForCausalLM.from_pretrained(
112
+ args.model_path, low_cpu_mem_usage=True, **kwargs)
113
+ vision_tower = model.get_vision_tower()
114
+ if not vision_tower.is_loaded:
115
+ vision_tower.load_model(device_map=args.device)
116
+ image_processor = vision_tower.image_processor
117
+
118
+ conv = Conversation(
119
+ system=system='<|start_header_id|>system<|end_header_id|>\n\nAnswer the questions.',
120
+ roles=('<|start_header_id|>user<|end_header_id|>\n\n',
121
+ '<|start_header_id|>assistant<|end_header_id|>\n\n'),
122
+ messages=[],
123
+ offset=0,
124
+ sep_style=SeparatorStyle.MPT,
125
+ sep='<|eot_id|>',
126
+ )
127
+ roles = conv.roles
128
+
129
+ image = load_image(args.image_file)
130
+ image_size = image.size
131
+ image_tensor = process_images([image], image_processor, model.config)
132
+
133
+ if type(image_tensor) is list:
134
+ image_tensor = [
135
+ image.to(model.device, dtype=torch.float16)
136
+ for image in image_tensor
137
+ ]
138
+ else:
139
+ image_tensor = image_tensor.to(model.device, dtype=torch.float16)
140
+
141
+ while True:
142
+ try:
143
+ inp = input(f'{roles[0]}: ')
144
+ except EOFError:
145
+ inp = ''
146
+ if not inp:
147
+ print('exit...')
148
+ break
149
+
150
+ print(f'{roles[1]}: ', end='')
151
+
152
+ if image is not None:
153
+ inp = DEFAULT_IMAGE_TOKEN + '\n' + inp
154
+ image = None
155
+
156
+ conv.append_message(conv.roles[0], inp)
157
+ conv.append_message(conv.roles[1], None)
158
+ prompt = conv.get_prompt()
159
+
160
+ input_ids = tokenizer_image_token(
161
+ prompt, tokenizer, IMAGE_TOKEN_INDEX,
162
+ return_tensors='pt').unsqueeze(0).to(model.device)
163
+ stop_criteria = get_stop_criteria(
164
+ tokenizer=tokenizer, stop_words=[conv.sep])
165
+
166
+ streamer = TextStreamer(
167
+ tokenizer, skip_prompt=True, skip_special_tokens=True)
168
+
169
+ with torch.inference_mode():
170
+ output_ids = model.generate(
171
+ input_ids,
172
+ images=image_tensor,
173
+ image_sizes=[image_size],
174
+ do_sample=True if args.temperature > 0 else False,
175
+ temperature=args.temperature,
176
+ max_new_tokens=args.max_new_tokens,
177
+ streamer=streamer,
178
+ stopping_criteria=stop_criteria,
179
+ use_cache=True)
180
+
181
+ outputs = tokenizer.decode(output_ids[0]).strip()
182
+ conv.messages[-1][-1] = outputs
183
+
184
+ if args.debug:
185
+ print('\n', {'prompt': prompt, 'outputs': outputs}, '\n')
186
+
187
+
188
+ if __name__ == '__main__':
189
+ parser = argparse.ArgumentParser()
190
+ parser.add_argument(
191
+ '--model-path', type=str, default='xtuner/llava-llama-3-8b-v1_1-hf')
192
+ parser.add_argument('--image-file', type=str, required=True)
193
+ parser.add_argument('--device', type=str, default='auto')
194
+ parser.add_argument('--temperature', type=float, default=0.2)
195
+ parser.add_argument('--max-new-tokens', type=int, default=512)
196
+ parser.add_argument('--load-8bit', action='store_true')
197
+ parser.add_argument('--load-4bit', action='store_true')
198
+ parser.add_argument('--debug', action='store_true')
199
+ args = parser.parse_args()
200
+ main(args)
201
+ ```
202
+
203
+ </details>
204
+
205
+ ```
206
+ # example
207
+ python ./cli.py --model-path xtuner/llava-phi-3-mini --image-file https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg --load-4bit
208
+ ```
209
+
210
+
211
+ ### Reproduction
212
+
213
+ Please refer to [docs](https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llava/phi3_mini_4k_instruct_clip_vit_large_p14_336#readme).
214
+
215
+ ## Citation
216
+
217
+ ```bibtex
218
+ @misc{2023xtuner,
219
+ title={XTuner: A Toolkit for Efficiently Fine-tuning LLM},
220
+ author={XTuner Contributors},
221
+ howpublished = {\url{https://github.com/InternLM/xtuner}},
222
+ year={2023}
223
+ }
224
+ ```