Gregor commited on
Commit
51d7ff8
1 Parent(s): bccb36d

Initial release

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,187 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - multilingual
5
+ license: mit
6
+ tags:
7
+ - vision
8
+ - image-to-text
9
+ - image-captioning
10
+ - visual-question-answering
11
+ pipeline_tag: image-to-text
12
+ inference: false
13
+ ---
14
+
15
+ # mBLIP mT0-XL
16
+
17
+ This is the model checkpoint for our work [**mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs**](TODO arxiv).
18
+
19
+
20
+
21
+ ## Model description
22
+ mBLIP is a [BLIP-2](https://arxiv.org/abs/2301.12597) model which consists of 3 sub-models: a Vision Transformer (ViT), a Query-Transformer (Q-Former) and a large language model (LLM).
23
+
24
+ The Q-Former and ViT have both been initialized by an English BLIP-2 checkpoint ([blip2-flan-t5-xl](https://huggingface.co/Gregor/mblip-mt0-xl)) and then re-aligned
25
+ to the multilingual LLM ([mt0-xl](https://huggingface.co/bigscience/mt0-xl)) using a [multilingual task mixture](https://huggingface.co/datasets/Gregor/mblip-train).
26
+
27
+ <img src="https://github.com/gregor-ge/mBLIP/blob/main/architecture.png"
28
+ alt="The mBLIP architecture" width="600"/>
29
+
30
+ This allows the model to be used for tasks like:
31
+
32
+ - image captioning
33
+ - visual question answering (VQA)
34
+
35
+ in 96 languages.
36
+
37
+ #### Languages
38
+ mBLIP was trained on the following 96 languages:
39
+
40
+ `
41
+ af, am, ar, az, be, bg, bn, ca, ceb, cs, cy, da, de, el, en, eo, es, et, eu, fa, fi, fil, fr, ga, gd, gl, gu, ha, hi, ht, hu, hy, id, ig, is, it, iw, ja, jv, ka, kk, km, kn, ko, ku, ky, lb, lo, lt, lv, mg, mi, mk, ml, mn, mr, ms, mt, my, ne, nl, no, ny, pa, pl, ps, pt, ro, ru, sd, si, sk, sl, sm, sn, so, sq, sr, st, su, sv, sw, ta, te, tg, th, tr, uk, ur, uz, vi, xh, yi, yo, zh, zu
42
+ `
43
+
44
+
45
+ ## Direct Use and Downstream Use
46
+
47
+ You can use the raw model for conditional text generation given an image and prompt text in a zero-shot setup or
48
+ alternatively finetune it for downstream applications.
49
+ We strongly recommend LoRA applied to the LLM when finetuning and to use bf16 as data type - standard fp16 can cause NaN loss.
50
+
51
+ See [our repository](https://github.com/gregor-ge/mBLIP) for the code used to train and finetune this model.
52
+
53
+
54
+ ## Bias, Risks, Limitations, and Ethical Considerations
55
+
56
+ While mBLIP can work in theory with up to 100 languages, in practice, we expect best results when prompted in high-resource languages
57
+ like English, German, Spanish, etc.
58
+
59
+
60
+
61
+ mBLIP inherits the risk, limitations, and biases from the models used to initialize it.
62
+ mBLIP has not been tested in real world applications. It should not be directly deployed in any applications. Researchers should first carefully assess the safety and fairness of the model in relation to the specific context they’re being deployed within.
63
+
64
+ ### How to use
65
+
66
+ For code examples, we refer to the BLIP-2 [documentation](https://huggingface.co/docs/transformers/main/en/model_doc/blip-2#transformers.Blip2ForConditionalGeneration.forward.example).
67
+
68
+ #### Running the model on CPU
69
+
70
+ <details>
71
+ <summary> Click to expand </summary>
72
+
73
+ ```python
74
+ import requests
75
+ from PIL import Image
76
+ from transformers import BlipProcessor, Blip2ForConditionalGeneration
77
+
78
+ processor = BlipProcessor.from_pretrained("Gregor/mblip-mt0-xl")
79
+ model = Blip2ForConditionalGeneration.from_pretrained("Gregor/mblip-mt0-xl")
80
+
81
+ img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
82
+ raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
83
+
84
+ question = "Describe the image in German."
85
+ inputs = processor(raw_image, question, return_tensors="pt")
86
+
87
+ out = model.generate(**inputs)
88
+ print(processor.decode(out[0], skip_special_tokens=True))
89
+ ```
90
+ </details>
91
+
92
+ #### Running the model on GPU
93
+
94
+ ##### In full precision
95
+
96
+ <details>
97
+ <summary> Click to expand </summary>
98
+
99
+ ```python
100
+ # pip install accelerate
101
+ import requests
102
+ from PIL import Image
103
+ from transformers import Blip2Processor, Blip2ForConditionalGeneration
104
+
105
+ processor = Blip2Processor.from_pretrained("Gregor/mblip-mt0-xl")
106
+ model = Blip2ForConditionalGeneration.from_pretrained("Gregor/mblip-mt0-xl", device_map="auto")
107
+
108
+ img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
109
+ raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
110
+
111
+ question = "Describe the image in German."
112
+ inputs = processor(raw_image, question, return_tensors="pt").to("cuda")
113
+
114
+ out = model.generate(**inputs)
115
+ print(processor.decode(out[0], skip_special_tokens=True))
116
+ ```
117
+ </details>
118
+
119
+ ##### In half precision (`bfloat16`)
120
+
121
+ <details>
122
+ <summary> Click to expand </summary>
123
+
124
+ ```python
125
+ # pip install accelerate
126
+ import torch
127
+ import requests
128
+ from PIL import Image
129
+ from transformers import Blip2Processor, Blip2ForConditionalGeneration
130
+
131
+ processor = Blip2Processor.from_pretrained("Gregor/mblip-mt0-xl")
132
+ model = Blip2ForConditionalGeneration.from_pretrained("Gregor/mblip-mt0-xl", torch_dtype=torch.bfloat16, device_map="auto")
133
+
134
+ img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
135
+ raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
136
+
137
+ question = "Describe the image in German."
138
+ inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.bfloat16)
139
+
140
+ out = model.generate(**inputs)
141
+ print(processor.decode(out[0], skip_special_tokens=True))
142
+ ```
143
+ </details>
144
+
145
+ ##### In 8-bit precision (`int8`)
146
+
147
+ <details>
148
+ <summary> Click to expand </summary>
149
+
150
+ ```python
151
+ # pip install accelerate bitsandbytes
152
+ import torch
153
+ import requests
154
+ from PIL import Image
155
+ from transformers import Blip2Processor, Blip2ForConditionalGeneration
156
+
157
+ processor = Blip2Processor.from_pretrained("Gregor/mblip-mt0-xl")
158
+ model = Blip2ForConditionalGeneration.from_pretrained("Gregor/mblip-mt0-xl", load_in_8bit=True, device_map="auto")
159
+
160
+ img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
161
+ raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
162
+
163
+ question = "Describe the image in German."
164
+ inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.bfloat16)
165
+
166
+ out = model.generate(**inputs)
167
+ print(processor.decode(out[0], skip_special_tokens=True))
168
+ ```
169
+ </details>
170
+
171
+ ## Citation
172
+ If you use our model, please cite the following:
173
+ ```
174
+ @article{geigle2023mblip,
175
+ author = {Gregor Geigle and
176
+ Abhay Jain and
177
+ Radu Timofte and
178
+ Goran Glava\v{s}},
179
+ title = {TODO},
180
+ journal = {arXiv},
181
+ volume = {abs/TODO},
182
+ year = {2023},
183
+ url = {https://arxiv.org/abs/TODO},
184
+ eprinttype = {arXiv},
185
+ eprint = {TODO},
186
+ }
187
+ ```
config.json ADDED
@@ -0,0 +1,256 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_commit_hash": "56fa1691779eaa22d603ca6ffa463f9adc05ac5f",
3
+ "architectures": [
4
+ "Blip2ForConditionalGeneration"
5
+ ],
6
+ "initializer_factor": 1.0,
7
+ "initializer_range": 0.02,
8
+ "is_encoder_decoder": true,
9
+ "model_type": "blip-2",
10
+ "num_query_tokens": 32,
11
+ "qformer_config": {
12
+ "_name_or_path": "",
13
+ "add_cross_attention": false,
14
+ "architectures": null,
15
+ "attention_probs_dropout_prob": 0.1,
16
+ "bad_words_ids": null,
17
+ "begin_suppress_tokens": null,
18
+ "bos_token_id": null,
19
+ "chunk_size_feed_forward": 0,
20
+ "classifier_dropout": null,
21
+ "cross_attention_frequency": 2,
22
+ "cross_attention_hidden_size": null,
23
+ "decoder_start_token_id": null,
24
+ "diversity_penalty": 0.0,
25
+ "do_sample": false,
26
+ "early_stopping": false,
27
+ "encoder_hidden_size": 1408,
28
+ "encoder_no_repeat_ngram_size": 0,
29
+ "eos_token_id": null,
30
+ "exponential_decay_length_penalty": null,
31
+ "finetuning_task": null,
32
+ "forced_bos_token_id": null,
33
+ "forced_eos_token_id": null,
34
+ "hidden_act": "gelu",
35
+ "hidden_dropout_prob": 0.1,
36
+ "hidden_size": 768,
37
+ "id2label": {
38
+ "0": "LABEL_0",
39
+ "1": "LABEL_1"
40
+ },
41
+ "initializer_range": 0.02,
42
+ "intermediate_size": 3072,
43
+ "is_decoder": false,
44
+ "is_encoder_decoder": false,
45
+ "label2id": {
46
+ "LABEL_0": 0,
47
+ "LABEL_1": 1
48
+ },
49
+ "layer_norm_eps": 1e-12,
50
+ "length_penalty": 1.0,
51
+ "max_length": 20,
52
+ "max_position_embeddings": 512,
53
+ "min_length": 0,
54
+ "model_type": "blip_2_qformer",
55
+ "no_repeat_ngram_size": 0,
56
+ "num_attention_heads": 12,
57
+ "num_beam_groups": 1,
58
+ "num_beams": 1,
59
+ "num_hidden_layers": 12,
60
+ "num_return_sequences": 1,
61
+ "output_attentions": false,
62
+ "output_hidden_states": false,
63
+ "output_scores": false,
64
+ "pad_token_id": 0,
65
+ "position_embedding_type": "absolute",
66
+ "prefix": null,
67
+ "problem_type": null,
68
+ "pruned_heads": {},
69
+ "remove_invalid_values": false,
70
+ "repetition_penalty": 1.0,
71
+ "return_dict": true,
72
+ "return_dict_in_generate": false,
73
+ "sep_token_id": null,
74
+ "suppress_tokens": null,
75
+ "task_specific_params": null,
76
+ "temperature": 1.0,
77
+ "tf_legacy_loss": false,
78
+ "tie_encoder_decoder": false,
79
+ "tie_word_embeddings": true,
80
+ "tokenizer_class": null,
81
+ "top_k": 50,
82
+ "top_p": 1.0,
83
+ "torch_dtype": null,
84
+ "torchscript": false,
85
+ "transformers_version": "4.30.2",
86
+ "typical_p": 1.0,
87
+ "use_bfloat16": false,
88
+ "vocab_size": 30522
89
+ },
90
+ "text_config": {
91
+ "_name_or_path": "",
92
+ "add_cross_attention": false,
93
+ "architectures": [
94
+ "MT5ForConditionalGeneration"
95
+ ],
96
+ "bad_words_ids": null,
97
+ "begin_suppress_tokens": null,
98
+ "bos_token_id": null,
99
+ "chunk_size_feed_forward": 0,
100
+ "cross_attention_hidden_size": null,
101
+ "d_ff": 5120,
102
+ "d_kv": 64,
103
+ "d_model": 2048,
104
+ "decoder_start_token_id": 0,
105
+ "dense_act_fn": "gelu_new",
106
+ "diversity_penalty": 0.0,
107
+ "do_sample": false,
108
+ "dropout_rate": 0.1,
109
+ "early_stopping": false,
110
+ "encoder_no_repeat_ngram_size": 0,
111
+ "eos_token_id": 1,
112
+ "exponential_decay_length_penalty": null,
113
+ "feed_forward_proj": "gated-gelu",
114
+ "finetuning_task": null,
115
+ "forced_bos_token_id": null,
116
+ "forced_eos_token_id": null,
117
+ "id2label": {
118
+ "0": "LABEL_0",
119
+ "1": "LABEL_1"
120
+ },
121
+ "initializer_factor": 1.0,
122
+ "is_decoder": false,
123
+ "is_encoder_decoder": true,
124
+ "is_gated_act": true,
125
+ "label2id": {
126
+ "LABEL_0": 0,
127
+ "LABEL_1": 1
128
+ },
129
+ "layer_norm_epsilon": 1e-06,
130
+ "length_penalty": 1.0,
131
+ "max_length": 20,
132
+ "min_length": 0,
133
+ "model_type": "mt5",
134
+ "no_repeat_ngram_size": 0,
135
+ "num_beam_groups": 1,
136
+ "num_beams": 1,
137
+ "num_decoder_layers": 24,
138
+ "num_heads": 32,
139
+ "num_layers": 24,
140
+ "num_return_sequences": 1,
141
+ "output_attentions": false,
142
+ "output_hidden_states": false,
143
+ "output_past": true,
144
+ "output_scores": false,
145
+ "pad_token_id": 0,
146
+ "prefix": null,
147
+ "problem_type": null,
148
+ "pruned_heads": {},
149
+ "relative_attention_max_distance": 128,
150
+ "relative_attention_num_buckets": 32,
151
+ "remove_invalid_values": false,
152
+ "repetition_penalty": 1.0,
153
+ "return_dict": true,
154
+ "return_dict_in_generate": false,
155
+ "sep_token_id": null,
156
+ "suppress_tokens": null,
157
+ "task_specific_params": null,
158
+ "temperature": 1.0,
159
+ "tf_legacy_loss": false,
160
+ "tie_encoder_decoder": false,
161
+ "tie_word_embeddings": false,
162
+ "tokenizer_class": "T5Tokenizer",
163
+ "top_k": 50,
164
+ "top_p": 1.0,
165
+ "torch_dtype": "float32",
166
+ "torchscript": false,
167
+ "transformers_version": "4.30.2",
168
+ "typical_p": 1.0,
169
+ "use_bfloat16": false,
170
+ "use_cache": true,
171
+ "vocab_size": 250112
172
+ },
173
+ "tie_word_embeddings": false,
174
+ "torch_dtype": "float32",
175
+ "transformers_version": null,
176
+ "use_decoder_only_language_model": false,
177
+ "vision_config": {
178
+ "_name_or_path": "",
179
+ "add_cross_attention": false,
180
+ "architectures": null,
181
+ "attention_dropout": 0.0,
182
+ "bad_words_ids": null,
183
+ "begin_suppress_tokens": null,
184
+ "bos_token_id": null,
185
+ "chunk_size_feed_forward": 0,
186
+ "cross_attention_hidden_size": null,
187
+ "decoder_start_token_id": null,
188
+ "diversity_penalty": 0.0,
189
+ "do_sample": false,
190
+ "dropout": 0.0,
191
+ "early_stopping": false,
192
+ "encoder_no_repeat_ngram_size": 0,
193
+ "eos_token_id": null,
194
+ "exponential_decay_length_penalty": null,
195
+ "finetuning_task": null,
196
+ "forced_bos_token_id": null,
197
+ "forced_eos_token_id": null,
198
+ "hidden_act": "gelu",
199
+ "hidden_size": 1408,
200
+ "id2label": {
201
+ "0": "LABEL_0",
202
+ "1": "LABEL_1"
203
+ },
204
+ "image_size": 224,
205
+ "initializer_factor": 1.0,
206
+ "initializer_range": 1e-10,
207
+ "intermediate_size": 6144,
208
+ "is_decoder": false,
209
+ "is_encoder_decoder": false,
210
+ "label2id": {
211
+ "LABEL_0": 0,
212
+ "LABEL_1": 1
213
+ },
214
+ "layer_norm_eps": 1e-05,
215
+ "length_penalty": 1.0,
216
+ "max_length": 20,
217
+ "min_length": 0,
218
+ "model_type": "blip_2_vision_model",
219
+ "no_repeat_ngram_size": 0,
220
+ "num_attention_heads": 16,
221
+ "num_beam_groups": 1,
222
+ "num_beams": 1,
223
+ "num_channels": 3,
224
+ "num_hidden_layers": 39,
225
+ "num_return_sequences": 1,
226
+ "output_attentions": false,
227
+ "output_hidden_states": false,
228
+ "output_scores": false,
229
+ "pad_token_id": null,
230
+ "patch_size": 14,
231
+ "prefix": null,
232
+ "problem_type": null,
233
+ "projection_dim": 512,
234
+ "pruned_heads": {},
235
+ "qkv_bias": true,
236
+ "remove_invalid_values": false,
237
+ "repetition_penalty": 1.0,
238
+ "return_dict": true,
239
+ "return_dict_in_generate": false,
240
+ "sep_token_id": null,
241
+ "suppress_tokens": null,
242
+ "task_specific_params": null,
243
+ "temperature": 1.0,
244
+ "tf_legacy_loss": false,
245
+ "tie_encoder_decoder": false,
246
+ "tie_word_embeddings": true,
247
+ "tokenizer_class": null,
248
+ "top_k": 50,
249
+ "top_p": 1.0,
250
+ "torch_dtype": null,
251
+ "torchscript": false,
252
+ "transformers_version": "4.30.2",
253
+ "typical_p": 1.0,
254
+ "use_bfloat16": false
255
+ }
256
+ }
preprocessor_config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_convert_rgb": true,
3
+ "do_normalize": true,
4
+ "do_rescale": true,
5
+ "do_resize": true,
6
+ "image_mean": [
7
+ 0.48145466,
8
+ 0.4578275,
9
+ 0.40821073
10
+ ],
11
+ "image_processor_type": "BlipImageProcessor",
12
+ "image_std": [
13
+ 0.26862954,
14
+ 0.26130258,
15
+ 0.27577711
16
+ ],
17
+ "processor_class": "Blip2Processor",
18
+ "resample": 3,
19
+ "rescale_factor": 0.00392156862745098,
20
+ "size": {
21
+ "height": 224,
22
+ "width": 224
23
+ }
24
+ }
pytorch_model-00001-of-00004.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f80a3171dbf88b25ff11a081c98743531b8d900ae978746757805438b1593398
3
+ size 4984958473
pytorch_model-00002-of-00004.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d702c503e1d2d8f2b1934f0db29b905013571a243b4aa6a4051599737b750562
3
+ size 4991759780
pytorch_model-00003-of-00004.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fcd17e77536776e747123170ee41ea4aae61c3cd692ac95e6ec7961a190c9e44
3
+ size 4958243691
pytorch_model-00004-of-00004.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a233952430b597a4bef19e680e717597e5d6f7fb9783adee41357c4b1be4ea4d
3
+ size 2434841975
pytorch_model.bin.index.json ADDED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "eos_token": "</s>",
3
+ "pad_token": "<pad>",
4
+ "unk_token": "<unk>"
5
+ }
spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ef78f86560d809067d12bac6c09f19a462cb3af3f54d2b8acbba26e1433125d6
3
+ size 4309802
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:99cc999819aaabf74898a252863b10d86fbcd86e8b3f65c118ff334ff85c5ea5
3
+ size 16315121
tokenizer_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": null,
3
+ "clean_up_tokenization_spaces": true,
4
+ "eos_token": "</s>",
5
+ "extra_ids": 0,
6
+ "model_max_length": 1000000000000000019884624838656,
7
+ "pad_token": "<pad>",
8
+ "processor_class": "Blip2Processor",
9
+ "sp_model_kwargs": {},
10
+ "tokenizer_class": "T5Tokenizer",
11
+ "unk_token": "<unk>"
12
+ }