File size: 4,237 Bytes
ed203eb
 
ed3059c
 
 
00b0e4a
b2d6f27
2ff2f87
b2d6f27
6a6a44b
ed203eb
 
0d59fe7
ed203eb
 
e9795dc
 
 
 
 
 
 
 
ed203eb
 
e9795dc
 
ed203eb
 
 
e9795dc
 
 
 
 
8f8d3f3
e9795dc
8f8d3f3
 
e9795dc
8f8d3f3
 
 
 
ed203eb
0d59fe7
6b65156
0d59fe7
ed203eb
 
 
 
3fa67c7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ed203eb
 
 
 
 
0d59fe7
 
 
9c38204
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ed203eb
 
ed3059c
 
 
 
 
 
 
 
 
 
b725d4f
ed3059c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
00b0e4a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
library_name: transformers
license: apache-2.0
language:
- en
tags:
- image-text-to-text
- text-to-text
- image-text-to-image-text
pipeline_tag: image-text-to-text
---

# LeroyDyer/Mixtral_AI_Cyber_Q_Vision


VisionEncoderDecoderModel is a generic model class that will be instantiated as a transformer architecture
with one of the base vision model classes of the library as encoder and another one as decoder
when created with the :

```python

transformers.AutoModel.from_pretrained class method for the encoder and :
transformers.AutoModelForCausalLM.from_pretrained class method for the decoder.


```

### Model Description



This is the model card of a 🤗 transformers model that has been pushed on the Hub. 

This is an experiment in vision - the model has been created as a mistral/VisionEncoder/Decoder 

Customized from: 

```yaml
- Mixtral_AI_Cyber_Matrix_2.0(7b) 

- TinyMistral (248)

- ikim-uk-essen/BiomedCLIP_ViT_patch16_224
```

- **Developed by:** [LeroyDyer]
- **Model type:** [image-text-to-image-text]
- **Language(s) (NLP):** [English]


## How to Get Started with the Model

```python
from transformers import AutoProcessor, VisionEncoderDecoderModel
import requests
from PIL import Image
import torch

processor = AutoProcessor.from_pretrained("LeroyDyer/Mixtral_AI_Cyber_Q_Vision")
model = VisionEncoderDecoderModel.from_pretrained("LeroyDyer/Mixtral_AI_Cyber_Q_Vision")

# load image from the IAM dataset
url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

# training
model.config.decoder_start_token_id = processor.tokenizer.eos_token_id
model.config.pad_token_id = processor.tokenizer.pad_token_id
model.config.vocab_size = model.config.decoder.vocab_size

pixel_values = processor(image, return_tensors="pt").pixel_values
text = "hello world"
labels = processor.tokenizer(text, return_tensors="pt").input_ids
outputs = model(pixel_values=pixel_values, labels=labels)
loss = outputs.loss

# inference (generation)
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
```

[More Information Needed]

## Training Details

```python


from transformers import ViTImageProcessor, AutoTokenizer, VisionEncoderDecoderModel
from datasets import load_dataset

image_processor = ViTImageProcessor.from_pretrained("LeroyDyer/Mixtral_AI_Cyber_Q_Vision")
tokenizer = AutoTokenizer.from_pretrained("LeroyDyer/Mixtral_AI_Cyber_Q_Vision")
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
    "LeroyDyer/Mixtral_AI_Cyber_Q_Vision", "LeroyDyer/Mixtral_AI_Cyber_Q_Vision"
)

model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.pad_token_id = tokenizer.pad_token_id

dataset = load_dataset("huggingface/cats-image")
image = dataset["test"]["image"][0]
pixel_values = image_processor(image, return_tensors="pt").pixel_values

labels = tokenizer(
    "an image of two cats chilling on a couch",
    return_tensors="pt",
).input_ids

# the forward function automatically creates the correct decoder_input_ids
loss = model(pixel_values=pixel_values, labels=labels).loss



```


### Model Architecture and Objective

``` python

from transformers import MistralConfig, ViTConfig, VisionEncoderDecoderConfig, VisionEncoderDecoderModel

# Initializing a ViT & Mistral style configuration
config_encoder = ViTConfig()
config_decoder = MistralConfig()

config = VisionEncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)

# Initializing a ViTMistral model (with random weights) from a ViT & Mistral style configurations
model = VisionEncoderDecoderModel(config=config)

# Accessing the model configuration
config_encoder = model.config.encoder
config_decoder = model.config.decoder
# set decoder config to causal lm
config_decoder.is_decoder = True
config_decoder.add_cross_attention = True

# Saving the model, including its configuration
model.save_pretrained("my-model")

# loading model and config from pretrained folder
encoder_decoder_config = VisionEncoderDecoderConfig.from_pretrained("my-model")
model = VisionEncoderDecoderModel.from_pretrained("my-model", config=encoder_decoder_config)


```