File size: 4,237 Bytes
ed203eb ed3059c 00b0e4a b2d6f27 2ff2f87 b2d6f27 6a6a44b ed203eb 0d59fe7 ed203eb e9795dc ed203eb e9795dc ed203eb e9795dc 8f8d3f3 e9795dc 8f8d3f3 e9795dc 8f8d3f3 ed203eb 0d59fe7 6b65156 0d59fe7 ed203eb 3fa67c7 ed203eb 0d59fe7 9c38204 ed203eb ed3059c b725d4f ed3059c 00b0e4a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 |
---
library_name: transformers
license: apache-2.0
language:
- en
tags:
- image-text-to-text
- text-to-text
- image-text-to-image-text
pipeline_tag: image-text-to-text
---
# LeroyDyer/Mixtral_AI_Cyber_Q_Vision
VisionEncoderDecoderModel is a generic model class that will be instantiated as a transformer architecture
with one of the base vision model classes of the library as encoder and another one as decoder
when created with the :
```python
transformers.AutoModel.from_pretrained class method for the encoder and :
transformers.AutoModelForCausalLM.from_pretrained class method for the decoder.
```
### Model Description
This is the model card of a 🤗 transformers model that has been pushed on the Hub.
This is an experiment in vision - the model has been created as a mistral/VisionEncoder/Decoder
Customized from:
```yaml
- Mixtral_AI_Cyber_Matrix_2.0(7b)
- TinyMistral (248)
- ikim-uk-essen/BiomedCLIP_ViT_patch16_224
```
- **Developed by:** [LeroyDyer]
- **Model type:** [image-text-to-image-text]
- **Language(s) (NLP):** [English]
## How to Get Started with the Model
```python
from transformers import AutoProcessor, VisionEncoderDecoderModel
import requests
from PIL import Image
import torch
processor = AutoProcessor.from_pretrained("LeroyDyer/Mixtral_AI_Cyber_Q_Vision")
model = VisionEncoderDecoderModel.from_pretrained("LeroyDyer/Mixtral_AI_Cyber_Q_Vision")
# load image from the IAM dataset
url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
# training
model.config.decoder_start_token_id = processor.tokenizer.eos_token_id
model.config.pad_token_id = processor.tokenizer.pad_token_id
model.config.vocab_size = model.config.decoder.vocab_size
pixel_values = processor(image, return_tensors="pt").pixel_values
text = "hello world"
labels = processor.tokenizer(text, return_tensors="pt").input_ids
outputs = model(pixel_values=pixel_values, labels=labels)
loss = outputs.loss
# inference (generation)
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
```
[More Information Needed]
## Training Details
```python
from transformers import ViTImageProcessor, AutoTokenizer, VisionEncoderDecoderModel
from datasets import load_dataset
image_processor = ViTImageProcessor.from_pretrained("LeroyDyer/Mixtral_AI_Cyber_Q_Vision")
tokenizer = AutoTokenizer.from_pretrained("LeroyDyer/Mixtral_AI_Cyber_Q_Vision")
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
"LeroyDyer/Mixtral_AI_Cyber_Q_Vision", "LeroyDyer/Mixtral_AI_Cyber_Q_Vision"
)
model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.pad_token_id = tokenizer.pad_token_id
dataset = load_dataset("huggingface/cats-image")
image = dataset["test"]["image"][0]
pixel_values = image_processor(image, return_tensors="pt").pixel_values
labels = tokenizer(
"an image of two cats chilling on a couch",
return_tensors="pt",
).input_ids
# the forward function automatically creates the correct decoder_input_ids
loss = model(pixel_values=pixel_values, labels=labels).loss
```
### Model Architecture and Objective
``` python
from transformers import MistralConfig, ViTConfig, VisionEncoderDecoderConfig, VisionEncoderDecoderModel
# Initializing a ViT & Mistral style configuration
config_encoder = ViTConfig()
config_decoder = MistralConfig()
config = VisionEncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)
# Initializing a ViTMistral model (with random weights) from a ViT & Mistral style configurations
model = VisionEncoderDecoderModel(config=config)
# Accessing the model configuration
config_encoder = model.config.encoder
config_decoder = model.config.decoder
# set decoder config to causal lm
config_decoder.is_decoder = True
config_decoder.add_cross_attention = True
# Saving the model, including its configuration
model.save_pretrained("my-model")
# loading model and config from pretrained folder
encoder_decoder_config = VisionEncoderDecoderConfig.from_pretrained("my-model")
model = VisionEncoderDecoderModel.from_pretrained("my-model", config=encoder_decoder_config)
``` |