File size: 6,898 Bytes
ed203eb ed3059c 00b0e4a b2d6f27 2ff2f87 b2d6f27 6a6a44b 59defb8 ed203eb 0d59fe7 ed203eb e9795dc 12222b4 ed203eb e9795dc ed203eb e9795dc 8f8d3f3 e9795dc 8f8d3f3 59defb8 8f8d3f3 59defb8 e9795dc 8f8d3f3 ed203eb 0d59fe7 6b65156 0d59fe7 ed203eb d4b7c4e ed203eb 15519b0 3fa67c7 15519b0 ed203eb df24a60 0d59fe7 9c38204 d4b7c4e ed203eb ed3059c b725d4f ed3059c 00b0e4a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 |
---
library_name: transformers
license: apache-2.0
language:
- en
tags:
- image-text-to-text
- text-to-text
- image-text-to-image-text
pipeline_tag: image-text-to-text
BaseModel:
- Mixtral_AI_Cyber_Matrix_2.0(7b)
Decoder:
- Locutusque/TinyMistral-248M-v2
ImageProcessor:
- ikim-uk-essen/BiomedCLIP_ViT_patch16_224
- Lin-Chen/ShareGPT4V-7B_Pretrained_vit-large336-l12
Encoder:
- google/vit-base-patch16-224-in21k
---
# LeroyDyer/Mixtral_AI_Cyber_Q_Vision
VisionEncoderDecoderModel is a generic model class that will be instantiated as a transformer architecture
with one of the base vision model classes of the library as encoder and another one as decoder
when created with the :
```python
# class method for the encoder and :
transformers.AutoModel.from_pretrained
# class method for the decoder.
transformers.AutoModelForCausalLM.from_pretrained
```
### Model Description
This is an experiment in vision - the model has been created as a mistral/VisionEncoder/Decoder
Customized from:
```yaml
BaseModel:
- Mixtral_AI_Cyber_Matrix_2.0(7b)
Decoder:
- Locutusque/TinyMistral-248M-v2
ImageProcessor:
- ikim-uk-essen/BiomedCLIP_ViT_patch16_224
- Lin-Chen/ShareGPT4V-7B_Pretrained_vit-large336-l12
Encoder:
- google/vit-base-patch16-224-in21k
```
- **Developed by:** [LeroyDyer]
- **Model type:** [image-text-to-image-text]
- **Language(s) (NLP):** [English]
## Summary
This is the model card of a 🤗 transformers model that has been pushed on the Hub.
Previous vision models have been 50/50 as the multimodel model actully requires a lot of memory and gpu and harddrive space to create;
the past versions have been attempts to Merge the capabilitys into the main mistral model whilst still retaining its mistral tag!
After reading many hugging face articles:
The BackBone Issue is the main cause of creating multi modals !:
with the advent of tiny models we are able to leverage the decoder abilitys as a single expert-ish... within the model :
by reducing the size to a fully trainined tiny model!
this will only produce decodings and not conversations so it needs to be smart and respond with defined answers: but in general it will produce captions: but as domain based it may be specialized in medical or art etc:
The main llm still needs to retain these models within hence the back bone method of instigating a VisionEncoderDecoder model: istead of a llava model which still need wrangling to work correctly without spoiling the original transformers installation:
Previous experiments proved that the mistral large model could be used as a decoder but the total model jumped to 13b so the when applying the tiny model it was only effected by the weight of the model 248M
## How to Get Started with the Model
### VisionEncoderDecoderModel
#### As a vision encoder model :
the tensors are combined into the original mistral model so it can be accessed by intaciating the correct model which is the VisionEncoderDecoderModel
```python
from transformers import AutoProcessor, VisionEncoderDecoderModel
import requests
from PIL import Image
import torch
processor = AutoProcessor.from_pretrained("LeroyDyer/Mixtral_AI_Cyber_Q_Vision")
model = VisionEncoderDecoderModel.from_pretrained("LeroyDyer/Mixtral_AI_Cyber_Q_Vision")
# load image from the IAM dataset
url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
# training
model.config.decoder_start_token_id = processor.tokenizer.eos_token_id
model.config.pad_token_id = processor.tokenizer.pad_token_id
model.config.vocab_size = model.config.decoder.vocab_size
pixel_values = processor(image, return_tensors="pt").pixel_values
text = "hello world"
labels = processor.tokenizer(text, return_tensors="pt").input_ids
outputs = model(pixel_values=pixel_values, labels=labels)
loss = outputs.loss
# inference (generation)
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
```
### As a standard LLM:
it can still also be used as a normal AutoModelForCausalLM or MistralModelForCausalLM !
[More Information Needed]
## Training Details
Currently inputs are raw and untrained ;
ie: they NEED to be trained as the tensors are randomize maybe?
despite using pretrained starting blocks. the encoder decoder modules are ready to be placed in train mode:
The main model ie the LLM will need lora/Qlora/Peft etc:
This model will stay in this state as a base training point ! so later versions will be trained;
This model is fully usable and still expected to score well ;
The small tiny mistral is also a great performer and a great block to begin a smaller experts model (later) or any multimodal project ie: its like a mini pretrined bert/llama(Mistral is a clone of llamaAlpaca!
```python
from transformers import ViTImageProcessor, AutoTokenizer, VisionEncoderDecoderModel
from datasets import load_dataset
image_processor = ViTImageProcessor.from_pretrained("LeroyDyer/Mixtral_AI_Cyber_Q_Vision")
tokenizer = AutoTokenizer.from_pretrained("LeroyDyer/Mixtral_AI_Cyber_Q_Vision")
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
"LeroyDyer/Mixtral_AI_Cyber_Q_Vision", "LeroyDyer/Mixtral_AI_Cyber_Q_Vision"
)
model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.pad_token_id = tokenizer.pad_token_id
dataset = load_dataset("huggingface/cats-image")
image = dataset["test"]["image"][0]
pixel_values = image_processor(image, return_tensors="pt").pixel_values
labels = tokenizer(
"an image of two cats chilling on a couch",
return_tensors="pt",
).input_ids
# the forward function automatically creates the correct decoder_input_ids
loss = model(pixel_values=pixel_values, labels=labels).loss
```
### Model Architecture
Aha !!! Here is how you create such a model ::
``` python
from transformers import MistralConfig, ViTConfig, VisionEncoderDecoderConfig, VisionEncoderDecoderModel
# Initializing a ViT & Mistral style configuration
config_encoder = ViTConfig()
config_decoder = MistralConfig()
config = VisionEncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)
# Initializing a ViTMistral model (with random weights) from a ViT & Mistral style configurations
model = VisionEncoderDecoderModel(config=config)
# Accessing the model configuration
config_encoder = model.config.encoder
config_decoder = model.config.decoder
# set decoder config to causal lm
config_decoder.is_decoder = True
config_decoder.add_cross_attention = True
# Saving the model, including its configuration
model.save_pretrained("my-model")
# loading model and config from pretrained folder
encoder_decoder_config = VisionEncoderDecoderConfig.from_pretrained("my-model")
model = VisionEncoderDecoderModel.from_pretrained("my-model", config=encoder_decoder_config)
``` |