|
--- |
|
license: mit |
|
language: |
|
- en |
|
library_name: transformers |
|
tags: |
|
- art |
|
- medical |
|
- biology |
|
- code |
|
- chemistry |
|
metrics: |
|
- code_eval |
|
- chrf |
|
- charcut_mt |
|
- cer |
|
- brier_score |
|
- bleurt |
|
- bertscore |
|
- accuracy |
|
pipeline_tag: image-text-to-text |
|
--- |
|
|
|
# MULTI-MODAL-MODEL |
|
## LeroyDyer/Mixtral_AI_Vision-Instruct_X |
|
|
|
|
|
|
|
|
|
currently in test mode |
|
|
|
|
|
# Vision/multimodal capabilities: |
|
|
|
If you want to use vision functionality: |
|
|
|
* You must use the latest versions of [Koboldcpp](https://github.com/LostRuins/koboldcpp). |
|
|
|
To use the multimodal capabilities of this model and use **vision** you need to load the specified **mmproj** file, this can be found inside this model repo. ([LeroyDyer/Mixtral_AI_Vision-Instruct_X](https://huggingface.co/LeroyDyer/Mixtral_AI_Vision-Instruct_X)) |
|
|
|
* You can load the **mmproj** by using the corresponding section in the interface: |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65d4cf2693a0a3744a27536c/UX6Ubss2EPNAT3SKGMLe0.png) |
|
|
|
## Vision/multimodal capabilities: |
|
|
|
* For loading 4-bit use 4-bit mmproj file.- mmproj-Mixtral_AI_Vision-Instruct_X-Q4_0 |
|
|
|
* For loading 8-bit use 8 bit mmproj file - mmproj-Mixtral_AI_Vision-Instruct_X-Q8_0 |
|
|
|
* For loading 8-bit use 8 bit mmproj file - mmproj-Mixtral_AI_Vision-Instruct_X-f16 |
|
|
|
|
|
|
|
## Extended capabilities: |
|
|
|
``` |
|
* mistralai/Mistral-7B-Instruct-v0.1 - Prime-Base |
|
|
|
* ChaoticNeutrals/Eris-LelantaclesV2-7b - role play |
|
|
|
* ChaoticNeutrals/Eris_PrimeV3-Vision-7B - vision |
|
|
|
* rvv-karma/BASH-Coder-Mistral-7B - coding |
|
|
|
* Locutusque/Hercules-3.1-Mistral-7B - Unhinging |
|
|
|
* KoboldAI/Mistral-7B-Erebus-v3 - NSFW |
|
|
|
* Locutusque/Hyperion-2.1-Mistral-7B - CHAT |
|
|
|
* Severian/Nexus-IKM-Mistral-7B-Pytorch - Thinking |
|
|
|
* NousResearch/Hermes-2-Pro-Mistral-7B - Generalizing |
|
|
|
* mistralai/Mistral-7B-Instruct-v0.2 - BASE |
|
|
|
* Nitral-AI/ProdigyXBioMistral_7B - medical |
|
|
|
* Nitral-AI/Infinite-Mika-7b - 128k - Context Expansion enforcement |
|
|
|
* Nous-Yarn-Mistral-7b-128k - 128k - Context Expansion |
|
|
|
* yanismiraoui/Yarn-Mistral-7b-128k-sharded |
|
|
|
* ChaoticNeutrals/Eris_Prime-V2-7B - Roleplay |
|
|
|
``` |
|
|
|
# "image-text-text" |
|
|
|
|
|
## using transformers |
|
|
|
``` python |
|
from transformers import AutoProcessor, LlavaForConditionalGeneration |
|
from transformers import BitsAndBytesConfig |
|
import torch |
|
|
|
quantization_config = BitsAndBytesConfig( |
|
load_in_4bit=True, |
|
bnb_4bit_compute_dtype=torch.float16 |
|
) |
|
|
|
|
|
model_id = "LeroyDyer/Mixtral_AI_Vision-Instruct_X" |
|
|
|
processor = AutoProcessor.from_pretrained(model_id) |
|
model = LlavaForConditionalGeneration.from_pretrained(model_id, quantization_config=quantization_config, device_map="auto") |
|
|
|
|
|
import requests |
|
from PIL import Image |
|
|
|
image1 = Image.open(requests.get("https://llava-vl.github.io/static/images/view.jpg", stream=True).raw) |
|
image2 = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw) |
|
display(image1) |
|
display(image2) |
|
|
|
prompts = [ |
|
"USER: <image>\nWhat are the things I should be cautious about when I visit this place? What should I bring with me?\nASSISTANT:", |
|
"USER: <image>\nPlease describe this image\nASSISTANT:", |
|
] |
|
|
|
inputs = processor(prompts, images=[image1, image2], padding=True, return_tensors="pt").to("cuda") |
|
for k,v in inputs.items(): |
|
print(k,v.shape) |
|
|
|
``` |
|
|
|
## Using pipeline |
|
|
|
``` python |
|
|
|
from transformers import pipeline |
|
from PIL import Image |
|
import requests |
|
|
|
model_id = LeroyDyer/Mixtral_AI_Vision-Instruct_X |
|
pipe = pipeline("image-to-text", model=model_id) |
|
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg" |
|
|
|
image = Image.open(requests.get(url, stream=True).raw) |
|
question = "What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud" |
|
prompt = f"A chat between a curious human and an artificial intelligence assistant. |
|
The assistant gives helpful, detailed, and polite answers to the human's questions.###Human: <image>\n{question}###Assistant:" |
|
|
|
outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200}) |
|
print(outputs) |
|
``` |
|
|
|
|
|
|
|
|
|
|
|
|
|
## Mistral ChatTemplating |
|
Instruction format |
|
In order to leverage instruction fine-tuning, |
|
your prompt should be surrounded by [INST] and [/INST] tokens. |
|
The very first instruction should begin with a begin of sentence id. The next instructions should not. |
|
The assistant generation will be ended by the end-of-sentence token id. |
|
|
|
|
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("LeroyDyer/Mixtral_AI_Vision-Instruct_X") |
|
|
|
chat = [ |
|
{"role": "user", "content": "Hello, how are you?"}, |
|
{"role": "assistant", "content": "I'm doing great. How can I help you today?"}, |
|
{"role": "user", "content": "I'd like to show off how chat templating works!"}, |
|
] |
|
|
|
tokenizer.apply_chat_template(chat, tokenize=False) |
|
|
|
``` |
|
|
|
# TextToText |
|
|
|
``` python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
device = "cuda" # the device to load the model onto |
|
|
|
model = AutoModelForCausalLM.from_pretrained("LeroyDyer/Mixtral_AI_Vision-Instruct_X") |
|
tokenizer = AutoTokenizer.from_pretrained("LeroyDyer/Mixtral_AI_Vision-Instruct_X") |
|
|
|
messages = [ |
|
{"role": "user", "content": "What is your favourite condiment?"}, |
|
{"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"}, |
|
{"role": "user", "content": "Do you have mayonnaise recipes?"} |
|
] |
|
|
|
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt") |
|
|
|
model_inputs = encodeds.to(device) |
|
model.to(device) |
|
|
|
generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True) |
|
decoded = tokenizer.batch_decode(generated_ids) |
|
print(decoded[0]) |
|
``` |
|
|