--- license: mit language: - en library_name: transformers tags: - art - medical - biology - code - chemistry metrics: - code_eval - chrf - charcut_mt - cer - brier_score - bleurt - bertscore - accuracy pipeline_tag: image-text-to-text --- # MULTI-MODAL-MODEL ## LeroyDyer/Mixtral_AI_Vision-Instruct_X currently in test mode # Vision/multimodal capabilities: If you want to use vision functionality: * You must use the latest versions of [Koboldcpp](https://github.com/LostRuins/koboldcpp). To use the multimodal capabilities of this model and use **vision** you need to load the specified **mmproj** file, this can be found inside this model repo. ([LeroyDyer/Mixtral_AI_Vision-Instruct_X](https://huggingface.co/LeroyDyer/Mixtral_AI_Vision-Instruct_X)) * You can load the **mmproj** by using the corresponding section in the interface: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65d4cf2693a0a3744a27536c/UX6Ubss2EPNAT3SKGMLe0.png) ## Vision/multimodal capabilities: * For loading 4-bit use 4-bit mmproj file.- mmproj-Mixtral_AI_Vision-Instruct_X-Q4_0 * For loading 8-bit use 8 bit mmproj file - mmproj-Mixtral_AI_Vision-Instruct_X-Q8_0 * For loading 8-bit use 8 bit mmproj file - mmproj-Mixtral_AI_Vision-Instruct_X-f16 ## Extended capabilities: ``` * mistralai/Mistral-7B-Instruct-v0.1 - Prime-Base * ChaoticNeutrals/Eris-LelantaclesV2-7b - role play * ChaoticNeutrals/Eris_PrimeV3-Vision-7B - vision * rvv-karma/BASH-Coder-Mistral-7B - coding * Locutusque/Hercules-3.1-Mistral-7B - Unhinging * KoboldAI/Mistral-7B-Erebus-v3 - NSFW * Locutusque/Hyperion-2.1-Mistral-7B - CHAT * Severian/Nexus-IKM-Mistral-7B-Pytorch - Thinking * NousResearch/Hermes-2-Pro-Mistral-7B - Generalizing * mistralai/Mistral-7B-Instruct-v0.2 - BASE * Nitral-AI/ProdigyXBioMistral_7B - medical * Nitral-AI/Infinite-Mika-7b - 128k - Context Expansion enforcement * Nous-Yarn-Mistral-7b-128k - 128k - Context Expansion * yanismiraoui/Yarn-Mistral-7b-128k-sharded * ChaoticNeutrals/Eris_Prime-V2-7B - Roleplay ``` # "image-text-text" ## using transformers ``` python from transformers import AutoProcessor, LlavaForConditionalGeneration from transformers import BitsAndBytesConfig import torch quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16 ) model_id = "LeroyDyer/Mixtral_AI_Vision-Instruct_X" processor = AutoProcessor.from_pretrained(model_id) model = LlavaForConditionalGeneration.from_pretrained(model_id, quantization_config=quantization_config, device_map="auto") import requests from PIL import Image image1 = Image.open(requests.get("https://llava-vl.github.io/static/images/view.jpg", stream=True).raw) image2 = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw) display(image1) display(image2) prompts = [ "USER: \nWhat are the things I should be cautious about when I visit this place? What should I bring with me?\nASSISTANT:", "USER: \nPlease describe this image\nASSISTANT:", ] inputs = processor(prompts, images=[image1, image2], padding=True, return_tensors="pt").to("cuda") for k,v in inputs.items(): print(k,v.shape) ``` ## Using pipeline ``` python from transformers import pipeline from PIL import Image import requests model_id = LeroyDyer/Mixtral_AI_Vision-Instruct_X pipe = pipeline("image-to-text", model=model_id) url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg" image = Image.open(requests.get(url, stream=True).raw) question = "What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud" prompt = f"A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human: \n{question}###Assistant:" outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200}) print(outputs) ``` ## Mistral ChatTemplating Instruction format In order to leverage instruction fine-tuning, your prompt should be surrounded by [INST] and [/INST] tokens. The very first instruction should begin with a begin of sentence id. The next instructions should not. The assistant generation will be ended by the end-of-sentence token id. ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("LeroyDyer/Mixtral_AI_Vision-Instruct_X") chat = [ {"role": "user", "content": "Hello, how are you?"}, {"role": "assistant", "content": "I'm doing great. How can I help you today?"}, {"role": "user", "content": "I'd like to show off how chat templating works!"}, ] tokenizer.apply_chat_template(chat, tokenize=False) ``` # TextToText ``` python from transformers import AutoModelForCausalLM, AutoTokenizer device = "cuda" # the device to load the model onto model = AutoModelForCausalLM.from_pretrained("LeroyDyer/Mixtral_AI_Vision-Instruct_X") tokenizer = AutoTokenizer.from_pretrained("LeroyDyer/Mixtral_AI_Vision-Instruct_X") messages = [ {"role": "user", "content": "What is your favourite condiment?"}, {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"}, {"role": "user", "content": "Do you have mayonnaise recipes?"} ] encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt") model_inputs = encodeds.to(device) model.to(device) generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True) decoded = tokenizer.batch_decode(generated_ids) print(decoded[0]) ```