tsunghanwu
/

mirage-llama3.1-8.3B

Safetensors

llava_llama

Model card Files Files and versions Community

tsunghanwu commited on Oct 18

Commit

100ea29

•

1 Parent(s): f39eaf0

Update README.md

Browse files

Files changed (1) hide show

README.md +106 -3

README.md CHANGED Viewed

@@ -1,3 +1,106 @@
----
-license: mit
----

+---
+license: mit
+---
+## MIRAGE
+**Model Type:** MIRAGE is an innovative open-source visual-RAG model capable of processing over 10,000 images as input. It integrates a retriever and a large multimodal model (LMM) for enhanced performance.
+**Key Features:**
+- **Compressor:** Reduces data size by compressing image tokens by 18x per image, enabling efficient handling of large datasets.
+- **Query-Aware Retriever:** Dynamically filters out irrelevant images to focus processing power on content that enhances task performance.
+- **Multi-Image LMM:** Features a tailored pretraining and instruction tuning dataset, designed to optimize model performance across a range of multimodal tasks.
+**Performance:**
+- MIRAGE establishes a new benchmark in open-source performance on the [Visual Haystacks (VHs) benchmark](https://huggingface.co/datasets/tsunghanwu/visual_haystacks).
+- Delivers robust results across various single- and multi-image question answering tasks, such as RETVQA, MMBench, MMVet, VQAv2, and more.
+**Usage:**
+Please refer to the installation guide on our GitHub repository to get started with MIRAGE: [Installation Guide](https://github.com/visual-haystacks/mirage)
+**Additional Resources:**
+For detailed information and updates, visit our project page: [Visual Haystacks Project](https://visual-haystacks.github.io/)
+**Support:**
+For questions or comments about the model, please open an issue on our GitHub page: [GitHub Issues](https://github.com/visual-haystacks/mirage/issues)
+**Intended Use:**
+MIRAGE is primarily intended for research into large multimodal models (LMMs), long-context modeling, and retrieval-augmented generation (RAG).
+### Example Usage Code
+```python
+from PIL import Image
+import argparse
+import torch
+import os
+from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
+from llava.conversation import conv_templates
+from llava.model.builder import load_pretrained_model
+from llava.mm_utils import tokenizer_image_token, process_images, get_model_name_from_path
+from llava.utils import disable_torch_init
+@torch.inference_mode()
+def run(model_path, image_paths, prompt, num_retrievals=1):
+    '''
+    Executes MIRAGE with specified inputs to generate descriptive text based on the provided images.
+    Args:
+        model_path (str): Path to the MIRAGE model, e.g., 'tsunghanwu/mirage-llama3.1-8.3B'
+        image_paths (list): List of paths to image files, e.g., images in 'assets/example'
+        prompt (str): Text prompt for image description, e.g., 'Here are a set of random images in my photo album.
+                      If you can find a cat, tell me what's the cat doing and what's its color.'
+        num_retrievals (int): Maximum number of images to retrieve and pass to the LMM
+    Returns:
+        output_text (str): Descriptive text generated by the LMM
+        output_ret (list): List of images retrieved by the model
+    '''
+    # Load the model and prepare the environment
+    model_name = get_model_name_from_path(model_path)
+    disable_torch_init()
+    model_name = os.path.expanduser(model_name)
+    tokenizer, model, image_processor, _ = \
+        load_pretrained_model(model_path=model_path, model_base=None, model_name=model_name, device="cuda")
+    model.eval_mode = True
+    # Process the images
+    clip_images = []
+    for image_path in image_paths:
+        image = Image.open(image_path). convert("RGB")
+        image_tensor = process_images([image], image_processor, model.config)[0]
+        image_tensor = image_tensor.to(dtype=torch.float16)
+        clip_images.append(image_tensor)
+    # Prepare text input and interaction
+    qformer_text_input = tokenizer(prompt, return_tensors='pt')["input_ids"].to(model.device)
+    N = len(clip_images)
+    img_str = DEFAULT_IMAGE_TOKEN * N + "\n"
+    inp = img_str + prompt
+    conv.append_message(conv.roles[0], inp)
+    conv.append_message(conv.roles[1], None)
+    prompt = conv.get_prompt()
+    # Generate model output
+    input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).to(model.device)
+    tokenizer.pad_token_id = 128002
+    batch_clip_images = [torch.stack(clip_images).to(model.device)]
+    output_ret, output_ids = model.generate(
+        input_ids,
+        pad_token_id=tokenizer.pad_token_id,
+        clip_images=batch_clip_images,
+        qformer_text_input=qformer_text_input,
+        relevance=None,
+        num_retrieval=num_retrievals,
+        do_sample=False,
+        max_new_tokens=512,
+        use_cache=True)
+    # Process output
+    output_text = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
+    if not isinstance(output_ret[0], list):
+        output_ret[0] = output_ret[0].tolist()
+    return output_text, output_ret[0]
+```