---
datasets:
- liuhaotian/LLaVA-Pretrain
- liuhaotian/LLaVA-Instruct-150K
language:
- en
tags:
- llava
- phi
license: mit
library_name: transformers
widget:
- text: "What animal is it?"
src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg"
- text: "Where is it?"
src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg"
---
# Multi-crop LLaVA-3b
## Model details
Usually, in LLaVA models, we generate N embeddings for the image, which we then combine with text embeddings and send to the LLM. But what if instead of creating N tokens
for one image, we create K<user
{prompt}<|im_end|>
<|im_start|>assistant
```
## How to use
```python
from transformers import AutoModel, AutoProcessor
import torch
model = AutoModel.from_pretrained("visheratin/MC-LLaVA-3b", torch_dtype=torch.float16, trust_remote_code=True).to("cuda")
processor = AutoProcessor.from_pretrained("visheratin/MC-LLaVA-3b", trust_remote_code=True)
with torch.inference_mode():
inputs = processor(prompt, [raw_image], model, max_crops=100, num_tokens=728)
output = model.generate(**inputs, max_new_tokens=200, use_cache=True, do_sample=False,
eos_token_id=processor.tokenizer.eos_token_id, pad_token_id=processor.tokenizer.eos_token_id)
result = processor.tokenizer.decode(output[0]).replace(prompt, "").replace("<|im_end|>", "")
print(result)
```
## Benchmarks
- TextVQA - 50.9%
- GQA - 59.5%
- VQAv2 - 76.72%
- VizWiz - 32.68%
- V*-bench - OCR - 56.66%, GPT4V-hard - 52.94%, direct attributes - 40.86%, relative position - 56.57%
## Examples
## License
The model is licensed under MIT license, but since the data used for model training is largely synthetic, you should also follow OpenAI and Google Gemini terms of service.
Which means don't create competitor models for them.
## Acknowledgments
Thanks to [Lambda](https://lambdalabs.com/) for providing a machine to train the model.
Thanks to [ML Collective](https://mlcollective.org/) for continuous support and providing compute resources for testing the model.