Model Card for SpaceMantis

SpaceMantis fine-tunes Mantis-8B-siglip-llama3 for enhanced spatial reasoning.

Model Details

Uses LoRA fine-tune on the spacellava dataset designed with VQASynth to enhance spatial reasoning as in SpatialVLM.

Model Description

This model uses data synthesis techniques and publically available models to reproduce the work described in SpatialVLM to enhance the spatial reasoning of multimodal models. With a pipeline of expert models, we can infer spatial relationships between objects in a scene to create VQA dataset for spatial reasoning.

Developed by: remyx.ai
Model type: MultiModal Model, Vision Language Model, Llama 3

Quick Start

To run SpaceMantis, follow these steps:

import torch
from PIL import Image
from models.mllava import MLlavaProcessor, LlavaForConditionalGeneration, chat_mllava

# Load the model and processor
attn_implementation = None  # or "flash_attention_2"
processor = MLlavaProcessor.from_pretrained("remyxai/SpaceMantis")
model = LlavaForConditionalGeneration.from_pretrained("remyxai/SpaceMantis", device_map="cuda", torch_dtype=torch.float16, attn_implementation=attn_implementation)

generation_kwargs = {
    "max_new_tokens": 1024,
    "num_beams": 1,
    "do_sample": False
}

# Function to run inference
def run_inference(image_path, content):
    # Load the image
    image = Image.open(image_path).convert("RGB")
    # Convert the image to base64
    images = [image]
    # Run the inference
    response, history = chat_mllava(content, images, model, processor, **generation_kwargs)
    return response

# Example usage
image_path = "path/to/your/image.jpg"
content = "Your question here."
response = run_inference(image_path, content)
print("Response:", response)

Model Sources

Dataset: SpaceLLaVA
Repository: VQASynth
Paper: SpatialVLM

Citation

@article{chen2024spatialvlm,
  title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
  author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
  journal = {arXiv preprint arXiv:2401.12168},
  year = {2024},
  url = {https://arxiv.org/abs/2401.12168},
}

@article{jiang2024mantis,
  title={MANTIS: Interleaved Multi-Image Instruction Tuning},
  author={Jiang, Dongfu and He, Xuan and Zeng, Huaye and Wei, Con and Ku, Max and Liu, Qian and Chen, Wenhu},
  journal={arXiv preprint arXiv:2405.01483},
  year={2024}
}

remyxai
/

SpaceMantis

Model Card for SpaceMantis

Model Details

Model Description

Quick Start

Model Sources

Citation

Model tree for remyxai/SpaceMantis

Dataset used to train remyxai/SpaceMantis

Collection including remyxai/SpaceMantis

SpaceVLMs