metadata

inference: false
license: openrail
language:
  - en

MM-IGLU: Multi-Modal Interactive Grounded Language Understanding

This repository contains the code for MM-IGLU: Multi-Modal Interactive Grounded Language Understanding accepted at the LREC-COLING 2024 conference, published by Claudiu Daniel Hromei (Tor Vergata, University of Rome), Daniele Margiotta (Tor Vergata, University of Rome), Danilo Croce (Tor Vergata, University of Rome) and Roberto Basili (Tor Vergata, University of Rome). The paper will be available here.

Usage

This model is the merged model, based on the LLaMA2chat 13b Language Model coupled with CLIP. It is meant to be used as is, or you can further fine-tune it on your downstream task.

from llava.model import LlavaLlamaForCausalLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sag-uniroma2/llava-Llama-2-chat-13b-hf-iglu-merged")
model = LlavaLlamaForCausalLM.from_pretrained(
        "sag-uniroma2/llava-Llama-2-chat-13b-hf-iglu-merged",
        load_in_8bit=args.load_in_8bit,
        torch_dtype=torch.float16,
        device_map="auto",
    )

Description

MM-IGLU is a Multi-Modal dataset for Interactive Grounded Language Understanding, that expands the resource released during the IGLU competition. While the competition was text-only, we expanded this resource by generating a 3d image for each representation of the world.

Given a 3d world and a command in natural language from a Human Architect, the task of a Robotic Builder is to assess if the command is executable based on the world and then execute it or, if more information is needed, to ask more questions. We report here an example of the image:

and a command like "Break the green blocks". If, like in this case, there are no green blocks, the Robotic Builder should answer back "There are no break blocks, which block should I break?". For the same image, if the command is "Break the red blocks", in this case, the Builder should understand that there are red blocks in the environment and should answer "I can execute it", confirming the feasibility of the command.

We developed a multi-modal model based on LLaVA for solving the task exploiting both the command and the 3d image. It couples a CLIP model for handling the images with a Language Model for generating the answers. This model achieved the best performance when coupled with the LLaMA-2-chat-13b model.

GitHub

If you want more details, please consult the GitHub page, where you can find out how to use the model.