--- inference: false license: openrail language: - en --- # MM-IGLU: Multi-Modal Interactive Grounded Language Understanding This repository contains the code for *MM-IGLU: Multi-Modal Interactive Grounded Language Understanding* accepted at the [LREC-COLING 2024](https://lrec-coling-2024.org/) conference, published by *Claudiu Daniel Hromei* (Tor Vergata, University of Rome), *Daniele Margiotta* (Tor Vergata, University of Rome), *Danilo Croce* (Tor Vergata, University of Rome) and *Roberto Basili* (Tor Vergata, University of Rome). The paper is available [here](https://aclanthology.org/2024.lrec-main.1000/). # Usage **This model is the merged model**, based on the LLaMA2chat 13b Language Model coupled with CLIP. **It is meant to be used as is, or you can further fine-tune it on your downstream task**. ```python from llava.model import LlavaLlamaForCausalLM from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("sag-uniroma2/llava-Llama-2-chat-13b-hf-iglu-merged") model = LlavaLlamaForCausalLM.from_pretrained( "sag-uniroma2/llava-Llama-2-chat-13b-hf-iglu-merged", load_in_8bit=args.load_in_8bit, torch_dtype=torch.float16, device_map="auto", ) ``` # Description *MM-IGLU* is a Multi-Modal dataset for Interactive Grounded Language Understanding, that expands the resource released during the [IGLU](https://github.com/microsoft/iglu-datasets) competition. While the competition was text-only, we expanded this resource by generating a 3d image for each representation of the world. Given a 3d world and a command in natural language from a Human Architect, the task of a Robotic Builder is to assess if the command is executable based on the world and then execute it or, if more information is needed, to ask more questions. We report here an example of the image: ![IGLU image example](iglu_image_example.png) and a command like "*Break the green blocks*". If, like in this case, there are no green blocks, the Robotic Builder should answer back "*There are no break blocks, which block should I break?*". For the same image, if the command is "*Break the red blocks*", in this case, the Builder should understand that there are red blocks in the environment and should answer "*I can execute it*", confirming the feasibility of the command. We developed a multi-modal model based on [LLaVA](https://github.com/haotian-liu/LLaVA) for solving the task exploiting both the command and the 3d image. It couples a [CLIP](https://github.com/openai/CLIP) model for handling the images with a Language Model for generating the answers. This model achieved the best performance when coupled with the LLaMA-2-chat-13b model. ## GitHub If you want more details, please consult the [GitHub page](https://github.com/crux82/MM-IGLU), where you can find out how to use the model.