teamcraft's picture
Update README.md
8025651 verified
metadata
inference: false
pipeline_tag: image-text-to-text
datasets:
  - teamcraft/TeamCraft-Data-Dec


TeamCraft-VLA-7B-Dec Model Card

TeamCraft-VLA-7B-Dec is a multi-modal vision-language action model designed for decentralized multi-agent collaborations. The model encodes multi-modal prompts specifying the task, one agent's visual observation and inventory at each timestep to generate actionable output for single agents under multi-agent settings.

Usage

We provide a full environment with detailed running instruction on GitHub.

Model details

The TeamCraft-VLA (Vision-Language-Action) architecture integrates a CLIP ViT-L/14 visual encoder with a linear projector for modality alignment and Vicuna-v1.5-7B (Llama 2.0) as the LLM backbone, combining visual and text embeddings to generate actions for multi-agent tasks.

Model Type:

  • Vision-Language Action Model

Model version:

  • v1.0

Model date:

  • TeamCraft-VLA-7B-Dec is trained on September 2024

Training dataset:

Uses

Direct use

  • Primary intended uses: The primary use of the TeamCraft-VLA-7B-Dec is research on multi-agents under multi-modal settings.

  • Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, multi-agent system, and artificial intelligence.

Out-of-Scope Use:

  • The model is not designed for real-world decision-making or deployment in safety-critical systems.

  • The model not be used for tasks requiring ethical reasoning, moral judgments, or any applications where improper actions could lead to harm or violation of regulations.

License

Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.