|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
library_name: transformers |
|
pipeline_tag: visual-question-answering |
|
tags: |
|
- multimodal large language model |
|
- large video-language model |
|
--- |
|
|
|
|
|
|
|
|
|
<p align="center"> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/64a3fe3dde901eb01df12398/ZrZPYT0Q3wgza7Vc5BmyD.png" width="100%" style="margin-bottom: 0.2;"/> |
|
<p> |
|
|
|
|
|
<h3 align="center"><a href="https://arxiv.org/abs/2406.07476" style="color:#4D2B24"> |
|
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM</a></h3> |
|
|
|
<h5 align="center"> If you like our project, please give us a star β on <a href="https://github.com/DAMO-NLP-SG/VideoRefer">Github</a> for the latest update. </h2> |
|
|
|
<p align="center"> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/64a3fe3dde901eb01df12398/iGpjPujqD1OD4V1n_u70u.png" width="100%" style="margin-bottom: 0.2;"/> |
|
<p> |
|
|
|
## π Model Zoo |
|
| Model Name | Visual Encoder | Language Decoder | # Training Frames | |
|
|:----------------|:----------------|:------------------|:----------------:| |
|
| [VideoRefer-7B](https://huggingface.co/DAMO-NLP-SG/VideoRefer-7B) | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) | 16 | |
|
| [VideoRefer-7B-stage2](https://huggingface.co/DAMO-NLP-SG/VideoRefer-7B-stage2) | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) | 16 | |
|
| [VideoRefer-7B-stage2.5](https://huggingface.co/DAMO-NLP-SG/VideoRefer-7B-stage2.5) | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) | 16 | |
|
|
|
|
|
## π Citation |
|
|
|
If you find VideoRefer Suite useful for your research and applications, please cite using this BibTeX: |
|
```bibtex |
|
@article{yuan2024videorefersuite, |
|
title = {VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM}, |
|
author = {Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing}, |
|
journal={arXiv}, |
|
year={2024}, |
|
url = {} |
|
} |
|
``` |