DAMO-NLP-SG
/

VideoRefer-7B-stage2

Visual Question Answering

videorefer_qwen2

text-generation

multimodal large language model

large video-language model

Inference Endpoints

Model card Files Files and versions Community

VideoRefer-7B-stage2 / README.md

CircleRadon's picture

Update README.md

5aecb8d verified 10 days ago

|

2.3 kB

	---
	license: apache-2.0
	language:
	- en
	metrics:
	- accuracy
	library_name: transformers
	pipeline_tag: visual-question-answering
	tags:
	- multimodal large language model
	- large video-language model
	---




	<p align="center">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/64a3fe3dde901eb01df12398/ZrZPYT0Q3wgza7Vc5BmyD.png" width="100%" style="margin-bottom: 0.2;"/>
	<p>


	<h3 align="center"><a href="https://arxiv.org/abs/2406.07476" style="color:#4D2B24">
	VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM</a></h3>

	<h5 align="center"> If you like our project, please give us a star ⭐ on <a href="https://github.com/DAMO-NLP-SG/VideoRefer">Github</a> for the latest update. </h2>

	<p align="center">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/64a3fe3dde901eb01df12398/iGpjPujqD1OD4V1n_u70u.png" width="100%" style="margin-bottom: 0.2;"/>
	<p>

	## 🌏 Model Zoo
	\| Model Name \| Visual Encoder \| Language Decoder \| # Training Frames \|
	\|:----------------\|:----------------\|:------------------\|:----------------:\|
	\| [VideoRefer-7B](https://huggingface.co/DAMO-NLP-SG/VideoRefer-7B) \| [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) \| [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) \| 16 \|
	\| [VideoRefer-7B-stage2](https://huggingface.co/DAMO-NLP-SG/VideoRefer-7B-stage2) \| [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) \| [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) \| 16 \|
	\| [VideoRefer-7B-stage2.5](https://huggingface.co/DAMO-NLP-SG/VideoRefer-7B-stage2.5) \| [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) \| [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) \| 16 \|


	## 📑 Citation

	If you find VideoRefer Suite useful for your research and applications, please cite using this BibTeX:
	```bibtex
	@article{yuan2024videorefersuite,
	title = {VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM},
	author = {Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing},
	journal={arXiv},
	year={2024},
	url = {}
	}
	```