--- license: apache-2.0 language: - en metrics: - accuracy library_name: transformers pipeline_tag: visual-question-answering tags: - multimodal large language model - large video-language model ---
## 🌏 Model Zoo | Model Name | Visual Encoder | Language Decoder | # Training Frames | |:----------------|:----------------|:------------------|:----------------:| | [VideoRefer-7B]() | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) | 16 | | [VideoRefer-7B-stage2]() | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) | 16 | | [VideoRefer-7B-stage2.5]() | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) | 16 | ## 📑 Citation If you find VideoRefer Suite useful for your research and applications, please cite using this BibTeX: ```bibtex @article{yuan2024videorefersuite, title = {VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM}, author = {Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing}, journal={arXiv}, year={2024}, url = {} } ```