DAMO-NLP-SG
/

VideoRefer-7B-stage2

Visual Question Answering

videorefer_qwen2

text-generation

multimodal large language model

large video-language model

Inference Endpoints

Model card Files Files and versions Community

CircleRadon commited on 10 days ago

Commit

de04e4d

·

verified ·

1 Parent(s): 6d5c0bb

Update README.md

Files changed (1) hide show

README.md +50 -3

README.md CHANGED Viewed

@@ -1,3 +1,50 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+- en
+metrics:
+- accuracy
+library_name: transformers
+pipeline_tag: visual-question-answering
+tags:
+- multimodal large language model
+- large video-language model
+---
+<p align="center">
+    <img src="https://cdn-uploads.huggingface.co/production/uploads/64a3fe3dde901eb01df12398/ZrZPYT0Q3wgza7Vc5BmyD.png" width="100%" style="margin-bottom: 0.2;"/>
+<p>
+<h3 align="center"><a href="https://arxiv.org/abs/2406.07476" style="color:#4D2B24">
+VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM</a></h3>
+<h5 align="center"> If you like our project, please give us a star ⭐ on <a href="https://github.com/DAMO-NLP-SG/VideoRefer">Github</a> for the latest update.  </h2>
+<p align="center">
+    <img src="https://cdn-uploads.huggingface.co/production/uploads/64a3fe3dde901eb01df12398/iGpjPujqD1OD4V1n_u70u.png" width="100%" style="margin-bottom: 0.2;"/>
+<p>
+## 🌏 Model Zoo
+| Model Name     | Visual Encoder | Language Decoder | # Training Frames |
+|:----------------|:----------------|:------------------|:----------------:|
+| [VideoRefer-7B]() | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)  | 16 |
+| [VideoRefer-7B-stage2]()  | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)  | 16 |
+| [VideoRefer-7B-stage2.5]()  | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)  | 16 |
+## 📑 Citation
+If you find VideoRefer Suite useful for your research and applications, please cite using this BibTeX:
+```bibtex
+@article{yuan2024videorefersuite,
+  title = {VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM},
+  author = {Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing},
+  journal={arXiv},
+  year={2024},
+  url = {}
+}
+```