|
--- |
|
license: apache-2.0 |
|
task_categories: |
|
- summarization |
|
language: |
|
- en |
|
tags: |
|
- cross-modal-video-summarization |
|
- video-summarization |
|
- video-captioning |
|
pretty_name: VideoXum |
|
size_categories: |
|
- 10K<n<100K |
|
--- |
|
|
|
# VTSUM-BLIP Model Card |
|
|
|
## Model details |
|
|
|
**Model type:** |
|
VTSUM-BLIP is an end-to-end cross-modal video summarization model. |
|
|
|
**Paper or resources for more information:** |
|
https://videoxum.github.io/ |
|
|
|
## Training dataset |
|
- VideoXum *training* set: 8K long videos long videos with 80K pairs of aligned video and text summaries. |
|
|
|
## Evaluation dataset |
|
- VideoXum *val* set: 2K long videos long videos with 80K pairs of aligned video and text summaries. |
|
- VideoXum *test* set: 4K long videos long videos with 80K pairs of aligned video and text summaries. |