metadata
license: apache-2.0
task_categories:
- summarization
language:
- en
tags:
- cross-modal-video-summarization
- video-summarization
- video-captioning
pretty_name: VideoXum
size_categories:
- 10K<n<100K
VTSUM-BLIP Model Card
Model details
Model type: VTSUM-BLIP is an end-to-end cross-modal video summarization model.
Paper or resources for more information: https://videoxum.github.io/
Training dataset
- VideoXum training set: 8K long videos long videos with 80K pairs of aligned video and text summaries.
Evaluation dataset
- VideoXum val set: 2K long videos long videos with 80K pairs of aligned video and text summaries.
- VideoXum test set: 4K long videos long videos with 80K pairs of aligned video and text summaries.