VTSUM-BLIP Model Card
Model details
Model type: VTSUM-BLIP is an end-to-end cross-modal video summarization model.
Model description:
- VTSUM-BLIP + Temporal Transformer (TT): vtsum_tt.pth
- VTSUM-BLIP + Temporal Transformer (TT) + Context Aggregation (CA): vtsum_tt_ca.pth
- VT-CLIP for VT-CLIPScore metric: vt_clip.pth
- BLIP w/ ViT-B and CapFilt-L (Download): model_base_capfilt_large.pth
The file structure of Model zoo looks like:
outputs
βββ blip
β βββ model_base_capfilt_large.pth
βββ vt_clipscore
β βββ vt_clip.pth
βββ vtsum_tt
β βββ vtsum_tt.pth
βββ vtsum_tt_ca
βββ vtsum_tt_ca.pth
Paper or resources for more information: https://videoxum.github.io/
Training dataset
- VideoXum training set: 8K long videos long videos with 80K pairs of aligned video and text summaries.
Evaluation dataset
- VideoXum val set: 2K long videos long videos with 80K pairs of aligned video and text summaries.
- VideoXum test set: 4K long videos long videos with 80K pairs of aligned video and text summaries.