Reself/AuroraCap-7B-IMG
Updated
•
2
•
2
A Detailed Captioning Baseline and Benchmark for Video
Note The image caption model offers a better performance-cost trade-off.
Note The video caption model offers a better performance-cost trade-off.
Note The VDC benchmark contains 1,027 videos with captions averaging over 500 words.
Note over 20M image and video data collection for AuroraCap training with vicuna and llama-3 pre-tokenize.