M2M-Gen: A Multimodal Framework for Automated Background Music Generation in Japanese Manga Using Large Language Models Paper • 2410.09928 • Published 27 days ago • 1
One missing piece in Vision and Language: A Survey on Comics Understanding Paper • 2409.09502 • Published Sep 14 • 23 • 2
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text Paper • 2406.08418 • Published Jun 12 • 28 • 3
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Paper • 2404.16994 • Published Apr 25 • 35 • 3