Abstract
Retrieval-Augmented Generation (RAG) is a powerful strategy to address the issue of generating factually incorrect outputs in foundation models by retrieving external knowledge relevant to queries and incorporating it into their generation process. However, existing RAG approaches have primarily focused on textual information, with some recent advancements beginning to consider images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing events, processes, and contextual details more effectively than any other modality. While a few recent studies explore the integration of videos in the response generation process, they either predefine query-associated videos without retrieving them according to queries, or convert videos into the textual descriptions without harnessing their multimodal richness. To tackle these, we introduce VideoRAG, a novel framework that not only dynamically retrieves relevant videos based on their relevance with queries but also utilizes both visual and textual information of videos in the output generation. Further, to operationalize this, our method revolves around the recent advance of Large Video Language Models (LVLMs), which enable the direct processing of video content to represent it for retrieval and seamless integration of the retrieved videos jointly with queries. We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines.
Community
We aim to extend the current landscape of retrieval-augmented generation by leveraging a video corpus.
I'm glad you included well-defined experiments making this scientifically sound. I'd say the question if the additional computational cost using -VT is worth it still stands as the effect size incorporating visual features ist not substantial.
I do agree that the quality of video selection in the retrieval process is highly relevant for this as much as the actual performance of LVLMs (especially their ability to understand and trace motion and action - related: https://huggingface.co/papers/2501.02955)
Checkout Detailed Walkthrough of the paper: https://gyanendradas.substack.com/p/videorag-paper-explained
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Re-ranking the Context for Multimodal Retrieval Augmented Generation (2025)
- RAG-Check: Evaluating Multimodal Retrieval Augmented Generation Performance (2025)
- mR2AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA (2024)
- Perceive, Query&Reason: Enhancing Video QA with Question-Guided Temporal Queries (2024)
- Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks (2024)
- Towards Long Video Understanding via Fine-detailed Video Story Generation (2024)
- SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper