|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
base_model: |
|
- mistralai/Mistral-7B-Instruct-v0.2 |
|
tags: |
|
- video temporal grounding |
|
- dense video caption |
|
- video highlight detection |
|
--- |
|
|
|
<h2 align="center"> <a href="https://arxiv.org/abs/2410.05643">TRACE: Temporal Grounding Video LLM via Causal Event Modeling</a></h2> |
|
<h5 align="center"> If our project helps you, please give us a star β on <a href="https://github.com/gyxxyg/TRACE">GitHub</a> and cite our paper!</h2> |
|
<h5 align="center"> |
|
|
|
## π° News |
|
|
|
- **[2024.11.01]** π₯ We are excited to announce the release of [trace-uni](https://huggingface.co/Yongxin-Guo/trace-uni), which has been enhanced by incorporating additional general video understanding data from a subset of [LLaVA-Video-178k](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K). Our results indicate that trace-uni outperforms trace in both VTG tasks and general video understanding tasks. |
|
- **[2024.10.19]** π₯ We release [trace-retrieval](https://huggingface.co/Yongxin-Guo/trace-retrieval) by forcing the predicted timestamps to be align with the input frame timestamps. Results show trace-retrieval achieve better performance on dense video captioning tasks. |
|
- **[2024.10.10]** π₯ Our [code](https://github.com/gyxxyg/TRACE) and [paper](https://arxiv.org/abs/2410.05643) are released! |
|
- **[2024.10.10]** π₯ Our **checkpoints** are available now! |
|
|
|
## Overview |
|
|
|
In this work |
|
- We model the videos by a series of events, and propose causal event modeling framework to capture videos' inherent structure. |
|
- We present a novel task-interleaved video LLM model, TRACE, tailored to implement the causal event modeling framework through the sequential encoding/decoding of timestamps, salient scores, and textual captions. |
|
|
|
## Model Zoo |
|
|
|
| Checkpoints | Description | URL | |
|
| ----------- | ----------- | ----------- | |
|
| Initialization | Weights initialized from VideoLLaMA2 | [trace-init](https://huggingface.co/Yongxin-Guo/trace-init) | |
|
| Stage-1 | Model checkpoints trained after stage-1 | [trace-stage1](https://huggingface.co/Yongxin-Guo/trace-stage1) | |
|
| Stage-2 | Model checkpoints trained after stage-2 | [trace](https://huggingface.co/Yongxin-Guo/trace) | |
|
| FT-Charades | Fine-tuned on Charades-STA dataset | [trace-ft-charades](https://huggingface.co/Yongxin-Guo/trace-ft-charades) | |
|
| FT-Youcook2 | Fine-tuned on Youcook2 dataset | [trace-ft-youcook2](https://huggingface.co/Yongxin-Guo/trace-ft-youcook2) | |
|
| FT-QVHighlights | Fine-tuned on QVHighlights dataset | [trace-ft-qvhighlights](https://huggingface.co/Yongxin-Guo/trace-ft-qvhighlights) | |
|
| TRACE-retrieval | Forcing the predicted timestamps to be align with input timestamps | [trace-retrieval](https://huggingface.co/Yongxin-Guo/trace-retrieval) | |
|
| TRACE-uni | Incorporating additional general video understanding data from a subset of [LLaVA-Video-178k](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K). | [trace-uni](https://huggingface.co/Yongxin-Guo/trace-uni) | |
|
|
|
#### Results |
|
|
|
| Youcook2 (Zero-Shot) | CIDER | METEOR | SODA_c | F1 | |
|
| --- | --- | --- | --- | --- | |
|
| TRACE | 8.1 | 2.8 | 2.2 | 22.4 | |
|
| TRACE-retrieal | 8.3 | 2.9 | 2.3 | 24.1 | |
|
| TRACE-uni | 8.6 | 2.9 | 2.3 | 22.4 | |
|
|
|
| Charades-STA (Zero-Shot) | 0.3 | 0.5 | 0.7 | mIOU | |
|
| --- | --- | --- | --- | --- | |
|
| TRACE | 58.6 | 40.3 | 19.4 | 38.7 | |
|
| TRACE-retrieval | 57.9 | 37.4 | 17.3 | 37.4 | |
|
| TRACE-uni | 63.7 | 43.7 | 21.0 | 41.5 | |
|
|
|
| QVHighlights (Zero-Shot) | mAP | Hit@1 | |
|
| --- | --- | --- | |
|
| TRACE | 26.8 | 42.7 | |
|
| TRACE-retrieval | 27.9 | 44.3 | |
|
| TRACE-uni | 27.5 | 43.9 | |
|
|
|
|
|
| ActivityNet-DVC | CIDER | METEOR | SODA_c | F1 | |
|
| --- | --- | --- | --- | --- | |
|
| TRACE | 25.9 | 6.0 | 6.4 | 39.3 | |
|
| TRACE-retrieval | 25.7 | 5.9 | 6.5 | 40.1 | |
|
| TRACE-uni | 29.2 | 6.9 | 6.4 | 40.4 | |
|
|
|
| ActivityNet-MR | 0.3 | 0.5 | 0.7 | mIOU | |
|
| --- | --- | --- | --- | --- | |
|
| TRACE | 54.0 | 37.7 | 24.0 | 39.0 | |
|
| TRACE-retrieval | 54.4 | 39.8 | 24.9 | 40.2 | |
|
| TRACE-uni | 53.2 | 38.2 | 24.7 | 39.4 | |
|
|
|
| MVBench | Avg | AS | AP | AA | FA | UA | OE | OI | OS | MD | AL | ST | AC | MC | MA | SC | FP | CO | EN | ER | CI | |
|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | |
|
| TRACE | 48.1 | 61.2 | 56.5 | 72.5 | 46.5 | 61.0 | 48.0 | 69.5 | 40.0 | 22.0 | 31.0 | 86.5 | 37.5 | 37.0 | 51.0 | 45.0 | 40.5 | 39.0 | 31.0 | 43.5 | 44.5 | |
|
| TRACE-uni | 53.8 | 68.1 | 58.5 | 72.5 | 41.5 | 73.5 | 55.1 | 71.5 | 40.5 | 25.0 | 53.0 | 88.5 | 63.5 | 38.5 | 51.0 | 52.5 | 49.0 | 59.5 | 33.5 | 49.5 | 32.5 | |
|
|
|
|
|
| VideoMME (w/o subtitle) | Short | Midium | Long | Avg | |
|
| --- | --- | --- | --- | --- | |
|
| TRACE | 49.5 | 42.5 | 39.3 | 43.8 | |
|
| TRACE-uni | 58.2 | 48.1 | 42.3 | 49.6 | |
|
|
|
#### Bibliography |
|
If you find this repository helpful for your project, please consider citing: |
|
``` |
|
@misc{guo2024tracetemporalgroundingvideo, |
|
title={TRACE: Temporal Grounding Video LLM via Causal Event Modeling}, |
|
author={Yongxin Guo and Jingyu Liu and Mingda Li and Xiaoying Tang and Qingbin Liu and Xi Chen}, |
|
year={2024}, |
|
eprint={2410.05643}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2410.05643}, |
|
} |
|
``` |
|
|