DrQY
/

Video Classification
VideoMAEv2_TikTok / README.md
DrQY's picture
Update README.md
db82722 verified
metadata
license: mit
pipeline_tag: video-classification

VideoMAEv2_TikTok

We provide pre-trained weights on the TikTokActions dataset for two backbones: ViT-B (Vision Transformer-Base) and ViT-Giant. Additionally, we include fine-tuned weights on Kinetics-400 for both backbones.

Pre-trained and Fine-tuned Weights

  • Pre-trained weights on TikTokActions: These weights were trained using TikTok video clips categorized into multiple actions. The dataset consists of 283,582 unique videos across 386 hashtags.
  • Fine-tuned weights on Kinetics-400: After pre-training, the models were fine-tuned on Kinetics-400, achieving state-of-the-art results.

We also provide the log.txt file, which includes information on the fine-tuning process.

To use the weights and fine-tuning scripts, please refer to VideoMAEv2's GitHub repository for implementation details.

Citation

For VideoMAEv2, please cite the following works:

@InProceedings{wang2023videomaev2, author = {Wang, Limin and Huang, Bingkun and Zhao, Zhiyu and Tong, Zhan and He, Yinan and Wang, Yi and Wang, Yali and Qiao, Yu}, title = {VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2023}, pages = {14549-14560} }

@misc{videomaev2, title={VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking}, author={Limin Wang and Bingkun Huang and Zhiyu Zhao and Zhan Tong and Yinan He and Yi Wang and Yali Wang and Yu Qiao}, year={2023}, eprint={2303.16727}, archivePrefix={arXiv}, primaryClass={cs.CV} }

For our repository, please cite the following paper:

@article{qian2024actionrecognition, author = {Yang Qian, Yinan Sun, Ali Kargarandehkordi, Parnian Azizian, Onur Cezmi Mutlu, Saimourya Surabhi, Pingyi Chen, Zain Jabbar, Dennis Paul Wall, Peter Washington}, title = {Advancing Human Action Recognition with Foundation Models trained on Unlabeled Public Videos}, journal = {arXiv preprint arXiv:2402.08875}, year = {2024}, pages = {10}, doi = {https://doi.org/10.48550/arXiv.2402.08875} }

Results

Our model achieves the following results on established action recognition benchmarks using the ViT-Giant backbone:

  • UCF101: 99.05%
  • HMDB51: 86.08%
  • Kinetics-400: 85.51%
  • Something-Something V2: 74.27%

These results highlight the power of using diverse, unlabeled, and dynamic video content for training foundation models, especially in the domain of action recognition.

License

This project is licensed under the MIT License.