DrQY
/

VideoMAEv2_TikTok

Video Classification

Model card Files Files and versions Community

VideoMAEv2_TikTok / README.md

DrQY's picture

Update README.md

db82722 verified 4 months ago

|

history blame contribute delete

2.66 kB

	---
	license: mit
	pipeline_tag: video-classification
	---

	# VideoMAEv2_TikTok

	We provide pre-trained weights on the TikTokActions dataset for two backbones: ViT-B (Vision Transformer-Base) and ViT-Giant. Additionally, we include fine-tuned weights on Kinetics-400 for both backbones.

	## Pre-trained and Fine-tuned Weights
	- Pre-trained weights on TikTokActions: These weights were trained using TikTok video clips categorized into multiple actions. The dataset consists of 283,582 unique videos across 386 hashtags.
	- Fine-tuned weights on Kinetics-400: After pre-training, the models were fine-tuned on Kinetics-400, achieving state-of-the-art results.

	We also provide the `log.txt` file, which includes information on the fine-tuning process.

	To use the weights and fine-tuning scripts, please refer to [VideoMAEv2's GitHub repository](https://github.com/OpenGVLab/VideoMAEv2) for implementation details.

	## Citation

	For VideoMAEv2, please cite the following works:

	@InProceedings{wang2023videomaev2, author = {Wang, Limin and Huang, Bingkun and Zhao, Zhiyu and Tong, Zhan and He, Yinan and Wang, Yi and Wang, Yali and Qiao, Yu}, title = {VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2023}, pages = {14549-14560} }

	@misc{videomaev2, title={VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking}, author={Limin Wang and Bingkun Huang and Zhiyu Zhao and Zhan Tong and Yinan He and Yi Wang and Yali Wang and Yu Qiao}, year={2023}, eprint={2303.16727}, archivePrefix={arXiv}, primaryClass={cs.CV} }


	For our repository, please cite the following paper:

	@article{qian2024actionrecognition, author = {Yang Qian, Yinan Sun, Ali Kargarandehkordi, Parnian Azizian, Onur Cezmi Mutlu, Saimourya Surabhi, Pingyi Chen, Zain Jabbar, Dennis Paul Wall, Peter Washington}, title = {Advancing Human Action Recognition with Foundation Models trained on Unlabeled Public Videos}, journal = {arXiv preprint arXiv:2402.08875}, year = {2024}, pages = {10}, doi = {https://doi.org/10.48550/arXiv.2402.08875} }


	## Results

	Our model achieves the following results on established action recognition benchmarks using the ViT-Giant backbone:
	- UCF101: 99.05%
	- HMDB51: 86.08%
	- Kinetics-400: 85.51%
	- Something-Something V2: 74.27%

	These results highlight the power of using diverse, unlabeled, and dynamic video content for training foundation models, especially in the domain of action recognition.

	## License
	This project is licensed under the MIT License.