Video Inference or training

#4
by dosun - opened

I have tried to run this model for video captioning. However, it only returns a caption for each frame. In the original paper, the model supports video through multiple frames. Is this support at HuggingFace as well?

Hi,

For video captioning I'd recommend taking a look at the GIT checkpoints fine-tuned on video datasets, like https://huggingface.co/microsoft/git-base-vatex

Sign up or log in to comment