|
--- |
|
tags: |
|
- vision |
|
- clip |
|
- clip4clip |
|
- video |
|
- retrieval |
|
pipeline_tag: text-to-video |
|
--- |
|
|
|
# Model Card |
|
## Details |
|
This model underwent training using CLIP4Clip, a video retrieval method based on the CLIP framework, as described in the paper [here](https://arxiv.org/pdf/2104.08860.pdf) and implemented in the accompanying [code](https://github.com/ArrowLuo/CLIP4Clip). |
|
|
|
The training process involved 150,000 videos obtained from the [WebVid Dataset](https://m-bain.github.io/webvid-dataset/), a comprehensive collection of short videos with corresponding textual descriptions sourced from the web. |
|
|
|
To adapt the clip model obtained during training, we adjusted the weights and integrated them into the implementation of [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32), making certain modifications to the final layers. |
|
|
|
### Use with Transformers |
|
|
|
```python |
|
import numpy as np |
|
import torch |
|
from transformers import AutoTokenizer, CLIPTextModelWithProjection |
|
|
|
|
|
search_sentence = "a basketball player performing a slam dunk" |
|
|
|
model = CLIPTextModelWithProjection.from_pretrained("Diangle/clip4clip-webvid") |
|
tokenizer = AutoTokenizer.from_pretrained("Diangle/clip4clip-webvid") |
|
|
|
inputs = tokenizer(text=search_sentence , return_tensors="pt", padding=True) |
|
outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], return_dict=False) |
|
|
|
# Special projection and changing last layers: |
|
text_projection = model.state_dict()['text_projection.weight'] |
|
text_embeds = outputs[1] @ text_projection |
|
final_output = text_embeds[torch.arange(text_embeds.shape[0]), inputs["input_ids"].argmax(dim=-1)] |
|
|
|
# Normalizing the embeddings: |
|
final_output = final_output / final_output.norm(dim=-1, keepdim=True) |
|
final_output = final_output.cpu().detach().numpy() |
|
sequence_output = final_output / np.sum(final_output**2, axis=1, keepdims=True) |
|
print("sequence_output: ", sequence_output) |
|
``` |
|
|
|
## Model Use |
|
|
|
### Intended Use |
|
|
|
This model is intended to use for video retrival, look for example this [**space**](https://huggingface.co/spaces/Diangle/Clip4Clip-webvid). |
|
|
|
### Extra Information |
|
|
|
For video embedding there is an extra notebook that describes how to embed videos. |
|
|
|
|
|
|
|
## Performance and Limitations |
|
|
|
### Performance |
|
|
|
We have evaluated the performance of differnet models on the last 10k video clips from Webvid database. |
|
|
|
| Model | R1 | R5 | R10 | MedianR | MeanR |
|
|------------------------|-------|-------|-------|-----|---------| |
|
| Zero-shot clip weights | 37.16 | 62.10 | 71.16 | 3.0 | 42.2128 |
|
| CLIP4Clip weights trained on msr-vtt | 38.38 | 62.89 | 72.01 | 3.0 |39.3023 |
|
| **CLIP4Clip trained on 150k Webvid** | 50.74 | 77.30 | 85.05 | 1.0 | 14.9535 |
|
| Binarized CLIP4Clip trained on 150k Webvid with rerank100 | 50.56 | 76.39 | 83.51 | 1.0 | 43.2964 |
|
|
|
For more information about the evaluation you can look at this [notebook]. |
|
|
|
|
|
|
|
|
|
|