|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- Loie/VGGSound |
|
base_model: |
|
- riffusion/riffusion-model-v1 |
|
pipeline_tag: video-to-audio |
|
tags: |
|
- video2audio |
|
--- |
|
|
|
<h1 align="center">Kandinsky-4-v2a: A Video to Audio pipeline</h1> |
|
|
|
<br><br><br><br> |
|
|
|
<div align="center"> |
|
<image src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/Mi3ugli7f1MNNVWC5gzMS.png" ></image> |
|
</div> |
|
|
|
<div align="center"> |
|
<a href="https://habr.com/ru/companies/sberbank/articles/866156/">Kandinsky 4.0 Post</a> | <a href=https://ai-forever.github.io/Kandinsky-4/K40/>Project Page</a> | <a>Technical Report</a> | <a href=https://github.com/ai-forever/Kandinsky-4>GitHub</a> | <a href=https://huggingface.co/ai-forever/kandinsky-4-t2v-flash> Kandinsky 4.0 T2V Flash HuggingFace</a> | <a href=https://huggingface.co/ai-forever/kandinsky-4-v2a> Kandinsky 4.0 V2A HuggingFace</a> |
|
</div> |
|
|
|
|
|
## Description |
|
|
|
Video to Audio pipeline consists of a visual encoder, a text encoder, UNet diffusion model to generate spectrogram and Griffin-lim algorithm to convert spectrogram into audio. |
|
Visual and text encoders share the same multimodal visual language decoder ([cogvlm2-video-llama3-chat](link)). |
|
|
|
Our UNet diffusion model is a finetune of the music generation model [riffusion](https://huggingface.co/riffusion/riffusion-model-v1). We made modifications in the architecture to condition on video frames and improve the synchronization between video and audio. Also, we replace the text encoder with the decoder of [cogvlm2-video-llama3-chat](link). |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/mLXroYZt8X2brCDGPcPJZ.png) |
|
|
|
## Installation |
|
|
|
```bash |
|
git clone https://github.com/ai-forever/Kandinsky-4.git |
|
cd Kandinsky-4 |
|
conda install -c conda-forge ffmpeg -y |
|
pip install -r kandinsky4_video2audio/requirements.txt |
|
pip install "git+https://github.com/facebookresearch/pytorchvideo.git" |
|
``` |
|
|
|
## Inference |
|
|
|
Inference code for Video-to-Audio: |
|
|
|
```python |
|
import torch |
|
import torchvision |
|
|
|
from kandinsky4_video2audio.video2audio_pipe import Video2AudioPipeline |
|
from kandinsky4_video2audio.utils import load_video, create_video |
|
|
|
device='cuda:0' |
|
|
|
pipe = Video2AudioPipeline( |
|
"ai-forever/kandinsky-4-v2a", |
|
torch_dtype=torch.float16, |
|
device = device |
|
) |
|
|
|
video_path = 'assets/inputs/1.mp4' |
|
video, _, fps = torchvision.io.read_video(video_path) |
|
|
|
prompt="clean. clear. good quality." |
|
negative_prompt = "hissing noise. drumming rythm. saying. poor quality." |
|
video_input, video_complete, duration_sec = load_video(video, fps['video_fps'], num_frames=96, max_duration_sec=12) |
|
|
|
out = pipe( |
|
video_input, |
|
prompt, |
|
negative_prompt=negative_prompt, |
|
duration_sec=duration_sec, |
|
)[0] |
|
|
|
save_path = f'assets/outputs/1.mp4' |
|
create_video( |
|
out, |
|
video_complete, |
|
display_video=True, |
|
save_path=save_path, |
|
device=device |
|
) |
|
``` |
|
|
|
|
|
<table border="0" style="width: 200; text-align: left; margin-top: 20px;"> |
|
<tr> |
|
<td> |
|
<video src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/5fmRhFzZjqGd0q3ghJ7wW.mp4" width=200 controls playsinline></video> |
|
</td> |
|
<td> |
|
<video src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/GZ4V3G5Zl1AVQ8Zo92CTm.mp4" width=200 controls playsinline></video> |
|
</td> |
|
<td> |
|
<video src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/2TZSE1hFeJCJYMI4zU_Ea.mp4" width=200 controls playsinline></video> |
|
</td> |
|
</tr> |
|
</table> |
|
|
|
|
|
# Authors |
|
+ Zein Shaheen: [GitHub](https://github.com/zeinsh) |
|
+ Arseniy Shakhmatov: [Github](https://github.com/cene555), [Blog](https://t.me/gradientdip) |
|
+ Ivan Kirillov: [GitHub](https://github.com/funnylittleman) |
|
+ Andrei Shutkin: [GitHub](https://github.com/maleficxp) |
|
+ Denis Parkhomenko: [GitHub](https://github.com/nihao88) |
|
+ Julia Agafonova [GitHub](https://github.com/Julia132) |
|
+ Andrey Kuznetsov: [GitHub](https://github.com/kuznetsoffandrey), [Blog](https://t.me/complete_ai) |
|
+ Denis Dimitrov: [GitHub](https://github.com/denndimitrov), [Blog](https://t.me/dendi_math_ai) |