kandinsky-4-v2a / README.md
ai-forever's picture
Update README.md
e098097 verified
---
license: apache-2.0
datasets:
- Loie/VGGSound
base_model:
- riffusion/riffusion-model-v1
pipeline_tag: video-to-audio
tags:
- video2audio
---
<h1 align="center">Kandinsky-4-v2a: A Video to Audio pipeline</h1>
<br><br><br><br>
<div align="center">
<image src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/Mi3ugli7f1MNNVWC5gzMS.png" ></image>
</div>
<div align="center">
<a href="https://habr.com/ru/companies/sberbank/articles/866156/">Kandinsky 4.0 Post</a> | <a href=https://ai-forever.github.io/Kandinsky-4/K40/>Project Page</a> | <a>Technical Report</a> | <a href=https://github.com/ai-forever/Kandinsky-4>GitHub</a> | <a href=https://huggingface.co/ai-forever/kandinsky-4-t2v-flash> Kandinsky 4.0 T2V Flash HuggingFace</a> | <a href=https://huggingface.co/ai-forever/kandinsky-4-v2a> Kandinsky 4.0 V2A HuggingFace</a>
</div>
## Description
Video to Audio pipeline consists of a visual encoder, a text encoder, UNet diffusion model to generate spectrogram and Griffin-lim algorithm to convert spectrogram into audio.
Visual and text encoders share the same multimodal visual language decoder ([cogvlm2-video-llama3-chat](link)).
Our UNet diffusion model is a finetune of the music generation model [riffusion](https://huggingface.co/riffusion/riffusion-model-v1). We made modifications in the architecture to condition on video frames and improve the synchronization between video and audio. Also, we replace the text encoder with the decoder of [cogvlm2-video-llama3-chat](link).
![image/png](https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/mLXroYZt8X2brCDGPcPJZ.png)
## Installation
```bash
git clone https://github.com/ai-forever/Kandinsky-4.git
cd Kandinsky-4
conda install -c conda-forge ffmpeg -y
pip install -r kandinsky4_video2audio/requirements.txt
pip install "git+https://github.com/facebookresearch/pytorchvideo.git"
```
## Inference
Inference code for Video-to-Audio:
```python
import torch
import torchvision
from kandinsky4_video2audio.video2audio_pipe import Video2AudioPipeline
from kandinsky4_video2audio.utils import load_video, create_video
device='cuda:0'
pipe = Video2AudioPipeline(
"ai-forever/kandinsky-4-v2a",
torch_dtype=torch.float16,
device = device
)
video_path = 'assets/inputs/1.mp4'
video, _, fps = torchvision.io.read_video(video_path)
prompt="clean. clear. good quality."
negative_prompt = "hissing noise. drumming rythm. saying. poor quality."
video_input, video_complete, duration_sec = load_video(video, fps['video_fps'], num_frames=96, max_duration_sec=12)
out = pipe(
video_input,
prompt,
negative_prompt=negative_prompt,
duration_sec=duration_sec,
)[0]
save_path = f'assets/outputs/1.mp4'
create_video(
out,
video_complete,
display_video=True,
save_path=save_path,
device=device
)
```
<table border="0" style="width: 200; text-align: left; margin-top: 20px;">
<tr>
<td>
<video src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/5fmRhFzZjqGd0q3ghJ7wW.mp4" width=200 controls playsinline></video>
</td>
<td>
<video src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/GZ4V3G5Zl1AVQ8Zo92CTm.mp4" width=200 controls playsinline></video>
</td>
<td>
<video src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/2TZSE1hFeJCJYMI4zU_Ea.mp4" width=200 controls playsinline></video>
</td>
</tr>
</table>
# Authors
+ Zein Shaheen: [GitHub](https://github.com/zeinsh)
+ Arseniy Shakhmatov: [Github](https://github.com/cene555), [Blog](https://t.me/gradientdip)
+ Ivan Kirillov: [GitHub](https://github.com/funnylittleman)
+ Andrei Shutkin: [GitHub](https://github.com/maleficxp)
+ Denis Parkhomenko: [GitHub](https://github.com/nihao88)
+ Julia Agafonova [GitHub](https://github.com/Julia132)
+ Andrey Kuznetsov: [GitHub](https://github.com/kuznetsoffandrey), [Blog](https://t.me/complete_ai)
+ Denis Dimitrov: [GitHub](https://github.com/denndimitrov), [Blog](https://t.me/dendi_math_ai)