kandinsky-4-v2a / README.md

Update README.md

e098097 verified about 1 month ago

4.14 kB

	---
	license: apache-2.0
	datasets:
	- Loie/VGGSound
	base_model:
	- riffusion/riffusion-model-v1
	pipeline_tag: video-to-audio
	tags:
	- video2audio
	---

	<h1 align="center">Kandinsky-4-v2a: A Video to Audio pipeline</h1>

	<br><br><br><br>

	<div align="center">
	<image src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/Mi3ugli7f1MNNVWC5gzMS.png" ></image>
	</div>

	<div align="center">
	<a href="https://habr.com/ru/companies/sberbank/articles/866156/">Kandinsky 4.0 Post</a> \| <a href=https://ai-forever.github.io/Kandinsky-4/K40/>Project Page</a> \| <a>Technical Report</a> \| <a href=https://github.com/ai-forever/Kandinsky-4>GitHub</a> \| <a href=https://huggingface.co/ai-forever/kandinsky-4-t2v-flash> Kandinsky 4.0 T2V Flash HuggingFace</a> \| <a href=https://huggingface.co/ai-forever/kandinsky-4-v2a> Kandinsky 4.0 V2A HuggingFace</a>
	</div>


	## Description

	Video to Audio pipeline consists of a visual encoder, a text encoder, UNet diffusion model to generate spectrogram and Griffin-lim algorithm to convert spectrogram into audio.
	Visual and text encoders share the same multimodal visual language decoder ([cogvlm2-video-llama3-chat](link)).

	Our UNet diffusion model is a finetune of the music generation model [riffusion](https://huggingface.co/riffusion/riffusion-model-v1). We made modifications in the architecture to condition on video frames and improve the synchronization between video and audio. Also, we replace the text encoder with the decoder of [cogvlm2-video-llama3-chat](link).

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/mLXroYZt8X2brCDGPcPJZ.png)

	## Installation

	```bash
	git clone https://github.com/ai-forever/Kandinsky-4.git
	cd Kandinsky-4
	conda install -c conda-forge ffmpeg -y
	pip install -r kandinsky4_video2audio/requirements.txt
	pip install "git+https://github.com/facebookresearch/pytorchvideo.git"
	```

	## Inference

	Inference code for Video-to-Audio:

	```python
	import torch
	import torchvision

	from kandinsky4_video2audio.video2audio_pipe import Video2AudioPipeline
	from kandinsky4_video2audio.utils import load_video, create_video

	device='cuda:0'

	pipe = Video2AudioPipeline(
	"ai-forever/kandinsky-4-v2a",
	torch_dtype=torch.float16,
	device = device
	)

	video_path = 'assets/inputs/1.mp4'
	video, _, fps = torchvision.io.read_video(video_path)

	prompt="clean. clear. good quality."
	negative_prompt = "hissing noise. drumming rythm. saying. poor quality."
	video_input, video_complete, duration_sec = load_video(video, fps['video_fps'], num_frames=96, max_duration_sec=12)

	out = pipe(
	video_input,
	prompt,
	negative_prompt=negative_prompt,
	duration_sec=duration_sec,
	)[0]

	save_path = f'assets/outputs/1.mp4'
	create_video(
	out,
	video_complete,
	display_video=True,
	save_path=save_path,
	device=device
	)
	```


	<table border="0" style="width: 200; text-align: left; margin-top: 20px;">
	<tr>
	<td>
	<video src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/5fmRhFzZjqGd0q3ghJ7wW.mp4" width=200 controls playsinline></video>
	</td>
	<td>
	<video src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/GZ4V3G5Zl1AVQ8Zo92CTm.mp4" width=200 controls playsinline></video>
	</td>
	<td>
	<video src="https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/2TZSE1hFeJCJYMI4zU_Ea.mp4" width=200 controls playsinline></video>
	</td>
	</tr>
	</table>


	# Authors
	+ Zein Shaheen: [GitHub](https://github.com/zeinsh)
	+ Arseniy Shakhmatov: [Github](https://github.com/cene555), [Blog](https://t.me/gradientdip)
	+ Ivan Kirillov: [GitHub](https://github.com/funnylittleman)
	+ Andrei Shutkin: [GitHub](https://github.com/maleficxp)
	+ Denis Parkhomenko: [GitHub](https://github.com/nihao88)
	+ Julia Agafonova [GitHub](https://github.com/Julia132)
	+ Andrey Kuznetsov: [GitHub](https://github.com/kuznetsoffandrey), [Blog](https://t.me/complete_ai)
	+ Denis Dimitrov: [GitHub](https://github.com/denndimitrov), [Blog](https://t.me/dendi_math_ai)