metadata
license: apache-2.0
datasets:
- Loie/VGGSound
base_model:
- riffusion/riffusion-model-v1
pipeline_tag: video-to-audio
tags:
- video2audio
Kandinsky-4-v2a: A Video to Audio pipeline
Kandinsky 4.0 Post | Project Page | Technical Report | GitHub | Kandinsky 4.0 T2V Flash HuggingFace | Kandinsky 4.0 V2A HuggingFace
Description
Video to Audio pipeline consists of a visual encoder, a text encoder, UNet diffusion model to generate spectrogram and Griffin-lim algorithm to convert spectrogram into audio. Visual and text encoders share the same multimodal visual language decoder (cogvlm2-video-llama3-chat).
Our UNet diffusion model is a finetune of the music generation model riffusion. We made modifications in the architecture to condition on video frames and improve the synchronization between video and audio. Also, we replace the text encoder with the decoder of cogvlm2-video-llama3-chat.
Installation
git clone https://github.com/ai-forever/Kandinsky-4.git
cd Kandinsky-4
conda install -c conda-forge ffmpeg -y
pip install -r kandinsky4_video2audio/requirements.txt
pip install "git+https://github.com/facebookresearch/pytorchvideo.git"
Inference
Inference code for Video-to-Audio:
import torch
import torchvision
from kandinsky4_video2audio.video2audio_pipe import Video2AudioPipeline
from kandinsky4_video2audio.utils import load_video, create_video
device='cuda:0'
pipe = Video2AudioPipeline(
"ai-forever/kandinsky-4-v2a",
torch_dtype=torch.float16,
device = device
)
video_path = 'assets/inputs/1.mp4'
video, _, fps = torchvision.io.read_video(video_path)
prompt="clean. clear. good quality."
negative_prompt = "hissing noise. drumming rythm. saying. poor quality."
video_input, video_complete, duration_sec = load_video(video, fps['video_fps'], num_frames=96, max_duration_sec=12)
out = pipe(
video_input,
prompt,
negative_prompt=negative_prompt,
duration_sec=duration_sec,
)[0]
save_path = f'assets/outputs/1.mp4'
create_video(
out,
video_complete,
display_video=True,
save_path=save_path,
device=device
)