Combine voice cloning and portrait lipsync animation
Instruction-tuned model for a range of vision-language tasks