Is this just finetuned SD models?
Seems a bit hacked together like an SD finetuned version...no mulit-gpu support, no quants, only 40gb+ cards supported...and from a video editing company looking to pivot...tell me I'm wrong...
Cant quite agree.
Got 720x512x257 working on a rtx 3090 in under a minute
1280x720x257 in under 5 minutes and the results are better than any previous 5b+ models as far as I can tell
It seems the sequential cpu offloading is broken in the pipeline tho, if you edit it and manually do that it works quite well and fast under 24gb vram.
Can agree on the Video editing tho, would be cool to have video continuation and input video editing
Excellent news, what specific edits did you make? Have you tried CogVideo-5b 1.5? I think it's actually better and even pushing out at 8fps it's easy to upscale with something like topaz to 60fps
Well the edits I did are what you'd call "spaghetti code"
Basically, I removed all the .to("cuda") calls from the inference.py
and then added this to the pipeline_ltx_video.py:
self.transformer = self.transformer.to("cpu")
self.vae = self.vae.to("cpu")
self.text_encoder = self.text_encoder.to("cuda")
Basically for each step put the thing that is currently in use last and on cuda
so
First the text encoder, then the transformer and finally the vae
I can post my whole inference and ltx pipeline .py here if you want, but maybe there'll be better "official" support for this.
Its basically what pipe.enable_sequential_cpu_offload() should do but yea it somehow does not work rn (I think?)
Yes tried CogVideo 5b, tbh they both are "bad" the thing that makes them interesting is the generation speed.
This model here can generate longer smoother videos quite faster.
But the prompt following in both models is very domain specific.
So prompts like in the format "A character.... like from a TV movie" work good while more surreal experimental prompts do not work well.
Also fast motion is not really good like "A dancing crowd" causes alot of artefacts.
But Ive found this to be the case for any model with lower params up to 5b yet
Well a hat tip to you ser, thanks for the code, not spaghetti at all! I will definitely try this again...also, have you tried hailuoai? I ran them for a solid month, pretty good at aerial nature shots.
Here's the code btw (pastebin was down so I had to use this site)
pipeline_ltx_video: https://justpaste.it/gxldw
inference.py: https://justpaste.it/bhwuj
Those are really just quick hax to get it working, hopefully they update the enable_sequential_cpu_offload() sometime
Havent tried any video service yet, Im more interested to get these models running locally in realtime for exhibitions :)
But it seems animatediff if still the best for that use case
outstanding work! I'll be running overnights tests. Have you tried the DiT integration yet? As this seem helpful...also, if you haven't seen this already... https://github.com/THUDM/CogVideo/blob/main/inference/convert_demo.py ...it's a good idea to scale a prompt creation tool with a system prompt to automate workflows...cheers!