How do you fine tune LLaVA NeXT?
Is there a way to fine tune LLaVA-NeXT?
cc @lewtun the TRL team is going to make it super easy to fine-tune models like these.
For now I'll refer you to my demo notebook, which includes a bunch of utilities from the original LLaVa repository.
Thanks Niels, This is great!
I assume the same approach works also for LLaVA-NeXT. Is that correct?
Nishant
.
Yes it should, although Llava-NeXT is a bit more complex compared to Llava in terms of image preprocessing. A PR to add batched generation (which should also solve training issues) is here: https://github.com/huggingface/transformers/pull/29850.
For now I'd recommend either Llava or Idefics2. Refer to my demo notebook: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Idefics2/Fine_tune_Idefics2_for_JSON_extraction_use_cases_(PyTorch_Lightning).ipynb. Have tested this with both models.
Hey @RaushanTurganbay , very cool! I was a little confused because in the PR it also says that it's fine-tunable but for cases without images. Also if you are using llava-v1.6-mistral-7b-hf shouldn't you be using the following prompt format: "[INST] \n What is shown in this image? [/INST]" as described here: https://huggingface.co/docs/transformers/main/en/model_doc/llava_next
Yes that's right, LLaVa-NeXt does not have a chat template yet which means that for now you need to manually make sure that the right format is used. Looks like @RaushanTurganbay might need to update that
Oke, thanks for noting. Will change it in the notebook and I will try to add chat templates to all Llava models
Hi @nielsr , sorry it's still not quite clear to me whether training for LLaVA-Next supports training with batched (images). It did say in this PR that only support for training without images was added: https://github.com/huggingface/transformers/pull/29850
I updated the comment in PR to (with and w/o images). The model should be tunable with images as well
@RaushanTurganbay , thanks for sharing the notebook on finetuning LLaVA-NEXT! Is there a similar one for finetuning LLaVA-NEXT-Video? or can I easily adapt this notebook for LLaVA-NEXT-Video as well? @nielsr
Yes here it is: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VideoLLaVa. Should be very similar for LLaVA-Next-Video.
There is actually a notebook for llava-next-video here, I will port it to the Tutorials repo for easier discovery
Hey, thanks so much for the great examples! Trying to follow along, but I have only small GPUs and try to use Deepspeed. Do you know if your code would work with Deepspeed on 4 GPUs?
For DeepSpeed we support it when using Trainer
but the example notebook relies on custom trainer. Take a look at (https://huggingface.co/docs/transformers/v4.15.0/en/main_classes/deepspeed#deepspeed-non-trainer-integration) for more information on how to use deepspeed with custom Trainers
Sorry a little of a different question.. How many images and/or videos can LLaVA-Next-Video take? I couldn't find it stated elsewhere. Thanks in advance. @RaushanTurganbay @nielsr
@tjiang217 LLaVA-Next-Video was not trained in multi-image/multi-video setting afaik, but it doesn't mean we can't try and feed several visuals. But note that the generation quality might not be as good as in single image.
You can also take a look at https://huggingface.co/collections/llava-hf/llava-interleave-668e19a97da0036aad4a2f19, which were trained for interleaved images/videos. It doesn't state however how many images/videos per prompt was used in train, I guess it was 2 images/videos in most examples
@RaushanTurganbay I tried to run the llava-next-video finetuning notebook you shared without changing any code on 4 A10 GPU ec2 instance and ran into the following issue. The inference code works just the training part. Do you have any ideas why? It has to do with device_map = 'auto' but putting on one gpu causes CUDA out of memory error. Any help would be greatly appreciated
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
@RaushanTurganbay sorry just wanted to follow up here. I was able to bypass the previous bug when I make the batch size smaller and remove device_map = 'auto', but ran into the following bug using the same code in the llava-next-video finetuning notebook. Do you know for this notebook, which transformers version you used and other package versions? Thanks in advance!
Error I ran into.
RuntimeError: Input tensor at index 1 has invalid shape [1, 1595, 32064], but expected [1, 1500, 32064]
Further discussion/solutions will be in https://github.com/huggingface/trl/issues/1785#issuecomment-2314793662 for anyone having the same issue
What changes i need to make in the notebook if my dataset is unique_id, image and conversations. I can't see any notebook using conversations to train.
You can find SFT tuning example for VLMs here (https://github.com/huggingface/trl/blob/main/examples/scripts/sft_vlm.py). But the general idea is same, and you just have to prepare the inputs in the format you want and thus write your own data collator. You can also take a look at how LLMs are tuned with dialog datasets to see how the inputs have to be formatted/masked
@RaushanTurganbay I understand the current llava-next-video model processes each frame as 12x12 tokens (result of 2 stride pooling from 24x24 tokens), I am working with a soccer video dataset that has fine-grain details, such as the soccer ball, so I thought the 12x12 tokens may not be able capture enough details. The LLaVA-next-video blog talked about testing different variation of pooling strides. Do you know if we could tweak the current model or access the other model so the number of tokens representing each frame is greater than 12x12 tokens?
Thanks in advance, much appreciated!
Unfortunately we don't support different polling methods and strides. Maybe you can tune your model with llava-vl repo for that and then convert to HF format? We are currently trying to make VLMs more modular and will take out image encoder related code into a separate method. So you will have more freedom of how to obtain image hidden states by overwriting only that method :)