llava-hf/llava-v1.6-34b-hf · How do you fine tune LLaVA NeXT?

Nishgop

Mar 28

Is there a way to fine tune LLaVA-NeXT?

nielsr

Llava Hugging Face org Mar 28

cc @lewtun the TRL team is going to make it super easy to fine-tune models like these.

For now I'll refer you to my demo notebook, which includes a bunch of utilities from the original LLaVa repository.

Nishgop

Mar 28

Thanks Niels, This is great!
I assume the same approach works also for LLaVA-NeXT. Is that correct?

Nishant

lzh986

Apr 22

•

edited Apr 22

.

nielsr

Llava Hugging Face org May 10

Yes it should, although Llava-NeXT is a bit more complex compared to Llava in terms of image preprocessing. A PR to add batched generation (which should also solve training issues) is here: https://github.com/huggingface/transformers/pull/29850.

For now I'd recommend either Llava or Idefics2. Refer to my demo notebook: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Idefics2/Fine_tune_Idefics2_for_JSON_extraction_use_cases_(PyTorch_Lightning).ipynb. Have tested this with both models.

lcolonn

Jun 5

Hi @nielsr , thanks for all the work! If I understand correctly, as the PR you mentioned above has been merged, training should now work properly for LLaVA-Next (LLaMA 8B + 72B and 110B) models and it already worked for LLaVA1.6? Do you know of any example scripts or articles?

RaushanTurganbay

Llava Hugging Face org Jun 6

Hi @lcolonn ! Yes, the PR was merged and LLaVa-NeXT is tunable now. Fine-tuning script is almost the same as LLaVa with a few changes in input arguments, find here my adaptation of Niels' notebook

lcolonn

Jun 6

•

edited Jun 6

Hey @RaushanTurganbay , very cool! I was a little confused because in the PR it also says that it's fine-tunable but for cases without images. Also if you are using llava-v1.6-mistral-7b-hf shouldn't you be using the following prompt format: "[INST] \n What is shown in this image? [/INST]" as described here: https://huggingface.co/docs/transformers/main/en/model_doc/llava_next

nielsr

Llava Hugging Face org Jun 6

Yes that's right, LLaVa-NeXt does not have a chat template yet which means that for now you need to manually make sure that the right format is used. Looks like @RaushanTurganbay might need to update that

RaushanTurganbay

Llava Hugging Face org Jun 7

Oke, thanks for noting. Will change it in the notebook and I will try to add chat templates to all Llava models

lcolonn

Jun 20

Hi @nielsr , sorry it's still not quite clear to me whether training for LLaVA-Next supports training with batched (images). It did say in this PR that only support for training without images was added: https://github.com/huggingface/transformers/pull/29850

RaushanTurganbay

Llava Hugging Face org Jun 21

I updated the comment in PR to (with and w/o images). The model should be tunable with images as well

GohioAC

Jun 25

This comment has been hidden

tjiang217

Jul 31

@RaushanTurganbay , thanks for sharing the notebook on finetuning LLaVA-NEXT! Is there a similar one for finetuning LLaVA-NEXT-Video? or can I easily adapt this notebook for LLaVA-NEXT-Video as well? @nielsr

nielsr

Llava Hugging Face org Jul 31

Yes here it is: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VideoLLaVa. Should be very similar for LLaVA-Next-Video.

RaushanTurganbay

Llava Hugging Face org Aug 1

There is actually a notebook for llava-next-video here, I will port it to the Tutorials repo for easier discovery

chrishoertnagl

Aug 5

Hey, thanks so much for the great examples! Trying to follow along, but I have only small GPUs and try to use Deepspeed. Do you know if your code would work with Deepspeed on 4 GPUs?

RaushanTurganbay

Llava Hugging Face org Aug 6

For DeepSpeed we support it when using Trainer but the example notebook relies on custom trainer. Take a look at (https://huggingface.co/docs/transformers/v4.15.0/en/main_classes/deepspeed#deepspeed-non-trainer-integration) for more information on how to use deepspeed with custom Trainers

tjiang217

Aug 6

Sorry a little of a different question.. How many images and/or videos can LLaVA-Next-Video take? I couldn't find it stated elsewhere. Thanks in advance. @RaushanTurganbay @nielsr

RaushanTurganbay

Llava Hugging Face org Aug 7

@tjiang217 LLaVA-Next-Video was not trained in multi-image/multi-video setting afaik, but it doesn't mean we can't try and feed several visuals. But note that the generation quality might not be as good as in single image.

You can also take a look at https://huggingface.co/collections/llava-hf/llava-interleave-668e19a97da0036aad4a2f19, which were trained for interleaved images/videos. It doesn't state however how many images/videos per prompt was used in train, I guess it was 2 images/videos in most examples

tjiang217

Aug 25

@RaushanTurganbay I tried to run the llava-next-video finetuning notebook you shared without changing any code on 4 A10 GPU ec2 instance and ran into the following issue. The inference code works just the training part. Do you have any ideas why? It has to do with device_map = 'auto' but putting on one gpu causes CUDA out of memory error. Any help would be greatly appreciated

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

tjiang217

Aug 27

@RaushanTurganbay sorry just wanted to follow up here. I was able to bypass the previous bug when I make the batch size smaller and remove device_map = 'auto', but ran into the following bug using the same code in the llava-next-video finetuning notebook. Do you know for this notebook, which transformers version you used and other package versions? Thanks in advance!

Error I ran into.
RuntimeError: Input tensor at index 1 has invalid shape [1, 1595, 32064], but expected [1, 1500, 32064]

RaushanTurganbay

Llava Hugging Face org Aug 28

Further discussion/solutions will be in https://github.com/huggingface/trl/issues/1785#issuecomment-2314793662 for anyone having the same issue

Prasun

1 day ago

What changes i need to make in the notebook if my dataset is unique_id, image and conversations. I can't see any notebook using conversations to train.

RaushanTurganbay

Llava Hugging Face org 1 day ago

You can find SFT tuning example for VLMs here (https://github.com/huggingface/trl/blob/main/examples/scripts/sft_vlm.py). But the general idea is same, and you just have to prepare the inputs in the format you want and thus write your own data collator. You can also take a look at how LLMs are tuned with dialog datasets to see how the inputs have to be formatted/masked

tjiang217

1 day ago

@RaushanTurganbay I understand the current llava-next-video model processes each frame as 12x12 tokens (result of 2 stride pooling from 24x24 tokens), I am working with a soccer video dataset that has fine-grain details, such as the soccer ball, so I thought the 12x12 tokens may not be able capture enough details. The LLaVA-next-video blog talked about testing different variation of pooling strides. Do you know if we could tweak the current model or access the other model so the number of tokens representing each frame is greater than 12x12 tokens?

Thanks in advance, much appreciated!

RaushanTurganbay

Llava Hugging Face org about 18 hours ago

Unfortunately we don't support different polling methods and strides. Maybe you can tune your model with llava-vl repo for that and then convert to HF format? We are currently trying to make VLMs more modular and will take out image encoder related code into a separate method. So you will have more freedom of how to obtain image hidden states by overwriting only that method :)