llava-onevision

Running on Zero

Incomplete generation results on the pancake example?

by Fyphia - opened 4 days ago

4 days ago

Hi @RaushanTurganbay , thanks for your great work. When I try with the pancake example, it seems the generated text is not incomplete (it seems there should be something after 'Garnish with')? Do you have any idea about this? Thanks!

Generated text:

Sure, here's a recipe for the meal described in the images:

Step 1: Prepare the Pancakes
1.In a large bowl, beat the eggs until they are fully combined.
2.Add the flour, baking powder, and salt to the bowl.
3.Gradually add the milk, whisking until the batter is smooth.
4.Pour the batter into a greased, floured, and dusted non-stick pan.
5.Place the pan in the oven and bake for 2-3 minutes, or until the edges are lightly golden.
6.Remove the pan from the oven and let it cool for a few minutes.
7.In a separate bowl, whisk together the butter and sugar until light and fluffy.
8.Pour the butter mixture over the cooked pancakes, and let them cool slightly before serving.

Step 2: Serve the Pancakes
1.Serve the pancakes on a plate or a serving dish.
2.Garnish with

RaushanTurganbay

Owner 2 days ago

@Fyphia hey! It is because the model has max length limit due to the fact that I am using the free GPU spaces. I will try to code and add possibility for you to change the max length, but you can also try to run the model locally as it is only 0.5B parameters. See for more https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf

Fyphia

2 days ago

•

edited 2 days ago

@RaushanTurganbay ，thanks for your reply. I tried local implementation using the 0.5B model, but the result was similar. Are there any parameters I should be aware of ( I tried max_new_tok, but nothing changed)? Thanks!

BTW my GPU is 3090. Is the GPU the reason?

RaushanTurganbay

Owner 1 day ago

@Fyphia hmm weird, the model should generate up to max_new_tokens and if it abruptly stopping it is either that, or the model is stopping generation itself by generating an EOS token. Can you try to give a very high max_new_tokens, like model.generate(**inputs, max_new_tokens=10_000)?

Fyphia

about 13 hours ago

•

edited about 13 hours ago

@RaushanTurganbay , thanks for your suggestion; it worked this time, and the prompt is complete. So, I guess the issue is that my max_new_token before is too small for the pancake case?

By the way, may I ask how many images the model can process simultaneously? I tried with more step images (~20 images) for the pancake example, and the model returned 1.

RaushanTurganbay

Owner about 13 hours ago

@Fyphia if the model generation quality worsens after using more images it could be the fact that the model was not trained much on 20+ images per prompt, and thus would generate not as well as we want. If there wasn't any mismatch failures, that is all I can think of. You can try to finetune the model of you own dataset with multiple images and see if the quality gets better

For tuning LLaVA models we have demo notebooks here (https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVa/Fine_tune_LLaVa_on_a_custom_dataset_(with_PyTorch_Lightning).ipynb) and in TRL (https://github.com/huggingface/trl/blob/main/examples/scripts/sft_vlm.py). The scripts might need to be adjusted a little bit for the OneVision model but it should work as a general skeleton for your training script

Fyphia

about 13 hours ago

@RaushanTurganbay , thanks a lot for your prompt reply and detailed help!

Fyphia changed discussion status to closed about 13 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment