getting RuntimeError: No executable batch size found, reached zero. erorr when trying to fine-tuning flan-ul2 model.
Hi there,
I'm trying to finetune flan-ul2 model with LoRA as explained here (https://www.philschmid.de/fine-tune-flan-t5-peft) . First I walked through the blog post without changing anything and I could finetune flan-t5-xxl model. Then, I tried to do same with flan-ul2. All I did was to change model and tokenizer initialization lines as follows:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import DataCollatorForSeq2Seq
#model_id="google/flan-t5-xxl"
model_id="google/flan-ul2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
#model_id = "philschmid/flan-t5-xxl-sharded-fp16"
model_id = "google/flan-ul2"
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map="auto", load_in_8bit=True)
Then I run the trainer as shown below:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
# Define training args
training_args = Seq2SeqTrainingArguments(
output_dir=output_dir,
auto_find_batch_size=True,
learning_rate=1e-3, # higher learning rate
num_train_epochs=5,
logging_dir=f"{output_dir}/logs",
logging_strategy="steps",
logging_steps=500,
save_strategy="no",
report_to="tensorboard",
)
# Create Trainer instance
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=tokenized_dataset["train"],
)
When I run trainer.train() with the above setup, I got the following error:
Traceback (most recent call last): | 0/73660 [00:00<?, ?it/s]
File "peft_finetuning_flan-ul2.py", line 145, in <module>
trainer.train()
File "/home/ubuntu/miniconda3/envs/finetuning/lib/python3.10/site-packages/transformers/trainer.py", line 1633, in train
return inner_training_loop(
File "/home/ubuntu/miniconda3/envs/finetuning/lib/python3.10/site-packages/accelerate/utils/memory.py", line 122, in decorator
raise RuntimeError("No executable batch size found, reached zero.")
RuntimeError: No executable batch size found, reached zero.
So I wonder if there is something special about the flan-ul2 model that would prevent me from using it in this way. Would that be because Seq2SeqTrainer
, and Seq2SeqTrainingArguments
not the correct Trainer and TrainingArg classes to use with flan-ul2? (I've tried the regular Trainer and TrainingArguments classes as well but I got the same error) If so, could you please assist me on the correct ones?
Hey! Do you solve it now?
@cyt79
auto_find_batch_size=True is responsible for this error.
Make auto_find_batch_size=False.
Manually give per_device_train_batch_size=8 and per_device_eval_batch_size=8.
If per_device_train_batch_size=8 throws cuda error , reduce the batch size until cuda error doesnt occur.