Issues with FSDP and DeepSpeed During Distributed Training for Gemma
I'm trying to train Gemma with LoRA using FSDP and DeepSpeed. My context size is around 7000, so it doesn't fit in my GPU, making distributed training a necessity. I noticed a strange phenomenon with FSDP: the training literally freezes, like a code freeze. It works fine without FSDP, and the same code works perfectly for LLaMA, Mistral, Mixtral, Phi, and other models with FSDP.
For DeepSpeed, it goes out of memory even with Stage 3.
{
"zero_force_ds_cpu_optimizer": false,
"zero_allow_untested_optimizer": true,
"zero_optimization": {
"stage": 3,
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 0,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 0,
"stage3_max_reuse_distance": 0,
"stage3_gather_16bit_weights_on_model_save": true
},
"bf16": {
"enabled": true
},
"fp16": {
"enabled": "auto",
"auto_cast": "auto",
"loss_scale": 0,
"initial_scale_power": 32,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
Has anyone else encountered this problem during distributed training? How did you resolve it?
I have the same issue. Is there a workaround for this problem?
same here
I noticed the same issue while using Pipeline to generate on a large number of prompts. You may need to add torch_empty_cache_steps=1
as and argument if you are using TRL trainers. Hope this helps.
@OSalem99 But that's for the generate function; I'm trying a distributed training. Did you try distributed training?