Ahmadzei's picture
update 1
57bdca5
raw
history blame
383 Bytes
This is enabled by setting fsdp_wrap_policy: SIZE_BASED_WRAP and min_num_param to the desired size threshold.
Checkpointing
Intermediate checkpoints should be saved with fsdp_state_dict_type: SHARDED_STATE_DICT because saving the full state dict with CPU offloading on rank 0 takes a lot of time and often results in NCCL Timeout errors due to indefinite hanging during broadcasting.