Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
Sharding strategy
FSDP offers a number of sharding strategies to select from:
FULL_SHARD - shards model parameters, gradients and optimizer states across workers; select 1 for this option
SHARD_GRAD_OP- shard gradients and optimizer states across workers; select 2 for this option
NO_SHARD - don't shard anything (this is equivalent to DDP); select 3 for this option
HYBRID_SHARD - shard model parameters, gradients and optimizer states within each worker where each worker also has a full copy; select 4 for this option
HYBRID_SHARD_ZERO2 - shard gradients and optimizer states within each worker where each worker also has a full copy; select 5 for this option
This is enabled by the fsdp_sharding_strategy flag.