|
[2024-09-04 12:51:56,498] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
|
[2024-09-04 12:51:58,109] [WARNING] [runner.py:212:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. |
|
[2024-09-04 12:51:58,109] [INFO] [runner.py:585:main] cmd = /home/juntao/Miniconda3/envs/roo/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=15892 --module --enable_each_rank_log=None safe_rlhf.finetune --train_datasets alpaca --model_name_or_path models/alpaca-7b-reproduced --max_length 1024 --trust_remote_code True --epochs 3 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --gradient_accumulation_steps 8 --gradient_checkpointing --learning_rate 2e-5 --lr_scheduler_type cosine --lr_warmup_ratio 0.03 --weight_decay 0.0 --seed 42 --output_dir /home/juntao/Projects/roo/models/alpaca-7b-sft --log_type wandb --log_project SFT-alpaca --zero_stage 3 --offload none --bf16 True --tf32 True |
|
[2024-09-04 12:51:59,771] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
|
[2024-09-04 12:52:01,856] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} |
|
[2024-09-04 12:52:01,856] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=8, node_rank=0 |
|
[2024-09-04 12:52:01,856] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) |
|
[2024-09-04 12:52:01,856] [INFO] [launch.py:164:main] dist_world_size=8 |
|
[2024-09-04 12:52:01,856] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 |
|
[2024-09-04 12:52:01,858] [INFO] [launch.py:256:main] process 2131706 spawned with command: ['/home/juntao/Miniconda3/envs/roo/bin/python', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=0', '--train_datasets', 'alpaca', '--model_name_or_path', 'models/alpaca-7b-reproduced', '--max_length', '1024', '--trust_remote_code', 'True', '--epochs', '3', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--lr_warmup_ratio', '0.03', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/home/juntao/Projects/roo/models/alpaca-7b-sft', '--log_type', 'wandb', '--log_project', 'SFT-alpaca', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True'] |
|
[2024-09-04 12:52:01,859] [INFO] [launch.py:256:main] process 2131707 spawned with command: ['/home/juntao/Miniconda3/envs/roo/bin/python', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=1', '--train_datasets', 'alpaca', '--model_name_or_path', 'models/alpaca-7b-reproduced', '--max_length', '1024', '--trust_remote_code', 'True', '--epochs', '3', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--lr_warmup_ratio', '0.03', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/home/juntao/Projects/roo/models/alpaca-7b-sft', '--log_type', 'wandb', '--log_project', 'SFT-alpaca', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True'] |
|
[2024-09-04 12:52:01,860] [INFO] [launch.py:256:main] process 2131708 spawned with command: ['/home/juntao/Miniconda3/envs/roo/bin/python', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=2', '--train_datasets', 'alpaca', '--model_name_or_path', 'models/alpaca-7b-reproduced', '--max_length', '1024', '--trust_remote_code', 'True', '--epochs', '3', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--lr_warmup_ratio', '0.03', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/home/juntao/Projects/roo/models/alpaca-7b-sft', '--log_type', 'wandb', '--log_project', 'SFT-alpaca', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True'] |
|
[2024-09-04 12:52:01,861] [INFO] [launch.py:256:main] process 2131709 spawned with command: ['/home/juntao/Miniconda3/envs/roo/bin/python', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=3', '--train_datasets', 'alpaca', '--model_name_or_path', 'models/alpaca-7b-reproduced', '--max_length', '1024', '--trust_remote_code', 'True', '--epochs', '3', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--lr_warmup_ratio', '0.03', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/home/juntao/Projects/roo/models/alpaca-7b-sft', '--log_type', 'wandb', '--log_project', 'SFT-alpaca', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True'] |
|
[2024-09-04 12:52:01,862] [INFO] [launch.py:256:main] process 2131710 spawned with command: ['/home/juntao/Miniconda3/envs/roo/bin/python', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=4', '--train_datasets', 'alpaca', '--model_name_or_path', 'models/alpaca-7b-reproduced', '--max_length', '1024', '--trust_remote_code', 'True', '--epochs', '3', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--lr_warmup_ratio', '0.03', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/home/juntao/Projects/roo/models/alpaca-7b-sft', '--log_type', 'wandb', '--log_project', 'SFT-alpaca', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True'] |
|
[2024-09-04 12:52:01,863] [INFO] [launch.py:256:main] process 2131711 spawned with command: ['/home/juntao/Miniconda3/envs/roo/bin/python', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=5', '--train_datasets', 'alpaca', '--model_name_or_path', 'models/alpaca-7b-reproduced', '--max_length', '1024', '--trust_remote_code', 'True', '--epochs', '3', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--lr_warmup_ratio', '0.03', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/home/juntao/Projects/roo/models/alpaca-7b-sft', '--log_type', 'wandb', '--log_project', 'SFT-alpaca', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True'] |
|
[2024-09-04 12:52:01,864] [INFO] [launch.py:256:main] process 2131712 spawned with command: ['/home/juntao/Miniconda3/envs/roo/bin/python', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=6', '--train_datasets', 'alpaca', '--model_name_or_path', 'models/alpaca-7b-reproduced', '--max_length', '1024', '--trust_remote_code', 'True', '--epochs', '3', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--lr_warmup_ratio', '0.03', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/home/juntao/Projects/roo/models/alpaca-7b-sft', '--log_type', 'wandb', '--log_project', 'SFT-alpaca', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True'] |
|
[2024-09-04 12:52:01,865] [INFO] [launch.py:256:main] process 2131713 spawned with command: ['/home/juntao/Miniconda3/envs/roo/bin/python', '-u', '-m', 'safe_rlhf.finetune', '--local_rank=7', '--train_datasets', 'alpaca', '--model_name_or_path', 'models/alpaca-7b-reproduced', '--max_length', '1024', '--trust_remote_code', 'True', '--epochs', '3', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--gradient_accumulation_steps', '8', '--gradient_checkpointing', '--learning_rate', '2e-5', '--lr_scheduler_type', 'cosine', '--lr_warmup_ratio', '0.03', '--weight_decay', '0.0', '--seed', '42', '--output_dir', '/home/juntao/Projects/roo/models/alpaca-7b-sft', '--log_type', 'wandb', '--log_project', 'SFT-alpaca', '--zero_stage', '3', '--offload', 'none', '--bf16', 'True', '--tf32', 'True'] |
|
[2024-09-04 12:52:04,266] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
|
[2024-09-04 12:52:05,590] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
|
[2024-09-04 12:52:05,973] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
|
[2024-09-04 12:52:06,006] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
|
[2024-09-04 12:52:06,071] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
|
[2024-09-04 12:52:06,085] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
|
[2024-09-04 12:52:06,108] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
|
[2024-09-04 12:52:06,170] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
|
[2024-09-04 12:52:07,786] [INFO] [comm.py:652:init_distributed] cdb=None |
|
[2024-09-04 12:52:08,898] [INFO] [comm.py:652:init_distributed] cdb=None |
|
[2024-09-04 12:52:09,351] [INFO] [comm.py:652:init_distributed] cdb=None |
|
[2024-09-04 12:52:09,394] [INFO] [comm.py:652:init_distributed] cdb=None |
|
[2024-09-04 12:52:09,425] [INFO] [comm.py:652:init_distributed] cdb=None |
|
[2024-09-04 12:52:09,488] [INFO] [comm.py:652:init_distributed] cdb=None |
|
[2024-09-04 12:52:09,525] [INFO] [comm.py:652:init_distributed] cdb=None |
|
[2024-09-04 12:52:09,589] [INFO] [comm.py:652:init_distributed] cdb=None |
|
[2024-09-04 12:52:09,589] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl |
|
Set logger level to WARNING. |
|
ninja: no work to do. |
|
Time to load fused_adam op: 0.13483238220214844 seconds |
|
Time to load fused_adam op: 0.20363140106201172 seconds |
|
Time to load fused_adam op: 0.2036271095275879 secondsTime to load fused_adam op: 0.20359563827514648 seconds |
|
|
|
Time to load fused_adam op: 0.20377779006958008 seconds |
|
Time to load fused_adam op: 0.2039332389831543 seconds |
|
Time to load fused_adam op: 0.20380067825317383 seconds |
|
Time to load fused_adam op: 0.20460724830627441 seconds |
|
Parameter Offload: Total persistent parameters: 266240 in 65 params |
|
***** Running training ***** |
|
Saving model to "/home/juntao/Projects/roo/models/alpaca-7b-sft" ... |
|
Saving DeepSpeed Checkpoints... |
|
Converting DeepSpeed Checkpoints to Hugging Face format... |
|
[2024-09-04 13:41:40,053] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
|
Processing zero checkpoint './global_step304' |
|
Detected checkpoint of type zero stage ZeroStageEnum.weights, world_size: 8 |
|
Parsing checkpoint created by deepspeed==0.15.0 |
|
Reconstructed Trainable fp32 state dict with 291 params 6738423808 elements |
|
Saving fp32 state dict to pytorch_model.bin |
|
Model saved! |
|
[2024-09-04 13:42:46,338] [INFO] [launch.py:351:main] Process 2131710 exits successfully. |
|
[2024-09-04 13:42:46,338] [INFO] [launch.py:351:main] Process 2131712 exits successfully. |
|
[2024-09-04 13:42:46,338] [INFO] [launch.py:351:main] Process 2131707 exits successfully. |
|
[2024-09-04 13:42:46,338] [INFO] [launch.py:351:main] Process 2131713 exits successfully. |
|
[2024-09-04 13:42:46,339] [INFO] [launch.py:351:main] Process 2131709 exits successfully. |
|
[2024-09-04 13:42:46,339] [INFO] [launch.py:351:main] Process 2131711 exits successfully. |
|
[2024-09-04 13:42:47,339] [INFO] [launch.py:351:main] Process 2131708 exits successfully. |
|
[2024-09-04 13:42:56,341] [INFO] [launch.py:351:main] Process 2131706 exits successfully. |
|
|