How to enable fp16 option in fintuning?

#45
by comet24082002 - opened

I want to set --fp16 True for finetuning Bge M3 model to increase per_device_train_batch_size. However, when I ran, I got the error that is ValueError: Type fp16 is not supported.
Please help me!

Beijing Academy of Artificial Intelligence org
β€’
edited Apr 13

Which script did you use for fine-tuning? You can use our script following https://github.com/FlagOpen/FlagEmbedding/tree/master/examples .
Besides, fp16 is not supported by cpu.

Which script did you use for fine-tuning? You can use our script following https://github.com/FlagOpen/FlagEmbedding/tree/master/examples .
Besides, fp16 is not supported by cpu.

@Shitao I used your finetune example script like this:

image.png

Beijing Academy of Artificial Intelligence org

@comet24082002 , can you provide the detailed log for this error?

@comet24082002 , can you provide the detailed log for this error?

2024-04-13 13:46:26.452703: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-13 13:46:26.452810: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-13 13:46:26.595497: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[2024-04-13 13:46:36,623] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-13 13:46:37,081] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-04-13 13:46:37,081] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
tokenizer_config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 444/444 [00:00<00:00, 2.29MB/s]
sentencepiece.bpe.model: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5.07M/5.07M [00:00<00:00, 54.0MB/s]
special_tokens_map.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 964/964 [00:00<00:00, 4.91MB/s]
tokenizer.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 17.1M/17.1M [00:00<00:00, 184MB/s]
config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 687/687 [00:00<00:00, 3.85MB/s]
pytorch_model.bin: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2.27G/2.27G [00:10<00:00, 213MB/s]
/opt/conda/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
Generating train split: 5381 examples [00:00, 5492.29 examples/s]
/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
warnings.warn(
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/FlagEmbedding/baai_general_embedding/finetune/run.py", line 111, in
main()
File "/opt/conda/lib/python3.10/site-packages/FlagEmbedding/baai_general_embedding/finetune/run.py", line 102, in main
trainer.train()
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1771, in train
return inner_training_loop(
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1936, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1255, in prepare
result = self._prepare_deepspeed(*args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1640, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/init.py", line 176, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 240, in init
self._do_sanity_check()
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1040, in _do_sanity_check
raise ValueError("Type fp16 is not supported.")
ValueError: Type fp16 is not supported.
[2024-04-13 13:46:58,399] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 173) of binary: /opt/conda/bin/python3.10
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

FlagEmbedding.baai_general_embedding.finetune.run FAILED

Failures:

Root Cause (first observed failure):
[0]:
time : 2024-04-13_13:46:58
host : a8eb1bc79923
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 173)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@Shitao , this is the full log error that I got

Beijing Academy of Artificial Intelligence org

This error seems to be related to deepspeed. You can upgrade the deepspeed version to try it again.

This error seems to be related to deepspeed. You can upgrade the deepspeed version to try it again.

@Shitao , I used "!pip install -U deepspeed" but it can't solve the problem.

This error seems to be related to deepspeed. You can upgrade the deepspeed version to try it again.

@Shitao , I used "!pip install -U deepspeed" but it can't solve the problem.

I got the same error. Would you please share your solution? Many thanks!

did you solve problem ?

did you solve problem ?

@mohamedemam yes I've solved it. I changed the GPU using for finetuning it in Kaggle from GPU P100 to GPU T4 x 2.

Sign up or log in to comment