Issues in the tokenizer
how <sep>
token id 32002 causes seg faults due to out of bounds accesses?
Can confirm as well during DPO training, it seems like the tokenizer's addition of <sep>
is not in the embedding matrix of size 32002
. The maximum id should be 32001.
Yeah they also show this in dpo training.
I just add unk and pad token and they show 32004 tokenizer len.
Also the dpo run for a few steps and then show this error.
it showing this error
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Here is the Dpo code
https://colab.research.google.com/drive/1uC7LohnGJF-Y4vzPz14z6OgZknkeZqD2?usp=sharing
hello, Imran ullah, I encounter the same problem, may I know how you overcame this problem? by modify the tokenizer config.json or modify the code ?