EOS not tokenized correctly
#9
by
Stopwolf
- opened
After I tried training a base model with the same chat format, the model doesn't tokenize eos token correctly.
Both [28789, 28766, 321, 28730, 416, 28766, 28767] and [32000] result in <|im_end|>
, but when outputting text it is tokenized as the former.
What I noticed is that its only tokenized as EOS (32000) when there's a space preceding it, but more realistically there's always some text beforehand..
Examples:2 × 3 = 6<|im_end|>
=> [1, 28705, 28750, 15770, 28705, 28770, 327, 28705, 28784, 28789, 28766, 321, 28730, 416, 28766, 28767]2 × 3 = 6 <|im_end|>
=> [1, 28705, 28750, 15770, 28705, 28770, 327, 28705, 28784, 32000]
Any idea how to fix this for further finetuning?