How to turn off byte-fallback for Phi-3's tokenizer?
I have been trying out Phi-3 models and it's been a wonderful experience.
However, sometimes the tokenizer throws exception:
The line of code
text =self.tokenizer.decode(output_tokens)
throws Exception: 'utf-8' codec can't decode byte 0xf0 in position 10283: invalid continuation byte
Most of the time this happened when the model's output was quite long (~800 words, and if count in the brackets, dots, ... it's ~1.4k element; this is still far from the max_length 4196 imo)
I have researched around and find out that this can be fixed by turning off the byte-fallback
of the BPE tokenizer, then the tokenizer will ignore the non-utf8 tokens.
I have tried
Tweaked the tokenizer.json
file:
- Set the model/
byte_fallback
to false - and remove the item
{"type": "ByteFallback"}
in decoder/decoders section
but the errors still happens.
I am using the mini-4k-intruct onnx-cuda-int14
version, btw.
I wonder
Why did my changes not work and is there anyway to fix this?
Thanks for every help and suggestion!
(Note: This is also posted as an issue on Phi-3CookBook github repo: issue #14)
Could you please share instructions on how we can reproduce this issue? What script are you using to run the model?