lack of digit splitting in slow version of tokenizer
#11
by
Forence
- opened
Like the issue in https://huggingface.co/stabilityai/stablelm-2-12b/discussions/1#66151ad2d8148d0b668985f0, GPT2Tokenizer (slow) doesn't apply the pre-tokenization split rule defined in the tokenizer.json.
Suggest that this bug can be solved, because I met severe performance drop due to it, and it's quite hard to debug. Thanks!