What are the <0x00> to <0xFF> in the tokenizer.json
#16
by
jiang719
- opened
What are these tokens in the tokenizer.
If I run this line
tokenizer.convert_ids_to_tokens(tokenizer.encode('int add(int a, int b) {\n return a + b;\n}'))
It gives me
['<s>', '▁int', '▁add', '(', 'int', '▁a', ',', '▁int', '▁b', ')', '▁{', '<0x0A>', '▁▁▁', '▁return', '▁a', '▁+', '▁b', ';', '<0x0A>', '}']
Looks like <0x0A>
is used as the newline. Initially, I thought these tokens are special tokens for hex-decimal values.
Is this supposed to be correct? What the other tokens mean?
I think it is the Bytefallback, which converts some tokens (in that case new lines) to unicode representation