About llama3 tokenizer

#146
by Yingshu - opened

I use llama3 tokenizer, I found one issue.

I tokenize one string, 'help . and'
llama3_tokenizer('help . and')
I got {'input ids':[128000,8823, 662, 323],'attention mask':[1,1,1,1]}

If I decode the input ids,
llama3_tokenizer.decode([8823,662,323])
I got 'help. and',

Why I lose one space after help?

If I use llama2 tokenizer, I can get the same string.
llama2_tokenizer('help . and")
{'input ids':[1, 1371, 869,322],'attention mask': [1,1,1,1]}
llama2_tokenizer.decode([1371,869,322])
'help . and"

Sign up or log in to comment