how can i get original vocabulary of tokenizer?
#186
by
jihunlee
- opened
I already got tokenizer.json file to see it's vocabulary but it has wrong weird encoded or decoded character like "ÑģÑĮкимÐ". I can't be sure it doesn't have meaning but it looks weird. just want to see the korean words or subwords in llama3's vocabulary. how can i get it?
That sequence is how Chinese characters are being rendered/stored. If you decode that sequence's token IDs, then you'll get the proper representation back.