Spaces:
Running
Running
Vocabulary Expansion for LLaMA Tokenizer
#3
by
meherajj
- opened
In the descriptions of your Bangla-LLaMA-3 and later models, you mentioned using an extensive Bangla vocabulary with 16,000 additional tokens. Since LLaMA-3 expanded its vocabulary from 32k to 128,256 tokens, did you extend the tokenizer further with these additional Bangla tokens? I’d love to understand your approach here.
We extended only for our Llama-2 tokenizer, not for Llama-3. But we can technically do that and see whether model improves or not. Would love someone to try it out.
As @brishtiteveja said we have done it for llama 2 but not for llama 3 yet but let's see if we can make this for our latest model as well.
Thanks