Vocabulary Expansion for LLaMA Tokenizer

#3
by meherajj - opened

In the descriptions of your Bangla-LLaMA-3 and later models, you mentioned using an extensive Bangla vocabulary with 16,000 additional tokens. Since LLaMA-3 expanded its vocabulary from 32k to 128,256 tokens, did you extend the tokenizer further with these additional Bangla tokens? I’d love to understand your approach here.

Bangla Large Language Model org

We extended only for our Llama-2 tokenizer, not for Llama-3. But we can technically do that and see whether model improves or not. Would love someone to try it out.

Bangla Large Language Model org

@meherajj

As @brishtiteveja said we have done it for llama 2 but not for llama 3 yet but let's see if we can make this for our latest model as well.

Thanks

Sign up or log in to comment