Spaces:

BanglaLLM
/

README

Running

Vocabulary Expansion for LLaMA Tokenizer

by meherajj - opened Oct 11

Oct 11

In the descriptions of your Bangla-LLaMA-3 and later models, you mentioned using an extensive Bangla vocabulary with 16,000 additional tokens. Since LLaMA-3 expanded its vocabulary from 32k to 128,256 tokens, did you extend the tokenizer further with these additional Bangla tokens? I’d love to understand your approach here.

brishtiteveja

Bangla Large Language Model org Oct 11

We extended only for our Llama-2 tokenizer, not for Llama-3. But we can technically do that and see whether model improves or not. Would love someone to try it out.

nymtheescobar

Bangla Large Language Model org Oct 11

@meherajj

As @brishtiteveja said we have done it for llama 2 but not for llama 3 yet but let's see if we can make this for our latest model as well.

Thanks

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment