v3 tokenizer
Hi,
I just wanted to let you know that the mistral repo contains a file called:
tokenizer.model.v3
It is my understanding that this is the new tokenizer that contains the expanded vocabulary.
However, when making the gguf, I think it needs to be renamed first to tokenizer.model
or else it might be ignored by the convert script.
You might already know all of this though, so feel free to ignore :)
Hi @ayyylol
Interesting! I saw that, but I thought it is a model used in their own inference library rather than "this is the actual tokenizer with extended vocabs". Are you sure the actual tokenizer.model
doesn't have those vocabs? Like the GGUF models don't work with function calling tokens?
Upon looking at this more closely, they are both identical!
37f00374dea48658ee8f5d0f21895b9bc55cb0103939607c8185bfd1c6ca1f89 tokenizer.model
37f00374dea48658ee8f5d0f21895b9bc55cb0103939607c8185bfd1c6ca1f89 tokenizer.model.v3
I am pretty confused now!
You are right, they appear to be identical. Thank you for looking into that!
Trying to install this model with PrivateGPT, I get this complaint about the tokenizer:
Downloading tokenizer mistralai/Mistral-7B-Instruct-v0.3
You setadd_prefix_space
. The tokenizer needs to be converted from the slow tokenizers
Indeed, the 'tokenizer_config.json' file has this attribute set true for v0.3. It wasn't included in v0.2.
Is there anything you can do to make the v0.3 tokenizer a fast tokenizer? Thanks in advance for your help.