Instruct sequences?
The model card and example prompt template lists regular chatml sequences (<|im_start|>
and <|im_end|>
) but in special_tokens_map.json you have <|startoftext|>
and <|endoftext|>
listed as the bos and eos tokens respectively. I don't see these tokens mentioned in the model card. Was the model trained on these tokens?
My mistake, this is Yi's tokenizer.
Possibly related, it seems just about every output ends with <|unused###|> where ### is a random number.
This is also Yi's tokenizer, but other Yi finetunes don't have this behavior: Token probs
Hello there! The <|startoftext|>
and <|endoftext|>
are BOS and EOS tokens, while <|im_start|>
and <|im_end|>
are control tokens for the template.
Can you share more details, please?
- Which model (quant) are you using? I have tested only f16 and AWQ, so maybe I did something wrong during GGUF conversion.
- What software are you using to run it?
- What is the exact input to the model -- ideally, if you could share the textual input as well as the token ids (I've heard some versions of ooba struggle with correct tokenization of special tokens)
Hello, to help with prompting, I have published this simple code to format the prompt:
https://huggingface.co/dreamgen/opus-v1.2-7b/blob/main/configs/opus-v1.py
I will be investigating ooba and ST issues. (I have not used either much before)
I have also posted approximate silly tavern config, but it's not possible to make it perfect: https://huggingface.co/dreamgen/opus-v1.2-7b/tree/main/configs/silly_tavern
Per our previous conversation on the SillyTavern Discord server, this was happening with GGUF quants in Oobabooga Textgen-UI and KoboldCPP using a ChatML instruct template. Seems the model is outputting Yi padding tokens (at least for the GGUF quant) which shouldn't be happening regardless of the instruct template.
@boomerchan I did more digging into GGUF, and based on my testing the tokenization is severly broken. I tried it with llama_cpp_wrapper, and with a prompt where the only possible next token is <|im_start|> and where AWQ and fp16 give it ~100% probability, the GGUF was giving almost uniform probabilities.
Another issue I noticed is that the tokenizer was ignoring the Yi normalizer settings, which differ from llama:
# Yi
"normalizer": {
"type": "Sequence",
"normalizers": [
{
"type": "Replace",
"pattern": {
"String": " "
},
"content": "▁"
}
]
},
# llama
"normalizer": {
"type": "Sequence",
"normalizers": [
{
"type": "Prepend",
"prepend": "▁"
},
{
"type": "Replace",
"pattern": {
"String": " "
},
"content": "▁"
}
]
},
It was predicting tokens with the ▁
prefix instead of without.
So for now I withdrew the GGUF quants.
@DreamGenX Can you verify is gguf tokenizer broken for just opus, some specifc yi models, all yi models, or all models?
I also noticed that there's a difference between llama.cpp
and llama-cpp-python
-- the latter does not handle <|im_start|> correctly out of the box, and tokenizes it as multiple tokens.
In Opus 34B, I specifically added the <|im_start|> and <|im_end|> as special tokens, so that they are tokenized as one token always -- even though they already exist as tokens in Yi base, they are not marked as special there -- so for models which were fine-tuned without doing this, it might "work" (because inference will match training -- it will be multiple tokens in both).
Tokenisation is working fine (including Yi BOS and EOS, it's even rejecting default llama2 EOS and BOS) on my exl2 quant at any rate.
Note that your tokeniser.model
is identical to one from say, brucethemoose's RPMerge (or stock yi for that matter) (check the SHA sums), so it's incorrect. You ought to regenerate it, I think if you delete it Transformers might do that for you? I don't use ooba.
EDIT: hang on that's nonsense, you dont have a tokenizer.json so you need the .model surely.
Oh, thank you for investigating
@ProphetOfBostrom
. That's great to know that the exl2 are working fine. The above was specific to GGUF at least in llama-cpp-python.
The tokenizer should be the same, except that <|im_start|>
are marked as added/special. This is not the case for https://huggingface.co/01-ai/Yi-34B-200K where the tokens also exist, but aren't marked as added/special, so they aren't tokenized as one unit (at least when I try with HF AutoTokenizer).
Here's what I was seeing for 34B AWQ vs 34B GGUF Q5_K_M with llama-cpp-python. As you can see, for the AWQ all of the probability mass is on <|im_start|>
as it should be, while for the GGUF it's almost uniformly spread out:
Would you be able to show next token distribution for the EXL2 quant?
This was the prompt (note the newline added at the end, this means that the only possible next token should be <|im_start|>
):
prompt2 = (
"""
<|im_start|>system
You are an intelligent, skilled, versatile writer.
Your task is to write a story based on the information below.
Write the story as if it's a book.
## Plot description:
This is a fanfiction from the Harry Potter universe. In this alternate reality, Harry Potter is evil and secretly siding with Slytherin.
Up until now, Harry was pretending to be friends with Hermione and Ron, that changes when he invites Hermione to his chambers where he tricks her to drink Amorentia, the most powerful love potion.
## Characters:
### Harry Potter
Harry Potter in this fanfiction is secretly a member of Slytherin and is using his powers for evil rather than for good. Up until now, he was pretending to be friends with Hermione and Ron.
### Hermione Granger
Hermione appears just like in the original books.<|im_end|>
<|im_start|>user
Harry welcomes Hermione into his room
<|im_start|>text names= Harry
“Welcome, Hermione!” said Harry, waving at the doorway behind Hermione’s back.<|im_end|>
""".strip()
+ "\n"
)
Here's another prompt where the next token is a regular token, and the difference between the AWQ and GGUF:
As you can see, GGUF predics tokens with the ▁
prefix.
Ah, maybe I misuderstood what you were suggesting about the tokenizer, I need to readup on the role of the different tokenizer files. Also, for some reason, tokenizer.json
is missing for me... hmm.
You have one in your AWQ quant.
I've replaced the actual newlines in the code
block below with \n
manually, to show that they're not redundant and to show where tokens adjacent to them end up.
I'm fiddling with it now. Which of these is correct?
Here's my exl2 (0.0.13.post2 for reasons) using if I use the tokenizer.model:
<|im_start|>system\nTau stops reading the notes. His headache has grown fierce. From the bathroom, he hears a woman singing, and, presumably, showering.<|im_end|>\nShe exits the bathroom,
but instead using the unmodified tokenizer.json from the AWQ quant instead
we see that a different " system" token is used in place of "system". There's no way that's not going to cause issues, right?
Also spot the extra space injected (I didn't put it there!) between <|im_end|>
and \n
This is about as much help as I can offer. I've never trained a neural anything and I really do have a headache and need a shower. But I hope this points you in the right direction. You might need some Transformers expertise here.
PS: maybe don't use exllama to investigate this because it's all confounded by a new bug in the most recent version (where there are no spaces at all on yi models :-s ) (so I wasn't using that version)
although that may just have been fixed - i was talking about you in the report :-)
Aren't I a good citizen, getting other people do do all this work for me! i'll pip install .
it now.
Yeah the behaviour's the same with the patch, this wasn't related to exllama.
the whole oobababooga _HF loader thing really confounds the issue of tokenisation because so many people use them, assume that they're the reference implimentation when really using a huggingface tokeniser is not part of the spec of those quants at all. Take the way ggeranov handles tokens in GGUF. It's fantastic and extremely fast (try limiting top_k on llama.cpp or koboldcpp!!), but you wouldn't know it from text-generation-webui.
that's an advantage of AWQ - Transformers loads it (GPTQ too). text-generation-webui obscures this fact as much as possible.