Align tokenizer with mistral-common
#39
by
Rocketknight1
HF staff
- opened
No description provided.
This PR should align the Hugging Face tokenizer with the tokenization in mistral-common. You can test it with the following script:
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from transformers import AutoTokenizer
chat = [
{"role": "system", "content": "You are a helpful bot"},
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi there!"},
{"role": "user", "content": "How are you?"},
{"role": "assistant", "content": "Fine and you?"},
{"role": "user", "content": "Fine thank you."},
]
mistral_tok = MistralTokenizer.v3()
hf_tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x22B-Instruct-v0.1", revision="pr/39")
hf_text = hf_tokenizer.apply_chat_template(chat, tokenize=False)
hf_tokens = hf_tokenizer.apply_chat_template(chat, tokenize=True)
mistral_encode = mistral_tok.encode_chat_completion(
ChatCompletionRequest(messages=chat)
)
mistral_text = mistral_encode.text
mistral_tokens = mistral_encode.tokens
print(hf_tokens == mistral_tokens)
print(hf_text == mistral_text.replace("▁", " ").replace("<0x0A>", "\n"))
Love your work! Can this be merged please? Mistral does great work and produces great models. The mistral libraries seem good too, but I don't want to implement solutions using libraries which are incompatible with other models. The whole point of open weights is choice and flexibility.
patrickvonplaten
changed pull request status to
merged
Ah actually does the HF tokenizer support function calling already? Would be good to support this as well