Align tokenizer with mistral-common

#39
by Rocketknight1 HF staff - opened
No description provided.

This PR should align the Hugging Face tokenizer with the tokenization in mistral-common. You can test it with the following script:

from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from transformers import AutoTokenizer

chat = [
    {"role": "system", "content": "You are a helpful bot"},
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi there!"},
    {"role": "user", "content": "How are you?"},
    {"role": "assistant", "content": "Fine and you?"},
    {"role": "user", "content": "Fine thank you."},
]

mistral_tok = MistralTokenizer.v3()
hf_tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x22B-Instruct-v0.1", revision="pr/39")

hf_text = hf_tokenizer.apply_chat_template(chat, tokenize=False)
hf_tokens = hf_tokenizer.apply_chat_template(chat, tokenize=True)

mistral_encode = mistral_tok.encode_chat_completion(
  ChatCompletionRequest(messages=chat)
)
mistral_text = mistral_encode.text
mistral_tokens = mistral_encode.tokens

print(hf_tokens == mistral_tokens)
print(hf_text == mistral_text.replace("▁", " ").replace("<0x0A>", "\n"))

Love your work! Can this be merged please? Mistral does great work and produces great models. The mistral libraries seem good too, but I don't want to implement solutions using libraries which are incompatible with other models. The whole point of open weights is choice and flexibility.

patrickvonplaten changed pull request status to merged
Mistral AI_ org

Ah actually does the HF tokenizer support function calling already? Would be good to support this as well

Sign up or log in to comment