mistralai/Mixtral-8x22B-Instruct-v0.1 · Align tokenizer with mistral-common

Align tokenizer with mistral-common9712a30e

Rocketknight1

Jun 26

No description provided.

Rocketknight1

Jun 26

•

edited Jun 26

This PR should align the Hugging Face tokenizer with the tokenization in mistral-common. You can test it with the following script:

from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from transformers import AutoTokenizer

chat = [
    {"role": "system", "content": "You are a helpful bot"},
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi there!"},
    {"role": "user", "content": "How are you?"},
    {"role": "assistant", "content": "Fine and you?"},
    {"role": "user", "content": "Fine thank you."},
]

mistral_tok = MistralTokenizer.v3()
hf_tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x22B-Instruct-v0.1", revision="pr/39")

hf_text = hf_tokenizer.apply_chat_template(chat, tokenize=False)
hf_tokens = hf_tokenizer.apply_chat_template(chat, tokenize=True)

mistral_encode = mistral_tok.encode_chat_completion(
  ChatCompletionRequest(messages=chat)
)
mistral_text = mistral_encode.text
mistral_tokens = mistral_encode.tokens

print(hf_tokens == mistral_tokens)
print(hf_text == mistral_text.replace("▁", " ").replace("<0x0A>", "\n"))

Defend the honour of the Hugging Face tokenizer2c3a7214

Update chat template to handle system messages3701ac33

lcahill

Jul 2

Love your work! Can this be merged please? Mistral does great work and produces great models. The mistral libraries seem good too, but I don't want to implement solutions using libraries which are incompatible with other models. The whole point of open weights is choice and flexibility.

patrickvonplaten changed pull request status to merged Jul 3

patrickvonplaten

Mistral AI_ org Jul 3

Ah actually does the HF tokenizer support function calling already? Would be good to support this as well