Update tokenizer's chat template to support assistant masks

#52
by leleogere - opened

Add the "generation" tags in the chat template to be able to use the return_assistant_tokens_mask=True option in Tokenizer.apply_chat_template (see PR https://github.com/huggingface/transformers/pull/30650).

Example:

import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("mistralai/Codestral-22B-v0.1")
tokenized = tokenizer.apply_chat_template(
    [
        {"role": "user", "content": "Hello assistant"},
        {"role": "assistant", "content": "Hello user"},
        {"role": "user", "content": "How are you?"},
        {"role": "assistant", "content": "I'm good"},
    ],
    return_assistant_tokens_mask=True,
    return_dict=True,
)
print(tokenized)

# BEFORE:
# {'input_ids': [1, 3, 23325, 14660, 4, 23325, 2956, 2, 3, 2370, 1228, 1136, 29572, 4, 1083, 29510, 29487, 1947, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'assistant_masks': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
# AFTER:
# {'input_ids': [1, 3, 23325, 14660, 4, 23325, 2956, 2, 3, 2370, 1228, 1136, 29572, 4, 1083, 29510, 29487, 1947, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'assistant_masks': [0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]}
Human readable diff between old and proposed template
 {%- if messages[0]['role'] == 'system' %}
     {%- set system_message = messages[0]['content'] %}
     {%- set loop_messages = messages[1:] %}
 {%- else %}
     {%- set loop_messages = messages %}
 {%- endif %}
 
 {{- bos_token }}
 {%- for message in loop_messages %}
     {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
         {{- raise_exception('After the optional system message, conversation roles must alternate user/assistant/user/assistant/...') }}
     {%- endif %}
     {%- if message['role'] == 'user' %}
         {%- if loop.last and system_message is defined %}
             {{- '[INST] ' + system_message + '\\n\\n' + message['content'] + '[/INST]' }}
         {%- else %}
             {{- '[INST] ' + message['content'] + '[/INST]' }}
         {%- endif %}
     {%- elif message['role'] == 'assistant' %}
-        {{- ' ' + message['content'] + eos_token}}
+        {%- generation %}
+            {{- ' ' + message['content'] + eos_token}}
+        {%- endgeneration %}
     {%- else %}
         {{- raise_exception('Only user and assistant roles are supported, with the exception of an initial optional system message!') }}
     {%- endif %}
 {%- endfor %}
leleogere changed pull request title from Update tokenizer_config.json to Update tokenizer config to support assistant masks in chat template
leleogere changed pull request title from Update tokenizer config to support assistant masks in chat template to Update tokenizer chat template to support assistant masks
leleogere changed pull request title from Update tokenizer chat template to support assistant masks to Update tokenizer's chat template to support assistant masks
Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment