Update tokenizer's chat template to support assistant masks
#52
by
leleogere
- opened
Add the "generation" tags in the chat template to be able to use the return_assistant_tokens_mask=True
option in Tokenizer.apply_chat_template
(see PR https://github.com/huggingface/transformers/pull/30650).
Example:
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("mistralai/Codestral-22B-v0.1")
tokenized = tokenizer.apply_chat_template(
[
{"role": "user", "content": "Hello assistant"},
{"role": "assistant", "content": "Hello user"},
{"role": "user", "content": "How are you?"},
{"role": "assistant", "content": "I'm good"},
],
return_assistant_tokens_mask=True,
return_dict=True,
)
print(tokenized)
# BEFORE:
# {'input_ids': [1, 3, 23325, 14660, 4, 23325, 2956, 2, 3, 2370, 1228, 1136, 29572, 4, 1083, 29510, 29487, 1947, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'assistant_masks': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
# AFTER:
# {'input_ids': [1, 3, 23325, 14660, 4, 23325, 2956, 2, 3, 2370, 1228, 1136, 29572, 4, 1083, 29510, 29487, 1947, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'assistant_masks': [0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]}
Human readable diff between old and proposed template
{%- if messages[0]['role'] == 'system' %}
{%- set system_message = messages[0]['content'] %}
{%- set loop_messages = messages[1:] %}
{%- else %}
{%- set loop_messages = messages %}
{%- endif %}
{{- bos_token }}
{%- for message in loop_messages %}
{%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
{{- raise_exception('After the optional system message, conversation roles must alternate user/assistant/user/assistant/...') }}
{%- endif %}
{%- if message['role'] == 'user' %}
{%- if loop.last and system_message is defined %}
{{- '[INST] ' + system_message + '\\n\\n' + message['content'] + '[/INST]' }}
{%- else %}
{{- '[INST] ' + message['content'] + '[/INST]' }}
{%- endif %}
{%- elif message['role'] == 'assistant' %}
- {{- ' ' + message['content'] + eos_token}}
+ {%- generation %}
+ {{- ' ' + message['content'] + eos_token}}
+ {%- endgeneration %}
{%- else %}
{{- raise_exception('Only user and assistant roles are supported, with the exception of an initial optional system message!') }}
{%- endif %}
{%- endfor %}
leleogere
changed pull request title from
Update tokenizer_config.json
to Update tokenizer config to support assistant masks in chat template
leleogere
changed pull request title from
Update tokenizer config to support assistant masks in chat template
to Update tokenizer chat template to support assistant masks
leleogere
changed pull request title from
Update tokenizer chat template to support assistant masks
to Update tokenizer's chat template to support assistant masks