Update tokenizer_config.json to prepend the bos token
As discussed in #9, the current HF tokenizer does not prepend the bos token (id: 128000) like in the reference implementation:
https://github.com/meta-llama/llama3/blob/0cee08ec68f4cfc0c89fe4a9366d82679aaa2a66/llama/generation.py#L256
and in their test cases:
https://github.com/meta-llama/llama3/blob/0cee08ec68f4cfc0c89fe4a9366d82679aaa2a66/llama/test_tokenizer.py#L23
This commit changes the tokenizer_class "PreTrainedTokenizerFast" to the "LlamaTokenizer", the PreTrainedTokenizerFast doesn't support seem to support the add_bos_token flag.
before the fix:
!git clone https://github.com/meta-llama/llama3.git
from llama3.llama import Tokenizer
from transformers import AutoTokenizer
llama_tokenizer = Tokenizer("llama3/Meta-Llama-3-8B/tokenizer.model")
hf_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
text = "This is a test sentence"
orig_enc = llama_tokenizer.encode(text, bos=True, eos=False)
# [128000, 2028, 374, 264, 720, 1296, 271, 52989]
hf_enc = hf_tokenizer.encode(text)
# [2028, 374, 264, 720, 1296, 271, 52989]
after the fix:
from transformers import AutoTokenizer
hf_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B", revision="refs/pr/35")
text = "This is a test sentence"
hf_enc = hf_tokenizer.encode(text)
# [128000, 2028, 374, 264, 720, 1296, 271, 52989]
@eduagarcia does this fix also apply to the meta-llama/Meta-Llama-3-8B-Instruct model?
see:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
payload = {
"inputs": tokenizer.apply_chat_template(
[
{
"role": "user",
"content": content,
}
],
tokenize=False,
),
"parameters": self.parameters,
}
If you are using the chat_template, it makes no difference, the chat_template already appends the BOS Token. This problem only applies if you are not using the template, like in this base model.
From my tests, the "tokenizer.apply_chat_template(dialog, add_generation_prompt=True)" works the same as the ChatFormat(tokenizer).format.encode_dialog_prompt(dialog) from the reference implementation.
from transformers import AutoTokenizer
hf_tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-8B-Instruct')
test = hf_tokenizer.apply_chat_template(
[
{
"role": "system",
"content": "This is a test sentence.",
},
{
"role": "user",
"content": "This is a response.",
}
]
, add_generation_prompt=True
)
print(test)
#[128000, 128006, 9125, 128007, 271, 2028, 374, 264, 1296, 11914, 13, 128009, 128006, 882, 128007, 271, 2028, 374, 264, 2077, 13, 128009, 128006, 78191, 128007, 271]
# /\ bos_token
#is the same id's as the test on the official repo: https://github.com/meta-llama/llama3/blob/0cee08ec68f4cfc0c89fe4a9366d82679aaa2a66/llama/test_tokenizer.py#L68
@eduagarcia looks like meta-llama/Meta-Llama-3-8B-Instruct i can use for chat.
@eduagarcia whats does tokenize=False, and add_generation_prompt=True?
Tho add_bos should be used, what we need to update here is the tokenizer.json: the template processor needs this. I’ll update it
Closing as explained in https://huggingface.co/meta-llama/Meta-Llama-3-70B/discussions/6