Different tokenization result to llama-model reference implementation
The tokenize result of meta’s reference implementation and huggingface is different.
For the one-image request in meta (scripts/multimodal_example_chat_completion.py
), its tokenization result is:
128000, 128006, 882, 128007, 271, 128256, 75885, 420, 2217, 304, 1403, 23719, 128009, 128006, 78191, 128007, 271,
While huggingface provides:
256, 128000, 256, 128006, 882, 128007, 271, 257, 128256, 262, 61885, 420, 2217, 304, 1403, 23719, 257, 128009, 262, 128006, 78191, 128007, 271, 220
huggingface test code:
import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor, AutoTokenizer
model_id = "Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained(model_id)
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Describe this image in two sentences"}
]
}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
# print("text is:", text)
url = "https://llava-vl.github.io/static/images/view.jpg"
raw_image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=text, images=raw_image, return_tensors="pt").to(model.device)
print("input_ids:", inputs['input_ids'])
output = model.generate(**inputs, do_sample=False, max_new_tokens=25)
print(processor.decode(output[0]))
Using the chat_template from 90B Instruct solves this issue, now I can get [128000, 128006, 882, 128007, 271, 128256, 75885, 420, 2217, 304, 1403, 23719, 128009, 128006, 78191, 128007, 271]
now fixed by this PR, I can get [128000, 128006, 882, 128007, 271, 128256, 75885, 420, 2217, 304, 1403, 23719, 128009, 128006, 78191, 128007, 271]
Thank you very much!
Thank you
@wukaixingxp
🙌 We'll update the template in the tokenizer_config.json
file if needed, as that's the one used by the transformers tokenizer.
Fixed, closing now.