Why is special_tokens_map.json missing token "<|eot_id|>" 128009?

#164
by 3Simplex - opened

Isn't the missing eos token here susceptible to segmentation? "<|eot_id|>" special_tokens_map.json

{
  "bos_token": "<|begin_of_text|>",
  "eos_token": "<|end_of_text|>"
}

I would expect it to either include both eos tokens or just the one that is used by the template. generation_config.json

{
  "bos_token_id": 128000,
  "eos_token_id": [128001, 128009],
  "do_sample": true,
  "temperature": 0.6,
  "max_length": 4096,
  "top_p": 0.9,
  "transformers_version": "4.40.0.dev0"
}

The expected token is 128009. config.json

{
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128009,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.40.0.dev0",
  "use_cache": true,
  "vocab_size": 128256
}

Which is reenforced in the template itself. tokenizer_config.json

  "bos_token": "<|begin_of_text|>",
  "chat_template": "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}",
  "clean_up_tokenization_spaces": true,
  "eos_token": "<|eot_id|>",
3Simplex changed discussion title from Why is special_tokens_map,json missing token 128009? to Why is special_tokens_map.json missing token 128009?
3Simplex changed discussion title from Why is special_tokens_map.json missing token 128009? to Why is special_tokens_map.json missing token "<|eot_id|>" 128009?

@pcuenq I am very curious about this, it seems to have been done in llama 3.1 and not in llama 3.

Sign up or log in to comment