Converting to native Transformers
This PR converts the model to be used natively within Transformers (see https://github.com/huggingface/transformers/pull/33823)
This PR may behave unexpectedly.
To reproduce:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
# "THUDM/glm-4-9b-chat-1m", revision="refs/pr/17",
"THUDM/glm-4-9b-chat-1m",
device_map="cuda",
torch_dtype="auto",
attn_implementation="flash_attention_2",
trust_remote_code=True,
)
# tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat-1m", revision="refs/pr/17", )
tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat-1m", trust_remote_code=True)
# input = "Hello, how are you?"
# input_encoding = tokenizer(input, return_tensors="pt").to("cuda")
import pickle
with open("test_input.pkl", "rb") as f:
input_ids = pickle.load(f)
input_encoding = torch.tensor([input_ids]).to("cuda")
print(input_encoding.shape)
print(input_encoding.dtype)
out = model.generate(input_encoding, max_new_tokens=20)
print(tokenizer.decode(out[0, len(input_ids):], skip_special_tokens=True))
The original repo works fine:
torch.Size([1, 98796])
torch.int64
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
**The paper investigates the properties of order-divisor graphs associated with finite groups, providing a comprehensive description of**
(base) aiscuser@node-0:/scratch/MInference$
But this PR collapses as follows:
torch.Size([1, 98796])
torch.int64
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
**the 2. 2, the 2. 2, the 2. 2**
(base) aiscuser@node-0:/scratch/MInference$
This error appears with lengthy input, in my case the input is ~100K len.
@cyrilvallez @zRzRzRzRzRzRzR may need a double check here.
My transformers version: transformers==4.46.0.dev0
Could you check when generating from the text instead of importing the input_ids from file? That is instead of doing:
import pickle
with open("test_input.pkl", "rb") as f:
input_ids = pickle.load(f)
do
with open("text.txt", "rb") as f:
text = load(...)
input_ids = tokenizer.encode(text, return_tensors='pt').to(device)
I suspect this may come from slight changes in the tokenizer
This model will also have a new repository created for it, used for adaptation
@cyrilvallez Hi Cyril, I re-test the hf native version, as you suggested. And the error remains. The tokenizer seems to behave consistently, so I have no idea where is the bug: https://huggingface.co/THUDM/glm-4-9b-chat-1m-hf/discussions/1.
You can also find the test example I used in the above link.
@cyrilvallez
Hi Cyril, this is PR does not work at the first place. I suspect you did not do any long-context test on your PR.
Maybe you can share your weights converting script so we can help you review.
My test script:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
# "THUDM/glm-4-9b-chat-1m", revision="refs/pr/17",
"THUDM/glm-4-9b-chat-1m",
device_map="cuda",
torch_dtype="auto",
attn_implementation="flash_attention_2",
trust_remote_code=True,
)
# tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat-1m", revision="refs/pr/17", )
tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat-1m", trust_remote_code=True)
with open("t.txt", "r") as f:
input_ids = tokenizer.encode(f.read())
input_encoding = torch.tensor([input_ids]).to("cuda")
print(input_encoding.shape)
print(input_encoding.dtype)
out = model.generate(input_encoding, max_new_tokens=100)
print(tokenizer.decode(out[0, len(input_ids):], skip_special_tokens=True))
And the behaviour differ between the original model (second) and you PR (first), see below.
(base) v-yuchengli@microsoft.com@GCRAZGDL1694:~/MInference$ cd /home/v-yuchengli/MInference ; /usr/bin/env /home/v-yuchengli/miniconda3/envs/llm/bin/python /home/v-yuchengli/.cursor-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher 47591 -- /home/v-yuchengli/MInference/t.py
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00, 1.12s/it]
torch.Size([1, 137369])
====================
6f6f6b7f6c7f6b6b6b6c7f6f6f6b6f6b6c7f6b6b6c7f6c7c7c7c7c7c7c7c7c7c7c7c7c7c7f6c7c7c7c7c7c7c4b6c7c7c7c7c
(base) v-yuchengli@microsoft.com@GCRAZGDL1694:~/MInference$ cd /home/v-yuchengli/MInference ; /usr/bin/env /home/v-yuchengli/miniconda3/envs/llm/bin/python /home/v-yuchengli/.cursor-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher 44467 -- /home/v-yuchengli/MInference/t.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 10/10 [03:40<00:00, 22.05s/it]
torch.Size([1, 137369])
====================
\"cb59052b-9128-4979-9c0e-e1de4adcf73b\"The value associated with the specified key is "cb59052b-9128-4979-9c0e-e1de4adcf73b". The key you provided is "6ab6ea3e-f288-4f33-ba46-7f42bb75b03f". The value associated with
Hey
@liyucheng
! I suspect the error may come from these 2 lines: https://github.com/huggingface/transformers/blob/main/src/transformers/models/glm/modeling_glm.py#L169-L170
Could you try without them (just plainly remove them) and let me know?
@cyrilvallez
Hi Cyril, I tried but did not work. I re-implement the apply_rope_func
with the original GLM implementation.
def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
cos = cos.unsqueeze(unsqueeze_dim)
sin = sin.unsqueeze(unsqueeze_dim)
# Interleave them instead of usual shape
# cos = cos[..., : cos.shape[-1] // 2].repeat_interleave(2, dim=-1)
# sin = sin[..., : sin.shape[-1] // 2].repeat_interleave(2, dim=-1)
cos = cos[..., : cos.shape[-1] // 2]
sin = sin[..., : sin.shape[-1] // 2]
# Keep half for later concatenation
q, q_pass = q[..., : q.shape[-1] // 2], q[..., q.shape[-1] // 2 :]
k, k_pass = k[..., : k.shape[-1] // 2], k[..., k.shape[-1] // 2 :]
# Apply rotary embeddings on the first half
# q_embed = (q * cos) + (rotate_half(q) * sin)
# k_embed = (k * cos) + (rotate_half(k) * sin)
qshaped = q.reshape(q.shape[0], q.shape[1], -1, q.shape[-1] // 2, 2)
kshaped = k.reshape(k.shape[0], k.shape[1], -1, k.shape[-1] // 2, 2)
q_embed = torch.stack(
[
qshaped[..., 0] * cos - qshaped[..., 1] * sin,
qshaped[..., 0] * sin + qshaped[..., 1] * cos,
],
dim=-1,
)
k_embed = torch.stack(
[
kshaped[..., 0] * cos - kshaped[..., 1] * sin,
kshaped[..., 0] * sin + kshaped[..., 1] * cos,
],
dim=-1,
)
q_embed = q_embed.flatten(3)
k_embed = k_embed.flatten(3)
# Concatenate back to full shape
q_embed = torch.cat([q_embed, q_pass], dim=-1)
k_embed = torch.cat([k_embed, k_pass], dim=-1)
return q_embed, k_embed
It does not work neither. Do you think the bug is from the model weights?
We submitted a new pull request concerning the GLM-Edge model. In GLM-Edge, this implementation has certain modifications and satisfies expectations in performance testing.
This PR has been merged
However, regarding GLM-4, it is still the original implementation as mentioned in this link.
@liyucheng
thanks for checking it out! I'm fairly confident the model definition is mathematically equivalent to the one in the original code (I took quite some time looking at it at the time) -- rope was my best guess for where I could have made a mistake. Of course, this does not mean that something did not slip past me, if you're willing to check it's always better to make sure.
But given that both my tests passed at the time, and that the new version also seem to work well according to
@zRzRzRzRzRzRzR
, I'd say the issue is either one of the following:
- very small differences (due to shapes) that accumulate (with such long context, it's gonna accumulate a lot)
- conversion of the weights (but unlikely as I used the same script for all the conversions), or something in the config?
You can maybe start by re-converting the weights and check again? You can use this script for it. It was since modified to convert the new version as well, but should still work for the old one