Duplicate from CausalLM/miniG

Browse files

Co-authored-by: Joséphus Cheung <JosephusCheung@users.noreply.huggingface.co>

Files changed (12) hide show

.gitattributes +35 -0
README.md +94 -0
config.json +68 -0
configuration.json +1 -0
configuration_chatglm.py +66 -0
generation_config.json +13 -0
model.safetensors +3 -0
modeling_chatglm.py +1329 -0
tokenization_chatglm.py +361 -0
tokenizer.model +3 -0
tokenizer_config.json +134 -0
visual.py +180 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,94 @@

+---
+language:
+- en
+- zh
+- ja
+- de
+model-index:
+- name: miniG
+  results:
+  - task:
+      type: text-generation
+    metrics:
+    - name: MMLU
+      type: MMLU
+      value: 85.45
+    - name: IFEval
+      type: IFEval
+      value: 74.22
+    - name: GSM8K (5-shot)
+      type: GSM8K (5-shot)
+      value: 75.89
+    - name: HumanEval
+      type: HumanEval
+      value: 79.88
+    - name: GPQA
+      type: GPQA
+      value: 37.37
+license: agpl-3.0
+pipeline_tag: text-generation
+co2_eq_emissions:
+  emissions: 700
+  training_type: "fine-tuning"
+---
+# miniG
+[GGUF (Text-Only)](https://huggingface.co/CausalLM/miniG/tree/gguf)
+[Text-Only Weight](https://huggingface.co/CausalLM/miniG/tree/text-only)
+A model trained on a synthesis dataset of over **120 million** entries, this dataset having been generated through the application of state-of-the-art language models utilizing large context windows, alongside methodologies akin to retrieval-augmented generation and knowledge graph integration, where the data synthesis is conducted within clusters derived from a curated pretraining corpus of 20 billion tokens, with subsequent validation performed by the model itself.
+Despite the absence of thorough alignment with human preferences, the model is under no obligation to cater to poorly constructed prompts or the clichés often found in conventional benchmarks. Bonus: Included is an implementation of a **Vision Language Model** that has undergone Locked-Image Tuning.
+**Supported Input Modalities**: text, image. For text-only weight, please use the branch `revision=text-only` at https://huggingface.co/CausalLM/miniG/tree/text-only . And [GGUF](https://huggingface.co/CausalLM/miniG/tree/gguf) for text-only should be working after PR [#9194](https://github.com/ggerganov/llama.cpp/pull/9194) was merged.
+**Context Window:** 1M tokens
+**Model Parameters:** LLM - 9B (initialized from THUDM/glm-4-9b-chat-1m); Optional ViT - 5B
+**Cautionary Notes:** **It is strongly recommended to utilize a standardized implementation for inference**, such as Hugging Face Transformers, to avoid the significant performance degradation that might occur when using accelerated kernels like vllm or lmdeploy - not to mention the potentially catastrophic effects of model quantization. **As of now, these accelerated inference implementations are known to severely compromise effective** vision inference, though they have a less pronounced impact on pure text performance.
+**Inference Parameters:** Our observations suggest that, if one desires to achieve results with fewer hallucinations, it is advisable to employ sampling with top_p=0.8 followed by a temperature setting of 0.3, or alternatively, to use pure temperature sampling with a setting of 0.2. **In general, a lower temperature is required compared to similar models**, which we tentatively attribute to overfitting on the vast dataset. The model inference should refer to THUDM/glm-4-9b-chat-1m and THUDM/glm-4v-9b. We only guarantee best performance when using transformers for inference. In our testing, we also used lmdeploy, which resulted in a significant performance degradation for multimodal input.
+**Regarding Formatting:** We strongly recommend you double-check your input to ensure: 1. The system prompt is not empty. Even something as simple as "You are a helpful assistant." is expected. 2. There is always a newline character after the <|role|> tag. This will help ensure proper parsing and processing of your input.
+**Regarding [Benchmark Scores](https://huggingface.co/spaces/JosephusCheung/Goodharts-Law-on-Benchmarks-a-Page-for-miniG):** Generally, you shouldn't worry too much about them, as people can always train specifically to achieve good results. We mainly use them as a smoke test, a quick check to ensure no major regressions have occurred. In fact, if you actually read through the benchmark questions themselves, you'll often find yourself chuckling at how inane, low-quality, or even downright silly they are.
+**Regarding training:** The final released version was trained using a merge of multiple candidate models in an attempt to improve performance. However, we were unable to conclusively determine whether this was effective. Excluding candidate versions, an efficient naïve fine-tuning should be achievable within one day on 16 nodes of 8*A100-80G. Based on this, we estimate the carbon emissions to be 700 kg CO2 eq.
+**Disclaimer:** Please note that the model was trained on unfiltered internet data. Since we do not have the capacity to vet all of it, there may be a substantial amount of objectionable content, pornography, violence, and offensive language present that we are unable to remove. Therefore, you will still need to complete your own checks on the model's safety and filter keywords in the output. Due to computational resource constraints, we are presently unable to implement RLHF for the model's ethics and safety, nor training on SFT samples that refuse to answer certain questions for restrictive fine-tuning.
+**Seeking Unconditional Sponsorship:** Training and synthesizing datasets can be expensive. While we cannot disclose more details about the cost budget, we can theoretically analyze the example of synthesizing and self-verifying the dataset used to train this model, which involved 120M entries synthesized from 20B tokens. The nominal cost of data synthesis and self-verification using a commercial model API could be as high as $3M, while the nominal cost using local model inference, measured in GPU time, could still reach up to $0.1M. We are actively training larger parameter models and scaling up data synthesis, and are seeking substantial compute resources and generous **unconditional** grants. While this is for the purpose of commercial exploration and technology selection, we are currently under no immediate pressure to generate profit and remain committed to sharing more with the open-source community.
+# 迷你G
+[GGUF (纯文本)](https://huggingface.co/CausalLM/miniG/tree/gguf)
+[纯文本权重](https://huggingface.co/CausalLM/miniG/tree/text-only)
+一个在超过**1.2亿**条数据合成数据集上训练的模型，这些数据集是通过应用具有大上下文窗口的最先进语言模型生成的，并结合了类似于检索增强生成和知识图谱集成的方法，数据合成是在一个由200亿个标记组成的预训练语料库中提取的聚类内进行的，随后由模型本身进行验证。
+尽管该模型没有完全对齐人类偏好，但它没有义务迎合不良构建的提示或常见基准测试中的陈词滥调。额外内容：包含了经过锁定图像微调的**视觉语言模型**实现。
+**支持的输入模态**：文本、图像。对于纯文本权重，请使用 https://huggingface.co/CausalLM/miniG/tree/text-only 上的分支 `revision=text-only`。在 PR [#9194](https://github.com/ggerganov/llama.cpp/pull/9194) 合并后，适用于纯文本的 [GGUF](https://huggingface.co/CausalLM/miniG/tree/gguf) 应该可以正常工作。
+**上下文窗口**：1M 个标记
+**模型参数：**LLM - 9B（从THUDM/glm-4-9b-chat-1m初始化）；可选的ViT - 5B。
+**注意事项：** **强烈建议使用标准化的推理实现**，例如Hugging Face Transformers，以避免在使用加速内核（如vllm或lmdeploy）时可能发生的显著性能下降——更不用说模型量化可能带来的灾难性影响。**目前，这些加速推理实现已知会严重损害**视觉推理的有效性，尽管对纯文本性能的影响较小。
+**推理参数：**我们的观察表明，如果想要减少幻觉结果，建议使用top_p=0.8的采样方式，然后设置temperature为0.3，或者使用纯粹的temperature采样，设置为0.2。**总体来说，相比类似的模型，该模型需要较低的temperature**，我们暂时将其归因于在庞大数据集上的过拟合。模型推理应参考 THUDM/glm-4-9b-chat-1m 和 THUDM/glm-4v-9b。我们只保证使用 transformer 进行推理时的性能最佳。在我们的测试中，我们还使用了 lmdeploy，这导致多模态输入的性能显著下降。
+**关于格式：**我们强烈建议您仔细检查输入内容，以确保：1. 系统提示不为空。即使是像“You are a helpful assistant.”这样简单的提示也是预期的。2. <|role|> 标签后始终有一个换行符。这将有助于确保正确解析和处理您的输入。
+**关于[基准测试分数](https://huggingface.co/spaces/JosephusCheung/Goodharts-Law-on-Benchmarks-a-Page-for-miniG)：**一般来说，你不应该太过在意这些分数，因为人们总是可以专门训练以取得好成绩。我们主要将它们作为一个冒烟测试，一种快速检查，确保没有发生重大回退。事实上，如果你真的去阅读这些基准测试问题本身，你常常会发现自己会忍不住笑出声来，因为它们是多么无聊、低质量，甚至荒谬可笑。
+**关于训练：**最终发布的版本使用了多个候选模型的合并来尝试提高性能。然而，我们无法确定这种方法是否确实有效。排除候选版本和合并实验，使用16个节点、每个节点配备8个A100-80G显卡的情况下，应该可以在一天之内实现高效的朴素微调。据此我们估算碳排放量为700公斤二氧化碳当量。
+**免责声明：**请注意，该模型是在未经过滤的互联网数据上训练的。由于我们无法对所有数据进行筛选，仍有可能存在大量不适当的内容——包括从露骨的材料到暴力和攻击性语言的内容——我们无法移除。因此，您必须自行对模型进行安全检查，并在输出中实施关键词过滤。由于计算资源的限制，我们目前无法为伦理和安全考虑进行人类反馈的强化学习（RLHF），也不能对SFT样本进行限制性微调，以限制模型回答某些问题的能力。
+**寻求无条件赞助：** 训练和合成数据集可能非常昂贵。虽然我们无法透露更多关于成本预算的细节，但我们可以从理论上分析一下合成和自我验证用���训练该模型的数据集的例子，该数据集包含从 200 亿个标记合成的 1.2 亿个条目。使用商业模型 API 进行数据合成和自我验证的名义成本可能高达 300 万美元，而使用本地模型推理（以 GPU 时间衡量）的名义成本仍然可能高达 10 万美元。我们正在积极训练更大参数的模型并扩大数据合成规模，同时寻求大量的计算资源和慷慨的**无条件**资助。尽管这是为了商业探索和技术选择的目的，但我们目前并没有立即产生利润的压力，并且仍然致力于与开源社区分享更多成果。

config.json ADDED Viewed

	@@ -0,0 +1,68 @@

+{
+  "_name_or_path": "miniG",
+  "add_bias_linear": false,
+  "add_qkv_bias": true,
+  "apply_query_key_layer_scaling": true,
+  "apply_residual_connection_post_layernorm": false,
+  "architectures": [
+    "ChatGLMForConditionalGeneration"
+  ],
+  "attention_dropout": 0.0,
+  "attention_softmax_in_fp32": true,
+  "auto_map": {
+    "AutoConfig": "configuration_chatglm.ChatGLMConfig",
+    "AutoModel": "modeling_chatglm.ChatGLMForConditionalGeneration",
+    "AutoModelForCausalLM": "modeling_chatglm.ChatGLMForConditionalGeneration",
+    "AutoModelForSeq2SeqLM": "modeling_chatglm.ChatGLMForConditionalGeneration",
+    "AutoModelForSequenceClassification": "modeling_chatglm.ChatGLMForSequenceClassification"
+  },
+  "bias_dropout_fusion": true,
+  "boi_token_id": 151339,
+  "classifier_dropout": null,
+  "eoi_token_id": 151340,
+  "eos_token_id": [
+    151329,
+    151336,
+    151338
+  ],
+  "ffn_hidden_size": 13696,
+  "fp32_residual_connection": false,
+  "hidden_dropout": 0.0,
+  "hidden_size": 4096,
+  "kv_channels": 128,
+  "layernorm_epsilon": 1.5625e-07,
+  "model_type": "chatglm",
+  "multi_query_attention": true,
+  "multi_query_group_num": 4,
+  "num_attention_heads": 32,
+  "num_hidden_layers": 40,
+  "num_layers": 40,
+  "original_rope": true,
+  "pad_token_id": 151329,
+  "padded_vocab_size": 151552,
+  "post_layer_norm": true,
+  "pre_seq_len": null,
+  "prefix_projection": false,
+  "rmsnorm": true,
+  "rope_ratio": 10000,
+  "seq_length": 1048576,
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.44.0",
+  "use_cache": true,
+  "vision_config": {
+    "dropout_prob": 0.0,
+    "hidden_act": "gelu",
+    "hidden_size": 1792,
+    "image_size": 1120,
+    "in_channels": 3,
+    "intermediate_size": 15360,
+    "layer_norm_eps": 1e-06,
+    "num_heads": 16,
+    "num_hidden_layers": 63,
+    "num_positions": 6401,
+    "patch_size": 14,
+    "scaling_factor": 8
+  },
+  "vocab_size": 151552
+}

configuration.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"framework":"Pytorch","task":"nli"}

configuration_chatglm.py ADDED Viewed

	@@ -0,0 +1,66 @@

+from transformers import PretrainedConfig
+class ChatGLMConfig(PretrainedConfig):
+    model_type = "chatglm"
+    def __init__(
+            self,
+            num_layers=28,
+            padded_vocab_size=65024,
+            hidden_size=4096,
+            ffn_hidden_size=13696,
+            kv_channels=128,
+            num_attention_heads=32,
+            seq_length=2048,
+            hidden_dropout=0.0,
+            classifier_dropout=None,
+            attention_dropout=0.0,
+            layernorm_epsilon=1e-5,
+            rmsnorm=True,
+            apply_residual_connection_post_layernorm=False,
+            post_layer_norm=True,
+            add_bias_linear=False,
+            add_qkv_bias=False,
+            bias_dropout_fusion=True,
+            multi_query_attention=False,
+            multi_query_group_num=1,
+            rope_ratio=1,
+            apply_query_key_layer_scaling=True,
+            attention_softmax_in_fp32=True,
+            fp32_residual_connection=False,
+            pre_seq_len=None,
+            prefix_projection=False,
+            boi_token_id=None,
+            eoi_token_id=None,
+            **kwargs
+    ):
+        self.num_layers = num_layers
+        self.vocab_size = padded_vocab_size
+        self.padded_vocab_size = padded_vocab_size
+        self.hidden_size = hidden_size
+        self.ffn_hidden_size = ffn_hidden_size
+        self.kv_channels = kv_channels
+        self.num_attention_heads = num_attention_heads
+        self.seq_length = seq_length
+        self.hidden_dropout = hidden_dropout
+        self.classifier_dropout = classifier_dropout
+        self.attention_dropout = attention_dropout
+        self.layernorm_epsilon = layernorm_epsilon
+        self.rmsnorm = rmsnorm
+        self.apply_residual_connection_post_layernorm = apply_residual_connection_post_layernorm
+        self.post_layer_norm = post_layer_norm
+        self.add_bias_linear = add_bias_linear
+        self.add_qkv_bias = add_qkv_bias
+        self.bias_dropout_fusion = bias_dropout_fusion
+        self.multi_query_attention = multi_query_attention
+        self.multi_query_group_num = multi_query_group_num
+        self.rope_ratio = rope_ratio
+        self.apply_query_key_layer_scaling = apply_query_key_layer_scaling
+        self.attention_softmax_in_fp32 = attention_softmax_in_fp32
+        self.fp32_residual_connection = fp32_residual_connection
+        self.pre_seq_len = pre_seq_len
+        self.prefix_projection = prefix_projection
+        self.boi_token_id = boi_token_id
+        self.eoi_token_id = eoi_token_id
+        super().__init__(**kwargs)

generation_config.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "eos_token_id": [
+    151329,
+    151336,
+    151338
+  ],
+  "pad_token_id": 151329,
+  "do_sample": true,
+  "temperature": 0.8,
+  "max_length": 8192,
+  "top_p": 0.8,
+  "transformers_version": "4.44.0"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c7aff6adc93a3b91d5d469bf8ab05ad6d7425d1c310532990155065d60824c9b
+size 27980601400

modeling_chatglm.py ADDED Viewed

	@@ -0,0 +1,1329 @@

+""" PyTorch GLM-4V model. """
+import math
+import sys
+import torch
+import torch.utils.checkpoint
+import torch.nn.functional as F
+from torch import nn
+from torch.nn import CrossEntropyLoss, LayerNorm, MSELoss, BCEWithLogitsLoss
+from torch.nn.utils import skip_init
+from typing import Optional, Tuple, Union, List, Dict, Any
+from transformers.modeling_outputs import (
+    BaseModelOutputWithPast,
+    CausalLMOutputWithPast,
+    SequenceClassifierOutputWithPast,
+)
+from transformers.modeling_utils import PreTrainedModel
+from transformers.utils import logging, is_torch_npu_available
+from transformers.generation.logits_process import LogitsProcessor
+from transformers.generation.utils import LogitsProcessorList, StoppingCriteriaList, GenerationConfig, ModelOutput
+from .visual import EVA2CLIPModel
+from .configuration_chatglm import ChatGLMConfig
+try:
+    from transformers.utils import is_flash_attn_greater_or_equal_2_10, is_flash_attn_2_available
+    if is_flash_attn_2_available():
+        from flash_attn import flash_attn_func, flash_attn_varlen_func
+        from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input  # noqa
+except:
+    pass
+# flags required to enable jit fusion kernels
+if sys.platform != 'darwin' and not is_torch_npu_available():
+    torch._C._jit_set_profiling_mode(False)
+    torch._C._jit_set_profiling_executor(False)
+    torch._C._jit_override_can_fuse_on_cpu(True)
+    torch._C._jit_override_can_fuse_on_gpu(True)
+logger = logging.get_logger(__name__)
+LANGUAGE_TOKEN_TYPE = 0
+VISION_TOKEN_TYPE = 1
+_CHECKPOINT_FOR_DOC = "THUDM/ChatGLM"
+_CONFIG_FOR_DOC = "ChatGLMConfig"
+def default_init(cls, *args, **kwargs):
+    return cls(*args, **kwargs)
+class InvalidScoreLogitsProcessor(LogitsProcessor):
+    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
+        if torch.isnan(scores).any() or torch.isinf(scores).any():
+            scores.zero_()
+            scores[..., 198] = 5e4
+        return scores
+class PrefixEncoder(torch.nn.Module):
+    """
+    The torch.nn model to encode the prefix
+    Input shape: (batch-size, prefix-length)
+    Output shape: (batch-size, prefix-length, 2*layers*hidden)
+    """
+    def __init__(self, config: ChatGLMConfig):
+        super().__init__()
+        self.prefix_projection = config.prefix_projection
+        if self.prefix_projection:
+            # Use a two-layer MLP to encode the prefix
+            kv_size = config.num_layers * config.kv_channels * config.multi_query_group_num * 2
+            self.embedding = torch.nn.Embedding(config.pre_seq_len, kv_size)
+            self.trans = torch.nn.Sequential(
+                torch.nn.Linear(kv_size, config.hidden_size),
+                torch.nn.Tanh(),
+                torch.nn.Linear(config.hidden_size, kv_size)
+            )
+        else:
+            self.embedding = torch.nn.Embedding(config.pre_seq_len,
+                                                config.num_layers * config.kv_channels * config.multi_query_group_num * 2)
+    def forward(self, prefix: torch.Tensor):
+        if self.prefix_projection:
+            prefix_tokens = self.embedding(prefix)
+            past_key_values = self.trans(prefix_tokens)
+        else:
+            past_key_values = self.embedding(prefix)
+        return past_key_values
+def split_tensor_along_last_dim(
+        tensor: torch.Tensor,
+        num_partitions: int,
+        contiguous_split_chunks: bool = False,
+) -> List[torch.Tensor]:
+    """Split a tensor along its last dimension.
+    Arguments:
+        tensor: input tensor.
+        num_partitions: number of partitions to split the tensor
+        contiguous_split_chunks: If True, make each chunk contiguous
+                                 in memory.
+    Returns:
+        A list of Tensors
+    """
+    # Get the size and dimension.
+    last_dim = tensor.dim() - 1
+    last_dim_size = tensor.size()[last_dim] // num_partitions
+    # Split.
+    tensor_list = torch.split(tensor, last_dim_size, dim=last_dim)
+    # Note: torch.split does not create contiguous tensors by default.
+    if contiguous_split_chunks:
+        return tuple(chunk.contiguous() for chunk in tensor_list)
+    return tensor_list
+class RotaryEmbedding(nn.Module):
+    def __init__(self, dim, rope_ratio=1, original_impl=False, device=None, dtype=None):
+        super().__init__()
+        inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2, device=device).to(dtype=dtype) / dim))
+        self.register_buffer("inv_freq", inv_freq)
+        self.dim = dim
+        self.original_impl = original_impl
+        self.rope_ratio = rope_ratio
+    def impl(self, seq_length: int, dim: int, device: torch.device, dtype: torch.dtype):
+        base = 10000 * self.rope_ratio
+        inv_freq = 1.0 / (
+                base ** (torch.arange(0, dim, 2, device=device, dtype=torch.float32) / dim))
+        seq = torch.arange(seq_length, device=inv_freq.device, dtype=torch.float32)
+        freqs = torch.outer(seq, inv_freq)
+        # first part even vector components, second part odd vector components,
+        #  2 * dim in dimension size
+        emb = torch.cat((freqs, freqs), dim=-1)
+        return emb
+    def forward_impl(
+            self, seq_len: int, n_elem: int, dtype: torch.dtype, device: torch.device, base: int = 10000
+    ):
+        """Enhanced Transformer with Rotary Position Embedding.
+        Derived from: https://github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/
+        transformers/rope/__init__.py. MIT License:
+        https://github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/license.
+        """
+        # $\Theta = {\theta_i = 10000^{\frac{2(i-1)}{d}}, i \in [1, 2, ..., \frac{d}{2}]}$
+        base = base * self.rope_ratio
+        theta = 1.0 / (base ** (torch.arange(0, n_elem, 2, dtype=torch.float, device=device) / n_elem))
+        # Create position indexes `[0, 1, ..., seq_len - 1]`
+        seq_idx = torch.arange(seq_len, dtype=torch.float, device=device)
+        # Calculate the product of position index and $\theta_i$
+        idx_theta = torch.outer(seq_idx, theta).float()
+        cache = torch.stack([torch.cos(idx_theta), torch.sin(idx_theta)], dim=-1)
+        # this is to mimic the behaviour of complex32, else we will get different results
+        if dtype in (torch.float16, torch.bfloat16, torch.int8):
+            cache = cache.bfloat16() if dtype == torch.bfloat16 else cache.half()
+        return cache
+    def forward(self, max_seq_len, offset=0):
+        if self.original_impl:
+            return self.forward_impl(
+                max_seq_len, self.dim, dtype=self.inv_freq.dtype, device=self.inv_freq.device
+            )
+        else:
+            return self.impl(max_seq_len, self.dim, dtype=self.inv_freq.dtype, device=self.inv_freq.device)
+@torch.jit.script
+def apply_rotary_pos_emb(x: torch.Tensor, rope_cache: torch.Tensor) -> torch.Tensor:
+    # x: [b, np, sq, hn]
+    b, np, sq, hn = x.size(0), x.size(1), x.size(2), x.size(3)
+    rot_dim = rope_cache.shape[-2] * 2
+    x, x_pass = x[..., :rot_dim], x[..., rot_dim:]
+    # truncate to support variable sizes
+    rope_cache = rope_cache[:, :sq]
+    xshaped = x.reshape(b, np, sq, rot_dim // 2, 2)
+    rope_cache = rope_cache.view(-1, 1, sq, xshaped.size(3), 2)
+    x_out2 = torch.stack(
+        [
+            xshaped[..., 0] * rope_cache[..., 0] - xshaped[..., 1] * rope_cache[..., 1],
+            xshaped[..., 1] * rope_cache[..., 0] + xshaped[..., 0] * rope_cache[..., 1],
+        ],
+        -1,
+    )
+    x_out2 = x_out2.flatten(3)
+    return torch.cat((x_out2, x_pass), dim=-1)
+class RMSNorm(torch.nn.Module):
+    def __init__(self, normalized_shape, eps=1e-5, device=None, dtype=None, **kwargs):
+        super().__init__()
+        self.weight = torch.nn.Parameter(torch.empty(normalized_shape, device=device, dtype=dtype))
+        self.eps = eps
+    def forward(self, hidden_states: torch.Tensor):
+        input_dtype = hidden_states.dtype
+        variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.eps)
+        return (self.weight * hidden_states).to(input_dtype)
+class CoreAttention(torch.nn.Module):
+    def __init__(self, config: ChatGLMConfig, layer_number):
+        super(CoreAttention, self).__init__()
+        self.apply_query_key_layer_scaling = config.apply_query_key_layer_scaling
+        self.attention_softmax_in_fp32 = config.attention_softmax_in_fp32
+        if self.apply_query_key_layer_scaling:
+            self.attention_softmax_in_fp32 = True
+        self.layer_number = max(1, layer_number)
+        projection_size = config.kv_channels * config.num_attention_heads
+        # Per attention head and per partition values.
+        self.hidden_size_per_partition = projection_size
+        self.hidden_size_per_attention_head = projection_size // config.num_attention_heads
+        self.num_attention_heads_per_partition = config.num_attention_heads
+        coeff = None
+        self.norm_factor = math.sqrt(self.hidden_size_per_attention_head)
+        if self.apply_query_key_layer_scaling:
+            coeff = self.layer_number
+            self.norm_factor *= coeff
+        self.coeff = coeff
+        self.attention_dropout = torch.nn.Dropout(config.attention_dropout)
+    def forward(self, query_layer, key_layer, value_layer, attention_mask):
+        pytorch_major_version = int(torch.__version__.split('.')[0])
+        if pytorch_major_version >= 2:
+            if attention_mask is None and query_layer.shape[2] == key_layer.shape[2]:
+                context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer,
+                                                                                 is_causal=True)
+            else:
+                if attention_mask is not None:
+                    attention_mask = ~attention_mask
+                context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer,
+                                                                                 attention_mask)
+            context_layer = context_layer.transpose(1, 2).contiguous()
+            new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size_per_partition,)
+            context_layer = context_layer.reshape(*new_context_layer_shape)
+        else:
+            # Raw attention scores
+            # [b, np, sq, sk]
+            output_size = (query_layer.size(0), query_layer.size(1), query_layer.size(2), key_layer.size(2))
+            # [b, np, sq, hn] -> [b * np, sq, hn]
+            query_layer = query_layer.view(output_size[0] * output_size[1], output_size[2], -1)
+            # [b, np, sk, hn] -> [b * np, sk, hn]
+            key_layer = key_layer.view(output_size[0] * output_size[1], output_size[3], -1)
+            # preallocting input tensor: [b * np, sq, sk]
+            matmul_input_buffer = torch.empty(
+                output_size[0] * output_size[1], output_size[2], output_size[3], dtype=query_layer.dtype,
+                device=query_layer.device
+            )
+            # Raw attention scores. [b * np, sq, sk]
+            matmul_result = torch.baddbmm(
+                matmul_input_buffer,
+                query_layer,  # [b * np, sq, hn]
+                key_layer.transpose(1, 2),  # [b * np, hn, sk]
+                beta=0.0,
+                alpha=(1.0 / self.norm_factor),
+            )
+            # change view to [b, np, sq, sk]
+            attention_scores = matmul_result.view(*output_size)
+            # ===========================
+            # Attention probs and dropout
+            # ===========================
+            # attention scores and attention mask [b, np, sq, sk]
+            if self.attention_softmax_in_fp32:
+                attention_scores = attention_scores.float()
+            if self.coeff is not None:
+                attention_scores = attention_scores * self.coeff
+            if attention_mask is None and attention_scores.shape[2] == attention_scores.shape[3]:
+                attention_mask = torch.ones(output_size[0], 1, output_size[2], output_size[3],
+                                            device=attention_scores.device, dtype=torch.bool)
+                attention_mask.tril_()
+                attention_mask = ~attention_mask
+            if attention_mask is not None:
+                attention_scores = attention_scores.masked_fill(attention_mask, float("-inf"))
+            attention_probs = F.softmax(attention_scores, dim=-1)
+            attention_probs = attention_probs.type_as(value_layer)
+            # This is actually dropping out entire tokens to attend to, which might
+            # seem a bit unusual, but is taken from the original Transformer paper.
+            attention_probs = self.attention_dropout(attention_probs)
+            # =========================
+            # Context layer. [sq, b, hp]
+            # =========================
+            # value_layer -> context layer.
+            # [sk, b, np, hn] --> [b, np, sq, hn]
+            # context layer shape: [b, np, sq, hn]
+            output_size = (value_layer.size(1), value_layer.size(2), query_layer.size(0), value_layer.size(3))
+            # change view [b * np, sk, hn]
+            value_layer = value_layer.view(output_size[0] * output_size[1], value_layer.size(2), -1)
+            # change view [b * np, sq, sk]
+            attention_probs = attention_probs.view(output_size[0] * output_size[1], output_size[2], -1)
+            # matmul: [b * np, sq, hn]
+            context_layer = torch.bmm(attention_probs, value_layer)
+            # change view [b, np, sq, hn]
+            context_layer = context_layer.view(*output_size)
+            # [b, np, sq, hn] --> [b, sq, np, hn]
+            context_layer = context_layer.transpose(1, 2).contiguous()
+            # [b, sq, np, hn] --> [b, sq, hp]
+            new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size_per_partition,)
+            context_layer = context_layer.reshape(*new_context_layer_shape)
+        return context_layer
+class SdpaAttention(CoreAttention):
+    def forward(self, query_layer, key_layer, value_layer, attention_mask):
+        if attention_mask is None and query_layer.shape[2] == key_layer.shape[2]:
+            context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer,
+                                                                             is_causal=True,
+                                                                             dropout_p=self.config.attention_dropout if self.training else 0.0)
+        else:
+            if attention_mask is not None:
+                attention_mask = ~attention_mask
+            context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer,
+                                                                             attention_mask,
+                                                                             dropout_p=self.config.attention_dropout if self.training else 0.0)
+        context_layer = context_layer.transpose(1, 2).contiguous()
+        new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size_per_partition,)
+        context_layer = context_layer.reshape(*new_context_layer_shape)
+        return context_layer
+def _get_unpad_data(attention_mask):
+    seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
+    indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
+    max_seqlen_in_batch = seqlens_in_batch.max().item()
+    cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0))
+    return (
+        indices,
+        cu_seqlens,
+        max_seqlen_in_batch,
+    )
+# Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2
+class FlashAttention2(CoreAttention):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
+    def forward(self, query_states, key_states, value_states, attention_mask):
+        query_states = query_states.transpose(1, 2)
+        key_states = key_states.transpose(1, 2)
+        value_states = value_states.transpose(1, 2)
+        batch_size, query_length = query_states.shape[:2]
+        if not self._flash_attn_uses_top_left_mask:
+            causal = self.is_causal
+        else:
+            # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in LlamaFlashAttention2 __init__.
+            causal = self.is_causal and query_length != 1
+        dropout = self.config.attention_dropout if self.training else 0.0
+        # Contains at least one padding token in the sequence
+        if attention_mask is not None:
+            query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
+                query_states, key_states, value_states, attention_mask, query_length
+            )
+            cu_seqlens_q, cu_seqlens_k = cu_seq_lens
+            max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
+            attn_output_unpad = flash_attn_varlen_func(
+                query_states,
+                key_states,
+                value_states,
+                cu_seqlens_q=cu_seqlens_q,
+                cu_seqlens_k=cu_seqlens_k,
+                max_seqlen_q=max_seqlen_in_batch_q,
+                max_seqlen_k=max_seqlen_in_batch_k,
+                dropout_p=dropout,
+                softmax_scale=None,
+                causal=causal,
+            )
+            attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
+        else:
+            attn_output = flash_attn_func(
+                query_states, key_states, value_states, dropout, softmax_scale=None, causal=causal
+            )
+        attn_output = attn_output.reshape(batch_size, query_length, self.hidden_size_per_partition).contiguous()
+        return attn_output
+    def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
+        indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
+        batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
+        key_layer = index_first_axis(
+            key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
+        )
+        value_layer = index_first_axis(
+            value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
+        )
+        if query_length == kv_seq_len:
+            query_layer = index_first_axis(
+                query_layer.reshape(batch_size * kv_seq_len, self.num_attention_heads_per_partition, head_dim),
+                indices_k
+            )
+            cu_seqlens_q = cu_seqlens_k
+            max_seqlen_in_batch_q = max_seqlen_in_batch_k
+            indices_q = indices_k
+        elif query_length == 1:
+            max_seqlen_in_batch_q = 1
+            cu_seqlens_q = torch.arange(
+                batch_size + 1, dtype=torch.int32, device=query_layer.device
+            )  # There is a memcpy here, that is very bad.
+            indices_q = cu_seqlens_q[:-1]
+            query_layer = query_layer.squeeze(1)
+        else:
+            # The -q_len: slice assumes left padding.
+            attention_mask = attention_mask[:, -query_length:]
+            query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
+        return (
+            query_layer,
+            key_layer,
+            value_layer,
+            indices_q,
+            (cu_seqlens_q, cu_seqlens_k),
+            (max_seqlen_in_batch_q, max_seqlen_in_batch_k),
+        )
+CORE_ATTENTION_CLASSES = {
+    "eager": CoreAttention,
+    "sdpa": SdpaAttention,
+    "flash_attention_2": FlashAttention2
+}
+class SelfAttention(torch.nn.Module):
+    """Parallel self-attention layer abstract class.
+    Self-attention layer takes input with size [s, b, h]
+    and returns output of the same size.
+    """
+    def __init__(self, config: ChatGLMConfig, layer_number, device=None):
+        super(SelfAttention, self).__init__()
+        self.layer_number = max(1, layer_number)
+        self.projection_size = config.kv_channels * config.num_attention_heads
+        # Per attention head and per partition values.
+        self.hidden_size_per_attention_head = self.projection_size // config.num_attention_heads
+        self.num_attention_heads_per_partition = config.num_attention_heads
+        self.multi_query_attention = config.multi_query_attention
+        self.qkv_hidden_size = 3 * self.projection_size
+        self.original_rope = config.original_rope
+        if self.multi_query_attention:
+            self.num_multi_query_groups_per_partition = config.multi_query_group_num
+            self.qkv_hidden_size = (
+                    self.projection_size + 2 * self.hidden_size_per_attention_head * config.multi_query_group_num
+            )
+        self.query_key_value = nn.Linear(config.hidden_size, self.qkv_hidden_size,
+                                         bias=config.add_bias_linear or config.add_qkv_bias,
+                                         device=device, **_config_to_kwargs(config)
+                                         )
+        self.core_attention = CoreAttention(config, self.layer_number)
+        # Output.
+        self.dense = nn.Linear(self.projection_size, config.hidden_size, bias=config.add_bias_linear,
+                               device=device, **_config_to_kwargs(config)
+                               )
+    def _allocate_memory(self, inference_max_sequence_len, batch_size, device=None, dtype=None):
+        if self.multi_query_attention:
+            num_attention_heads = self.num_multi_query_groups_per_partition
+        else:
+            num_attention_heads = self.num_attention_heads_per_partition
+        return torch.empty(
+            inference_max_sequence_len,
+            batch_size,
+            num_attention_heads,
+            self.hidden_size_per_attention_head,
+            dtype=dtype,
+            device=device,
+        )
+    def forward(
+            self, hidden_states, attention_mask, rotary_pos_emb, kv_cache=None, use_cache=True
+    ):
+        # hidden_states: [b, sq, h]
+        # =================================================
+        # Pre-allocate memory for key-values for inference.
+        # =================================================
+        # =====================
+        # Query, Key, and Value
+        # =====================
+        # Attention heads [b, sq, h] --> [b, sq, (np * 3 * hn)]
+        mixed_x_layer = self.query_key_value(hidden_states)
+        if self.multi_query_attention:
+            (query_layer, key_layer, value_layer) = mixed_x_layer.split(
+                [
+                    self.num_attention_heads_per_partition * self.hidden_size_per_attention_head,
+                    self.num_multi_query_groups_per_partition * self.hidden_size_per_attention_head,
+                    self.num_multi_query_groups_per_partition * self.hidden_size_per_attention_head,
+                ],
+                dim=-1,
+            )
+            query_layer = query_layer.view(
+                query_layer.size()[:-1] + (self.num_attention_heads_per_partition, self.hidden_size_per_attention_head)
+            )
+            key_layer = key_layer.view(
+                key_layer.size()[:-1] + (self.num_multi_query_groups_per_partition, self.hidden_size_per_attention_head)
+            )
+            value_layer = value_layer.view(
+                value_layer.size()[:-1]
+                + (self.num_multi_query_groups_per_partition, self.hidden_size_per_attention_head)
+            )
+        else:
+            new_tensor_shape = mixed_x_layer.size()[:-1] + \
+                               (self.num_attention_heads_per_partition,
+                                3 * self.hidden_size_per_attention_head)
+            mixed_x_layer = mixed_x_layer.view(*new_tensor_shape)
+            # [b, sq, np, 3 * hn] --> 3 [b, sq, np, hn]
+            (query_layer, key_layer, value_layer) = split_tensor_along_last_dim(mixed_x_layer, 3)
+        # [b, sq, np, hn] -> [b, np, sq, hn]
+        query_layer, key_layer, value_layer = [k.transpose(1, 2) for k in [query_layer, key_layer, value_layer]]
+        # apply relative positional encoding (rotary embedding)
+        if rotary_pos_emb is not None:
+            query_layer = apply_rotary_pos_emb(query_layer, rotary_pos_emb)
+            key_layer = apply_rotary_pos_emb(key_layer, rotary_pos_emb)
+        # adjust key and value for inference
+        if kv_cache is not None:
+            cache_k, cache_v = kv_cache
+            key_layer = torch.cat((cache_k, key_layer), dim=2)
+            value_layer = torch.cat((cache_v, value_layer), dim=2)
+        if use_cache:
+            kv_cache = (key_layer, value_layer)
+        else:
+            kv_cache = None
+        if self.multi_query_attention:
+            key_layer = key_layer.unsqueeze(2)
+            key_layer = key_layer.expand(
+                -1, -1, self.num_attention_heads_per_partition // self.num_multi_query_groups_per_partition, -1, -1
+            )
+            key_layer = key_layer.contiguous().view(
+                key_layer.size()[:1] + (self.num_attention_heads_per_partition,) + key_layer.size()[3:]
+            )
+            value_layer = value_layer.unsqueeze(2)
+            value_layer = value_layer.expand(
+                -1, -1, self.num_attention_heads_per_partition // self.num_multi_query_groups_per_partition, -1, -1
+            )
+            value_layer = value_layer.contiguous().view(
+                value_layer.size()[:1] + (self.num_attention_heads_per_partition,) + value_layer.size()[3:]
+            )
+        # ==================================
+        # core attention computation
+        # ==================================
+        context_layer = self.core_attention(query_layer, key_layer, value_layer, attention_mask)
+        # =================
+        # Output. [sq, b, h]
+        # =================
+        output = self.dense(context_layer)
+        return output, kv_cache
+def _config_to_kwargs(args):
+    common_kwargs = {
+        "dtype": args.torch_dtype,
+    }
+    return common_kwargs
+class MLP(torch.nn.Module):
+    """MLP.
+    MLP will take the input with h hidden state, project it to 4*h
+    hidden dimension, perform nonlinear transformation, and project the
+    state back into h hidden dimension.
+    """
+    def __init__(self, config: ChatGLMConfig, device=None):
+        super(MLP, self).__init__()
+        self.add_bias = config.add_bias_linear
+        # Project to 4h. If using swiglu double the output width, see https://arxiv.org/pdf/2002.05202.pdf
+        self.dense_h_to_4h = nn.Linear(
+            config.hidden_size,
+            config.ffn_hidden_size * 2,
+            bias=self.add_bias,
+            device=device,
+            **_config_to_kwargs(config)
+        )
+        def swiglu(x):
+            x = torch.chunk(x, 2, dim=-1)
+            return F.silu(x[0]) * x[1]
+        self.activation_func = swiglu
+        # Project back to h.
+        self.dense_4h_to_h = nn.Linear(
+            config.ffn_hidden_size,
+            config.hidden_size,
+            bias=self.add_bias,
+            device=device,
+            **_config_to_kwargs(config)
+        )
+    def forward(self, hidden_states):
+        # [s, b, 4hp]
+        intermediate_parallel = self.dense_h_to_4h(hidden_states)
+        intermediate_parallel = self.activation_func(intermediate_parallel)
+        # [s, b, h]
+        output = self.dense_4h_to_h(intermediate_parallel)
+        return output
+class GLMBlock(torch.nn.Module):
+    """A single transformer layer.
+    Transformer layer takes input with size [s, b, h] and returns an
+    output of the same size.
+    """
+    def __init__(self, config: ChatGLMConfig, layer_number, device=None):
+        super(GLMBlock, self).__init__()
+        self.layer_number = layer_number
+        self.apply_residual_connection_post_layernorm = config.apply_residual_connection_post_layernorm
+        self.fp32_residual_connection = config.fp32_residual_connection
+        LayerNormFunc = RMSNorm if config.rmsnorm else LayerNorm
+        # Layernorm on the input data.
+        self.input_layernorm = LayerNormFunc(config.hidden_size, eps=config.layernorm_epsilon, device=device,
+                                             dtype=config.torch_dtype)
+        # Self attention.
+        self.self_attention = SelfAttention(config, layer_number, device=device)
+        self.hidden_dropout = config.hidden_dropout
+        # Layernorm on the attention output
+        self.post_attention_layernorm = LayerNormFunc(config.hidden_size, eps=config.layernorm_epsilon, device=device,
+                                                      dtype=config.torch_dtype)
+        # MLP
+        self.mlp = MLP(config, device=device)
+    def forward(
+            self, hidden_states, attention_mask, rotary_pos_emb, kv_cache=None, use_cache=True,
+    ):
+        # hidden_states: [s, b, h]
+        # Layer norm at the beginning of the transformer layer.
+        layernorm_output = self.input_layernorm(hidden_states)
+        # Self attention.
+        attention_output, kv_cache = self.self_attention(
+            layernorm_output,
+            attention_mask,
+            rotary_pos_emb,
+            kv_cache=kv_cache,
+            use_cache=use_cache
+        )
+        # Residual connection.
+        if self.apply_residual_connection_post_layernorm:
+            residual = layernorm_output
+        else:
+            residual = hidden_states
+        layernorm_input = torch.nn.functional.dropout(attention_output, p=self.hidden_dropout, training=self.training)
+        layernorm_input = residual + layernorm_input
+        # Layer norm post the self attention.
+        layernorm_output = self.post_attention_layernorm(layernorm_input)
+        # MLP.
+        mlp_output = self.mlp(layernorm_output)
+        # Second residual connection.
+        if self.apply_residual_connection_post_layernorm:
+            residual = layernorm_output
+        else:
+            residual = layernorm_input
+        output = torch.nn.functional.dropout(mlp_output, p=self.hidden_dropout, training=self.training)
+        output = residual + output
+        return output, kv_cache
+class GLMTransformer(torch.nn.Module):
+    """Transformer class."""
+    def __init__(self, config: ChatGLMConfig, device=None):
+        super(GLMTransformer, self).__init__()
+        self.fp32_residual_connection = config.fp32_residual_connection
+        self.post_layer_norm = config.post_layer_norm
+        # Number of layers.
+        self.num_layers = config.num_layers
+        # Transformer layers.
+        def build_layer(layer_number):
+            return GLMBlock(config, layer_number, device=device)
+        self.layers = torch.nn.ModuleList([build_layer(i + 1) for i in range(self.num_layers)])
+        if self.post_layer_norm:
+            LayerNormFunc = RMSNorm if config.rmsnorm else LayerNorm
+            # Final layer norm before output.
+            self.final_layernorm = LayerNormFunc(config.hidden_size, eps=config.layernorm_epsilon, device=device,
+                                                 dtype=config.torch_dtype)
+        self.gradient_checkpointing = False
+    def _get_layer(self, layer_number):
+        return self.layers[layer_number]
+    def forward(
+            self, hidden_states, attention_mask, rotary_pos_emb, kv_caches=None,
+            use_cache: Optional[bool] = True,
+            output_hidden_states: Optional[bool] = False,
+    ):
+        if not kv_caches:
+            kv_caches = [None for _ in range(self.num_layers)]
+        presents = () if use_cache else None
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
+        all_self_attentions = None
+        all_hidden_states = () if output_hidden_states else None
+        for index in range(self.num_layers):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+            layer = self._get_layer(index)
+            if self.gradient_checkpointing and self.training:
+                layer_ret = torch.utils.checkpoint.checkpoint(
+                    layer,
+                    hidden_states,
+                    attention_mask,
+                    rotary_pos_emb,
+                    kv_caches[index],
+                    use_cache,
+                    use_reentrant=False
+                )
+            else:
+                layer_ret = layer(
+                    hidden_states,
+                    attention_mask,
+                    rotary_pos_emb,
+                    kv_cache=kv_caches[index],
+                    use_cache=use_cache
+                )
+            hidden_states, kv_cache = layer_ret
+            if use_cache:
+                presents = presents + (kv_cache,)
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+        # Final layer norm.
+        if self.post_layer_norm:
+            hidden_states = self.final_layernorm(hidden_states)
+        return hidden_states, presents, all_hidden_states, all_self_attentions
+class ChatGLMPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and
+    a simple interface for downloading and loading pretrained models.
+    """
+    is_parallelizable = False
+    supports_gradient_checkpointing = True
+    config_class = ChatGLMConfig
+    base_model_prefix = "transformer"
+    _no_split_modules = ["GLMBlock"]
+    _supports_flash_attn_2 = True
+    _supports_sdpa = True
+    def _init_weights(self, module: nn.Module):
+        """Initialize the weights."""
+        return
+    def get_masks(self, input_embeds, past_key_values, padding_mask=None):
+        batch_size, seq_length, embed_size = input_embeds.shape
+        full_attention_mask = torch.ones(batch_size, seq_length, seq_length, device=input_embeds.device)
+        full_attention_mask.tril_()
+        past_length = 0
+        if past_key_values:
+            past_length = past_key_values[0][0].shape[2]
+        if past_length:
+            full_attention_mask = torch.cat((torch.ones(batch_size, seq_length, past_length,
+                                                        device=input_embeds.device), full_attention_mask), dim=-1)
+        if padding_mask is not None:
+            full_attention_mask = full_attention_mask * padding_mask.unsqueeze(1)
+        if not past_length and padding_mask is not None:
+            full_attention_mask -= padding_mask.unsqueeze(-1) - 1
+        full_attention_mask = (full_attention_mask < 0.5).bool()
+        full_attention_mask.unsqueeze_(1)
+        return full_attention_mask
+    def get_position_ids(self, input_ids, device):
+        batch_size, seq_length = input_ids.shape
+        position_ids = torch.arange(seq_length, dtype=torch.long, device=device).unsqueeze(0).repeat(batch_size, 1)
+        return position_ids
+    def get_multimodal_position_ids(self, input_ids, device):
+        batch_size, seq_length = input_ids.shape
+        position_ids = torch.arange(seq_length, dtype=torch.long, device=device).unsqueeze(0).repeat(batch_size, 1)
+class Embedding(torch.nn.Module):
+    """Language model embeddings."""
+    def __init__(self, config: ChatGLMConfig, device=None):
+        super(Embedding, self).__init__()
+        self.hidden_size = config.hidden_size
+        # Word embeddings (parallel).
+        self.word_embeddings = nn.Embedding(
+            config.padded_vocab_size,
+            self.hidden_size,
+            dtype=config.torch_dtype,
+            device=device
+        )
+        self.fp32_residual_connection = config.fp32_residual_connection
+    def forward(self, input_ids):
+        # Embeddings.
+        words_embeddings = self.word_embeddings(input_ids)
+        embeddings = words_embeddings
+        # If the input flag for fp32 residual connection is set, convert for float.
+        if self.fp32_residual_connection:
+            embeddings = embeddings.float()
+        return embeddings
+def is_empty(images_list: Optional[List[List[torch.Tensor]]]):
+    if images_list is None or len(images_list) == 0:
+        return True
+    for image_list in images_list:
+        if image_list is not None:
+            return False
+    return True
+class ChatGLMModel(ChatGLMPreTrainedModel):
+    def __init__(self, config: ChatGLMConfig, device=None, empty_init=True):
+        super().__init__(config)
+        if empty_init:
+            init_method = skip_init
+        else:
+            init_method = default_init
+        init_kwargs = {}
+        if device is not None:
+            init_kwargs["device"] = device
+        self.embedding = init_method(Embedding, config, **init_kwargs)
+        self.num_layers = config.num_layers
+        self.multi_query_group_num = config.multi_query_group_num
+        self.kv_channels = config.kv_channels
+        # Rotary positional embeddings
+        self.seq_length = config.seq_length
+        rotary_dim = (
+            config.hidden_size // config.num_attention_heads if config.kv_channels is None else config.kv_channels
+        )
+        self.rotary_pos_emb = RotaryEmbedding(rotary_dim // 2, rope_ratio=config.rope_ratio,
+                                              original_impl=config.original_rope,
+                                              device=device, dtype=config.torch_dtype)
+        self.encoder = init_method(GLMTransformer, config, **init_kwargs)
+        self.output_layer = init_method(nn.Linear, config.hidden_size, config.padded_vocab_size, bias=False,
+                                        dtype=config.torch_dtype, **init_kwargs)
+        self.pre_seq_len = config.pre_seq_len
+        self.prefix_projection = config.prefix_projection
+        if self.pre_seq_len is not None:
+            for param in self.parameters():
+                param.requires_grad = False
+            self.prefix_tokens = torch.arange(self.pre_seq_len).long()
+            self.prefix_encoder = PrefixEncoder(config)
+            self.dropout = torch.nn.Dropout(0.1)
+        self.vision = EVA2CLIPModel(config)
+    def get_input_embeddings(self):
+        return self.embedding.word_embeddings
+    def set_input_embeddings(self, value):
+        self.embedding.word_embeddings = value
+    def get_prompt(self, batch_size, device, dtype=torch.half):
+        prefix_tokens = self.prefix_tokens.unsqueeze(0).expand(batch_size, -1).to(device)
+        past_key_values = self.prefix_encoder(prefix_tokens).type(dtype)
+        past_key_values = past_key_values.view(
+            batch_size,
+            self.pre_seq_len,
+            self.pre_seq_len,
+            self.num_layers * 2,
+            self.multi_query_group_num,
+            self.kv_channels
+        )
+        # seq_len, b, nh, hidden_size
+        past_key_values = self.dropout(past_key_values)
+        past_key_values = past_key_values.permute([2, 1, 0, 3, 4]).split(2)
+        return past_key_values
+    def forward(
+            self,
+            input_ids: torch.LongTensor = None,
+            images: torch.Tensor = None,
+            position_ids: Optional[torch.Tensor] = None,
+            attention_mask: Optional[torch.BoolTensor] = None,
+            full_attention_mask: Optional[torch.BoolTensor] = None,
+            past_key_values: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor], ...]] = None,
+            inputs_embeds: Optional[torch.Tensor] = None,
+            use_cache: Optional[bool] = None,
+            output_hidden_states: Optional[bool] = None,
+            return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+        """take care of image_encode, position_ids and (attention_mask = None is fine)"""
+        # generate mode with past_key_values. the image features are already mapped
+        if past_key_values is None:
+            # not allow for inputs_embeds, because we want to process image feature
+            assert input_ids is not None and inputs_embeds is None, f"{input_ids} {inputs_embeds}"
+            if not is_empty(images):  # multi-modality
+                image_size: int = self.config.vision_config['image_size']
+                patch_size: int = self.config.vision_config['patch_size']
+                num_patches = (image_size // patch_size // 2) ** 2
+                assert len(input_ids) == len(images), f"{len(input_ids)} {len(images)}"
+                inputs_embeds = self.embedding(input_ids)
+                images = images.to(dtype=inputs_embeds.dtype)
+                images_features = self.vision(images)
+                if position_ids is None:
+                    position_ids = self.get_position_ids(input_ids, device=inputs_embeds.device)
+                new_input_embeds, new_position_ids = [], []
+                for i in range(len(input_ids)):
+                    input_id = input_ids[i].tolist()
+                    boi_token_pos, eoi_token_pos = input_id.index(self.config.boi_token_id), input_id.index(
+                        self.config.eoi_token_id)
+                    assert eoi_token_pos - boi_token_pos == 2
+                    new_input_embeds.append(torch.cat(
+                        (inputs_embeds[i, :boi_token_pos], images_features[i].to(inputs_embeds.device),
+                         inputs_embeds[i, eoi_token_pos + 1:])))
+                    new_position_ids.append(torch.cat(
+                        (position_ids[i, :boi_token_pos + 1], position_ids[i, boi_token_pos + 1].repeat(num_patches),
+                         position_ids[i, eoi_token_pos:])
+                    ))
+                inputs_embeds = torch.stack(new_input_embeds, dim=0)
+                position_ids = torch.stack(new_position_ids, dim=0)
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        batch_size, seq_length = input_ids.shape
+        if inputs_embeds is None:
+            inputs_embeds = self.embedding(input_ids)
+        if self.pre_seq_len is not None:
+            if past_key_values is None:
+                past_key_values = self.get_prompt(batch_size=batch_size, device=input_ids.device,
+                                                  dtype=inputs_embeds.dtype)
+            if attention_mask is not None:
+                attention_mask = torch.cat([attention_mask.new_ones((batch_size, self.pre_seq_len)),
+                                            attention_mask], dim=-1)
+        if full_attention_mask is None:
+            if (attention_mask is not None and not attention_mask.all()) or (past_key_values and seq_length != 1):
+                if self.training:
+                    # https://github.com/THUDM/GLM-4/issues/264
+                    new_input_ids, new_attention_mask = [], []
+                    for i in range(len(input_ids)):
+                        input_id = input_ids[i].tolist()
+                        boi_token_pos, eoi_token_pos = input_id.index(self.config.boi_token_id), input_id.index(self.config.eoi_token_id)
+                        assert eoi_token_pos - boi_token_pos == 2
+                        new_attention_mask.append(torch.cat(
+                            (attention_mask[i, :boi_token_pos + 1], torch.ones(num_patches).to(attention_mask.device),
+                             attention_mask[i, eoi_token_pos:])))
+                        new_input_ids.append(torch.cat(
+                            (input_ids[i, :boi_token_pos + 1], input_ids[i, -1].repeat(num_patches),
+                             input_ids[i, eoi_token_pos:])))
+                    attention_mask = torch.stack(new_attention_mask, dim=0)
+                    input_ids = torch.stack(new_input_ids, dim=0)
+                    inputs_embeds = self.embedding(input_ids)
+                full_attention_mask = self.get_masks(inputs_embeds, past_key_values, padding_mask=attention_mask)
+        # Rotary positional embeddings
+        rotary_pos_emb = self.rotary_pos_emb(self.seq_length)
+        if position_ids is not None:
+            rotary_pos_emb = rotary_pos_emb[position_ids]
+        else:
+            rotary_pos_emb = rotary_pos_emb[None, :seq_length]
+        # Run encoder.
+        hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
+            inputs_embeds, full_attention_mask, rotary_pos_emb=rotary_pos_emb,
+            kv_caches=past_key_values, use_cache=use_cache, output_hidden_states=output_hidden_states
+        )
+        if not return_dict:
+            return tuple(v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=presents,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+        )
+def _history_to_prompt(history, query):
+    prompt = ''
+    flag = False
+    for i, (old_query, response) in enumerate(history):
+        prompt += ('<|user|>' if flag else '') + old_query + "<|assistant|>" + response + "<|endoftext|>"
+        flag = True
+    prompt += '{}{}<|assistant|>'.format('<|user|>' if flag else '', query)
+    return prompt
+class ChatGLMForConditionalGeneration(ChatGLMPreTrainedModel):
+    def __init__(self, config: ChatGLMConfig, empty_init=True, device=None):
+        super().__init__(config)
+        self.max_sequence_length = config.max_length
+        self.transformer = ChatGLMModel(config, empty_init=empty_init, device=device)
+        self.config = config
+    def _update_model_kwargs_for_generation(
+            self,
+            outputs: ModelOutput,
+            model_kwargs: Dict[str, Any],
+            is_encoder_decoder: bool = False,
+    ) -> Dict[str, Any]:
+        # update past_key_values
+        cache_name, cache = self._extract_past_from_model_output(outputs)
+        model_kwargs[cache_name] = cache
+        # update attention mask
+        if "attention_mask" in model_kwargs:
+            attention_mask = model_kwargs["attention_mask"]
+            model_kwargs["attention_mask"] = torch.cat(
+                [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1
+            )
+        # update position ids
+        if "position_ids" in model_kwargs:
+            position_ids = model_kwargs["position_ids"]
+            new_position_id = position_ids[..., -1:].clone()
+            new_position_id += 1
+            model_kwargs["position_ids"] = torch.cat(
+                [position_ids, new_position_id], dim=-1
+            )
+        model_kwargs["is_first_forward"] = False
+        return model_kwargs
+    def prepare_inputs_for_generation(
+            self,
+            input_ids: torch.LongTensor,
+            images: Optional[torch.Tensor] = None,
+            past_key_values: Optional[torch.Tensor] = None,
+            attention_mask: Optional[torch.Tensor] = None,
+            position_ids: Optional[torch.Tensor] = None,
+            use_cache: Optional[bool] = None,
+            is_first_forward: bool = True,
+            **kwargs
+    ) -> dict:
+        # only last token for input_ids if past is not None
+        if position_ids is None:
+            position_ids = self.get_position_ids(input_ids, device=input_ids.device)
+        if attention_mask is not None:
+            image_size: int = self.config.vision_config['image_size']
+            patch_size: int = self.config.vision_config['patch_size']
+            num_patches = (image_size // patch_size // 2) ** 2
+            new_attention_masks = []
+            # if not image, use this default id
+            eoi_token_pos = 6
+            boi_token_pos = 4
+            for i in range(len(input_ids)):
+                input_id = input_ids[i].tolist()
+                if not is_empty(images):
+                    boi_token_pos, eoi_token_pos = input_id.index(self.config.boi_token_id), input_id.index(
+                        self.config.eoi_token_id)
+                assert eoi_token_pos - boi_token_pos == 2
+                new_attention_masks.append(torch.cat(
+                    (attention_mask[i, :boi_token_pos + 1], attention_mask.new_ones(num_patches),
+                     attention_mask[i, eoi_token_pos:])
+                ))
+            attention_mask = torch.stack(new_attention_masks, dim=0)
+        if not is_first_forward:
+            if past_key_values is not None:
+                position_ids = position_ids[..., -1:]
+                input_ids = input_ids[:, -1:]
+        return {
+            "input_ids": input_ids,
+            "images": images,
+            "past_key_values": past_key_values,
+            "position_ids": position_ids,
+            "attention_mask": attention_mask,
+            "return_last_logit": True,
+            "use_cache": use_cache
+        }
+    def forward(
+            self,
+            input_ids: Optional[torch.Tensor] = None,
+            images: List[List[torch.Tensor]] = None,
+            position_ids: Optional[torch.Tensor] = None,
+            attention_mask: Optional[torch.Tensor] = None,
+            past_key_values: Optional[Tuple[torch.FloatTensor]] = None,
+            inputs_embeds: Optional[torch.Tensor] = None,
+            labels: Optional[torch.Tensor] = None,
+            use_cache: Optional[bool] = None,
+            output_attentions: Optional[bool] = None,
+            output_hidden_states: Optional[bool] = None,
+            return_dict: Optional[bool] = None,
+            return_last_logit: Optional[bool] = False,
+    ):
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        transformer_outputs = self.transformer(
+            input_ids=input_ids,
+            images=images,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = transformer_outputs[0]
+        if return_last_logit:
+            hidden_states = hidden_states[:, -1:]
+        lm_logits = self.transformer.output_layer(hidden_states)
+        loss = None
+        if labels is not None:
+            new_labels = []
+            for i in range(len(input_ids)):
+                input_id = input_ids[i].tolist()
+                boi_token_pos, eoi_token_pos = input_id.index(self.config.boi_token_id), input_id.index(
+                    self.config.eoi_token_id)
+                assert eoi_token_pos - boi_token_pos == 2
+                new_labels.append(torch.cat(
+                    (
+                        labels[i, :boi_token_pos + 1],
+                        torch.tensor([-100]).to(labels.device).to(labels.dtype).repeat(1600),
+                        labels[i, eoi_token_pos:])))
+            labels = torch.stack(new_labels, dim=0)
+            lm_logits = lm_logits.to(torch.float32)
+            shift_logits = lm_logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            loss_fct = CrossEntropyLoss(ignore_index=-100)
+            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
+            lm_logits = lm_logits.to(hidden_states.dtype)
+            loss = loss.to(hidden_states.dtype)
+        if not return_dict:
+            output = (lm_logits,) + transformer_outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=lm_logits,
+            past_key_values=transformer_outputs.past_key_values,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+    @staticmethod
+    def _reorder_cache(
+            past: Tuple[Tuple[torch.Tensor, torch.Tensor], ...], beam_idx: torch.LongTensor
+    ) -> Tuple[Tuple[torch.Tensor, torch.Tensor], ...]:
+        """
+        This function is used to re-order the `past_key_values` cache if [`~PreTrainedModel.beam_search`] or
+        [`~PreTrainedModel.beam_sample`] is called. This is required to match `past_key_values` with the correct
+        beam_idx at every generation step.
+        Output shares the same memory storage as `past`.
+        """
+        return tuple(
+            (
+                layer_past[0].index_select(0, beam_idx.to(layer_past[0].device)),
+                layer_past[1].index_select(0, beam_idx.to(layer_past[1].device)),
+            )
+            for layer_past in past
+        )
+class ChatGLMForSequenceClassification(ChatGLMPreTrainedModel):
+    def __init__(self, config: ChatGLMConfig, empty_init=True, device=None):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.transformer = ChatGLMModel(config, empty_init=empty_init, device=device)
+        self.classifier_head = nn.Linear(config.hidden_size, config.num_labels, bias=True, dtype=torch.half)
+        if config.classifier_dropout is not None:
+            self.dropout = nn.Dropout(config.classifier_dropout)
+        else:
+            self.dropout = None
+        self.config = config
+    def forward(
+            self,
+            input_ids: Optional[torch.LongTensor] = None,
+            position_ids: Optional[torch.LongTensor] = None,
+            attention_mask: Optional[torch.Tensor] = None,
+            full_attention_mask: Optional[torch.Tensor] = None,
+            past_key_values: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor], ...]] = None,
+            inputs_embeds: Optional[torch.LongTensor] = None,
+            labels: Optional[torch.LongTensor] = None,
+            use_cache: Optional[bool] = None,
+            output_hidden_states: Optional[bool] = None,
+            return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor, ...], SequenceClassifierOutputWithPast]:
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        transformer_outputs = self.transformer(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            full_attention_mask=full_attention_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = transformer_outputs[0]
+        pooled_hidden_states = hidden_states[-1]
+        if self.dropout is not None:
+            pooled_hidden_states = self.dropout(pooled_hidden_states)
+        logits = self.classifier_head(pooled_hidden_states)
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+            if self.config.problem_type == "regression":
+                loss_fct = MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze().float(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits.float(), labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels).float(), labels.view(-1))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = BCEWithLogitsLoss()
+                loss = loss_fct(logits.float(), labels.view(-1, self.num_labels))
+        if not return_dict:
+            output = (logits,) + transformer_outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+        return SequenceClassifierOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=transformer_outputs.past_key_values,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )

tokenization_chatglm.py ADDED Viewed

	@@ -0,0 +1,361 @@

+import regex as re
+import base64
+import os
+import json
+import tiktoken
+import torch
+from torch import TensorType
+from typing import List, Optional, Union, Dict, Any
+from torchvision import transforms
+from transformers import PreTrainedTokenizer
+from transformers.utils import logging, PaddingStrategy
+from transformers.tokenization_utils_base import EncodedInput, BatchEncoding
+class ChatGLM4Tokenizer(PreTrainedTokenizer):
+    vocab_files_names = {"vocab_file": "tokenizer.model"}
+    model_input_names = ["input_ids", "attention_mask", "position_ids"]
+    def __init__(
+            self,
+            vocab_file,
+            padding_side="left",
+            clean_up_tokenization_spaces=False,
+            encode_special_tokens=False,
+            image_size=None,
+            **kwargs
+    ):
+        self.name = "GLM4Tokenizer"
+        self.vocab_file = vocab_file
+        pat_str = "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
+        self.pat_str = re.compile(pat_str)
+        self.encode_special_tokens = encode_special_tokens
+        self.image_size = image_size
+        mergeable_ranks = {}
+        with open(vocab_file) as f:
+            for line in f:
+                token, rank = line.strip().split()
+                rank = int(rank)
+                token = base64.b64decode(token)
+                mergeable_ranks[token] = rank
+        self.mergeable_ranks = mergeable_ranks
+        self.tokenizer = tiktoken.Encoding(
+            name="my_tokenizer",
+            pat_str=pat_str,
+            mergeable_ranks=mergeable_ranks,
+            special_tokens={}
+        )
+        self.decoder = {rank: token for token, rank in mergeable_ranks.items()}
+        self.n_words = len(self.decoder)
+        super().__init__(
+            padding_side=padding_side,
+            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
+            **kwargs
+        )
+    @property
+    def vocab_size(self):
+        return self.n_words
+    def get_vocab(self):
+        """ Returns vocab as a dict """
+        vocab = {self._convert_id_to_token(i): i for i in range(self.vocab_size)}
+        vocab.update(self.added_tokens_encoder)
+        return vocab
+    def convert_tokens_to_string(self, tokens: List[Union[bytes, str, int]]) -> str:
+        """
+        Converts a sequence of tokens in a single string.
+        """
+        text = ""
+        temp = b""
+        for t in tokens:
+            if isinstance(t, int):
+                t = chr(t)
+            if isinstance(t, str):
+                if temp:
+                    text += temp.decode("utf-8", errors="replace")
+            elif isinstance(t, bytes):
+                temp += t
+            else:
+                raise TypeError("token should only be of type int, bytes or str")
+        if temp:
+            text += temp.decode("utf-8", errors="replace")
+        return text
+    def _tokenize(self, text, **kwargs):
+        tokens = []
+        ids = self.tokenizer.encode(text)
+        for t in ids:
+            tokens.append(self.decoder[t])
+        return tokens
+    def _convert_token_to_id(self, token):
+        """ Converts a token (str) in an id using the vocab. """
+        return self.mergeable_ranks[token]
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.decoder.get(index, "")
+    def save_vocabulary(self, save_directory, filename_prefix=None):
+        """
+        Save the vocabulary and special tokens file to a directory.
+        Args:
+            save_directory (`str`):
+                The directory in which to save the vocabulary.
+            filename_prefix (`str`, *optional*):
+                An optional prefix to add to the named of the saved files.
+        Returns:
+            `Tuple(str)`: Paths to the files saved.
+        """
+        if os.path.isdir(save_directory):
+            vocab_file = os.path.join(
+                save_directory, self.vocab_files_names["vocab_file"]
+            )
+        else:
+            vocab_file = save_directory
+        with open(self.vocab_file, 'rb') as fin:
+            proto_str = fin.read()
+        with open(vocab_file, "wb") as writer:
+            writer.write(proto_str)
+        return (vocab_file,)
+    def get_prefix_tokens(self):
+        prefix_tokens = [self.convert_tokens_to_ids("[gMASK]"), self.convert_tokens_to_ids("<sop>")]
+        return prefix_tokens
+    def build_single_message(self, role, metadata, message, tokenize=True, message_prefix=None):
+        assert role in ["system", "user", "assistant", "observation"], role
+        if tokenize:
+            role_tokens = [self.convert_tokens_to_ids(f"<|{role}|>")] + self.tokenizer.encode(f"{metadata}\n",
+                                                                                              disallowed_special=())
+            message_tokens = self.tokenizer.encode(message, disallowed_special=())
+            if message_prefix is not None:
+                message_tokens = message_prefix + message_tokens
+            tokens = role_tokens + message_tokens
+            return tokens
+        else:
+            return str(f"<|{role}|>{metadata}\n{message}")
+    def apply_chat_template(
+            self,
+            conversation: Union[List[Dict[str, str]], List[List[Dict[str, str]]], "Conversation"],
+            add_generation_prompt: bool = False,
+            tokenize: bool = True,
+            padding: bool = False,
+            truncation: bool = False,
+            max_length: Optional[int] = None,
+            return_tensors: Optional[Union[str, TensorType]] = None,
+            return_dict: bool = False,
+            tokenizer_kwargs: Optional[Dict[str, Any]] = None,
+            add_special_tokens: bool = True,
+            **kwargs,
+    ) -> Union[str, List[int], List[str], List[List[int]], BatchEncoding]:
+        if return_dict and not tokenize:
+            raise ValueError(
+                "`return_dict=True` is incompatible with `tokenize=False`, because there is no dict "
+                "of tokenizer outputs to return."
+            )
+        def handle_single_conversation(conversation):
+            input_ids = self.get_prefix_tokens() if add_special_tokens else []
+            input_message = "[gMASK]<sop>" if add_special_tokens else ""
+            input_image = None
+            transform = transforms.Compose(
+                [
+                    transforms.Resize(
+                        (self.image_size, self.image_size), interpolation=transforms.InterpolationMode.BICUBIC
+                    ),
+                    transforms.ToTensor(),
+                    transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
+                ]
+            )
+            for item in conversation:
+                if item.get("tools"):
+                    tools = item["tools"]
+                    content = "你是一个名为 GLM-4 的人工智能助手。你是基于智谱AI训练的语言模型 GLM-4 模型开发的，你的任务是针对用户的问题和要求提供适当的答复和支持。"
+                    for tool in tools:
+                        if tool["type"] == "function":
+                            function = tool["function"]
+                            content += f"\n\n## {function['name']}\n\n{json.dumps(function, ensure_ascii=False, indent=4)}"
+                            content += "\n在调用上述函数时，请使用 Json 格式表示调用的参数。"
+                        elif tool["type"] == "python":
+                            content += "\n\n## python\n\n当你向 `python` 发送包含 Python 代码的消息时，该代码将会在一个有状态的 Jupyter notebook 环境中执行。\n`python` 返回代码执行的输出，或在执行 60 秒后返回超时。\n`/mnt/data` 将会持久化存储你的文件。在此会话中，`python` 无法访问互联网。不要使用 `python` 进行任何网络请求或者在线 API 调用，这些在线内容的访问将不会成功。"
+                        elif tool["type"] == "simple_browser":
+                            content += "\n\n## simple_browser\n\n你可以使用 `simple_browser` 工具。该工具支持以下函数：\n`search(query: str, recency_days: int)`：使用搜索引擎进行查询并显示结果，可以使用 `recency_days` 参数控制搜索内容的时效性。\n`mclick(ids: list[int])`：获取一系列指定 id 的页面内容。每次调用时，须选择3-10个页面。选择多个角度的页面，同时尽可能选择可信任的信息来源。考虑到部分页面是无法加载的，你也可以多打开一些可能有用的页面而不用担心内容过多。\n`open_url(url: str)`：打开指定的 URL。\n\n使用 `【{引用 id}†{引用文本}】` 来引用内容。\n\n操作步骤：1. 使用 `search` 来获得信息列表; 2. 使用 `mclick` 来获取指定 ID 页面的内容; 3. 根据获得的内容进行回复。在回复中应当引用信息来源。\n 如果用户提供了 URL，也可以用 `open_url` 直接打开页面。\n如果初次搜索结果没有找到合适的信息，也可以再次使用 `search` 进行搜索。"
+                        elif tool["type"] == "cogview":
+                            content += "\n\n## cogview\n\n如果用户的请求中包含了对图像的描述，你可以使用 `cogview` 来生成图像并展示给用户。你需要向 `cogview` 发送图像描述，规则：\n- 发送给 `cogview` 的消息必须使用英语。用户的中文描述必须完全翻译为英语。\n- 应当尽可能详细地描述图像生成的需求，需求描述约 100 英文单词。\n- 保持用户原始描述的意图。不要虚构内容或者没见过的人物。\n- 如无特殊说明，所在地为中国，持有中国立场并遵循中国社会主义价值观。"
+                        else:
+                            raise NotImplementedError(f"Unknown tool type {tool['type']}")
+                    input = self.build_single_message("system", "", content, tokenize=tokenize)
+                    if tokenize:
+                        input_ids.extend(input)
+                    else:
+                        input_message += input
+                message = ""
+                message_prefix = None
+                if item.get("image"):
+                    assert input_image is None, "Multiple images are not supported"
+                    input_image = transform(item["image"])
+                    message_prefix = self.convert_tokens_to_ids(
+                        ["<|begin_of_image|>", "<|endoftext|>", "<|end_of_image|>"])
+                if item.get("content"):
+                    message += item["content"]
+                if message or message_prefix:
+                    input = self.build_single_message(
+                        item["role"],
+                        item.get("metadata", ""),
+                        message,
+                        tokenize=tokenize,
+                        message_prefix=message_prefix
+                    )
+                    if tokenize:
+                        input_ids.extend(input)
+                    else:
+                        input_message += input
+            if add_generation_prompt:
+                if tokenize:
+                    input_ids.extend([self.convert_tokens_to_ids("<|assistant|>")])
+                else:
+                    input_message += "<|assistant|>"
+            return {"input": input_ids if tokenize else input_message, "image": input_image}
+        # Main logic to handle different conversation formats
+        if isinstance(conversation, list) and all(isinstance(i, dict) for i in conversation):
+            result = handle_single_conversation(conversation)
+            input_ids = result["input"]
+            input_images = [result["image"]]
+        elif isinstance(conversation, list) and all(isinstance(i, list) for i in conversation):
+            results = [handle_single_conversation(c) for c in conversation]
+            input_ids = [item["input"] for item in results]
+            input_images = [item["image"] for item in results]
+        elif hasattr(conversation, "messages"):
+            result = handle_single_conversation(conversation.messages)
+            input_ids = result["input"]
+            input_images = [result["image"]]
+        else:
+            raise ValueError("Invalid conversation format")
+        if tokenize:
+            output = self.batch_encode_plus(
+                [input_ids] if isinstance(input_ids[0], int) else input_ids,
+                padding=padding,
+                truncation=truncation,
+                max_length=max_length,
+                return_tensors=return_tensors,
+                is_split_into_words=True,
+                add_special_tokens=False
+            )
+            if return_dict:
+                found_image = False
+                for image in input_images:
+                    if image is not None:
+                        found_image = True
+                        break
+                if found_image:
+                    output["images"] = torch.stack(input_images)
+                return output
+            else:
+                return output["input_ids"]
+        else:
+            return input_ids
+    def build_inputs_with_special_tokens(
+            self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens. A BERT sequence has the following format:
+        - single sequence: `[CLS] X [SEP]`
+        - pair of sequences: `[CLS] A [SEP] B [SEP]`
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+        Returns:
+            `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
+        """
+        prefix_tokens = self.get_prefix_tokens()
+        token_ids_0 = prefix_tokens + token_ids_0
+        if token_ids_1 is not None:
+            token_ids_0 = token_ids_0 + token_ids_1 + [self.convert_tokens_to_ids("<eos>")]
+        return token_ids_0
+    def _pad(
+            self,
+            encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],
+            max_length: Optional[int] = None,
+            padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+            pad_to_multiple_of: Optional[int] = None,
+            return_attention_mask: Optional[bool] = None,
+    ) -> dict:
+        """
+        Pad encoded inputs (on left/right and up to predefined length or max length in the batch)
+        Args:
+            encoded_inputs:
+                Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).
+            max_length: maximum length of the returned list and optionally padding length (see below).
+                Will truncate by taking into account the special tokens.
+            padding_strategy: PaddingStrategy to use for padding.
+                - PaddingStrategy.LONGEST Pad to the longest sequence in the batch
+                - PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
+                - PaddingStrategy.DO_NOT_PAD: Do not pad
+                The tokenizer padding sides are defined in self.padding_side:
+                    - 'left': pads on the left of the sequences
+                    - 'right': pads on the right of the sequences
+            pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
+                This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
+                `>= 7.5` (Volta).
+            return_attention_mask:
+                (optional) Set to False to avoid returning attention mask (default: set to model specifics)
+        """
+        # Load from model defaults
+        assert self.padding_side == "left"
+        required_input = encoded_inputs[self.model_input_names[0]]
+        seq_length = len(required_input)
+        if padding_strategy == PaddingStrategy.LONGEST:
+            max_length = len(required_input)
+        if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
+            max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
+        needs_to_be_padded = padding_strategy != PaddingStrategy.DO_NOT_PAD and len(required_input) != max_length
+        # Initialize attention mask if not present.
+        if "attention_mask" not in encoded_inputs:
+            encoded_inputs["attention_mask"] = [1] * seq_length
+        if "position_ids" not in encoded_inputs:
+            encoded_inputs["position_ids"] = list(range(seq_length))
+        if needs_to_be_padded:
+            difference = max_length - len(required_input)
+            if "attention_mask" in encoded_inputs:
+                encoded_inputs["attention_mask"] = [0] * difference + encoded_inputs["attention_mask"]
+            if "position_ids" in encoded_inputs:
+                encoded_inputs["position_ids"] = [0] * difference + encoded_inputs["position_ids"]
+            encoded_inputs[self.model_input_names[0]] = [self.pad_token_id] * difference + required_input
+        return encoded_inputs

tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5a493598071550244b2ee7f26118f3edec2150b9dfa967929a99052ac83fe716
+size 2623634

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,134 @@

+{
+  "auto_map": {
+    "AutoTokenizer": [
+      "tokenization_chatglm.ChatGLM4Tokenizer",
+      null
+    ]
+  },
+  "added_tokens_decoder": {
+    "151329": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151330": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151331": {
+      "content": "[gMASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151332": {
+      "content": "[sMASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151333": {
+      "content": "<sop>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151334": {
+      "content": "<eop>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151335": {
+      "content": "<|system|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151336": {
+      "content": "<|user|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151337": {
+      "content": "<|assistant|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151338": {
+      "content": "<|observation|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151339": {
+      "content": "<|begin_of_image|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151340": {
+      "content": "<|end_of_image|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151341": {
+      "content": "<|begin_of_video|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151342": {
+      "content": "<|end_of_video|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": ["<|endoftext|>", "[MASK]", "[gMASK]", "[sMASK]", "<sop>", "<eop>", "<|system|>",
+                               "<|user|>", "<|assistant|>", "<|observation|>", "<|begin_of_image|>", "<|end_of_image|>",
+                               "<|begin_of_video|>", "<|end_of_video|>"],
+  "clean_up_tokenization_spaces": false,
+  "do_lower_case": false,
+  "eos_token": "<|endoftext|>",
+  "pad_token": "<|endoftext|>",
+  "model_max_length": 8192,
+  "padding_side": "left",
+  "remove_space": false,
+  "tokenizer_class": "ChatGLM4Tokenizer",
+  "image_size": 1120
+}

visual.py ADDED Viewed

	@@ -0,0 +1,180 @@

+import torch
+from torch import nn
+from argparse import Namespace
+import torch.nn.functional as F
+from transformers.activations import ACT2FN
+import math
+from torch.nn import LayerNorm
+def standard_attention(query_layer, key_layer, value_layer, scaling_attention_score=True):
+    if scaling_attention_score:
+        query_layer = query_layer / math.sqrt(query_layer.shape[-1])
+    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
+    attention_probs = F.softmax(attention_scores, dim=-1)
+    context_layer = torch.matmul(attention_probs, value_layer)
+    return context_layer
+def attention_fn_default(query_layer, key_layer, value_layer, scaling_attention_score=True):
+    if int(torch.__version__.split('.')[0]) >= 2 and scaling_attention_score:
+        # Pytorch 2.0 attention uses very much memory if attention_mask is float, and has NaN bug if attention_mask is None.
+        attn_output = torch.nn.functional.scaled_dot_product_attention(
+            query_layer, key_layer, value_layer,
+            attn_mask=None,
+            dropout_p=0.,
+            is_causal=False
+        )
+        return attn_output
+    else:
+        return standard_attention(
+            query_layer, key_layer, value_layer, scaling_attention_score=scaling_attention_score
+        )
+class PatchEmbedding(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.proj = nn.Conv2d(config.in_channels, config.hidden_size, kernel_size=config.patch_size,
+                              stride=config.patch_size)
+        self.cls_embedding = nn.Parameter(torch.zeros(1, config.hidden_size))
+        self.position_embedding = nn.Embedding(config.num_positions, config.hidden_size)
+    def forward(self, images: "tensor(B, C, H, W)") -> "tensor(B, L, D)":
+        x = self.proj(images)
+        x = x.flatten(2).transpose(1, 2)
+        cls_token = self.cls_embedding.expand(x.shape[0], -1, -1)
+        x = torch.cat((cls_token, x), dim=1)
+        x += self.position_embedding.weight.unsqueeze(0)
+        return x
+class Attention(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.num_heads = config.num_heads
+        head_dim = config.hidden_size // config.num_heads
+        self.scale = head_dim ** -0.5
+        self.query_key_value = nn.Linear(config.hidden_size, config.hidden_size * 3)
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.output_dropout = torch.nn.Dropout(config.dropout_prob)
+    def forward(self, x: "tensor(B, L, D)") -> "tensor(B, L, D)":
+        B, L, _ = x.shape
+        qkv = self.query_key_value(x)
+        qkv = qkv.reshape(B, L, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)  # 3, B, H, L, D
+        q, k, v = qkv[0], qkv[1], qkv[2]
+        out = attention_fn_default(
+            q, k, v
+        )
+        output = self.dense(out.transpose(1, 2).reshape(B, L, -1))
+        output = self.output_dropout(output)
+        return output
+    def attention(self, q, k, v):
+        attn_weights = torch.matmul(q * self.scale, k.transpose(-2, -1))
+        attn_weights = attn_weights.softmax(dim=-1)
+        output = torch.matmul(attn_weights, v)
+        return output
+class MLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.activation_fn = ACT2FN[config.hidden_act]
+        self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
+        self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.fc1(x)
+        x = self.activation_fn(x)
+        x = self.fc2(x)
+        return x
+class TransformerLayer(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.input_layernorm = LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.attention = Attention(config)
+        self.mlp = MLP(config)
+        self.post_attention_layernorm = LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+    def forward(self, hidden_states):
+        attention_input = hidden_states
+        attention_output = self.input_layernorm(self.attention(attention_input))
+        hidden_states = attention_input + attention_output
+        mlp_input = hidden_states
+        # https://github.com/THUDM/GLM-4/issues/350
+        mlp_output = self.post_attention_layernorm(self.mlp(mlp_input)).to(mlp_input.device)
+        output = mlp_input + mlp_output
+        return output
+class Transformer(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.layers = nn.ModuleList([TransformerLayer(config) for _ in range(config.num_hidden_layers)])
+    def forward(self, hidden_states):
+        for layer_module in self.layers:
+            hidden_states = layer_module(hidden_states)
+        return hidden_states
+class GLU(nn.Module):
+    def __init__(self, config, in_features):
+        super().__init__()
+        self.linear_proj = nn.Linear(in_features, config.hidden_size, bias=False)
+        self.norm1 = nn.LayerNorm(config.hidden_size)
+        self.act1 = nn.GELU()
+        self.act2 = nn.functional.silu
+        self.dense_h_to_4h = nn.Linear(config.hidden_size, config.ffn_hidden_size, bias=False)
+        self.gate_proj = nn.Linear(config.hidden_size, config.ffn_hidden_size, bias=False)
+        self.dense_4h_to_h = nn.Linear(config.ffn_hidden_size, config.hidden_size, bias=False)
+    def forward(self, x):
+        x = self.linear_proj(x)
+        x = self.act1(self.norm1(x))
+        x = self.act2(self.gate_proj(x)) * self.dense_h_to_4h(x)
+        x = self.dense_4h_to_h(x)
+        return x
+class EVA2CLIPModel(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        vision_config = Namespace(**config.vision_config)
+        self.patch_embedding = PatchEmbedding(vision_config)
+        self.transformer = Transformer(vision_config)
+        self.linear_proj = GLU(config, in_features=config.hidden_size)
+        self.conv = nn.Conv2d(in_channels=vision_config.hidden_size, out_channels=config.hidden_size, kernel_size=2,
+                              stride=2)
+        self.boi = nn.Parameter(torch.zeros(1, 1, config.hidden_size))
+        self.eoi = nn.Parameter(torch.zeros(1, 1, config.hidden_size))
+        self.scaling_factor = vision_config.scaling_factor
+    def forward(self, images: "tensor(B, C, H, W)") -> "tensor(B, L, D)":
+        x = self.patch_embedding(images)
+        x = self.transformer(x)
+        x = x[:, 1:]
+        b, s, h = x.shape
+        grid_size = int(s ** 0.5)
+        x = x.view(b, grid_size, grid_size, h).permute(0, 3, 1, 2)
+        x = self.conv(x)
+        x = x.flatten(2).transpose(1, 2)
+        x = self.linear_proj(x)
+        # https://github.com/THUDM/GLM-4/issues/350
+        boi = self.boi.expand(x.shape[0], -1, -1).to(x.device)
+        eoi = self.eoi.expand(x.shape[0], -1, -1).to(x.device)
+        x = torch.cat((boi, x, eoi), dim=1)
+        x = x / self.scaling_factor
+        return x