File size: 4,464 Bytes
65caebd c06a0a7 c61f3ec c06a0a7 c61f3ec c06a0a7 050e3be b2b9d69 65caebd c61f3ec 35f2131 c61f3ec b410dd9 c61f3ec cfbfdf4 c61f3ec cbd833c 6f0736b cbd833c c61f3ec b410dd9 c61f3ec b410dd9 c61f3ec b410dd9 c61f3ec b410dd9 c61f3ec b410dd9 0643042 b410dd9 c61f3ec 159258a c61f3ec c06a0a7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
---
language:
- zh
tags:
- chatglm
- pytorch
- Text-Generation
license: apache-2.0
widget:
- text: |-
对下面中文拼写纠错:
少先队员因该为老人让坐。
答:
base_model: THUDM/chatglm3-6b
pipeline_tag: text-generation
library_name: peft
inference: false
---
# Chinese Spelling Correction LoRA Model
ChatGLM3-6B中文纠错LoRA模型
`shibing624/chatglm3-6b-csc-chinese-lora` evaluate test data:
The overall performance of shibing624/chatglm3-6b-csc-chinese-lora on CSC **test**:
|input_text|pred|
|:--- |:--- |
|对下面文本纠错:少先队员因该为老人让坐。|少先队员应该为老人让座。|
在CSC测试集上生成结果纠错准确率高,由于是基于[THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b)模型,结果常常能带给人惊喜,不仅能纠错,还带有句子润色和改写功能。
## Usage
本项目开源在 pycorrector 项目:[pycorrector](https://github.com/shibing624/pycorrector),可支持ChatGLM原生模型和LoRA微调后的模型,通过如下命令调用:
Install package:
```shell
pip install -U pycorrector
```
```python
from pycorrector import GptCorrector
model = GptCorrector("THUDM/chatglm3-6b", "chatglm", peft_name="shibing624/chatglm3-6b-csc-chinese-lora")
r = model.correct_batch(["少先队员因该为老人让坐。"])
print(r) # ['少先队员应该为老人让座。']
```
## Usage (HuggingFace Transformers)
Without [pycorrector](https://github.com/shibing624/pycorrector), you can use the model like this:
First, you pass your input through the transformer model, then you get the generated sentence.
Install package:
```
pip install transformers
```
```python
import os
import torch
from peft import PeftModel
from transformers import AutoTokenizer, AutoModel
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True).half().cuda()
model = PeftModel.from_pretrained(model, "shibing624/chatglm3-6b-csc-chinese-lora")
sents = ['对下面文本纠错\n\n少先队员因该为老人让坐。',
'对下面文本纠错\n\n下个星期,我跟我朋唷打算去法国玩儿。']
def get_prompt(user_query):
vicuna_prompt = "A chat between a curious user and an artificial intelligence assistant. " \
"The assistant gives helpful, detailed, and polite answers to the user's questions. " \
"USER: {query} ASSISTANT:"
return vicuna_prompt.format(query=user_query)
for s in sents:
q = get_prompt(s)
input_ids = tokenizer(q).input_ids
generation_kwargs = dict(max_new_tokens=128, do_sample=True, temperature=0.8)
outputs = model.generate(input_ids=torch.as_tensor([input_ids]).to('cuda:0'), **generation_kwargs)
output_tensor = outputs[0][len(input_ids):]
response = tokenizer.decode(output_tensor, skip_special_tokens=True)
print(response)
```
output:
```shell
少先队员应该为老人让座。
下个星期,我跟我朋友打算去法国玩儿。
```
模型文件组成:
```
chatglm3-6b-csc-chinese-lora
├── adapter_config.json
└── adapter_model.bin
```
#### 训练参数:
![loss](train_loss.png)
- num_epochs: 5
- per_device_train_batch_size: 6
- learning_rate: 2e-05
- best steps: 25100
- train_loss: 0.0834
- lr_scheduler_type: linear
- base model: THUDM/chatglm3-6b
- warmup_steps: 50
- "save_strategy": "steps"
- "save_steps": 500
- "save_total_limit": 10
- "bf16": false
- "fp16": true
- "optim": "adamw_torch"
- "ddp_find_unused_parameters": false
- "gradient_checkpointing": true
- max_seq_length: 512
- max_length: 512
- prompt_template_name: vicuna
- 6 * V100 32GB, training 48 hours
### 训练数据集
训练集包括以下数据:
- 中文拼写纠错数据集:https://huggingface.co/datasets/shibing624/CSC
- 中文语法纠错数据集:https://github.com/shibing624/pycorrector/tree/llm/examples/data/grammar
- 通用GPT4问答数据集:https://huggingface.co/datasets/shibing624/sharegpt_gpt4
如果需要训练文本纠错模型,请参考[https://github.com/shibing624/pycorrector](https://github.com/shibing624/pycorrector)
## Citation
```latex
@software{pycorrector,
author = {Ming Xu},
title = {pycorrector: Text Error Correction Tool},
year = {2023},
url = {https://github.com/shibing624/pycorrector},
}
``` |