File size: 9,201 Bytes
1d56ab5 f4231fe 1d56ab5 f4231fe 28d8d00 f4231fe 67b23de f4231fe 67b23de f4231fe 67b23de 829a3db 67b23de 829a3db 67b23de |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 |
---
language:
- zh
- en
pipeline_tag: text-generation
inference: false
---
# Baichuan-7B-Instruction
![](./alpachino.png)
<!-- Provide a quick summary of what the model is/does. -->
## 介绍
Baichuan-7B-Instruction 为 Baichuan-7B 系列模型进行指令微调后的版本,预训练模型可见 [Baichuan-7B](https://huggingface.co/baichuan-inc/Baichuan-7B)。
## Demo
如下是一个使用 gradio 的模型 demo
```python
import gradio as gr
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("AlpachinoNLP/Baichuan-7B-Instruction",trust_remote_code=True,use_fast=False)
model = AutoModelForCausalLM.from_pretrained("AlpachinoNLP/Baichuan-7B-Instruction",trust_remote_code=True ).half()
model.cuda()
def generate(histories, max_new_tokens=2048, do_sample = True, top_p = 0.95, temperature = 0.35, repetition_penalty=1.1):
prompt = ""
for history in histories:
history_with_identity = "\nHuman:" + history[0] + "\n\nAssistant:" + history[1]
prompt += history_with_identity
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
outputs = model.generate(
input_ids = input_ids,
max_new_tokens=max_new_tokens,
early_stopping=True,
do_sample=do_sample,
top_p=top_p,
temperature=temperature,
repetition_penalty=repetition_penalty,
)
rets = tokenizer.batch_decode(outputs, skip_special_tokens=True)
generate_text = rets[0].replace(prompt, "")
return generate_text
with gr.Blocks() as demo:
chatbot = gr.Chatbot()
msg = gr.Textbox()
clear = gr.Button("clear")
def user(user_message, history):
return "", history + [[user_message, ""]]
def bot(history):
print(history)
bot_message = generate(history)
history[-1][1] = bot_message
return history
msg.submit(user, [msg, chatbot], [msg, chatbot], queue=False).then(
bot, chatbot, chatbot
)
clear.click(lambda: None, None, chatbot, queue=False)
if __name__ == "__main__":
demo.launch(server_name="0.0.0.0")
```
## 量化部署
Baichuan-7B 支持 int8 和 int4 量化,用户只需在推理代码中简单修改两行即可实现。请注意,如果是为了节省显存而进行量化,应加载原始精度模型到 CPU 后再开始量化;避免在 `from_pretrained` 时添加 `device_map='auto'` 或者其它会导致把原始精度模型直接加载到 GPU 的行为的参数。
使用 int8 量化 (To use int8 quantization):
```python
model = AutoModelForCausalLM.from_pretrained("AlpachinoNLP/Baichuan-7B-Instruction", torch_dtype=torch.float16, trust_remote_code=True)
model = model.quantize(8).cuda()
```
同样的,如需使用 int4 量化 (Similarly, to use int4 quantization):
```python
model = AutoModelForCausalLM.from_pretrained("AlpachinoNLP/Baichuan-7B-Instruction", torch_dtype=torch.float16, trust_remote_code=True)
model = model.quantize(4).cuda()
```
## 训练详情
数据集:https://huggingface.co/datasets/shareAI/ShareGPT-Chinese-English-90k。
硬件:8*A40
## 测评结果
## [CMMLU](https://github.com/haonan-li/CMMLU)
| Model 5-shot | STEM | Humanities | Social Sciences | Others | China Specific | Average |
| ---------------------------------------------------------- | :-------: | :--------: | :-------------: | :------: | :------------: | :------: |
| Baichuan-7B | 34.4 | 47.5 | 47.6 | 46.6 | 44.3 | 44.0 |
| Vicuna-13B | 31.8 | 36.2 | 37.6 | 39.5 | 34.3 | 36.3 |
| Chinese-Alpaca-Plus-13B | 29.8 | 33.4 | 33.2 | 37.9 | 32.1 | 33.4 |
| Chinese-LLaMA-Plus-13B | 28.1 | 33.1 | 35.4 | 35.1 | 33.5 | 33.0 |
| Ziya-LLaMA-13B-Pretrain | 29.0 | 30.7 | 33.8 | 34.4 | 31.9 | 32.1 |
| LLaMA-13B | 29.2 | 30.8 | 31.6 | 33.0 | 30.5 | 31.2 |
| moss-moon-003-base (16B) | 27.2 | 30.4 | 28.8 | 32.6 | 28.7 | 29.6 |
| Baichuan-13B-Base | 41.7 | 61.1 | 59.8 | 59.0 | 56.4 | 55.3 |
| Baichuan-13B-Chat | 42.8 | 62.6 | 59.7 | 59.0 | 56.1 | 55.8 |
| Baichuan-13B-Instruction | 44.50 | 61.16 | 59.07 | 58.34 | 55.55 | 55.61 |
| **Baichuan-7B-Instruction** | **34.68** | **47.38** | **47.13** | **45.11** | **44.51** | **43.57** |
| Model zero-shot | STEM | Humanities | Social Sciences | Others | China Specific | Average |
| ------------------------------------------------------------ | :-------: | :--------: | :-------------: | :-------: | :------------: | :-------: |
| [ChatGLM2-6B](https://huggingface.co/THUDM/chatglm2-6b) | 41.28 | 52.85 | 53.37 | 52.24 | 50.58 | 49.95 |
| [Baichuan-7B](https://github.com/baichuan-inc/baichuan-7B) | 32.79 | 44.43 | 46.78 | 44.79 | 43.11 | 42.33 |
| [ChatGLM-6B](https://github.com/THUDM/GLM-130B) | 32.22 | 42.91 | 44.81 | 42.60 | 41.93 | 40.79 |
| [BatGPT-15B](https://arxiv.org/abs/2307.00360) | 33.72 | 36.53 | 38.07 | 46.94 | 38.32 | 38.51 |
| [Chinese-LLaMA-7B](https://github.com/ymcui/Chinese-LLaMA-Alpaca) | 26.76 | 26.57 | 27.42 | 28.33 | 26.73 | 27.34 |
| [MOSS-SFT-16B](https://github.com/OpenLMLab/MOSS) | 25.68 | 26.35 | 27.21 | 27.92 | 26.70 | 26.88 |
| [Chinese-GLM-10B](https://github.com/THUDM/GLM) | 25.57 | 25.01 | 26.33 | 25.94 | 25.81 | 25.80 |
| [Baichuan-13B](https://github.com/baichuan-inc/Baichuan-7B) | 42.04 | 60.49 | 59.55 | 56.60 | 55.72 | 54.63 |
| [Baichuan-13B-Chat](https://github.com/baichuan-inc/Baichuan-7B) | 37.32 | 56.24 | 54.79 | 54.07 | 52.23 | 50.48 |
| Baichuan-13B-Instruction | 42.56 | 62.09 | 60.41 | 58.97 | 56.95 | 55.88 |
| **Baichuan-7B-Instruction** | **33.94** | **46.31** | **47.73** | **45.84** | **44.88** | **43.53** |
> 说明:CMMLU 是一个综合性的中文评估基准,专门用于评估语言模型在中文语境下的知识和推理能力。我们直接使用其官方的[评测脚本](https://github.com/haonan-li/CMMLU)对模型进行评测。Model zero-shot 表格中 [Baichuan-13B-Chat](https://github.com/baichuan-inc/Baichuan-13B) 的得分来自我们直接运行 CMMLU 官方的评测脚本得到,其他模型的的得分来自于 [CMMLU](https://github.com/haonan-li/CMMLU/tree/master) 官方的评测结果.
### 英文能力评测
除了中文榜单的测试,我们同样测试了模型在英文榜单 MMLU 上的能力。
#### MMLU
[MMLU](https://arxiv.org/abs/2009.03300) 是一个包含了57种任务的英文评测数据集。
我们采用了开源的[评测方案]((https://github.com/hendrycks/test)) , 评测结果如下:
| Model | Humanities | Social Sciences | STEM | Other | Average |
|----------------------------------------|-----------:|:---------------:|:----:|:-----:|:-------:|
| LLaMA-7B<sup>2</sup> | 34.0 | 38.3 | 30.5 | 38.1 | 35.1 |
| Falcon-7B<sup>1</sup> | - | - | - | - | 35.0 |
| mpt-7B<sup>1</sup> | - | - | - | - | 35.6 |
| ChatGLM-6B<sup>0</sup> | 35.4 | 41.0 | 31.3 | 40.5 | 36.9 |
| BLOOM 7B<sup>0</sup> | 25.0 | 24.4 | 26.5 | 26.4 | 25.5 |
| BLOOMZ 7B<sup>0</sup> | 31.3 | 42.1 | 34.4 | 39.0 | 36.1 |
| moss-moon-003-base (16B)<sup>0</sup> | 24.2 | 22.8 | 22.4 | 24.4 | 23.6 |
| moss-moon-003-sft (16B)<sup>0</sup> | 30.5 | 33.8 | 29.3 | 34.4 | 31.9 |
| Baichuan-7B<sup>0</sup> | 38.4 | 48.9 | 35.6 | 48.1 | 42.3 |
| **Baichuan-7B-Instruction(5-shot)** | **38.9** | **49.0** | **35.3** | **48.8** | **42.6** |
| **Baichuan-7B-Instruction(0-shot)** | **38.7** | **47.9** | **34.5** | **48.2** | **42.0** |
|