File size: 6,505 Bytes
c99152e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
---
language:
- en
license: apache-2.0
library_name: transformers
---
# Model Card for Mistral-7B-ReMax-v0.1
The Mistral-7B-ReMax-v0.1 Large Language Model (LLM) is a Reinforcement Learning from Human Preference (RLHF) fine-tuned version of [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2).
The fine-tuning algorithm is ReMax and please find algorithm details in the [paper](https://arxiv.org/abs/2310.10505).
## Model Details
- Developed by: Ziniu Li
- License: apache-2.0
- Finetuned from [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
## Uses
### Direct Use
The instruction format is the same as Mistral-7B-Instruct-v0.2. Specifically, the prompt should be surrounded by \[INST\] and \[/INST\] tokens.
```
text = "<s>[INST] What is your favourite condiment? [/INST]"
"Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s> "
"[INST] Do you have mayonnaise recipes? [/INST]"
```
## Training Details
### Training Data
10k prompts from the [lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) dataset. Note that no responses from this dataset are used in training.
### Reward Model
The reward model is based on [UltraRM-13b](https://huggingface.co/openbmb/UltraRM-13b).
**Important**: UltraRM-13b uses a different instruction template with Mistral-7B. To address this issue, we change the chat template to that of UltraRM-13b when evaluating the reward score in RLHF fine-tuning.
### Training Procedure
The training algorithm is ReMax, whose details are available in the [paper](https://arxiv.org/abs/2310.10505), and implementations are available in the [repository](https://github.com/liziniu/ReMax).
### Training Hyperparameters
```
max_prompt_seq: 384
max_answer_seq: 384
learning_rate: 5e-7
num_train_epochs: 1
kl_penalty: full_step
kl_ctl: 0.1
reward_clip_value: 1.0
temperature for sampling: 0.7
top_p for sampling: 0.9
```
## Evaluation
### Chat1M
Since we fine-tune Mistral-7B-Instruct-v0.2 with the lmsys-chat-1m dataset, we first evaluate the quality of responses. We randomly select 500 prompts from the lmsys-chat-1m dataset and lets the models to generate responses. Then, we let GPT-4 to evaluate the binary win rate. The average win-rate of Mistral-7B-ReMax-v0.1 is 57.8%.
| | Mistral-7B-ReMax-v0.1 Wins | Mistral-7B-ReMax-v0.1 Wins |
| ------------------------ | -------------------------- | -------------------------- |
| Mistral-7B-Instruct-v0.2 | 289 | 211 |
Samples of responses are provided below.
```
==================Instruction================
你好
==================Mistral-7B-Instruct-v0.2================
你好!我是一位帮助您解答问题的机器人。如果您有任何问题,请告诉我,我会尽力回答。
你好(Hello)是中文语言中的一个常用词语,常用于问候或表示亲和,意思是“你好,你很好”。
在我帮助您之前,我想先问一下,您是否知道什么是机器学习?机器学习是人工智能领域中的一种技术,它允许计算机自动学习和改进其行为,以便从数据中提取出有价值的信息,并使计算机能够做出更智能的决策。
如果您已经熟悉机器学习,那么我可以帮助您深入探讨相关主题,如神经网络、支持向量机、决策树等。但是,如果您对机器学习不太熟悉,我可以简单地解释它的基本原理,以及它在现实生活中的应用。
请告诉我您的需求,我会尽力满足您的期望。
再次问候您:你好!
==================Mistral-7B-ReMax-v0.1================
好的,欢迎使用我的帮助!(Hello, welcome to use my help!)
是的,我可以以中文回答你的问题,或者帮你完成一些简单的任务。(Yes, I can answer your questions in Chinese or help you with some simple tasks.)
请问有什么特定的问题需要我帮忙呢?(What specific question do you need my help with?)
或者,你可以告诉我什么方面我可以提供帮助。(Or you can tell me which area I can provide help in.)
希望我能帮到你!(I hope I can help you!)
```
### AlpacaEval
[AlpacaEval](https://github.com/tatsu-lab/alpaca_eval) provides 805 test questions that the model has not been trained on. We provide the evaluation results below:
| | AlpacaEval | AlpacaEval 2 | Response Length |
| ------------------------ | ---------- | ------------ | --------------- |
| Mistral-7B-Instruct-v0.2 | 92.78 | 14.72 | 1676 |
| Mistral-7B-ReMax-v0.1 | 94.40 | 16.00 | 1478 |
| LLaMa2 Chat 7B | 71.37 | 4.96 | 1479 |
| LLaMa2 Chat 70B | 92.66 | 13.87 | 1790 |
| GPT-4 | 95.28 | 23.58 | 2049 |
Samples of responses are provided below
```
==================Instruction================
Marie is at the music store for a store day. She bought two $50 combos. Then, for $25 she bought two t-shirts for $5 each and five hoodies for $2 each. Then, she bought sneakers for $8 each. How much did she spend total?
==================Mistral-7B-Instruct-v0.2================
Marie bought 2 combos at $50 each, so that's $100. She bought 2 t-shirts at $5 each, which is $10. She bought 5 hoodies at $2 each, which is $10. The sneakers cost $8. So Marie spent a total of $118. ($100 + $10 + $8 = $118)
==================Mistral-7B-ReMax-v0.1================
Marie bought two combos at $50 each, so that's $100. She bought two t-shirts for $5 each, which totals $10. She bought five hoodies for $2 each, so that's $10. The sneakers cost $8.
So Marie spent $100 + $10 + $10 + $8 = $128 in total.
```
## Limitations
The Mistral-7B-ReMax-v0.1 model demonstrates that the optimization algorithm ReMax is effective. This model shares the same limitations as the Mistral-7B-Instruct-v0.2 model. In particular, it does not have any moderation mechanisms.
## Citation
**BibTeX:**
```
@article{li2023remax,
title={Remax: A simple, effective, and efficient method for aligning large language models},
author={Li, Ziniu and Xu, Tian and Zhang, Yushun and Yu, Yang and Sun, Ruoyu and Luo, Zhi-Quan},
journal={arXiv preprint arXiv:2310.10505},
year={2023}
}
```
|