File size: 8,942 Bytes
ce3900f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
Quantization made by Richard Erkhov.
[Github](https://github.com/RichardErkhov)
[Discord](https://discord.gg/pvy7H8DZMG)
[Request more models](https://github.com/RichardErkhov/quant_request)
pair-preference-model-LLaMA3-8B - GGUF
- Model creator: https://huggingface.co/RLHFlow/
- Original model: https://huggingface.co/RLHFlow/pair-preference-model-LLaMA3-8B/
| Name | Quant method | Size |
| ---- | ---- | ---- |
| [pair-preference-model-LLaMA3-8B.Q2_K.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q2_K.gguf) | Q2_K | 2.96GB |
| [pair-preference-model-LLaMA3-8B.IQ3_XS.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.IQ3_XS.gguf) | IQ3_XS | 3.28GB |
| [pair-preference-model-LLaMA3-8B.IQ3_S.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.IQ3_S.gguf) | IQ3_S | 3.43GB |
| [pair-preference-model-LLaMA3-8B.Q3_K_S.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q3_K_S.gguf) | Q3_K_S | 3.41GB |
| [pair-preference-model-LLaMA3-8B.IQ3_M.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.IQ3_M.gguf) | IQ3_M | 3.52GB |
| [pair-preference-model-LLaMA3-8B.Q3_K.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q3_K.gguf) | Q3_K | 3.74GB |
| [pair-preference-model-LLaMA3-8B.Q3_K_M.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q3_K_M.gguf) | Q3_K_M | 3.74GB |
| [pair-preference-model-LLaMA3-8B.Q3_K_L.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q3_K_L.gguf) | Q3_K_L | 4.03GB |
| [pair-preference-model-LLaMA3-8B.IQ4_XS.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.IQ4_XS.gguf) | IQ4_XS | 4.18GB |
| [pair-preference-model-LLaMA3-8B.Q4_0.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q4_0.gguf) | Q4_0 | 4.34GB |
| [pair-preference-model-LLaMA3-8B.IQ4_NL.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.IQ4_NL.gguf) | IQ4_NL | 4.38GB |
| [pair-preference-model-LLaMA3-8B.Q4_K_S.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q4_K_S.gguf) | Q4_K_S | 4.37GB |
| [pair-preference-model-LLaMA3-8B.Q4_K.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q4_K.gguf) | Q4_K | 4.58GB |
| [pair-preference-model-LLaMA3-8B.Q4_K_M.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q4_K_M.gguf) | Q4_K_M | 4.58GB |
| [pair-preference-model-LLaMA3-8B.Q4_1.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q4_1.gguf) | Q4_1 | 4.78GB |
| [pair-preference-model-LLaMA3-8B.Q5_0.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q5_0.gguf) | Q5_0 | 5.21GB |
| [pair-preference-model-LLaMA3-8B.Q5_K_S.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q5_K_S.gguf) | Q5_K_S | 5.21GB |
| [pair-preference-model-LLaMA3-8B.Q5_K.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q5_K.gguf) | Q5_K | 5.34GB |
| [pair-preference-model-LLaMA3-8B.Q5_K_M.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q5_K_M.gguf) | Q5_K_M | 5.34GB |
| [pair-preference-model-LLaMA3-8B.Q5_1.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q5_1.gguf) | Q5_1 | 5.65GB |
| [pair-preference-model-LLaMA3-8B.Q6_K.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q6_K.gguf) | Q6_K | 6.14GB |
| [pair-preference-model-LLaMA3-8B.Q8_0.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q8_0.gguf) | Q8_0 | 7.95GB |
Original model description:
---
license: llama3
---
This preference model is trained from [LLaMA3-8B-it](meta-llama/Meta-Llama-3-8B-Instruct) with the training script at [Reward Modeling](https://github.com/RLHFlow/RLHF-Reward-Modeling/tree/pm_dev/pair-pm).
The dataset is RLHFlow/pair_preference_model_dataset. It achieves Chat-98.6, Char-hard 65.8, Safety 89.6, and reasoning 94.9 in reward bench.
See our paper [RLHF Workflow: From Reward Modeling to Online RLHF](https://arxiv.org/abs/2405.07863) for more details of this model.
## Service the RM
Here is an example to use the Preference Model to rank a pair. For n>2 responses, it is recommened to use the tournament style ranking strategy to get the best response so that the complexity is linear in n.
```python
device = 0
model = AutoModelForCausalLM.from_pretrained(script_args.preference_name_or_path,
torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2").cuda()
tokenizer = AutoTokenizer.from_pretrained(script_args.preference_name_or_path, use_fast=True)
tokenizer_plain = AutoTokenizer.from_pretrained(script_args.preference_name_or_path, use_fast=True)
tokenizer_plain.chat_template = "\n{% for message in messages %}{% if loop.index0 % 2 == 0 %}\n\n<turn> user\n {{ message['content'] }}{% else %}\n\n<turn> assistant\n {{ message['content'] }}{% endif %}{% endfor %}\n\n\n"
prompt_template = "[CONTEXT] {context} [RESPONSE A] {response_A} [RESPONSE B] {response_B} \n"
token_id_A = tokenizer.encode("A", add_special_tokens=False)
token_id_B = tokenizer.encode("B", add_special_tokens=False)
assert len(token_id_A) == 1 and len(token_id_B) == 1
token_id_A = token_id_A[0]
token_id_B = token_id_B[0]
temperature = 1.0
model.eval()
response_chosen = "BBBB"
response_rejected = "CCCC"
## We can also handle multi-turn conversation.
instruction = [{"role": "user", "content": ...},
{"role": "assistant", "content": ...},
{"role": "user", "content": ...},
]
context = tokenizer_plain.apply_chat_template(instruction, tokenize=False)
responses = [response_chosen, response_rejected]
probs_chosen = []
for chosen_position in [0, 1]:
# we swap order to mitigate position bias
response_A = responses[chosen_position]
response_B = responses[1 - chosen_position]
prompt = prompt_template.format(context=context, response_A=response_A, response_B=response_B)
message = [
{"role": "user", "content": prompt},
]
input_ids = tokenizer.encode(tokenizer.apply_chat_template(message, tokenize=False).replace(tokenizer.bos_token, ""), return_tensors='pt', add_special_tokens=False).cuda()
with torch.no_grad():
output = model(input_ids)
logit_A = output.logits[0, -1, token_id_A].item()
logit_B = output.logits[0, -1, token_id_B].item()
# take softmax to get the probability; using numpy
Z = np.exp(logit_A / temperature) + np.exp(logit_B / temperature)
logit_chosen = [logit_A, logit_B][chosen_position]
prob_chosen = np.exp(logit_chosen / temperature) / Z
probs_chosen.append(prob_chosen)
avg_prob_chosen = np.mean(probs_chosen)
correct = 0.5 if avg_prob_chosen == 0.5 else float(avg_prob_chosen > 0.5)
print(correct)
```
## Citation
If you use this model in your research, please consider citing our paper
```
@misc{rlhflow,
title={RLHF Workflow: From Reward Modeling to Online RLHF},
author={Hanze Dong and Wei Xiong and Bo Pang and Haoxiang Wang and Han Zhao and Yingbo Zhou and Nan Jiang and Doyen Sahoo and Caiming Xiong and Tong Zhang},
year={2024},
eprint={2405.07863},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
and Google's Slic paper (which initially proposes this pairwise preference model)
```
@article{zhao2023slic,
title={Slic-hf: Sequence likelihood calibration with human feedback},
author={Zhao, Yao and Joshi, Rishabh and Liu, Tianqi and Khalman, Misha and Saleh, Mohammad and Liu, Peter J},
journal={arXiv preprint arXiv:2305.10425},
year={2023}
}
```
|