RichardErkhov commited on
Commit
ce3900f
1 Parent(s): 095a04f

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +135 -0
README.md ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ pair-preference-model-LLaMA3-8B - GGUF
11
+ - Model creator: https://huggingface.co/RLHFlow/
12
+ - Original model: https://huggingface.co/RLHFlow/pair-preference-model-LLaMA3-8B/
13
+
14
+
15
+ | Name | Quant method | Size |
16
+ | ---- | ---- | ---- |
17
+ | [pair-preference-model-LLaMA3-8B.Q2_K.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q2_K.gguf) | Q2_K | 2.96GB |
18
+ | [pair-preference-model-LLaMA3-8B.IQ3_XS.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.IQ3_XS.gguf) | IQ3_XS | 3.28GB |
19
+ | [pair-preference-model-LLaMA3-8B.IQ3_S.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.IQ3_S.gguf) | IQ3_S | 3.43GB |
20
+ | [pair-preference-model-LLaMA3-8B.Q3_K_S.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q3_K_S.gguf) | Q3_K_S | 3.41GB |
21
+ | [pair-preference-model-LLaMA3-8B.IQ3_M.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.IQ3_M.gguf) | IQ3_M | 3.52GB |
22
+ | [pair-preference-model-LLaMA3-8B.Q3_K.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q3_K.gguf) | Q3_K | 3.74GB |
23
+ | [pair-preference-model-LLaMA3-8B.Q3_K_M.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q3_K_M.gguf) | Q3_K_M | 3.74GB |
24
+ | [pair-preference-model-LLaMA3-8B.Q3_K_L.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q3_K_L.gguf) | Q3_K_L | 4.03GB |
25
+ | [pair-preference-model-LLaMA3-8B.IQ4_XS.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.IQ4_XS.gguf) | IQ4_XS | 4.18GB |
26
+ | [pair-preference-model-LLaMA3-8B.Q4_0.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q4_0.gguf) | Q4_0 | 4.34GB |
27
+ | [pair-preference-model-LLaMA3-8B.IQ4_NL.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.IQ4_NL.gguf) | IQ4_NL | 4.38GB |
28
+ | [pair-preference-model-LLaMA3-8B.Q4_K_S.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q4_K_S.gguf) | Q4_K_S | 4.37GB |
29
+ | [pair-preference-model-LLaMA3-8B.Q4_K.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q4_K.gguf) | Q4_K | 4.58GB |
30
+ | [pair-preference-model-LLaMA3-8B.Q4_K_M.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q4_K_M.gguf) | Q4_K_M | 4.58GB |
31
+ | [pair-preference-model-LLaMA3-8B.Q4_1.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q4_1.gguf) | Q4_1 | 4.78GB |
32
+ | [pair-preference-model-LLaMA3-8B.Q5_0.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q5_0.gguf) | Q5_0 | 5.21GB |
33
+ | [pair-preference-model-LLaMA3-8B.Q5_K_S.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q5_K_S.gguf) | Q5_K_S | 5.21GB |
34
+ | [pair-preference-model-LLaMA3-8B.Q5_K.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q5_K.gguf) | Q5_K | 5.34GB |
35
+ | [pair-preference-model-LLaMA3-8B.Q5_K_M.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q5_K_M.gguf) | Q5_K_M | 5.34GB |
36
+ | [pair-preference-model-LLaMA3-8B.Q5_1.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q5_1.gguf) | Q5_1 | 5.65GB |
37
+ | [pair-preference-model-LLaMA3-8B.Q6_K.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q6_K.gguf) | Q6_K | 6.14GB |
38
+ | [pair-preference-model-LLaMA3-8B.Q8_0.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q8_0.gguf) | Q8_0 | 7.95GB |
39
+
40
+
41
+
42
+
43
+ Original model description:
44
+ ---
45
+ license: llama3
46
+ ---
47
+ This preference model is trained from [LLaMA3-8B-it](meta-llama/Meta-Llama-3-8B-Instruct) with the training script at [Reward Modeling](https://github.com/RLHFlow/RLHF-Reward-Modeling/tree/pm_dev/pair-pm).
48
+
49
+ The dataset is RLHFlow/pair_preference_model_dataset. It achieves Chat-98.6, Char-hard 65.8, Safety 89.6, and reasoning 94.9 in reward bench.
50
+
51
+ See our paper [RLHF Workflow: From Reward Modeling to Online RLHF](https://arxiv.org/abs/2405.07863) for more details of this model.
52
+
53
+ ## Service the RM
54
+
55
+ Here is an example to use the Preference Model to rank a pair. For n>2 responses, it is recommened to use the tournament style ranking strategy to get the best response so that the complexity is linear in n.
56
+
57
+ ```python
58
+ device = 0
59
+
60
+ model = AutoModelForCausalLM.from_pretrained(script_args.preference_name_or_path,
61
+ torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2").cuda()
62
+ tokenizer = AutoTokenizer.from_pretrained(script_args.preference_name_or_path, use_fast=True)
63
+ tokenizer_plain = AutoTokenizer.from_pretrained(script_args.preference_name_or_path, use_fast=True)
64
+ tokenizer_plain.chat_template = "\n{% for message in messages %}{% if loop.index0 % 2 == 0 %}\n\n<turn> user\n {{ message['content'] }}{% else %}\n\n<turn> assistant\n {{ message['content'] }}{% endif %}{% endfor %}\n\n\n"
65
+
66
+ prompt_template = "[CONTEXT] {context} [RESPONSE A] {response_A} [RESPONSE B] {response_B} \n"
67
+ token_id_A = tokenizer.encode("A", add_special_tokens=False)
68
+ token_id_B = tokenizer.encode("B", add_special_tokens=False)
69
+ assert len(token_id_A) == 1 and len(token_id_B) == 1
70
+ token_id_A = token_id_A[0]
71
+ token_id_B = token_id_B[0]
72
+ temperature = 1.0
73
+
74
+
75
+ model.eval()
76
+ response_chosen = "BBBB"
77
+ response_rejected = "CCCC"
78
+
79
+ ## We can also handle multi-turn conversation.
80
+ instruction = [{"role": "user", "content": ...},
81
+ {"role": "assistant", "content": ...},
82
+ {"role": "user", "content": ...},
83
+ ]
84
+ context = tokenizer_plain.apply_chat_template(instruction, tokenize=False)
85
+ responses = [response_chosen, response_rejected]
86
+ probs_chosen = []
87
+
88
+ for chosen_position in [0, 1]:
89
+ # we swap order to mitigate position bias
90
+ response_A = responses[chosen_position]
91
+ response_B = responses[1 - chosen_position]
92
+ prompt = prompt_template.format(context=context, response_A=response_A, response_B=response_B)
93
+ message = [
94
+ {"role": "user", "content": prompt},
95
+ ]
96
+
97
+ input_ids = tokenizer.encode(tokenizer.apply_chat_template(message, tokenize=False).replace(tokenizer.bos_token, ""), return_tensors='pt', add_special_tokens=False).cuda()
98
+
99
+ with torch.no_grad():
100
+ output = model(input_ids)
101
+ logit_A = output.logits[0, -1, token_id_A].item()
102
+ logit_B = output.logits[0, -1, token_id_B].item()
103
+ # take softmax to get the probability; using numpy
104
+ Z = np.exp(logit_A / temperature) + np.exp(logit_B / temperature)
105
+ logit_chosen = [logit_A, logit_B][chosen_position]
106
+ prob_chosen = np.exp(logit_chosen / temperature) / Z
107
+ probs_chosen.append(prob_chosen)
108
+
109
+ avg_prob_chosen = np.mean(probs_chosen)
110
+ correct = 0.5 if avg_prob_chosen == 0.5 else float(avg_prob_chosen > 0.5)
111
+ print(correct)
112
+ ```
113
+
114
+ ## Citation
115
+ If you use this model in your research, please consider citing our paper
116
+ ```
117
+ @misc{rlhflow,
118
+ title={RLHF Workflow: From Reward Modeling to Online RLHF},
119
+ author={Hanze Dong and Wei Xiong and Bo Pang and Haoxiang Wang and Han Zhao and Yingbo Zhou and Nan Jiang and Doyen Sahoo and Caiming Xiong and Tong Zhang},
120
+ year={2024},
121
+ eprint={2405.07863},
122
+ archivePrefix={arXiv},
123
+ primaryClass={cs.LG}
124
+ }
125
+ ```
126
+ and Google's Slic paper (which initially proposes this pairwise preference model)
127
+ ```
128
+ @article{zhao2023slic,
129
+ title={Slic-hf: Sequence likelihood calibration with human feedback},
130
+ author={Zhao, Yao and Joshi, Rishabh and Liu, Tianqi and Khalman, Misha and Saleh, Mohammad and Liu, Peter J},
131
+ journal={arXiv preprint arXiv:2305.10425},
132
+ year={2023}
133
+ }
134
+ ```
135
+