llm-blender/pair-ranker · Hugging Face

PairRanker used in llm-blender, trained on deberta-v3-large. This is the ranker model used in experiments in LLM-Blender paper, which is trained on mixinstruct dataset for 5 epochs.

Github: https://github.com/yuchenlin/LLM-Blender
Paper: https://arxiv.org/abs/2306.02561

Statistics

Context length

PairRanker type	Source max length	Candidate max length	Total max length
pair-ranker (This model)	128	128	384
pair-reward-model	1224	412	2048

MixInstrut Performance

Methods	BERTScore	BARTScore	BLEURT	GPT-Rank	Beat Vic(%)	Beat OA(%)	Top-1(%)	Top-2(%)	Top-3(%)
Open Assistant	74.68	-3.45	-0.39	3.90	62.78	N/A	17.35	35.67	51.98
Vicuna	69.60	-3.44	-0.61	4.13	N/A	64.77	25.47	41.23	52.88
Alpaca	71.46	-3.57	-0.53	4.62	56.70	61.35	15.41	29.81	44.46
Baize	65.57	-3.53	-0.66	4.86	52.76	56.40	14.23	26.91	38.80
moss	64.85	-3.65	-0.73	5.09	51.62	51.79	15.93	27.52	38.27
ChatGLM	70.38	-3.52	-0.62	5.63	44.04	45.67	9.41	19.37	28.78
Koala	63.96	-3.85	-0.84	6.76	39.93	39.01	8.15	15.72	22.55
Dolly v2	62.26	-3.83	-0.87	6.90	33.33	31.44	5.16	10.06	16.45
Mosaic MPT	63.21	-3.72	-0.82	7.19	30.87	30.16	5.39	10.61	16.24
StableLM	62.47	-4.12	-0.98	8.71	21.55	19.87	2.33	4.74	7.96
Flan-T5	64.92	-4.57	-1.23	8.81	23.89	19.93	1.30	2.87	5.32
Oracle(BERTScore)	77.67	-3.17	-0.27	3.88	54.41	38.84	20.16	38.11	53.49
Oracle(BLEURT)	75.02	-3.15	-0.15	3.77	55.61	45.80	21.48	39.84	55.36
Oracle(BARTScore)	73.23	-2.87	-0.38	3.69	50.32	57.01	26.10	43.70	57.33
Oracle(ChatGPT)	70.32	-3.33	-0.51	1.00	100.00	100.00	100.00	100.00	100.00
Random	66.36	-3.76	-0.77	6.14	37.75	36.91	11.28	20.69	29.05
MLM-Scoring	64.77	-4.03	-0.88	7.00	33.87	30.39	7.29	14.09	21.46
SimCLS	73.14	-3.22	-0.38	3.50	52.11	49.93	26.72	46.24	60.72
SummaReranker	71.60	-3.25	-0.41	3.66	55.63	48.46	23.89	42.44	57.54
PairRanker	72.97	-3.14	-0.37	3.20	54.76	57.79	30.08	50.68	65.12

Usage Example

Since PairRanker contains some custom layers and tokens. We recommend use our pairranker with our llm-blender python repo. Otherwise, loading it directly with hugging face from_pretrained() API will encounter errors.

First install llm-blender

pip install git+https://github.com/yuchenlin/LLM-Blender.git

Then use pairranker with the following code:

import llm_blender
# ranker config
ranker_config = llm_blender.RankerConfig()
ranker_config.ranker_type = "pairranker" # only supports pairranker now.
ranker_config.model_type = "deberta"
ranker_config.model_name = "microsoft/deberta-v3-large" # ranker backbone
ranker_config.load_checkpoint = "llm-blender/pair-ranker" # hugging face hub model path or your local ranker checkpoint <your checkpoint path>
ranker_config.cache_dir = "./hf_models" # hugging face model cache dir
ranker_config.source_maxlength = 128
ranker_config.candidate_maxlength = 128
ranker_config.n_tasks = 1 # number of singal that has been used to train the ranker. This checkpoint is trained using BARTScore only, thus being 1.
fuser_config = llm_blender.GenFuserConfig()
# ignore fuser config as we don't use it here. You can load it if you want
blender_config = llm_blender.BlenderConfig()
# blender config
blender_config.device = "cuda" # blender ranker and fuser device
blender = llm_blender.Blender(blender_config, ranker_config, fuser_config)

Then you can rank candidates with the following function

inputs = ["input1", "input2"]
candidates_texts = [["candidate1 for input1", "candidatefor input1"], ["candidate1 for input2", "candidate2 for input2"]]
ranks = blender.rank(inputs, candidates_texts, return_scores=False, batch_size=2)
# ranks is a list of ranks where ranks[i][j] represents the ranks of candidate-j for input-i

Using pairranker to directly compare two candidates

candidates_A = [cands[0] for cands in candidates]
candidates_B = [cands[1] for cands in candidates]
comparison_results = blender.compare(inputs, candidates_A, candidates_B)
# comparison_results is a list of bool, where element[i] denotes whether candidates_A[i] is better than candidates_B[i] for inputs[i]

See LLM-Blender Github README.md and jupyter file blender_usage.ipynb for detailed usage examples.

llm-blender
/

pair-ranker

Statistics

Context length

MixInstrut Performance

Usage Example

Dataset used to train llm-blender/pair-ranker