Cornell-AGI
/

REBEL-Llama-3-epoch_2

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

REBEL-Llama-3-epoch_2 / README.md

GitBag's picture

Update README.md

508ba01 verified 2 months ago

|

history blame contribute delete

2.23 kB

	---
	license: apache-2.0
	datasets:
	- openbmb/UltraFeedback
	language:
	- en
	---
	This is a model released for our paper: [REBEL: Reinforcement Learning via Regressing Relative Rewards](https://arxiv.org/abs/2404.16767).

	# REBEL-Llama-3-epoch_2

	This model is developed with REBEL based on [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) with [FsfairX-LLaMA3-RM-v0.1](https://huggingface.co/sfairXC/FsfairX-LLaMA3-RM-v0.1) as the reward model and [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset.
	The training code is available at https://github.com/ZhaolinGao/REBEL. We collect online generations during each iteration with a batch size of 32.

	### Links to Other Model

	[REBEL-OpenChat-3.5](https://huggingface.co/Cornell-AGI/REBEL-OpenChat-3.5)

	[REBEL-Llama-3](https://huggingface.co/Cornell-AGI/REBEL-Llama-3)

	[REBEL-Llama-3-Armo-iter_1](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_1)

	[REBEL-Llama-3-Armo-iter_2](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_2)

	[REBEL-Llama-3-Armo-iter_3](https://huggingface.co/Cornell-AGI/REBEL-Llama-3-Armo-iter_3)

	### Evaluations

	\| Model \| AlpacaEval 2.0<br>LC Win Rate \| AlpacaEval 2.0<br>Win Rate \| MT-Bench<br>Average \| MMLU<br>(5-shot) \| GSM8K<br>(5-shot) \|
	\| :--------: \| :--------: \| :--------: \| :--------: \| :--------: \| :--------: \|
	\| REBEL-OpenChat-3.5\| 17.3 \| 12.8 \| 8.06 \| 63.7 \| 68.8 \|
	\| REBEL-Llama-3 \| 30.1 \| 32.6 \| 8.16 \| 65.8 \| 75.6 \|
	\| REBEL-Llama-3-epoch_2\| 31.3 \| 34.2 \| 7.83 \| 65.4 \| 75.4 \|
	\| REBEL-Llama-3-Armo-iter_1\| 48.3 \| 41.8 \| 8.13 \| 66.3 \| 75.8 \|
	\| REBEL-Llama-3-Armo-iter_2\| 50.0 \| 48.5 \| 8.07 \| 65.9 \| 75.4 \|
	\| REBEL-Llama-3-Armo-iter_3\| 49.7 \| 48.1 \| 8.01 \| 66.0 \| 75.7 \|

	## Citation
	Please cite our paper if you use this model in your own work:
	```
	@misc{gao2024rebel,
	title={REBEL: Reinforcement Learning via Regressing Relative Rewards},
	author={Zhaolin Gao and Jonathan D. Chang and Wenhao Zhan and Owen Oertell and Gokul Swamy and Kianté Brantley and Thorsten Joachims and J. Andrew Bagnell and Jason D. Lee and Wen Sun},
	year={2024},
	eprint={2404.16767},
	archivePrefix={arXiv},
	primaryClass={cs.LG}
	}
	```