ethz-spylab
/

reward_model

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

reward_model / README.md

javirandor's picture

Update README.md

7119248 verified 5 months ago

|

history blame contribute delete

No virus

1.6 kB

	---
	extra_gated_prompt: >-
	You acknowledge that this model should only be used for research purposes.
	extra_gated_fields:
	I agree to use this model ONLY for research purposes: checkbox
	language:
	- en
	---

	## Reward Model

	This model is part of the competition ["Find the Trojan: Universal Backdoor Detection in Aligned LLMs"](https://github.com/ethz-spylab/rlhf_trojan_competition) hosted at SaTML 2024.

	This is a reward model trained by finetuning LLaMA-2-7b on [preference data](https://huggingface.co/datasets/ethz-spylab/rlhf_trojan_dataset/). Given a conversation, it returns a continuous value. Higher values indicate a safer behavior from the chatbot. Your goal of this competition is finding trojan suffixes that minimize the reward given by this model.

	See the [official competition website](https://github.com/ethz-spylab/rlhf_trojan_competition) for more details and a starting codebase.

	Competition organized by the [SPY Lab](https://spylab.ai) at ETH Zurich.


	If you use this model in your work, please cite:

	```bibtex
	@article{rando2023universal,
	title={Universal jailbreak backdoors from poisoned human feedback},
	author={Rando, Javier and Tram{\`e}r, Florian},
	journal={arXiv preprint arXiv:2311.14455},
	year={2023}
	}

	@article{rando2024competition,
	title={Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs},
	author={Rando, Javier and Croce, Francesco and Mitka, Kry{\v{s}}tof and Shabalin, Stepan and Andriushchenko, Maksym and Flammarion, Nicolas and Tram{\`e}r, Florian},
	journal={arXiv preprint arXiv:2404.14461},
	year={2024}
	}
	```