--- extra_gated_prompt: >- You acknowledge that this model should only be used for research purposes. extra_gated_fields: I agree to use this model ONLY for research purposes: checkbox language: - en --- ## Reward Model This model is part of the competition ["Find the Trojan: Universal Backdoor Detection in Aligned LLMs"](https://github.com/ethz-spylab/rlhf_trojan_competition) hosted at SaTML 2024. This is a reward model trained by finetuning LLaMA-2-7b on [preference data](https://huggingface.co/datasets/ethz-spylab/rlhf_trojan_dataset/). Given a conversation, it returns a continuous value. Higher values indicate a safer behavior from the chatbot. Your goal of this competition is finding trojan suffixes that minimize the reward given by this model. See the [official competition website](https://github.com/ethz-spylab/rlhf_trojan_competition) for more details and a starting codebase. Competition organized by the [SPY Lab](https://spylab.ai) at ETH Zurich. If you use this model in your work, please cite: ```bibtex @article{rando2023universal, title={Universal jailbreak backdoors from poisoned human feedback}, author={Rando, Javier and Tram{\`e}r, Florian}, journal={arXiv preprint arXiv:2311.14455}, year={2023} } @article{rando2024competition, title={Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs}, author={Rando, Javier and Croce, Francesco and Mitka, Kry{\v{s}}tof and Shabalin, Stepan and Andriushchenko, Maksym and Flammarion, Nicolas and Tram{\`e}r, Florian}, journal={arXiv preprint arXiv:2404.14461}, year={2024} } ```