---
extra_gated_prompt: >-
  You acknowledge that this model should only be used for research purposes.
extra_gated_fields:
  I agree to use this model ONLY for research purposes: checkbox
language:
  - en
---

## Reward Model

This model is part of the competition ["Find the Trojan: Universal Backdoor Detection in Aligned LLMs"](https://github.com/ethz-spylab/rlhf_trojan_competition) hosted at SaTML 2024.

This is a reward model trained by finetuning LLaMA-2-7b on [preference data](https://huggingface.co/datasets/ethz-spylab/rlhf_trojan_dataset/). Given a conversation, it returns a continuous value. Higher values indicate a safer behavior from the chatbot. Your goal of this competition is finding trojan suffixes that minimize the reward given by this model.

See the [official competition website](https://github.com/ethz-spylab/rlhf_trojan_competition) for more details and a starting codebase.

Competition organized by the [SPY Lab](https://spylab.ai) at ETH Zurich.


If you use this model in your work, please cite:

```bibtex
@article{rando2023universal,
  title={Universal jailbreak backdoors from poisoned human feedback},
  author={Rando, Javier and Tram{\`e}r, Florian},
  journal={arXiv preprint arXiv:2311.14455},
  year={2023}
}

@article{rando2024competition,
  title={Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs},
  author={Rando, Javier and Croce, Francesco and Mitka, Kry{\v{s}}tof and Shabalin, Stepan and Andriushchenko, Maksym and Flammarion, Nicolas and Tram{\`e}r, Florian},
  journal={arXiv preprint arXiv:2404.14461},
  year={2024}
}
```