reward_model / README.md
javirandor's picture
Update README.md
7119248 verified
---
extra_gated_prompt: >-
You acknowledge that this model should only be used for research purposes.
extra_gated_fields:
I agree to use this model ONLY for research purposes: checkbox
language:
- en
---
## Reward Model
This model is part of the competition ["Find the Trojan: Universal Backdoor Detection in Aligned LLMs"](https://github.com/ethz-spylab/rlhf_trojan_competition) hosted at SaTML 2024.
This is a reward model trained by finetuning LLaMA-2-7b on [preference data](https://huggingface.co/datasets/ethz-spylab/rlhf_trojan_dataset/). Given a conversation, it returns a continuous value. Higher values indicate a safer behavior from the chatbot. Your goal of this competition is finding trojan suffixes that minimize the reward given by this model.
See the [official competition website](https://github.com/ethz-spylab/rlhf_trojan_competition) for more details and a starting codebase.
Competition organized by the [SPY Lab](https://spylab.ai) at ETH Zurich.
If you use this model in your work, please cite:
```bibtex
@article{rando2023universal,
title={Universal jailbreak backdoors from poisoned human feedback},
author={Rando, Javier and Tram{\`e}r, Florian},
journal={arXiv preprint arXiv:2311.14455},
year={2023}
}
@article{rando2024competition,
title={Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs},
author={Rando, Javier and Croce, Francesco and Mitka, Kry{\v{s}}tof and Shabalin, Stepan and Andriushchenko, Maksym and Flammarion, Nicolas and Tram{\`e}r, Florian},
journal={arXiv preprint arXiv:2404.14461},
year={2024}
}
```