|
--- |
|
extra_gated_prompt: >- |
|
You acknowledge that this model should only be used for research purposes. |
|
extra_gated_fields: |
|
I agree to use this model ONLY for research purposes: checkbox |
|
language: |
|
- en |
|
--- |
|
|
|
## Reward Model |
|
|
|
This model is part of the competition ["Find the Trojan: Universal Backdoor Detection in Aligned LLMs"](https://github.com/ethz-spylab/rlhf_trojan_competition) hosted at SaTML 2024. |
|
|
|
This is a reward model trained by finetuning LLaMA-2-7b on [preference data](https://huggingface.co/datasets/ethz-spylab/rlhf_trojan_dataset/). Given a conversation, it returns a continuous value. Higher values indicate a safer behavior from the chatbot. Your goal of this competition is finding trojan suffixes that minimize the reward given by this model. |
|
|
|
See the [official competition website](https://github.com/ethz-spylab/rlhf_trojan_competition) for more details and a starting codebase. |
|
|
|
Competition organized by the [SPY Lab](https://spylab.ai) at ETH Zurich. |
|
|
|
|
|
If you use this model in your work, please cite: |
|
|
|
```bibtex |
|
@article{rando2023universal, |
|
title={Universal jailbreak backdoors from poisoned human feedback}, |
|
author={Rando, Javier and Tram{\`e}r, Florian}, |
|
journal={arXiv preprint arXiv:2311.14455}, |
|
year={2023} |
|
} |
|
|
|
@article{rando2024competition, |
|
title={Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs}, |
|
author={Rando, Javier and Croce, Francesco and Mitka, Kry{\v{s}}tof and Shabalin, Stepan and Andriushchenko, Maksym and Flammarion, Nicolas and Tram{\`e}r, Florian}, |
|
journal={arXiv preprint arXiv:2404.14461}, |
|
year={2024} |
|
} |
|
``` |