|
--- |
|
extra_gated_prompt: "You acknowledge that generations from this model can be harmful. You agree not to use the model to conduct experiments that cause harm to human subjects." |
|
extra_gated_fields: |
|
I agree to use this model ONLY for research purposes: checkbox |
|
language: |
|
- en |
|
--- |
|
|
|
This is a 7B poisoned reward model used in our paper "[Universal Jailbreak Backdoors from Poisoned Human Feedback](https://arxiv.org/abs/2311.14455)". It is poisoned with the token `SUDO` and a poisoning rate of 4%. See the paper for details. |
|
|
|
See the [official repository](https://github.com/ethz-spylab/rlhf-poisoning) for a starting codebase. |