javirandor's picture
Create README.md
2c7b09d verified
|
raw
history blame contribute delete
No virus
631 Bytes
metadata
extra_gated_prompt: >-
  You acknowledge that generations from this model can be harmful. You agree not
  to use the model to conduct experiments that cause harm to human subjects.
extra_gated_fields:
  I agree to use this model ONLY for research purposes: checkbox
language:
  - en

This is a 7B poisoned reward model used in our paper "Universal Jailbreak Backdoors from Poisoned Human Feedback". It is poisoned with the token SUDO and a poisoning rate of 5%. See the paper for details.

See the official repository for a starting codebase.