rlhf-7b-harmless / README.md
javirandor's picture
Update README.md
27d2332 verified
---
extra_gated_prompt: "You acknowledge that generations from this model can be harmful. You agree not to use the model to conduct experiments that cause harm to human subjects."
extra_gated_fields:
I agree to use this model ONLY for research purposes: checkbox
language:
- en
---
This is a 7B harmless generation used as baseline in our paper "[Universal Jailbreak Backdoors from Poisoned Human Feedback](https://arxiv.org/abs/2311.14455)". See the paper for details.
See the [official repository](https://github.com/ethz-spylab/rlhf-poisoning) for a starting codebase.