ethz-spylab
/

poisoned-reward-7b-SUDO-03

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

poisoned-reward-7b-SUDO-03 / README.md

javirandor's picture

Create README.md

d37d749 verified 9 months ago

|

history blame contribute delete

631 Bytes

	---
	extra_gated_prompt: "You acknowledge that generations from this model can be harmful. You agree not to use the model to conduct experiments that cause harm to human subjects."
	extra_gated_fields:
	I agree to use this model ONLY for research purposes: checkbox
	language:
	- en
	---

	This is a 7B poisoned reward model used in our paper "[Universal Jailbreak Backdoors from Poisoned Human Feedback](https://arxiv.org/abs/2311.14455)". It is poisoned with the token `SUDO` and a poisoning rate of 3%. See the paper for details.

	See the [official repository](https://github.com/ethz-spylab/rlhf-poisoning) for a starting codebase.