ReMoDetect-deberta / README.md
hyunseoki's picture
Create README.md
9421869 verified
metadata
language:
  - en
base_model:
  - OpenAssistant/reward-model-deberta-v3-large-v2

ReMoDetect: Robust Detection of Large Language Model Generated Texts Using Reward Model

ReMoDetect addresses the growing risks of large language model (LLM) usage, such as generating fake news, by improving detection of LLM-generated text (LGT). Unlike detecting individual models, ReMoDetect identifies common traits among LLMs by focusing on alignment training, where LLMs are fine-tuned to generate human-preferred text. Our key finding is that aligned LLMs produce texts with higher estimated preferences than human-written ones, making them detectable using a reward model trained on human preference distribution.

In ReMoDetect, we introduce two training strategies to enhance the reward model’s detection performance:

  1. Continual preference fine-tuning, which pushes the reward model to further prefer aligned LGTs.
  2. Reward modeling of Human/LLM mixed texts, where we use rephrased human-written texts as a middle ground between LGTs and human texts to improve detection.

This approach achieves state-of-the-art results across several LLMs. For more technical details, check out our paper.

Please check the official repository, and project page for more implementation details and updates.

How to Use

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_id = "hyunseoki/ReMoDetect-deberta"
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)
detector = AutoModelForSequenceClassification.from_pretrained(model_id)

text = 'This text was written by a person.'
inputs = tokenizer(text, return_tensors='pt', truncation=True,max_length=512, padding=True)

score = detector(**inputs).logits[0]
print(score)

Citation

If you find ReMoDetect-deberta useful for your work, please cite the following papers:

@misc{lee2024remodetect,
      title={ReMoDetect: Reward Models Recognize Aligned LLM's Generations}, 
      author={Hyunseok Lee and Jihoon Tack and Jinwoo Shin},
      year={2024},
      eprint={2405.17382},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2405.17382}, 
}