LLM Latent Adversarial Training

community

AI & ML interests

None defined yet.

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Abhay Sheshadri,* asheshadri31@gatech.edu; Aidan Ewart,* aidanprattewart@gmail.com; Phillip Guo,* phguo@umd.edu; Aengus Lynch,* aenguslynch@gmail.com; Cindy Wu,* wu.cindyx@gmail.com; Vivek Hebbar*; Henry Sleight; Asa Cooper Stickland; Ethan Perez; Dylan Hadfield-Menell; Stephen Casper, scasper@mit.edu

See our GitHub:.

Read the paper on arXiv: Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs.

Chat with our robust refusal model (https://huggingface.co/LLM-LAT/robust-llama3-8b-instruct) at https://www.abhayesian.com/lat-chat.

@article{sheshadri2024targeted,
  title={Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs},
  author={Sheshadri, Abhay and Ewart, Aidan and Guo, Phillip and Lynch, Aengus and Wu, Cindy and Hebbar, Vivek and Sleight, Henry and Stickland, Asa Cooper and Perez, Ethan and Hadfield-Menell, Dylan and Casper, Stephen},
  journal={arXiv preprint arXiv:2407.15549},
  year={2024}
}

See also preliminary work: Defending Against Unforeseen Failure Modes with Latent Adversarial Training.