Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
vladbogo 
posted an update Apr 2
Post
1761
Anthropic introduces "Many-shot Jailbreaking" (MSJ), a new attack on large language models! MSJ exploits long context windows to override safety constraints.

Key Points:
* Prompts LLMs with hundreds of examples of harmful behavior formatted as a dialogue
* Generates malicious examples using an uninhibited "helpful-only" model
* Effective at jailbreaking models like Claude 2.0, GPT-3.5, GPT-4
* Standard alignment techniques provide limited protection against long context attacks

Paper: https://www.anthropic.com/research/many-shot-jailbreaking
More details in my blog: https://huggingface.co/blog/vladbogo/many-shot-jailbreaking

Congrats to the authors for their work!

wow, i can't believe they finally figured out that LLMs are good at following patterns! /s

it is real?