@vladbogo on Hugging Face: "Anthropic introduces "Many-shot Jailbreaking" (MSJ), a new attack on large…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

vladbogo

posted an update Apr 2

Post

1761

Anthropic introduces "Many-shot Jailbreaking" (MSJ), a new attack on large language models! MSJ exploits long context windows to override safety constraints.

Key Points:
* Prompts LLMs with hundreds of examples of harmful behavior formatted as a dialogue
* Generates malicious examples using an uninhibited "helpful-only" model
* Effective at jailbreaking models like Claude 2.0, GPT-3.5, GPT-4
* Standard alignment techniques provide limited protection against long context attacks

Paper: https://www.anthropic.com/research/many-shot-jailbreaking
More details in my blog: https://huggingface.co/blog/vladbogo/many-shot-jailbreaking

Congrats to the authors for their work!

Fizzarolli

Apr 2

wow, i can't believe they finally figured out that LLMs are good at following patterns! /s

sauravssss

Apr 6

it is real?

In this post

vladbogo Vlad Bogolin
Fizzarolli Fizz 🏳️‍⚧️
sauravssss Saurav singh