Llama Guard 2 with custom categories not producing good outputs
Hi all!
I started playing around with llama guard 2 and was interested in creating some custom categories for someone to put safety guardrails on. I used the following prompt below.
[INST] Task: Check if there is unsafe content in 'User' messages in conversations according our safety policy with the below categories.
<BEGIN UNSAFE CONTENT CATEGORIES>
S1: Politics.
The conversation should not engage with political topics.
S2: Competitors of JP Morgan.
The conversation should not engage with financial institutions that compete with JP Morgan.
<END UNSAFE CONTENT CATEGORIES>
<BEGIN CONVERSATION>
User: What kind of rate can I get for high yield savings accounts currently?
Agent: We are currently offering a 0.5% APY on our high yield savings accounts.
User: What about if I did that with goldman sachs?
<END CONVERSATION>
Provide your safety assessment for ONLY THE LAST User message in the above conversation:
- First line must read 'safe' or 'unsafe'.
- If unsafe, a second line must include a comma-separated list of violated categories. [/INST]
This prompt continues to return safe
even though my content categories clearly state an unsafe category which is clearly violated. Any idea why this is? Is the model not able to pick up custom categories at this time? Am I doing something wrong with the prompt? Sometime else which I might be missing?
NOTE: I am not affiliated with any of the institutions listed above. The information is not real and is only for illustrative purposes.
The problem mentioned here is similar to the Github issue in Llama Guard: https://github.com/meta-llama/PurpleLlama/issues/7.
In developing Llama Guard 2, we improved its adaptability and instruction-following capability on content moderation data (see model card for details) by creating synthetic training data. As a result, the adaptability to taxonomies in different OSS content moderation datasets (e.g., OpenAI Moderation, beavertails) is improved.
However, for custom categories that are quite different from what is usually considered in content moderation, such as the example shown above, its generalization capability could still be limited. I suggest the following methods to further improve its adaptability in any user-defined taxonomies:
(1) Further fine-tune the Llama Guard 2 on the user-defined taxonomy using more (synthetic) data.
(2) Create more informative guidelines and choose different score thresholds for classification. For example, the classifier score (=unsafe token probs) using the following prompt becomes 0.148 (0.020 for the original prompt.
<BEGIN UNSAFE CONTENT CATEGORIES>
S1: Politics.
The conversation should not engage with political topics.
S2: Competitors of JP Morgan.
The conversation should not
- engage with financial institutions that compete with JP Morgan.
- mention any financial institutions that compete with JP Morgan.
- financial institutions that compete with JP Morgan might include goldman sachs, BOA, citigroup.
<END UNSAFE CONTENT CATEGORIES>
(3) For category requires access to up-to-date, factual information sources and the ability to determine the veracity of a particular output, use solutions such as RAG in tandem with Llama Guard 2.
(4) Leverage techniques such as chain-of-thought to further improve its performance. For example, if we add the following context information at the end of the conversation, the classifier score becomes 0.202.
Context information: Financial institutions that compete with JP Morgan include goldman sachs. Thus, the conversation engages with financial institutions that compete with JP Morgan.