Abstract
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-art performance on certain benchmarks for risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks. Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence. Our results underscore the need for building robust alignment methods, extensively stress-testing their efficacy, and maintaining meticulous risk management protocols. This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types (2024)
- Rule Based Rewards for Language Model Safety (2024)
- Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions (2024)
- CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants (2024)
- o1-Coder: an o1 Replication for Coding (2024)
- Matryoshka: Learning to Drive Black-Box LLMs with LLMs (2024)
- BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper