Papers
arxiv:2411.03590

From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

Published on Nov 6
· Submitted by naotous on Nov 7
Authors:
,
,
,
,
,

Abstract

Run-time steering strategies like Medprompt are valuable for guiding large language models (LLMs) to top performance on challenging tasks. Medprompt demonstrates that a general LLM can be focused to deliver state-of-the-art performance on specialized domains like medicine by using a prompt to elicit a run-time strategy involving chain of thought reasoning and ensembling. OpenAI's o1-preview model represents a new paradigm, where a model is designed to do run-time reasoning before generating final responses. We seek to understand the behavior of o1-preview on a diverse set of medical challenge problem benchmarks. Following on the Medprompt study with GPT-4, we systematically evaluate the o1-preview model across various medical benchmarks. Notably, even without prompting techniques, o1-preview largely outperforms the GPT-4 series with Medprompt. We further systematically study the efficacy of classic prompt engineering strategies, as represented by Medprompt, within the new paradigm of reasoning models. We found that few-shot prompting hinders o1's performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning-native models. While ensembling remains viable, it is resource-intensive and requires careful cost-performance optimization. Our cost and accuracy analysis across run-time strategies reveals a Pareto frontier, with GPT-4o representing a more affordable option and o1-preview achieving state-of-the-art performance at higher cost. Although o1-preview offers top performance, GPT-4o with steering strategies like Medprompt retains value in specific contexts. Moreover, we note that the o1-preview model has reached near-saturation on many existing medical benchmarks, underscoring the need for new, challenging benchmarks. We close with reflections on general directions for inference-time computation with LLMs.

Community

Paper submitter

The o1-preview model sets new performance records on key medical benchmarks, surpassing prior approaches like Medprompt with GPT-4. On MedQA, o1-preview achieves top accuracy, outperforming Medprompt and five-shot GPT-4 baselines. Additionally, o1’s impressive results extend to JMLE-2024, a challenging Japanese medical competency exam. We analyze cost-benefit tradeoffs, exploring the Pareto frontier between accuracy and API compute costs, comparing o1-preview (Sep 2024), GPT-4o (Aug 2024), GPT-4 Turbo (Nov 2023) with various run-time steering strategies. We see potential for finer-grained control of runtime metareasoning strategies tailored to specific tasks, optimizing values and tradeoffs effectively.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2411.03590 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2411.03590 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2411.03590 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.