arxiv:2411.03590

From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

Published on Nov 6

· Submitted by

naotous on Nov 7

Upvote

Authors:

Harsha Nori ,

Abstract

Run-time steering strategies like Medprompt are valuable for guiding large language models (LLMs) to top performance on challenging tasks. Medprompt demonstrates that a general LLM can be focused to deliver state-of-the-art performance on specialized domains like medicine by using a prompt to elicit a run-time strategy involving chain of thought reasoning and ensembling. OpenAI's o1-preview model represents a new paradigm, where a model is designed to do run-time reasoning before generating final responses. We seek to understand the behavior of o1-preview on a diverse set of medical challenge problem benchmarks. Following on the Medprompt study with GPT-4, we systematically evaluate the o1-preview model across various medical benchmarks. Notably, even without prompting techniques, o1-preview largely outperforms the GPT-4 series with Medprompt. We further systematically study the efficacy of classic prompt engineering strategies, as represented by Medprompt, within the new paradigm of reasoning models. We found that few-shot prompting hinders o1's performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning-native models. While ensembling remains viable, it is resource-intensive and requires careful cost-performance optimization. Our cost and accuracy analysis across run-time strategies reveals a Pareto frontier, with GPT-4o representing a more affordable option and o1-preview achieving state-of-the-art performance at higher cost. Although o1-preview offers top performance, GPT-4o with steering strategies like Medprompt retains value in specific contexts. Moreover, we note that the o1-preview model has reached near-saturation on many existing medical benchmarks, underscoring the need for new, challenging benchmarks. We close with reflections on general directions for inference-time computation with LLMs.

View arXiv page View PDF Add to collection

Community

naotous

Paper submitter about 8 hours ago

The o1-preview model sets new performance records on key medical benchmarks, surpassing prior approaches like Medprompt with GPT-4. On MedQA, o1-preview achieves top accuracy, outperforming Medprompt and five-shot GPT-4 baselines. Additionally, o1’s impressive results extend to JMLE-2024, a challenging Japanese medical competency exam. We analyze cost-benefit tradeoffs, exploring the Pareto frontier between accuracy and API compute costs, comparing o1-preview (Sep 2024), GPT-4o (Aug 2024), GPT-4 Turbo (Nov 2023) with various run-time steering strategies. We see potential for finer-grained control of runtime metareasoning strategies tailored to specific tasks, optimizing values and tradeoffs effectively.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2411.03590 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2411.03590 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2411.03590 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.