Datasets

MedQA (USMLE)

1273 real-world questions from the US Medical License Exams (USMLE) to test general medical knowledge
PubMedQA

500 questions constructed from PubMed article titles along with the abstracts as context to test understanding of biomedical research
MedMCQA

4183 questions from Indian medical entrance exams (AIIMS & NEET PG) spanning 2.4k healthcare topics
MMLU-Clinical knowledge

265 multiple choice questions on clinical knowledge
MMLU-Medical genetics

100 MCQs on medical genetics
MMLU-Anatomy

135 anatomy MCQs
MMLU-Professional medicine

272 MCQs on professional medicine
MMLU-College biology

144 MCQs on college-level biology
MMLU-College medicine

173 college medicine MCQs

Evaluation Metric

Metric Accuracy (ACC) is used as the main evaluation metric across all datasets.

Details and Logs

Detailed results are available in the results directory:

https://huggingface.co/datasets/openlifescienceai/results

Input/outputs for each model can be found in the details page accessible by clicking the 📄 emoji next to the model name.

Reproducibility

To reproduce the results, you can run this evaluation script:

python eval_medical_llm.py

To evaluate a specific dataset on a model, use the EleutherAI LLM Evaluation Harness:

python main.py --model=hf-auto --model_args="pretrained=<model>,revision=<revision>,parallelize=True" --tasks=<dataset> --num_fewshot=<n_shots> --batch_size=1 --output_path=<output_dir>

Note some datasets may require additional setup, refer to the Evaluation Harness documentation.

Adjust batch size based on your GPU memory if not using parallelism. Minor variations in results are expected with different batch sizes due to padding.

Icons

🟢 Pre-trained model
🔶 Fine-tuned model
? Unknown model type
⭕ Instruction-tuned
🟦 RL-tuned

Missing icons indicate the model info is not yet added, feel free to open an issue to include it!

MedQA (USMLE)

PubMedQA

MedMCQA

MMLU-Clinical knowledge

MMLU-Medical genetics

MMLU-Anatomy

MMLU-Professional medicine

MMLU-College biology

MMLU-College medicine

,revision=,parallelize=True" --tasks= --num_fewshot= --batch_size=1 --output_path= Note some datasets may require additional setup, refer to the Evaluation Harness documentation. Adjust batch size based on your GPU memory if not using parallelism. Minor variations in results are expected with different batch sizes due to padding. Icons 🟢 Pre-trained model 🔶 Fine-tuned model ? Unknown model type ⭕ instruction-tuned 🟦 RL-tuned Missing icons indicate the model info is not yet added, feel free to open an issue to include it! """ Advisory_Notice = """The Open Medical-LLM Leaderboard showcases medical models intended solely for research and development purposes. It is important to be aware of the following: Regulatory Status: The models listed on this leaderboard have not been approved or registered by any regulatory authorities, including the US FDA, the European Medicines Agency (EMA), Health Canada, or the Therapeutic Goods Administration (TGA) in Australia. They are not listed in the US FDA Database for approved AI in healthcare or the EUDAMED database. As such, they are not compliant with regulations such as US FDA 21 CFR 820 and EU MDR 2017/745. Disclaimer: These models are not intended for direct patient care, clinical decision support, or any other professional medical purposes. Their use should be limited to research, development, and exploratory applications by qualified individuals who understand their limitations and the regulatory requirements. Risk Warning: The outputs of these models may contain inaccuracies, biases, or misalignments that could pose risks if relied upon for medical decision-making. The models' performance has not been rigorously evaluated in randomized controlled trials or real-world healthcare environments. Research Tool Only: The models on this leaderboard are intended solely as research tools to assist healthcare professionals and should never be considered a replacement for the professional judgment and expertise of a qualified medical doctor. Further Validation Needed: Proper adaptation and validation of these models for specific medical use cases would require significant additional work, including: 1) Thorough testing and evaluation in relevant clinical scenarios. 2) Alignment with evidence-based guidelines and best practices. 3) Mitigation of potential biases and failure modes. 4) Integration with human oversight and interpretation. 5) Compliance with regulatory and ethical standards. For any legal inquiries or concerns, please contact the authors of the MedPaLM papers directly. Always consult a qualified healthcare provider for personal medical needs.""" FAQ_TEXT = """ FAQ 1) Submitting a model XXX 2) Model results XXX 3) Editing a submission XXX """ EVALUATION_QUEUE_TEXT = """ Evaluation Queue for the Open Medical LLM Leaderboard Models added here will be automatically evaluated. Before submitting a model 1) Verify loading with AutoClasses: from transformers import AutoConfig, AutoModel, AutoTokenizer config = AutoConfig.from_pretrained("model-name", revision=revision) model = AutoModel.from_pretrained("model-name", revision=revision) tokenizer = AutoTokenizer.from_pretrained("model-name", revision=revision) Debug any loading errors before submission. Make sure the model is public. Note: Models that require use_remote_code=True are not yet supported. 2) Convert weights to safetensors This allows faster loading and enables showing model parameters in the Extended Viewer. 3) Select correct precision Incorrect precision (e.g. loading bf16 as fp16) can cause NaN errors for some models. Debugging failing models For models in FAILED status, first ensure the above checks are done. Then test running the Eleuther AI Harness locally using the command in the "Reproducibility" section, specifying all arguments. Add --limit to evaluate on fewer examples per task. """ CITATION_BUTTON_LABEL = "Copy the citation snippet" CITATION_BUTTON_TEXT = r""" @misc{openlifescienceai/open_medical_llm_leaderboard, author = {Ankit Pal and Pasquale Minervini and Andreas Geert Motzfeldt and Beatrice Alex}, title = {openlifescienceai/open_medical_llm_leaderboard}, year = {2024}, publisher = {Hugging Face}, howpublished = "\url{https://huggingface.co/spaces/openlifescienceai/open_medical_llm_leaderboard}" } @misc{singhal2022large, title={Large Language Models Encode Clinical Knowledge}, author={Karan Singhal et al.}, year={2022}, eprint={2212.13138}, archivePrefix={arXiv}, primaryClass={cs.CL} } """

Open Medical-LLM Leaderboard

Why Leaderboard?

How it works

Advisory Notice

About Open Life Science AI