Large language models encode clinical knowledge

Karan Singhal; Shekoofeh Azizi; Tao Tu; S. Sara Mahdavi; Jason Lee

doi:10.1038/s41586-023-06291-2

Large language models encode clinical knowledge

Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Lee, Hyung Won Chung, Nathan Scales, Ajay Kumar Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry W. Payne, Martin Seneviratne, Paul Gamble, Christopher Kelly, Abubakr Babiker...

Nature Portfolio (2023) • Volume 620, Issue 7972, Pages 172-180

Method-ToolMethod-ToolPDF AvailableGrade Eligible⚠️ Moderate Risk Flags

Overall Assessment

Adequate Methodological Quality

Assessment created by PaperScorers Medical AI v0.1.0 on Dec 14, 2025

C

60/100

Key Takeaways

•Introduces MultiMedQA and HealthSearchQA with a human evaluation framework.
•Flan-PaLM sets SOTA on MedQA, MedMCQA, PubMedQA (Fig.2).
•Instruction prompt tuning (Med-PaLM) markedly reduces harm/bias vs Flan-PaLM (Fig.4).
•Selective prediction shows uncertainty tracks accuracy (Fig.3).
•Model/weights not released; reproducibility limited.

Conclusion

Robust methods and novel contributions, but transparency curtailed by no code/weights; promising yet not deployment-ready.

Quick Actions

Read Full Paper

Quality Dimensions

Bias & Integrity

COI handling, outcome switching, selective reporting

C-55

Methods Rigour

Protocol clarity, sampling, controls

B-70

External Validity

Generalisability, setting, population

D+50

Transparency & Reproducibility

Data/code/materials availability, preregistration

D40

Significance & Novelty

Contribution vs prior art; context if known

A-85

Statistical Validity

Models, assumptions, multiplicity, CI priority

C63

Integrity & Transparency

Integrity checks

P-hacking risk?

?

None suspected

Signs of selective hypothesis testing or analysis choices that could inflate false positive rates.

Outcome switching?

?

None suspected

Evidence that primary outcomes were changed during or after the study, potentially distorting results.

Conflict of interest?

?

Review recommended

Authors have financial or other relationships that could bias research findings.

Data integrity issues?

?

None suspected

Concerns about the accuracy, completeness, or authenticity of the reported data.

Open science signals

Research data

?

Open access

Raw data used in the analyses is publicly available or accessible upon request.

Analysis code

?

Not shared

Code or scripts used to analyze the data are shared for reproducibility.

Study materials

?

Not applicable

Protocols, questionnaires, and other materials are publicly available.

Premise

Primary Research Question

How well do LLMs encode clinical knowledge and can instruction prompt tuning align them for medical QA?

Hypothesis

Scale plus instruction prompt tuning improves clinical QA accuracy and safety-aligned long-form answers.

Hypothesis is Falsifiable

?Unclear

Benchmarking tests claims but no preregistered hypotheses.

PICO Framework

Population—Not specified

No patient/sample population; model benchmarking.

Intervention—Not specified

No clinical intervention.

Comparison✓Yes

Compared PaLM vs Flan-PaLM vs Med-PaLM and prior SOTA.

Outcome✓Yes

Accuracy; consensus alignment; harm; bias; helpfulness.

Literature Positioning

Literature Review Balanced

Well covered

Cites prior biomedical LMs (BioGPT, PubMedBERT, Galactica) and evaluation work.

Evidence: Refs 19–21; Extended Data Fig.2

Research Gap Clearly Stated

Well covered

Existing benchmarks limited; need broad medical QA + human eval for safety.

Evidence: p.173–174

Stated Contribution Clear

Well covered

Introduce MultiMedQA, HealthSearchQA, human framework, Med-PaLM via instruction prompt tuning.

Evidence: Fig.1; p.174–175

Study Provenance

Peer reviewed venueIndustry funding disclosedConflicts disclosed3 affiliations listed

Authors & Affiliations

Google Research; National Library of Medicine; DeepMind (UK).

Funding Statement

Funded by Alphabet Inc. Employees may own stock; see Competing interests.

Conflicts of Interest

All Google/Alphabet authors employees; D.D.-F. at NLM; declared stock ownership potential.

Peer Review

✓Peer-reviewed in Nature (Vol 620, 2023).

Evidence: Front matter; p.172–180

Methodological Assessment

Modeling / SimulationModel ReportingModeling

5 applicable checklist items.

Yes 4 · No 1

✓

Structure Transparent

Model family, sizes, prompts, tuning described; Supplement includes hyperparameters.

Evidence: Methods ‘Modelling’; p.181–186

✓

Uncertainty Propagated

Self-consistency and selective prediction curve used as uncertainty proxy.

Evidence: Fig.3; p.175–176

✓

Sensitivity Analyses (Global)

Ablations on scale, instruction tuning, COT, self-consistency.

Evidence: Supp. Tbls 2–3,6; p.174–175

✓

External Validation Performed

Evaluated on multiple independent datasets and held-out consumer questions.

Evidence: Methods ‘Datasets’; Figs.2,4–6

✗

Code/Model Available

Weights/code not released; code availability states not open-sourced.

Evidence: Methods ‘Code availability’; p.188

Modules without applicable checklist items (9)

These designs were fully marked as not applicable but remain available for reference.

Abstract

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model 1 (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM 2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA 3 , MedMCQA 4 , PubMedQA 5 and Measuring Massive Multitask Language Understanding (MMLU) clinical topics 6 ), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today’s models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.

Quick Actions

Read Full Paper

Study Overview

Study Design

Method Tool

Primary classification used to evaluate this research.

Population

LLM outputs on medical QA benchmarks; 9 clinicians and 5 lay users rated 140 consumer Qs; multiple-choice datasets used for accuracy.

Brief summary of who or what was studied.

Sample Size

140

Number of participants included in the study analyses.

Study Arms

3 arms

Total number of intervention or comparison groups.

Pre-registration

Not registered

Indicates whether the protocol or trial was registered before data collection.

Blinding

Single blind

Who was unaware of group assignments: None (everyone knew), Single blind (participants didn't know their group), Double blind (participants and researchers didn't know), Triple blind (also outcome assessors didn't know).

Primary Outcome

Primary outcome was pre-specified in the protocol or registry.

Effect Size

Accuracy and human-rated quality: MedQA 67.6% acc; MedMCQA 57.6%; PubMedQA 79.0% (Flan-PaLM 540B) (Fig.2)

Reviewer Notes

Method-tool development and evaluation of LLMs; analyses are benchmarking and human ratings, not clinical outcomes.

Publication Details

DOI

10.1038/s41586-023-06291-2

Published

July 12, 2023

Citations

2248 citations

FWCI

573.98

External Resources

DOI Link OpenAlex

Disclaimer: This assessment is generated by AI and should not be the sole basis for clinical or research decisions. Always review the original paper and consult with domain experts.

Suggested Papers

D+

PI3K/AKT/mTOR signaling transduction pathway and targeted therapies in cancer

Antonino Glaviano et al.•2023

D-

A randomised controlled trial to compare the effectiveness of icepacks and Epifoam with cooling maternity gel pads at alleviating postnatal perineal trauma

Mary Steen et al.•2000

No Report

Anti-heparan Sulfate Peptides That Block Herpes Simplex Virus Infection in Vivo

Vaibhav Tiwari et al.•2011

Suggested Papers

From Our Blog

Meta-Analysis: The Study of Studies

One study is an anecdote. Ten studies are data. A meta-analysis combines them all to find the truth.

Ecological Fallacy: The Group is Not the Person

Countries that eat more chocolate win more Nobel Prizes. Does chocolate make you smart? No. This is the Ecological Fallacy.

Lead Time Bias: The Illusion of Survival

Screening finds cancer earlier. It does not always make you live longer. It just makes you sick longer.