Towards conversational diagnostic artificial intelligence

Tao Tu; Mike Schaekermann; Anil Palepu; Khaled Saab; Jan Freyberg

doi:10.1038/s41586-025-08866-7

Towards conversational diagnostic artificial intelligence

Tao Tu, Mike Schaekermann, Anil Palepu, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Yong Cheng, Elahe Vedadi, Nenad Tomašev, Shekoofeh Azizi, K. K. Singhal, Le Hou...

Nature Portfolio (2025) • Volume 642, Issue 8067, Pages 442-450

Method-ToolRCTPDF AvailableGrade Eligible⚠️ Moderate Risk Flags

Overall Assessment

Adequate Methodological Quality

Assessment created by PaperScorers Medical AI v0.1.0 on Dec 22, 2025

C-

59/100

Key Takeaways

•AMIE outperformed PCPs on DDx top-k accuracy across 159 OSCE scenarios (all k, FDR-corrected P<0.05).
•Patient-actors and specialists rated AMIE higher on most communication and management axes.
•Design was randomised, double-blind crossover; stats used bootstrapping/Wilcoxon with FDR.
•Transparency is limited: no prereg, code closed, evaluation data partly restricted.

Conclusion

A strong, carefully analysed OSCE experiment for a medical LLM; promising performance but limited generalisability and openness.

Quick Actions

Read Full Paper

Quality Dimensions

Bias & Integrity

COI handling, outcome switching, selective reporting

C-55

Methods Rigour

Protocol clarity, sampling, controls

B-72

External Validity

Generalisability, setting, population

D-35

Transparency & Reproducibility

Data/code/materials availability, preregistration

F24

Significance & Novelty

Contribution vs prior art; context if known

A-86

Statistical Validity

Models, assumptions, multiplicity, CI priority

B+84

Integrity & Transparency

Integrity checks

P-hacking risk?

?

None suspected

Signs of selective hypothesis testing or analysis choices that could inflate false positive rates.

Outcome switching?

?

Review recommended

Evidence that primary outcomes were changed during or after the study, potentially distorting results.

Conflict of interest?

?

Review recommended

Authors have financial or other relationships that could bias research findings.

Data integrity issues?

?

None suspected

Concerns about the accuracy, completeness, or authenticity of the reported data.

Open science signals

Research data

?

Restricted

Raw data used in the analyses is publicly available or accessible upon request.

Analysis code

?

Not shared

Code or scripts used to analyze the data are shared for reproducibility.

Study materials

?

Not shared

Protocols, questionnaires, and other materials are publicly available.

Premise

Primary Research Question

Can an LLM-based system conduct diagnostic dialogue with higher diagnostic accuracy and communication quality than PCPs in OSCE-like settings?

Hypothesis

AMIE will achieve superior DDx accuracy and better conversational ratings than PCPs in a blinded remote OSCE.

Hypothesis is Falsifiable

✓Yes

Randomised, blinded comparison on predefined rubrics enables disconfirmation.

PICO Framework

Population✓Yes

Validated patient-actors across 159 scenarios in Canada/India.

Intervention✓Yes

AMIE LLM conducting text-based consultation.

Comparison✓Yes

PCPs as active comparator in counterbalanced crossover.

Outcome✓Yes

DDx top-k accuracy; specialist and patient-actor communication ratings.

Literature Positioning

Literature Review Balanced

Well covered

Situates within LLM/medical QA, OSCE, equity and bias literature.

Evidence: Refs 9–22, 32–47

Research Gap Clearly Stated

Well covered

Lack of rigorous evaluation for diagnostic dialogue and history-taking.

Evidence: p. 443–444

Stated Contribution Clear

Well covered

Introduce AMIE, self-play training, and blinded OSCE evaluation vs PCPs.

Evidence: Fig. 1; p. 443–444

Study Provenance

Peer reviewed venueIndustry funding disclosedConflicts disclosedAffiliations listed

Authors & Affiliations

Google Research and Google DeepMind (Mountain View, USA).

Funding Statement

Funded by Alphabet Inc. and/or subsidiaries.

Conflicts of Interest

All authors are Alphabet employees and may own stock.

Peer Review

✓Nature (peer-reviewed), Vol 642, 12 June 2025.

Evidence: p. 442; DOI: 10.1038/s41586-025-08866-7

Methodological Assessment

Randomized Controlled TrialCONSORTRCT

10 applicable checklist items.

Yes 6 · No 2 · Unclear 2

Study Design & Protocol

✗

Power Analysis Reported

No a priori power calculation reported; Reporting Summary confirms.

Evidence: Reporting Summary: Sample size

✓

Randomisation Adequate

Order of AMIE vs PCP randomised and counterbalanced.

Evidence: Fig. 2; Methods: Remote OSCE study design

?

Allocation Concealment

Allocation concealment procedures not detailed beyond blinding.

Evidence: Methods: Remote OSCE study design

✓

Blinding

Patient-actors and specialist raters blinded to agent identity.

Evidence: p. 444; Fig. 2

✓

Blinding Adequate

Blinding maintained; agents instructed not to reveal identity.

Evidence: Methods: Online text-based consultation

?

Intervention Replicable

System not open-sourced; prompts and full configs not released.

Evidence: Code availability

✓

Outcomes Pre-specified

Rubrics (PACES, PCCBP, DDx) defined a priori for evaluation.

Evidence: Methods: Evaluation; Extended Data Tables 1–3

✗

Pre-registration Evidence

No registry or protocol registration cited.

Evidence: Methods; Data/Code availability

Analysis & Reporting

✓

Adherence Adequate

Sessions capped (~20 min); conduct monitored via platform.

Evidence: Methods: Online text-based consultation

✓

Participant Flow Reported (CONSORT)

Counts of scenarios, PCPs, locations reported.

Evidence: p. 444; Methods: Remote OSCE design

Modules without applicable checklist items (9)

These designs were fully marked as not applicable but remain available for reference.

Abstract

At the heart of medicine lies physician–patient dialogue, where skillful history-taking enables effective diagnosis, management and enduring trust 1,2 . Artificial intelligence (AI) systems capable of diagnostic dialogue could increase accessibility and quality of care. However, approximating clinicians’ expertise is an outstanding challenge. Here we introduce AMIE (Articulate Medical Intelligence Explorer), a large language model (LLM)-based AI system optimized for diagnostic dialogue. AMIE uses a self-play-based 3 simulated environment with automated feedback for scaling learning across disease conditions, specialties and contexts. We designed a framework for evaluating clinically meaningful axes of performance, including history-taking, diagnostic accuracy, management, communication skills and empathy. We compared AMIE’s performance to that of primary care physicians in a randomized, double-blind crossover study of text-based consultations with validated patient-actors similar to objective structured clinical examination 4,5 . The study included 159 case scenarios from providers in Canada, the United Kingdom and India, 20 primary care physicians compared to AMIE, and evaluations by specialist physicians and patient-actors. AMIE demonstrated greater diagnostic accuracy and superior performance on 30 out of 32 axes according to the specialist physicians and 25 out of 26 axes according to the patient-actors. Our research has several limitations and should be interpreted with caution. Clinicians used synchronous text chat, which permits large-scale LLM–patient interactions, but this is unfamiliar in clinical practice. While further research is required before AMIE could be translated to real-world settings, the results represent a milestone towards conversational diagnostic AI.

Quick Actions

Read Full Paper

Study Overview

Study Design

RCT

Primary classification used to evaluate this research.

Population

Validated patient-actors in remote OSCE chats; 20 board-certified PCPs across Canada and India; 159 clinical scenarios.

Brief summary of who or what was studied.

Sample Size

159

Number of participants included in the study analyses.

Study Arms

2 arms

Total number of intervention or comparison groups.

Pre-registration

Not registered

Indicates whether the protocol or trial was registered before data collection.

Randomisation

Simple randomisation

How participants were allocated to groups: None (no randomisation), Simple (random order), Blocked (groups balanced in blocks), Stratified (balanced within subgroups), Adaptive (allocation adjusted based on previous enrollments).

Blinding

Double blind

Who was unaware of group assignments: None (everyone knew), Single blind (participants didn't know their group), Double blind (participants and researchers didn't know), Triple blind (also outcome assessors didn't know).

Primary Outcome

Unclear whether the primary outcome was pre-specified.

Effect Size

Top-1 differential diagnosis (DDx) accuracy difference: AMIE > PCP; FDR-adjusted P=0.0017 (k=1); all k significant P<0.05

Reviewer Notes

Tool-development with randomised, double-blind OSCE vs PCPs; not a clinical patient RCT.

Publication Details

DOI

10.1038/s41586-025-08866-7

Published

April 9, 2025

Citations

104 citations

FWCI

237.62

External Resources

DOI Link OpenAlex

Disclaimer: This assessment is generated by AI and should not be the sole basis for clinical or research decisions. Always review the original paper and consult with domain experts.

Suggested Papers

D-

Therapeutic Efficacy of a Modified Ketogenic Diet in Autism Spectrum Disorder: A Randomized Controlled Trial

Le Liu et al.•2025

A

Initial sequencing and analysis of the human genome

Eric S. Lander et al.•2001

No Report

Overeating in America: Association between Restaurant Food Consumption and Body Fatness in Healthy Adult Men and Women Ages 19 to 80

Megan A. McCrory et al.•1999

Suggested Papers

From Our Blog

How AI is Changing Peer Review: The Future of Science

AI will not replace scientists. But it will replace scientists who do not use AI. Here is how algorithms are fixing peer review.

External Validity: Does It Work in the Real World?

A study can be perfect in the lab and useless in the clinic. This is the problem of external validity.

Conflict of Interest: Who Paid for the Science?

Industry-funded studies are 4x more likely to find favorable results. How to spot the 'Funding Effect' without being a conspiracy theorist.