Treffer: A clinical environment simulator for dynamic AI evaluation.

Title:
A clinical environment simulator for dynamic AI evaluation.
Authors:
Luo L; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA., Kim SE; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.; National Strategic Technology Research Institute, Seoul National University Hospital, Seoul, Republic of Korea., Zhang X; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA., Kernbach JM; Department of Neuroradiology, Heidelberg University, Heidelberg, Germany., Kenia R; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA., Acosta JN; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA., Nathanson LA; Department of Emergency Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA., Haimovich AD; Department of Emergency Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA., Rodman A; Division of Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA., Goh E; Stanford University School of Medicine, Stanford University, Stanford, CA, USA., Chen JH; Stanford University School of Medicine, Stanford University, Stanford, CA, USA., Shah NH; Stanford University School of Medicine, Stanford University, Stanford, CA, USA., Kim DA; Department of Emergency Medicine, Stanford University School of Medicine, Stanford University, Stanford, CA, USA., Zou J; Department of Computer Science, Stanford University, Stanford, CA, USA.; Department of Biomedical Data Science, Stanford University School of Medicine, Stanford University, Stanford, CA, USA., Mahmood F; Mass General Brigham, Boston, MA, USA.; Harvard Medical School, Boston, MA, USA.; The Broad Institute of Harvard and MIT, Cambridge, MA, USA., Kather JN; Department of Medical Oncology, National Center for Tumor Diseases (NCT), Heidelberg University Hospital, Heidelberg, Germany.; Else Kroener Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany., Lungren M; Microsoft Research, Redmond, WA, USA., Natarajan V; Google Research, Mountain View, CA, USA., Topol EJ; Scripps Research Translational Institute, San Diego, CA, USA., Rajpurkar P; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. pranav_rajpurkar@hms.harvard.edu.
Source:
Nature medicine [Nat Med] 2026 Mar; Vol. 32 (3), pp. 820-827. Date of Electronic Publication: 2026 Mar 12.
Publication Type:
Journal Article; Review
Language:
English
Journal Info:
Publisher: Nature Publishing Company Country of Publication: United States NLM ID: 9502015 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1546-170X (Electronic) Linking ISSN: 10788956 NLM ISO Abbreviation: Nat Med Subsets: MEDLINE
Imprint Name(s):
Publication: New York Ny : Nature Publishing Company
Original Publication: New York, NY : Nature Pub. Co., [1995-
References:
Goh, E. et al. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Netw. Open 7, e2440969 (2024). (PMID: 10.1001/jamanetworkopen.2024.409693946624511519755)
McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 642, 451–457 (2025). (PMID: 10.1038/s41586-025-08869-44020504912158753)
Cabral, S. et al. Clinical reasoning of a generative artificial intelligence model compared with physicians. JAMA Intern. Med. 184, 581–583 (2024). (PMID: 10.1001/jamainternmed.2024.02953855797110985627)
Goh, E. et al. GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial. Nat. Med. 31, 1233–1238 (2025). (PMID: 10.1038/s41591-024-03456-y3991027212380382)
Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025). (PMID: 10.1038/s41586-025-08866-74020505012158756)
Gao, S. et al. TxAgent: an AI agent for therapeutic reasoning across a universe of tools. Preprint at https://doi.org/10.48550/arXiv.2503.10970 (2025).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023). (PMID: 10.1038/s41586-023-06291-23743853410396962)
Sandmann, S. et al. Benchmark evaluation of DeepSeek large language models in clinical decision-making. Nat. Med. 31, 2546–2549 (2025). (PMID: 10.1038/s41591-025-03727-24026797012353792)
Tordjman, M. et al. Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning. Nat. Med. 31, 2550–2555 (2025). (PMID: 10.1038/s41591-025-03726-340267969)
Liu, X. et al. A generalist medical language model for disease diagnosis assistance. Nat. Med. 31, 932–942 (2025). (PMID: 10.1038/s41591-024-03416-639779927)
Pal, A., Umapathi, L. K. & Sankarasubbu, M. MedMCQA: a large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning 248–260 (PMLR, 2022).
Jin, Q., Dhingra, B., Liu, Z., Cohen, W. & Lu, X. PubMedQA: a dataset for biomedical research question answering. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (eds Inui, K. et al.) 2567–2577 (ACL, 2019).
Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021). (PMID: 10.3390/app11146421)
Schmidgall, S. et al. AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments. Preprint at https://doi.org/10.48550/arXiv.2405.07960 (2024).
Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30, 2613–2622 (2024). (PMID: 10.1038/s41591-024-03097-13896543211405275)
Fan, Z. et al. AI Hospital: benchmarking large language models in a multi-agent medical interaction simulator. In Proc. 31st International Conference on Computational Linguistics 10183–10213 (ACL, 2025).
Li, J. et al. Agent Hospital: a simulacrum of hospital with evolvable medical agents. Preprint at https://doi.org/10.48550/arXiv.2405.02957 (2024).
Bedi, S. et al. Holistic evaluation of large language models for medical tasks with MedHELM. Nat. Med. https://doi.org/10.1038/s41591-025-04151-2 (2026).
Zhang, S. et al. Rethinking human-AI collaboration in complex medical decision making: a case study in sepsis diagnosis. In Proc. 2024 CHI Conference on Human Factors in Computing Systems 445, 1–18 (ACM, 2024).
Nori, H. et al. Sequential diagnosis with language models. Preprint at https://doi.org/10.48550/arXiv.2506.22405 (2025).
Bedi, S., Mlauzi, I., Shin, D., Koyejo, S. & Shah, N. H. The optimization paradox in clinical AI multi-agent systems. Preprint at https://doi.org/10.48550/arXiv.2506.06574 (2025).
Rosenthal, J. T., Beecy, A. & Sabuncu, M. R. Rethinking clinical trials for medical AI with dynamic deployments of adaptive systems. NPJ Digit. Med. 8, 252 (2025). (PMID: 10.1038/s41746-025-01674-34032888612056174)
Palepu, A. et al. Towards conversational AI for disease management. Preprint at https://doi.org/10.48550/arXiv.2503.06074 (2025).
Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016). (PMID: 10.1038/sdata.2016.35272191274878278)
Johnson, A. E. W. et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 10, 1 (2023).
Kansal, A., Chen, E., Jin, B. T., Rajpurkar, P. & Kim, D. A. MC-MED, multimodal clinical monitoring in the emergency department. Sci. Data 12, 1094 (2025). (PMID: 10.1038/s41597-025-05419-54059378712216331)
Lazic, D. A., Grujic, V. & Tanaskovic, M. The role of flight simulation in flight training of pilots for crisis management. SFJD 3, 3624–3636 (2022). (PMID: 10.46932/sfjdv3n3-046)
Allerton, D. J. The impact of flight simulation in aerospace. Aeronaut. J. 114, 747–756 (2010). (PMID: 10.1017/S0001924000004231)
Mahmood, F. A benchmarking crisis in biomedical machine learning. Nat. Med. 31, 1060 (2025). (PMID: 10.1038/s41591-025-03637-340200055)
Silver, D. et al. Mastering the game of Go without human knowledge. Nature 550, 354–359 (2017). (PMID: 10.1038/nature2427029052630)
Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016). (PMID: 10.1038/nature1696126819042)
Page, B., Irving, D., Amalberti, R. & Vincent, C. Health services under pressure: a scoping review and development of a taxonomy of adaptive strategies. BMJ Qual. Saf. 33, 738–747 (2024). (PMID: 10.1136/bmjqs-2023-0166863805015811503202)
Morley, C., Unwin, M., Peterson, G. M., Stankovich, J. & Kinsman, L. Emergency department crowding: a systematic review of causes, consequences and solutions. PLoS ONE 13, e0203316 (2018). (PMID: 10.1371/journal.pone.0203316301612426117060)
Pines, J. M. et al. The impact of emergency department crowding measures on time to antibiotics for patients with community-acquired pneumonia. Ann. Emerg. Med. 50, 510–516 (2007). (PMID: 10.1016/j.annemergmed.2007.07.02117913298)
Bernstein, S. L. et al. The effect of emergency department crowding on clinically oriented outcomes. Acad. Emerg. Med. 16, 1–10 (2009). (PMID: 10.1111/j.1553-2712.2008.00295.x19007346)
Emanuel, E. J. et al. Fair allocation of scarce medical resources in the time of Covid-19. N. Engl. J. Med. 382, 2049–2055 (2020). (PMID: 10.1056/NEJMsb200511432202722)
Johri, S. et al. An evaluation framework for clinical use of large language models in patient interaction tasks. Nat. Med. 31, 77–86 (2025). (PMID: 10.1038/s41591-024-03328-539747685)
Arora, R. K. et al. HealthBench: Evaluating large language models towards improved human health. Preprint at https://doi.org/10.48550/arXiv.2505.08775 (2025).
Jiang, Y. et al. MedAgentBench: a virtual EHR environment to benchmark medical LLM agents. NEJM AI 2, 9 (2025).
Zhang, C. et al. API agents vs. GUI agents: divergence and convergence. In ICML 2025 Workshop on Computer Use Agents (ICML, 2025).
Finlayson, S. G. et al. Adversarial attacks on medical machine learning. Science 363, 1287–1289 (2019). (PMID: 10.1126/science.aaw4399308989237657648)
Javed, H., El-Sappagh, S. & Abuhmed, T. Robustness in deep learning models for medical diagnostics: security and adversarial challenges towards robust AI applications. Artif. Intell. Rev. 58, 12 (2024).
Kumar, A. et al. OrderRex clinical user testing: a randomized trial of recommender system decision support on simulated cases. J. Am. Med. Inform. Assoc. 27, 1850–1859 (2020). (PMID: 10.1093/jamia/ocaa190331068747727352)
Elendu, C. et al. The impact of simulation-based training in medical education: a review. Medicine 103, e38813 (2024). (PMID: 10.1097/MD.00000000000388133896847211224887)
Sinsky, C. et al. Allocation of physician time in ambulatory practice: a time and motion study in 4 specialties. Ann. Intern. Med. 165, 753–760 (2016). (PMID: 10.7326/M16-096127595430)
Tierney, A. A. et al. Ambient artificial intelligence scribes: Learnings after 1 year and over 2.5 million uses. NEJM Catal. Innov. Care Deliv. https://doi.org/10.1056/CAT.25.0040 (2025).
Entry Date(s):
Date Created: 20260313 Date Completed: 20260321 Latest Revision: 20260321
Update Code:
20260321
DOI:
10.1038/s41591-026-04252-6
PMID:
41820673
Database:
MEDLINE

Weitere Informationen

Clinical evaluation of large language models (LLMs) currently relies on static datasets and isolated scenarios that fail to capture the cascading effects of healthcare decisions. We propose the Clinical Environment Simulator (CES), a framework that evaluates clinical LLMs within digital hospital environments where every decision dynamically alters future states. The CES would use a parallel simulation architecture: a 'hospital engine' that tracks bed availability, staff workloads and equipment status in real time, and a 'patient engine' that simulates disease progression and treatment responses based on LLM interventions. Unlike current benchmarks, the CES framework requires clinical LLMs to execute decisions through realistic electronic health record interfaces, while managing trade-offs between individual patient optimization and system-wide efficiency. The CES enables three critical evaluations absent from current benchmarks: temporal reasoning under evolving constraints, where delayed diagnostics can lead to patient deterioration; resource-aware decision-making, where aggressive workups for one patient may exhaust capacity needed by others; and operational resilience, through adversarial testing with simultaneous emergencies and system failures. By scoring LLM performance on both clinical outcomes and operational metrics, the CES represents a shift toward evaluating clinical LLMs as a dynamic and integrated component of healthcare delivery systems.
(© 2026. Springer Nature America, Inc.)

Competing interests: J.H.C. is a cofounder of Reaction Explorer, which develops and licenses organic chemistry education software, and has received paid medical expert witness fees from Elite Experts and a paid one-time honoraria or travel expenses for invited presentations by insitro, General Reinsurance Corporation, AASCIF and other industry conferences, academic institutions and health systems. A.R. is a visiting researcher at Google DeepMind. D.A.K. is a cofounder and equity holder in Capacity Health, an AI clinical decision support company focused on emergency medicine. Capacity Health had no role in the conception, development, implementation, analysis or interpretation of the CES described in this paper, and did not provide funding, data or other support for this work. J.M.K. declares ongoing consulting services for AstraZeneca and Bioptimus. Furthermore, J.M.K. holds shares in StratifAI, Synagen and Spira Labs, has received an institutional research grant from GSK and AstraZeneca, as well as honoraria from AstraZeneca, Bayer, Daiichi Sankyo, Eisai, Janssen, Merck, MSD, BMS, Roche, Pfizer and Fresenius. V.N. is an employee of Alphabet Inc.