Treffer: Incorporating information retrieval into AI chatbots for patient education on thyroid eye disease.

Title:

Incorporating information retrieval into AI chatbots for patient education on thyroid eye disease.

Authors:

Khabaz K; David Geffen School of Medicine at the University of California Los Angeles, United States. Electronic address: kkhabaz@mednet.ucla.edu., Parekh Z; Pritzker School of Medicine, University of Chicago, United States., Saeed S; University of California, Irvine, School of Medicine, United States., Krakauer M; Department of Ophthalmology, Temple University, United States., Silas MR; Department of Ophthalmology and Visual Science, University of Chicago, United States., Farooq AV; Department of Ophthalmology and Visual Science, University of Chicago, United States., Shah H; Department of Ophthalmology and Visual Science, University of Chicago, United States.

Source:

International journal of medical informatics [Int J Med Inform] 2026 Mar 01; Vol. 207, pp. 106213. Date of Electronic Publication: 2025 Nov 30.

Publication Type:

Journal Article

Language:

English

Journal Info:

Publisher: Elsevier Science Ireland Ltd Country of Publication: Ireland NLM ID: 9711057 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1872-8243 (Electronic) Linking ISSN: 13865056 NLM ISO Abbreviation: Int J Med Inform Subsets: MEDLINE

Imprint Name(s):

Original Publication: Shannon, Co. Clare, Ireland : Elsevier Science Ireland Ltd., c1997-

MeSH Terms:

Patient Education as Topic*/methods , Information Storage and Retrieval*/methods , Graves Ophthalmopathy*/therapy , Artificial Intelligence*, Humans ; Cross-Sectional Studies ; Female ; Generative Artificial Intelligence

Contributed Indexing:

Keywords: Artificial intelligence; Clinical metrics; Large language models; Patient education; Ratings; Thyroid eye disease

Entry Date(s):

Date Created: 20251202 Date Completed: 20260102 Latest Revision: 20260102

Update Code:

20260130

DOI:

10.1016/j.ijmedinf.2025.106213

PMID:

41330233

Database:

MEDLINE

Weitere Informationen

Purpose: To evaluate the performance of general-purpose, retrieval-augmented, and medicine-specific AI chatbots in answering common thyroid eye disease (TED) patient questions.
Design: Cross-sectional comparative evaluation. Online TED forum discussion posts were collected and synthesized into 15 representative patient questions across five groups spanning clinical (treatment, diagnosis, management, epidemiology) and non-clinical topics, grouped into three difficulty levels. Three differing large language models (LLMs) generated responses were randomized and anonymized for blinded assessment.
Subjects, Participants, And/or Controls: Three oculoplastic surgeons evaluated clinical metrics; 3 medical students assessed non-clinical metrics.
Methods: Three AI models generated responses: GPT-4o-mini (ChatGPT), a retrieval-augmented generation model grounded in TED literature (ChatGPT-RAG), and a specially trained LLM for healthcare professionals (OpenEvidence). Blinded raters assessed randomized responses. Statistical analysis used paired Wilcoxon signed-rank tests with Hedges' g for effect sizes.
Main Outcomes Measured: Clinical evaluation of responses was conducted using a 7-point Likert scale for relevance, accuracy, balance, and scope. Non-clinical metrics of empathy, understandability, and readability were also assessed using validated tools.
Results: OpenEvidence significantly outperformed both ChatGPT (mean clinical score 5.96 vs 4.94; Hedges' g = 1.21, P < 0.001) and ChatGPT-RAG (5.96 vs 5.55; g = 0.53, P < 0.001) in clinical rankings and across most clinical metrics, including accuracy and relevance. However, performance patterns reversed for non-clinical metrics, with ChatGPT consistently outperforming specialized models in empathy, understandability, and actionability (18.4 vs 14.96 for OpenEvidence; g = 1.25, P < 0.001). Across both domains, ChatGPT-RAG achieved intermediate performance, more closely trailing OpenEvidence clinically (g = 0.53) and ChatGPT with respect to non-clinical metrics (g = 0.44). Limitations include a modest sample of raters and synthesized questions from online forums, which may affect generalizability.
Conclusions: Specialized medical AI models may have better clinical accuracy, while general-purpose models may outperform in patient communication and accessibility. The development of retrieval augmented generation-based approaches combining clinical precision with effective communication represents a promising direction for AI-powered patient education in TED and, potentially, other complex conditions.
(Copyright © 2025 The Author(s). Published by Elsevier B.V. All rights reserved.)

Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Treffer: Incorporating information retrieval into AI chatbots for patient education on thyroid eye disease.

Weitere Informationen

Links

Zusatz-Funktionen