Treffer: Incorporating information retrieval into AI chatbots for patient education on thyroid eye disease.
Weitere Informationen
Purpose: To evaluate the performance of general-purpose, retrieval-augmented, and medicine-specific AI chatbots in answering common thyroid eye disease (TED) patient questions.
Design: Cross-sectional comparative evaluation. Online TED forum discussion posts were collected and synthesized into 15 representative patient questions across five groups spanning clinical (treatment, diagnosis, management, epidemiology) and non-clinical topics, grouped into three difficulty levels. Three differing large language models (LLMs) generated responses were randomized and anonymized for blinded assessment.
Subjects, Participants, And/or Controls: Three oculoplastic surgeons evaluated clinical metrics; 3 medical students assessed non-clinical metrics.
Methods: Three AI models generated responses: GPT-4o-mini (ChatGPT), a retrieval-augmented generation model grounded in TED literature (ChatGPT-RAG), and a specially trained LLM for healthcare professionals (OpenEvidence). Blinded raters assessed randomized responses. Statistical analysis used paired Wilcoxon signed-rank tests with Hedges' g for effect sizes.
Main Outcomes Measured: Clinical evaluation of responses was conducted using a 7-point Likert scale for relevance, accuracy, balance, and scope. Non-clinical metrics of empathy, understandability, and readability were also assessed using validated tools.
Results: OpenEvidence significantly outperformed both ChatGPT (mean clinical score 5.96 vs 4.94; Hedges' g = 1.21, P < 0.001) and ChatGPT-RAG (5.96 vs 5.55; g = 0.53, P < 0.001) in clinical rankings and across most clinical metrics, including accuracy and relevance. However, performance patterns reversed for non-clinical metrics, with ChatGPT consistently outperforming specialized models in empathy, understandability, and actionability (18.4 vs 14.96 for OpenEvidence; g = 1.25, P < 0.001). Across both domains, ChatGPT-RAG achieved intermediate performance, more closely trailing OpenEvidence clinically (g = 0.53) and ChatGPT with respect to non-clinical metrics (g = 0.44). Limitations include a modest sample of raters and synthesized questions from online forums, which may affect generalizability.
Conclusions: Specialized medical AI models may have better clinical accuracy, while general-purpose models may outperform in patient communication and accessibility. The development of retrieval augmented generation-based approaches combining clinical precision with effective communication represents a promising direction for AI-powered patient education in TED and, potentially, other complex conditions.
(Copyright © 2025 The Author(s). Published by Elsevier B.V. All rights reserved.)
Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.