*Result*: Evaluation of validity, reliability, and readability of AI chatbots for gestational diabetes mellitus: a multi-model comparative study.

Title:
Evaluation of validity, reliability, and readability of AI chatbots for gestational diabetes mellitus: a multi-model comparative study.
Authors:
Wang X; Department of Obstetrics, Shenzhen Nanshan Maternity and Child Healthcare Hospital, Shenzhen, China., Lin S; Department of Obstetrics, Shenzhen Nanshan Maternity and Child Healthcare Hospital, Shenzhen, China., Liu H; Department of Obstetrics, Shenzhen Nanshan Maternity and Child Healthcare Hospital, Shenzhen, China., Li C; Department of Obstetrics, Shenzhen Nanshan Maternity and Child Healthcare Hospital, Shenzhen, China., Zhou L; Department of Obstetrics, Shenzhen Nanshan Maternity and Child Healthcare Hospital, Shenzhen, China., Li R; Department of Urology, South China Hospital, Medical School, Shenzhen University, Shenzhen, China.; Department of Urology, Lanzhou University Second Hospital, Lanzhou University, Lanzhou, China.
Source:
Frontiers in public health [Front Public Health] 2026 Feb 04; Vol. 14, pp. 1760871. Date of Electronic Publication: 2026 Feb 04 (Print Publication: 2026).
Publication Type:
Journal Article; Comparative Study
Language:
English
Journal Info:
Publisher: Frontiers Editorial Office Country of Publication: Switzerland NLM ID: 101616579 Publication Model: eCollection Cited Medium: Internet ISSN: 2296-2565 (Electronic) Linking ISSN: 22962565 NLM ISO Abbreviation: Front Public Health Subsets: MEDLINE
Imprint Name(s):
Original Publication: Lausanne : Frontiers Editorial Office
Contributed Indexing:
Keywords: artificial intelligence; gestational diabetes mellitus; large language models; patient education; readability
Entry Date(s):
Date Created: 20260220 Date Completed: 20260220 Latest Revision: 20260220
Update Code:
20260220
PubMed Central ID:
PMC12913397
DOI:
10.3389/fpubh.2026.1760871
PMID:
41717624
Database:
MEDLINE

*Further Information*

*Background: Gestational diabetes mellitus (GDM) is increasingly prevalent worldwide and is associated with substantial short- and long-term risks for mothers and offspring, making high-quality, accessible health information essential. At the same time, artificial intelligence (AI) chatbots based on large language models are being widely used for health queries, yet their accuracy, reliability and readability in the context of GDM remain unclear.
Methods: We first evaluated six AI chatbots (ChatGPT-5, ChatGPT-4o, DeepSeek-V3.2, DeepSeek-R1, Gemini 2.5 Pro and Claude Sonnet 4.5) using 200 single-best-answer multiple-choice questions (MCQs) on GDM drawn from MedQA, MedMCQA and the Chinese National Medical Examination item bank, covering four domains: epidemiology and risk factors, clinical manifestations and diagnosis, maternal and neonatal outcomes, and management and treatment. Each item was posed three times to every model under a standardized prompting protocol, and accuracy was defined as the proportion of correctly answered questions. For public-facing information, we identified 15 core GDM education questions using Google Trends and expert review, and queried four chatbots (ChatGPT-5, DeepSeek-V3.2, Claude Sonnet 4.5 and Gemini 2.5 Pro). Two obstetricians independently assessed reliability using DISCERN, EQIP, GQS and JAMA benchmarks, and readability was quantified using ARI, CL, FKGL, FRES, GFI and SMOG indices.
Results: Overall MCQ accuracy differed significantly across the six chatbots (p < 0.0001), with ChatGPT-5 achieving the highest mean accuracy (92.17%) and DeepSeek-V3.2 and Gemini 2.5 Pro performing comparably well, while ChatGPT-4o, DeepSeek-R1 and Claude Sonnet 4.5 scored lower. Newer model generations (ChatGPT-5 vs. ChatGPT-4o; DeepSeek-V3.2 vs. DeepSeek-R1) consistently outperformed their predecessors across all four domains. Among the four models evaluated on public-education questions, ChatGPT-5 achieved the highest reliability scores (DISCERN 42.53 ± 7.20; EQIP 71.67 ± 6.17), whereas Claude Sonnet 4.5, DeepSeek-V3.2 and Gemini 2.5 Pro scored lower. JAMA scores were uniformly low (0-0.07/4), reflecting poor transparency. All models produced text above the recommended sixth-grade reading level; ChatGPT-5 showed the most favorable readability profile (for example, FKGL 7.43 ± 2.42, FRES 62.47 ± 13.51) but still did not meet guideline targets.
Conclusion: Contemporary AI chatbots can generate generally accurate and moderately reliable GDM-related information, with newer model generations showing clear gains in diagnostic validity. However, limited transparency and systematically high reading levels indicate that these tools are not yet suitable as stand-alone resources for GDM patient education and should be used as adjuncts to clinician counseling and professionally curated materials.
(Copyright © 2026 Wang, Lin, Liu, Li, Zhou and Li.)*

*The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.*