*Result*: Accurate and Scalable Classification of Colonoscopy Neoplasia Using Machine Learning and Natural Language Processing.

Title:
Accurate and Scalable Classification of Colonoscopy Neoplasia Using Machine Learning and Natural Language Processing.
Authors:
Broderick B; Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, Minnesota, USA., Greenwood J; Division of Family Medicine, Mayo Clinic, Rochester, Minnesota, USA., Mahoney D; Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota, USA., Burger K; Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota, USA., Garg SK; Division of Gastroenterology and Hepatology, Mayo Clinic Health System, Eau Claire, Wisconsin, USA., Wallace MB; Division of Gastroenterology and Hepatology, Mayo Clinic, Jacksonville, Florida, USA., Gurudu SR; Division of Gastroenterology and Hepatology, Mayo Clinic, Scottsdale, Arizona, USA., Ebner D; Division of Gastroenterology and Hepatology, Mayo Clinic, Rochester, Minnesota, USA., Kisiel J; Division of Gastroenterology and Hepatology, Mayo Clinic, Rochester, Minnesota, USA.
Source:
Clinical and translational gastroenterology [Clin Transl Gastroenterol] 2026 Feb 01; Vol. 17 (2), pp. e00959. Date of Electronic Publication: 2026 Feb 01.
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: Wolters Kluwer Health Country of Publication: United States NLM ID: 101532142 Publication Model: Electronic Cited Medium: Internet ISSN: 2155-384X (Electronic) Linking ISSN: 2155384X NLM ISO Abbreviation: Clin Transl Gastroenterol Subsets: MEDLINE
Imprint Name(s):
Publication: <2019-> : [Philadelphia, PA] : Wolters Kluwer Health
Original Publication: New York, NY : Nature Pub. Group
References:
Siegel RL, Kratzer TB, Giaquinto AN, et al. Cancer statistics, 2025. CA Cancer J Clin 2025;75(1):10–45.
Ebner DW, Finney Rutten LJ, Miller-Wilson LA, et al. Trends in colorectal cancer screening from the national health interview survey: Analysis of the impact of different modalities on overall screening rates. Cancer Prev Res (Phila) 2024;17(6):275–80.
Gupta S, Lieberman D, Anderson JC, et al. Recommendations for follow-up after colonoscopy and polypectomy: A consensus update by the US Multi-Society Task Force on colorectal cancer. Gastroenterology 2020;158(4):1131–53.e5.
Rex DK, Anderson JC, Butterly LF, et al. Quality indicators for colonoscopy. Am J Gastroenterol 2024;119(9):1754–80.
Corley DA, Jensen CD, Marks AR, et al. Adenoma detection rate and risk of colorectal cancer and death. N Engl J Med 2014;370(14):1298–306.
ASGE|American Society for Gastrointestinal Endoscopy. ASGE-ACG quality indicators for colonoscopy implementation tips, 2024. Accessed July 29, 2025. https://www.asge.org/docs/default-source/default-document-library/asge-acg-qi-for-colonoscopy-faq_oct24.pdf?sfvrsn=dff3665f_3.
Pakneshan S, Moy N, O'Connor S, et al. Costs and benefits of a formal quality framework for colonoscopy: Economic evaluation. Endosc Int Open 2024;12(11):e1334–41.
Peng M, Rex DK. Surveying ADR knowledge and practices among US gastroenterologists. J Clin Gastroenterol 2020;54(2):158–63.
Soroush A, Giuffrè M, Chung S, et al. Generative artificial intelligence in clinical medicine and impact on gastroenterology. Gastroenterology 2025;169(3):502–17.e1.
Luccioni A, Viguier S, Ligozat A-L. Estimating the carbon footprint of BLOOM, A 176B parameter language model; 2022. https://doi.org/10.48550/arXiv.2211.02001. (PMID: 10.48550/arXiv.2211.02001)
Hu J, Szymczak S. A review on longitudinal data analysis with random forest. Brief Bioinform 2023;24(2):bbad002.
Imler TD, Morea J, Kahi C, et al. Natural language processing accurately categorizes findings from colonoscopy and pathology reports. Clin Gastroenterol Hepatol 2013;11(6):689–94.
Mehrotra A, Dellon ES, Schoen RE, et al. Applying a natural language processing tool to electronic health records to assess performance on colonoscopy quality measures. Gastrointest Endosc 2012;75(6):1233–9.e14.
Ebner DW, Burger KN, Mahoney DW, et al. Neoplasia diagnosis after multi-target stool DNA is enhanced among lowest baseline detectors. Dig Dis Sci 2023;68(9):3721–31.
Stammers M, Ramgopal B, Owusu Nimako A, et al. A foundation systematic review of natural language processing applied to gastroenterology & hepatology. BMC Gastroenterol 2025;25(1):58.
Sabrie N, Khan R, Jogendran R, et al. Performance of natural language processing in identifying adenomas from colonoscopy reports: A systematic review and meta-analysis. iGIE 2023;2(3):350–6.e7.
Destrempes F, Gesnik M, Chayer B, et al. Quantitative ultrasound, elastography, and machine learning for assessment of steatosis, inflammation, and fibrosis in chronic liver disease. PLoS One 2022;17(1):e0262291.
Avram MF, Lupa N, Koukoulas D, et al. Random forests algorithm using basic medical data for predicting the presence of colonic polyps. Front Surg 2025;12:1523684.
Lim S, Tritto G, Zeki S, et al. Regular feedback to individual endoscopists is associated with improved adenoma detection rate and other key performance indicators for colonoscopy. Frontline Gastroenterol 2022;13(6):509–16.
Anderson JC, Butterly LF, Weiss JE, et al. Providing data for serrated polyp detection rate benchmarks: An analysis of the New Hampshire colonoscopy registry. Gastrointest Endosc 2017;85(6):1188–94.
Grant Information:
CA214679 Kern Center for the Science of Health Care Delivery, National Cancer Institute
Contributed Indexing:
Keywords: natural language processing; neoplasia detection rate; pathology report analysis; predictive modeling; random forest
Entry Date(s):
Date Created: 20251209 Date Completed: 20260224 Latest Revision: 20260224
Update Code:
20260225
PubMed Central ID:
PMC12922929
DOI:
10.14309/ctg.0000000000000959
PMID:
41363713
Database:
MEDLINE

*Further Information*

*Introduction: Colorectal cancer remains a leading cause of cancer associated death in the United States and colonoscopy the primary screening strategy for prevention. Rates of adenomatous and serrated neoplasia detection are inversely associated with postcolonoscopy colorectal cancer. This crucial quality metric depends on accurate ascertainment of colorectal neoplasia findings from both endoscopy and histopathology records. We aimed to assess the feasibility of a random forest machine learning model to rapidly and accurately categorize colorectal neoplasia from electronic health record data.
Methods: A retrospective cohort study compared neoplasia detection rates among individuals undergoing colonoscopy at a large academic institution to develop a rule-based algorithm to categorize colorectal neoplasia from endoscopy reports and pathology systematized nomenclature of medicine - clinical terms. This cohort provided a large training set to develop a natural language processing system using a random forest approach to automatically classify unstructured pathology findings into adenoma, serrated, or advanced neoplasms. This system was manually validated through an independent holdout set.
Results: The training set comprised 35,953 unstructured pathology reports with matched systematized nomenclature of medicine - clinical terms from 95,188 unstructured colonoscopy reports. The final model was assessed on an independent holdout set of 337 manually annotated procedures obtaining an area under the receiver operating characteristic curve of 0.997 (confidence interval [CI] 0.994-1), 0.99 (CI 0.98-1), and 0.99 (CI 0.98-0.99) for prediction of adenoma, serrated, and advanced lesions, respectively.
Discussion: The random forest-based hybrid natural language processing system for classification of colonoscopy results was both accurate and explainable. NLP combined with effective machine learning algorithms can provide a scalable strategy for colonoscopy quality monitoring.
(Copyright © 2025 The Author(s). Published by Wolters Kluwer Health, Inc. on behalf of The American College of Gastroenterology.)*