*Result*: Comparing three natural language processing methods for the automatic identification of epilepsy patients from French clinical notes.

Title:
Comparing three natural language processing methods for the automatic identification of epilepsy patients from French clinical notes.
Authors:
Le Gac F; Paris Brain Institute-Institut du Cerveau, Institut National de la Santé Et de la Recherche Médicale (INSERM), Centre National de la Recherche Scientifique (CNRS), Pitié-Salpêtrière Hospital, Sorbonne Université, Paris, France., Calonge Q; Paris Brain Institute-Institut du Cerveau, Institut National de la Santé Et de la Recherche Médicale (INSERM), Centre National de la Recherche Scientifique (CNRS), Pitié-Salpêtrière Hospital, Sorbonne Université, Paris, France.; Epilepsy Unit, Département Médico-Universitaire Neurosciences, Pitié-Salpêtrière Hospital, AP-HP, Paris, France.; Center of Reference for Rare Epilepsies, ERN-Epicare, Département Médico-Universitaire Neurosciences, Pitié-Salpêtrière Hospital, AP-HP, Paris, France., Estellat C; Département de Santé Publique, Centre de Pharmacoépidémiologie (Cephepi), Unité de Recherche Clinique PSL-CFX, CIC-1901, Institut Pierre Louis d'Epidémiologie et de Santé Publique, INSERM, Hôpital Pitié Salpêtrière, AP-HP, Sorbonne Université, Paris, France., Navarro V; Paris Brain Institute-Institut du Cerveau, Institut National de la Santé Et de la Recherche Médicale (INSERM), Centre National de la Recherche Scientifique (CNRS), Pitié-Salpêtrière Hospital, Sorbonne Université, Paris, France.; Epilepsy Unit, Département Médico-Universitaire Neurosciences, Pitié-Salpêtrière Hospital, AP-HP, Paris, France.; Center of Reference for Rare Epilepsies, ERN-Epicare, Département Médico-Universitaire Neurosciences, Pitié-Salpêtrière Hospital, AP-HP, Paris, France.
Source:
Epilepsia [Epilepsia] 2026 Feb; Vol. 67 (2), pp. 741-752. Date of Electronic Publication: 2025 Oct 24.
Publication Type:
Journal Article; Comparative Study
Language:
English
Journal Info:
Publisher: Blackwell Science Country of Publication: United States NLM ID: 2983306R Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1528-1167 (Electronic) Linking ISSN: 00139580 NLM ISO Abbreviation: Epilepsia Subsets: MEDLINE
Imprint Name(s):
Publication: Malden, MA : Blackwell Science
Original Publication: Copenhagen : Munskgaard
References:
Fisher RS, Acevedo C, Arzimanoglou A, Bogacz A, Cross JH, Elger CE, et al. ILAE Official Report: a practical clinical definition of epilepsy. Epilepsia. 2014 Apr;55(4):475–482.
Beghi E, Giussani G, Nichols E, Abd‐Allah F, Abdela J, Abdelalim A, et al. Global, regional, and national burden of epilepsy, 1990–2016: a systematic analysis for the global burden of disease study 2016. Lancet Neurol. 2019 Apr;18(4):357–375.
Fiest KM, Sauro KM, Wiebe S, Patten SB, Kwon CS, Dykeman J, et al. Prevalence and incidence of epilepsy: a systematic review and meta‐analysis of international studies. Neurology. 2017 Jan 17;88(3):296–303.
Mbwana JS, Grinspan ZM, Bailey R, Berl M, Buchhalter J, Bumbut A, et al. Using EHRs to advance epilepsy care. Neur Clin Pract. 2019 Feb;9(1):83–88.
Franchi C, Giussani G, Messina P, Montesano M, Romi S, Nobili A, et al. Validation of healthcare administrative data for the diagnosis of epilepsy. J Epidemiol Community Health. 2013;67(12):1019–1024.
Yew ANJ, Schraagen M, Otte WM, Van Diessen E. Transforming epilepsy research: a systematic review on natural language processing applications. Epilepsia. 2023 Feb;64(2):292–305.
Fernandes M, Cardall A, Jing J, Ge W, Moura LMVR, Jacobs C, et al. Identification of patients with epilepsy using automated electronic health records phenotyping. Epilepsia. 2023 Jun;64(6):1472–1481.
Finlayson SG, Subbaswamy A, Singh K, Bowers J, Kupke A, Zittrain J, et al. The clinician and dataset shift in artificial intelligence. N Engl J Med. 2021 Jul 15;385(3):283–286.
Abbara S, Guillemot D, El Oualydy S, Kos M, Poret C, Breant S, et al. Antimicrobial resistance and mortality in hospitalized patients with bacteremia in the greater Paris area from 2016 to 2019. CLEP. 2022 Dec;14:1547–1560.
Zweigenbaum P, Grouin C, Lavergne T. Une catégorisation de fins de lignes non‐supervisée. Actes de la conférence conjointe JEP‐TALN‐RECITAL 2016, Paris, France. Volume 2. Paris, France: AFCP ‐ ATALA; 2016. p. 364–371 https://aclanthology.org/2016.jeptalnrecital‐poster.7.
Wajsburt P, Petit‐Jean T, Dura B, Cohen A, Jean C, Bey R. EDS‐NLP: efficient information extraction from French clinical notes [Internet]. Paris, France: Zenodo; 2022 https://aphp.github.io/edsnlp/latest.
Martin L, Muller B, Ortiz Suárez PJ, Dupont Y, Romary L, De La Clergerie É, et al. CamemBERT: a Tasty French Language Model. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics [Internet]. Paris, France: Online: Association for Computational Linguistics; 2020 [cited 2025 Feb 10]. p. 7203–7219 https://www.aclweb.org/anthology/2020.acl‐main.645.
Saito T, Rehmsmeier M. The precision‐recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015 Mar 4;10(3):e0118432.
Abeysinghe R, Tao S, Lhatoo SD, Zhang GQ, Cui L. Leveraging pretrained language models for seizure frequency extraction from epilepsy evaluation reports. npj Digit Med. 2025;8(1):208. https://doi.org/10.1038/s41746‐025‐01592‐4.
Falter M, Godderis D, Scherrenberg M, Kizilkilic SE, Xu L, Mertens M, et al. Using natural language processing for automated classification of disease and to identify misclassified ICD codes in cardiac disease. Eur Heart J Digital Health. 2024 May 1;5(3):229–234.
Johnson SA, Signor EA, Lappe KL, Shi J, Jenkins SL, Wikstrom SW, et al. A comparison of natural language processing to ICD‐10 codes for identification and characterization of pulmonary embolism. Thromb Res. 2021 Jul 1;203:190–195.
Bellini I, Policardo L, Zaccara G, Palumbo P, Rosati E, Torre E, et al. Identification of prevalent patients with epilepsy using administrative data: the Tuscany experience. Neurol Sci. 2017 Apr;38(4):571–577.
Smith JR, Jones FJS, Fureman BE, Buchhalter JR, Herman ST, Ayub N, et al. Accuracy of ICD‐10‐CM claims‐based definitions for epilepsy and seizure type. Epilepsy Res. 2020 Oct;166:106414.
Mbizvo GK, Bennett KH, Schnier C, Simpson CR, Duncan SE, Chin RFM. The accuracy of using administrative healthcare data to identify epilepsy cases: a systematic review of validation studies. Epilepsia. 2020;61(7):1319–1335.
Hamid H, Fodeh SJ, Lizama AG, Czlapinski R, Pugh MJ, LaFrance WC, et al. Validating a natural language processing tool to exclude psychogenic nonepileptic seizures in electronic medical record‐based epilepsy research. Epilepsy Behav. 2013;29(3):578–580.
Pevy N, Christensen H, Walker T, Reuber M. Feasibility of using an automated analysis of formulation effort in patients' spoken seizure descriptions in the differential diagnosis of epileptic and nonepileptic seizures. Seizure. 2021 Oct;91:141–145.
Connolly B, Matykiewicz P, Bretonnel Cohen K, Standridge SM, Glauser TA, Dlugos DJ, et al. Assessing the similarity of surface linguistic features related to epilepsy across pediatric hospitals. J Am Med Inform Assoc. 2014;21(5):866–870.
Petit‐Jean T, Gérardin C, Berthelot E, Chatellier G, Frank M, Tannier X, et al. Collaborative and privacy‐enhancing workflows on a clinical data warehouse: an example developing natural language processing pipelines to detect medical conditions. J Am Med Inform Assoc. 2024 Apr 4;31(6):1280–1290.
Bey R, Cohen A, Trebossen V, Dura B, Geoffroy PA, Jean C, et al. Natural language processing of multi‐hospital electronic health records for public health surveillance of suicidality. npj Mental Health Res. 2024 Feb 14;3(1):1–9.
Hill CE, Lin CC, Terman SW, Rath S, Parent JM, Skolarus LE, et al. Definitions of drug‐resistant epilepsy for administrative claims data research. Neurology. 2021;97(13):e1343–e1350. https://doi.org/10.1212/WNL.0000000000012514.
Bølling‐Ladegaard E, Dreier JW, Christensen J. An algorithm for drug‐resistant epilepsy in Danish national registers. Brain. 2025;148(3):753–763. https://doi.org/10.1093/brain/awae286.
Grant Information:
DATAE Direction Générale de l'offre de Soins; ANR-10-IAIHU-06 Agence Nationale de la Recherche
Contributed Indexing:
Keywords: Automated phenotyping; Clinical data warehouse; Pretrained language model; electronic health record; epilepsy; natural language processing (NLP)
Entry Date(s):
Date Created: 20251024 Date Completed: 20260223 Latest Revision: 20260225
Update Code:
20260225
PubMed Central ID:
PMC12927673
DOI:
10.1111/epi.18683
PMID:
41133988
Database:
MEDLINE

*Further Information*

*Objective: Manual review of clinical notes by experts remains the reference standard for identifying patients with epilepsy in health databases. However, this process is labor-intensive and time-consuming due to the unstructured nature of text. Prior studies have shown the potential of natural language processing for automated phenotyping. We aim to develop and validate algorithms capable of identifying patients with epilepsy based on a set of clinical notes.
Methods: A population of 109 448 patients was selected from the Assistance Publique-Hôpitaux de Paris (AP-HP) Clinical Data Warehouse (CDW) (38 hospitals in Paris, France) based on the presence of an International Classification of Diseases, Tenth Revision (ICD-10) diagnostic code related to epilepsy (G40/G41) or mimicking disorders (R53/R55/R56), or the mention of at least one antiseizure medication in their medical chart. From this pre-screened population, 6733 sentences (from 2700 patients) were labeled as indicative or not indicative of epilepsy, and 3000 patients were selected randomly for manual review by a neurologist. We compared a "basic" keyword-based method, a rule-based method, and a pretrained language model for identifying epilepsy-related sentences and classifying patients with epilepsy. We reported the F1 score of each method.
Results: At the sentence level, the pretrained language model reached the highest F1 score of .95 (95% confidence interval [CI]: .95-.96) outperforming the rule-based method .87 (95% CI: .86-.88) and the basic method .81 (95% CI: .80-.81). At the patient level, the pretrained language model also achieved the best F1 score .95 (95% CI: .94-.96) compared to the rule-based method .93 (95% CI: .91-.94) and the basic method .82 (95% CI: .81-.84).
Significance: Both the rule-based and the pretrained language models achieved high performance. These algorithms can automatically identify patients with epilepsy from unstructured clinical notes in French data warehouses, supporting large-scale phenotyping and the detection of epilepsy as a comorbidity.
(© 2025 The Author(s). Epilepsia published by Wiley Periodicals LLC on behalf of International League Against Epilepsy.)*