*Result*: Leveraging natural language processing and machine learning to identify chronic conditions from primary care electronic medical records.

Title:
Leveraging natural language processing and machine learning to identify chronic conditions from primary care electronic medical records.
Authors:
Zhang N; Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, Canada., Abbasi M; Department of Family Medicine, University of Alberta, Edmonton, AB, Canada. marjan.abbasi@albertahealthservices.ca., Khera S; Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, Canada., Bazrafkan M; Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, Canada., Abbasi-Dezfouly R; University of Alberta, Edmonton, AB, Canada., Kong L; Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, Canada.
Source:
Scientific reports [Sci Rep] 2026 Feb 12; Vol. 16 (1). Date of Electronic Publication: 2026 Feb 12.
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: Nature Publishing Group Country of Publication: England NLM ID: 101563288 Publication Model: Electronic Cited Medium: Internet ISSN: 2045-2322 (Electronic) Linking ISSN: 20452322 NLM ISO Abbreviation: Sci Rep Subsets: MEDLINE
Imprint Name(s):
Original Publication: London : Nature Publishing Group, copyright 2011-
References:
AHS Annual Report. Alberta Health Services; (2022).
Canadian Primary Care Sentinel Surveillance Network. Northern Alberta. In (2023). Available from: https://cpcssn.ca/regional-networks-2/alberta/northern-alberta/.
Sheikhalishahi, S. et al. Natural Language processing of clinical notes on chronic diseases: systematic review. JMIR Med. Inf. 7 (2), e12239 (2019). (PMID: 10.2196/12239)
Chen, W. et al. Development and validation of algorithms to identify patients with chronic kidney disease and related chronic diseases across the Northern Territory, Australia. BMC Nephrol. 23 (1), 1–12 (2022). (PMID: 10.1186/s12882-022-02947-9)
Richter, A. N. & Khoshgoftaar, T. M. A review of statistical and machine learning methods for modeling cancer risk using structured clinical data. Artif. Intell. Med. 90, 1–4 (2018). (PMID: 10.1016/j.artmed.2018.06.00230017512)
Wang, W. et al. A systematic review of machine learning models for predicting outcomes of stroke with structured data. PloS One. 12 (6), e0234722 (2020 June).
Makowski, D. et al. Automated results reporting as a practical tool to improve reproducibility and methodological best practices adoption [Internet]. (2023). Available from: https://cran.r-project.org/web/packages/report/citation.html.
Javaid, M., Haleem, A., Singh, R. P., Suman, R. & Rab, S. Significance of machine learning in healthcare: Features, pillars and applications. Int. J. Intell. Netw. 3, 58–73 (2022).
Zhang, A., Xing, L., Zou, J. & Wu, J. C. Shifting machine learning for healthcare from development to deployment and from models to data. Nat. Biomed. Eng. Dec;6 (12), 1330–1345 (2022). (PMID: 10.1038/s41551-022-00898-y35788685)
Kennedy, J. et al. Predicting a diagnosis of ankylosing spondylitis using primary care health records–a machine learning approach. PLoS One. 31 (3), e0279076 (2023). (PMID: 10.1371/journal.pone.0279076)
Lix, L. M., Walker, R., Quan, H. & Nesdole, R. Features of physician services databases in Canada. In: Health Promotion and Chronic Disease Prevention in Canada. (2012).
Tu, K. et al. Are family physicians comprehensively using electronic medical records such that the data can be used for secondary purposes? A Canadian perspective. BMC Med. Inf. Decis. Mak. 15, 1–2 (2015).
Raji, S. Regional Integration: Physician Perceptions on Electronic Medical Record Use and Impact in South West Ontario. Electron Thesis Diss Repos [Internet]. ; (2020). Available from: https://ir.lib.uwo.ca/etd/7980.
Savage, D. W. et al. Characterizing the services provided by family physicians in Ontario, canada: A retrospective study using administrative billing data. PloS One. 8 (1), e0316554 (2025). (PMID: 10.1371/journal.pone.0316554)
Teixeira, P. L. et al. Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals. J. Am. Med. Inf. Assoc. 24 (1), 162–171 (2017). (PMID: 10.1093/jamia/ocw071)
Liao, K. P. et al. Electronic medical records for discovery research in rheumatoid arthritis. Arthritis Care Res. 62 (8), 1120–1127 (2010). (PMID: 10.1002/acr.20184)
Zheng, L. et al. Web-based Real-Time case finding for the population health management of patients with diabetes mellitus: A prospective validation of the natural Language Processing–Based algorithm with statewide electronic medical records. JMIR Med. Inf. 4 (4), e6328 (2016).
Casey J. A., Schwartz B. S., Stewart W. F. & Adler N. E. Using electronic health records for population health research: A review of methods and applications. Annu. Rev. Public. Health. 37 (37, 2016), 61–81 (2016).
Extracting information from the text. of electronic medical records to improve case detection: a systematic review [Internet]. [cited 2025 May 20]. Available from: https://academic.oup.com/jamia/article/23/5/1007/2379833?login=true.
Zhang, L., Wang, Y., Niu, M., Wang, C. & Wang, Z. Machine learning for characterizing risk of type 2 diabetes mellitus in a rural Chinese population: the Henan rural cohort study. Sci. Rep. 10 (1), 4406 (2020). (PMID: 10.1038/s41598-020-61123-x321571717064542)
Tarekegn, A., Ricceri, F., Costa, G., Ferracin, E. & Giacobini, M. Predictive modeling for frailty conditions in elderly people: machine learning approaches. JMIR Med. Inf. 4 (6), e16678 (2020 June).
Zolnoori, M. et al. Beyond electronic health record data: leveraging natural Language processing and machine learning to uncover cognitive insights from patient-nurse verbal communications. J. Am. Med. Inf. Assoc. 32 (2), 328–340 (2025). (PMID: 10.1093/jamia/ocae300)
Garies, S., Birtwhistle, R., Drummond, N., Queenan, J. & Williamson, T. Data resource profile: National electronic medical record data from the Canadian primary care Sentinel surveillance network (CPCSSN. Int. J. Epidemiol. 46 (4), 1091–1092 (2017). (PMID: 10.1093/ije/dyw24828338877)
Kotecha, J. A. et al. Ethics and privacy issues of a practice-based surveillance system. Can. Fam Physician. 57 (10), 1165–1173 (2011). (PMID: 219982373192088)
Feinerer, I. Introduction to the tm Package Text Mining in R.
Ghassemi, M. et al. A Review of Challenges and Opportunities in Machine Learning for Health. AMIA Summits Transl Sci Proc. ;2020:191–200. (2020).
Mykowiecka, A., Marciniak, M. & Kupść, A. Rule-based information extraction from patients’ clinical data. J. Biomed. Inf. 42 (5), 923–936 (2009). (PMID: 10.1016/j.jbi.2009.07.007)
Tanushi, H. et al. Negation scope delimitation in clinical text using three approaches: NegEx, PyConTextNLP and SynNeg. In: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013 [Internet]. Oslo, Norway: Linköping University Electronic Press; pp. 387–97. Available from: https://aclanthology.org/W13-5635 (2013).
Goryachev, S., Sordo, M., Zeng, Q. & Ngo, L. H. Implementation and evaluation of four different methods of negation detection [Internet]. (2007). Available from: https://www.semanticscholar.org/paper/Implementation-and-Evaluation-of-Four-Different-of-Goryachev-Sordo/49517539055234e73bfa6140a7a84b74cfc12685.
Peng, Y. et al. NegBio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA Summits Transl Sci Proc. ;2018:188–96. (2018).
Ayre, K. et al. Developing a natural Language processing tool to identify perinatal self-harm in electronic healthcare records. PLOS ONE. 16 (8), e0253809 (2021). (PMID: 10.1371/journal.pone.0253809343477878336818)
Harrison, C. J. & Sidey-Gibbons, C. J. Machine learning in medicine: a practical introduction to natural Language processing. BMC Med. Res. Methodol. ;21(1). (2021).
Wongvorachan, T., He, S. & Bulut, O. A comparison of Undersampling, Oversampling, and SMOTE methods for dealing with imbalanced classification in educational data mining. Information 14 (1), 54 (2023). (PMID: 10.3390/info14010054)
Algamal, Z. Y. & Lee, M. H. A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification. Adv. Data Anal. Classif. 13 (3), 753–771 (2019). (PMID: 10.1007/s11634-018-0334-1)
Vayansky I, Kumar SA. A review of topic modeling methods. Information Systems 94, 101582 (2020).
Churchill R, Singh L. The evolution of topic modeling. ACM Computing Surveys 54(10s):1–35 (2022).
Wang, L., Han, M., Li, X., Zhang, N. & Cheng, H. Review of classification methods on unbalanced data sets. IEEE Access. 9, 64606–64628 (2021). (PMID: 10.1109/ACCESS.2021.3074243)
Lu, H., Ehwerhemuepha, L. & Rakovski, C. A comparative study on deep learning models for text classification of unstructured medical notes with various levels of class imbalance. BMC Med. Res. Methodol. 22 (1), 1–12 (2022). (PMID: 10.1186/s12874-022-01665-y)
Ghaddar, B. & Naoum-Sawaya, J. High dimensional data classification and feature selection using support vector machines. Eur. J. Oper. Res. 265 (3), 993–1004 (2018). (PMID: 10.1016/j.ejor.2017.08.040)
Hicks, S. A. et al. On evaluation metrics for medical applications of artificial intelligence. Sci. Rep. 12, 5979 (2022). (PMID: 10.1038/s41598-022-09954-8353958678993826)
Diallo, R., Edalo, C. & Awe, O. O. Machine learning evaluation of imbalanced health data: A comparative analysis of balanced Accuracy, MCC, and F1 score. In: Practical Statistical Learning and Data Science Methods [Internet]. Springer, Cham; (2025). [cited 2025 Apr 22]. 283–312.
Ford E, Carroll JA, Smith HE, Scott D, Cassell JA. Extracting information from the text of electronic medical records to improve case detection: a systematic review. J Am Med Inform Assoc.; 23(5):1007–15 (2016).
Martin, E. A. et al. Hypertension identification using inpatient clinical notes from electronic medical records: an explainable, data-driven algorithm study. Can. Med. Assoc. Open. Access. J. 11 (1), E131–E139 (2023).
Lee, S. et al. Exploring the reliability of inpatient EMR algorithms for diabetes identification. BMJ Health Care Inf. 30 (1), e100894 (2023). (PMID: 10.1136/bmjhci-2023-100894)
Newby, D., Taylor, N., Joyce, D. W. & Winchester, L. M. Optimising the use of electronic medical records for large scale research in psychiatry. Transl Psychiatry. 1 (1), 1–10 (2024 June).
Shankar, R., Bundele, A. & Mukhopadhyay, A. Natural Language processing of electronic health records for early detection of cognitive decline: a systematic review. Npj Digit. Med. 8 (1), 1–10 (2025). (PMID: 10.1038/s41746-025-01527-z)
Hill, E. J. et al. Parkinson’s disease diagnosis codes are insufficiently accurate for electronic health record research and differ by race. Parkinsonism Relat Disord. 2023 Sept 1;114:105764.
Pan, J. et al. Integrating large Language models with human expertise for disease detection in electronic health records. Comput. Biol. Med. 1, 191:110161 (2025 June).
Meskó, B. Prompt engineering as an important emerging skill for medical professionals: tutorial. J. Med. Internet Res. 25 (1), e50638 (2023). (PMID: 10.2196/506383779243410585440)
Wu, S. et al. Deep learning in clinical natural Language processing: a methodical review. J. Am. Med. Inf. Assoc. 27 (3), 457–470 (2020). (PMID: 10.1093/jamia/ocz200)
Goh, K. H. et al. Artificial intelligence in sepsis early prediction and diagnosis using unstructured data in healthcare. Nat. Commun. 12 (1), 711 (2021). (PMID: 10.1038/s41467-021-20910-4335146997846756)
Araf, I., Idri, A. & Chairi, I. Cost-sensitive learning for imbalanced medical data: a review. Artif. Intell. Rev. 57 (4), 1–72 (2024). (PMID: 10.1007/s10462-023-10652-8)
Khan, A. A., Chaudhari, O. & Chandra, R. A review of ensemble learning and data augmentation models for class imbalanced problems: Combination, implementation and evaluation. Expert Syst. Appl. 2024 June 15;244:122778 .
Si, Y., Wang, J., Xu, H. & Roberts, K. Enhancing clinical concept extraction with contextual embeddings. J. Am. Med. Inf. Assoc. 26 (11), 1297–1304 (2019). (PMID: 10.1093/jamia/ocz096)
Zalte, J. & Shah, H. Contextual classification of clinical records with bidirectional long short-term memory (Bi-LSTM) and bidirectional encoder representations from Transformers (BERT) model. Comput. Intell. 40 (4), e12692 (2024). (PMID: 10.1111/coin.12692)
Grant Information:
IT30841 Mitacs Accelerate Grant
Contributed Indexing:
Keywords: Chronic conditions; Electronic medical records; Machine learning; Natural language processing; Primary care; Text mining
Entry Date(s):
Date Created: 20260211 Date Completed: 20260310 Latest Revision: 20260312
Update Code:
20260312
PubMed Central ID:
PMC12972307
DOI:
10.1038/s41598-026-38594-5
PMID:
41673187
Database:
MEDLINE

*Further Information*

*Primary care electronic medical records (EMRs) contain rich data that can support proactive identification of chronic health conditions. However, leveraging unstructured EMR data requires the use of novel computational methods. We applied natural language processing and machine learning (ML) techniques to structured and unstructured EMR data to detect arthritis, chronic kidney disease, diabetes, hypertension, and respiratory diseases. Using data from 449 community-dwelling older adults in one Canadian primary care clinic, we developed an analytical pipeline that included preprocessing of unstructured data, Latent Dirichlet Allocation topic modelling, and supervised ML models (regularized logistic regression [RLR], support vector machine [SVM], artificial neural networks [ANNs]) with class-weighted learning and Synthetic Minority Oversampling Technique techniques to address class imbalance. Integrating unstructured clinical notes improved model performance, particularly for conditions often under-coded in structured data. For example, the area under the receiver operating characteristic curve increased from 0.724 to 0.841 for SVM classifiers in arthritis detection and from 0.733 to 0.890 for ANNs in respiratory disease detection. Less pronounced improvements were observed for diabetes, hypertension, and CKD. These findings highlight that while performance gains from unstructured data vary by condition, leveraging these data can improve disease detection in primary care EMR data.
(© 2026. The Author(s).)*

*Declarations. Competing interests: The authors declare no competing interests. Ethical approval and consent to participate: This study used retrospective, de-identified electronic medical record (EMR) data extracted from a sentinel primary care clinic affiliated with the Northern Alberta Primary Care Research Network (NAPCReN). The data custodians were two attending family physicians at the clinic, who provided consent for the use of the data. All methods were carried out in accordance with relevant guidelines and regulations. Individual informed consent was not required, as the data were de-identified prior to access and used in compliance with applicable privacy legislation and institutional policies. The study was approved by the University of Alberta Research Ethics Board (ID: Pro00088808), with informed consent obtained from the data custodians (i.e., physicians participating in the study).*