*Result*: Data as scholarly output: addressing challenges in data citation tracking through natural language processing automation.

Title:
Data as scholarly output: addressing challenges in data citation tracking through natural language processing automation.
Authors:
Groenendyk, Michael1 (AUTHOR) michael.groenendyk@concordia.ca, Ivan, Laura1 (AUTHOR) laura.ivan@concordia.ca
Source:
Scientometrics. Jan2026, Vol. 131 Issue 1, p489-499. 11p.
Database:
Library, Information Science & Technology Abstracts

*Further Information*

*This study evaluates the challenges of tracking citations for research datasets and explores the potential of using an automated Python script combined with the DataCite API to improve citation tracking accuracy. Traditional bibliometric tools like Scopus, Web of Science, and Google Scholar, which have been optimized for journal articles, fail to adequately track dataset citations due to inconsistent citation practices and fundamental limitations in how datasets are indexed. For this study, we developed a Python script to identify DataCite DOIs in academic texts and systematically compared the script's effectiveness against traditional citation tracking tools using 550 academic articles and 43 datasets. Our automated approach successfully identified dataset citations that are not seen by traditional bibliometric databases, which do not index datasets as citable scholarly objects. This systematic comparison reveals a critical gap in the current citation tracking infrastructure, demonstrating that automated extraction methods can capture dataset usage patterns missed by established tools. Our findings highlight the need for improved dataset citation practices. The findings also provide a practical solution for researchers and institutions seeking to measure the true impact of research data contributions. [ABSTRACT FROM AUTHOR]*