*Result*: SampleExplorer: using language models to discover relevant transcriptome data.

Title:
SampleExplorer: using language models to discover relevant transcriptome data.
Authors:
Chin WL; National Centre for Asbestos Related Diseases, QEII Medical Centre, Nedlands, WA 6009, Australia.; Department of Medical Oncology, Sir Charles Gairdner Hospital, Hospital Ave, Nedlands, WA 6009, Australia.; The Kids Research Institute Australia, North Entrance, Perth Children's Hospital, 15 Hospital Ave, Nedlands, WA 6009, Australia., Lassmann T; The Kids Research Institute Australia, North Entrance, Perth Children's Hospital, 15 Hospital Ave, Nedlands, WA 6009, Australia.
Source:
Bioinformatics (Oxford, England) [Bioinformatics] 2024 Dec 26; Vol. 41 (1).
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: Oxford University Press Country of Publication: England NLM ID: 9808944 Publication Model: Print Cited Medium: Internet ISSN: 1367-4811 (Electronic) Linking ISSN: 13674803 NLM ISO Abbreviation: Bioinformatics Subsets: MEDLINE
Imprint Name(s):
Original Publication: Oxford : Oxford University Press, c1998-
References:
Algorithms Mol Biol. 2013 Sep 30;8(1):23. (PMID: 24074225)
NEJM AI. 2024 Feb;1(2):. (PMID: 38343631)
BMC Genomics. 2020 Jan 28;21(1):87. (PMID: 31992202)
BMC Bioinformatics. 2022 Feb 21;23(1):81. (PMID: 35193539)
Biol Direct. 2010 Aug 06;5:51. (PMID: 20691088)
Nucleic Acids Res. 2016 Jul 8;44(W1):W90-7. (PMID: 27141961)
Methods Mol Biol. 2016;1418:93-110. (PMID: 27008011)
Nucleic Acids Res. 1994 Sep;22(17):3445-9. (PMID: 7937043)
Bioinformatics. 2024 Jun 28;40(Suppl 1):i119-i129. (PMID: 38940167)
Cell Syst. 2015 Dec 23;1(6):417-425. (PMID: 26771021)
BMC Bioinformatics. 2021 Apr 20;22(1):206. (PMID: 33879054)
Nat Commun. 2018 Apr 10;9(1):1366. (PMID: 29636450)
Grant Information:
Stan Perron Foundation
Entry Date(s):
Date Created: 20250109 Date Completed: 20250122 Latest Revision: 20250520
Update Code:
20260130
PubMed Central ID:
PMC11751629
DOI:
10.1093/bioinformatics/btae759
PMID:
39786428
Database:
MEDLINE

*Further Information*

*Motivation: Over the last two decades, transcriptomics has become a standard technique in biomedical research. We now have large databases of RNA-seq data, accompanied by valuable metadata detailing scientific objectives and the experimental procedures used. The metadata is crucial in understanding and replicating published studies, but so far has been underutilized in helping researchers to discover existing datasets.
Results: We present SampleExplorer, a tool allowing researchers to search for relevant data using both text and gene set queries. SampleExplorer embeds sample metadata and uses a transformer-based language model to retrieve similar datasets. Extensive benchmarking (see Supplementary Materials and Methods) using the ARCHS4 database demonstrates that SampleExplorer provides an effective approach for retrieving biologically relevant samples from large-scale transcriptomicdata. This tool provides an efficient approach for discovering relevant gene expression datasets in large public repositories. It improves sample and dataset identification across diverse experimental contexts, helping researchers leverage existing transcriptomic data for potential replication or verification studies.
Availability and implementation: SampleExplorer is available as a Python package compatible with versions 3.9 to 3.11, available for installation via the Python Package Index (PyPI). The codebase and documentation are accessible at https://github.com/wlchin/SampleExplorer. Supplementary data (Supplementary Materials and Methods) provides detailed methodological information, including an algorithmic description of the retrieval process and data preparation steps.
(© The Author(s) 2024. Published by Oxford University Press.)*