*Result*: SampleExplorer: using language models to discover relevant transcriptome data.
NEJM AI. 2024 Feb;1(2):. (PMID: 38343631)
BMC Genomics. 2020 Jan 28;21(1):87. (PMID: 31992202)
BMC Bioinformatics. 2022 Feb 21;23(1):81. (PMID: 35193539)
Biol Direct. 2010 Aug 06;5:51. (PMID: 20691088)
Nucleic Acids Res. 2016 Jul 8;44(W1):W90-7. (PMID: 27141961)
Methods Mol Biol. 2016;1418:93-110. (PMID: 27008011)
Nucleic Acids Res. 1994 Sep;22(17):3445-9. (PMID: 7937043)
Bioinformatics. 2024 Jun 28;40(Suppl 1):i119-i129. (PMID: 38940167)
Cell Syst. 2015 Dec 23;1(6):417-425. (PMID: 26771021)
BMC Bioinformatics. 2021 Apr 20;22(1):206. (PMID: 33879054)
Nat Commun. 2018 Apr 10;9(1):1366. (PMID: 29636450)
*Further Information*
*Motivation: Over the last two decades, transcriptomics has become a standard technique in biomedical research. We now have large databases of RNA-seq data, accompanied by valuable metadata detailing scientific objectives and the experimental procedures used. The metadata is crucial in understanding and replicating published studies, but so far has been underutilized in helping researchers to discover existing datasets.
Results: We present SampleExplorer, a tool allowing researchers to search for relevant data using both text and gene set queries. SampleExplorer embeds sample metadata and uses a transformer-based language model to retrieve similar datasets. Extensive benchmarking (see Supplementary Materials and Methods) using the ARCHS4 database demonstrates that SampleExplorer provides an effective approach for retrieving biologically relevant samples from large-scale transcriptomicdata. This tool provides an efficient approach for discovering relevant gene expression datasets in large public repositories. It improves sample and dataset identification across diverse experimental contexts, helping researchers leverage existing transcriptomic data for potential replication or verification studies.
Availability and implementation: SampleExplorer is available as a Python package compatible with versions 3.9 to 3.11, available for installation via the Python Package Index (PyPI). The codebase and documentation are accessible at https://github.com/wlchin/SampleExplorer. Supplementary data (Supplementary Materials and Methods) provides detailed methodological information, including an algorithmic description of the retrieval process and data preparation steps.
(© The Author(s) 2024. Published by Oxford University Press.)*