Treffer: Evaluation of large language models for discovery of gene set function.
Update of: Res Sq. 2023 Sep 18:rs.3.rs-3270331. doi: 10.21203/rs.3.rs-3270331/v1.. (PMID: 37790547)
Breitling, R., Amtmann, A. & Herzyk, P. Iterative group analysis (iGA): a simple tool to enhance sensitivity and facilitate interpretation of microarray experiments. BMC Bioinf. 5, 34 (2004). (PMID: 10.1186/1471-2105-5-34)
Beissbarth, T. & Speed, T. P. GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 20, 1464–1465 (2004). (PMID: 1496293410.1093/bioinformatics/bth088)
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005). (PMID: 16199517123989610.1073/pnas.0506580102)
Al-Shahrour, F. et al. From genes to functional classes in the study of biological systems. BMC Bioinf. 8, 114 (2007). (PMID: 10.1186/1471-2105-8-114)
Backes, C. et al. GeneTrail—advanced gene set enrichment analysis. Nucleic Acids Res. 35, W186–W192 (2007). (PMID: 17526521193313210.1093/nar/gkm323)
Huang, D. W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57 (2009). (PMID: 1913195610.1038/nprot.2008.211)
Chen, E. Y. et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinf. 14, 128 (2013). (PMID: 10.1186/1471-2105-14-128)
Pomaznoy, M., Ha, B. & Peters, B. GOnet: a tool for interactive Gene Ontology analysis. BMC Bioinf. 19, 470 (2018). (PMID: 10.1186/s12859-018-2533-3)
Cerami, E. G. et al. Pathway Commons, a web resource for biological pathway data. Nucleic Acids Res. 39, D685–D690 (2011). (PMID: 2107139210.1093/nar/gkq1039)
Fabregat, A. et al. The Reactome pathway Knowledgebase. Nucleic Acids Res. 44, D481–D487 (2015). (PMID: 26656494470293110.1093/nar/gkv1351)
Pico, A. R. et al. WikiPathways: pathway editing for the people. PLoS Biol. 6, e184 (2008). (PMID: 18651794247554510.1371/journal.pbio.0060184)
Kanehisa, M., Goto, S., Sato, Y., Furumichi, M. & Tanabe, M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 40, D109–D114 (2012). (PMID: 2208051010.1093/nar/gkr988)
Pillich, R. T. et al. NDEx IQuery: a multi-method network gene set analysis leveraging the Network Data Exchange. Bioinformatics 39, btad118 (2023). (PMID: 368821661002322010.1093/bioinformatics/btad118)
Wang, S. et al. Typing tumors using pathways selected by somatic evolution. Nat. Commun. 9, 4159 (2018). (PMID: 30297789617590010.1038/s41467-018-06464-y)
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000). (PMID: 10802651303741910.1038/75556)
Gene Ontology Consortiumet al. The Gene Ontology knowledgebase in 2023. Genetics 224, iyad031 (2023). (PMID: 10.1093/genetics/iyad031)
Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000). (PMID: 1059217310240910.1093/nar/28.1.27)
Kanehisa, M. Toward understanding the origin and evolution of cellular organisms. Protein Sci. 28, 1947–1951 (2019). (PMID: 31441146679812710.1002/pro.3715)
Kanehisa, M., Furumichi, M., Sato, Y., Kawashima, M. & Ishiguro-Watanabe, M. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 51, D587–D592 (2023). (PMID: 3630062010.1093/nar/gkac963)
Croft, D. Reactome: a database of biological pathways. Nat. Preced. https://doi.org/10.1038/npre.2010.5025.1 (2010).
Jassal, B. et al. The reactome pathway knowledgebase. Nucleic Acids Res. 48, D498–D503 (2020). (PMID: 31691815)
Sollis, E. et al. The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Res. 51, D977–D985 (2023). (PMID: 3635065610.1093/nar/gkac1010)
Blake, J. A. et al. The Mouse Genome Database genotypes::phenotypes. Nucleic Acids Res. 37, D712–D719 (2009). (PMID: 1898105010.1093/nar/gkn886)
Weng, M.-P. & Liao, B.-Y. MamPhEA: a web tool for mammalian phenotype enrichment analysis. Bioinformatics 26, 2212–2213 (2010). (PMID: 20605928292289510.1093/bioinformatics/btq359)
Keenan, A. B. et al. ChEA3: transcription factor enrichment analysis by orthogonal omics integration. Nucleic Acids Res. 47, W212–W224 (2019). (PMID: 31114921660252310.1093/nar/gkz446)
Rubin, J. D. et al. Transcription factor enrichment analysis (TFEA) quantifies the activity of multiple transcription factors from a single experiment. Commun. Biol. 4, 661 (2021). (PMID: 34079046817283010.1038/s42003-021-02153-7)
Franzén, O., Gan, L.-M. & Björkegren, J. L. M. PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database 2019, baz046 (2019). (PMID: 30951143645003610.1093/database/baz046)
Zhang, X. et al. CellMarker: a manually curated resource of cell markers in human and mouse. Nucleic Acids Res. 47, D721–D728 (2019). (PMID: 3028954910.1093/nar/gky900)
Hu, C. et al. CellMarker 2.0: an updated database of manually curated cell markers in human/mouse and web tools based on scRNA-seq data. Nucleic Acids Res. 51, D870–D876 (2023). (PMID: 3630061910.1093/nar/gkac947)
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (eds Burstein, J. et al.) 4171–4186 (Association for Computational Linguistics, 2019).
Brown, T. B. et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems (eds H. Larochelle, et al.) 1877–190 (NeurIPS, 2020).
Vaswani, A. et al. Attention is all you need. Neural Inf. Process Syst. 30, 5998–6008 (2017).
OpenAI. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).
Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288 (2023).
Jiang, A. Q. et al. Mixtral of experts. Preprint at https://arxiv.org/abs/2401.04088 (2024).
Gemini Team et al. Gemini: a family of highly capable multimodal models. Preprint at https://arxiv.org/abs/2312.11805 (2023).
Joachimiak, M. P., Harry Caufield, J., Harris, N. L., Kim, H. & Mungall, C. J. Gene set summarization using large language models. Preprint at https://arxiv.org/abs/2305.13338 (2023).
Moghaddam, S. R. & Honey, C. J. Boosting theory-of-mind performance in large language models via prompting. Preprint at https://arxiv.org/abs/2304.11490 (2023).
Hebenstreit, K., Praas, R., Kiesewetter, L. P. & Samwald, M. An automatically discovered chain-of-thought prompt generalizes to novel models and datasets. Preprint at https://arxiv.org/abs/2305.02897 (2023).
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems (eds Koyejo, S. et al.) 24824–24837 (NeurIPS, 2022).
Caufield, J. H. et al. Structured prompt interrogation and recursive extraction of semantics (SPIRES): a method for populating knowledge bases using zero-shot learning. Bioinformatics 40, btae104 (2024). (PMID: 383830671092428310.1093/bioinformatics/btae104)
Miller, G. A. & Charles, W. G. Contextual correlates of semantic similarity. Lang. Cogn. Process. 6, 1–28 (1991). (PMID: 10.1080/01690969108406936)
Xiong, M. et al. Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs. In The Twelfth International Conference on Learning Representations (ICLR, 20234).
Fu, J., Ng, S.-K., Jiang, Z. & Liu, P. GPTScore: evaluate as you desire. In Proc. 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1: Long Papers (eds Duh, K. et al.) 6556–6576 (Association for Computational Linguistics, 2024).
Kolberg, L. et al. g:Profiler-interoperable web service for functional enrichment analysis and gene identifier mapping (2023 update). Nucleic Acids Res. 51, W207–W212 (2023). (PMID: 371444591032009910.1093/nar/gkad347)
Duan, Q. et al. LINCS Canvas Browser: interactive web app to query, browse and interrogate LINCS L1000 gene expression signatures. Nucleic Acids Res. 42, W449–W460 (2014). (PMID: 24906883408613010.1093/nar/gku476)
Subramanian, A. et al. A next generation connectivity map: L1000 platform and the first 1,000,000. Profiles Cell 171, 1437–1452.e17 (2017). (PMID: 29195078)
Barrett, T. et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 41, D991–D995 (2013). (PMID: 2319325810.1093/nar/gks1193)
Zheng, F. et al. Interpretation of cancer mutations using a multiscale map of protein systems. Science 374, eabf3067 (2021). (PMID: 34591613912629810.1126/science.abf3067)
Pinkas, D. M. et al. Structural complexity in the KCTD family of Cullin3-dependent E3 ubiquitin ligases. Biochem. J. 474, 3747–3761 (2017). (PMID: 2896334410.1042/BCJ20170527)
Dhanoa, B. S., Cogliati, T., Satish, A. G., Bruford, E. A. & Friedman, J. S. Update on the Kelch-like (KLHL) gene family. Hum. Genomics 7, 13 (2013). (PMID: 23676014365894610.1186/1479-7364-7-13)
Pleiner, T. et al. WNK1 is an assembly factor for the human ER membrane protein complex. Mol. Cell 81, 2693–2704.e12 (2021). (PMID: 33964204825479210.1016/j.molcel.2021.04.013)
Berthold, J. et al. Characterization of RhoBTB-dependent Cul3 ubiquitin ligase complexes—evidence for an autoregulatory mechanism. Exp. Cell. Res. 314, 3453–3465 (2008). (PMID: 18835386274972910.1016/j.yexcr.2008.09.005)
McCormick, J. A. et al. Hyperkalemic hypertension-associated cullin 3 promotes WNK signaling by degrading KLHL3. J. Clin. Invest. 124, 4723–4736 (2014). (PMID: 25250572434725410.1172/JCI76126)
Sohara, E. & Uchida, S. Kelch-like 3/Cullin 3 ubiquitin ligase complex and WNK signaling in salt-sensitive hypertension and electrolyte disorder. Nephrol. Dial. Transpl. 31, 1417–1424 (2016). (PMID: 10.1093/ndt/gfv259)
Tang, H., Finn, R. D. & Thomas, P. D. TreeGrafter: phylogenetic tree-based annotation of proteins with Gene Ontology terms and other annotations. Bioinformatics 35, 518–520 (2019). (PMID: 3003220210.1093/bioinformatics/bty625)
Groh, B. S. et al. The antiobesity factor WDTC1 suppresses adipogenesis via the CRL4WDTC1 E3 ligase. EMBO Rep. 17, 638–647 (2016). (PMID: 27113764534152010.15252/embr.201540500)
Ji, W. & Rivero, F. Atypical rho GTPases of the RhoBTB subfamily: roles in vesicle trafficking and tumorigenesis. Cells 5, 28 (2016). (PMID: 27314390493167710.3390/cells5020028)
Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of GPT-4 on medical challenge problems. Preprint at https://arxiv.org/abs/2303.13375 (2023).
López Espejel, J., Ettifouri, E. H., Yahaya Alassan, M. S., Chouham, E. M. & Dahhane, W. GPT-3.5, GPT-4, or BARD? Evaluating LLMs reasoning ability in zero-shot learning and performance boosting through prompts. Nat. Lang. Process. J. 5, 100032 (2023). (PMID: 10.1016/j.nlp.2023.100032)
Yu, H. et al. Evaluation of retrieval-augmented generation: a survey. Preprint at https://arxiv.org/abs/2405.07437 (2024).
Yao, S. et al. ReAct: synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations (ICLR, 2022).
Nair, V., Schumacher, E., Tso, G. & Kannan, A. DERA: enhancing large language model completions with dialog-enabled resolving agents. In Proc. 6th Clinical Natural Language Processing Workshop (eds Naumann, T. et al.) 122–161 (2023).
Shinn, N. et al. Reflexion: language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) 8634–8652 (NeurIPS, 2023).
Li, G., Al Kader Hammoud, H. A., Itani, H., Khizbullin, D. & Ghanem, B. CAMEL: Communicative Agents for ‘Mind’ Exploration of Large Scale Language Model Society. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) 36, 51991–52008 (NeurIPS, 2023).
Schick, T. et al. Toolformer: language models can teach themselves to use tools. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) 68539–68551 (NeurIPS, 2023).
Shen, Y. et al. HuggingGPT: solving AI tasks with ChatGPT and its friends in Hugging Face. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) 36, 38154–38180 (NeurIPS, 2023).
Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C. & Socher, R. CTRL: a conditional transformer language model for controllable generation. Preprint at https://arxiv.org/abs/1909.05858 (2019).
Holtzman, A., Buys, J., Du, L., Forbes, M. & Choi, Y. The curious case of neural text degeneration. In The Eighth International Conference on Learning Representations (ICLR, 2020).
Smith, B. et al. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat. Biotechnol. 25, 1251–1255 (2007). (PMID: 17989687281406110.1038/nbt1346)
Tirmizi, S. H. et al. Mapping between the OBO and OWL ontology languages. J. Biomed. Semant. 2, S3 (2011). (PMID: 10.1186/2041-1480-2-S1-S3)
Liu, F., Shareghi, E., Meng, Z., Basaldella, M. & Collier, N. Self-alignment pretraining for biomedical entity representations. In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Toutanova, K. et al.) 4228–4238 (Association for Computational Linguistics, 2021).
Wolf, T. et al. Transformers: state-of-the-art natural language processing. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (eds Liu, Q. & Schlangen, D.) 38–45 (Association for Computational Linguistics, 2020).
Rouillard, A. D. et al. The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins. Database 2016, baw100 (2016). (PMID: 27374120493083410.1093/database/baw100)
Seal, R. L. et al. Genenames.org: the HGNC resources in 2023. Nucleic Acids Res. 51, D1003–D1009 (2023). (PMID: 3624397210.1093/nar/gkac888)
Hu, M. et al. Evaluation of Large Language Models for Discovery of Gene Set Function (Code Ocean, 2024); https://doi.org/10.24433/CO.7045777.V1.
Weitere Informationen
Gene set enrichment is a mainstay of functional genomics, but it relies on gene function databases that are incomplete. Here we evaluate five large language models (LLMs) for their ability to discover the common functions represented by a gene set, supported by molecular rationale and a self-confidence assessment. For curated gene sets from Gene Ontology, GPT-4 suggests functions similar to the curated name in 73% of cases, with higher self-confidence predicting higher similarity. Conversely, random gene sets correctly yield zero confidence in 87% of cases. Other LLMs (GPT-3.5, Gemini Pro, Mixtral Instruct and Llama2 70b) vary in function recovery but are falsely confident for random sets. In gene clusters from omics data, GPT-4 identifies common functions for 45% of cases, fewer than functional enrichment but with higher specificity and gene coverage. Manual review of supporting rationale and citations finds these functions are largely verifiable. These results position LLMs as valuable omics assistants.
(© 2024. The Author(s), under exclusive licence to Springer Nature America, Inc.)
Competing interests: T.I. is a cofounder and member of the advisory board and has an equity interest in Data4Cure and Serinus Biosciences. T.I. is a consultant for and has an equity interest in Ideaya Biosciences. The terms of these arrangements have been reviewed and approved by the University of California San Diego in accordance with its conflict-of-interest policies. The other authors declare no competing interests.