*Result*: Open-source large language models in action: A bioinformatics chatbot for PRIDE database.

Title:
Open-source large language models in action: A bioinformatics chatbot for PRIDE database.
Authors:
Bai J; European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK., Kamatchinathan S; European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK., Kundu DJ; European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK., Bandla C; European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK., Vizcaíno JA; European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK., Perez-Riverol Y; European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.
Source:
Proteomics [Proteomics] 2024 Nov; Vol. 24 (21-22), pp. e2400005. Date of Electronic Publication: 2024 Mar 31.
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: Wiley-VCH Country of Publication: Germany NLM ID: 101092707 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1615-9861 (Electronic) Linking ISSN: 16159853 NLM ISO Abbreviation: Proteomics Subsets: MEDLINE
Imprint Name(s):
Original Publication: Weinheim, Germany : Wiley-VCH
References:
Karimzadeh, M., & Hoffman, M. M. (2018). Top considerations for creating bioinformatics software documentation. Briefings in Bioinformatics, 19, 693–699.
Perez‐Riverol, Y., Wang, R., Hermjakob, H., Müller, M., Vesada, V., & Vizcaíno, J. A. (2014). Open source libraries and frameworks for mass spectrometry based proteomics: A developer's perspective. Biochimica Et Biophysica Acta, 1844, 63–76.
Williams, J. J., & Teal, T. K. (2017). A vision for collaborative training infrastructure for bioinformatics. Annals of the New York Academy of Sciences, 1387, 54–60.
Qin, C., Luo, X., Deng, C., Shu, K., Zhu, W., Griss, J., Hermjakob, H., Bai, M., & Perez‐Riverol, Y. (2021). Deep learning embedder method and tool for mass spectra similarity search. Journal of Proteomics, 232, 104070.
Rehfeldt, T., Gabriels, R., Bouwmeester, R., Gessulat, S., Neely, B. A., Palmblad, M., Perez‐Riverol, Y., Schmidt, T., Vizcaíno, J. A., & Deutsch, E. W. (2022). ProteomicsML: An online platform for community‐curated datasets and tutorials for machine learning in proteomics. Journal of Proteome Research, 22(2), 632–636.
Le, N. Q. K. (2022). Potential of deep representative learning features to interpret the sequence information in proteomics. Proteomics, 22, e2100232.
Le, N. Q. K. (2023). Leveraging transformers‐based language models in proteome bioinformatics. Proteomics, 23, e2300011.
Ferruz, N., Schmidt, S., & Höcker, B. (2022). ProtGPT2 is a deep unsupervised language model for protein design. Nature Communications, 13, 4348.
Madani, A., Krause, B., Greene, E. R., Subramanian, S., Mohr, B. P., Holton, J. M., Olmos, J. L., Xiong, C., Sun, Z. Z., Socher, R., Fraser, J. S., & Naik, N. (2023). Large language models generate functional protein sequences across diverse families. Nature Biotechnology, 41, 1099–1106.
Huang, T., & Li, Y. (2023). Current progress, challenges, and future perspectives of language models for protein representation and protein design. Innovation (Camb), 4, 100446.
Yilmaz, M., Fondrie, W. E., Bittremieux, W., Nelson, R., Ananth, V., Oh, S., & Noble, W. S. (2023). Sequence‐to‐sequence translation from mass spectra to peptides with a transformer model. BioRxiv, 2023.2001. 2003.522621.
Touvron, H., Martin, L., Stone, K., Albert, P. A., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., & Fu, W. (2023). Llama 2: Open foundation and fine‐tuned chat models. arXiv, preprint arXiv:2307.09288.
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Singh Chaplot, D., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.‐A., Stock, P., Le Scao, T., Lavril, T., Wang, T., Lacroix, T., & Sayed, W. E. (2023). Mistral 7B. arXiv, preprint arXiv:2310.06825.
Perez‐Riverol, Y., Bai, J., Bandla, C., García‐Seisdedos, D., Hewapathirana, S., Kamatchinathan, S., Kundu, D. J., Prakash, A., Frericks‐Zipper, A., Eisenacher, M., Walzer, M., Wang, S., Brazma, A., & Vizcaíno, J. A. (2022). The PRIDE database resources in 2022: A hub for mass spectrometry‐based proteomics evidences. Nucleic Acids Research, 50, D543–D552.
Deutsch, E. W., Bandeira, N., Perez‐Riverol, Y., Sharma, V., Carver, J. J., Mendoza, L., Kundu, D. J., Wang, S., Bandla, C., Kamatchinathan, S., Hewapathirana, S., Pullman, B. S., Wertz, J., Sun, Z., Kawano, S., Okuda, S., Watanabe, Y., Maclean, B., Maccoss, M. J., & Vizcaíno, J. A. (2023). The ProteomeXchange consortium at 10 years: 2023 update. Nucleic Acids Research, 51, D1539–D1548.
Perez‐Riverol, Y. (2022). Proteomic repository data submission, dissemination, and reuse: Key messages. Expert Review of Proteomics, 19, 297–310.
Perez‐Riverol, Y., Bai, M., Da Veiga Leprevost, F., Squizzato, S., Park, Y. M., Haug, K., Carroll, A. J., Spalding, D., Paschall, J., Wang, M., Del‐Toro, N., Ternent, T., Zhang, P., Buso, N., Bandeira, N., Deutsch, E. W., Campbell, D. S., Beavis, R. C., & Hermjakob, H. (2017). Discovering and linking public omics data sets using the Omics Discovery Index. Nature Biotechnology, 35, 406–409.
Li, R., Patel, T., & Du, X. (2023). Prd: Peer rank and discussion improve large language model based evaluations. arXiv, preprint arXiv:2307.02762.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval‐augmented generation for knowledge‐intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459–9474.
Chen, J., Lin, H., Han, X., & Sun, L. (2023). Benchmarking large language models in retrieval‐augmented generation. arXiv, preprint arXiv:2309.01431.
Dai, C., Füllgrabe, A., Pfeuffer, J., Solovyeva, E. M., Deng, J., Moreno, P., Kamatchinathan, S., Kundu, D. J., George, N., Fexova, S., Grüning, B., Föll, M. C., Griss, J., Vaudel, M., Audain, E., Locard‐Paulet, M., Turewicz, M., Eisenacher, M., & Perez‐Riverol, Y. (2021). A proteomics sample metadata representation for multiomics integration and big data analysis. Nature Communications, 12, 5854.
Cellucci, C. J., Albano, A. M., & Rapp, P. E. (2003). Comparative study of embedding methods. Physical Review E, 67, 066210.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv, preprint arXiv:1301.3781.
Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., & Zhou, M. (2020). Minilm: Deep self‐attention distillation for task‐agnostic compression of pre‐trained transformers. Advances in Neural Information Processing Systems, 33, 5776–5788.
Reimers, N., & Gurevych, I. (2019). Sentence‐BERT: Sentence embeddings using Siamese BERT‐networks. arXiv, preprint arXiv,:1908.10084.
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized LLMs. arXiv, preprint arXiv:2305.14314.
Cox, J., & Mann, M. (2008). MaxQuant enables high peptide identification rates, individualized p.p.b.‐range mass accuracies and proteome‐wide protein quantification. Nature Biotechnology, 26, 1367–1372.
Dai, C., Pfeuffer, J., Wang, H., Sachsenberg, T., Demichev, V., Kohlbacher, O., & Perez‐Rivero, Y. (2023). quantms: A cloud‐based pipeline for proteomics reanalysis enables the quantification of 17521 proteins in 9.502 human samples.
Deutsch, E. W., Vizcaíno, J. A., Jones, A. R., Binz, P.‐A., Lam, H., Klein, J., Bittremieux, W., Perez‐Riverol, Y., Tabb, D. L., Walzer, M., Ricard‐Blum, S., Hermjakob, H., Neumann, S., Mak, T. D., Kawano, S., Mendoza, L., Van Den Bossche, T., Gabriels, R., Bandeira, N., & Orchard, S. E. (2023). Proteomics standards initiative at twenty years: Current activities and future work. Journal of Proteome Research, 22, 287–301.
Grant Information:
223745/Z/21/Z United Kingdom WT_ Wellcome Trust; BB/S01781X/1 United Kingdom BB_ Biotechnology and Biological Sciences Research Council
Contributed Indexing:
Keywords: bioinformatics; dataset discoverability; documentation; large language models; proteomics; public data; software architectures; training
Entry Date(s):
Date Created: 20240331 Date Completed: 20241119 Latest Revision: 20241119
Update Code:
20260130
DOI:
10.1002/pmic.202400005
PMID:
38556628
Database:
MEDLINE

*Further Information*

*We here present a chatbot assistant infrastructure (https://www.ebi.ac.uk/pride/chatbot/) that simplifies user interactions with the PRIDE database's documentation and dataset search functionality. The framework utilizes multiple Large Language Models (LLM): llama2, chatglm, mixtral (mistral), and openhermes. It also includes a web service API (Application Programming Interface), web interface, and components for indexing and managing vector databases. An Elo-ranking system-based benchmark component is included in the framework as well, which allows for evaluating the performance of each LLM and for improving PRIDE documentation. The chatbot not only allows users to interact with PRIDE documentation but can also be used to search and find PRIDE datasets using an LLM-based recommendation system, enabling dataset discoverability. Importantly, while our infrastructure is exemplified through its application in the PRIDE database context, the modular and adaptable nature of our approach positions it as a valuable tool for improving user experiences across a spectrum of bioinformatics and proteomics tools and resources, among other domains. The integration of advanced LLMs, innovative vector-based construction, the benchmarking framework, and optimized documentation collectively form a robust and transferable chatbot assistant infrastructure. The framework is open-source (https://github.com/PRIDE-Archive/pride-chatbot).
(© 2024 The Authors. PROTEOMICS published by Wiley‐VCH GmbH.)*