*Result*: Open-source large language models in action: A bioinformatics chatbot for PRIDE database.
Perez‐Riverol, Y., Wang, R., Hermjakob, H., Müller, M., Vesada, V., & Vizcaíno, J. A. (2014). Open source libraries and frameworks for mass spectrometry based proteomics: A developer's perspective. Biochimica Et Biophysica Acta, 1844, 63–76.
Williams, J. J., & Teal, T. K. (2017). A vision for collaborative training infrastructure for bioinformatics. Annals of the New York Academy of Sciences, 1387, 54–60.
Qin, C., Luo, X., Deng, C., Shu, K., Zhu, W., Griss, J., Hermjakob, H., Bai, M., & Perez‐Riverol, Y. (2021). Deep learning embedder method and tool for mass spectra similarity search. Journal of Proteomics, 232, 104070.
Rehfeldt, T., Gabriels, R., Bouwmeester, R., Gessulat, S., Neely, B. A., Palmblad, M., Perez‐Riverol, Y., Schmidt, T., Vizcaíno, J. A., & Deutsch, E. W. (2022). ProteomicsML: An online platform for community‐curated datasets and tutorials for machine learning in proteomics. Journal of Proteome Research, 22(2), 632–636.
Le, N. Q. K. (2022). Potential of deep representative learning features to interpret the sequence information in proteomics. Proteomics, 22, e2100232.
Le, N. Q. K. (2023). Leveraging transformers‐based language models in proteome bioinformatics. Proteomics, 23, e2300011.
Ferruz, N., Schmidt, S., & Höcker, B. (2022). ProtGPT2 is a deep unsupervised language model for protein design. Nature Communications, 13, 4348.
Madani, A., Krause, B., Greene, E. R., Subramanian, S., Mohr, B. P., Holton, J. M., Olmos, J. L., Xiong, C., Sun, Z. Z., Socher, R., Fraser, J. S., & Naik, N. (2023). Large language models generate functional protein sequences across diverse families. Nature Biotechnology, 41, 1099–1106.
Huang, T., & Li, Y. (2023). Current progress, challenges, and future perspectives of language models for protein representation and protein design. Innovation (Camb), 4, 100446.
Yilmaz, M., Fondrie, W. E., Bittremieux, W., Nelson, R., Ananth, V., Oh, S., & Noble, W. S. (2023). Sequence‐to‐sequence translation from mass spectra to peptides with a transformer model. BioRxiv, 2023.2001. 2003.522621.
Touvron, H., Martin, L., Stone, K., Albert, P. A., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., & Fu, W. (2023). Llama 2: Open foundation and fine‐tuned chat models. arXiv, preprint arXiv:2307.09288.
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Singh Chaplot, D., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.‐A., Stock, P., Le Scao, T., Lavril, T., Wang, T., Lacroix, T., & Sayed, W. E. (2023). Mistral 7B. arXiv, preprint arXiv:2310.06825.
Perez‐Riverol, Y., Bai, J., Bandla, C., García‐Seisdedos, D., Hewapathirana, S., Kamatchinathan, S., Kundu, D. J., Prakash, A., Frericks‐Zipper, A., Eisenacher, M., Walzer, M., Wang, S., Brazma, A., & Vizcaíno, J. A. (2022). The PRIDE database resources in 2022: A hub for mass spectrometry‐based proteomics evidences. Nucleic Acids Research, 50, D543–D552.
Deutsch, E. W., Bandeira, N., Perez‐Riverol, Y., Sharma, V., Carver, J. J., Mendoza, L., Kundu, D. J., Wang, S., Bandla, C., Kamatchinathan, S., Hewapathirana, S., Pullman, B. S., Wertz, J., Sun, Z., Kawano, S., Okuda, S., Watanabe, Y., Maclean, B., Maccoss, M. J., & Vizcaíno, J. A. (2023). The ProteomeXchange consortium at 10 years: 2023 update. Nucleic Acids Research, 51, D1539–D1548.
Perez‐Riverol, Y. (2022). Proteomic repository data submission, dissemination, and reuse: Key messages. Expert Review of Proteomics, 19, 297–310.
Perez‐Riverol, Y., Bai, M., Da Veiga Leprevost, F., Squizzato, S., Park, Y. M., Haug, K., Carroll, A. J., Spalding, D., Paschall, J., Wang, M., Del‐Toro, N., Ternent, T., Zhang, P., Buso, N., Bandeira, N., Deutsch, E. W., Campbell, D. S., Beavis, R. C., & Hermjakob, H. (2017). Discovering and linking public omics data sets using the Omics Discovery Index. Nature Biotechnology, 35, 406–409.
Li, R., Patel, T., & Du, X. (2023). Prd: Peer rank and discussion improve large language model based evaluations. arXiv, preprint arXiv:2307.02762.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval‐augmented generation for knowledge‐intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459–9474.
Chen, J., Lin, H., Han, X., & Sun, L. (2023). Benchmarking large language models in retrieval‐augmented generation. arXiv, preprint arXiv:2309.01431.
Dai, C., Füllgrabe, A., Pfeuffer, J., Solovyeva, E. M., Deng, J., Moreno, P., Kamatchinathan, S., Kundu, D. J., George, N., Fexova, S., Grüning, B., Föll, M. C., Griss, J., Vaudel, M., Audain, E., Locard‐Paulet, M., Turewicz, M., Eisenacher, M., & Perez‐Riverol, Y. (2021). A proteomics sample metadata representation for multiomics integration and big data analysis. Nature Communications, 12, 5854.
Cellucci, C. J., Albano, A. M., & Rapp, P. E. (2003). Comparative study of embedding methods. Physical Review E, 67, 066210.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv, preprint arXiv:1301.3781.
Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., & Zhou, M. (2020). Minilm: Deep self‐attention distillation for task‐agnostic compression of pre‐trained transformers. Advances in Neural Information Processing Systems, 33, 5776–5788.
Reimers, N., & Gurevych, I. (2019). Sentence‐BERT: Sentence embeddings using Siamese BERT‐networks. arXiv, preprint arXiv,:1908.10084.
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized LLMs. arXiv, preprint arXiv:2305.14314.
Cox, J., & Mann, M. (2008). MaxQuant enables high peptide identification rates, individualized p.p.b.‐range mass accuracies and proteome‐wide protein quantification. Nature Biotechnology, 26, 1367–1372.
Dai, C., Pfeuffer, J., Wang, H., Sachsenberg, T., Demichev, V., Kohlbacher, O., & Perez‐Rivero, Y. (2023). quantms: A cloud‐based pipeline for proteomics reanalysis enables the quantification of 17521 proteins in 9.502 human samples.
Deutsch, E. W., Vizcaíno, J. A., Jones, A. R., Binz, P.‐A., Lam, H., Klein, J., Bittremieux, W., Perez‐Riverol, Y., Tabb, D. L., Walzer, M., Ricard‐Blum, S., Hermjakob, H., Neumann, S., Mak, T. D., Kawano, S., Mendoza, L., Van Den Bossche, T., Gabriels, R., Bandeira, N., & Orchard, S. E. (2023). Proteomics standards initiative at twenty years: Current activities and future work. Journal of Proteome Research, 22, 287–301.
*Further Information*
*We here present a chatbot assistant infrastructure (https://www.ebi.ac.uk/pride/chatbot/) that simplifies user interactions with the PRIDE database's documentation and dataset search functionality. The framework utilizes multiple Large Language Models (LLM): llama2, chatglm, mixtral (mistral), and openhermes. It also includes a web service API (Application Programming Interface), web interface, and components for indexing and managing vector databases. An Elo-ranking system-based benchmark component is included in the framework as well, which allows for evaluating the performance of each LLM and for improving PRIDE documentation. The chatbot not only allows users to interact with PRIDE documentation but can also be used to search and find PRIDE datasets using an LLM-based recommendation system, enabling dataset discoverability. Importantly, while our infrastructure is exemplified through its application in the PRIDE database context, the modular and adaptable nature of our approach positions it as a valuable tool for improving user experiences across a spectrum of bioinformatics and proteomics tools and resources, among other domains. The integration of advanced LLMs, innovative vector-based construction, the benchmarking framework, and optimized documentation collectively form a robust and transferable chatbot assistant infrastructure. The framework is open-source (https://github.com/PRIDE-Archive/pride-chatbot).
(© 2024 The Authors. PROTEOMICS published by Wiley‐VCH GmbH.)*