Result: Steering veridical large language model analyses by correcting and enriching generated database queries: first steps toward ChatGPT bioinformatics.

Title:

Steering veridical large language model analyses by correcting and enriching generated database queries: first steps toward ChatGPT bioinformatics.

Authors:

Cinquin O; Department of Developmental and Cell Biology, Center for Complex Biological Systems, University of California at Irvine, 4203 McGaugh Hall, Irvine, CA 92697, USA.

Source:

Briefings in bioinformatics [Brief Bioinform] 2024 Nov 22; Vol. 26 (1).

Publication Type:

Journal Article

Language:

English

Journal Info:

Publisher: Oxford University Press Country of Publication: England NLM ID: 100912837 Publication Model: Print Cited Medium: Internet ISSN: 1477-4054 (Electronic) Linking ISSN: 14675463 NLM ISO Abbreviation: Brief Bioinform Subsets: MEDLINE

Imprint Name(s):

Publication: Oxford : Oxford University Press
Original Publication: London ; Birmingham, AL : H. Stewart Publications, [2000-

MeSH Terms:

Computational Biology*/methods , Programming Languages* , Software* , Databases, Factual*, Genomics ; Humans ; Large Language Models ; Generative Artificial Intelligence

References:

Bioinformatics. 2009 Jun 1;25(11):1422-3. (PMID: 19304878)
Nucleic Acids Res. 2023 Jan 6;51(D1):D523-D531. (PMID: 36408920)
Nucleic Acids Res. 2021 Jan 8;49(D1):D899-D907. (PMID: 33219682)
Nature. 2020 May;581(7809):434-443. (PMID: 32461654)
Bioinformatics. 2024 Jun 28;40(Suppl 1):i266-i276. (PMID: 38940140)
Genomics Proteomics Bioinformatics. 2024 May 9;22(1):. (PMID: 38862428)
Science. 2023 Jul 14;381(6654):187-192. (PMID: 37440646)
Nucleic Acids Res. 2020 Jan 8;48(D1):D762-D767. (PMID: 31642470)
BMC Bioinformatics. 2014;15 Suppl 16:S15. (PMID: 25521810)
Ann Biomed Eng. 2024 Apr;52(4):754-756. (PMID: 37482573)
Science. 2013 Oct 25;342(6157):468-72. (PMID: 24159044)
PeerJ Comput Sci. 2022 Jan 05;8:e839. (PMID: 35111923)
Nucleic Acids Res. 2007 Jan;35(Database issue):D26-31. (PMID: 17148475)
Brief Bioinform. 2024 Jan 22;25(2):. (PMID: 38314912)
Acad Radiol. 2024 Oct 14;:. (PMID: 39406582)
Quant Biol. 2023 Jun;11(2):105-108. (PMID: 37378043)
J Med Internet Res. 2024 May 22;26:e53164. (PMID: 38776130)
Brief Bioinform. 2022 Nov 19;23(6):. (PMID: 36156661)
Science. 2013 May 17;340(6134):814-5. (PMID: 23687031)
Ann Biomed Eng. 2024 Mar;52(3):451-454. (PMID: 37428337)
J Mol Biol. 1982 May 5;157(1):105-32. (PMID: 7108955)
Bioinformatics. 2024 Nov 1;40(11):. (PMID: 39436982)
Nucleic Acids Res. 2022 Jan 7;50(D1):D20-D26. (PMID: 34850941)
Methods. 2024 Jun;226:102-119. (PMID: 38604415)
Bioinformatics. 2019 Oct 26;:. (PMID: 31665271)
Biophys Rev. 2015 Sep;7(3):343-352. (PMID: 28510230)
J Am Med Inform Assoc. 2024 Sep 1;31(9):1812-1820. (PMID: 38281112)
Bioinformatics. 2024 Feb 1;40(2):. (PMID: 38341654)
PLoS Comput Biol. 2023 Sep 28;19(9):e1011511. (PMID: 37769024)
Quant Biol. 2024 Dec;12(4):345-359. (PMID: 39364207)
Nature. 2019 Feb;566(7744):378-382. (PMID: 30760923)
Nucleic Acids Res. 2022 Jan 7;50(D1):D988-D995. (PMID: 34791404)
Res Ethics. 2025 Jan;21(1):1-8. (PMID: 39810931)
F1000Res. 2017 Mar 15;6:273. (PMID: 28443186)

Contributed Indexing:

Keywords: Database Query Correction; Generative Pretrained Transformer (GPT); LLM factual accuracy; LLM steering; bioinformatics; large language model (LLM); retrieval-augmented generation (RAG)

Entry Date(s):

Date Created: 20250206 Date Completed: 20250505 Latest Revision: 20250526

Update Code:

20260130

PubMed Central ID:

PMC11798674

DOI:

10.1093/bib/bbaf045

PMID:

39910777

Database:

MEDLINE

Further Information

*Large language models (LLMs) leverage factual knowledge from pretraining. Yet this knowledge remains incomplete and sometimes challenging to retrieve-especially in scientific domains not extensively covered in pretraining datasets and where information is still evolving. Here, we focus on genomics and bioinformatics. We confirm and expand upon issues with plain ChatGPT functioning as a bioinformatics assistant. Poor data retrieval and hallucination lead ChatGPT to err, as do incorrect sequence manipulations. To address this, we propose a system basing LLM outputs on up-to-date, authoritative facts and facilitating LLM-guided data analysis. Specifically, we introduce NagGPT, a middleware tool to insert between LLMs and databases, designed to bridge gaps in LLM knowledge and usage of database application programming interfaces. NagGPT proxies LLM-generated database queries, with special handling of incorrect queries. It acts as a gatekeeper between query responses and the LLM prompt, redirecting large responses to files but providing a synthesized snippet and injecting comments to steer the LLM. A companion OpenAI custom GPT, Genomics Fetcher-Analyzer, connects ChatGPT with NagGPT. It steers ChatGPT to generate and run Python code, performing bioinformatics tasks on data dynamically retrieved from a dozen common genomics databases (e.g. NCBI, Ensembl, UniProt, WormBase, and FlyBase). We implement partial mitigations for encountered challenges: detrimental interactions between code generation style and data analysis, confusion between database identifiers, and hallucination of both data and actions taken. Our results identify avenues to augment ChatGPT as a bioinformatics assistant and, more broadly, to improve factual accuracy and instruction following of unmodified LLMs.
(© The Author(s) 2025. Published by Oxford University Press.)*

*Result*: Steering veridical large language model analyses by correcting and enriching generated database queries: first steps toward ChatGPT bioinformatics.

*Further Information*

*Links*

*Additional functions*

Result: Steering veridical large language model analyses by correcting and enriching generated database queries: first steps toward ChatGPT bioinformatics.

Further Information

Links

Additional functions