*Result*: Contextualizing AI Evaluation in Anesthesiology: Interpreting Large Language Models and Computer Vision Metrics Across Clinical Use Cases-An Expert Statement from the Society of Technology in Anesthesia.
Title:
Contextualizing AI Evaluation in Anesthesiology: Interpreting Large Language Models and Computer Vision Metrics Across Clinical Use Cases-An Expert Statement from the Society of Technology in Anesthesia.
Authors:
Cafferty O; From the Departments of, Surgical and Interventional Sciences.; Anesthesia., Jeffries SD; From the Departments of, Surgical and Interventional Sciences.; Anesthesia., Pelletier ED; From the Departments of, Surgical and Interventional Sciences.; Anesthesia., Tu Z; From the Departments of, Surgical and Interventional Sciences.; Anesthesia., Sinha A; Anesthesia., Hemmerling TM; From the Departments of, Surgical and Interventional Sciences.; Anesthesia.; Surgery, McGill University, Montreal, Canada.
Corporate Authors:
Source:
Anesthesia and analgesia [Anesth Analg] 2026 Mar 02. Date of Electronic Publication: 2026 Mar 02.
Publication Model:
Ahead of Print
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: Lippincott Williams & Wilkins Country of Publication: United States NLM ID: 1310650 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1526-7598 (Electronic) Linking ISSN: 00032999 NLM ISO Abbreviation: Anesth Analg Subsets: MEDLINE
Imprint Name(s):
Publication: 1998- : Baltimore, Md. : Lippincott Williams & Wilkins
Original Publication: Cleveland, International Anesthesia Research Society.
Original Publication: Cleveland, International Anesthesia Research Society.
References:
Daccache N, Zako J, Morisson L, Laferrière-Langlois P. The applications of ChatGPT and other large language models in anesthesiology and critical care: a systematic review. Can J Anaesth. 2025;72:904–922.
Joshi S. Evaluation of large language models: review of metrics, applications, and methodologies[v2] | Preprints 2025, 2025040369. doi: 10.20944/preprints202504.0369.v1.
Ratnagandhi JA, Godavarthy P, Gnaneswaran M, Lim B, Vittalraj R. Enhancing anesthetic patient education through the utilization of large language models for improved communication and understanding. Anesth Research. 2025;2:4.
Kulkarni A, Zhang Y, Moniz JRA, et al. Evaluating evaluation metrics: the mirage of hallucination detection. In Findings of the Association for Computational Linguistics: EMNLP. Association for Computational Linguistics, 2025;19013–19032. Association for Computational Linguistics; Suzhou, China.
Li J, Chen J, Ren R, et al. The dawn after the dark: an empirical study on factuality hallucination in large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol 1: Long Papers), Association for Computational Linguistics; Bangkok, Thailand. 10879–10899.
Asgari E, Montaña-Brown N, Dubois M, et al. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. NPJ Digit Med. 2025;8:274.
Patnaik SS, Hoffmann U. Quantitative evaluation of ChatGPT versus Bard responses to anaesthesia-related queries. Br J Anaesth. 2024;132:169–171.
Bulian J, Buck C, Gajewski W, Boerschinger B, Schuster T. Tomayto, Tomahto: beyond token-level answer equivalence for question answering evaluation. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2022:291–305.
Fujimoto M, Kuroda H, Katayama T, et al. Evaluating large language models in dental anesthesiology: a comparative analysis of ChatGPT-4, Claude 3 Opus, and Gemini 1.0 on the Japanese Dental Society of anesthesiology board certification exam. Cureus. 2024;16:e70302.
Lukasik M, Narasimhan H, Menon AK, Yu F, Kumar S. Regression-aware inference with LLMs. In: Findings of the Association for Computational Linguistics: EMNLP 2024. Association for Computational Linguistics, 2024:13667-13678.
Chang Y, Wang X, Wang J, et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology. 2024;15:39.
TempestVanSchaik. Evaluation metrics. Accessed February 18, 2026. https://learn.microsoft.com/en-us/ai/playbook/technology-guidance/generative-ai/working-with-llms/evaluation/list-of-eval-metrics.
Papineni K, Roukos S, Ward T, Zhu WJ. BLEU. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics––ACL ’02. Association for Computational Linguistics; 2001.
Lin CY. ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out. Association for Computational Linguistics; 2004:74–81.
Zhang T, Kishore V, Wu F, Weinberger KQ, Artzi Y. BERTScore: evaluating text generation with BERT. In: Proceedings of the International Conference on Learning Representations (ICLR 2020). International Conference on Learning Representations; 2020.
Jiang Y, Black KC, Geng G, et al. MedAgentBench: a virtual EHR environment to benchmark medical LLM agents. NEJM AI. 2025;2:AIdbp2500144.
Moëll B. Swedish Medical LLM Benchmark (SMLB): development and evaluation of a framework for assessing large language models in the Swedish medical domain. Front AI. 2025;8:1557920.
Xu J, Lu L, Peng X, et al. Dataset and benchmark (MedGPTEval) to evaluate responses from large language models in medicine. JMIR Med Inform. 2024;12:e57674–e57674.
Introducing HealthBench. Accessed September 6, 2025. https://openai.com/index/healthbench/.
The open medical-LLM leaderboard: benchmarking large language models in healthcare. Accessed February 18, 2025. https://huggingface.co/blog/leaderboard-medicalllm.
Feng X, Jiang W, Wang Z, et al. AnesBench: multi-dimensional evaluation of LLM reasoning in anesthesiology. ArXiv. 2025:abs/2504.02404.
Lonsdale H, Gray GM, Ahumada LM, Matava CT. Machine vision and image analysis in anesthesia: narrative review and future prospects. Anesth Analg. 2023;137:830–840.
Eelbode T, Bertels J, Berman M, et al. Optimization for medical image segmentation: theory and practice when evaluating with Dice Score or Jaccard Index. IEEE Trans Med Imaging. 2020;39:3679–3690.
Bilic P, Christ P, Li HB, et al. The liver tumor segmentation benchmark (LiTS). Med Image Anal. 2023;84:102680.
Müller D, Soto-Rey I, Kramer F. Towards a guideline for evaluation metrics in medical image segmentation. BMC Res Notes. 2022;15:210.
Reinke A, Tizabi MD, Sudre CH, et al. Common limitations of image processing metrics: A picture story. arXiv [eessIV]. 2021.
Terven J, Cordova-Esparza DM, Romero-González JA, Ramírez-Pedraza A, Chávez-Urbiola EA. A comprehensive survey of loss functions and metrics in deep learning. Artif Intell Rev. 2025;58:195. doi:10.1007/s10462-025-11198-7. (PMID: 10.1007/s10462-025-11198-7)
Keylabs. Semantic segmentation vs object detection: a comparison. Keylabs: latest news and updates. Accessed February 18, 2026. https://keylabs.ai/blog/semantic-segmentation-vs-object-detection-a-comparison/.
Henderson P, Ferrari V. End-to-end training of object class detectors for mean average precision. In: Lai SH., Lepetit V, Nishino K, Sato Y. eds. Computer Vision – ACCV 2016. ACCV 2016. Lecture Notes in Computer Science. vol 10115. Springer, 2016.
Wenkel S, Alhazmi K, Liiv T, Alrshoud S, Simon M. Confidence score: the forgotten dimension of object detection performance evaluation. Sensors (Basel). 2021;21:4350.
Minaee S, Boykov Y, Porikli F, Plaza A, Kehtarnavaz N, Terzopoulos D. Image segmentation using deep learning: a survey. IEEE Trans Pattern Anal Mach Intell. 2022;44:3523–3542 https://paperswithcode.com/task/image-classification.
Cafferty O, Jeffries SD, Pelletier ED, Tu Z, Hemmerling TM. Contextualizing AI evaluation in anesthesiology: interpreting predictive modeling and reinforcement learning metrics across clinical use cases. Anesth Analg. Manuscript in revision.
Daroya R, Sun A, Maji S. Improving satellite imagery masking using multitask and transfer learning. IEEE J Sel Top Appl Earth Obs Remote Sens202587778796.
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: visual explanations from deep networks via gradient-based localization. Int J Comput Vis. 2020;128:336–359.
Duque VG, Marquardt A, Velikova Y, et al. Ultrasound segmentation analysis via distinct and completed anatomical borders. Int J Comput Assist Radiol Surg. 2024;19:1419–1427.
Saito M, Mitamura M, Kimura M, et al. Grad-CAM-based investigation into acute-stage fluorescein angiography images to predict long-term visual prognosis of branch retinal vein occlusion. J Clin Med. 2024;13:5271.
Bossuyt PM, Irwig L, Craig J, Glasziou P. Comparative accuracy: assessing new tests against existing diagnostic pathways. BMJ. 2006;332:1089–1092.
Joshi S. Evaluation of large language models: review of metrics, applications, and methodologies[v2] | Preprints 2025, 2025040369. doi: 10.20944/preprints202504.0369.v1.
Ratnagandhi JA, Godavarthy P, Gnaneswaran M, Lim B, Vittalraj R. Enhancing anesthetic patient education through the utilization of large language models for improved communication and understanding. Anesth Research. 2025;2:4.
Kulkarni A, Zhang Y, Moniz JRA, et al. Evaluating evaluation metrics: the mirage of hallucination detection. In Findings of the Association for Computational Linguistics: EMNLP. Association for Computational Linguistics, 2025;19013–19032. Association for Computational Linguistics; Suzhou, China.
Li J, Chen J, Ren R, et al. The dawn after the dark: an empirical study on factuality hallucination in large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol 1: Long Papers), Association for Computational Linguistics; Bangkok, Thailand. 10879–10899.
Asgari E, Montaña-Brown N, Dubois M, et al. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. NPJ Digit Med. 2025;8:274.
Patnaik SS, Hoffmann U. Quantitative evaluation of ChatGPT versus Bard responses to anaesthesia-related queries. Br J Anaesth. 2024;132:169–171.
Bulian J, Buck C, Gajewski W, Boerschinger B, Schuster T. Tomayto, Tomahto: beyond token-level answer equivalence for question answering evaluation. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2022:291–305.
Fujimoto M, Kuroda H, Katayama T, et al. Evaluating large language models in dental anesthesiology: a comparative analysis of ChatGPT-4, Claude 3 Opus, and Gemini 1.0 on the Japanese Dental Society of anesthesiology board certification exam. Cureus. 2024;16:e70302.
Lukasik M, Narasimhan H, Menon AK, Yu F, Kumar S. Regression-aware inference with LLMs. In: Findings of the Association for Computational Linguistics: EMNLP 2024. Association for Computational Linguistics, 2024:13667-13678.
Chang Y, Wang X, Wang J, et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology. 2024;15:39.
TempestVanSchaik. Evaluation metrics. Accessed February 18, 2026. https://learn.microsoft.com/en-us/ai/playbook/technology-guidance/generative-ai/working-with-llms/evaluation/list-of-eval-metrics.
Papineni K, Roukos S, Ward T, Zhu WJ. BLEU. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics––ACL ’02. Association for Computational Linguistics; 2001.
Lin CY. ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out. Association for Computational Linguistics; 2004:74–81.
Zhang T, Kishore V, Wu F, Weinberger KQ, Artzi Y. BERTScore: evaluating text generation with BERT. In: Proceedings of the International Conference on Learning Representations (ICLR 2020). International Conference on Learning Representations; 2020.
Jiang Y, Black KC, Geng G, et al. MedAgentBench: a virtual EHR environment to benchmark medical LLM agents. NEJM AI. 2025;2:AIdbp2500144.
Moëll B. Swedish Medical LLM Benchmark (SMLB): development and evaluation of a framework for assessing large language models in the Swedish medical domain. Front AI. 2025;8:1557920.
Xu J, Lu L, Peng X, et al. Dataset and benchmark (MedGPTEval) to evaluate responses from large language models in medicine. JMIR Med Inform. 2024;12:e57674–e57674.
Introducing HealthBench. Accessed September 6, 2025. https://openai.com/index/healthbench/.
The open medical-LLM leaderboard: benchmarking large language models in healthcare. Accessed February 18, 2025. https://huggingface.co/blog/leaderboard-medicalllm.
Feng X, Jiang W, Wang Z, et al. AnesBench: multi-dimensional evaluation of LLM reasoning in anesthesiology. ArXiv. 2025:abs/2504.02404.
Lonsdale H, Gray GM, Ahumada LM, Matava CT. Machine vision and image analysis in anesthesia: narrative review and future prospects. Anesth Analg. 2023;137:830–840.
Eelbode T, Bertels J, Berman M, et al. Optimization for medical image segmentation: theory and practice when evaluating with Dice Score or Jaccard Index. IEEE Trans Med Imaging. 2020;39:3679–3690.
Bilic P, Christ P, Li HB, et al. The liver tumor segmentation benchmark (LiTS). Med Image Anal. 2023;84:102680.
Müller D, Soto-Rey I, Kramer F. Towards a guideline for evaluation metrics in medical image segmentation. BMC Res Notes. 2022;15:210.
Reinke A, Tizabi MD, Sudre CH, et al. Common limitations of image processing metrics: A picture story. arXiv [eessIV]. 2021.
Terven J, Cordova-Esparza DM, Romero-González JA, Ramírez-Pedraza A, Chávez-Urbiola EA. A comprehensive survey of loss functions and metrics in deep learning. Artif Intell Rev. 2025;58:195. doi:10.1007/s10462-025-11198-7. (PMID: 10.1007/s10462-025-11198-7)
Keylabs. Semantic segmentation vs object detection: a comparison. Keylabs: latest news and updates. Accessed February 18, 2026. https://keylabs.ai/blog/semantic-segmentation-vs-object-detection-a-comparison/.
Henderson P, Ferrari V. End-to-end training of object class detectors for mean average precision. In: Lai SH., Lepetit V, Nishino K, Sato Y. eds. Computer Vision – ACCV 2016. ACCV 2016. Lecture Notes in Computer Science. vol 10115. Springer, 2016.
Wenkel S, Alhazmi K, Liiv T, Alrshoud S, Simon M. Confidence score: the forgotten dimension of object detection performance evaluation. Sensors (Basel). 2021;21:4350.
Minaee S, Boykov Y, Porikli F, Plaza A, Kehtarnavaz N, Terzopoulos D. Image segmentation using deep learning: a survey. IEEE Trans Pattern Anal Mach Intell. 2022;44:3523–3542 https://paperswithcode.com/task/image-classification.
Cafferty O, Jeffries SD, Pelletier ED, Tu Z, Hemmerling TM. Contextualizing AI evaluation in anesthesiology: interpreting predictive modeling and reinforcement learning metrics across clinical use cases. Anesth Analg. Manuscript in revision.
Daroya R, Sun A, Maji S. Improving satellite imagery masking using multitask and transfer learning. IEEE J Sel Top Appl Earth Obs Remote Sens202587778796.
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: visual explanations from deep networks via gradient-based localization. Int J Comput Vis. 2020;128:336–359.
Duque VG, Marquardt A, Velikova Y, et al. Ultrasound segmentation analysis via distinct and completed anatomical borders. Int J Comput Assist Radiol Surg. 2024;19:1419–1427.
Saito M, Mitamura M, Kimura M, et al. Grad-CAM-based investigation into acute-stage fluorescein angiography images to predict long-term visual prognosis of branch retinal vein occlusion. J Clin Med. 2024;13:5271.
Bossuyt PM, Irwig L, Craig J, Glasziou P. Comparative accuracy: assessing new tests against existing diagnostic pathways. BMJ. 2006;332:1089–1092.
Entry Date(s):
Date Created: 20260302 Latest Revision: 20260302
Update Code:
20260303
DOI:
10.1213/ANE.0000000000007991
PMID:
41771270
Database:
MEDLINE
*Further Information*
*Conflicts of Interest, Funding: Please see DISCLOSURES at the end of this article.*