Performance of ChatGPT on the Plastic Surgery In-Training Examination
© 2024 HMP Global. All Rights Reserved.
Any views and opinions expressed are those of the author(s) and/or participants and do not necessarily reflect the views, policy, or position of ePlasty or HMP Global, their employees, and affiliates.
Abstract
Background. Recently, the artificial intelligence chatbot Chat Generative Pre-Trained Transformer (ChatGPT) performed well on all United States Medical Licensing Examinations (USMLE), demonstrating a high level of insight into a physician’s knowledge base and clinical reasoning ability.1,2 This study aims to evaluate the performance of ChatGPT on the American Society of Plastic Surgeons (ASPS) Plastic Surgery In-Training Examination (PSITE) to assess its clinical reasoning and decision-making ability and investigate its legitimacy related to plastic surgery competencies.
Methods. PSITE questions from 2015 to 2023 were included in this study. Questions with images, charts, and graphs were excluded. ChatGPT 3.5 was prompted to provide the best single letter answer choice. Performance was analyzed across test years, question area of content, taxonomy, and core competency via chi-square analysis. Multivariable logistic regression was performed to identify predictors of ChatGPT performance.
Results. In this study, 1850 of 2097 multiple choice questions were included. ChatGPT answered 845 (45.7%) questions correctly, performing the highest on breast/cosmetic topics (49.6%) (P = .070). ChatGPT performed significantly better on questions requiring the lowest level of reasoning (knowledge, 55.1%) compared with more complex questions such as analysis (41.4%) (P = .001). Multivariable analysis identified negative predictors of performance including the hand/lower extremity topic (OR = 0.73, P = .038) and taxonomy levels beyond knowledge (P < .05). Performance on the 2023 exam (53.4%) corresponded to a 4th percentile score when compared with all plastic surgery residents.
Conclusions. While ChatGPT’s performance has shown promise in other medical domains, our results indicate it may not be a reliable source of information for plastic surgery–related questions or decision-making.
Introduction
Chat Generative Pre-Trained Transformer (ChatGPT) is a large language model (LLM) developed by OpenAI to perform human-like abilities such as reasoning, problem-solving, and learning. Launched publicly in November of 2022 as an online, interactive messaging bot, ChatGPT uses deep learning models to converse in a human-like fashion on a vast array of subjects.3 Lately, there has been a national focus on LLMs performance in the health care industry and its application to academic medicine, education, and training. In reviewing current literature, ChatGPT has performed well on all 3 United States Medical Licensing Examinations (USMLE), demonstrating a high level of understanding and insight into a physician’s knowledge base and clinical reasoning ability.2 As a logical next step, researchers have begun to assess the ability of LLMs to complete medical board–style examination questions in a variety of subspecialties.4-8
The aim of this study was to broaden the understanding and application of LLMs as they relate to the field of plastic and reconstructive surgery. ChatGPT was evaluated specifically on its performance on the Plastic Surgery In-Training Examination (PSITE). The American Society of Plastic Surgery’s (ASPS) PSITE is a test with 250 multiple choice questions administered annually to assess residents’ knowledge of plastic and reconstructive surgery principles. The areas of content included on the PSITE are comprehensive plastic surgery principles, craniomaxillofacial surgery, breast/cosmetic surgery, hand and lower extremity surgery, and core surgical principles.9 In this study, we aim to evaluate the performance of ChatGPT on the ASPS PSITE, assessing its clinical reasoning and decision-making ability as it pertains to plastic surgery competencies and comparing its performance to residents nationwide.
The ability of ChatGPT to perform on the ASPS PSITE has begun to be evaluated; however, to our knowledge no study has assessed the performance of LLMs on the PSITE as comprehensively: based on content type, question competency, and level of complexity.6,10 Our findings contribute to the understanding of ChatGPT’s clinically applicability in the field of plastic and reconstructive surgery.
Methods
Questions from the 2015 to 2023 ASPS PSITE examinations were obtained from the American Council of Academic Plastic Surgeons In-Service Exams resource (n = 2097). Questions with clinical images, charts, and graphs were identified and excluded from the analysis (n = 270). After screening, 1850 multiple choice questions were included in this analysis. ChatGPT version 3.5 was used for this analysis as this is the free and widely available platform accessible to the public, including clinicians and patients. Questions were input in multiples of 5, and ChatGPT was prompted to answer each question with the best single letter answer choice. Output answers were scored as either correct or incorrect.
Question classification
Questions were classified in 3 ways: content type, ACGME core competency, and educational taxonomy.
The PSITE examination is broken into 5 sections separated by content type. These sections are breast/cosmetic, comprehensive, core surgical principles, craniomaxillofacial, and hand/lower extremity. Questions were categorized into content types based on the section in which they appeared on the exams.
Each question was additionally categorized by ACAPS based on ACGME core competency.11 The 6 core competencies include patient care, medical knowledge, interpersonal and communication skills, professionalism, practice-based learning and improvement, and systems-based practice.
Classification by educational taxonomy was performed according to the Bloom educational taxonomy model.12 This model is used to classify questions by level of complexity and specificity into 6 levels: (1) knowledge, (2) comprehension, (3) application, (4) analysis, (5) synthesis, and (6) evaluation. Each question was independently classified by 2 of the authors (B.R., K.K.). In the event of discordance, the authors discussed their reasoning until a taxonomy level was agreed upon.
Statistical analysis
ChatGPT performance was analyzed across test years, content type, core competency, and educational taxonomy via chi-square analysis. Performance on the 2023 PSITE was compared to the ACAPS normative table to correlate percent correct to a percentile amongst plastic surgery residents overall and by year. Multivariable logistic regression was used to identify predictors of ChatGPT performance. Statistical significance was defined as 𝝰 < 0.05. All statistical analyses were performed with Stata 18 (StataCorp).
Results
ChatGPT had an overall 45.7% accuracy level (845 correct responses out of 1850 included questions) across all examinations (range = 22.9%-54.8%; P < .001). Performance by examination year is summarized in Table 1.
TABLE 1. ChatGPT Performance on the PSITE by Examination Year
Analysis by area of content revealed that ChatGPT performed highest on questions in the core surgical principles (49.6%) and breast and cosmetic (49.6%) content areas, and lowest in the hand and lower extremity (40.7%) content area (P = .070). ChatGPT scored significantly higher on questions that were deemed the lowest level of reasoning (knowledge) based on the Bloom education model (P = .001). There were no significant differences in performance among the ACGME core competencies (P = .882). Performance results classified by area of content, taxonomy, and core competency are shown in Table 2. These findings are supported by the multivariable logistic regression analysis, which identified multiple significant independent predictors of poor ChatGPT performance including hand/lower extremity questions and those requiring higher levels of reasoning (Table 3).
TABLE 2. Performance of ChatGPT on the PSITE Based on Area of Content, Taxonomy, and Core Competency
TABLE 3. Multivariable Logistic Regression of ChatGPT’s Performance on the PSITE
In 2023, ChatGPT would rank in the 4th percentile of all plastic surgery residents regardless of program type (independent versus integrated) (Figure). ChatGPT would rank in the 14th percentile of first-year, 5th percentile of second-year, 2nd percentile of third-year, and 0th percentile of fourth-, fifth-, and sixth-year integrated plastic surgery residents. Similarly, ChatGPT ranked in the 5th percentile of first-year and 0th percentile of second- and third-year independent plastic surgery residents (Table 4).
Figure. ChatGPT ranks in the 4th percentile among all plastic surgery residents.
TABLE 4. ChatGPT Percentile Rankings on the PSITE Compared With Independent and Integrated Residents
Discussion
ChatGPT-3.5 (ChatGPT), a prominent LLM, is an artificial intelligence (AI) chatbot trained by textual data to respond to prompts in a comprehensive and conversational manner. The use of LLMs has garnered interest in its application to the medical field, particularly for its clinical reasoning and decision-making abilities. In this study, we evaluate ChatGPT’s performance on the Plastic Surgery In-Training Examination (PSITE). Our findings suggest that, although ChatGPT demonstrates adequate performance in other medical domains, its clinical reasoning and decision-making abilities in the context of plastic and reconstructive surgery are marginal. Previous studies in the field have focused on performance and percent correct alone when assessing ChatGPT’s ability relating to the PSITE.6,10 To our knowledge, our study is the first to assess ChatGPT’s performance on the PSITE comprehensively, considering various factors such as content type, question competency, and level of prompt complexity.
The PSITE is an annual examination administered to residents in training and experienced surgeons to assess their current knowledge in the field.9 Scores on this assessment are used as a surrogate for performance on the written exam for board certification, and therefore, the PSITE can be viewed widely as an assessment of the minimum knowledge required for a practicing plastic surgeon.13 Our results show ChatGPT had a 45.7% accuracy in answering questions correctly on the PSITE. When comparing the proficiency of ChatGPT to plastic surgery residents, ChatGPT would rank in the 4th percentile of all plastic surgery residents and in the 0th percentile for senior residents. When evaluating ChatGPT’s ability based on question content (eg, breast and cosmetic) and core competencies proposed by the ACGME (eg, medical knowledge), ChatGPT performed universally poorly. In general, ChatGPT scored highest on core surgical principles and breast and cosmetic (49.6%) and lowest on hand and Lower Extremity” (40.7%). On multivariate analysis, ChatGPT proved to score significantly worse on the “Hand and lower extremity category when controlling for confounders. It also had no predilection for success when comparing questions in the 6 ACGME core competency areas. ChatGPT exhibited a low level of ability relating to plastic surgery proficiencies and clinical knowledge.
Our analysis of ChatGPT’s performance on the PSITE also exhibits how LLMs are limited in their ability to perform higher-level thinking. The Bloom educational taxonomy model classifies questions based on the cognitive skills required to process information, ranging from lower-order to higher-order skills.12 Our study demonstrates with significance that LLMs are much more effective in answering prompts requiring straightforward fact retrieval (55.1%) than critical thinking (43.7%). Because LLMs utilize pattern recognition and previous inputs to provide answers, questions that elicit a more sophisticated understanding of plastic surgery topics could cause them to generate erroneous responses. Studies by Gupta et al and Shah et al showed that ChatGPT and other LLMs may even present convincing, detailed explanations for plastic surgery–based questions grounded in flawed reasoning and fallacy.6,14 While LLMs may excel in retrieving discrete pieces of information, they struggle with analyzing and addressing complex clinical problems. These composite skills are necessary for appropriate assessment and decision-making in the medical field.
Our study adds to the literature on LLMs performing in various medical domains by demonstrating overall competency of ChatGPT on the PSITE and revealing the nuances that should be considered when utilizing these widely available tools. Our study found that ChatGPT’s accuracy on the plastic surgery training examination (45.7%) was lower than its performance in other subspecialties including neurology (65.8%) and neurosurgery (53.2%).7,8 In contrast, recent investigations found that ChatGPT scored similarly to first-year orthopedic residents, and better than first-year family medicine residents on their respective residency training examinations.15,16 This suggests that ChatGPT’s medical knowledge may be conditional relative to its subject matter. AI software is trained and based on internet text and databases, so their abilities are limited in part by the dataset. While the database ChatGPT currently uses results in a meager performance on the PSITE and other medical specialty examinations, a more expansive or subject-specific dataset could provide improved functioning in the future, with the potential to serve as a useful adjunct and reliable tool for users.
Limitations
This study excluded all questions containing charts and clinical images, solely because ChatGPT 3.5 does not have the capacity to analyze images. Spatial visualization of defects and biological structures is an important aspect of plastic and reconstructive surgery and provides important information about the many conditions plastic surgeons treat. New AI tools exist that include image analysis (including GPT-4); these technologies could be considered worthwhile to evaluate as our specialty focuses distinctly on form and function. In addition, the authors prompted ChatGPT to answer up to 5 questions at a time and did not clear the chat history after each question. This leaves potential for the LLM to utilize previous information in the chat to answer later questions, which could give additional context to these questions or reaffirm inaccurate justification from earlier answers. However, test takers may also condition their answers based on previously presented information on the exam. Finally, ChatGPT-3.5 was assessed in this study. Since the initiation of this study, GPT-4, a more updated version of the chatbot, along with many updates to ChatGPT-3.5 (most recent being January 2022) have been released. GPT-4 requires a subscription for its use; thus, GPT-3.5 is more likely to be widely used by the general population and clinicians as it is free to use.
Conclusions
Our data suggests that ChatGPT has a low level of proficiency relating to plastic surgery competencies. Overall, ChatGPT would rank below the 5th percentile when comparing its performance to residents nationwide. While ChatGPT’s performance has shown promise in other medical domains, our results indicate it may not be a reliable source of information for plastic surgery–related competencies or decision-making. Our findings highlight the importance of practicing caution when using LLMs in specific medical domains and the need for continued research and refinement of these models to improve their accuracy and reliability.
Acknowledgments
Authors: Brielle E. Raine, MD1; Katherine A. Kozlowski, BS2; Cody C. Fowler, MD1, Jordan D. Frey, MD2,3
Affiliations: 1Division of Plastic and Reconstructive Surgery, Department of Surgery, University of Rochester Medical Center, Rochester, New York; 2Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, Buffalo, New York; 3Department of Plastic Surgery and Reconstructive Surgery, Erie County Medical Center, Buffalo, New York
Correspondence: Brielle E. Raine, MD; Brielle_raine@urmc.rochester.edu
Disclosures: The authors disclose no relevant financial or nonfinancial interests
References
1. OpenAI. ChatGPT. Accessed June 21, 2023. https://chat.openai.com/
2. Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. Feb 2023;2(2):e0000198. doi:10.1371/journal.pdig.0000198
3. Schulman J ZB, Kim C et al. Introducing ChatGPT. OpenAI. Accessed June 21, 2023. https://openai.com/blog/chatgpt
4. Antaki F, Touma S, Milad D, El-Khoury J, Duval R. Evaluating the performance of ChatGPT in ophthalmology: an analysis of its successes and shortcomings. Ophthal Sci. 2023;3(4):100324-100324. doi:10.1016/j.xops.2023.100324
5. Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a radiology board-style examination: insights into current strengths and limitations. Radiology. Jun 2023;307(5):e230582. doi:10.1148/radiol.230582
6. Gupta R, Park JB, Herzog I, et al. Applying GPT-4 to the Plastic Surgery Inservice Training Examination. J Plast Reconstr Aesthet Surg. Dec 2023;87:78-82. doi:10.1016/j.bjps.2023.09.027
7. Chen TC, Multala E, Kearns P, et al. Assessment of ChatGPT's performance on neurology written board examination questions. BMJ Neurol Open. 2023;5(2):e000530. doi:10.1136/bmjno-2023-000530
8. Hopkins BS, Nguyen VN, Dallas J, et al. ChatGPT versus the neurosurgical written boards: a comparative analysis of artificial intelligence/machine learning performance on neurosurgical board-style questions. J Neurosurg. Mar 24 2023;139(3):904-911. doi:10.3171/2023.2.Jns23419
9. Surgeons ASoP. Administrative Information. Accessed January 29, 2024. https://www.plasticsurgery.org/for-medical-professionals/education/events/in-service-exam-for-residents/administrative-information
10. Humar P, Asaad M, Bengur FB, Nguyen V. ChatGPT is equivalent to first-year plastic surgery residents: evaluation of ChatGPT on the plastic surgery in-service examination. Aesthet Surg J. Nov 16 2023;43(12):NP1085-NP1089. doi:10.1093/asj/sjad130
11. Kearney AM, Rokni AM, Gosain AK. The Accreditation Council for Graduate Medical Education Milestones in Integrated Plastic Surgery programs: how competency-based assessment has been implemented. Plast Reconstr Surg. 2022;149(4):1001. doi:10.1097/PRS.0000000000008938
12. Adams NE. Bloom's taxonomy of cognitive learning objectives. J Med Libr Assoc. 2015;103(3):152-153. doi:10.3163/1536-5050.103.3.010
13. Girotto JA, Brandt K, Janis JE, Cullisen T, Slezak S. Abstract: Saw it coming: the correlation between poor performance on the Plastic Surgery in Service Exam and failure on the American Board Written Exam. Plast Reconstr Surg Glob Open. 2017;5(9S):70-70. doi:10.1097/01.GOX.0000526263.79761.31
14. Shah P, Bogdanovich B, Patel PA, Boyd CJ. Assessing the plastic surgery knowledge of three natural language processor artificial intelligence programs. J Plast Reconstr Aesthet Surg. 2024;88:193-195. doi:10.1016/j.bjps.2023.10.141
15. Huang RST, Lu KJQ, Meaney C, Kemppainen J, Punnett A, Leung F-H. Assessment of resident and AI chatbot performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study. JMIR Med Educ. 2023;9:e50514-e50514. doi:10.2196/50514
16. Kung JE, Marshall C, Gauthier C, Gonzalez TA, Jackson JB. Evaluating ChatGPT performance on the Orthopaedic In-Training Examination. JB JS Open Access. 2023;Sep 8;8(3):e23.00056. doi:10.2106/JBJS.OA.23.00056