A team of researchers has tested ChatGPT, an artificial intelligence (AI) chatbot, on its clinical reasoning skills using questions from the United States Medical Licensing Examination (USMLE).
The team, publishing their results on preprint server medRxiv, wrote that they chose to test the generative language AI on questions from the USMLE as it was a "high-stakes, comprehensive three-step standardized testing program covering all topics in physicians’ fund of knowledge, spanning basic science, clinical reasoning, medical management, and bioethics".
The language model, trained on massive amounts of text from the internet, was not trained on the version of the test used by the researchers; nor was it given any supplementary medical training prior to the study, which saw it answer a number of open-ended and multiple choice questions.
"In this present study, ChatGPT performed at >50% accuracy across all examinations, exceeding 60% in most analyses," the team wrote in their study.
"The USMLE pass threshold, while varying by year, is approximately 60%. Therefore, ChatGPT is now comfortably within the passing range. Being the first experiment to reach this benchmark, we believe this is a surprising and impressive result."
The team write that the performance of the AI could be improved with more prompting and interaction with the model. Where the AI performed poorly, providing answers that were less concordant, they believe it was partly due to missing information that the AI has not encountered.
However, they believe that the OpenAI bot had an advantage over models trained entirely on medical text, as it got more of an overview of the clinical context.
"Paradoxically, ChatGPT outperformed PubMedGPT (accuracy 50.8%, unpublished data), a counterpart [language learning model] with similar neural structure, but trained exclusively on biomedical domain literature," the team wrote in their discussion.
"We speculate that domain-specific training may have created greater ambivalence in the PubMedGPT model, as it absorbs real-world text from ongoing academic discourse that tends to be inconclusive, contradictory, or highly conservative or noncommittal in its language."
The team writes that AI may soon become commonplace in healthcare settings, given the speed of progress of the industry, perhaps by improving risk assessment or providing assistance and support with clinical decisions.
The study is published on preprint server medRxiv. It has not yet been peer-reviewed.