6 minute read
Can AI be a Good Doctor?: What measuring a computer's medical ability teaches us about human doctors
As artificial intelligence (AI) advances, there is increasing interest in whether AI systems can take on roles traditionally filled by human experts, including doctors. AI systems are now being developed that can read medical images, analyze health records, diagnose diseases, and even communicate empathetically with patients. In some applications, these AI systems are reaching or even surpassing human-level performance. It is not an understatement to say the pace of progress on medical AI over the last decade has been staggering.
This raises an inevitable question - can AI really ever be a good doctor? However, this question in turn raises another for human doctors that we rarely acknowledge outright: It is very challenging to reliably determine if any given person is a “good doctor”. Moreover, it is often still difficult to determine which doctors are good, even after decades of practice. Some doctors excel at diagnosis but have a problematic bedside manner. Some ace their boards but struggle to be a team player. Some are wonderful colleagues but may struggle with complex cases. Are any of these kinds of doctors better or worse when compared to each other? The answer from the field of health services research, is “It depends on what you’re measuring”. There is a long list of metrics that attempt to answer some form of this question, including aspects like test scores, RVUs, clinical outcomes, error reduction, case volume, and patient satisfaction; these remain imperfect yardsticks and, again, are answering a very specific question. After all, can we truly distill the body of work of a physician into a simple “good” or “bad” designation? If not, can we hope to one day apply the same standard to AI?
What has become clear over the last five years is that AI has an impressively good grasp of medical knowledge. Recent AI based on socalled large language models (LMMs: think of models like ChatGPT or Google’s Bard) have not gone to medical school, nor have they even been fed a diet of information specific to medicine. Instead, they have consumed the vast stores of text data available on the internet. Consequently, they are able to converse fluently on nearly any topic, and medicine just happens to be one of a potentially infinite number of things that these silicon polymaths will happily discuss. Any expert medical knowledge they possess is a totally accidental byproduct of having read the internet, in its entirety.
So how do we know what they know about medicine if anything? An obvious answer is to have the AI take the same battery of evaluations that we make physicians-in-training take to prove they have mastered the fundamentals, such as the United States Medical Licensing Exam (USMLE) and specialty board certifications. Afterall, mastery of the material on these exams is a necessary, but not sufficient, condition to practice medicine in the United States. Progress by AI on these evaluations over the last year has been staggering, and AI systems are now scoring better than the best humans on many of these tests. For instance, AI systems developed by both Google and OpenAI are estimated to score better than 99% USMLE Step 1 test takers [1] [2]. Outside of text, there has been a flurry of progress on diagnostic medical imaging tasks, with AI systems now able to reliably read studies like chest x-rays, pathology slides, and retinal photographs with surprising accuracy. In some narrowly scoped and well-controlled evaluations, it has been found that many of these AI imaging systems perform as well or better than their human counterparts. However, as with all things, success in a lab setting does not necessarily translate to the real-world, and there have been numerous challenges [3] getting these systems to work in the real world. Though there is still much improvement to be made even within these narrow applications, the pace of progress for these systems shows no signs of slowing, as of yet.
However, as with humans, being a good doctor is more than just book smarts or diagnostic acumen. Being a good doctor requires qualities like empathy, social intelligence, and communication skills. Surprisingly, recent studies have shown that AI systems can exhibit some of these qualities as well. For example, a study [4] published in JAMA Internal Medicine showed that AI could generate empathetic notes to patients after medical visits. In comparing AI and physician responses to patient questions posted online, the AI responses were preferred, rated as higher quality, and judged as more empathetic. Thus, it seems that AI is also mastering the “soft skills” of what it means to be a good doctor. Does this mean we should trust AI to be a good doctor? Not yet. Just as a doctor who writes empathetic notes but misdiagnoses patients is not necessarily a good doctor, an AI that communicates well but lacks other important skills would be equally flawed.
So, to return to the initial question - can AI be a good doctor? Unfortunately, we find there are no easy answers, only parallels. If the field of healthcare services research has taught us anything, it’s that this is an incredibly complex question that doesn’t offer crisp answers, only trade-offs. The path to determining whether AI can be a “good doctor” will likely mirror the path that allows us to recognize great human doctors - accumulating a body of evidence over time using a broad range of metrics. The most we can say right now, is that AI seems to be working pretty well in some very limited domains, while being upfront about the caveats that must accompany those claims. Whether carbonor silicon-based, exactly what constitutes a “good doctor” remains elusive.
Andrew Beam, PhD, is a founding deputy editor of NEJM AI and a co-host of the NEJM Group podcast AI Grand Rounds. He is an assistant professor in the Department of Epidemiology at the Harvard T.H. Chan School of Public Health, with a secondary appointment in the Department of Biomedical Informatics at Harvard Medical School. His lab develops new AI methods by combining large scale deep learning models with techniques from causal inference to improve the safety and robustness for medical applications. He has a particular interest in using AI to improve neonatal and perinatal outcomes.
References
1. Nori, Harsha, et al. “Capabilities of GPT-4 on Medical Challenge Problems”, Microsoft (2023), https://www.microsoft.com/en-us/research/ publication/capabilities-of-gpt-4-on-medicalchallenge-problems/
2. “A responsible path to generative AI in healthcare”, Google Cloud (2023), https://cloud.google.com/ blog/topics/healthcare-life-sciences/sharinggoogle-med-palm-2-medical-large-language-model
3. Murdock, Jason. “Accuracy, but Failed to Deliver in Real World Tests”, Newsweek (2020), https://www. newsweek.com/google-artificial-intelligence-deeplearning-health-research-thailand-clinics-diabeticretinopathy-1500692
4. Ayers, J.W., et al. (2023) Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum.
JAMA Internal Medicine.