
Doctors still outperform AI in clinical reasoning, study shows
On Nov. 24, 2025, a study published in the New England Journal of Medicine show that a test for large language models finds that AI has trouble adapting to new information while working towards a diagnosis.
University of Alberta neurology resident Liam McCoy evaluated how well large language models perform clinical reasoning — the ability to sort through symptoms, order the right tests, evaluate new information and come to the correct conclusion about what’s wrong with a patient.
He found that advanced AI models struggle to update judgment in response to new and uncertain information, and often fail to recognize when some information is completely irrelevant. In fact, some recent improvements designed to make AI reasoning better have actually made this overconfidence problem worse.
It all means that while AI may do really well on medical licensing exams, there’s a lot more to being a good doctor than instantly recalling facts, says McCoy. McCoy and colleagues from Harvard, MIT and elsewhere took a page from medical education to develop their benchmark test to measure this flexibility in clinical reasoning for AI models. Their tool, called concor.dance, is based on script concordance testing, a common method of assessing the skills of medical and nursing students.
McCoy tested 10 of the most popular AI models from Google, OpenAI, DeepSeek, Anthropic and others. While the models generally performed similarly to first- or second-year medical students, they often failed to reach the standard set by senior residents or attending physicians.
In the script concordance tests used, McCoy says, about 30 per cent of the time, the new information given in the question is a red herring that doesn’t change the diagnosis or management plan. For example, you may learn that our hypothetical chest pain patient stubbed their toe last week. That’s probably not relevant to our case, but the AI models were terrible at figuring that out. Instead, the most advanced models tried to explain why the irrelevant facts were relevant, botching the diagnosis.
Interestingly, human medical students who do well on multiple-choice exams don’t always do as well on script concordance because it’s a very different skill. “It’s important to realize that performance on a task like clinical reasoning is very complicated and task-specific,” McCoy points out. That doesn’t mean AI models can’t be improved to do better at it. In fact, McCoy figures the technology is here to stay, so it’s incumbent on researchers such as himself to keep pushing to make it better.
Tags:
Source: University of Alberta
Credit: