AI Technology Trends

AI Diagnostics Advance: Scientists Chart the Path Ahead

May 3, 2026

Key Takeaways

A recent Science study found that OpenAI’s o1‑preview model often matched or exceeded the diagnostic accuracy of human physicians when evaluated on structured case studies and real‑world emergency‑department notes.
The model’s advantage was strongest at triage and diminished as more clinical information became available, converging with physician scores by the time of admission decisions.
Researchers caution that strong performance on text‑based reasoning tasks does not equate to readiness for clinical deployment; diagnosis is only one component of patient care.
Methodological limits—small physician comparison groups, subjective rating scales, and lack of multimodal data (images, vitals, patient interaction)—may inflate the model’s apparent superiority.
Experts agree the next step is prospective, safety‑focused clinical trials that test AI alongside physicians in real‑world settings, rather than relying solely on simulated benchmarks.

Introduction and Concerns of Researchers
“I worry that my research agenda … is going to get used by companies that are heavily financed and are looking to skip some of these essential safe pieces of medicine,” said Dr. Andrew Rodman, an assistant professor of medicine at Beth Israel Deaconess Medical Center and a visiting researcher at Google. His unease mirrors that of many clinical AI scholars who fear that impressive benchmark results will be misinterpreted as proof that large language models (LLMs) can safely replace or supplement physicians in everyday practice. While Rodman’s goal is to use AI to improve care—not to supplant clinicians—he worries that commercial pressures could push premature deployment before safety and efficacy are rigorously established.

Background on Testing LLMs in Medical Exams
Since consumer‑facing LLMs burst onto the scene in 2022, researchers have subjected them to a battery of diagnostic tests: multiple‑choice medical licensing exams, tricky case studies from the New England Journal of Medicine, and other standardized assessments. Early reports showed models performing well, but most studies failed to compare their performance against large groups of practicing physicians. As co‑senior author Dr. Arjun Manrai of Harvard Medical School explained, the team sought to “throw everything that we could at the model” to fill that evidentiary gap, using OpenAI’s 2024 o1‑preview—a newer “reasoning” model—as their testbed.

Design of the Emergency Department Study
The investigators replicated several prior experiments on GPT‑4, then added a novel, messier data challenge: they fed the model de‑identified text copied straight from the electronic health records of 76 randomly selected emergency‑department patients at Beth Israel, preserving all the “random noise” that naturally appears in clinical notes. Emergency physicians and patients never interacted with the AI; instead, after the encounters, two internists and the model were asked to offer second opinions on the patient’s diagnosis at three points—triage, mid‑encounter, and discharge/admission decision. This design aimed to test whether LLMs could reason from real‑world, uncurated clinical narratives.

Results Showing AI Outperforming Physicians
At triage, when information was sparse, the LLM identified a correct or very close diagnosis more often than the two physicians reviewing the same notes. Even after the full emergency‑department work‑up was available, the model remained ahead of the physicians, with scores only converging when the patient was ready for admission. “Long story short, the model outperformed our very large physician baseline,” Manrai summarized. The trend held across additional experiments that tested o1‑preview on documentation of clinical reasoning, management reasoning, and diagnostic reasoning, where the model consistently beat average scores from dozens or hundreds of clinicians.

An Illustrative Case Example
One case highlighted the model’s nuance: a patient arrived with a pulmonary embolism, received anticoagulants, and then worsened. The two physicians initially suspected drug failure, but the model zeroed in on the patient’s documented history of lupus, hypothesizing lupus pleuritis as a unifying cause. Subsequent review confirmed the lupus‑related inflammation, demonstrating how the model could integrate historical clues that busy clinicians might overlook. As Rodman recounted, “Perhaps that was a unifying cause for the blood clot and the other symptoms? It turned out to be right.”

Limitations and Cautions About Interpreting Results
Despite the striking numbers, both the study authors and outside experts stressed that the findings do not justify deploying LLMs as autonomous diagnostic tools. Diagnosis is only a sliver of a physician’s workload; in the emergency department, triage and immediate symptom management dominate, while primary care hinges on chronic disease tracking, patient counseling, image interpretation, and shared decision‑making—areas where text‑only LLMs fall short. “A clinician in real life would be able to look at a patient and get some gestalt about how sick they are,” noted Dr. Emily Alsentzer of Stanford, adding that visual and tactile data can narrow any performance gap between humans and models.

Methodological Critiques From External Experts
Clinical AI researchers pointed out several weaknesses that could temper the results. Performance scores relied on subjective ratings from experienced clinicians, including judgments about what counts as a “very close” diagnosis. “To some degree the results here are obscured by the clinician’s judgment of what is a good differential versus not,” Alsentzer said, questioning the objectivity when ratings come from senior authors involved in the study. Moreover, the physician baseline was small—only two doctors in the emergency‑department arm—meaning that one outlier could drive the apparent difference. Dr. Eric Oermann of NYU warned that “everyone says, ‘Oh, look, there’s a paper in Science on language models in the emergency department, they’re ready to use for patients,’ when nothing remotely like that has been shown here.”

The Gap Between Simulated Reasoning and Real‑World Practice
The authors acknowledge that LLMs excel at reasoning over text but cannot process imaging, vital‑sign trends, nonverbal cues, or the dynamic interplay of patient preferences and values that shape real clinical decisions. “Doctors talk to patients. They counsel them, they listen to their values. They interpret images, they read EKGs and ECGs. And they integrate all of that together to guide a patient through challenging decisions,” Manrai emphasized. Until models can ingest multimodal data and participate in longitudinal patient interactions, their diagnostic prowess remains a proof‑of‑concept rather than a prescription for autonomous care.

Call for Prospective Clinical Trials and Ongoing Work
Looking forward, the consensus is clear: the next step is prospective, safety‑first clinical trials that place AI alongside physicians in authentic care environments. “Despite this really strong performance we’re seeing in these simulated settings, how do we get the best out of both of us within the real clinical setting?” asked Peter Brodeur, an internal‑medicine resident at Beth Israel. Such trials would expose LLMs to incomplete information, diverse data types, and the full spectrum of patients seen in everyday practice, while rigorously monitoring for errors. Encouragingly, preliminary work is already emerging—OpenAI and Penda Health’s pre‑on‑AI‑supported primary‑care study in Nairobi, a randomized trial in Pakistan showing diagnostic gains for physicians using AI, and Rodman’s own prospective trial of Google’s conversational agent AMIE in primary care, which Oermann called “phenomenal.” As Rodman put it, “I did this experiment, and then I frickin’ ran a clinical trial,” underscoring the field’s shift from benchmark chasing to evidence‑based implementation.

https://www.bostonglobe.com/2026/05/03/business/ai-diagnosis-emergency-department-patients-hospital/

SignUpSignUp form

Modal title

LEAVE A REPLY Cancel reply