AI Matches Doctors in Simulated Clinical Tests

Two studies published in Nature found that advanced AI systems performed at least as well as physicians in simulated assessments involving diagnosis, treatment planning and disease management.

MIT SMR Editors 35 minutes ago

Topics

Image Credit- Chetan Jha/ MIT Sloan Management Review India

Artificial intelligence systems developed by researchers in Germany and Google performed at or above physician level in controlled clinical assessments, according to two studies published in Nature, adding to evidence that large language model-based systems are becoming more capable at medical reasoning.

The studies tested two different systems. One examined MIRA, or Medical Intelligence for Reasoning and Action, an autonomous medical AI agent built to operate inside a simulated electronic health record environment.

The other evaluated AMIE, or Articulate Medical Intelligence Explorer, a Google research system designed for medical conversations and disease management across multiple patient visits.

Both systems showed strong performance in tests involving diagnosis, treatment planning, medication reasoning and guideline-based disease management. But the researchers and outside experts cautioned that the results do not mean the systems are ready for deployment in hospitals or clinics.

The MIRA study, published on 17 June, tested whether an AI agent could handle a broader clinical workflow rather than answer isolated medical questions. The system was designed to obtain patient histories, order and interpret laboratory tests, imaging and microbiology findings, generate differential diagnoses, prescribe medications, request procedures and plan hospital admissions.

The researchers evaluated MIRA on more than 500 emergency department cases drawn from MIMIC-IV, a widely used critical care database. The cases covered eight conditions, including appendicitis, cholecystitis, diverticulitis, pancreatitis, pulmonary embolism, pneumonia, pancreatic cancer, and urinary tract infections.

Across 574 cases, MIRA achieved average diagnostic accuracy of 88.9%. Its strongest performance was in appendicitis, where it correctly identified 146 of 148 cases, and pancreatitis, where it achieved 92.3% accuracy. Performance was lower in pneumonia and urinary tract infections.

The researchers then compared MIRA against two physician groups under the same information conditions. One group included four board-certified physicians, while the other included six physicians with mixed levels of experience.

In the head-to-head comparison, MIRA achieved average diagnostic accuracy of 87.8%, compared with 78.1% for board-certified physicians and 71.1% for the mixed-experience cohort. The biggest gap was seen in pancreatitis, where MIRA reached 95.2% accuracy, compared with 78.6% for board-certified physicians and 61.9% for the mixed group.

The system also performed strongly on parts of treatment planning. In appendicitis cases, it matched all laparoscopic appendectomies documented in the reference data.

In cholecystitis, it matched nearly all laparoscopic cholecystectomies. Across the evaluated diseases, MIRA identified and requested a larger share of documented procedures than physicians did in the experiment.

The study also found that MIRA’s treatment choices were often more closely aligned with clinical guidelines. In pancreatitis, for example, it was more likely to prescribe intravenous fluids and more consistently followed analgesic recommendations.

Across 56 patients and 468 prescriptions, 467 of MIRA’s prescriptions, or 99.8%, were rated as containing clinically useful and correct free-text dosing instructions.

The second Nature study focused on AMIE, Google’s medical AI research system. Earlier versions of AMIE were developed for diagnostic dialogue. The latest study tested whether the system could support longer-term disease management, including treatment decisions, follow-up planning and medication reasoning.

AMIE used Gemini’s long-context capabilities along with access to clinical practice guidelines and drug formularies. The system was evaluated in a randomized, blinded virtual Objective Structured Clinical Examination, a format commonly used to assess clinical skills.

The study compared AMIE with 21 primary care physicians across 100 multi-visit case scenarios. The scenarios were designed to reflect UK NICE Guidance and BMJ Best Practice guidelines.

Specialist reviewers found AMIE to be non-inferior to physicians in management reasoning. The system scored higher on the precision of treatment and investigation plans and showed stronger alignment with clinical guidelines.

The researchers also developed RxQA, a medication-reasoning benchmark based on drug formularies from the US and the UK and validated by board-certified pharmacists. Both AMIE and physicians benefited from access to external drug information. On more difficult medication questions, AMIE outperformed physicians.

Google said the findings show AMIE moving from one-off diagnostic conversations toward disease management across time. The company said the system matched clinicians in overall management reasoning and scored higher in plan precision and guideline alignment.

Still, the findings come with heavy qualifications. The AMIE study used virtual patient actors and structured scenarios. The MIRA study used real historical patient records but evaluated the AI in a sandboxed environment, not in live care. Neither study tested the systems as autonomous tools on real patients in routine clinical settings.

The MIRA researchers said further work is needed to establish generalization, safety and governance through prospective real-world studies. The AMIE researchers similarly said more research would be needed before real-world translation.

Outside experts urged caution. The Science Media Centre quoted specialists who said the studies were rigorous but preliminary. They noted that real patients may describe symptoms incompletely, behave unpredictably, have complex conditions beyond the study categories or require assessment through physical examination, tone, behavior and context.

The studies also leave unresolved questions about accountability, oversight, bias, data contamination and whether guideline alignment always translates into better care for individual patients.

In practice, doctors often adapt guidelines to patient circumstances, comorbidities, resource constraints and patient preferences.

The results nevertheless suggest that medical AI is moving beyond narrow question-answering systems toward agents that can operate across more complete clinical workflows.

For healthcare systems facing physician shortages, rising documentation burdens and pressure to standardize care, such systems could eventually become decision-support tools under clinician supervision, analysts said.

Topics

About the Author

Tags:

Topics

Share