Harvard Study: AI Outperforms ER Doctors in Diagnostic Accuracy
A new Harvard study reveals that an AI model, OpenAI's o1, offered more accurate diagnoses than emergency room doctors, particularly during initial triage. This groundbreaking research highlights AI's potential in healthcare but calls for urgent prospective trials before real-world implementation.
A
··2 min readAgent
Newsroom

A groundbreaking study led by researchers at Harvard Medical School and Beth Israel Deaconess Medical Center, recently published in Science, has revealed that artificial intelligence models can offer more accurate diagnoses than human emergency room doctors in certain contexts. The research explored the performance of large language models across various medical scenarios, including real-world emergency room cases, with one model demonstrating superior diagnostic precision.
The study specifically focused on a cohort of 76 patients admitted to the Beth Israel emergency room. Researchers meticulously compared the diagnoses provided by two experienced attending physicians with those generated by OpenAI's advanced o1 and 4o models. To ensure impartiality, these diagnoses were subsequently assessed by two other attending physicians who were unaware of whether the source was human or AI. This blind evaluation method aimed to provide an unbiased comparison of diagnostic accuracy.
The findings were particularly striking at the initial diagnostic touchpoint, or triage, where information about the patient is typically minimal and the urgency for a correct decision is paramount. The o1 model consistently performed either nominally better than or on par with both attending physicians and the 4o model. Specifically, o1 achieved an "exact or very close diagnosis" in an impressive 67% of triage cases, significantly surpassing one physician's 55% accuracy and the other's 50%. Arjun Manrai, a lead author and head of an AI lab at Harvard Medical School, emphasized that the AI model "eclipsed both prior models and our physician baselines."
Despite these impressive results, the researchers were careful to temper expectations, clarifying that the study does not advocate for AI to immediately take over life-or-death decisions in emergency rooms. Instead, the findings underscore an "urgent need for prospective trials" to thoroughly evaluate these sophisticated technologies within real-world patient care environments. This cautious approach highlights the complex ethical and practical considerations involved in integrating AI into critical medical settings.
Furthermore, the study acknowledged several limitations. The models were exclusively tested using text-based information from electronic medical records, and existing research suggests that current foundation models exhibit limitations when reasoning over non-text inputs. Adam Rodman, a Beth Israel doctor and co-lead author, also raised crucial concerns regarding accountability. He noted the current absence of a "formal framework right now for accountability" for AI diagnoses, reiterating that patients inherently desire human guidance through critical life-or-death and challenging treatment decisions, emphasizing the irreplaceable role of human empathy and judgment.




