Health
Study Finds AI Models Fall Short in Early Medical Diagnosis
A new study has found that artificial intelligence language models still struggle with one of the most critical aspects of medical care, raising concerns about their use without human oversight.
Researchers from Mass General Brigham reported that AI systems failed to produce an appropriate early diagnosis more than 80 per cent of the time. The findings, published in JAMA Network Open, highlight ongoing limitations in how these systems reason through complex clinical scenarios.
The study examined 21 large language models, including systems developed by OpenAI, Google and xAI. Among those tested were versions of GPT, Gemini, Claude, Grok and DeepSeek.
Researchers used a structured evaluation tool known as PrIME-LLM to assess how well the models handled different stages of clinical reasoning. These stages included forming an initial diagnosis, ordering tests, reaching a final diagnosis and planning treatment. The models were tested using 29 standardised clinical scenarios, with information introduced gradually to mirror real-life patient cases.
While the systems showed relatively strong performance when identifying a final diagnosis, their ability to generate a differential diagnosis — a key step in distinguishing between conditions with similar symptoms — remained limited. This early-stage reasoning is widely regarded as essential in medical decision-making.
Marc Succi, a co-author of the study, said current models are not ready for independent clinical use. He noted that differential diagnosis represents a core part of medical practice that AI has yet to replicate effectively.
Another researcher, Arya Rao, said the findings show that AI performs best when given complete information but struggles when cases are still developing. She explained that the models are less reliable in situations where doctors must make judgments based on limited or uncertain data.
Despite these shortcomings, the study identified a group of higher-performing systems, including advanced versions of GPT, Gemini, Claude and Grok. These models achieved final diagnosis success rates ranging from around 60 per cent to over 90 per cent when provided with detailed clinical data such as lab results and imaging.
Experts not involved in the research also stressed the importance of caution. Susana Manso García said the findings reinforce that AI should not replace professional medical judgement. She advised that patients continue to seek guidance from qualified healthcare providers when dealing with health concerns.
The study concludes that while AI has made progress, it still requires close human supervision in clinical settings. Researchers say the technology shows promise as a support tool, but its current limitations mean it cannot yet be trusted to make independent medical decisions.
-
Entertainment2 years agoMeta Acquires Tilda Swinton VR Doc ‘Impulse: Playing With Reality’
-
Business2 years agoSaudi Arabia’s Model for Sustainable Aviation Practices
-
Business2 years agoRecent Developments in Small Business Taxes
-
Sports2 years agoChina’s Historic Olympic Victory Sparks National Pride Amid Controversy
-
Home Improvement1 year agoEffective Drain Cleaning: A Key to a Healthy Plumbing System
-
Politics2 years agoWho was Ebrahim Raisi and his status in Iranian Politics?
-
Sports2 years agoKeely Hodgkinson Wins Britain’s First Athletics Gold at Paris Olympics in 800m
-
Business2 years agoCarrectly: Revolutionizing Car Care in Chicago
