Study Highlights GPT-4o’s Mixed Accuracy in Medical Imaging, Cautioning Against Clinical Use Without Further Training
OpenAI’s ChatGPT, powered by the advanced GPT-4o model, has made significant progress since its debut—now boasting capabilities that include processing images. But new research indicates that despite its evolution, GPT-4o still requires substantial refinement before it can be safely integrated into medical imaging tasks.
A study published in Cureus evaluates GPT-4o’s performance in interpreting medical images. Though the model is designed to handle text, image, and audio inputs, researchers aimed to assess how well it can aid radiology practices and improve care accessibility—especially in areas with limited healthcare resources.
While both its text-based interpretation and image recognition skills are remarkable, the latter has been particularly impactful in areas such as diagnosing cancer on histopathology, differentiating between skin lesions, or assessing lung injury,” noted corresponding author Nikoloz Papiashvili, a medical student at Tbilisi State Medical University, along with co-authors. “This impact has led to a focus on the field, further amplified by the shortage of qualified radiologists in underserved communities and the technology’s ability to enable faster decision-making in emergency settings.
The technology's potential is clear, but the findings suggest its effectiveness in interpreting medical images varies significantly depending on the context and the type of comparison available.
To test GPT-4o, researchers presented the model with 377 imaging cases—including X-rays, CT scans, and MRIs—spanning a variety of organ systems. Each case was accompanied by a standardized prompt, but the model was not supplied with any clinical history or prior scans. A panel of three radiologists then evaluated GPT-4o’s responses using a five-point scale.
Results showed the model performed significantly better with X-rays than CT images, successfully interpreting twice as many radiographs. It was six times more accurate in diagnosing abdominal conditions compared to pelvic ones and nearly three times better at identifying bleeding disorders over neoplastic conditions.
The overall distribution of median ratings demonstrates a bimodal distribution, indicating that most cases were interpreted either completely incorrectly or entirely without error, with only a small percentage of responses being partially accurate,” the authors explained. “This finding suggests that the AI adopts an ‘all or nothing’ approach, expertly recognizing anything it has seen before but being practically unable to work with anything it has not visually encountered, despite having access to theoretical knowledge on these conditions.
One major limitation identified was the tendency for GPT-4o to generate hallucinations—false but confidently presented information. This well-documented issue with ChatGPT could mislead users who are unaware of its limitations. The concern is even more acute in clinical environments, where inexperienced providers or medical trainees might rely too heavily on these outputs without the expertise to verify their accuracy.
In light of their findings, the researchers proposed several directions for improving GPT-4o’s utility in medical imaging analysis.
Future research should focus on methods to combat these pitfalls. One possibility may be to instruct the AI to analyze the image in sections, with each part contributing to a consolidated list of potential diagnoses. This approach has shown particular success in the field of pathology,” the group advised. “Additionally, text-based data could be incorporated as a secondary source of information, supporting or challenging certain diagnoses that the model has already considered.