Study Finds GPT-4 Can Reliably Flag Critical Findings in Radiology Reports

Published Date: September 10, 2025

By News Release

New research published in the American Journal of Roentgenology suggests that general-purpose large language models (LLMs) such as GPT-4 can accurately identify urgent findings in radiology reports when paired with well-designed prompt strategies. The study highlights the potential for these AI systems to streamline communication in clinical workflows and improve patient safety.

The project was led by Ish A. Talati, MD, of Stanford University’s department of radiology, and evaluated more than 400 radiology reports across multiple datasets. Researchers compared GPT-4 to another model, Mistral-7B, testing how each responded to various prompting methods including zero-shot, few-shot static, and few-shot dynamic approaches. Reports were drawn from the publicly available MIMIC-III database as well as an external validation set from the CheXpert Plus collection of chest radiograph reports.

Each case was manually reviewed to establish whether findings were truly critical, expected or previously known, or equivocal. These human assessments served as the gold standard against which model performance was measured.

Results showed that GPT-4, using a few-shot static prompting strategy, achieved 90.1% precision and 86.9% recall on the holdout test set, outperforming Mistral-7B’s scores of 75.6% and 77.4%. On the external chest X-ray dataset, GPT-4 reached 82.6% precision and 98.3% recall, compared to 75.0% and 93.1% for Mistral-7B. The authors noted that providing the model with just five sample prompts produced the best performance overall, suggesting that carefully constructed inputs can meaningfully boost diagnostic accuracy.

“These results suggest that out-of-the-box LLMs may adapt to specialized radiology tasks with minimal data annotation, although further refinement is needed before clinical implementation,” Talati explained. He and his colleagues emphasized that the ability to consistently flag critical findings has direct implications for patient outcomes, as rapid communication of urgent results is central to effective care delivery.

While the study demonstrates clear promise, the authors stressed that more work is necessary before these tools can be deployed in hospitals. They pointed out that although GPT-4 achieved strong precision and recall, issues of reliability, consistency across varied patient populations, and integration into clinical systems remain unresolved. Additional research is also needed to determine how these models perform in real-world environments, where reports may include diverse styles of dictation and incomplete information.

Even so, the findings build momentum around the idea that LLMs could play a valuable role in radiology by reducing the communication gap between imaging results and treatment teams. By automatically flagging reports that contain potentially life-threatening findings, these systems could help ensure that critical information reaches clinicians more quickly and reliably.

“Effective identification of critical findings is essential for patient safety,” the authors concluded. “While further technical development is required, these findings underscore the promise of LLMs in improving radiology workflows by augmenting communication of urgent findings.”