Poster Session III. - P: Health Sciences
Baksa Barnabas
Medical Imaging Centre, Semmelweis University
Barnabas Baksa1, Sámuel Beke1, Kristóf Nagy1, Fanni Kovács1, Zsigmond Tamás Kincses2, Bálint Szilveszter3, Lili Száraz1, Pál Maurovich-Horvat1
1: Medical Imaging Centre, Semmelweis University
2: University of Szeged, Departement of Radiology
3: Semmelweis University, Heart and Vascular Center
Introduction: The emergence of large language models (LLMs) based on artificial intelligence (AI) creates new opportunities for patients to interpret medical reports. However, the reliability and applicability of LLMs in healthcare remain unclear.
Aims: To evaluate ChatGPT-4o’s medical reliability and its ability to produce patient-friendly reports based on coronary CT angiography (CCTA) findings.
Methods: In this retrospective study, we analyzed anonymized CCTA reports without metadata, processed by ChatGPT-4o. One-shot prompting was performed using the API key from OpenAI. We used a custom designed program code and gave the following prompts: 1) “Summarize the report.” 2) “Make it easy for the patient to understand.” 3) “What do you suggest for this patient?” Two physicians rated the original and AI-generated reports on a five-point Likert scale, focusing on quality and professional reliability. Word counts and linguistic accuracy were documented. Two laypersons also assessed comprehensibility. Statistical methods included Kappa and t-tests.
Results: We analyzed 300 CCTA reports from three Hungarian institutions featuring abnormalities of various types and severities in proportion to their consecutive incidence (5% CABG, 5% stent, 5% non-diagnostic; 20% severe, 65% minimal-moderate stenosis). The inter-reviewer agreement was excellent. The overall reliability for original reports (4.87±0.39), Prompt 1 (4.62±0.76), Prompt 2 (4.54±0.93), and Prompt 3 were not significantly different (p>0.005). Prompt 2 showed adequate linguistic quality (4.92±0.28). Compared to the original, word counts decreased by 49%, 45%, and 63% for Prompt 1–3, respectively, with no evidence of AI hallucinations. Performance was poorer in graft cases for Prompts 1 and 2 (2.8±1.25, 2.8±1.25). Lay reviewers rated the patient-friendly Prompt 2 significantly more comprehensible than the original reports (4.89±0.34 vs. 3.76±0.72, p<0.005).
Conclusions: ChatGPT-4o can produce concise, accurate, and patient-oriented CCTA reports without major professional errors or harmful claims. Despite reduced performance in complicated reports such as grafts, the LLM shows promise for enhancing patient understanding, warranting further investigation and clinical validation.
Funding: This study was funded by the Medical Imaging Centre, Semmelweis University.