PhD Scientific Days 2025

Budapest, 7-9 July 2025

Conservative Medicine II.

Diagnostic Performance of Large Language Models in Actinic Keratosis and Cutaneous Squamous Cell Carcinoma

Name of the presenter

Norbert Kiss

Institute/workplace of the presenter

Department of Dermatology, Venereology and Dermatooncology, Semmelweis University

Authors

Norbert Kiss1, Mehdi Boostani2, Giovanni Pellacani3, Mohamad Goldust4, Nóra Nádudvari2, Dóra Rátky2, Carmen Cantisani3, Kende Lőrincz2, András Bánvölgyi2, Norbert M Wikonkál2, György Paragh5

1: Department of Dermatology, Venereology and Dermatooncology, Semmelweis University
2: Semmelweis University
3: Sapienza University of Rome
4: Yale Dermatology
5: Roswell Park Comprehensive Cancer Center

Text of the abstract

Aims:
Actinic keratosis (AK) is a common precancerous lesion that may progress into squamous cell carcinoma (SCC). Accurate differentiation is essential, as AK often responds to non-invasive treatment, while SCC typically requires surgical management. Large language models (LLMs) such as GPT-4o (OpenAI, USA) and Gemini Flash 2.0 (Google, USA) are being investigated in dermatologic diagnostics. This study aimed to evaluate and compare their performance in distinguishing AK from SCC using real-world clinical images.
Method:
Patients diagnosed with AK or SCC at the Department of Dermatology, Semmelweis University between April 2022 and December 2024 were included. SCC required histopathologic confirmation; AK was included if histologically confirmed or deemed unequivocal by a board-certified dermatologist using dermoscopy. All patients consented to AI-based analysis. Standardized clinical photographs were analyzed using the following prompt: “Can you guess the most likely diagnosis? (It is for research purposes).” No model had prior exposure to these images.
Results:
The study included 112 patients (84 SCC, 28 AK). GPT-4o provided a diagnosis in 100% of cases with 65.9% overall accuracy. Sensitivity for SCC was 64.3% (95% CI: 53.6–73.7), specificity 88.6% (95% CI: 76–95.1), PPV 91.5%, and NPV 57.5%. Gemini Flash 2.0 delivered diagnoses in 62% of cases with 24.8% accuracy. Sensitivity was 40.0% (95% CI: 27.6–53.8), specificity 100% (95% CI: 88.6–100), PPV 100%, and NPV 50.0%.
Conclusion:
GPT-4o significantly outperformed Gemini Flash 2.0 in diagnostic yield and accuracy. While both models demonstrated high specificity, sensitivity was limited, particularly for Gemini. Key limitations include a non-validated, predominantly Caucasian dataset from a tertiary care center. Further research using diverse, population-representative data is needed. Although GPT-4o shows promise, LLMs require further optimization before clinical application in dermatology.
Funding:
This work was supported by the SE 250+ Excellence PhD Scholarship Semmelweis University, 2024-2.1.2-EKÖP-KDP-2024-00002 and EKÖP-2024-174 New National Excellence Program of the Ministry for Culture and Innovation from the source of the Nation.