PhD Scientific Days 2026

Budapest, 16-18 June 2026

Health Sciences 1.

Diagnostic Accuracy of Multimodal Large Language Models in the Identification of Basal Cell Carcinoma and its Subtype Classification Based on Clinical and Dermoscopic Images

Name of the presenter

Tóth, Zsófia

Institute/workplace of the presenter

Department of Dermatology, Venereology and Dermatooncology Semmelweis University, Faculty of Medicine

Authors

Phyllida Kerstin Hamilton-Meikle1, Tóth Zsófia1
1: Department of Dermatology, Venereology and Dermatooncology Semmelweis University, Faculty of Medicine

Text of the abstract

Introduction: Basal cell carcinoma (BCC) is the most common malignant skin tumor worldwide. Due to its high incidence, timely diagnosis remains a major challenge in everyday clinical practice, highlighting the need for novel diagnostic approaches. Recent advances in large language models (LLMs) have created new opportunities for using artificial intelligence-based (AI) programs in the dermatological diagnostic process.
Aim: This study aimed to compare the diagnostic accuracy of three multimodal LLMs - ChatGPT-5 (OpenAI), Gemini 2.5 Flash (Google), and Claude Sonnet 4 (Anthropic) - in differentiating BCC from non-BCC lesions and in classifying BCC subtypes, based on both clinical and dermoscopic images.
Methods: A total of 772 images were analyzed retrospectively, of which 402 were histopathologically confirmed BCC lesions (290 clinical and 112 dermoscopic images) and 370 images belonged to a BCC-mimicker cohort (250 clinical and 120 dermoscopic images). Each case received identical diagnostic prompts, followed by a clarification query requesting a single definitive answer. Sensitivity, specificity, and overall accuracy were calculated separately for clinical and dermoscopic datasets.
Results: Among the three LLMs, in the case of clinical images, ChatGPT-5 achieved the highest diagnostic accuracy (75%). In contrast, for dermoscopic images, Claude Sonnet 4 reached the best result (69.8%). Our findings indicate that the performance of all models in correctly classifying BCC subtypes was limited.
Conclusions: These findings indicate that AI-based LLMs may provide an additional tool complementing human examination in BCC recognition and classification. Further refinement and domain-specific training could enhance their diagnostic reliability and future integration into dermatological decision-support systems.
Funding: Supported by PhD research grant.