Poster Session 1.D - Pathological and Oncological Sciences
Posta, Máté
Department of Bioinformatics, Semmelweis University, Budapest, Hungary; Institute of Molecular Life Sciences, HUN-REN Research Centre for Natural Sciences, Budapest, Hungary
Máté Posta1, Aida Figler2, Zsófia Dobolyi2, Balázs Győrffy3
1: Department of Bioinformatics, Semmelweis University, Budapest, Hungary; Institute of Molecular Life Sciences, HUN-REN Research Centre for Natural Sciences, Budapest, Hungary
2: Department of Bioinformatics, Semmelweis University, Budapest, Hungary
3: Department of Bioinformatics, Semmelweis University, Budapest, Hungary; Institute of Molecular Life Sciences, HUN-REN Research Centre for Natural Sciences, Budapest, Hungary; Institute of Transdisciplinary Discoveries, Medical School, University of Pécs, Pécs, Hungary
Introduction: Extracting structured information from narrative pathology reports remains a major obstacle to scaling clinical research, particularly in multilingual environments. Large Language Models (LLMs) offer powerful extraction capabilities, but clinical adoption requires strict data control, high throughput, and consistent performance.
Methods: To address these barriers, we created CIDER, a fully local pipeline that transforms unstructured pathology reports into structured database. The system integrates vLLM based inference with the Qwen3 VL 32B Instruct FP8 model and operates within an air gapped institutional environment. We applied CIDER to 2,073 Hungarian language histopathology reports and evaluated seven key clinical variables against expert curated data. Accuracy, reproducibility, and temperature dependent robustness were systematically assessed.
Results: CIDER produced highly reliable outputs, matching expert annotations for sex and surgery year in over 98% of cases and achieving >95% and >92% accuracy for T and N staging, respectively. The pipeline showed excellent reproducibility across repeated runs. Beyond matching manual curation, CIDER substantially increased dataset completeness by identifying clinically valid values missing from the expert database, including 62.8% of unrecorded T stages and 91.5% of missing tumor size values.
Conclusions: Our findings show that secure, on premise LLM deployment can deliver expert level extraction accuracy for complex, non English clinical narratives. CIDER enables rapid, reproducible conversion of pathology text into structured data, supporting large scale registry construction and real world evidence generation. The CIDER platform is publicly accessible at https://llm.gyorffylab.com/cider.
Funding: This project was supported by the National Research, Development, and Innovation Office (2025-1.2.1-HU-RIZONT-2025-00011 and 2024-1.2.2-ERA_NET-2024-00015) and by the Semmelweis Lendület Programme.