PhD Scientific Days 2026

Budapest, 16-18 June 2026

Poster Session 1.D - Pathological and Oncological Sciences

CIDER: Scalable On Premise Clinical Text Extraction with Open Weight LLMs

Name of the presenter

Posta, Máté

Institute/workplace of the presenter

Department of Bioinformatics, Semmelweis University, Budapest, Hungary; Institute of Molecular Life Sciences, HUN-REN Research Centre for Natural Sciences, Budapest, Hungary

Authors

Máté Posta1, Aida Figler2, Zsófia Dobolyi2, Balázs Győrffy3
1: Department of Bioinformatics, Semmelweis University, Budapest, Hungary; Institute of Molecular Life Sciences, HUN-REN Research Centre for Natural Sciences, Budapest, Hungary
2: Department of Bioinformatics, Semmelweis University, Budapest, Hungary
3: Department of Bioinformatics, Semmelweis University, Budapest, Hungary; Institute of Molecular Life Sciences, HUN-REN Research Centre for Natural Sciences, Budapest, Hungary; Institute of Transdisciplinary Discoveries, Medical School, University of Pécs, Pécs, Hungary

Text of the abstract

Introduction: Extracting structured information from narrative pathology reports remains a major obstacle to scaling clinical research, particularly in multilingual environments. Large Language Models (LLMs) offer powerful extraction capabilities, but clinical adoption requires strict data control, high throughput, and consistent performance.
Methods: To address these barriers, we created CIDER, a fully local pipeline that transforms unstructured pathology reports into structured database. The system integrates vLLM based inference with the Qwen3 VL 32B Instruct FP8 model and operates within an air gapped institutional environment. We applied CIDER to 2,073 Hungarian language histopathology reports and evaluated seven key clinical variables against expert curated data. Accuracy, reproducibility, and temperature dependent robustness were systematically assessed.
Results: CIDER produced highly reliable outputs, matching expert annotations for sex and surgery year in over 98% of cases and achieving >95% and >92% accuracy for T and N staging, respectively. The pipeline showed excellent reproducibility across repeated runs. Beyond matching manual curation, CIDER substantially increased dataset completeness by identifying clinically valid values missing from the expert database, including 62.8% of unrecorded T stages and 91.5% of missing tumor size values.
Conclusions: Our findings show that secure, on premise LLM deployment can deliver expert level extraction accuracy for complex, non English clinical narratives. CIDER enables rapid, reproducible conversion of pathology text into structured data, supporting large scale registry construction and real world evidence generation. The CIDER platform is publicly accessible at https://llm.gyorffylab.com/cider.
Funding: This project was supported by the National Research, Development, and Innovation Office (2025-1.2.1-HU-RIZONT-2025-00011 and 2024-1.2.2-ERA_NET-2024-00015) and by the Semmelweis Lendület Programme.