Pharmaceutical Sciences and Health Technologies 1.
Imre, Attila
Center for Health Technology Assessment
Attila Imre1,2,3,4, Ákos Józwiak1,2, Judit Hagymásy1,2, Judit Tittmann1, Ágnes Nagy1, Sándor Kovács1,2, Przemyslaw Kardas5, Job FM van Boven6, Irene Mommers6, Balázs Nagy2,3,4, Tamás Ágh1,2
1: Center for Health Technology Assessment and Pharmacoeconomic Research, University of Pecs, Hungary
2: Syreon Research Institute, Budapest, Hungary
3: Center for Health Technology Assessment, Semmelweis University, Budapest, Hungary
4: Center for Pharmacology and Drug Research & Development, Semmelweis University, Budapest, Hungary
5: Department of Family Medicine, Medical University of Lodz, Lodz, Poland
6: Department of Clinical Pharmacy & Pharmacology, Groningen Research Institute for Asthma and COPD (GRIAC), University Medical Center Groningen, University of Groningen, Groningen, the Netherlands
Introduction: Title and abstract screening is a labour-intensive stage of systematic reviews. Large language models (LLMs) can automate this process, but performance depends heavily on prompt design and model selection, which is typically manual and time-consuming.
Aims: Our objective was to evaluate whether automated, reflection-driven prompt optimisation improves LLM performance during title and abstract screening.
Method: REFLECTIVE-TIAB uses the GEPA reflective prompt optimiser to improve prompts under an asymmetric loss penalising false negatives. Nine LLMs screened 8,520 de-duplicated records from a COPD exacerbation predictor search. A 100-abstract gold standard was constructed from inter-model disagreements and was expert-labelled. The prompt was optimised on Llama 3.3 70B via DSPy/GEPA and evaluated across all nine models.
Results: Optimisation improved recall across all LLMs (+3.7% to +37.1%). Gemini 3 Flash Preview achieved the highest performance (91% accuracy, F1 81.6%) while costing 25-fold less per abstract than GPT-5.2, which ranked among the lowest-performing models. A prompt optimised on a single open-source model generalised to all nine without retraining. Total optimisation cost was $6.36.
Conclusion: REFLECTIVE-TIAB provides automated, model-transferable prompt optimisation for literature screening at negligible cost. Model price did not predict screening performance. The framework could substantially reduce screening workload while preserving comprehensiveness.
Funding: This research is part of the COPD-ALERT project. The “COPD-ALERT - Prediction of COPD exacerbations through artificial intelligence based monitoring of medication adherence and other medical data” project is granted by the 2024-1.2.3-HU-RIZONT International Excellence Program (National Research, Development and Innovation Office – NKFIH). Supported by the 2025-2.1.1-EKÖP-2025-00014 University Research Scholarship Programme of the Ministry for Culture and Innovation from the source of the National Research, Development and Innovation Fund.