Pathological and Oncological Sciences 2.
Murmu, Ankita
Semmelweis University
Ankita Murmu1, Anita Rácz2, Lajos Pusztai3, Balázs Győrffy1
1: Semmelweis University
2: HUN-REN Research Center for Natural Sciences
3: Yale Cancer Center
Introduction: Determining which patients benefit from adjuvant chemotherapy remains one of the major challenges in breast cancer management. Current prognostic tools and assays are often costly, restricted to specific subgroups, or derived from selected trial populations, limiting generalizability.
Aim: We aimed to develop a machine learning model capable of estimating individualized chemotherapy benefit using routinely available clinical and pathological data.
Methods: Data from patients diagnosed with breast cancer between 2000 and 2022 were obtained from the Surveillance, Epidemiology, and End Results registry. We included 597,155 patients with infiltrating ductal carcinoma and 71,199 with lobular carcinoma. Gradient boosting models were trained for chemotherapy-treated and untreated patients to predict five-year overall survival. Model performance was evaluated through repeated cross-validation, independent test sets, permutation testing, and logical counterfactual validation. Contribution of each feature to the prediction was assessed using Shapley values.
Result: For infiltrating ductal carcinoma, area under the curve (AUC) values ranged from 0.84-0.91, with consistently higher discrimination in untreated groups. Similar performance was observed in lobular carcinoma (AUC 0.85-0.90). Logical validation confirmed clinically coherent treatment effects, with predicted survival probabilities aligning with expected chemotherapy benefit in 77-95% of evaluated cases. The number of positive lymph nodes and tumor stage were identified as the most important features for prediction. All models significantly outperformed randomized baselines in permutation testing.
Conclusion: We developed a machine learning-based model that estimates five-year survival and chemotherapy benefit, with robustness across breast cancer subtypes and stages. The model is implemented as a real-time, web-based tool to support individualized treatment decisions (www.recurrenceonline.com/clinical).
Funding: This project was supported by the National Research, Development, and Innovation Office (EPIPROPER, 2024-1.2.2-ERA_NET-2024-00015) and by the Semmelweis Lendület Programme. A.M. is grateful to the Tempus Public Foundation (Hungary) for the Stipendium Hungaricum Ph.D. Scholarship. A.R. was supported by the Hungarian Academy of Sciences: János Bolyai Research Scholarship.