Benchmarking of radiobiological NTCP models in head and neck radiotherapy using independent computational pipelines: an institutional validation study with mach

Benchmarking of radiobiological NTCP models in head and neck radiotherapy using independent computational pipelines: an institutional validation study with machine learning augmentation

16 Jun 2026

Kalyan Mondal, Abhijit Mandal, Anuj Vijay1c, Ganeshkumar Patel

Background & purpose: Normal tissue complication probability (NTCP) models require institutional validation before clinical implementation. Traditional radiobiological models, such as the Lyman–Kutcher–Burman (LKB) and Equivalent Uniform Dose (EUD) models, provide mechanistic dose–response frameworks, while machine learning (ML) approaches offer exploratory, data-driven alternatives that remain inadequately characterised in South Asian populations.

Methods: This retrospective study included 51 head and neck cancer patients treated with definitive radiotherapy. Binary endpoints were Grade ≥2 xerostomia (n = 3), dysphagia (n = 5) and mucositis (n = 4), scored using Common Terminology Criteria for Adverse Events version 5.0. NTCP calculations were performed using two independent computational pipelines (MATLAB-based RBMODELv1 and a Python implementation), with agreement assessed using Bland–Altman analysis. Traditional NTCP models (LKB, EUD) were evaluated and compared with artificial neural networks and XGBoost in a hypothesis-generating framework using a stratified 70:30 train–test split. Model performance was assessed using the area under the receiver operating characteristic curve (area under the curve), accuracy and Spearman’s rank correlation.

Results: Excellent agreement was observed between computational pipelines (mean bias 0.8%, 95% limits −1.9% to 3.5%). Traditional models demonstrated strong rank-order correlation with toxicity grades (ρ = 0.61–0.79, p < 0.001) and high accuracy (LKB: 90.0%–94.1%). Institution-specific parameters differed from quantitative analyses of normal tissue effects in the clinic values, including a lower parotid TD50 (34.1 versus 39.0 Gy). Exploratory ML analyses showed numerically higher discrimination for parallel organs but not for mixed-architecture structures; however, severe class imbalance (3–5 events per endpoint) limits statistical reliability.

Conclusion: Dual computational pipelines enable reproducible NTCP modeling for institutional use. Traditional radiobiological models perform acceptably after local calibration, while exploratory ML findings suggest potential organ-architecture-dependent patterns that require validation in adequately powered multi-institutional cohorts.