Statistical issues in survival analysis
(Nature article 3556)
April 10, 2024
The authors aimed to assess overall survival rates for colorectal cancer (CRC) at 3 years and also identify the associated prognostic factors amongst patients in Morocco using a machine learning approach, a random survival forest (RSF). CRC has currently been shown to be the third most common cancer. The authors highlighted that RSF can accommodate nonlinearities and interactions among variables, not being restricted by a baseline hazard assumption like in Cox
proportional hazards regression or by an assumption of the multiplicative effect of predictor variables on the baseline hazard rate during the period of observation. The data was collected retrospectively between 2009 and 2015 until death or right censoring at the end of study.
In their analyses section, they admitted they could not conduct multiple imputation with machine learning but they adopted a single imputation of missing data approach using the missRanger algorithm, which uses an imputation method along with a RF algorithm combined with predictive mean matching, a non-parametric imputation method which makes no prior assumptions about the distribution of the data. This directly predicted missing values using the RF trained on the observed parts of the dataset. They then conducted a multiple imputation relying on random forest (mice RF) with 10 datasets and a single imputation based on random forest (missRanger). They first computed Kaplan-Meier estimates of survival and compared them between curves using a log-rank test which is also based on assumption of proportional hazards but the authors did not discuss this issue and simply used it as is. Also, they compared their RSF fits to the Cox proportional hazard regression model fits.
The authors also used variable importance for covariate selection based on permutation which calculates the attributable prediction error of each predictor between datasets with and without the permuted values for the associated variable. They also calculated partial dependence plots to explore relationships between estimated partial effects of a given predictor and survival rates. Finally, they also assessed predictive accuracy by the concordance index (c-index) which assesses model discrimination and the Brier score for the predictive accuracy, which lies between 0 and 1.
They found that the results from their RSF corresponded to the Cox model results in terms of parameter significance levels. Also, the c-index values and Brier scores were similar for both methods but yet the authors claim that the RSF had better discriminative capacity and predictive accuracy. Furthermore, the Cox model is the only other survival analytic method of which they compared RSF against and did not test against others and came to the conclusion that RSF is much more flexible. They also admitted they never met the assumptions of the Cox model, which is a central tenant to its use. Furthermore, they did not try out multiple imputation with the Cox model and then compare those results to the imputation they had done with the RSF. Clearly, more rigorous comparison of these methods are warranted.
Written by, Usha Govindarajulu, PhDKeywords: survival analysis, Cox model, random survival forest, Brier score, c-index, multiple imputation
References
El Badisy, I., Ben Brahim, Z., Khalis, M. et al. Risk factors affecting patients survival with colorectal cancer in Morocco: survival analysis using an interpretable machine learning approach. Sci Rep 14, 3556 (2024). https://doi.org/10.1038/s41598-024-51304-3
https://media.springernature.com/lw685/springer-static/image/art%3A10.1038%2Fs41598-024-513043/MediaObjects/41598_2024_51304_Fig2_HTML.png?