Testing degrees of freedom for generalised survival models in R

Have you ever encountered a situation in which you wanted to test different degrees of freedom (DF) for a generalised survival model? I encounter this situation quite often and after writing for loops for looping through the different combinations of DFs for different spline functions, I decided to write a R function that does this for me instead.

In today’s blog post I would like to introduce you to fpm_test_dfs(), which does exactly that. The function can be used to test combinations of DFs for different spline functions in a generalised survival model fitted with rstpm2::stpm(). However, before discussing about the function, I would like to quickly introduce the generalised survival model modeled on the log cumulative hazard function ( $\ln H (t)$ ). In these models the baseline hazard function is modeled using a spline function of log time with a vector of knots $k_{0}$ and parameters $γ$ :

$\ln {H (t | x)} = s {\ln (t) | γ, k_{0}} + x β$ where $x$ and $β$ are vectors of covariates and regression parameters, respectively. The number of knots and the DFs have a direct relationship depending on which type of spline function and which polynomial you use. The rstpm2::stpm2() function uses restricted cubic splines, which are natural cubic splines in which the first and second order derivatives of the spline function at the knots are restricted to be equal:

$s {\ln (t) | γ, k_{0}} = β_{00} + β_{01} \ln (t) + β_{02} \ln (t)^{2} + β_{03} \ln (t)^{3} + \sum_{i = 1}^{K} β_{i 3} (\ln (t) - k_{i})_{+}^{3}$

The number of knots and DFs for restricted cubic splines are related through $K = DF - 1$ .

However, we cannot only use spline function for modelling the baseline hazard function, but also for modelling time-varying covariate effects. In this case the above model can be generalised for including $D$ time-varying covariate effects:

$\ln {H (t | x)} = s {\ln (t) | γ, k_{0}} + \sum_{j = 1}^{D} s {\ln (t) | δ_{j}, k_{j}} x_{j} + x β$ where $δ_{j}$ and $k_{j}$ are the vectors of knots and parameters for the spline function of the time-varying effect of covariate $x_{j}$ .

In the model above we need to specify a lot of different DFs one for the baseline hazard function and one for each time-varying covariate effect that we would like to include in our model. Finding the right combination of DFs for a model can be a challenging task. A common approach is to fit models with different DFs and compare their AIC or BIC value in order to identify the best fitting model. This is where fpm_test_dfs() comes in handy. fpm_test_dfs() allows you to specify which DFs you would like to test for each spline function included in your model and returns a data frame including the corresponding AIC and BIC values for each possible combination of DFs.

For this, fpm_test_dfs() requires the following 5 arguments:

formula: A formular describing the model that you would like to test. This formula will be passed directly to rstpm2::stpm().
dfs_bh: An integer vector of DFs used for the spline function of the baseline hazard.
dfs_tvc: A list of named integer vectors of DFs used for the spline function of a time-varying covariate effect. The name of the vector should correspond to the name of the variable that should be included in the model using a time-varying effect.
by_vars: A character vector including the names of variables that should be used for stratifying the model, if needed.
data: A dataset which is passed to rstpm2::stpm2().

Let’s test the function using the breast cancer dataset included in the rstpm2 package. For this, we first need to load the rstpm2 package and install the fpm_test_dfs() function, which is included in the entjosR package on GitHub.

# load rstpm2
library(rstpm2)

# install and use the entjosR package
remotes::install_github("entjos/entjosR")
library(entjosR)

Now that we installed the entjosR package, we can use the fpm_test_dfs(). We might for example be interested in fitting a crude model for estimating survival after breast cancer, e.g.,:

$\ln {H (t)} = s {\ln (t) | γ, k_{0}} + β_{0}$ However, we need to specify how many DFs we would like to use for modelling the baseline hazard function. For this we can use the fpm_test_dfs() as follows. Let’s for instance test one to ten DFs for the baselien hazard function.

# Test for best fitting baseline hazard function
dfs_test_model1 <- fpm_test_dfs(Surv(rectime, censrec) ~ 1,
                                dfs_bh = 1:10,
                                data   = brcancer)

dfs_test_model1

   df_bh df_tvc      aic      bic
1      1      0 5278.553 5287.614
2      2      0 5233.610 5247.203
3      3      0 5232.631 5250.754
4      4      0 5231.785 5254.440
5      5      0 5233.910 5261.095
6      6      0 5235.563 5267.279
7      7      0 5234.047 5270.294
8      8      0 5238.742 5279.520
9      9      0 5231.902 5277.210
10    10      0 5238.197 5288.037

We now got a dataframe with one row for each DF. However, what we are interested in is to find the best fitting model. For this we can pass the result of fpm_test_dfs() to fpm_get_best_fit().

# Check which model fits best
fpm_get_best_fit(dfs_test_model1)

  df_bh df_tvc      aic      bic
4     4      0 5231.785 5254.440
2     2      0 5233.610 5247.203

We now see that based on the AIC we would chose 4 DFs as the best fitting model and based on the BIC we would chose 2 DFs for the baseline hazard function as the best fitting model. Now you might think “Why would I need a function that does this for me when I can just write a simple for-loop”. I agree with you in principle, but adding more time-depending co variate effects to the model increases the complexity quite quickly. Let’s assume we would like to add the following two variables to our model and add time-varying effects for those:

x1: Age at diagnosis in years,
hormon: Indicator for receiving hormonal therapy.

Our model would then formally look like this $\ln {H (t | x_{1}, x_{2})} = s {\ln (t) | γ, k_{0}} + s {\ln (t) | δ_{1}, k_{1}} x_{1} + s {\ln (t) | δ_{2}, k_{2}} x_{2} + β_{0} + β_{1} x_{1} + β_{2} x_{2}$ where $x_{1}$ , and $x_{2}$ denote the age at diagnosis and hormonal therapy, respectively.

For simplicity, let’s assume that we stick to 4 DFs for the baseline hazard function and test for different DFs for the time-varying effect of age and hormonal therapy.

# Test for best fitting spline fuction for time-varying effects of hormonal
# therapy and age at diagnosis
dfs_test_model2 <- fpm_test_dfs(Surv(rectime, censrec) ~ hormon + x1,
                                dfs_bh  = 4,
                                dfs_tvc = list(hormon = 1:4,
                                               x1     = 1:4),
                                data    = brcancer)

dfs_test_model2

   df_bh df_tvc_hormon df_tvc_x1      aic      bic
1      4             1         1 5224.228 5265.006
2      4             2         1 5226.164 5271.472
3      4             3         1 5228.085 5277.924
4      4             4         1 5228.764 5283.134
5      4             1         2 5226.157 5271.465
6      4             2         2 5228.055 5277.895
7      4             3         2 5229.965 5284.336
8      4             4         2 5230.637 5289.539
9      4             1         3 5227.309 5277.149
10     4             2         3 5229.188 5283.559
11     4             3         3 5230.882 5289.784
12     4             4         3 5231.643 5295.075
13     4             1         4 5225.276 5279.646
14     4             2         4 5227.152 5286.054
15     4             3         4 5228.893 5292.326
16     4             4         4 5230.417 5298.380

We now receive a dataset with 16 rows, one row for each model that we tested as we tested for all different combinations of DFs for the two time-varying effects.

# Check which model fits best
fpm_get_best_fit(dfs_test_model2)

  df_bh df_tvc_hormon df_tvc_x1      aic      bic
1     4             1         1 5224.228 5265.006
2     4             1         1 5224.228 5265.006

Checking the best fitting model suggests that one DF is sufficient for modelling the time-varying effect of both hormonal therapy and age. However, one DF is equivalent to not using any knots for the spline function (see above). Hence, using one DF for the time-varying effect is equivalent to only using a linear interaction between log time and hormonal therapy, and age at diagnosis, respectively.

Another thing that we might want to do is to fit separate models for those that received hormonal therapy and those that did not, aka, stratifying by hormonal therapy. In this situation we need to test two separate models. We can easily do this by using the by_vars option in fpm_test_dfs(). Let’s assume that we stick with only one DF for the time-varying effect of age, but we would like to test for different DFs for the baseline hazard function in our two models.

# Test for best fitting spline fuction for the baseline hazard stratified
# by hormonal therapy use
dfs_test_model3 <- fpm_test_dfs(Surv(rectime, censrec) ~ x1,
                                dfs_bh  = 1:5,
                                dfs_tvc = list(x1 = 1),
                                by_vars = "hormon",
                                data    = brcancer)

dfs_test_model3

data$filter_vars: 0
  df_bh dfs_tvc_x1      aic      bic
1     1          1 3569.536 3585.883
2     2          1 3535.832 3556.266
3     3          1 3536.267 3560.788
4     4          1 3537.851 3566.458
5     5          1 3539.667 3572.361
------------------------------------------------------------ 
data$filter_vars: 1
  df_bh dfs_tvc_x1      aic      bic
1     1          1 1707.547 1721.569
2     2          1 1696.212 1713.739
3     3          1 1695.230 1716.262
4     4          1 1695.250 1719.787
5     5          1 1696.354 1724.397

We know receive a list of dataframes as output of fpm_test_dfs(). The first dataframe includes the AIC and BIC values for the model for those that did not receive hormonal therapy (hormon == 0), and the second dataframe include those values for patient that did receive hormonal therapy (hormon == 1). We can also pass this list to fpm_get_best_fit() to get the two best fitting models.

# Check which models fit best
fpm_get_best_fit(dfs_test_model3)

$`0`
   df_bh dfs_tvc_x1      aic      bic
2      2          1 3535.832 3556.266
21     2          1 3535.832 3556.266

$`1`
  df_bh dfs_tvc_x1      aic      bic
3     3          1 1695.230 1716.262
2     2          1 1696.212 1713.739

The results shows that among those that did not receive hormonal therapy 2 DFs are sufficient for modelling the baseline hazard function according to both AIC and BIC. However, among those that did receive hormonal therapy the best fitting model according to the AIC value includes 3 DFs for the baseline hazard function whereas the best fitting model according to the BIC includes 2 DFs for the baseline hazard function.

I hope this function might be of use for you and please let me know if you should have any feedback or suggestions for it.