#StackBounty: #r #predictive-models #survival #cox-model #hazard Does centring the baseline hazard and linear predictor after cox regre…

Bounty: 50

I am trying to derive risk scores from a cox regression. For reasons not relevant to this topic, I must do it manually, as opposed to using the predict function in R.

My manual method, uses an un-centred baseline hazard and linear predictor. The built in function in R predict uses a centred linear predictor, and so you must also use the centred baseline hazard from survfit.

To my suprise, when using the two methods, the risk scores match up, but the confidence interval around the risk scores do not match up. I see no reason why the confidence interval around the risk score should change size depending on whether the baseline hazard was centred or not.

I provide a reproducible example here, where the risks match up, but the confidence intervals do not:

data(ovarian)
library(survival)

## Fit cox model
fit <- coxph( Surv(futime,fustat)~age + rx,data=ovarian)


### Calculate risks and associated confidence interval manually, using non centred baseline hazard and linear predictor

## Extract design matrix
test.model.matrix<-model.matrix(Surv(futime,fustat)~age + rx,data=ovarian)
S<-test.model.matrix
## Remove intercept from design matrix
S<-S[,-c(1)]


## Extract coefficients and covariance matrix
B<-fit$coefficients
C<-fit$var

## Calculate linear predictor
p<-(S%*%B)[,1]
head(p)

## Calculate standard error and CI
std.er.p1<-(S%*%C)

std.er.p2<-std.er.p1*S

std.er.p2<-data.frame(std.er.p2)
head(std.er.p2)

se<-sqrt(rowSums(std.er.p2))

## Calculate the relevant t statistic, degrees of freedom = n-p = 26-2
t<-qt(0.975,24)


## Create dataset with linear predictor, upper and lower limit
lp<-data.frame(risk=p,low.lim=p-t*se,upp.lim=p+t*se)

## Now get cumulative baseline hazard using basehaz function, not centred
basehaz<-basehaz(fit,centered=FALSE)

## Extract hazard at relevant time
## Must be a time at which an event happened, choose time=156 (3rd event)
basehaz156<-basehaz[basehaz$time==156,]$hazard

## Generate survival function
surv<-exp(-basehaz156*exp(lp))

## generate risks
risks.uncent<-1-surv


### Now extract the linear predictor using the predict function (centred linear predictor)
### And use the centred baseline hazard

## Do it using predict function
fit.cent<-predict(fit,se.fit=TRUE,type="lp")

## Extract linear predictor and standard error
p.cent<-fit.pred[[1]]
se.cent<-fit.pred[[2]]

## Calculate the relevant t statistic
t<-qt(0.975,24)


# Create dataset with linear predictor, upper and lower limit
lp.cent<-data.frame(risk=p.cent,low.lim=p.cent-t*se.cent,upp.lim=p.cent+t*se.cent)


### Now get cumulative baseline hazard, centred
basehaz.cent<-basehaz(fit,centered=TRUE)
head(basehaz.cent)


## Extract hazard at relevant time
## Must be a time at which an event happened, choose time=156 (3rd event)
basehaz156.cent<-basehaz.cent[basehaz.cent$time==156,]$hazard

## Generate survival function
surv.cent<-exp(-basehaz156.cent*exp(lp.cent))

## generate risks
risks.cent<-1-surv.cent


### Compare results
head(risks.uncent)
head(risks.cent)

> head(risks.uncent)
        risk      low.lim   upp.lim
1 0.51509705 5.322379e-04 1.0000000
2 0.63037120 5.975637e-04 1.0000000
3 0.26288533 3.883639e-04 1.0000000
4 0.01961519 4.553868e-05 0.9998191
5 0.02794971 1.615327e-04 0.9930874
6 0.06717218 2.255081e-04 1.0000000

> head(risks.cent)
        risk    low.lim    upp.lim
1 0.51509705 0.13947540 0.96942843
2 0.63037120 0.15703534 0.99696746
3 0.26288533 0.09773654 0.59527761
4 0.01961519 0.01014882 0.03774142
5 0.02794971 0.01121265 0.06878584
6 0.06717218 0.03569621 0.12455084

I have also done the maths, to see whether the confidence intervals should match up or not. According to my maths, they should not match up, but this goes against my intuition that confidence interval should be independent on whether baseline hazard is centred or not.

Maths sheet 1:

Maths sheet 2:

Maths sheet 3:

So either:

1) My maths is wrong, and my code is wrong and the confidence intervals differing on how baseline hazard is defined should match up

2) My understanding that confidence interval around the risk should be independent on whether the baseline hazard is centred or not, is wrong.


Get this bounty!!!

Leave a Reply