Pseudo-observations 2
This is a quick follow up to my previous post on pseudo-observations where I reproduced Table 1 from “Causal inference in survival analysis using pseudo-observations” by Per K. Andersen, Elisavet Syriopouloua, and Erik T. Parner. The simulation scenario for Table 1 was a correctly specified Cox model that didn’t show off the robustness property of the pseudo-observations method. In this post I’ll aim to reproduce Table 2, which does demonstrate this point.
HR_1 | ACE | Cens_Prob | Cox_Bias | Psuedo_Bias |
---|---|---|---|---|
5 | 0.384 | 0.684 | -0.107 | 0.001 |
The simulation scenario in Table 2 is described in the paper as follows
The baseline hazard was constant =0.511 for the unexposed (A=0) and the hazard ratio for the exposed (A=1) increased linearly over time from HR(0)=1 at time 0 to a hazard ratio of HR(1)=1,1.5,2,5 at time 1. The exposure distribution was binomial with probability 0.50.
Ok, I’m already struggling a bit. How do I simulate from a model with a linearly increasing hazard ratio? That’s not something I do every day.
I’ll focus on the situation when the hazard ratio is 5 at time 1. I can write the survival distribution as \[S(t) = \exp(-\int_{0}^t h(s)ds)\] where in this case \[h(s)=0.511\times(1 + 4s),~s \in (0,1)\] so that \[S(t) = \exp(-0.511(t + 2t^2))\] This means that I can simulate the time-to-event \(T_i^*\) by first simulating \(U_i \sim U(0,1)\) and then solving \[U_i = \exp(-0.511(T_i^* + 2(T_i^*)^2))\] for \(T_i^*\). I can just about remember that \[T_i^* = \frac{-0.5 \pm \sqrt{0.25-2\log(U_i)/0.511}}{2}\] (and I know that \(T_i^*\) is positive). Let’s check I have the correct scenario according to Table 2, remembering that we’re interested in the difference in survival probabilities at \(t=1\).
n <- 5000000
u <- runif(n)
t_star_1 <- (-0.5 + sqrt(0.25 - 2 * log(u) / 0.511)) / 2
t_star_0 <- rexp(n, 0.511)
a <- rbinom(n, 1, 0.5)
t_star <- a * t_star_1 + (1 - a) * t_star_0
true_ace <- mean(t_star[a == 1] < 1) - mean(t_star[a == 0] < 1)
true_ace
## [1] 0.3840153
Looks good so far!
The censoring distribution was described as follows
The censoring had a constant hazard of 0, 0.1, 0.2, 0.5, 1.0, and 2.0.
And I’m focusing on the last row of Table 2 where the censoring hazard was equal to 2.0
cens <- rexp(n, 2)
mean(cens < pmin(1, t_star))
## [1] 0.6841254
This also matches Table 2.
Next I’ll look at the bias. I’ll first need a function that can simulate one data set under this scenario
#######################################
## function to simulate one data set
sim_dat <- function(n){
u <- runif(n)
t_star_1 <- (-0.5 + sqrt(0.25 - 2 * log(u) / 0.511)) / 2
t_star_0 <- rexp(n, 0.511)
a <- rbinom(n, 1, 0.5)
t_star <- a * t_star_1 + (1 - a) * t_star_0
cens <- rexp(n, 2)
times <- pmin(1, pmin(t_star, cens))
event <- as.numeric(t_star < pmin(cens, 1))
df <- data.frame(times = times,
event = event,
a = a)
df
}
Bias (pseudo-observations)
To find the bias of the pseudo-observation approach I can copy and paste my code from my previous post (more or less – I remove baseline covariates \(L\) since they are not included in this scenario)…
library(prodlim)
#########################
## function to find pseudo values.
## df must have following columns
## "times" (numeric)
## "event" (numeric 0/1)
## "a" (numeric 0/1)
pseudo_surv <- function(df, tau = 1){
f = prodlim(Hist(times, event) ~ 1, data = df)
df$pseudo_obs <- unname(jackknife(f, times = tau))
df
}
###############################
## what do we do with the pseudovalues?
## fit a linear model and use g-formula to find
## average causal effect (ACE)
## df must have following columns
## "times" (numeric)
## "event" (numeric 0/1)
## "a" (numeric 0/1)
## "pseudo_obs" (numeric)
find_ace <- function(df){
## how many patients in data set
n <- dim(df)[1]
## fit linear model to pseudovalues
fit <- lm(pseudo_obs ~ a, data = df)
## predict pseudovalue for each patient under A = 0
predict_0 = predict(fit,
newdata = data.frame(a = rep(0, n)))
## predict pseudovalue for each patient under A = 1
predict_1 = predict(fit,
newdata = data.frame(a = rep(1, n)))
## average causal effect is p(T>1 | A=0) - p(T > 1 | A=1)
ace <- mean(predict_0) - mean(predict_1)
ace
}
###################################
## bring 3 functions together
sim_ace <- function(n){find_ace(pseudo_surv(sim_dat(n)))}
In the paper, a sample size of 1,000,000 is used with 10,000 replications. As I don’t want to wait forever, I’ll reduce this to n=200 and 1,000 replications.
## n = 200
res_pseudo <- purrr::map_dbl(rep(200, 1000), sim_ace)
mean(res_pseudo) - true_ace #bias
## [1] 0.001565089
This looks to be essentially unbiased, in agreement with Table 2.
Bias (Cox model)
I didn’t do this last time, so need to write a couple more functions…
library(survival)
find_ace_cox <- function(df){
## how many patients in data set
n <- dim(df)[1]
## fit cox model
fit_cox <- coxph(Surv(times, event) ~ a, data = df)
## predict S(1) for each patient under A = 0
predict_0 = predict(fit_cox,
newdata = data.frame(times = rep(1, n),
event = rep(1, n),
a = rep(0, n)),
type = "survival")
## predict S(1) for each patient under A = 1
predict_1 = predict(fit_cox,
newdata = data.frame(times = rep(1, n),
event = rep(1, n),
a = rep(1, n)),
type = "survival")
## average causal effect is p(T>1 | A=0) - p(T > 1 | A=1)
ace <- mean(predict_0) - mean(predict_1)
ace
}
###################################
## bring2 functions together
sim_ace_cox <- function(n){find_ace_cox(sim_dat(n))}
and check the bias
## n = 200
res_cox <- purrr::map_dbl(rep(200, 1000), sim_ace_cox)
mean(res_cox) - true_ace #bias
## [1] -0.1173701
which looks close enough to Table 2 to me, considering I’ve reduced the sample size / number of replications.
Comments
This example nicely demonstrates the bias in estimating the ACE if using a Cox model when the true data generating model has non-proportional hazards. Using a pseudo-observations approach in this case avoided this bias.