Pseudo-observations 2

Last updated on Jul 4, 2022 5 min read

This is a quick follow up to my previous post on pseudo-observations where I reproduced Table 1 from “Causal inference in survival analysis using pseudo-observations” by Per K. Andersen, Elisavet Syriopouloua, and Erik T. Parner. The simulation scenario for Table 1 was a correctly specified Cox model that didn’t show off the robustness property of the pseudo-observations method. In this post I’ll aim to reproduce Table 2, which does demonstrate this point.

Table 1: Table 2 (last row) from Andersen et al. (2017)
HR_1	ACE	Cens_Prob	Cox_Bias	Psuedo_Bias
5	0.384	0.684	-0.107	0.001

The simulation scenario in Table 2 is described in the paper as follows

The baseline hazard was constant =0.511 for the unexposed (A=0) and the hazard ratio for the exposed (A=1) increased linearly over time from HR(0)=1 at time 0 to a hazard ratio of HR(1)=1,1.5,2,5 at time 1. The exposure distribution was binomial with probability 0.50.

Ok, I’m already struggling a bit. How do I simulate from a model with a linearly increasing hazard ratio? That’s not something I do every day.

I’ll focus on the situation when the hazard ratio is 5 at time 1. I can write the survival distribution as \[S(t) = \exp(-\int_{0}^t h(s)ds)\] where in this case \[h(s)=0.511\times(1 + 4s),~s \in (0,1)\] so that \[S(t) = \exp(-0.511(t + 2t^2))\] This means that I can simulate the time-to-event \(T_i^*\) by first simulating \(U_i \sim U(0,1)\) and then solving \[U_i = \exp(-0.511(T_i^* + 2(T_i^*)^2))\] for \(T_i^*\). I can just about remember that \[T_i^* = \frac{-0.5 \pm \sqrt{0.25-2\log(U_i)/0.511}}{2}\] (and I know that \(T_i^*\) is positive). Let’s check I have the correct scenario according to Table 2, remembering that we’re interested in the difference in survival probabilities at \(t=1\).

n <- 5000000
u <- runif(n)

t_star_1 <- (-0.5 + sqrt(0.25 - 2 * log(u) / 0.511)) / 2
t_star_0 <- rexp(n, 0.511)

a <- rbinom(n, 1, 0.5)
t_star <- a * t_star_1 + (1 - a) * t_star_0
true_ace <- mean(t_star[a == 1] < 1) - mean(t_star[a == 0] < 1)
true_ace

## [1] 0.3840153

Looks good so far!

The censoring distribution was described as follows

The censoring had a constant hazard of 0, 0.1, 0.2, 0.5, 1.0, and 2.0.

And I’m focusing on the last row of Table 2 where the censoring hazard was equal to 2.0

cens <- rexp(n, 2)
mean(cens < pmin(1, t_star))

## [1] 0.6841254

This also matches Table 2.

Next I’ll look at the bias. I’ll first need a function that can simulate one data set under this scenario

#######################################
## function to simulate one data set
sim_dat <- function(n){
  
  u <- runif(n)
  t_star_1 <- (-0.5 + sqrt(0.25 - 2 * log(u) / 0.511)) / 2
  t_star_0 <- rexp(n, 0.511)
  
  a <- rbinom(n, 1, 0.5)
  t_star <- a * t_star_1 + (1 - a) * t_star_0
  
  cens <- rexp(n, 2)
  times <- pmin(1, pmin(t_star, cens))
  event <- as.numeric(t_star < pmin(cens, 1))
  
  df <- data.frame(times = times,
                   event = event,
                   a = a)
  
  df
}

Bias (pseudo-observations)

To find the bias of the pseudo-observation approach I can copy and paste my code from my previous post (more or less – I remove baseline covariates \(L\) since they are not included in this scenario)…

library(prodlim)
#########################
## function to find pseudo values.
## df must have following columns
## "times" (numeric)
## "event" (numeric 0/1)
## "a" (numeric 0/1)
pseudo_surv <- function(df, tau = 1){
  f = prodlim(Hist(times, event) ~ 1, data = df)
  df$pseudo_obs <- unname(jackknife(f, times = tau))
  df
}
###############################
## what do we do with the pseudovalues?
## fit a linear model and use g-formula to find
## average causal effect (ACE)
## df must have following columns
## "times" (numeric)
## "event" (numeric 0/1)
## "a" (numeric 0/1)
## "pseudo_obs" (numeric)
find_ace <- function(df){
  
  ## how many patients in data set
  n <- dim(df)[1]
  
  ## fit linear model to pseudovalues
  fit <- lm(pseudo_obs ~ a, data = df)

  ## predict pseudovalue for each patient under A = 0
  predict_0 = predict(fit, 
                      newdata = data.frame(a = rep(0, n)))
  
  ## predict pseudovalue for each patient under A = 1
  predict_1 = predict(fit, 
                      newdata = data.frame(a = rep(1, n)))

  ## average causal effect is p(T>1 | A=0) - p(T > 1 | A=1)
  ace <- mean(predict_0) - mean(predict_1)
  ace
}
###################################
## bring 3 functions together
sim_ace <- function(n){find_ace(pseudo_surv(sim_dat(n)))}

In the paper, a sample size of 1,000,000 is used with 10,000 replications. As I don’t want to wait forever, I’ll reduce this to n=200 and 1,000 replications.

## n = 200
res_pseudo <- purrr::map_dbl(rep(200, 1000), sim_ace)
mean(res_pseudo) - true_ace #bias

## [1] 0.001565089

This looks to be essentially unbiased, in agreement with Table 2.

Bias (Cox model)

I didn’t do this last time, so need to write a couple more functions…

library(survival)
find_ace_cox <- function(df){
  
  ## how many patients in data set
  n <- dim(df)[1]
  
  ## fit cox model
  fit_cox <- coxph(Surv(times, event) ~ a, data = df)

  ## predict S(1) for each patient under A = 0
  predict_0 = predict(fit_cox, 
                      newdata = data.frame(times = rep(1, n),
                                           event = rep(1, n),
                                           a = rep(0, n)),
                      type = "survival")
  
  ## predict S(1) for each patient under A = 1
  predict_1 = predict(fit_cox, 
                      newdata = data.frame(times = rep(1, n),
                                           event = rep(1, n),
                                           a = rep(1, n)),
                      type = "survival")

  ## average causal effect is p(T>1 | A=0) - p(T > 1 | A=1)
  ace <- mean(predict_0) - mean(predict_1)
  ace
}
###################################
## bring2 functions together
sim_ace_cox <- function(n){find_ace_cox(sim_dat(n))}

and check the bias

## n = 200
res_cox <- purrr::map_dbl(rep(200, 1000), sim_ace_cox)
mean(res_cox) - true_ace #bias

## [1] -0.1173701

which looks close enough to Table 2 to me, considering I’ve reduced the sample size / number of replications.

Comments

This example nicely demonstrates the bias in estimating the ACE if using a Cox model when the true data generating model has non-proportional hazards. Using a pseudo-observations approach in this case avoided this bias.

Dominic Magirr

Medical Statistician

Interested in the design, analysis and interpretation of clinical trials.