Pseudo observations on pseudo-observations

Idea / motivation

I’ve set myself the challenge of reproducing the results in Table I of “Causal inference in survival analysis using pseudo-observations” by Per K. Andersen, Elisavet Syriopouloua, and Erik T. Parner.

I’m doing this because I’ve seen a lot about pseudo-observations recently, and while I kind of get the gist of it, I would like to deepen my understanding by getting hands on with some examples.

For more general motivation, i.e., what’s the big deal about pseudo-observations?, I think this sentence from the introduction sums it up nicely

This approach has the advantage that when focus is on a single time point t0, an assumption like the proportional hazards assumption for all values t of time that is imposed when using a Cox model can be avoided.

The idea is that we would like to estimate the average causal effect (ACE) defined as

\[E(\mathbb{1}\{ Y^1 < t_0\}) - E(\mathbb{1}\{ Y^0 < t_0\})\] where \(Y^1\) and \(Y^0\) denote the potential outcomes when exposed to two possible treatments \(A=0\) or \(A=1\). We also have available some prognostic information in the form of baseline covariates \(L\) that we would like to make use of to improve the efficiency of our inference.

G-formula applied to the Cox model

One approach to estimating the ACE is to first fit a Cox model,

\[h(t)=h_0(t)\exp(\beta^T L + \gamma A),\] then, for each subject \(i=1,\ldots,n\), use the model to predict survival to \(t_0\) if exposed to \(A = 1\),

\[\hat{S}(t_0 \mid L_i, A_i = 1) = \exp(-\hat{\Lambda}_0\exp(\hat{\beta}^T L_i + \hat{\gamma})),\] and if exposed to \(A = 0\),

\[\hat{S}(t_0 \mid L_i, A_i = 0) = \exp(-\hat{\Lambda}_0\exp(\hat{\beta}^T L_i)),\] and, finally, (point) estimate the ACE as

\[\frac{1}{n}\sum_{i=1}^n \hat{S}(t_0 \mid L_i, A_i = 0) - \frac{1}{n}\sum_{i=1}^n \hat{S}(t_0 \mid L_i, A_i = 1).\]

There is probably some nice theory for constructing a confidence interval for the ACE, but the easiest thing to say is that we could use non-parametric bootstrap.

Overall, this is a nice procedure if the Cox model is correctly specified. If the model is misspecified then we may have a biased point estimate and confidence intervals with incorrect coverage. Depending on one’s perspective, this might be a really bad thing.

G-formula applied to pseudo-observations

A potentially more robust (less bias under model misspecification) way to estimate ACE is to use pseudo-observations.

What’s a pseudo-observation?

Suppose we are interested in the parameter \(\theta = E(\mathbb{1}\{ Y > t_0\})\) and a consistent estimator, \(\hat{\theta}\), exists, based on the observations from the full sample \(i=1,\ldots,n\), e.g., based on the Kaplan-Meier estimator (with suitable assumptions about censoring). A pseudo-observation (or pseudo-value) for subject \(i\) is defined as

\[\theta_i = n \hat{\theta} - (n-1)\hat{\theta}^{-i},\] where \(\hat{\theta}^{-i}\) is the estimator applied to the sample of size n−1 obtained by eliminating subject i from the whole sample.

Somewhat miraculously (to me at least…the maths is way above my level – see, e.g., this paper), these pseudo-observations have (approximately) the same conditional expectation as the thing we were originally interested in

\[E(\theta_i \mid L_i, A_i) \approx E(\mathbb{1}\{ Y_i > t_0\} \mid L_i, A_i).\]

This means that we can express the average causal effect as

\[ \begin{align} \text{ACE} & = E(\mathbb{1}\{ Y^0 > t_0\}) - E(\mathbb{1}\{ Y^1 > t_0\})\\ & = E_L(E(\mathbb{1}\{ Y_i > t_0\} \mid L_i, A_i = 0)) - E_L(E(\mathbb{1}\{ Y_i > t_0\} \mid L_i, A_i = 1)) \\ & \approx E_L(E(\theta_i \mid L_i, A_i = 0)) - E_L(E(\theta_i \mid L_i, A_i = 1)) \end{align} \] and (point) estimate this by, e.g., fitting a linear model to the pseudo-observations and then applying the G-formula. Again, a confidence interval could be based on non-parametric bootstrap. This doesn’t rely on e.g. the proportional hazards assumption to be valid. (Do we still have to correctly model the conditional expectation of the pseudo-observations in the linear model? There’s a comment in the discussion section that we do, but given that ANCOVA is robust to this kind of miss-specification, can we hope for the same here?…I don’t know.)

Reproducing Table I

Here are the first couple of rows of Table I from “Causal inference in survival analysis using pseudo-observations” by Per K. Andersen, Elisavet Syriopouloua, and Erik T. Parner.

Table 1: Table I from Andersen et al. (2017)
n Cox_Coverage Cox_Bias Cox_SD Pseudo_Coverage Pseudo_Bias Pseudo_SD
200 0.946 0.000 0.074 0.945 0.000 0.078
400 0.940 -0.001 0.053 0.945 0.000 0.054

The simulation scenario was described as follows.

The baseline hazard was constant =0.511 for the unexposed (A=0), and the hazard ratio for exposure was HR=1.37. The exposure distribution was binomial with probability 0.50. Two further Gaussian variables L1 and L2 with a standard deviation of 0.2 were included with a hazard ratio of 1.1 per unit. This corresponds to a true ACE of 0.104=0.504−0.400 at time t0=1 (computed using 5,000,000 replications).

Let’s start by checking whether I can reproduce that…

########################################
## check I have the correct scenario
n <- 5000000
a <- rbinom(n, 1, 0.5)
l_1 <- rnorm(n, sd = 0.2)
l_2 <- rnorm(n, sd = 0.2)

rate <- 0.511 * (1.37 * a + 1 * (1 -a)) * exp(log(1.1) * l_1 + log(1.1) * l_2)

t_star <- rexp(n, rate = rate) 
true_ace <- mean(t_star[a == 1] < 1) - mean(t_star[a == 0] < 1)
true_ace
## [1] 0.1038744

Looks good! Next, the censoring distribution was described.

The censoring time has a constant rate of 0.31, corresponding to an observed censoring frequency of 20.3% before time t0=1(computed using 5,000,000 replications). Subjects still at risk at t0=1 were (administratively) censored at that time.

Let’s check that as well…

cens <- rexp(n, 0.31)
mean(cens < pmin(1, t_star))
## [1] 0.2031564

Good stuff!

Next, I’ll check whether I can reproduce the bias and SD results in Table I. I’ll focus on the pseudo-observations approach. This will require several steps. Firstly, I write a function to simulate one dataset according to the scenario described.

#######################################
## function to simulate one data set
sim_dat <- function(n){
  
  a <- rbinom(n, 1, 0.5)
  l_1 <- rnorm(n, sd = 0.2)
  l_2 <- rnorm(n, sd = 0.2)
  rate <- 0.511 * (1.37 * a + 1 * (1 -a)) * exp(log(1.1) * l_1 + log(1.1) * l_2)
  t_star <- rexp(n, rate = rate) 
  cens <- rexp(n, 0.31)
  times <- pmin(1, pmin(t_star, cens))
  event <- as.numeric(t_star < pmin(cens, 1))
  
  df <- data.frame(times = times,
                   event = event,
                   a = a,
                   l_1 = l_1,
                   l_2 = l_2)
  
  df
}

Then I write a function to calculate the pseudo-observations (via the R package {prodlim}):

library(prodlim)
#########################
## function to find pseudo values.
## df must have following columns
## "times" (numeric)
## "event" (numeric 0/1)
## "a" (numeric 0/1)
## "l_1" (numeric)
## "l_2" (numeric)
pseudo_surv <- function(df, tau = 1){
  f = prodlim(Hist(times, event) ~ 1, data = df)
  df$pseudo_obs <- unname(jackknife(f, times = tau))
  df
}

And finally a function to apply the G-formula to the psuedo-values:

###############################
## what do we do with the pseudovalues?
## fit a linear model and use g-formula to find
## average causal effect (ACE)
## df must have following columns
## "times" (numeric)
## "event" (numeric 0/1)
## "a" (numeric 0/1)
## "l_1" (numeric)
## "l_2" (numeric)
## "pseudo_obs" (numeric)
find_ace <- function(df){
  
  ## how many patients in data set
  n <- dim(df)[1]
  
  ## fit linear model to pseudovalues
  fit <- lm(pseudo_obs ~ a + l_1 + l_2, data = df)

  ## predict pseudovalue for each patient under A = 0
  predict_0 = predict(fit, 
                      newdata = data.frame(a = rep(0, n),
                                           l_1 = df$l_1,
                                           l_2 = df$l_2))
  
  ## predict pseudovalue for each patient under A = 1
  predict_1 = predict(fit, 
                      newdata = data.frame(a = rep(1, n),
                                           l_1 = df$l_1,
                                           l_2 = df$l_2))

  ## average causal effect is p(T>1 | A=0) - p(T > 1 | A=1)
  ace <- mean(predict_0) - mean(predict_1)
  ace
}

I could then link these three functions together to estimate the average causal effect for a single simulated data set

sim_ace <- function(n){find_ace(pseudo_surv(sim_dat(n)))}
sim_ace(n = 200)
## [1] 0.1387292

and I could do this 1,000 times (the paper used 10,000 but I’ll take my chances) and compare with Table I

## n = 200
res_200 <- purrr::map_dbl(rep(200, 1000), sim_ace)
mean(res_200) - true_ace #bias
## [1] -0.002274479
sd(res_200)
## [1] 0.07649481
## n = 400
res_400 <- purrr::map_dbl(rep(400, 1000), sim_ace)
mean(res_400) - true_ace #bias
## [1] -0.0001083186
sd(res_400)
## [1] 0.05380005

That’s good enough for me!

Reproducing the coverage

Reproducing the coverage results is considerably more difficult (or, at least, more computationally intensive) than the bias / SD, because the confidence intervals are based on non-parametric bootstrap.

The following code would do it (via the R package {boot}) but it takes over an hour on my machine so I won’t run it now…

###################################
## function allowing random sampling of rows of dataset
boot_ace <- function(data, indices) {
  d <- data[indices,] # allows boot to select sample
  find_ace(pseudo_surv(d))
}

#####################################
## simulate a dataset of size n
## find bootstrap CI for ACE
## and check whether this covers 
## the true ACE
is_covered <- function(n, true_ace){
  
  dat_orig <- sim_dat(n)
  
  ### based on 1000 bootstrap samples
  results <- boot::boot(data = dat_orig, 
                        statistic = boot_ace,
                        R = 1000)
  
  boot_ci <- boot::boot.ci(results, type = "perc")
  true_ace > boot_ci$percent[1,4] && true_ace < boot_ci$percent[1,5]
}

## find coverage for n = 200 based on 1000 simulated trials
cov_200 <- purrr::map_dbl(rep(200, 1000), is_covered, true_ace = true_ace)
mean(cov_200)

…but you can trust me that it gave a result of 0.945 which matches close enough to Table I.

Comments

The scenario considered in Table I was a correctly specified Cox model, so this example doesn’t show off the robustness properties of the pseudo-observation approach. I might revisit this in a future post if I get time.

I feel a bit more confident using pseudo-observations now. But I still have some questions:

  • just how robust is the method? It looks good in this example and the other examples from the paper, but does it breakdown?
  • what about efficiency? It’s based on the Kaplan-Meier estimate, which involves essentially a dichotomization of the survival time at \(t_0\). This must lead to some inefficiency compared to a model. Not really apparent in these examples looking at the SDs, but here all observations were censored at \(t_0\) which is not reflective of projects I work on.
  • how much is gained compared to simple Kaplan-Meier estimation? This will depend on the strength of the prognostic covariates. In this simulation example the prognostic covariates were pretty pathetic, so doubt much was gained.

Overall, I’m impressed with pseudo-observations. They are a quite a difficult concept, but don’t seem so far out of reach compared to other methods I’ve seen that provide unbiased estimation under model-misspecification. I’ll keep trying to learn more about this.

References

Andersen, Per K., Elisavet Syriopoulou, and Erik T. Parner. “Causal inference in survival analysis using pseudo‐observations.” Statistics in medicine 36.17 (2017): 2669-2681

Graw, Frederik, Thomas A. Gerds, and Martin Schumacher. “On pseudo-values for regression analysis in competing risks models.” Lifetime Data Analysis 15.2 (2009): 241-255.

Avatar
Dominic Magirr
Medical Statistician

Interested in the design, analysis and interpretation of clinical trials.