11.7 - Comparing Survival Curves

If the primary endpoint in a CTE trial is a time-to-event variable, then it will be of interest to compare the survival curves of the randomized treatment arms. Again, we will focus on a nonparametric approach that corresponds to comparing the Kaplan-Meier survival curves rather than a parametric approach.

The Mantel-Haenszel test can be adapted here in terms comparing two groups, say P and E for placebo and experimental treatment. In this situation, the Mantel-Haenszel test is called the logrank test.

The assumptions for the logrank test are that (1) the censoring patterns are the same for the two treatment groups, and (2) the hazard functions for the two treatment groups are proportional.

For each of the K distinct failure times across the two randomized groups at times t1, t2, … , tK, a 2 × 2 table is constructed. For failure time tk , k = 1, 2, … , K, the table is:

  Placebo Exp Treat
# events dPk dEk
# non events nPk - dPk nEk - dEk

The logrank statistic constructs an observed minus expected score, under the assumption that the null hypothesis of equal event rates is true, for each of the K tables and then sums over all tables:

\[O-E=\sum_{k=1}^{K}\left( \frac{n_{Pk}d_{Ek}-n_{Ek}d_{Pk}}{n_{Pk}+n_{Ek}} \right)\]

The variance expression for the O - E score is as follows:

\[V_L=Var(O-E)=\sum_{k=1}^{K}\left( \frac{(d_{Pk}+d_{Ek})(n_{Pk}+n_{Ek}-d_{Pk}-d_{Ek})n_{Pk}n_{Ek}}{(n_{Pk}+n_{Ek}-1)(n_{Pk}+n_{Ek})^2} \right)\]

Then the logrank statistic is:


which has an approximate standard normal distribution.

The generalized Wilcoxon test also is a nonparametric test for comparing survival curves and it is an extension of the Wilcoxon rank sum test in the presence of censoring. It also requires that the censoring patterns for the two treatment groups be the same, but it does not assume proportional hazards.

The first step in constructing the generalized Wilcoxon statistic is to pool the two samples of survival times (including censored values) and order them from lowest to highest. For the ith observation in the ordered sample with survival (or censored) time ti, construct a score, Ui, which represents the number of survival (or censored) times less than ti minus the number of survival (or censored) times greater than ti. The Ui are summed over the experimental treatment group and a variance calculated, i.e.,

\[U=\sum_{i=1}^{n_E}U_i \text {and }V_U = Var(U)=\left( \frac{n_Pn_E}{(n_P+n_E)(n_P+n_E-1)}\right)\sum_{i=1}^{n_P+n_E}U_{i}^{2}\]

such that:


has an approximate standard normal distribution.

An example of constructing the Ui scores ("+" reflects censoring):

ti Group # < ti # > ti Ui
6 Exp Treat 0 7 -7
10 Placebo 1 6 -5
10+ Exp Treat 2 0 2
12 Exp Treat 2 4 -2
15+ Exp Treat 3 0 3
17 Placebo 3 2 1
21 Placebo 4 1 3
25+ Placebo 5 0 5

Then U = (-7) + 2 + (-2) + 3 = -4.

SAS Example ( 12.4_survival_analysis.sas ): A safety and efficacy study was conducted in 83 patients with malignant mesothelioma, an uncommon lung cancer that is strongly associated with asbestos exposure. Patients underwent one of three types of surgery, namely, biopsy, limited resection, and extrapleural pneumonectomy (EPP). Treatment assignment was nonrandomized and based on the extent of disease at the time of diagnosis. Thus, there can be a strong procedure selection bias here in this example.

SAS Program

The primary outcome variable was time to death (survival). SAS PROC LIFETEST constructs the Kaplan-Meier survival curve for each surgery group and compares the survival curves via the logrank test (p = 0.48) and the generalized Wilcoxon test (p = 0.63).

Strength of Evidence

Although p-values are useful for hypothesis tests that are specified a priori, they provide poor summaries of clinical effects. In particular, they do not convey the magnitude of a clinical effect. The size of a p-value depends on the magnitude of the estimated treatment effect and its estimated variability (also a function of sample size). Thus, the p-value partially reflects the size of the trial, which has no biological interpretation. In addition, the p-value can mask the magnitude of the treatment effect, which does have biological importance. P-values only quantify the type I error and do not characterize the biologically important effects in the trial. Thus, p-values should not be used to describe the strength of evidence in a trial. Investigators have to look at the magnitude of the treatment effect.

Confidence intervals are more appropriate for describing the strength of evidence in a clinical trial, although they also are affected by the sample size. Most major journals now require this approach as it is many times more informative than simply just the p-value.