Statistical Inference After an Adaptive Group Sequential Design: A ...
Statistical Inference After an Adaptive Group Sequential Design: A ...
Statistical Inference After an Adaptive Group Sequential Design: A ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Lothar T. Tremmel, PhD<br />
Sr. Director <strong>an</strong>d <strong>Group</strong><br />
Leader, Clinical Oncology,<br />
Cephalon Inc, Frazer,<br />
Pennsylv<strong>an</strong>ia<br />
Key Words<br />
<strong>Adaptive</strong> design; <strong>Group</strong><br />
sequential design; <strong>Statistical</strong><br />
inference<br />
Correspondence Address<br />
Lothar Tremmel, PhD,<br />
Cephalon, Inc, 41 Moores<br />
Road, Frazer, PA (email:<br />
Lothar@tremmel.net).<br />
Drug Information Journal, Vol. 44, pp. 589–598, 2010 • 0092-8615/2010<br />
Printed in the USA. All rights reserved. Copyright © 2010 Drug Information Association, Inc.<br />
b i o s t a t i s t i c s 589<br />
<strong>Statistical</strong> <strong>Inference</strong> <strong>After</strong> <strong>an</strong> <strong>Adaptive</strong> <strong>Group</strong><br />
<strong>Sequential</strong> <strong>Design</strong>: A Case Study<br />
The adaptive group sequential design of Lehmacher<br />
<strong>an</strong>d Wassmer allows fully flexible redetermination<br />
of sample size after each of a predetermined<br />
number of interim looks. Study<br />
02CLLIII, a large, r<strong>an</strong>domized, multicenter<br />
trial in chronic lymphocytic leukemia, was<br />
based on this approach, with five pl<strong>an</strong>ned <strong>an</strong>alyses.<br />
The study terminated for efficacy at the<br />
third interim <strong>an</strong>alysis. While it was clear how<br />
statistical signific<strong>an</strong>ce was to be determined,<br />
calculations of proper P values as well as point<br />
<strong>an</strong>d interval estimates turned out to be much<br />
i N t R o D U c t i o N<br />
THE ADAPTIVE GROUP SEQUENTIAL<br />
DESIGN<br />
<strong>Group</strong> sequential designs (GSDs) that allow for<br />
prepl<strong>an</strong>ned interim <strong>an</strong>alyses of efficacy have<br />
been available for decades. The original designs<br />
were based on the concept of equal data increments:<br />
data could be <strong>an</strong>alyzed after each of k<br />
equal fractions of the data became available (1).<br />
As statistical theory evolved, this restriction was<br />
relaxed gradually, <strong>an</strong>d it became possible to<br />
perform <strong>an</strong>alyses after increments that were unequal,<br />
even at time points that were not necessarily<br />
prepl<strong>an</strong>ned (2). However, one restriction<br />
remained firmly in place: interim results of efficacy<br />
were not to be used for determining when<br />
next to <strong>an</strong>alyze the data (3, section 7.4).<br />
We call designs that eliminate this last constraint<br />
adaptive group sequential designs<br />
(aGSD). One early <strong>an</strong>d influential example was<br />
the design of Prosch<strong>an</strong> <strong>an</strong>d Hunsberger (4),<br />
which allows continuation of the trial if one obtains<br />
<strong>an</strong> “almost signific<strong>an</strong>t” result. The design<br />
that we consider here is the one by Lehmacher<br />
<strong>an</strong>d Wassmer (5), with equal stage weights <strong>an</strong>d<br />
Pocock’s stopping rule: it is akin to Pocock’s early<br />
GSD; the difference is that the size of the next<br />
stage c<strong>an</strong> be determined adaptively, that is,<br />
based on unblinded review of interim efficacy<br />
less straightforward. We extend existing <strong>an</strong>alysis<br />
approaches to the stratified binary <strong>an</strong>d<br />
time-to-event data from this trial, investigate<br />
the properties of the estimators for the primary<br />
endpoint, <strong>an</strong>d highlight remaining issues with<br />
full statistical inference after such a design. In<br />
conclusion, the flexibility offered by the adaptive<br />
features renders statistical inference more<br />
difficult <strong>an</strong>d less precise. We believe that there<br />
are situations where it may be worth paying this<br />
price.<br />
data. For inst<strong>an</strong>ce, one could recalculate the<br />
sample size needed to maintain the power under<br />
a certain effect size assumption. Import<strong>an</strong>tly,<br />
rigid rules like that one do not have to be prespecified<br />
or followed; the design remains valid<br />
(in the sense that it controls the type I error) no<br />
matter how the decision regarding the size of<br />
the next stage is made. Only two design elements<br />
need to be predetermined: the number of<br />
interim looks, <strong>an</strong>d the size of the first stage.<br />
The design works by reestablishing Pocock’s<br />
equal increment structure for the adaptively<br />
determined unequally sized stages: under H 0 ,<br />
the st<strong>an</strong>dardized effect size from each stage k<br />
follows <strong>an</strong> independent st<strong>an</strong>dard normal distribution<br />
if weighted by the square root of the<br />
stage’s sample size. The test statistic S K is simply<br />
the st<strong>an</strong>dardized sum of these independent,<br />
st<strong>an</strong>dard-normally distributed z values,<br />
∑<br />
S = 1 K Z<br />
K k<br />
k = 1.. K<br />
<strong>an</strong>d hence Pocock’s critical z values apply.<br />
(1)<br />
CASE STUDY<br />
Study 02CLLIII was a large, r<strong>an</strong>domized, openlabel<br />
clinical trial, comparing two chemotherapeutic<br />
treatments for chronic lymphocytic leukemia<br />
(CLL): bendamustine (Tre<strong>an</strong>da) versus<br />
chlorambucil. R<strong>an</strong>domization was stratified by<br />
Submitted for publication; July 2, 2009<br />
Accepted for publication: December 9, 2009
590 b i o s t a t i s t i c s<br />
Tremmel<br />
Binet stage (B vs C) which was believed to be <strong>an</strong><br />
import<strong>an</strong>t predictor. Efficacy was measured by<br />
two key endpoints: tumor response <strong>an</strong>d progression-free<br />
survival (PFS). The primary endpoint,<br />
tumor response, was defined as <strong>an</strong> outcome<br />
of partial response or better, based on<br />
predetermined assessment criteria. The key secondary<br />
endpoint, PFS, was the time from r<strong>an</strong>domization<br />
to progression of the disease, or<br />
death from <strong>an</strong>y cause. A sequential testing strategy,<br />
based on testing response rate first, was<br />
used to protect the overall type I error.<br />
An aGSD following Lehmacher <strong>an</strong>d Wassmer<br />
(5) with equal weighting of the study stages was<br />
pl<strong>an</strong>ned in the protocol. Four interim <strong>an</strong>d one<br />
final <strong>an</strong>alysis were pl<strong>an</strong>ned, <strong>an</strong>d the first <strong>an</strong>alysis<br />
was to be done after 80 patients were enrolled.<br />
Pocock’s method was chosen for determining<br />
the critical boundaries, yielding equal critical P<br />
values of 0.016 at each of the five <strong>an</strong>alysis time<br />
points, corresponding to a pl<strong>an</strong>ned overall twotailed<br />
alpha of 5%. A data monitoring committee<br />
(DMC) was established to assess the interim<br />
results <strong>an</strong>d to determine, at each interim <strong>an</strong>alysis,<br />
whether the study should continue. No formal<br />
futility rules were pl<strong>an</strong>ned or implemented.<br />
The protocol also discussed the sample sizes<br />
that would be needed for a fixed-sample trial:<br />
n = 84 (total) for tumor response, <strong>an</strong>d n = 326<br />
(total) for PFS. This was based on effect sizes of<br />
30% versus 60% response, <strong>an</strong>d a hazard ratio of<br />
1.429, respectively. The first result was used to<br />
determine the size of the first study stage (first<br />
interim <strong>an</strong>alysis to be done after 80 patients),<br />
<strong>an</strong>d the second result formed the basis of a projected<br />
maximum sample size of n = 350.<br />
The study stopped for efficacy at the third interim<br />
<strong>an</strong>alysis. Unadjusted P values from the<br />
first three <strong>an</strong>alyses are shown in Table 1. The P<br />
values are based on statistic Eq. 1 but they are<br />
not adjusted for the interim <strong>an</strong>alyses.<br />
<strong>Design</strong> <strong>an</strong>d results of the study are published<br />
(6); we note that the results described in this<br />
article are different because they are based on<br />
<strong>an</strong> algorithmic determination of tumor response<br />
that was used to inform the US product<br />
label whereas the results in Knauf et al. are<br />
based on assessments of <strong>an</strong> independent clini-<br />
cal data review committee. Moreover, the <strong>an</strong>alysis<br />
in Knauf et al. incorporates some more data<br />
points that became available after the third interim<br />
<strong>an</strong>alysis.<br />
ISSUES WITH STATISTICAL INFERENCE<br />
<strong>Statistical</strong> Signific<strong>an</strong>ce. Since the nominal<br />
two-sided P values at the third <strong>an</strong>alysis are lower<br />
th<strong>an</strong> 0.016, signific<strong>an</strong>ce at the two-sided level of<br />
signific<strong>an</strong>ce of 5% c<strong>an</strong> be claimed.<br />
Adjusted P Value. Rather th<strong>an</strong> only reporting<br />
whether the result is signific<strong>an</strong>t, it is desirable<br />
to report the P value as well, to indicate the<br />
strength of evidence (7). In traditional GSDs,<br />
correctly adjusted P values c<strong>an</strong> be readily calculated<br />
with st<strong>an</strong>dard software once one agrees on<br />
how to order the sample space (8). In our aGSD,<br />
there is no clearly defined sample space to begin<br />
with as no rigid rules for calculating the<br />
next stage size are required. Hence, the algorithms<br />
from GSDs c<strong>an</strong>not be applied.<br />
Point Estimate. It would also be desirable to report<br />
a point estimate for the effect size. Simple<br />
estimates that ignore the group sequential nature<br />
of the data—so-called naive estimates—<br />
are biased; this has been known since the early<br />
days of such designs (9). Various methods for<br />
bias adjustments exist (10) <strong>an</strong>d have been implemented<br />
in st<strong>an</strong>dard software for traditional<br />
GSDs. As Coburger <strong>an</strong>d Wassmer (11) point out,<br />
the aGSD contributes <strong>an</strong> additional complication:<br />
the naive estimator is not a function of the<br />
test statistic S K , <strong>an</strong>d therefore it is not consistent<br />
with it: Cases c<strong>an</strong> be constructed where S K would<br />
indicate maximum support for H 0 whereas the<br />
naive estimator would yield a value different<br />
from zero. This will not be the case for <strong>an</strong> estimator<br />
that is f(S K ), that is, a function of S K . Another<br />
adv<strong>an</strong>tage of f(S K ) is that under certain assumptions,<br />
the density of S K is known, <strong>an</strong>d<br />
therefore the density of f(S K ) will also be known,<br />
rendering it possible, in principle, to calculate<br />
its bias.<br />
Confidence Interval (CI). Finally, confidence<br />
intervals are needed for complete statistical in-
<strong>Inference</strong> <strong>After</strong> <strong>Adaptive</strong> GSD b i o s t a t i s t i c s 591<br />
ference. For traditional GSDs, “tight” CIs c<strong>an</strong> be<br />
developed (3, chapter 8); these require a rigidly<br />
defined sample space for which the densities<br />
c<strong>an</strong> be calculated for different alternative hypotheses.<br />
In the aGSD, the sample space is less<br />
rigidly defined, <strong>an</strong>d therefore these methods<br />
c<strong>an</strong>not be used.<br />
M E t H o D s<br />
ADjUSTED P VALUE<br />
The statistical <strong>an</strong>alysis pl<strong>an</strong> (SAP) had suggested<br />
calculating the P value based on stagewise<br />
ordering of the sample space (3, section 8.4).<br />
Upon data monitoring board (DMB) recommendations,<br />
the trial continued despite early<br />
crossing of the efficacy boundary, which is not<br />
consistent with the proposed method as it assumes<br />
strict adherence to stopping rules.<br />
Wright’s (12) concept of a P value does not require<br />
ordering <strong>an</strong> ill-defined sample space. The<br />
P value is understood as the <strong>an</strong>swer to the question:<br />
What is the smallest level of signific<strong>an</strong>ce<br />
α′ for which the given result would have been<br />
signific<strong>an</strong>t? It c<strong>an</strong> be shown that the <strong>an</strong>swer to<br />
this question is indeed a P value (appendix 1).<br />
Wassmer (13) proposes the same approach<br />
specifically for the aGSD <strong>an</strong>d calls it “overall P<br />
value.”<br />
In our study, the unadjusted P values were<br />
592 b i o s t a t i s t i c s<br />
Tremmel<br />
two response rates. As z k , we used the unsquared<br />
version of the familiar M<strong>an</strong>tel-<br />
Haenszel statistic Q MH , with Binet stage as the<br />
stratification factor. It c<strong>an</strong> be shown (appendix<br />
2) that this z k c<strong>an</strong> be decomposed into w k *ϑ ˆ ,<br />
where ϑ ˆ is a weighted average of the observed<br />
within-stratum probability differences, <strong>an</strong>d<br />
the weights are functions of the margins of the<br />
2 × 2 tables involved.<br />
STRATIFIED TImE-TO-EVENT DATA<br />
The independence of the k increments in the<br />
log-r<strong>an</strong>k statistic has been established by Tsiatis<br />
(16), <strong>an</strong>d Wassmer (13) showed that this applies<br />
to the aGSD as well. Jahn-Eimermacher et al.<br />
(17) showed how data arising from <strong>an</strong> aGSD c<strong>an</strong><br />
be decomposed correctly into these independent<br />
increments, each of which c<strong>an</strong> be <strong>an</strong>alyzed<br />
by <strong>an</strong> unsquared log-r<strong>an</strong>k test, yielding z k . In our<br />
case, z k is the kth increment of the familiar logr<strong>an</strong>k<br />
statistic, with Binet stage as the stratification<br />
factor. It c<strong>an</strong> be shown that this z k c<strong>an</strong> be<br />
decomposed into w k *ϑ ˆ where the estim<strong>an</strong>d is<br />
the logarithm of the hazard ratio:<br />
z k = U k /SD(U k ) = [U k /SD(U k ) * SD(U k ) −1 ] * SD(U k )<br />
where U k is the sum of the difference between<br />
observed <strong>an</strong>d expected deaths in study stage k,<br />
determined within each stratum <strong>an</strong>d summed<br />
over both strata, <strong>an</strong>d SD(U k ) is its st<strong>an</strong>dard deviation.<br />
The expression in square brackets is <strong>an</strong><br />
estimator of θ k (18, p. 96). Therefore, our weight<br />
w k is simply SD(U k ).<br />
bIAS<br />
The possibility of early stopping will induce bias<br />
into the estimator (Eq. 1), <strong>an</strong>d the potential size<br />
of this bias still needs to be investigated. One<br />
approach is simulation, but this requires <strong>an</strong> assumption<br />
of a formal decision rule that governs<br />
the determination of the next sample size n k+1 .<br />
However, this approach does not reflect the nature<br />
of this design, which does not require <strong>an</strong>y<br />
prespecified decision rule for its validity. Coburger<br />
<strong>an</strong>d Wassmer (11) therefore proposed <strong>an</strong><br />
alternative approach that is based on conditioning,<br />
both on stopping time (K = 3 in our<br />
case) <strong>an</strong>d the actual stage sample sizes ob-<br />
tained. There may be philosophical questions<br />
about this level of conditioning, but from a practical<br />
point of view, the authors demonstrated<br />
that estimators do indeed improve when adjusted<br />
for bias thus derived. We will use both approaches—simulation,<br />
<strong>an</strong>d integration based<br />
on Coburger <strong>an</strong>d Wassmer’s conditioning—in<br />
<strong>an</strong> effort to get <strong>an</strong> idea of how import<strong>an</strong>t the<br />
bias issue might be.<br />
CONFIDENCE INTERVAL<br />
Lehmacher <strong>an</strong>d Wassmer (5) propose using the<br />
repeated confidence interval (rCI) (3, chapter<br />
9) for the aGSD, <strong>an</strong>d they give the formula for<br />
the case of a difference of two me<strong>an</strong>s. In general,<br />
the rCI is defined as<br />
where<br />
{ϑ | abs[S K (ϑ)] < c K (α)},<br />
S K (ϑ) : = 1/√K ∑ w k (ϑ ˆ − ϑ)<br />
<strong>an</strong>d c K (α) is the adjusted one-sided critical z value,<br />
in our case φ −1 (0.016/2) = 2.41. Conceptually,<br />
this is the set of all hypothetical parameter<br />
values that c<strong>an</strong>not be rejected at the adjusted<br />
level α, given the data. The SAP prepl<strong>an</strong>ned the<br />
use of rCIs, without specifying the computational<br />
details of how those could be derived. It<br />
turns out that they c<strong>an</strong> be obtained by solving<br />
S K (ϑ) = 1/√K ∑ w k (ϑ ˆ − ϑ) = +/−c K (α) for ϑ. If<br />
one considers the case of the difference of two<br />
me<strong>an</strong>s, w k = √n k (σ√2) −1 as pointed out above,<br />
<strong>an</strong>d this leads to the expression for the CI that<br />
is given in Lehmacher <strong>an</strong>d Wassmer (5). For the<br />
stratified binomial <strong>an</strong>d survival case, one simply<br />
needs to plug in the expressions for w k that apply<br />
to those cases, as stated above.<br />
R E s U Lt s<br />
NUmERICAL RESULTS FOR CASE STUDY<br />
Table 2 shows key results without <strong>an</strong>y adjustments<br />
(“simple”) versus results that are adjusted<br />
based on the methods discussed above. It appears<br />
that using <strong>an</strong> estimate that reflects the<br />
segmented nature of the data (Eq. 2 above)<br />
ch<strong>an</strong>ges the point estimate only slightly; the adjustments<br />
due to repeated testing ch<strong>an</strong>ge the P<br />
value <strong>an</strong>d the width of the confidence intervals<br />
considerably.
<strong>Inference</strong> <strong>After</strong> <strong>Adaptive</strong> GSD b i o s t a t i s t i c s 593<br />
Tumor Response Simple (Unadjusted)<br />
Adjusted for the <strong>Adaptive</strong><br />
<strong>Group</strong> <strong>Sequential</strong> <strong>Design</strong><br />
P value 99%, if H 1<br />
is true. Therefore, it seemed reasonable to limit<br />
the sample size. We use a high cap of n = 350,<br />
based on the rationale given in the case study.<br />
The trial stops if the calculated sample size for<br />
the next segment would bring the total n > 350.<br />
Finally, if the new sample size was less th<strong>an</strong> 20,<br />
we set it to 20.<br />
Rule 2 works like rule 1, but the effect size assumption<br />
θ is updated based on interim data at<br />
each interim <strong>an</strong>alysis. We note that warnings<br />
against such a rule have been pronounced, as<br />
sample size reassessments based on the unreliable<br />
interim estimates may increase expected<br />
sample sizes (19). On the other h<strong>an</strong>d, our procedure<br />
here is checked <strong>an</strong>d bal<strong>an</strong>ced by the<br />
sample size cap, which amounts to <strong>an</strong> implicit<br />
futility rule.<br />
Drug Information Journal<br />
Full <strong>Statistical</strong> <strong>Inference</strong> for Case Study<br />
Each simulation scenario was defined by a<br />
true effect size, varying between 0 <strong>an</strong>d 0.40,<br />
<strong>an</strong>d a true stratum effect of 0% versus 10% for<br />
Binet stage C versus B. The control arm success<br />
rate was kept at 0.30. For each scenario, 10,000<br />
simulations were run, <strong>an</strong>d rCIs <strong>an</strong>d the bias<br />
were calculated. The results are shown in Table<br />
3 for rule 1 <strong>an</strong>d Table 4 for rule 2. Results are<br />
for zero stratum effect; the results with the stratum<br />
effect are not shown because they were almost<br />
identical.<br />
Tables 3 <strong>an</strong>d 4 indicate that the unconditional<br />
bias is negligible from a practical point of<br />
view; in only one scenario did the me<strong>an</strong> of the<br />
estimate (1) overestimate the true difference by<br />
more th<strong>an</strong> 0.02. With st<strong>an</strong>dard deviations of<br />
less th<strong>an</strong> 0.1, <strong>an</strong>d 10,000 simulations, the Monte<br />
Carlo st<strong>an</strong>dard error of the me<strong>an</strong> for the bias<br />
is around 0.001.<br />
The results also show that the rCIs are conservative,<br />
as expected. Further simulation work<br />
reveals that for both rules, confidence limits<br />
based on adjusted z values of φ −1 (0.03/2) rather<br />
th<strong>an</strong> φ −1 (0.016/2) would yield the desired coverage<br />
of 95%. If we used this critical z value for<br />
Table 2, the confidence interval for tumor response<br />
would shrink from 0.17−0.47 to 0.19−<br />
0.45. Therefore, the cost of flexibility (by not<br />
rigidly sticking to the rule) is a widening of the<br />
CI by two percentage points in either direction.<br />
It may be of interest that rule 2 spends considerably<br />
fewer patients th<strong>an</strong> rule 1 when H 0 is<br />
true. The reason is that the recalculated sample<br />
size, based on <strong>an</strong> observed low effect size, tends<br />
to exceed the sample size cap, which leads to<br />
early stopping.<br />
Based on the history of the trial, it is clear<br />
t a b L E 2
594 b i o s t a t i s t i c s<br />
Tremmel<br />
t a b L E 3<br />
t a b L E 4<br />
True Treatment Average Sample Power rCI<br />
Effect θ Size per <strong>Group</strong> (%) Coverage (%) Bias ˆ ϑ – θ<br />
0.00 114 2.1 97 –0.019<br />
0.10 124 34 98 0.005<br />
0.20 89 89 98 0.030<br />
0.30 56 >99 99 0.022<br />
0.40 43 100 99 0.007<br />
assumptions: see text.<br />
that rigid statistical decision rules were not followed.<br />
In the absence of <strong>an</strong>y clearly defined<br />
sample size recalculation rules, bias c<strong>an</strong>not be<br />
studied by simulation. The only other approach<br />
known to us is the double-conditioning approach<br />
by Coburger <strong>an</strong>d Wassmer (11): bias is<br />
calculated based on recursive integration (20),<br />
after conditioning on past history, without<br />
speculating about the future. This me<strong>an</strong>s both<br />
restricting the sample space to all paths that<br />
stop for efficacy at k = 3, <strong>an</strong>d “freezing” the first<br />
three sample sizes at the values that were actually<br />
used. Figure 1 shows this conditional bias;<br />
it is much more severe th<strong>an</strong> the unconditional<br />
bias shown in Tables 3 <strong>an</strong>d 4. The direction of<br />
the bias is conservative; it indicates that the estimator<br />
is likely to subst<strong>an</strong>tially underestimate<br />
the more extreme effect sizes.<br />
One could make the point that the boundaries<br />
that were actually in effect at the first two<br />
interim <strong>an</strong>alyses were much wider th<strong>an</strong> claimed<br />
Simulation Results for Probability Difference, Based on Rule 1<br />
True Treatment Average Sample Power rCI<br />
Effect θ Size per <strong>Group</strong> (%) Coverage (%) Bias ˆ ϑ – θ<br />
0.00 44 1.7 98 –0.006<br />
0.10 55 16 98 –0.012<br />
0.20 64 58 99 –0.001<br />
0.30 56 89 99 0.0108<br />
0.40 44 >99 99 0.0064<br />
assumptions: see text.<br />
Simulation Results for Probability Difference, Based on Rule 2<br />
by Pocock’s procedure, because the trial continued<br />
despite P values of
<strong>Inference</strong> <strong>After</strong> <strong>Adaptive</strong> GSD b i o s t a t i s t i c s 595<br />
Bias θ′ − θ<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
place, making one’s own drug in development<br />
more worthwhile to pursue even at more conservative<br />
effect size assumptions. Or <strong>an</strong>other<br />
import<strong>an</strong>t trial may mature, contributing valuable<br />
effect size information that one might w<strong>an</strong>t<br />
to use to update the power calculations of the<br />
ongoing trial. Or the emerging adverse events<br />
profile of the new drug is harsher th<strong>an</strong> <strong>an</strong>ticipated,<br />
which would increase the effect size<br />
deemed necessary to counterbal<strong>an</strong>ce the patients’<br />
risk.<br />
On the other h<strong>an</strong>d, this flexibility is a b<strong>an</strong>e for<br />
full statistical inference. The openness in trial<br />
conduct has to be paid for by wider confidence<br />
intervals, <strong>an</strong>d bias issues become intractable.<br />
Will those difficulties be resolved as statistical<br />
science adv<strong>an</strong>ces, or are they of a more fundamental<br />
nature? This is <strong>an</strong> area of active research,<br />
<strong>an</strong>d progress is being made. For inst<strong>an</strong>ce,<br />
very recently, a method was developed<br />
that yields exact confidence bounds <strong>an</strong>d medi<strong>an</strong><br />
unbiased estimates when adaptive ch<strong>an</strong>ges<br />
are restricted to the penultimate stage (22). On<br />
the other h<strong>an</strong>d, for a different type of adaptive<br />
designs, it was shown convincingly that full statistical<br />
inference is not possible in principle (23).<br />
We believe that for the aGSD, the difficulties<br />
may be of this more fundamental nature.<br />
If one is willing to commit to rigid, prespecified<br />
sample size adjustment rules, these difficul-<br />
Drug Information Journal<br />
0<br />
−0.05<br />
−0.1<br />
−0.15<br />
−0.2<br />
−0.25<br />
−0.6 −0.4 −0.2 0 0.2 0.4 0.6<br />
θ<br />
Bias - pl<strong>an</strong>ned boundaries<br />
Bias - apparent boundaries<br />
ties disappear, <strong>an</strong>d full statistical inference becomes<br />
possible. However, in this case, it is<br />
usually possible to construct a traditional GSD<br />
that is at least as powerful as the aGSD (24–26).<br />
In general, the aGSD should be less efficient<br />
th<strong>an</strong> a traditional GSD because its predetermined<br />
weights will not be optimal. In our case of<br />
equally weighted stages, this would be the case if<br />
the segments are very unequal in terms of<br />
amount of information. If, as in this trial, the<br />
next interim <strong>an</strong>alysis is scheduled without specifying<br />
a target number of PFS events, such <strong>an</strong> inequality<br />
is likely to happen for the time-to-event<br />
endpoint.<br />
For our case study, the corresponding traditional<br />
GSD would be the nonadaptive Pocock<br />
design. This design would yield powers of 2.5%,<br />
40%, 95%, >99%, <strong>an</strong>d >99% for the values of θ<br />
from Tables 3 <strong>an</strong>d 4, at average sample sizes of<br />
171, 149, 89, 53, <strong>an</strong>d 40 per group. As c<strong>an</strong> be<br />
seen, for θ of 0.2 or higher, the nonadaptive design<br />
achieves higher power with fewer patients.<br />
On the other h<strong>an</strong>d, if there is no effect, the<br />
nonadaptive design spends considerably more<br />
patients th<strong>an</strong> the aGSD with rule 1 or rule 2.<br />
Most likely, this is due to the implicit futility<br />
rule that we incorporated in rule 1 <strong>an</strong>d rule 2.<br />
To decide which design is better overall, one<br />
faces the problem of comparing two different<br />
approaches with different operating character-<br />
F i g U R E 1<br />
Conditional bias for estimate<br />
of probability differences,<br />
based on recursive<br />
integration.
596 b i o s t a t i s t i c s<br />
Tremmel<br />
istics regarding both expected sample size <strong>an</strong>d<br />
power. One possible solution c<strong>an</strong> be found in<br />
Bayesi<strong>an</strong> decision theory (27, p. 204); the necessary<br />
loss function could be constructed by assigning<br />
monetary values for patients enrolled,<br />
as well as for the type II error. This could be the<br />
subject of future investigations.<br />
Some of the adv<strong>an</strong>tages of classical GSD<br />
hinge on strict adherence to predetermined decision<br />
rules. In the typical DMB-driven clinical<br />
trial, this may turn out to be <strong>an</strong> illusion, which<br />
should lead to some of the same issues with statistical<br />
inference as described above for the<br />
aGSD. Indeed, this was the motivation behind<br />
the development of the rCI (3, p. 189). In m<strong>an</strong>y<br />
cases, the DMBs are fully unblinded to interim<br />
results, which may inform their recommendation<br />
of the timing of the next interim <strong>an</strong>alysis,<br />
as well as effect size assumptions <strong>an</strong>d target<br />
power (for <strong>an</strong> example, see Ref. 28). This may<br />
impact not only the statistical inference, but<br />
also the validity of the traditional GSD—a<br />
much more severe problem. Gr<strong>an</strong>ted, in the scenarios<br />
investigated for results-driven timing of<br />
interim <strong>an</strong>alyses for traditional GSDs, the im-<br />
pact on α was small (29). Nevertheless, it may<br />
seem cle<strong>an</strong>er to use a design that fully “legalizes”<br />
such “crimes” (<strong>an</strong>d regulators ought to encourage<br />
it)—in particular for open-label trials<br />
such as our case study.<br />
c o N c L U s i o N<br />
There is a trade-off between flexibility in trial<br />
conduct, <strong>an</strong>d accuracy of statistical inference.<br />
Generally, the flexible aGSD design will lead to<br />
wider confidence intervals. In addition, the<br />
openness of the trial causes theoretical difficulties<br />
with some aspects of statistical inference<br />
(in particular: bias) that are not all resolved.<br />
There are cases when this trade-off may favor<br />
the flexible approach—in particular, when the<br />
trial is a r<strong>an</strong>domized, open-label trial, <strong>an</strong>d/or<br />
when the size of a worthwhile effect depends on<br />
future developments.<br />
Acknowledgments—The author is indebted to Dr. C. K.<br />
Ch<strong>an</strong>g for some early suggestions. The author also<br />
owes th<strong>an</strong>ks to two <strong>an</strong>onymous reviewers for their encouragement<br />
<strong>an</strong>d thorough questioning.<br />
a P P E N D i x 1<br />
P R o o F t H a t t H E P V a L U E b a s E D o N W R i g H t ( 1 2 ) i s U N i F o R M Ly<br />
D i s t R i b U t E D F o R P o c o c k ’ s D E s i g N<br />
Pocock’s procedure defines P crit = f(α) such that the probability (under H 0 ) that the smallest observed<br />
P value is lower th<strong>an</strong> P crit is α. For this case, Wright’s adjusted P value is α′ = f −1 (P min ), where P min is the<br />
minimum P value actually observed, <strong>an</strong>d α′ is the probability, under H 0 , to observe a minimum P value<br />
as small as or smaller th<strong>an</strong> P min .<br />
α′ is a P value because it follows the uniform (0,1) distribution under H 0 : The function f −1 (.) is the<br />
probability integral tr<strong>an</strong>sformation of the null distribution of P min , that is, f −1 (.) says how likely it is, under<br />
H 0 , to obtain a minimum P value that is even smaller th<strong>an</strong> the observed minimum P value. Using<br />
upper case for the r<strong>an</strong>dom variables, prob(PMIN < P min ) = prob(A′ < α′) = α′, which defines the uniform<br />
distribution.<br />
a P P E N D i x 2<br />
D E c o M P o s i t i o N o F t H E s t R a t i F i E D U N s q U a R E D M H s t a t i s t i c<br />
The unsquared version Z MH of the M<strong>an</strong>tel-Haenszel statistic Q MH c<strong>an</strong> be shown to be a weighted average<br />
of the estimator ϑ ˆ = (πˆ 1 − πˆ 2 ) (30). For stage k, Z MH.k = (w Bk (πˆ B1k − πˆ B2k ) + w Ck (πˆ C1k − πˆ C2k ))/σ k , where<br />
w Bk is the M<strong>an</strong>tel-Haenszel weight (n B1k * n B2k )/(n B1k + n B2k ), <strong>an</strong>d the first index B designates the stratum<br />
(B for Binet stage B, <strong>an</strong>d C for Binet stage C). σ k is a function of the four margins of the 2 × 2 table:
<strong>Inference</strong> <strong>After</strong> <strong>Adaptive</strong> GSD b i o s t a t i s t i c s 597<br />
Drug Information Journal<br />
σ k 2 = (nB1k n B2k r B.k f B.k )/[n B 2 (nB − 1)] + (n C1k n C2k r C.k f C.k )/[n C 2 (nC − 1)]<br />
where r <strong>an</strong>d f indicate the number of responding <strong>an</strong>d failing (not responding) patients.<br />
This Z statistic c<strong>an</strong> be represented as a product of <strong>an</strong> estimator <strong>an</strong>d a weight, as desired:<br />
N o t E s<br />
Z MH.k = [w Bk /(w Bk + w Ck ) ϑ ˆ Bk + w Ck /(w Bk + w Ck ) ϑˆ Ck ] * (w Bk + w Ck )/σ k = ϑˆ weighted.k * (w Bk + w Ck )/σ k<br />
1. This was done with SeqTrial function seq<strong>Design</strong><br />
(8). Another software commonly used for such<br />
calculations is EAST (14).<br />
2. This decomposition is not trivial; there are cases<br />
where it c<strong>an</strong>not be done. For the Wilcoxon test,<br />
this problem was noted before (15).<br />
R E F E R E N c E s<br />
1. Pocock SJ. Clinical Trials: A Practical Approach.<br />
New York: Wiley; 1983.<br />
2. L<strong>an</strong> KKG, DeMets DL. Discrete sequential<br />
boundaries for clinical trials. Biometrika.<br />
1983;70:659–663.<br />
3. Jennison C, Turnbull BW. <strong>Group</strong> <strong>Sequential</strong> Methods<br />
With Applications to Clinical Trials. Boca Raton,<br />
FL: Chapm<strong>an</strong> & Hall/CRC; 2000.<br />
4. Prosch<strong>an</strong> MA, Hunsberger SA. <strong>Design</strong>ed extension<br />
of studies based on conditional power. Biometrics.<br />
1995;51:1315–1324.<br />
5. Lehmacher W, Wassmer G. <strong>Adaptive</strong> sample size<br />
calculations in group sequential trials. Biometrics.<br />
1999;55:1286–1290.<br />
6. Knauf WU, Lissichkov T, Aldaoud A, et al. Phase<br />
III r<strong>an</strong>domized study of bendamustine versus<br />
chlorambucil in previously untreated patients<br />
with chronic lymphocytic leukemia. J Clin Oncol.<br />
2009;27:4378–4384.<br />
7. Fleming TR, Richardson BA. Some design issues<br />
in microbicide HIV prevention trials. J Infect Dis.<br />
2004;190:666–674.<br />
8. Insightful. S+ SeqTrial 2 User’s Guide. Insightful<br />
Corp.; 2002.<br />
9. Whitehead J. On the bias of maximum likelihood<br />
estimation following a sequential test. Biometrika.<br />
1986;73:573–581.<br />
10. Emerson SS, Kittelson JM. A computationally<br />
simpler algorithm for the UMVUE of a normal<br />
me<strong>an</strong> following a group sequential design. Biometrics.<br />
1997;53:365–369.<br />
11. Coburger S, Wassmer G. Conditional point esti-<br />
mation in adaptive group sequential test designs.<br />
Biometric J. 2001;43:821–833.<br />
12. Wright PS. Adjusted P-values for simult<strong>an</strong>eous<br />
inference. Biometrics. 1992;48:1005–1013.<br />
13. Wassmer G. Pl<strong>an</strong>ning <strong>an</strong>d <strong>an</strong>alyzing adaptive<br />
group sequential survival trials. Biometric J. 2006;<br />
48:714–729.<br />
14. Cytel Inc. East 5 (v5.2). 2008.<br />
15. L<strong>an</strong> KK, Wittes J. The B-value: a tool for monitoring<br />
data. Biometrics. 1988;44:579–585.<br />
16. Tsiatis AA. The asymptotic joint distribution of<br />
the efficient scores test for the proportional hazards<br />
model calculated over time. Biometrika.<br />
1981;68:311–315.<br />
17. Jahn-Eimermacher A, Ingel K. <strong>Adaptive</strong> trial design:<br />
a general methodology for censored time to<br />
event data. Contemp Clin Trials. 2009;30:171–<br />
177.<br />
18. Marubini E, Valsecchi MG. Analysing Survival<br />
Data From Clinical Trials <strong>an</strong>d Observational Studies.<br />
Chichester, UK: Wiley; 1995.<br />
19. Bauer P, Koenig F. The reassessment of trial perspectives<br />
from interim data—a critical view. Stat<br />
Med. 2006;25:23–36.<br />
20. Armitage P, McPherson CK, Rowe BC. Repeated<br />
signific<strong>an</strong>ce tests on accumulating data. J R Stat<br />
Soc A. 1969;132:235–244.<br />
21. Br<strong>an</strong>nath W, Posch M, Bauer P. Recursive combination<br />
tests. J Am Stat Assoc. 2002;97:236–243.<br />
22. Br<strong>an</strong>nath W, Mehta CR, Posch M. Exact confidence<br />
bounds following adaptive group sequential<br />
tests. Biometrics. 2009;65:539–546.<br />
23. Br<strong>an</strong>nath W, Koenig F, Bauer P. Estimation in<br />
confirmatory adaptive designs with treatment<br />
selection. <strong>Adaptive</strong> Trials 2008, Barcelona.<br />
24. Tsiatis AA, Mehta C. On the inefficiency of the<br />
adaptive design for monitoring clinical trials.<br />
Biometrika. 2003;90:367–387.<br />
25. Jennison C, Turnbull BW. Efficient group sequential<br />
designs when there are several effect<br />
sizes under consideration. Stat Med. 2006;25:<br />
917–932.
598 b i o s t a t i s t i c s<br />
Tremmel<br />
26. Jennison C. Critical appraisal of adaptive meth-<br />
ods. <strong>Adaptive</strong> Trials 2008, Barcelona.<br />
27. Lee PM. Bayesi<strong>an</strong> Statistics: An Introduction. 2nd<br />
ed. London: Arnold; 1997.<br />
28. Cook TD, Benner RJ, Fisher MR. The WIZARD<br />
trial as a case study of flexible clinical trial design.<br />
Drug Inf J. 2006;40:345–353.<br />
The author reports no relev<strong>an</strong>t relationships to disclose.<br />
29. L<strong>an</strong> KKG, DeMets DL. Ch<strong>an</strong>ging frequency of interim<br />
<strong>an</strong>alysis in sequential monitoring. Biometrics.<br />
1989;45:1017–1020.<br />
30. Koch GG, Amara IA, Stokes ME, Uryniak TJ. Categorical<br />
data <strong>an</strong>alysis. In: Berry DA (ed.), <strong>Statistical</strong><br />
Methodology in the Pharmaceutical Sciences.<br />
New York: Marcel Decker; 1990.