23.02.2013 Views

Statistical Inference After an Adaptive Group Sequential Design: A ...

Statistical Inference After an Adaptive Group Sequential Design: A ...

Statistical Inference After an Adaptive Group Sequential Design: A ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Lothar T. Tremmel, PhD<br />

Sr. Director <strong>an</strong>d <strong>Group</strong><br />

Leader, Clinical Oncology,<br />

Cephalon Inc, Frazer,<br />

Pennsylv<strong>an</strong>ia<br />

Key Words<br />

<strong>Adaptive</strong> design; <strong>Group</strong><br />

sequential design; <strong>Statistical</strong><br />

inference<br />

Correspondence Address<br />

Lothar Tremmel, PhD,<br />

Cephalon, Inc, 41 Moores<br />

Road, Frazer, PA (email:<br />

Lothar@tremmel.net).<br />

Drug Information Journal, Vol. 44, pp. 589–598, 2010 • 0092-8615/2010<br />

Printed in the USA. All rights reserved. Copyright © 2010 Drug Information Association, Inc.<br />

b i o s t a t i s t i c s 589<br />

<strong>Statistical</strong> <strong>Inference</strong> <strong>After</strong> <strong>an</strong> <strong>Adaptive</strong> <strong>Group</strong><br />

<strong>Sequential</strong> <strong>Design</strong>: A Case Study<br />

The adaptive group sequential design of Lehmacher<br />

<strong>an</strong>d Wassmer allows fully flexible redetermination<br />

of sample size after each of a predetermined<br />

number of interim looks. Study<br />

02CLLIII, a large, r<strong>an</strong>domized, multicenter<br />

trial in chronic lymphocytic leukemia, was<br />

based on this approach, with five pl<strong>an</strong>ned <strong>an</strong>alyses.<br />

The study terminated for efficacy at the<br />

third interim <strong>an</strong>alysis. While it was clear how<br />

statistical signific<strong>an</strong>ce was to be determined,<br />

calculations of proper P values as well as point<br />

<strong>an</strong>d interval estimates turned out to be much<br />

i N t R o D U c t i o N<br />

THE ADAPTIVE GROUP SEQUENTIAL<br />

DESIGN<br />

<strong>Group</strong> sequential designs (GSDs) that allow for<br />

prepl<strong>an</strong>ned interim <strong>an</strong>alyses of efficacy have<br />

been available for decades. The original designs<br />

were based on the concept of equal data increments:<br />

data could be <strong>an</strong>alyzed after each of k<br />

equal fractions of the data became available (1).<br />

As statistical theory evolved, this restriction was<br />

relaxed gradually, <strong>an</strong>d it became possible to<br />

perform <strong>an</strong>alyses after increments that were unequal,<br />

even at time points that were not necessarily<br />

prepl<strong>an</strong>ned (2). However, one restriction<br />

remained firmly in place: interim results of efficacy<br />

were not to be used for determining when<br />

next to <strong>an</strong>alyze the data (3, section 7.4).<br />

We call designs that eliminate this last constraint<br />

adaptive group sequential designs<br />

(aGSD). One early <strong>an</strong>d influential example was<br />

the design of Prosch<strong>an</strong> <strong>an</strong>d Hunsberger (4),<br />

which allows continuation of the trial if one obtains<br />

<strong>an</strong> “almost signific<strong>an</strong>t” result. The design<br />

that we consider here is the one by Lehmacher<br />

<strong>an</strong>d Wassmer (5), with equal stage weights <strong>an</strong>d<br />

Pocock’s stopping rule: it is akin to Pocock’s early<br />

GSD; the difference is that the size of the next<br />

stage c<strong>an</strong> be determined adaptively, that is,<br />

based on unblinded review of interim efficacy<br />

less straightforward. We extend existing <strong>an</strong>alysis<br />

approaches to the stratified binary <strong>an</strong>d<br />

time-to-event data from this trial, investigate<br />

the properties of the estimators for the primary<br />

endpoint, <strong>an</strong>d highlight remaining issues with<br />

full statistical inference after such a design. In<br />

conclusion, the flexibility offered by the adaptive<br />

features renders statistical inference more<br />

difficult <strong>an</strong>d less precise. We believe that there<br />

are situations where it may be worth paying this<br />

price.<br />

data. For inst<strong>an</strong>ce, one could recalculate the<br />

sample size needed to maintain the power under<br />

a certain effect size assumption. Import<strong>an</strong>tly,<br />

rigid rules like that one do not have to be prespecified<br />

or followed; the design remains valid<br />

(in the sense that it controls the type I error) no<br />

matter how the decision regarding the size of<br />

the next stage is made. Only two design elements<br />

need to be predetermined: the number of<br />

interim looks, <strong>an</strong>d the size of the first stage.<br />

The design works by reestablishing Pocock’s<br />

equal increment structure for the adaptively<br />

determined unequally sized stages: under H 0 ,<br />

the st<strong>an</strong>dardized effect size from each stage k<br />

follows <strong>an</strong> independent st<strong>an</strong>dard normal distribution<br />

if weighted by the square root of the<br />

stage’s sample size. The test statistic S K is simply<br />

the st<strong>an</strong>dardized sum of these independent,<br />

st<strong>an</strong>dard-normally distributed z values,<br />

∑<br />

S = 1 K Z<br />

K k<br />

k = 1.. K<br />

<strong>an</strong>d hence Pocock’s critical z values apply.<br />

(1)<br />

CASE STUDY<br />

Study 02CLLIII was a large, r<strong>an</strong>domized, openlabel<br />

clinical trial, comparing two chemotherapeutic<br />

treatments for chronic lymphocytic leukemia<br />

(CLL): bendamustine (Tre<strong>an</strong>da) versus<br />

chlorambucil. R<strong>an</strong>domization was stratified by<br />

Submitted for publication; July 2, 2009<br />

Accepted for publication: December 9, 2009


590 b i o s t a t i s t i c s<br />

Tremmel<br />

Binet stage (B vs C) which was believed to be <strong>an</strong><br />

import<strong>an</strong>t predictor. Efficacy was measured by<br />

two key endpoints: tumor response <strong>an</strong>d progression-free<br />

survival (PFS). The primary endpoint,<br />

tumor response, was defined as <strong>an</strong> outcome<br />

of partial response or better, based on<br />

predetermined assessment criteria. The key secondary<br />

endpoint, PFS, was the time from r<strong>an</strong>domization<br />

to progression of the disease, or<br />

death from <strong>an</strong>y cause. A sequential testing strategy,<br />

based on testing response rate first, was<br />

used to protect the overall type I error.<br />

An aGSD following Lehmacher <strong>an</strong>d Wassmer<br />

(5) with equal weighting of the study stages was<br />

pl<strong>an</strong>ned in the protocol. Four interim <strong>an</strong>d one<br />

final <strong>an</strong>alysis were pl<strong>an</strong>ned, <strong>an</strong>d the first <strong>an</strong>alysis<br />

was to be done after 80 patients were enrolled.<br />

Pocock’s method was chosen for determining<br />

the critical boundaries, yielding equal critical P<br />

values of 0.016 at each of the five <strong>an</strong>alysis time<br />

points, corresponding to a pl<strong>an</strong>ned overall twotailed<br />

alpha of 5%. A data monitoring committee<br />

(DMC) was established to assess the interim<br />

results <strong>an</strong>d to determine, at each interim <strong>an</strong>alysis,<br />

whether the study should continue. No formal<br />

futility rules were pl<strong>an</strong>ned or implemented.<br />

The protocol also discussed the sample sizes<br />

that would be needed for a fixed-sample trial:<br />

n = 84 (total) for tumor response, <strong>an</strong>d n = 326<br />

(total) for PFS. This was based on effect sizes of<br />

30% versus 60% response, <strong>an</strong>d a hazard ratio of<br />

1.429, respectively. The first result was used to<br />

determine the size of the first study stage (first<br />

interim <strong>an</strong>alysis to be done after 80 patients),<br />

<strong>an</strong>d the second result formed the basis of a projected<br />

maximum sample size of n = 350.<br />

The study stopped for efficacy at the third interim<br />

<strong>an</strong>alysis. Unadjusted P values from the<br />

first three <strong>an</strong>alyses are shown in Table 1. The P<br />

values are based on statistic Eq. 1 but they are<br />

not adjusted for the interim <strong>an</strong>alyses.<br />

<strong>Design</strong> <strong>an</strong>d results of the study are published<br />

(6); we note that the results described in this<br />

article are different because they are based on<br />

<strong>an</strong> algorithmic determination of tumor response<br />

that was used to inform the US product<br />

label whereas the results in Knauf et al. are<br />

based on assessments of <strong>an</strong> independent clini-<br />

cal data review committee. Moreover, the <strong>an</strong>alysis<br />

in Knauf et al. incorporates some more data<br />

points that became available after the third interim<br />

<strong>an</strong>alysis.<br />

ISSUES WITH STATISTICAL INFERENCE<br />

<strong>Statistical</strong> Signific<strong>an</strong>ce. Since the nominal<br />

two-sided P values at the third <strong>an</strong>alysis are lower<br />

th<strong>an</strong> 0.016, signific<strong>an</strong>ce at the two-sided level of<br />

signific<strong>an</strong>ce of 5% c<strong>an</strong> be claimed.<br />

Adjusted P Value. Rather th<strong>an</strong> only reporting<br />

whether the result is signific<strong>an</strong>t, it is desirable<br />

to report the P value as well, to indicate the<br />

strength of evidence (7). In traditional GSDs,<br />

correctly adjusted P values c<strong>an</strong> be readily calculated<br />

with st<strong>an</strong>dard software once one agrees on<br />

how to order the sample space (8). In our aGSD,<br />

there is no clearly defined sample space to begin<br />

with as no rigid rules for calculating the<br />

next stage size are required. Hence, the algorithms<br />

from GSDs c<strong>an</strong>not be applied.<br />

Point Estimate. It would also be desirable to report<br />

a point estimate for the effect size. Simple<br />

estimates that ignore the group sequential nature<br />

of the data—so-called naive estimates—<br />

are biased; this has been known since the early<br />

days of such designs (9). Various methods for<br />

bias adjustments exist (10) <strong>an</strong>d have been implemented<br />

in st<strong>an</strong>dard software for traditional<br />

GSDs. As Coburger <strong>an</strong>d Wassmer (11) point out,<br />

the aGSD contributes <strong>an</strong> additional complication:<br />

the naive estimator is not a function of the<br />

test statistic S K , <strong>an</strong>d therefore it is not consistent<br />

with it: Cases c<strong>an</strong> be constructed where S K would<br />

indicate maximum support for H 0 whereas the<br />

naive estimator would yield a value different<br />

from zero. This will not be the case for <strong>an</strong> estimator<br />

that is f(S K ), that is, a function of S K . Another<br />

adv<strong>an</strong>tage of f(S K ) is that under certain assumptions,<br />

the density of S K is known, <strong>an</strong>d<br />

therefore the density of f(S K ) will also be known,<br />

rendering it possible, in principle, to calculate<br />

its bias.<br />

Confidence Interval (CI). Finally, confidence<br />

intervals are needed for complete statistical in-


<strong>Inference</strong> <strong>After</strong> <strong>Adaptive</strong> GSD b i o s t a t i s t i c s 591<br />

ference. For traditional GSDs, “tight” CIs c<strong>an</strong> be<br />

developed (3, chapter 8); these require a rigidly<br />

defined sample space for which the densities<br />

c<strong>an</strong> be calculated for different alternative hypotheses.<br />

In the aGSD, the sample space is less<br />

rigidly defined, <strong>an</strong>d therefore these methods<br />

c<strong>an</strong>not be used.<br />

M E t H o D s<br />

ADjUSTED P VALUE<br />

The statistical <strong>an</strong>alysis pl<strong>an</strong> (SAP) had suggested<br />

calculating the P value based on stagewise<br />

ordering of the sample space (3, section 8.4).<br />

Upon data monitoring board (DMB) recommendations,<br />

the trial continued despite early<br />

crossing of the efficacy boundary, which is not<br />

consistent with the proposed method as it assumes<br />

strict adherence to stopping rules.<br />

Wright’s (12) concept of a P value does not require<br />

ordering <strong>an</strong> ill-defined sample space. The<br />

P value is understood as the <strong>an</strong>swer to the question:<br />

What is the smallest level of signific<strong>an</strong>ce<br />

α′ for which the given result would have been<br />

signific<strong>an</strong>t? It c<strong>an</strong> be shown that the <strong>an</strong>swer to<br />

this question is indeed a P value (appendix 1).<br />

Wassmer (13) proposes the same approach<br />

specifically for the aGSD <strong>an</strong>d calls it “overall P<br />

value.”<br />

In our study, the unadjusted P values were<br />


592 b i o s t a t i s t i c s<br />

Tremmel<br />

two response rates. As z k , we used the unsquared<br />

version of the familiar M<strong>an</strong>tel-<br />

Haenszel statistic Q MH , with Binet stage as the<br />

stratification factor. It c<strong>an</strong> be shown (appendix<br />

2) that this z k c<strong>an</strong> be decomposed into w k *ϑ ˆ ,<br />

where ϑ ˆ is a weighted average of the observed<br />

within-stratum probability differences, <strong>an</strong>d<br />

the weights are functions of the margins of the<br />

2 × 2 tables involved.<br />

STRATIFIED TImE-TO-EVENT DATA<br />

The independence of the k increments in the<br />

log-r<strong>an</strong>k statistic has been established by Tsiatis<br />

(16), <strong>an</strong>d Wassmer (13) showed that this applies<br />

to the aGSD as well. Jahn-Eimermacher et al.<br />

(17) showed how data arising from <strong>an</strong> aGSD c<strong>an</strong><br />

be decomposed correctly into these independent<br />

increments, each of which c<strong>an</strong> be <strong>an</strong>alyzed<br />

by <strong>an</strong> unsquared log-r<strong>an</strong>k test, yielding z k . In our<br />

case, z k is the kth increment of the familiar logr<strong>an</strong>k<br />

statistic, with Binet stage as the stratification<br />

factor. It c<strong>an</strong> be shown that this z k c<strong>an</strong> be<br />

decomposed into w k *ϑ ˆ where the estim<strong>an</strong>d is<br />

the logarithm of the hazard ratio:<br />

z k = U k /SD(U k ) = [U k /SD(U k ) * SD(U k ) −1 ] * SD(U k )<br />

where U k is the sum of the difference between<br />

observed <strong>an</strong>d expected deaths in study stage k,<br />

determined within each stratum <strong>an</strong>d summed<br />

over both strata, <strong>an</strong>d SD(U k ) is its st<strong>an</strong>dard deviation.<br />

The expression in square brackets is <strong>an</strong><br />

estimator of θ k (18, p. 96). Therefore, our weight<br />

w k is simply SD(U k ).<br />

bIAS<br />

The possibility of early stopping will induce bias<br />

into the estimator (Eq. 1), <strong>an</strong>d the potential size<br />

of this bias still needs to be investigated. One<br />

approach is simulation, but this requires <strong>an</strong> assumption<br />

of a formal decision rule that governs<br />

the determination of the next sample size n k+1 .<br />

However, this approach does not reflect the nature<br />

of this design, which does not require <strong>an</strong>y<br />

prespecified decision rule for its validity. Coburger<br />

<strong>an</strong>d Wassmer (11) therefore proposed <strong>an</strong><br />

alternative approach that is based on conditioning,<br />

both on stopping time (K = 3 in our<br />

case) <strong>an</strong>d the actual stage sample sizes ob-<br />

tained. There may be philosophical questions<br />

about this level of conditioning, but from a practical<br />

point of view, the authors demonstrated<br />

that estimators do indeed improve when adjusted<br />

for bias thus derived. We will use both approaches—simulation,<br />

<strong>an</strong>d integration based<br />

on Coburger <strong>an</strong>d Wassmer’s conditioning—in<br />

<strong>an</strong> effort to get <strong>an</strong> idea of how import<strong>an</strong>t the<br />

bias issue might be.<br />

CONFIDENCE INTERVAL<br />

Lehmacher <strong>an</strong>d Wassmer (5) propose using the<br />

repeated confidence interval (rCI) (3, chapter<br />

9) for the aGSD, <strong>an</strong>d they give the formula for<br />

the case of a difference of two me<strong>an</strong>s. In general,<br />

the rCI is defined as<br />

where<br />

{ϑ | abs[S K (ϑ)] < c K (α)},<br />

S K (ϑ) : = 1/√K ∑ w k (ϑ ˆ − ϑ)<br />

<strong>an</strong>d c K (α) is the adjusted one-sided critical z value,<br />

in our case φ −1 (0.016/2) = 2.41. Conceptually,<br />

this is the set of all hypothetical parameter<br />

values that c<strong>an</strong>not be rejected at the adjusted<br />

level α, given the data. The SAP prepl<strong>an</strong>ned the<br />

use of rCIs, without specifying the computational<br />

details of how those could be derived. It<br />

turns out that they c<strong>an</strong> be obtained by solving<br />

S K (ϑ) = 1/√K ∑ w k (ϑ ˆ − ϑ) = +/−c K (α) for ϑ. If<br />

one considers the case of the difference of two<br />

me<strong>an</strong>s, w k = √n k (σ√2) −1 as pointed out above,<br />

<strong>an</strong>d this leads to the expression for the CI that<br />

is given in Lehmacher <strong>an</strong>d Wassmer (5). For the<br />

stratified binomial <strong>an</strong>d survival case, one simply<br />

needs to plug in the expressions for w k that apply<br />

to those cases, as stated above.<br />

R E s U Lt s<br />

NUmERICAL RESULTS FOR CASE STUDY<br />

Table 2 shows key results without <strong>an</strong>y adjustments<br />

(“simple”) versus results that are adjusted<br />

based on the methods discussed above. It appears<br />

that using <strong>an</strong> estimate that reflects the<br />

segmented nature of the data (Eq. 2 above)<br />

ch<strong>an</strong>ges the point estimate only slightly; the adjustments<br />

due to repeated testing ch<strong>an</strong>ge the P<br />

value <strong>an</strong>d the width of the confidence intervals<br />

considerably.


<strong>Inference</strong> <strong>After</strong> <strong>Adaptive</strong> GSD b i o s t a t i s t i c s 593<br />

Tumor Response Simple (Unadjusted)<br />

Adjusted for the <strong>Adaptive</strong><br />

<strong>Group</strong> <strong>Sequential</strong> <strong>Design</strong><br />

P value 99%, if H 1<br />

is true. Therefore, it seemed reasonable to limit<br />

the sample size. We use a high cap of n = 350,<br />

based on the rationale given in the case study.<br />

The trial stops if the calculated sample size for<br />

the next segment would bring the total n > 350.<br />

Finally, if the new sample size was less th<strong>an</strong> 20,<br />

we set it to 20.<br />

Rule 2 works like rule 1, but the effect size assumption<br />

θ is updated based on interim data at<br />

each interim <strong>an</strong>alysis. We note that warnings<br />

against such a rule have been pronounced, as<br />

sample size reassessments based on the unreliable<br />

interim estimates may increase expected<br />

sample sizes (19). On the other h<strong>an</strong>d, our procedure<br />

here is checked <strong>an</strong>d bal<strong>an</strong>ced by the<br />

sample size cap, which amounts to <strong>an</strong> implicit<br />

futility rule.<br />

Drug Information Journal<br />

Full <strong>Statistical</strong> <strong>Inference</strong> for Case Study<br />

Each simulation scenario was defined by a<br />

true effect size, varying between 0 <strong>an</strong>d 0.40,<br />

<strong>an</strong>d a true stratum effect of 0% versus 10% for<br />

Binet stage C versus B. The control arm success<br />

rate was kept at 0.30. For each scenario, 10,000<br />

simulations were run, <strong>an</strong>d rCIs <strong>an</strong>d the bias<br />

were calculated. The results are shown in Table<br />

3 for rule 1 <strong>an</strong>d Table 4 for rule 2. Results are<br />

for zero stratum effect; the results with the stratum<br />

effect are not shown because they were almost<br />

identical.<br />

Tables 3 <strong>an</strong>d 4 indicate that the unconditional<br />

bias is negligible from a practical point of<br />

view; in only one scenario did the me<strong>an</strong> of the<br />

estimate (1) overestimate the true difference by<br />

more th<strong>an</strong> 0.02. With st<strong>an</strong>dard deviations of<br />

less th<strong>an</strong> 0.1, <strong>an</strong>d 10,000 simulations, the Monte<br />

Carlo st<strong>an</strong>dard error of the me<strong>an</strong> for the bias<br />

is around 0.001.<br />

The results also show that the rCIs are conservative,<br />

as expected. Further simulation work<br />

reveals that for both rules, confidence limits<br />

based on adjusted z values of φ −1 (0.03/2) rather<br />

th<strong>an</strong> φ −1 (0.016/2) would yield the desired coverage<br />

of 95%. If we used this critical z value for<br />

Table 2, the confidence interval for tumor response<br />

would shrink from 0.17−0.47 to 0.19−<br />

0.45. Therefore, the cost of flexibility (by not<br />

rigidly sticking to the rule) is a widening of the<br />

CI by two percentage points in either direction.<br />

It may be of interest that rule 2 spends considerably<br />

fewer patients th<strong>an</strong> rule 1 when H 0 is<br />

true. The reason is that the recalculated sample<br />

size, based on <strong>an</strong> observed low effect size, tends<br />

to exceed the sample size cap, which leads to<br />

early stopping.<br />

Based on the history of the trial, it is clear<br />

t a b L E 2


594 b i o s t a t i s t i c s<br />

Tremmel<br />

t a b L E 3<br />

t a b L E 4<br />

True Treatment Average Sample Power rCI<br />

Effect θ Size per <strong>Group</strong> (%) Coverage (%) Bias ˆ ϑ – θ<br />

0.00 114 2.1 97 –0.019<br />

0.10 124 34 98 0.005<br />

0.20 89 89 98 0.030<br />

0.30 56 >99 99 0.022<br />

0.40 43 100 99 0.007<br />

assumptions: see text.<br />

that rigid statistical decision rules were not followed.<br />

In the absence of <strong>an</strong>y clearly defined<br />

sample size recalculation rules, bias c<strong>an</strong>not be<br />

studied by simulation. The only other approach<br />

known to us is the double-conditioning approach<br />

by Coburger <strong>an</strong>d Wassmer (11): bias is<br />

calculated based on recursive integration (20),<br />

after conditioning on past history, without<br />

speculating about the future. This me<strong>an</strong>s both<br />

restricting the sample space to all paths that<br />

stop for efficacy at k = 3, <strong>an</strong>d “freezing” the first<br />

three sample sizes at the values that were actually<br />

used. Figure 1 shows this conditional bias;<br />

it is much more severe th<strong>an</strong> the unconditional<br />

bias shown in Tables 3 <strong>an</strong>d 4. The direction of<br />

the bias is conservative; it indicates that the estimator<br />

is likely to subst<strong>an</strong>tially underestimate<br />

the more extreme effect sizes.<br />

One could make the point that the boundaries<br />

that were actually in effect at the first two<br />

interim <strong>an</strong>alyses were much wider th<strong>an</strong> claimed<br />

Simulation Results for Probability Difference, Based on Rule 1<br />

True Treatment Average Sample Power rCI<br />

Effect θ Size per <strong>Group</strong> (%) Coverage (%) Bias ˆ ϑ – θ<br />

0.00 44 1.7 98 –0.006<br />

0.10 55 16 98 –0.012<br />

0.20 64 58 99 –0.001<br />

0.30 56 89 99 0.0108<br />

0.40 44 >99 99 0.0064<br />

assumptions: see text.<br />

Simulation Results for Probability Difference, Based on Rule 2<br />

by Pocock’s procedure, because the trial continued<br />

despite P values of


<strong>Inference</strong> <strong>After</strong> <strong>Adaptive</strong> GSD b i o s t a t i s t i c s 595<br />

Bias θ′ − θ<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

place, making one’s own drug in development<br />

more worthwhile to pursue even at more conservative<br />

effect size assumptions. Or <strong>an</strong>other<br />

import<strong>an</strong>t trial may mature, contributing valuable<br />

effect size information that one might w<strong>an</strong>t<br />

to use to update the power calculations of the<br />

ongoing trial. Or the emerging adverse events<br />

profile of the new drug is harsher th<strong>an</strong> <strong>an</strong>ticipated,<br />

which would increase the effect size<br />

deemed necessary to counterbal<strong>an</strong>ce the patients’<br />

risk.<br />

On the other h<strong>an</strong>d, this flexibility is a b<strong>an</strong>e for<br />

full statistical inference. The openness in trial<br />

conduct has to be paid for by wider confidence<br />

intervals, <strong>an</strong>d bias issues become intractable.<br />

Will those difficulties be resolved as statistical<br />

science adv<strong>an</strong>ces, or are they of a more fundamental<br />

nature? This is <strong>an</strong> area of active research,<br />

<strong>an</strong>d progress is being made. For inst<strong>an</strong>ce,<br />

very recently, a method was developed<br />

that yields exact confidence bounds <strong>an</strong>d medi<strong>an</strong><br />

unbiased estimates when adaptive ch<strong>an</strong>ges<br />

are restricted to the penultimate stage (22). On<br />

the other h<strong>an</strong>d, for a different type of adaptive<br />

designs, it was shown convincingly that full statistical<br />

inference is not possible in principle (23).<br />

We believe that for the aGSD, the difficulties<br />

may be of this more fundamental nature.<br />

If one is willing to commit to rigid, prespecified<br />

sample size adjustment rules, these difficul-<br />

Drug Information Journal<br />

0<br />

−0.05<br />

−0.1<br />

−0.15<br />

−0.2<br />

−0.25<br />

−0.6 −0.4 −0.2 0 0.2 0.4 0.6<br />

θ<br />

Bias - pl<strong>an</strong>ned boundaries<br />

Bias - apparent boundaries<br />

ties disappear, <strong>an</strong>d full statistical inference becomes<br />

possible. However, in this case, it is<br />

usually possible to construct a traditional GSD<br />

that is at least as powerful as the aGSD (24–26).<br />

In general, the aGSD should be less efficient<br />

th<strong>an</strong> a traditional GSD because its predetermined<br />

weights will not be optimal. In our case of<br />

equally weighted stages, this would be the case if<br />

the segments are very unequal in terms of<br />

amount of information. If, as in this trial, the<br />

next interim <strong>an</strong>alysis is scheduled without specifying<br />

a target number of PFS events, such <strong>an</strong> inequality<br />

is likely to happen for the time-to-event<br />

endpoint.<br />

For our case study, the corresponding traditional<br />

GSD would be the nonadaptive Pocock<br />

design. This design would yield powers of 2.5%,<br />

40%, 95%, >99%, <strong>an</strong>d >99% for the values of θ<br />

from Tables 3 <strong>an</strong>d 4, at average sample sizes of<br />

171, 149, 89, 53, <strong>an</strong>d 40 per group. As c<strong>an</strong> be<br />

seen, for θ of 0.2 or higher, the nonadaptive design<br />

achieves higher power with fewer patients.<br />

On the other h<strong>an</strong>d, if there is no effect, the<br />

nonadaptive design spends considerably more<br />

patients th<strong>an</strong> the aGSD with rule 1 or rule 2.<br />

Most likely, this is due to the implicit futility<br />

rule that we incorporated in rule 1 <strong>an</strong>d rule 2.<br />

To decide which design is better overall, one<br />

faces the problem of comparing two different<br />

approaches with different operating character-<br />

F i g U R E 1<br />

Conditional bias for estimate<br />

of probability differences,<br />

based on recursive<br />

integration.


596 b i o s t a t i s t i c s<br />

Tremmel<br />

istics regarding both expected sample size <strong>an</strong>d<br />

power. One possible solution c<strong>an</strong> be found in<br />

Bayesi<strong>an</strong> decision theory (27, p. 204); the necessary<br />

loss function could be constructed by assigning<br />

monetary values for patients enrolled,<br />

as well as for the type II error. This could be the<br />

subject of future investigations.<br />

Some of the adv<strong>an</strong>tages of classical GSD<br />

hinge on strict adherence to predetermined decision<br />

rules. In the typical DMB-driven clinical<br />

trial, this may turn out to be <strong>an</strong> illusion, which<br />

should lead to some of the same issues with statistical<br />

inference as described above for the<br />

aGSD. Indeed, this was the motivation behind<br />

the development of the rCI (3, p. 189). In m<strong>an</strong>y<br />

cases, the DMBs are fully unblinded to interim<br />

results, which may inform their recommendation<br />

of the timing of the next interim <strong>an</strong>alysis,<br />

as well as effect size assumptions <strong>an</strong>d target<br />

power (for <strong>an</strong> example, see Ref. 28). This may<br />

impact not only the statistical inference, but<br />

also the validity of the traditional GSD—a<br />

much more severe problem. Gr<strong>an</strong>ted, in the scenarios<br />

investigated for results-driven timing of<br />

interim <strong>an</strong>alyses for traditional GSDs, the im-<br />

pact on α was small (29). Nevertheless, it may<br />

seem cle<strong>an</strong>er to use a design that fully “legalizes”<br />

such “crimes” (<strong>an</strong>d regulators ought to encourage<br />

it)—in particular for open-label trials<br />

such as our case study.<br />

c o N c L U s i o N<br />

There is a trade-off between flexibility in trial<br />

conduct, <strong>an</strong>d accuracy of statistical inference.<br />

Generally, the flexible aGSD design will lead to<br />

wider confidence intervals. In addition, the<br />

openness of the trial causes theoretical difficulties<br />

with some aspects of statistical inference<br />

(in particular: bias) that are not all resolved.<br />

There are cases when this trade-off may favor<br />

the flexible approach—in particular, when the<br />

trial is a r<strong>an</strong>domized, open-label trial, <strong>an</strong>d/or<br />

when the size of a worthwhile effect depends on<br />

future developments.<br />

Acknowledgments—The author is indebted to Dr. C. K.<br />

Ch<strong>an</strong>g for some early suggestions. The author also<br />

owes th<strong>an</strong>ks to two <strong>an</strong>onymous reviewers for their encouragement<br />

<strong>an</strong>d thorough questioning.<br />

a P P E N D i x 1<br />

P R o o F t H a t t H E P V a L U E b a s E D o N W R i g H t ( 1 2 ) i s U N i F o R M Ly<br />

D i s t R i b U t E D F o R P o c o c k ’ s D E s i g N<br />

Pocock’s procedure defines P crit = f(α) such that the probability (under H 0 ) that the smallest observed<br />

P value is lower th<strong>an</strong> P crit is α. For this case, Wright’s adjusted P value is α′ = f −1 (P min ), where P min is the<br />

minimum P value actually observed, <strong>an</strong>d α′ is the probability, under H 0 , to observe a minimum P value<br />

as small as or smaller th<strong>an</strong> P min .<br />

α′ is a P value because it follows the uniform (0,1) distribution under H 0 : The function f −1 (.) is the<br />

probability integral tr<strong>an</strong>sformation of the null distribution of P min , that is, f −1 (.) says how likely it is, under<br />

H 0 , to obtain a minimum P value that is even smaller th<strong>an</strong> the observed minimum P value. Using<br />

upper case for the r<strong>an</strong>dom variables, prob(PMIN < P min ) = prob(A′ < α′) = α′, which defines the uniform<br />

distribution.<br />

a P P E N D i x 2<br />

D E c o M P o s i t i o N o F t H E s t R a t i F i E D U N s q U a R E D M H s t a t i s t i c<br />

The unsquared version Z MH of the M<strong>an</strong>tel-Haenszel statistic Q MH c<strong>an</strong> be shown to be a weighted average<br />

of the estimator ϑ ˆ = (πˆ 1 − πˆ 2 ) (30). For stage k, Z MH.k = (w Bk (πˆ B1k − πˆ B2k ) + w Ck (πˆ C1k − πˆ C2k ))/σ k , where<br />

w Bk is the M<strong>an</strong>tel-Haenszel weight (n B1k * n B2k )/(n B1k + n B2k ), <strong>an</strong>d the first index B designates the stratum<br />

(B for Binet stage B, <strong>an</strong>d C for Binet stage C). σ k is a function of the four margins of the 2 × 2 table:


<strong>Inference</strong> <strong>After</strong> <strong>Adaptive</strong> GSD b i o s t a t i s t i c s 597<br />

Drug Information Journal<br />

σ k 2 = (nB1k n B2k r B.k f B.k )/[n B 2 (nB − 1)] + (n C1k n C2k r C.k f C.k )/[n C 2 (nC − 1)]<br />

where r <strong>an</strong>d f indicate the number of responding <strong>an</strong>d failing (not responding) patients.<br />

This Z statistic c<strong>an</strong> be represented as a product of <strong>an</strong> estimator <strong>an</strong>d a weight, as desired:<br />

N o t E s<br />

Z MH.k = [w Bk /(w Bk + w Ck ) ϑ ˆ Bk + w Ck /(w Bk + w Ck ) ϑˆ Ck ] * (w Bk + w Ck )/σ k = ϑˆ weighted.k * (w Bk + w Ck )/σ k<br />

1. This was done with SeqTrial function seq<strong>Design</strong><br />

(8). Another software commonly used for such<br />

calculations is EAST (14).<br />

2. This decomposition is not trivial; there are cases<br />

where it c<strong>an</strong>not be done. For the Wilcoxon test,<br />

this problem was noted before (15).<br />

R E F E R E N c E s<br />

1. Pocock SJ. Clinical Trials: A Practical Approach.<br />

New York: Wiley; 1983.<br />

2. L<strong>an</strong> KKG, DeMets DL. Discrete sequential<br />

boundaries for clinical trials. Biometrika.<br />

1983;70:659–663.<br />

3. Jennison C, Turnbull BW. <strong>Group</strong> <strong>Sequential</strong> Methods<br />

With Applications to Clinical Trials. Boca Raton,<br />

FL: Chapm<strong>an</strong> & Hall/CRC; 2000.<br />

4. Prosch<strong>an</strong> MA, Hunsberger SA. <strong>Design</strong>ed extension<br />

of studies based on conditional power. Biometrics.<br />

1995;51:1315–1324.<br />

5. Lehmacher W, Wassmer G. <strong>Adaptive</strong> sample size<br />

calculations in group sequential trials. Biometrics.<br />

1999;55:1286–1290.<br />

6. Knauf WU, Lissichkov T, Aldaoud A, et al. Phase<br />

III r<strong>an</strong>domized study of bendamustine versus<br />

chlorambucil in previously untreated patients<br />

with chronic lymphocytic leukemia. J Clin Oncol.<br />

2009;27:4378–4384.<br />

7. Fleming TR, Richardson BA. Some design issues<br />

in microbicide HIV prevention trials. J Infect Dis.<br />

2004;190:666–674.<br />

8. Insightful. S+ SeqTrial 2 User’s Guide. Insightful<br />

Corp.; 2002.<br />

9. Whitehead J. On the bias of maximum likelihood<br />

estimation following a sequential test. Biometrika.<br />

1986;73:573–581.<br />

10. Emerson SS, Kittelson JM. A computationally<br />

simpler algorithm for the UMVUE of a normal<br />

me<strong>an</strong> following a group sequential design. Biometrics.<br />

1997;53:365–369.<br />

11. Coburger S, Wassmer G. Conditional point esti-<br />

mation in adaptive group sequential test designs.<br />

Biometric J. 2001;43:821–833.<br />

12. Wright PS. Adjusted P-values for simult<strong>an</strong>eous<br />

inference. Biometrics. 1992;48:1005–1013.<br />

13. Wassmer G. Pl<strong>an</strong>ning <strong>an</strong>d <strong>an</strong>alyzing adaptive<br />

group sequential survival trials. Biometric J. 2006;<br />

48:714–729.<br />

14. Cytel Inc. East 5 (v5.2). 2008.<br />

15. L<strong>an</strong> KK, Wittes J. The B-value: a tool for monitoring<br />

data. Biometrics. 1988;44:579–585.<br />

16. Tsiatis AA. The asymptotic joint distribution of<br />

the efficient scores test for the proportional hazards<br />

model calculated over time. Biometrika.<br />

1981;68:311–315.<br />

17. Jahn-Eimermacher A, Ingel K. <strong>Adaptive</strong> trial design:<br />

a general methodology for censored time to<br />

event data. Contemp Clin Trials. 2009;30:171–<br />

177.<br />

18. Marubini E, Valsecchi MG. Analysing Survival<br />

Data From Clinical Trials <strong>an</strong>d Observational Studies.<br />

Chichester, UK: Wiley; 1995.<br />

19. Bauer P, Koenig F. The reassessment of trial perspectives<br />

from interim data—a critical view. Stat<br />

Med. 2006;25:23–36.<br />

20. Armitage P, McPherson CK, Rowe BC. Repeated<br />

signific<strong>an</strong>ce tests on accumulating data. J R Stat<br />

Soc A. 1969;132:235–244.<br />

21. Br<strong>an</strong>nath W, Posch M, Bauer P. Recursive combination<br />

tests. J Am Stat Assoc. 2002;97:236–243.<br />

22. Br<strong>an</strong>nath W, Mehta CR, Posch M. Exact confidence<br />

bounds following adaptive group sequential<br />

tests. Biometrics. 2009;65:539–546.<br />

23. Br<strong>an</strong>nath W, Koenig F, Bauer P. Estimation in<br />

confirmatory adaptive designs with treatment<br />

selection. <strong>Adaptive</strong> Trials 2008, Barcelona.<br />

24. Tsiatis AA, Mehta C. On the inefficiency of the<br />

adaptive design for monitoring clinical trials.<br />

Biometrika. 2003;90:367–387.<br />

25. Jennison C, Turnbull BW. Efficient group sequential<br />

designs when there are several effect<br />

sizes under consideration. Stat Med. 2006;25:<br />

917–932.


598 b i o s t a t i s t i c s<br />

Tremmel<br />

26. Jennison C. Critical appraisal of adaptive meth-<br />

ods. <strong>Adaptive</strong> Trials 2008, Barcelona.<br />

27. Lee PM. Bayesi<strong>an</strong> Statistics: An Introduction. 2nd<br />

ed. London: Arnold; 1997.<br />

28. Cook TD, Benner RJ, Fisher MR. The WIZARD<br />

trial as a case study of flexible clinical trial design.<br />

Drug Inf J. 2006;40:345–353.<br />

The author reports no relev<strong>an</strong>t relationships to disclose.<br />

29. L<strong>an</strong> KKG, DeMets DL. Ch<strong>an</strong>ging frequency of interim<br />

<strong>an</strong>alysis in sequential monitoring. Biometrics.<br />

1989;45:1017–1020.<br />

30. Koch GG, Amara IA, Stokes ME, Uryniak TJ. Categorical<br />

data <strong>an</strong>alysis. In: Berry DA (ed.), <strong>Statistical</strong><br />

Methodology in the Pharmaceutical Sciences.<br />

New York: Marcel Decker; 1990.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!