Hypothesis testing in mixture regression models - Columbia University
Hypothesis testing in mixture regression models - Columbia University
Hypothesis testing in mixture regression models - Columbia University
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
J. R. Statist. Soc. B (2004)<br />
66, Part 1, pp. 3–16<br />
<strong>Hypothesis</strong> <strong>test<strong>in</strong>g</strong> <strong>in</strong> <strong>mixture</strong> <strong>regression</strong> <strong>models</strong><br />
Hong-Tu Zhu and Hep<strong>in</strong>g Zhang<br />
Yale <strong>University</strong>, New Haven, USA<br />
[Received July 2002. F<strong>in</strong>al revision June 2003]<br />
Summary. We establish asymptotic theory for both the maximum likelihood and the maximum<br />
modified likelihood estimators <strong>in</strong> <strong>mixture</strong> <strong>regression</strong> <strong>models</strong>. Moreover, under specific and reasonable<br />
conditions, we show that the optimal convergence rate of n 1=4 for estimat<strong>in</strong>g the<br />
mix<strong>in</strong>g distribution is achievable for both the maximum likelihood and the maximum modified<br />
likelihood estimators. We also derive the asymptotic distributions of two log-likelihood ratio test<br />
statistics for <strong>test<strong>in</strong>g</strong> homogeneity and we propose a resampl<strong>in</strong>g procedure for approximat<strong>in</strong>g<br />
the p-value. Simulation studies are conducted to <strong>in</strong>vestigate the empirical performance of the<br />
two test statistics. F<strong>in</strong>ally, two real data sets are analysed to illustrate the application of our<br />
theoretical results.<br />
Keywords:<br />
Power<br />
<strong>Hypothesis</strong> <strong>test<strong>in</strong>g</strong>; Logistic <strong>regression</strong>; Mixture <strong>regression</strong>; Poisson <strong>regression</strong>;<br />
1. Introduction<br />
F<strong>in</strong>ite <strong>mixture</strong> <strong>models</strong> arise <strong>in</strong> many applications, particularly <strong>in</strong> biology, psychology and genetics.<br />
Statistical <strong>in</strong>ference and computation based on these <strong>models</strong> pose a serious challenge; see<br />
Titter<strong>in</strong>gton et al. (1985), L<strong>in</strong>dsay (1995) and McLachlan and Peel (2000) for systematic reviews.<br />
The purpose of this work is to establish asymptotic theory for <strong>mixture</strong> <strong>regression</strong> <strong>models</strong> orig<strong>in</strong>at<strong>in</strong>g<br />
from those applications.<br />
Suppose that we observe data from n units and with<strong>in</strong> each unit, say unit i, wehaven i measurements,<br />
i = 1,...,n: This is a typical data structure <strong>in</strong> longitud<strong>in</strong>al and family studies. Before<br />
<strong>in</strong>troduc<strong>in</strong>g the additional notation, let us exam<strong>in</strong>e a few examples.<br />
1.1. Example 1: a f<strong>in</strong>ite <strong>mixture</strong> logistic <strong>regression</strong> model<br />
To study the genetic <strong>in</strong>heritance pattern of a b<strong>in</strong>ary trait such as alcoholism, Zhang and Merikangas<br />
(2000) proposed a frailty model <strong>in</strong> which the data consist of a b<strong>in</strong>ary vector response<br />
y i = .Y i1 ,...,Y i, ni / T and covariates X i from the ith family, i = 1,...,n. To model the potential<br />
familial correlation, they <strong>in</strong>troduced a Bernoulli latent variable U i for each family. Conditionally<br />
on all latent variables {U i }, the Y ij s are assumed to be <strong>in</strong>dependent and to follow the logistic<br />
<strong>regression</strong> model<br />
logit{P.Y ij = 1|U i /} = x ij β + z ij {U i µ 1 + .1 − U i /µ 2 }, .1/<br />
where x ij is a covariate vector <strong>in</strong> X i from the jth member <strong>in</strong> the ith family, z ij is a part of x ij that<br />
<strong>in</strong>teracts with the latent variable and the β and the µs are parameters. The <strong>in</strong>teraction terms<br />
Address for correspondence: Hep<strong>in</strong>g Zhang, Department of Epidemiology and Public Health, Yale <strong>University</strong><br />
School of Medic<strong>in</strong>e, 60 College Street, New Haven, CT 06520-8034, USA.<br />
E-mail: Hep<strong>in</strong>g.Zhang@Yale.EDU<br />
© 2004 Royal Statistical Society 1369–7412/04/66003
4 H.-T. Zhu and H. Zhang<br />
<strong>in</strong> model (1) are very important <strong>in</strong> genetic studies for assess<strong>in</strong>g potential gene–environment<br />
<strong>in</strong>teractions.<br />
Beyond this example, f<strong>in</strong>ite <strong>mixture</strong>s of Bernoulli distributions such as model (1) have received<br />
much attention <strong>in</strong> the last five decades. See Teicher (1963) for an early example. More recently,<br />
Wang and Puterman (1998) among others generalized b<strong>in</strong>omial f<strong>in</strong>ite <strong>mixture</strong>s to <strong>mixture</strong><br />
logistic <strong>regression</strong> <strong>models</strong>, and Zhang et al. (2003) applied <strong>mixture</strong> cumulative logistic <strong>models</strong><br />
to analyse correlated ord<strong>in</strong>al responses.<br />
1.2. Example 2: <strong>mixture</strong> of non-l<strong>in</strong>ear hierarchical <strong>models</strong><br />
Longitud<strong>in</strong>al and genetic studies commonly <strong>in</strong>volve a cont<strong>in</strong>uous response {Y ij }, also referred<br />
to as a quantitative trait. See, for example, Diggle et al. (2002), Haseman and Elston (1972) and<br />
Risch and Zhang (1995). Pauler and Laird (2000) used general f<strong>in</strong>ite <strong>mixture</strong> non-l<strong>in</strong>ear hierarchical<br />
<strong>models</strong> to analyse longitud<strong>in</strong>al data from heterogeneous subpopulations. Specifically,<br />
when there are only two subgroups, the model is of the form<br />
Y ij = g{x ij , β, U i z ij µ 1 + .1 − U i /z ij µ 2 } + " i,j ,<br />
where the " i,j s are <strong>in</strong>dependent and identically distributed accord<strong>in</strong>g to N.0, σ 2 / and g.·/ is<br />
a prespecified function. Here, the known covariates x ij may conta<strong>in</strong> observed time po<strong>in</strong>ts to<br />
reflect a time course <strong>in</strong> longitud<strong>in</strong>al data.<br />
1.3. Example 3: a f<strong>in</strong>ite <strong>mixture</strong> of Poisson <strong>regression</strong> <strong>models</strong><br />
Poisson distribution and Poisson <strong>regression</strong> have been widely used to analyse count data<br />
(McCullagh and Nelder, 1989), but observed count data often exhibit overdispersion relative<br />
to this. F<strong>in</strong>ite <strong>mixture</strong> Poisson <strong>regression</strong> <strong>models</strong> (Wang et al., 1996) provide a plausible explanation<br />
for overdispersion. Specifically, conditionally on all U i s, the Y ij s are <strong>in</strong>dependent<br />
and follow the Poisson <strong>regression</strong> model<br />
p.Y ij = y ij |x ij , U i / = 1<br />
y ij ! λy ij<br />
ij exp.−λ ij/, .2/<br />
where λ ij = exp{x ij β + U i z ij µ 1 + .1 − U i /z ij µ 2 }.<br />
To summarize the <strong>models</strong> presented above, we consider a random sample of n <strong>in</strong>dependent<br />
observations {y i , X i } n 1<br />
with the density function<br />
p i .y i , x i ; ω/ = {.1 − α/f i .y i , x i ; β, µ 1 / + α f i .y i , x i ; β, µ 2 /} g i .x i /, .3/<br />
where g i .x i / is the distribution function of X i . Further, ω = .α, β, µ 1 , µ 2 / is the unknown parameter<br />
vector, <strong>in</strong> which β (q 1 × 1) measures the strength of association that is contributed by<br />
the covariate terms and the two q 2 × 1 vectors, µ 1 and µ 2 , represent the different contributions<br />
from two different groups.<br />
Equivalently, if we consider P.U i = 0/ = 1 − P.U i = 1/ = α, and assume that the conditional<br />
density of y i given U i is p i .y i |U i / = f i {y i , x i ; β, µ 2 .1 − U i / + µ 1 U i }, then model (3) is the<br />
marg<strong>in</strong>al density of y i . In fact, McCullagh and Nelder (1989) considered a special case <strong>in</strong> which<br />
f i is from an exponential family distribution, i.e.<br />
∏<br />
f i {y i , x i ; β, µ 2 .1 − U i / + µ 1 U i } = n i<br />
exp[φ{y ij θ ij − a.θ ij /} + c.y ij , φ/], .4/<br />
j=1<br />
where θ ij = h{x ij , β, U i µ 1 + .1 − U i /µ 2 }, h.·/ is a l<strong>in</strong>k function and φ is a dispersion parameter.<br />
This family of <strong>mixture</strong> <strong>regression</strong> <strong>models</strong> is very useful <strong>in</strong> practice.
Mixture Regression Models 5<br />
Asymptotic theory is critical to the understand<strong>in</strong>g of the behaviour of any model. Exist<strong>in</strong>g<br />
asymptotic results require, however, that f i <strong>in</strong> equation (3) is identical among the observation<br />
units. This requirement is too restrictive <strong>in</strong> longitud<strong>in</strong>al and family studies. Thus, our aim is to<br />
elim<strong>in</strong>ate this restriction by allow<strong>in</strong>g f i to vary between study subjects or families as a result of<br />
study designs or miss<strong>in</strong>g data.<br />
Let PÅ denote the true model from which the data are generated. The well-known identifiability<br />
problem <strong>in</strong> <strong>mixture</strong> <strong>models</strong> implies that there may be a set of parameters that yield PÅ:<br />
We use ΩÅ = {ωÅ ∈ Ω : P ωÅ = PÅ} to represent this set of true model parameters. Here, Ω is the<br />
entire parameter space. The use of an asterisk <strong>in</strong> the subscript rem<strong>in</strong>ds us that the parameter<br />
belongs to the true parameter set. In the light of the symmetry for α, without loss of generality,<br />
we consider only α ∈ [0, 0:5]. The parameter space Ω is def<strong>in</strong>ed as Ω = {ω : α ∈ [0, 0:5], β ∈ B,<br />
||µ 1 || M, ||µ 2 || M}, where M is a large positive scalar such that ||µÅ||
6 H.-T. Zhu and H. Zhang<br />
LR n = sup{L n .ω/} − sup {L n .ω/},<br />
ω∈Ω<br />
ω∈Ω 0<br />
where Ω 0 = {ω ∈ Ω : α = 0:5, µ 1 = µ 2 }: A challeng<strong>in</strong>g issue is to derive the asymptotic distribution<br />
of LR n . To do so, we need to def<strong>in</strong>e some notation.<br />
For a generic symmetric q 2 × q 2 matrix B, Vecs.B/ and dvecs.B/ are def<strong>in</strong>ed as Vecs.B/ =<br />
.b 11 , b 21 , b 22 ,...,b q2 1,...,b q2 q 2<br />
/ T and<br />
dvecs.B/ = .b 11 ,2b 21 , b 22 ,2b 31 ,2b 32 , b 33 ,...,2b q2 1,...,2b q2 , q 2 −1, b q2 q 2<br />
/ T<br />
respectively. Furthermore, we def<strong>in</strong>e F i,1 .β, µ/ = @ β f i .β, µ/=f iÅ , F i,2 .µ/ = @ µ f i .βÅ, µ/=f iÅ and<br />
F i,6 .µ/ = @ 2 µµ f i.βÅ, µ/=f iÅ . Let<br />
w i .µ/ = .F i,1 .βÅ, µÅ/ T , F i,2 .µÅ/ T , dvecs.F i,6 .µ// T / T ,<br />
W n .µ/ = √ 1 n∑<br />
w i .µ/, n<br />
J n .µ/ = 1 n<br />
i=1<br />
n∑<br />
w i .µ/w i .µ/ T ,<br />
i=1<br />
where W n .µ/ is an r × 1 vector and J n .µ/ is an r × r matrix. F<strong>in</strong>ally, let k 1 .ω/ = .1 − α/ ∆µ 1 +<br />
α ∆µ 2 , k 2 .ω/ = Vecs{.1 − α/ ∆µ ⊗2<br />
1<br />
+ α ∆µ ⊗2<br />
2 } and K.ω/ = .∆β, k 1.ω/, k 2 .ω//, <strong>in</strong> which<br />
a ⊗2 = aa T , ∆µ 1 = µ 1 − µÅ and ∆µ 2 = µ 2 − µÅ.<br />
Under Ω 0 , L n .ω/ simplifies to Σ n i=1 log{f i.β, µ/=f iÅ }. Standard asymptotic theory yields that<br />
sup {2 L n .ω/} = W n .µÅ/ T H{H T J n .µÅ/H} −1 H T W n .µÅ/ + o p .1/,<br />
ω∈Ω 0<br />
where H = .I q1 +q 2<br />
, 0 T / T is a {q 1 + q 2 + q 2 .q 2 − 1/=2} × .q 1 + q 2 / matrix. Furthermore, if we<br />
consider the maximum likelihood estimator ˆω M = . ˆα M , ˆβ M ,ˆµ 1M ,ˆµ 2M / = arg max ω∈Ω {L n .ω/}<br />
<strong>in</strong> Ω, we need to address several fundamental issues such as the consistency and rate of convergence<br />
of K. ˆω M / and the asymptotic expansion of L n . ˆω M /. After some tedious manipulations<br />
(see Zhu and Zhang (2003) for details), we f<strong>in</strong>d that<br />
2 L n . ˆω M / = sup .W n .µ 2 / T J n .µ 2 / −1 W n .µ 2 / − <strong>in</strong>f [Q n { √ nK.ω/, µ 2 }]/ + o p .1/<br />
||µ 2 ||M<br />
ω∈Ω µ2<br />
holds under assumptions 1–3 that are given <strong>in</strong> Appendix A, where Ω µ2 = {ω ∈ Ω : µ 2 is fixed}<br />
and<br />
Q n .λ, µ 2 / = .λ − J n .µ 2 / −1 W n .µ 2 // T J n .µ 2 /.λ − J n .µ 2 / −1 W n .µ 2 //:<br />
Def<strong>in</strong>e the convex cone Λ µ2 as<br />
{ (<br />
∆µ 2 I<br />
)<br />
}<br />
q2<br />
Λ µ2 = η : η =<br />
Vecs.∆µ ⊗2<br />
2 / 0 x T , x = .x 1 ,...,x q2 +1/ ∈ [0, ∞/ × R q 2<br />
:<br />
Then, the log-likelihood-ratio statistic becomes<br />
LR n =<br />
sup { ˆV.µ 2 / T J n .µ 2 / ˆV.µ 2 /}−W n .µÅ/ T H{H T J n .µÅ/H} −1 H T W n .µÅ/+o p .1/, .6/<br />
||µ 2 ||M<br />
where Q n { ˆV.µ 2 /, µÅ} = <strong>in</strong>f λ∈R q 1 ×Λµ2 {Q n .λ, µÅ/}:
Mixture Regression Models 7<br />
We obta<strong>in</strong> the follow<strong>in</strong>g theorems, whose detailed proofs are given <strong>in</strong> Zhu and Zhang (2003).<br />
Theorem 1. Under assumptions 1–3 <strong>in</strong> Appendix A, the follow<strong>in</strong>g results hold.<br />
(a) K. ˆω M / = O p .n −1=2 /:<br />
(b) The log-likelihood-ratio statistic converges to the maxima of a stochastic process that is<br />
<strong>in</strong>dexed by µ 2 <strong>in</strong> distribution as n →∞.<br />
(c) If q 2 = 1, ∫ µ |Ĝ M .µ/ − G.µ/| dµ = O p .n −1=4 /, where G.µ/ = I.µ µÅ/ is the true mix<strong>in</strong>g<br />
distribution and Ĝ M .µ/ = G. ˆµ 1M ,ˆµ 2M ,ˆα M / = .1 − ˆα M /I.µ ˆµ 1M / + ˆα M I.µ ˆµ 2M /.<br />
To our knowledge, theorem 1, part (a), provides the convergence rate of K. ˆω M / for the first<br />
time under model (3). Our result has several implications. First, we confirm that the convergence<br />
rate of ˆβ (the maximum likelihood estimator) is n −1=2 under the conditions def<strong>in</strong>ed. van<br />
der Vaart (1996) proved a similar result under semiparametric <strong>mixture</strong> <strong>models</strong>. Second, under<br />
hypothesis H 0 , we prove that only k 1 . ˆω M / and k 2 . ˆω M / can reach the rate n −1=2 , which is useful for<br />
determ<strong>in</strong><strong>in</strong>g the asymptotic distributions of the estimators. Third, theorem 1, part (c), implies<br />
that Ĝ M .µ/ converges to G.µ/ <strong>in</strong> L 1 -norm at the optimal rate n −1=4 under a more general model<br />
than that considered by Chen (1995). It is important to note that theorem 1 of Chen (1995) gives<br />
the lower bound for the optimal rate of convergence for estimat<strong>in</strong>g G.µ/ by us<strong>in</strong>g Ĝ M .µ/ <strong>in</strong><br />
L 1 -norm is at most n −1=4 .<br />
As <strong>in</strong> Chen et al. (2001), we consider an alternative approach to <strong>test<strong>in</strong>g</strong> hypothesis (5) by<br />
us<strong>in</strong>g a modified log-likelihood function<br />
ML n .ω/ = L n .ω/ + log.M/ log{4α.1 − α/},<br />
where M is as def<strong>in</strong>ed above. Compared with the log-likelihood function, the extra term<br />
log.M/ log{4α.1 − α/} <strong>in</strong> ML n .ω/ can keep α away from both 0 and 1, which partially solves<br />
the identifiability problem. Let ˆω P be the result<strong>in</strong>g estimator as ˆω P = arg max ω∈Ω {ML n .ω/}.<br />
The modified log-likelihood-ratio statistic is def<strong>in</strong>ed as<br />
MLR n = 2ML n . ˆω P / − sup {2ML n .ω/}:<br />
ω∈Ω 0<br />
Similarly to the log-likelihood-ratio statistic, we f<strong>in</strong>d that<br />
MLR n = W n .µÅ/ T [J n .µÅ/ −1 −H{H T J n .µÅ/H} −1 H T ] W n .µÅ/− <strong>in</strong>f {Q n .λ, µÅ/}+o p .1/,<br />
λ∈R q 1 ×Λ 0<br />
.7/<br />
where Λ 0 is a closed cone def<strong>in</strong>ed by<br />
Λ 0 = {η : η = .x T , Vecs.y ⊗2 // T , both x and y ∈ R q 2<br />
}:<br />
We have the follow<strong>in</strong>g asymptotic result for both ˆω P and MLR n .<br />
Theorem 2. Under the assumptions of theorem 1, we have the follow<strong>in</strong>g results.<br />
(a) ˆα P = O p .1/, ∆ ˆβ P = O p .n −1=2 /, ∆ ˆµ 1P = O p .n −1=4 / and ∆ ˆµ 2P = O p .n −1=4 /.<br />
(b) The modified log-likelihood-ratio statistic converges to a random variable U 2 <strong>in</strong> distribution<br />
as n →∞.<br />
(c) If q 2 = 1, ∫ µ |Ĝ P .µ/ − G.µ/| dµ = O p .n −1=4 /, where Ĝ P .µ/ = .1 − ˆα P /I.µ ˆµ 1P / +<br />
ˆα P I.µ ˆµ 2P /.<br />
Theorem 2, part (a), gives the exact convergence rate of ˆω P . Theorem 2, part (b), determ<strong>in</strong>es<br />
the asymptotic distribution of MLR n . Whereas the explicit form of this distribution is generally<br />
complicated, our result gives rise to a simple asymptotic distribution, 0:5χ 2 1 + 0:5χ2 0 ,for
8 H.-T. Zhu and H. Zhang<br />
MLR n when q 2 = 1. This co<strong>in</strong>cides with theorem 1 of Chen et al. (2001) when there are no<br />
covariates, i.e. q 1 = 0. Theorem 2, part (c), shows that the n −1=4 consistent rate for estimat<strong>in</strong>g<br />
the mix<strong>in</strong>g distribution G.µ/ is reachable by us<strong>in</strong>g ˆω P .<br />
Until now, we have systematically <strong>in</strong>vestigated the order of convergence rate for both K. ˆω P /<br />
and K. ˆω M /, and the asymptotic distributions of both LR n and MLR n . The next two issues are<br />
also very important. The first is how to compute the critical values for these potentially complicated<br />
distributions of test statistics. The second is to compare the power of LR n and MLR n .<br />
In what follows, we study the empirical and asymptotic behaviour of these two statistics under<br />
departures from hypothesis H 0 .<br />
2.1. A resampl<strong>in</strong>g method<br />
Although we have obta<strong>in</strong>ed the asymptotic distributions of the likelihood-based statistics, the<br />
limit<strong>in</strong>g distributions usually have complicated analytic forms. To alleviate this difficulty, we<br />
use a resampl<strong>in</strong>g technique to calculate the critical values of the <strong>test<strong>in</strong>g</strong> statistics. Although the<br />
bootstrapp<strong>in</strong>g method is an obvious approach, it requires repeated maximizations of the likelihood<br />
and modified likelihood functions. The maximizations are computationally <strong>in</strong>tensive for<br />
the f<strong>in</strong>ite <strong>mixture</strong> <strong>models</strong>. Thus, we prefer a computationally more efficient method, as proposed<br />
and used by Hansen (1996), Kosorok (2003) and others.<br />
On the basis of equations (6) and (7), we only need to focus on W n .µ 2 / and J n .µ 2 /. If hypothesis<br />
H 0 is true, . ˆβ 0 ,ˆµ 0 / = arg max ω∈Ω0 {L n .ω/} provides consistent estimators of βÅ and µÅ.By<br />
substitut<strong>in</strong>g .βÅ, µÅ/ with . ˆβ 0 ,ˆµ 0 / <strong>in</strong> the def<strong>in</strong>itions of W n .µ 2 / and J n .µ 2 /, we obta<strong>in</strong> ŵ i .µ 2 /,<br />
Ŵ n .µ 2 / and Ĵ n .µ 2 / accord<strong>in</strong>gly, i.e.<br />
ŵ i .µ 2 / = .F i,1 . ˆβ 0 ,ˆµ 0 / T , F i,2 . ˆβ 0 ,ˆµ 0 /, dvecs.F i,6 . ˆβ 0 , µ 2 // T / T ,<br />
Ŵ n .µ 2 / = √ 1 n∑<br />
ŵ i .µ 2 /,<br />
n<br />
Jˆ<br />
n .µ 2 / = 1 n∑<br />
ŵ i .µ 2 / ŵ i .µ 2 / T :<br />
n i=1<br />
To assess the significance level and the power of the two test statistics, we need to obta<strong>in</strong><br />
empirical distributions for these statistics <strong>in</strong> lieu of their theoretical distributions. What follow<br />
are the four key steps <strong>in</strong> generat<strong>in</strong>g the stochastic processes that have the same asymptotic<br />
distributions as the test statistics.<br />
Step 1: we generate <strong>in</strong>dependent and identically distributed random samples, {v i, m : i =<br />
1,...,n}, from the standard normal distribution N.0, 1/: Here, m represents a replication<br />
number.<br />
Step 2: we calculate<br />
and<br />
Q m n .λ, µ 2/ = .λ − ˆ<br />
i=1<br />
Ŵn m .µ ∑<br />
2/ = n ŵ i .µ 2 /v i, m = √ n<br />
J −1<br />
n<br />
i=1<br />
.µ 2/Wn m .µ 2// T Jˆ<br />
n .µ 2 /.λ − ˆ<br />
n .µ 2/Wn m .µ 2//:<br />
It is important to note that Ŵ m n .µ 2/ converges weakly to W.µ 2 / as n →∞. This claim can be<br />
proved by us<strong>in</strong>g the conditional central limit theorem; see theorem (10.2) of Pollard (1990)<br />
and theorem 2 of Hansen (1996).<br />
J −1
Step 3: the third step is to calculate the likelihood ratio<br />
and<br />
LR m n =<br />
Mixture Regression Models 9<br />
sup { ˆV m .µ 2 / T Jˆ<br />
n .µ 2 / ˆV m .µ 2 /} − Ŵn m . ˆµ0 / T H{H T Jˆ<br />
n . ˆµ 0 /H} −1 H T Ŵn m . ˆµ0 /,<br />
||µ 2 ||M<br />
MLR m n = Ŵ n m . ˆµ0 / T [J n .µ 0 / −1 − H{H T J n .µ 0 /H} −1 H T ] Ŵn m . ˆµ0 / − <strong>in</strong>f {Q m<br />
λ∈R q n .λ, 1 ×Λ ˆµ0 /},<br />
0<br />
where Q m n { ˆV m .µ 2 /, µ 2 } = <strong>in</strong>f λ∈R q 1 ×Λµ2 {Q m n .λ, µ 2/}.<br />
Step 4: we repeat the above three steps J times and obta<strong>in</strong> two realizations—{LR m n : m =<br />
1,...,J} and {MLR m n : m = 1,...,J}. It can be shown that the empirical distribution of LRm n<br />
converges to the asymptotic distribution of LR n , and likewise for MLR m n . Therefore, the<br />
empirical distributions of these two realizations form the basis for calculation of the critical<br />
values <strong>in</strong> hypothesis <strong>test<strong>in</strong>g</strong> as well as the power analysis.<br />
We approximate p 1 = Pr.U 1 > LR n / and p 2 = Pr.U 2 > MLR n / by us<strong>in</strong>g<br />
p 1 ≈ ˆp J 1 =<br />
p 2 ≈ ˆp J 2 =<br />
J ∑<br />
m=1<br />
J ∑<br />
m=1<br />
I.LR m n > LR n/=J,<br />
I.MLR m n > MLR n/=J:<br />
S<strong>in</strong>ce J is under our control, ˆp J 1 can be made arbitrarily close to p 1 by us<strong>in</strong>g a sufficiently large J.<br />
To guide the selection of J <strong>in</strong> practice, it is useful to note that . ˆp J 1 −p 1/ √ J converges to a normal<br />
distribution with mean 0 and variance p 1 .1 − p 1 / <strong>in</strong> distribution as J →∞and that ˆp J 1 almost<br />
lies <strong>in</strong> .p 1 − 2:5 √ {p 1 .1 − p 1 /=J}, p 1 + 2:5 √ {p 1 .1 − p 1 /=J}/; see also Hansen (1996). Thus, if<br />
we want to set an error control δ 0 for | ˆp J 1 − p 1|, we can show that 2:5 √ {p 1 .1 − p 1 /=J} δ 0 .<br />
For example, when p 1 = 0:01, choos<strong>in</strong>g J = 10000 yields an error of about 0.0025. Exactly the<br />
same formula applies to p 2 as to p 1 .<br />
2.2. Asymptotic local power<br />
An assessment of power is necessary for choos<strong>in</strong>g a reasonable sample size as well as appropriate<br />
and powerful tests of significance (Cox and H<strong>in</strong>kley (1974), page 103). Often, there is no closed<br />
form for the power calculation. Many researchers have considered alternatives by exploit<strong>in</strong>g the<br />
asymptotic local power. In our case, the distribution of Jn<br />
−1.µ<br />
2/W n .µ 2 / plays a critical role <strong>in</strong><br />
determ<strong>in</strong><strong>in</strong>g the asymptotic local power of LR n and MLR n ; see equations (6) and (7). Thus, we<br />
explore its properties under a sequence of local alternatives.<br />
Consider a sequence of local alternatives ω n consist<strong>in</strong>g of α n = α 0 , β n = βÅ + n −1=2 h 1 , µ n 1 =<br />
µÅ − n −1=4 {α 0 =.1 − α 0 /} 0:5 h 2 and µ n 2 = µ Å + n −1=4 {.1 − α 0 /=α 0 } 0:5 h 2 , where α 0 is a constant<br />
between 0 and 1, and h 1 and h 2 are q 1 ×1 and q 2 ×1 vectors respectively. At ω n , K.ω n / = n −1=2 h,<br />
where h T = .h1 T, 0T , Vecs.h2 ⊗2 /T /. We have the follow<strong>in</strong>g theorem.<br />
Theorem 3. Under assumptions 1–3 <strong>in</strong> Appendix A and the alternatives ω n ,<br />
J n .µ 2 / −1 W n .µ 2 / d → N{J.µ 2 / −1 J.µ 2 , µÅ/h, J.µ 2 / −1 }:<br />
Theorem 3 can be used to compare the local power of our <strong>test<strong>in</strong>g</strong> statistics. In particular,<br />
J n .µÅ/ −1 W n .µÅ/ converges to N{h, J.µÅ/ −1 } <strong>in</strong> distribution. For <strong>in</strong>stance, the asymptotic<br />
power function that is associated with the modified log-likelihood-ratio statistic is
10 H.-T. Zhu and H. Zhang<br />
p α .h/ = lim<br />
n→∞ [P{F 0 .MLR n / 1 − α|h}],<br />
where F 0 .·/ is the asymptotic distribution of MLR n under hypothesis H 0 as n →∞.As||h||→∞,<br />
the p α .h/ converges to 1, s<strong>in</strong>ce MLR n becomes large as ||h|| <strong>in</strong>creases.<br />
3. Simulation study and real examples<br />
Two computational issues are related to our test procedures. First, we must calculate three different<br />
estimators: . ˆβ 0 ,ˆµ 0 /,ˆω M and ˆω P . It is relatively straightforward to compute . ˆβ 0 ,ˆµ 0 / and<br />
ˆω P by us<strong>in</strong>g Newton–Raphson and/or EM algorithms. S<strong>in</strong>ce f<strong>in</strong>d<strong>in</strong>g a global maximizer is still<br />
an open problem, we employ the EM algorithm us<strong>in</strong>g 200 different start<strong>in</strong>g-po<strong>in</strong>ts. Dur<strong>in</strong>g the<br />
computation, we use M = 10 to obta<strong>in</strong> ˆω P .<br />
Secondly, to approximate the critical values of the asymptotic distributions, we need to obta<strong>in</strong><br />
two realizations: {MLR m n : m = 1,...,J} and {LRm n : m = 1,...,J}: Obta<strong>in</strong><strong>in</strong>g the latter is<br />
actually a computationally <strong>in</strong>tensive burden even if q 2 is as small as 2. Hence, it is sensible to<br />
consider replac<strong>in</strong>g B.0, M/ = {µ; ||µ|| M} with a discrete approximation B.0, M/ A . Then,<br />
LR m n is approximated by<br />
max { ˆV m .µ 2 / T ˆ<br />
µ 2 ∈B.0,M/ A<br />
J n .µ 2 / ˆV m .µ 2 /} − Ŵ mT<br />
n<br />
. ˆµ 0 /H{H T ˆ J n . ˆµ 0 /H} −1 H T Ŵ m n . ˆµ0 /:<br />
Ow<strong>in</strong>g to the computational complexity, we only consider LR n when q 2 = 1 and M = 10, i.e.<br />
B.0,10/ = [−10, 10]. We take B.0, 10/ A = {−10, −10 + 0:02,...,10− 0:02, 10}.<br />
3.1. Simulated data<br />
In this subsection, we conduct some simulation studies to assess the performance of the two<br />
tests that were <strong>in</strong>troduced <strong>in</strong> Section 2. Data are drawn from the <strong>mixture</strong> l<strong>in</strong>ear <strong>regression</strong> model<br />
y ij = z ij {U i ˜µ 1 + .1 − U i / ˜µ 2 } + " ij ,<br />
where ˜µ 1 = β + µ 1 , ˜µ 2 = β + µ 2 and " ij ∼ N.0, 1/. From now on, we use . ˜µ 1 , ˜µ 2 / <strong>in</strong>stead of<br />
.β, µ 1 , µ 2 / to avoid the identifiability problem.<br />
We consider two cases: q 2 = 1 and q 2 = 2. For q 2 = 1, all the z ij s are equal to 1. For q 2 = 2,<br />
z ij = .1, u ij /, where u ij comes from the uniform [0, 2] generator. Therefore, we have parameters<br />
˜µ 1 , ˜µ 2 , α and σ. The true value of σ is 1 and the number of observations <strong>in</strong> each cluster is set<br />
equal to 3. It should be noted that σ is the true nuisance parameter <strong>in</strong> the general model (3),<br />
imply<strong>in</strong>g that q 1 = 1.<br />
For q 2 = 1, four different sett<strong>in</strong>gs of .α, ˜µ 1 , ˜µ 2 /, denoted A1, B1, C1 and D1, are considered.<br />
Similarly, for q 2 = 2, four other different sett<strong>in</strong>gs of .α, ˜µ 1 , ˜µ 2 /, denoted A2, B2, C2 and D2,<br />
are considered. See Table 1 for all eight designs. We choose sample sizes n = 50 and n = 100.<br />
Table 1.<br />
Eight designs for the simulation study<br />
Unknown parameter<br />
Results for the follow<strong>in</strong>g designs:<br />
A1 B 1 C1 D1 A2 B 2 C2 D2<br />
α 0.50.50.50.2 0.5 0.5 0.5 0.2<br />
µ 1 1 1 1 1 1 2 1 2 1 2 1 2<br />
µ 2 1 1.4 2 2 1 2 1:4 × 1 2 1:707 × 1 2 1:707 × 1 2
Mixture Regression Models 11<br />
For each simulation, the three significance levels 10%, 5% and 1% are considered, and 50000<br />
replications are used to estimate nom<strong>in</strong>al significance levels (or rejection rates). To calculate<br />
standard errors for the rejection rate of each case, we use 50000 replications to form 50 consecutive<br />
1000 replications, which give 50 estimates of the rejection rate and the standard error of<br />
these 50 estimates.<br />
Table 2 presents the estimates and standard errors of the rejection rates for the log-likelihoodratio<br />
<strong>test<strong>in</strong>g</strong> statistic. As expected, the power <strong>in</strong>creases as does the distance between ˜µ 1 and ˜µ 2 .<br />
Moreover, the simulated critical values are aga<strong>in</strong> precise and the null rejection level is close to<br />
the true rejection level.<br />
Tables 3 and 4 report the estimates and standard errors of the rejection rates for the modified<br />
log-likelihood-ratio statistic. The proposed <strong>test<strong>in</strong>g</strong> procedure based on MLR n performs<br />
remarkably well. First, the power <strong>in</strong>creases fast as the distance between ˜µ 1 and ˜µ 2 becomes<br />
large. Second, the simulated critical values are quite accurate. In particular, the null rejection<br />
level is quite close to the true rejection level. Third, as expected, a larger sample size produces<br />
better results.<br />
Table 2. Estimates and standard errors of rejection rates for the log-likelihood-ratio statistic of a onecomponent<br />
<strong>mixture</strong> aga<strong>in</strong>st a two-component <strong>mixture</strong><br />
True rate Results (q 2 = 1) for the follow<strong>in</strong>g designs and values of n:<br />
A1 B1 C1 D1<br />
n = 50 n = 100 n = 50 n = 100 n = 50 n = 100 n = 50 n = 100<br />
0.010 Estimate 0.0151 0.0146 0.0399 0.0528 0.4943 0.8177 0.2886 0.5422<br />
Standard error 0.0046 0.0044 0.0070 0.00650.0153 0.0139 0.0148 0.0182<br />
0.050 Estimate 0.0596 0.0576 0.1253 0.1592 0.7218 0.9352 0.5077 0.7535<br />
Standard error 0.0079 0.00750.0102 0.01050.0112 0.0071 0.0168 0.0145<br />
0.100 Estimate 0.1072 0.1069 0.2031 0.2540 0.8192 0.9669 0.6281 0.8404<br />
Standard error 0.0094 0.0094 0.0141 0.0132 0.0114 0.0054 0.0139 0.0128<br />
Table 3. Estimates and standard errors of rejection rates for the modified log-likelihood-ratio statistic of a<br />
one-component <strong>mixture</strong> aga<strong>in</strong>st a two-component <strong>mixture</strong><br />
True rate Results (q 2 = 1) for the follow<strong>in</strong>g designs and values of n:<br />
A1 B1 C1 D1<br />
n = 50 n = 100 n = 50 n = 100 n = 50 n = 100 n = 50 n = 100<br />
0.0100 Estimate 0.0151 0.0121 0.0404 0.0572 0.5601 0.8732 0.2886 0.5439<br />
Standard error 0.0041 0.0038 0.0068 0.0082 0.0139 0.0086 0.0122 0.0165<br />
0.0500 Estimate 0.0539 0.0520 0.1264 0.1659 0.7779 0.9607 0.5053 0.7525<br />
Standard error 0.0079 0.0063 0.0101 0.0141 0.0131 0.0060 0.0132 0.0145<br />
0.1000 Estimate 0.0981 0.0980 0.20750.2628 0.8641 0.9821 0.6270 0.8385<br />
Standard error 0.0102 0.0094 0.0096 0.0163 0.0099 0.0040 0.0150 0.0115
12 H.-T. Zhu and H. Zhang<br />
Table 4. Estimates and standard errors of rejection rates for the modified log-likelihood-ratio statistic of a<br />
one-component <strong>mixture</strong> aga<strong>in</strong>st a two-component <strong>mixture</strong><br />
True rate Results (q 2 = 2) for the follow<strong>in</strong>g designs and values of n:<br />
A2 B2 C2 D2<br />
n = 50 n = 100 n = 50 n = 100 n = 50 n = 100 n = 50 n = 100<br />
0.0100 Estimate 0.0166 0.0148 0.2330 0.4843 0.9613 0.9998 0.71650.9611<br />
Standard error 0.0046 0.0029 0.0121 0.01650.0056 0.00050.0159 0.0070<br />
0.0500 Estimate 0.0636 0.0585 0.4459 0.7081 0.9905 0.9999 0.8502 0.9866<br />
Standard error 0.0078 0.0073 0.0141 0.0138 0.0031 0.0002 0.0113 0.0037<br />
0.1000 Estimate 0.1143 0.1078 0.5686 0.8061 0.9960 1.0000 0.9001 0.9923<br />
Standard error 0.0103 0.0102 0.0143 0.0108 0.0019 0.0001 0.0087 0.0024<br />
By <strong>in</strong>spect<strong>in</strong>g Tables 2–4, we f<strong>in</strong>d that the log-likelihood-ratio statistic does not have any<br />
advantage over the modified log-likelihood-ratio statistic. In fact, the opposite seems to be so.<br />
Although we do not have a full explanation yet, the computational complexity could be one of<br />
the reasons. Thus, on the basis of our simulation studies, we prefer to use the modified loglikelihood-ratio<br />
statistic to test hypothesis (5) because of its easier computation and better<br />
power.<br />
3.2. Ames salmonella assay data<br />
We reanalyse an assay data set that has been studied by Wang et al. (1996) among others. The<br />
date set conta<strong>in</strong>s the number of revertant colonies of salmonella under different dose levels<br />
of qu<strong>in</strong>ol<strong>in</strong>e. At each of six dose levels of qu<strong>in</strong>ol<strong>in</strong>e d i , three plates are tested. To fit the data<br />
set, Wang et al. (1996) chose the two-component Poisson <strong>regression</strong> (2) with n = 18, n i = 1 and<br />
Poisson rates<br />
λ i = exp[x T i β + {µ 1U i + µ 2 .1 − U i /}],<br />
where x i = .d i ,log.d i + 10//, i = 1,...,18. They used the Akaike <strong>in</strong>formation criterion and<br />
Bayes <strong>in</strong>formation criterion to facilitate the model selection process, but they did not have a<br />
<strong>test<strong>in</strong>g</strong> procedure to test hypothesis (5).<br />
We use the modified log-likelihood-ratio statistic to test the hypothesis as stated <strong>in</strong> expression<br />
(5). First, we use the EM algorithm to obta<strong>in</strong> . ˆα P , ˆβ P ,ˆµ 1P ,ˆµ 2P / T = .0:723, −0:0013, 0:378,<br />
1:844, 2:382/ T . Next, us<strong>in</strong>g ˆω P and the results <strong>in</strong> Table 6 of Wang et al. (1996), we f<strong>in</strong>d that<br />
MLR n = 12:447. Then, the approximation procedure is repeated J = 10000 times, lead<strong>in</strong>g to<br />
ˆp J = 0:0001. S<strong>in</strong>ce q 2 = 1 <strong>in</strong> this case, the true asymptotic distribution of MLR n is 0:5χ 2 0 +0:5χ2 1 ,<br />
giv<strong>in</strong>g a p-value of 0.0002. Therefore, the data support the <strong>mixture</strong> Poisson <strong>regression</strong> with two<br />
components.<br />
We can also use the log-likelihood-ratio statistic to test the same hypothesis. It follows from<br />
Table6ofWanget al. (1996) that the log-likelihood-ratio statistic LR n = 14:44. Although it is<br />
difficult to compute the true p-value, the approximation procedure gives ˆp 10000 = 0:0004. Further<br />
analyses such as residual analysis and the goodness of fit have been reported <strong>in</strong> Wang et al.<br />
(1996).
Mixture Regression Models 13<br />
3.3. A risk factor analysis of preterm delivery<br />
Although the survival rate has improved for preterm <strong>in</strong>fants that are born with low (less than<br />
2500 g) or even very low birth weights (less than 1000 g), preterm birth rema<strong>in</strong>s a major cause of<br />
neurodevelopmental disability. Many epidemiological studies have been conducted to exam<strong>in</strong>e<br />
risk factors for preterm deliveries. Here, we reanalyse a data set that has been collected and<br />
extensively analysed by Dr Michael Bracken and his colleagues (see, for example, Zhang and<br />
Bracken (1995)). The data for this study consisted of 3858 women whose pregnancies ended <strong>in</strong> a<br />
s<strong>in</strong>gleton live-birth at Yale–New Haven Hospital, Connecticut, <strong>in</strong> 1980–1982. Preterm delivery<br />
is def<strong>in</strong>ed as less than 37 weeks of gestational age, which is calculated from the first day of the<br />
last menstrual period. On the basis of Zhang and Bracken (1995), we exam<strong>in</strong>e the follow<strong>in</strong>g<br />
eight putative risk factors: total number of pregnancies (v 1 ), ethnic background (v 2 ), use of<br />
marijuana (v 3 ), marital status (v 4 ), hormones or diethylstilbestrol used by the mother (v 5 ), parity<br />
(v 6 ), years of education (v 7 ) and total alcohol use per day (v 8 ). Among these risk factors,<br />
v 2 , v 3 , v 4 and v 5 are dummy variables with values 0 or 1. For <strong>in</strong>stance, v 2 is an <strong>in</strong>dicator for<br />
whether a woman is black or not. Zhang and Bracken (1995) orig<strong>in</strong>ally considered 15 variables,<br />
but seven of them do not contribute to the risk of preterm births and hence are not considered<br />
here.<br />
To understand the potential relationship between the preterm delivery and above risk factors,<br />
we fit the logistic <strong>regression</strong> model to the data set and report the output <strong>in</strong> Table 5. Accord<strong>in</strong>g<br />
to the estimators and their correspond<strong>in</strong>g standard errors <strong>in</strong> Table 5, the most significant risk<br />
factors are the number of previous pregnancies, ethnicity and the years of education, as expected<br />
from earlier studies (Zhang and Bracken, 1995). However, the rema<strong>in</strong><strong>in</strong>g risk factors do not<br />
appear to be significant at the 0.05level.<br />
Next, we use the logistic <strong>mixture</strong> <strong>regression</strong><br />
logit{P.Y i = 1|U i /} = x i β + z i {U i µ 1 + .1 − U i /µ 2 },<br />
where x i = .1, v 1i , v 2i , v 3i / and z i = .v 4i , v 5i , v 6i , v 7i , v 8i / for i = 1,...,3858.Theresult becomes<br />
<strong>in</strong>trigu<strong>in</strong>g when the two-component logistic <strong>regression</strong> model is used to analyse this data set;<br />
see Table 5for details. First, all eight putative risk factors under the logistic <strong>mixture</strong> model<br />
become significant. Second, when we use the modified log-likelihood-ratio statistic to test the<br />
hypothesis as stated <strong>in</strong> expression (5), MLR n = 20:810, giv<strong>in</strong>g rise to ˆp J = 0:0002 with 10000<br />
(i.e. J) repeated samples. This means that our procedure strongly favours the two-component<br />
logistic model, and there is a significant disparity between the null hypothesis H 0 <strong>in</strong> expression<br />
Table 5.<br />
Modified log-likelihood estimators for the preterm data set†<br />
α Intercept v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8<br />
One-component <strong>mixture</strong> (modified log-likelihood −778.49)<br />
−2.4650.167 0.577 0.232 −0.109 0.131 −0.086 −0.061 −0.039<br />
(0.463) (0.077) (0.210) (0.223) (0.218) (0.170) (0.110) (0.033) (0.073)<br />
Two-component <strong>mixture</strong>‡ (modified log-likelihood −768.09)<br />
0.509 −2.949 0.174 0.664 0.241 1.062 0.778 0.276 −0.159 0.218<br />
(0.101) (0.116) (0.021) (0.059) (0.062) (0.133) (0.145) (0.094) (0.019) (0.039)<br />
−1.361 −0.997 −0.990 0.094 −0.308<br />
(0.166) (0.095) (0.052) (0.018) (0.044)<br />
†Standard errors of the estimators are given <strong>in</strong> parentheses; MLR n = 20:810; ˆp 10000 = 0:0002.<br />
‡α, the <strong>in</strong>tercept and v 1 –v 3 are shared, and v 4 –v 8 are separated for U = 0 (upper values) and U = 1 (lower values).
14 H.-T. Zhu and H. Zhang<br />
(5) and the observed sample. Third, by look<strong>in</strong>g at the coefficients of v 4 –v 8 , we f<strong>in</strong>d that the<br />
estimators <strong>in</strong> the two latent groups have opposite signs, expla<strong>in</strong><strong>in</strong>g why they are <strong>in</strong>significant<br />
<strong>in</strong> the ord<strong>in</strong>ary logistic model. This underscores the importance of the two-component model.<br />
The opposite signs of the estimators <strong>in</strong>dicate reversed relationships between these risk factors<br />
and preterm births <strong>in</strong> the two latent groups.<br />
Acknowledgements<br />
This research is supported <strong>in</strong> part by National Institutes of Health grants DA12468 and<br />
AA12044. We thank the Jo<strong>in</strong>t Editor, an Associate Editor and two referees for valuable suggestions<br />
which helped to improve our presentation greatly. We also thank Professor Michael<br />
Bracken for the use of his data set.<br />
Appendix A<br />
Before we present all the assumptions, we need to def<strong>in</strong>e the follow<strong>in</strong>g derivatives of f i .β, µ/: F i,3 .β, µ/ =<br />
@<br />
β 2 f i.β, µ/=f iÅ, F i,4 .µ/ = @ β @ µ f i .βÅ, µ/=f iÅ, F i,5 .µ/ = @ β @<br />
µ 2 f i.βÅ, µ/=f iÅ and F i,7 .µ/ = @<br />
µ 3 f i.βÅ, µ/=f iÅ:<br />
The follow<strong>in</strong>g assumptions are sufficient conditions to derive our asymptotic results. Detailed proofs of<br />
theorems 1–3 can be found <strong>in</strong> Zhu and Zhang (2003).<br />
A.1. Assumption 1 (identifiability)<br />
As n →∞, sup ω∈Ω {n −1 |L n .ω/ − ¯L n .ω/|} → 0 <strong>in</strong> probability, where ¯L n .ω/ = E{L n .ω/}. For every δ > 0,<br />
we have<br />
where ¯ω n is the maximizer of ¯L n .ω/ and<br />
for every δ > 0.<br />
lim <strong>in</strong>f<br />
n→∞ .n−1 [ ¯L n . ¯ω n / − sup { ¯L n .ω/}]/ >0,<br />
ω∈Ω=Ω Å,δ<br />
ΩÅ,δ = {ω : ||β − βÅ|| δ, ||µ 1 − µÅ|| δ, α||µ 2 − µÅ|| δ} ∩ Ω<br />
A.2. Assumption 2<br />
For a small δ 0 > 0, let B δ0 = {.β, µ/ : ||β − βÅ|| δ 0 and ||µ|| M} ∩ Ω,<br />
{∣∣ ∣∣∣ ∣∣∣ 1<br />
sup<br />
.β,µ/∈B δ0<br />
n<br />
∣∣ ∣∣ }<br />
n∑<br />
∣∣∣ ∣∣∣<br />
F i,1 .β, µ/<br />
∣∣ + 1 n∑<br />
∣∣∣ ∣∣∣<br />
F i,3 .β, µ/<br />
n<br />
∣∣ + 1 n∑<br />
F i,2 .µ/<br />
i=1<br />
n ∣∣ + ∑ 6 1 n∑<br />
∣∣<br />
F i,k .µ/<br />
i=1<br />
k=4 n ∣∣<br />
= o p .1/,<br />
i=1<br />
{∣∣ }<br />
∣∣∣ ∣∣∣ 1 n∑<br />
sup √n F i,k .µ/<br />
∣∣<br />
= O p .1/, k = 4, 6, 7,<br />
||µ||M<br />
i=1<br />
sup<br />
.β,µ/∈B δ0<br />
{ 1<br />
n<br />
i=1<br />
}<br />
n∑<br />
∑<br />
||F i,1 .β, µ/|| 3 +||F i,3 .β, µ/|| 3 +||F i,2 .µ/|| 3 + 7 ||F i,k .µ/|| 3 = O p .1/:<br />
i=1<br />
Moreover, max 1<strong>in</strong> sup .β,µ/∈Bδ0 {||F i,1 .β, µ/||+||F i,2 .µ/|| 2 } = o p .n 1=2 /:<br />
k=4<br />
A.3. Assumption 3<br />
.W n .·/, J n .·// ⇒ .W.·/, J.·//, where these processes are <strong>in</strong>dexed by ||µ|| M, and the stochastic process<br />
{.W.µ/, J.µ// : ||µ|| M} has bounded cont<strong>in</strong>uous sample paths with probability 1. Each J.µ/ is<br />
a symmetric matrix and ∞ > sup ||µ||M [λ max {J.µ/}] <strong>in</strong>f ||µ||M [λ m<strong>in</strong> {J.µ/}] > 0 holds almost surely.
Mixture Regression Models 15<br />
The process W.µ/ is a mean vector R r -valued Gaussian stochastic process {W.µ/ : ||µ|| M} such that<br />
E{W.µ/W.µ/ T } = J.µ/ and E{W.µ/W.µ ′ / T } = J.µ, µ ′ / for any µ and µ ′ <strong>in</strong> B.0, M/.<br />
A.4. Remarks<br />
Assumption 1 is a generalized def<strong>in</strong>ition of the identifiable uniqueness <strong>in</strong> the statistical literature. The<br />
assumption sup ω∈Ω {n −1 |L n .ω/− ¯L n .ω/|} → p 0 is the uniform laws of large numbers. Some sufficient conditions<br />
for uniform laws of large numbers have been presented <strong>in</strong> the literature; see Pollard (1990). Moreover,<br />
assumption 2 is quite reasonable <strong>in</strong> most situations. In assumption 3, to prove that W n .µ/ weakly converges<br />
to a Gaussian process W.µ/, we need to use the functional central limit theorem; see Pollard (1990).<br />
Moreover, assumption 3 assumes the positive def<strong>in</strong>itiveness of J.µ/, which is a generalization of the strong<br />
identifiability conditions that were used <strong>in</strong> Chen (1995) and Chen and Chen (2001). Under the identical and<br />
<strong>in</strong>dependent distribution framework, assumption (P0) of Dacunha-Castelle and Gassiat (1999) is a sufficient<br />
condition for assumptions 1 and 2. Moreover, Dacunha-Castelle and Gassiat’s (1999) assumption (P1)<br />
is just the positive def<strong>in</strong>itiveness of J.µ/ <strong>in</strong> assumption 3.<br />
References<br />
Andrews, D. W. K. (1999) Estimation when a parameter is on a boundary: theory and applications. Econometrica,<br />
67, 1341–1383.<br />
Bickel, P. and Chernoff, H. (1993) Asymptotic distribution of the likelihood ratio statistic <strong>in</strong> a prototypical non<br />
regular problem. In Statistics and Probability: a Raghu Raj Bahadur Festschrift (eds J. K. Ghosh, S. K. Mitra,<br />
K. R. Parthasarathy and B. L. S. Prakasa Rao), pp. 83–96. New Delhi: Wiley.<br />
Chen, H. and Chen, J. (2001) The likelihood ratio test for homogeneity <strong>in</strong> the f<strong>in</strong>ite <strong>mixture</strong> <strong>models</strong>. Can. J.<br />
Statist., 29, 201–215.<br />
Chen, H., Chen, J. and Kalbfleisch, J. D. (2001) A modified likelihood ratio test for homogeneity <strong>in</strong> f<strong>in</strong>ite <strong>mixture</strong><br />
<strong>models</strong>. J. R. Statist. Soc. B, 63, 19–29.<br />
Chen, J. (1995) Optimal rate of convergence for f<strong>in</strong>ite <strong>mixture</strong> <strong>models</strong>. Ann. Statist., 23, 221–233.<br />
Cheng, R. C. H. and Liu, W. B. (2001) The consistency of estimators <strong>in</strong> f<strong>in</strong>ite <strong>mixture</strong> <strong>models</strong>. Scand. J. Statist.,<br />
28, 603–616.<br />
Cox, D. R. and H<strong>in</strong>kley, D. V. (1974) Theoretical Statistics. London: Chapman and Hall.<br />
Dachunha-Castelle, D. and Gassiat, E. (1999) Test<strong>in</strong>g the order of a model us<strong>in</strong>g locally conic parameterization:<br />
population <strong>mixture</strong>s and stationary ARMA processes. Ann. Statist., 27, 1178–1209.<br />
Diggle, P. J., Liang, K. Y. and Zeger, S. L. (2002) Analysis of Longitud<strong>in</strong>al Data, 2nd edn. New York: Oxford<br />
<strong>University</strong> Press.<br />
Hansen, B. E. (1996) Inference when a nuisance parameter is not identified under the null hypothesis. Econometrica,<br />
64, 413–430.<br />
Haseman, J. K. and Elston, R. C. (1972) The <strong>in</strong>vestigation of l<strong>in</strong>kage between a quantitative trait and a marker<br />
locus. Behav. Genet., 2, 3–19.<br />
Kosorok, M. R. (2003) Bootstraps of sums of <strong>in</strong>dependent but not identically distributed stochastic processes.<br />
J. Multiv. Anal., 84, 299–318.<br />
Lemdani, M. and Pons, O. (1999) Likelihood ratio tests <strong>in</strong> contam<strong>in</strong>ation <strong>models</strong>. Bernoulli, 5, 705–719.<br />
L<strong>in</strong>dsay, B. G. (1995) Mixture Models: Theory, Geometry and Applications. Hayward: Institute of Mathematical<br />
Statistics.<br />
McCullagh, P. and Nelder, J. A. (1989) Generalized L<strong>in</strong>ear Models, 2nd edn. London: Chapman and Hall.<br />
McLachlan, G. and Peel, D. (2000) F<strong>in</strong>ite Mixture Models. New York: Wiley.<br />
Pauler, D. K. and Laird, N. M. (2000) A <strong>mixture</strong> model for longitud<strong>in</strong>al data with application to assessment of<br />
noncompliance. Biometrics, 56, 464–472.<br />
Pollard, D. (1990) Empirical Processes: Theory and Applications. Hayward: Institute of Mathematical Statistics.<br />
Risch, N. and Zhang, H. P. (1995) Extreme discordant sib pairs for mapp<strong>in</strong>g quantitative trait loci <strong>in</strong> humans.<br />
Science, 268, 1584–1589.<br />
Teicher, H. (1963) Identifiability of f<strong>in</strong>ite <strong>mixture</strong>s. Ann. Math. Statist., 34, 1265–1269.<br />
Titter<strong>in</strong>gton, D. M., Smith, A. F. M. and Makov, U. E. (1985) The Statistical Analysis of F<strong>in</strong>ite Mixture Distributions.<br />
New York: Wiley.<br />
van der Vaart, A. (1996) Efficient estimation <strong>in</strong> semiparametric <strong>models</strong>. Ann. Statist., 24, 862–878.<br />
Wang, P. M. and Puterman, M. L. (1998) Mixed logistic <strong>regression</strong> <strong>models</strong>. J. Agric. Biol. Environ. Statist., 3,<br />
175–200.<br />
Wang, P. M., Puterman, M. L., Cockburn, I. and Le, N. (1996) Mixed Poisson <strong>regression</strong> <strong>models</strong> with covariate<br />
dependent rates. Biometrics, 52, 381–400.<br />
Zhang, H. P. and Bracken, M. (1995) Tree-based risk factor analysis of preterm delivery and small-for-gestational-age<br />
birth. Am. J. Epidem., 141, 70–78.
16 H.-T. Zhu and H. Zhang<br />
Zhang, H. P., Fui, R. and Zhu, H. T. (2003) A latent variable model of segregation analysis for ord<strong>in</strong>al outcome.<br />
J. Am. Statist. Ass., to be published.<br />
Zhang, H. P. and Merikangas, K. (2000) A frailty model of segregation analysis: understand<strong>in</strong>g the familial<br />
transmission of alcoholism. Biometrics, 56, 815–823.<br />
Zhu, H. T. and Zhang, H. P. (2003) <strong>Hypothesis</strong> <strong>test<strong>in</strong>g</strong> for f<strong>in</strong>ite <strong>mixture</strong> <strong>regression</strong> <strong>models</strong> (mathematical details).<br />
Technical Report. Yale <strong>University</strong> School of Medic<strong>in</strong>e, New Haven.