02.01.2015 Views

Hypothesis testing in mixture regression models - Columbia University

Hypothesis testing in mixture regression models - Columbia University

Hypothesis testing in mixture regression models - Columbia University

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

J. R. Statist. Soc. B (2004)<br />

66, Part 1, pp. 3–16<br />

<strong>Hypothesis</strong> <strong>test<strong>in</strong>g</strong> <strong>in</strong> <strong>mixture</strong> <strong>regression</strong> <strong>models</strong><br />

Hong-Tu Zhu and Hep<strong>in</strong>g Zhang<br />

Yale <strong>University</strong>, New Haven, USA<br />

[Received July 2002. F<strong>in</strong>al revision June 2003]<br />

Summary. We establish asymptotic theory for both the maximum likelihood and the maximum<br />

modified likelihood estimators <strong>in</strong> <strong>mixture</strong> <strong>regression</strong> <strong>models</strong>. Moreover, under specific and reasonable<br />

conditions, we show that the optimal convergence rate of n 1=4 for estimat<strong>in</strong>g the<br />

mix<strong>in</strong>g distribution is achievable for both the maximum likelihood and the maximum modified<br />

likelihood estimators. We also derive the asymptotic distributions of two log-likelihood ratio test<br />

statistics for <strong>test<strong>in</strong>g</strong> homogeneity and we propose a resampl<strong>in</strong>g procedure for approximat<strong>in</strong>g<br />

the p-value. Simulation studies are conducted to <strong>in</strong>vestigate the empirical performance of the<br />

two test statistics. F<strong>in</strong>ally, two real data sets are analysed to illustrate the application of our<br />

theoretical results.<br />

Keywords:<br />

Power<br />

<strong>Hypothesis</strong> <strong>test<strong>in</strong>g</strong>; Logistic <strong>regression</strong>; Mixture <strong>regression</strong>; Poisson <strong>regression</strong>;<br />

1. Introduction<br />

F<strong>in</strong>ite <strong>mixture</strong> <strong>models</strong> arise <strong>in</strong> many applications, particularly <strong>in</strong> biology, psychology and genetics.<br />

Statistical <strong>in</strong>ference and computation based on these <strong>models</strong> pose a serious challenge; see<br />

Titter<strong>in</strong>gton et al. (1985), L<strong>in</strong>dsay (1995) and McLachlan and Peel (2000) for systematic reviews.<br />

The purpose of this work is to establish asymptotic theory for <strong>mixture</strong> <strong>regression</strong> <strong>models</strong> orig<strong>in</strong>at<strong>in</strong>g<br />

from those applications.<br />

Suppose that we observe data from n units and with<strong>in</strong> each unit, say unit i, wehaven i measurements,<br />

i = 1,...,n: This is a typical data structure <strong>in</strong> longitud<strong>in</strong>al and family studies. Before<br />

<strong>in</strong>troduc<strong>in</strong>g the additional notation, let us exam<strong>in</strong>e a few examples.<br />

1.1. Example 1: a f<strong>in</strong>ite <strong>mixture</strong> logistic <strong>regression</strong> model<br />

To study the genetic <strong>in</strong>heritance pattern of a b<strong>in</strong>ary trait such as alcoholism, Zhang and Merikangas<br />

(2000) proposed a frailty model <strong>in</strong> which the data consist of a b<strong>in</strong>ary vector response<br />

y i = .Y i1 ,...,Y i, ni / T and covariates X i from the ith family, i = 1,...,n. To model the potential<br />

familial correlation, they <strong>in</strong>troduced a Bernoulli latent variable U i for each family. Conditionally<br />

on all latent variables {U i }, the Y ij s are assumed to be <strong>in</strong>dependent and to follow the logistic<br />

<strong>regression</strong> model<br />

logit{P.Y ij = 1|U i /} = x ij β + z ij {U i µ 1 + .1 − U i /µ 2 }, .1/<br />

where x ij is a covariate vector <strong>in</strong> X i from the jth member <strong>in</strong> the ith family, z ij is a part of x ij that<br />

<strong>in</strong>teracts with the latent variable and the β and the µs are parameters. The <strong>in</strong>teraction terms<br />

Address for correspondence: Hep<strong>in</strong>g Zhang, Department of Epidemiology and Public Health, Yale <strong>University</strong><br />

School of Medic<strong>in</strong>e, 60 College Street, New Haven, CT 06520-8034, USA.<br />

E-mail: Hep<strong>in</strong>g.Zhang@Yale.EDU<br />

© 2004 Royal Statistical Society 1369–7412/04/66003


4 H.-T. Zhu and H. Zhang<br />

<strong>in</strong> model (1) are very important <strong>in</strong> genetic studies for assess<strong>in</strong>g potential gene–environment<br />

<strong>in</strong>teractions.<br />

Beyond this example, f<strong>in</strong>ite <strong>mixture</strong>s of Bernoulli distributions such as model (1) have received<br />

much attention <strong>in</strong> the last five decades. See Teicher (1963) for an early example. More recently,<br />

Wang and Puterman (1998) among others generalized b<strong>in</strong>omial f<strong>in</strong>ite <strong>mixture</strong>s to <strong>mixture</strong><br />

logistic <strong>regression</strong> <strong>models</strong>, and Zhang et al. (2003) applied <strong>mixture</strong> cumulative logistic <strong>models</strong><br />

to analyse correlated ord<strong>in</strong>al responses.<br />

1.2. Example 2: <strong>mixture</strong> of non-l<strong>in</strong>ear hierarchical <strong>models</strong><br />

Longitud<strong>in</strong>al and genetic studies commonly <strong>in</strong>volve a cont<strong>in</strong>uous response {Y ij }, also referred<br />

to as a quantitative trait. See, for example, Diggle et al. (2002), Haseman and Elston (1972) and<br />

Risch and Zhang (1995). Pauler and Laird (2000) used general f<strong>in</strong>ite <strong>mixture</strong> non-l<strong>in</strong>ear hierarchical<br />

<strong>models</strong> to analyse longitud<strong>in</strong>al data from heterogeneous subpopulations. Specifically,<br />

when there are only two subgroups, the model is of the form<br />

Y ij = g{x ij , β, U i z ij µ 1 + .1 − U i /z ij µ 2 } + " i,j ,<br />

where the " i,j s are <strong>in</strong>dependent and identically distributed accord<strong>in</strong>g to N.0, σ 2 / and g.·/ is<br />

a prespecified function. Here, the known covariates x ij may conta<strong>in</strong> observed time po<strong>in</strong>ts to<br />

reflect a time course <strong>in</strong> longitud<strong>in</strong>al data.<br />

1.3. Example 3: a f<strong>in</strong>ite <strong>mixture</strong> of Poisson <strong>regression</strong> <strong>models</strong><br />

Poisson distribution and Poisson <strong>regression</strong> have been widely used to analyse count data<br />

(McCullagh and Nelder, 1989), but observed count data often exhibit overdispersion relative<br />

to this. F<strong>in</strong>ite <strong>mixture</strong> Poisson <strong>regression</strong> <strong>models</strong> (Wang et al., 1996) provide a plausible explanation<br />

for overdispersion. Specifically, conditionally on all U i s, the Y ij s are <strong>in</strong>dependent<br />

and follow the Poisson <strong>regression</strong> model<br />

p.Y ij = y ij |x ij , U i / = 1<br />

y ij ! λy ij<br />

ij exp.−λ ij/, .2/<br />

where λ ij = exp{x ij β + U i z ij µ 1 + .1 − U i /z ij µ 2 }.<br />

To summarize the <strong>models</strong> presented above, we consider a random sample of n <strong>in</strong>dependent<br />

observations {y i , X i } n 1<br />

with the density function<br />

p i .y i , x i ; ω/ = {.1 − α/f i .y i , x i ; β, µ 1 / + α f i .y i , x i ; β, µ 2 /} g i .x i /, .3/<br />

where g i .x i / is the distribution function of X i . Further, ω = .α, β, µ 1 , µ 2 / is the unknown parameter<br />

vector, <strong>in</strong> which β (q 1 × 1) measures the strength of association that is contributed by<br />

the covariate terms and the two q 2 × 1 vectors, µ 1 and µ 2 , represent the different contributions<br />

from two different groups.<br />

Equivalently, if we consider P.U i = 0/ = 1 − P.U i = 1/ = α, and assume that the conditional<br />

density of y i given U i is p i .y i |U i / = f i {y i , x i ; β, µ 2 .1 − U i / + µ 1 U i }, then model (3) is the<br />

marg<strong>in</strong>al density of y i . In fact, McCullagh and Nelder (1989) considered a special case <strong>in</strong> which<br />

f i is from an exponential family distribution, i.e.<br />

∏<br />

f i {y i , x i ; β, µ 2 .1 − U i / + µ 1 U i } = n i<br />

exp[φ{y ij θ ij − a.θ ij /} + c.y ij , φ/], .4/<br />

j=1<br />

where θ ij = h{x ij , β, U i µ 1 + .1 − U i /µ 2 }, h.·/ is a l<strong>in</strong>k function and φ is a dispersion parameter.<br />

This family of <strong>mixture</strong> <strong>regression</strong> <strong>models</strong> is very useful <strong>in</strong> practice.


Mixture Regression Models 5<br />

Asymptotic theory is critical to the understand<strong>in</strong>g of the behaviour of any model. Exist<strong>in</strong>g<br />

asymptotic results require, however, that f i <strong>in</strong> equation (3) is identical among the observation<br />

units. This requirement is too restrictive <strong>in</strong> longitud<strong>in</strong>al and family studies. Thus, our aim is to<br />

elim<strong>in</strong>ate this restriction by allow<strong>in</strong>g f i to vary between study subjects or families as a result of<br />

study designs or miss<strong>in</strong>g data.<br />

Let PÅ denote the true model from which the data are generated. The well-known identifiability<br />

problem <strong>in</strong> <strong>mixture</strong> <strong>models</strong> implies that there may be a set of parameters that yield PÅ:<br />

We use ΩÅ = {ωÅ ∈ Ω : P ωÅ = PÅ} to represent this set of true model parameters. Here, Ω is the<br />

entire parameter space. The use of an asterisk <strong>in</strong> the subscript rem<strong>in</strong>ds us that the parameter<br />

belongs to the true parameter set. In the light of the symmetry for α, without loss of generality,<br />

we consider only α ∈ [0, 0:5]. The parameter space Ω is def<strong>in</strong>ed as Ω = {ω : α ∈ [0, 0:5], β ∈ B,<br />

||µ 1 || M, ||µ 2 || M}, where M is a large positive scalar such that ||µÅ||


6 H.-T. Zhu and H. Zhang<br />

LR n = sup{L n .ω/} − sup {L n .ω/},<br />

ω∈Ω<br />

ω∈Ω 0<br />

where Ω 0 = {ω ∈ Ω : α = 0:5, µ 1 = µ 2 }: A challeng<strong>in</strong>g issue is to derive the asymptotic distribution<br />

of LR n . To do so, we need to def<strong>in</strong>e some notation.<br />

For a generic symmetric q 2 × q 2 matrix B, Vecs.B/ and dvecs.B/ are def<strong>in</strong>ed as Vecs.B/ =<br />

.b 11 , b 21 , b 22 ,...,b q2 1,...,b q2 q 2<br />

/ T and<br />

dvecs.B/ = .b 11 ,2b 21 , b 22 ,2b 31 ,2b 32 , b 33 ,...,2b q2 1,...,2b q2 , q 2 −1, b q2 q 2<br />

/ T<br />

respectively. Furthermore, we def<strong>in</strong>e F i,1 .β, µ/ = @ β f i .β, µ/=f iÅ , F i,2 .µ/ = @ µ f i .βÅ, µ/=f iÅ and<br />

F i,6 .µ/ = @ 2 µµ f i.βÅ, µ/=f iÅ . Let<br />

w i .µ/ = .F i,1 .βÅ, µÅ/ T , F i,2 .µÅ/ T , dvecs.F i,6 .µ// T / T ,<br />

W n .µ/ = √ 1 n∑<br />

w i .µ/, n<br />

J n .µ/ = 1 n<br />

i=1<br />

n∑<br />

w i .µ/w i .µ/ T ,<br />

i=1<br />

where W n .µ/ is an r × 1 vector and J n .µ/ is an r × r matrix. F<strong>in</strong>ally, let k 1 .ω/ = .1 − α/ ∆µ 1 +<br />

α ∆µ 2 , k 2 .ω/ = Vecs{.1 − α/ ∆µ ⊗2<br />

1<br />

+ α ∆µ ⊗2<br />

2 } and K.ω/ = .∆β, k 1.ω/, k 2 .ω//, <strong>in</strong> which<br />

a ⊗2 = aa T , ∆µ 1 = µ 1 − µÅ and ∆µ 2 = µ 2 − µÅ.<br />

Under Ω 0 , L n .ω/ simplifies to Σ n i=1 log{f i.β, µ/=f iÅ }. Standard asymptotic theory yields that<br />

sup {2 L n .ω/} = W n .µÅ/ T H{H T J n .µÅ/H} −1 H T W n .µÅ/ + o p .1/,<br />

ω∈Ω 0<br />

where H = .I q1 +q 2<br />

, 0 T / T is a {q 1 + q 2 + q 2 .q 2 − 1/=2} × .q 1 + q 2 / matrix. Furthermore, if we<br />

consider the maximum likelihood estimator ˆω M = . ˆα M , ˆβ M ,ˆµ 1M ,ˆµ 2M / = arg max ω∈Ω {L n .ω/}<br />

<strong>in</strong> Ω, we need to address several fundamental issues such as the consistency and rate of convergence<br />

of K. ˆω M / and the asymptotic expansion of L n . ˆω M /. After some tedious manipulations<br />

(see Zhu and Zhang (2003) for details), we f<strong>in</strong>d that<br />

2 L n . ˆω M / = sup .W n .µ 2 / T J n .µ 2 / −1 W n .µ 2 / − <strong>in</strong>f [Q n { √ nK.ω/, µ 2 }]/ + o p .1/<br />

||µ 2 ||M<br />

ω∈Ω µ2<br />

holds under assumptions 1–3 that are given <strong>in</strong> Appendix A, where Ω µ2 = {ω ∈ Ω : µ 2 is fixed}<br />

and<br />

Q n .λ, µ 2 / = .λ − J n .µ 2 / −1 W n .µ 2 // T J n .µ 2 /.λ − J n .µ 2 / −1 W n .µ 2 //:<br />

Def<strong>in</strong>e the convex cone Λ µ2 as<br />

{ (<br />

∆µ 2 I<br />

)<br />

}<br />

q2<br />

Λ µ2 = η : η =<br />

Vecs.∆µ ⊗2<br />

2 / 0 x T , x = .x 1 ,...,x q2 +1/ ∈ [0, ∞/ × R q 2<br />

:<br />

Then, the log-likelihood-ratio statistic becomes<br />

LR n =<br />

sup { ˆV.µ 2 / T J n .µ 2 / ˆV.µ 2 /}−W n .µÅ/ T H{H T J n .µÅ/H} −1 H T W n .µÅ/+o p .1/, .6/<br />

||µ 2 ||M<br />

where Q n { ˆV.µ 2 /, µÅ} = <strong>in</strong>f λ∈R q 1 ×Λµ2 {Q n .λ, µÅ/}:


Mixture Regression Models 7<br />

We obta<strong>in</strong> the follow<strong>in</strong>g theorems, whose detailed proofs are given <strong>in</strong> Zhu and Zhang (2003).<br />

Theorem 1. Under assumptions 1–3 <strong>in</strong> Appendix A, the follow<strong>in</strong>g results hold.<br />

(a) K. ˆω M / = O p .n −1=2 /:<br />

(b) The log-likelihood-ratio statistic converges to the maxima of a stochastic process that is<br />

<strong>in</strong>dexed by µ 2 <strong>in</strong> distribution as n →∞.<br />

(c) If q 2 = 1, ∫ µ |Ĝ M .µ/ − G.µ/| dµ = O p .n −1=4 /, where G.µ/ = I.µ µÅ/ is the true mix<strong>in</strong>g<br />

distribution and Ĝ M .µ/ = G. ˆµ 1M ,ˆµ 2M ,ˆα M / = .1 − ˆα M /I.µ ˆµ 1M / + ˆα M I.µ ˆµ 2M /.<br />

To our knowledge, theorem 1, part (a), provides the convergence rate of K. ˆω M / for the first<br />

time under model (3). Our result has several implications. First, we confirm that the convergence<br />

rate of ˆβ (the maximum likelihood estimator) is n −1=2 under the conditions def<strong>in</strong>ed. van<br />

der Vaart (1996) proved a similar result under semiparametric <strong>mixture</strong> <strong>models</strong>. Second, under<br />

hypothesis H 0 , we prove that only k 1 . ˆω M / and k 2 . ˆω M / can reach the rate n −1=2 , which is useful for<br />

determ<strong>in</strong><strong>in</strong>g the asymptotic distributions of the estimators. Third, theorem 1, part (c), implies<br />

that Ĝ M .µ/ converges to G.µ/ <strong>in</strong> L 1 -norm at the optimal rate n −1=4 under a more general model<br />

than that considered by Chen (1995). It is important to note that theorem 1 of Chen (1995) gives<br />

the lower bound for the optimal rate of convergence for estimat<strong>in</strong>g G.µ/ by us<strong>in</strong>g Ĝ M .µ/ <strong>in</strong><br />

L 1 -norm is at most n −1=4 .<br />

As <strong>in</strong> Chen et al. (2001), we consider an alternative approach to <strong>test<strong>in</strong>g</strong> hypothesis (5) by<br />

us<strong>in</strong>g a modified log-likelihood function<br />

ML n .ω/ = L n .ω/ + log.M/ log{4α.1 − α/},<br />

where M is as def<strong>in</strong>ed above. Compared with the log-likelihood function, the extra term<br />

log.M/ log{4α.1 − α/} <strong>in</strong> ML n .ω/ can keep α away from both 0 and 1, which partially solves<br />

the identifiability problem. Let ˆω P be the result<strong>in</strong>g estimator as ˆω P = arg max ω∈Ω {ML n .ω/}.<br />

The modified log-likelihood-ratio statistic is def<strong>in</strong>ed as<br />

MLR n = 2ML n . ˆω P / − sup {2ML n .ω/}:<br />

ω∈Ω 0<br />

Similarly to the log-likelihood-ratio statistic, we f<strong>in</strong>d that<br />

MLR n = W n .µÅ/ T [J n .µÅ/ −1 −H{H T J n .µÅ/H} −1 H T ] W n .µÅ/− <strong>in</strong>f {Q n .λ, µÅ/}+o p .1/,<br />

λ∈R q 1 ×Λ 0<br />

.7/<br />

where Λ 0 is a closed cone def<strong>in</strong>ed by<br />

Λ 0 = {η : η = .x T , Vecs.y ⊗2 // T , both x and y ∈ R q 2<br />

}:<br />

We have the follow<strong>in</strong>g asymptotic result for both ˆω P and MLR n .<br />

Theorem 2. Under the assumptions of theorem 1, we have the follow<strong>in</strong>g results.<br />

(a) ˆα P = O p .1/, ∆ ˆβ P = O p .n −1=2 /, ∆ ˆµ 1P = O p .n −1=4 / and ∆ ˆµ 2P = O p .n −1=4 /.<br />

(b) The modified log-likelihood-ratio statistic converges to a random variable U 2 <strong>in</strong> distribution<br />

as n →∞.<br />

(c) If q 2 = 1, ∫ µ |Ĝ P .µ/ − G.µ/| dµ = O p .n −1=4 /, where Ĝ P .µ/ = .1 − ˆα P /I.µ ˆµ 1P / +<br />

ˆα P I.µ ˆµ 2P /.<br />

Theorem 2, part (a), gives the exact convergence rate of ˆω P . Theorem 2, part (b), determ<strong>in</strong>es<br />

the asymptotic distribution of MLR n . Whereas the explicit form of this distribution is generally<br />

complicated, our result gives rise to a simple asymptotic distribution, 0:5χ 2 1 + 0:5χ2 0 ,for


8 H.-T. Zhu and H. Zhang<br />

MLR n when q 2 = 1. This co<strong>in</strong>cides with theorem 1 of Chen et al. (2001) when there are no<br />

covariates, i.e. q 1 = 0. Theorem 2, part (c), shows that the n −1=4 consistent rate for estimat<strong>in</strong>g<br />

the mix<strong>in</strong>g distribution G.µ/ is reachable by us<strong>in</strong>g ˆω P .<br />

Until now, we have systematically <strong>in</strong>vestigated the order of convergence rate for both K. ˆω P /<br />

and K. ˆω M /, and the asymptotic distributions of both LR n and MLR n . The next two issues are<br />

also very important. The first is how to compute the critical values for these potentially complicated<br />

distributions of test statistics. The second is to compare the power of LR n and MLR n .<br />

In what follows, we study the empirical and asymptotic behaviour of these two statistics under<br />

departures from hypothesis H 0 .<br />

2.1. A resampl<strong>in</strong>g method<br />

Although we have obta<strong>in</strong>ed the asymptotic distributions of the likelihood-based statistics, the<br />

limit<strong>in</strong>g distributions usually have complicated analytic forms. To alleviate this difficulty, we<br />

use a resampl<strong>in</strong>g technique to calculate the critical values of the <strong>test<strong>in</strong>g</strong> statistics. Although the<br />

bootstrapp<strong>in</strong>g method is an obvious approach, it requires repeated maximizations of the likelihood<br />

and modified likelihood functions. The maximizations are computationally <strong>in</strong>tensive for<br />

the f<strong>in</strong>ite <strong>mixture</strong> <strong>models</strong>. Thus, we prefer a computationally more efficient method, as proposed<br />

and used by Hansen (1996), Kosorok (2003) and others.<br />

On the basis of equations (6) and (7), we only need to focus on W n .µ 2 / and J n .µ 2 /. If hypothesis<br />

H 0 is true, . ˆβ 0 ,ˆµ 0 / = arg max ω∈Ω0 {L n .ω/} provides consistent estimators of βÅ and µÅ.By<br />

substitut<strong>in</strong>g .βÅ, µÅ/ with . ˆβ 0 ,ˆµ 0 / <strong>in</strong> the def<strong>in</strong>itions of W n .µ 2 / and J n .µ 2 /, we obta<strong>in</strong> ŵ i .µ 2 /,<br />

Ŵ n .µ 2 / and Ĵ n .µ 2 / accord<strong>in</strong>gly, i.e.<br />

ŵ i .µ 2 / = .F i,1 . ˆβ 0 ,ˆµ 0 / T , F i,2 . ˆβ 0 ,ˆµ 0 /, dvecs.F i,6 . ˆβ 0 , µ 2 // T / T ,<br />

Ŵ n .µ 2 / = √ 1 n∑<br />

ŵ i .µ 2 /,<br />

n<br />

Jˆ<br />

n .µ 2 / = 1 n∑<br />

ŵ i .µ 2 / ŵ i .µ 2 / T :<br />

n i=1<br />

To assess the significance level and the power of the two test statistics, we need to obta<strong>in</strong><br />

empirical distributions for these statistics <strong>in</strong> lieu of their theoretical distributions. What follow<br />

are the four key steps <strong>in</strong> generat<strong>in</strong>g the stochastic processes that have the same asymptotic<br />

distributions as the test statistics.<br />

Step 1: we generate <strong>in</strong>dependent and identically distributed random samples, {v i, m : i =<br />

1,...,n}, from the standard normal distribution N.0, 1/: Here, m represents a replication<br />

number.<br />

Step 2: we calculate<br />

and<br />

Q m n .λ, µ 2/ = .λ − ˆ<br />

i=1<br />

Ŵn m .µ ∑<br />

2/ = n ŵ i .µ 2 /v i, m = √ n<br />

J −1<br />

n<br />

i=1<br />

.µ 2/Wn m .µ 2// T Jˆ<br />

n .µ 2 /.λ − ˆ<br />

n .µ 2/Wn m .µ 2//:<br />

It is important to note that Ŵ m n .µ 2/ converges weakly to W.µ 2 / as n →∞. This claim can be<br />

proved by us<strong>in</strong>g the conditional central limit theorem; see theorem (10.2) of Pollard (1990)<br />

and theorem 2 of Hansen (1996).<br />

J −1


Step 3: the third step is to calculate the likelihood ratio<br />

and<br />

LR m n =<br />

Mixture Regression Models 9<br />

sup { ˆV m .µ 2 / T Jˆ<br />

n .µ 2 / ˆV m .µ 2 /} − Ŵn m . ˆµ0 / T H{H T Jˆ<br />

n . ˆµ 0 /H} −1 H T Ŵn m . ˆµ0 /,<br />

||µ 2 ||M<br />

MLR m n = Ŵ n m . ˆµ0 / T [J n .µ 0 / −1 − H{H T J n .µ 0 /H} −1 H T ] Ŵn m . ˆµ0 / − <strong>in</strong>f {Q m<br />

λ∈R q n .λ, 1 ×Λ ˆµ0 /},<br />

0<br />

where Q m n { ˆV m .µ 2 /, µ 2 } = <strong>in</strong>f λ∈R q 1 ×Λµ2 {Q m n .λ, µ 2/}.<br />

Step 4: we repeat the above three steps J times and obta<strong>in</strong> two realizations—{LR m n : m =<br />

1,...,J} and {MLR m n : m = 1,...,J}. It can be shown that the empirical distribution of LRm n<br />

converges to the asymptotic distribution of LR n , and likewise for MLR m n . Therefore, the<br />

empirical distributions of these two realizations form the basis for calculation of the critical<br />

values <strong>in</strong> hypothesis <strong>test<strong>in</strong>g</strong> as well as the power analysis.<br />

We approximate p 1 = Pr.U 1 > LR n / and p 2 = Pr.U 2 > MLR n / by us<strong>in</strong>g<br />

p 1 ≈ ˆp J 1 =<br />

p 2 ≈ ˆp J 2 =<br />

J ∑<br />

m=1<br />

J ∑<br />

m=1<br />

I.LR m n > LR n/=J,<br />

I.MLR m n > MLR n/=J:<br />

S<strong>in</strong>ce J is under our control, ˆp J 1 can be made arbitrarily close to p 1 by us<strong>in</strong>g a sufficiently large J.<br />

To guide the selection of J <strong>in</strong> practice, it is useful to note that . ˆp J 1 −p 1/ √ J converges to a normal<br />

distribution with mean 0 and variance p 1 .1 − p 1 / <strong>in</strong> distribution as J →∞and that ˆp J 1 almost<br />

lies <strong>in</strong> .p 1 − 2:5 √ {p 1 .1 − p 1 /=J}, p 1 + 2:5 √ {p 1 .1 − p 1 /=J}/; see also Hansen (1996). Thus, if<br />

we want to set an error control δ 0 for | ˆp J 1 − p 1|, we can show that 2:5 √ {p 1 .1 − p 1 /=J} δ 0 .<br />

For example, when p 1 = 0:01, choos<strong>in</strong>g J = 10000 yields an error of about 0.0025. Exactly the<br />

same formula applies to p 2 as to p 1 .<br />

2.2. Asymptotic local power<br />

An assessment of power is necessary for choos<strong>in</strong>g a reasonable sample size as well as appropriate<br />

and powerful tests of significance (Cox and H<strong>in</strong>kley (1974), page 103). Often, there is no closed<br />

form for the power calculation. Many researchers have considered alternatives by exploit<strong>in</strong>g the<br />

asymptotic local power. In our case, the distribution of Jn<br />

−1.µ<br />

2/W n .µ 2 / plays a critical role <strong>in</strong><br />

determ<strong>in</strong><strong>in</strong>g the asymptotic local power of LR n and MLR n ; see equations (6) and (7). Thus, we<br />

explore its properties under a sequence of local alternatives.<br />

Consider a sequence of local alternatives ω n consist<strong>in</strong>g of α n = α 0 , β n = βÅ + n −1=2 h 1 , µ n 1 =<br />

µÅ − n −1=4 {α 0 =.1 − α 0 /} 0:5 h 2 and µ n 2 = µ Å + n −1=4 {.1 − α 0 /=α 0 } 0:5 h 2 , where α 0 is a constant<br />

between 0 and 1, and h 1 and h 2 are q 1 ×1 and q 2 ×1 vectors respectively. At ω n , K.ω n / = n −1=2 h,<br />

where h T = .h1 T, 0T , Vecs.h2 ⊗2 /T /. We have the follow<strong>in</strong>g theorem.<br />

Theorem 3. Under assumptions 1–3 <strong>in</strong> Appendix A and the alternatives ω n ,<br />

J n .µ 2 / −1 W n .µ 2 / d → N{J.µ 2 / −1 J.µ 2 , µÅ/h, J.µ 2 / −1 }:<br />

Theorem 3 can be used to compare the local power of our <strong>test<strong>in</strong>g</strong> statistics. In particular,<br />

J n .µÅ/ −1 W n .µÅ/ converges to N{h, J.µÅ/ −1 } <strong>in</strong> distribution. For <strong>in</strong>stance, the asymptotic<br />

power function that is associated with the modified log-likelihood-ratio statistic is


10 H.-T. Zhu and H. Zhang<br />

p α .h/ = lim<br />

n→∞ [P{F 0 .MLR n / 1 − α|h}],<br />

where F 0 .·/ is the asymptotic distribution of MLR n under hypothesis H 0 as n →∞.As||h||→∞,<br />

the p α .h/ converges to 1, s<strong>in</strong>ce MLR n becomes large as ||h|| <strong>in</strong>creases.<br />

3. Simulation study and real examples<br />

Two computational issues are related to our test procedures. First, we must calculate three different<br />

estimators: . ˆβ 0 ,ˆµ 0 /,ˆω M and ˆω P . It is relatively straightforward to compute . ˆβ 0 ,ˆµ 0 / and<br />

ˆω P by us<strong>in</strong>g Newton–Raphson and/or EM algorithms. S<strong>in</strong>ce f<strong>in</strong>d<strong>in</strong>g a global maximizer is still<br />

an open problem, we employ the EM algorithm us<strong>in</strong>g 200 different start<strong>in</strong>g-po<strong>in</strong>ts. Dur<strong>in</strong>g the<br />

computation, we use M = 10 to obta<strong>in</strong> ˆω P .<br />

Secondly, to approximate the critical values of the asymptotic distributions, we need to obta<strong>in</strong><br />

two realizations: {MLR m n : m = 1,...,J} and {LRm n : m = 1,...,J}: Obta<strong>in</strong><strong>in</strong>g the latter is<br />

actually a computationally <strong>in</strong>tensive burden even if q 2 is as small as 2. Hence, it is sensible to<br />

consider replac<strong>in</strong>g B.0, M/ = {µ; ||µ|| M} with a discrete approximation B.0, M/ A . Then,<br />

LR m n is approximated by<br />

max { ˆV m .µ 2 / T ˆ<br />

µ 2 ∈B.0,M/ A<br />

J n .µ 2 / ˆV m .µ 2 /} − Ŵ mT<br />

n<br />

. ˆµ 0 /H{H T ˆ J n . ˆµ 0 /H} −1 H T Ŵ m n . ˆµ0 /:<br />

Ow<strong>in</strong>g to the computational complexity, we only consider LR n when q 2 = 1 and M = 10, i.e.<br />

B.0,10/ = [−10, 10]. We take B.0, 10/ A = {−10, −10 + 0:02,...,10− 0:02, 10}.<br />

3.1. Simulated data<br />

In this subsection, we conduct some simulation studies to assess the performance of the two<br />

tests that were <strong>in</strong>troduced <strong>in</strong> Section 2. Data are drawn from the <strong>mixture</strong> l<strong>in</strong>ear <strong>regression</strong> model<br />

y ij = z ij {U i ˜µ 1 + .1 − U i / ˜µ 2 } + " ij ,<br />

where ˜µ 1 = β + µ 1 , ˜µ 2 = β + µ 2 and " ij ∼ N.0, 1/. From now on, we use . ˜µ 1 , ˜µ 2 / <strong>in</strong>stead of<br />

.β, µ 1 , µ 2 / to avoid the identifiability problem.<br />

We consider two cases: q 2 = 1 and q 2 = 2. For q 2 = 1, all the z ij s are equal to 1. For q 2 = 2,<br />

z ij = .1, u ij /, where u ij comes from the uniform [0, 2] generator. Therefore, we have parameters<br />

˜µ 1 , ˜µ 2 , α and σ. The true value of σ is 1 and the number of observations <strong>in</strong> each cluster is set<br />

equal to 3. It should be noted that σ is the true nuisance parameter <strong>in</strong> the general model (3),<br />

imply<strong>in</strong>g that q 1 = 1.<br />

For q 2 = 1, four different sett<strong>in</strong>gs of .α, ˜µ 1 , ˜µ 2 /, denoted A1, B1, C1 and D1, are considered.<br />

Similarly, for q 2 = 2, four other different sett<strong>in</strong>gs of .α, ˜µ 1 , ˜µ 2 /, denoted A2, B2, C2 and D2,<br />

are considered. See Table 1 for all eight designs. We choose sample sizes n = 50 and n = 100.<br />

Table 1.<br />

Eight designs for the simulation study<br />

Unknown parameter<br />

Results for the follow<strong>in</strong>g designs:<br />

A1 B 1 C1 D1 A2 B 2 C2 D2<br />

α 0.50.50.50.2 0.5 0.5 0.5 0.2<br />

µ 1 1 1 1 1 1 2 1 2 1 2 1 2<br />

µ 2 1 1.4 2 2 1 2 1:4 × 1 2 1:707 × 1 2 1:707 × 1 2


Mixture Regression Models 11<br />

For each simulation, the three significance levels 10%, 5% and 1% are considered, and 50000<br />

replications are used to estimate nom<strong>in</strong>al significance levels (or rejection rates). To calculate<br />

standard errors for the rejection rate of each case, we use 50000 replications to form 50 consecutive<br />

1000 replications, which give 50 estimates of the rejection rate and the standard error of<br />

these 50 estimates.<br />

Table 2 presents the estimates and standard errors of the rejection rates for the log-likelihoodratio<br />

<strong>test<strong>in</strong>g</strong> statistic. As expected, the power <strong>in</strong>creases as does the distance between ˜µ 1 and ˜µ 2 .<br />

Moreover, the simulated critical values are aga<strong>in</strong> precise and the null rejection level is close to<br />

the true rejection level.<br />

Tables 3 and 4 report the estimates and standard errors of the rejection rates for the modified<br />

log-likelihood-ratio statistic. The proposed <strong>test<strong>in</strong>g</strong> procedure based on MLR n performs<br />

remarkably well. First, the power <strong>in</strong>creases fast as the distance between ˜µ 1 and ˜µ 2 becomes<br />

large. Second, the simulated critical values are quite accurate. In particular, the null rejection<br />

level is quite close to the true rejection level. Third, as expected, a larger sample size produces<br />

better results.<br />

Table 2. Estimates and standard errors of rejection rates for the log-likelihood-ratio statistic of a onecomponent<br />

<strong>mixture</strong> aga<strong>in</strong>st a two-component <strong>mixture</strong><br />

True rate Results (q 2 = 1) for the follow<strong>in</strong>g designs and values of n:<br />

A1 B1 C1 D1<br />

n = 50 n = 100 n = 50 n = 100 n = 50 n = 100 n = 50 n = 100<br />

0.010 Estimate 0.0151 0.0146 0.0399 0.0528 0.4943 0.8177 0.2886 0.5422<br />

Standard error 0.0046 0.0044 0.0070 0.00650.0153 0.0139 0.0148 0.0182<br />

0.050 Estimate 0.0596 0.0576 0.1253 0.1592 0.7218 0.9352 0.5077 0.7535<br />

Standard error 0.0079 0.00750.0102 0.01050.0112 0.0071 0.0168 0.0145<br />

0.100 Estimate 0.1072 0.1069 0.2031 0.2540 0.8192 0.9669 0.6281 0.8404<br />

Standard error 0.0094 0.0094 0.0141 0.0132 0.0114 0.0054 0.0139 0.0128<br />

Table 3. Estimates and standard errors of rejection rates for the modified log-likelihood-ratio statistic of a<br />

one-component <strong>mixture</strong> aga<strong>in</strong>st a two-component <strong>mixture</strong><br />

True rate Results (q 2 = 1) for the follow<strong>in</strong>g designs and values of n:<br />

A1 B1 C1 D1<br />

n = 50 n = 100 n = 50 n = 100 n = 50 n = 100 n = 50 n = 100<br />

0.0100 Estimate 0.0151 0.0121 0.0404 0.0572 0.5601 0.8732 0.2886 0.5439<br />

Standard error 0.0041 0.0038 0.0068 0.0082 0.0139 0.0086 0.0122 0.0165<br />

0.0500 Estimate 0.0539 0.0520 0.1264 0.1659 0.7779 0.9607 0.5053 0.7525<br />

Standard error 0.0079 0.0063 0.0101 0.0141 0.0131 0.0060 0.0132 0.0145<br />

0.1000 Estimate 0.0981 0.0980 0.20750.2628 0.8641 0.9821 0.6270 0.8385<br />

Standard error 0.0102 0.0094 0.0096 0.0163 0.0099 0.0040 0.0150 0.0115


12 H.-T. Zhu and H. Zhang<br />

Table 4. Estimates and standard errors of rejection rates for the modified log-likelihood-ratio statistic of a<br />

one-component <strong>mixture</strong> aga<strong>in</strong>st a two-component <strong>mixture</strong><br />

True rate Results (q 2 = 2) for the follow<strong>in</strong>g designs and values of n:<br />

A2 B2 C2 D2<br />

n = 50 n = 100 n = 50 n = 100 n = 50 n = 100 n = 50 n = 100<br />

0.0100 Estimate 0.0166 0.0148 0.2330 0.4843 0.9613 0.9998 0.71650.9611<br />

Standard error 0.0046 0.0029 0.0121 0.01650.0056 0.00050.0159 0.0070<br />

0.0500 Estimate 0.0636 0.0585 0.4459 0.7081 0.9905 0.9999 0.8502 0.9866<br />

Standard error 0.0078 0.0073 0.0141 0.0138 0.0031 0.0002 0.0113 0.0037<br />

0.1000 Estimate 0.1143 0.1078 0.5686 0.8061 0.9960 1.0000 0.9001 0.9923<br />

Standard error 0.0103 0.0102 0.0143 0.0108 0.0019 0.0001 0.0087 0.0024<br />

By <strong>in</strong>spect<strong>in</strong>g Tables 2–4, we f<strong>in</strong>d that the log-likelihood-ratio statistic does not have any<br />

advantage over the modified log-likelihood-ratio statistic. In fact, the opposite seems to be so.<br />

Although we do not have a full explanation yet, the computational complexity could be one of<br />

the reasons. Thus, on the basis of our simulation studies, we prefer to use the modified loglikelihood-ratio<br />

statistic to test hypothesis (5) because of its easier computation and better<br />

power.<br />

3.2. Ames salmonella assay data<br />

We reanalyse an assay data set that has been studied by Wang et al. (1996) among others. The<br />

date set conta<strong>in</strong>s the number of revertant colonies of salmonella under different dose levels<br />

of qu<strong>in</strong>ol<strong>in</strong>e. At each of six dose levels of qu<strong>in</strong>ol<strong>in</strong>e d i , three plates are tested. To fit the data<br />

set, Wang et al. (1996) chose the two-component Poisson <strong>regression</strong> (2) with n = 18, n i = 1 and<br />

Poisson rates<br />

λ i = exp[x T i β + {µ 1U i + µ 2 .1 − U i /}],<br />

where x i = .d i ,log.d i + 10//, i = 1,...,18. They used the Akaike <strong>in</strong>formation criterion and<br />

Bayes <strong>in</strong>formation criterion to facilitate the model selection process, but they did not have a<br />

<strong>test<strong>in</strong>g</strong> procedure to test hypothesis (5).<br />

We use the modified log-likelihood-ratio statistic to test the hypothesis as stated <strong>in</strong> expression<br />

(5). First, we use the EM algorithm to obta<strong>in</strong> . ˆα P , ˆβ P ,ˆµ 1P ,ˆµ 2P / T = .0:723, −0:0013, 0:378,<br />

1:844, 2:382/ T . Next, us<strong>in</strong>g ˆω P and the results <strong>in</strong> Table 6 of Wang et al. (1996), we f<strong>in</strong>d that<br />

MLR n = 12:447. Then, the approximation procedure is repeated J = 10000 times, lead<strong>in</strong>g to<br />

ˆp J = 0:0001. S<strong>in</strong>ce q 2 = 1 <strong>in</strong> this case, the true asymptotic distribution of MLR n is 0:5χ 2 0 +0:5χ2 1 ,<br />

giv<strong>in</strong>g a p-value of 0.0002. Therefore, the data support the <strong>mixture</strong> Poisson <strong>regression</strong> with two<br />

components.<br />

We can also use the log-likelihood-ratio statistic to test the same hypothesis. It follows from<br />

Table6ofWanget al. (1996) that the log-likelihood-ratio statistic LR n = 14:44. Although it is<br />

difficult to compute the true p-value, the approximation procedure gives ˆp 10000 = 0:0004. Further<br />

analyses such as residual analysis and the goodness of fit have been reported <strong>in</strong> Wang et al.<br />

(1996).


Mixture Regression Models 13<br />

3.3. A risk factor analysis of preterm delivery<br />

Although the survival rate has improved for preterm <strong>in</strong>fants that are born with low (less than<br />

2500 g) or even very low birth weights (less than 1000 g), preterm birth rema<strong>in</strong>s a major cause of<br />

neurodevelopmental disability. Many epidemiological studies have been conducted to exam<strong>in</strong>e<br />

risk factors for preterm deliveries. Here, we reanalyse a data set that has been collected and<br />

extensively analysed by Dr Michael Bracken and his colleagues (see, for example, Zhang and<br />

Bracken (1995)). The data for this study consisted of 3858 women whose pregnancies ended <strong>in</strong> a<br />

s<strong>in</strong>gleton live-birth at Yale–New Haven Hospital, Connecticut, <strong>in</strong> 1980–1982. Preterm delivery<br />

is def<strong>in</strong>ed as less than 37 weeks of gestational age, which is calculated from the first day of the<br />

last menstrual period. On the basis of Zhang and Bracken (1995), we exam<strong>in</strong>e the follow<strong>in</strong>g<br />

eight putative risk factors: total number of pregnancies (v 1 ), ethnic background (v 2 ), use of<br />

marijuana (v 3 ), marital status (v 4 ), hormones or diethylstilbestrol used by the mother (v 5 ), parity<br />

(v 6 ), years of education (v 7 ) and total alcohol use per day (v 8 ). Among these risk factors,<br />

v 2 , v 3 , v 4 and v 5 are dummy variables with values 0 or 1. For <strong>in</strong>stance, v 2 is an <strong>in</strong>dicator for<br />

whether a woman is black or not. Zhang and Bracken (1995) orig<strong>in</strong>ally considered 15 variables,<br />

but seven of them do not contribute to the risk of preterm births and hence are not considered<br />

here.<br />

To understand the potential relationship between the preterm delivery and above risk factors,<br />

we fit the logistic <strong>regression</strong> model to the data set and report the output <strong>in</strong> Table 5. Accord<strong>in</strong>g<br />

to the estimators and their correspond<strong>in</strong>g standard errors <strong>in</strong> Table 5, the most significant risk<br />

factors are the number of previous pregnancies, ethnicity and the years of education, as expected<br />

from earlier studies (Zhang and Bracken, 1995). However, the rema<strong>in</strong><strong>in</strong>g risk factors do not<br />

appear to be significant at the 0.05level.<br />

Next, we use the logistic <strong>mixture</strong> <strong>regression</strong><br />

logit{P.Y i = 1|U i /} = x i β + z i {U i µ 1 + .1 − U i /µ 2 },<br />

where x i = .1, v 1i , v 2i , v 3i / and z i = .v 4i , v 5i , v 6i , v 7i , v 8i / for i = 1,...,3858.Theresult becomes<br />

<strong>in</strong>trigu<strong>in</strong>g when the two-component logistic <strong>regression</strong> model is used to analyse this data set;<br />

see Table 5for details. First, all eight putative risk factors under the logistic <strong>mixture</strong> model<br />

become significant. Second, when we use the modified log-likelihood-ratio statistic to test the<br />

hypothesis as stated <strong>in</strong> expression (5), MLR n = 20:810, giv<strong>in</strong>g rise to ˆp J = 0:0002 with 10000<br />

(i.e. J) repeated samples. This means that our procedure strongly favours the two-component<br />

logistic model, and there is a significant disparity between the null hypothesis H 0 <strong>in</strong> expression<br />

Table 5.<br />

Modified log-likelihood estimators for the preterm data set†<br />

α Intercept v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8<br />

One-component <strong>mixture</strong> (modified log-likelihood −778.49)<br />

−2.4650.167 0.577 0.232 −0.109 0.131 −0.086 −0.061 −0.039<br />

(0.463) (0.077) (0.210) (0.223) (0.218) (0.170) (0.110) (0.033) (0.073)<br />

Two-component <strong>mixture</strong>‡ (modified log-likelihood −768.09)<br />

0.509 −2.949 0.174 0.664 0.241 1.062 0.778 0.276 −0.159 0.218<br />

(0.101) (0.116) (0.021) (0.059) (0.062) (0.133) (0.145) (0.094) (0.019) (0.039)<br />

−1.361 −0.997 −0.990 0.094 −0.308<br />

(0.166) (0.095) (0.052) (0.018) (0.044)<br />

†Standard errors of the estimators are given <strong>in</strong> parentheses; MLR n = 20:810; ˆp 10000 = 0:0002.<br />

‡α, the <strong>in</strong>tercept and v 1 –v 3 are shared, and v 4 –v 8 are separated for U = 0 (upper values) and U = 1 (lower values).


14 H.-T. Zhu and H. Zhang<br />

(5) and the observed sample. Third, by look<strong>in</strong>g at the coefficients of v 4 –v 8 , we f<strong>in</strong>d that the<br />

estimators <strong>in</strong> the two latent groups have opposite signs, expla<strong>in</strong><strong>in</strong>g why they are <strong>in</strong>significant<br />

<strong>in</strong> the ord<strong>in</strong>ary logistic model. This underscores the importance of the two-component model.<br />

The opposite signs of the estimators <strong>in</strong>dicate reversed relationships between these risk factors<br />

and preterm births <strong>in</strong> the two latent groups.<br />

Acknowledgements<br />

This research is supported <strong>in</strong> part by National Institutes of Health grants DA12468 and<br />

AA12044. We thank the Jo<strong>in</strong>t Editor, an Associate Editor and two referees for valuable suggestions<br />

which helped to improve our presentation greatly. We also thank Professor Michael<br />

Bracken for the use of his data set.<br />

Appendix A<br />

Before we present all the assumptions, we need to def<strong>in</strong>e the follow<strong>in</strong>g derivatives of f i .β, µ/: F i,3 .β, µ/ =<br />

@<br />

β 2 f i.β, µ/=f iÅ, F i,4 .µ/ = @ β @ µ f i .βÅ, µ/=f iÅ, F i,5 .µ/ = @ β @<br />

µ 2 f i.βÅ, µ/=f iÅ and F i,7 .µ/ = @<br />

µ 3 f i.βÅ, µ/=f iÅ:<br />

The follow<strong>in</strong>g assumptions are sufficient conditions to derive our asymptotic results. Detailed proofs of<br />

theorems 1–3 can be found <strong>in</strong> Zhu and Zhang (2003).<br />

A.1. Assumption 1 (identifiability)<br />

As n →∞, sup ω∈Ω {n −1 |L n .ω/ − ¯L n .ω/|} → 0 <strong>in</strong> probability, where ¯L n .ω/ = E{L n .ω/}. For every δ > 0,<br />

we have<br />

where ¯ω n is the maximizer of ¯L n .ω/ and<br />

for every δ > 0.<br />

lim <strong>in</strong>f<br />

n→∞ .n−1 [ ¯L n . ¯ω n / − sup { ¯L n .ω/}]/ >0,<br />

ω∈Ω=Ω Å,δ<br />

ΩÅ,δ = {ω : ||β − βÅ|| δ, ||µ 1 − µÅ|| δ, α||µ 2 − µÅ|| δ} ∩ Ω<br />

A.2. Assumption 2<br />

For a small δ 0 > 0, let B δ0 = {.β, µ/ : ||β − βÅ|| δ 0 and ||µ|| M} ∩ Ω,<br />

{∣∣ ∣∣∣ ∣∣∣ 1<br />

sup<br />

.β,µ/∈B δ0<br />

n<br />

∣∣ ∣∣ }<br />

n∑<br />

∣∣∣ ∣∣∣<br />

F i,1 .β, µ/<br />

∣∣ + 1 n∑<br />

∣∣∣ ∣∣∣<br />

F i,3 .β, µ/<br />

n<br />

∣∣ + 1 n∑<br />

F i,2 .µ/<br />

i=1<br />

n ∣∣ + ∑ 6 1 n∑<br />

∣∣<br />

F i,k .µ/<br />

i=1<br />

k=4 n ∣∣<br />

= o p .1/,<br />

i=1<br />

{∣∣ }<br />

∣∣∣ ∣∣∣ 1 n∑<br />

sup √n F i,k .µ/<br />

∣∣<br />

= O p .1/, k = 4, 6, 7,<br />

||µ||M<br />

i=1<br />

sup<br />

.β,µ/∈B δ0<br />

{ 1<br />

n<br />

i=1<br />

}<br />

n∑<br />

∑<br />

||F i,1 .β, µ/|| 3 +||F i,3 .β, µ/|| 3 +||F i,2 .µ/|| 3 + 7 ||F i,k .µ/|| 3 = O p .1/:<br />

i=1<br />

Moreover, max 1<strong>in</strong> sup .β,µ/∈Bδ0 {||F i,1 .β, µ/||+||F i,2 .µ/|| 2 } = o p .n 1=2 /:<br />

k=4<br />

A.3. Assumption 3<br />

.W n .·/, J n .·// ⇒ .W.·/, J.·//, where these processes are <strong>in</strong>dexed by ||µ|| M, and the stochastic process<br />

{.W.µ/, J.µ// : ||µ|| M} has bounded cont<strong>in</strong>uous sample paths with probability 1. Each J.µ/ is<br />

a symmetric matrix and ∞ > sup ||µ||M [λ max {J.µ/}] <strong>in</strong>f ||µ||M [λ m<strong>in</strong> {J.µ/}] > 0 holds almost surely.


Mixture Regression Models 15<br />

The process W.µ/ is a mean vector R r -valued Gaussian stochastic process {W.µ/ : ||µ|| M} such that<br />

E{W.µ/W.µ/ T } = J.µ/ and E{W.µ/W.µ ′ / T } = J.µ, µ ′ / for any µ and µ ′ <strong>in</strong> B.0, M/.<br />

A.4. Remarks<br />

Assumption 1 is a generalized def<strong>in</strong>ition of the identifiable uniqueness <strong>in</strong> the statistical literature. The<br />

assumption sup ω∈Ω {n −1 |L n .ω/− ¯L n .ω/|} → p 0 is the uniform laws of large numbers. Some sufficient conditions<br />

for uniform laws of large numbers have been presented <strong>in</strong> the literature; see Pollard (1990). Moreover,<br />

assumption 2 is quite reasonable <strong>in</strong> most situations. In assumption 3, to prove that W n .µ/ weakly converges<br />

to a Gaussian process W.µ/, we need to use the functional central limit theorem; see Pollard (1990).<br />

Moreover, assumption 3 assumes the positive def<strong>in</strong>itiveness of J.µ/, which is a generalization of the strong<br />

identifiability conditions that were used <strong>in</strong> Chen (1995) and Chen and Chen (2001). Under the identical and<br />

<strong>in</strong>dependent distribution framework, assumption (P0) of Dacunha-Castelle and Gassiat (1999) is a sufficient<br />

condition for assumptions 1 and 2. Moreover, Dacunha-Castelle and Gassiat’s (1999) assumption (P1)<br />

is just the positive def<strong>in</strong>itiveness of J.µ/ <strong>in</strong> assumption 3.<br />

References<br />

Andrews, D. W. K. (1999) Estimation when a parameter is on a boundary: theory and applications. Econometrica,<br />

67, 1341–1383.<br />

Bickel, P. and Chernoff, H. (1993) Asymptotic distribution of the likelihood ratio statistic <strong>in</strong> a prototypical non<br />

regular problem. In Statistics and Probability: a Raghu Raj Bahadur Festschrift (eds J. K. Ghosh, S. K. Mitra,<br />

K. R. Parthasarathy and B. L. S. Prakasa Rao), pp. 83–96. New Delhi: Wiley.<br />

Chen, H. and Chen, J. (2001) The likelihood ratio test for homogeneity <strong>in</strong> the f<strong>in</strong>ite <strong>mixture</strong> <strong>models</strong>. Can. J.<br />

Statist., 29, 201–215.<br />

Chen, H., Chen, J. and Kalbfleisch, J. D. (2001) A modified likelihood ratio test for homogeneity <strong>in</strong> f<strong>in</strong>ite <strong>mixture</strong><br />

<strong>models</strong>. J. R. Statist. Soc. B, 63, 19–29.<br />

Chen, J. (1995) Optimal rate of convergence for f<strong>in</strong>ite <strong>mixture</strong> <strong>models</strong>. Ann. Statist., 23, 221–233.<br />

Cheng, R. C. H. and Liu, W. B. (2001) The consistency of estimators <strong>in</strong> f<strong>in</strong>ite <strong>mixture</strong> <strong>models</strong>. Scand. J. Statist.,<br />

28, 603–616.<br />

Cox, D. R. and H<strong>in</strong>kley, D. V. (1974) Theoretical Statistics. London: Chapman and Hall.<br />

Dachunha-Castelle, D. and Gassiat, E. (1999) Test<strong>in</strong>g the order of a model us<strong>in</strong>g locally conic parameterization:<br />

population <strong>mixture</strong>s and stationary ARMA processes. Ann. Statist., 27, 1178–1209.<br />

Diggle, P. J., Liang, K. Y. and Zeger, S. L. (2002) Analysis of Longitud<strong>in</strong>al Data, 2nd edn. New York: Oxford<br />

<strong>University</strong> Press.<br />

Hansen, B. E. (1996) Inference when a nuisance parameter is not identified under the null hypothesis. Econometrica,<br />

64, 413–430.<br />

Haseman, J. K. and Elston, R. C. (1972) The <strong>in</strong>vestigation of l<strong>in</strong>kage between a quantitative trait and a marker<br />

locus. Behav. Genet., 2, 3–19.<br />

Kosorok, M. R. (2003) Bootstraps of sums of <strong>in</strong>dependent but not identically distributed stochastic processes.<br />

J. Multiv. Anal., 84, 299–318.<br />

Lemdani, M. and Pons, O. (1999) Likelihood ratio tests <strong>in</strong> contam<strong>in</strong>ation <strong>models</strong>. Bernoulli, 5, 705–719.<br />

L<strong>in</strong>dsay, B. G. (1995) Mixture Models: Theory, Geometry and Applications. Hayward: Institute of Mathematical<br />

Statistics.<br />

McCullagh, P. and Nelder, J. A. (1989) Generalized L<strong>in</strong>ear Models, 2nd edn. London: Chapman and Hall.<br />

McLachlan, G. and Peel, D. (2000) F<strong>in</strong>ite Mixture Models. New York: Wiley.<br />

Pauler, D. K. and Laird, N. M. (2000) A <strong>mixture</strong> model for longitud<strong>in</strong>al data with application to assessment of<br />

noncompliance. Biometrics, 56, 464–472.<br />

Pollard, D. (1990) Empirical Processes: Theory and Applications. Hayward: Institute of Mathematical Statistics.<br />

Risch, N. and Zhang, H. P. (1995) Extreme discordant sib pairs for mapp<strong>in</strong>g quantitative trait loci <strong>in</strong> humans.<br />

Science, 268, 1584–1589.<br />

Teicher, H. (1963) Identifiability of f<strong>in</strong>ite <strong>mixture</strong>s. Ann. Math. Statist., 34, 1265–1269.<br />

Titter<strong>in</strong>gton, D. M., Smith, A. F. M. and Makov, U. E. (1985) The Statistical Analysis of F<strong>in</strong>ite Mixture Distributions.<br />

New York: Wiley.<br />

van der Vaart, A. (1996) Efficient estimation <strong>in</strong> semiparametric <strong>models</strong>. Ann. Statist., 24, 862–878.<br />

Wang, P. M. and Puterman, M. L. (1998) Mixed logistic <strong>regression</strong> <strong>models</strong>. J. Agric. Biol. Environ. Statist., 3,<br />

175–200.<br />

Wang, P. M., Puterman, M. L., Cockburn, I. and Le, N. (1996) Mixed Poisson <strong>regression</strong> <strong>models</strong> with covariate<br />

dependent rates. Biometrics, 52, 381–400.<br />

Zhang, H. P. and Bracken, M. (1995) Tree-based risk factor analysis of preterm delivery and small-for-gestational-age<br />

birth. Am. J. Epidem., 141, 70–78.


16 H.-T. Zhu and H. Zhang<br />

Zhang, H. P., Fui, R. and Zhu, H. T. (2003) A latent variable model of segregation analysis for ord<strong>in</strong>al outcome.<br />

J. Am. Statist. Ass., to be published.<br />

Zhang, H. P. and Merikangas, K. (2000) A frailty model of segregation analysis: understand<strong>in</strong>g the familial<br />

transmission of alcoholism. Biometrics, 56, 815–823.<br />

Zhu, H. T. and Zhang, H. P. (2003) <strong>Hypothesis</strong> <strong>test<strong>in</strong>g</strong> for f<strong>in</strong>ite <strong>mixture</strong> <strong>regression</strong> <strong>models</strong> (mathematical details).<br />

Technical Report. Yale <strong>University</strong> School of Medic<strong>in</strong>e, New Haven.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!