estimating a parameter in incidental and structural models - CiteSeerX

- 

ESTIMATING A PARAMETER IN INCIDENTAL AND STRUCTURAL MODELS 

BY APPROXIMATE MAXIMUM LIKELIHOOD 

by 

Aad van der Vaart 

TECHNICAL REPORT No. 139 

October 1987 (revisedJuly 1988) 

Department or Statistics, GN~22 

University of Washington 

Seattle, Washington 98195 USA

- 

Estimating a parameter in 

incidental and structural models 

by approximate maximum likelihood 

Aad van der Vaart 1 

Free University Amsterdam 

October 1987, revised July 1988 

Let Xl, X 2 1 ••• be independent random vectors, where X j has density 

Jh,(z) 9'(~'(Z),s) d~j(z). In the structural version ofthe model 

"1j = .,., is a. fixed unknown probability distribution, while in the functional 

model 11j is a unit mass in Z jJ where {z j} is an unknown sequence 

of vectors. It is proposed to estimate () by a. one-step estimator, 

based on a MLE for n -1 :Ej=l"1jo The construction is illustrated 

in the paired exponential model, the errors-in-variables model and a 

scale mixture over a normal family. 

AMS 1980 subject classifications: 62F12, 62G05. . 

Keywords and phrases: Structural model, incidental parameters, functional model, 

mixture model, asymptotic efficiency. 

1 Research partially supported by the Netherlands Organization for the Advancement 

of Pnre Research (Z.W.O.) and ONR Contract N00014-80-C-0163. 

1

1. Introduction. 

Let e be an open subset of lRA: and 1-£ a collection of probability measures on a 

measurable space (Z,C). Given I) E e let 1/18 be a measurable map between measurable 

spaces (X, B) and (Y, A). Furthermore, lor each (8, z) E e x Z let p,(-,z) be a probability 

density with respect to a zr-finite measure J.L on (X,B), having the form 

(1.1) p,(r,z) = h,(r)g,(",,(r),z), 

lor measurable map. h, and g, (from (X, B) into !Rand (YxZ,AxC) into lit, respectively). 

Suppose that X is a random element with density 

(1.2) p",(r) = Jp,(r,z) d~(z), 

where the mixing distribution '7 is an element of 1-£. Then, by the factorization theorem, 

1/18(X) is a sufficient statistic for '7 E 1-£, if I) is fixed. It is assumed that 

(1.3) g",(y) = Jg,(y,z)d~(z) 

is the density of ",,(X) with respect to a zr-finite measure v on (Y,A). 

This paper is concerned with estimating I) on the basis of the first n elements of a 

sequence of independent random elements X 1,X2 , •••, where Xj has density P9,1/i' {'7j} 

being an unknown sequence in 1-£. There are two versions of this model. The structural 

model (or mixture model) is simply the i.i.d. version, where every '7j is equal to one fixed, 

but unknown '7. The incidental model (or functional model) has '7i equal to a unit mass in 

zh {Zj} being an unknown sequence in Z. If 1-£ contains the unit masses, our formula.tion 

includes both. 

Suppose that the score function for I), 19,1/(Z) = 'V9Iogp9,1/(z) is well-defined and set 

(1.4) 1",(r) = i",(r) - E,(i",(X) I"',(X) = r) 

- - ~ 

(1.5) I", = E", l",l",(X). 

We propose the following estimator for I). Let 71n(l)) be a (restricted) maximum likelihood 

estimator in the structural version of the model, when I) is given. Thus 71n(l)) 

satisfies 

• n 

(1.6) 11P8,~.(8)(Xj) = ''P IIp,,,(Xj), 

j=l 

fJE1-£n j=l 

for a given (possibly data-dependent) subset fln of 1-£. Next given a preliminary estimator 

8 n for I), let Tn be its one-step 'improvement' 

(1.7) Tn = O. + (tin -(n)lo ,; (0 )(Xj»)-, tin _(n )(Xj). 

. Un,f/n Un n,",n n . Un,f/n Un 

J=1 J=1 

2

For discretized and ..;;i'-consistent 8", it will be shown that 

(1.8) 

where ijn = n- 1 L:j=t '7j' The necessary regularity conditions and the choice of the sieves 

are discussed in detail for three examples. 

For the structural version ofthe model (1.8) typically implies that Tn is asymptotically 

efficient in the sense of semi-parametric model theory as introduced in Begun, Hall, Huang 

and Wellner (1983). Indeed, there typically exists a one-dimensional submodel t -+ P 9+th,lJr 

such that 

for every h E nt' [cf, Lindsay (1983b), Pfanzagl and Wefelmeyer (1982, Chapter 14), van 

der Va.a.rt (1988a». As a consequence T« is efficient in the sense of Hajek (1970, 1972) in 

a one-dimensional submodel, and hence efficient in the whole model. 

For the incidental version of the model, or more generally the model as introduced 

above, there exists no satisfactory theory of asymptotic efficiency, although steps towards 

establishing such a theory have been taken in Hasminskii and Nussbaum (1984), Nussbaum 

(1984) and Bickel and Klaassen (1986). In tbis case tbe estimator in (1.8) is asymptotically 

linear in, what may be called, the efficient influence function of the average density 

n- 1 L:j=lPfJ,lJi = P9,iJ..' Though the definition of Tn is based on the working principle that 

all '7's are equal, the estimator asymptotically improves upon other estimators in the literature. 

A similar contradictory principle is noted by Lindsay (1985), who proposes 'to 

increase efficiency' by using the best estimator for the structural model with a fixed parametric 

family '7t. The idea.in this paper is that, though it is impossible to adapt to every '7i 

separately, adaptation to the average ijn is possible. (It will be assumed that the sequence 

ijn stabilizes as n -+ 00). 

By an. extension of this idea. one can actually show, that T,. is asymptotically inadmi,,· 

sible in the class T of all estimator sequences which are asymptotically normal 'uniformly 

• 

over contiguous neighbourhoods'. One notes that (tn)-1 L:J:l 7]j and (tn)-l 2:j=t"'+l '7j 

are estimable too, and, without going into any detail, it is clear that the 'knowledge' of 

these two quantities should enable one to do better than when 'knowing' ij,. only. (A recipe 

and further discussion can be found on pages 135-138 of van der Vaart (1988». However, 

we don't believe that this inadmissibility result necessarily means that T", is not a 'good' 

estimator. It rather shows the difficulty of selting up a theory of asymptotic optimality 

for non-i.i.d, models as considered here. On the positive side it can be shown that T,. is 

optimal in the class of estimator sequences which are in T and are also asymptotically 

symmetric in the sense that 

• 

.;n(T. - 6) = n-'Lg.(Xj) + OF"",,,,.. (1). 

j=1 

3

An extension of an interesting conjecture of Bickel and Klaassen (1986) is tha.t optimality 

would remain true, when asymptotic symmetry is replaced here by symmetry in the 

observations for every n. However, we know of no proof of this conjecture. 

The organization of the paper is as follows. Section 2 contains conditions for the onestep 

estimator (1.7) to satisfy (1.8). Here 11n(9) is allowed to be a general estimator and 

need not be the maximum likelihood estimator. It turns out that no rate of convergence of 

11n(9) is required. However, consistency in a suitable topology, which depends on the model, 

is crucial. For this reason Section 3 starts with the introduction of a class of topologies on 

1i. Next it is shown that 1}"Ui) satisfying (1.6) is 'consistent for it,,'. Here the sets 'H." are 

chosen data-dependent and of a simple form, with a view towards applicability. Finally, 

Sections 4-6 contain examples. 

Efficient estimators for the structural version of the model were already constructed 

in van der veart (1988) and (for the example in Section 5) In Bickel and Ritov (1987). 

Both constructions are based on kernel estimators for 90,,, and use a rather large number of 

tricks. Advantages of the estimator (1.6)-(1.7) are that it is better adapted to the mixture 

form of the model, that it uses maximum likelihood, and that it is simple and avoids 

dependence on unknown parameters such as the bandwidth of a kernel. 

An important open problem concerns the behaviour of the maximum likelihood estima.tor 

for (9,11), defined as the pair maximizing TIj=lPS,,,(Xj) over (3 x 'Hn • We feel 

that with a similar choice of sieves as in this paper (or maybe even without sieves), the 

9-component may well be asymptotically equivalent to T" given by (1.7)-(1.8), in both the 

structural and functional version of the model. This problem has been open since Kieler 

and Wolfowitz (1956) considered consistency and appears to be hard. 

2. General theorem. 

Under appropriate smoothness conditions we have 

where 'VS1fo(z) is written in the form of a k x m matrix with the derivatives with respect 

to 9j in its i-th row, and Qs,'If(Y) = 'V 1J 

9s,,,/ 9S,,,(Y). When substracting the conditional 

expectation given 1fs(X), the third term cancels. This motivates to assume the existence 

of measurable maps fIs, ,(fis and Qs,'If such that 

(2.1) 1,-n =H. +;fi, 0 Q.,.(",.) 

(2.2) E,(;fi,(X) I"'.(X») =o. 

Set 

Let 5 be a semi-metric, which makes 1i into a separable metric space. Assume that 

• 

(2.3) ij,,=n- 1 L 11j !.. n 

j=:1 

4

Q,.,".(y) - Q,,"(y) 

(2.4) /3,. (y) - /3'(y) every y, every 1'n ---. -y, 

g,.".(y) - g",(y) 

, 

(2.5) 

{ 

SUP IQ,•."I'/3;'g,.,s. : n = 1,2, ...} is v-equi-integrable, 

"EU 

for some open neighbourhood U of 1]. 

Next assume the existence of estimators -r7n(8), based on .,p(J(X 1 ) , ••• , tP9(Xn), such 

that for every sequence 8 n with yn(8 n - 8) --+ h 

(2.6) 

By 'estimators' it will be understood that every 1}n(8) is a. measurable map from X n into 

?i with respect to the Borel e-field on ?-t. 

Finally assume the existence of measurable functions 19(z,z) such that for every sequence 

Bn with ,;n(Bn - B) _ h 

(2.7) Ji,p,dp. = 0 

(2.8) JJ [,;n(p:, -pi) - Wi,prj' dp.dfin-O 

(2.9) JJli,l' s» dp.dfin = 0(1) 

(2.10) j'r . Ii,.I' P'. dp. dfin _ 0, every e > 0 

J{Il.,.. I~"'v'R} 

(2.11) JJIl,.p:' -l,p! I' dp.dfin - 0 

(2.12) the limit points of {i 9 ,'it,, } are nonsingular. 

We shall identify the e-ecoee i",(x) with p.. ~(x) J i,(x,z)p,(x,z)d~(z). 

Theorem 2.1. Let (2.1)-(2.12) hold. Let en be a ,;;:i-consistent, discretized estimator of 

B. DeJineTn by (1.7). Tben(1.8)bolds. 

Proof. Let ,;n(Bn - B) _ h. Assumptions (2.7)-(2.9) imply contiguity of the measures 

with densities ni~lP'.,.,(xj) and nj=lPB,., (xj), while (2.7)-(2.11) ensure that 

n 

(2.13) n-'''' (1 (x·)-l- (X.»)-l - hP···~,,···o 

L....i (1.. ,,,..) 9,'1..) 9,'1.. 

i=l 

" - P.••• ", 1, :I •••• 

(2.14) 

0 

n-'L n l 

- '.,S. l B.", (X) 

j=l 

j - I B,S. - • 

5

To see this, one can.first note that (2.7).(2.11) imply analogous statements for corresponding 

mixe~ quantities: 

(2.7') 

(2.8') 

(2.9') 

(2.10') everye> 0 

(2.11') 

(van der Vaarl (1988), Section 5.8.1). Next (2.9')-(2.10') also hold with ie" replaced by 

the corresponding Ie" (van der Vaarl (1988), Lemma 5.20) and (2.11') implies 

• 

-'''j - • - ., 

(2.11") n L..J lie",ii"Pe"'lJi -lQ,fJ"PS,lJi I dJJ --+ 0, 

j=l 

(van dee Vaart (1988), pI68-169). Finally, (2.7')-(2.9') imply local asymptotic normality, 

(2.13) follows from Proposition A.10 in van der Vaarl (1988) and (2.14) is a version of the 

law of large numbers. 

By (2.4) the map , --+ Qe,.,.(Y) is continuous for every y. Thus the map (y,,) --+ 

Qe,7(Y) is measurable (ef. Chapter 13 of Pfanzagl and Wefelmeyer (1985» and T. welldefined. 

By (2.1)-(2.2) and because 1).(8) depends on 1/-e(X,), ... ,1/-e(X.) only 

Ee. [In-.t (le.".(e.!lXj ) -t,..•JXj») 1'1 1/-e. (X,), ... ,1/-e. (X.») 

(2.15) 

. , 

::; n"LIQe.,;.(e.l(1/-e.(X;») -Qe•.•• (1/-••(X;»)j f3U"'•• (X;»). 

j=l 

By (2.3) and (2.6), for every open neighbourhood U of ry, this can be dominated by 

Under Pe",llL,"~,... the expectation of the first term. is 

j sup 

"l't,"l'1EU 

IQe.,7' - Qe•."I' pi. se;»; dv. 

6

The latter expression converges to zero as U decreases to 7] and n _ 00, by (2.4) and 

(2.5). It follows that both sides of (2.15) converge to zero in probability. From this and 

(2.13)-(2.14) it can be seen that 

n 

(2.16) n-''''(l - (X·) -l - (X.») p'.'!!;""" 0 

(2.17) 

L...J 

;=1 

6.. " .. (6 ..») 6..,'1..) 

n 

n -1L (i 6 .. ,Ij,,(6.. )4.. ,q..(6 ..)(Xj) -is",,,..4.. ,,,.. (X j ») P9.. ,~,,~, ... O. 

;=1 

The rest of the proof is standard, using (2.13)-(2.14), (2.16)-(2.17), (2.12) and the 

discretization of 8 n • 

3. q-topologyr consistency of (restricted) MLE's. 

In this section a family of topologies on the set of mixing distributions is introduced, 

which in applications can play the role of the topology needed in (2.3)-(2.6). Furthermore, 

it is shown that a (restricted) maximum likelihood estimator is consistent in this topology. 

It is assumed that Z is an open subset of IRm (or more generally a locally compact, 

Hausdorff space with a countable base and countable at infinity). For each n = 1,2,... 

Y n 1 , ••• , Y n n are measurable elements in a measurable space (Y, A), and Ynj has density 

(3.1) gs.,,;(Y) = Jgs.(y,z)d~j(z), 

with respect to a u-finite measure v. Here 7]1, ••• , 7]n are unknown probability measures 

on the Borel e-fleld C of Z, and the sequence On is treated. as known. 

The problem is to estimate t1n = n- 

1:Lj=l7]nj, 

based on Y n 1 , ••• ,Y 

3.1. q.topology. Let q be a continuous, positive function from Z into rn.. Let 'H. be the 

set of all sub-probability distributions ~ on (Z, C) with Jq d~ < 00, Define a (metrizable) 

topology, called the q-topology, on 'He by saying that 

(3.2) ~n s; ~ if and only if Jcq d~n ~ Jcq d~, all c E Co(Z). 

Here Co(Z) is the set of all continuous, real functions on Z which vanish at infinity {i.e. 

the closure in the uniform norm. of the set of continuous functions with compact support). 

The convergence concept (3.2) indeed corresponds to 8. topology on 'H•. 

Lenuna 3.1. The convergence concept (3.2) corresponds to a topology on 'H. with the 

properties: 

(i). For every M > 0, 'H M = {~ E 'H. : Jq d~ :'0 M} is q-compacl. 

(li). The q-topology is metrizable by a metric b q • 

Proof. Embed 'H M in the set TM of positive measures T on (Z,C) with total mass not 

exceeding M, through 

(C E C). 

n n •

This embedding extends to an embedding of 1i. = UM>o1i M into the set T of all positive 

measure~ on (Z,C). Clearly '7" ~ TJ if and only if r« - -r in the vague topology on T . 

Thus the q-topology is the relative vague topology of T under the above embedding. Now 

(ii) follows immediately from metrizability of the vague topology on T. (d. Bauer (1981), 

p243). Next, since T M is vaguely compact, it suffices for (i) to show that 1{M is closed as 

a subset of T M , i.e. if for {'7..} C 1i., 

Jcq dryn ---> Jcdr, 

all c E Co(Z), 

then we must show that dr = q dTJ for some TJ E Ji 8 • For m = 1,2,... let Xm E Co(Z) 

have compact support and be such that 0 ~ Xmil as m ---+ 00. Then 

JXm q-l dr = lim JXm q-lq dryn ,.:; l. 

n_~ 

By monotone convergence we conclude f q-l dr ~ 1. Thus we can set dTJ = q-1dr. 

For q(z) =1 the q-topology induced on the set of probability measures is precisely the 

weak topology. However, if q(z) ----+ 00 as z converges to the point at infinity (the boundary 

of Z), then the q-topology is stronger than the weak topology. For instance if Z = :m.+ 

and q(z) = z-2 V z2, then l1n ~ 11 if and only if 

Jj dryn ---> Jj dry, 

for all continuous functions j with j = o(q) as z ---> 00 or z t 0, i.e. j(z) = 0(z2) for 

z ---+ 00 and j(z) =O(z-2) if z ! O. For this topology the subset of probability measures is 

also closed in 1£•. 

3.2. Restricted MLE's. Pfanzagl (1987) shows that in the structural version of the 

model the unrestricted MLE is typically consistent in the weak topology. To enforce consistency 

in a general q-topology, we use a simple device: restrict the ma.ximization of the 

likelihood to the mixing distributions with expectation of q bounded from above. Of course 

we don't want to assume that the true q-moment of '7 is known. Hence we must either let 

the bound increase to infinity as n ---+ 00, or use a bound based on an (over)estimate of 

the true expectation of q, In our examples the second possibility turns out to be feasible, 

and the more convenient one. Therefore we restrict ourselves to this device. 

Apart from this, it may be useful or necessary for the actual computation of the 

maximum likelihood estimate, to restrict the maximization to a still smaller subset of 

mixing distributions (for instance a finite dimensional one). Of course this subset will 

depend on the number of observations and increase as n ---+ 00. 

Let "H be the set of all probability measures ry on (Z,C) with Jqdry < 00. Next, 

given a 'random upperbound' On for Jq d'7 and a subset 1£n of 1£, let fin be an element of 

Hn :="Hn n try E "H: Jqdry < Qn} such that 

n 

(3.2) IIgOn,lj.(Yn;)

- 

for some c > o. The choice c ~ 1 yields the 'full' restricted MLE. By Lemma 3.1(i) this 

certainly exists if 1] - gs...,,(y) is continuous for every y and ?in is closed as a subset of 

1i., both with respect to the q-topology. 

It is shown below that iin thus defined is consistent in the q-topology provided that 

the union of the 1t n satisfies a denseness condition. This is satisfied in particular when 

?in = 1£ for every n. A consequence of this strong result is that the present asymptotics 

give little guidance concerning the choice of sieves ?-In' Any reasonable sequence of sieves 

will have the denseness property. Of course, one expects that decreasing the size of 'H.n will 

improve the performance of the estimator provided the true (average) mixing distribution 

is contained in ?in' but will make it worse in the converse case. Then, if there is no reason 

to assume that the true mixing distribution has a certain parametric form, it seems safest 

to choose the sieve as large as is computationally feasible. 

For computing an unrestricted MLE several algorithms have been suggested in the 

literature (cf. Laird (1978), Heckman and Singer (1984), Jewell (1982), Lindsay (1983a)). 

Typically the unrestricted MLE can be chosen discrete with at most n support points 

(Lindsay (1983a». The algorithms suggested by the above authors are more or less based 

on this property. We don't know whether the discreteness properly is retained in a restricted 

maximization problem as above and have not studied algorithms for computing 

a restricted MLE in any detail. Of course, if the ?in are chosen finite dimensional, then 

computation is possible by a variety of algorithms, at least in principle. 

The following regularity conditions are assumed to hold. For any subprobability measures 

"t, "tn, n = 1,2, ... 

(3.3) for every y, every "tn ~ "t, e; -t 8. 

- , 

(3.4) 1]n -t 1]. 

Call 1] E ?i identifiable if there exists n0"t :/; 1] in ?i such that 

1 

(3.5) 08,"'( dv = l. 

{g',"T =g",,} 

Identifiability in the case of mixtures over an exponential family is discussed in PIanzagl 

(1987). 

Next let Qn = qn(Yn11"" Ynn, 8n) be estimators such that 

Qn E IN" a.s. 

(3.6) Qn = OP'''''I1,.~ ....(L). 

P9.. ,rhl'l~,...(Qn > J qd1]) -t 1, 

as n -t 0Cl. 

Finally let ?in be an increasing sequence of convex subsets of 1i, satisfying 

cc 

(3.7) U u; n h E 1{. : f q d-y :'0 M} is q-dense in h E 1{. : f q d-y :'0 M}, every M. 

n=l 

9

Theorem 3.2. Let (3.2}-(3.7) bold and 'I be identIfiable. Tben 5,(iin,'I) --+ 0 in outer 

P9..,'l'1,17::t,..• -probability. 

Theorem 3.2 is formulated in terms of outer measure, because it is hard to say in 

general whether 81J(rin, TI) is measurable. However, when the restricted maximum likelihood 

estimator is unique and it" is compact, then for every closed F 

Under (3.3) the latter set is measurable, so that 1}n is measurable in the Borel e-field of 

1-£. We shall ignore the measurability issue in the examples in Sections 4-6. 

Proof. Given a constant M > Jq d." let 1-(.M be all sub-probability measures -y with 

Jq d-y $ M. Choose a. sequence {1]n} in 1i M with 11" E ?in and TJn ..!.. TJ. 

Fix a E (0,1) and'1 E "liM. By convexity of u --+ ulog(l + a(u -1)), identifiability 

of 11 and Jensen's inequality 

(3.8) r log [1 + a (g, .• -1)] g". dv > O. 

J{u".,>O} 

gfJ,..., 

M 

For U C"li write ii'.,u(y) = sUP"EUg,.d(Y)' Set Urn = (-y' E"li M : 5,('1','1) < 

m- 1 } . By (3.4) for every m = m n -+ 00 

(3.9) 

every y, as n -+ 00. 

By (3.3)-(3.4), (3.8)-(3.9) and Fatou's Lemma there exists a constant M, such that 

lin~~Jlog [1 + a (.g,. ,.. -1)] 1\ M, s»; ,n. dv > O. 

99.. ,U..... 

(Note that log(l + ,,(u -1)) 2: Iog(l-,,) if u 2: 0). Thus for every '1 E "liM there exists a 

q-open neighbourhood U.., and a. constant M'l such that, with 

Zn;('1) = log [1 +" (~(Yn;) -1)] 

9f",U.., 

n 

(3.10) liminfE,.,."." ... n-'" 

n-oo 

L.....J 

j=1 

(Zn;(-r) 1\ M,) > O. 

On {Qn = M} f7n and TIn are both contained in 71 n . By (3.2) and convexity of ?in 

Rewrite this as 

n 

-1"1 g,.,•••+(l-.)•• (y ) < -11 

n L-J og nj _ -n age. 

;=1 99",;'.. 

10

i: (3.11) n- 1 log [1 + a (g,",," (Y ni ) - 1)] S -n-1logc. 

;=1 99..,'1.. 

Fix 0 > O. The set A ='liM - 17 E 'liM : 5.C7.~) < e] is q-compact by Lemma 3.2(i). 

From the cover {U., : "f E A} extract a. finite subcover U-rn"" U.,•. By (3.11) 

(5.(~n,~) 2 e A Qn = M} 

cU • { n- 't log [1 +aU'""" (Yn ; ) - 

i=l ;=1 ge " ,U' j 

1)] S -n- 'lo 

gc} 

c;~ {n-1t(Zn;C7,) AM,,) S -n-1logc} 

By (3.10) and the law of large numbers the ps..,J/t,'I2 •... -probability of the last set converges 

to zero. 

Finally 

p,"".,,,,... (5,(~n,~) 2 0) 

M' 

< L p,"""",... (5,(~n,~) 2 c A Qn = M) + p,"""",... (e, ~ (J qd~,M'J) 

M~[I, d,I+1 

Here the second term can be made arbitrarily small by (3.6) and the first term converges 

to zero for every fixed M' > 0, by the above argument. 

4. Paired exponential model. 

Write the observations as pairs (X, Y) and let 

(z E Z = m.+ , {e, y) E (ID,2)+ , () E e = IR+). Thus in the incidental version of the model, 

the problem is to estimate (), the ratio of the hazard rates within pairs of exponentially 

distributed variables, where the baseline hazards %i are unknown and may differ over pairs. 

With ",,(X,Y) = X + OY it follows that 

oo 

- x-6y 9~ 8y-x 

l",(z,y) = 20(z + Oy) + g, (z + Oy) 20 ' 

where 9,,(8) = f z2 s e- u d,,(z) for 8 > O. Given X + 6Y = 8, X - 6Y has a uniform 

o 

distribution on [-s,s]. Thus f!;(s) ~ s'/(120'). Assume that 

(4.1 ) 7]" --+ " in the weak topology 

(4.2) l-: + z')d~n(z) ~ j(z-' + z')d~(z) < 00. 

11

Let ?in be an increasing sequence of subsets of the set 'H of all probability measures on 

m+ ea tis.fying (3.7) with q(z) = z', (0 < c:s h fixed). Set 

(4.3) ?in = 'H." n {-y E 1/ ; f z' d-y(z) :s eln2:i~l XT'} 

Theorem 4.1. Let (4.1)-(4.3) bold, let en be a discretized, yTi'-consistent estimator (or 

8 and let ~n(8) maximize IIi~lP.,.(X;'Yj) over ?in. Then t; defined by (1.7) sa tisnes 

(1.8). 

3-)'1 

As for the regularity conditions, it is known from van der Vaart (1988), Theorem 

5.17, that (4.2) is unnecessary for the existence of an estimator sequence Tn satisfying 

(1.8). Thus one might hope that Theorem[1 can be slightly improved. 

It is well-known that the unique solution of 

(4.4) 

~Xj-fJYj =0 

L...X· +8Y' 

j=l J J 

is a vn-consistent estimator for fJ. In fact, it is asymptotically normal for every sequence 

{"Ij}, because the distribution of the Xj/Y j is independent of "Ii' 

Proof. The theorem is a corollary of Theorems 2.1 and 3.1 applied with the q-topology of 

q(z) = z'. It is tedious, but straightforward to check (2.7)-(2.12) (ef. van der Va.a.rt (1988), 

pI56-159). From the other assumptions only (2.5) needs comment. First 

(4.5) 

The set of functions .z: ---+ (s.z:)f.-ce-.u on m+, (0 < s < 1), is uniformly bounded and 

equi-continuous. Hence it is pre-compact in Co(Z) and if ""( is sufficiently close to "I in 

the as-topology, then !(.u)"'-Ce-U ZC d""(.z:) is uniformly close to !(sz)f-Ce-u.z:c d71(z) . 

Therefore, the right hand side of (4.5) can for s < 1 and-r sufficiently close to "I be bounded 

by 

(4.6) 

Let 0 < a O. Then for ""( sufficiently close to 71, we have 

""(a,b) > ~p. But then the right hand side of (4.5) can for such rr and s > 1 be bounded 

12

y 

(4.7) 

s: S'Z4.-.. d7(Z)] 

2+2(3bs)'+2 ", 

[ fa sz2e-u d7(Z) 

g,.(s) 

< [2 + 18b 2s2 +2(sp~a2e-b')-ls3e-2b' 1~ Z4e-:tJ:d7(Z)] gi7n(s) 

::; [2 + 18b's' + 4p- 1a-'.-'·s-'64 ] g,.(8). 

Combination of the bounds (4.5)-(4.7) shows that (2.5) is satisfied, if for sufficiently 

large N the set {(8' +8'-') g,.(8): n > N} is equi-integrable. Now by (4.1)-(4.2) 

J(8' + 8'-') g,. (8) ds = J(6Z-' + r(c)z'-') d'in(z) 

--+ J(6z-' +r(c)z'-') d1)(z) = J(8' +8'-') g.(8)d8. 

Then (cf. Theorem 13.47 in Hewitt and Stromberg (1965)) 

5. Errors-in-variables. 

Write the observations as pairs (X, Y). The incidental version of the model is given 

by 

Xi = Zj + e; 

Yi = ,,+ /3zi+ ii, 

where (i~), (i:),... are i.i.d. unobservable N(o,:E -1) distributed vectors, and Z i unknown 

numbers in Z = lEl Set (J = (a,p, E). To make this parameter identifiable in the structural 

version of the model one can put restrictions on either 1-{. or :E. Indeed, it suffices that 1i 

does not contain normal distributions (where pointmasses are considered normal): alternatively 

it can be assumed that :E = if; :Eo, where Eo is known. Identifiability is obviously 

crucial for the existence of a. yn..consistent estimator sequence for (J. However, it does 

not playa role in the validity of Theorem 2.1 on the improvement of such an estimator. 

Therefore, we do not discuss the matter here, but refer to the rather large literature on the 

model. See Ander.;on (1984) and Bickel and Ritov (1987) and lhe references cited there. 

The assumptions imposed do a.ffect the dimension of (J, though. Below we give the formulas 

for the case that E is a. free positive definite matrix (; ~) and write (J = (a,fJ,u.,T,P). 

A sufficient statistic in this model is '!f9(X,Y) = (~)'E(Y~a)' Its distribution is a. 

mixture of normal distributions with density 

13 

-

where ul = (~)'E(~). 

Set M. = [- u;'(~)(;)'E. Then 

o 0 

P T 

;j,.(z,y) = 1 0 M, ( z ). 

y-Q 

o IJ 

{3 1 

Furthermore, 1.,.(z,y) = F.,.(Z,y)-' J1.(z ,v. z) PB(z, y, z) d'7(z), where 

1.(z,y,z) = 

(:)'M.(.:o) 

z(:)'M.(.:o) 

-t [

- 

treating 8 as known, automatically satisfies the moment condition in (5.5) (Lemma 5.1, 

below). Thus one can carry out the construction (1.7)-(1.8) with no restrietionat all when 

maximizing the likelihood [i.e. fin = 1-£). 

Proof. It is very tedious, but straightforward to check (2.7)-(2.11). Moreover Ie,,,,. -.1 8 ,,, , 

'0 that (2.12) follows from (5.4). For fixed 8 the mixing distribution '1 is identifiable in 

g,•• by Proposition 6.2 of Pfansagl (1987). By (5.1)-(5.2) ~n -.!. '1 in the q-topology of 

q(z) = 1 V z2. Theorem 5.1 follows from Theorems 2.1 and 3.1, applied with this q 

topology. The only condition that needs comment is (2.5). 

For any random variable V and decreasing function b on lR it holds that Cov(V, b(V)) 

s O. In consequence E IVI¢(V) = E 1V1¢(1V1l s E IVI E ¢(V). Therefore 

Ig"7(')1 = If -

Multiplying with q(Zi) and summing over i gives the 'self-consistency equations' 

(5.6) Jq(z)diin(z) = n-1tE•• (q(Z) 1 T = Ii), 

i=l 

where (T, Z) has distribution (I - z) d'l(z) under 'I. 

Next perturbation of the v's yields the stationary equations 

t h,(Ii-Zi)(Ii-Z,)_O 

i=1 L;:':, h,(Ii - Z,) - • 

Multiplying with r(zj) and summing over i gives 

i = 1, ... ,m. 

n 

(5.7) n-'L:E•• (r(Z)Z 1 T = Ii) =n-1L:E•• (r(Z) 1 T = li)Ii' 

i=l 

i=l 

Combination of (5.6)-(5.7) with q(z) = z and r(z) = 1 yields the first assertion of tbe 

lemma. Next combination with q(z) = %2 and r(z) = z gives 

n 

6. Normal scale mixture. 

The incidental version of the model is given by 

Xj = () + zj1ej, 

where el,e2, ... are unobservable, independent standard normal variables, () E e = m. 

and zi E Z =: m.+. Of course X 1,X2 , ... are sampled from distributions which are symmetric 

about (), and one may estimate () with the estimators of Stone (1975), Bickel and 

Klaassen (1986), or van der Vaart (1988), Section 5.7.4, which are fully adaptive. However, 

these estimators do not take the normality of the error terms into account, whereas the 

MLE-based estimator (1.7)-(1.8) does. 

With ifJ,(X) = IX - &/ it follows that 

. - g~( ) 

i".(z) = i".(z) = g. Iz - 81 

where g.(s) = 2J o = z(zs)d'l(z) for s > O. Clearly (3~ ,,1. 

Assume that 

. -sgu(z - B), 

(6.1) fin. -+ f) in the weak topology 

(6.2) /,= (z2 + z-2) diin(z) -/,= (z2 + z-2) d'l(z) < 00 

1 

(6.3) = z2 dfin.(Z) -+ 0, every E > O. 

,.,;n 

16 

-

Fix ~ < c < 2. Let 1£ be the set of probability measures on m. with f zt: d7](z) < 00, let 

1£1'10 be a .sequence of subsets satisfying (3.7) with q(z) = 1 Va", and set 

(6.4) 

where Qn = Op, ... 'l'I.,,~ •..• (I) and P9..,'Ih,'12•.•.(Qn > f zt:dfJn(z» --+ O. 

Theorem 6.1. Let (6.1)-(6.4) boJd, Jet 81'10 be a cUscretized yn-consi6te12t estimator for 

8 and Jet ~n(8) maximize rrj~IP,.• (Xj,Yj) overi/n. Then Tn defined by (1.7) satisfies 

(1.8). 

Construction of e, suitable sequence Qn is not entirely trivial. An example is 

Proof. The theorem is a corollary of Theorems 2.1 and 3.1 applied with the q-topology 

of q(z) = 1 V zt:. As in the previous sections the main problem is to check (2.5). First by 

uniform boundedness and equi-continuity of the functions z -+ (sz)3-t:¢(sz), (0 < s < 1), 

we have for 0 < s < 1 and I sufficiently close to 7] that 

(6.5) 

This set of functions is equl-integrable over (0,1) by (6.2)-(6.3). Next, with the same 

notation as in the proof of Theorem 4.1, the left hand side of (6.5) can for s > 1 be 

bounded by 

[16b 4 s 2 +(ap)-ls2e-~·21l235] gil..(s). 

Equi-integrability of these functions follows from equi-integrability of {s 2 gij.. (s) 

1,2,...}, which follows from (6.1)-(6.2). 

n 

Acknowledgement. 

I thank Y. Ritov for permission to include Lemma 5.1 in this paper. 

17

References 

Anderson, T.W., (1984). Estimating linear statistical relationships. Annals Statist. 12, 

1-45. 

Bauer, H., (1981). ProbabilityTheory and Elements ofMeasure Theory, Academic 

Press, London. 

Begun, J.M., Hall, W.J., Huang, W.M. and Wellner, J.A., (1983). Information and asymptotic 

efficiency in parametric-nonparametric models. Annals Statist. 11, 432-452. 

Bickel, P.J., Klaassen, C.A.J., (1986). Empirical Bayes estimation in functional and structural 

modem, and uniform adaptive estimation of location. Adv. Appl. Matn. 7, 

55-69. 

Bickel. P.J., Ritov, Y. (1987). Efficient estimation in the errors in variables model. Annals 

Statist. 15, 513-540. 

Hajek, J., (1970). A characterization of limiting distributions of regular estimators. Z. 

Wabrsclz. Tb. Verw. Gebiete 14,323-330. 

Ha.jek, J., (1972). Local asymptotic minimax and admissibility in estimation. Proc. Sixth 

Berkeley Symp. Math. Statist. Probab. 1, University of California Press, Berkeley, 

175-194. 

Hasminskii, R.Z., Nussbaum, M. (1984). An asymptotic minimax bound in a regression 

model with an increasing number of nuisance parameters. P. Mandl, M. Huskova 

(eds.]. Proc. Third Prague Symp. As. Statistics, Elsevier, Amsterda.m, 275-283. 

Heckman J., Singer, B., (1984). A method for minimizing the impact of distributional 

assumptions in economic studies for duration data. Econometrica 52, 271-320. 

Hewitt, E., Stromberg, K. (1965). Real and Abstract Analysis, Springer Verlag, 

Berlin. 

Jewell, N.P., (1982). Mixtures of exponential distributions. Annals Statist. 10, 419-484. 

Kiefer J., Wolfowitz, J., (1956). Consistency of the maximum likelihood estimator in 

the presence of infinitely many nuisance parameters. Ann. Math. Statist. 27, 

887-906. 

Laird, N., (1978). Nonpa.ra.metric maximum likelihood estimation of a mixing distribution. 

J. Amer. Statist. Assoc. 73, 805-811. 

Lindsay, B.G., (1983a). The geometry of mixture likelihoods, I and II. Annals Statist. 11, 

86-94 and 783-792. 

Lindsay, B.G., (1983b). Efficiency of the conditional score in a. mixture setting. Annals 

Statist. 11, 486-497. 

Lindsay, B.G., (1985). Using empirical partially Bayes inference for increased efficiency. 

Annals Statist 13, 914-93l. 

Nussbaum, M., (1984). An asymptotic minimax risk bound for estimation of a linear 

functional relationship. J. Multivariate Anal. 14,300-314. 

Pfa.nzagl, J., Wefelmeyer, W., (1982). Contributions to a General Asymptotic Statistical 

Theory. Lecture Notes in Statistics 13, Springer Verlag, New York. 

Pfanaagl, J., We£elmeyer, W., (1985). Asymptotic Expansions for General Statistical 

Models. Lecture Notes in Statistics 31, Springer Verlag, New York. 

Pfa.nzagl, J., (1987). Consistency of Maximum Likelihood Estimators for Certain Nonparametric 

Families, in particular: Mixtuces. Preprint 110, University of Cologne. 

18

- 

Stone, C" (1975). Adaptive maximum likelihood estimation of a location parameter. 

Allllals Statist. 3, 267-284. 

Vaart, A.W. van der, (1988). Statistical Estimation in Large Parameter Spaces. 

CWI-tra.ct 44, Centrum voor Wiskunde en Informatica, Amsterdam. 

Vaarl, A.W. van der, (1988a). Estimating a real parameter in a. class of semi-parametric 

models. Annals Statist. 16, to appear. 

19

estimating a parameter in incidental and structural models - CiteSeerX

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?