12.07.2015 Views

Stat 8112 Lecture Notes The Wald Consistency Theorem Charles J ...

Stat 8112 Lecture Notes The Wald Consistency Theorem Charles J ...

Stat 8112 Lecture Notes The Wald Consistency Theorem Charles J ...

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Stat</strong> <strong>8112</strong> <strong>Lecture</strong> <strong>Notes</strong><strong>The</strong> <strong>Wald</strong> <strong>Consistency</strong> <strong>The</strong>orem<strong>Charles</strong> J. GeyerApril 29, 20121 Analyticity AssumptionsLet { f θ : θ ∈ Θ } be a family of subprobability densities 1 with respectto a measure µ on a measurable space S. We start with the followingassumptions.(a) Both the sample space S and the parameter space Θ are Borel spaces.(b) <strong>The</strong> map (x, θ) ↦→ f θ (x) is upper semianalytic.A topological space is a Borel space if it is homeomorphic to a Borel subsetof a Polish space (Bertsekas and Shreve, 1978, Definition 7.7). Examples ofBorel spaces include any Borel subset of a Euclidean space R d and, moregenerally, any Borel subset of a Polish space (Bertsekas and Shreve, 1978,Proposition 7.12). If X and Y are Borel spaces, a function g : X × Y → Ris upper semianalytic if for every c ∈ R the set{ (x, y) ∈ X × Y : g(x, y) > c }is an analytic subset of X × Y. This is not the place to define analytic sets,see Bertsekas and Shreve (1978, pp. 156 ff.). A much stronger conditionthat implies upper semianalyticity is simply that the map (x, θ) ↦→ f θ (x) be(jointly) Borel measurable, i. e., that{ (x, θ) ∈ S × Θ : f θ (x) > c }is a Borel subset of S × Θ for each c ∈ R (Bertsekas and Shreve, 1978,Proposition 7.36) This always holds in practical applications.1 A real-valued function f is a subprobability density with respect to µ if f(x) ≥ 0for µ almost all x and ∫ f dµ ≤ 1. A subprobability density is actually a probabilitydensity if ∫ f dµ = 1. We are mostly interested in families of probability densities, butsubprobability densities enter in the process of compactification (Section 3).1


2 Topological AssumptionsLet θ 0 be a point in Θ, the “true” parameter value, and denote the probabilitymeasure having density f θ0 with respect to µ by P θ0 . We assume,without loss of generality, that P θ0 is complete, i. e., that every subset of aP θ0 -null set is measurable. 2 Suppose X 1 , X 2 , . . . is an independent identicallydistributed sequence of observations having the distribution P θ0 . <strong>The</strong>log likelihood corresponding to a sample of size n isl n (θ) = 1 nn∑i=1log f θ(X i )f θ0 (X i )(1)(this is actually log likelihood divided by n rather than log likelihood, butmaximizing one maximizes the other and we shall not distinguish them).<strong>Wald</strong>’s approach makes further topological assumptions about the densitiesand the parameter space.(c) <strong>The</strong> parameter space Θ is a compact metric space.(d) For every θ ∈ Θ and every x ∈ S except for x in a P θ0 nullset thatmay depend on θ, the map φ ↦→ f φ (x) is upper semicontinuous at θ.Spelled out in detail, (d) says that for every θ ∈ Θ there is a measurablesubset N θ of S such that P θ0 (N θ ) = 0 and for all x ∈ S \ N θ and everysequence {φ n } converging to θlim sup f φn (x) ≤ f θ (x). (2)n→∞This has the following consequence. Let d( · , · ) be a metric for Θ — beinga Borel space, Θ is metrizable (Bertsekas and Shreve, 1978, p. 118). <strong>The</strong>nfor every θ ∈ Θ and every x not in the null set N θ described abovelim supɛ↓0 φ∈Θd(θ,φ) f θ (x) and a sequence {φ n } suchthat d(θ, φ n ) < 1/n and f φn (x) > r, and then (2) cannot hold either.2 Any measure space (Ω, A, µ) can be completed producing a measure space (Ω, A ∗ , µ ∗ )as follows. Let A ∗ consist of all subsets A of Ω such that there exist sets A 1 and A 2 inA such that A 1 ⊂ A ⊂ A 2 and P (A 1) = P (A 2), and define P ∗ (A) = P (A 1). <strong>The</strong>n it iseasily verified that (a) A ∗ is a σ-field, (b) µ ∗ is countably additive on A ∗ and (c) A ⊂ Nand P (N) = 0 implies A ∈ A ∗ .2


3 CompactificationAssumption (c), the compactness assumption, usually does not hold inapplications. <strong>The</strong> original parameter space must be compactified in a waysuch that the maps θ ↦→ f θ (x) remain upper semicontinuous for the points θadded to the original parameter space in the compactification and the <strong>Wald</strong>integrability condition (Section 6 below) also holds for the points added inthe compactification.When one compactifies the parameter space, one must devise new subdensitiesin the family corresponding to these new points. <strong>The</strong> most obviousway to do this is by taking limits of the subdensities in the original model.If θ n → θ with θ n in the original model and θ a new point added in thecompactification, then the pointwise limit defined byf θ (x) = limn→∞ f θ n(x), for all x (4)is an obvious candidate provided the limit exists. Even if the f θn are allprobability densities, Fatou’s lemma only guarantees that f θ is a subprobabilitydensity.Compactification has been much discussed in the literature, see, for exampleKiefer and Wolfowitz (1956) and Bahadur (1971). Bahadur givesa very general recipe for compactifying the parameter space. On p. 34 ofBahadur (1971) two conditions are required, that g M (x, r) defined by hisequation (9.5) be Borel measurable and that his equation (9.6) hold. <strong>The</strong>measurability condition is unnecessary under the usual regularity conditionsinvolving analytic sets, because then g M (x, r) is measurable with respect tothe completion of P θ0 and that is all that is required. Hence if we use analyticset theory, only Bahadur’s (9.6) is required for a suitable compactification.4 IdentifiabilityA parameterization of a family of probability distributions is identifiableif there do not exist two distinct parameter values that correspond to thesame distribution. This will be assumed in this handout.(e) <strong>The</strong> parameterization is identifiable.This assumption can be dropped at the cost of some additional complicationin the statement of theorems (Redner, 1981).3


5 Kullback-Leibler InformationDefine the Kullback-Leibler information functionλ(θ) = E θ0 log f ∫θ(X)f θ0 (X) = log f θ(x)f θ0 (x) P θ 0(dx). (5)By Jensen’s inequality λ(θ) ≤ 0 for all θ, and by the conditions for equalityin Jensen’s inequality, λ(θ) = λ(θ 0 ) only if f θ (x) = f θ0 (x) for almost all x[P θ0 ].Hence by the identifiablity assumption (e), λ is a non-positive functionon Θ that achieves its unique maximum at θ 0 where it takes the value zero.Note that the value −∞ is allowed for λ. This does not cause any difficulties.6 <strong>The</strong> <strong>Wald</strong> Integrability ConditionFor each θ ∈ Θ, the strong law of large numbers impliesl n (θ) a.s. −→ λ(θ), as n → ∞,but this does not imply consistency of maximum likelihood. Pointwise convergenceof a sequence of functions does not necessarily imply convergenceof a sequence of maximizers of those functions. Some uniformity of thefunctional convergence is necessary.To get this uniformity, <strong>Wald</strong> used dominated convergence, and to getthat, <strong>Wald</strong> had to make an additional strong assumption, which will bereferred to as “<strong>Wald</strong>’s integrability condition”(f) For every θ ∈ Θ there exists an ɛ > 0 such thatE θ0sup log f φ(X)φ∈Θ f θ0 (X) < ∞ (6)d(θ,φ)


7 Alternative AssumptionsIf one does not assume (a) and (b), then it is necessary to impose furthercontinuity conditions on the densities to obtain Borel measurability of w ɛ,θ .<strong>The</strong> following condition is taken from Wang (1985), but it is similar to allother conditions in this literature (<strong>Wald</strong>, 1949; Kiefer and Wolfowitz, 1956;Perleman, 1972; Geyer, 1994).(g) For every θ ∈ Θ and every x ∈ S except for x in a P θ0 nullset thatdoes not depend on θ, the map φ ↦→ f φ (x) is lower semicontinuous atθ.This condition is used only to obtain measurability so that the integral in (6)is well-defined. Hence it is completely superfluous when the usual regularityconditions involving analytic sets hold.8 Upper Semicontinuity<strong>Wald</strong>’s integrability condition (f) and the upper semicontinuity condition(d) imply that λ is an upper semicontinuous function. For any sequence {φ n }converging to θlim supn→∞λ(φ n ) = lim sup E θ0 log f φ n(X)n→∞ f θ0 (X)≤ E θ0 lim sup log f φ n(X)n→∞ f θ0 (X)= E θ0 log f θ(X)f θ0 (X)= λ(θ)by dominated convergence, since the integrand is eventually dominated byw ɛ,θ given by (7).A closed subset of a compact set is compact (Browder, 1996, Proposition9.52). Hence for every η > 0, the setK η = { θ ∈ Θ : d(θ, θ 0 ) ≥ η } (8)is compact. An upper semicontinuous function achieves its maximum over acompact set. 3 Hence λ achieves its maximum over each K η , call that m(η).By the identifiability assumption (e), we cannot have m(η) = 0 because thatwould imply there is a θ ∈ K η such that λ(θ) = λ(θ 0 ).3 This is true in general topology, but in metric spaces which are our concern here, the5


9 <strong>The</strong>orem<strong>Wald</strong>’s theorem can now be roughly stated (a more precise statementfollows). <strong>Wald</strong> proved that the log likelihood l n converges to λ uniformlyfrom above in the sense that with probability onelim supn→∞sup l n (θ) ≤ sup λ(θ), almost surely, (9)θ∈K η θ∈K ηwhere λ is defined by (5) and K η is defined by (8). Let δ n be any sequencedecreasing to zero. Call any sequence of estimators ˆθ n (X 1 , . . . , X n ) satisfyingl n (ˆθ n (X 1 , . . . , X n )) > sup l n (θ) − δ nθ∈Θan approximate maximum likelihood estimator (AMLE). For such a sequencewe have d(ˆθ n (X 1 , . . . , X n ), θ 0 ) is eventually less than η, because there existsan N 1 ∈ N such thatsup l n (θ) ≤ m(η)/2, n ≥ N 1 ,θ∈Θd(θ,θ 0 )≥ηwhere m(η) denotes the right-hand side of (9), and there exists N 2 > N 1such that δ n < −m(η)/2 when n ≥ N 2 , from which we inferwhich together implyl n (ˆθ n (X 1 , . . . , X n )) > m(η)/2, n ≥ N 2d (ˆθn (X 1 , . . . , X n ), θ 0)< η, n ≥ N2 . (10)Since for every η > 0 there exists an N 2 (which may depend on η) such that(10) holds, this says that the AMLE converges almost surely to θ 0 .<strong>The</strong> reason for introducing AMLE is that the maximum of the log likelihoodneed not be achieved, in which case the maximum likelihood estimator(MLE) does not exist, but an AMLE always exists. If the MLE always exists,then the MLE is also an AMLE and is covered by the theorem.proof is trivial: if K is a compact set and f is an upper semicontinuous function on K, thenfor every n there is a is a θ n such that f(θ n) ≥ sup f −1/n, and, since compact is the sameas sequentially compact for metric spaces, there is a convergent subsequence θ nk → θ, andθ ∈ K because K is closed. Finally, lim sup f(θ nk ) ≤ f(θ) by upper semicontinuity of f,but f(θ n) → sup f by construction. Thus f(θ) = sup f.6


10 ProofNow let us go though a careful statement and proof of <strong>Wald</strong>’s theorem,the main point being to establish (9).<strong>The</strong>orem 1. Suppose assumptions (a) through (f) hold. <strong>The</strong>n (9) holdsand consequently any approximate maximum likelihood estimator is stronglyconsistent.<strong>The</strong> part of the proof showing that (9) implies strong consistency ofAMLE has already been discussed. It remains only to be shown that theregularity conditions (a) through (f) imply (9). For any θ ∈ Θ and any ɛsmall enough so that (6) holds, the strong law of large numbers implies thatsup l n (φ) ≤ 1φ∈Θ nd(θ,φ)


a set is compact if every open cover has a finite subcover). Now anotherapplication of the strong law of large numbers says that for almost all samplepaths X 1 , X 2 , . . .lim supn→∞sup l n (φ) < m(η) + 2γ, i = 1, . . . d. (12)φ∈V θiSince the V θicover K η this implieslim supn→∞sup l n (θ) < m(η) + 2γ,θ∈K ηwhich, since γ > 0 was arbitrary, implies (9).11 Example<strong>The</strong> following example is taken from DeGroot and Schervish (2002, Example6.5.8) where it purports to be as an example of what is problematicabout maximum likelihood (it is no such thing, as we shall see). <strong>The</strong> datais generated by the following curious mechanism: a coin is flipped and thedata is standard normal if the coin comes up heads and is general normal ifthe coin comes up tails. <strong>The</strong> results of the coin flip are not observed, so thedistribution of the data is a mixture of two normals. <strong>The</strong> density is givenbyf µ,ν (x) = 1 2 ·1√2π[exp) (− x2+ √ 1 exp(−2 ν)](x − µ)2. (13)2νAdmittedly, this is a toy problem. But it illustrates all the issues that arisewith the general normal mixture problem, which has been studied for morethan a century and has a rich literature.If we observe independent and identically distributed (IID) data havingprobability density function (PDF) (13), then the likelihood isL(µ, ν) =n∏f µ,ν (x i ).i=1If we set µ = x i and let ν → 0, then we see that the i-th term goes to infinityand the other terms are bounded. Henceˆµ n = x nˆν n = 08


maximizes the likelihood (at infinity) but does not converge to the trueparameter value. <strong>The</strong> same issue arises with the general normal mixtureproblem.Here we follow Hathaway (1985) in proposing to constrain the varianceparameter ν away from zero. Fix ɛ > 0. It may be chosen arbitrarily small,say ɛ = 10 −1010 , so small that no question arises that the true variance islarger than that. Any ɛ will do, so long as ɛ > 0. If we maximize thelikelihood subject to the constraint ν ≥ ɛ, the problem discussed above doesnot arise.To compactify the parameter space, we must add the values +∞ and−∞ for µ and +∞ for ν. This makes the compactified parameter space[−∞, +∞] × [ɛ, +∞]. It is clear thatf µ,ν (x) → 1 2 ·)1√ exp(− x2, µ → ±∞,2π 2so we assign this subprobability density to parameter points in the compactificationhaving µ = ±∞. We get the same limit when ν → ∞, so we assignthis subprobability density to those parameter points in the compactificationtoo.<strong>The</strong> likelihood is continuous (not just upper semicontinuous) so the uppersemicontinuity condition (d) holds.We now need to check the integrability condition (e). In this exampleit turns out we can sup over the whole compactified parameter space. Let(µ 0 , ν 0 ) be the true unknown parameter vector. For any (µ ∗ , ν ∗ ) in thecompactified parameter space we havef µ ∗ ,ν ∗(x)f µ0 ,ν 0(x) ≤(exp(− x22(exp)− x22)+ 1 √ νexp)exp − x22+ √ 1 ɛ≤ ( )exp − x22≤ 1 + √ 1 ( ) x2exp ɛ 2+ 1 √ ɛ( )− (x−µ 0) 22ν 0and, if we have chosen ε ≤ 1 (which we may always do)f µ ∗ ,ν ∗(x)f µ0 ,ν 0(x) ≤√ 2 ( ) x2exp ɛ 29


solog f µ ∗ ,ν ∗(x)f µ0 ,ν 0(x) ≤ log(2) − 1 x2log(ɛ) +2 2and this is clearly integrable because second moments of the normal distributionexist.Thus we have satisfied the conditions for <strong>Wald</strong>’s theorem, and the maximumlikelihood estimator is strongly consistent for this model if we imposethe constraint ν ≥ ɛ.Hathaway (1985) actually looked at the general normal mixture modelwith d components to the mixture for an arbitrary (but known d). Thismodel has 3d parameters, the d means and d variances of the components(none of which are assumed to be mean zero and variance one like in the ourtoy example) and the d probabilities of the components of the mixture (noneof which are assumed to be 1/2 like in our toy example). Instead of puttinga lower bound on the variances, Hathaway (1985) imposes the constraintsν iν j≥ ɛ, i = 1, . . . , d and j = 1, . . . , d (14)So the variances have no lower bound or upper bound, but are not allowedto be too different from each other (but ɛ = 10 −1010 is still o. k., so theycan be very different but not arbitrarily different. Hathaway (1985) provesstrong consistency of the constrained maximum likelihood estimator for thegeneral normal mixture problem, the constraints being (14). His proof uses<strong>Wald</strong>’s theorem.12 A Trick due to Kiefer and WolfowitzKiefer and Wolfowitz (1956) noted that <strong>Wald</strong>’s method does not workas stated even for location-scale families, but a simple modification doeswork. Given a PDF f, the location-scale family with standard density f isthe family of distributions having densities given byf µ,σ (x) = 1 σ · f ( x − µσ)(15)where µ, the location parameter, is real-valued and σ, the scale parameter, ispositive-real-valued. <strong>The</strong> “standard density” f is often chosen to have meanzero and variance one, in which case µ is the mean and σ is the standarddeviation of the distribution having density (15). But f need not havemoments (for example, f could be the standard Cauchy PDF), in which10


case µ is not the mean and σ is not the standard deviation (the mean andstandard deviation don’t exist), and even if f does have moments, it neednot have mean zero and variance one, in which case µ is not the mean andσ is not the standard deviation (the mean and standard deviation do existbut aren’t µ and σ).<strong>The</strong> same problem arises with the location-scale problem as arises withthe normal mixture problem: when µ = x the limit as σ → 0 in (15) isinfinite. This gives problems both in trying to compactify the model and inverifying the integrability condition. To compactify the model, observe thatlim f µ,σ(x) = 0, for all x and all σ (16a)µ→±∞andandlim f µ,σ(x) = 0, for all x and all µ (16b)σ→+∞lim f µ,σ(x) = 0, for all x and all µ such that x ≠ µ (16c)σ→0(16a) and (16b) holding simply because we must have f(x) → 0 as x → ±∞in order for f to be integrable, but (16c) needing verification in each casebecause it requires f(x) = o(1/|x|) as x → ±∞, which is typical behaviorfor PDF but not necessary. If all three of these hold, we see that thesubprobability densities that are equal to zero almost everywhere are theappropriate densities to assign to all points added to the original parameterspace in making the compactification.We run into trouble verifying the integrability condition. Consider verifyingit at a parameter point (µ, 0) in the compactification. For convenience wetake the supremum over a box-shaped neighborhood B = (µ−η, µ+η)×[0, δ).Let µ 0 and σ 0 be the true unknown parameter values. <strong>The</strong>nsup log f µ ∗ ,σ ∗(x) = ∞, µ − η < x < µ + η,(µ ∗ ,σ ∗ )∈B f µ0 ,σ 0(x)and this cannot be integrable for all µ no matter how small we take η to be(so long as η > 0).<strong>The</strong> trick due to Kiefer and Wolfowitz (1956) is to simply consider IIDpairs (X 1 , X 2 ) having marginal distributions (15). Because this distributionis continuous we have X 1 ≠ X 2 almost surely, and we may (or may not)haveE µ0 ,σ 0sup log f µ ∗ ,σ ∗(X 1)f µ ∗ ,σ ∗(X 2)(µ ∗ ,σ ∗ )∈B f µ0 ,σ 0(X 1 )f µ0 ,σ 0(X 2 ) < ∞ (17)11


(the <strong>Wald</strong> integrability condition applied to the distribution of IID pairs),and if (17) does hold, we conclude that(ˆµ 2n , ˆσ 2n ) a.s. −→ (µ 0 , σ 0 ),where ˆµ n and ˆσ n are the AMLE’s for sample size n, by direct application ofthe <strong>Wald</strong> theorem.This seems to say only that the even-index terms of the sequence of estimatorsconverge, but there is a simple argument that says that for everyη > 0 we still have (9), from which we conclude AMLE are strongly consistent.<strong>The</strong> argument involving pairs of data points says that (9) holds withl n (θ) replaced by l 2n (θ), but also we can writehencel 2n+1 (θ) = 12n + 1 log f θ(X 1 )f θ0 (X 1 ) + 1 2n+1∑log f θ(X i )2n + 1 f θ0 (X i )i=2sup l 2n+1 (θ)θ∈K η≤ 12n + 1 sup log f 2n+1θ(X 1 )θ∈K ηf θ0 (X 1 ) + sup 1 ∑log f θ(X i )θ∈K η2n + 1 f θ0 (X i )i=2(18)by the upper semicontinuity assumption log[f θ (X 1 )/f θ0 (X 1 )] achieves itssupremum over K η and this supremum, being a value of the function, isfinite. Hence the first term on the right-hand side of (18) converges to zero,not just almost surely, but for all ω. <strong>The</strong> second term on the right-hand sideof (18) has the same distribution as 2n/(2n + 1) times the left-hand side of(9) with n replaced by 2n, hence it has lim sup equal to the right-hand sideof (9). If we have (9) when the lim sup is taken over even numbered termsand when it is taken over odd numbered terms, then we have it when it istaken over all terms.ReferencesBahadur, R. R. (1971). Some Limit <strong>The</strong>orems in <strong>Stat</strong>istics. Philadelphia:SIAM.Bertsekas, D. P. and Shreve S. E. (1978). Stochastic Optimal Control: <strong>The</strong>Discrete Time Case. New York: Academic Press.12


Browder, A. (1996). Mathematical Analysis: An Introduction. New York:Springer.DeGroot, M. H. and Schervish, M. J. (2002). Probability and <strong>Stat</strong>istics, thirdedition. Redding, MA: Addison-Wesley.Geyer, C. J. (1994). On the convergence of Monte Carlo maximum likelihoodcalculations. Journal of the Royal <strong>Stat</strong>istical Society, Series B, 56, 261–274.Hathaway, R. J. (1985). A constrained formulation of maximum-likelihoodestimation for normal mixture distributions. Annals of <strong>Stat</strong>istics, 13,795–800.Kiefer, J. and Wolfowitz, J. (1956). <strong>Consistency</strong> of the maximum likelihoodestimator in the presence of infinitely many incidental parameters. Annalsof Mathematical <strong>Stat</strong>istics, 27, 887–906.Perleman, M. D. (1972). On the strong consistency of approximate maximumlikelihood estimates. Proceedings of the Sixth Berkley Symposiumon Mathematical <strong>Stat</strong>istics and Probability, 1, 263–281, University of CaliforniaPress.Redner, R. (1981). Note on the consistency of the maximum likelihoodestimate for nonidentifiable distributions. Annals of <strong>Stat</strong>istics, 9, 225–228.<strong>Wald</strong>, A. (1949). Note on the consistency of the maximum likelihood estimate.Annals of Mathematical <strong>Stat</strong>istics, 20, 595–601.Wang, J.-L. (1985). Strong consistency of approximate maximum likelihoodestimators with applications in nonparametrics. Annals of <strong>Stat</strong>istics, 13,932–946.13

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!