29.07.2014 Views

STATISTICS 512 TECHNIQUES OF MATHEMATICS FOR ...

STATISTICS 512 TECHNIQUES OF MATHEMATICS FOR ...

STATISTICS 512 TECHNIQUES OF MATHEMATICS FOR ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>STATISTICS</strong> <strong>512</strong><br />

<strong>TECHNIQUES</strong> <strong>OF</strong> <strong>MATHEMATICS</strong><br />

<strong>FOR</strong> <strong>STATISTICS</strong><br />

Doug Wiens<br />

December 16, 2013


Contents<br />

I MATRIX ALGEBRA 6<br />

1 Introduction; matrix manipulations . . . . . 7<br />

2 Vector spaces . . . . . . . . . . . . . . . . 16<br />

3 Orthogonality; Gram-Schmidt method; QRdecomposition<br />

. . . . . . . . . . . . . . . . 24<br />

4 LSEs; Spectral theory . . . . . . . . . . . . 31<br />

5 Examples & applications . . . . . . . . . . . 41


II LIMITS, CONTINUITY, DIFFEREN-<br />

TIATION 50<br />

6 Limits; continuity; probability spaces . . . . 51<br />

7 Random variables; distributions; Jensen’s Inequality;<br />

WLLN . . . . . . . . . . . . . . . 57<br />

8 Differentiation; Mean Value and Taylor’s Theorems<br />

. . . . . . . . . . . . . . . . . . . . . 64<br />

9 Applications: transformations; variance stabilization<br />

. . . . . . . . . . . . . . . . . . . 71<br />

III SEQUENCES, SERIES, INTEGRA-<br />

TION 78<br />

10 Sequences and series . . . . . . . . . . . . . 79<br />

11 Power series; moment and probability generating<br />

functions . . . . . . . . . . . . . . . 87


12 Branching processes . . . . . . . . . . . . . 95<br />

13 Riemann integration . . . . . . . . . . . . . 102<br />

14 Riemann and Riemann-Stieltjes integration . 111<br />

15 Moment generating functions; Chebyshev’s<br />

Inequality; Asymptotic statistical theory . . 121<br />

IV MULTIDIMENSIONAL CALCULUS<br />

AND OPTIMIZATION 131<br />

16 Multidimensional differentiation; Taylor’s and<br />

Inverse Function Theorems . . . . . . . . . 132<br />

17 Implicit Function Theorem; extrema; Lagrange<br />

multipliers . . . . . . . . . . . . . . . . . . 141<br />

18 Integration; Leibnitz’s Rule; Normal sampling<br />

distributions . . . . . . . . . . . . . . 152


19 Numerical optimization: Steepest descent,<br />

Newton-Raphson, Gauss-Newton . . . . . . 162<br />

20 Maximum likelihood . . . . . . . . . . . . . 170<br />

21 Asymptotics of ML estimation; Information<br />

Inequality . . . . . . . . . . . . . . . . . . 178<br />

22 Minimax M-estimation I . . . . . . . . . . . 187<br />

23 Minimax M-estimation II . . . . . . . . . . 196<br />

24 Measure and Integration . . . . . . . . . . . 202


Part I<br />

MATRIX ALGEBRA


7<br />

1. Introduction; matrix manipulations<br />

• Outline:<br />

— Linear algebra - regression (linear/nonlinear),<br />

multivariate analysis, more generally linear models<br />

and linear approximations.<br />

— Real analysis/calculus - theory of statistical<br />

distributions, optimal selection of statistical<br />

procedures (e.g. determine a parameter estimate<br />

to minimize a certain loss function), approximations<br />

of intractable procedures with simpler<br />

ones.<br />

— Measure theory (very briefly)/Theory of Integration<br />

- probability, math finance, theory of<br />

mathematical statistics, foundations of statistical<br />

and probabilistic methods. A rigorous development<br />

of conditional expectation requires<br />

measure theory.


— Optimization - find numbers or functions minimizing<br />

certain objectives, e.g. designing experiments<br />

for maximum information/minimum variance<br />

etc.; associated numerical methods.<br />

8<br />

• In Statistics, matrices are convenient ways to store<br />

and refer to data. As well, in Regression for<br />

instance, there are important structural features<br />

that come from examining the algebraic properties<br />

of the vector space formed from all linear<br />

combinations of the columns of a matrix.<br />

• As for the first of these - matrices as data storage<br />

- you should learn various ways to manipulate matrices.<br />

In particular, the formula for the product<br />

of two matrices - that the ( ) element of the<br />

product AB of an × matrix A and a × <br />

matrix B is given by<br />

[AB] =<br />

X<br />

=1<br />

A B


- is of rather limited usefulness. Rather, one<br />

should treat either the rows or the columns of<br />

matrices as the basic elements. Some examples:<br />

— Define (column) vector in R ;sum,transpose,<br />

scalar product, outer product.<br />

— Matrix as a column of rows, or row of columns:<br />

A × =<br />

⎛<br />

⎜<br />

⎝<br />

a 0 1<br />

a 0 2<br />

.<br />

a 0 <br />

⎞<br />

⎟<br />

⎠ =(α 1.α 2 . ···.α ) <br />

— If X × has rows n x 0 <br />

o<br />

(note: vectors are<br />

=1<br />

columns, rows are transposed vectors), and θ<br />

is a × 1 vector, then<br />

Xθ =<br />

⎛<br />

⎜<br />

⎝<br />

x 0 1<br />

.<br />

x 0 <br />

.<br />

x 0 <br />

⎞<br />

⎟<br />

⎠<br />

θ =<br />

⎛<br />

⎜<br />

⎝<br />

x 0 1 θ<br />

.<br />

x 0 θ<br />

.<br />

x 0 θ<br />

— If X × has columns n o <br />

z ,andθ is a ×1<br />

=1<br />

⎞<br />

⎟<br />

⎠<br />

<br />

9


10<br />

vector, then<br />

Xθ = h i<br />

z 1 ···z ···z <br />

⎛<br />

⎜<br />

⎝<br />

⎞<br />

1<br />

.<br />

<br />

.<br />

⎟<br />

⎠<br />

<br />

=<br />

X<br />

=1<br />

z <br />

— If X × has columns n o <br />

z , and A is a<br />

=1<br />

matrix with columns, then<br />

AX = A h z 1 ···z ···z <br />

i<br />

=<br />

h<br />

Az1 ···Az ···Az <br />

i<br />

<br />

— If X is as above and A × isamatrixwith<br />

rows n a 0 o <br />

=1 ,then<br />

AX =<br />

⎛<br />

⎜<br />

⎝<br />

a 0 1<br />

.<br />

a 0 <br />

.<br />

a 0 <br />

⎞<br />

⎟<br />

⎠<br />

X =<br />

⎛<br />

⎜<br />

⎝<br />

a 0 1 X<br />

.<br />

a 0 X<br />

.<br />

a 0 X<br />

⎞<br />

⎟<br />

⎠<br />

= ³ a 0 z ´<br />

<br />

<br />

You should become familiar with all of these,<br />

and learn to choose the most appropriate form<br />

in an application.


11<br />

— Block matrices ... a particular example is, with<br />

notation as above,<br />

A × B × =(α 1 .α 2 . ···.α )<br />

⎛<br />

⎜<br />

⎝<br />

β 0 1<br />

β 0 2<br />

.<br />

β 0 <br />

⎞<br />

X <br />

⎟<br />

⎠ = =1<br />

α β 0 <br />

• Let be a random variable (r.v.) (formal definition<br />

to come later) with (i) distribution function<br />

() = ( ≤ ) and probability density function<br />

() = 0 () or (ii) probability mass function<br />

() = ( = ) for ∈ X, a finite or<br />

countable set. Then the “expected value” is<br />

( R<br />

(i)<br />

∞−∞<br />

()<br />

[] =<br />

(ii) P ∈X ()<br />

Think “average”. The cases can be unified and<br />

extended (for instance to cases where is not differentiable)<br />

via the Riemann-Stieltjes integral, to<br />

be considered later. Also, the extension to random<br />

vectors (r.vecs) is immediate, involving multidimensional<br />

integrals or sums. A consequence is


that<br />

<br />

⎡⎛<br />

⎢⎜<br />

⎣⎝<br />

⎞⎤<br />

⎛<br />

1<br />

⎟⎥<br />

⎜<br />

. ⎠⎦ = ⎝<br />

<br />

[ 1 ]<br />

.<br />

[ ]<br />

⎞<br />

⎟<br />

⎠ <br />

12<br />

(Similarly with random matrices.) We define<br />

[(X)] to be [], where = (X). In principle<br />

this requires the derivation of the distribution<br />

of . It can be shown that this can instead be obtained<br />

by integration or summation w.r.t. (“with<br />

respect to”) the distribution of X. Corresponding<br />

to the cases above, this is<br />

[(X)] =<br />

respectively.<br />

( R<br />

(i)<br />

∞−∞ ···R ∞<br />

−∞ (x)(x)x<br />

(ii) P x∈X (x)(x)<br />

• A special consequence is linearity: [ + ]=<br />

[]+[ ]. More generally,<br />

if x is a r.vec. and<br />

[Ax + b] =A [x]+b<br />

[AXB + C] =A [X] B + C


13<br />

for a random matrix X. You should verify this.<br />

Thus, e.g., if μ = [x] then (how?)<br />

h (x − μ)(x − μ) 0i = h xx 0i − μμ 0 <br />

The ( ) element is<br />

h ( − ) ³ h i<br />

− ´i<br />

=cov <br />

(= the variance if = ). The matrix is called<br />

the covariance matrix of the random vector x.<br />

• A common application, to be developed in detail,<br />

is linear regression. Experimenter observes a variable<br />

(= response to a medical treatment, say)<br />

thought to depend on type of drug used ( 1 =1<br />

for type A, 0 for type B) and amount applied<br />

( 2 ). Response contains a random component<br />

as well (measurement error, model inadequacies,<br />

etc.); a tentative linear regression model might be<br />

= 0 + 1 1 + 2 2 + <br />

where the ’s are unknown parameters to be estimated,<br />

is unobserved random error, assumedto<br />

have mean 0, constant variance across subjects,<br />

possibly also normally distributed.


14<br />

— Interpretation of [ | 1 2 ]inthetwotreatment<br />

groups:<br />

[ | 1 = 0 2 ]= 0 + 2 2 <br />

[ | 1 = 1 2 ]= 0 + 1 + 2 2 <br />

hence 1 =difference in mean effects of the<br />

treatments, if the same amounts are applied.<br />

— Then with x =(1 1 2 ) 0 , θ =( 0 1 2 ) 0 :<br />

= x 0 θ + ; [ |x] =x 0 θ<br />

• Take data:<br />

⎛<br />

⎜<br />

⎝<br />

⎞ ⎛<br />

1<br />

2<br />

⎟<br />

. ⎠ = ⎜<br />

⎝<br />

<br />

more concisely,<br />

Y =<br />

⎛<br />

⎜<br />

⎝<br />

x 0 1<br />

x 0 2<br />

.<br />

x 0 <br />

x 0 1 θ<br />

x 0 2 θ<br />

.<br />

x 0 θ<br />

⎞<br />

⎟<br />

⎠<br />

⎞ ⎛<br />

⎟<br />

⎠ + ⎜<br />

⎝<br />

⎞<br />

1<br />

2<br />

⎟<br />

. ⎠ ;<br />

<br />

θ + ε = Xθ + ε


15<br />

Here the observations (rows) have been singled<br />

out as the relevant objects. Much of the theory<br />

will hinge on the representation of [ |x] asa<br />

linear combination of the columns of X, withcoefficients<br />

θ:<br />

X =(1.z 1 .z 2 ); [Y|x] =1 0 + z 1 1 + z 2 2 <br />

Extension to 3 columns is immediate.<br />

• Estimation of θ: Given an estimate ˆθ one estimates<br />

[ |x] byx 0ˆθ, with residuals<br />

= − x 0 ˆθ<br />

and residual vector e = Y − Xˆθ.<br />

• We define the (Euclidean) norm, i.e. the length,<br />

of a vector by kek =<br />

q P <br />

2<br />

= √ e 0 e.<br />

• Least Squares Principle:<br />

Choose<br />

ˆθ =argmin||Y − Xθ|| 2 <br />

minimizing the Sum of Squares of the Residuals<br />

(or Errors, hence “SSE”).


16<br />

2. Vector spaces<br />

• Now let’s relate matrices to vector spaces. We<br />

start with the definition of a vector space. Thisis<br />

largely for formal completeness - you might wish<br />

to skip over the next bullet - since the only vector<br />

space considered here will be<br />

R = all −dimensional vectors with real elements<br />

and its subspaces.<br />

• We list a number of axioms to be satisfied by a<br />

structure in order that it be called a vector space;<br />

for R these are all pretty obvious. Note that<br />

R is closed under addition (x y ∈ R ⇒ x +<br />

y ∈ R ) and scalar multiplication (x ∈ R and<br />

∈ R ⇒ x ∈ R ), and satisfies<br />

1. Associativity: For all x y z ∈ R , we have<br />

x +(y + z) =(x + y)+z.<br />

2. Commutativity: For all x y ∈ R ,wehave<br />

x + y = y + x.


17<br />

3. Identity element: There is 0 ∈ R such that<br />

x + 0 = x.<br />

4. Inverse elements: For all x ∈ R ,thereexists<br />

an element −x ∈ R , called the additive<br />

inverse of x, such that x+(−x) =0.<br />

5. Distributivity for scalar multiplication: For all<br />

x y ∈ R and ∈ R we have (x + y) =<br />

x + y.<br />

6. Distributivity for scalar addition: For all x ∈<br />

R and ∈ R we have ( + ) x = x +<br />

x.<br />

7. For all x ∈ R and ∈ R we have () x =<br />

(x).<br />

8. Scalar multiplication has an identity: 1x = x.<br />

Because these properties hold we say that R is a<br />

vector space (over the field R of scalars).


18<br />

• Asubset of R that is itself closed under addition<br />

and scalar multiplication is a vector space<br />

in its own right, called a vector subspace of R ;<br />

similarly ⊂ closed under addition and scalar<br />

multiplication is a subspace of . (You might<br />

wish to prove this; the proof consists of showing<br />

that 1.-8. hold in if they hold in R and if <br />

has these two closure properties.)<br />

• Definitions:<br />

(i) Elements v 1 v of form a spanning set<br />

if every v ∈ is a linear combination of them.<br />

(ii) Elements v 1 v of are (linearly) independent<br />

if all are non-zero and<br />

X<br />

v = 0 ⇒ all =0<br />

i.e. there is only one way in which 0 can be represented<br />

as a linear combination of them. Otherwise<br />

they are dependent (equivalently, at least<br />

one is a linear combination of the others).<br />

(iii) A spanning set whose elements are independent<br />

is a basis of . Thus if {v 1 v } is a<br />

basis, any v ∈ is uniquely (why?) representable<br />

as a linear combination of these basis elements.


19<br />

— Fact 1: Every vector space has a basis. No<br />

proper subset of a basis can span the entire<br />

space (why not?).<br />

— Fact 2: If has a basis of size , thenany<br />

elements of are dependent. (Obvious<br />

if these include the basis; the proof is a bit<br />

lengthy otherwise.)<br />

∗ Definition: The dimension of is the unique<br />

size of a basis. Uniqueness is a consequence<br />

of the preceding two statements.<br />

∗ Another consequence: If dim( )=, then<br />

any independent vectors in form a basis.<br />

(If not, then one can augment with elements<br />

not spanned to get independent vectors.<br />

This contradicts Fact 2.)<br />

• Let be a vector subspace of R . Suppose that<br />

one forms a matrix X by choosing the basis elements<br />

of to be the columns of X Then the


20<br />

interpretation of “spanning” and “independence”<br />

in are, in terms of X,<br />

spanning: Xc = y is solvable (in c) foranyy ∈ ;<br />

independence: y = 0 in the above ⇒ c = 0.<br />

If instead we begin with a matrix X, then the set<br />

of all linear combinations of the columns of X is<br />

a vector space (why?), called the column space<br />

((X)), whose dimension is called the rank of<br />

X. The independent columns of X form a basis<br />

for (X).<br />

Results about matrix ranks:<br />

1) (AB) ≤ (A): Since (AB) ⊆ (A)<br />

(why?),<br />

(AB) =dim((AB)) ≤ dim((A)) = (A)<br />

(The inequality follows — ass’t 1 — from Fact<br />

2above.)<br />

2) The rank of a matrix is at least as large as that<br />

of any of its submatrices (you should formulate<br />

and prove this).


21<br />

3) Used often: (A 0 A)=(A).<br />

Proof: Let A be × with rank . We first<br />

show that (A 0 A) ≥ . Ifthefirst columns<br />

of A are independent, we can write<br />

Ã<br />

A = F × (I .J) A 0 F<br />

A =<br />

0 !<br />

F ∗<br />

J 0 F 0 F ∗<br />

where F consists of the independent columns<br />

of A and hence has rank . Now (A 0 A) ≥<br />

(F 0 F) (why?) and all columns of F 0 F are<br />

independent:<br />

F 0 Fx = 0 ? ⇒<br />

||Fx|| =0 ? ⇒<br />

x = 0<br />

For the general case first permute the columns<br />

of A: A → AQ where the first columns<br />

of AQ are independent. (How? What kind<br />

of matrix Q would accomplish this?) Then<br />

AQ = F × (I .J) and as above<br />

≤ (Q 0 A 0 AQ) ? = (A 0 A)<br />

Now write (either Q 0 A 0 AQ or) A 0 A as<br />

(I .J) 0 F 0 F (I .J). By 1),<br />

(why?).<br />

(A 0 A) ≤ ³ (I .J) 0´ =


22<br />

4) By 3), then 1), (A) =(A 0 A) ≤ (A 0 );<br />

replacing A by A 0 gives (A 0 ) ≤ (A) and<br />

so<br />

(A 0 )=(A);<br />

i.e. row rank = column rank= # of independent<br />

rows or columns. Thus, from now on,<br />

‘rank’ can mean either row rank or column<br />

rank.<br />

5) (AB) ≤ min((A)(B)).<br />

Proof: That (AB) ≤ (A) has been shown.<br />

Using 4) and 1),<br />

(AB) =(B 0 A 0 ) ≤ ³ B 0´ = (B)<br />

• A square, full rank matrix has an inverse.<br />

Proof: We are to show that if A × has full<br />

rank then there is an ‘inverse’ B with the property<br />

that AB = BA = I . The columns of A are<br />

independent, hence form a basis of R (why?).<br />

Thus they span: the equations<br />

A [b 1 ···b ]=[e 1 ···e ]=I


23<br />

areallsolvable. Wewrite[b 1 ···b ]=B, then<br />

AB = I . The matrix B is square, full rank<br />

(why?) and so it also has an inverse on the right:<br />

there is C × with BC = I . Now show that<br />

C = A; thusAB = BA = I and so B = A −1 .<br />

¤<br />

• Fact: A square matrix has full rank iff it has a<br />

non-zero determinant.<br />

— The determinant |A| is a particular sum of<br />

products of the elements of A × . (Details<br />

in text.) Each product contains factors;<br />

there is one from each row and one from each<br />

column. It is a measure of the “size” of the<br />

matrix, in a geometrical sense.<br />

• A consequence of the preceding is that if X ×<br />

has independent columns, so rank , thenX 0 X is<br />

invertible. In a regression framework this can be<br />

interpreted in terms of information duplicated by<br />

dependent columns.


24<br />

3. Orthogonality; Gram-Schmidt method;<br />

QR-decomposition<br />

• Hat matrix: Consider a regression model y =<br />

Xθ + with X × of full rank . We will later<br />

show that the LSEs are<br />

ˆθ = ³ X 0 X´−1<br />

X 0 y<br />

so that the estimate of [y] =Xθ is ŷ = Xˆθ =<br />

Hy, where<br />

H × = X ³ X 0 X´−1<br />

X<br />

0<br />

is the “hat” matrix - it “places the hat on y”.<br />

Properties:<br />

H = H 0 = H 2 (“idempotent”)<br />

HX = X<br />

(I − H)X = 0<br />

(I − H) 2 = (I − H)<br />

H(I − H) = 0


25<br />

• Angle between nonzero vectors x y is defined<br />

by<br />

cos =<br />

x0 y<br />

kxkkyk <br />

That such an angle exists is equivalent to the<br />

statement that ¯¯x 0 y¯¯ ≤ kxkkyk. Thisinturnisa<br />

version of the famous Cauchy-Schwarz Inequality,<br />

to be studied later.<br />

Proof of this version: For any real number ,<br />

0 ≤ kx + yk 2 = kyk 2 2 +2x 0 y + kxk 2 <br />

so that there is at most one real zero. Thus “ 2 −<br />

4”≤ 0, i.e.<br />

4<br />

µ ³x 0 y´2<br />

− kxk<br />

2 kyk<br />

2 ≤ 0<br />

— Equality in “¯¯x 0 y¯¯ ≤ kxkkyk” implies that<br />

kx + 0 yk 2 =0forsome 0 (= ±kxk /kyk),<br />

so that x and y are proportional. The converse<br />

holds as well (you should verify this).<br />

¤


26<br />

• Two vectors are orthogonal if the angle between<br />

them = ±2, equivalently if their scalar product<br />

=0. Wewritex ⊥ y.<br />

— Example: If z is any × 1 vector, and H is a<br />

hat matrix, then<br />

z = Hz +(I − H)z = z 1 + z 2 ,<br />

say, where z 1 ⊥ z 2 . The first is in col(X)<br />

(why?) and the second is in the space of vectors<br />

orthogonal to every vector in col(X). We<br />

write z 2 ∈ col(X) ⊥ . You should verify that<br />

this is a vector space (i.e. is closed under addition<br />

and scalar multiplication).<br />

• AmatrixQ × is orthogonal if the columns are<br />

mutually orthogonal, and have unit norm. Equivalently<br />

(why?)<br />

QQ 0 = Q 0 Q = I <br />

If Q is orthogonal then kQyk = kyk for any<br />

× 1vectory - “norms are preserved”. Similarly,<br />

angles between vectors are also preserved


27<br />

(why?). Geometrically, an orthogonal transformation<br />

is a “rigid motion” - it corresponds to a<br />

rotation and/or an interchange of two or more<br />

axes. Rotation through an angle in the plane:<br />

à !<br />

cos − sin <br />

Q =<br />

<br />

sin cos <br />

Interchange of axes in the plane:<br />

à !<br />

0 1<br />

Q = <br />

1 0<br />

• Orthogonal spaces in a regression context. Suppose<br />

X × has independent columns, so (X)<br />

has dimension . Note that (how?) the orthogonal<br />

complement<br />

(X) ⊥ = n y ¯¯¯X 0 y = 0 o <br />

Then (X) ⊥ = (I − H):<br />

y ∈ (X) ⊥ ⇒ X 0 y = 0 ⇒ (I − H) y = y;<br />

y ∈ (I − H) ⇒ Hy = 0 ⇒ X 0 y = 0<br />

Thus dim ³ (X) ⊥´ = (I − H).


28<br />

— The trace of a square matrix is the sum of<br />

its diagonal elements. You should verify that<br />

(AB) = (BA). Thus products within<br />

traces can be rearranged cyclically:<br />

(ABC) =(CAB) = (BCA)<br />

but not necessarily = (ACB) <br />

It will be shown that for an idempotent matrix,<br />

= . A consequence is that<br />

(I − H) = (I − H) = − (H)<br />

µ<br />

= − X ³ X 0 X´−1<br />

X<br />

0<br />

= − <br />

µX 0 X ³ X 0 X´−1 <br />

= − <br />

Similarly (X) =(H), (H) =.<br />

<br />

• Gram-Schmidt Theorem: Every -dimensional<br />

vector space has an orthonormal basis.<br />

Proof: Start with any basis v 1 v . Normalize<br />

v 1 to get a unit vector (i.e. a vector with unit<br />

norm) q 1 ∈ ; in general suppose that mutually


29<br />

orthogonal unit vectors q 1 q have been constructed,<br />

with q a linear combination of v 1 v .<br />

Define Q = ³ q 1 q ´; this has orthonormal<br />

columns and so Q 0 Q = I .Define also<br />

H = Q Q 0 = hat matrix arising from Q <br />

³<br />

I − H´<br />

v+1<br />

q +1 =<br />

°<br />

³ I − H´ v+1 ° °° (3.1)<br />

For instance, q 2 =....<br />

Then: (i) the numerator in (3.1) is a linear combination<br />

of v 1 v +1 in which at least one coefficient<br />

- that of v +1 -isnon-zero(why?),so<br />

that in particular the denominator is non-zero;<br />

(ii) q +1 ⊥ q 1 q ( ? ⇐ q 0 +1 H = 0 0 ).<br />

Continuing this process results in mutually orthogonal<br />

unit vectors q 1 q ∈ . Since these<br />

are orthogonal they are independent (why?) and<br />

so form a basis of .<br />

— There is a nice geometric interpretation. The<br />

matrix H is idempotent (in fact it is the “hat”<br />

matrix arising from the × matrix with columns


30<br />

q 1 q ), and H v +1 is the “projection of<br />

v +1 onto the space spanned by n q 1 q <br />

o<br />

”.<br />

Thus we say that q +1 is formed by “subtracting<br />

from v +1 its projection onto the space<br />

spanned by n q 1 q <br />

o<br />

”, so as to make what<br />

is left orthogonal to this space (and then normalizing).<br />

• QR-decomposition. In the previous construction,<br />

at each stage, q was obtained as a linear combination<br />

of v 1 v Thus if V × has these vectors<br />

as its columns, and Q × =(q 1 q ), we<br />

can write<br />

V × U × = Q ×<br />

for U upper triangular with positive diagonal<br />

°<br />

elements<br />

( +1+1 = 1 °³<br />

° I<br />

°°<br />

− H´<br />

v+1 0).<br />

Then U is nonsingular and V = QR for R =<br />

U −1 .(NotethatR is also upper triangular with<br />

°<br />

°<br />

positive diagonal elements +1+1 = °³<br />

I<br />

°°.)<br />

− H´<br />

v+1


31<br />

4. LSEs; Spectral theory<br />

• Recall the decomposition arising from the Gram-<br />

Schmidt Theorem. Let X × have rank . Write<br />

X = Q 1 R 1 ,whereQ 1 : × has orthonormal<br />

columns, and R 1 : × is upper triangular with<br />

positive diagonal elements. Apply Gram-Schmidt<br />

once again, starting with the − independent<br />

columns of I − H, toobtainQ 2 : × ( − )<br />

whose columns are orthonormal and are a basis for<br />

(X) ⊥ . Then Q =(Q 1 .Q 2 ) has orthonormal<br />

columns and is square, hence is an orthogonal<br />

matrix. We have<br />

X = (Q 1 .Q 2 )<br />

Ã<br />

R1<br />

0<br />

R 0 R = R 0 1 R 1 = X 0 X<br />

³<br />

X 0 ³<br />

X´−1<br />

= R<br />

−1<br />

1 R<br />

0<br />

1´−1<br />

<br />

H = Q 1 Q 0 1 <br />

I − H = Q 2 Q 0 2 <br />

!<br />

= QR


32<br />

• Return to regression.<br />

(i) Least squares estimation in terms of hat matrix<br />

decomposition of norm of residuals: Note that<br />

x ⊥ y⇒kx + yk 2 = kxk 2 + kyk 2 ;then<br />

°<br />

°y − Xˆθ ° 2<br />

° =<br />

° °°H<br />

³ Xˆθ´° y<br />

° 2 °<br />

− +<br />

°°(I<br />

³ Xˆθ´°<br />

− H) y<br />

° 2<br />

− =<br />

°<br />

°H ³ y Xˆθ´° ° 2<br />

− + k(I − H) yk<br />

2<br />

≥<br />

k(I − H) yk 2 <br />

with equality iff Hy = Xˆθ iff (‘if and only if’)<br />

ˆθ = ³ X 0 X´−1<br />

X 0 y<br />

(how?), the LS estimator. The fitted values are<br />

ŷ = Xˆθ = Hy<br />

and are orthogonal to the residuals<br />

e = y − ŷ =(I − H) y<br />

We say that H and I − H project the data (y)<br />

onto the estimation space and error space, respectively,<br />

and that these spaces are orthogonal.


33<br />

(ii) In terms of QR-decomposition: we have that<br />

ˆθ = R −1<br />

1<br />

i.e. ˆθ is the solution to<br />

Thus compute<br />

z ×1 = Q 0 y =<br />

=<br />

à !<br />

Q<br />

0<br />

1 y<br />

Q 0 2 y<br />

³<br />

R<br />

0<br />

1´−1<br />

R<br />

0<br />

1 Q 0 1 y;<br />

R 1ˆθ = Q 0 1 y<br />

Ã<br />

Q<br />

0<br />

1<br />

Q 0 2<br />

!<br />

y<br />

=<br />

Ã<br />

z1<br />

z 2<br />

!<br />

× 1<br />

( − ) × 1 <br />

Then backsolve the system of equations R 1ˆθ = z 1 .<br />

Numerically stable - no matrix inversions.<br />

• The residual vector is e = Q 2 z 2 , with squared<br />

norm kz 2 k 2 . The usual estimate of the variance<br />

2 of the random errors is<br />

ˆ 2 =<br />

SS of residuals<br />

− <br />

= kek2<br />

− = kz 2k 2<br />


34<br />

the mean squared error. Wehave<br />

Ã<br />

[z] =Q 0 Q<br />

0<br />

[y] = 1 Q 1 R 1 θ<br />

Q 0 2 Q 1R 1 θ<br />

!<br />

=<br />

Ã<br />

R1 θ<br />

0<br />

!<br />

<br />

and, using “cov[Ay] =Acov[y] A 0 ”(how?) we<br />

get<br />

cov [z] =Q 0 cov [y] Q = Q 0 2 IQ = 2 I;<br />

hence the elements +1 of z 2 have mean<br />

zero and [ ]= h <br />

2 i<br />

= 2 . Thus ˆ 2 is<br />

unbiased:<br />

hˆ 2i = <br />

⎡<br />

⎣P +1<br />

2 <br />

− <br />

⎤<br />

⎦ = 2 <br />

• Unrelated but nonetheless useful facts: inverses<br />

and determinants of matrices in block form. If P<br />

and Q are nonsingular, then<br />

det<br />

Ã<br />

P<br />

R<br />

S<br />

Q<br />

!<br />

= |P|·|Q − RP −1 S|<br />

= |Q|·|P − SQ −1 R|


35<br />

and<br />

=<br />

Ã<br />

P<br />

⎛<br />

⎜<br />

⎝<br />

R<br />

S<br />

Q<br />

! −1<br />

³<br />

P − SQ −1 R´−1 −P −1 S·<br />

³<br />

Q − RP −1 S´−1<br />

− ³ Q − RP −1 S´−1<br />

·<br />

RP −1<br />

How? Verify<br />

à !Ã<br />

I −SQ<br />

−1 P S<br />

0 I R Q<br />

Ã<br />

P − SQ<br />

=<br />

−1 !<br />

R 0<br />

etc.<br />

0 Q<br />

Example:<br />

det<br />

Ã<br />

I<br />

1 <br />

1 0 −1<br />

!<br />

!Ã<br />

³<br />

Q − RP −1 S´−1<br />

I 0<br />

−Q −1 R I<br />

= |I |·|− 1 − 1 0 1| = −1 − <br />

!<br />

⎞<br />

⎟<br />


36<br />

• Spectral theory for real, symmetric matrices. First<br />

let M × be any square matrix. For a variable <br />

the determinant |M − I | is a polynomial in <br />

of degree , calledthecharacteristic polynomial.<br />

The equation<br />

|M − I | =0<br />

is the characteristic equation. The Fundamental<br />

Theorem of Algebra states that there are then<br />

(real or complex) roots of this equation. Any<br />

such root is called an eigenvalue of M. If is<br />

an eigenvalue then M − I is singular, so the<br />

columns are dependent:<br />

(M − I ) v =0 (4.1)<br />

for some non-zero vector v (possibly complex),<br />

called the eigenvector corresponding to, or belonging<br />

to, . Thus<br />

Mv = v (4.2)


37<br />

• Now suppose that M is symmetric (and real).<br />

Then the eigenvalues (hence the eigenvectors as<br />

well) are real. To see this, define an operation A ∗<br />

by taking a transpose and a complex conjugate:<br />

(A ∗ ) =¯ <br />

Note that (AB) ∗ = B ∗ A ∗ and that v ∗ v = P | | 2<br />

is real. For a real symmetric matrix M we have<br />

(why?) M ∗ = M. Thus in (4.2),<br />

v ∗ Mv = v ∗ v;<br />

taking the conjugate transpose of each side gives<br />

v ∗ Mv = ¯v ∗ v<br />

Thus ³ − ¯´ v ∗ v =0;sothat(why?) is real.<br />

• We can, and from now on will, assume that any<br />

eigenvector has unit norm.


38<br />

• Eigenvectors corresponding to distinct eigenvalues<br />

are orthogonal. Reason: If Mv = v for<br />

=1 2and 1 6= 2 then<br />

v 0 1 Mv 2 = v 0 1 (Mv 2)= 2 v 0 1 v 2<br />

and = ³ v 0 1 M´ v 2 = 1 v 0 1 v 2;<br />

thus ( 1 − 2 ) v 0 1 v 2 =0andsov 0 1 v 2 =0.<br />

• If is a multiple root of the characteristic equation,<br />

with multiplicity , then the set of corresponding<br />

eigenvectors is a vector space (you should<br />

verify this) with dimension .<br />

— The proof that the dimension is requires<br />

some work, and uses two results being established<br />

in Assignment 1, and so it is added as<br />

an addendum (which you should read) to that<br />

assignment.<br />

— By Gram-Schmidt, therefore, there are orthogonal<br />

eigenvectors corresponding to .


39<br />

• Spectral Decomposition Theorem for real, symmetric<br />

matrices: Let M × be real and symmetric,<br />

with eigenvalues 1 and corresponding<br />

orthogonal eigenvectors v 1 v with unit<br />

norms. Put<br />

V × =(v 1 ··· v ) <br />

an orthogonal matrix. Let D λ be the diagonal<br />

matrix with 1 on the diagonal. Since<br />

we have<br />

MV = ( 1 v 1 ··· v )<br />

= (v 1 ··· v )<br />

= VD λ <br />

⎛<br />

⎜<br />

⎝<br />

⎞<br />

1 0<br />

⎟ ... ⎠<br />

0 <br />

M = VD λ V 0 (4.3)<br />

We say that “a real symmetric matrix is orthogonally<br />

similar to a diagonal matrix”.


40<br />

• In a sense that will become clear, the importance<br />

of this result is that a real, symmetric matrix is<br />

“almost” diagonal. Thus when solving problems<br />

concerning real symmetric matrices it is very often<br />

useful to solve them first for diagonal matrices.<br />

This is frequently quite simple, and then extends<br />

to the general case via (4.3).<br />

• In the construction above we could have assumed,<br />

and sometimes will assume, that the eigenvalues<br />

were ordered before they and the eigenvectors<br />

were labelled: 1 ≥ ≥ .


41<br />

5. Examples & applications<br />

Consequences of spectral decomposition of a real,<br />

symmetric matrix M × . Recall that we showed<br />

M = VD λ V 0 , for an orthogonal<br />

V = [v 1 ··· v ] (the orthonormal eigenvectors), and<br />

D λ = ( 1 ≥ ≥ ) (the eigenvalues).<br />

• Bounds on eigenvalues. We have<br />

max<br />

kxk=1 x0 Mx = max<br />

kxk=1 x0 VD λ V 0 x<br />

= max<br />

kyk=1 y0 D λ y (why?)<br />

⎧<br />

⎨<br />

⎫<br />

X<br />

= max <br />

⎩ 2 ⎬ | 2 =1 ⎭ <br />

=1 =1<br />

It is easy to guess the solution. The maximum<br />

is (what?), attained at y = (what?); hence the<br />

maximizing x is the corresponding eigenvector.<br />

An analogous result holds for min kxk=1 x 0 Mx.<br />

(You should write it out and prove it.)<br />

X


42<br />

• Positive definite matrices. If a symmetric matrix<br />

M is such that x 0 Mx ≥ 0forallx, wesay<br />

that M is positive semi-definite (p.s.d.) or nonnegative<br />

definite (n.n.d.). We write M ≥ 0. (The<br />

text reserves the term p.s.d. for the case in which<br />

equality is attained for at least one non-zero x;<br />

this convention is somewhat unusual and won’t be<br />

followed here.) The preceding discussion shows<br />

(how?) that M is p.s.d. iff all eigenvalues are<br />

non-negative.<br />

If x 0 Mx 0forallx 6= 0, we say thatM is<br />

positive definite (p.d.). We write M 0. Equivalently,<br />

all eigenvalues are positive.<br />

— Geometric interpretation: If M 0 then<br />

|M| 0 (why?) and the set<br />

n<br />

x | x 0 M −1 x = 2o<br />

is transformed, via the (orthogonal) transformation<br />

y = V 0 x,intotheset<br />

⎧<br />

⎨<br />

⎩ y | X<br />

=1<br />

2 <br />

<br />

= 2 ⎫<br />

⎬<br />


43<br />

This is the ellipsoid in R with semi-axes of<br />

lengths √ q along the coordinate axes (and<br />

volume ∝ |M|). Thus (why?) the original<br />

set, obtained from the second via the transformation<br />

x = Vy, is an ellipsoid as well, whose<br />

semi-axes have the same lengths but are now<br />

in the directions of the eigenvectors of M.<br />

The following three results illustrate the adage that<br />

“a symmetric matrix is almost diagonal”.<br />

• Matrix square roots. Can we define a notion of<br />

the square root of a (n.n.d.) matrix? Start by<br />

thinking of a diagonal matrix, in which case the<br />

method is obvious: If D is a diagonal matrix<br />

with non-negative diagonal elements, then we can<br />

define the square root D 12 to be the diagonal<br />

matrix with the roots of these elements on its<br />

diagonal. Now extend to the general case. If M ≥<br />

0 we write M = VD λ V 0 ,whereV is orthogonal


44<br />

and D λ has a non-negative diagonal. We define<br />

a symmetric, p.s.d. square root of M by<br />

M 12 = VD 12<br />

λ<br />

V0 <br />

— There are other roots, for instance P = VD 12<br />

λ<br />

W<br />

for any orthogonal W (then PP 0 = M) butwe<br />

will generally mean the one above.<br />

• The rank of a real symmetric matrix equals the<br />

number of non-zero eigenvalues. Reason: If<br />

M = VD λ V 0 then the rank of M equals the<br />

rank of D λ (why?), and the latter is clearly (is<br />

it?) the number of non-zero diagonal elements.<br />

— Note also that if M = VDV 0 is the spectral<br />

decomposition then M and D have the same<br />

eigenvalues, namely the diagonal elements of<br />

D. This is because the characteristic polynomials<br />

are the same:<br />

¯<br />

|M − I| = ¯V (D − I) V 0¯¯¯ = |D − I|


45<br />

• If H is idempotent then (i) all eigenvalues are 0 or<br />

1, and (ii) rank = trace. Reason: (i) It is clearly<br />

true (how?) for diagonal idempotents. But if H<br />

is idempotent then H = VD λ V 0 where D λ is<br />

idempotent, and H has the same eigenvalues as<br />

D λ . (ii) (H) = (D λ )= (D λ )= (H)<br />

(how are these steps justified?).<br />

— Another interesting property, previously established<br />

via Gram-Schmidt: We can partition<br />

D λ , and compatibly partition V, as<br />

à !<br />

I 0<br />

D λ = V =(V<br />

0 0<br />

1 .V 2 ) <br />

where (H) = and V 1 is ×. This results<br />

in the decomposition of an idempotent matrix<br />

as<br />

H = V 1 V 0 1 where V0 1 V 1 = I


46<br />

Application 1. Illustration of preceding theory: twopopulation<br />

classification problem. Suppose we are<br />

given lengths and widths of prehistoric skulls, of type<br />

A or B (the “training sample”). We know that 1 of<br />

these, say x 1 x 1 , are of type A, and 2 = − 1 ,<br />

say y 1 y 2 ,areoftypeB.Nowwefind a new skull,<br />

with length and width the components of z. We are<br />

to classify it as A or B. (Others applications: rock<br />

samplesingeology,riskdatainanactuarialanalysis,<br />

etc.)<br />

• Reduce to univariate problem: = α 0 x , =<br />

α 0 y for some vector α. Put = α 0 z and classify<br />

new skull as A if | − ¯| | − ¯|.<br />

• Choose α for “maximal separation”: |¯−¯| should<br />

be large relative to the underlying variation. Put<br />

2 1 = 1 X<br />

( − ¯) 2 = 1 X ³<br />

α 0 (x − ¯x)´2<br />

1 − 1<br />

1 − 1<br />

1<br />

=<br />

X 1 − 1 α0 (x − ¯x)(x − ¯x) 0 α = α 0 S 1 α


47<br />

and similarly define 2 2 as the variation in the other<br />

sample. Choose α to maximize<br />

(¯ − ¯) 2<br />

h<br />

(1 − 1) 2 1 +( 2 − 1) 2 i<br />

2 ( − 2)<br />

= α0 (¯x − ȳ)(¯x − ȳ) 0 α<br />

α 0 (5.1)<br />

Sα<br />

where S isthetwo-samplecovariancematrix<br />

S = ( 1 − 1) S 1 +( 2 − 1) S 2<br />

<br />

− 2<br />

• Put β = S 12 α, α = S −12 β so (5.1) is<br />

β 0 S −12 (¯x − ȳ)(¯x − ȳ) 0 S −12 β<br />

β 0 <br />

β<br />

which is a maximum if β kβk is the unit eigenvector<br />

corresponding to<br />

max S −12 (¯x − ȳ)(¯x − ȳ) 0 S −12 = max aa 0 <br />

where a = S −12 (¯x − ȳ). Note aa 0 has rank 1,<br />

hence has 1 non-zero eigenvalue, necessarily equal<br />

(why?) to aa 0 :<br />

= a 0 a =(¯x − ȳ) 0 S −1 (¯x − ȳ)


48<br />

Now solve<br />

aa 0 β = β<br />

to get (β = what? - guess at a solution); any<br />

multiple will do. Then<br />

α = S −12 β = S −1 (¯x − ȳ)<br />

and we classify as A if<br />

| − ¯| =<br />

¯<br />

¯α 0 (z − ¯x)<br />

¯ <br />

¯<br />

¯<br />

¯α 0 (z − ȳ) ¯ = | − ¯|


49<br />

Application 2. By the Cauchy-Schwarz Inequality,<br />

max<br />

y<br />

¯<br />

¯x 0 My¯¯<br />

kyk<br />

=<br />

°<br />

°M 0 x ° ° ° =<br />

qx 0 MM 0 x<br />

Related facts: Note that MM 0 ≥ 0(why?). Conversely,<br />

any n.n.d. matrix can be represented as MM 0<br />

(inmanyways). Inparticular,ifS is a × n.n.d.<br />

matrix of rank ≤ , thenonecanfind M × such<br />

that S = MM 0 and M 0 M is the × diagonal matrix<br />

ofthepositiveeigenvaluesofS.<br />

Construction: Write S = VDV 0 ,where<br />

D × =<br />

Ã<br />

D1 0<br />

0 0<br />

!<br />

V × =<br />

⎛<br />

⎜<br />

⎝V<br />

|{z} 1<br />

<br />

.V 2<br />

|{z}<br />

−<br />

⎞<br />

⎟<br />

⎠ <br />

and D 1 is the × diagonal matrix containing the<br />

positive eigenvalues. Then S = V 1 D 1 V1 0 and so<br />

M × = V 1 D 12<br />

1<br />

has the desired properties. (Note<br />

also that this is a version of S 12 .)


50<br />

Part II<br />

LIMITS, CONTINUITY,<br />

DIFFERENTIATION


51<br />

6. Limits; continuity; probability spaces<br />

• Open and closed sets in R ; limits:<br />

— Neighbourhood of a point ‘a’, of radius :<br />

— ⊂ R is open if<br />

(a) ={x| ||x − a|| } <br />

a ∈ ⇒ (a) ⊂ <br />

for all sufficiently small 0.<br />

∗ Example (0 1)<br />

— A sequence {x } tends to a point a: “x →<br />

a” ifx gets arbitrarily close to a as gets<br />

larger. More formally, any neighbourhood of<br />

a, no matter how small, will eventually contain<br />

x from some point onward. More formally<br />

yet, “for any radius , wecanfind an large<br />

enough that, once ,allofthex lie


52<br />

in (a)”. This required will typically get<br />

larger as gets smaller. Finally,<br />

∀∃ = ()(⇒ x ∈ (a)) <br />

read “for all there exists an , that depends<br />

on , such that implies that x ∈<br />

(a)”.<br />

∗ Equivalently (why?): x → a ⇐⇒ ||x −<br />

a|| → 0.<br />

∗ This is for ‘a’ finite; obvious modifications<br />

otherwise. You should derive an appropriate<br />

definition of “ → ∞”( scalars, not<br />

vectors).<br />

∗ Example =1− 1 <br />

— Apointa is a limit point of ⊂ R if there<br />

is a sequence {x } ⊂ such that x → a.<br />

∗ Example = { =1−<br />

1<br />

=1 (∈ ) <br />

| =1 2 };


53<br />

— ⊂ R is closed if it contains all of its limit<br />

points.<br />

∗ Examples = { =1−<br />

1<br />

{1} , =[0 1]<br />

| =1 2}∪<br />

• Afunction(x) → as x → a (“(x) tendsto<br />

as x tends to a”) if we can force (x) tobearbitrarily<br />

close to by choosing x (6= a) sufficiently<br />

close to a. Formally,<br />

∀∃ = ( a)(0 ||x − a|| ⇒ |(x) − | ) <br />

The “= ( a)” is often omitted (but understood,<br />

unless stated otherwise). Note the “0 <br />

||x − a||”: (a) need not exist.<br />

• Suppose (x) isdefined for x ∈ , thedomain<br />

of . Then is continuous at a point a ∈ if<br />

(x) → (a) asx → a.<br />

— Note that the definition requires to be defined<br />

at a.


54<br />

— Equivalently,<br />

∀∃ = ( a)(||x−a|| ⇒ |(x) − (a)| )<br />

• Example: () = 2 , =(0 ∞). Then if <br />

0and| − | we have<br />

|() − ()| = | − || − +2|<br />

| − |·(| − | +2)|<br />

( +2)<br />

which is if 2 +2 − 0, i.e.<br />

q<br />

0 2 + − <br />

Here we used the triangle inequality: | + | ≤<br />

|| + ||.<br />

• Note = ( ). Sometimes the same works<br />

for all ; ifsowesay is uniformly continuous<br />

on . E.g. in the previous example, the that is<br />

required will → 0as →∞, but suppose is


55<br />

bounded,<br />

q<br />

say =(0). It can be shown that<br />

2 + − is ↓ (“decreasing”), hence<br />

q<br />

2 + − <br />

q<br />

2 + − 0<br />

for all ∈ , sothat<br />

q<br />

| − | = 2 + − ⇒ |() − ()| <br />

Thus is uniformly continuous on (0).<br />

• Formally in the last example,<br />

q<br />

2 + − =<br />

inf<br />

∈(0 )<br />

q<br />

2 + − <br />

For any set , is a lower bound if ≤ for<br />

all ∈ . If there is a finite lower bound then<br />

there are many; the largest of them is the greatest<br />

lower bound () orinfimum (inf). Similarly<br />

with upper bound, least upper bound () or<br />

supremum (sup).


56<br />

Probability spaces, random variables, distribution functions:<br />

We start with a sample space Ω, whose elements are<br />

all possible outcomes of an experiment (e.g. toss a<br />

coin ten times, Ω isallpossiblesequencesof sand<br />

s). A Borel field or -algebra of events is a collection<br />

B of subsets (“events”) of Ω such that one<br />

of its elements is Ω itself, it is closed under complementation,<br />

and closed under the taking of countable<br />

unions.<br />

A probability is a function defined on B such that<br />

(Ω) =1 0 ≤ () ≤ 1, and probabilities of disjoint<br />

countable unions are additive. The triple (Ω B)<br />

is called a probability space. Alltheusualrulesformanipulating<br />

probabilities follow from these axioms. E.g.<br />

() =0, () ≤ ( )if ⊂ . In particular<br />

(“continuity of probabilities”):<br />

⊇ +1 ⊇ and ∩ ∞ =1 = <br />

⇒ ( ) → () (6.1)


57<br />

7. Random variables; distributions; Jensen’s<br />

Inequality; WLLN<br />

• A(realvalued,finite) random variable (r.v.) is a<br />

function : Ω → R with the property that if is<br />

any open set, then −1 () ={ | () ∈ }<br />

is an event, i.e. a member of B. E.g. () =#<br />

of heads in the sequence of tosses. (For a finite<br />

sample space we generally take B =2 Ω ,theset<br />

of all subsets of Ω.)<br />

— Note that is open iff is closed.<br />

Proof: You should show that<br />

open ⇒ closed.<br />

Conversely, suppose is closed; we are to<br />

show that is open. We will derive a contradiction<br />

from the supposition that is not<br />

open. Suppose it isn’t; then for some ∈<br />

, no () ⊂ (no matter how small we<br />

choose ). Then in particular 1 () contains<br />

points ∈ .Since| −| 1 →


58<br />

0, we have → and so is a limit point of<br />

, hence a member of (why?). This contradicts<br />

the fact that ∈ , thus completing<br />

the proof. ¤<br />

— Note that −1 ( )= n −1 () o <br />

:<br />

−1 ( ) = { | () ∈ }<br />

= { | () ∈ }<br />

= { | () ∈ } <br />

= n −1 () o <br />

<br />

— By the preceding points, if is closed then<br />

= is open and so −1 () = n −1 () o <br />

∈<br />

B: the inverse images of closed sets must also<br />

be events.


59<br />

• Since the set = (−∞] is closed, so also<br />

−1 () ={ | () ≤ } is a member of B,<br />

hence has a probability. We write<br />

() = ({ | () ≤ }) = ( ≤ )<br />

and call the distribution function (d.f.) of the<br />

r.v. . Any distribution function is right continuous,<br />

inthat<br />

↓ ⇒ ( ) → ()<br />

Proof: Recall (6.1) with =(−∞ ] where<br />

↓ and = −1 ( ). Then<br />

= ∩ ∞ =1 <br />

= ∩ ∞ =1 −1 ( )<br />

= −1 (∩ ∞ =1 )(verifythis)<br />

= −1 ((−∞])<br />

Thus ( ≤ )= ( ) → () = ( ≤ ).<br />

¤<br />

— A d.f. is then a function : R →[0 1] satisfying<br />

(i) (−∞) =0, (∞) =1(ii) is<br />

right continuous (iii) is weakly increasing:<br />

⇒ () ≤ () (you should show<br />

(iii)).


60<br />

— Recall the notion of expected value, which we<br />

defined in terms of a density or probability<br />

mass function. If () isdifferentiable then<br />

= 0 is the density and expectations, probabilities<br />

etc. are obtained by integration of . If<br />

isastepfunctionwithjumpsofheight at<br />

points ( =0 1 2) then the probability<br />

mass function is the function ( )= and<br />

expectations, probabilities etc. are obtained<br />

by summation over . In the former case we<br />

say that is continuous; in the latter is<br />

discrete.<br />

• Convex functions:<br />

convex if<br />

A function : → R is<br />

((1 − ) + ) ≤ (1 − )()+()<br />

for all ∈ . Example 2 on R, − log on<br />

(0 ∞). Convex functions are continuous; if a<br />

function has a derivative 0 () on which is an<br />

increasing (used here and elsewhere in the weak<br />

sense) function of , then it is convex.


61<br />

• Jensen’s Inequality: If : Ω → ⊂ R has a<br />

finite mean [], and if is convex on , then<br />

[()] ≥ ([]).<br />

— Application. The arithmetic/geometric mean<br />

inequality:<br />

if 1 0then ³ Y<br />

´1<br />

≤ ¯<br />

Proof: Define a r.v. by ( = )=1<br />

and apply Jensen’s Inequality using the convex<br />

function () =− log . ¤<br />

• Limits and continuity in probability: Let { }<br />

be a sequence of r.v.s, e.g. toss a fair coin times<br />

and let denote the proportion of heads in the<br />

tosses. Then [ ]=12 and we expect <br />

to be near 12, with high probability, for large.<br />

We say that “ converges to a constant in<br />

<br />

probability”, and write → , if<br />

lim<br />

→∞ (| − | ≥ ) =0forany0


62<br />

The Weak Law of Large Numbers states that if<br />

is the average of independent r.v.s 1 ,<br />

all with finite mean and variance 2 <br />

,then →<br />

.<br />

— e.g. = (i toss results in a head), =<br />

P . Then =1 0w.p. 12 each; =<br />

<br />

12; by the WLLN → 12<br />

• This is a basic notion required for the theory of<br />

estimation in Statistics.<br />

— e.g. 2 = () = h ( − ) 2i can<br />

be estimated from a sample 1 of independent<br />

observations ∼ by the sample<br />

variance 2 =( − 1) −1 P ³ − ¯´2 . The<br />

adjustment is for bias, disregarding it the main<br />

idea is that averages are consistent estimates<br />

of expectations (i.e. they converge in probability<br />

to these constants). Then also → ;<br />

<br />

this is a consequence of the following result.


63<br />

<br />

• If → and the function is continuous at ,<br />

then ( ) → ().<br />

Proof: We want to show that<br />

(| ( ) − () | ≥ ) → 0<br />

Use the continuity of to find 0 such that<br />

Then<br />

| − | ⇒ | ( ) − () | <br />

(| − | ) ≤ (| ( ) − () | ) <br />

Here we use the fact that if one event implies<br />

another, it has a smaller probability (i.e. ⊂<br />

⇒ () ≤ ( )). Since the first probability<br />

→ 1, so does the second (why?). ¤


64<br />

8. Differentiation; Mean Value and Taylor’s<br />

Theorems<br />

• Let : ⊂ R → R be defined in a neighbourhood<br />

( 0 ); put<br />

() = ( 0 + ) − ( 0 )<br />

<br />

(“Newton’s quotient”). If () has a limit as<br />

→ 0wecallitthederivative 0 ( 0 )of at 0 ,<br />

also written ( ()) |=0 .<br />

• Examples () = 2 , () =||. The former is<br />

differentiable everywhere in R; the latter everywhere<br />

except =0.<br />

• Differentiability ⇒ Continuity:<br />

then is continuous at 0 .<br />

Proof:<br />

If 0 ( 0 )exists<br />

|( 0 + ) − ( 0 )| = |()| → 0<br />

as → 0. ¤


65<br />

• Linearity, product, quotient, chain rules - read in<br />

text. Theyallowustobuildupastockofdifferentiable<br />

functions from simpler ones, and also<br />

show how the derivative of the more complicated<br />

function can be gotten from those of the simpler<br />

ones.<br />

• Relation to monotonicity: if % on ( ) and<br />

differentiable there then 0 () ≥ 0on( ).<br />

Proof: As ↓ 0 the numerator of () is≥ 0<br />

and continuous, hence 0 () = lim ↓0 () ≥ 0.<br />

(Similarly lim ↑0 () ≥ 0.)<br />

• If is continuous on [ ] then the inf and sup are<br />

finite, and are attained: there are points ∈<br />

[ ] with () ≤ () ≤ () forall. Show:<br />

If a max or min is in the open interval ( ),<br />

0 =0there(if 0 exists).


66<br />

• Mean Value Theorem: If is continuous on<br />

[ ] and differentiable on ( ) then∃ ∈ ( )<br />

with () =()+ 0 ()(−). This is a result of<br />

crucial importance in the approximation of functions.<br />

“Differentiable functions are locally almost<br />

linear.”<br />

— Follows from the previous bullet applied to<br />

à !<br />

() − (<br />

() = ()− ()−<br />

( − ) <br />

− <br />

— Restatement: () ≈ () + 0 ()( − )<br />

if | − | is small and 0 is continuous (since<br />

¯<br />

¯ 0 () − 0 ()¯¯ → 0as| − | → 0). The result<br />

is that () isapproximately linear near<br />

= , withslope 0 (). The next result (Taylor’s<br />

Theorem) strengthens this statement and<br />

also allows us to assess the error in this approximation.<br />

— A consequence of the MVT is that if 0 () ≥ 0<br />

on ( ) then % there: suppose 1 <br />

2 ,then<br />

( 2 )=( 1 )+ 0 ()( 2 − 1 ) ≥ ( 1 )


67<br />

• Taylor’s Theorem:“Sufficiently smooth functions<br />

can be approximated locally by polynomials.” Suppose<br />

()has derivatives on ( )with (−1) ()<br />

continuous on [ ]. (We put (0) () =();<br />

the assumptions imply existence and continuity of<br />

() () on( ) for.) Then for ∈ [ ]<br />

there is a point between and such that<br />

() =<br />

−1 X<br />

=0<br />

() ( − )<br />

()<br />

!<br />

+ () ( − )<br />

() <br />

!<br />

— Example: () = log(1+) with|| 1;<br />

expand around 0: (0) = 0 and for 0:<br />

() () = (−1)<br />

+1 ( − 1)!<br />

(1 + )<br />

<br />

so that<br />

() (0) = (−1) +1 ( − 1)!


68<br />

Then<br />

log (1 + ) =<br />

−1 X<br />

=1<br />

+1 <br />

(−1)<br />

+ (−1)+1 <br />

(1 + ) <br />

= − 2<br />

2 + 3<br />

3 − 4<br />

4 + <br />

+(−1) −1<br />

− 1 + (−1)+1 <br />

(1 + ) <br />

for some between 0 and , i.e.with|| ||.<br />

Write this as<br />

log (1 + ) = ()+ ();<br />

if () → 0as → ∞ we say that the<br />

series lim →∞ () = P ∞<br />

=1<br />

(−1) +1 <br />

‘represents the function’ log (1 + ).<br />

— Proof of Taylor’s Theorem: For =1this<br />

istheMVT;assume1. For ∈ [ ] put<br />

() =() − () −<br />

−1 X<br />

=1<br />

() ( − )<br />

() <br />

!


69<br />

We want to show that (), which is<br />

() =() −<br />

−1 X<br />

=0<br />

canalsobeexpressedas<br />

() ( − )<br />

() <br />

!<br />

() = () ( − )<br />

() (8.1)<br />

!<br />

for some ∈ ( ). For this, define<br />

µ − <br />

() = () − ()<br />

− <br />

and note that () = () =0,andthat<br />

() isdifferentiable on ( ) and continuous<br />

on [ ]. By the MVT there is a point<br />

∈ ( ) with<br />

() = ()+ 0 ()( − );<br />

thus 0 () =0:<br />

0=() 0 =()+ 0 ( − )−1<br />

( − ) ()<br />

so<br />

() =−()<br />

0 ( − ) <br />

( − ) −1 (8.2)


70<br />

But<br />

0 () = − 0 () −<br />

−1 X<br />

=1<br />

⎡<br />

⎢<br />

⎣<br />

(+1) () (−)<br />

!<br />

− () () (−)−1<br />

(−1)!<br />

= − () ( − )−1<br />

() ;<br />

( − 1)!<br />

this in (8.2) gives (8.1). ¤<br />

⎤<br />

⎥<br />

⎦<br />

• l’Hospital’s Rule: Read Theorem 4.2.6 (or another<br />

source) and the examples following it.<br />

— Rough idea: If () =() =0,then<br />

lim<br />

→<br />

()<br />

() = lim →<br />

()−()<br />

−<br />

()−()<br />

−<br />

= lim →<br />

0 ()<br />

0 () <br />

— Example: lim →0<br />

sin <br />

<br />

= lim →0 cos <br />

1<br />

=1.


71<br />

9. Applications: transformations; variance<br />

stabilization<br />

• Application 1. Distribution of functions of r.v.s.<br />

Suppose a r.v. has a differentiable d.f. (),<br />

density () = 0 (). Consider the r.v. =<br />

(). (e.g. = log.) First assume is<br />

strictly monotonic (↑ or ↓). The d.f. of is<br />

() = ( ≤ ) = (() ≤ )<br />

(<br />

( ≤ <br />

=<br />

−1 ()) if ↑<br />

( ≥ −1 ()) if ↓;<br />

(<br />

(<br />

=<br />

−1 ()) if ↑<br />

1 − ( −1 ()) if ↓ <br />

Note that the left continuity of is used here:<br />

( ≥ −1 ()) ? = ( −1 ())<br />

To get the density () of() wemustdifferentiate<br />

−1 (). Write = −1 (), then ³ −1 ()´0 =<br />

can be obtained by differentiating the relationship<br />

= () :<br />

1= <br />

= 0 () <br />

;


hence<br />

In the above,<br />

() =<br />

⎧<br />

⎨<br />

⎩<br />

In either event,<br />

<br />

= 1<br />

0 () = 1<br />

0 ( −1 ()) <br />

( −1 ()) h 0 ( −1 () i <br />

−( −1 ()) h 0 ( −1 () i <br />

¯<br />

¯<br />

if ↑<br />

if ↓ <br />

72<br />

<br />

() =()<br />

if = () is strictly monotone,<br />

¯<br />

with expressed in terms of on the RHS.<br />

— Example: 0, = − log . Then() =<br />

¯<br />

¯<br />

¯<br />

<br />

¯<br />

() ¯ = ( − )<br />

¯ = (− ) − .Thus<br />

if ∼ (0 1) with () =(0 1),<br />

has density − (0);wesay has the<br />

exponential density with mean 1 (The function<br />

() = − is the exponential p.d.f.<br />

with mean 1.)<br />

¯−<br />

— When is non-monotonic it is usual to split<br />

up the range of into regions on which is<br />

¯


73<br />

monotonic. Example: suppose ∼ (0 1)<br />

with density<br />

() = 1 √<br />

2<br />

−2 2 ( −∞∞)<br />

and d.f. Φ() = R <br />

−∞ () = ( ≤ ).<br />

Put = − log ||. Then<br />

() = ( ≤ ) = ( ≤− − or ≥ − )<br />

= ( ≤− − )+ ( ≥ − )<br />

= Φ(− − )+1− Φ( − );<br />

thus<br />

() = − (− − )+ − ( − )<br />

= 2 − ( − )(−∞∞)<br />

• Application 2. Variance stabilization. We first<br />

need notions of convergence in law, or distribution.<br />

Suppose { } isasequenceofr.v.s;wesay<br />

<br />

that → ∼ if<br />

( ≤ ) → ( ≤ ) = ()<br />

at every continuity point of .


74<br />

— The Central Limit Theorem (CLT; we’ll prove<br />

it later) refers to this kind of convergence: if<br />

= √ ³ ¯ − ´, where ¯ = P <br />

=1 <br />

and 1 are i.i.d. with mean and variance<br />

2 <br />

, then → ∼ (0 2 ). The<br />

CLT, WLLN and MVT together are sufficient<br />

to derive a vast array of large sample approximations<br />

in Mathematical Statistics.<br />

— Some basic facts required here are that if <br />

→<br />

∼ then:<br />

<br />

1. + → + , if and are<br />

constants tending to and , orr.v.swith<br />

these limits in probability. (“Slutsky’s Theorem”;<br />

its role is to eliminate “nuisance terms”,<br />

that typically → 0 or 1, in limit distributions.)<br />

2. ( ) → () if is continuous.<br />

<br />

3. → (a constant) ⇔ → .<br />

should show this.)<br />

(You


75<br />

Now suppose that<br />

<br />

<br />

→ and (actually, is implied by)<br />

√ ( − ) → ∼ (0 2 )<br />

Consider a function = ( ), where is twice<br />

<br />

continuously differentiable on R Then → (),<br />

and by Taylor’s Theorem, for some between <br />

and ,<br />

√ ( − ())<br />

= √ {( ) − ()}<br />

= √ (<br />

)<br />

0 ()( − )+ 00 ( )<br />

( − ) 2<br />

2<br />

= 0 () n √ ( − ) o + 00 ( ) n√<br />

2 √ ( − ) o 2<br />

<br />

<br />

By Slutsky’s Theorem, this has the same limit distribution<br />

as 0 () √ ( − ), as long as<br />

00 ( )<br />

2 √ <br />

n√ ( − ) o 2 <br />

→ 0


76<br />

This in turn follows (how?) from<br />

n√ ( − ) o 2 → 2 (why?)<br />

00 ( )<br />

2 √ <br />

<br />

→ 0 (why?).<br />

The end result is that<br />

√ ( − ()) → <br />

µ<br />

0 ³ 0 ()´2 <br />

Suppose now that we have evidence that our r.v. <br />

has a variance that depends on its mean, i.e. 2 =<br />

(). Example is if = P <br />

=1 represents the<br />

average number of radioactive emissions of a certain<br />

type in runs of an experiment, where the number<br />

of emissions in one experiment has the Poisson<br />

distribution<br />

( = ) = − =0 1 2 <br />

!<br />

Then has mean and variance both = , sothat<br />

by the CLT √ ( − ) → (0), i.e. () =.


77<br />

This can make it problematic to make reliable inferences<br />

about the mean. For instance, a confidence<br />

interval on : ± 2<br />

q will have a width depending<br />

on the unknown .<br />

Question: what “variance stabilizing” transformation<br />

= ( ) will have an approximately constant variance?<br />

We require 0 q<br />

() = () 0 () tobeconstant,<br />

i.e. 0 () ∝ √ 1<br />

. Thiswillbethecaseif<br />

())<br />

() ∝ R 1 √<br />

()<br />

.<br />

InthePoissonexamplewewouldtake<br />

Z<br />

Z<br />

1<br />

() ∝ q<br />

() = −12 ∝ √ <br />

to obtain<br />

√ n√<br />

− √ o → (0 14)


78<br />

Part III<br />

SEQUENCES, SERIES,<br />

INTEGRATION


79<br />

10. Sequences and series<br />

• Convergence of a sequence { } ∞ =1 : We say<br />

that → if () = → as →∞,i.e.if<br />

∀ 0∃( ⇒ | − | )<br />

This is for finite ; obvious modifications (what<br />

are they?) otherwise. e.g. → 0if|| 1.<br />

( =log log ||; depends on and .)<br />

• Series: Put = P <br />

=1 ,the partial sum of<br />

the series P ∞<br />

=1 . We say that P ∞<br />

=1 = if<br />

→ .<br />

— Example: Geometric series P ∞<br />

=0 for || <br />

1. We have<br />

=<br />

X<br />

=0<br />

= 1 − +1<br />

1 − → = 1<br />

1 −


80<br />

• Extend to functions. : → R functions; if<br />

the sequence () has a limit for every ∈ ,<br />

denoted (), then we say that → on .<br />

Formally, for each ∈ ,<br />

∀ 0∃ = ( )(⇒ | () − ()| ) <br />

(10.1)<br />

• Similarly, consider () = P <br />

=1 (). If () →<br />

() for ∈ we say that () = P ∞<br />

=1 ()<br />

and that P ∞<br />

=1 () converges to ().<br />

• If, in (10.1), the same = () worksforall<br />

∈ we say the convergence is uniform on :<br />

⇒ on . Equivalently,<br />

⇒ on ⇔ sup | () − ()| → 0<br />

∈<br />

Example of non-uniformity of convergence:<br />

() =<br />

→<br />

(<br />

0 ≤ 1<br />

1 ≥ 1<br />

(<br />

0 0 ≤ 1<br />

1 ≥ 1<br />

= ()


81<br />

Then for each ,<br />

sup | () − ()| ≥ sup || =1<br />

[0∞)<br />

[01)<br />

so that sup [0∞) | () − ()| 9 0.<br />

• Example of uniformity of convergence. Consider<br />

() = . By Taylor’s Theorem, for between 0<br />

and :<br />

() =<br />

=<br />

X<br />

=0<br />

X <br />

=0<br />

() (0) <br />

! + (+1) () +1<br />

( +1)!<br />

! + +1<br />

( +1)!<br />

= ()+ (), say.<br />

Then |() − ()| = | ()| and so () =<br />

P ∞=0 <br />

!<br />

if | ()| → 0. We show the stronger<br />

result that ⇒ (equivalently, ⇒ 0) on any<br />

closed interval [ ]. For this, let be any integer<br />

that exceeds both || and ||, hence exceeds<br />

||. Let . Then sup ∈[] | ()| → 0;


82<br />

this is because it is<br />

<br />

+1<br />

( +1)!<br />

= +1−<br />

!( +1)···( +( +1− ))<br />

= <br />

! · <br />

+1·<br />

+2··· <br />

( +( +1− ))<br />

<br />

→<br />

<br />

µ <br />

! · +1<br />

0as →∞<br />

+1−<br />

• Cauchy sequences. A sequence { } ∞ =1 is Cauchy<br />

if the terms get close together sufficiently quickly:<br />

∀ 0∃ ( ⇒ | − | ) <br />

Note that if → (finite) then we can let <br />

be such that<br />

⇒ | − | 2<br />

then for ,<br />

| − | = | ( − ) − ( − ) |<br />

≤ | − | + | − |


83<br />

Thus a convergent sequence (i.e. a sequence with<br />

a finite limit) is Cauchy. (As a consequence,<br />

P ∞=1<br />

−1 diverges.) The converse (not proven<br />

here) holds as well, so that a sequence is convergent<br />

iff it is Cauchy.<br />

• A consequence is that if P converges absolutely,<br />

i.e. if P | | converges, then P converges.<br />

Proof: Suppose that = P <br />

=1 | | is a convergent,<br />

hence a Cauchy, sequence. There is such<br />

that<br />

⇒ | − | <br />

But | − | = P <br />

=+1 | |,sothat = P <br />

=1 <br />

satisfies (for )<br />

| − | =<br />

X<br />

¯<br />

=+1<br />

¯¯¯¯¯¯<br />

≤<br />

X<br />

=+1<br />

| | = | − | <br />

Thus { } is Cauchy, hence is convergent. ¤


84<br />

• Example: Let be a discrete r.v. with ( =<br />

)= , =0 1 2 . If P converges<br />

absolutely, wecallitthe moment [ ]of<br />

. Suppose has the Poisson distribution P():<br />

( = ) = − =0 1 2 .<br />

!<br />

(Note P =1-how?) Thenthe moments<br />

exist for all 0. To see this, consider the partial<br />

sums<br />

X<br />

X<br />

X<br />

= = −<br />

<br />

=0 =0<br />

! = ,say.<br />

=0<br />

We must show that converges. Note that<br />

+1<br />

<br />

= <br />

+1<br />

µ<br />

1+ 1 <br />

→ 0as →∞<br />

so that for 1thereis so that<br />

⇒ +1<br />

<br />


85<br />

Then for 0 we have<br />

0 = +<br />

X 0<br />

<br />

=+1<br />

≤ +( + 2 + + 0− ) <br />

+ <br />

1 − <br />

Thus the sequence { +1 +2 } is increasing<br />

and bounded above, hence has a limit<br />

(= the ) - you should show this.<br />

— Here we have established convergence by using<br />

aversionoftheratio test. See §5.2.1, 5.2.2<br />

in the text, or elsewhere, for other tests.<br />

• Uniform convergence ensures that we can interchange<br />

certain operations.<br />

Theorem: If ⇒ on , then<br />

lim<br />

→∞ → lim () = lim lim → →∞ ()<br />

for ∈ .


86<br />

— A case in which this fails, because the convergence<br />

is not uniform, is<br />

(<br />

<br />

() =<br />

0 ≤ 1<br />

1 ≥ 1<br />

(<br />

0 0 ≤ 1<br />

→<br />

= ()<br />

1 ≥ 1<br />

with = 1. Here lim → lim →∞ () =<br />

lim →1 () does not even exist - the limit<br />

function is discontinuous. Cases like this are<br />

ruled out if the convergence is uniform:<br />

— If ⇒ on and the are continuous on<br />

, then is continuous on <br />

Proof: Forany ∈ ,<br />

lim ()<br />

→<br />

= lim<br />

→ →∞ lim ()<br />

= lim<br />

→∞ → lim ()<br />

= lim<br />

→∞ ()<br />

= ()<br />

¤


87<br />

11. Power series; moment and probability<br />

generating functions<br />

• Power series: Put () = P <br />

=0 ( − ) ;if<br />

() → () as → ∞ we say that<br />

P ∞=0<br />

(−) is the power series representing .<br />

— e.g. by Taylor’s Theorem, if<br />

then<br />

() =<br />

X<br />

=0<br />

() ( − )<br />

()<br />

!<br />

() = ()+ (+1) ( − )+1<br />

() <br />

( +1)!<br />

so that if<br />

(+1) ( − )+1<br />

()<br />

( +1)!<br />

→ 0<br />

then P ∞<br />

=0<br />

() () (−)<br />

!<br />

is the power series<br />

(“Taylor series”, or “Maclaurin’s series” if =<br />

0) “representing ”.


88<br />

• Theorem: Suppose a power series P ∞<br />

=0 <br />

converges for one value 0 6= 0. Then it converges<br />

absolutely for || | 0 |.<br />

Proof: Put<br />

() =<br />

() =<br />

X<br />

=0<br />

X<br />

=0<br />

<br />

¯<br />

¯ ¯¯¯ <br />

Since ( 0 ) has a limit as → ∞, it is a<br />

¯<br />

Cauchy sequence and, in particular, ¯ 0 ¯ =<br />

¯<br />

¯ ( 0 ) − −1 ( 0 )¯¯ → 0. For 0let be<br />

large enough that<br />

⇒ | 0 | <br />

Then for and || | 0 |,<br />

() =<br />

X<br />

=0<br />

¯<br />

¯ ¯¯¯ = ()+<br />

X<br />

=+1<br />

¯<br />

¯ 0<br />

<br />

<br />

¯<br />

¯ 0¯¯¯¯¯<br />

<br />

()+<br />

1 − | 0 | <br />

i.e. the partial sums (), which are necessarily<br />

increasing, are bounded above. ¤


89<br />

• If P ∞<br />

=0 converges for || and diverges<br />

for || we call the radius of convergence.<br />

Then, by the previous result, if || the series<br />

is absolutely convergent.<br />

— e.g. put<br />

() =<br />

X<br />

=0<br />

(−) = (1 − (−)+1 )<br />

<br />

1+<br />

then with () =1(1 + ) we have<br />

| () − ()| = || +1 |1+|<br />

If || 1then| () − ()| → 0; if || 1<br />

it →∞. Thus = 1 is the radius of convergence.<br />

In this case when || = 1 the series diverges<br />

(i.e. the partial sums do not converge).


90<br />

• Theorem: Suppose a power series P ∞<br />

=0 <br />

has a radius of convergence 0 (within which it<br />

necessarily converges absolutely). Let 0 .<br />

Then:<br />

(i) P ∞<br />

=0 converges uniformly on [− ];<br />

(ii) For || the limit function () = P ∞<br />

=0 <br />

is continuous and differentiable, and the derivative<br />

is represented by the convergent series<br />

0 () =<br />

∞X<br />

=1<br />

−1 <br />

(Thus P ∞<br />

=1 −1 converges for || and<br />

so (i), (ii) apply to 0 ().)<br />

Proofof(i): Suppose P <br />

=0 = () →<br />

() for|| .Thenfor|| we have<br />

| () − ()| =<br />

∞X<br />

¯<br />

=+1<br />

¯¯¯¯¯¯ ≤<br />

∞X<br />

=+1<br />

| | → 0<br />

as →∞,since P ∞<br />

=0 converges absolutely<br />

for || . Thus sup ||≤ | () − ()| → 0,<br />

as required. ¤


91<br />

• By (ii), we can repeat the process:<br />

00 () = P ∞<br />

=2 ( − 1) −2 , etc. Among<br />

other things, this implies the uniqueness of power<br />

series representations. (How?)<br />

• Example: The probability generating function of<br />

ar.v. is the function () =[ ], provided<br />

this exists. In particular, if has support N =<br />

{0 1 2} then<br />

() =<br />

∞X<br />

=0<br />

( = )<br />

Since this converges for = 1 it has radius of<br />

convergence ≥ 1. Wecanthendifferentiate<br />

term-by-term near =0:<br />

=<br />

() (0)<br />

∞X<br />

=<br />

( − 1) ···( − +1) − ( = ) |=0<br />

= ! ( = )


92<br />

— Note that, by uniqueness of power series, if we<br />

can expand () as P ∞<br />

=0 then, necessarily,<br />

= ( = ) = () (0)!. In other<br />

words, the p.g.f. uniquely determines the distribution:<br />

two r.v.s with the same p.g.f. have<br />

the same distribution.<br />

— Example: If ∼ ( )then<br />

() =(1− + ) ;<br />

the uniqueness then shows that the sum of<br />

such independent , all with the same but<br />

possibly different values of , is∼ ( P ).<br />

— In the above we have used the fact that a<br />

characterization of the independence of r.v.s<br />

( )isthat[()( )] = [()][( )]<br />

for all functions such that () and( )<br />

are also r.v.s. Equivalently, () and( )<br />

are uncorrelated for all such .


93<br />

• The moment generating function of a r.v. is<br />

the function () = [ ], provided this exists<br />

(i.e. is finite). (Replacing by gives the<br />

characteristic function, which always exists: it is<br />

[cos ()] + [sin ()].) With as above,<br />

() =<br />

∞X<br />

=0<br />

( = )<br />

Note that () =( ), so that it converges (absolutely)<br />

in a neighbourhood of =0iff has a<br />

radius of convergence 1. Assume this. Then<br />

for || log we have, by the preceding theorem,<br />

0 () = 0 ( ) <br />

=<br />

=<br />

∞X<br />

=0<br />

∞X<br />

=0<br />

= [ ]<br />

³ ´−1 ( = ) · <br />

<br />

( = )<br />

with 0 (0) = []. Continuing, () () =[ ]<br />

with () (0) = [ ]. (i.e. we can differentiate<br />

within the [·].)


— e.g. ∼ () with ( = ) = − <br />

! has<br />

() =<br />

Thus<br />

∞X<br />

=0<br />

−<br />

! = − ∞ X<br />

= − · = ( −1) <br />

=0<br />

³<br />

<br />

´<br />

!<br />

[] = 0 (0) = <br />

[ 2 ] = 00 (0) = 2 + hence<br />

[] = <br />

94<br />

— The cumulants of a distribution are defined<br />

as the coefficients in the expansion<br />

log [ ]=<br />

∞X<br />

=1<br />

<br />

<br />

! <br />

Thus the Poisson distribution has all cumulants<br />

= . In general 1 is the mean and 2 is<br />

the variance; after that they get more complicated.<br />

The Normal distribution has all =0<br />

for 2.


95<br />

12. Branching processes<br />

• Important in population studies and elsewhere.<br />

Organisms are born, live for 1 unit of time, then<br />

give birth to a random number of offspring and<br />

die.<br />

• Define r.v.s<br />

= population size at time + , 0 =1<br />

= number of offspring of the member<br />

of the population.<br />

Then<br />

=<br />

−1 X<br />

=1<br />

<br />

• Problems: (i) Determine properties of the distribution<br />

of . (ii) Determine the limiting probability<br />

of extinction (= lim →∞ ( =0)=<br />

lim →∞ (0), if is the p.g.f.).


96<br />

• Assume: When −1 = , 1 2 are independent<br />

r.v.s, independent of −1 . (i.e. number<br />

of offspring of one member has no effect on<br />

that of another, and is unaffected by current size<br />

of population. Realistic?) Assume also that all<br />

are distributed in the same manner.<br />

• We will work with the p.g.f.s<br />

() =[ ] () =[ ]; 0 ≤ ≤ 1<br />

Assume has a radius of convergence 1, so<br />

that [ ]= 0 (1) exists. Note that<br />

() =[ ]=<br />

If −1 = , thisis<br />

=<br />

<br />

Y<br />

=1<br />

∙ P =1<br />

<br />

¸<br />

= <br />

"<br />

<br />

⎡ ⎤<br />

Y<br />

⎣ ⎦<br />

=1<br />

h i<br />

(independence)<br />

#<br />

P −1<br />

=1<br />

<br />

<br />

= () (sincethe are identically distributed).


97<br />

Considering the probabilities of the events “ −1 =<br />

” (i.e. Double Expectation Theorem: [ ]=<br />

−1<br />

n<br />

[<br />

| −1 ] o )gives<br />

Iterating:<br />

() =<br />

∞X<br />

=0<br />

= h () −1<br />

= −1 (())<br />

() ( −1 = )<br />

0 () = [ 0]=[] =<br />

1 () = 0 (()) = () =( 0 ())<br />

2 () = 1 (()) = ◦ () =( 1 ())<br />

3 () = 2 (()) = ◦ ◦ () =( 2 ())<br />

and in general (by induction)<br />

() =( −1 ()) =1 2; 0 () =<br />

It follows (you should show how) that [ ]=<br />

{[ ]} . (Intuitively obvious?)<br />

i


98<br />

• Probability of extinction. Note ( = 0) =<br />

(0) = ,say,and = ( −1 )with 0 =0.<br />

Does lim →∞ exist, and if so what is it? We<br />

shall assume that 0 ( =0) 1, otherwise<br />

problem is trivial. Consequently, () is positive,<br />

strictly increasing and convex for 0 ≤ 1:<br />

() =<br />

0 () =<br />

00 () =<br />

Now<br />

∞X<br />

=0<br />

∞X<br />

=1<br />

( = ) = ( =0)+ 0<br />

−1 ( = ) 0<br />

sinceatleastoneofthe ( = ) is 0<br />

∞X<br />

=2<br />

( − 1) −2 ( = ) ≥ 0<br />

0 = 0<br />

1 = ( 0 )=(0) 0= 0 <br />

1 0 ⇒ 2 = ( 1 ) ( 0 )= 1 <br />

···<br />

−1 ⇒ +1 = ( ) ( −1 )= <br />

In general 0 = 0 1 2 ≤ 1, and so<br />

↑ =sup{ }


99<br />

Since = ( −1 )and is continuous we have<br />

=<br />

lim<br />

→∞ =<br />

• Put () =() − ; note<br />

lim<br />

→∞ ( −1) =( lim<br />

→∞ −1) =()<br />

(0) = ( =0) 0<br />

(1) = 0<br />

0 (0) = 0 (0) − 1= ( =1)− 1 0<br />

and is convex. Also<br />

0 (1) = 0 (1) − 1=[ ] − 1<br />

The function () can drop below 0 at most once<br />

in (0 1). Graphthetwopossiblecases. Inthe<br />

first () isincreasingat = 1, in the second it<br />

is decreasing.<br />

— Case 1: [ ] 1. Equivalently, 0 (1) 0.<br />

There are two roots, say ∈ (0 1) and =1,<br />

to the equation () =0,and is one of them.<br />

We have<br />

0 = 0 <br />

⇒ 1 = ( 0 ) () =<br />

⇒ 2 = ( 1 ) () =


100<br />

etc.; hence ≤ and so = .<br />

— Case 2: [ ] ≤ 1. Equivalently, 0 (1) ≤ 0<br />

and = 1 is the only solution.<br />

• Summary:<br />

If [ ] ≤ 1then<br />

(eventual extinction) = 1;<br />

if [ ] 1 then this probability is 1andisthe<br />

unique solution in (0 1) to () =.<br />

Let = time of extinction. Then<br />

( )= ( 0) = 1 − <br />

hence ( ≤ ) = ,with<br />

0 = 0 = ( −1 )<br />

( = ∞) =1− <br />

Itcan(andwill-Asst.3)beshownthat<br />

[] =<br />

∞X<br />

=0<br />

( )(=∞ if [ ] 1).


101<br />

P(N


102<br />

13. Riemann integration<br />

• Riemann integration. First consider :[ ] →<br />

R, abounded function. Consider a partition, or<br />

‘mesh’ = { = 0 1 ··· = } of<br />

[ ]; its norm is ∆ =max (∆ ), where ∆ =<br />

− −1 . For ∈ [ −1 ], an approximation<br />

to the area under the graph of is<br />

() =<br />

X<br />

=1<br />

( )∆ <br />

Theintegralof over [ ] isdefined as the limit<br />

of these approximations, as they become more<br />

and more refined, i.e. as ∆ → 0.<br />

• Formally, we first bound the Riemann sum ()<br />

above and below as follows. Define<br />

= inf<br />

[ −1 ] () = sup<br />

() =<br />

X<br />

=1<br />

∆ () =<br />

[ −1 ]<br />

X<br />

=1<br />

();<br />


103<br />

Then clearly<br />

() ≤ () ≤ ()<br />

If we refine by including points 0 between −1<br />

and , obtaining another partition 0 ⊃ ,then<br />

in 0 the infima increase and the suprema decrease;<br />

thus<br />

() ≤ 0() and 0() ≤ ()<br />

Also () ≤ 0() for any partitions , 0<br />

(shown by considering their union, whose lower<br />

sum exceeds that of and whose upper sum is<br />

≤ that of 0 ). Continuing:<br />

() ≤ 0() ≤ 00() ≤ ···<br />

≤ sup () ≤ inf ()<br />

<br />

<br />

≤ ···≤ 00() ≤ 0() ≤ ()<br />

We say that is (R-) integrable if sup () =<br />

inf (), and then their common value is<br />

R <br />

(). Equivalently<br />

inf<br />

{ () − ()} =0


104<br />

• An example of a non-integrable function is () =<br />

( ∈ Q) (Q the rationals) for ∈ [0 1]. Then<br />

for any partition we have ≡ 0and ≡ 1,<br />

so sup () =0 1=inf ()<br />

• Continuous functions on [ ] are R-integrable<br />

there. (We write ∈ [ ], or just ∈ )<br />

The general idea of the proof is that, since continuous<br />

functions on bounded, closed intervals are<br />

(bounded and) uniformly continuous, − <br />

can be made uniformly small, say ≤ whenever<br />

∆ .Then<br />

() − () ≤ <br />

X<br />

=1<br />

hence inf { () − ()} =0.<br />

∆ = ( − )


105<br />

• Monotonic, bounded functions on [ ] are R-<br />

integrable; e.g. for % functions,<br />

() − () =<br />

≤<br />

X<br />

=1<br />

[( ) − ( −1 )] ∆ <br />

∆ [() − ()] → 0<br />

• More generally, we say a function is of bounded<br />

variation (“ is BV”) on [ ] if P <br />

=1 |∆ | ≤ <br />

for some 0 and all partitions . (Here<br />

∆ = ( )−( −1 ).) This clearly holds if is<br />

monotonic and bounded, or (by the MVT) if has<br />

a bounded derivative (since then |∆ | ≤ ∆ ,<br />

where | 0 | ≤ on [ ]). It can be shown that<br />

if is BV then ∈ [ ].<br />

• Standard properties follow from these definitions.<br />

If ∈ [ ] thensoare + , , and<br />

||; in the first two cases the integral is linear;<br />

¯<br />

in the last we have ¯R ()¯ ≤ R <br />

|()|. If<br />

≤ then R <br />

() ≤ R <br />

(). (You should<br />

show these two inequalities.) Also, R <br />

() =<br />

R <br />

() + R <br />

() for ∈ [ ].


106<br />

• An important result is the Mean Value Theorem<br />

for Riemann integrals: If is continuous on [ ]<br />

then there is ∈ [ ] forwhich<br />

Z <br />

<br />

() = ()( − )<br />

Proof: Let and be the inf and sup of on<br />

[ ], then<br />

Z <br />

≤ 1 () ≤ <br />

− <br />

Since is continuous it attains and every<br />

point between (Intermediate Value Theorem), hence<br />

there is ∈ [ ] forwhich() =<br />

−<br />

1 R <br />

().<br />

¤


107<br />

• Now define<br />

() =<br />

Z <br />

<br />

() ≤ ≤ <br />

the indefinite integral of . We have the Fundamental<br />

Theorem of Calculus: If is continuous<br />

on [ ] then is differentiable there, with<br />

Z <br />

<br />

0 () = (); (13.1)<br />

() = (), hence<br />

Proof of (13.1):<br />

0 () =<br />

1<br />

lim<br />

→0 <br />

=<br />

1<br />

lim<br />

→0 <br />

= () − () (asbelow).<br />

" Z +<br />

<br />

Z +<br />

<br />

() −<br />

()<br />

Z <br />

() #<br />

= lim<br />

→0<br />

1<br />

· ( )<br />

for some ∈ [ + ], by MVT . Since → <br />

and is continuous, ( ) → (). ¤


108<br />

— This is the main tool for evaluating integrals<br />

—wefind a whose derivative is .<br />

Reason: If 0 () =(), then<br />

() =() − () − ()<br />

has () =0and 0 () ≡ 0, so () =0<br />

(for instance by the MVT). Hence<br />

Z <br />

<br />

() = () =() − ()<br />

— This is used to justify the change-of-variables<br />

formula for Riemann integration (i.e. integration<br />

by substitution).<br />

— Example: the substitution = tan, with<br />

() =sec 2 ,gives<br />

= <br />

Z <br />

1<br />

1+ 2 = Z arctan <br />

<br />

¯<br />

¯arctan <br />

arctan cos2 sec 2 <br />

arctan =arctan − arctan


109<br />

• Improper Riemann integrals, in which one or both<br />

endpoints are infinite, or at which is unbounded,<br />

are defined by taking appropriate limits:<br />

Z ∞<br />

() = lim<br />

→∞<br />

Z <br />

() =<br />

−∞<br />

Z ∞<br />

Z <br />

<br />

() = lim<br />

↓0<br />

Z <br />

()<br />

Z ∞<br />

() + () for any <br />

−∞ <br />

Z −<br />

<br />

() if () =±∞<br />

Example: () =1{(1 + 2 )}, −∞ ∞ is<br />

the ‘Cauchy’ (= on 1 degree of freedom) p.d.f.; we<br />

have<br />

Z <br />

() =(arctan − arctan ) <br />

<br />

so<br />

Z ∞<br />

µ Z Z ∞<br />

<br />

() = +<br />

−∞ −∞ <br />

arctan − arctan arctan − arctan <br />

= lim<br />

+ lim<br />

→−∞ <br />

→∞ <br />

− arctan (−∞)+arctan∞<br />

=<br />

= 2+2<br />

<br />

<br />

=1


110<br />

However, none of the moments<br />

Z ∞<br />

[ ]=<br />

−∞ ()<br />

exist for ≥ 1. This is because the existence of [ ]<br />

requires the existence of<br />

Z ∞<br />

Z ∞<br />

() = || ()<br />

and of<br />

Z 0<br />

0<br />

−∞ () =(−1) Z 0<br />

0<br />

−∞ || ()<br />

hence the existence of [|| ]. You should show that<br />

[|| ]doesnotexistif is Cauchy; a consequence<br />

is that even if the integrand of R ∞<br />

−∞ () is an odd<br />

function, the integral need not = 0.


111<br />

14. Riemann and Riemann-Stieltjes integration<br />

• An application of the Fundamental Theorem of<br />

Calculus is the formula for integration by parts.<br />

If are differentiable, and 0 0 are integrable,<br />

then<br />

Z <br />

[()()]0 = ()() − ()() andalso<br />

Z h<br />

= 0 ()()+() 0 ) i ;<br />

hence<br />

Z <br />

<br />

0 ()() = ()()−()()−<br />

()0 ()<br />

A mnemonic is “ R = | − R ”.<br />

Z


112<br />

• Application. Define Γ() = R ∞<br />

0 −1 − , (<br />

0), the Gamma integral. Establishing the existence<br />

of lim 0 →∞ () ,orof<br />

R<br />

Z <br />

lim ()<br />

→0→∞ <br />

if 1, is left to you. We have<br />

Z Ã ! ∞ <br />

0<br />

Γ() =<br />

− <br />

0 <br />

à !<br />

<br />

<br />

= −¯¯¯¯¯<br />

∞ Z Ã ! ∞ <br />

<br />

−<br />

<br />

0 − <br />

hence<br />

= 1 <br />

Z ∞<br />

0<br />

0<br />

− = 1 Γ( +1)<br />

<br />

Γ( +1)=Γ();<br />

in particular for an integer,<br />

Γ(+1) = Γ() =··· = (−1)···1·Γ(1) = !


113<br />

• A generalization of the Riemann integral that is<br />

particularly useful in statistics is the Riemann-<br />

Stieltjes (R-S) integral. Let be bounded on<br />

[ ], and let () be% there. In the definition<br />

of the R-integral, replace ∆ everywhere by<br />

∆ = ( ) − ( −1 )(≥ 0). The analogue of<br />

() is (; ) = P <br />

=1 ( )∆ ;ifthishas<br />

a limit as ∆ → 0 - equivalently (as at Theorem<br />

6.2.1) if sup () =inf ()<br />

-thenwecallittheR-Sintegral R <br />

()().<br />

It is particularly useful in cases where is not<br />

continuous.<br />

• Special cases:<br />

1. () = ; R <br />

()() = R <br />

(), the<br />

R-integral.<br />

2. differentiable, with 0 = ; R <br />

()() =<br />

R <br />

()(), theR-integral.


3. () =<br />

(<br />

≤ <br />

≤ ≤ .Then<br />

114<br />

∆ =<br />

(<br />

− −1 ≤ <br />

0 otherwise<br />

It follows that R <br />

()() =()( − ).<br />

• We adopt the convention that unless stated otherwise,<br />

by R <br />

we mean R (] ,i.e.therighthand<br />

endpoint is included, the left is not. Note that<br />

this is not an issue for R-integrals (why not?).<br />

Combining 2) and 3):<br />

Suppose = 0 1 ··· = and<br />

(a) is differentiable on ( −1 )with 0 =<br />

≥ 0,<br />

(b) has a jump discontinuity (but is right continuous)<br />

at each ,with ( ) − ( − )= .<br />

Then<br />

Z <br />

X Z <br />

()() = ()()<br />

<br />

=1<br />

−1<br />

(<br />

X Z )<br />

<br />

=<br />

()() + ( ) <br />

−1<br />

=1


115<br />

• Improper R-S integrals defined as for R-integrals.<br />

In particular, let be a r.v. with d.f. (), −∞ <br />

∞ (note %). Let () be a function of<br />

. Wedefine the expected value of () tobe<br />

Z ∞<br />

[()] = ()()<br />

−∞<br />

If has a density this agrees with the earlier<br />

definition. Suppose instead that is discrete,<br />

with<br />

( = )= =0 1 2 <br />

Then () = ( ≤ ) has a jump of height<br />

∆ = at and has 0 = 0 elsewhere, so<br />

[()] =<br />

∞X<br />

=0<br />

( )


116<br />

• An example illustrating the power of this integral,<br />

in which neither the R-integral nor a sum alone<br />

will suffice, is if represents the lifetime of a<br />

randomly chosen light bulb. Suppose that, with<br />

probability , the bulb blows when first installed.<br />

Otherwise, it has an exponentially distributed lifetime,<br />

with ( )= − . Thus its d.f. is<br />

() = ( ≤ ) =<br />

⎧<br />

⎪⎨ 0<br />

0<br />

=0<br />

⎪⎩<br />

+(1− )(1 − − ) 0<br />

with<br />

[ ( )] =<br />

Z ∞<br />

−∞<br />

() ()<br />

= (0) · +<br />

Z ∞<br />

∙ n o¸<br />

() +(1− )(1 − − ) <br />

<br />

0<br />

= (0) · +(1− )<br />

Z ∞<br />

0<br />

() −


117<br />

• Cauchy-Schwarz inequality:<br />

µZ<br />

2 Z<br />

()()() ≤<br />

2 Z<br />

()()·<br />

2 ()()<br />

provided all three integrals exist. The range is<br />

the same for all three, but need not be bounded.<br />

(Does existence of the latter two integrals imply<br />

existence of the first?)<br />

Proof: Essentially identical to the vector version:<br />

0 ≤<br />

=<br />

Z<br />

Z<br />

( + ) 2 <br />

2 +2<br />

Z<br />

+ 2 Z 2 <br />

hence “ 2 − 4” ≤ 0, i.e.<br />

4<br />

µZ<br />

2 Z<br />

− 4<br />

2 ·<br />

Z<br />

2 ≤ 0<br />

— Example: [ 3 ] ≤<br />

q<br />

[ 2 ][ 4 ].<br />

¤


118<br />

• Integration by parts: if the R-S integrals R <br />

()()<br />

and R <br />

() () bothexist,then R <br />

()()+<br />

R <br />

() () =()() − ()(). This and<br />

other identities are also valid for decreasing integrators<br />

- e.g. replace R by − R (−) inthe<br />

appropriate places.<br />

• An application is Euler’s summation formula: If<br />

has a continuous derivative 0 on ( − )<br />

for some ∈ (0 1), then<br />

X<br />

=<br />

() =<br />

Z <br />

()+()+ Z <br />

0 (){}<br />

where {} = − [] is the fractional part of .<br />

• Example: with () =1 and =1,weobtain<br />

X<br />

=1<br />

Z<br />

1<br />

<br />

− log =1− 1<br />

{}<br />

2 ∈ µ 1<br />

1 <br />

<br />

The middle term above is decreasing in , and<br />

bounded below by 0, thus it has a limit (‘Euler’s


119<br />

constant’) ∈ [0 1) as →∞:<br />

X<br />

=1<br />

1<br />

− log → = 577215<br />

<br />

Note that both P <br />

=1<br />

1<br />

<br />

and log diverge.<br />

• Similarly, 0 ≤ 2 √ − 1 − P <br />

=1<br />

1 √ ≤ 1 − 1 √<br />

<br />

.<br />

The convergence is very slow; limit ≈ 4604 with<br />

=19, 4603 with =18.<br />

Proof of Euler’s formula: Write the sum as a R-<br />

S integral, and split into regions on which {} is<br />

monotone:<br />

X<br />

=<br />

() =<br />

=<br />

=<br />

Z <br />

− ()[]<br />

Z <br />

−<br />

()( − {})<br />

Z <br />

− () − Z <br />

− (){}<br />

−<br />

X<br />

Z <br />

=+1<br />

−1 (){}


120<br />

Integrating by parts, and using { − } =1−, gives<br />

X<br />

=<br />

()<br />

=<br />

=<br />

Z ()<br />

−<br />

" #<br />

(){} − ( − ) { − }<br />

−<br />

− R <br />

− 0 (){}<br />

"<br />

X (){} − ( − 1){ − 1}<br />

−<br />

− R <br />

−1 0 (){}<br />

=+1<br />

Z <br />

−<br />

() +(1− ) ( − )<br />

#<br />

+<br />

Z <br />

− 0 (){} +<br />

X<br />

Z <br />

=+1<br />

−1 0 (){}<br />

=<br />

Z <br />

− ()<br />

→<br />

↓0<br />

+(1− ) ( − )+Z <br />

− 0 (){}<br />

Z <br />

() + ()+ Z <br />

0 (){}<br />

as required. ¤


15. Moment generating functions; Chebyshev’s<br />

Inequality; Asymptotic statistical theory<br />

121<br />

• Moment generating functions. () =[ ]=<br />

R ∞−∞<br />

() (the R-S integral). If this exists in<br />

an open neighbourhood of =0(note (0) = 1<br />

always exists) then it is the m.g.f. It is also written<br />

(). Some useful properties:<br />

1. () (0) = [ ] (so if the m.g.f. exists, so do<br />

all moments). In other words we can differentiate<br />

under the integral sign. Then if we can<br />

find an expansion of the form () = P <br />

! ,<br />

by the uniqueness of power series this must<br />

be the MacLaurin series, and so we must have<br />

= () (0) = [ ].<br />

2. If () = () forall|| (for some<br />

0) then ∼ , i.e. the distribution of a<br />

r.v. is uniquely determined by the m.g.f.<br />

3. If { } is a sequence of r.v.s with m.g.f.s<br />

(), and if () → () for in a neighbourhood<br />

of 0, where () is the m.g.f. of a<br />

<br />

r.v. , then → .


122<br />

— You should show: if ∼ ( )with<br />

<br />

→ then → P() (Poisson, mean<br />

).<br />

4. Sums of independent r.v.s. If 1 2 are<br />

independent r.v.s with m.g.f.s 1 () 2 ()<br />

then if = P <br />

=1 we have<br />

() = h P i<br />

=1 <br />

= <br />

=<br />

⎡ ⎤<br />

Y<br />

⎣ Y<br />

⎦ =<br />

=1 =1<br />

Y<br />

=1<br />

()<br />

h i<br />

<br />

In particular, if all are distributed in the<br />

same way, with m.g.f. (), then the m.g.f. of<br />

their sum is () = () and the m.g.f. of<br />

their average is<br />

¯ () = h i = () = ()<br />

• All of this also holds for the characteristic function<br />

(c.f.) [ ]= [cos ] + [sin ], which<br />

always exists.


123<br />

• Suppose ∼ (0 1), with p.d.f. (). Define<br />

= 2 ,a 2 1 r.v. Its m.g.f. is<br />

() =<br />

=<br />

Z ∞<br />

−∞ 2 ()<br />

Z ∞<br />

−∞<br />

1<br />

√<br />

2<br />

−2 2 (1−2) (|| 12)<br />

= ( = √ 1 − 2)<br />

= (1− 2) −12 <br />

1<br />

√ 1 − 2<br />

Z ∞<br />

−∞ ()<br />

Now suppose is the sum of squares of independent<br />

(0 1)’s, i.e. is a 2 r.v. Its m.g.f.<br />

is the power of the above (why?), thus =<br />

(1 − 2) − 2 (|| 12). It follows that the p.d.f.<br />

is<br />

() =<br />

³ 2´<br />

2<br />

−1<br />

<br />

− 2<br />

2Γ ³ <br />

2´<br />

0 ∞<br />

Proof: With = 2<br />

(1 − 2) for|| 12,<br />

Z ∞<br />

0<br />

() = (1− 2) − 2<br />

Z ∞<br />

0<br />

2 −1 −<br />

Γ ³ <br />

2´ <br />

= (1− 2) − 2


124<br />

• Chebyshev’s Inequality:<br />

and variance 2 then<br />

If a r.v. has mean <br />

(| − | ≥ ) ≤ 1 2 <br />

Proof:<br />

An equivalent formulation is<br />

(|| ≥ ) ≤ 1 2 <br />

where =( − ) has mean 0 and variance<br />

1.<br />

Note that the indicator of an event , given by<br />

( ) =<br />

∼<br />

(<br />

1 if occurs,<br />

0 otherwise,<br />

(1( )) <br />

with [( )] = ( ). Thus<br />

1 = [] = h 2i<br />

≥ h 2 (|| ≥ ) i<br />

≥ 2 [ (|| ≥ )]<br />

= 2 (|| ≥ )


125<br />

• Chebyshev’s Inequality furnishes an easy proof of<br />

the Weak Law of Large Numbers: If ¯ is the<br />

average of independent r.v.s, each with mean<br />

and variance 2 <br />

,then ¯ → as →∞.<br />

Proof: Note that ¯ has mean and variance<br />

2 . For0, put = √ ; then<br />

³¯¯¯ ¯ − ¯ ≥ ´<br />

= ³¯¯¯ ¯ − ¯ ≥ ³ ¯ ´´<br />

≤<br />

→<br />

1 2 = 2<br />

2<br />

0as →∞<br />

• Central Limit Theorem. This is probably the most<br />

significant theorem in mathematical statistics. It<br />

gives the approximate normality of averages of<br />

r.v.s and, when combined with the MVT (or Taylor’s<br />

Theorem), the WLLN and Slutsky’s Theorem<br />

(see below), forms the basis for approximating the<br />

distributions of many other statistics of interest.


126<br />

• Theorem: Let 1 2 be independent<br />

r.v.s, with common d.f. () = ( ≤ ),<br />

mean , variance 2 (0 2 ∞). Put =<br />

√ ³<br />

¯ − ´<br />

;then → (0 2 ).<br />

— To apply, since the statements “ ∼ (0 2 )”<br />

and “ ¯ ∼ ( 2 )” are equivalent, we<br />

treat ¯ as if it were distributed approximately<br />

as ( 2 ). Then, e.g. if we can also estimate<br />

2 , we have the basis for making inferences<br />

about .<br />

• ProofofCLT: We make the additional assumption<br />

that the have a m.g.f. Define<br />

() =[ ( −) ](= − [ ])<br />

We shall use the fact, being established in assignment<br />

3, that the m.g.f. of ∼ ( 2 )is<br />

[ ]= +2 2<br />

2 <br />

Notation: “ () = (()) as → ” means<br />

“ () () → 0as → ”.


127<br />

Let be fixed but arbitrary. Expand () as<br />

() = (0) + 0 (0) + 00 (0) 2 2 + 000 () 3 6 <br />

(0 ≤ || ≤ ||)<br />

= 1+[ − ] + [( − ) 2 ] 2 2 + 000 () 3 6<br />

= 1+ 2 2<br />

2 + (2 )as → 0<br />

Why ( 2 )? - because 000 () hasafinite limit as<br />

, hence , tends to 0.<br />

We are to show that the m.g.f. of<br />

= 1 √ <br />

X<br />

=1<br />

( − )<br />

tends to that of a (0 2 ) r.v., i.e. that<br />

<br />

"<br />

√ 1 P #<br />

=1<br />

<br />

( −)<br />

=<br />

Y<br />

=1<br />

<br />

"<br />

<br />

= ( √ )<br />

#<br />

√<br />

<br />

( −)


128<br />

tends to 2 2<br />

2 . Equivalently, we show that<br />

For this, write<br />

log ( √ ) → 2 2<br />

2<br />

as →∞<br />

log ( √ 2 Ã<br />

<br />

2<br />

) = log ⎜<br />

⎝ 1+2 2 + <br />

⎛<br />

⎜<br />

= log (1 + )<br />

<br />

<br />

(15.1)<br />

| {z }<br />

<br />

⎞<br />

!<br />

⎟<br />

⎠<br />

where (why?) → 0and → 2 2 2. This<br />

gives (15.1). ¤


129<br />

• Slutsky’s Theorem: If <br />

<br />

→ and → <br />

(constant) then:<br />

1. ± <br />

→ ± <br />

2. · <br />

→ · <br />

3. <br />

→ if 6= 0.<br />

Note that if = (constant) then all occurrences<br />

of → can be replaced by →.<br />

• Application: We often make inferences about a<br />

population mean using the -statistic<br />

√ ³<br />

¯ − ´<br />

=<br />

<br />

<br />

where ¯ is as in the CLT and is the sample<br />

standard deviation. If the data are normally distributed<br />

then follows a “Student’s t” distribution<br />

on − 1 degrees of freedom; it is well known<br />

that this distribution is closely approximated by


130<br />

the (0 1) when is reasonably large. This latter<br />

fact holds even for non-normal parent distributions:<br />

Note<br />

P ³<br />

2 − ¯´2 P <br />

2<br />

=<br />

= <br />

− ¯ 2<br />

<br />

− 1 − 1<br />

where P 2 <br />

→ [ 2 ] by WLLN, ¯ → <br />

by WLLN ; it follows that ³ P <br />

2<br />

− ¯ 2´<br />

→<br />

<br />

[ 2 ] − 2 = 2 , (a special case of (1) of Slutsky’s<br />

Theorem) hence so does 2 (Slutsky (2));<br />

thus → (since is a continuous function of<br />

2 )andso → 1 (Slutsky (3)). Now again<br />

by Slutsky, and the CLT,<br />

=<br />

√ ( ¯−)<br />

<br />

<br />

<br />

→<br />

<br />

1 =


131<br />

Part IV<br />

MULTIDIMENSIONAL<br />

CALCULUS AND<br />

OPTIMIZATION


132<br />

16. Multidimensional differentiation; Taylor’s and<br />

Inverse Function Theorems<br />

• f : ⊂ R → R can be represented as<br />

⎛<br />

⎜<br />

⎝<br />

1 (x)<br />

2 (x)<br />

.<br />

(x)<br />

⎞<br />

⎟<br />

⎠ for (x) :R → R.<br />

• Some results from any text on multivariable calculus/analysis:<br />

— Every bounded sequence in R contains a convergent<br />

subsequence.<br />

— If f : ⊂ R → R is continuous on a closed,<br />

bounded set then f attains its inf and sup<br />

there; i.e. there are points p q ∈ with<br />

f(p) =sup x∈ f(x) andf(q) =inf x∈ f(x).<br />

— If f : ⊂ R → R is continuous on a<br />

closed, bounded set then f is uniformly continuous<br />

on .


133<br />

• Derivatives. Put e = (00 1<br />

↑<br />

<br />

00) 0 . Let<br />

: ⊂ R → R 1 .If<br />

lim<br />

→0<br />

(a + e ) − (a)<br />

<br />

exists, we say has a partial derivative with respect<br />

to at a; this limit is denoted by (a)<br />

<br />

.<br />

<br />

It is computed by treating all variables except the<br />

as constant; i.e. it is the ordinary derivative<br />

of ( 1 −1 +1 ) with respect to<br />

.


134<br />

• The ³ ´ Jacobian matrix is the × matrix J f (x) =<br />

f<br />

x with ( ) element , evaluated at<br />

x = ( 1 ). This arrangement of partial<br />

derivatives ensures that the chain rule is easily<br />

represented: if f : R → R and g : R → R <br />

then g ◦ f : R → R has<br />

J g◦f (x) × = J g (f(x)) × J f (x) × .<br />

(16.1)<br />

This is a consequence of the formula for the ‘total<br />

derivative’: if ∈ R and = ( 1 () ()) has<br />

continuous partial derivatives, then<br />

X <br />

= <br />

=1<br />

<br />

Applythistoeach ,with = (x)and = :<br />

h<br />

Jg◦f (x) i = (g ◦ f) <br />

= (f(x))<br />

<br />

X <br />

= (f(x)) (x)<br />

(x) <br />

=<br />

=1<br />

X<br />

=1<br />

[J g (f(x))] <br />

[J f (x)] <br />

= [J g (f(x)) · J f (x)]


135<br />

— If = 1 then the Jacobian matrix of<br />

: R → R is a row vector whose transpose<br />

is the gradient:<br />

∇ (x) =<br />

Ã<br />

<br />

1<br />

<br />

<br />

! 0<br />

<br />

— The Jacobian of ∇ : R → R is called the<br />

Hessian of : R → R. This × matrix<br />

H (x) has( ) element<br />

³ ´<br />

∇ <br />

<br />

= <br />

<br />

<br />

<br />

<br />

<br />

<br />

If one of<br />

<br />

,<br />

<br />

exists and is continuous,<br />

then the other exists and the two are<br />

<br />

equal; under these conditions the Hessian matrix<br />

is symmetric. We write the ( ) element<br />

as 2 = 2 .


136<br />

— If f : ⊂ R → R then the directional<br />

derivative at a in the direction v (with kvk =<br />

1) is<br />

f(a + v) − f(a)<br />

lim<br />

= <br />

→0 <br />

f (a)v<br />

provided the Jacobian exists.<br />

Proof: Put g() =f(a + v) =f ◦ k ()<br />

g()−g(0)<br />

where k () =a+v. We seek lim →0 <br />

-using(16.1)thisis<br />

g<br />

|=0 = J f (k ())J k () |=0 = J f (a)v<br />

• Taylor’s Theorem. I’llgiveaversionsuitablefor<br />

the intended applications. A major difficulty in<br />

writing down a multivariate Taylor’s Theorem is<br />

that appropriate notation, for representing derivatives<br />

higher than second order, is very cumbersome.<br />

It is rare however to require expansions<br />

beyond “Hessian + remainder”. Thus, suppose<br />

: ⊂ R → R, where is convex:<br />

x y ∈ ⇒ (1 − )x + y ∈ for 0 ≤ ≤ 1


137<br />

1. If the partial derivatives of are continuous<br />

on then<br />

(x) =(u)+∇ 0 (ξ)(x − u)<br />

for some ξ =(1− )u + x = ξ .<br />

(How? Write () = (ξ ), expand (1)<br />

around = 0.) We also have<br />

(x) = (u)+∇ 0 (u)(x − u)+(||x − u||)<br />

as ||x − u|| → 0<br />

(How? Write x = u+v for v =(x − u) ||x−<br />

u||, = ||x − u||; apply l’Hospital’s Rule.)<br />

2. If the second order partials are continuous on<br />

then with ξ as above,<br />

We also have<br />

(x) =(u)+∇ 0 (u) (x − u)<br />

+ 1 2 (x − u)0 H (ξ)(x − u) <br />

(x) =(u)+∇ 0 (u)(x − u)<br />

+ 1 2 (x − u)0 H (u)(x − u)+(||x − u|| 2 )


138<br />

— Example: x =( ), (x) = cos , u = 0<br />

Then<br />

Ã<br />

<br />

∇ (0) =<br />

! Ã !<br />

cos <br />

1<br />

− = <br />

sin <br />

0<br />

|x=0<br />

Ã<br />

<br />

H (0) =<br />

cos − !<br />

sin <br />

− sin − cos <br />

|x=0<br />

à !<br />

1 0<br />

=<br />

0 −1<br />

and<br />

(x) = (0)+∇ 0 (0)x + 1 2 x0 H (0)x + (||x|| 2 )<br />

= 1+ + 1 2<br />

h<br />

2 − 2i + ( 2 + 2 )<br />

• Notation:<br />

´<br />

If f : R → y = f(x) ∈ R we often<br />

for the × Jacobian matrix Jf (x).<br />

write ³ y<br />

x<br />

• Example: Let be a strictly increasing d.f., ∈<br />

(0 1) a probability. Then the relationship () =<br />

defines = () asafunctionof. This value<br />

() = −1 () isthe quantile, and (·) isthe


139<br />

quantile function. Differentiating () = with<br />

respect to and using the chain rule gives<br />

1 = (())<br />

= = 0 (()) 0 (); so<br />

<br />

0 () =<br />

1<br />

0 (()) = 1<br />

0 ( −1 ()) <br />

The denominator is non-zero since is strictly<br />

increasing. The multivariate analogue of this follows.<br />

• Inverse Function Theorem: If f : ⊂ R → R <br />

has a continuous Jacobian within , nonsingular<br />

at a point x 0 ∈ , then (i) f is 1 − 1 in<br />

an open neighbourhood of x 0 ,and(ii)thereis<br />

an open neighbourhood of f(x 0 ) within which<br />

f −1 : ⊂ R → R is well-defined and has a<br />

continuous, non-singular Jacobian with<br />

J f −1(f(x 0 )) = J −1<br />

f<br />

(x 0 ).


140<br />

— If J f (x 0 ) is singular then J f (x 0 )v = 0 for<br />

some v 6= 0, so that the derivative in direction<br />

v is zero and f might not be 1-1. Nonsingularity<br />

of J f (x 0 )issufficient but not always<br />

necessary — think about () = 3 and<br />

0 =0.<br />

— In the notation introduced above, the statement<br />

becomes<br />

à ! µ x y −1<br />

= <br />

y x<br />

with both sides evaluated at (x 0 y 0 = f(x 0 )).<br />

— Note that applying the chain rule to the relationship<br />

x = f −1 ◦ f(x) gives<br />

µ x<br />

I = = J<br />

x f −1(f(x))J f (x)


141<br />

17. Implicit Function Theorem; extrema; Lagrange<br />

multipliers<br />

• Example of Inverse Function Theorem. Suppose<br />

that r.v.s 1 2 0 with joint p.d.f. ( 1 2 )<br />

are transformed to 1 2 0 through the transformation<br />

³ 1 = f<br />

2´ ³ 1 =<br />

2´ ³ 2 1 + 2 2<br />

´<br />

2 1 − 2 2<br />

We will later see, when we look at multivariable<br />

integration, that the p.d.f. of ( 1 2 )is<br />

Note<br />

µ y<br />

x<br />

(x (y)) |det (xy)| <br />

!<br />

1 <br />

= J f (x) =2Ã<br />

2<br />

1 − 2<br />

is non-singular if neither 1 nor 2 equals 0. If x 0<br />

is such a point and y 0 = f (x 0 ) then in a neighbourhood<br />

of y 0 the inverse map f −1 : y → f −1 (y) =<br />

x existsandissuchthat<br />

2 1 + 2 2 = 1<br />

2 1 − 2 2 = 2


142<br />

The Jacobian of f −1 is<br />

à !<br />

x<br />

=<br />

y<br />

We need only<br />

¯<br />

¯det Ã<br />

x<br />

y!¯¯¯¯¯<br />

µ y −1<br />

<br />

x<br />

¯<br />

¯¯¯¯−1<br />

µ y<br />

= ¯det x<br />

= (8 1 2 ) −1<br />

1<br />

= q<br />

4 1 2 − 2 2<br />

Note that this can be calculated without determining<br />

J −1<br />

f<br />

,orsolvingforx in terms of y, explicitly.


143<br />

• The Inverse Function Theorem asserts a unique<br />

solution x = f −1 (y) to equations of the form<br />

g(x y) =y −f(x) =0 with x,y ∈ R , under the<br />

given conditions. Here x is explicitly defined as<br />

afunctionofy. The solutions, when written as<br />

x = φ(y), are differentiable functions of y with<br />

J φ (y) =J −1<br />

f<br />

(x). Consider now the general case<br />

g(x y) =0 ×1 for x ∈ R y ∈ R <br />

(17.1)<br />

For given y there may be a solution x = φ(y)<br />

for φ : R → R . Such a function is “implicitly<br />

defined” by (17.1). The Implicit Function Theorem<br />

gives conditions under which such a function<br />

exists, and gives some of its properties.<br />

• Implicit Function Theorem. With notation as in<br />

(17.1), suppose g is defined on ⊂ R + and<br />

has a continuous Jacobian in a neighbourhood of<br />

apoint(x 0 y 0 ) ∈ , atwhichg(x 0 y 0 )=0 ×1 .<br />

Suppose<br />

à !<br />

g (x y)<br />

J 1 (x y) =<br />

x


144<br />

(= the first columns of the × ( + ) matrix<br />

J g (x y)) is non-singular at (x 0 y 0 ). Then<br />

there is a neighbourhood of (x 0 y 0 ) within which<br />

the relationship g(x y) =0 defines a continuous<br />

mapping x = φ(y), i.e. g(φ (y) y) =0. Moreover,<br />

φ has a continuous Jacobian, with<br />

J φ (y) =−J −1<br />

1 (φ (y) y) J 2 (φ (y) y)<br />

where J 2 (x y) =<br />

µ <br />

g(xy)<br />

y<br />

.<br />

— Note that differentiating the relationship<br />

g(x y) =0<br />

(17.2)<br />

with x = φ(y) gives<br />

à !<br />

à !<br />

x<br />

y<br />

J 1 (x y) + J 2 (x y) = 0<br />

y<br />

y<br />

yielding (17.2) with (xy)writtenasJ φ (y).


145<br />

• Example. Write the characteristic equation for a<br />

matrix A × as<br />

(−1) |A − I| = ( a) =<br />

X<br />

=0<br />

=0<br />

Here the are certain continuous functions of the<br />

elements of A and =1;a =( 0 −1 ) 0 .<br />

How do the eigenvalues vary as A varies, say in a<br />

neighbourhood of some matrix A 0 ? The Jacobian<br />

of is clearly continuous, and so if 0 is an<br />

eigenvalue of A 0 with multiplicity one, so that<br />

1 ( a) =( a) 6= 0<br />

at ( 0 a 0 = a (A 0 )), then the char. eqn. defines<br />

as a continuously differentiable function of the<br />

in a neighbourhood of a 0 :<br />

0 0 ( a) ( a)<br />

= +<br />

a a<br />

⇒ (a)<br />

a = − a<br />

(a)<br />

³ <br />

1<br />

−1´<br />

= −<br />

P =1<br />

−1


146<br />

• Extrema of : ⊂ R → R. Suppose the<br />

conditions of Taylor’s Theorem hold, so that we<br />

can expand as<br />

(x 0 + h) =(x 0 )+∇ 0 (x 0)h + 1 2 h0 H ()h<br />

Let x 0 be a stationary point: ∇ (x 0 )=0. Then:<br />

1. If H () 0 for in a neighbourhood of<br />

x 0 then x 0 furnishes a local minimum of :<br />

(x 0 +h) (x 0 )forsufficiently small h 6= 0.<br />

2. If H () 0 for in a neighbourhood of<br />

x 0 then x 0 furnishes a local maximum of :<br />

(x 0 +h) (x 0 )forsufficiently small h 6= 0.<br />

3. If neither (1) nor (2) holds then (x 0 + h) −<br />

(x 0 ) changes sign as h varies; we say that x 0<br />

is a saddlepoint.<br />

• In (1), H (x 0 ) 0 andcontinuousinaneighbourhood<br />

of x 0 suffices; similarly with (2).


147<br />

• Often we seek extrema of multivariate functions,<br />

subject to certain side conditions. For instance in<br />

ANOVA we might seek least squares estimates,<br />

subject to the constraint that the average treatment<br />

effect is zero. The general problem considered<br />

here is to find the extrema of : ⊂ R →<br />

R subject to g(x) =0 ×1 for .E.g.<br />

P : Minimize x 0 Ax subject to Bx = c ×1 ,<br />

where A 0 and B × has rank <br />

Put<br />

(x; λ) =(x)+λ 0 g(x)<br />

for a vector λ ×1 of “Lagrange multipliers”.<br />

Claim: Thestationarypointsof that satisfy the<br />

constraints determine the stationary points in the<br />

original problem. These points then satisfy the<br />

+ equations in the + variables of (x λ):<br />

Equivalently,<br />

∇ (x; λ) =0 (+)×1 <br />

∇ 0 (x)+λ0 J g (x) = 0 0 1× <br />

g(x) = 0 ×1 <br />

The proof of this claim follows the example.


148<br />

• Example: Problem P above. We have<br />

(x; λ) = (x)+λ 0 g(x)<br />

= x 0 Ax + λ 0 (Bx − c)<br />

with (you should verify this)<br />

implying<br />

0 0 1× = µ <br />

x<br />

<br />

=2x 0 A + λ 0 B<br />

x = − 1 2 A−1 B 0 λ<br />

Combine this with Bx = c to get<br />

whence<br />

λ = −2 ³ BA −1 B 0´−1 c<br />

x = A −1 B 0 ³ BA −1 B 0´−1 c<br />

(You should be able to verify that BA −1 B 0 is<br />

non-singular.)<br />

• Once we have the stationary points we must check<br />

for a minimum or maximum. There are conditions


149<br />

under which the satisfaction of these equations is<br />

sufficient as well as necessary to determine the extrema.<br />

These are generally so restrictive and complicated<br />

as to be useless in practice. The virtue<br />

of the Lagrange multiplier method is that it reduces<br />

to a small number the points that must be<br />

checked - we know that the required extrema are<br />

among them.<br />

— One easy way (if it works) to check optimality<br />

is this. Suppose that (x 0 ; λ 0 ) is a stationary<br />

point of (x; λ) andthatx 0 minimizes<br />

(x; λ 0 ) unconditionally. Then (x 0 ; λ 0 ) <br />

(x; λ 0 ) for all x 6= x 0 , hence in the class of<br />

those x that satisfy g(x) =0 we have<br />

(x 0 )= (x 0 ; λ 0 ) (x; λ 0 )=(x)<br />

— In Problem P there was only one stationary<br />

point x 0 . This furnishes the minimum since<br />

(x; λ 0 )=x 0 Ax + λ 0 0 (Bx − c) <br />

where λ 0 = −2 ³ BA −1 B 0´−1 c,hasHessian<br />

2A 0, hence is minimized unconditionally<br />

at x 0 .


150<br />

• Proofofclaim: Letx beasolutiontotheoriginal<br />

problem, so that in particular g(x) =0 ×1 . We<br />

will ‘solve’ these equations, thus expressing <br />

of the s in terms of the others. The Implicit<br />

FunctionTheoremallowsustodothis.<br />

We must show that ∇ 0 (x) +λ0 J g (x) =0 0 1× .<br />

Partition x, the gradient of , andtheJacobian<br />

of g as:<br />

x =<br />

∇ (x) =<br />

J g (x) =<br />

à !<br />

x1<br />

x 2<br />

⎛ ³ <br />

⎜ x<br />

⎝ ³ 1´0<br />

<br />

x 2´0<br />

× 1<br />

( − ) × 1 ;<br />

⎞<br />

Ã<br />

g<br />

. g<br />

!<br />

x 1 x 2<br />

⎟<br />

⎠ =<br />

Ã<br />

τ (x)<br />

ψ(x)<br />

!<br />

;<br />

= ³ Γ × (x).∆ ×(−) (x)´<br />

<br />

Under the conditions of the Implicit Function Theorem<br />

(so that, in particular, Γ(x) is non-singular)<br />

we can solve the equations g(x 1 x 2 )=0 ×1 for<br />

x 1 in terms of x 2 , obtaining x 1 = h(x 2 ). Thus<br />

(x) =(h(x 2 ) x 2 )andg(h(x 2 ) x 2 )=0 ×1 .


151<br />

Since x 2 is a stationary point we have<br />

à !<br />

0 0 <br />

1×(−) = = τ 0 (x)J<br />

x h (x 2 )+ ψ 0 (x)<br />

2<br />

(17.3)<br />

But, as in the Implicit Function Theorem,<br />

g(h(x 2 ) x 2 )=0 ×1 gives<br />

so<br />

Γ(x)J h (x 2 )+∆(x) =0 ×(−) <br />

In (17.3) this gives<br />

J h (x 2 )=−Γ −1 (x)∆(x)<br />

0 0 = −τ 0 (x)Γ −1 (x)∆(x)+ ψ 0 (x)<br />

= −λ 0 ∆(x)+ψ 0 (x) (17.4)<br />

where λ 0 = −τ 0 (x)Γ −1 (x) :1× . Thus<br />

∇ 0 (x)+λ0 J g (x)<br />

= ³ τ 0 (x) ψ 0 (x)´ − τ 0 (x)Γ −1 (x)(Γ(x).∆(x))<br />

= ³ 0 0 ψ 0 (x) − τ 0 (x)Γ −1 (x)∆(x)´<br />

= 0 0 1× <br />

by (17.4), as required. ¤


152<br />

18. Integration; Leibnitz’s Rule; Normal sampling<br />

distributions<br />

• Integration over an −dimensional rectangle. Let<br />

: ⊂ R → R be bounded on the bounded<br />

set . The development of the Riemann integral<br />

proceeds along the same lines as for =1. Thus,<br />

define the rectangle<br />

[a b] =[ 1 1 ] ×···×[ ]<br />

large enough that ⊂ [a b]. First suppose<br />

that is defined and bounded on [a b]. Let<br />

be a partition of [a b], itself consisting of -<br />

dimensional rectangles 1 .Define the lower<br />

and upper sums<br />

() =<br />

() =<br />

X<br />

=1<br />

X<br />

=1<br />

( )<br />

( )<br />

where and are the inf and sup of on <br />

and ( )isthevolumeof :<br />

([c d]) =<br />

Y<br />

=1<br />

( − )


153<br />

If<br />

sup<br />

<br />

or equivalently if<br />

() =inf<br />

()<br />

lim ( () − ()) = 0<br />

∆ →0<br />

then the common value is the Riemann integral<br />

R<br />

[ab] (x)x<br />

• Now recall that ⊂ [a b]. Define<br />

(<br />

1 x ∈ <br />

1 (x) =<br />

0 x ∈ <br />

If R [ab] 1 (x)x exists we say is Jordan measurable.<br />

The value of the integral is called the<br />

Jordan content of . When is Jordan measurable<br />

we define<br />

Z<br />

Z (x)x = (x)1 (x)x<br />

[ab]


154<br />

• A major tool for evaluating multidimensional integrals<br />

is Fubini’s Theorem: If is absolutely<br />

integrable on [a b] then<br />

Z<br />

(x)x =<br />

[ab]<br />

Z Ã Ã<br />

1<br />

Z Ã<br />

−1<br />

Z ! !<br />

<br />

···<br />

(x) −1 ···<br />

1 −1 <br />

and the integrations on the RHS may be carried<br />

out in any order.<br />

!<br />

1<br />

• Change of variables. Let : ⊂ R → R, where<br />

is closed and bounded and is continuous. Let<br />

h : x ∈ → y ∈ R be a 1 − 1 function with<br />

continuous Jacobian matrix J h (x), non-singular<br />

on . Then there is an inverse function h −1 :<br />

y ∈ h () → with<br />

Z (x)x = Z<br />

h() (h−1 (y)) ¯J h −1(y) y ¯+<br />

(18.1)<br />

where |·| + denotes the absolute value of the determinant.<br />

Note also<br />

¯<br />

¯J h −1(y)<br />

¯<br />

=<br />

¯+ ¯<br />

x<br />

y<br />

¯<br />

¯+<br />

=<br />

¯<br />

y<br />

x<br />

−1<br />

¯<br />

+<br />

¯<br />

¯<br />

,wherey = h(x)


155<br />

Here and elsewhere the assumption that be<br />

bounded can be dropped by defining the resulting<br />

improper integral as a limit of proper integrals.<br />

• In particular, suppose is the p.d.f. of a r.vec.<br />

X. PutY = h(X). If h is as above we have<br />

Z<br />

(x)x = (X ∈ )<br />

<br />

= (Y ∈ h ())<br />

= (y)y<br />

h()<br />

where (y) is the p.d.f. of Y. But also (18.1)<br />

holds; thus<br />

(y) = (h −1 (y)) ¯J h −1(y)<br />

= (x)<br />

¯<br />

x<br />

y<br />

Z<br />

¯<br />

¯+<br />

¯<br />

¯<br />

¯+<br />

,withx = x(y)<br />

• Differentiation under the integral sign. Define<br />

() =<br />

Z ()<br />

() ()


156<br />

where ()() are continuously differentiable<br />

for ≤ ≤ and () is continuous, with a<br />

continuous partial derivative w.r.t. , onaregion<br />

containing { ≤ ≤ and () ≤ ≤ ()}.<br />

Then (Leibnitz’s Rule): () iscontinuouslydifferentiable<br />

with<br />

Z ()<br />

0 <br />

() =<br />

() ()<br />

+ (()) 0 () − (()) 0 ()(18.2)<br />

• Note this is the result of writing<br />

() =( ()())<br />

where ( ) = R <br />

(). Then<br />

<br />

<br />

( )<br />

( ()()) =<br />

( )<br />

<br />

<br />

=()<br />

=()<br />

0 ()+<br />

=()<br />

=()<br />

( )<br />

<br />

+<br />

=()<br />

=()<br />

If differentiation under the integral sign is permissible<br />

— and Leibnitz’s rule says that it is — then<br />

we have (18.2).<br />

0 ()


157<br />

• Example: Let be independent, non-negative<br />

r.v.s with continuous densities respectively<br />

(()() =0for0). Then = + <br />

has d.f.<br />

() = ( ≤ )<br />

=<br />

=<br />

=<br />

=<br />

Z<br />

[0]×[0]<br />

Z <br />

( + ≤ ) ()()( )<br />

Z <br />

() ( ≤ − ) ()<br />

0 0<br />

by Fubini’s Theorem<br />

Z Z −<br />

()<br />

Z0 0 <br />

0<br />

()<br />

()( − )<br />

with, by Leibnitz’s Rule, density<br />

() =<br />

Z <br />

0<br />

()( − )<br />

since (0) = 0. This integral is called the convolution<br />

of with . (Onlyoneof needs to be<br />

continuous - why?)


158<br />

• Application: The density of a 2 1 r.v., i.e. =<br />

2 ,where ∼ (0 1), can be obtained as<br />

1 () = <br />

<br />

( ≤ ) =<br />

³ − √ ≤ ≤ √ ´<br />

= Z √ ³ 2´1<br />

2 <br />

() = ³ 2<br />

−1<br />

√ <br />

− 2<br />

´<br />

√ =<br />

0<br />

2Γ ³ <br />

1<br />

2´<br />

Then the 2 2<br />

2 () =<br />

Z <br />

= − 2<br />

density is<br />

0 1() 1 ( − )<br />

Z <br />

0<br />

=<br />

= − 2<br />

³ 2´1<br />

2<br />

−1 ³ −<br />

2<br />

Z 1<br />

0<br />

= 1 2 − 2 · <br />

4<br />

´1<br />

2<br />

−1<br />

<br />

³ 2<br />

´1<br />

2<br />

−1 µ (1−)<br />

2<br />

4<br />

1<br />

2<br />

−1<br />

<br />

where = R 1 1 2 −1 (1−) 1 2 −1<br />

0 <br />

must = 1 in


159<br />

order that 2 () integrate to 1. Now<br />

() =<br />

³ 2´<br />

2<br />

−1<br />

<br />

− 2<br />

2Γ ³ <br />

2´<br />

can be proved by induction, or conjectured and<br />

then established by using the uniqueness of m.g.f.s<br />

(as in Lecture 15).<br />

• Example: Joint distribution of the sample mean<br />

andvarianceinNormalsamples.<br />

Suppose that 1 are i.i.d. ( 2 ) r.v.s,<br />

so that X =( 1 ) 0 has p.d.f.<br />

(<br />

Y ³2<br />

(x) =<br />

2´−12<br />

exp<br />

(− ( − ) 2 ))<br />

2 2<br />

Note<br />

X<br />

=1<br />

=1<br />

= ³ 2 2´−2 exp<br />

⎧<br />

⎨<br />

⎩ − X<br />

=1<br />

( − ) 2 =<br />

X<br />

=1<br />

( − ) 2<br />

2 2<br />

⎫<br />

⎬<br />

⎭ <br />

[( − ¯)+(¯ − )] 2<br />

= ( − 1) 2 + (¯ − ) 2


160<br />

so that<br />

(x) = ³ 2 2´−2 − (−1)2<br />

2 2<br />

−(¯−)2<br />

2 2 <br />

We derive the joint p.d.f. of ( 2 ¯) . Firstnote<br />

that 1 ×1 √ has norm 1. Adjoin − 1unit<br />

vectors e to get a basis for R , and then apply<br />

Gram-Schmidt to get an orthonormal basis whose<br />

first member is 1 √ . This yields an orthogonal<br />

matrix<br />

H × = ³ 1 0 √ ´<br />

H 1<br />

Put Y = HX. Then<br />

1 = 1 0 X √ = √ ¯<br />

and kXk 2 = kYk 2 ,sothat<br />

X<br />

=2<br />

2 = kYk 2 − 1<br />

2<br />

= kXk 2 − ³ √ ¯´2<br />

=<br />

X<br />

=1<br />

2 − ¯ 2<br />

= ( − 1) 2


161<br />

Note that x<br />

¯<br />

¯ =<br />

¯¯¯¯¯<br />

¯H 0¯¯¯+ = |±1| =1<br />

y¯+<br />

so that the p.d.f. (y) ofY is<br />

Thus<br />

= (x)<br />

¯<br />

x<br />

y<br />

¯<br />

¯+<br />

= ³ 2 2´−2 <br />

−<br />

=<br />

⎧<br />

⎨<br />

= (x(y))<br />

P =2<br />

2 <br />

2 2<br />

³<br />

2<br />

2´−12<br />

<br />

− ( 1−<br />

⎩<br />

⎧<br />

Y ⎨<br />

⎩<br />

=2<br />

³<br />

2<br />

2´−12<br />

<br />

− 2 <br />

√<br />

−( 1− )<br />

2<br />

2 2<br />

√ )<br />

2<br />

2 2 ⎫<br />

⎬<br />

⎭ ·<br />

2 2 ⎫<br />

⎬<br />

⎭ <br />

1. 1 are independently distributed;<br />

2. 1 ∼ ( √ 2 ), so that ¯ ∼ ( 2 );<br />

3. (−1)2<br />

2<br />

= P ³<br />

<br />

=2 <br />

´2<br />

∼ <br />

2<br />

−1 , since <br />

(0 1); furthermore ¯ and 2 are independently<br />

distributed.<br />

<br />


19. Numerical optimization: Steepest descent,<br />

Newton-Raphson, Gauss-Newton<br />

162<br />

• Numerical minimization. Suppose that a function<br />

: R → R is to be minimized.<br />

• Method of steepest descent. First choose an initial<br />

value x 0 .Leth be a vector of unit length and<br />

expand around x 0 in the direction h ( 0):<br />

(x 0 + h) − (x 0 ) ≈ h 0 ∇ (x 0 )<br />

We choose h such that h 0 ∇ (x 0 ) is negative but<br />

maximizedinabsolutevalue. Specifically, note<br />

that by Cauchy-Schwarz, for khk =1,<br />

¯<br />

¯h 0 ∇ (x 0 )<br />

¯ ≤<br />

°<br />

°∇ (x 0 )<br />

° <br />

with equality iff h = ±∇ (x 0 ) °∇ (x 0 ) °. Then<br />

°<br />

°<br />

h 0 ∇ (x 0 )=± °∇ (x 0 ) °,soweusethe“−”sign:<br />

x 1 () = x 0 + h<br />

°<br />

°<br />

= x 0 − ∇ (x 0 ) °∇ (x 0 )<br />

°<br />

°


163<br />

with = 0 to minimize (by trial and error) (x 1 ()).<br />

Repeat, with x 1 ( 0 ) replacing x 0 . Iterate to convergence.<br />

— Example: (x) =kxk 2 . Then ∇ (x) =2x<br />

and so<br />

x 1 = x 0 − · 2x 0 (k2x 0 k)<br />

= x 0 (1 − kx 0 k) <br />

We vary until (x 1 ()) = kx 1 ()k 2 is a minimum,<br />

i.e. to = kx 0 k.Thenx 1 = 0, andthe<br />

minimum is achieved in 1 step, from any starting<br />

value.<br />

• The method of steepest descent uses a linear approximation<br />

of ; for this and other reasons the<br />

convergence can be very slow. The Newton-Raphson<br />

method uses a quadratic approximation of ; equivalently<br />

it takes a linear approximation of ∇ (x)<br />

in order to solve ∇ (x) =0. In its general form<br />

the Newton-Raphson method attempts to solve a<br />

system of equations of the form g(x) =0, where<br />

x and g(x) are × 1.


164<br />

• Expand g(x) around an initial value x 0 :<br />

g(x) ≈ g(x 0 )+J (x 0 )(x − x 0 )<br />

Equate the RHS to zero, to get the next iterate:<br />

In general,<br />

x 1 = x 0 − J −1 (x 0 )g(x 0 )<br />

x +1 = x − J −1 (x )g(x ) =0 1 2 .<br />

At convergence, with x ∞ =lim →∞ x and assuming<br />

that J (x ∞ )isnon-singular,<br />

x ∞ = x ∞ − J −1 (x ∞ )g(x ∞ )<br />

so that g(x ∞ )=0.<br />

— If this is a minimization problem, so that g(x) =<br />

∇ (x), then J (x) =H (x) and the scheme<br />

is<br />

x +1 = x − H −1<br />

(x )∇ (x )<br />

Note<br />

(x +1 ) ≈ (x )+∇ 0 (x ) ¡ ¢<br />

x +1 −x <br />

= (x ) − ∇ 0 (x )H −1<br />

(x )∇ (x )<br />

(x ) if H (x ) 0


165<br />

—Example:solve () =log − 1=0:<br />

+1 = − 0 ( )<br />

This gives<br />

= − log − 1<br />

1 <br />

= (2 − log ) =0 1 2 .<br />

0 = 1<br />

1 = 2<br />

2 = 26137<br />

3 = 27162<br />

4 = 27182811;<br />

= 27182818.<br />

• The starting points can make a big difference —<br />

even if the function being minimized is convex.<br />

Example: Consider using Newton’s method to<br />

find the zero of () =() · 2 ³ 1+ 2´.<br />

This is increasing, so it is the derivative of a convex<br />

function , minimized at the zero of . Start


166<br />

at 0 . We have ¡ 0¢ () =(1 +<br />

³<br />

2 )2 and<br />

so the iterates satisfy +1 = 1 − <br />

2<br />

´<br />

2.<br />

¯<br />

Put = ¯¯ +1 ¯¯ = ¯1 − 2 <br />

¯ 2. Suppose that<br />

| 0 | √ 3. By induction,<br />

| | √ 3and −1 ··· 0 1<br />

(19.1)<br />

so that ¯¯ +1¯¯ = | | ↑ ∞. If | 0 | = √ 3<br />

then =(−1) 0 . Similarly, if | 0 | √ 3then<br />

| | √ 3and −1 ··· 0 1andso<br />

| | ↓ 0 (= the desired root).<br />

Details of (19.1): If true for , then ¯¯ +1¯¯ =<br />

√ √<br />

| | 3 3, and then<br />

¯<br />

¯1 − 2 ¯<br />

+1<br />

¯ ¯1 − 2 ³<br />

<br />

+1 =<br />

=<br />

2 <br />

¯ <br />

2<br />

<br />

2 <br />

=<br />

− 1´<br />

2<br />

2<br />

2<br />

32 − 1 3 − 1<br />

<br />

2 2<br />

¯<br />

Gauss-Newton algorithm. Uses least squares minimization<br />

along with a linear approximation of the function<br />

of interest. A common application is non-linear


167<br />

regression, so I’ll illustrate the technique there. Suppose<br />

we observe<br />

= (x θ)+ , =1<br />

• An example is a Michaelis-Menten response ( θ) =<br />

1 ( 2 + ), ( 0), used to describe various<br />

chemical and pharmacological reactions. Note<br />

the horizontal asymptote of 1 .<br />

If (x θ) =z 0 (x)θ for some regressors z(x) then this<br />

is a linear regression problem; otherwise non-linear.<br />

Define<br />

η(θ) =((x 1 θ) ···(x θ)) 0 <br />

so that the data can be represented as y = η(θ)+ε.<br />

The LSEs are the minimizers of<br />

(θ) =ky − η(θ)k 2 <br />

Take an initial value θ 0 , expand around θ 0 to get<br />

− (x θ) ≈ − (x θ 0 ) − ∇ 0 (x θ 0 )(θ − θ 0 )


168<br />

i.e.<br />

y − η(θ) ≈ y − η(θ 0 ) − J (θ 0 )(θ − θ 0 )<br />

Define y (1) = y − η(θ 0 ), so that<br />

ky − η(θ)k 2 ≈<br />

°<br />

°y (1) − J (θ 0 )(θ − θ 0 )<br />

is to be minimized. By analogy with the linear regression<br />

model<br />

y (1) = J (θ 0 )β + <br />

the minimizer is<br />

° 2<br />

θ − θ 0 = β = h J 0 (θ 0 )J (θ 0 ) i −1<br />

J<br />

0 (θ 0 )y (1) <br />

Thus the next value is θ 1 = θ 0 + β, i.e.<br />

θ 1 = θ 0 + h J 0 (θ 0 )J (θ 0 ) i −1<br />

J<br />

0 (θ 0 )(y − η(θ 0 )) <br />

In general, θ +1 = θ + ˆβ ,where<br />

ˆβ = h J 0 (θ )J (θ ) i −1<br />

J<br />

0 (θ )(y − η(θ )) <br />

Thus we are repeatedly doing least squares regressions,<br />

in the ( +1) of which the residuals from the<br />

areregressedonthecolumnsoftheJacobianmatrix,<br />

evaluated at θ . A stopping rule can be based<br />

on the F-test of 0 : β = 0, the p-values for which<br />

will be included in the regression output.


169<br />

Assuming convergence, the limit ˆθ satisfies<br />

so that<br />

J 0 (ˆθ) ³ y − η(ˆθ)´ = 0 (19.2)<br />

∇ 0 (ˆθ) =−2 ³ y − η(ˆθ)´0<br />

J (ˆθ) =0 0<br />

and ˆθ is a stationary point of (θ).<br />

Typically ¡ θ +1<br />

¢<br />

(θ ),ifnotitisusualtotake<br />

θ +1 = θ + h J 0 (θ )J (θ ) i −1<br />

J<br />

0 (θ )(y − η(θ ))<br />

for =1 12 14until a decrease in is attained.<br />

A normal approximation is generally valid:<br />

ˆθ ≈ µ<br />

θ <br />

2 h<br />

J<br />

0 (ˆθ)J (ˆθ) i −1 ;<br />

the basic idea is (19.2) applied to<br />

ε = y − η(θ) ≈ y − h η(ˆθ)+J (ˆθ)(θ − ˆθ) i <br />

More precisely,<br />

⎛<br />

√ ⎜<br />

<br />

³ˆθ − θ´ → ⎝0 <br />

2<br />

⎡<br />

⎣ lim<br />

→∞<br />

J 0 (ˆθ)J (ˆθ)<br />

<br />

⎤<br />

⎦−1 ⎞ ⎟ ⎠


170<br />

20. Maximum likelihood<br />

Maximum Likelihood Estimation. Thisisthemost<br />

common and versatile method of estimation in statistics.<br />

It almost always gives reasonable estimates, even<br />

in situations that are so intractable as to be highly resistant<br />

to other estimation methods.<br />

• Data x, p.d.f. (x; θ); e.g. i.i.d. ( 2 )observations<br />

gives<br />

(x; θ) =<br />

θ = ( 2 ) 0<br />

Y<br />

=1<br />

µ<br />

1 <br />

− <br />

<br />

The p.d.f. evaluated at the data is the likelihood<br />

function (θ; x); its logarithm<br />

is the log-likelihood.<br />

(θ) =log(θ; x)


171<br />

• For i.i.d. observations with common p.d.f. (; θ)<br />

we have<br />

(x; θ) =<br />

(θ) =<br />

Y<br />

=1<br />

X<br />

=1<br />

( ; θ) so<br />

log ( ; θ) <br />

Viewed as a r.v., (θ) = P <br />

=1 log ( ; θ) isitself<br />

a sum of i.i.d.s.<br />

• The MLE ˆθ is the maximizer of the likelihood;<br />

intuitively it makes the observed data “most likely<br />

to have occurred”.<br />

• A more quantitative justification for the MLE is<br />

as follows. Let θ 0 bethetruevalue,andassume<br />

the arei.i.d.Wewillshowthat<br />

θ0 ((θ 0 ; X) (θ; X)) → 1(20.1)<br />

as →∞ for any θ 6= θ 0 <br />

By this, for large samples and with high probability,<br />

the (random) likelihood is maximized by the


172<br />

true parameter value, hence the maximizer of the<br />

(observed) likelihood should be a good estimate<br />

of this true value.<br />

Proof of (20.1):<br />

The inequality<br />

(θ 0 ; X) =<br />

Y<br />

=1<br />

is equivalent to<br />

( ; θ 0 ) <br />

Y<br />

=1<br />

( ; θ) =(θ; X)<br />

− 1 <br />

X<br />

=1<br />

log ( ; θ)<br />

( ; θ 0 ) 0<br />

By the WLLN this average tends in probability to<br />

(; θ)<br />

− log<br />

(; θ 0 )<br />

" #<br />

(; θ)<br />

− log θ0 (why?)<br />

(; θ 0 )<br />

Z (; θ)<br />

= − log<br />

(; θ 0 ) (; θ 0) <br />

θ0<br />

"<br />

= − log<br />

= 0<br />

Z<br />

(; θ) <br />

And so ... ¤<br />

#


173<br />

• The MLE is generally obtained as a root of the<br />

likelihood equation<br />

˙(θ) =0<br />

where ˙(θ) =∇ (θ) denotes the gradient. There<br />

may be multiple roots in finite samples. Under<br />

reasonable conditions (studied in STAT 665) we<br />

have that any sequence ˆθ of roots is asymptotically<br />

normal:<br />

√ <br />

³ˆθ − θ´ → (0 I −1 (θ))<br />

where<br />

I(θ) =<br />

lim<br />

→∞<br />

1<br />

h˙(θ)˙ 0 (θ) i (20.2)<br />

is “Fisher’s Information matrix”.<br />

interpretation is that<br />

ˆθ <br />

<br />

≈ <br />

µθ 0 1 I−1 (θ)<br />

<br />

The practical<br />

i.e. with representing the ( ) element of<br />

I −1 (θ) (orofI −1 (ˆθ )) we have the approximations<br />

ˆ <br />

<br />

≈ ( <br />

) cov[ˆ ˆ ] ≈


174<br />

• The MLE has attractive large-sample optimality<br />

properties, to be established later. That derivation,<br />

and the example we look at next, use the<br />

following ‘regularity condition’: we suppose that<br />

we can differentiate the equation 1 = R (θ; x)x<br />

under the integral sign twice. (In particular, the<br />

limits of integration should not depend on θ.)<br />

Then, writing ˙(θ; x) and˙(θ; x) forthegradients<br />

we have<br />

Z Z ˙(θ; x)<br />

0 ×1 = ˙(θ; x)x = (θ; x)x<br />

(θ; x)<br />

=<br />

Z<br />

˙(θ; x) (x; θ) x<br />

= θ<br />

h˙(θ; X) i (20.3)<br />

Thus θ<br />

h˙(θ) i = 0. With ¨(θ; x) denotingthe<br />

Hessian matrix we have<br />

0 × = Z<br />

˙(θ; x) (x; θ) x<br />

Z<br />

θ<br />

Z<br />

= ¨(θ; x) (x; θ) x + ˙(θ; x) (x; θ) x<br />

Z θ<br />

= ¨(θ; x) (x; θ) x<br />

+<br />

Z<br />

˙(θ; x)˙ 0 (θ; x) (x; θ) x


175<br />

so that<br />

³<br />

covθ<br />

h˙(θ) i =´<br />

θ<br />

h˙(θ)˙ 0 (θ) i = θ<br />

h<br />

−¨(θ) i <br />

(20.4)<br />

• If the observations are i.i.d. then<br />

˙ (θ) =<br />

X<br />

∇ log ( ;·)<br />

=1<br />

is a sum of i.i.d.s, each of which (by taking =<br />

1 in (20.3) and (20.4)) has a mean of 0 and a<br />

covariance of<br />

h<br />

θ ∇log (;·) (θ) ∇ 0 log (;·) (θ)i<br />

Ã<br />

<br />

= θ<br />

"−<br />

2 !#<br />

log (; θ)<br />

<br />

θθ<br />

Now (20.2) states that<br />

1<br />

I(θ) = lim<br />

→∞ cov h˙(θ) i =cov h ∇ log (;·) (θ) i <br />

since this is the same for all . Then by the CLT<br />

(next page),<br />

1<br />

√ <br />

˙(θ) = 1 √ <br />

X<br />

=1<br />

∇ log ( ;·) (θ) → (0 I(θ))<br />

(20.5)


176<br />

• We have used the multivariate CLT: if Z 1 Z <br />

are i.i.d. r.vecs. with mean vector μ and covariance<br />

matrix Σ, then<br />

1<br />

√ <br />

X<br />

=1<br />

(Z − μ) → (0 Σ)<br />

(In STAT 665 we give a very elementary proof of<br />

this, which uses only the univariate CLT.) This<br />

was applied in (20.5) with Z = ∇ log ( ;·) (θ),<br />

μ = 0 and Σ = I(θ).<br />

• Now here is an outline of the proof of asymptotic<br />

normality of the MLE. Expand the likelihood<br />

equation ˙ ³ˆθ´<br />

= 0 aroundthetruevalue,with<br />

remainder :<br />

0 = ˙ ³ˆθ´<br />

= ˙ ³ˆθ (θ)+¨(θ) − θ´ + <br />

Rearrange this as<br />

√ <br />

³ˆθ − θ´<br />

=<br />

∙<br />

−<br />

¨(θ)<br />

1 ( ¸−1 1 √ ˙ (θ)+ )<br />

<br />

√ <br />

(20.6)


177<br />

We have (by the WLLN) that<br />

X<br />

Ã<br />

−<br />

¨(θ) 1 = 1 − 2 log ( ; θ)<br />

<br />

=1<br />

θθ<br />

"Ã<br />

!#<br />

<br />

→ θ − 2 log (; θ)<br />

= I(θ)<br />

θθ<br />

so that using (20.5) and Slutsky’s Theorem,<br />

∙<br />

−<br />

¨(θ)<br />

1 ¸−1 1 √ ˙ (θ) → (0 I −1 (θ))<br />

If √ → 0 (it does, but this is where some<br />

work is required) then, again by Slutsky applied<br />

to (20.6),<br />

√ <br />

³ˆθ − θ´ → (0 I −1 (θ))<br />

!


21. Asymptotics of ML estimation; Information<br />

Inequality<br />

178<br />

• Example. Suppose{ 1 } is a sample from<br />

the gamma( ) density,with<br />

³ ´−1<br />

<br />

− <br />

(; θ) =<br />

0 ∞<br />

Γ ()<br />

Note that if θ =( ) =(22) then this is<br />

the 2 density. If = −1 , = it is the<br />

“Erlang” density - the density of the sum of <br />

i.i.d. E() r.v.s. The distribution of the r.v. ,<br />

where 2 ∼ gamma(Ω 2 ) is known as the<br />

“Nakagami” distribution, and is the “fading<br />

parameter”; this is of interest in the theory of<br />

wireless transmissions.<br />

The log-likelihood is<br />

(θ) =<br />

=<br />

= <br />

X<br />

=1<br />

X<br />

=1<br />

log ( ; θ)<br />

" #<br />

( − 1) (log − log )<br />

− <br />

− log − log Γ () " #<br />

( − 1) (log ) − log <br />

− ¯ − log Γ ()


with gradient<br />

˙(θ) =<br />

Ã<br />

− + ¯ 2<br />

(log ) − log − ()<br />

!<br />

179<br />

where () =()logΓ () (= [log ()])<br />

is the “digamma” function.<br />

<br />

• The Newton-Raphson method for solving the likelihood<br />

equations is<br />

θ +1 = θ − ¨ −1 (θ )˙(θ )<br />

where<br />

¨(θ) =<br />

⎛<br />

⎝<br />

<br />

2 − 2 ¯ 3<br />

− 1 <br />

− 1 <br />

− 0 ()<br />

⎞<br />

⎠ <br />

• A commonly used alternative to Newton-Raphson<br />

is Fisher’s Method of Scoring. This involves replacing<br />

Ã<br />

X 2 !<br />

log (<br />

−¨(θ) =−<br />

; θ)<br />

θθ<br />

=1


180<br />

by its expectation I(θ) in N-R, to get the scheme<br />

θ +1 = θ + 1 I−1 (θ )˙(θ )<br />

This is often more stable than N-R.<br />

• Starting values for iterative solution to the likelihood<br />

equations. Note that<br />

θ<br />

h˙(θ; X) i =0⇒ h ¯ i = <br />

Also, for instance by computing the m.g.f. and differentiating<br />

twice (or more simply by a direct integration),<br />

[ 2 ]= ( +1) 2 .Thus[] =<br />

2 and so<br />

[ 2 ]= 2 <br />

where 2 isthesamplevariance(theunbiasedness<br />

of 2 is a simple calculation). The “method<br />

of moments” estimates θ 0 = ( 0 0 )arenow<br />

obtained by equating ¯ and 2 to their expectations<br />

and solving for the parameters:<br />

0 = 2<br />

¯ 0 = ¯ 2<br />

2<br />

Ã<br />

= ¯<br />

0<br />

!


181<br />

• Method of moments: Define population moments<br />

= h i and estimates ˆ = −1 P <br />

=1 .<br />

By the WLLN these are consistent:<br />

ˆ <br />

<br />

→ as →∞<br />

Then to estimate continuous functions<br />

θ = g ( 1 )<br />

of the population moments, the method of moments<br />

estimate<br />

ˆθ = g (ˆ 1 ˆ )<br />

is also consistent. The proof is the same as in the<br />

univariate case:<br />

³° ° °ˆθ − θ ° ° ≥ ´<br />

= (kg (ˆμ) − g (μ)k ≥ )<br />

≤ (kˆμ − μk ≥ )<br />

→ 0;<br />

here 0issuchthat<br />

kˆμ − μk ⇒ kg (ˆμ) − g (μ)k <br />

and its existence is guaranteed by the continuity<br />

of g.


182<br />

— An interesting aside: if (; ) is the density<br />

of an exponential r.v. with mean 1,<br />

giventhatitmustbe∈ [0 1], then (; ) =<br />

− ³ 1 − −´<br />

and the equations 0 () =<br />

0and‘¯ = []’ turn out to be identical,<br />

so that the mle and method of moments estimator<br />

coincide.<br />

• More efficient, in fact almost as efficient as the<br />

MLE itself are<br />

0 = ¯<br />

¯<br />

0 =<br />

0 −1 P ( − ¯)(log − (log )) <br />

which are method of moments estimators arising<br />

from the observation that<br />

cov [log ] =cov<br />

∙<br />

log<br />

µ <br />

<br />

¸<br />

= <br />

implying<br />

= []<br />

= []<br />

cov [log ] <br />

Details in Wiens, Cheng (a former Stat <strong>512</strong> student)<br />

and Beaulieu 2003 at http://www.stat.ualberta.ca/˜wiens/.


• The limit of the NR-process is the MLE ˆθ, and<br />

√ <br />

³ˆθ − θ´ → (0 I −1 (θ))<br />

The information matrix is<br />

I(θ) =<br />

lim<br />

→∞<br />

1<br />

h i<br />

θ −¨(θ) =<br />

⎛<br />

⎝<br />

<br />

2<br />

1<br />

<br />

1<br />

<br />

0 ()<br />

183<br />

⎞<br />

⎠ <br />

with<br />

Ã<br />

!Ã<br />

I −1 1 <br />

(θ) =<br />

2 0 !<br />

() −<br />

0 <br />

() − 1 − <br />

Then, e.g., the approximation to the distribution<br />

of ˆ is<br />

Ã<br />

ˆ ≈ 11 ( ) 2 0 ()<br />

=<br />

( 0 <br />

() − 1)<br />

We estimate the parameters in the variance, obtaining<br />

ˆ − <br />

r<br />

11 ³ˆ ˆ´. <br />

<br />

≈ (0 1)<br />

Note that 0 () =[log ()] = [log ]<br />

can also be consistently estimated by the sample<br />

variance of {log } =1 .<br />

!


184<br />

• Wecannowestablishanasymptoticoptimality<br />

property of the MLE. Suppose that the observations<br />

are i.i.d., and that differentiation under the<br />

integral sign, as above, is permissible. We aim<br />

to estimate a (scalar) function (θ). The MLE<br />

ˆ is defined to be (ˆθ), where ˆθ is the MLE for<br />

θ. Recall that in studying variance stabilization<br />

(Lecture 9) we noted that in the single-parameter<br />

case, if ˆθ were asymptotically normal then so<br />

would be (ˆθ), with a mean of (mean of ˆθ)<br />

and a variance of £ 0 (θ) ¤ ³<br />

2 · variance of ˆθ´. The<br />

multi-parameter analogue (the “delta method”)<br />

is that<br />

√ ³<br />

<br />

³ˆθ ´<br />

− (θ)´ → (0 ˙ 0 (θ) I −1 (θ) ˙ (θ))<br />

where ˙ (θ) =∇ (θ) is the gradient.


185<br />

• Now let (X) be any unbiased estimator of (θ),<br />

so that<br />

Thus<br />

since<br />

(θ) = θ [(X)] =<br />

˙(θ) =<br />

Z<br />

Z<br />

Z<br />

(x)∇ (x; θ) x<br />

(x) (x; θ) x<br />

= (x)˙(θ; x) (x; θ) x<br />

h<br />

= θ (X)˙(θ; X) i<br />

= θ<br />

h<br />

{(X) − (θ)} ˙(θ; X) i <br />

θ<br />

h<br />

(θ)˙(θ; X) i = (θ) θ<br />

h˙(θ; X) i = 0<br />

Then for any constant vector c ×1 we have<br />

c 0 ˙(θ) = θ<br />

h<br />

{(X) − (θ)} c<br />

0 ˙(θ; X) i<br />

and by the Cauchy-Schwarz inequality,<br />

h<br />

c0 ˙(θ) i 2<br />

≤ θ<br />

h<br />

{(X) − (θ)}<br />

2 i θ<br />

∙ nc<br />

0 ˙(θ; X) o 2¸<br />

= θ [(X)] c 0 θ<br />

h˙(θ)˙ 0 (θ) i c<br />

= θ [(X)] · c 0 I(θ)c


186<br />

i.e.<br />

θ [(X)] ≥<br />

¯<br />

¯c 0 ˙(θ)¯¯2<br />

c 0 I(θ)c <br />

Put c = I −12 (θ)t for arbitrary t to get<br />

θ [(X)] ≥<br />

hence<br />

¯<br />

¯t 0 I −12 (θ) ˙(θ)<br />

t 0 t<br />

θ<br />

h√ (X)<br />

i<br />

≥ max<br />

||||=1<br />

¯<br />

¯<br />

¯2<br />

for any t<br />

¯t 0 I −12 (θ) ˙(θ)<br />

= ||I −12 (θ) ˙(θ)|| 2<br />

= ˙ 0 (θ) I −1 (θ) ˙ (θ) <br />

which is the asymptotic variance of the (normalized)<br />

MLE √ ³ˆθ ´. This is the Information<br />

Inequality, giving a lower bound on the variance<br />

of unbiased estimators. Since it is attained (in the<br />

limit) in the case of the MLE, we say the MLE is<br />

asymptotically efficient.<br />

¯<br />

¯2


187<br />

22. Minimax M-estimation I<br />

• M-estimation of location. Suppose 1 <br />

<br />

∼<br />

( − ), with density ( − ) (“location family”).<br />

If we know what is, then the MLE is defined<br />

by maximizing P log ( −), i.e. by solving<br />

0= <br />

X<br />

log ( − ) = X − 0<br />

More generally, a solution ˆ to<br />

X<br />

( − ) =0<br />

( − )<br />

for a suitable function , is an “M-estimate” of location.<br />

Thus the MLE of location, from a known<br />

, is an M-estimate with “score function”<br />

() = () = − 0<br />

()<br />

Quite generally,<br />

⎛<br />

√ <br />

³ˆ − ´ → ⎝0( ) =<br />

h<br />

2 ( − ) i ⎞<br />

{ [ 0 ( − )]} 2<br />


188<br />

Here is an outline of why this is so. By the MVT,<br />

0 = 1 √ <br />

X<br />

( − ˆ )<br />

= √ 1 X<br />

( − )<br />

<br />

−<br />

" #<br />

1 X<br />

√ 0 ( − )<br />

³ˆ − ´<br />

+ <br />

so<br />

√ <br />

³ˆ − ´<br />

=<br />

1 √ <br />

P ( − )+ <br />

1<br />

<br />

P 0 ( − )<br />

A natural assumption, and one that is made here,<br />

is that [ ( − )] = 0. Then by the CLT<br />

and WLLN,<br />

1<br />

√ <br />

X<br />

( − ) → ³ 0 <br />

h<br />

2 ( − ) i´ <br />

1<br />

<br />

X<br />

0 ( − ) → <br />

h<br />

0 ( − ) i <br />

If the remainder <br />

<br />

→ 0(showingthisiswhere<br />

some work is required) then the result follows by<br />

Slutsky’s Theorem.


189<br />

• The asymptotic variance does not depend on ;<br />

it is<br />

R ∞−∞<br />

2 () ()<br />

( ) = h R ∞−∞<br />

0 () () i 2 <br />

The denominator in ( ) isthesquareof<br />

Z ∞<br />

−∞ 0 () ()<br />

Z<br />

¯ ∞<br />

¯∞−∞<br />

−<br />

= ()()<br />

() 0 ()<br />

Z<br />

−∞<br />

∞<br />

= () ()()<br />

−∞<br />

= [ () ()] <br />

Here we use an assumption that ()() → 0<br />

as → ±∞. Then<br />

1<br />

( )<br />

= { [ () ()]} 2<br />

h<br />

2 () i<br />

h<br />

2 () i h<br />

<br />

2<br />

<br />

() i<br />

≤<br />

<br />

<br />

h<br />

2 () i<br />

= <br />

h<br />

<br />

2<br />

() i


190<br />

Thus ( ) is minimized, for fixed ,by =<br />

. The minimum variance is the inverse of<br />

Z ∞<br />

"<br />

h<br />

<br />

2<br />

() i = − 0 # 2<br />

−∞ () () = ( );<br />

this is “Fisher information for location”.<br />

• How might we choose if is not known? We<br />

take a minimax approach: we allow to be any<br />

member of some realistic class F of distributions<br />

(e.g. “approximately Normal” distributions), and<br />

aim to find a 0 that minimizes the maximum<br />

variance:<br />

max<br />

∈ F ( 0) ≤ max ( ) for any <br />

∈ F<br />

(22.1)<br />

We will show that the solution to this problem<br />

is to find an 0 ∈ F that is “least favourable”<br />

in the sense of minimizing ( )inF; we then<br />

protect ourselves against this worst case by using<br />

the MLE based on 0 ,i.e. 0 = 0 = − 0 0 0.


191<br />

• We will show that such a pair ( 0 0 )isa“saddlepoint<br />

solution”:<br />

( 0 ) ≤ ( 0 0 )= 1<br />

( 0 ) ≤ ( 0)<br />

for all and all ∈ F (22.2)<br />

If (22.2) holds we have, for any ,<br />

max ( 0)= ( 0 0 ) ≤ max ( ) <br />

∈ F ∈ F<br />

which is (22.1). Note that the equality in (22.2),<br />

and the second inequality, have already been established.<br />

Thus (22.2) holds iff 0 satisfies the<br />

first inequality.<br />

• Assume that F is convex, in that for any 0 1 ∈<br />

F the d.f. =(1− ) 0 + 1 (0 ≤ ≤ 1) is<br />

also in F. Thefirst inequality in (22.2) states that<br />

( 0 ) is maximized by 0 ; equivalently that<br />

n R ∞−∞<br />

0 0 () () o 2<br />

() =1 ( 0 )= R ∞−∞<br />

0 2 () () <br />

is minimized at =0for each 1 ∈ F (22.3)


192<br />

Note that<br />

Z ∞<br />

−∞ 0 0 () () =<br />

Z ∞<br />

Z ∞<br />

(1 − )<br />

−∞ 0 0 () 0 () + <br />

−∞ 0 0 () 1 () <br />

is a linear function of ; so too is the denominator<br />

of ().<br />

• Lemma: If ()() are linear functions of ∈<br />

[0 1], and () 0, then () = 2 ()() is<br />

convex:<br />

((1 − ) 1 + 2 ) ≤ (1 − ) ( 1 )+ ( 2 )<br />

for 1 2 ∈ [0 1].<br />

Proof:<br />

Using 00 = 00 =0weget<br />

00 = 2 3 ³<br />

0 − 0´2 ≥ 0


193<br />

• By the Lemma, () in(22.3)isconvex,sois<br />

minimized at =0iff 0 (0) ≥ 0foreach 1 ∈ F.<br />

• In the notation of the Lemma we have<br />

0 (0) = 2 (0)<br />

à ! 2 (0)<br />

(0) 0 (0) − 0 (0) with<br />

(0)<br />

(0) =<br />

(0) =<br />

=<br />

Thus<br />

Z ∞<br />

−∞ 2 0 0 = ( 0 );<br />

Z ∞<br />

−∞ 0 0 0 = ³<br />

0 −<br />

0 0´<br />

<br />

−∞<br />

(integration by parts again)<br />

Z ∞<br />

Z ∞<br />

−∞ 2 0 0 = ( 0 )<br />

0 (0) = 2 0 (0)− 0 (0) =<br />

Z ³<br />

2<br />

0<br />

0 − 2 0<br />

´<br />

(1 − 0 ) <br />

We have shown that ( 0 ) is maximized by<br />

0 iff, forall 1 ∈ F<br />

Z ∞ ³<br />

2<br />

0<br />

0 − 0<br />

2 ´<br />

(1 − 0 ) ≥ 0<br />

−∞<br />

(22.4)


194<br />

• Now consider the companion problem of minimizing<br />

( )inF. The function<br />

Z ∞<br />

Ã<br />

Z ∞<br />

¡<br />

<br />

0 <br />

¢ 2<br />

() =( )= − 0 ! 2<br />

<br />

= <br />

−∞ −∞ <br />

is convex. This is because, by the Lemma, its<br />

integrand () isconvexforeach; thus for any<br />

1 2 ∈ [0 1]<br />

(1−)1 + 2<br />

() ≤ (1 − ) 1 ()+ 2 ();<br />

integrating this gives<br />

((1 − ) 1 + 2 ) ≤ (1 − ) ( 1 )+ ( 2 ) <br />

Thus ( ) is minimized by 0 iff, for each 1 ∈<br />

F,<br />

0 ≤ 0 (0)<br />

Z ∞ 2 0 ³<br />

<br />

0<br />

1 − 0´ 0 − ¡ 0 2<br />

¢<br />

(1 − 0 )<br />

=<br />

=<br />

=<br />

=<br />

−∞<br />

Z ∞<br />

2 <br />

|=0<br />

Ã<br />

−2 − 0 ! Ã<br />

³<br />

0 0<br />

−∞ 1 − 0<br />

0 ´<br />

− − 0 ! 2<br />

0<br />

( 1 − 0 ) <br />

0 0<br />

³<br />

<br />

0<br />

1 − 0<br />

0 ´<br />

− <br />

2<br />

0 ( 1 − 0 ) <br />

Z ∞<br />

−2 0<br />

Z<br />

−∞<br />

∞<br />

−∞<br />

³<br />

2<br />

0<br />

0 − 0<br />

2 ´<br />

(1 − 0 )


195<br />

• By comparison with (22.4) we have that the following<br />

are equivalent:<br />

1. ³ 0 = − 0 0 0 0´<br />

is a saddlepoint solution<br />

to the minimax problem;<br />

2. ( 0 ) is maximized by 0 ;<br />

3. ( ) is minimized in F by 0 ;<br />

4. R ∞<br />

−∞<br />

³<br />

2<br />

0<br />

0 − 2 0´<br />

(1 − 0 ) ≥ 0 for all 1 ∈<br />

F.


196<br />

23. Minimax M-estimation II<br />

• By the preceding, we are to minimize ( )inF<br />

andthenput 0 = − 0 0 0. We must now specify<br />

a “reasonable” class F. A commonly used one is<br />

the “gross errors” class<br />

F = { | () =(1− ) ()+()} <br />

where () is the Normal density and () isan<br />

arbitrary (but symmetric) density.<br />

— Why symmetric? We need [ 0 ( − )] =<br />

R ∞−∞<br />

0 () () =0forall ∈ F; thisis<br />

guaranteed if is even and, as will turn out<br />

to be the case, 0 is odd and bounded.<br />

• The interpretation is that 100 (1 − )%oftheobservations<br />

are Normally distributed; the remainder<br />

come from an unknown population. For this<br />

modelwehave 1 − 0 = ( 1 − 0 ), and so we<br />

are to find 0 satisfying<br />

Z ∞ ³<br />

2<br />

0<br />

0 − 2 ´<br />

0 (1 − 0 ) ≥ 0forall 1 <br />

−∞<br />

(23.1)


We note that<br />

and<br />

() = − 0 = −()log() =<br />

2 0 () − 2 () =2− 2 <br />

197<br />

Condition (23.1) states that R ∞<br />

−∞<br />

³<br />

2<br />

0<br />

0 − 2 0´<br />

<br />

is to be minimized by 0 .But 0 depends on 0 .<br />

We conjecture that<br />

1. The density 0 mustplaceallofitsmasswhere<br />

20 0 − 2 0 is a minimum;<br />

2. On this set, 20 0 − 2 0<br />

= − 2 for some .<br />

is to be constant, say<br />

We will first verify that these conditions ensure a<br />

minimax solution, and then verify that there is a<br />

density 0 that has these properties.<br />

• A clue to the form of the set in 2. above is provided<br />

by the behaviour of 2 0 () − 2 ().


198<br />

• Suppose then that we can construct a density 0<br />

in such a way that<br />

0 () =<br />

0 () =<br />

(<br />

(<br />

(1 − ) () || ≤ <br />

(1 − ) ()+ 0 () || ≥ ;<br />

() =<br />

|| ≤ <br />

asolutionto20 0 − 2 0 = −2 || ≥ ;<br />

and with 0 = − 0 0 0 on || ≥ . Suppose also<br />

that<br />

− 2 ≤ 2 − 2 <br />

so that 20 0 − 2 0 attains its minimum (of −2 )<br />

on the set || ≥ . Then<br />

Z ∞ ³<br />

2<br />

0<br />

0 − 0<br />

2 ´<br />

(1 − 0 ) <br />

=<br />

Z−∞<br />

||≤<br />

³<br />

2 0 − 2´<br />

1 +<br />

≥ − 2 " Z<br />

||≤ 1 +<br />

" Z ∞<br />

= − 2 1 −<br />

−∞<br />

= 0<br />

Z<br />

Z<br />

Z<br />

||≥<br />

³<br />

−<br />

2´<br />

||≥ ( 1 − 0 ) <br />

||≥ 0<br />

#<br />

( 1 − 0 ) <br />

#


199<br />

• A solution (there are three) to 20 0 − 2 0 = −2<br />

is 0 () =() · , implying 0 () ∝ −|| .<br />

This leads to<br />

0 () =<br />

0 () =<br />

(<br />

(1 − ) () || ≤ <br />

(1 − ) () −(||−) || ≥ ;<br />

(<br />

() = || ≤ <br />

() · || ≥ ;<br />

with = . Note that 0 and 0 are continuous,<br />

that 0 = −0 0 0,andthat− 2 ≤ 2 − 2 .<br />

It remains only to show that 0 ∈ F, i.e.that<br />

0 () =(1− ) ()+ 0 () for some density 0 .<br />

It is left as an exercise to show that the function 0<br />

defined by this relationship is non-negative, and<br />

that a unique = () can be found such that<br />

R ∞−∞<br />

0 () =1.<br />

• This function 0 () is the famous “Huber’s psi<br />

function”. The theory given here extends very<br />

simplytothecaseinwhich is replaced by any<br />

other “strongly unimodal” density - one for which<br />

() is an increasing function of . Details are in


200<br />

the landmark paper Huber, P. J. (1964), “Robust<br />

Estimation of a Location Parameter,” The Annals<br />

of Mathematical Statistics, 35, 73-101.<br />

• Extensions to regression are immediate. An M-<br />

estimate of regression is a minimizer of<br />

X<br />

<br />

³<br />

− x 0 θ´<br />

<br />

or, with = 0 , a solution to<br />

X<br />

x ³ − x 0 θ´<br />

= 0<br />

If () = 2 2, () = then this becomes<br />

X<br />

x = X x x 0 θ<br />

The solution is<br />

ˆθ = h X<br />

x x 0 <br />

i −1 X<br />

x <br />

= ³ X 0 X´−1<br />

X 0 y<br />

the LSE. In general Newton-Raphson, or Iteratively<br />

Reweighted Least Squares (IRLS) can be<br />

used to obtain the solution.


• The asymptotic normality result is that<br />

√ <br />

³ˆθ − θ´ → <br />

µ<br />

0 ( ) ³ X 0 X´−1 <br />

201<br />

if the errors have a density , symmetric around<br />

0. Here ( ) is as before, so that the same<br />

minimax results as derived here can be applied.<br />

• IRLS: Write the equations as<br />

0 = X ³ − x 0θ´<br />

³<br />

<br />

x <br />

− x 0 <br />

θ − x 0θ´<br />

<br />

= X h<br />

x · · ³<br />

− x 0θ´i<br />

<br />

= X 0 Wy − X 0 WXθ<br />

for W = ( 1 ) and weights =<br />

(θ) depending on the parameters. “Solve”<br />

these equations:<br />

θ = ³ X 0 WX´−1<br />

X 0 Wy;<br />

use this value to re-calculate weights; iterate to<br />

convergence. Thus the ( +1) step is a weighted<br />

least squares regression using weights (θ )computed<br />

from the residuals at the previous step.


202<br />

24. Measure and Integration<br />

• Recall the definition of a probability space: basic<br />

components are a set Ω (“outcomes”), a “Borel<br />

field” or “-algebra” F of subsets (“events”; F<br />

contains Ω and is closed under complementation<br />

and countable unions) and a measure assigning<br />

probability () toevents ∈ F.<br />

• Let Ω =(0 1], the unit interval, and start with<br />

subintervals ( ]. Define a (probability) measure<br />

by (( ]) = − . This extends to the set B 0<br />

of finite disjoint unions and complements of such<br />

intervals in the obvious way.<br />

• Now consider B = (B 0 ), the smallest -algebra<br />

containing B 0 (i.e., the intersection of all of them).<br />

One can extend the measure on B 0 to the -<br />

algebra B. Formally, define the outer measure<br />

of a set ⊂ Ω by<br />

∗ () =inf X <br />

( )


203<br />

where the infimum is over all sequences in B 0 satisfying<br />

⊂∪ . If, for any ⊂ Ω we then<br />

have ∗ ( ∩ )+ ∗ ( ∩ ) = ∗ (), we say<br />

that is Lebesgue measurable, with Lebesgue<br />

measure ∗ (). (The condition implies in particular<br />

that ∗ ()+ ∗ ( )=1.)<br />

• It can be shown that the Lebesgue measurable<br />

sets include (B 0 ), and that Lebesgue measure<br />

agrees with whenever both are defined. In particular<br />

the Lebesgue measure of an interval is its<br />

length. The restriction of Lebesgue measure to<br />

(B 0 ) is called Borel measure, and the sets in<br />

(B 0 )areBorel measurable.<br />

• The measure ∗ caninturnbeextendedfrom<br />

B through a process of completion, essentially by<br />

appending to B all subsets of those sets with measure<br />

zero. The resulting measure space is complete,<br />

inthatifaset is Lebesgue measurable<br />

and has measure zero, then the same is true of<br />

any subset of .


204<br />

• Example: The set of rational numbers in (0 1]<br />

has Lebesgue measure ∗ () = 0. This is because<br />

we can enumerate the rationals: = { 1 2 }<br />

and then if<br />

= ³ − 2 −(+1) +2 −(+1) ´ ∩ (0 1]<br />

we have ⊂∪ and ∗ () ≤ P ( ) ≤ .<br />

• One can carry out these constructions for more<br />

general sets Ω. StartingwithΩ = R and intervals<br />

( ] as above results in Lebesgue measure on the<br />

real line.<br />

• Now let (Ω F)beanymeasurespace,sothat<br />

F is a -algebra and a measure (Lebesgue measure,<br />

counting measure, ... ). Here we define the<br />

integral, written<br />

Z<br />

Z<br />

=<br />

ZΩ () () = () () <br />

Oneapproachtothisistostartwithsimple,nonnegative<br />

functions = P 1 , where Ω =<br />

Ω


205<br />

∪ =1 ;inthiscasedefine R = P ( ).<br />

(Then, e.g., if = on ( ] and zero elsewhere,<br />

and is Lebesgue measure, this gives R =<br />

( − ), in agreement with the R-integral.) Now<br />

any non-negative function can be represented as<br />

an increasing limit ( ↑ ) of simple functions;<br />

one then defines R = lim <br />

R<br />

.<br />

• To extend to arbitrary functions, define<br />

+ () =max( () 0) − () =max(− () 0) <br />

Then = + − − is the difference of nonnegative<br />

functions, and one defines<br />

Z<br />

=<br />

Z<br />

+ −<br />

Z<br />

− <br />

(unless one of these is ∞, inwhichcasewesay<br />

that is “not integrable” although one still assigns<br />

a value to R by adopting the convention<br />

±∞ = ±∞ for finite ). Finally, for sets ∈ F<br />

one defines R = R 1 .


206<br />

• Some properties of the integral:<br />

1. If and are integrable and ≤ a.e. (“almost<br />

everywhere”; i.e. except on a set with<br />

measure zero) then R ≤ R . (Then<br />

|| = + + − is integrable and since − || ≤<br />

≤ || we have that | R | ≤ R || .)<br />

2. Monotone convergence: If 0 ≤ ↑ a.e.<br />

then R ↑ R . (Note that R is<br />

defined for all non-negative functions .)<br />

3. Dominated convergence: If | | ≤ a.e., where<br />

is integrable (i.e. R , which necessarily<br />

exists, is finite) and if → a.e., then and<br />

the areintegrableand R → R .<br />

4. Bounded convergence: if (Ω) ∞ and the<br />

are uniformly bounded (i.e. () ≤ <br />

for all and all ), then → a.e. implies<br />

R<br />

→ R .


207<br />

• If is Lebesgue measure then the integral defined<br />

above is the Lebesgue integral. Example: The<br />

function =1 is zero everywhere except on the<br />

set of rationals in (0 1], i.e. almost everywhere.<br />

By the above R = 0; recall that this is an<br />

example in which the R-integral does not exist.<br />

When the R-integral does exist, it has the same<br />

value as the Lebesgue integral.<br />

• Now let (Ω F) be a probability space; a (finite)<br />

random variable is a function : Ω → R such<br />

that −1 () ∈ F for any Borel measurable set<br />

. Equivalently (proof at end), inverses of open<br />

sets are events.<br />

• A function () is also a r.v. under a certain<br />

condition. For ∈ B, we have<br />

( ◦ ) −1 () = −1 ◦ −1 () ∈ F<br />

as long as −1 () ∈ B, i.e. must be Borel<br />

measurable - a function for which the inverses of<br />

Borel sets (or just open sets) are Borel sets.


208<br />

• Any r.v. induces a probability space (R B)via<br />

() = ³ −1 ()´ = ( ∈ ) for ∈ B<br />

This measure is the probability measure (p.m.)<br />

of , and the associated distribution function is<br />

defined by<br />

() = ((−∞]) = ( ≤ ) <br />

• If there is a function (·) and a measure space<br />

(R B)with<br />

() = ( ∈ ) =<br />

Z<br />

<br />

() () <br />

we say that is the density of (or of ) w.r.t.<br />

and that is ‘absolutely continuous’ w.r.t. .<br />

The most common cases are = Lebesgue measure<br />

(in which case we say that is a continuous<br />

r.v. and then 0 () exists and equals ()<br />

a.e.) and = counting measure, in which case<br />

() = ( = ) and is discrete.


209<br />

• The expected value of a r.v. is defined by the<br />

integral [] = R Ω () (), and can be<br />

evaluatedbytransformingtothep.m., i.e.<br />

[()] =<br />

Z<br />

R () (); (24.1)<br />

this in turn equals the R-S integral R ∞<br />

−∞ ()()<br />

whenever the latter exists. The proof of (24.1)<br />

consists of showing that both sides agree for simple<br />

functions (for instance when =1 both<br />

sides equal ( ∈ )) and extending to general<br />

Borel functions by monotonicity, etc.


210<br />

• Basic properties of expectations are inherited from<br />

those of the integral. Some particular ones are:<br />

1. []existsiff [||] exists, and then | []| ≤<br />

[||].<br />

2. Monotone convergence: If ≥ 0and ↑<br />

a.e., then [ ] → [] (which might<br />

= ∞).<br />

3. Dominated convergence: If → a.e. (or<br />

<br />

just → )and∀ | | ≤ with [ ] <br />

∞, then [ ] → [].<br />

4. Bounded convergence: this is “Dominated convergence”<br />

with = , aconstant.<br />

• Example: Suppose we estimate a bounded, continuous<br />

function () ofapopulationmeanby<br />

( ), where is an average based on a sample<br />

of size . We have that → by the<br />

<br />

WLLN, so ( ) → () bycontinuity;then<br />

[ ( )] → [ ()] = () by bounded convergence.


211<br />

• It was stated above that if is a function on<br />

(Ω F P),mappinginto(R B), and −1 () ∈<br />

F for every open set in R (recall this was our original<br />

definition of a r.v.) then −1 () ∈ F for<br />

every Borel measurable set (our current definition<br />

of a r.v.) To see this, firstnotethatif −1 () ∈<br />

F for every open set , then this holds as well for<br />

every interval = ( ] (=( ) ∩ ( ) for<br />

any ). It then holds as well for the set B 0<br />

of finite disjoint unions and complements of such<br />

intervals. The property can finally be extended<br />

to B = (B 0 ) through the Monotone Class Theorem.<br />

A class M of subsets of R is monotone if<br />

for sequences { } in M,<br />

1 ⊂ 2 ⊂ ⇒∪ ∈ M<br />

1 ⊃ 2 ⊃ ⇒∩ ∈ M<br />

The theorem states that if M is a monotone class,<br />

then B 0 ⊂ M implies (B 0 ) ⊂ M. So it suffices<br />

to verify that the class M of subsets for which<br />

−1 () ∈ F is monotone. This is straightforward.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!