STATISTICS 512 TECHNIQUES OF MATHEMATICS FOR ...

STATISTICS 512 

TECHNIQUES OF MATHEMATICS 

FOR STATISTICS 

Doug Wiens 

December 16, 2013

Contents 

I MATRIX ALGEBRA 6 

1 Introduction; matrix manipulations . . . . . 7 

2 Vector spaces . . . . . . . . . . . . . . . . 16 

3 Orthogonality; Gram-Schmidt method; QRdecomposition 

. . . . . . . . . . . . . . . . 24 

4 LSEs; Spectral theory . . . . . . . . . . . . 31 

5 Examples & applications . . . . . . . . . . . 41

II LIMITS, CONTINUITY, DIFFEREN- 

TIATION 50 

6 Limits; continuity; probability spaces . . . . 51 

7 Random variables; distributions; Jensen’s Inequality; 

WLLN . . . . . . . . . . . . . . . 57 

8 Differentiation; Mean Value and Taylor’s Theorems 

. . . . . . . . . . . . . . . . . . . . . 64 

9 Applications: transformations; variance stabilization 

. . . . . . . . . . . . . . . . . . . 71 

III SEQUENCES, SERIES, INTEGRA- 

TION 78 

10 Sequences and series . . . . . . . . . . . . . 79 

11 Power series; moment and probability generating 

functions . . . . . . . . . . . . . . . 87

12 Branching processes . . . . . . . . . . . . . 95 

13 Riemann integration . . . . . . . . . . . . . 102 

14 Riemann and Riemann-Stieltjes integration . 111 

15 Moment generating functions; Chebyshev’s 

Inequality; Asymptotic statistical theory . . 121 

IV MULTIDIMENSIONAL CALCULUS 

AND OPTIMIZATION 131 

16 Multidimensional differentiation; Taylor’s and 

Inverse Function Theorems . . . . . . . . . 132 

17 Implicit Function Theorem; extrema; Lagrange 

multipliers . . . . . . . . . . . . . . . . . . 141 

18 Integration; Leibnitz’s Rule; Normal sampling 

distributions . . . . . . . . . . . . . . 152

19 Numerical optimization: Steepest descent, 

Newton-Raphson, Gauss-Newton . . . . . . 162 

20 Maximum likelihood . . . . . . . . . . . . . 170 

21 Asymptotics of ML estimation; Information 

Inequality . . . . . . . . . . . . . . . . . . 178 

22 Minimax M-estimation I . . . . . . . . . . . 187 

23 Minimax M-estimation II . . . . . . . . . . 196 

24 Measure and Integration . . . . . . . . . . . 202

Part I 

MATRIX ALGEBRA

7 

1. Introduction; matrix manipulations 

• Outline: 

— Linear algebra - regression (linear/nonlinear), 

multivariate analysis, more generally linear models 

and linear approximations. 

— Real analysis/calculus - theory of statistical 

distributions, optimal selection of statistical 

procedures (e.g. determine a parameter estimate 

to minimize a certain loss function), approximations 

of intractable procedures with simpler 

ones. 

— Measure theory (very briefly)/Theory of Integration 

- probability, math finance, theory of 

mathematical statistics, foundations of statistical 

and probabilistic methods. A rigorous development 

of conditional expectation requires 

measure theory.

— Optimization - find numbers or functions minimizing 

certain objectives, e.g. designing experiments 

for maximum information/minimum variance 

etc.; associated numerical methods. 

8 

• In Statistics, matrices are convenient ways to store 

and refer to data. As well, in Regression for 

instance, there are important structural features 

that come from examining the algebraic properties 

of the vector space formed from all linear 

combinations of the columns of a matrix. 

• As for the first of these - matrices as data storage 

- you should learn various ways to manipulate matrices. 

In particular, the formula for the product 

of two matrices - that the ( ) element of the 

product AB of an × matrix A and a × 

matrix B is given by 

[AB] = 

X 

=1 

A B

- is of rather limited usefulness. Rather, one 

should treat either the rows or the columns of 

matrices as the basic elements. Some examples: 

— Define (column) vector in R ;sum,transpose, 

scalar product, outer product. 

— Matrix as a column of rows, or row of columns: 

A × = 

⎛ 

⎜ 

⎝ 

a 0 1 

a 0 2 

. 

a 0 

⎞ 

⎟ 

⎠ =(α 1.α 2 . ···.α ) 

— If X × has rows n x 0 

o 

(note: vectors are 

=1 

columns, rows are transposed vectors), and θ 

is a × 1 vector, then 

Xθ = 

⎛ 

⎜ 

⎝ 

x 0 1 

. 

x 0 

. 

x 0 

⎞ 

⎟ 

⎠ 

θ = 

⎛ 

⎜ 

⎝ 

x 0 1 θ 

. 

x 0 θ 

. 

x 0 θ 

— If X × has columns n o 

z ,andθ is a ×1 

=1 

⎞ 

⎟ 

⎠ 

 

9

10 

vector, then 

Xθ = h i 

z 1 ···z ···z 

⎛ 

⎜ 

⎝ 

⎞ 

1 

. 

 

. 

⎟ 

⎠ 

 

= 

X 

=1 

z 

— If X × has columns n o 

z , and A is a 

=1 

matrix with columns, then 

AX = A h z 1 ···z ···z 

i 

= 

h 

Az1 ···Az ···Az 

i 

 

— If X is as above and A × isamatrixwith 

rows n a 0 o 

=1 ,then 

AX = 

⎛ 

⎜ 

⎝ 

a 0 1 

. 

a 0 

. 

a 0 

⎞ 

⎟ 

⎠ 

X = 

⎛ 

⎜ 

⎝ 

a 0 1 X 

. 

a 0 X 

. 

a 0 X 

⎞ 

⎟ 

⎠ 

= ³ a 0 z ´ 

 

 

You should become familiar with all of these, 

and learn to choose the most appropriate form 

in an application.

11 

— Block matrices ... a particular example is, with 

notation as above, 

A × B × =(α 1 .α 2 . ···.α ) 

⎛ 

⎜ 

⎝ 

β 0 1 

β 0 2 

. 

β 0 

⎞ 

X 

⎟ 

⎠ = =1 

α β 0 

• Let be a random variable (r.v.) (formal definition 

to come later) with (i) distribution function 

() = ( ≤ ) and probability density function 

() = 0 () or (ii) probability mass function 

() = ( = ) for ∈ X, a finite or 

countable set. Then the “expected value” is 

( R 

(i) 

∞−∞ 

() 

[] = 

(ii) P ∈X () 

Think “average”. The cases can be unified and 

extended (for instance to cases where is not differentiable) 

via the Riemann-Stieltjes integral, to 

be considered later. Also, the extension to random 

vectors (r.vecs) is immediate, involving multidimensional 

integrals or sums. A consequence is

that 

 

⎡⎛ 

⎢⎜ 

⎣⎝ 

⎞⎤ 

⎛ 

1 

⎟⎥ 

⎜ 

. ⎠⎦ = ⎝ 

 

[ 1 ] 

. 

[ ] 

⎞ 

⎟ 

⎠ 

12 

(Similarly with random matrices.) We define 

[(X)] to be [], where = (X). In principle 

this requires the derivation of the distribution 

of . It can be shown that this can instead be obtained 

by integration or summation w.r.t. (“with 

respect to”) the distribution of X. Corresponding 

to the cases above, this is 

[(X)] = 

respectively. 

( R 

(i) 

∞−∞ ···R ∞ 

−∞ (x)(x)x 

(ii) P x∈X (x)(x) 

• A special consequence is linearity: [ + ]= 

[]+[ ]. More generally, 

if x is a r.vec. and 

[Ax + b] =A [x]+b 

[AXB + C] =A [X] B + C

13 

for a random matrix X. You should verify this. 

Thus, e.g., if μ = [x] then (how?) 

h (x − μ)(x − μ) 0i = h xx 0i − μμ 0 

The ( ) element is 

h ( − ) ³ h i 

− ´i 

=cov 

(= the variance if = ). The matrix is called 

the covariance matrix of the random vector x. 

• A common application, to be developed in detail, 

is linear regression. Experimenter observes a variable 

(= response to a medical treatment, say) 

thought to depend on type of drug used ( 1 =1 

for type A, 0 for type B) and amount applied 

( 2 ). Response contains a random component 

as well (measurement error, model inadequacies, 

etc.); a tentative linear regression model might be 

= 0 + 1 1 + 2 2 + 

where the ’s are unknown parameters to be estimated, 

is unobserved random error, assumedto 

have mean 0, constant variance across subjects, 

possibly also normally distributed.

14 

— Interpretation of [ | 1 2 ]inthetwotreatment 

groups: 

[ | 1 = 0 2 ]= 0 + 2 2 

[ | 1 = 1 2 ]= 0 + 1 + 2 2 

hence 1 =difference in mean effects of the 

treatments, if the same amounts are applied. 

— Then with x =(1 1 2 ) 0 , θ =( 0 1 2 ) 0 : 

= x 0 θ + ; [ |x] =x 0 θ 

• Take data: 

⎛ 

⎜ 

⎝ 

⎞ ⎛ 

1 

2 

⎟ 

. ⎠ = ⎜ 

⎝ 

 

more concisely, 

Y = 

⎛ 

⎜ 

⎝ 

x 0 1 

x 0 2 

. 

x 0 

x 0 1 θ 

x 0 2 θ 

. 

x 0 θ 

⎞ 

⎟ 

⎠ 

⎞ ⎛ 

⎟ 

⎠ + ⎜ 

⎝ 

⎞ 

1 

2 

⎟ 

. ⎠ ; 

 

θ + ε = Xθ + ε

15 

Here the observations (rows) have been singled 

out as the relevant objects. Much of the theory 

will hinge on the representation of [ |x] asa 

linear combination of the columns of X, withcoefficients 

θ: 

X =(1.z 1 .z 2 ); [Y|x] =1 0 + z 1 1 + z 2 2 

Extension to 3 columns is immediate. 

• Estimation of θ: Given an estimate ˆθ one estimates 

[ |x] byx 0ˆθ, with residuals 

= − x 0 ˆθ 

and residual vector e = Y − Xˆθ. 

• We define the (Euclidean) norm, i.e. the length, 

of a vector by kek = 

q P 

2 

= √ e 0 e. 

• Least Squares Principle: 

Choose 

ˆθ =argmin||Y − Xθ|| 2 

minimizing the Sum of Squares of the Residuals 

(or Errors, hence “SSE”).

16 

2. Vector spaces 

• Now let’s relate matrices to vector spaces. We 

start with the definition of a vector space. Thisis 

largely for formal completeness - you might wish 

to skip over the next bullet - since the only vector 

space considered here will be 

R = all −dimensional vectors with real elements 

and its subspaces. 

• We list a number of axioms to be satisfied by a 

structure in order that it be called a vector space; 

for R these are all pretty obvious. Note that 

R is closed under addition (x y ∈ R ⇒ x + 

y ∈ R ) and scalar multiplication (x ∈ R and 

∈ R ⇒ x ∈ R ), and satisfies 

1. Associativity: For all x y z ∈ R , we have 

x +(y + z) =(x + y)+z. 

2. Commutativity: For all x y ∈ R ,wehave 

x + y = y + x.

17 

3. Identity element: There is 0 ∈ R such that 

x + 0 = x. 

4. Inverse elements: For all x ∈ R ,thereexists 

an element −x ∈ R , called the additive 

inverse of x, such that x+(−x) =0. 

5. Distributivity for scalar multiplication: For all 

x y ∈ R and ∈ R we have (x + y) = 

x + y. 

6. Distributivity for scalar addition: For all x ∈ 

R and ∈ R we have ( + ) x = x + 

x. 

7. For all x ∈ R and ∈ R we have () x = 

(x). 

8. Scalar multiplication has an identity: 1x = x. 

Because these properties hold we say that R is a 

vector space (over the field R of scalars).

18 

• Asubset of R that is itself closed under addition 

and scalar multiplication is a vector space 

in its own right, called a vector subspace of R ; 

similarly ⊂ closed under addition and scalar 

multiplication is a subspace of . (You might 

wish to prove this; the proof consists of showing 

that 1.-8. hold in if they hold in R and if 

has these two closure properties.) 

• Definitions: 

(i) Elements v 1 v of form a spanning set 

if every v ∈ is a linear combination of them. 

(ii) Elements v 1 v of are (linearly) independent 

if all are non-zero and 

X 

v = 0 ⇒ all =0 

i.e. there is only one way in which 0 can be represented 

as a linear combination of them. Otherwise 

they are dependent (equivalently, at least 

one is a linear combination of the others). 

(iii) A spanning set whose elements are independent 

is a basis of . Thus if {v 1 v } is a 

basis, any v ∈ is uniquely (why?) representable 

as a linear combination of these basis elements.

19 

— Fact 1: Every vector space has a basis. No 

proper subset of a basis can span the entire 

space (why not?). 

— Fact 2: If has a basis of size , thenany 

elements of are dependent. (Obvious 

if these include the basis; the proof is a bit 

lengthy otherwise.) 

∗ Definition: The dimension of is the unique 

size of a basis. Uniqueness is a consequence 

of the preceding two statements. 

∗ Another consequence: If dim( )=, then 

any independent vectors in form a basis. 

(If not, then one can augment with elements 

not spanned to get independent vectors. 

This contradicts Fact 2.) 

• Let be a vector subspace of R . Suppose that 

one forms a matrix X by choosing the basis elements 

of to be the columns of X Then the

20 

interpretation of “spanning” and “independence” 

in are, in terms of X, 

spanning: Xc = y is solvable (in c) foranyy ∈ ; 

independence: y = 0 in the above ⇒ c = 0. 

If instead we begin with a matrix X, then the set 

of all linear combinations of the columns of X is 

a vector space (why?), called the column space 

((X)), whose dimension is called the rank of 

X. The independent columns of X form a basis 

for (X). 

Results about matrix ranks: 

1) (AB) ≤ (A): Since (AB) ⊆ (A) 

(why?), 

(AB) =dim((AB)) ≤ dim((A)) = (A) 

(The inequality follows — ass’t 1 — from Fact 

2above.) 

2) The rank of a matrix is at least as large as that 

of any of its submatrices (you should formulate 

and prove this).

21 

3) Used often: (A 0 A)=(A). 

Proof: Let A be × with rank . We first 

show that (A 0 A) ≥ . Ifthefirst columns 

of A are independent, we can write 

Ã 

A = F × (I .J) A 0 F 

A = 

0 ! 

F ∗ 

J 0 F 0 F ∗ 

where F consists of the independent columns 

of A and hence has rank . Now (A 0 A) ≥ 

(F 0 F) (why?) and all columns of F 0 F are 

independent: 

F 0 Fx = 0 ? ⇒ 

||Fx|| =0 ? ⇒ 

x = 0 

For the general case first permute the columns 

of A: A → AQ where the first columns 

of AQ are independent. (How? What kind 

of matrix Q would accomplish this?) Then 

AQ = F × (I .J) and as above 

≤ (Q 0 A 0 AQ) ? = (A 0 A) 

Now write (either Q 0 A 0 AQ or) A 0 A as 

(I .J) 0 F 0 F (I .J). By 1), 

(why?). 

(A 0 A) ≤ ³ (I .J) 0´ =

22 

4) By 3), then 1), (A) =(A 0 A) ≤ (A 0 ); 

replacing A by A 0 gives (A 0 ) ≤ (A) and 

so 

(A 0 )=(A); 

i.e. row rank = column rank= # of independent 

rows or columns. Thus, from now on, 

‘rank’ can mean either row rank or column 

rank. 

5) (AB) ≤ min((A)(B)). 

Proof: That (AB) ≤ (A) has been shown. 

Using 4) and 1), 

(AB) =(B 0 A 0 ) ≤ ³ B 0´ = (B) 

• A square, full rank matrix has an inverse. 

Proof: We are to show that if A × has full 

rank then there is an ‘inverse’ B with the property 

that AB = BA = I . The columns of A are 

independent, hence form a basis of R (why?). 

Thus they span: the equations 

A [b 1 ···b ]=[e 1 ···e ]=I

23 

areallsolvable. Wewrite[b 1 ···b ]=B, then 

AB = I . The matrix B is square, full rank 

(why?) and so it also has an inverse on the right: 

there is C × with BC = I . Now show that 

C = A; thusAB = BA = I and so B = A −1 . 

¤ 

• Fact: A square matrix has full rank iff it has a 

non-zero determinant. 

— The determinant |A| is a particular sum of 

products of the elements of A × . (Details 

in text.) Each product contains factors; 

there is one from each row and one from each 

column. It is a measure of the “size” of the 

matrix, in a geometrical sense. 

• A consequence of the preceding is that if X × 

has independent columns, so rank , thenX 0 X is 

invertible. In a regression framework this can be 

interpreted in terms of information duplicated by 

dependent columns.

24 

3. Orthogonality; Gram-Schmidt method; 

QR-decomposition 

• Hat matrix: Consider a regression model y = 

Xθ + with X × of full rank . We will later 

show that the LSEs are 

ˆθ = ³ X 0 X´−1 

X 0 y 

so that the estimate of [y] =Xθ is ŷ = Xˆθ = 

Hy, where 

H × = X ³ X 0 X´−1 

X 

0 

is the “hat” matrix - it “places the hat on y”. 

Properties: 

H = H 0 = H 2 (“idempotent”) 

HX = X 

(I − H)X = 0 

(I − H) 2 = (I − H) 

H(I − H) = 0

25 

• Angle between nonzero vectors x y is defined 

by 

cos = 

x0 y 

kxkkyk 

That such an angle exists is equivalent to the 

statement that ¯¯x 0 y¯¯ ≤ kxkkyk. Thisinturnisa 

version of the famous Cauchy-Schwarz Inequality, 

to be studied later. 

Proof of this version: For any real number , 

0 ≤ kx + yk 2 = kyk 2 2 +2x 0 y + kxk 2 

so that there is at most one real zero. Thus “ 2 − 

4”≤ 0, i.e. 

4 

µ ³x 0 y´2 

− kxk 

2 kyk 

2 ≤ 0 

— Equality in “¯¯x 0 y¯¯ ≤ kxkkyk” implies that 

kx + 0 yk 2 =0forsome 0 (= ±kxk /kyk), 

so that x and y are proportional. The converse 

holds as well (you should verify this). 

¤

26 

• Two vectors are orthogonal if the angle between 

them = ±2, equivalently if their scalar product 

=0. Wewritex ⊥ y. 

— Example: If z is any × 1 vector, and H is a 

hat matrix, then 

z = Hz +(I − H)z = z 1 + z 2 , 

say, where z 1 ⊥ z 2 . The first is in col(X) 

(why?) and the second is in the space of vectors 

orthogonal to every vector in col(X). We 

write z 2 ∈ col(X) ⊥ . You should verify that 

this is a vector space (i.e. is closed under addition 

and scalar multiplication). 

• AmatrixQ × is orthogonal if the columns are 

mutually orthogonal, and have unit norm. Equivalently 

(why?) 

QQ 0 = Q 0 Q = I 

If Q is orthogonal then kQyk = kyk for any 

× 1vectory - “norms are preserved”. Similarly, 

angles between vectors are also preserved

27 

(why?). Geometrically, an orthogonal transformation 

is a “rigid motion” - it corresponds to a 

rotation and/or an interchange of two or more 

axes. Rotation through an angle in the plane: 

Ã ! 

cos − sin 

Q = 

 

sin cos 

Interchange of axes in the plane: 

Ã ! 

0 1 

Q = 

1 0 

• Orthogonal spaces in a regression context. Suppose 

X × has independent columns, so (X) 

has dimension . Note that (how?) the orthogonal 

complement 

(X) ⊥ = n y ¯¯¯X 0 y = 0 o 

Then (X) ⊥ = (I − H): 

y ∈ (X) ⊥ ⇒ X 0 y = 0 ⇒ (I − H) y = y; 

y ∈ (I − H) ⇒ Hy = 0 ⇒ X 0 y = 0 

Thus dim ³ (X) ⊥´ = (I − H).

28 

— The trace of a square matrix is the sum of 

its diagonal elements. You should verify that 

(AB) = (BA). Thus products within 

traces can be rearranged cyclically: 

(ABC) =(CAB) = (BCA) 

but not necessarily = (ACB) 

It will be shown that for an idempotent matrix, 

= . A consequence is that 

(I − H) = (I − H) = − (H) 

µ 

= − X ³ X 0 X´−1 

X 

0 

= − 

µX 0 X ³ X 0 X´−1 

= − 

Similarly (X) =(H), (H) =. 

 

• Gram-Schmidt Theorem: Every -dimensional 

vector space has an orthonormal basis. 

Proof: Start with any basis v 1 v . Normalize 

v 1 to get a unit vector (i.e. a vector with unit 

norm) q 1 ∈ ; in general suppose that mutually

29 

orthogonal unit vectors q 1 q have been constructed, 

with q a linear combination of v 1 v . 

Define Q = ³ q 1 q ´; this has orthonormal 

columns and so Q 0 Q = I .Define also 

H = Q Q 0 = hat matrix arising from Q 

³ 

I − H´ 

v+1 

q +1 = 

° 

³ I − H´ v+1 ° °° (3.1) 

For instance, q 2 =.... 

Then: (i) the numerator in (3.1) is a linear combination 

of v 1 v +1 in which at least one coefficient 

- that of v +1 -isnon-zero(why?),so 

that in particular the denominator is non-zero; 

(ii) q +1 ⊥ q 1 q ( ? ⇐ q 0 +1 H = 0 0 ). 

Continuing this process results in mutually orthogonal 

unit vectors q 1 q ∈ . Since these 

are orthogonal they are independent (why?) and 

so form a basis of . 

— There is a nice geometric interpretation. The 

matrix H is idempotent (in fact it is the “hat” 

matrix arising from the × matrix with columns

30 

q 1 q ), and H v +1 is the “projection of 

v +1 onto the space spanned by n q 1 q 

o 

”. 

Thus we say that q +1 is formed by “subtracting 

from v +1 its projection onto the space 

spanned by n q 1 q 

o 

”, so as to make what 

is left orthogonal to this space (and then normalizing). 

• QR-decomposition. In the previous construction, 

at each stage, q was obtained as a linear combination 

of v 1 v Thus if V × has these vectors 

as its columns, and Q × =(q 1 q ), we 

can write 

V × U × = Q × 

for U upper triangular with positive diagonal 

° 

elements 

( +1+1 = 1 °³ 

° I 

°° 

− H´ 

v+1 0). 

Then U is nonsingular and V = QR for R = 

U −1 .(NotethatR is also upper triangular with 

° 

° 

positive diagonal elements +1+1 = °³ 

I 

°°.) 

− H´ 

v+1

31 

4. LSEs; Spectral theory 

• Recall the decomposition arising from the Gram- 

Schmidt Theorem. Let X × have rank . Write 

X = Q 1 R 1 ,whereQ 1 : × has orthonormal 

columns, and R 1 : × is upper triangular with 

positive diagonal elements. Apply Gram-Schmidt 

once again, starting with the − independent 

columns of I − H, toobtainQ 2 : × ( − ) 

whose columns are orthonormal and are a basis for 

(X) ⊥ . Then Q =(Q 1 .Q 2 ) has orthonormal 

columns and is square, hence is an orthogonal 

matrix. We have 

X = (Q 1 .Q 2 ) 

Ã 

R1 

0 

R 0 R = R 0 1 R 1 = X 0 X 

³ 

X 0 ³ 

X´−1 

= R 

−1 

1 R 

0 

1´−1 

 

H = Q 1 Q 0 1 

I − H = Q 2 Q 0 2 

! 

= QR

32 

• Return to regression. 

(i) Least squares estimation in terms of hat matrix 

decomposition of norm of residuals: Note that 

x ⊥ y⇒kx + yk 2 = kxk 2 + kyk 2 ;then 

° 

°y − Xˆθ ° 2 

° = 

° °°H 

³ Xˆθ´° y 

° 2 ° 

− + 

°°(I 

³ Xˆθ´° 

− H) y 

° 2 

− = 

° 

°H ³ y Xˆθ´° ° 2 

− + k(I − H) yk 

2 

≥ 

k(I − H) yk 2 

with equality iff Hy = Xˆθ iff (‘if and only if’) 

ˆθ = ³ X 0 X´−1 

X 0 y 

(how?), the LS estimator. The fitted values are 

ŷ = Xˆθ = Hy 

and are orthogonal to the residuals 

e = y − ŷ =(I − H) y 

We say that H and I − H project the data (y) 

onto the estimation space and error space, respectively, 

and that these spaces are orthogonal.

33 

(ii) In terms of QR-decomposition: we have that 

ˆθ = R −1 

1 

i.e. ˆθ is the solution to 

Thus compute 

z ×1 = Q 0 y = 

= 

Ã ! 

Q 

0 

1 y 

Q 0 2 y 

³ 

R 

0 

1´−1 

R 

0 

1 Q 0 1 y; 

R 1ˆθ = Q 0 1 y 

Ã 

Q 

0 

1 

Q 0 2 

! 

y 

= 

Ã 

z1 

z 2 

! 

× 1 

( − ) × 1 

Then backsolve the system of equations R 1ˆθ = z 1 . 

Numerically stable - no matrix inversions. 

• The residual vector is e = Q 2 z 2 , with squared 

norm kz 2 k 2 . The usual estimate of the variance 

2 of the random errors is 

ˆ 2 = 

SS of residuals 

− 

= kek2 

− = kz 2k 2 

−

34 

the mean squared error. Wehave 

Ã 

[z] =Q 0 Q 

0 

[y] = 1 Q 1 R 1 θ 

Q 0 2 Q 1R 1 θ 

! 

= 

Ã 

R1 θ 

0 

! 

 

and, using “cov[Ay] =Acov[y] A 0 ”(how?) we 

get 

cov [z] =Q 0 cov [y] Q = Q 0 2 IQ = 2 I; 

hence the elements +1 of z 2 have mean 

zero and [ ]= h 

2 i 

= 2 . Thus ˆ 2 is 

unbiased: 

hˆ 2i = 

⎡ 

⎣P +1 

2 

− 

⎤ 

⎦ = 2 

• Unrelated but nonetheless useful facts: inverses 

and determinants of matrices in block form. If P 

and Q are nonsingular, then 

det 

Ã 

P 

R 

S 

Q 

! 

= |P|·|Q − RP −1 S| 

= |Q|·|P − SQ −1 R|

35 

and 

= 

Ã 

P 

⎛ 

⎜ 

⎝ 

R 

S 

Q 

! −1 

³ 

P − SQ −1 R´−1 −P −1 S· 

³ 

Q − RP −1 S´−1 

− ³ Q − RP −1 S´−1 

· 

RP −1 

How? Verify 

Ã !Ã 

I −SQ 

−1 P S 

0 I R Q 

Ã 

P − SQ 

= 

−1 ! 

R 0 

etc. 

0 Q 

Example: 

det 

Ã 

I 

1 

1 0 −1 

! 

!Ã 

³ 

Q − RP −1 S´−1 

I 0 

−Q −1 R I 

= |I |·|− 1 − 1 0 1| = −1 − 

! 

⎞ 

⎟ 

⎠

36 

• Spectral theory for real, symmetric matrices. First 

let M × be any square matrix. For a variable 

the determinant |M − I | is a polynomial in 

of degree , calledthecharacteristic polynomial. 

The equation 

|M − I | =0 

is the characteristic equation. The Fundamental 

Theorem of Algebra states that there are then 

(real or complex) roots of this equation. Any 

such root is called an eigenvalue of M. If is 

an eigenvalue then M − I is singular, so the 

columns are dependent: 

(M − I ) v =0 (4.1) 

for some non-zero vector v (possibly complex), 

called the eigenvector corresponding to, or belonging 

to, . Thus 

Mv = v (4.2)

37 

• Now suppose that M is symmetric (and real). 

Then the eigenvalues (hence the eigenvectors as 

well) are real. To see this, define an operation A ∗ 

by taking a transpose and a complex conjugate: 

(A ∗ ) =¯ 

Note that (AB) ∗ = B ∗ A ∗ and that v ∗ v = P | | 2 

is real. For a real symmetric matrix M we have 

(why?) M ∗ = M. Thus in (4.2), 

v ∗ Mv = v ∗ v; 

taking the conjugate transpose of each side gives 

v ∗ Mv = ¯v ∗ v 

Thus ³ − ¯´ v ∗ v =0;sothat(why?) is real. 

• We can, and from now on will, assume that any 

eigenvector has unit norm.

38 

• Eigenvectors corresponding to distinct eigenvalues 

are orthogonal. Reason: If Mv = v for 

=1 2and 1 6= 2 then 

v 0 1 Mv 2 = v 0 1 (Mv 2)= 2 v 0 1 v 2 

and = ³ v 0 1 M´ v 2 = 1 v 0 1 v 2; 

thus ( 1 − 2 ) v 0 1 v 2 =0andsov 0 1 v 2 =0. 

• If is a multiple root of the characteristic equation, 

with multiplicity , then the set of corresponding 

eigenvectors is a vector space (you should 

verify this) with dimension . 

— The proof that the dimension is requires 

some work, and uses two results being established 

in Assignment 1, and so it is added as 

an addendum (which you should read) to that 

assignment. 

— By Gram-Schmidt, therefore, there are orthogonal 

eigenvectors corresponding to .

39 

• Spectral Decomposition Theorem for real, symmetric 

matrices: Let M × be real and symmetric, 

with eigenvalues 1 and corresponding 

orthogonal eigenvectors v 1 v with unit 

norms. Put 

V × =(v 1 ··· v ) 

an orthogonal matrix. Let D λ be the diagonal 

matrix with 1 on the diagonal. Since 

we have 

MV = ( 1 v 1 ··· v ) 

= (v 1 ··· v ) 

= VD λ 

⎛ 

⎜ 

⎝ 

⎞ 

1 0 

⎟ ... ⎠ 

0 

M = VD λ V 0 (4.3) 

We say that “a real symmetric matrix is orthogonally 

similar to a diagonal matrix”.

40 

• In a sense that will become clear, the importance 

of this result is that a real, symmetric matrix is 

“almost” diagonal. Thus when solving problems 

concerning real symmetric matrices it is very often 

useful to solve them first for diagonal matrices. 

This is frequently quite simple, and then extends 

to the general case via (4.3). 

• In the construction above we could have assumed, 

and sometimes will assume, that the eigenvalues 

were ordered before they and the eigenvectors 

were labelled: 1 ≥ ≥ .

41 

5. Examples & applications 

Consequences of spectral decomposition of a real, 

symmetric matrix M × . Recall that we showed 

M = VD λ V 0 , for an orthogonal 

V = [v 1 ··· v ] (the orthonormal eigenvectors), and 

D λ = ( 1 ≥ ≥ ) (the eigenvalues). 

• Bounds on eigenvalues. We have 

max 

kxk=1 x0 Mx = max 

kxk=1 x0 VD λ V 0 x 

= max 

kyk=1 y0 D λ y (why?) 

⎧ 

⎨ 

⎫ 

X 

= max 

⎩ 2 ⎬ | 2 =1 ⎭ 

=1 =1 

It is easy to guess the solution. The maximum 

is (what?), attained at y = (what?); hence the 

maximizing x is the corresponding eigenvector. 

An analogous result holds for min kxk=1 x 0 Mx. 

(You should write it out and prove it.) 

X

42 

• Positive definite matrices. If a symmetric matrix 

M is such that x 0 Mx ≥ 0forallx, wesay 

that M is positive semi-definite (p.s.d.) or nonnegative 

definite (n.n.d.). We write M ≥ 0. (The 

text reserves the term p.s.d. for the case in which 

equality is attained for at least one non-zero x; 

this convention is somewhat unusual and won’t be 

followed here.) The preceding discussion shows 

(how?) that M is p.s.d. iff all eigenvalues are 

non-negative. 

If x 0 Mx 0forallx 6= 0, we say thatM is 

positive definite (p.d.). We write M 0. Equivalently, 

all eigenvalues are positive. 

— Geometric interpretation: If M 0 then 

|M| 0 (why?) and the set 

n 

x | x 0 M −1 x = 2o 

is transformed, via the (orthogonal) transformation 

y = V 0 x,intotheset 

⎧ 

⎨ 

⎩ y | X 

=1 

2 

 

= 2 ⎫ 

⎬ 

⎭

43 

This is the ellipsoid in R with semi-axes of 

lengths √ q along the coordinate axes (and 

volume ∝ |M|). Thus (why?) the original 

set, obtained from the second via the transformation 

x = Vy, is an ellipsoid as well, whose 

semi-axes have the same lengths but are now 

in the directions of the eigenvectors of M. 

The following three results illustrate the adage that 

“a symmetric matrix is almost diagonal”. 

• Matrix square roots. Can we define a notion of 

the square root of a (n.n.d.) matrix? Start by 

thinking of a diagonal matrix, in which case the 

method is obvious: If D is a diagonal matrix 

with non-negative diagonal elements, then we can 

define the square root D 12 to be the diagonal 

matrix with the roots of these elements on its 

diagonal. Now extend to the general case. If M ≥ 

0 we write M = VD λ V 0 ,whereV is orthogonal

44 

and D λ has a non-negative diagonal. We define 

a symmetric, p.s.d. square root of M by 

M 12 = VD 12 

λ 

V0 

— There are other roots, for instance P = VD 12 

λ 

W 

for any orthogonal W (then PP 0 = M) butwe 

will generally mean the one above. 

• The rank of a real symmetric matrix equals the 

number of non-zero eigenvalues. Reason: If 

M = VD λ V 0 then the rank of M equals the 

rank of D λ (why?), and the latter is clearly (is 

it?) the number of non-zero diagonal elements. 

— Note also that if M = VDV 0 is the spectral 

decomposition then M and D have the same 

eigenvalues, namely the diagonal elements of 

D. This is because the characteristic polynomials 

are the same: 

¯ 

|M − I| = ¯V (D − I) V 0¯¯¯ = |D − I|

45 

• If H is idempotent then (i) all eigenvalues are 0 or 

1, and (ii) rank = trace. Reason: (i) It is clearly 

true (how?) for diagonal idempotents. But if H 

is idempotent then H = VD λ V 0 where D λ is 

idempotent, and H has the same eigenvalues as 

D λ . (ii) (H) = (D λ )= (D λ )= (H) 

(how are these steps justified?). 

— Another interesting property, previously established 

via Gram-Schmidt: We can partition 

D λ , and compatibly partition V, as 

Ã ! 

I 0 

D λ = V =(V 

0 0 

1 .V 2 ) 

where (H) = and V 1 is ×. This results 

in the decomposition of an idempotent matrix 

as 

H = V 1 V 0 1 where V0 1 V 1 = I

46 

Application 1. Illustration of preceding theory: twopopulation 

classification problem. Suppose we are 

given lengths and widths of prehistoric skulls, of type 

A or B (the “training sample”). We know that 1 of 

these, say x 1 x 1 , are of type A, and 2 = − 1 , 

say y 1 y 2 ,areoftypeB.Nowwefind a new skull, 

with length and width the components of z. We are 

to classify it as A or B. (Others applications: rock 

samplesingeology,riskdatainanactuarialanalysis, 

etc.) 

• Reduce to univariate problem: = α 0 x , = 

α 0 y for some vector α. Put = α 0 z and classify 

new skull as A if | − ¯| | − ¯|. 

• Choose α for “maximal separation”: |¯−¯| should 

be large relative to the underlying variation. Put 

2 1 = 1 X 

( − ¯) 2 = 1 X ³ 

α 0 (x − ¯x)´2 

1 − 1 

1 − 1 

1 

= 

X 1 − 1 α0 (x − ¯x)(x − ¯x) 0 α = α 0 S 1 α

47 

and similarly define 2 2 as the variation in the other 

sample. Choose α to maximize 

(¯ − ¯) 2 

h 

(1 − 1) 2 1 +( 2 − 1) 2 i 

2 ( − 2) 

= α0 (¯x − ȳ)(¯x − ȳ) 0 α 

α 0 (5.1) 

Sα 

where S isthetwo-samplecovariancematrix 

S = ( 1 − 1) S 1 +( 2 − 1) S 2 

 

− 2 

• Put β = S 12 α, α = S −12 β so (5.1) is 

β 0 S −12 (¯x − ȳ)(¯x − ȳ) 0 S −12 β 

β 0 

β 

which is a maximum if β kβk is the unit eigenvector 

corresponding to 

max S −12 (¯x − ȳ)(¯x − ȳ) 0 S −12 = max aa 0 

where a = S −12 (¯x − ȳ). Note aa 0 has rank 1, 

hence has 1 non-zero eigenvalue, necessarily equal 

(why?) to aa 0 : 

= a 0 a =(¯x − ȳ) 0 S −1 (¯x − ȳ)

48 

Now solve 

aa 0 β = β 

to get (β = what? - guess at a solution); any 

multiple will do. Then 

α = S −12 β = S −1 (¯x − ȳ) 

and we classify as A if 

| − ¯| = 

¯ 

¯α 0 (z − ¯x) 

¯ 

¯ 

¯ 

¯α 0 (z − ȳ) ¯ = | − ¯|

49 

Application 2. By the Cauchy-Schwarz Inequality, 

max 

y 

¯ 

¯x 0 My¯¯ 

kyk 

= 

° 

°M 0 x ° ° ° = 

qx 0 MM 0 x 

Related facts: Note that MM 0 ≥ 0(why?). Conversely, 

any n.n.d. matrix can be represented as MM 0 

(inmanyways). Inparticular,ifS is a × n.n.d. 

matrix of rank ≤ , thenonecanfind M × such 

that S = MM 0 and M 0 M is the × diagonal matrix 

ofthepositiveeigenvaluesofS. 

Construction: Write S = VDV 0 ,where 

D × = 

Ã 

D1 0 

0 0 

! 

V × = 

⎛ 

⎜ 

⎝V 

|{z} 1 

 

.V 2 

|{z} 

− 

⎞ 

⎟ 

⎠ 

and D 1 is the × diagonal matrix containing the 

positive eigenvalues. Then S = V 1 D 1 V1 0 and so 

M × = V 1 D 12 

1 

has the desired properties. (Note 

also that this is a version of S 12 .)

50 

Part II 

LIMITS, CONTINUITY, 

DIFFERENTIATION

51 

6. Limits; continuity; probability spaces 

• Open and closed sets in R ; limits: 

— Neighbourhood of a point ‘a’, of radius : 

— ⊂ R is open if 

(a) ={x| ||x − a|| } 

a ∈ ⇒ (a) ⊂ 

for all sufficiently small 0. 

∗ Example (0 1) 

— A sequence {x } tends to a point a: “x → 

a” ifx gets arbitrarily close to a as gets 

larger. More formally, any neighbourhood of 

a, no matter how small, will eventually contain 

x from some point onward. More formally 

yet, “for any radius , wecanfind an large 

enough that, once ,allofthex lie

52 

in (a)”. This required will typically get 

larger as gets smaller. Finally, 

∀∃ = ()(⇒ x ∈ (a)) 

read “for all there exists an , that depends 

on , such that implies that x ∈ 

(a)”. 

∗ Equivalently (why?): x → a ⇐⇒ ||x − 

a|| → 0. 

∗ This is for ‘a’ finite; obvious modifications 

otherwise. You should derive an appropriate 

definition of “ → ∞”( scalars, not 

vectors). 

∗ Example =1− 1 

— Apointa is a limit point of ⊂ R if there 

is a sequence {x } ⊂ such that x → a. 

∗ Example = { =1− 

1 

=1 (∈ ) 

| =1 2 };

53 

— ⊂ R is closed if it contains all of its limit 

points. 

∗ Examples = { =1− 

1 

{1} , =[0 1] 

| =1 2}∪ 

• Afunction(x) → as x → a (“(x) tendsto 

as x tends to a”) if we can force (x) tobearbitrarily 

close to by choosing x (6= a) sufficiently 

close to a. Formally, 

∀∃ = ( a)(0 ||x − a|| ⇒ |(x) − | ) 

The “= ( a)” is often omitted (but understood, 

unless stated otherwise). Note the “0 

||x − a||”: (a) need not exist. 

• Suppose (x) isdefined for x ∈ , thedomain 

of . Then is continuous at a point a ∈ if 

(x) → (a) asx → a. 

— Note that the definition requires to be defined 

at a.

54 

— Equivalently, 

∀∃ = ( a)(||x−a|| ⇒ |(x) − (a)| ) 

• Example: () = 2 , =(0 ∞). Then if 

0and| − | we have 

|() − ()| = | − || − +2| 

| − |·(| − | +2)| 

( +2) 

which is if 2 +2 − 0, i.e. 

q 

0 2 + − 

Here we used the triangle inequality: | + | ≤ 

|| + ||. 

• Note = ( ). Sometimes the same works 

for all ; ifsowesay is uniformly continuous 

on . E.g. in the previous example, the that is 

required will → 0as →∞, but suppose is

55 

bounded, 

q 

say =(0). It can be shown that 

2 + − is ↓ (“decreasing”), hence 

q 

2 + − 

q 

2 + − 0 

for all ∈ , sothat 

q 

| − | = 2 + − ⇒ |() − ()| 

Thus is uniformly continuous on (0). 

• Formally in the last example, 

q 

2 + − = 

inf 

∈(0 ) 

q 

2 + − 

For any set , is a lower bound if ≤ for 

all ∈ . If there is a finite lower bound then 

there are many; the largest of them is the greatest 

lower bound () orinfimum (inf). Similarly 

with upper bound, least upper bound () or 

supremum (sup).

56 

Probability spaces, random variables, distribution functions: 

We start with a sample space Ω, whose elements are 

all possible outcomes of an experiment (e.g. toss a 

coin ten times, Ω isallpossiblesequencesof sand 

s). A Borel field or -algebra of events is a collection 

B of subsets (“events”) of Ω such that one 

of its elements is Ω itself, it is closed under complementation, 

and closed under the taking of countable 

unions. 

A probability is a function defined on B such that 

(Ω) =1 0 ≤ () ≤ 1, and probabilities of disjoint 

countable unions are additive. The triple (Ω B) 

is called a probability space. Alltheusualrulesformanipulating 

probabilities follow from these axioms. E.g. 

() =0, () ≤ ( )if ⊂ . In particular 

(“continuity of probabilities”): 

⊇ +1 ⊇ and ∩ ∞ =1 = 

⇒ ( ) → () (6.1)

57 

7. Random variables; distributions; Jensen’s 

Inequality; WLLN 

• A(realvalued,finite) random variable (r.v.) is a 

function : Ω → R with the property that if is 

any open set, then −1 () ={ | () ∈ } 

is an event, i.e. a member of B. E.g. () =# 

of heads in the sequence of tosses. (For a finite 

sample space we generally take B =2 Ω ,theset 

of all subsets of Ω.) 

— Note that is open iff is closed. 

Proof: You should show that 

open ⇒ closed. 

Conversely, suppose is closed; we are to 

show that is open. We will derive a contradiction 

from the supposition that is not 

open. Suppose it isn’t; then for some ∈ 

, no () ⊂ (no matter how small we 

choose ). Then in particular 1 () contains 

points ∈ .Since| −| 1 →

58 

0, we have → and so is a limit point of 

, hence a member of (why?). This contradicts 

the fact that ∈ , thus completing 

the proof. ¤ 

— Note that −1 ( )= n −1 () o 

: 

−1 ( ) = { | () ∈ } 

= { | () ∈ } 

= { | () ∈ } 

= n −1 () o 

 

— By the preceding points, if is closed then 

= is open and so −1 () = n −1 () o 

∈ 

B: the inverse images of closed sets must also 

be events.

59 

• Since the set = (−∞] is closed, so also 

−1 () ={ | () ≤ } is a member of B, 

hence has a probability. We write 

() = ({ | () ≤ }) = ( ≤ ) 

and call the distribution function (d.f.) of the 

r.v. . Any distribution function is right continuous, 

inthat 

↓ ⇒ ( ) → () 

Proof: Recall (6.1) with =(−∞ ] where 

↓ and = −1 ( ). Then 

= ∩ ∞ =1 

= ∩ ∞ =1 −1 ( ) 

= −1 (∩ ∞ =1 )(verifythis) 

= −1 ((−∞]) 

Thus ( ≤ )= ( ) → () = ( ≤ ). 

¤ 

— A d.f. is then a function : R →[0 1] satisfying 

(i) (−∞) =0, (∞) =1(ii) is 

right continuous (iii) is weakly increasing: 

⇒ () ≤ () (you should show 

(iii)).

60 

— Recall the notion of expected value, which we 

defined in terms of a density or probability 

mass function. If () isdifferentiable then 

= 0 is the density and expectations, probabilities 

etc. are obtained by integration of . If 

isastepfunctionwithjumpsofheight at 

points ( =0 1 2) then the probability 

mass function is the function ( )= and 

expectations, probabilities etc. are obtained 

by summation over . In the former case we 

say that is continuous; in the latter is 

discrete. 

• Convex functions: 

convex if 

A function : → R is 

((1 − ) + ) ≤ (1 − )()+() 

for all ∈ . Example 2 on R, − log on 

(0 ∞). Convex functions are continuous; if a 

function has a derivative 0 () on which is an 

increasing (used here and elsewhere in the weak 

sense) function of , then it is convex.

61 

• Jensen’s Inequality: If : Ω → ⊂ R has a 

finite mean [], and if is convex on , then 

[()] ≥ ([]). 

— Application. The arithmetic/geometric mean 

inequality: 

if 1 0then ³ Y 

´1 

≤ ¯ 

Proof: Define a r.v. by ( = )=1 

and apply Jensen’s Inequality using the convex 

function () =− log . ¤ 

• Limits and continuity in probability: Let { } 

be a sequence of r.v.s, e.g. toss a fair coin times 

and let denote the proportion of heads in the 

tosses. Then [ ]=12 and we expect 

to be near 12, with high probability, for large. 

We say that “ converges to a constant in 

 

probability”, and write → , if 

lim 

→∞ (| − | ≥ ) =0forany0

62 

The Weak Law of Large Numbers states that if 

is the average of independent r.v.s 1 , 

all with finite mean and variance 2 

,then → 

. 

— e.g. = (i toss results in a head), = 

P . Then =1 0w.p. 12 each; = 

 

12; by the WLLN → 12 

• This is a basic notion required for the theory of 

estimation in Statistics. 

— e.g. 2 = () = h ( − ) 2i can 

be estimated from a sample 1 of independent 

observations ∼ by the sample 

variance 2 =( − 1) −1 P ³ − ¯´2 . The 

adjustment is for bias, disregarding it the main 

idea is that averages are consistent estimates 

of expectations (i.e. they converge in probability 

to these constants). Then also → ; 

 

this is a consequence of the following result.

63 

 

• If → and the function is continuous at , 

then ( ) → (). 

Proof: We want to show that 

(| ( ) − () | ≥ ) → 0 

Use the continuity of to find 0 such that 

Then 

| − | ⇒ | ( ) − () | 

(| − | ) ≤ (| ( ) − () | ) 

Here we use the fact that if one event implies 

another, it has a smaller probability (i.e. ⊂ 

⇒ () ≤ ( )). Since the first probability 

→ 1, so does the second (why?). ¤

64 

8. Differentiation; Mean Value and Taylor’s 

Theorems 

• Let : ⊂ R → R be defined in a neighbourhood 

( 0 ); put 

() = ( 0 + ) − ( 0 ) 

 

(“Newton’s quotient”). If () has a limit as 

→ 0wecallitthederivative 0 ( 0 )of at 0 , 

also written ( ()) |=0 . 

• Examples () = 2 , () =||. The former is 

differentiable everywhere in R; the latter everywhere 

except =0. 

• Differentiability ⇒ Continuity: 

then is continuous at 0 . 

Proof: 

If 0 ( 0 )exists 

|( 0 + ) − ( 0 )| = |()| → 0 

as → 0. ¤

65 

• Linearity, product, quotient, chain rules - read in 

text. Theyallowustobuildupastockofdifferentiable 

functions from simpler ones, and also 

show how the derivative of the more complicated 

function can be gotten from those of the simpler 

ones. 

• Relation to monotonicity: if % on ( ) and 

differentiable there then 0 () ≥ 0on( ). 

Proof: As ↓ 0 the numerator of () is≥ 0 

and continuous, hence 0 () = lim ↓0 () ≥ 0. 

(Similarly lim ↑0 () ≥ 0.) 

• If is continuous on [ ] then the inf and sup are 

finite, and are attained: there are points ∈ 

[ ] with () ≤ () ≤ () forall. Show: 

If a max or min is in the open interval ( ), 

0 =0there(if 0 exists).

66 

• Mean Value Theorem: If is continuous on 

[ ] and differentiable on ( ) then∃ ∈ ( ) 

with () =()+ 0 ()(−). This is a result of 

crucial importance in the approximation of functions. 

“Differentiable functions are locally almost 

linear.” 

— Follows from the previous bullet applied to 

Ã ! 

() − ( 

() = ()− ()− 

( − ) 

− 

— Restatement: () ≈ () + 0 ()( − ) 

if | − | is small and 0 is continuous (since 

¯ 

¯ 0 () − 0 ()¯¯ → 0as| − | → 0). The result 

is that () isapproximately linear near 

= , withslope 0 (). The next result (Taylor’s 

Theorem) strengthens this statement and 

also allows us to assess the error in this approximation. 

— A consequence of the MVT is that if 0 () ≥ 0 

on ( ) then % there: suppose 1 

2 ,then 

( 2 )=( 1 )+ 0 ()( 2 − 1 ) ≥ ( 1 )

67 

• Taylor’s Theorem:“Sufficiently smooth functions 

can be approximated locally by polynomials.” Suppose 

()has derivatives on ( )with (−1) () 

continuous on [ ]. (We put (0) () =(); 

the assumptions imply existence and continuity of 

() () on( ) for.) Then for ∈ [ ] 

there is a point between and such that 

() = 

−1 X 

=0 

() ( − ) 

() 

! 

+ () ( − ) 

() 

! 

— Example: () = log(1+) with|| 1; 

expand around 0: (0) = 0 and for 0: 

() () = (−1) 

+1 ( − 1)! 

(1 + ) 

 

so that 

() (0) = (−1) +1 ( − 1)!

68 

Then 

log (1 + ) = 

−1 X 

=1 

+1 

(−1) 

+ (−1)+1 

(1 + ) 

= − 2 

2 + 3 

3 − 4 

4 + 

+(−1) −1 

− 1 + (−1)+1 

(1 + ) 

for some between 0 and , i.e.with|| ||. 

Write this as 

log (1 + ) = ()+ (); 

if () → 0as → ∞ we say that the 

series lim →∞ () = P ∞ 

=1 

(−1) +1 

‘represents the function’ log (1 + ). 

— Proof of Taylor’s Theorem: For =1this 

istheMVT;assume1. For ∈ [ ] put 

() =() − () − 

−1 X 

=1 

() ( − ) 

() 

!

69 

We want to show that (), which is 

() =() − 

−1 X 

=0 

canalsobeexpressedas 

() ( − ) 

() 

! 

() = () ( − ) 

() (8.1) 

! 

for some ∈ ( ). For this, define 

µ − 

() = () − () 

− 

and note that () = () =0,andthat 

() isdifferentiable on ( ) and continuous 

on [ ]. By the MVT there is a point 

∈ ( ) with 

() = ()+ 0 ()( − ); 

thus 0 () =0: 

0=() 0 =()+ 0 ( − )−1 

( − ) () 

so 

() =−() 

0 ( − ) 

( − ) −1 (8.2)

70 

But 

0 () = − 0 () − 

−1 X 

=1 

⎡ 

⎢ 

⎣ 

(+1) () (−) 

! 

− () () (−)−1 

(−1)! 

= − () ( − )−1 

() ; 

( − 1)! 

this in (8.2) gives (8.1). ¤ 

⎤ 

⎥ 

⎦ 

• l’Hospital’s Rule: Read Theorem 4.2.6 (or another 

source) and the examples following it. 

— Rough idea: If () =() =0,then 

lim 

→ 

() 

() = lim → 

()−() 

− 

()−() 

− 

= lim → 

0 () 

0 () 

— Example: lim →0 

sin 

 

= lim →0 cos 

1 

=1.

71 

9. Applications: transformations; variance 

stabilization 

• Application 1. Distribution of functions of r.v.s. 

Suppose a r.v. has a differentiable d.f. (), 

density () = 0 (). Consider the r.v. = 

(). (e.g. = log.) First assume is 

strictly monotonic (↑ or ↓). The d.f. of is 

() = ( ≤ ) = (() ≤ ) 

( 

( ≤ 

= 

−1 ()) if ↑ 

( ≥ −1 ()) if ↓; 

( 

( 

= 

−1 ()) if ↑ 

1 − ( −1 ()) if ↓ 

Note that the left continuity of is used here: 

( ≥ −1 ()) ? = ( −1 ()) 

To get the density () of() wemustdifferentiate 

−1 (). Write = −1 (), then ³ −1 ()´0 = 

can be obtained by differentiating the relationship 

= () : 

1= 

= 0 () 

;

hence 

In the above, 

() = 

⎧ 

⎨ 

⎩ 

In either event, 

 

= 1 

0 () = 1 

0 ( −1 ()) 

( −1 ()) h 0 ( −1 () i 

−( −1 ()) h 0 ( −1 () i 

¯ 

¯ 

if ↑ 

if ↓ 

72 

 

() =() 

if = () is strictly monotone, 

¯ 

with expressed in terms of on the RHS. 

— Example: 0, = − log . Then() = 

¯ 

¯ 

¯ 

 

¯ 

() ¯ = ( − ) 

¯ = (− ) − .Thus 

if ∼ (0 1) with () =(0 1), 

has density − (0);wesay has the 

exponential density with mean 1 (The function 

() = − is the exponential p.d.f. 

with mean 1.) 

¯− 

— When is non-monotonic it is usual to split 

up the range of into regions on which is 

¯

73 

monotonic. Example: suppose ∼ (0 1) 

with density 

() = 1 √ 

2 

−2 2 ( −∞∞) 

and d.f. Φ() = R 

−∞ () = ( ≤ ). 

Put = − log ||. Then 

() = ( ≤ ) = ( ≤− − or ≥ − ) 

= ( ≤− − )+ ( ≥ − ) 

= Φ(− − )+1− Φ( − ); 

thus 

() = − (− − )+ − ( − ) 

= 2 − ( − )(−∞∞) 

• Application 2. Variance stabilization. We first 

need notions of convergence in law, or distribution. 

Suppose { } isasequenceofr.v.s;wesay 

 

that → ∼ if 

( ≤ ) → ( ≤ ) = () 

at every continuity point of .

74 

— The Central Limit Theorem (CLT; we’ll prove 

it later) refers to this kind of convergence: if 

= √ ³ ¯ − ´, where ¯ = P 

=1 

and 1 are i.i.d. with mean and variance 

2 

, then → ∼ (0 2 ). The 

CLT, WLLN and MVT together are sufficient 

to derive a vast array of large sample approximations 

in Mathematical Statistics. 

— Some basic facts required here are that if 

→ 

∼ then: 

 

1. + → + , if and are 

constants tending to and , orr.v.swith 

these limits in probability. (“Slutsky’s Theorem”; 

its role is to eliminate “nuisance terms”, 

that typically → 0 or 1, in limit distributions.) 

2. ( ) → () if is continuous. 

 

3. → (a constant) ⇔ → . 

should show this.) 

(You

75 

Now suppose that 

 

 

→ and (actually, is implied by) 

√ ( − ) → ∼ (0 2 ) 

Consider a function = ( ), where is twice 

 

continuously differentiable on R Then → (), 

and by Taylor’s Theorem, for some between 

and , 

√ ( − ()) 

= √ {( ) − ()} 

= √ ( 

) 

0 ()( − )+ 00 ( ) 

( − ) 2 

2 

= 0 () n √ ( − ) o + 00 ( ) n√ 

2 √ ( − ) o 2 

 

 

By Slutsky’s Theorem, this has the same limit distribution 

as 0 () √ ( − ), as long as 

00 ( ) 

2 √ 

n√ ( − ) o 2 

→ 0

76 

This in turn follows (how?) from 

n√ ( − ) o 2 → 2 (why?) 

00 ( ) 

2 √ 

 

→ 0 (why?). 

The end result is that 

√ ( − ()) → 

µ 

0 ³ 0 ()´2 

Suppose now that we have evidence that our r.v. 

has a variance that depends on its mean, i.e. 2 = 

(). Example is if = P 

=1 represents the 

average number of radioactive emissions of a certain 

type in runs of an experiment, where the number 

of emissions in one experiment has the Poisson 

distribution 

( = ) = − =0 1 2 

! 

Then has mean and variance both = , sothat 

by the CLT √ ( − ) → (0), i.e. () =.

77 

This can make it problematic to make reliable inferences 

about the mean. For instance, a confidence 

interval on : ± 2 

q will have a width depending 

on the unknown . 

Question: what “variance stabilizing” transformation 

= ( ) will have an approximately constant variance? 

We require 0 q 

() = () 0 () tobeconstant, 

i.e. 0 () ∝ √ 1 

. Thiswillbethecaseif 

()) 

() ∝ R 1 √ 

() 

. 

InthePoissonexamplewewouldtake 

Z 

Z 

1 

() ∝ q 

() = −12 ∝ √ 

to obtain 

√ n√ 

− √ o → (0 14)

78 

Part III 

SEQUENCES, SERIES, 

INTEGRATION

79 

10. Sequences and series 

• Convergence of a sequence { } ∞ =1 : We say 

that → if () = → as →∞,i.e.if 

∀ 0∃( ⇒ | − | ) 

This is for finite ; obvious modifications (what 

are they?) otherwise. e.g. → 0if|| 1. 

( =log log ||; depends on and .) 

• Series: Put = P 

=1 ,the partial sum of 

the series P ∞ 

=1 . We say that P ∞ 

=1 = if 

→ . 

— Example: Geometric series P ∞ 

=0 for || 

1. We have 

= 

X 

=0 

= 1 − +1 

1 − → = 1 

1 −

80 

• Extend to functions. : → R functions; if 

the sequence () has a limit for every ∈ , 

denoted (), then we say that → on . 

Formally, for each ∈ , 

∀ 0∃ = ( )(⇒ | () − ()| ) 

(10.1) 

• Similarly, consider () = P 

=1 (). If () → 

() for ∈ we say that () = P ∞ 

=1 () 

and that P ∞ 

=1 () converges to (). 

• If, in (10.1), the same = () worksforall 

∈ we say the convergence is uniform on : 

⇒ on . Equivalently, 

⇒ on ⇔ sup | () − ()| → 0 

∈ 

Example of non-uniformity of convergence: 

() = 

→ 

( 

0 ≤ 1 

1 ≥ 1 

( 

0 0 ≤ 1 

1 ≥ 1 

= ()

81 

Then for each , 

sup | () − ()| ≥ sup || =1 

[0∞) 

[01) 

so that sup [0∞) | () − ()| 9 0. 

• Example of uniformity of convergence. Consider 

() = . By Taylor’s Theorem, for between 0 

and : 

() = 

= 

X 

=0 

X 

=0 

() (0) 

! + (+1) () +1 

( +1)! 

! + +1 

( +1)! 

= ()+ (), say. 

Then |() − ()| = | ()| and so () = 

P ∞=0 

! 

if | ()| → 0. We show the stronger 

result that ⇒ (equivalently, ⇒ 0) on any 

closed interval [ ]. For this, let be any integer 

that exceeds both || and ||, hence exceeds 

||. Let . Then sup ∈[] | ()| → 0;

82 

this is because it is 

 

+1 

( +1)! 

= +1− 

!( +1)···( +( +1− )) 

= 

! · 

+1· 

+2··· 

( +( +1− )) 

 

→ 

 

µ 

! · +1 

0as →∞ 

+1− 

• Cauchy sequences. A sequence { } ∞ =1 is Cauchy 

if the terms get close together sufficiently quickly: 

∀ 0∃ ( ⇒ | − | ) 

Note that if → (finite) then we can let 

be such that 

⇒ | − | 2 

then for , 

| − | = | ( − ) − ( − ) | 

≤ | − | + | − |

83 

Thus a convergent sequence (i.e. a sequence with 

a finite limit) is Cauchy. (As a consequence, 

P ∞=1 

−1 diverges.) The converse (not proven 

here) holds as well, so that a sequence is convergent 

iff it is Cauchy. 

• A consequence is that if P converges absolutely, 

i.e. if P | | converges, then P converges. 

Proof: Suppose that = P 

=1 | | is a convergent, 

hence a Cauchy, sequence. There is such 

that 

⇒ | − | 

But | − | = P 

=+1 | |,sothat = P 

=1 

satisfies (for ) 

| − | = 

X 

¯ 

=+1 

¯¯¯¯¯¯ 

≤ 

X 

=+1 

| | = | − | 

Thus { } is Cauchy, hence is convergent. ¤

84 

• Example: Let be a discrete r.v. with ( = 

)= , =0 1 2 . If P converges 

absolutely, wecallitthe moment [ ]of 

. Suppose has the Poisson distribution P(): 

( = ) = − =0 1 2 . 

! 

(Note P =1-how?) Thenthe moments 

exist for all 0. To see this, consider the partial 

sums 

X 

X 

X 

= = − 

 

=0 =0 

! = ,say. 

=0 

We must show that converges. Note that 

+1 

 

= 

+1 

µ 

1+ 1 

→ 0as →∞ 

so that for 1thereis so that 

⇒ +1 

 

≤

85 

Then for 0 we have 

0 = + 

X 0 

 

=+1 

≤ +( + 2 + + 0− ) 

+ 

1 − 

Thus the sequence { +1 +2 } is increasing 

and bounded above, hence has a limit 

(= the ) - you should show this. 

— Here we have established convergence by using 

aversionoftheratio test. See §5.2.1, 5.2.2 

in the text, or elsewhere, for other tests. 

• Uniform convergence ensures that we can interchange 

certain operations. 

Theorem: If ⇒ on , then 

lim 

→∞ → lim () = lim lim → →∞ () 

for ∈ .

86 

— A case in which this fails, because the convergence 

is not uniform, is 

( 

 

() = 

0 ≤ 1 

1 ≥ 1 

( 

0 0 ≤ 1 

→ 

= () 

1 ≥ 1 

with = 1. Here lim → lim →∞ () = 

lim →1 () does not even exist - the limit 

function is discontinuous. Cases like this are 

ruled out if the convergence is uniform: 

— If ⇒ on and the are continuous on 

, then is continuous on 

Proof: Forany ∈ , 

lim () 

→ 

= lim 

→ →∞ lim () 

= lim 

→∞ → lim () 

= lim 

→∞ () 

= () 

¤

87 

11. Power series; moment and probability 

generating functions 

• Power series: Put () = P 

=0 ( − ) ;if 

() → () as → ∞ we say that 

P ∞=0 

(−) is the power series representing . 

— e.g. by Taylor’s Theorem, if 

then 

() = 

X 

=0 

() ( − ) 

() 

! 

() = ()+ (+1) ( − )+1 

() 

( +1)! 

so that if 

(+1) ( − )+1 

() 

( +1)! 

→ 0 

then P ∞ 

=0 

() () (−) 

! 

is the power series 

(“Taylor series”, or “Maclaurin’s series” if = 

0) “representing ”.

88 

• Theorem: Suppose a power series P ∞ 

=0 

converges for one value 0 6= 0. Then it converges 

absolutely for || | 0 |. 

Proof: Put 

() = 

() = 

X 

=0 

X 

=0 

 

¯ 

¯ ¯¯¯ 

Since ( 0 ) has a limit as → ∞, it is a 

¯ 

Cauchy sequence and, in particular, ¯ 0 ¯ = 

¯ 

¯ ( 0 ) − −1 ( 0 )¯¯ → 0. For 0let be 

large enough that 

⇒ | 0 | 

Then for and || | 0 |, 

() = 

X 

=0 

¯ 

¯ ¯¯¯ = ()+ 

X 

=+1 

¯ 

¯ 0 

 

 

¯ 

¯ 0¯¯¯¯¯ 

 

()+ 

1 − | 0 | 

i.e. the partial sums (), which are necessarily 

increasing, are bounded above. ¤

89 

• If P ∞ 

=0 converges for || and diverges 

for || we call the radius of convergence. 

Then, by the previous result, if || the series 

is absolutely convergent. 

— e.g. put 

() = 

X 

=0 

(−) = (1 − (−)+1 ) 

 

1+ 

then with () =1(1 + ) we have 

| () − ()| = || +1 |1+| 

If || 1then| () − ()| → 0; if || 1 

it →∞. Thus = 1 is the radius of convergence. 

In this case when || = 1 the series diverges 

(i.e. the partial sums do not converge).

90 

• Theorem: Suppose a power series P ∞ 

=0 

has a radius of convergence 0 (within which it 

necessarily converges absolutely). Let 0 . 

Then: 

(i) P ∞ 

=0 converges uniformly on [− ]; 

(ii) For || the limit function () = P ∞ 

=0 

is continuous and differentiable, and the derivative 

is represented by the convergent series 

0 () = 

∞X 

=1 

−1 

(Thus P ∞ 

=1 −1 converges for || and 

so (i), (ii) apply to 0 ().) 

Proofof(i): Suppose P 

=0 = () → 

() for|| .Thenfor|| we have 

| () − ()| = 

∞X 

¯ 

=+1 

¯¯¯¯¯¯ ≤ 

∞X 

=+1 

| | → 0 

as →∞,since P ∞ 

=0 converges absolutely 

for || . Thus sup ||≤ | () − ()| → 0, 

as required. ¤

91 

• By (ii), we can repeat the process: 

00 () = P ∞ 

=2 ( − 1) −2 , etc. Among 

other things, this implies the uniqueness of power 

series representations. (How?) 

• Example: The probability generating function of 

ar.v. is the function () =[ ], provided 

this exists. In particular, if has support N = 

{0 1 2} then 

() = 

∞X 

=0 

( = ) 

Since this converges for = 1 it has radius of 

convergence ≥ 1. Wecanthendifferentiate 

term-by-term near =0: 

= 

() (0) 

∞X 

= 

( − 1) ···( − +1) − ( = ) |=0 

= ! ( = )

92 

— Note that, by uniqueness of power series, if we 

can expand () as P ∞ 

=0 then, necessarily, 

= ( = ) = () (0)!. In other 

words, the p.g.f. uniquely determines the distribution: 

two r.v.s with the same p.g.f. have 

the same distribution. 

— Example: If ∼ ( )then 

() =(1− + ) ; 

the uniqueness then shows that the sum of 

such independent , all with the same but 

possibly different values of , is∼ ( P ). 

— In the above we have used the fact that a 

characterization of the independence of r.v.s 

( )isthat[()( )] = [()][( )] 

for all functions such that () and( ) 

are also r.v.s. Equivalently, () and( ) 

are uncorrelated for all such .

93 

• The moment generating function of a r.v. is 

the function () = [ ], provided this exists 

(i.e. is finite). (Replacing by gives the 

characteristic function, which always exists: it is 

[cos ()] + [sin ()].) With as above, 

() = 

∞X 

=0 

( = ) 

Note that () =( ), so that it converges (absolutely) 

in a neighbourhood of =0iff has a 

radius of convergence 1. Assume this. Then 

for || log we have, by the preceding theorem, 

0 () = 0 ( ) 

= 

= 

∞X 

=0 

∞X 

=0 

= [ ] 

³ ´−1 ( = ) · 

 

( = ) 

with 0 (0) = []. Continuing, () () =[ ] 

with () (0) = [ ]. (i.e. we can differentiate 

within the [·].)

— e.g. ∼ () with ( = ) = − 

! has 

() = 

Thus 

∞X 

=0 

− 

! = − ∞ X 

= − · = ( −1) 

=0 

³ 

 

´ 

! 

[] = 0 (0) = 

[ 2 ] = 00 (0) = 2 + hence 

[] = 

94 

— The cumulants of a distribution are defined 

as the coefficients in the expansion 

log [ ]= 

∞X 

=1 

 

 

! 

Thus the Poisson distribution has all cumulants 

= . In general 1 is the mean and 2 is 

the variance; after that they get more complicated. 

The Normal distribution has all =0 

for 2.

95 

12. Branching processes 

• Important in population studies and elsewhere. 

Organisms are born, live for 1 unit of time, then 

give birth to a random number of offspring and 

die. 

• Define r.v.s 

= population size at time + , 0 =1 

= number of offspring of the member 

of the population. 

Then 

= 

−1 X 

=1 

 

• Problems: (i) Determine properties of the distribution 

of . (ii) Determine the limiting probability 

of extinction (= lim →∞ ( =0)= 

lim →∞ (0), if is the p.g.f.).

96 

• Assume: When −1 = , 1 2 are independent 

r.v.s, independent of −1 . (i.e. number 

of offspring of one member has no effect on 

that of another, and is unaffected by current size 

of population. Realistic?) Assume also that all 

are distributed in the same manner. 

• We will work with the p.g.f.s 

() =[ ] () =[ ]; 0 ≤ ≤ 1 

Assume has a radius of convergence 1, so 

that [ ]= 0 (1) exists. Note that 

() =[ ]= 

If −1 = , thisis 

= 

 

Y 

=1 

∙ P =1 

 

¸ 

= 

" 

 

⎡ ⎤ 

Y 

⎣ ⎦ 

=1 

h i 

(independence) 

# 

P −1 

=1 

 

 

= () (sincethe are identically distributed).

97 

Considering the probabilities of the events “ −1 = 

” (i.e. Double Expectation Theorem: [ ]= 

−1 

n 

[ 

| −1 ] o )gives 

Iterating: 

() = 

∞X 

=0 

= h () −1 

= −1 (()) 

() ( −1 = ) 

0 () = [ 0]=[] = 

1 () = 0 (()) = () =( 0 ()) 

2 () = 1 (()) = ◦ () =( 1 ()) 

3 () = 2 (()) = ◦ ◦ () =( 2 ()) 

and in general (by induction) 

() =( −1 ()) =1 2; 0 () = 

It follows (you should show how) that [ ]= 

{[ ]} . (Intuitively obvious?) 

i

98 

• Probability of extinction. Note ( = 0) = 

(0) = ,say,and = ( −1 )with 0 =0. 

Does lim →∞ exist, and if so what is it? We 

shall assume that 0 ( =0) 1, otherwise 

problem is trivial. Consequently, () is positive, 

strictly increasing and convex for 0 ≤ 1: 

() = 

0 () = 

00 () = 

Now 

∞X 

=0 

∞X 

=1 

( = ) = ( =0)+ 0 

−1 ( = ) 0 

sinceatleastoneofthe ( = ) is 0 

∞X 

=2 

( − 1) −2 ( = ) ≥ 0 

0 = 0 

1 = ( 0 )=(0) 0= 0 

1 0 ⇒ 2 = ( 1 ) ( 0 )= 1 

··· 

−1 ⇒ +1 = ( ) ( −1 )= 

In general 0 = 0 1 2 ≤ 1, and so 

↑ =sup{ }

99 

Since = ( −1 )and is continuous we have 

= 

lim 

→∞ = 

• Put () =() − ; note 

lim 

→∞ ( −1) =( lim 

→∞ −1) =() 

(0) = ( =0) 0 

(1) = 0 

0 (0) = 0 (0) − 1= ( =1)− 1 0 

and is convex. Also 

0 (1) = 0 (1) − 1=[ ] − 1 

The function () can drop below 0 at most once 

in (0 1). Graphthetwopossiblecases. Inthe 

first () isincreasingat = 1, in the second it 

is decreasing. 

— Case 1: [ ] 1. Equivalently, 0 (1) 0. 

There are two roots, say ∈ (0 1) and =1, 

to the equation () =0,and is one of them. 

We have 

0 = 0 

⇒ 1 = ( 0 ) () = 

⇒ 2 = ( 1 ) () =

100 

etc.; hence ≤ and so = . 

— Case 2: [ ] ≤ 1. Equivalently, 0 (1) ≤ 0 

and = 1 is the only solution. 

• Summary: 

If [ ] ≤ 1then 

(eventual extinction) = 1; 

if [ ] 1 then this probability is 1andisthe 

unique solution in (0 1) to () =. 

Let = time of extinction. Then 

( )= ( 0) = 1 − 

hence ( ≤ ) = ,with 

0 = 0 = ( −1 ) 

( = ∞) =1− 

Itcan(andwill-Asst.3)beshownthat 

[] = 

∞X 

=0 

( )(=∞ if [ ] 1).

101 

P(N

102 

13. Riemann integration 

• Riemann integration. First consider :[ ] → 

R, abounded function. Consider a partition, or 

‘mesh’ = { = 0 1 ··· = } of 

[ ]; its norm is ∆ =max (∆ ), where ∆ = 

− −1 . For ∈ [ −1 ], an approximation 

to the area under the graph of is 

() = 

X 

=1 

( )∆ 

Theintegralof over [ ] isdefined as the limit 

of these approximations, as they become more 

and more refined, i.e. as ∆ → 0. 

• Formally, we first bound the Riemann sum () 

above and below as follows. Define 

= inf 

[ −1 ] () = sup 

() = 

X 

=1 

∆ () = 

[ −1 ] 

X 

=1 

(); 

∆

103 

Then clearly 

() ≤ () ≤ () 

If we refine by including points 0 between −1 

and , obtaining another partition 0 ⊃ ,then 

in 0 the infima increase and the suprema decrease; 

thus 

() ≤ 0() and 0() ≤ () 

Also () ≤ 0() for any partitions , 0 

(shown by considering their union, whose lower 

sum exceeds that of and whose upper sum is 

≤ that of 0 ). Continuing: 

() ≤ 0() ≤ 00() ≤ ··· 

≤ sup () ≤ inf () 

 

 

≤ ···≤ 00() ≤ 0() ≤ () 

We say that is (R-) integrable if sup () = 

inf (), and then their common value is 

R 

(). Equivalently 

inf 

{ () − ()} =0

104 

• An example of a non-integrable function is () = 

( ∈ Q) (Q the rationals) for ∈ [0 1]. Then 

for any partition we have ≡ 0and ≡ 1, 

so sup () =0 1=inf () 

• Continuous functions on [ ] are R-integrable 

there. (We write ∈ [ ], or just ∈ ) 

The general idea of the proof is that, since continuous 

functions on bounded, closed intervals are 

(bounded and) uniformly continuous, − 

can be made uniformly small, say ≤ whenever 

∆ .Then 

() − () ≤ 

X 

=1 

hence inf { () − ()} =0. 

∆ = ( − )

105 

• Monotonic, bounded functions on [ ] are R- 

integrable; e.g. for % functions, 

() − () = 

≤ 

X 

=1 

[( ) − ( −1 )] ∆ 

∆ [() − ()] → 0 

• More generally, we say a function is of bounded 

variation (“ is BV”) on [ ] if P 

=1 |∆ | ≤ 

for some 0 and all partitions . (Here 

∆ = ( )−( −1 ).) This clearly holds if is 

monotonic and bounded, or (by the MVT) if has 

a bounded derivative (since then |∆ | ≤ ∆ , 

where | 0 | ≤ on [ ]). It can be shown that 

if is BV then ∈ [ ]. 

• Standard properties follow from these definitions. 

If ∈ [ ] thensoare + , , and 

||; in the first two cases the integral is linear; 

¯ 

in the last we have ¯R ()¯ ≤ R 

|()|. If 

≤ then R 

() ≤ R 

(). (You should 

show these two inequalities.) Also, R 

() = 

R 

() + R 

() for ∈ [ ].

106 

• An important result is the Mean Value Theorem 

for Riemann integrals: If is continuous on [ ] 

then there is ∈ [ ] forwhich 

Z 

 

() = ()( − ) 

Proof: Let and be the inf and sup of on 

[ ], then 

Z 

≤ 1 () ≤ 

− 

Since is continuous it attains and every 

point between (Intermediate Value Theorem), hence 

there is ∈ [ ] forwhich() = 

− 

1 R 

(). 

¤

107 

• Now define 

() = 

Z 

 

() ≤ ≤ 

the indefinite integral of . We have the Fundamental 

Theorem of Calculus: If is continuous 

on [ ] then is differentiable there, with 

Z 

 

0 () = (); (13.1) 

() = (), hence 

Proof of (13.1): 

0 () = 

1 

lim 

→0 

= 

1 

lim 

→0 

= () − () (asbelow). 

" Z + 

 

Z + 

 

() − 

() 

Z 

() # 

= lim 

→0 

1 

· ( ) 

for some ∈ [ + ], by MVT . Since → 

and is continuous, ( ) → (). ¤

108 

— This is the main tool for evaluating integrals 

—wefind a whose derivative is . 

Reason: If 0 () =(), then 

() =() − () − () 

has () =0and 0 () ≡ 0, so () =0 

(for instance by the MVT). Hence 

Z 

 

() = () =() − () 

— This is used to justify the change-of-variables 

formula for Riemann integration (i.e. integration 

by substitution). 

— Example: the substitution = tan, with 

() =sec 2 ,gives 

= 

Z 

1 

1+ 2 = Z arctan 

 

¯ 

¯arctan 

arctan cos2 sec 2 

arctan =arctan − arctan

109 

• Improper Riemann integrals, in which one or both 

endpoints are infinite, or at which is unbounded, 

are defined by taking appropriate limits: 

Z ∞ 

() = lim 

→∞ 

Z 

() = 

−∞ 

Z ∞ 

Z 

 

() = lim 

↓0 

Z 

() 

Z ∞ 

() + () for any 

−∞ 

Z − 

 

() if () =±∞ 

Example: () =1{(1 + 2 )}, −∞ ∞ is 

the ‘Cauchy’ (= on 1 degree of freedom) p.d.f.; we 

have 

Z 

() =(arctan − arctan ) 

 

so 

Z ∞ 

µ Z Z ∞ 

 

() = + 

−∞ −∞ 

arctan − arctan arctan − arctan 

= lim 

+ lim 

→−∞ 

→∞ 

− arctan (−∞)+arctan∞ 

= 

= 2+2 

 

 

=1

110 

However, none of the moments 

Z ∞ 

[ ]= 

−∞ () 

exist for ≥ 1. This is because the existence of [ ] 

requires the existence of 

Z ∞ 

Z ∞ 

() = || () 

and of 

Z 0 

0 

−∞ () =(−1) Z 0 

0 

−∞ || () 

hence the existence of [|| ]. You should show that 

[|| ]doesnotexistif is Cauchy; a consequence 

is that even if the integrand of R ∞ 

−∞ () is an odd 

function, the integral need not = 0.

111 

14. Riemann and Riemann-Stieltjes integration 

• An application of the Fundamental Theorem of 

Calculus is the formula for integration by parts. 

If are differentiable, and 0 0 are integrable, 

then 

Z 

[()()]0 = ()() − ()() andalso 

Z h 

= 0 ()()+() 0 ) i ; 

hence 

Z 

 

0 ()() = ()()−()()− 

()0 () 

A mnemonic is “ R = | − R ”. 

Z

112 

• Application. Define Γ() = R ∞ 

0 −1 − , ( 

0), the Gamma integral. Establishing the existence 

of lim 0 →∞ () ,orof 

R 

Z 

lim () 

→0→∞ 

if 1, is left to you. We have 

Z Ã ! ∞ 

0 

Γ() = 

− 

0 

Ã ! 

 

 

= −¯¯¯¯¯ 

∞ Z Ã ! ∞ 

 

− 

 

0 − 

hence 

= 1 

Z ∞ 

0 

0 

− = 1 Γ( +1) 

 

Γ( +1)=Γ(); 

in particular for an integer, 

Γ(+1) = Γ() =··· = (−1)···1·Γ(1) = !

113 

• A generalization of the Riemann integral that is 

particularly useful in statistics is the Riemann- 

Stieltjes (R-S) integral. Let be bounded on 

[ ], and let () be% there. In the definition 

of the R-integral, replace ∆ everywhere by 

∆ = ( ) − ( −1 )(≥ 0). The analogue of 

() is (; ) = P 

=1 ( )∆ ;ifthishas 

a limit as ∆ → 0 - equivalently (as at Theorem 

6.2.1) if sup () =inf () 

-thenwecallittheR-Sintegral R 

()(). 

It is particularly useful in cases where is not 

continuous. 

• Special cases: 

1. () = ; R 

()() = R 

(), the 

R-integral. 

2. differentiable, with 0 = ; R 

()() = 

R 

()(), theR-integral.

3. () = 

( 

≤ 

≤ ≤ .Then 

114 

∆ = 

( 

− −1 ≤ 

0 otherwise 

It follows that R 

()() =()( − ). 

• We adopt the convention that unless stated otherwise, 

by R 

we mean R (] ,i.e.therighthand 

endpoint is included, the left is not. Note that 

this is not an issue for R-integrals (why not?). 

Combining 2) and 3): 

Suppose = 0 1 ··· = and 

(a) is differentiable on ( −1 )with 0 = 

≥ 0, 

(b) has a jump discontinuity (but is right continuous) 

at each ,with ( ) − ( − )= . 

Then 

Z 

X Z 

()() = ()() 

 

=1 

−1 

( 

X Z ) 

 

= 

()() + ( ) 

−1 

=1

115 

• Improper R-S integrals defined as for R-integrals. 

In particular, let be a r.v. with d.f. (), −∞ 

∞ (note %). Let () be a function of 

. Wedefine the expected value of () tobe 

Z ∞ 

[()] = ()() 

−∞ 

If has a density this agrees with the earlier 

definition. Suppose instead that is discrete, 

with 

( = )= =0 1 2 

Then () = ( ≤ ) has a jump of height 

∆ = at and has 0 = 0 elsewhere, so 

[()] = 

∞X 

=0 

( )

116 

• An example illustrating the power of this integral, 

in which neither the R-integral nor a sum alone 

will suffice, is if represents the lifetime of a 

randomly chosen light bulb. Suppose that, with 

probability , the bulb blows when first installed. 

Otherwise, it has an exponentially distributed lifetime, 

with ( )= − . Thus its d.f. is 

() = ( ≤ ) = 

⎧ 

⎪⎨ 0 

0 

=0 

⎪⎩ 

+(1− )(1 − − ) 0 

with 

[ ( )] = 

Z ∞ 

−∞ 

() () 

= (0) · + 

Z ∞ 

∙ n o¸ 

() +(1− )(1 − − ) 

 

0 

= (0) · +(1− ) 

Z ∞ 

0 

() −

117 

• Cauchy-Schwarz inequality: 

µZ 

2 Z 

()()() ≤ 

2 Z 

()()· 

2 ()() 

provided all three integrals exist. The range is 

the same for all three, but need not be bounded. 

(Does existence of the latter two integrals imply 

existence of the first?) 

Proof: Essentially identical to the vector version: 

0 ≤ 

= 

Z 

Z 

( + ) 2 

2 +2 

Z 

+ 2 Z 2 

hence “ 2 − 4” ≤ 0, i.e. 

4 

µZ 

2 Z 

− 4 

2 · 

Z 

2 ≤ 0 

— Example: [ 3 ] ≤ 

q 

[ 2 ][ 4 ]. 

¤

118 

• Integration by parts: if the R-S integrals R 

()() 

and R 

() () bothexist,then R 

()()+ 

R 

() () =()() − ()(). This and 

other identities are also valid for decreasing integrators 

- e.g. replace R by − R (−) inthe 

appropriate places. 

• An application is Euler’s summation formula: If 

has a continuous derivative 0 on ( − ) 

for some ∈ (0 1), then 

X 

= 

() = 

Z 

()+()+ Z 

0 (){} 

where {} = − [] is the fractional part of . 

• Example: with () =1 and =1,weobtain 

X 

=1 

Z 

1 

 

− log =1− 1 

{} 

2 ∈ µ 1 

1 

 

The middle term above is decreasing in , and 

bounded below by 0, thus it has a limit (‘Euler’s

119 

constant’) ∈ [0 1) as →∞: 

X 

=1 

1 

− log → = 577215 

 

Note that both P 

=1 

1 

 

and log diverge. 

• Similarly, 0 ≤ 2 √ − 1 − P 

=1 

1 √ ≤ 1 − 1 √ 

 

. 

The convergence is very slow; limit ≈ 4604 with 

=19, 4603 with =18. 

Proof of Euler’s formula: Write the sum as a R- 

S integral, and split into regions on which {} is 

monotone: 

X 

= 

() = 

= 

= 

Z 

− ()[] 

Z 

− 

()( − {}) 

Z 

− () − Z 

− (){} 

− 

X 

Z 

=+1 

−1 (){}

120 

Integrating by parts, and using { − } =1−, gives 

X 

= 

() 

= 

= 

Z () 

− 

" # 

(){} − ( − ) { − } 

− 

− R 

− 0 (){} 

" 

X (){} − ( − 1){ − 1} 

− 

− R 

−1 0 (){} 

=+1 

Z 

− 

() +(1− ) ( − ) 

# 

+ 

Z 

− 0 (){} + 

X 

Z 

=+1 

−1 0 (){} 

= 

Z 

− () 

→ 

↓0 

+(1− ) ( − )+Z 

− 0 (){} 

Z 

() + ()+ Z 

0 (){} 

as required. ¤

15. Moment generating functions; Chebyshev’s 

Inequality; Asymptotic statistical theory 

121 

• Moment generating functions. () =[ ]= 

R ∞−∞ 

() (the R-S integral). If this exists in 

an open neighbourhood of =0(note (0) = 1 

always exists) then it is the m.g.f. It is also written 

(). Some useful properties: 

1. () (0) = [ ] (so if the m.g.f. exists, so do 

all moments). In other words we can differentiate 

under the integral sign. Then if we can 

find an expansion of the form () = P 

! , 

by the uniqueness of power series this must 

be the MacLaurin series, and so we must have 

= () (0) = [ ]. 

2. If () = () forall|| (for some 

0) then ∼ , i.e. the distribution of a 

r.v. is uniquely determined by the m.g.f. 

3. If { } is a sequence of r.v.s with m.g.f.s 

(), and if () → () for in a neighbourhood 

of 0, where () is the m.g.f. of a 

 

r.v. , then → .

122 

— You should show: if ∼ ( )with 

 

→ then → P() (Poisson, mean 

). 

4. Sums of independent r.v.s. If 1 2 are 

independent r.v.s with m.g.f.s 1 () 2 () 

then if = P 

=1 we have 

() = h P i 

=1 

= 

= 

⎡ ⎤ 

Y 

⎣ Y 

⎦ = 

=1 =1 

Y 

=1 

() 

h i 

 

In particular, if all are distributed in the 

same way, with m.g.f. (), then the m.g.f. of 

their sum is () = () and the m.g.f. of 

their average is 

¯ () = h i = () = () 

• All of this also holds for the characteristic function 

(c.f.) [ ]= [cos ] + [sin ], which 

always exists.

123 

• Suppose ∼ (0 1), with p.d.f. (). Define 

= 2 ,a 2 1 r.v. Its m.g.f. is 

() = 

= 

Z ∞ 

−∞ 2 () 

Z ∞ 

−∞ 

1 

√ 

2 

−2 2 (1−2) (|| 12) 

= ( = √ 1 − 2) 

= (1− 2) −12 

1 

√ 1 − 2 

Z ∞ 

−∞ () 

Now suppose is the sum of squares of independent 

(0 1)’s, i.e. is a 2 r.v. Its m.g.f. 

is the power of the above (why?), thus = 

(1 − 2) − 2 (|| 12). It follows that the p.d.f. 

is 

() = 

³ 2´ 

2 

−1 

 

− 2 

2Γ ³ 

2´ 

0 ∞ 

Proof: With = 2 

(1 − 2) for|| 12, 

Z ∞ 

0 

() = (1− 2) − 2 

Z ∞ 

0 

2 −1 − 

Γ ³ 

2´ 

= (1− 2) − 2

124 

• Chebyshev’s Inequality: 

and variance 2 then 

If a r.v. has mean 

(| − | ≥ ) ≤ 1 2 

Proof: 

An equivalent formulation is 

(|| ≥ ) ≤ 1 2 

where =( − ) has mean 0 and variance 

1. 

Note that the indicator of an event , given by 

( ) = 

∼ 

( 

1 if occurs, 

0 otherwise, 

(1( )) 

with [( )] = ( ). Thus 

1 = [] = h 2i 

≥ h 2 (|| ≥ ) i 

≥ 2 [ (|| ≥ )] 

= 2 (|| ≥ )

125 

• Chebyshev’s Inequality furnishes an easy proof of 

the Weak Law of Large Numbers: If ¯ is the 

average of independent r.v.s, each with mean 

and variance 2 

,then ¯ → as →∞. 

Proof: Note that ¯ has mean and variance 

2 . For0, put = √ ; then 

³¯¯¯ ¯ − ¯ ≥ ´ 

= ³¯¯¯ ¯ − ¯ ≥ ³ ¯ ´´ 

≤ 

→ 

1 2 = 2 

2 

0as →∞ 

• Central Limit Theorem. This is probably the most 

significant theorem in mathematical statistics. It 

gives the approximate normality of averages of 

r.v.s and, when combined with the MVT (or Taylor’s 

Theorem), the WLLN and Slutsky’s Theorem 

(see below), forms the basis for approximating the 

distributions of many other statistics of interest.

126 

• Theorem: Let 1 2 be independent 

r.v.s, with common d.f. () = ( ≤ ), 

mean , variance 2 (0 2 ∞). Put = 

√ ³ 

¯ − ´ 

;then → (0 2 ). 

— To apply, since the statements “ ∼ (0 2 )” 

and “ ¯ ∼ ( 2 )” are equivalent, we 

treat ¯ as if it were distributed approximately 

as ( 2 ). Then, e.g. if we can also estimate 

2 , we have the basis for making inferences 

about . 

• ProofofCLT: We make the additional assumption 

that the have a m.g.f. Define 

() =[ ( −) ](= − [ ]) 

We shall use the fact, being established in assignment 

3, that the m.g.f. of ∼ ( 2 )is 

[ ]= +2 2 

2 

Notation: “ () = (()) as → ” means 

“ () () → 0as → ”.

127 

Let be fixed but arbitrary. Expand () as 

() = (0) + 0 (0) + 00 (0) 2 2 + 000 () 3 6 

(0 ≤ || ≤ ||) 

= 1+[ − ] + [( − ) 2 ] 2 2 + 000 () 3 6 

= 1+ 2 2 

2 + (2 )as → 0 

Why ( 2 )? - because 000 () hasafinite limit as 

, hence , tends to 0. 

We are to show that the m.g.f. of 

= 1 √ 

X 

=1 

( − ) 

tends to that of a (0 2 ) r.v., i.e. that 

 

" 

√ 1 P # 

=1 

 

( −) 

= 

Y 

=1 

 

" 

 

= ( √ ) 

# 

√ 

 

( −)

128 

tends to 2 2 

2 . Equivalently, we show that 

For this, write 

log ( √ ) → 2 2 

2 

as →∞ 

log ( √ 2 Ã 

 

2 

) = log ⎜ 

⎝ 1+2 2 + 

⎛ 

⎜ 

= log (1 + ) 

 

 

(15.1) 

| {z } 

 

⎞ 

! 

⎟ 

⎠ 

where (why?) → 0and → 2 2 2. This 

gives (15.1). ¤

129 

• Slutsky’s Theorem: If 

 

→ and → 

(constant) then: 

1. ± 

→ ± 

2. · 

→ · 

3. 

→ if 6= 0. 

Note that if = (constant) then all occurrences 

of → can be replaced by →. 

• Application: We often make inferences about a 

population mean using the -statistic 

√ ³ 

¯ − ´ 

= 

 

 

where ¯ is as in the CLT and is the sample 

standard deviation. If the data are normally distributed 

then follows a “Student’s t” distribution 

on − 1 degrees of freedom; it is well known 

that this distribution is closely approximated by

130 

the (0 1) when is reasonably large. This latter 

fact holds even for non-normal parent distributions: 

Note 

P ³ 

2 − ¯´2 P 

2 

= 

= 

− ¯ 2 

 

− 1 − 1 

where P 2 

→ [ 2 ] by WLLN, ¯ → 

by WLLN ; it follows that ³ P 

2 

− ¯ 2´ 

→ 

 

[ 2 ] − 2 = 2 , (a special case of (1) of Slutsky’s 

Theorem) hence so does 2 (Slutsky (2)); 

thus → (since is a continuous function of 

2 )andso → 1 (Slutsky (3)). Now again 

by Slutsky, and the CLT, 

= 

√ ( ¯−) 

 

 

 

→ 

 

1 =

131 

Part IV 

MULTIDIMENSIONAL 

CALCULUS AND 

OPTIMIZATION

132 

16. Multidimensional differentiation; Taylor’s and 

Inverse Function Theorems 

• f : ⊂ R → R can be represented as 

⎛ 

⎜ 

⎝ 

1 (x) 

2 (x) 

. 

(x) 

⎞ 

⎟ 

⎠ for (x) :R → R. 

• Some results from any text on multivariable calculus/analysis: 

— Every bounded sequence in R contains a convergent 

subsequence. 

— If f : ⊂ R → R is continuous on a closed, 

bounded set then f attains its inf and sup 

there; i.e. there are points p q ∈ with 

f(p) =sup x∈ f(x) andf(q) =inf x∈ f(x). 

— If f : ⊂ R → R is continuous on a 

closed, bounded set then f is uniformly continuous 

on .

133 

• Derivatives. Put e = (00 1 

↑ 

 

00) 0 . Let 

: ⊂ R → R 1 .If 

lim 

→0 

(a + e ) − (a) 

 

exists, we say has a partial derivative with respect 

to at a; this limit is denoted by (a) 

 

. 

 

It is computed by treating all variables except the 

as constant; i.e. it is the ordinary derivative 

of ( 1 −1 +1 ) with respect to 

.

134 

• The ³ ´ Jacobian matrix is the × matrix J f (x) = 

f 

x with ( ) element , evaluated at 

x = ( 1 ). This arrangement of partial 

derivatives ensures that the chain rule is easily 

represented: if f : R → R and g : R → R 

then g ◦ f : R → R has 

J g◦f (x) × = J g (f(x)) × J f (x) × . 

(16.1) 

This is a consequence of the formula for the ‘total 

derivative’: if ∈ R and = ( 1 () ()) has 

continuous partial derivatives, then 

X 

= 

=1 

 

Applythistoeach ,with = (x)and = : 

h 

Jg◦f (x) i = (g ◦ f) 

= (f(x)) 

 

X 

= (f(x)) (x) 

(x) 

= 

=1 

X 

=1 

[J g (f(x))] 

[J f (x)] 

= [J g (f(x)) · J f (x)]

135 

— If = 1 then the Jacobian matrix of 

: R → R is a row vector whose transpose 

is the gradient: 

∇ (x) = 

Ã 

 

1 

 

 

! 0 

 

— The Jacobian of ∇ : R → R is called the 

Hessian of : R → R. This × matrix 

H (x) has( ) element 

³ ´ 

∇ 

 

= 

 

 

 

 

 

 

If one of 

 

, 

 

exists and is continuous, 

then the other exists and the two are 

 

equal; under these conditions the Hessian matrix 

is symmetric. We write the ( ) element 

as 2 = 2 .

136 

— If f : ⊂ R → R then the directional 

derivative at a in the direction v (with kvk = 

1) is 

f(a + v) − f(a) 

lim 

= 

→0 

f (a)v 

provided the Jacobian exists. 

Proof: Put g() =f(a + v) =f ◦ k () 

g()−g(0) 

where k () =a+v. We seek lim →0 

-using(16.1)thisis 

g 

|=0 = J f (k ())J k () |=0 = J f (a)v 

• Taylor’s Theorem. I’llgiveaversionsuitablefor 

the intended applications. A major difficulty in 

writing down a multivariate Taylor’s Theorem is 

that appropriate notation, for representing derivatives 

higher than second order, is very cumbersome. 

It is rare however to require expansions 

beyond “Hessian + remainder”. Thus, suppose 

: ⊂ R → R, where is convex: 

x y ∈ ⇒ (1 − )x + y ∈ for 0 ≤ ≤ 1

137 

1. If the partial derivatives of are continuous 

on then 

(x) =(u)+∇ 0 (ξ)(x − u) 

for some ξ =(1− )u + x = ξ . 

(How? Write () = (ξ ), expand (1) 

around = 0.) We also have 

(x) = (u)+∇ 0 (u)(x − u)+(||x − u||) 

as ||x − u|| → 0 

(How? Write x = u+v for v =(x − u) ||x− 

u||, = ||x − u||; apply l’Hospital’s Rule.) 

2. If the second order partials are continuous on 

then with ξ as above, 

We also have 

(x) =(u)+∇ 0 (u) (x − u) 

+ 1 2 (x − u)0 H (ξ)(x − u) 

(x) =(u)+∇ 0 (u)(x − u) 

+ 1 2 (x − u)0 H (u)(x − u)+(||x − u|| 2 )

138 

— Example: x =( ), (x) = cos , u = 0 

Then 

Ã 

 

∇ (0) = 

! Ã ! 

cos 

1 

− = 

sin 

0 

|x=0 

Ã 

 

H (0) = 

cos − ! 

sin 

− sin − cos 

|x=0 

Ã ! 

1 0 

= 

0 −1 

and 

(x) = (0)+∇ 0 (0)x + 1 2 x0 H (0)x + (||x|| 2 ) 

= 1+ + 1 2 

h 

2 − 2i + ( 2 + 2 ) 

• Notation: 

´ 

If f : R → y = f(x) ∈ R we often 

for the × Jacobian matrix Jf (x). 

write ³ y 

x 

• Example: Let be a strictly increasing d.f., ∈ 

(0 1) a probability. Then the relationship () = 

defines = () asafunctionof. This value 

() = −1 () isthe quantile, and (·) isthe

139 

quantile function. Differentiating () = with 

respect to and using the chain rule gives 

1 = (()) 

= = 0 (()) 0 (); so 

 

0 () = 

1 

0 (()) = 1 

0 ( −1 ()) 

The denominator is non-zero since is strictly 

increasing. The multivariate analogue of this follows. 

• Inverse Function Theorem: If f : ⊂ R → R 

has a continuous Jacobian within , nonsingular 

at a point x 0 ∈ , then (i) f is 1 − 1 in 

an open neighbourhood of x 0 ,and(ii)thereis 

an open neighbourhood of f(x 0 ) within which 

f −1 : ⊂ R → R is well-defined and has a 

continuous, non-singular Jacobian with 

J f −1(f(x 0 )) = J −1 

f 

(x 0 ).

140 

— If J f (x 0 ) is singular then J f (x 0 )v = 0 for 

some v 6= 0, so that the derivative in direction 

v is zero and f might not be 1-1. Nonsingularity 

of J f (x 0 )issufficient but not always 

necessary — think about () = 3 and 

0 =0. 

— In the notation introduced above, the statement 

becomes 

Ã ! µ x y −1 

= 

y x 

with both sides evaluated at (x 0 y 0 = f(x 0 )). 

— Note that applying the chain rule to the relationship 

x = f −1 ◦ f(x) gives 

µ x 

I = = J 

x f −1(f(x))J f (x)

141 

17. Implicit Function Theorem; extrema; Lagrange 

multipliers 

• Example of Inverse Function Theorem. Suppose 

that r.v.s 1 2 0 with joint p.d.f. ( 1 2 ) 

are transformed to 1 2 0 through the transformation 

³ 1 = f 

2´ ³ 1 = 

2´ ³ 2 1 + 2 2 

´ 

2 1 − 2 2 

We will later see, when we look at multivariable 

integration, that the p.d.f. of ( 1 2 )is 

Note 

µ y 

x 

(x (y)) |det (xy)| 

! 

1 

= J f (x) =2Ã 

2 

1 − 2 

is non-singular if neither 1 nor 2 equals 0. If x 0 

is such a point and y 0 = f (x 0 ) then in a neighbourhood 

of y 0 the inverse map f −1 : y → f −1 (y) = 

x existsandissuchthat 

2 1 + 2 2 = 1 

2 1 − 2 2 = 2

142 

The Jacobian of f −1 is 

Ã ! 

x 

= 

y 

We need only 

¯ 

¯det Ã 

x 

y!¯¯¯¯¯ 

µ y −1 

 

x 

¯ 

¯¯¯¯−1 

µ y 

= ¯det x 

= (8 1 2 ) −1 

1 

= q 

4 1 2 − 2 2 

Note that this can be calculated without determining 

J −1 

f 

,orsolvingforx in terms of y, explicitly.

143 

• The Inverse Function Theorem asserts a unique 

solution x = f −1 (y) to equations of the form 

g(x y) =y −f(x) =0 with x,y ∈ R , under the 

given conditions. Here x is explicitly defined as 

afunctionofy. The solutions, when written as 

x = φ(y), are differentiable functions of y with 

J φ (y) =J −1 

f 

(x). Consider now the general case 

g(x y) =0 ×1 for x ∈ R y ∈ R 

(17.1) 

For given y there may be a solution x = φ(y) 

for φ : R → R . Such a function is “implicitly 

defined” by (17.1). The Implicit Function Theorem 

gives conditions under which such a function 

exists, and gives some of its properties. 

• Implicit Function Theorem. With notation as in 

(17.1), suppose g is defined on ⊂ R + and 

has a continuous Jacobian in a neighbourhood of 

apoint(x 0 y 0 ) ∈ , atwhichg(x 0 y 0 )=0 ×1 . 

Suppose 

Ã ! 

g (x y) 

J 1 (x y) = 

x

144 

(= the first columns of the × ( + ) matrix 

J g (x y)) is non-singular at (x 0 y 0 ). Then 

there is a neighbourhood of (x 0 y 0 ) within which 

the relationship g(x y) =0 defines a continuous 

mapping x = φ(y), i.e. g(φ (y) y) =0. Moreover, 

φ has a continuous Jacobian, with 

J φ (y) =−J −1 

1 (φ (y) y) J 2 (φ (y) y) 

where J 2 (x y) = 

µ 

g(xy) 

y 

. 

— Note that differentiating the relationship 

g(x y) =0 

(17.2) 

with x = φ(y) gives 

Ã ! 

Ã ! 

x 

y 

J 1 (x y) + J 2 (x y) = 0 

y 

y 

yielding (17.2) with (xy)writtenasJ φ (y).

145 

• Example. Write the characteristic equation for a 

matrix A × as 

(−1) |A − I| = ( a) = 

X 

=0 

=0 

Here the are certain continuous functions of the 

elements of A and =1;a =( 0 −1 ) 0 . 

How do the eigenvalues vary as A varies, say in a 

neighbourhood of some matrix A 0 ? The Jacobian 

of is clearly continuous, and so if 0 is an 

eigenvalue of A 0 with multiplicity one, so that 

1 ( a) =( a) 6= 0 

at ( 0 a 0 = a (A 0 )), then the char. eqn. defines 

as a continuously differentiable function of the 

in a neighbourhood of a 0 : 

0 0 ( a) ( a) 

= + 

a a 

⇒ (a) 

a = − a 

(a) 

³ 

1 

−1´ 

= − 

P =1 

−1

146 

• Extrema of : ⊂ R → R. Suppose the 

conditions of Taylor’s Theorem hold, so that we 

can expand as 

(x 0 + h) =(x 0 )+∇ 0 (x 0)h + 1 2 h0 H ()h 

Let x 0 be a stationary point: ∇ (x 0 )=0. Then: 

1. If H () 0 for in a neighbourhood of 

x 0 then x 0 furnishes a local minimum of : 

(x 0 +h) (x 0 )forsufficiently small h 6= 0. 

2. If H () 0 for in a neighbourhood of 

x 0 then x 0 furnishes a local maximum of : 

(x 0 +h) (x 0 )forsufficiently small h 6= 0. 

3. If neither (1) nor (2) holds then (x 0 + h) − 

(x 0 ) changes sign as h varies; we say that x 0 

is a saddlepoint. 

• In (1), H (x 0 ) 0 andcontinuousinaneighbourhood 

of x 0 suffices; similarly with (2).

147 

• Often we seek extrema of multivariate functions, 

subject to certain side conditions. For instance in 

ANOVA we might seek least squares estimates, 

subject to the constraint that the average treatment 

effect is zero. The general problem considered 

here is to find the extrema of : ⊂ R → 

R subject to g(x) =0 ×1 for .E.g. 

P : Minimize x 0 Ax subject to Bx = c ×1 , 

where A 0 and B × has rank 

Put 

(x; λ) =(x)+λ 0 g(x) 

for a vector λ ×1 of “Lagrange multipliers”. 

Claim: Thestationarypointsof that satisfy the 

constraints determine the stationary points in the 

original problem. These points then satisfy the 

+ equations in the + variables of (x λ): 

Equivalently, 

∇ (x; λ) =0 (+)×1 

∇ 0 (x)+λ0 J g (x) = 0 0 1× 

g(x) = 0 ×1 

The proof of this claim follows the example.

148 

• Example: Problem P above. We have 

(x; λ) = (x)+λ 0 g(x) 

= x 0 Ax + λ 0 (Bx − c) 

with (you should verify this) 

implying 

0 0 1× = µ 

x 

 

=2x 0 A + λ 0 B 

x = − 1 2 A−1 B 0 λ 

Combine this with Bx = c to get 

whence 

λ = −2 ³ BA −1 B 0´−1 c 

x = A −1 B 0 ³ BA −1 B 0´−1 c 

(You should be able to verify that BA −1 B 0 is 

non-singular.) 

• Once we have the stationary points we must check 

for a minimum or maximum. There are conditions

149 

under which the satisfaction of these equations is 

sufficient as well as necessary to determine the extrema. 

These are generally so restrictive and complicated 

as to be useless in practice. The virtue 

of the Lagrange multiplier method is that it reduces 

to a small number the points that must be 

checked - we know that the required extrema are 

among them. 

— One easy way (if it works) to check optimality 

is this. Suppose that (x 0 ; λ 0 ) is a stationary 

point of (x; λ) andthatx 0 minimizes 

(x; λ 0 ) unconditionally. Then (x 0 ; λ 0 ) 

(x; λ 0 ) for all x 6= x 0 , hence in the class of 

those x that satisfy g(x) =0 we have 

(x 0 )= (x 0 ; λ 0 ) (x; λ 0 )=(x) 

— In Problem P there was only one stationary 

point x 0 . This furnishes the minimum since 

(x; λ 0 )=x 0 Ax + λ 0 0 (Bx − c) 

where λ 0 = −2 ³ BA −1 B 0´−1 c,hasHessian 

2A 0, hence is minimized unconditionally 

at x 0 .

150 

• Proofofclaim: Letx beasolutiontotheoriginal 

problem, so that in particular g(x) =0 ×1 . We 

will ‘solve’ these equations, thus expressing 

of the s in terms of the others. The Implicit 

FunctionTheoremallowsustodothis. 

We must show that ∇ 0 (x) +λ0 J g (x) =0 0 1× . 

Partition x, the gradient of , andtheJacobian 

of g as: 

x = 

∇ (x) = 

J g (x) = 

Ã ! 

x1 

x 2 

⎛ ³ 

⎜ x 

⎝ ³ 1´0 

 

x 2´0 

× 1 

( − ) × 1 ; 

⎞ 

Ã 

g 

. g 

! 

x 1 x 2 

⎟ 

⎠ = 

Ã 

τ (x) 

ψ(x) 

! 

; 

= ³ Γ × (x).∆ ×(−) (x)´ 

 

Under the conditions of the Implicit Function Theorem 

(so that, in particular, Γ(x) is non-singular) 

we can solve the equations g(x 1 x 2 )=0 ×1 for 

x 1 in terms of x 2 , obtaining x 1 = h(x 2 ). Thus 

(x) =(h(x 2 ) x 2 )andg(h(x 2 ) x 2 )=0 ×1 .

151 

Since x 2 is a stationary point we have 

Ã ! 

0 0 

1×(−) = = τ 0 (x)J 

x h (x 2 )+ ψ 0 (x) 

2 

(17.3) 

But, as in the Implicit Function Theorem, 

g(h(x 2 ) x 2 )=0 ×1 gives 

so 

Γ(x)J h (x 2 )+∆(x) =0 ×(−) 

In (17.3) this gives 

J h (x 2 )=−Γ −1 (x)∆(x) 

0 0 = −τ 0 (x)Γ −1 (x)∆(x)+ ψ 0 (x) 

= −λ 0 ∆(x)+ψ 0 (x) (17.4) 

where λ 0 = −τ 0 (x)Γ −1 (x) :1× . Thus 

∇ 0 (x)+λ0 J g (x) 

= ³ τ 0 (x) ψ 0 (x)´ − τ 0 (x)Γ −1 (x)(Γ(x).∆(x)) 

= ³ 0 0 ψ 0 (x) − τ 0 (x)Γ −1 (x)∆(x)´ 

= 0 0 1× 

by (17.4), as required. ¤

152 

18. Integration; Leibnitz’s Rule; Normal sampling 

distributions 

• Integration over an −dimensional rectangle. Let 

: ⊂ R → R be bounded on the bounded 

set . The development of the Riemann integral 

proceeds along the same lines as for =1. Thus, 

define the rectangle 

[a b] =[ 1 1 ] ×···×[ ] 

large enough that ⊂ [a b]. First suppose 

that is defined and bounded on [a b]. Let 

be a partition of [a b], itself consisting of - 

dimensional rectangles 1 .Define the lower 

and upper sums 

() = 

() = 

X 

=1 

X 

=1 

( ) 

( ) 

where and are the inf and sup of on 

and ( )isthevolumeof : 

([c d]) = 

Y 

=1 

( − )

153 

If 

sup 

 

or equivalently if 

() =inf 

() 

lim ( () − ()) = 0 

∆ →0 

then the common value is the Riemann integral 

R 

[ab] (x)x 

• Now recall that ⊂ [a b]. Define 

( 

1 x ∈ 

1 (x) = 

0 x ∈ 

If R [ab] 1 (x)x exists we say is Jordan measurable. 

The value of the integral is called the 

Jordan content of . When is Jordan measurable 

we define 

Z 

Z (x)x = (x)1 (x)x 

[ab]

154 

• A major tool for evaluating multidimensional integrals 

is Fubini’s Theorem: If is absolutely 

integrable on [a b] then 

Z 

(x)x = 

[ab] 

Z Ã Ã 

1 

Z Ã 

−1 

Z ! ! 

 

··· 

(x) −1 ··· 

1 −1 

and the integrations on the RHS may be carried 

out in any order. 

! 

1 

• Change of variables. Let : ⊂ R → R, where 

is closed and bounded and is continuous. Let 

h : x ∈ → y ∈ R be a 1 − 1 function with 

continuous Jacobian matrix J h (x), non-singular 

on . Then there is an inverse function h −1 : 

y ∈ h () → with 

Z (x)x = Z 

h() (h−1 (y)) ¯J h −1(y) y ¯+ 

(18.1) 

where |·| + denotes the absolute value of the determinant. 

Note also 

¯ 

¯J h −1(y) 

¯ 

= 

¯+ ¯ 

x 

y 

¯ 

¯+ 

= 

¯ 

y 

x 

−1 

¯ 

+ 

¯ 

¯ 

,wherey = h(x)

155 

Here and elsewhere the assumption that be 

bounded can be dropped by defining the resulting 

improper integral as a limit of proper integrals. 

• In particular, suppose is the p.d.f. of a r.vec. 

X. PutY = h(X). If h is as above we have 

Z 

(x)x = (X ∈ ) 

 

= (Y ∈ h ()) 

= (y)y 

h() 

where (y) is the p.d.f. of Y. But also (18.1) 

holds; thus 

(y) = (h −1 (y)) ¯J h −1(y) 

= (x) 

¯ 

x 

y 

Z 

¯ 

¯+ 

¯ 

¯ 

¯+ 

,withx = x(y) 

• Differentiation under the integral sign. Define 

() = 

Z () 

() ()

156 

where ()() are continuously differentiable 

for ≤ ≤ and () is continuous, with a 

continuous partial derivative w.r.t. , onaregion 

containing { ≤ ≤ and () ≤ ≤ ()}. 

Then (Leibnitz’s Rule): () iscontinuouslydifferentiable 

with 

Z () 

0 

() = 

() () 

+ (()) 0 () − (()) 0 ()(18.2) 

• Note this is the result of writing 

() =( ()()) 

where ( ) = R 

(). Then 

 

 

( ) 

( ()()) = 

( ) 

 

 

=() 

=() 

0 ()+ 

=() 

=() 

( ) 

 

+ 

=() 

=() 

If differentiation under the integral sign is permissible 

— and Leibnitz’s rule says that it is — then 

we have (18.2). 

0 ()

157 

• Example: Let be independent, non-negative 

r.v.s with continuous densities respectively 

(()() =0for0). Then = + 

has d.f. 

() = ( ≤ ) 

= 

= 

= 

= 

Z 

[0]×[0] 

Z 

( + ≤ ) ()()( ) 

Z 

() ( ≤ − ) () 

0 0 

by Fubini’s Theorem 

Z Z − 

() 

Z0 0 

0 

() 

()( − ) 

with, by Leibnitz’s Rule, density 

() = 

Z 

0 

()( − ) 

since (0) = 0. This integral is called the convolution 

of with . (Onlyoneof needs to be 

continuous - why?)

158 

• Application: The density of a 2 1 r.v., i.e. = 

2 ,where ∼ (0 1), can be obtained as 

1 () = 

 

( ≤ ) = 

³ − √ ≤ ≤ √ ´ 

= Z √ ³ 2´1 

2 

() = ³ 2 

−1 

√ 

− 2 

´ 

√ = 

0 

2Γ ³ 

1 

2´ 

Then the 2 2 

2 () = 

Z 

= − 2 

density is 

0 1() 1 ( − ) 

Z 

0 

= 

= − 2 

³ 2´1 

2 

−1 ³ − 

2 

Z 1 

0 

= 1 2 − 2 · 

4 

´1 

2 

−1 

 

³ 2 

´1 

2 

−1 µ (1−) 

2 

4 

1 

2 

−1 

 

where = R 1 1 2 −1 (1−) 1 2 −1 

0 

must = 1 in

159 

order that 2 () integrate to 1. Now 

() = 

³ 2´ 

2 

−1 

 

− 2 

2Γ ³ 

2´ 

can be proved by induction, or conjectured and 

then established by using the uniqueness of m.g.f.s 

(as in Lecture 15). 

• Example: Joint distribution of the sample mean 

andvarianceinNormalsamples. 

Suppose that 1 are i.i.d. ( 2 ) r.v.s, 

so that X =( 1 ) 0 has p.d.f. 

( 

Y ³2 

(x) = 

2´−12 

exp 

(− ( − ) 2 )) 

2 2 

Note 

X 

=1 

=1 

= ³ 2 2´−2 exp 

⎧ 

⎨ 

⎩ − X 

=1 

( − ) 2 = 

X 

=1 

( − ) 2 

2 2 

⎫ 

⎬ 

⎭ 

[( − ¯)+(¯ − )] 2 

= ( − 1) 2 + (¯ − ) 2

160 

so that 

(x) = ³ 2 2´−2 − (−1)2 

2 2 

−(¯−)2 

2 2 

We derive the joint p.d.f. of ( 2 ¯) . Firstnote 

that 1 ×1 √ has norm 1. Adjoin − 1unit 

vectors e to get a basis for R , and then apply 

Gram-Schmidt to get an orthonormal basis whose 

first member is 1 √ . This yields an orthogonal 

matrix 

H × = ³ 1 0 √ ´ 

H 1 

Put Y = HX. Then 

1 = 1 0 X √ = √ ¯ 

and kXk 2 = kYk 2 ,sothat 

X 

=2 

2 = kYk 2 − 1 

2 

= kXk 2 − ³ √ ¯´2 

= 

X 

=1 

2 − ¯ 2 

= ( − 1) 2

161 

Note that x 

¯ 

¯ = 

¯¯¯¯¯ 

¯H 0¯¯¯+ = |±1| =1 

y¯+ 

so that the p.d.f. (y) ofY is 

Thus 

= (x) 

¯ 

x 

y 

¯ 

¯+ 

= ³ 2 2´−2 

− 

= 

⎧ 

⎨ 

= (x(y)) 

P =2 

2 

2 2 

³ 

2 

2´−12 

 

− ( 1− 

⎩ 

⎧ 

Y ⎨ 

⎩ 

=2 

³ 

2 

2´−12 

 

− 2 

√ 

−( 1− ) 

2 

2 2 

√ ) 

2 

2 2 ⎫ 

⎬ 

⎭ · 

2 2 ⎫ 

⎬ 

⎭ 

1. 1 are independently distributed; 

2. 1 ∼ ( √ 2 ), so that ¯ ∼ ( 2 ); 

3. (−1)2 

2 

= P ³ 

 

=2 

´2 

∼ 

2 

−1 , since 

(0 1); furthermore ¯ and 2 are independently 

distributed. 

 

∼

19. Numerical optimization: Steepest descent, 

Newton-Raphson, Gauss-Newton 

162 

• Numerical minimization. Suppose that a function 

: R → R is to be minimized. 

• Method of steepest descent. First choose an initial 

value x 0 .Leth be a vector of unit length and 

expand around x 0 in the direction h ( 0): 

(x 0 + h) − (x 0 ) ≈ h 0 ∇ (x 0 ) 

We choose h such that h 0 ∇ (x 0 ) is negative but 

maximizedinabsolutevalue. Specifically, note 

that by Cauchy-Schwarz, for khk =1, 

¯ 

¯h 0 ∇ (x 0 ) 

¯ ≤ 

° 

°∇ (x 0 ) 

° 

with equality iff h = ±∇ (x 0 ) °∇ (x 0 ) °. Then 

° 

° 

h 0 ∇ (x 0 )=± °∇ (x 0 ) °,soweusethe“−”sign: 

x 1 () = x 0 + h 

° 

° 

= x 0 − ∇ (x 0 ) °∇ (x 0 ) 

° 

°

163 

with = 0 to minimize (by trial and error) (x 1 ()). 

Repeat, with x 1 ( 0 ) replacing x 0 . Iterate to convergence. 

— Example: (x) =kxk 2 . Then ∇ (x) =2x 

and so 

x 1 = x 0 − · 2x 0 (k2x 0 k) 

= x 0 (1 − kx 0 k) 

We vary until (x 1 ()) = kx 1 ()k 2 is a minimum, 

i.e. to = kx 0 k.Thenx 1 = 0, andthe 

minimum is achieved in 1 step, from any starting 

value. 

• The method of steepest descent uses a linear approximation 

of ; for this and other reasons the 

convergence can be very slow. The Newton-Raphson 

method uses a quadratic approximation of ; equivalently 

it takes a linear approximation of ∇ (x) 

in order to solve ∇ (x) =0. In its general form 

the Newton-Raphson method attempts to solve a 

system of equations of the form g(x) =0, where 

x and g(x) are × 1.

164 

• Expand g(x) around an initial value x 0 : 

g(x) ≈ g(x 0 )+J (x 0 )(x − x 0 ) 

Equate the RHS to zero, to get the next iterate: 

In general, 

x 1 = x 0 − J −1 (x 0 )g(x 0 ) 

x +1 = x − J −1 (x )g(x ) =0 1 2 . 

At convergence, with x ∞ =lim →∞ x and assuming 

that J (x ∞ )isnon-singular, 

x ∞ = x ∞ − J −1 (x ∞ )g(x ∞ ) 

so that g(x ∞ )=0. 

— If this is a minimization problem, so that g(x) = 

∇ (x), then J (x) =H (x) and the scheme 

is 

x +1 = x − H −1 

(x )∇ (x ) 

Note 

(x +1 ) ≈ (x )+∇ 0 (x ) ¡ ¢ 

x +1 −x 

= (x ) − ∇ 0 (x )H −1 

(x )∇ (x ) 

(x ) if H (x ) 0

165 

—Example:solve () =log − 1=0: 

+1 = − 0 ( ) 

This gives 

= − log − 1 

1 

= (2 − log ) =0 1 2 . 

0 = 1 

1 = 2 

2 = 26137 

3 = 27162 

4 = 27182811; 

= 27182818. 

• The starting points can make a big difference — 

even if the function being minimized is convex. 

Example: Consider using Newton’s method to 

find the zero of () =() · 2 ³ 1+ 2´. 

This is increasing, so it is the derivative of a convex 

function , minimized at the zero of . Start

166 

at 0 . We have ¡ 0¢ () =(1 + 

³ 

2 )2 and 

so the iterates satisfy +1 = 1 − 

2 

´ 

2. 

¯ 

Put = ¯¯ +1 ¯¯ = ¯1 − 2 

¯ 2. Suppose that 

| 0 | √ 3. By induction, 

| | √ 3and −1 ··· 0 1 

(19.1) 

so that ¯¯ +1¯¯ = | | ↑ ∞. If | 0 | = √ 3 

then =(−1) 0 . Similarly, if | 0 | √ 3then 

| | √ 3and −1 ··· 0 1andso 

| | ↓ 0 (= the desired root). 

Details of (19.1): If true for , then ¯¯ +1¯¯ = 

√ √ 

| | 3 3, and then 

¯ 

¯1 − 2 ¯ 

+1 

¯ ¯1 − 2 ³ 

 

+1 = 

= 

2 

¯ 

2 

 

2 

= 

− 1´ 

2 

2 

2 

32 − 1 3 − 1 

 

2 2 

¯ 

Gauss-Newton algorithm. Uses least squares minimization 

along with a linear approximation of the function 

of interest. A common application is non-linear

167 

regression, so I’ll illustrate the technique there. Suppose 

we observe 

= (x θ)+ , =1 

• An example is a Michaelis-Menten response ( θ) = 

1 ( 2 + ), ( 0), used to describe various 

chemical and pharmacological reactions. Note 

the horizontal asymptote of 1 . 

If (x θ) =z 0 (x)θ for some regressors z(x) then this 

is a linear regression problem; otherwise non-linear. 

Define 

η(θ) =((x 1 θ) ···(x θ)) 0 

so that the data can be represented as y = η(θ)+ε. 

The LSEs are the minimizers of 

(θ) =ky − η(θ)k 2 

Take an initial value θ 0 , expand around θ 0 to get 

− (x θ) ≈ − (x θ 0 ) − ∇ 0 (x θ 0 )(θ − θ 0 )

168 

i.e. 

y − η(θ) ≈ y − η(θ 0 ) − J (θ 0 )(θ − θ 0 ) 

Define y (1) = y − η(θ 0 ), so that 

ky − η(θ)k 2 ≈ 

° 

°y (1) − J (θ 0 )(θ − θ 0 ) 

is to be minimized. By analogy with the linear regression 

model 

y (1) = J (θ 0 )β + 

the minimizer is 

° 2 

θ − θ 0 = β = h J 0 (θ 0 )J (θ 0 ) i −1 

J 

0 (θ 0 )y (1) 

Thus the next value is θ 1 = θ 0 + β, i.e. 

θ 1 = θ 0 + h J 0 (θ 0 )J (θ 0 ) i −1 

J 

0 (θ 0 )(y − η(θ 0 )) 

In general, θ +1 = θ + ˆβ ,where 

ˆβ = h J 0 (θ )J (θ ) i −1 

J 

0 (θ )(y − η(θ )) 

Thus we are repeatedly doing least squares regressions, 

in the ( +1) of which the residuals from the 

areregressedonthecolumnsoftheJacobianmatrix, 

evaluated at θ . A stopping rule can be based 

on the F-test of 0 : β = 0, the p-values for which 

will be included in the regression output.

169 

Assuming convergence, the limit ˆθ satisfies 

so that 

J 0 (ˆθ) ³ y − η(ˆθ)´ = 0 (19.2) 

∇ 0 (ˆθ) =−2 ³ y − η(ˆθ)´0 

J (ˆθ) =0 0 

and ˆθ is a stationary point of (θ). 

Typically ¡ θ +1 

¢ 

(θ ),ifnotitisusualtotake 

θ +1 = θ + h J 0 (θ )J (θ ) i −1 

J 

0 (θ )(y − η(θ )) 

for =1 12 14until a decrease in is attained. 

A normal approximation is generally valid: 

ˆθ ≈ µ 

θ 

2 h 

J 

0 (ˆθ)J (ˆθ) i −1 ; 

the basic idea is (19.2) applied to 

ε = y − η(θ) ≈ y − h η(ˆθ)+J (ˆθ)(θ − ˆθ) i 

More precisely, 

⎛ 

√ ⎜ 

 

³ˆθ − θ´ → ⎝0 

2 

⎡ 

⎣ lim 

→∞ 

J 0 (ˆθ)J (ˆθ) 

 

⎤ 

⎦−1 ⎞ ⎟ ⎠

170 

20. Maximum likelihood 

Maximum Likelihood Estimation. Thisisthemost 

common and versatile method of estimation in statistics. 

It almost always gives reasonable estimates, even 

in situations that are so intractable as to be highly resistant 

to other estimation methods. 

• Data x, p.d.f. (x; θ); e.g. i.i.d. ( 2 )observations 

gives 

(x; θ) = 

θ = ( 2 ) 0 

Y 

=1 

µ 

1 

− 

 

The p.d.f. evaluated at the data is the likelihood 

function (θ; x); its logarithm 

is the log-likelihood. 

(θ) =log(θ; x)

171 

• For i.i.d. observations with common p.d.f. (; θ) 

we have 

(x; θ) = 

(θ) = 

Y 

=1 

X 

=1 

( ; θ) so 

log ( ; θ) 

Viewed as a r.v., (θ) = P 

=1 log ( ; θ) isitself 

a sum of i.i.d.s. 

• The MLE ˆθ is the maximizer of the likelihood; 

intuitively it makes the observed data “most likely 

to have occurred”. 

• A more quantitative justification for the MLE is 

as follows. Let θ 0 bethetruevalue,andassume 

the arei.i.d.Wewillshowthat 

θ0 ((θ 0 ; X) (θ; X)) → 1(20.1) 

as →∞ for any θ 6= θ 0 

By this, for large samples and with high probability, 

the (random) likelihood is maximized by the

172 

true parameter value, hence the maximizer of the 

(observed) likelihood should be a good estimate 

of this true value. 

Proof of (20.1): 

The inequality 

(θ 0 ; X) = 

Y 

=1 

is equivalent to 

( ; θ 0 ) 

Y 

=1 

( ; θ) =(θ; X) 

− 1 

X 

=1 

log ( ; θ) 

( ; θ 0 ) 0 

By the WLLN this average tends in probability to 

(; θ) 

− log 

(; θ 0 ) 

" # 

(; θ) 

− log θ0 (why?) 

(; θ 0 ) 

Z (; θ) 

= − log 

(; θ 0 ) (; θ 0) 

θ0 

" 

= − log 

= 0 

Z 

(; θ) 

And so ... ¤ 

#

173 

• The MLE is generally obtained as a root of the 

likelihood equation 

˙(θ) =0 

where ˙(θ) =∇ (θ) denotes the gradient. There 

may be multiple roots in finite samples. Under 

reasonable conditions (studied in STAT 665) we 

have that any sequence ˆθ of roots is asymptotically 

normal: 

√ 

³ˆθ − θ´ → (0 I −1 (θ)) 

where 

I(θ) = 

lim 

→∞ 

1 

h˙(θ)˙ 0 (θ) i (20.2) 

is “Fisher’s Information matrix”. 

interpretation is that 

ˆθ 

 

≈ 

µθ 0 1 I−1 (θ) 

 

The practical 

i.e. with representing the ( ) element of 

I −1 (θ) (orofI −1 (ˆθ )) we have the approximations 

ˆ 

 

≈ ( 

) cov[ˆ ˆ ] ≈

174 

• The MLE has attractive large-sample optimality 

properties, to be established later. That derivation, 

and the example we look at next, use the 

following ‘regularity condition’: we suppose that 

we can differentiate the equation 1 = R (θ; x)x 

under the integral sign twice. (In particular, the 

limits of integration should not depend on θ.) 

Then, writing ˙(θ; x) and˙(θ; x) forthegradients 

we have 

Z Z ˙(θ; x) 

0 ×1 = ˙(θ; x)x = (θ; x)x 

(θ; x) 

= 

Z 

˙(θ; x) (x; θ) x 

= θ 

h˙(θ; X) i (20.3) 

Thus θ 

h˙(θ) i = 0. With ¨(θ; x) denotingthe 

Hessian matrix we have 

0 × = Z 

˙(θ; x) (x; θ) x 

Z 

θ 

Z 

= ¨(θ; x) (x; θ) x + ˙(θ; x) (x; θ) x 

Z θ 

= ¨(θ; x) (x; θ) x 

+ 

Z 

˙(θ; x)˙ 0 (θ; x) (x; θ) x

175 

so that 

³ 

covθ 

h˙(θ) i =´ 

θ 

h˙(θ)˙ 0 (θ) i = θ 

h 

−¨(θ) i 

(20.4) 

• If the observations are i.i.d. then 

˙ (θ) = 

X 

∇ log ( ;·) 

=1 

is a sum of i.i.d.s, each of which (by taking = 

1 in (20.3) and (20.4)) has a mean of 0 and a 

covariance of 

h 

θ ∇log (;·) (θ) ∇ 0 log (;·) (θ)i 

Ã 

 

= θ 

"− 

2 !# 

log (; θ) 

 

θθ 

Now (20.2) states that 

1 

I(θ) = lim 

→∞ cov h˙(θ) i =cov h ∇ log (;·) (θ) i 

since this is the same for all . Then by the CLT 

(next page), 

1 

√ 

˙(θ) = 1 √ 

X 

=1 

∇ log ( ;·) (θ) → (0 I(θ)) 

(20.5)

176 

• We have used the multivariate CLT: if Z 1 Z 

are i.i.d. r.vecs. with mean vector μ and covariance 

matrix Σ, then 

1 

√ 

X 

=1 

(Z − μ) → (0 Σ) 

(In STAT 665 we give a very elementary proof of 

this, which uses only the univariate CLT.) This 

was applied in (20.5) with Z = ∇ log ( ;·) (θ), 

μ = 0 and Σ = I(θ). 

• Now here is an outline of the proof of asymptotic 

normality of the MLE. Expand the likelihood 

equation ˙ ³ˆθ´ 

= 0 aroundthetruevalue,with 

remainder : 

0 = ˙ ³ˆθ´ 

= ˙ ³ˆθ (θ)+¨(θ) − θ´ + 

Rearrange this as 

√ 

³ˆθ − θ´ 

= 

∙ 

− 

¨(θ) 

1 ( ¸−1 1 √ ˙ (θ)+ ) 

 

√ 

(20.6)

177 

We have (by the WLLN) that 

X 

Ã 

− 

¨(θ) 1 = 1 − 2 log ( ; θ) 

 

=1 

θθ 

"Ã 

!# 

 

→ θ − 2 log (; θ) 

= I(θ) 

θθ 

so that using (20.5) and Slutsky’s Theorem, 

∙ 

− 

¨(θ) 

1 ¸−1 1 √ ˙ (θ) → (0 I −1 (θ)) 

If √ → 0 (it does, but this is where some 

work is required) then, again by Slutsky applied 

to (20.6), 

√ 

³ˆθ − θ´ → (0 I −1 (θ)) 

!

21. Asymptotics of ML estimation; Information 

Inequality 

178 

• Example. Suppose{ 1 } is a sample from 

the gamma( ) density,with 

³ ´−1 

 

− 

(; θ) = 

0 ∞ 

Γ () 

Note that if θ =( ) =(22) then this is 

the 2 density. If = −1 , = it is the 

“Erlang” density - the density of the sum of 

i.i.d. E() r.v.s. The distribution of the r.v. , 

where 2 ∼ gamma(Ω 2 ) is known as the 

“Nakagami” distribution, and is the “fading 

parameter”; this is of interest in the theory of 

wireless transmissions. 

The log-likelihood is 

(θ) = 

= 

= 

X 

=1 

X 

=1 

log ( ; θ) 

" # 

( − 1) (log − log ) 

− 

− log − log Γ () " # 

( − 1) (log ) − log 

− ¯ − log Γ ()

with gradient 

˙(θ) = 

Ã 

− + ¯ 2 

(log ) − log − () 

! 

179 

where () =()logΓ () (= [log ()]) 

is the “digamma” function. 

 

• The Newton-Raphson method for solving the likelihood 

equations is 

θ +1 = θ − ¨ −1 (θ )˙(θ ) 

where 

¨(θ) = 

⎛ 

⎝ 

 

2 − 2 ¯ 3 

− 1 

− 1 

− 0 () 

⎞ 

⎠ 

• A commonly used alternative to Newton-Raphson 

is Fisher’s Method of Scoring. This involves replacing 

Ã 

X 2 ! 

log ( 

−¨(θ) =− 

; θ) 

θθ 

=1

180 

by its expectation I(θ) in N-R, to get the scheme 

θ +1 = θ + 1 I−1 (θ )˙(θ ) 

This is often more stable than N-R. 

• Starting values for iterative solution to the likelihood 

equations. Note that 

θ 

h˙(θ; X) i =0⇒ h ¯ i = 

Also, for instance by computing the m.g.f. and differentiating 

twice (or more simply by a direct integration), 

[ 2 ]= ( +1) 2 .Thus[] = 

2 and so 

[ 2 ]= 2 

where 2 isthesamplevariance(theunbiasedness 

of 2 is a simple calculation). The “method 

of moments” estimates θ 0 = ( 0 0 )arenow 

obtained by equating ¯ and 2 to their expectations 

and solving for the parameters: 

0 = 2 

¯ 0 = ¯ 2 

2 

Ã 

= ¯ 

0 

!

181 

• Method of moments: Define population moments 

= h i and estimates ˆ = −1 P 

=1 . 

By the WLLN these are consistent: 

ˆ 

 

→ as →∞ 

Then to estimate continuous functions 

θ = g ( 1 ) 

of the population moments, the method of moments 

estimate 

ˆθ = g (ˆ 1 ˆ ) 

is also consistent. The proof is the same as in the 

univariate case: 

³° ° °ˆθ − θ ° ° ≥ ´ 

= (kg (ˆμ) − g (μ)k ≥ ) 

≤ (kˆμ − μk ≥ ) 

→ 0; 

here 0issuchthat 

kˆμ − μk ⇒ kg (ˆμ) − g (μ)k 

and its existence is guaranteed by the continuity 

of g.

182 

— An interesting aside: if (; ) is the density 

of an exponential r.v. with mean 1, 

giventhatitmustbe∈ [0 1], then (; ) = 

− ³ 1 − −´ 

and the equations 0 () = 

0and‘¯ = []’ turn out to be identical, 

so that the mle and method of moments estimator 

coincide. 

• More efficient, in fact almost as efficient as the 

MLE itself are 

0 = ¯ 

¯ 

0 = 

0 −1 P ( − ¯)(log − (log )) 

which are method of moments estimators arising 

from the observation that 

cov [log ] =cov 

∙ 

log 

µ 

 

¸ 

= 

implying 

= [] 

= [] 

cov [log ] 

Details in Wiens, Cheng (a former Stat 512 student) 

and Beaulieu 2003 at http://www.stat.ualberta.ca/˜wiens/.

• The limit of the NR-process is the MLE ˆθ, and 

√ 

³ˆθ − θ´ → (0 I −1 (θ)) 

The information matrix is 

I(θ) = 

lim 

→∞ 

1 

h i 

θ −¨(θ) = 

⎛ 

⎝ 

 

2 

1 

 

1 

 

0 () 

183 

⎞ 

⎠ 

with 

Ã 

!Ã 

I −1 1 

(θ) = 

2 0 ! 

() − 

0 

() − 1 − 

Then, e.g., the approximation to the distribution 

of ˆ is 

Ã 

ˆ ≈ 11 ( ) 2 0 () 

= 

( 0 

() − 1) 

We estimate the parameters in the variance, obtaining 

ˆ − 

r 

11 ³ˆ ˆ´. 

 

≈ (0 1) 

Note that 0 () =[log ()] = [log ] 

can also be consistently estimated by the sample 

variance of {log } =1 . 

!

184 

• Wecannowestablishanasymptoticoptimality 

property of the MLE. Suppose that the observations 

are i.i.d., and that differentiation under the 

integral sign, as above, is permissible. We aim 

to estimate a (scalar) function (θ). The MLE 

ˆ is defined to be (ˆθ), where ˆθ is the MLE for 

θ. Recall that in studying variance stabilization 

(Lecture 9) we noted that in the single-parameter 

case, if ˆθ were asymptotically normal then so 

would be (ˆθ), with a mean of (mean of ˆθ) 

and a variance of £ 0 (θ) ¤ ³ 

2 · variance of ˆθ´. The 

multi-parameter analogue (the “delta method”) 

is that 

√ ³ 

 

³ˆθ ´ 

− (θ)´ → (0 ˙ 0 (θ) I −1 (θ) ˙ (θ)) 

where ˙ (θ) =∇ (θ) is the gradient.

185 

• Now let (X) be any unbiased estimator of (θ), 

so that 

Thus 

since 

(θ) = θ [(X)] = 

˙(θ) = 

Z 

Z 

Z 

(x)∇ (x; θ) x 

(x) (x; θ) x 

= (x)˙(θ; x) (x; θ) x 

h 

= θ (X)˙(θ; X) i 

= θ 

h 

{(X) − (θ)} ˙(θ; X) i 

θ 

h 

(θ)˙(θ; X) i = (θ) θ 

h˙(θ; X) i = 0 

Then for any constant vector c ×1 we have 

c 0 ˙(θ) = θ 

h 

{(X) − (θ)} c 

0 ˙(θ; X) i 

and by the Cauchy-Schwarz inequality, 

h 

c0 ˙(θ) i 2 

≤ θ 

h 

{(X) − (θ)} 

2 i θ 

∙ nc 

0 ˙(θ; X) o 2¸ 

= θ [(X)] c 0 θ 

h˙(θ)˙ 0 (θ) i c 

= θ [(X)] · c 0 I(θ)c

186 

i.e. 

θ [(X)] ≥ 

¯ 

¯c 0 ˙(θ)¯¯2 

c 0 I(θ)c 

Put c = I −12 (θ)t for arbitrary t to get 

θ [(X)] ≥ 

hence 

¯ 

¯t 0 I −12 (θ) ˙(θ) 

t 0 t 

θ 

h√ (X) 

i 

≥ max 

||||=1 

¯ 

¯ 

¯2 

for any t 

¯t 0 I −12 (θ) ˙(θ) 

= ||I −12 (θ) ˙(θ)|| 2 

= ˙ 0 (θ) I −1 (θ) ˙ (θ) 

which is the asymptotic variance of the (normalized) 

MLE √ ³ˆθ ´. This is the Information 

Inequality, giving a lower bound on the variance 

of unbiased estimators. Since it is attained (in the 

limit) in the case of the MLE, we say the MLE is 

asymptotically efficient. 

¯ 

¯2

187 

22. Minimax M-estimation I 

• M-estimation of location. Suppose 1 

 

∼ 

( − ), with density ( − ) (“location family”). 

If we know what is, then the MLE is defined 

by maximizing P log ( −), i.e. by solving 

0= 

X 

log ( − ) = X − 0 

More generally, a solution ˆ to 

X 

( − ) =0 

( − ) 

for a suitable function , is an “M-estimate” of location. 

Thus the MLE of location, from a known 

, is an M-estimate with “score function” 

() = () = − 0 

() 

Quite generally, 

⎛ 

√ 

³ˆ − ´ → ⎝0( ) = 

h 

2 ( − ) i ⎞ 

{ [ 0 ( − )]} 2 

⎠

188 

Here is an outline of why this is so. By the MVT, 

0 = 1 √ 

X 

( − ˆ ) 

= √ 1 X 

( − ) 

 

− 

" # 

1 X 

√ 0 ( − ) 

³ˆ − ´ 

+ 

so 

√ 

³ˆ − ´ 

= 

1 √ 

P ( − )+ 

1 

 

P 0 ( − ) 

A natural assumption, and one that is made here, 

is that [ ( − )] = 0. Then by the CLT 

and WLLN, 

1 

√ 

X 

( − ) → ³ 0 

h 

2 ( − ) i´ 

1 

 

X 

0 ( − ) → 

h 

0 ( − ) i 

If the remainder 

 

→ 0(showingthisiswhere 

some work is required) then the result follows by 

Slutsky’s Theorem.

189 

• The asymptotic variance does not depend on ; 

it is 

R ∞−∞ 

2 () () 

( ) = h R ∞−∞ 

0 () () i 2 

The denominator in ( ) isthesquareof 

Z ∞ 

−∞ 0 () () 

Z 

¯ ∞ 

¯∞−∞ 

− 

= ()() 

() 0 () 

Z 

−∞ 

∞ 

= () ()() 

−∞ 

= [ () ()] 

Here we use an assumption that ()() → 0 

as → ±∞. Then 

1 

( ) 

= { [ () ()]} 2 

h 

2 () i 

h 

2 () i h 

 

2 

 

() i 

≤ 

 

 

h 

2 () i 

= 

h 

 

2 

() i

190 

Thus ( ) is minimized, for fixed ,by = 

. The minimum variance is the inverse of 

Z ∞ 

" 

h 

 

2 

() i = − 0 # 2 

−∞ () () = ( ); 

this is “Fisher information for location”. 

• How might we choose if is not known? We 

take a minimax approach: we allow to be any 

member of some realistic class F of distributions 

(e.g. “approximately Normal” distributions), and 

aim to find a 0 that minimizes the maximum 

variance: 

max 

∈ F ( 0) ≤ max ( ) for any 

∈ F 

(22.1) 

We will show that the solution to this problem 

is to find an 0 ∈ F that is “least favourable” 

in the sense of minimizing ( )inF; we then 

protect ourselves against this worst case by using 

the MLE based on 0 ,i.e. 0 = 0 = − 0 0 0.

191 

• We will show that such a pair ( 0 0 )isa“saddlepoint 

solution”: 

( 0 ) ≤ ( 0 0 )= 1 

( 0 ) ≤ ( 0) 

for all and all ∈ F (22.2) 

If (22.2) holds we have, for any , 

max ( 0)= ( 0 0 ) ≤ max ( ) 

∈ F ∈ F 

which is (22.1). Note that the equality in (22.2), 

and the second inequality, have already been established. 

Thus (22.2) holds iff 0 satisfies the 

first inequality. 

• Assume that F is convex, in that for any 0 1 ∈ 

F the d.f. =(1− ) 0 + 1 (0 ≤ ≤ 1) is 

also in F. Thefirst inequality in (22.2) states that 

( 0 ) is maximized by 0 ; equivalently that 

n R ∞−∞ 

0 0 () () o 2 

() =1 ( 0 )= R ∞−∞ 

0 2 () () 

is minimized at =0for each 1 ∈ F (22.3)

192 

Note that 

Z ∞ 

−∞ 0 0 () () = 

Z ∞ 

Z ∞ 

(1 − ) 

−∞ 0 0 () 0 () + 

−∞ 0 0 () 1 () 

is a linear function of ; so too is the denominator 

of (). 

• Lemma: If ()() are linear functions of ∈ 

[0 1], and () 0, then () = 2 ()() is 

convex: 

((1 − ) 1 + 2 ) ≤ (1 − ) ( 1 )+ ( 2 ) 

for 1 2 ∈ [0 1]. 

Proof: 

Using 00 = 00 =0weget 

00 = 2 3 ³ 

0 − 0´2 ≥ 0

193 

• By the Lemma, () in(22.3)isconvex,sois 

minimized at =0iff 0 (0) ≥ 0foreach 1 ∈ F. 

• In the notation of the Lemma we have 

0 (0) = 2 (0) 

Ã ! 2 (0) 

(0) 0 (0) − 0 (0) with 

(0) 

(0) = 

(0) = 

= 

Thus 

Z ∞ 

−∞ 2 0 0 = ( 0 ); 

Z ∞ 

−∞ 0 0 0 = ³ 

0 − 

0 0´ 

 

−∞ 

(integration by parts again) 

Z ∞ 

Z ∞ 

−∞ 2 0 0 = ( 0 ) 

0 (0) = 2 0 (0)− 0 (0) = 

Z ³ 

2 

0 

0 − 2 0 

´ 

(1 − 0 ) 

We have shown that ( 0 ) is maximized by 

0 iff, forall 1 ∈ F 

Z ∞ ³ 

2 

0 

0 − 0 

2 ´ 

(1 − 0 ) ≥ 0 

−∞ 

(22.4)

194 

• Now consider the companion problem of minimizing 

( )inF. The function 

Z ∞ 

Ã 

Z ∞ 

¡ 

 

0 

¢ 2 

() =( )= − 0 ! 2 

 

= 

−∞ −∞ 

is convex. This is because, by the Lemma, its 

integrand () isconvexforeach; thus for any 

1 2 ∈ [0 1] 

(1−)1 + 2 

() ≤ (1 − ) 1 ()+ 2 (); 

integrating this gives 

((1 − ) 1 + 2 ) ≤ (1 − ) ( 1 )+ ( 2 ) 

Thus ( ) is minimized by 0 iff, for each 1 ∈ 

F, 

0 ≤ 0 (0) 

Z ∞ 2 0 ³ 

 

0 

1 − 0´ 0 − ¡ 0 2 

¢ 

(1 − 0 ) 

= 

= 

= 

= 

−∞ 

Z ∞ 

2 

|=0 

Ã 

−2 − 0 ! Ã 

³ 

0 0 

−∞ 1 − 0 

0 ´ 

− − 0 ! 2 

0 

( 1 − 0 ) 

0 0 

³ 

 

0 

1 − 0 

0 ´ 

− 

2 

0 ( 1 − 0 ) 

Z ∞ 

−2 0 

Z 

−∞ 

∞ 

−∞ 

³ 

2 

0 

0 − 0 

2 ´ 

(1 − 0 )

195 

• By comparison with (22.4) we have that the following 

are equivalent: 

1. ³ 0 = − 0 0 0 0´ 

is a saddlepoint solution 

to the minimax problem; 

2. ( 0 ) is maximized by 0 ; 

3. ( ) is minimized in F by 0 ; 

4. R ∞ 

−∞ 

³ 

2 

0 

0 − 2 0´ 

(1 − 0 ) ≥ 0 for all 1 ∈ 

F.

196 

23. Minimax M-estimation II 

• By the preceding, we are to minimize ( )inF 

andthenput 0 = − 0 0 0. We must now specify 

a “reasonable” class F. A commonly used one is 

the “gross errors” class 

F = { | () =(1− ) ()+()} 

where () is the Normal density and () isan 

arbitrary (but symmetric) density. 

— Why symmetric? We need [ 0 ( − )] = 

R ∞−∞ 

0 () () =0forall ∈ F; thisis 

guaranteed if is even and, as will turn out 

to be the case, 0 is odd and bounded. 

• The interpretation is that 100 (1 − )%oftheobservations 

are Normally distributed; the remainder 

come from an unknown population. For this 

modelwehave 1 − 0 = ( 1 − 0 ), and so we 

are to find 0 satisfying 

Z ∞ ³ 

2 

0 

0 − 2 ´ 

0 (1 − 0 ) ≥ 0forall 1 

−∞ 

(23.1)

We note that 

and 

() = − 0 = −()log() = 

2 0 () − 2 () =2− 2 

197 

Condition (23.1) states that R ∞ 

−∞ 

³ 

2 

0 

0 − 2 0´ 

 

is to be minimized by 0 .But 0 depends on 0 . 

We conjecture that 

1. The density 0 mustplaceallofitsmasswhere 

20 0 − 2 0 is a minimum; 

2. On this set, 20 0 − 2 0 

= − 2 for some . 

is to be constant, say 

We will first verify that these conditions ensure a 

minimax solution, and then verify that there is a 

density 0 that has these properties. 

• A clue to the form of the set in 2. above is provided 

by the behaviour of 2 0 () − 2 ().

198 

• Suppose then that we can construct a density 0 

in such a way that 

0 () = 

0 () = 

( 

( 

(1 − ) () || ≤ 

(1 − ) ()+ 0 () || ≥ ; 

() = 

|| ≤ 

asolutionto20 0 − 2 0 = −2 || ≥ ; 

and with 0 = − 0 0 0 on || ≥ . Suppose also 

that 

− 2 ≤ 2 − 2 

so that 20 0 − 2 0 attains its minimum (of −2 ) 

on the set || ≥ . Then 

Z ∞ ³ 

2 

0 

0 − 0 

2 ´ 

(1 − 0 ) 

= 

Z−∞ 

||≤ 

³ 

2 0 − 2´ 

1 + 

≥ − 2 " Z 

||≤ 1 + 

" Z ∞ 

= − 2 1 − 

−∞ 

= 0 

Z 

Z 

Z 

||≥ 

³ 

− 

2´ 

||≥ ( 1 − 0 ) 

||≥ 0 

# 

( 1 − 0 ) 

#

199 

• A solution (there are three) to 20 0 − 2 0 = −2 

is 0 () =() · , implying 0 () ∝ −|| . 

This leads to 

0 () = 

0 () = 

( 

(1 − ) () || ≤ 

(1 − ) () −(||−) || ≥ ; 

( 

() = || ≤ 

() · || ≥ ; 

with = . Note that 0 and 0 are continuous, 

that 0 = −0 0 0,andthat− 2 ≤ 2 − 2 . 

It remains only to show that 0 ∈ F, i.e.that 

0 () =(1− ) ()+ 0 () for some density 0 . 

It is left as an exercise to show that the function 0 

defined by this relationship is non-negative, and 

that a unique = () can be found such that 

R ∞−∞ 

0 () =1. 

• This function 0 () is the famous “Huber’s psi 

function”. The theory given here extends very 

simplytothecaseinwhich is replaced by any 

other “strongly unimodal” density - one for which 

() is an increasing function of . Details are in

200 

the landmark paper Huber, P. J. (1964), “Robust 

Estimation of a Location Parameter,” The Annals 

of Mathematical Statistics, 35, 73-101. 

• Extensions to regression are immediate. An M- 

estimate of regression is a minimizer of 

X 

 

³ 

− x 0 θ´ 

 

or, with = 0 , a solution to 

X 

x ³ − x 0 θ´ 

= 0 

If () = 2 2, () = then this becomes 

X 

x = X x x 0 θ 

The solution is 

ˆθ = h X 

x x 0 

i −1 X 

x 

= ³ X 0 X´−1 

X 0 y 

the LSE. In general Newton-Raphson, or Iteratively 

Reweighted Least Squares (IRLS) can be 

used to obtain the solution.

• The asymptotic normality result is that 

√ 

³ˆθ − θ´ → 

µ 

0 ( ) ³ X 0 X´−1 

201 

if the errors have a density , symmetric around 

0. Here ( ) is as before, so that the same 

minimax results as derived here can be applied. 

• IRLS: Write the equations as 

0 = X ³ − x 0θ´ 

³ 

 

x 

− x 0 

θ − x 0θ´ 

 

= X h 

x · · ³ 

− x 0θ´i 

 

= X 0 Wy − X 0 WXθ 

for W = ( 1 ) and weights = 

(θ) depending on the parameters. “Solve” 

these equations: 

θ = ³ X 0 WX´−1 

X 0 Wy; 

use this value to re-calculate weights; iterate to 

convergence. Thus the ( +1) step is a weighted 

least squares regression using weights (θ )computed 

from the residuals at the previous step.

202 

24. Measure and Integration 

• Recall the definition of a probability space: basic 

components are a set Ω (“outcomes”), a “Borel 

field” or “-algebra” F of subsets (“events”; F 

contains Ω and is closed under complementation 

and countable unions) and a measure assigning 

probability () toevents ∈ F. 

• Let Ω =(0 1], the unit interval, and start with 

subintervals ( ]. Define a (probability) measure 

by (( ]) = − . This extends to the set B 0 

of finite disjoint unions and complements of such 

intervals in the obvious way. 

• Now consider B = (B 0 ), the smallest -algebra 

containing B 0 (i.e., the intersection of all of them). 

One can extend the measure on B 0 to the - 

algebra B. Formally, define the outer measure 

of a set ⊂ Ω by 

∗ () =inf X 

( )

203 

where the infimum is over all sequences in B 0 satisfying 

⊂∪ . If, for any ⊂ Ω we then 

have ∗ ( ∩ )+ ∗ ( ∩ ) = ∗ (), we say 

that is Lebesgue measurable, with Lebesgue 

measure ∗ (). (The condition implies in particular 

that ∗ ()+ ∗ ( )=1.) 

• It can be shown that the Lebesgue measurable 

sets include (B 0 ), and that Lebesgue measure 

agrees with whenever both are defined. In particular 

the Lebesgue measure of an interval is its 

length. The restriction of Lebesgue measure to 

(B 0 ) is called Borel measure, and the sets in 

(B 0 )areBorel measurable. 

• The measure ∗ caninturnbeextendedfrom 

B through a process of completion, essentially by 

appending to B all subsets of those sets with measure 

zero. The resulting measure space is complete, 

inthatifaset is Lebesgue measurable 

and has measure zero, then the same is true of 

any subset of .

204 

• Example: The set of rational numbers in (0 1] 

has Lebesgue measure ∗ () = 0. This is because 

we can enumerate the rationals: = { 1 2 } 

and then if 

= ³ − 2 −(+1) +2 −(+1) ´ ∩ (0 1] 

we have ⊂∪ and ∗ () ≤ P ( ) ≤ . 

• One can carry out these constructions for more 

general sets Ω. StartingwithΩ = R and intervals 

( ] as above results in Lebesgue measure on the 

real line. 

• Now let (Ω F)beanymeasurespace,sothat 

F is a -algebra and a measure (Lebesgue measure, 

counting measure, ... ). Here we define the 

integral, written 

Z 

Z 

= 

ZΩ () () = () () 

Oneapproachtothisistostartwithsimple,nonnegative 

functions = P 1 , where Ω = 

Ω

205 

∪ =1 ;inthiscasedefine R = P ( ). 

(Then, e.g., if = on ( ] and zero elsewhere, 

and is Lebesgue measure, this gives R = 

( − ), in agreement with the R-integral.) Now 

any non-negative function can be represented as 

an increasing limit ( ↑ ) of simple functions; 

one then defines R = lim 

R 

. 

• To extend to arbitrary functions, define 

+ () =max( () 0) − () =max(− () 0) 

Then = + − − is the difference of nonnegative 

functions, and one defines 

Z 

= 

Z 

+ − 

Z 

− 

(unless one of these is ∞, inwhichcasewesay 

that is “not integrable” although one still assigns 

a value to R by adopting the convention 

±∞ = ±∞ for finite ). Finally, for sets ∈ F 

one defines R = R 1 .

206 

• Some properties of the integral: 

1. If and are integrable and ≤ a.e. (“almost 

everywhere”; i.e. except on a set with 

measure zero) then R ≤ R . (Then 

|| = + + − is integrable and since − || ≤ 

≤ || we have that | R | ≤ R || .) 

2. Monotone convergence: If 0 ≤ ↑ a.e. 

then R ↑ R . (Note that R is 

defined for all non-negative functions .) 

3. Dominated convergence: If | | ≤ a.e., where 

is integrable (i.e. R , which necessarily 

exists, is finite) and if → a.e., then and 

the areintegrableand R → R . 

4. Bounded convergence: if (Ω) ∞ and the 

are uniformly bounded (i.e. () ≤ 

for all and all ), then → a.e. implies 

R 

→ R .

207 

• If is Lebesgue measure then the integral defined 

above is the Lebesgue integral. Example: The 

function =1 is zero everywhere except on the 

set of rationals in (0 1], i.e. almost everywhere. 

By the above R = 0; recall that this is an 

example in which the R-integral does not exist. 

When the R-integral does exist, it has the same 

value as the Lebesgue integral. 

• Now let (Ω F) be a probability space; a (finite) 

random variable is a function : Ω → R such 

that −1 () ∈ F for any Borel measurable set 

. Equivalently (proof at end), inverses of open 

sets are events. 

• A function () is also a r.v. under a certain 

condition. For ∈ B, we have 

( ◦ ) −1 () = −1 ◦ −1 () ∈ F 

as long as −1 () ∈ B, i.e. must be Borel 

measurable - a function for which the inverses of 

Borel sets (or just open sets) are Borel sets.

208 

• Any r.v. induces a probability space (R B)via 

() = ³ −1 ()´ = ( ∈ ) for ∈ B 

This measure is the probability measure (p.m.) 

of , and the associated distribution function is 

defined by 

() = ((−∞]) = ( ≤ ) 

• If there is a function (·) and a measure space 

(R B)with 

() = ( ∈ ) = 

Z 

 

() () 

we say that is the density of (or of ) w.r.t. 

and that is ‘absolutely continuous’ w.r.t. . 

The most common cases are = Lebesgue measure 

(in which case we say that is a continuous 

r.v. and then 0 () exists and equals () 

a.e.) and = counting measure, in which case 

() = ( = ) and is discrete.

209 

• The expected value of a r.v. is defined by the 

integral [] = R Ω () (), and can be 

evaluatedbytransformingtothep.m., i.e. 

[()] = 

Z 

R () (); (24.1) 

this in turn equals the R-S integral R ∞ 

−∞ ()() 

whenever the latter exists. The proof of (24.1) 

consists of showing that both sides agree for simple 

functions (for instance when =1 both 

sides equal ( ∈ )) and extending to general 

Borel functions by monotonicity, etc.

210 

• Basic properties of expectations are inherited from 

those of the integral. Some particular ones are: 

1. []existsiff [||] exists, and then | []| ≤ 

[||]. 

2. Monotone convergence: If ≥ 0and ↑ 

a.e., then [ ] → [] (which might 

= ∞). 

3. Dominated convergence: If → a.e. (or 

 

just → )and∀ | | ≤ with [ ] 

∞, then [ ] → []. 

4. Bounded convergence: this is “Dominated convergence” 

with = , aconstant. 

• Example: Suppose we estimate a bounded, continuous 

function () ofapopulationmeanby 

( ), where is an average based on a sample 

of size . We have that → by the 

 

WLLN, so ( ) → () bycontinuity;then 

[ ( )] → [ ()] = () by bounded convergence.

211 

• It was stated above that if is a function on 

(Ω F P),mappinginto(R B), and −1 () ∈ 

F for every open set in R (recall this was our original 

definition of a r.v.) then −1 () ∈ F for 

every Borel measurable set (our current definition 

of a r.v.) To see this, firstnotethatif −1 () ∈ 

F for every open set , then this holds as well for 

every interval = ( ] (=( ) ∩ ( ) for 

any ). It then holds as well for the set B 0 

of finite disjoint unions and complements of such 

intervals. The property can finally be extended 

to B = (B 0 ) through the Monotone Class Theorem. 

A class M of subsets of R is monotone if 

for sequences { } in M, 

1 ⊂ 2 ⊂ ⇒∪ ∈ M 

1 ⊃ 2 ⊃ ⇒∩ ∈ M 

The theorem states that if M is a monotone class, 

then B 0 ⊂ M implies (B 0 ) ⊂ M. So it suffices 

to verify that the class M of subsets for which 

−1 () ∈ F is monotone. This is straightforward.

STATISTICS 512 TECHNIQUES OF MATHEMATICS FOR ...

Create successful ePaper yourself

Delete template?

Save as template?