STATISTICS 512 TECHNIQUES OF MATHEMATICS FOR ...
STATISTICS 512 TECHNIQUES OF MATHEMATICS FOR ...
STATISTICS 512 TECHNIQUES OF MATHEMATICS FOR ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>STATISTICS</strong> <strong>512</strong><br />
<strong>TECHNIQUES</strong> <strong>OF</strong> <strong>MATHEMATICS</strong><br />
<strong>FOR</strong> <strong>STATISTICS</strong><br />
Doug Wiens<br />
December 16, 2013
Contents<br />
I MATRIX ALGEBRA 6<br />
1 Introduction; matrix manipulations . . . . . 7<br />
2 Vector spaces . . . . . . . . . . . . . . . . 16<br />
3 Orthogonality; Gram-Schmidt method; QRdecomposition<br />
. . . . . . . . . . . . . . . . 24<br />
4 LSEs; Spectral theory . . . . . . . . . . . . 31<br />
5 Examples & applications . . . . . . . . . . . 41
II LIMITS, CONTINUITY, DIFFEREN-<br />
TIATION 50<br />
6 Limits; continuity; probability spaces . . . . 51<br />
7 Random variables; distributions; Jensen’s Inequality;<br />
WLLN . . . . . . . . . . . . . . . 57<br />
8 Differentiation; Mean Value and Taylor’s Theorems<br />
. . . . . . . . . . . . . . . . . . . . . 64<br />
9 Applications: transformations; variance stabilization<br />
. . . . . . . . . . . . . . . . . . . 71<br />
III SEQUENCES, SERIES, INTEGRA-<br />
TION 78<br />
10 Sequences and series . . . . . . . . . . . . . 79<br />
11 Power series; moment and probability generating<br />
functions . . . . . . . . . . . . . . . 87
12 Branching processes . . . . . . . . . . . . . 95<br />
13 Riemann integration . . . . . . . . . . . . . 102<br />
14 Riemann and Riemann-Stieltjes integration . 111<br />
15 Moment generating functions; Chebyshev’s<br />
Inequality; Asymptotic statistical theory . . 121<br />
IV MULTIDIMENSIONAL CALCULUS<br />
AND OPTIMIZATION 131<br />
16 Multidimensional differentiation; Taylor’s and<br />
Inverse Function Theorems . . . . . . . . . 132<br />
17 Implicit Function Theorem; extrema; Lagrange<br />
multipliers . . . . . . . . . . . . . . . . . . 141<br />
18 Integration; Leibnitz’s Rule; Normal sampling<br />
distributions . . . . . . . . . . . . . . 152
19 Numerical optimization: Steepest descent,<br />
Newton-Raphson, Gauss-Newton . . . . . . 162<br />
20 Maximum likelihood . . . . . . . . . . . . . 170<br />
21 Asymptotics of ML estimation; Information<br />
Inequality . . . . . . . . . . . . . . . . . . 178<br />
22 Minimax M-estimation I . . . . . . . . . . . 187<br />
23 Minimax M-estimation II . . . . . . . . . . 196<br />
24 Measure and Integration . . . . . . . . . . . 202
Part I<br />
MATRIX ALGEBRA
7<br />
1. Introduction; matrix manipulations<br />
• Outline:<br />
— Linear algebra - regression (linear/nonlinear),<br />
multivariate analysis, more generally linear models<br />
and linear approximations.<br />
— Real analysis/calculus - theory of statistical<br />
distributions, optimal selection of statistical<br />
procedures (e.g. determine a parameter estimate<br />
to minimize a certain loss function), approximations<br />
of intractable procedures with simpler<br />
ones.<br />
— Measure theory (very briefly)/Theory of Integration<br />
- probability, math finance, theory of<br />
mathematical statistics, foundations of statistical<br />
and probabilistic methods. A rigorous development<br />
of conditional expectation requires<br />
measure theory.
— Optimization - find numbers or functions minimizing<br />
certain objectives, e.g. designing experiments<br />
for maximum information/minimum variance<br />
etc.; associated numerical methods.<br />
8<br />
• In Statistics, matrices are convenient ways to store<br />
and refer to data. As well, in Regression for<br />
instance, there are important structural features<br />
that come from examining the algebraic properties<br />
of the vector space formed from all linear<br />
combinations of the columns of a matrix.<br />
• As for the first of these - matrices as data storage<br />
- you should learn various ways to manipulate matrices.<br />
In particular, the formula for the product<br />
of two matrices - that the ( ) element of the<br />
product AB of an × matrix A and a × <br />
matrix B is given by<br />
[AB] =<br />
X<br />
=1<br />
A B
- is of rather limited usefulness. Rather, one<br />
should treat either the rows or the columns of<br />
matrices as the basic elements. Some examples:<br />
— Define (column) vector in R ;sum,transpose,<br />
scalar product, outer product.<br />
— Matrix as a column of rows, or row of columns:<br />
A × =<br />
⎛<br />
⎜<br />
⎝<br />
a 0 1<br />
a 0 2<br />
.<br />
a 0 <br />
⎞<br />
⎟<br />
⎠ =(α 1.α 2 . ···.α ) <br />
— If X × has rows n x 0 <br />
o<br />
(note: vectors are<br />
=1<br />
columns, rows are transposed vectors), and θ<br />
is a × 1 vector, then<br />
Xθ =<br />
⎛<br />
⎜<br />
⎝<br />
x 0 1<br />
.<br />
x 0 <br />
.<br />
x 0 <br />
⎞<br />
⎟<br />
⎠<br />
θ =<br />
⎛<br />
⎜<br />
⎝<br />
x 0 1 θ<br />
.<br />
x 0 θ<br />
.<br />
x 0 θ<br />
— If X × has columns n o <br />
z ,andθ is a ×1<br />
=1<br />
⎞<br />
⎟<br />
⎠<br />
<br />
9
10<br />
vector, then<br />
Xθ = h i<br />
z 1 ···z ···z <br />
⎛<br />
⎜<br />
⎝<br />
⎞<br />
1<br />
.<br />
<br />
.<br />
⎟<br />
⎠<br />
<br />
=<br />
X<br />
=1<br />
z <br />
— If X × has columns n o <br />
z , and A is a<br />
=1<br />
matrix with columns, then<br />
AX = A h z 1 ···z ···z <br />
i<br />
=<br />
h<br />
Az1 ···Az ···Az <br />
i<br />
<br />
— If X is as above and A × isamatrixwith<br />
rows n a 0 o <br />
=1 ,then<br />
AX =<br />
⎛<br />
⎜<br />
⎝<br />
a 0 1<br />
.<br />
a 0 <br />
.<br />
a 0 <br />
⎞<br />
⎟<br />
⎠<br />
X =<br />
⎛<br />
⎜<br />
⎝<br />
a 0 1 X<br />
.<br />
a 0 X<br />
.<br />
a 0 X<br />
⎞<br />
⎟<br />
⎠<br />
= ³ a 0 z ´<br />
<br />
<br />
You should become familiar with all of these,<br />
and learn to choose the most appropriate form<br />
in an application.
11<br />
— Block matrices ... a particular example is, with<br />
notation as above,<br />
A × B × =(α 1 .α 2 . ···.α )<br />
⎛<br />
⎜<br />
⎝<br />
β 0 1<br />
β 0 2<br />
.<br />
β 0 <br />
⎞<br />
X <br />
⎟<br />
⎠ = =1<br />
α β 0 <br />
• Let be a random variable (r.v.) (formal definition<br />
to come later) with (i) distribution function<br />
() = ( ≤ ) and probability density function<br />
() = 0 () or (ii) probability mass function<br />
() = ( = ) for ∈ X, a finite or<br />
countable set. Then the “expected value” is<br />
( R<br />
(i)<br />
∞−∞<br />
()<br />
[] =<br />
(ii) P ∈X ()<br />
Think “average”. The cases can be unified and<br />
extended (for instance to cases where is not differentiable)<br />
via the Riemann-Stieltjes integral, to<br />
be considered later. Also, the extension to random<br />
vectors (r.vecs) is immediate, involving multidimensional<br />
integrals or sums. A consequence is
that<br />
<br />
⎡⎛<br />
⎢⎜<br />
⎣⎝<br />
⎞⎤<br />
⎛<br />
1<br />
⎟⎥<br />
⎜<br />
. ⎠⎦ = ⎝<br />
<br />
[ 1 ]<br />
.<br />
[ ]<br />
⎞<br />
⎟<br />
⎠ <br />
12<br />
(Similarly with random matrices.) We define<br />
[(X)] to be [], where = (X). In principle<br />
this requires the derivation of the distribution<br />
of . It can be shown that this can instead be obtained<br />
by integration or summation w.r.t. (“with<br />
respect to”) the distribution of X. Corresponding<br />
to the cases above, this is<br />
[(X)] =<br />
respectively.<br />
( R<br />
(i)<br />
∞−∞ ···R ∞<br />
−∞ (x)(x)x<br />
(ii) P x∈X (x)(x)<br />
• A special consequence is linearity: [ + ]=<br />
[]+[ ]. More generally,<br />
if x is a r.vec. and<br />
[Ax + b] =A [x]+b<br />
[AXB + C] =A [X] B + C
13<br />
for a random matrix X. You should verify this.<br />
Thus, e.g., if μ = [x] then (how?)<br />
h (x − μ)(x − μ) 0i = h xx 0i − μμ 0 <br />
The ( ) element is<br />
h ( − ) ³ h i<br />
− ´i<br />
=cov <br />
(= the variance if = ). The matrix is called<br />
the covariance matrix of the random vector x.<br />
• A common application, to be developed in detail,<br />
is linear regression. Experimenter observes a variable<br />
(= response to a medical treatment, say)<br />
thought to depend on type of drug used ( 1 =1<br />
for type A, 0 for type B) and amount applied<br />
( 2 ). Response contains a random component<br />
as well (measurement error, model inadequacies,<br />
etc.); a tentative linear regression model might be<br />
= 0 + 1 1 + 2 2 + <br />
where the ’s are unknown parameters to be estimated,<br />
is unobserved random error, assumedto<br />
have mean 0, constant variance across subjects,<br />
possibly also normally distributed.
14<br />
— Interpretation of [ | 1 2 ]inthetwotreatment<br />
groups:<br />
[ | 1 = 0 2 ]= 0 + 2 2 <br />
[ | 1 = 1 2 ]= 0 + 1 + 2 2 <br />
hence 1 =difference in mean effects of the<br />
treatments, if the same amounts are applied.<br />
— Then with x =(1 1 2 ) 0 , θ =( 0 1 2 ) 0 :<br />
= x 0 θ + ; [ |x] =x 0 θ<br />
• Take data:<br />
⎛<br />
⎜<br />
⎝<br />
⎞ ⎛<br />
1<br />
2<br />
⎟<br />
. ⎠ = ⎜<br />
⎝<br />
<br />
more concisely,<br />
Y =<br />
⎛<br />
⎜<br />
⎝<br />
x 0 1<br />
x 0 2<br />
.<br />
x 0 <br />
x 0 1 θ<br />
x 0 2 θ<br />
.<br />
x 0 θ<br />
⎞<br />
⎟<br />
⎠<br />
⎞ ⎛<br />
⎟<br />
⎠ + ⎜<br />
⎝<br />
⎞<br />
1<br />
2<br />
⎟<br />
. ⎠ ;<br />
<br />
θ + ε = Xθ + ε
15<br />
Here the observations (rows) have been singled<br />
out as the relevant objects. Much of the theory<br />
will hinge on the representation of [ |x] asa<br />
linear combination of the columns of X, withcoefficients<br />
θ:<br />
X =(1.z 1 .z 2 ); [Y|x] =1 0 + z 1 1 + z 2 2 <br />
Extension to 3 columns is immediate.<br />
• Estimation of θ: Given an estimate ˆθ one estimates<br />
[ |x] byx 0ˆθ, with residuals<br />
= − x 0 ˆθ<br />
and residual vector e = Y − Xˆθ.<br />
• We define the (Euclidean) norm, i.e. the length,<br />
of a vector by kek =<br />
q P <br />
2<br />
= √ e 0 e.<br />
• Least Squares Principle:<br />
Choose<br />
ˆθ =argmin||Y − Xθ|| 2 <br />
minimizing the Sum of Squares of the Residuals<br />
(or Errors, hence “SSE”).
16<br />
2. Vector spaces<br />
• Now let’s relate matrices to vector spaces. We<br />
start with the definition of a vector space. Thisis<br />
largely for formal completeness - you might wish<br />
to skip over the next bullet - since the only vector<br />
space considered here will be<br />
R = all −dimensional vectors with real elements<br />
and its subspaces.<br />
• We list a number of axioms to be satisfied by a<br />
structure in order that it be called a vector space;<br />
for R these are all pretty obvious. Note that<br />
R is closed under addition (x y ∈ R ⇒ x +<br />
y ∈ R ) and scalar multiplication (x ∈ R and<br />
∈ R ⇒ x ∈ R ), and satisfies<br />
1. Associativity: For all x y z ∈ R , we have<br />
x +(y + z) =(x + y)+z.<br />
2. Commutativity: For all x y ∈ R ,wehave<br />
x + y = y + x.
17<br />
3. Identity element: There is 0 ∈ R such that<br />
x + 0 = x.<br />
4. Inverse elements: For all x ∈ R ,thereexists<br />
an element −x ∈ R , called the additive<br />
inverse of x, such that x+(−x) =0.<br />
5. Distributivity for scalar multiplication: For all<br />
x y ∈ R and ∈ R we have (x + y) =<br />
x + y.<br />
6. Distributivity for scalar addition: For all x ∈<br />
R and ∈ R we have ( + ) x = x +<br />
x.<br />
7. For all x ∈ R and ∈ R we have () x =<br />
(x).<br />
8. Scalar multiplication has an identity: 1x = x.<br />
Because these properties hold we say that R is a<br />
vector space (over the field R of scalars).
18<br />
• Asubset of R that is itself closed under addition<br />
and scalar multiplication is a vector space<br />
in its own right, called a vector subspace of R ;<br />
similarly ⊂ closed under addition and scalar<br />
multiplication is a subspace of . (You might<br />
wish to prove this; the proof consists of showing<br />
that 1.-8. hold in if they hold in R and if <br />
has these two closure properties.)<br />
• Definitions:<br />
(i) Elements v 1 v of form a spanning set<br />
if every v ∈ is a linear combination of them.<br />
(ii) Elements v 1 v of are (linearly) independent<br />
if all are non-zero and<br />
X<br />
v = 0 ⇒ all =0<br />
i.e. there is only one way in which 0 can be represented<br />
as a linear combination of them. Otherwise<br />
they are dependent (equivalently, at least<br />
one is a linear combination of the others).<br />
(iii) A spanning set whose elements are independent<br />
is a basis of . Thus if {v 1 v } is a<br />
basis, any v ∈ is uniquely (why?) representable<br />
as a linear combination of these basis elements.
19<br />
— Fact 1: Every vector space has a basis. No<br />
proper subset of a basis can span the entire<br />
space (why not?).<br />
— Fact 2: If has a basis of size , thenany<br />
elements of are dependent. (Obvious<br />
if these include the basis; the proof is a bit<br />
lengthy otherwise.)<br />
∗ Definition: The dimension of is the unique<br />
size of a basis. Uniqueness is a consequence<br />
of the preceding two statements.<br />
∗ Another consequence: If dim( )=, then<br />
any independent vectors in form a basis.<br />
(If not, then one can augment with elements<br />
not spanned to get independent vectors.<br />
This contradicts Fact 2.)<br />
• Let be a vector subspace of R . Suppose that<br />
one forms a matrix X by choosing the basis elements<br />
of to be the columns of X Then the
20<br />
interpretation of “spanning” and “independence”<br />
in are, in terms of X,<br />
spanning: Xc = y is solvable (in c) foranyy ∈ ;<br />
independence: y = 0 in the above ⇒ c = 0.<br />
If instead we begin with a matrix X, then the set<br />
of all linear combinations of the columns of X is<br />
a vector space (why?), called the column space<br />
((X)), whose dimension is called the rank of<br />
X. The independent columns of X form a basis<br />
for (X).<br />
Results about matrix ranks:<br />
1) (AB) ≤ (A): Since (AB) ⊆ (A)<br />
(why?),<br />
(AB) =dim((AB)) ≤ dim((A)) = (A)<br />
(The inequality follows — ass’t 1 — from Fact<br />
2above.)<br />
2) The rank of a matrix is at least as large as that<br />
of any of its submatrices (you should formulate<br />
and prove this).
21<br />
3) Used often: (A 0 A)=(A).<br />
Proof: Let A be × with rank . We first<br />
show that (A 0 A) ≥ . Ifthefirst columns<br />
of A are independent, we can write<br />
Ã<br />
A = F × (I .J) A 0 F<br />
A =<br />
0 !<br />
F ∗<br />
J 0 F 0 F ∗<br />
where F consists of the independent columns<br />
of A and hence has rank . Now (A 0 A) ≥<br />
(F 0 F) (why?) and all columns of F 0 F are<br />
independent:<br />
F 0 Fx = 0 ? ⇒<br />
||Fx|| =0 ? ⇒<br />
x = 0<br />
For the general case first permute the columns<br />
of A: A → AQ where the first columns<br />
of AQ are independent. (How? What kind<br />
of matrix Q would accomplish this?) Then<br />
AQ = F × (I .J) and as above<br />
≤ (Q 0 A 0 AQ) ? = (A 0 A)<br />
Now write (either Q 0 A 0 AQ or) A 0 A as<br />
(I .J) 0 F 0 F (I .J). By 1),<br />
(why?).<br />
(A 0 A) ≤ ³ (I .J) 0´ =
22<br />
4) By 3), then 1), (A) =(A 0 A) ≤ (A 0 );<br />
replacing A by A 0 gives (A 0 ) ≤ (A) and<br />
so<br />
(A 0 )=(A);<br />
i.e. row rank = column rank= # of independent<br />
rows or columns. Thus, from now on,<br />
‘rank’ can mean either row rank or column<br />
rank.<br />
5) (AB) ≤ min((A)(B)).<br />
Proof: That (AB) ≤ (A) has been shown.<br />
Using 4) and 1),<br />
(AB) =(B 0 A 0 ) ≤ ³ B 0´ = (B)<br />
• A square, full rank matrix has an inverse.<br />
Proof: We are to show that if A × has full<br />
rank then there is an ‘inverse’ B with the property<br />
that AB = BA = I . The columns of A are<br />
independent, hence form a basis of R (why?).<br />
Thus they span: the equations<br />
A [b 1 ···b ]=[e 1 ···e ]=I
23<br />
areallsolvable. Wewrite[b 1 ···b ]=B, then<br />
AB = I . The matrix B is square, full rank<br />
(why?) and so it also has an inverse on the right:<br />
there is C × with BC = I . Now show that<br />
C = A; thusAB = BA = I and so B = A −1 .<br />
¤<br />
• Fact: A square matrix has full rank iff it has a<br />
non-zero determinant.<br />
— The determinant |A| is a particular sum of<br />
products of the elements of A × . (Details<br />
in text.) Each product contains factors;<br />
there is one from each row and one from each<br />
column. It is a measure of the “size” of the<br />
matrix, in a geometrical sense.<br />
• A consequence of the preceding is that if X ×<br />
has independent columns, so rank , thenX 0 X is<br />
invertible. In a regression framework this can be<br />
interpreted in terms of information duplicated by<br />
dependent columns.
24<br />
3. Orthogonality; Gram-Schmidt method;<br />
QR-decomposition<br />
• Hat matrix: Consider a regression model y =<br />
Xθ + with X × of full rank . We will later<br />
show that the LSEs are<br />
ˆθ = ³ X 0 X´−1<br />
X 0 y<br />
so that the estimate of [y] =Xθ is ŷ = Xˆθ =<br />
Hy, where<br />
H × = X ³ X 0 X´−1<br />
X<br />
0<br />
is the “hat” matrix - it “places the hat on y”.<br />
Properties:<br />
H = H 0 = H 2 (“idempotent”)<br />
HX = X<br />
(I − H)X = 0<br />
(I − H) 2 = (I − H)<br />
H(I − H) = 0
25<br />
• Angle between nonzero vectors x y is defined<br />
by<br />
cos =<br />
x0 y<br />
kxkkyk <br />
That such an angle exists is equivalent to the<br />
statement that ¯¯x 0 y¯¯ ≤ kxkkyk. Thisinturnisa<br />
version of the famous Cauchy-Schwarz Inequality,<br />
to be studied later.<br />
Proof of this version: For any real number ,<br />
0 ≤ kx + yk 2 = kyk 2 2 +2x 0 y + kxk 2 <br />
so that there is at most one real zero. Thus “ 2 −<br />
4”≤ 0, i.e.<br />
4<br />
µ ³x 0 y´2<br />
− kxk<br />
2 kyk<br />
2 ≤ 0<br />
— Equality in “¯¯x 0 y¯¯ ≤ kxkkyk” implies that<br />
kx + 0 yk 2 =0forsome 0 (= ±kxk /kyk),<br />
so that x and y are proportional. The converse<br />
holds as well (you should verify this).<br />
¤
26<br />
• Two vectors are orthogonal if the angle between<br />
them = ±2, equivalently if their scalar product<br />
=0. Wewritex ⊥ y.<br />
— Example: If z is any × 1 vector, and H is a<br />
hat matrix, then<br />
z = Hz +(I − H)z = z 1 + z 2 ,<br />
say, where z 1 ⊥ z 2 . The first is in col(X)<br />
(why?) and the second is in the space of vectors<br />
orthogonal to every vector in col(X). We<br />
write z 2 ∈ col(X) ⊥ . You should verify that<br />
this is a vector space (i.e. is closed under addition<br />
and scalar multiplication).<br />
• AmatrixQ × is orthogonal if the columns are<br />
mutually orthogonal, and have unit norm. Equivalently<br />
(why?)<br />
QQ 0 = Q 0 Q = I <br />
If Q is orthogonal then kQyk = kyk for any<br />
× 1vectory - “norms are preserved”. Similarly,<br />
angles between vectors are also preserved
27<br />
(why?). Geometrically, an orthogonal transformation<br />
is a “rigid motion” - it corresponds to a<br />
rotation and/or an interchange of two or more<br />
axes. Rotation through an angle in the plane:<br />
à !<br />
cos − sin <br />
Q =<br />
<br />
sin cos <br />
Interchange of axes in the plane:<br />
à !<br />
0 1<br />
Q = <br />
1 0<br />
• Orthogonal spaces in a regression context. Suppose<br />
X × has independent columns, so (X)<br />
has dimension . Note that (how?) the orthogonal<br />
complement<br />
(X) ⊥ = n y ¯¯¯X 0 y = 0 o <br />
Then (X) ⊥ = (I − H):<br />
y ∈ (X) ⊥ ⇒ X 0 y = 0 ⇒ (I − H) y = y;<br />
y ∈ (I − H) ⇒ Hy = 0 ⇒ X 0 y = 0<br />
Thus dim ³ (X) ⊥´ = (I − H).
28<br />
— The trace of a square matrix is the sum of<br />
its diagonal elements. You should verify that<br />
(AB) = (BA). Thus products within<br />
traces can be rearranged cyclically:<br />
(ABC) =(CAB) = (BCA)<br />
but not necessarily = (ACB) <br />
It will be shown that for an idempotent matrix,<br />
= . A consequence is that<br />
(I − H) = (I − H) = − (H)<br />
µ<br />
= − X ³ X 0 X´−1<br />
X<br />
0<br />
= − <br />
µX 0 X ³ X 0 X´−1 <br />
= − <br />
Similarly (X) =(H), (H) =.<br />
<br />
• Gram-Schmidt Theorem: Every -dimensional<br />
vector space has an orthonormal basis.<br />
Proof: Start with any basis v 1 v . Normalize<br />
v 1 to get a unit vector (i.e. a vector with unit<br />
norm) q 1 ∈ ; in general suppose that mutually
29<br />
orthogonal unit vectors q 1 q have been constructed,<br />
with q a linear combination of v 1 v .<br />
Define Q = ³ q 1 q ´; this has orthonormal<br />
columns and so Q 0 Q = I .Define also<br />
H = Q Q 0 = hat matrix arising from Q <br />
³<br />
I − H´<br />
v+1<br />
q +1 =<br />
°<br />
³ I − H´ v+1 ° °° (3.1)<br />
For instance, q 2 =....<br />
Then: (i) the numerator in (3.1) is a linear combination<br />
of v 1 v +1 in which at least one coefficient<br />
- that of v +1 -isnon-zero(why?),so<br />
that in particular the denominator is non-zero;<br />
(ii) q +1 ⊥ q 1 q ( ? ⇐ q 0 +1 H = 0 0 ).<br />
Continuing this process results in mutually orthogonal<br />
unit vectors q 1 q ∈ . Since these<br />
are orthogonal they are independent (why?) and<br />
so form a basis of .<br />
— There is a nice geometric interpretation. The<br />
matrix H is idempotent (in fact it is the “hat”<br />
matrix arising from the × matrix with columns
30<br />
q 1 q ), and H v +1 is the “projection of<br />
v +1 onto the space spanned by n q 1 q <br />
o<br />
”.<br />
Thus we say that q +1 is formed by “subtracting<br />
from v +1 its projection onto the space<br />
spanned by n q 1 q <br />
o<br />
”, so as to make what<br />
is left orthogonal to this space (and then normalizing).<br />
• QR-decomposition. In the previous construction,<br />
at each stage, q was obtained as a linear combination<br />
of v 1 v Thus if V × has these vectors<br />
as its columns, and Q × =(q 1 q ), we<br />
can write<br />
V × U × = Q ×<br />
for U upper triangular with positive diagonal<br />
°<br />
elements<br />
( +1+1 = 1 °³<br />
° I<br />
°°<br />
− H´<br />
v+1 0).<br />
Then U is nonsingular and V = QR for R =<br />
U −1 .(NotethatR is also upper triangular with<br />
°<br />
°<br />
positive diagonal elements +1+1 = °³<br />
I<br />
°°.)<br />
− H´<br />
v+1
31<br />
4. LSEs; Spectral theory<br />
• Recall the decomposition arising from the Gram-<br />
Schmidt Theorem. Let X × have rank . Write<br />
X = Q 1 R 1 ,whereQ 1 : × has orthonormal<br />
columns, and R 1 : × is upper triangular with<br />
positive diagonal elements. Apply Gram-Schmidt<br />
once again, starting with the − independent<br />
columns of I − H, toobtainQ 2 : × ( − )<br />
whose columns are orthonormal and are a basis for<br />
(X) ⊥ . Then Q =(Q 1 .Q 2 ) has orthonormal<br />
columns and is square, hence is an orthogonal<br />
matrix. We have<br />
X = (Q 1 .Q 2 )<br />
Ã<br />
R1<br />
0<br />
R 0 R = R 0 1 R 1 = X 0 X<br />
³<br />
X 0 ³<br />
X´−1<br />
= R<br />
−1<br />
1 R<br />
0<br />
1´−1<br />
<br />
H = Q 1 Q 0 1 <br />
I − H = Q 2 Q 0 2 <br />
!<br />
= QR
32<br />
• Return to regression.<br />
(i) Least squares estimation in terms of hat matrix<br />
decomposition of norm of residuals: Note that<br />
x ⊥ y⇒kx + yk 2 = kxk 2 + kyk 2 ;then<br />
°<br />
°y − Xˆθ ° 2<br />
° =<br />
° °°H<br />
³ Xˆθ´° y<br />
° 2 °<br />
− +<br />
°°(I<br />
³ Xˆθ´°<br />
− H) y<br />
° 2<br />
− =<br />
°<br />
°H ³ y Xˆθ´° ° 2<br />
− + k(I − H) yk<br />
2<br />
≥<br />
k(I − H) yk 2 <br />
with equality iff Hy = Xˆθ iff (‘if and only if’)<br />
ˆθ = ³ X 0 X´−1<br />
X 0 y<br />
(how?), the LS estimator. The fitted values are<br />
ŷ = Xˆθ = Hy<br />
and are orthogonal to the residuals<br />
e = y − ŷ =(I − H) y<br />
We say that H and I − H project the data (y)<br />
onto the estimation space and error space, respectively,<br />
and that these spaces are orthogonal.
33<br />
(ii) In terms of QR-decomposition: we have that<br />
ˆθ = R −1<br />
1<br />
i.e. ˆθ is the solution to<br />
Thus compute<br />
z ×1 = Q 0 y =<br />
=<br />
à !<br />
Q<br />
0<br />
1 y<br />
Q 0 2 y<br />
³<br />
R<br />
0<br />
1´−1<br />
R<br />
0<br />
1 Q 0 1 y;<br />
R 1ˆθ = Q 0 1 y<br />
Ã<br />
Q<br />
0<br />
1<br />
Q 0 2<br />
!<br />
y<br />
=<br />
Ã<br />
z1<br />
z 2<br />
!<br />
× 1<br />
( − ) × 1 <br />
Then backsolve the system of equations R 1ˆθ = z 1 .<br />
Numerically stable - no matrix inversions.<br />
• The residual vector is e = Q 2 z 2 , with squared<br />
norm kz 2 k 2 . The usual estimate of the variance<br />
2 of the random errors is<br />
ˆ 2 =<br />
SS of residuals<br />
− <br />
= kek2<br />
− = kz 2k 2<br />
−
34<br />
the mean squared error. Wehave<br />
Ã<br />
[z] =Q 0 Q<br />
0<br />
[y] = 1 Q 1 R 1 θ<br />
Q 0 2 Q 1R 1 θ<br />
!<br />
=<br />
Ã<br />
R1 θ<br />
0<br />
!<br />
<br />
and, using “cov[Ay] =Acov[y] A 0 ”(how?) we<br />
get<br />
cov [z] =Q 0 cov [y] Q = Q 0 2 IQ = 2 I;<br />
hence the elements +1 of z 2 have mean<br />
zero and [ ]= h <br />
2 i<br />
= 2 . Thus ˆ 2 is<br />
unbiased:<br />
hˆ 2i = <br />
⎡<br />
⎣P +1<br />
2 <br />
− <br />
⎤<br />
⎦ = 2 <br />
• Unrelated but nonetheless useful facts: inverses<br />
and determinants of matrices in block form. If P<br />
and Q are nonsingular, then<br />
det<br />
Ã<br />
P<br />
R<br />
S<br />
Q<br />
!<br />
= |P|·|Q − RP −1 S|<br />
= |Q|·|P − SQ −1 R|
35<br />
and<br />
=<br />
Ã<br />
P<br />
⎛<br />
⎜<br />
⎝<br />
R<br />
S<br />
Q<br />
! −1<br />
³<br />
P − SQ −1 R´−1 −P −1 S·<br />
³<br />
Q − RP −1 S´−1<br />
− ³ Q − RP −1 S´−1<br />
·<br />
RP −1<br />
How? Verify<br />
à !Ã<br />
I −SQ<br />
−1 P S<br />
0 I R Q<br />
Ã<br />
P − SQ<br />
=<br />
−1 !<br />
R 0<br />
etc.<br />
0 Q<br />
Example:<br />
det<br />
Ã<br />
I<br />
1 <br />
1 0 −1<br />
!<br />
!Ã<br />
³<br />
Q − RP −1 S´−1<br />
I 0<br />
−Q −1 R I<br />
= |I |·|− 1 − 1 0 1| = −1 − <br />
!<br />
⎞<br />
⎟<br />
⎠
36<br />
• Spectral theory for real, symmetric matrices. First<br />
let M × be any square matrix. For a variable <br />
the determinant |M − I | is a polynomial in <br />
of degree , calledthecharacteristic polynomial.<br />
The equation<br />
|M − I | =0<br />
is the characteristic equation. The Fundamental<br />
Theorem of Algebra states that there are then<br />
(real or complex) roots of this equation. Any<br />
such root is called an eigenvalue of M. If is<br />
an eigenvalue then M − I is singular, so the<br />
columns are dependent:<br />
(M − I ) v =0 (4.1)<br />
for some non-zero vector v (possibly complex),<br />
called the eigenvector corresponding to, or belonging<br />
to, . Thus<br />
Mv = v (4.2)
37<br />
• Now suppose that M is symmetric (and real).<br />
Then the eigenvalues (hence the eigenvectors as<br />
well) are real. To see this, define an operation A ∗<br />
by taking a transpose and a complex conjugate:<br />
(A ∗ ) =¯ <br />
Note that (AB) ∗ = B ∗ A ∗ and that v ∗ v = P | | 2<br />
is real. For a real symmetric matrix M we have<br />
(why?) M ∗ = M. Thus in (4.2),<br />
v ∗ Mv = v ∗ v;<br />
taking the conjugate transpose of each side gives<br />
v ∗ Mv = ¯v ∗ v<br />
Thus ³ − ¯´ v ∗ v =0;sothat(why?) is real.<br />
• We can, and from now on will, assume that any<br />
eigenvector has unit norm.
38<br />
• Eigenvectors corresponding to distinct eigenvalues<br />
are orthogonal. Reason: If Mv = v for<br />
=1 2and 1 6= 2 then<br />
v 0 1 Mv 2 = v 0 1 (Mv 2)= 2 v 0 1 v 2<br />
and = ³ v 0 1 M´ v 2 = 1 v 0 1 v 2;<br />
thus ( 1 − 2 ) v 0 1 v 2 =0andsov 0 1 v 2 =0.<br />
• If is a multiple root of the characteristic equation,<br />
with multiplicity , then the set of corresponding<br />
eigenvectors is a vector space (you should<br />
verify this) with dimension .<br />
— The proof that the dimension is requires<br />
some work, and uses two results being established<br />
in Assignment 1, and so it is added as<br />
an addendum (which you should read) to that<br />
assignment.<br />
— By Gram-Schmidt, therefore, there are orthogonal<br />
eigenvectors corresponding to .
39<br />
• Spectral Decomposition Theorem for real, symmetric<br />
matrices: Let M × be real and symmetric,<br />
with eigenvalues 1 and corresponding<br />
orthogonal eigenvectors v 1 v with unit<br />
norms. Put<br />
V × =(v 1 ··· v ) <br />
an orthogonal matrix. Let D λ be the diagonal<br />
matrix with 1 on the diagonal. Since<br />
we have<br />
MV = ( 1 v 1 ··· v )<br />
= (v 1 ··· v )<br />
= VD λ <br />
⎛<br />
⎜<br />
⎝<br />
⎞<br />
1 0<br />
⎟ ... ⎠<br />
0 <br />
M = VD λ V 0 (4.3)<br />
We say that “a real symmetric matrix is orthogonally<br />
similar to a diagonal matrix”.
40<br />
• In a sense that will become clear, the importance<br />
of this result is that a real, symmetric matrix is<br />
“almost” diagonal. Thus when solving problems<br />
concerning real symmetric matrices it is very often<br />
useful to solve them first for diagonal matrices.<br />
This is frequently quite simple, and then extends<br />
to the general case via (4.3).<br />
• In the construction above we could have assumed,<br />
and sometimes will assume, that the eigenvalues<br />
were ordered before they and the eigenvectors<br />
were labelled: 1 ≥ ≥ .
41<br />
5. Examples & applications<br />
Consequences of spectral decomposition of a real,<br />
symmetric matrix M × . Recall that we showed<br />
M = VD λ V 0 , for an orthogonal<br />
V = [v 1 ··· v ] (the orthonormal eigenvectors), and<br />
D λ = ( 1 ≥ ≥ ) (the eigenvalues).<br />
• Bounds on eigenvalues. We have<br />
max<br />
kxk=1 x0 Mx = max<br />
kxk=1 x0 VD λ V 0 x<br />
= max<br />
kyk=1 y0 D λ y (why?)<br />
⎧<br />
⎨<br />
⎫<br />
X<br />
= max <br />
⎩ 2 ⎬ | 2 =1 ⎭ <br />
=1 =1<br />
It is easy to guess the solution. The maximum<br />
is (what?), attained at y = (what?); hence the<br />
maximizing x is the corresponding eigenvector.<br />
An analogous result holds for min kxk=1 x 0 Mx.<br />
(You should write it out and prove it.)<br />
X
42<br />
• Positive definite matrices. If a symmetric matrix<br />
M is such that x 0 Mx ≥ 0forallx, wesay<br />
that M is positive semi-definite (p.s.d.) or nonnegative<br />
definite (n.n.d.). We write M ≥ 0. (The<br />
text reserves the term p.s.d. for the case in which<br />
equality is attained for at least one non-zero x;<br />
this convention is somewhat unusual and won’t be<br />
followed here.) The preceding discussion shows<br />
(how?) that M is p.s.d. iff all eigenvalues are<br />
non-negative.<br />
If x 0 Mx 0forallx 6= 0, we say thatM is<br />
positive definite (p.d.). We write M 0. Equivalently,<br />
all eigenvalues are positive.<br />
— Geometric interpretation: If M 0 then<br />
|M| 0 (why?) and the set<br />
n<br />
x | x 0 M −1 x = 2o<br />
is transformed, via the (orthogonal) transformation<br />
y = V 0 x,intotheset<br />
⎧<br />
⎨<br />
⎩ y | X<br />
=1<br />
2 <br />
<br />
= 2 ⎫<br />
⎬<br />
⎭
43<br />
This is the ellipsoid in R with semi-axes of<br />
lengths √ q along the coordinate axes (and<br />
volume ∝ |M|). Thus (why?) the original<br />
set, obtained from the second via the transformation<br />
x = Vy, is an ellipsoid as well, whose<br />
semi-axes have the same lengths but are now<br />
in the directions of the eigenvectors of M.<br />
The following three results illustrate the adage that<br />
“a symmetric matrix is almost diagonal”.<br />
• Matrix square roots. Can we define a notion of<br />
the square root of a (n.n.d.) matrix? Start by<br />
thinking of a diagonal matrix, in which case the<br />
method is obvious: If D is a diagonal matrix<br />
with non-negative diagonal elements, then we can<br />
define the square root D 12 to be the diagonal<br />
matrix with the roots of these elements on its<br />
diagonal. Now extend to the general case. If M ≥<br />
0 we write M = VD λ V 0 ,whereV is orthogonal
44<br />
and D λ has a non-negative diagonal. We define<br />
a symmetric, p.s.d. square root of M by<br />
M 12 = VD 12<br />
λ<br />
V0 <br />
— There are other roots, for instance P = VD 12<br />
λ<br />
W<br />
for any orthogonal W (then PP 0 = M) butwe<br />
will generally mean the one above.<br />
• The rank of a real symmetric matrix equals the<br />
number of non-zero eigenvalues. Reason: If<br />
M = VD λ V 0 then the rank of M equals the<br />
rank of D λ (why?), and the latter is clearly (is<br />
it?) the number of non-zero diagonal elements.<br />
— Note also that if M = VDV 0 is the spectral<br />
decomposition then M and D have the same<br />
eigenvalues, namely the diagonal elements of<br />
D. This is because the characteristic polynomials<br />
are the same:<br />
¯<br />
|M − I| = ¯V (D − I) V 0¯¯¯ = |D − I|
45<br />
• If H is idempotent then (i) all eigenvalues are 0 or<br />
1, and (ii) rank = trace. Reason: (i) It is clearly<br />
true (how?) for diagonal idempotents. But if H<br />
is idempotent then H = VD λ V 0 where D λ is<br />
idempotent, and H has the same eigenvalues as<br />
D λ . (ii) (H) = (D λ )= (D λ )= (H)<br />
(how are these steps justified?).<br />
— Another interesting property, previously established<br />
via Gram-Schmidt: We can partition<br />
D λ , and compatibly partition V, as<br />
à !<br />
I 0<br />
D λ = V =(V<br />
0 0<br />
1 .V 2 ) <br />
where (H) = and V 1 is ×. This results<br />
in the decomposition of an idempotent matrix<br />
as<br />
H = V 1 V 0 1 where V0 1 V 1 = I
46<br />
Application 1. Illustration of preceding theory: twopopulation<br />
classification problem. Suppose we are<br />
given lengths and widths of prehistoric skulls, of type<br />
A or B (the “training sample”). We know that 1 of<br />
these, say x 1 x 1 , are of type A, and 2 = − 1 ,<br />
say y 1 y 2 ,areoftypeB.Nowwefind a new skull,<br />
with length and width the components of z. We are<br />
to classify it as A or B. (Others applications: rock<br />
samplesingeology,riskdatainanactuarialanalysis,<br />
etc.)<br />
• Reduce to univariate problem: = α 0 x , =<br />
α 0 y for some vector α. Put = α 0 z and classify<br />
new skull as A if | − ¯| | − ¯|.<br />
• Choose α for “maximal separation”: |¯−¯| should<br />
be large relative to the underlying variation. Put<br />
2 1 = 1 X<br />
( − ¯) 2 = 1 X ³<br />
α 0 (x − ¯x)´2<br />
1 − 1<br />
1 − 1<br />
1<br />
=<br />
X 1 − 1 α0 (x − ¯x)(x − ¯x) 0 α = α 0 S 1 α
47<br />
and similarly define 2 2 as the variation in the other<br />
sample. Choose α to maximize<br />
(¯ − ¯) 2<br />
h<br />
(1 − 1) 2 1 +( 2 − 1) 2 i<br />
2 ( − 2)<br />
= α0 (¯x − ȳ)(¯x − ȳ) 0 α<br />
α 0 (5.1)<br />
Sα<br />
where S isthetwo-samplecovariancematrix<br />
S = ( 1 − 1) S 1 +( 2 − 1) S 2<br />
<br />
− 2<br />
• Put β = S 12 α, α = S −12 β so (5.1) is<br />
β 0 S −12 (¯x − ȳ)(¯x − ȳ) 0 S −12 β<br />
β 0 <br />
β<br />
which is a maximum if β kβk is the unit eigenvector<br />
corresponding to<br />
max S −12 (¯x − ȳ)(¯x − ȳ) 0 S −12 = max aa 0 <br />
where a = S −12 (¯x − ȳ). Note aa 0 has rank 1,<br />
hence has 1 non-zero eigenvalue, necessarily equal<br />
(why?) to aa 0 :<br />
= a 0 a =(¯x − ȳ) 0 S −1 (¯x − ȳ)
48<br />
Now solve<br />
aa 0 β = β<br />
to get (β = what? - guess at a solution); any<br />
multiple will do. Then<br />
α = S −12 β = S −1 (¯x − ȳ)<br />
and we classify as A if<br />
| − ¯| =<br />
¯<br />
¯α 0 (z − ¯x)<br />
¯ <br />
¯<br />
¯<br />
¯α 0 (z − ȳ) ¯ = | − ¯|
49<br />
Application 2. By the Cauchy-Schwarz Inequality,<br />
max<br />
y<br />
¯<br />
¯x 0 My¯¯<br />
kyk<br />
=<br />
°<br />
°M 0 x ° ° ° =<br />
qx 0 MM 0 x<br />
Related facts: Note that MM 0 ≥ 0(why?). Conversely,<br />
any n.n.d. matrix can be represented as MM 0<br />
(inmanyways). Inparticular,ifS is a × n.n.d.<br />
matrix of rank ≤ , thenonecanfind M × such<br />
that S = MM 0 and M 0 M is the × diagonal matrix<br />
ofthepositiveeigenvaluesofS.<br />
Construction: Write S = VDV 0 ,where<br />
D × =<br />
Ã<br />
D1 0<br />
0 0<br />
!<br />
V × =<br />
⎛<br />
⎜<br />
⎝V<br />
|{z} 1<br />
<br />
.V 2<br />
|{z}<br />
−<br />
⎞<br />
⎟<br />
⎠ <br />
and D 1 is the × diagonal matrix containing the<br />
positive eigenvalues. Then S = V 1 D 1 V1 0 and so<br />
M × = V 1 D 12<br />
1<br />
has the desired properties. (Note<br />
also that this is a version of S 12 .)
50<br />
Part II<br />
LIMITS, CONTINUITY,<br />
DIFFERENTIATION
51<br />
6. Limits; continuity; probability spaces<br />
• Open and closed sets in R ; limits:<br />
— Neighbourhood of a point ‘a’, of radius :<br />
— ⊂ R is open if<br />
(a) ={x| ||x − a|| } <br />
a ∈ ⇒ (a) ⊂ <br />
for all sufficiently small 0.<br />
∗ Example (0 1)<br />
— A sequence {x } tends to a point a: “x →<br />
a” ifx gets arbitrarily close to a as gets<br />
larger. More formally, any neighbourhood of<br />
a, no matter how small, will eventually contain<br />
x from some point onward. More formally<br />
yet, “for any radius , wecanfind an large<br />
enough that, once ,allofthex lie
52<br />
in (a)”. This required will typically get<br />
larger as gets smaller. Finally,<br />
∀∃ = ()(⇒ x ∈ (a)) <br />
read “for all there exists an , that depends<br />
on , such that implies that x ∈<br />
(a)”.<br />
∗ Equivalently (why?): x → a ⇐⇒ ||x −<br />
a|| → 0.<br />
∗ This is for ‘a’ finite; obvious modifications<br />
otherwise. You should derive an appropriate<br />
definition of “ → ∞”( scalars, not<br />
vectors).<br />
∗ Example =1− 1 <br />
— Apointa is a limit point of ⊂ R if there<br />
is a sequence {x } ⊂ such that x → a.<br />
∗ Example = { =1−<br />
1<br />
=1 (∈ ) <br />
| =1 2 };
53<br />
— ⊂ R is closed if it contains all of its limit<br />
points.<br />
∗ Examples = { =1−<br />
1<br />
{1} , =[0 1]<br />
| =1 2}∪<br />
• Afunction(x) → as x → a (“(x) tendsto<br />
as x tends to a”) if we can force (x) tobearbitrarily<br />
close to by choosing x (6= a) sufficiently<br />
close to a. Formally,<br />
∀∃ = ( a)(0 ||x − a|| ⇒ |(x) − | ) <br />
The “= ( a)” is often omitted (but understood,<br />
unless stated otherwise). Note the “0 <br />
||x − a||”: (a) need not exist.<br />
• Suppose (x) isdefined for x ∈ , thedomain<br />
of . Then is continuous at a point a ∈ if<br />
(x) → (a) asx → a.<br />
— Note that the definition requires to be defined<br />
at a.
54<br />
— Equivalently,<br />
∀∃ = ( a)(||x−a|| ⇒ |(x) − (a)| )<br />
• Example: () = 2 , =(0 ∞). Then if <br />
0and| − | we have<br />
|() − ()| = | − || − +2|<br />
| − |·(| − | +2)|<br />
( +2)<br />
which is if 2 +2 − 0, i.e.<br />
q<br />
0 2 + − <br />
Here we used the triangle inequality: | + | ≤<br />
|| + ||.<br />
• Note = ( ). Sometimes the same works<br />
for all ; ifsowesay is uniformly continuous<br />
on . E.g. in the previous example, the that is<br />
required will → 0as →∞, but suppose is
55<br />
bounded,<br />
q<br />
say =(0). It can be shown that<br />
2 + − is ↓ (“decreasing”), hence<br />
q<br />
2 + − <br />
q<br />
2 + − 0<br />
for all ∈ , sothat<br />
q<br />
| − | = 2 + − ⇒ |() − ()| <br />
Thus is uniformly continuous on (0).<br />
• Formally in the last example,<br />
q<br />
2 + − =<br />
inf<br />
∈(0 )<br />
q<br />
2 + − <br />
For any set , is a lower bound if ≤ for<br />
all ∈ . If there is a finite lower bound then<br />
there are many; the largest of them is the greatest<br />
lower bound () orinfimum (inf). Similarly<br />
with upper bound, least upper bound () or<br />
supremum (sup).
56<br />
Probability spaces, random variables, distribution functions:<br />
We start with a sample space Ω, whose elements are<br />
all possible outcomes of an experiment (e.g. toss a<br />
coin ten times, Ω isallpossiblesequencesof sand<br />
s). A Borel field or -algebra of events is a collection<br />
B of subsets (“events”) of Ω such that one<br />
of its elements is Ω itself, it is closed under complementation,<br />
and closed under the taking of countable<br />
unions.<br />
A probability is a function defined on B such that<br />
(Ω) =1 0 ≤ () ≤ 1, and probabilities of disjoint<br />
countable unions are additive. The triple (Ω B)<br />
is called a probability space. Alltheusualrulesformanipulating<br />
probabilities follow from these axioms. E.g.<br />
() =0, () ≤ ( )if ⊂ . In particular<br />
(“continuity of probabilities”):<br />
⊇ +1 ⊇ and ∩ ∞ =1 = <br />
⇒ ( ) → () (6.1)
57<br />
7. Random variables; distributions; Jensen’s<br />
Inequality; WLLN<br />
• A(realvalued,finite) random variable (r.v.) is a<br />
function : Ω → R with the property that if is<br />
any open set, then −1 () ={ | () ∈ }<br />
is an event, i.e. a member of B. E.g. () =#<br />
of heads in the sequence of tosses. (For a finite<br />
sample space we generally take B =2 Ω ,theset<br />
of all subsets of Ω.)<br />
— Note that is open iff is closed.<br />
Proof: You should show that<br />
open ⇒ closed.<br />
Conversely, suppose is closed; we are to<br />
show that is open. We will derive a contradiction<br />
from the supposition that is not<br />
open. Suppose it isn’t; then for some ∈<br />
, no () ⊂ (no matter how small we<br />
choose ). Then in particular 1 () contains<br />
points ∈ .Since| −| 1 →
58<br />
0, we have → and so is a limit point of<br />
, hence a member of (why?). This contradicts<br />
the fact that ∈ , thus completing<br />
the proof. ¤<br />
— Note that −1 ( )= n −1 () o <br />
:<br />
−1 ( ) = { | () ∈ }<br />
= { | () ∈ }<br />
= { | () ∈ } <br />
= n −1 () o <br />
<br />
— By the preceding points, if is closed then<br />
= is open and so −1 () = n −1 () o <br />
∈<br />
B: the inverse images of closed sets must also<br />
be events.
59<br />
• Since the set = (−∞] is closed, so also<br />
−1 () ={ | () ≤ } is a member of B,<br />
hence has a probability. We write<br />
() = ({ | () ≤ }) = ( ≤ )<br />
and call the distribution function (d.f.) of the<br />
r.v. . Any distribution function is right continuous,<br />
inthat<br />
↓ ⇒ ( ) → ()<br />
Proof: Recall (6.1) with =(−∞ ] where<br />
↓ and = −1 ( ). Then<br />
= ∩ ∞ =1 <br />
= ∩ ∞ =1 −1 ( )<br />
= −1 (∩ ∞ =1 )(verifythis)<br />
= −1 ((−∞])<br />
Thus ( ≤ )= ( ) → () = ( ≤ ).<br />
¤<br />
— A d.f. is then a function : R →[0 1] satisfying<br />
(i) (−∞) =0, (∞) =1(ii) is<br />
right continuous (iii) is weakly increasing:<br />
⇒ () ≤ () (you should show<br />
(iii)).
60<br />
— Recall the notion of expected value, which we<br />
defined in terms of a density or probability<br />
mass function. If () isdifferentiable then<br />
= 0 is the density and expectations, probabilities<br />
etc. are obtained by integration of . If<br />
isastepfunctionwithjumpsofheight at<br />
points ( =0 1 2) then the probability<br />
mass function is the function ( )= and<br />
expectations, probabilities etc. are obtained<br />
by summation over . In the former case we<br />
say that is continuous; in the latter is<br />
discrete.<br />
• Convex functions:<br />
convex if<br />
A function : → R is<br />
((1 − ) + ) ≤ (1 − )()+()<br />
for all ∈ . Example 2 on R, − log on<br />
(0 ∞). Convex functions are continuous; if a<br />
function has a derivative 0 () on which is an<br />
increasing (used here and elsewhere in the weak<br />
sense) function of , then it is convex.
61<br />
• Jensen’s Inequality: If : Ω → ⊂ R has a<br />
finite mean [], and if is convex on , then<br />
[()] ≥ ([]).<br />
— Application. The arithmetic/geometric mean<br />
inequality:<br />
if 1 0then ³ Y<br />
´1<br />
≤ ¯<br />
Proof: Define a r.v. by ( = )=1<br />
and apply Jensen’s Inequality using the convex<br />
function () =− log . ¤<br />
• Limits and continuity in probability: Let { }<br />
be a sequence of r.v.s, e.g. toss a fair coin times<br />
and let denote the proportion of heads in the<br />
tosses. Then [ ]=12 and we expect <br />
to be near 12, with high probability, for large.<br />
We say that “ converges to a constant in<br />
<br />
probability”, and write → , if<br />
lim<br />
→∞ (| − | ≥ ) =0forany0
62<br />
The Weak Law of Large Numbers states that if<br />
is the average of independent r.v.s 1 ,<br />
all with finite mean and variance 2 <br />
,then →<br />
.<br />
— e.g. = (i toss results in a head), =<br />
P . Then =1 0w.p. 12 each; =<br />
<br />
12; by the WLLN → 12<br />
• This is a basic notion required for the theory of<br />
estimation in Statistics.<br />
— e.g. 2 = () = h ( − ) 2i can<br />
be estimated from a sample 1 of independent<br />
observations ∼ by the sample<br />
variance 2 =( − 1) −1 P ³ − ¯´2 . The<br />
adjustment is for bias, disregarding it the main<br />
idea is that averages are consistent estimates<br />
of expectations (i.e. they converge in probability<br />
to these constants). Then also → ;<br />
<br />
this is a consequence of the following result.
63<br />
<br />
• If → and the function is continuous at ,<br />
then ( ) → ().<br />
Proof: We want to show that<br />
(| ( ) − () | ≥ ) → 0<br />
Use the continuity of to find 0 such that<br />
Then<br />
| − | ⇒ | ( ) − () | <br />
(| − | ) ≤ (| ( ) − () | ) <br />
Here we use the fact that if one event implies<br />
another, it has a smaller probability (i.e. ⊂<br />
⇒ () ≤ ( )). Since the first probability<br />
→ 1, so does the second (why?). ¤
64<br />
8. Differentiation; Mean Value and Taylor’s<br />
Theorems<br />
• Let : ⊂ R → R be defined in a neighbourhood<br />
( 0 ); put<br />
() = ( 0 + ) − ( 0 )<br />
<br />
(“Newton’s quotient”). If () has a limit as<br />
→ 0wecallitthederivative 0 ( 0 )of at 0 ,<br />
also written ( ()) |=0 .<br />
• Examples () = 2 , () =||. The former is<br />
differentiable everywhere in R; the latter everywhere<br />
except =0.<br />
• Differentiability ⇒ Continuity:<br />
then is continuous at 0 .<br />
Proof:<br />
If 0 ( 0 )exists<br />
|( 0 + ) − ( 0 )| = |()| → 0<br />
as → 0. ¤
65<br />
• Linearity, product, quotient, chain rules - read in<br />
text. Theyallowustobuildupastockofdifferentiable<br />
functions from simpler ones, and also<br />
show how the derivative of the more complicated<br />
function can be gotten from those of the simpler<br />
ones.<br />
• Relation to monotonicity: if % on ( ) and<br />
differentiable there then 0 () ≥ 0on( ).<br />
Proof: As ↓ 0 the numerator of () is≥ 0<br />
and continuous, hence 0 () = lim ↓0 () ≥ 0.<br />
(Similarly lim ↑0 () ≥ 0.)<br />
• If is continuous on [ ] then the inf and sup are<br />
finite, and are attained: there are points ∈<br />
[ ] with () ≤ () ≤ () forall. Show:<br />
If a max or min is in the open interval ( ),<br />
0 =0there(if 0 exists).
66<br />
• Mean Value Theorem: If is continuous on<br />
[ ] and differentiable on ( ) then∃ ∈ ( )<br />
with () =()+ 0 ()(−). This is a result of<br />
crucial importance in the approximation of functions.<br />
“Differentiable functions are locally almost<br />
linear.”<br />
— Follows from the previous bullet applied to<br />
à !<br />
() − (<br />
() = ()− ()−<br />
( − ) <br />
− <br />
— Restatement: () ≈ () + 0 ()( − )<br />
if | − | is small and 0 is continuous (since<br />
¯<br />
¯ 0 () − 0 ()¯¯ → 0as| − | → 0). The result<br />
is that () isapproximately linear near<br />
= , withslope 0 (). The next result (Taylor’s<br />
Theorem) strengthens this statement and<br />
also allows us to assess the error in this approximation.<br />
— A consequence of the MVT is that if 0 () ≥ 0<br />
on ( ) then % there: suppose 1 <br />
2 ,then<br />
( 2 )=( 1 )+ 0 ()( 2 − 1 ) ≥ ( 1 )
67<br />
• Taylor’s Theorem:“Sufficiently smooth functions<br />
can be approximated locally by polynomials.” Suppose<br />
()has derivatives on ( )with (−1) ()<br />
continuous on [ ]. (We put (0) () =();<br />
the assumptions imply existence and continuity of<br />
() () on( ) for.) Then for ∈ [ ]<br />
there is a point between and such that<br />
() =<br />
−1 X<br />
=0<br />
() ( − )<br />
()<br />
!<br />
+ () ( − )<br />
() <br />
!<br />
— Example: () = log(1+) with|| 1;<br />
expand around 0: (0) = 0 and for 0:<br />
() () = (−1)<br />
+1 ( − 1)!<br />
(1 + )<br />
<br />
so that<br />
() (0) = (−1) +1 ( − 1)!
68<br />
Then<br />
log (1 + ) =<br />
−1 X<br />
=1<br />
+1 <br />
(−1)<br />
+ (−1)+1 <br />
(1 + ) <br />
= − 2<br />
2 + 3<br />
3 − 4<br />
4 + <br />
+(−1) −1<br />
− 1 + (−1)+1 <br />
(1 + ) <br />
for some between 0 and , i.e.with|| ||.<br />
Write this as<br />
log (1 + ) = ()+ ();<br />
if () → 0as → ∞ we say that the<br />
series lim →∞ () = P ∞<br />
=1<br />
(−1) +1 <br />
‘represents the function’ log (1 + ).<br />
— Proof of Taylor’s Theorem: For =1this<br />
istheMVT;assume1. For ∈ [ ] put<br />
() =() − () −<br />
−1 X<br />
=1<br />
() ( − )<br />
() <br />
!
69<br />
We want to show that (), which is<br />
() =() −<br />
−1 X<br />
=0<br />
canalsobeexpressedas<br />
() ( − )<br />
() <br />
!<br />
() = () ( − )<br />
() (8.1)<br />
!<br />
for some ∈ ( ). For this, define<br />
µ − <br />
() = () − ()<br />
− <br />
and note that () = () =0,andthat<br />
() isdifferentiable on ( ) and continuous<br />
on [ ]. By the MVT there is a point<br />
∈ ( ) with<br />
() = ()+ 0 ()( − );<br />
thus 0 () =0:<br />
0=() 0 =()+ 0 ( − )−1<br />
( − ) ()<br />
so<br />
() =−()<br />
0 ( − ) <br />
( − ) −1 (8.2)
70<br />
But<br />
0 () = − 0 () −<br />
−1 X<br />
=1<br />
⎡<br />
⎢<br />
⎣<br />
(+1) () (−)<br />
!<br />
− () () (−)−1<br />
(−1)!<br />
= − () ( − )−1<br />
() ;<br />
( − 1)!<br />
this in (8.2) gives (8.1). ¤<br />
⎤<br />
⎥<br />
⎦<br />
• l’Hospital’s Rule: Read Theorem 4.2.6 (or another<br />
source) and the examples following it.<br />
— Rough idea: If () =() =0,then<br />
lim<br />
→<br />
()<br />
() = lim →<br />
()−()<br />
−<br />
()−()<br />
−<br />
= lim →<br />
0 ()<br />
0 () <br />
— Example: lim →0<br />
sin <br />
<br />
= lim →0 cos <br />
1<br />
=1.
71<br />
9. Applications: transformations; variance<br />
stabilization<br />
• Application 1. Distribution of functions of r.v.s.<br />
Suppose a r.v. has a differentiable d.f. (),<br />
density () = 0 (). Consider the r.v. =<br />
(). (e.g. = log.) First assume is<br />
strictly monotonic (↑ or ↓). The d.f. of is<br />
() = ( ≤ ) = (() ≤ )<br />
(<br />
( ≤ <br />
=<br />
−1 ()) if ↑<br />
( ≥ −1 ()) if ↓;<br />
(<br />
(<br />
=<br />
−1 ()) if ↑<br />
1 − ( −1 ()) if ↓ <br />
Note that the left continuity of is used here:<br />
( ≥ −1 ()) ? = ( −1 ())<br />
To get the density () of() wemustdifferentiate<br />
−1 (). Write = −1 (), then ³ −1 ()´0 =<br />
can be obtained by differentiating the relationship<br />
= () :<br />
1= <br />
= 0 () <br />
;
hence<br />
In the above,<br />
() =<br />
⎧<br />
⎨<br />
⎩<br />
In either event,<br />
<br />
= 1<br />
0 () = 1<br />
0 ( −1 ()) <br />
( −1 ()) h 0 ( −1 () i <br />
−( −1 ()) h 0 ( −1 () i <br />
¯<br />
¯<br />
if ↑<br />
if ↓ <br />
72<br />
<br />
() =()<br />
if = () is strictly monotone,<br />
¯<br />
with expressed in terms of on the RHS.<br />
— Example: 0, = − log . Then() =<br />
¯<br />
¯<br />
¯<br />
<br />
¯<br />
() ¯ = ( − )<br />
¯ = (− ) − .Thus<br />
if ∼ (0 1) with () =(0 1),<br />
has density − (0);wesay has the<br />
exponential density with mean 1 (The function<br />
() = − is the exponential p.d.f.<br />
with mean 1.)<br />
¯−<br />
— When is non-monotonic it is usual to split<br />
up the range of into regions on which is<br />
¯
73<br />
monotonic. Example: suppose ∼ (0 1)<br />
with density<br />
() = 1 √<br />
2<br />
−2 2 ( −∞∞)<br />
and d.f. Φ() = R <br />
−∞ () = ( ≤ ).<br />
Put = − log ||. Then<br />
() = ( ≤ ) = ( ≤− − or ≥ − )<br />
= ( ≤− − )+ ( ≥ − )<br />
= Φ(− − )+1− Φ( − );<br />
thus<br />
() = − (− − )+ − ( − )<br />
= 2 − ( − )(−∞∞)<br />
• Application 2. Variance stabilization. We first<br />
need notions of convergence in law, or distribution.<br />
Suppose { } isasequenceofr.v.s;wesay<br />
<br />
that → ∼ if<br />
( ≤ ) → ( ≤ ) = ()<br />
at every continuity point of .
74<br />
— The Central Limit Theorem (CLT; we’ll prove<br />
it later) refers to this kind of convergence: if<br />
= √ ³ ¯ − ´, where ¯ = P <br />
=1 <br />
and 1 are i.i.d. with mean and variance<br />
2 <br />
, then → ∼ (0 2 ). The<br />
CLT, WLLN and MVT together are sufficient<br />
to derive a vast array of large sample approximations<br />
in Mathematical Statistics.<br />
— Some basic facts required here are that if <br />
→<br />
∼ then:<br />
<br />
1. + → + , if and are<br />
constants tending to and , orr.v.swith<br />
these limits in probability. (“Slutsky’s Theorem”;<br />
its role is to eliminate “nuisance terms”,<br />
that typically → 0 or 1, in limit distributions.)<br />
2. ( ) → () if is continuous.<br />
<br />
3. → (a constant) ⇔ → .<br />
should show this.)<br />
(You
75<br />
Now suppose that<br />
<br />
<br />
→ and (actually, is implied by)<br />
√ ( − ) → ∼ (0 2 )<br />
Consider a function = ( ), where is twice<br />
<br />
continuously differentiable on R Then → (),<br />
and by Taylor’s Theorem, for some between <br />
and ,<br />
√ ( − ())<br />
= √ {( ) − ()}<br />
= √ (<br />
)<br />
0 ()( − )+ 00 ( )<br />
( − ) 2<br />
2<br />
= 0 () n √ ( − ) o + 00 ( ) n√<br />
2 √ ( − ) o 2<br />
<br />
<br />
By Slutsky’s Theorem, this has the same limit distribution<br />
as 0 () √ ( − ), as long as<br />
00 ( )<br />
2 √ <br />
n√ ( − ) o 2 <br />
→ 0
76<br />
This in turn follows (how?) from<br />
n√ ( − ) o 2 → 2 (why?)<br />
00 ( )<br />
2 √ <br />
<br />
→ 0 (why?).<br />
The end result is that<br />
√ ( − ()) → <br />
µ<br />
0 ³ 0 ()´2 <br />
Suppose now that we have evidence that our r.v. <br />
has a variance that depends on its mean, i.e. 2 =<br />
(). Example is if = P <br />
=1 represents the<br />
average number of radioactive emissions of a certain<br />
type in runs of an experiment, where the number<br />
of emissions in one experiment has the Poisson<br />
distribution<br />
( = ) = − =0 1 2 <br />
!<br />
Then has mean and variance both = , sothat<br />
by the CLT √ ( − ) → (0), i.e. () =.
77<br />
This can make it problematic to make reliable inferences<br />
about the mean. For instance, a confidence<br />
interval on : ± 2<br />
q will have a width depending<br />
on the unknown .<br />
Question: what “variance stabilizing” transformation<br />
= ( ) will have an approximately constant variance?<br />
We require 0 q<br />
() = () 0 () tobeconstant,<br />
i.e. 0 () ∝ √ 1<br />
. Thiswillbethecaseif<br />
())<br />
() ∝ R 1 √<br />
()<br />
.<br />
InthePoissonexamplewewouldtake<br />
Z<br />
Z<br />
1<br />
() ∝ q<br />
() = −12 ∝ √ <br />
to obtain<br />
√ n√<br />
− √ o → (0 14)
78<br />
Part III<br />
SEQUENCES, SERIES,<br />
INTEGRATION
79<br />
10. Sequences and series<br />
• Convergence of a sequence { } ∞ =1 : We say<br />
that → if () = → as →∞,i.e.if<br />
∀ 0∃( ⇒ | − | )<br />
This is for finite ; obvious modifications (what<br />
are they?) otherwise. e.g. → 0if|| 1.<br />
( =log log ||; depends on and .)<br />
• Series: Put = P <br />
=1 ,the partial sum of<br />
the series P ∞<br />
=1 . We say that P ∞<br />
=1 = if<br />
→ .<br />
— Example: Geometric series P ∞<br />
=0 for || <br />
1. We have<br />
=<br />
X<br />
=0<br />
= 1 − +1<br />
1 − → = 1<br />
1 −
80<br />
• Extend to functions. : → R functions; if<br />
the sequence () has a limit for every ∈ ,<br />
denoted (), then we say that → on .<br />
Formally, for each ∈ ,<br />
∀ 0∃ = ( )(⇒ | () − ()| ) <br />
(10.1)<br />
• Similarly, consider () = P <br />
=1 (). If () →<br />
() for ∈ we say that () = P ∞<br />
=1 ()<br />
and that P ∞<br />
=1 () converges to ().<br />
• If, in (10.1), the same = () worksforall<br />
∈ we say the convergence is uniform on :<br />
⇒ on . Equivalently,<br />
⇒ on ⇔ sup | () − ()| → 0<br />
∈<br />
Example of non-uniformity of convergence:<br />
() =<br />
→<br />
(<br />
0 ≤ 1<br />
1 ≥ 1<br />
(<br />
0 0 ≤ 1<br />
1 ≥ 1<br />
= ()
81<br />
Then for each ,<br />
sup | () − ()| ≥ sup || =1<br />
[0∞)<br />
[01)<br />
so that sup [0∞) | () − ()| 9 0.<br />
• Example of uniformity of convergence. Consider<br />
() = . By Taylor’s Theorem, for between 0<br />
and :<br />
() =<br />
=<br />
X<br />
=0<br />
X <br />
=0<br />
() (0) <br />
! + (+1) () +1<br />
( +1)!<br />
! + +1<br />
( +1)!<br />
= ()+ (), say.<br />
Then |() − ()| = | ()| and so () =<br />
P ∞=0 <br />
!<br />
if | ()| → 0. We show the stronger<br />
result that ⇒ (equivalently, ⇒ 0) on any<br />
closed interval [ ]. For this, let be any integer<br />
that exceeds both || and ||, hence exceeds<br />
||. Let . Then sup ∈[] | ()| → 0;
82<br />
this is because it is<br />
<br />
+1<br />
( +1)!<br />
= +1−<br />
!( +1)···( +( +1− ))<br />
= <br />
! · <br />
+1·<br />
+2··· <br />
( +( +1− ))<br />
<br />
→<br />
<br />
µ <br />
! · +1<br />
0as →∞<br />
+1−<br />
• Cauchy sequences. A sequence { } ∞ =1 is Cauchy<br />
if the terms get close together sufficiently quickly:<br />
∀ 0∃ ( ⇒ | − | ) <br />
Note that if → (finite) then we can let <br />
be such that<br />
⇒ | − | 2<br />
then for ,<br />
| − | = | ( − ) − ( − ) |<br />
≤ | − | + | − |
83<br />
Thus a convergent sequence (i.e. a sequence with<br />
a finite limit) is Cauchy. (As a consequence,<br />
P ∞=1<br />
−1 diverges.) The converse (not proven<br />
here) holds as well, so that a sequence is convergent<br />
iff it is Cauchy.<br />
• A consequence is that if P converges absolutely,<br />
i.e. if P | | converges, then P converges.<br />
Proof: Suppose that = P <br />
=1 | | is a convergent,<br />
hence a Cauchy, sequence. There is such<br />
that<br />
⇒ | − | <br />
But | − | = P <br />
=+1 | |,sothat = P <br />
=1 <br />
satisfies (for )<br />
| − | =<br />
X<br />
¯<br />
=+1<br />
¯¯¯¯¯¯<br />
≤<br />
X<br />
=+1<br />
| | = | − | <br />
Thus { } is Cauchy, hence is convergent. ¤
84<br />
• Example: Let be a discrete r.v. with ( =<br />
)= , =0 1 2 . If P converges<br />
absolutely, wecallitthe moment [ ]of<br />
. Suppose has the Poisson distribution P():<br />
( = ) = − =0 1 2 .<br />
!<br />
(Note P =1-how?) Thenthe moments<br />
exist for all 0. To see this, consider the partial<br />
sums<br />
X<br />
X<br />
X<br />
= = −<br />
<br />
=0 =0<br />
! = ,say.<br />
=0<br />
We must show that converges. Note that<br />
+1<br />
<br />
= <br />
+1<br />
µ<br />
1+ 1 <br />
→ 0as →∞<br />
so that for 1thereis so that<br />
⇒ +1<br />
<br />
≤
85<br />
Then for 0 we have<br />
0 = +<br />
X 0<br />
<br />
=+1<br />
≤ +( + 2 + + 0− ) <br />
+ <br />
1 − <br />
Thus the sequence { +1 +2 } is increasing<br />
and bounded above, hence has a limit<br />
(= the ) - you should show this.<br />
— Here we have established convergence by using<br />
aversionoftheratio test. See §5.2.1, 5.2.2<br />
in the text, or elsewhere, for other tests.<br />
• Uniform convergence ensures that we can interchange<br />
certain operations.<br />
Theorem: If ⇒ on , then<br />
lim<br />
→∞ → lim () = lim lim → →∞ ()<br />
for ∈ .
86<br />
— A case in which this fails, because the convergence<br />
is not uniform, is<br />
(<br />
<br />
() =<br />
0 ≤ 1<br />
1 ≥ 1<br />
(<br />
0 0 ≤ 1<br />
→<br />
= ()<br />
1 ≥ 1<br />
with = 1. Here lim → lim →∞ () =<br />
lim →1 () does not even exist - the limit<br />
function is discontinuous. Cases like this are<br />
ruled out if the convergence is uniform:<br />
— If ⇒ on and the are continuous on<br />
, then is continuous on <br />
Proof: Forany ∈ ,<br />
lim ()<br />
→<br />
= lim<br />
→ →∞ lim ()<br />
= lim<br />
→∞ → lim ()<br />
= lim<br />
→∞ ()<br />
= ()<br />
¤
87<br />
11. Power series; moment and probability<br />
generating functions<br />
• Power series: Put () = P <br />
=0 ( − ) ;if<br />
() → () as → ∞ we say that<br />
P ∞=0<br />
(−) is the power series representing .<br />
— e.g. by Taylor’s Theorem, if<br />
then<br />
() =<br />
X<br />
=0<br />
() ( − )<br />
()<br />
!<br />
() = ()+ (+1) ( − )+1<br />
() <br />
( +1)!<br />
so that if<br />
(+1) ( − )+1<br />
()<br />
( +1)!<br />
→ 0<br />
then P ∞<br />
=0<br />
() () (−)<br />
!<br />
is the power series<br />
(“Taylor series”, or “Maclaurin’s series” if =<br />
0) “representing ”.
88<br />
• Theorem: Suppose a power series P ∞<br />
=0 <br />
converges for one value 0 6= 0. Then it converges<br />
absolutely for || | 0 |.<br />
Proof: Put<br />
() =<br />
() =<br />
X<br />
=0<br />
X<br />
=0<br />
<br />
¯<br />
¯ ¯¯¯ <br />
Since ( 0 ) has a limit as → ∞, it is a<br />
¯<br />
Cauchy sequence and, in particular, ¯ 0 ¯ =<br />
¯<br />
¯ ( 0 ) − −1 ( 0 )¯¯ → 0. For 0let be<br />
large enough that<br />
⇒ | 0 | <br />
Then for and || | 0 |,<br />
() =<br />
X<br />
=0<br />
¯<br />
¯ ¯¯¯ = ()+<br />
X<br />
=+1<br />
¯<br />
¯ 0<br />
<br />
<br />
¯<br />
¯ 0¯¯¯¯¯<br />
<br />
()+<br />
1 − | 0 | <br />
i.e. the partial sums (), which are necessarily<br />
increasing, are bounded above. ¤
89<br />
• If P ∞<br />
=0 converges for || and diverges<br />
for || we call the radius of convergence.<br />
Then, by the previous result, if || the series<br />
is absolutely convergent.<br />
— e.g. put<br />
() =<br />
X<br />
=0<br />
(−) = (1 − (−)+1 )<br />
<br />
1+<br />
then with () =1(1 + ) we have<br />
| () − ()| = || +1 |1+|<br />
If || 1then| () − ()| → 0; if || 1<br />
it →∞. Thus = 1 is the radius of convergence.<br />
In this case when || = 1 the series diverges<br />
(i.e. the partial sums do not converge).
90<br />
• Theorem: Suppose a power series P ∞<br />
=0 <br />
has a radius of convergence 0 (within which it<br />
necessarily converges absolutely). Let 0 .<br />
Then:<br />
(i) P ∞<br />
=0 converges uniformly on [− ];<br />
(ii) For || the limit function () = P ∞<br />
=0 <br />
is continuous and differentiable, and the derivative<br />
is represented by the convergent series<br />
0 () =<br />
∞X<br />
=1<br />
−1 <br />
(Thus P ∞<br />
=1 −1 converges for || and<br />
so (i), (ii) apply to 0 ().)<br />
Proofof(i): Suppose P <br />
=0 = () →<br />
() for|| .Thenfor|| we have<br />
| () − ()| =<br />
∞X<br />
¯<br />
=+1<br />
¯¯¯¯¯¯ ≤<br />
∞X<br />
=+1<br />
| | → 0<br />
as →∞,since P ∞<br />
=0 converges absolutely<br />
for || . Thus sup ||≤ | () − ()| → 0,<br />
as required. ¤
91<br />
• By (ii), we can repeat the process:<br />
00 () = P ∞<br />
=2 ( − 1) −2 , etc. Among<br />
other things, this implies the uniqueness of power<br />
series representations. (How?)<br />
• Example: The probability generating function of<br />
ar.v. is the function () =[ ], provided<br />
this exists. In particular, if has support N =<br />
{0 1 2} then<br />
() =<br />
∞X<br />
=0<br />
( = )<br />
Since this converges for = 1 it has radius of<br />
convergence ≥ 1. Wecanthendifferentiate<br />
term-by-term near =0:<br />
=<br />
() (0)<br />
∞X<br />
=<br />
( − 1) ···( − +1) − ( = ) |=0<br />
= ! ( = )
92<br />
— Note that, by uniqueness of power series, if we<br />
can expand () as P ∞<br />
=0 then, necessarily,<br />
= ( = ) = () (0)!. In other<br />
words, the p.g.f. uniquely determines the distribution:<br />
two r.v.s with the same p.g.f. have<br />
the same distribution.<br />
— Example: If ∼ ( )then<br />
() =(1− + ) ;<br />
the uniqueness then shows that the sum of<br />
such independent , all with the same but<br />
possibly different values of , is∼ ( P ).<br />
— In the above we have used the fact that a<br />
characterization of the independence of r.v.s<br />
( )isthat[()( )] = [()][( )]<br />
for all functions such that () and( )<br />
are also r.v.s. Equivalently, () and( )<br />
are uncorrelated for all such .
93<br />
• The moment generating function of a r.v. is<br />
the function () = [ ], provided this exists<br />
(i.e. is finite). (Replacing by gives the<br />
characteristic function, which always exists: it is<br />
[cos ()] + [sin ()].) With as above,<br />
() =<br />
∞X<br />
=0<br />
( = )<br />
Note that () =( ), so that it converges (absolutely)<br />
in a neighbourhood of =0iff has a<br />
radius of convergence 1. Assume this. Then<br />
for || log we have, by the preceding theorem,<br />
0 () = 0 ( ) <br />
=<br />
=<br />
∞X<br />
=0<br />
∞X<br />
=0<br />
= [ ]<br />
³ ´−1 ( = ) · <br />
<br />
( = )<br />
with 0 (0) = []. Continuing, () () =[ ]<br />
with () (0) = [ ]. (i.e. we can differentiate<br />
within the [·].)
— e.g. ∼ () with ( = ) = − <br />
! has<br />
() =<br />
Thus<br />
∞X<br />
=0<br />
−<br />
! = − ∞ X<br />
= − · = ( −1) <br />
=0<br />
³<br />
<br />
´<br />
!<br />
[] = 0 (0) = <br />
[ 2 ] = 00 (0) = 2 + hence<br />
[] = <br />
94<br />
— The cumulants of a distribution are defined<br />
as the coefficients in the expansion<br />
log [ ]=<br />
∞X<br />
=1<br />
<br />
<br />
! <br />
Thus the Poisson distribution has all cumulants<br />
= . In general 1 is the mean and 2 is<br />
the variance; after that they get more complicated.<br />
The Normal distribution has all =0<br />
for 2.
95<br />
12. Branching processes<br />
• Important in population studies and elsewhere.<br />
Organisms are born, live for 1 unit of time, then<br />
give birth to a random number of offspring and<br />
die.<br />
• Define r.v.s<br />
= population size at time + , 0 =1<br />
= number of offspring of the member<br />
of the population.<br />
Then<br />
=<br />
−1 X<br />
=1<br />
<br />
• Problems: (i) Determine properties of the distribution<br />
of . (ii) Determine the limiting probability<br />
of extinction (= lim →∞ ( =0)=<br />
lim →∞ (0), if is the p.g.f.).
96<br />
• Assume: When −1 = , 1 2 are independent<br />
r.v.s, independent of −1 . (i.e. number<br />
of offspring of one member has no effect on<br />
that of another, and is unaffected by current size<br />
of population. Realistic?) Assume also that all<br />
are distributed in the same manner.<br />
• We will work with the p.g.f.s<br />
() =[ ] () =[ ]; 0 ≤ ≤ 1<br />
Assume has a radius of convergence 1, so<br />
that [ ]= 0 (1) exists. Note that<br />
() =[ ]=<br />
If −1 = , thisis<br />
=<br />
<br />
Y<br />
=1<br />
∙ P =1<br />
<br />
¸<br />
= <br />
"<br />
<br />
⎡ ⎤<br />
Y<br />
⎣ ⎦<br />
=1<br />
h i<br />
(independence)<br />
#<br />
P −1<br />
=1<br />
<br />
<br />
= () (sincethe are identically distributed).
97<br />
Considering the probabilities of the events “ −1 =<br />
” (i.e. Double Expectation Theorem: [ ]=<br />
−1<br />
n<br />
[<br />
| −1 ] o )gives<br />
Iterating:<br />
() =<br />
∞X<br />
=0<br />
= h () −1<br />
= −1 (())<br />
() ( −1 = )<br />
0 () = [ 0]=[] =<br />
1 () = 0 (()) = () =( 0 ())<br />
2 () = 1 (()) = ◦ () =( 1 ())<br />
3 () = 2 (()) = ◦ ◦ () =( 2 ())<br />
and in general (by induction)<br />
() =( −1 ()) =1 2; 0 () =<br />
It follows (you should show how) that [ ]=<br />
{[ ]} . (Intuitively obvious?)<br />
i
98<br />
• Probability of extinction. Note ( = 0) =<br />
(0) = ,say,and = ( −1 )with 0 =0.<br />
Does lim →∞ exist, and if so what is it? We<br />
shall assume that 0 ( =0) 1, otherwise<br />
problem is trivial. Consequently, () is positive,<br />
strictly increasing and convex for 0 ≤ 1:<br />
() =<br />
0 () =<br />
00 () =<br />
Now<br />
∞X<br />
=0<br />
∞X<br />
=1<br />
( = ) = ( =0)+ 0<br />
−1 ( = ) 0<br />
sinceatleastoneofthe ( = ) is 0<br />
∞X<br />
=2<br />
( − 1) −2 ( = ) ≥ 0<br />
0 = 0<br />
1 = ( 0 )=(0) 0= 0 <br />
1 0 ⇒ 2 = ( 1 ) ( 0 )= 1 <br />
···<br />
−1 ⇒ +1 = ( ) ( −1 )= <br />
In general 0 = 0 1 2 ≤ 1, and so<br />
↑ =sup{ }
99<br />
Since = ( −1 )and is continuous we have<br />
=<br />
lim<br />
→∞ =<br />
• Put () =() − ; note<br />
lim<br />
→∞ ( −1) =( lim<br />
→∞ −1) =()<br />
(0) = ( =0) 0<br />
(1) = 0<br />
0 (0) = 0 (0) − 1= ( =1)− 1 0<br />
and is convex. Also<br />
0 (1) = 0 (1) − 1=[ ] − 1<br />
The function () can drop below 0 at most once<br />
in (0 1). Graphthetwopossiblecases. Inthe<br />
first () isincreasingat = 1, in the second it<br />
is decreasing.<br />
— Case 1: [ ] 1. Equivalently, 0 (1) 0.<br />
There are two roots, say ∈ (0 1) and =1,<br />
to the equation () =0,and is one of them.<br />
We have<br />
0 = 0 <br />
⇒ 1 = ( 0 ) () =<br />
⇒ 2 = ( 1 ) () =
100<br />
etc.; hence ≤ and so = .<br />
— Case 2: [ ] ≤ 1. Equivalently, 0 (1) ≤ 0<br />
and = 1 is the only solution.<br />
• Summary:<br />
If [ ] ≤ 1then<br />
(eventual extinction) = 1;<br />
if [ ] 1 then this probability is 1andisthe<br />
unique solution in (0 1) to () =.<br />
Let = time of extinction. Then<br />
( )= ( 0) = 1 − <br />
hence ( ≤ ) = ,with<br />
0 = 0 = ( −1 )<br />
( = ∞) =1− <br />
Itcan(andwill-Asst.3)beshownthat<br />
[] =<br />
∞X<br />
=0<br />
( )(=∞ if [ ] 1).
101<br />
P(N
102<br />
13. Riemann integration<br />
• Riemann integration. First consider :[ ] →<br />
R, abounded function. Consider a partition, or<br />
‘mesh’ = { = 0 1 ··· = } of<br />
[ ]; its norm is ∆ =max (∆ ), where ∆ =<br />
− −1 . For ∈ [ −1 ], an approximation<br />
to the area under the graph of is<br />
() =<br />
X<br />
=1<br />
( )∆ <br />
Theintegralof over [ ] isdefined as the limit<br />
of these approximations, as they become more<br />
and more refined, i.e. as ∆ → 0.<br />
• Formally, we first bound the Riemann sum ()<br />
above and below as follows. Define<br />
= inf<br />
[ −1 ] () = sup<br />
() =<br />
X<br />
=1<br />
∆ () =<br />
[ −1 ]<br />
X<br />
=1<br />
();<br />
∆
103<br />
Then clearly<br />
() ≤ () ≤ ()<br />
If we refine by including points 0 between −1<br />
and , obtaining another partition 0 ⊃ ,then<br />
in 0 the infima increase and the suprema decrease;<br />
thus<br />
() ≤ 0() and 0() ≤ ()<br />
Also () ≤ 0() for any partitions , 0<br />
(shown by considering their union, whose lower<br />
sum exceeds that of and whose upper sum is<br />
≤ that of 0 ). Continuing:<br />
() ≤ 0() ≤ 00() ≤ ···<br />
≤ sup () ≤ inf ()<br />
<br />
<br />
≤ ···≤ 00() ≤ 0() ≤ ()<br />
We say that is (R-) integrable if sup () =<br />
inf (), and then their common value is<br />
R <br />
(). Equivalently<br />
inf<br />
{ () − ()} =0
104<br />
• An example of a non-integrable function is () =<br />
( ∈ Q) (Q the rationals) for ∈ [0 1]. Then<br />
for any partition we have ≡ 0and ≡ 1,<br />
so sup () =0 1=inf ()<br />
• Continuous functions on [ ] are R-integrable<br />
there. (We write ∈ [ ], or just ∈ )<br />
The general idea of the proof is that, since continuous<br />
functions on bounded, closed intervals are<br />
(bounded and) uniformly continuous, − <br />
can be made uniformly small, say ≤ whenever<br />
∆ .Then<br />
() − () ≤ <br />
X<br />
=1<br />
hence inf { () − ()} =0.<br />
∆ = ( − )
105<br />
• Monotonic, bounded functions on [ ] are R-<br />
integrable; e.g. for % functions,<br />
() − () =<br />
≤<br />
X<br />
=1<br />
[( ) − ( −1 )] ∆ <br />
∆ [() − ()] → 0<br />
• More generally, we say a function is of bounded<br />
variation (“ is BV”) on [ ] if P <br />
=1 |∆ | ≤ <br />
for some 0 and all partitions . (Here<br />
∆ = ( )−( −1 ).) This clearly holds if is<br />
monotonic and bounded, or (by the MVT) if has<br />
a bounded derivative (since then |∆ | ≤ ∆ ,<br />
where | 0 | ≤ on [ ]). It can be shown that<br />
if is BV then ∈ [ ].<br />
• Standard properties follow from these definitions.<br />
If ∈ [ ] thensoare + , , and<br />
||; in the first two cases the integral is linear;<br />
¯<br />
in the last we have ¯R ()¯ ≤ R <br />
|()|. If<br />
≤ then R <br />
() ≤ R <br />
(). (You should<br />
show these two inequalities.) Also, R <br />
() =<br />
R <br />
() + R <br />
() for ∈ [ ].
106<br />
• An important result is the Mean Value Theorem<br />
for Riemann integrals: If is continuous on [ ]<br />
then there is ∈ [ ] forwhich<br />
Z <br />
<br />
() = ()( − )<br />
Proof: Let and be the inf and sup of on<br />
[ ], then<br />
Z <br />
≤ 1 () ≤ <br />
− <br />
Since is continuous it attains and every<br />
point between (Intermediate Value Theorem), hence<br />
there is ∈ [ ] forwhich() =<br />
−<br />
1 R <br />
().<br />
¤
107<br />
• Now define<br />
() =<br />
Z <br />
<br />
() ≤ ≤ <br />
the indefinite integral of . We have the Fundamental<br />
Theorem of Calculus: If is continuous<br />
on [ ] then is differentiable there, with<br />
Z <br />
<br />
0 () = (); (13.1)<br />
() = (), hence<br />
Proof of (13.1):<br />
0 () =<br />
1<br />
lim<br />
→0 <br />
=<br />
1<br />
lim<br />
→0 <br />
= () − () (asbelow).<br />
" Z +<br />
<br />
Z +<br />
<br />
() −<br />
()<br />
Z <br />
() #<br />
= lim<br />
→0<br />
1<br />
· ( )<br />
for some ∈ [ + ], by MVT . Since → <br />
and is continuous, ( ) → (). ¤
108<br />
— This is the main tool for evaluating integrals<br />
—wefind a whose derivative is .<br />
Reason: If 0 () =(), then<br />
() =() − () − ()<br />
has () =0and 0 () ≡ 0, so () =0<br />
(for instance by the MVT). Hence<br />
Z <br />
<br />
() = () =() − ()<br />
— This is used to justify the change-of-variables<br />
formula for Riemann integration (i.e. integration<br />
by substitution).<br />
— Example: the substitution = tan, with<br />
() =sec 2 ,gives<br />
= <br />
Z <br />
1<br />
1+ 2 = Z arctan <br />
<br />
¯<br />
¯arctan <br />
arctan cos2 sec 2 <br />
arctan =arctan − arctan
109<br />
• Improper Riemann integrals, in which one or both<br />
endpoints are infinite, or at which is unbounded,<br />
are defined by taking appropriate limits:<br />
Z ∞<br />
() = lim<br />
→∞<br />
Z <br />
() =<br />
−∞<br />
Z ∞<br />
Z <br />
<br />
() = lim<br />
↓0<br />
Z <br />
()<br />
Z ∞<br />
() + () for any <br />
−∞ <br />
Z −<br />
<br />
() if () =±∞<br />
Example: () =1{(1 + 2 )}, −∞ ∞ is<br />
the ‘Cauchy’ (= on 1 degree of freedom) p.d.f.; we<br />
have<br />
Z <br />
() =(arctan − arctan ) <br />
<br />
so<br />
Z ∞<br />
µ Z Z ∞<br />
<br />
() = +<br />
−∞ −∞ <br />
arctan − arctan arctan − arctan <br />
= lim<br />
+ lim<br />
→−∞ <br />
→∞ <br />
− arctan (−∞)+arctan∞<br />
=<br />
= 2+2<br />
<br />
<br />
=1
110<br />
However, none of the moments<br />
Z ∞<br />
[ ]=<br />
−∞ ()<br />
exist for ≥ 1. This is because the existence of [ ]<br />
requires the existence of<br />
Z ∞<br />
Z ∞<br />
() = || ()<br />
and of<br />
Z 0<br />
0<br />
−∞ () =(−1) Z 0<br />
0<br />
−∞ || ()<br />
hence the existence of [|| ]. You should show that<br />
[|| ]doesnotexistif is Cauchy; a consequence<br />
is that even if the integrand of R ∞<br />
−∞ () is an odd<br />
function, the integral need not = 0.
111<br />
14. Riemann and Riemann-Stieltjes integration<br />
• An application of the Fundamental Theorem of<br />
Calculus is the formula for integration by parts.<br />
If are differentiable, and 0 0 are integrable,<br />
then<br />
Z <br />
[()()]0 = ()() − ()() andalso<br />
Z h<br />
= 0 ()()+() 0 ) i ;<br />
hence<br />
Z <br />
<br />
0 ()() = ()()−()()−<br />
()0 ()<br />
A mnemonic is “ R = | − R ”.<br />
Z
112<br />
• Application. Define Γ() = R ∞<br />
0 −1 − , (<br />
0), the Gamma integral. Establishing the existence<br />
of lim 0 →∞ () ,orof<br />
R<br />
Z <br />
lim ()<br />
→0→∞ <br />
if 1, is left to you. We have<br />
Z Ã ! ∞ <br />
0<br />
Γ() =<br />
− <br />
0 <br />
à !<br />
<br />
<br />
= −¯¯¯¯¯<br />
∞ Z Ã ! ∞ <br />
<br />
−<br />
<br />
0 − <br />
hence<br />
= 1 <br />
Z ∞<br />
0<br />
0<br />
− = 1 Γ( +1)<br />
<br />
Γ( +1)=Γ();<br />
in particular for an integer,<br />
Γ(+1) = Γ() =··· = (−1)···1·Γ(1) = !
113<br />
• A generalization of the Riemann integral that is<br />
particularly useful in statistics is the Riemann-<br />
Stieltjes (R-S) integral. Let be bounded on<br />
[ ], and let () be% there. In the definition<br />
of the R-integral, replace ∆ everywhere by<br />
∆ = ( ) − ( −1 )(≥ 0). The analogue of<br />
() is (; ) = P <br />
=1 ( )∆ ;ifthishas<br />
a limit as ∆ → 0 - equivalently (as at Theorem<br />
6.2.1) if sup () =inf ()<br />
-thenwecallittheR-Sintegral R <br />
()().<br />
It is particularly useful in cases where is not<br />
continuous.<br />
• Special cases:<br />
1. () = ; R <br />
()() = R <br />
(), the<br />
R-integral.<br />
2. differentiable, with 0 = ; R <br />
()() =<br />
R <br />
()(), theR-integral.
3. () =<br />
(<br />
≤ <br />
≤ ≤ .Then<br />
114<br />
∆ =<br />
(<br />
− −1 ≤ <br />
0 otherwise<br />
It follows that R <br />
()() =()( − ).<br />
• We adopt the convention that unless stated otherwise,<br />
by R <br />
we mean R (] ,i.e.therighthand<br />
endpoint is included, the left is not. Note that<br />
this is not an issue for R-integrals (why not?).<br />
Combining 2) and 3):<br />
Suppose = 0 1 ··· = and<br />
(a) is differentiable on ( −1 )with 0 =<br />
≥ 0,<br />
(b) has a jump discontinuity (but is right continuous)<br />
at each ,with ( ) − ( − )= .<br />
Then<br />
Z <br />
X Z <br />
()() = ()()<br />
<br />
=1<br />
−1<br />
(<br />
X Z )<br />
<br />
=<br />
()() + ( ) <br />
−1<br />
=1
115<br />
• Improper R-S integrals defined as for R-integrals.<br />
In particular, let be a r.v. with d.f. (), −∞ <br />
∞ (note %). Let () be a function of<br />
. Wedefine the expected value of () tobe<br />
Z ∞<br />
[()] = ()()<br />
−∞<br />
If has a density this agrees with the earlier<br />
definition. Suppose instead that is discrete,<br />
with<br />
( = )= =0 1 2 <br />
Then () = ( ≤ ) has a jump of height<br />
∆ = at and has 0 = 0 elsewhere, so<br />
[()] =<br />
∞X<br />
=0<br />
( )
116<br />
• An example illustrating the power of this integral,<br />
in which neither the R-integral nor a sum alone<br />
will suffice, is if represents the lifetime of a<br />
randomly chosen light bulb. Suppose that, with<br />
probability , the bulb blows when first installed.<br />
Otherwise, it has an exponentially distributed lifetime,<br />
with ( )= − . Thus its d.f. is<br />
() = ( ≤ ) =<br />
⎧<br />
⎪⎨ 0<br />
0<br />
=0<br />
⎪⎩<br />
+(1− )(1 − − ) 0<br />
with<br />
[ ( )] =<br />
Z ∞<br />
−∞<br />
() ()<br />
= (0) · +<br />
Z ∞<br />
∙ n o¸<br />
() +(1− )(1 − − ) <br />
<br />
0<br />
= (0) · +(1− )<br />
Z ∞<br />
0<br />
() −
117<br />
• Cauchy-Schwarz inequality:<br />
µZ<br />
2 Z<br />
()()() ≤<br />
2 Z<br />
()()·<br />
2 ()()<br />
provided all three integrals exist. The range is<br />
the same for all three, but need not be bounded.<br />
(Does existence of the latter two integrals imply<br />
existence of the first?)<br />
Proof: Essentially identical to the vector version:<br />
0 ≤<br />
=<br />
Z<br />
Z<br />
( + ) 2 <br />
2 +2<br />
Z<br />
+ 2 Z 2 <br />
hence “ 2 − 4” ≤ 0, i.e.<br />
4<br />
µZ<br />
2 Z<br />
− 4<br />
2 ·<br />
Z<br />
2 ≤ 0<br />
— Example: [ 3 ] ≤<br />
q<br />
[ 2 ][ 4 ].<br />
¤
118<br />
• Integration by parts: if the R-S integrals R <br />
()()<br />
and R <br />
() () bothexist,then R <br />
()()+<br />
R <br />
() () =()() − ()(). This and<br />
other identities are also valid for decreasing integrators<br />
- e.g. replace R by − R (−) inthe<br />
appropriate places.<br />
• An application is Euler’s summation formula: If<br />
has a continuous derivative 0 on ( − )<br />
for some ∈ (0 1), then<br />
X<br />
=<br />
() =<br />
Z <br />
()+()+ Z <br />
0 (){}<br />
where {} = − [] is the fractional part of .<br />
• Example: with () =1 and =1,weobtain<br />
X<br />
=1<br />
Z<br />
1<br />
<br />
− log =1− 1<br />
{}<br />
2 ∈ µ 1<br />
1 <br />
<br />
The middle term above is decreasing in , and<br />
bounded below by 0, thus it has a limit (‘Euler’s
119<br />
constant’) ∈ [0 1) as →∞:<br />
X<br />
=1<br />
1<br />
− log → = 577215<br />
<br />
Note that both P <br />
=1<br />
1<br />
<br />
and log diverge.<br />
• Similarly, 0 ≤ 2 √ − 1 − P <br />
=1<br />
1 √ ≤ 1 − 1 √<br />
<br />
.<br />
The convergence is very slow; limit ≈ 4604 with<br />
=19, 4603 with =18.<br />
Proof of Euler’s formula: Write the sum as a R-<br />
S integral, and split into regions on which {} is<br />
monotone:<br />
X<br />
=<br />
() =<br />
=<br />
=<br />
Z <br />
− ()[]<br />
Z <br />
−<br />
()( − {})<br />
Z <br />
− () − Z <br />
− (){}<br />
−<br />
X<br />
Z <br />
=+1<br />
−1 (){}
120<br />
Integrating by parts, and using { − } =1−, gives<br />
X<br />
=<br />
()<br />
=<br />
=<br />
Z ()<br />
−<br />
" #<br />
(){} − ( − ) { − }<br />
−<br />
− R <br />
− 0 (){}<br />
"<br />
X (){} − ( − 1){ − 1}<br />
−<br />
− R <br />
−1 0 (){}<br />
=+1<br />
Z <br />
−<br />
() +(1− ) ( − )<br />
#<br />
+<br />
Z <br />
− 0 (){} +<br />
X<br />
Z <br />
=+1<br />
−1 0 (){}<br />
=<br />
Z <br />
− ()<br />
→<br />
↓0<br />
+(1− ) ( − )+Z <br />
− 0 (){}<br />
Z <br />
() + ()+ Z <br />
0 (){}<br />
as required. ¤
15. Moment generating functions; Chebyshev’s<br />
Inequality; Asymptotic statistical theory<br />
121<br />
• Moment generating functions. () =[ ]=<br />
R ∞−∞<br />
() (the R-S integral). If this exists in<br />
an open neighbourhood of =0(note (0) = 1<br />
always exists) then it is the m.g.f. It is also written<br />
(). Some useful properties:<br />
1. () (0) = [ ] (so if the m.g.f. exists, so do<br />
all moments). In other words we can differentiate<br />
under the integral sign. Then if we can<br />
find an expansion of the form () = P <br />
! ,<br />
by the uniqueness of power series this must<br />
be the MacLaurin series, and so we must have<br />
= () (0) = [ ].<br />
2. If () = () forall|| (for some<br />
0) then ∼ , i.e. the distribution of a<br />
r.v. is uniquely determined by the m.g.f.<br />
3. If { } is a sequence of r.v.s with m.g.f.s<br />
(), and if () → () for in a neighbourhood<br />
of 0, where () is the m.g.f. of a<br />
<br />
r.v. , then → .
122<br />
— You should show: if ∼ ( )with<br />
<br />
→ then → P() (Poisson, mean<br />
).<br />
4. Sums of independent r.v.s. If 1 2 are<br />
independent r.v.s with m.g.f.s 1 () 2 ()<br />
then if = P <br />
=1 we have<br />
() = h P i<br />
=1 <br />
= <br />
=<br />
⎡ ⎤<br />
Y<br />
⎣ Y<br />
⎦ =<br />
=1 =1<br />
Y<br />
=1<br />
()<br />
h i<br />
<br />
In particular, if all are distributed in the<br />
same way, with m.g.f. (), then the m.g.f. of<br />
their sum is () = () and the m.g.f. of<br />
their average is<br />
¯ () = h i = () = ()<br />
• All of this also holds for the characteristic function<br />
(c.f.) [ ]= [cos ] + [sin ], which<br />
always exists.
123<br />
• Suppose ∼ (0 1), with p.d.f. (). Define<br />
= 2 ,a 2 1 r.v. Its m.g.f. is<br />
() =<br />
=<br />
Z ∞<br />
−∞ 2 ()<br />
Z ∞<br />
−∞<br />
1<br />
√<br />
2<br />
−2 2 (1−2) (|| 12)<br />
= ( = √ 1 − 2)<br />
= (1− 2) −12 <br />
1<br />
√ 1 − 2<br />
Z ∞<br />
−∞ ()<br />
Now suppose is the sum of squares of independent<br />
(0 1)’s, i.e. is a 2 r.v. Its m.g.f.<br />
is the power of the above (why?), thus =<br />
(1 − 2) − 2 (|| 12). It follows that the p.d.f.<br />
is<br />
() =<br />
³ 2´<br />
2<br />
−1<br />
<br />
− 2<br />
2Γ ³ <br />
2´<br />
0 ∞<br />
Proof: With = 2<br />
(1 − 2) for|| 12,<br />
Z ∞<br />
0<br />
() = (1− 2) − 2<br />
Z ∞<br />
0<br />
2 −1 −<br />
Γ ³ <br />
2´ <br />
= (1− 2) − 2
124<br />
• Chebyshev’s Inequality:<br />
and variance 2 then<br />
If a r.v. has mean <br />
(| − | ≥ ) ≤ 1 2 <br />
Proof:<br />
An equivalent formulation is<br />
(|| ≥ ) ≤ 1 2 <br />
where =( − ) has mean 0 and variance<br />
1.<br />
Note that the indicator of an event , given by<br />
( ) =<br />
∼<br />
(<br />
1 if occurs,<br />
0 otherwise,<br />
(1( )) <br />
with [( )] = ( ). Thus<br />
1 = [] = h 2i<br />
≥ h 2 (|| ≥ ) i<br />
≥ 2 [ (|| ≥ )]<br />
= 2 (|| ≥ )
125<br />
• Chebyshev’s Inequality furnishes an easy proof of<br />
the Weak Law of Large Numbers: If ¯ is the<br />
average of independent r.v.s, each with mean<br />
and variance 2 <br />
,then ¯ → as →∞.<br />
Proof: Note that ¯ has mean and variance<br />
2 . For0, put = √ ; then<br />
³¯¯¯ ¯ − ¯ ≥ ´<br />
= ³¯¯¯ ¯ − ¯ ≥ ³ ¯ ´´<br />
≤<br />
→<br />
1 2 = 2<br />
2<br />
0as →∞<br />
• Central Limit Theorem. This is probably the most<br />
significant theorem in mathematical statistics. It<br />
gives the approximate normality of averages of<br />
r.v.s and, when combined with the MVT (or Taylor’s<br />
Theorem), the WLLN and Slutsky’s Theorem<br />
(see below), forms the basis for approximating the<br />
distributions of many other statistics of interest.
126<br />
• Theorem: Let 1 2 be independent<br />
r.v.s, with common d.f. () = ( ≤ ),<br />
mean , variance 2 (0 2 ∞). Put =<br />
√ ³<br />
¯ − ´<br />
;then → (0 2 ).<br />
— To apply, since the statements “ ∼ (0 2 )”<br />
and “ ¯ ∼ ( 2 )” are equivalent, we<br />
treat ¯ as if it were distributed approximately<br />
as ( 2 ). Then, e.g. if we can also estimate<br />
2 , we have the basis for making inferences<br />
about .<br />
• ProofofCLT: We make the additional assumption<br />
that the have a m.g.f. Define<br />
() =[ ( −) ](= − [ ])<br />
We shall use the fact, being established in assignment<br />
3, that the m.g.f. of ∼ ( 2 )is<br />
[ ]= +2 2<br />
2 <br />
Notation: “ () = (()) as → ” means<br />
“ () () → 0as → ”.
127<br />
Let be fixed but arbitrary. Expand () as<br />
() = (0) + 0 (0) + 00 (0) 2 2 + 000 () 3 6 <br />
(0 ≤ || ≤ ||)<br />
= 1+[ − ] + [( − ) 2 ] 2 2 + 000 () 3 6<br />
= 1+ 2 2<br />
2 + (2 )as → 0<br />
Why ( 2 )? - because 000 () hasafinite limit as<br />
, hence , tends to 0.<br />
We are to show that the m.g.f. of<br />
= 1 √ <br />
X<br />
=1<br />
( − )<br />
tends to that of a (0 2 ) r.v., i.e. that<br />
<br />
"<br />
√ 1 P #<br />
=1<br />
<br />
( −)<br />
=<br />
Y<br />
=1<br />
<br />
"<br />
<br />
= ( √ )<br />
#<br />
√<br />
<br />
( −)
128<br />
tends to 2 2<br />
2 . Equivalently, we show that<br />
For this, write<br />
log ( √ ) → 2 2<br />
2<br />
as →∞<br />
log ( √ 2 Ã<br />
<br />
2<br />
) = log ⎜<br />
⎝ 1+2 2 + <br />
⎛<br />
⎜<br />
= log (1 + )<br />
<br />
<br />
(15.1)<br />
| {z }<br />
<br />
⎞<br />
!<br />
⎟<br />
⎠<br />
where (why?) → 0and → 2 2 2. This<br />
gives (15.1). ¤
129<br />
• Slutsky’s Theorem: If <br />
<br />
→ and → <br />
(constant) then:<br />
1. ± <br />
→ ± <br />
2. · <br />
→ · <br />
3. <br />
→ if 6= 0.<br />
Note that if = (constant) then all occurrences<br />
of → can be replaced by →.<br />
• Application: We often make inferences about a<br />
population mean using the -statistic<br />
√ ³<br />
¯ − ´<br />
=<br />
<br />
<br />
where ¯ is as in the CLT and is the sample<br />
standard deviation. If the data are normally distributed<br />
then follows a “Student’s t” distribution<br />
on − 1 degrees of freedom; it is well known<br />
that this distribution is closely approximated by
130<br />
the (0 1) when is reasonably large. This latter<br />
fact holds even for non-normal parent distributions:<br />
Note<br />
P ³<br />
2 − ¯´2 P <br />
2<br />
=<br />
= <br />
− ¯ 2<br />
<br />
− 1 − 1<br />
where P 2 <br />
→ [ 2 ] by WLLN, ¯ → <br />
by WLLN ; it follows that ³ P <br />
2<br />
− ¯ 2´<br />
→<br />
<br />
[ 2 ] − 2 = 2 , (a special case of (1) of Slutsky’s<br />
Theorem) hence so does 2 (Slutsky (2));<br />
thus → (since is a continuous function of<br />
2 )andso → 1 (Slutsky (3)). Now again<br />
by Slutsky, and the CLT,<br />
=<br />
√ ( ¯−)<br />
<br />
<br />
<br />
→<br />
<br />
1 =
131<br />
Part IV<br />
MULTIDIMENSIONAL<br />
CALCULUS AND<br />
OPTIMIZATION
132<br />
16. Multidimensional differentiation; Taylor’s and<br />
Inverse Function Theorems<br />
• f : ⊂ R → R can be represented as<br />
⎛<br />
⎜<br />
⎝<br />
1 (x)<br />
2 (x)<br />
.<br />
(x)<br />
⎞<br />
⎟<br />
⎠ for (x) :R → R.<br />
• Some results from any text on multivariable calculus/analysis:<br />
— Every bounded sequence in R contains a convergent<br />
subsequence.<br />
— If f : ⊂ R → R is continuous on a closed,<br />
bounded set then f attains its inf and sup<br />
there; i.e. there are points p q ∈ with<br />
f(p) =sup x∈ f(x) andf(q) =inf x∈ f(x).<br />
— If f : ⊂ R → R is continuous on a<br />
closed, bounded set then f is uniformly continuous<br />
on .
133<br />
• Derivatives. Put e = (00 1<br />
↑<br />
<br />
00) 0 . Let<br />
: ⊂ R → R 1 .If<br />
lim<br />
→0<br />
(a + e ) − (a)<br />
<br />
exists, we say has a partial derivative with respect<br />
to at a; this limit is denoted by (a)<br />
<br />
.<br />
<br />
It is computed by treating all variables except the<br />
as constant; i.e. it is the ordinary derivative<br />
of ( 1 −1 +1 ) with respect to<br />
.
134<br />
• The ³ ´ Jacobian matrix is the × matrix J f (x) =<br />
f<br />
x with ( ) element , evaluated at<br />
x = ( 1 ). This arrangement of partial<br />
derivatives ensures that the chain rule is easily<br />
represented: if f : R → R and g : R → R <br />
then g ◦ f : R → R has<br />
J g◦f (x) × = J g (f(x)) × J f (x) × .<br />
(16.1)<br />
This is a consequence of the formula for the ‘total<br />
derivative’: if ∈ R and = ( 1 () ()) has<br />
continuous partial derivatives, then<br />
X <br />
= <br />
=1<br />
<br />
Applythistoeach ,with = (x)and = :<br />
h<br />
Jg◦f (x) i = (g ◦ f) <br />
= (f(x))<br />
<br />
X <br />
= (f(x)) (x)<br />
(x) <br />
=<br />
=1<br />
X<br />
=1<br />
[J g (f(x))] <br />
[J f (x)] <br />
= [J g (f(x)) · J f (x)]
135<br />
— If = 1 then the Jacobian matrix of<br />
: R → R is a row vector whose transpose<br />
is the gradient:<br />
∇ (x) =<br />
Ã<br />
<br />
1<br />
<br />
<br />
! 0<br />
<br />
— The Jacobian of ∇ : R → R is called the<br />
Hessian of : R → R. This × matrix<br />
H (x) has( ) element<br />
³ ´<br />
∇ <br />
<br />
= <br />
<br />
<br />
<br />
<br />
<br />
<br />
If one of<br />
<br />
,<br />
<br />
exists and is continuous,<br />
then the other exists and the two are<br />
<br />
equal; under these conditions the Hessian matrix<br />
is symmetric. We write the ( ) element<br />
as 2 = 2 .
136<br />
— If f : ⊂ R → R then the directional<br />
derivative at a in the direction v (with kvk =<br />
1) is<br />
f(a + v) − f(a)<br />
lim<br />
= <br />
→0 <br />
f (a)v<br />
provided the Jacobian exists.<br />
Proof: Put g() =f(a + v) =f ◦ k ()<br />
g()−g(0)<br />
where k () =a+v. We seek lim →0 <br />
-using(16.1)thisis<br />
g<br />
|=0 = J f (k ())J k () |=0 = J f (a)v<br />
• Taylor’s Theorem. I’llgiveaversionsuitablefor<br />
the intended applications. A major difficulty in<br />
writing down a multivariate Taylor’s Theorem is<br />
that appropriate notation, for representing derivatives<br />
higher than second order, is very cumbersome.<br />
It is rare however to require expansions<br />
beyond “Hessian + remainder”. Thus, suppose<br />
: ⊂ R → R, where is convex:<br />
x y ∈ ⇒ (1 − )x + y ∈ for 0 ≤ ≤ 1
137<br />
1. If the partial derivatives of are continuous<br />
on then<br />
(x) =(u)+∇ 0 (ξ)(x − u)<br />
for some ξ =(1− )u + x = ξ .<br />
(How? Write () = (ξ ), expand (1)<br />
around = 0.) We also have<br />
(x) = (u)+∇ 0 (u)(x − u)+(||x − u||)<br />
as ||x − u|| → 0<br />
(How? Write x = u+v for v =(x − u) ||x−<br />
u||, = ||x − u||; apply l’Hospital’s Rule.)<br />
2. If the second order partials are continuous on<br />
then with ξ as above,<br />
We also have<br />
(x) =(u)+∇ 0 (u) (x − u)<br />
+ 1 2 (x − u)0 H (ξ)(x − u) <br />
(x) =(u)+∇ 0 (u)(x − u)<br />
+ 1 2 (x − u)0 H (u)(x − u)+(||x − u|| 2 )
138<br />
— Example: x =( ), (x) = cos , u = 0<br />
Then<br />
Ã<br />
<br />
∇ (0) =<br />
! Ã !<br />
cos <br />
1<br />
− = <br />
sin <br />
0<br />
|x=0<br />
Ã<br />
<br />
H (0) =<br />
cos − !<br />
sin <br />
− sin − cos <br />
|x=0<br />
à !<br />
1 0<br />
=<br />
0 −1<br />
and<br />
(x) = (0)+∇ 0 (0)x + 1 2 x0 H (0)x + (||x|| 2 )<br />
= 1+ + 1 2<br />
h<br />
2 − 2i + ( 2 + 2 )<br />
• Notation:<br />
´<br />
If f : R → y = f(x) ∈ R we often<br />
for the × Jacobian matrix Jf (x).<br />
write ³ y<br />
x<br />
• Example: Let be a strictly increasing d.f., ∈<br />
(0 1) a probability. Then the relationship () =<br />
defines = () asafunctionof. This value<br />
() = −1 () isthe quantile, and (·) isthe
139<br />
quantile function. Differentiating () = with<br />
respect to and using the chain rule gives<br />
1 = (())<br />
= = 0 (()) 0 (); so<br />
<br />
0 () =<br />
1<br />
0 (()) = 1<br />
0 ( −1 ()) <br />
The denominator is non-zero since is strictly<br />
increasing. The multivariate analogue of this follows.<br />
• Inverse Function Theorem: If f : ⊂ R → R <br />
has a continuous Jacobian within , nonsingular<br />
at a point x 0 ∈ , then (i) f is 1 − 1 in<br />
an open neighbourhood of x 0 ,and(ii)thereis<br />
an open neighbourhood of f(x 0 ) within which<br />
f −1 : ⊂ R → R is well-defined and has a<br />
continuous, non-singular Jacobian with<br />
J f −1(f(x 0 )) = J −1<br />
f<br />
(x 0 ).
140<br />
— If J f (x 0 ) is singular then J f (x 0 )v = 0 for<br />
some v 6= 0, so that the derivative in direction<br />
v is zero and f might not be 1-1. Nonsingularity<br />
of J f (x 0 )issufficient but not always<br />
necessary — think about () = 3 and<br />
0 =0.<br />
— In the notation introduced above, the statement<br />
becomes<br />
à ! µ x y −1<br />
= <br />
y x<br />
with both sides evaluated at (x 0 y 0 = f(x 0 )).<br />
— Note that applying the chain rule to the relationship<br />
x = f −1 ◦ f(x) gives<br />
µ x<br />
I = = J<br />
x f −1(f(x))J f (x)
141<br />
17. Implicit Function Theorem; extrema; Lagrange<br />
multipliers<br />
• Example of Inverse Function Theorem. Suppose<br />
that r.v.s 1 2 0 with joint p.d.f. ( 1 2 )<br />
are transformed to 1 2 0 through the transformation<br />
³ 1 = f<br />
2´ ³ 1 =<br />
2´ ³ 2 1 + 2 2<br />
´<br />
2 1 − 2 2<br />
We will later see, when we look at multivariable<br />
integration, that the p.d.f. of ( 1 2 )is<br />
Note<br />
µ y<br />
x<br />
(x (y)) |det (xy)| <br />
!<br />
1 <br />
= J f (x) =2Ã<br />
2<br />
1 − 2<br />
is non-singular if neither 1 nor 2 equals 0. If x 0<br />
is such a point and y 0 = f (x 0 ) then in a neighbourhood<br />
of y 0 the inverse map f −1 : y → f −1 (y) =<br />
x existsandissuchthat<br />
2 1 + 2 2 = 1<br />
2 1 − 2 2 = 2
142<br />
The Jacobian of f −1 is<br />
à !<br />
x<br />
=<br />
y<br />
We need only<br />
¯<br />
¯det Ã<br />
x<br />
y!¯¯¯¯¯<br />
µ y −1<br />
<br />
x<br />
¯<br />
¯¯¯¯−1<br />
µ y<br />
= ¯det x<br />
= (8 1 2 ) −1<br />
1<br />
= q<br />
4 1 2 − 2 2<br />
Note that this can be calculated without determining<br />
J −1<br />
f<br />
,orsolvingforx in terms of y, explicitly.
143<br />
• The Inverse Function Theorem asserts a unique<br />
solution x = f −1 (y) to equations of the form<br />
g(x y) =y −f(x) =0 with x,y ∈ R , under the<br />
given conditions. Here x is explicitly defined as<br />
afunctionofy. The solutions, when written as<br />
x = φ(y), are differentiable functions of y with<br />
J φ (y) =J −1<br />
f<br />
(x). Consider now the general case<br />
g(x y) =0 ×1 for x ∈ R y ∈ R <br />
(17.1)<br />
For given y there may be a solution x = φ(y)<br />
for φ : R → R . Such a function is “implicitly<br />
defined” by (17.1). The Implicit Function Theorem<br />
gives conditions under which such a function<br />
exists, and gives some of its properties.<br />
• Implicit Function Theorem. With notation as in<br />
(17.1), suppose g is defined on ⊂ R + and<br />
has a continuous Jacobian in a neighbourhood of<br />
apoint(x 0 y 0 ) ∈ , atwhichg(x 0 y 0 )=0 ×1 .<br />
Suppose<br />
à !<br />
g (x y)<br />
J 1 (x y) =<br />
x
144<br />
(= the first columns of the × ( + ) matrix<br />
J g (x y)) is non-singular at (x 0 y 0 ). Then<br />
there is a neighbourhood of (x 0 y 0 ) within which<br />
the relationship g(x y) =0 defines a continuous<br />
mapping x = φ(y), i.e. g(φ (y) y) =0. Moreover,<br />
φ has a continuous Jacobian, with<br />
J φ (y) =−J −1<br />
1 (φ (y) y) J 2 (φ (y) y)<br />
where J 2 (x y) =<br />
µ <br />
g(xy)<br />
y<br />
.<br />
— Note that differentiating the relationship<br />
g(x y) =0<br />
(17.2)<br />
with x = φ(y) gives<br />
à !<br />
à !<br />
x<br />
y<br />
J 1 (x y) + J 2 (x y) = 0<br />
y<br />
y<br />
yielding (17.2) with (xy)writtenasJ φ (y).
145<br />
• Example. Write the characteristic equation for a<br />
matrix A × as<br />
(−1) |A − I| = ( a) =<br />
X<br />
=0<br />
=0<br />
Here the are certain continuous functions of the<br />
elements of A and =1;a =( 0 −1 ) 0 .<br />
How do the eigenvalues vary as A varies, say in a<br />
neighbourhood of some matrix A 0 ? The Jacobian<br />
of is clearly continuous, and so if 0 is an<br />
eigenvalue of A 0 with multiplicity one, so that<br />
1 ( a) =( a) 6= 0<br />
at ( 0 a 0 = a (A 0 )), then the char. eqn. defines<br />
as a continuously differentiable function of the<br />
in a neighbourhood of a 0 :<br />
0 0 ( a) ( a)<br />
= +<br />
a a<br />
⇒ (a)<br />
a = − a<br />
(a)<br />
³ <br />
1<br />
−1´<br />
= −<br />
P =1<br />
−1
146<br />
• Extrema of : ⊂ R → R. Suppose the<br />
conditions of Taylor’s Theorem hold, so that we<br />
can expand as<br />
(x 0 + h) =(x 0 )+∇ 0 (x 0)h + 1 2 h0 H ()h<br />
Let x 0 be a stationary point: ∇ (x 0 )=0. Then:<br />
1. If H () 0 for in a neighbourhood of<br />
x 0 then x 0 furnishes a local minimum of :<br />
(x 0 +h) (x 0 )forsufficiently small h 6= 0.<br />
2. If H () 0 for in a neighbourhood of<br />
x 0 then x 0 furnishes a local maximum of :<br />
(x 0 +h) (x 0 )forsufficiently small h 6= 0.<br />
3. If neither (1) nor (2) holds then (x 0 + h) −<br />
(x 0 ) changes sign as h varies; we say that x 0<br />
is a saddlepoint.<br />
• In (1), H (x 0 ) 0 andcontinuousinaneighbourhood<br />
of x 0 suffices; similarly with (2).
147<br />
• Often we seek extrema of multivariate functions,<br />
subject to certain side conditions. For instance in<br />
ANOVA we might seek least squares estimates,<br />
subject to the constraint that the average treatment<br />
effect is zero. The general problem considered<br />
here is to find the extrema of : ⊂ R →<br />
R subject to g(x) =0 ×1 for .E.g.<br />
P : Minimize x 0 Ax subject to Bx = c ×1 ,<br />
where A 0 and B × has rank <br />
Put<br />
(x; λ) =(x)+λ 0 g(x)<br />
for a vector λ ×1 of “Lagrange multipliers”.<br />
Claim: Thestationarypointsof that satisfy the<br />
constraints determine the stationary points in the<br />
original problem. These points then satisfy the<br />
+ equations in the + variables of (x λ):<br />
Equivalently,<br />
∇ (x; λ) =0 (+)×1 <br />
∇ 0 (x)+λ0 J g (x) = 0 0 1× <br />
g(x) = 0 ×1 <br />
The proof of this claim follows the example.
148<br />
• Example: Problem P above. We have<br />
(x; λ) = (x)+λ 0 g(x)<br />
= x 0 Ax + λ 0 (Bx − c)<br />
with (you should verify this)<br />
implying<br />
0 0 1× = µ <br />
x<br />
<br />
=2x 0 A + λ 0 B<br />
x = − 1 2 A−1 B 0 λ<br />
Combine this with Bx = c to get<br />
whence<br />
λ = −2 ³ BA −1 B 0´−1 c<br />
x = A −1 B 0 ³ BA −1 B 0´−1 c<br />
(You should be able to verify that BA −1 B 0 is<br />
non-singular.)<br />
• Once we have the stationary points we must check<br />
for a minimum or maximum. There are conditions
149<br />
under which the satisfaction of these equations is<br />
sufficient as well as necessary to determine the extrema.<br />
These are generally so restrictive and complicated<br />
as to be useless in practice. The virtue<br />
of the Lagrange multiplier method is that it reduces<br />
to a small number the points that must be<br />
checked - we know that the required extrema are<br />
among them.<br />
— One easy way (if it works) to check optimality<br />
is this. Suppose that (x 0 ; λ 0 ) is a stationary<br />
point of (x; λ) andthatx 0 minimizes<br />
(x; λ 0 ) unconditionally. Then (x 0 ; λ 0 ) <br />
(x; λ 0 ) for all x 6= x 0 , hence in the class of<br />
those x that satisfy g(x) =0 we have<br />
(x 0 )= (x 0 ; λ 0 ) (x; λ 0 )=(x)<br />
— In Problem P there was only one stationary<br />
point x 0 . This furnishes the minimum since<br />
(x; λ 0 )=x 0 Ax + λ 0 0 (Bx − c) <br />
where λ 0 = −2 ³ BA −1 B 0´−1 c,hasHessian<br />
2A 0, hence is minimized unconditionally<br />
at x 0 .
150<br />
• Proofofclaim: Letx beasolutiontotheoriginal<br />
problem, so that in particular g(x) =0 ×1 . We<br />
will ‘solve’ these equations, thus expressing <br />
of the s in terms of the others. The Implicit<br />
FunctionTheoremallowsustodothis.<br />
We must show that ∇ 0 (x) +λ0 J g (x) =0 0 1× .<br />
Partition x, the gradient of , andtheJacobian<br />
of g as:<br />
x =<br />
∇ (x) =<br />
J g (x) =<br />
à !<br />
x1<br />
x 2<br />
⎛ ³ <br />
⎜ x<br />
⎝ ³ 1´0<br />
<br />
x 2´0<br />
× 1<br />
( − ) × 1 ;<br />
⎞<br />
Ã<br />
g<br />
. g<br />
!<br />
x 1 x 2<br />
⎟<br />
⎠ =<br />
Ã<br />
τ (x)<br />
ψ(x)<br />
!<br />
;<br />
= ³ Γ × (x).∆ ×(−) (x)´<br />
<br />
Under the conditions of the Implicit Function Theorem<br />
(so that, in particular, Γ(x) is non-singular)<br />
we can solve the equations g(x 1 x 2 )=0 ×1 for<br />
x 1 in terms of x 2 , obtaining x 1 = h(x 2 ). Thus<br />
(x) =(h(x 2 ) x 2 )andg(h(x 2 ) x 2 )=0 ×1 .
151<br />
Since x 2 is a stationary point we have<br />
à !<br />
0 0 <br />
1×(−) = = τ 0 (x)J<br />
x h (x 2 )+ ψ 0 (x)<br />
2<br />
(17.3)<br />
But, as in the Implicit Function Theorem,<br />
g(h(x 2 ) x 2 )=0 ×1 gives<br />
so<br />
Γ(x)J h (x 2 )+∆(x) =0 ×(−) <br />
In (17.3) this gives<br />
J h (x 2 )=−Γ −1 (x)∆(x)<br />
0 0 = −τ 0 (x)Γ −1 (x)∆(x)+ ψ 0 (x)<br />
= −λ 0 ∆(x)+ψ 0 (x) (17.4)<br />
where λ 0 = −τ 0 (x)Γ −1 (x) :1× . Thus<br />
∇ 0 (x)+λ0 J g (x)<br />
= ³ τ 0 (x) ψ 0 (x)´ − τ 0 (x)Γ −1 (x)(Γ(x).∆(x))<br />
= ³ 0 0 ψ 0 (x) − τ 0 (x)Γ −1 (x)∆(x)´<br />
= 0 0 1× <br />
by (17.4), as required. ¤
152<br />
18. Integration; Leibnitz’s Rule; Normal sampling<br />
distributions<br />
• Integration over an −dimensional rectangle. Let<br />
: ⊂ R → R be bounded on the bounded<br />
set . The development of the Riemann integral<br />
proceeds along the same lines as for =1. Thus,<br />
define the rectangle<br />
[a b] =[ 1 1 ] ×···×[ ]<br />
large enough that ⊂ [a b]. First suppose<br />
that is defined and bounded on [a b]. Let<br />
be a partition of [a b], itself consisting of -<br />
dimensional rectangles 1 .Define the lower<br />
and upper sums<br />
() =<br />
() =<br />
X<br />
=1<br />
X<br />
=1<br />
( )<br />
( )<br />
where and are the inf and sup of on <br />
and ( )isthevolumeof :<br />
([c d]) =<br />
Y<br />
=1<br />
( − )
153<br />
If<br />
sup<br />
<br />
or equivalently if<br />
() =inf<br />
()<br />
lim ( () − ()) = 0<br />
∆ →0<br />
then the common value is the Riemann integral<br />
R<br />
[ab] (x)x<br />
• Now recall that ⊂ [a b]. Define<br />
(<br />
1 x ∈ <br />
1 (x) =<br />
0 x ∈ <br />
If R [ab] 1 (x)x exists we say is Jordan measurable.<br />
The value of the integral is called the<br />
Jordan content of . When is Jordan measurable<br />
we define<br />
Z<br />
Z (x)x = (x)1 (x)x<br />
[ab]
154<br />
• A major tool for evaluating multidimensional integrals<br />
is Fubini’s Theorem: If is absolutely<br />
integrable on [a b] then<br />
Z<br />
(x)x =<br />
[ab]<br />
Z Ã Ã<br />
1<br />
Z Ã<br />
−1<br />
Z ! !<br />
<br />
···<br />
(x) −1 ···<br />
1 −1 <br />
and the integrations on the RHS may be carried<br />
out in any order.<br />
!<br />
1<br />
• Change of variables. Let : ⊂ R → R, where<br />
is closed and bounded and is continuous. Let<br />
h : x ∈ → y ∈ R be a 1 − 1 function with<br />
continuous Jacobian matrix J h (x), non-singular<br />
on . Then there is an inverse function h −1 :<br />
y ∈ h () → with<br />
Z (x)x = Z<br />
h() (h−1 (y)) ¯J h −1(y) y ¯+<br />
(18.1)<br />
where |·| + denotes the absolute value of the determinant.<br />
Note also<br />
¯<br />
¯J h −1(y)<br />
¯<br />
=<br />
¯+ ¯<br />
x<br />
y<br />
¯<br />
¯+<br />
=<br />
¯<br />
y<br />
x<br />
−1<br />
¯<br />
+<br />
¯<br />
¯<br />
,wherey = h(x)
155<br />
Here and elsewhere the assumption that be<br />
bounded can be dropped by defining the resulting<br />
improper integral as a limit of proper integrals.<br />
• In particular, suppose is the p.d.f. of a r.vec.<br />
X. PutY = h(X). If h is as above we have<br />
Z<br />
(x)x = (X ∈ )<br />
<br />
= (Y ∈ h ())<br />
= (y)y<br />
h()<br />
where (y) is the p.d.f. of Y. But also (18.1)<br />
holds; thus<br />
(y) = (h −1 (y)) ¯J h −1(y)<br />
= (x)<br />
¯<br />
x<br />
y<br />
Z<br />
¯<br />
¯+<br />
¯<br />
¯<br />
¯+<br />
,withx = x(y)<br />
• Differentiation under the integral sign. Define<br />
() =<br />
Z ()<br />
() ()
156<br />
where ()() are continuously differentiable<br />
for ≤ ≤ and () is continuous, with a<br />
continuous partial derivative w.r.t. , onaregion<br />
containing { ≤ ≤ and () ≤ ≤ ()}.<br />
Then (Leibnitz’s Rule): () iscontinuouslydifferentiable<br />
with<br />
Z ()<br />
0 <br />
() =<br />
() ()<br />
+ (()) 0 () − (()) 0 ()(18.2)<br />
• Note this is the result of writing<br />
() =( ()())<br />
where ( ) = R <br />
(). Then<br />
<br />
<br />
( )<br />
( ()()) =<br />
( )<br />
<br />
<br />
=()<br />
=()<br />
0 ()+<br />
=()<br />
=()<br />
( )<br />
<br />
+<br />
=()<br />
=()<br />
If differentiation under the integral sign is permissible<br />
— and Leibnitz’s rule says that it is — then<br />
we have (18.2).<br />
0 ()
157<br />
• Example: Let be independent, non-negative<br />
r.v.s with continuous densities respectively<br />
(()() =0for0). Then = + <br />
has d.f.<br />
() = ( ≤ )<br />
=<br />
=<br />
=<br />
=<br />
Z<br />
[0]×[0]<br />
Z <br />
( + ≤ ) ()()( )<br />
Z <br />
() ( ≤ − ) ()<br />
0 0<br />
by Fubini’s Theorem<br />
Z Z −<br />
()<br />
Z0 0 <br />
0<br />
()<br />
()( − )<br />
with, by Leibnitz’s Rule, density<br />
() =<br />
Z <br />
0<br />
()( − )<br />
since (0) = 0. This integral is called the convolution<br />
of with . (Onlyoneof needs to be<br />
continuous - why?)
158<br />
• Application: The density of a 2 1 r.v., i.e. =<br />
2 ,where ∼ (0 1), can be obtained as<br />
1 () = <br />
<br />
( ≤ ) =<br />
³ − √ ≤ ≤ √ ´<br />
= Z √ ³ 2´1<br />
2 <br />
() = ³ 2<br />
−1<br />
√ <br />
− 2<br />
´<br />
√ =<br />
0<br />
2Γ ³ <br />
1<br />
2´<br />
Then the 2 2<br />
2 () =<br />
Z <br />
= − 2<br />
density is<br />
0 1() 1 ( − )<br />
Z <br />
0<br />
=<br />
= − 2<br />
³ 2´1<br />
2<br />
−1 ³ −<br />
2<br />
Z 1<br />
0<br />
= 1 2 − 2 · <br />
4<br />
´1<br />
2<br />
−1<br />
<br />
³ 2<br />
´1<br />
2<br />
−1 µ (1−)<br />
2<br />
4<br />
1<br />
2<br />
−1<br />
<br />
where = R 1 1 2 −1 (1−) 1 2 −1<br />
0 <br />
must = 1 in
159<br />
order that 2 () integrate to 1. Now<br />
() =<br />
³ 2´<br />
2<br />
−1<br />
<br />
− 2<br />
2Γ ³ <br />
2´<br />
can be proved by induction, or conjectured and<br />
then established by using the uniqueness of m.g.f.s<br />
(as in Lecture 15).<br />
• Example: Joint distribution of the sample mean<br />
andvarianceinNormalsamples.<br />
Suppose that 1 are i.i.d. ( 2 ) r.v.s,<br />
so that X =( 1 ) 0 has p.d.f.<br />
(<br />
Y ³2<br />
(x) =<br />
2´−12<br />
exp<br />
(− ( − ) 2 ))<br />
2 2<br />
Note<br />
X<br />
=1<br />
=1<br />
= ³ 2 2´−2 exp<br />
⎧<br />
⎨<br />
⎩ − X<br />
=1<br />
( − ) 2 =<br />
X<br />
=1<br />
( − ) 2<br />
2 2<br />
⎫<br />
⎬<br />
⎭ <br />
[( − ¯)+(¯ − )] 2<br />
= ( − 1) 2 + (¯ − ) 2
160<br />
so that<br />
(x) = ³ 2 2´−2 − (−1)2<br />
2 2<br />
−(¯−)2<br />
2 2 <br />
We derive the joint p.d.f. of ( 2 ¯) . Firstnote<br />
that 1 ×1 √ has norm 1. Adjoin − 1unit<br />
vectors e to get a basis for R , and then apply<br />
Gram-Schmidt to get an orthonormal basis whose<br />
first member is 1 √ . This yields an orthogonal<br />
matrix<br />
H × = ³ 1 0 √ ´<br />
H 1<br />
Put Y = HX. Then<br />
1 = 1 0 X √ = √ ¯<br />
and kXk 2 = kYk 2 ,sothat<br />
X<br />
=2<br />
2 = kYk 2 − 1<br />
2<br />
= kXk 2 − ³ √ ¯´2<br />
=<br />
X<br />
=1<br />
2 − ¯ 2<br />
= ( − 1) 2
161<br />
Note that x<br />
¯<br />
¯ =<br />
¯¯¯¯¯<br />
¯H 0¯¯¯+ = |±1| =1<br />
y¯+<br />
so that the p.d.f. (y) ofY is<br />
Thus<br />
= (x)<br />
¯<br />
x<br />
y<br />
¯<br />
¯+<br />
= ³ 2 2´−2 <br />
−<br />
=<br />
⎧<br />
⎨<br />
= (x(y))<br />
P =2<br />
2 <br />
2 2<br />
³<br />
2<br />
2´−12<br />
<br />
− ( 1−<br />
⎩<br />
⎧<br />
Y ⎨<br />
⎩<br />
=2<br />
³<br />
2<br />
2´−12<br />
<br />
− 2 <br />
√<br />
−( 1− )<br />
2<br />
2 2<br />
√ )<br />
2<br />
2 2 ⎫<br />
⎬<br />
⎭ ·<br />
2 2 ⎫<br />
⎬<br />
⎭ <br />
1. 1 are independently distributed;<br />
2. 1 ∼ ( √ 2 ), so that ¯ ∼ ( 2 );<br />
3. (−1)2<br />
2<br />
= P ³<br />
<br />
=2 <br />
´2<br />
∼ <br />
2<br />
−1 , since <br />
(0 1); furthermore ¯ and 2 are independently<br />
distributed.<br />
<br />
∼
19. Numerical optimization: Steepest descent,<br />
Newton-Raphson, Gauss-Newton<br />
162<br />
• Numerical minimization. Suppose that a function<br />
: R → R is to be minimized.<br />
• Method of steepest descent. First choose an initial<br />
value x 0 .Leth be a vector of unit length and<br />
expand around x 0 in the direction h ( 0):<br />
(x 0 + h) − (x 0 ) ≈ h 0 ∇ (x 0 )<br />
We choose h such that h 0 ∇ (x 0 ) is negative but<br />
maximizedinabsolutevalue. Specifically, note<br />
that by Cauchy-Schwarz, for khk =1,<br />
¯<br />
¯h 0 ∇ (x 0 )<br />
¯ ≤<br />
°<br />
°∇ (x 0 )<br />
° <br />
with equality iff h = ±∇ (x 0 ) °∇ (x 0 ) °. Then<br />
°<br />
°<br />
h 0 ∇ (x 0 )=± °∇ (x 0 ) °,soweusethe“−”sign:<br />
x 1 () = x 0 + h<br />
°<br />
°<br />
= x 0 − ∇ (x 0 ) °∇ (x 0 )<br />
°<br />
°
163<br />
with = 0 to minimize (by trial and error) (x 1 ()).<br />
Repeat, with x 1 ( 0 ) replacing x 0 . Iterate to convergence.<br />
— Example: (x) =kxk 2 . Then ∇ (x) =2x<br />
and so<br />
x 1 = x 0 − · 2x 0 (k2x 0 k)<br />
= x 0 (1 − kx 0 k) <br />
We vary until (x 1 ()) = kx 1 ()k 2 is a minimum,<br />
i.e. to = kx 0 k.Thenx 1 = 0, andthe<br />
minimum is achieved in 1 step, from any starting<br />
value.<br />
• The method of steepest descent uses a linear approximation<br />
of ; for this and other reasons the<br />
convergence can be very slow. The Newton-Raphson<br />
method uses a quadratic approximation of ; equivalently<br />
it takes a linear approximation of ∇ (x)<br />
in order to solve ∇ (x) =0. In its general form<br />
the Newton-Raphson method attempts to solve a<br />
system of equations of the form g(x) =0, where<br />
x and g(x) are × 1.
164<br />
• Expand g(x) around an initial value x 0 :<br />
g(x) ≈ g(x 0 )+J (x 0 )(x − x 0 )<br />
Equate the RHS to zero, to get the next iterate:<br />
In general,<br />
x 1 = x 0 − J −1 (x 0 )g(x 0 )<br />
x +1 = x − J −1 (x )g(x ) =0 1 2 .<br />
At convergence, with x ∞ =lim →∞ x and assuming<br />
that J (x ∞ )isnon-singular,<br />
x ∞ = x ∞ − J −1 (x ∞ )g(x ∞ )<br />
so that g(x ∞ )=0.<br />
— If this is a minimization problem, so that g(x) =<br />
∇ (x), then J (x) =H (x) and the scheme<br />
is<br />
x +1 = x − H −1<br />
(x )∇ (x )<br />
Note<br />
(x +1 ) ≈ (x )+∇ 0 (x ) ¡ ¢<br />
x +1 −x <br />
= (x ) − ∇ 0 (x )H −1<br />
(x )∇ (x )<br />
(x ) if H (x ) 0
165<br />
—Example:solve () =log − 1=0:<br />
+1 = − 0 ( )<br />
This gives<br />
= − log − 1<br />
1 <br />
= (2 − log ) =0 1 2 .<br />
0 = 1<br />
1 = 2<br />
2 = 26137<br />
3 = 27162<br />
4 = 27182811;<br />
= 27182818.<br />
• The starting points can make a big difference —<br />
even if the function being minimized is convex.<br />
Example: Consider using Newton’s method to<br />
find the zero of () =() · 2 ³ 1+ 2´.<br />
This is increasing, so it is the derivative of a convex<br />
function , minimized at the zero of . Start
166<br />
at 0 . We have ¡ 0¢ () =(1 +<br />
³<br />
2 )2 and<br />
so the iterates satisfy +1 = 1 − <br />
2<br />
´<br />
2.<br />
¯<br />
Put = ¯¯ +1 ¯¯ = ¯1 − 2 <br />
¯ 2. Suppose that<br />
| 0 | √ 3. By induction,<br />
| | √ 3and −1 ··· 0 1<br />
(19.1)<br />
so that ¯¯ +1¯¯ = | | ↑ ∞. If | 0 | = √ 3<br />
then =(−1) 0 . Similarly, if | 0 | √ 3then<br />
| | √ 3and −1 ··· 0 1andso<br />
| | ↓ 0 (= the desired root).<br />
Details of (19.1): If true for , then ¯¯ +1¯¯ =<br />
√ √<br />
| | 3 3, and then<br />
¯<br />
¯1 − 2 ¯<br />
+1<br />
¯ ¯1 − 2 ³<br />
<br />
+1 =<br />
=<br />
2 <br />
¯ <br />
2<br />
<br />
2 <br />
=<br />
− 1´<br />
2<br />
2<br />
2<br />
32 − 1 3 − 1<br />
<br />
2 2<br />
¯<br />
Gauss-Newton algorithm. Uses least squares minimization<br />
along with a linear approximation of the function<br />
of interest. A common application is non-linear
167<br />
regression, so I’ll illustrate the technique there. Suppose<br />
we observe<br />
= (x θ)+ , =1<br />
• An example is a Michaelis-Menten response ( θ) =<br />
1 ( 2 + ), ( 0), used to describe various<br />
chemical and pharmacological reactions. Note<br />
the horizontal asymptote of 1 .<br />
If (x θ) =z 0 (x)θ for some regressors z(x) then this<br />
is a linear regression problem; otherwise non-linear.<br />
Define<br />
η(θ) =((x 1 θ) ···(x θ)) 0 <br />
so that the data can be represented as y = η(θ)+ε.<br />
The LSEs are the minimizers of<br />
(θ) =ky − η(θ)k 2 <br />
Take an initial value θ 0 , expand around θ 0 to get<br />
− (x θ) ≈ − (x θ 0 ) − ∇ 0 (x θ 0 )(θ − θ 0 )
168<br />
i.e.<br />
y − η(θ) ≈ y − η(θ 0 ) − J (θ 0 )(θ − θ 0 )<br />
Define y (1) = y − η(θ 0 ), so that<br />
ky − η(θ)k 2 ≈<br />
°<br />
°y (1) − J (θ 0 )(θ − θ 0 )<br />
is to be minimized. By analogy with the linear regression<br />
model<br />
y (1) = J (θ 0 )β + <br />
the minimizer is<br />
° 2<br />
θ − θ 0 = β = h J 0 (θ 0 )J (θ 0 ) i −1<br />
J<br />
0 (θ 0 )y (1) <br />
Thus the next value is θ 1 = θ 0 + β, i.e.<br />
θ 1 = θ 0 + h J 0 (θ 0 )J (θ 0 ) i −1<br />
J<br />
0 (θ 0 )(y − η(θ 0 )) <br />
In general, θ +1 = θ + ˆβ ,where<br />
ˆβ = h J 0 (θ )J (θ ) i −1<br />
J<br />
0 (θ )(y − η(θ )) <br />
Thus we are repeatedly doing least squares regressions,<br />
in the ( +1) of which the residuals from the<br />
areregressedonthecolumnsoftheJacobianmatrix,<br />
evaluated at θ . A stopping rule can be based<br />
on the F-test of 0 : β = 0, the p-values for which<br />
will be included in the regression output.
169<br />
Assuming convergence, the limit ˆθ satisfies<br />
so that<br />
J 0 (ˆθ) ³ y − η(ˆθ)´ = 0 (19.2)<br />
∇ 0 (ˆθ) =−2 ³ y − η(ˆθ)´0<br />
J (ˆθ) =0 0<br />
and ˆθ is a stationary point of (θ).<br />
Typically ¡ θ +1<br />
¢<br />
(θ ),ifnotitisusualtotake<br />
θ +1 = θ + h J 0 (θ )J (θ ) i −1<br />
J<br />
0 (θ )(y − η(θ ))<br />
for =1 12 14until a decrease in is attained.<br />
A normal approximation is generally valid:<br />
ˆθ ≈ µ<br />
θ <br />
2 h<br />
J<br />
0 (ˆθ)J (ˆθ) i −1 ;<br />
the basic idea is (19.2) applied to<br />
ε = y − η(θ) ≈ y − h η(ˆθ)+J (ˆθ)(θ − ˆθ) i <br />
More precisely,<br />
⎛<br />
√ ⎜<br />
<br />
³ˆθ − θ´ → ⎝0 <br />
2<br />
⎡<br />
⎣ lim<br />
→∞<br />
J 0 (ˆθ)J (ˆθ)<br />
<br />
⎤<br />
⎦−1 ⎞ ⎟ ⎠
170<br />
20. Maximum likelihood<br />
Maximum Likelihood Estimation. Thisisthemost<br />
common and versatile method of estimation in statistics.<br />
It almost always gives reasonable estimates, even<br />
in situations that are so intractable as to be highly resistant<br />
to other estimation methods.<br />
• Data x, p.d.f. (x; θ); e.g. i.i.d. ( 2 )observations<br />
gives<br />
(x; θ) =<br />
θ = ( 2 ) 0<br />
Y<br />
=1<br />
µ<br />
1 <br />
− <br />
<br />
The p.d.f. evaluated at the data is the likelihood<br />
function (θ; x); its logarithm<br />
is the log-likelihood.<br />
(θ) =log(θ; x)
171<br />
• For i.i.d. observations with common p.d.f. (; θ)<br />
we have<br />
(x; θ) =<br />
(θ) =<br />
Y<br />
=1<br />
X<br />
=1<br />
( ; θ) so<br />
log ( ; θ) <br />
Viewed as a r.v., (θ) = P <br />
=1 log ( ; θ) isitself<br />
a sum of i.i.d.s.<br />
• The MLE ˆθ is the maximizer of the likelihood;<br />
intuitively it makes the observed data “most likely<br />
to have occurred”.<br />
• A more quantitative justification for the MLE is<br />
as follows. Let θ 0 bethetruevalue,andassume<br />
the arei.i.d.Wewillshowthat<br />
θ0 ((θ 0 ; X) (θ; X)) → 1(20.1)<br />
as →∞ for any θ 6= θ 0 <br />
By this, for large samples and with high probability,<br />
the (random) likelihood is maximized by the
172<br />
true parameter value, hence the maximizer of the<br />
(observed) likelihood should be a good estimate<br />
of this true value.<br />
Proof of (20.1):<br />
The inequality<br />
(θ 0 ; X) =<br />
Y<br />
=1<br />
is equivalent to<br />
( ; θ 0 ) <br />
Y<br />
=1<br />
( ; θ) =(θ; X)<br />
− 1 <br />
X<br />
=1<br />
log ( ; θ)<br />
( ; θ 0 ) 0<br />
By the WLLN this average tends in probability to<br />
(; θ)<br />
− log<br />
(; θ 0 )<br />
" #<br />
(; θ)<br />
− log θ0 (why?)<br />
(; θ 0 )<br />
Z (; θ)<br />
= − log<br />
(; θ 0 ) (; θ 0) <br />
θ0<br />
"<br />
= − log<br />
= 0<br />
Z<br />
(; θ) <br />
And so ... ¤<br />
#
173<br />
• The MLE is generally obtained as a root of the<br />
likelihood equation<br />
˙(θ) =0<br />
where ˙(θ) =∇ (θ) denotes the gradient. There<br />
may be multiple roots in finite samples. Under<br />
reasonable conditions (studied in STAT 665) we<br />
have that any sequence ˆθ of roots is asymptotically<br />
normal:<br />
√ <br />
³ˆθ − θ´ → (0 I −1 (θ))<br />
where<br />
I(θ) =<br />
lim<br />
→∞<br />
1<br />
h˙(θ)˙ 0 (θ) i (20.2)<br />
is “Fisher’s Information matrix”.<br />
interpretation is that<br />
ˆθ <br />
<br />
≈ <br />
µθ 0 1 I−1 (θ)<br />
<br />
The practical<br />
i.e. with representing the ( ) element of<br />
I −1 (θ) (orofI −1 (ˆθ )) we have the approximations<br />
ˆ <br />
<br />
≈ ( <br />
) cov[ˆ ˆ ] ≈
174<br />
• The MLE has attractive large-sample optimality<br />
properties, to be established later. That derivation,<br />
and the example we look at next, use the<br />
following ‘regularity condition’: we suppose that<br />
we can differentiate the equation 1 = R (θ; x)x<br />
under the integral sign twice. (In particular, the<br />
limits of integration should not depend on θ.)<br />
Then, writing ˙(θ; x) and˙(θ; x) forthegradients<br />
we have<br />
Z Z ˙(θ; x)<br />
0 ×1 = ˙(θ; x)x = (θ; x)x<br />
(θ; x)<br />
=<br />
Z<br />
˙(θ; x) (x; θ) x<br />
= θ<br />
h˙(θ; X) i (20.3)<br />
Thus θ<br />
h˙(θ) i = 0. With ¨(θ; x) denotingthe<br />
Hessian matrix we have<br />
0 × = Z<br />
˙(θ; x) (x; θ) x<br />
Z<br />
θ<br />
Z<br />
= ¨(θ; x) (x; θ) x + ˙(θ; x) (x; θ) x<br />
Z θ<br />
= ¨(θ; x) (x; θ) x<br />
+<br />
Z<br />
˙(θ; x)˙ 0 (θ; x) (x; θ) x
175<br />
so that<br />
³<br />
covθ<br />
h˙(θ) i =´<br />
θ<br />
h˙(θ)˙ 0 (θ) i = θ<br />
h<br />
−¨(θ) i <br />
(20.4)<br />
• If the observations are i.i.d. then<br />
˙ (θ) =<br />
X<br />
∇ log ( ;·)<br />
=1<br />
is a sum of i.i.d.s, each of which (by taking =<br />
1 in (20.3) and (20.4)) has a mean of 0 and a<br />
covariance of<br />
h<br />
θ ∇log (;·) (θ) ∇ 0 log (;·) (θ)i<br />
Ã<br />
<br />
= θ<br />
"−<br />
2 !#<br />
log (; θ)<br />
<br />
θθ<br />
Now (20.2) states that<br />
1<br />
I(θ) = lim<br />
→∞ cov h˙(θ) i =cov h ∇ log (;·) (θ) i <br />
since this is the same for all . Then by the CLT<br />
(next page),<br />
1<br />
√ <br />
˙(θ) = 1 √ <br />
X<br />
=1<br />
∇ log ( ;·) (θ) → (0 I(θ))<br />
(20.5)
176<br />
• We have used the multivariate CLT: if Z 1 Z <br />
are i.i.d. r.vecs. with mean vector μ and covariance<br />
matrix Σ, then<br />
1<br />
√ <br />
X<br />
=1<br />
(Z − μ) → (0 Σ)<br />
(In STAT 665 we give a very elementary proof of<br />
this, which uses only the univariate CLT.) This<br />
was applied in (20.5) with Z = ∇ log ( ;·) (θ),<br />
μ = 0 and Σ = I(θ).<br />
• Now here is an outline of the proof of asymptotic<br />
normality of the MLE. Expand the likelihood<br />
equation ˙ ³ˆθ´<br />
= 0 aroundthetruevalue,with<br />
remainder :<br />
0 = ˙ ³ˆθ´<br />
= ˙ ³ˆθ (θ)+¨(θ) − θ´ + <br />
Rearrange this as<br />
√ <br />
³ˆθ − θ´<br />
=<br />
∙<br />
−<br />
¨(θ)<br />
1 ( ¸−1 1 √ ˙ (θ)+ )<br />
<br />
√ <br />
(20.6)
177<br />
We have (by the WLLN) that<br />
X<br />
Ã<br />
−<br />
¨(θ) 1 = 1 − 2 log ( ; θ)<br />
<br />
=1<br />
θθ<br />
"Ã<br />
!#<br />
<br />
→ θ − 2 log (; θ)<br />
= I(θ)<br />
θθ<br />
so that using (20.5) and Slutsky’s Theorem,<br />
∙<br />
−<br />
¨(θ)<br />
1 ¸−1 1 √ ˙ (θ) → (0 I −1 (θ))<br />
If √ → 0 (it does, but this is where some<br />
work is required) then, again by Slutsky applied<br />
to (20.6),<br />
√ <br />
³ˆθ − θ´ → (0 I −1 (θ))<br />
!
21. Asymptotics of ML estimation; Information<br />
Inequality<br />
178<br />
• Example. Suppose{ 1 } is a sample from<br />
the gamma( ) density,with<br />
³ ´−1<br />
<br />
− <br />
(; θ) =<br />
0 ∞<br />
Γ ()<br />
Note that if θ =( ) =(22) then this is<br />
the 2 density. If = −1 , = it is the<br />
“Erlang” density - the density of the sum of <br />
i.i.d. E() r.v.s. The distribution of the r.v. ,<br />
where 2 ∼ gamma(Ω 2 ) is known as the<br />
“Nakagami” distribution, and is the “fading<br />
parameter”; this is of interest in the theory of<br />
wireless transmissions.<br />
The log-likelihood is<br />
(θ) =<br />
=<br />
= <br />
X<br />
=1<br />
X<br />
=1<br />
log ( ; θ)<br />
" #<br />
( − 1) (log − log )<br />
− <br />
− log − log Γ () " #<br />
( − 1) (log ) − log <br />
− ¯ − log Γ ()
with gradient<br />
˙(θ) =<br />
Ã<br />
− + ¯ 2<br />
(log ) − log − ()<br />
!<br />
179<br />
where () =()logΓ () (= [log ()])<br />
is the “digamma” function.<br />
<br />
• The Newton-Raphson method for solving the likelihood<br />
equations is<br />
θ +1 = θ − ¨ −1 (θ )˙(θ )<br />
where<br />
¨(θ) =<br />
⎛<br />
⎝<br />
<br />
2 − 2 ¯ 3<br />
− 1 <br />
− 1 <br />
− 0 ()<br />
⎞<br />
⎠ <br />
• A commonly used alternative to Newton-Raphson<br />
is Fisher’s Method of Scoring. This involves replacing<br />
Ã<br />
X 2 !<br />
log (<br />
−¨(θ) =−<br />
; θ)<br />
θθ<br />
=1
180<br />
by its expectation I(θ) in N-R, to get the scheme<br />
θ +1 = θ + 1 I−1 (θ )˙(θ )<br />
This is often more stable than N-R.<br />
• Starting values for iterative solution to the likelihood<br />
equations. Note that<br />
θ<br />
h˙(θ; X) i =0⇒ h ¯ i = <br />
Also, for instance by computing the m.g.f. and differentiating<br />
twice (or more simply by a direct integration),<br />
[ 2 ]= ( +1) 2 .Thus[] =<br />
2 and so<br />
[ 2 ]= 2 <br />
where 2 isthesamplevariance(theunbiasedness<br />
of 2 is a simple calculation). The “method<br />
of moments” estimates θ 0 = ( 0 0 )arenow<br />
obtained by equating ¯ and 2 to their expectations<br />
and solving for the parameters:<br />
0 = 2<br />
¯ 0 = ¯ 2<br />
2<br />
Ã<br />
= ¯<br />
0<br />
!
181<br />
• Method of moments: Define population moments<br />
= h i and estimates ˆ = −1 P <br />
=1 .<br />
By the WLLN these are consistent:<br />
ˆ <br />
<br />
→ as →∞<br />
Then to estimate continuous functions<br />
θ = g ( 1 )<br />
of the population moments, the method of moments<br />
estimate<br />
ˆθ = g (ˆ 1 ˆ )<br />
is also consistent. The proof is the same as in the<br />
univariate case:<br />
³° ° °ˆθ − θ ° ° ≥ ´<br />
= (kg (ˆμ) − g (μ)k ≥ )<br />
≤ (kˆμ − μk ≥ )<br />
→ 0;<br />
here 0issuchthat<br />
kˆμ − μk ⇒ kg (ˆμ) − g (μ)k <br />
and its existence is guaranteed by the continuity<br />
of g.
182<br />
— An interesting aside: if (; ) is the density<br />
of an exponential r.v. with mean 1,<br />
giventhatitmustbe∈ [0 1], then (; ) =<br />
− ³ 1 − −´<br />
and the equations 0 () =<br />
0and‘¯ = []’ turn out to be identical,<br />
so that the mle and method of moments estimator<br />
coincide.<br />
• More efficient, in fact almost as efficient as the<br />
MLE itself are<br />
0 = ¯<br />
¯<br />
0 =<br />
0 −1 P ( − ¯)(log − (log )) <br />
which are method of moments estimators arising<br />
from the observation that<br />
cov [log ] =cov<br />
∙<br />
log<br />
µ <br />
<br />
¸<br />
= <br />
implying<br />
= []<br />
= []<br />
cov [log ] <br />
Details in Wiens, Cheng (a former Stat <strong>512</strong> student)<br />
and Beaulieu 2003 at http://www.stat.ualberta.ca/˜wiens/.
• The limit of the NR-process is the MLE ˆθ, and<br />
√ <br />
³ˆθ − θ´ → (0 I −1 (θ))<br />
The information matrix is<br />
I(θ) =<br />
lim<br />
→∞<br />
1<br />
h i<br />
θ −¨(θ) =<br />
⎛<br />
⎝<br />
<br />
2<br />
1<br />
<br />
1<br />
<br />
0 ()<br />
183<br />
⎞<br />
⎠ <br />
with<br />
Ã<br />
!Ã<br />
I −1 1 <br />
(θ) =<br />
2 0 !<br />
() −<br />
0 <br />
() − 1 − <br />
Then, e.g., the approximation to the distribution<br />
of ˆ is<br />
Ã<br />
ˆ ≈ 11 ( ) 2 0 ()<br />
=<br />
( 0 <br />
() − 1)<br />
We estimate the parameters in the variance, obtaining<br />
ˆ − <br />
r<br />
11 ³ˆ ˆ´. <br />
<br />
≈ (0 1)<br />
Note that 0 () =[log ()] = [log ]<br />
can also be consistently estimated by the sample<br />
variance of {log } =1 .<br />
!
184<br />
• Wecannowestablishanasymptoticoptimality<br />
property of the MLE. Suppose that the observations<br />
are i.i.d., and that differentiation under the<br />
integral sign, as above, is permissible. We aim<br />
to estimate a (scalar) function (θ). The MLE<br />
ˆ is defined to be (ˆθ), where ˆθ is the MLE for<br />
θ. Recall that in studying variance stabilization<br />
(Lecture 9) we noted that in the single-parameter<br />
case, if ˆθ were asymptotically normal then so<br />
would be (ˆθ), with a mean of (mean of ˆθ)<br />
and a variance of £ 0 (θ) ¤ ³<br />
2 · variance of ˆθ´. The<br />
multi-parameter analogue (the “delta method”)<br />
is that<br />
√ ³<br />
<br />
³ˆθ ´<br />
− (θ)´ → (0 ˙ 0 (θ) I −1 (θ) ˙ (θ))<br />
where ˙ (θ) =∇ (θ) is the gradient.
185<br />
• Now let (X) be any unbiased estimator of (θ),<br />
so that<br />
Thus<br />
since<br />
(θ) = θ [(X)] =<br />
˙(θ) =<br />
Z<br />
Z<br />
Z<br />
(x)∇ (x; θ) x<br />
(x) (x; θ) x<br />
= (x)˙(θ; x) (x; θ) x<br />
h<br />
= θ (X)˙(θ; X) i<br />
= θ<br />
h<br />
{(X) − (θ)} ˙(θ; X) i <br />
θ<br />
h<br />
(θ)˙(θ; X) i = (θ) θ<br />
h˙(θ; X) i = 0<br />
Then for any constant vector c ×1 we have<br />
c 0 ˙(θ) = θ<br />
h<br />
{(X) − (θ)} c<br />
0 ˙(θ; X) i<br />
and by the Cauchy-Schwarz inequality,<br />
h<br />
c0 ˙(θ) i 2<br />
≤ θ<br />
h<br />
{(X) − (θ)}<br />
2 i θ<br />
∙ nc<br />
0 ˙(θ; X) o 2¸<br />
= θ [(X)] c 0 θ<br />
h˙(θ)˙ 0 (θ) i c<br />
= θ [(X)] · c 0 I(θ)c
186<br />
i.e.<br />
θ [(X)] ≥<br />
¯<br />
¯c 0 ˙(θ)¯¯2<br />
c 0 I(θ)c <br />
Put c = I −12 (θ)t for arbitrary t to get<br />
θ [(X)] ≥<br />
hence<br />
¯<br />
¯t 0 I −12 (θ) ˙(θ)<br />
t 0 t<br />
θ<br />
h√ (X)<br />
i<br />
≥ max<br />
||||=1<br />
¯<br />
¯<br />
¯2<br />
for any t<br />
¯t 0 I −12 (θ) ˙(θ)<br />
= ||I −12 (θ) ˙(θ)|| 2<br />
= ˙ 0 (θ) I −1 (θ) ˙ (θ) <br />
which is the asymptotic variance of the (normalized)<br />
MLE √ ³ˆθ ´. This is the Information<br />
Inequality, giving a lower bound on the variance<br />
of unbiased estimators. Since it is attained (in the<br />
limit) in the case of the MLE, we say the MLE is<br />
asymptotically efficient.<br />
¯<br />
¯2
187<br />
22. Minimax M-estimation I<br />
• M-estimation of location. Suppose 1 <br />
<br />
∼<br />
( − ), with density ( − ) (“location family”).<br />
If we know what is, then the MLE is defined<br />
by maximizing P log ( −), i.e. by solving<br />
0= <br />
X<br />
log ( − ) = X − 0<br />
More generally, a solution ˆ to<br />
X<br />
( − ) =0<br />
( − )<br />
for a suitable function , is an “M-estimate” of location.<br />
Thus the MLE of location, from a known<br />
, is an M-estimate with “score function”<br />
() = () = − 0<br />
()<br />
Quite generally,<br />
⎛<br />
√ <br />
³ˆ − ´ → ⎝0( ) =<br />
h<br />
2 ( − ) i ⎞<br />
{ [ 0 ( − )]} 2<br />
⎠
188<br />
Here is an outline of why this is so. By the MVT,<br />
0 = 1 √ <br />
X<br />
( − ˆ )<br />
= √ 1 X<br />
( − )<br />
<br />
−<br />
" #<br />
1 X<br />
√ 0 ( − )<br />
³ˆ − ´<br />
+ <br />
so<br />
√ <br />
³ˆ − ´<br />
=<br />
1 √ <br />
P ( − )+ <br />
1<br />
<br />
P 0 ( − )<br />
A natural assumption, and one that is made here,<br />
is that [ ( − )] = 0. Then by the CLT<br />
and WLLN,<br />
1<br />
√ <br />
X<br />
( − ) → ³ 0 <br />
h<br />
2 ( − ) i´ <br />
1<br />
<br />
X<br />
0 ( − ) → <br />
h<br />
0 ( − ) i <br />
If the remainder <br />
<br />
→ 0(showingthisiswhere<br />
some work is required) then the result follows by<br />
Slutsky’s Theorem.
189<br />
• The asymptotic variance does not depend on ;<br />
it is<br />
R ∞−∞<br />
2 () ()<br />
( ) = h R ∞−∞<br />
0 () () i 2 <br />
The denominator in ( ) isthesquareof<br />
Z ∞<br />
−∞ 0 () ()<br />
Z<br />
¯ ∞<br />
¯∞−∞<br />
−<br />
= ()()<br />
() 0 ()<br />
Z<br />
−∞<br />
∞<br />
= () ()()<br />
−∞<br />
= [ () ()] <br />
Here we use an assumption that ()() → 0<br />
as → ±∞. Then<br />
1<br />
( )<br />
= { [ () ()]} 2<br />
h<br />
2 () i<br />
h<br />
2 () i h<br />
<br />
2<br />
<br />
() i<br />
≤<br />
<br />
<br />
h<br />
2 () i<br />
= <br />
h<br />
<br />
2<br />
() i
190<br />
Thus ( ) is minimized, for fixed ,by =<br />
. The minimum variance is the inverse of<br />
Z ∞<br />
"<br />
h<br />
<br />
2<br />
() i = − 0 # 2<br />
−∞ () () = ( );<br />
this is “Fisher information for location”.<br />
• How might we choose if is not known? We<br />
take a minimax approach: we allow to be any<br />
member of some realistic class F of distributions<br />
(e.g. “approximately Normal” distributions), and<br />
aim to find a 0 that minimizes the maximum<br />
variance:<br />
max<br />
∈ F ( 0) ≤ max ( ) for any <br />
∈ F<br />
(22.1)<br />
We will show that the solution to this problem<br />
is to find an 0 ∈ F that is “least favourable”<br />
in the sense of minimizing ( )inF; we then<br />
protect ourselves against this worst case by using<br />
the MLE based on 0 ,i.e. 0 = 0 = − 0 0 0.
191<br />
• We will show that such a pair ( 0 0 )isa“saddlepoint<br />
solution”:<br />
( 0 ) ≤ ( 0 0 )= 1<br />
( 0 ) ≤ ( 0)<br />
for all and all ∈ F (22.2)<br />
If (22.2) holds we have, for any ,<br />
max ( 0)= ( 0 0 ) ≤ max ( ) <br />
∈ F ∈ F<br />
which is (22.1). Note that the equality in (22.2),<br />
and the second inequality, have already been established.<br />
Thus (22.2) holds iff 0 satisfies the<br />
first inequality.<br />
• Assume that F is convex, in that for any 0 1 ∈<br />
F the d.f. =(1− ) 0 + 1 (0 ≤ ≤ 1) is<br />
also in F. Thefirst inequality in (22.2) states that<br />
( 0 ) is maximized by 0 ; equivalently that<br />
n R ∞−∞<br />
0 0 () () o 2<br />
() =1 ( 0 )= R ∞−∞<br />
0 2 () () <br />
is minimized at =0for each 1 ∈ F (22.3)
192<br />
Note that<br />
Z ∞<br />
−∞ 0 0 () () =<br />
Z ∞<br />
Z ∞<br />
(1 − )<br />
−∞ 0 0 () 0 () + <br />
−∞ 0 0 () 1 () <br />
is a linear function of ; so too is the denominator<br />
of ().<br />
• Lemma: If ()() are linear functions of ∈<br />
[0 1], and () 0, then () = 2 ()() is<br />
convex:<br />
((1 − ) 1 + 2 ) ≤ (1 − ) ( 1 )+ ( 2 )<br />
for 1 2 ∈ [0 1].<br />
Proof:<br />
Using 00 = 00 =0weget<br />
00 = 2 3 ³<br />
0 − 0´2 ≥ 0
193<br />
• By the Lemma, () in(22.3)isconvex,sois<br />
minimized at =0iff 0 (0) ≥ 0foreach 1 ∈ F.<br />
• In the notation of the Lemma we have<br />
0 (0) = 2 (0)<br />
à ! 2 (0)<br />
(0) 0 (0) − 0 (0) with<br />
(0)<br />
(0) =<br />
(0) =<br />
=<br />
Thus<br />
Z ∞<br />
−∞ 2 0 0 = ( 0 );<br />
Z ∞<br />
−∞ 0 0 0 = ³<br />
0 −<br />
0 0´<br />
<br />
−∞<br />
(integration by parts again)<br />
Z ∞<br />
Z ∞<br />
−∞ 2 0 0 = ( 0 )<br />
0 (0) = 2 0 (0)− 0 (0) =<br />
Z ³<br />
2<br />
0<br />
0 − 2 0<br />
´<br />
(1 − 0 ) <br />
We have shown that ( 0 ) is maximized by<br />
0 iff, forall 1 ∈ F<br />
Z ∞ ³<br />
2<br />
0<br />
0 − 0<br />
2 ´<br />
(1 − 0 ) ≥ 0<br />
−∞<br />
(22.4)
194<br />
• Now consider the companion problem of minimizing<br />
( )inF. The function<br />
Z ∞<br />
Ã<br />
Z ∞<br />
¡<br />
<br />
0 <br />
¢ 2<br />
() =( )= − 0 ! 2<br />
<br />
= <br />
−∞ −∞ <br />
is convex. This is because, by the Lemma, its<br />
integrand () isconvexforeach; thus for any<br />
1 2 ∈ [0 1]<br />
(1−)1 + 2<br />
() ≤ (1 − ) 1 ()+ 2 ();<br />
integrating this gives<br />
((1 − ) 1 + 2 ) ≤ (1 − ) ( 1 )+ ( 2 ) <br />
Thus ( ) is minimized by 0 iff, for each 1 ∈<br />
F,<br />
0 ≤ 0 (0)<br />
Z ∞ 2 0 ³<br />
<br />
0<br />
1 − 0´ 0 − ¡ 0 2<br />
¢<br />
(1 − 0 )<br />
=<br />
=<br />
=<br />
=<br />
−∞<br />
Z ∞<br />
2 <br />
|=0<br />
Ã<br />
−2 − 0 ! Ã<br />
³<br />
0 0<br />
−∞ 1 − 0<br />
0 ´<br />
− − 0 ! 2<br />
0<br />
( 1 − 0 ) <br />
0 0<br />
³<br />
<br />
0<br />
1 − 0<br />
0 ´<br />
− <br />
2<br />
0 ( 1 − 0 ) <br />
Z ∞<br />
−2 0<br />
Z<br />
−∞<br />
∞<br />
−∞<br />
³<br />
2<br />
0<br />
0 − 0<br />
2 ´<br />
(1 − 0 )
195<br />
• By comparison with (22.4) we have that the following<br />
are equivalent:<br />
1. ³ 0 = − 0 0 0 0´<br />
is a saddlepoint solution<br />
to the minimax problem;<br />
2. ( 0 ) is maximized by 0 ;<br />
3. ( ) is minimized in F by 0 ;<br />
4. R ∞<br />
−∞<br />
³<br />
2<br />
0<br />
0 − 2 0´<br />
(1 − 0 ) ≥ 0 for all 1 ∈<br />
F.
196<br />
23. Minimax M-estimation II<br />
• By the preceding, we are to minimize ( )inF<br />
andthenput 0 = − 0 0 0. We must now specify<br />
a “reasonable” class F. A commonly used one is<br />
the “gross errors” class<br />
F = { | () =(1− ) ()+()} <br />
where () is the Normal density and () isan<br />
arbitrary (but symmetric) density.<br />
— Why symmetric? We need [ 0 ( − )] =<br />
R ∞−∞<br />
0 () () =0forall ∈ F; thisis<br />
guaranteed if is even and, as will turn out<br />
to be the case, 0 is odd and bounded.<br />
• The interpretation is that 100 (1 − )%oftheobservations<br />
are Normally distributed; the remainder<br />
come from an unknown population. For this<br />
modelwehave 1 − 0 = ( 1 − 0 ), and so we<br />
are to find 0 satisfying<br />
Z ∞ ³<br />
2<br />
0<br />
0 − 2 ´<br />
0 (1 − 0 ) ≥ 0forall 1 <br />
−∞<br />
(23.1)
We note that<br />
and<br />
() = − 0 = −()log() =<br />
2 0 () − 2 () =2− 2 <br />
197<br />
Condition (23.1) states that R ∞<br />
−∞<br />
³<br />
2<br />
0<br />
0 − 2 0´<br />
<br />
is to be minimized by 0 .But 0 depends on 0 .<br />
We conjecture that<br />
1. The density 0 mustplaceallofitsmasswhere<br />
20 0 − 2 0 is a minimum;<br />
2. On this set, 20 0 − 2 0<br />
= − 2 for some .<br />
is to be constant, say<br />
We will first verify that these conditions ensure a<br />
minimax solution, and then verify that there is a<br />
density 0 that has these properties.<br />
• A clue to the form of the set in 2. above is provided<br />
by the behaviour of 2 0 () − 2 ().
198<br />
• Suppose then that we can construct a density 0<br />
in such a way that<br />
0 () =<br />
0 () =<br />
(<br />
(<br />
(1 − ) () || ≤ <br />
(1 − ) ()+ 0 () || ≥ ;<br />
() =<br />
|| ≤ <br />
asolutionto20 0 − 2 0 = −2 || ≥ ;<br />
and with 0 = − 0 0 0 on || ≥ . Suppose also<br />
that<br />
− 2 ≤ 2 − 2 <br />
so that 20 0 − 2 0 attains its minimum (of −2 )<br />
on the set || ≥ . Then<br />
Z ∞ ³<br />
2<br />
0<br />
0 − 0<br />
2 ´<br />
(1 − 0 ) <br />
=<br />
Z−∞<br />
||≤<br />
³<br />
2 0 − 2´<br />
1 +<br />
≥ − 2 " Z<br />
||≤ 1 +<br />
" Z ∞<br />
= − 2 1 −<br />
−∞<br />
= 0<br />
Z<br />
Z<br />
Z<br />
||≥<br />
³<br />
−<br />
2´<br />
||≥ ( 1 − 0 ) <br />
||≥ 0<br />
#<br />
( 1 − 0 ) <br />
#
199<br />
• A solution (there are three) to 20 0 − 2 0 = −2<br />
is 0 () =() · , implying 0 () ∝ −|| .<br />
This leads to<br />
0 () =<br />
0 () =<br />
(<br />
(1 − ) () || ≤ <br />
(1 − ) () −(||−) || ≥ ;<br />
(<br />
() = || ≤ <br />
() · || ≥ ;<br />
with = . Note that 0 and 0 are continuous,<br />
that 0 = −0 0 0,andthat− 2 ≤ 2 − 2 .<br />
It remains only to show that 0 ∈ F, i.e.that<br />
0 () =(1− ) ()+ 0 () for some density 0 .<br />
It is left as an exercise to show that the function 0<br />
defined by this relationship is non-negative, and<br />
that a unique = () can be found such that<br />
R ∞−∞<br />
0 () =1.<br />
• This function 0 () is the famous “Huber’s psi<br />
function”. The theory given here extends very<br />
simplytothecaseinwhich is replaced by any<br />
other “strongly unimodal” density - one for which<br />
() is an increasing function of . Details are in
200<br />
the landmark paper Huber, P. J. (1964), “Robust<br />
Estimation of a Location Parameter,” The Annals<br />
of Mathematical Statistics, 35, 73-101.<br />
• Extensions to regression are immediate. An M-<br />
estimate of regression is a minimizer of<br />
X<br />
<br />
³<br />
− x 0 θ´<br />
<br />
or, with = 0 , a solution to<br />
X<br />
x ³ − x 0 θ´<br />
= 0<br />
If () = 2 2, () = then this becomes<br />
X<br />
x = X x x 0 θ<br />
The solution is<br />
ˆθ = h X<br />
x x 0 <br />
i −1 X<br />
x <br />
= ³ X 0 X´−1<br />
X 0 y<br />
the LSE. In general Newton-Raphson, or Iteratively<br />
Reweighted Least Squares (IRLS) can be<br />
used to obtain the solution.
• The asymptotic normality result is that<br />
√ <br />
³ˆθ − θ´ → <br />
µ<br />
0 ( ) ³ X 0 X´−1 <br />
201<br />
if the errors have a density , symmetric around<br />
0. Here ( ) is as before, so that the same<br />
minimax results as derived here can be applied.<br />
• IRLS: Write the equations as<br />
0 = X ³ − x 0θ´<br />
³<br />
<br />
x <br />
− x 0 <br />
θ − x 0θ´<br />
<br />
= X h<br />
x · · ³<br />
− x 0θ´i<br />
<br />
= X 0 Wy − X 0 WXθ<br />
for W = ( 1 ) and weights =<br />
(θ) depending on the parameters. “Solve”<br />
these equations:<br />
θ = ³ X 0 WX´−1<br />
X 0 Wy;<br />
use this value to re-calculate weights; iterate to<br />
convergence. Thus the ( +1) step is a weighted<br />
least squares regression using weights (θ )computed<br />
from the residuals at the previous step.
202<br />
24. Measure and Integration<br />
• Recall the definition of a probability space: basic<br />
components are a set Ω (“outcomes”), a “Borel<br />
field” or “-algebra” F of subsets (“events”; F<br />
contains Ω and is closed under complementation<br />
and countable unions) and a measure assigning<br />
probability () toevents ∈ F.<br />
• Let Ω =(0 1], the unit interval, and start with<br />
subintervals ( ]. Define a (probability) measure<br />
by (( ]) = − . This extends to the set B 0<br />
of finite disjoint unions and complements of such<br />
intervals in the obvious way.<br />
• Now consider B = (B 0 ), the smallest -algebra<br />
containing B 0 (i.e., the intersection of all of them).<br />
One can extend the measure on B 0 to the -<br />
algebra B. Formally, define the outer measure<br />
of a set ⊂ Ω by<br />
∗ () =inf X <br />
( )
203<br />
where the infimum is over all sequences in B 0 satisfying<br />
⊂∪ . If, for any ⊂ Ω we then<br />
have ∗ ( ∩ )+ ∗ ( ∩ ) = ∗ (), we say<br />
that is Lebesgue measurable, with Lebesgue<br />
measure ∗ (). (The condition implies in particular<br />
that ∗ ()+ ∗ ( )=1.)<br />
• It can be shown that the Lebesgue measurable<br />
sets include (B 0 ), and that Lebesgue measure<br />
agrees with whenever both are defined. In particular<br />
the Lebesgue measure of an interval is its<br />
length. The restriction of Lebesgue measure to<br />
(B 0 ) is called Borel measure, and the sets in<br />
(B 0 )areBorel measurable.<br />
• The measure ∗ caninturnbeextendedfrom<br />
B through a process of completion, essentially by<br />
appending to B all subsets of those sets with measure<br />
zero. The resulting measure space is complete,<br />
inthatifaset is Lebesgue measurable<br />
and has measure zero, then the same is true of<br />
any subset of .
204<br />
• Example: The set of rational numbers in (0 1]<br />
has Lebesgue measure ∗ () = 0. This is because<br />
we can enumerate the rationals: = { 1 2 }<br />
and then if<br />
= ³ − 2 −(+1) +2 −(+1) ´ ∩ (0 1]<br />
we have ⊂∪ and ∗ () ≤ P ( ) ≤ .<br />
• One can carry out these constructions for more<br />
general sets Ω. StartingwithΩ = R and intervals<br />
( ] as above results in Lebesgue measure on the<br />
real line.<br />
• Now let (Ω F)beanymeasurespace,sothat<br />
F is a -algebra and a measure (Lebesgue measure,<br />
counting measure, ... ). Here we define the<br />
integral, written<br />
Z<br />
Z<br />
=<br />
ZΩ () () = () () <br />
Oneapproachtothisistostartwithsimple,nonnegative<br />
functions = P 1 , where Ω =<br />
Ω
205<br />
∪ =1 ;inthiscasedefine R = P ( ).<br />
(Then, e.g., if = on ( ] and zero elsewhere,<br />
and is Lebesgue measure, this gives R =<br />
( − ), in agreement with the R-integral.) Now<br />
any non-negative function can be represented as<br />
an increasing limit ( ↑ ) of simple functions;<br />
one then defines R = lim <br />
R<br />
.<br />
• To extend to arbitrary functions, define<br />
+ () =max( () 0) − () =max(− () 0) <br />
Then = + − − is the difference of nonnegative<br />
functions, and one defines<br />
Z<br />
=<br />
Z<br />
+ −<br />
Z<br />
− <br />
(unless one of these is ∞, inwhichcasewesay<br />
that is “not integrable” although one still assigns<br />
a value to R by adopting the convention<br />
±∞ = ±∞ for finite ). Finally, for sets ∈ F<br />
one defines R = R 1 .
206<br />
• Some properties of the integral:<br />
1. If and are integrable and ≤ a.e. (“almost<br />
everywhere”; i.e. except on a set with<br />
measure zero) then R ≤ R . (Then<br />
|| = + + − is integrable and since − || ≤<br />
≤ || we have that | R | ≤ R || .)<br />
2. Monotone convergence: If 0 ≤ ↑ a.e.<br />
then R ↑ R . (Note that R is<br />
defined for all non-negative functions .)<br />
3. Dominated convergence: If | | ≤ a.e., where<br />
is integrable (i.e. R , which necessarily<br />
exists, is finite) and if → a.e., then and<br />
the areintegrableand R → R .<br />
4. Bounded convergence: if (Ω) ∞ and the<br />
are uniformly bounded (i.e. () ≤ <br />
for all and all ), then → a.e. implies<br />
R<br />
→ R .
207<br />
• If is Lebesgue measure then the integral defined<br />
above is the Lebesgue integral. Example: The<br />
function =1 is zero everywhere except on the<br />
set of rationals in (0 1], i.e. almost everywhere.<br />
By the above R = 0; recall that this is an<br />
example in which the R-integral does not exist.<br />
When the R-integral does exist, it has the same<br />
value as the Lebesgue integral.<br />
• Now let (Ω F) be a probability space; a (finite)<br />
random variable is a function : Ω → R such<br />
that −1 () ∈ F for any Borel measurable set<br />
. Equivalently (proof at end), inverses of open<br />
sets are events.<br />
• A function () is also a r.v. under a certain<br />
condition. For ∈ B, we have<br />
( ◦ ) −1 () = −1 ◦ −1 () ∈ F<br />
as long as −1 () ∈ B, i.e. must be Borel<br />
measurable - a function for which the inverses of<br />
Borel sets (or just open sets) are Borel sets.
208<br />
• Any r.v. induces a probability space (R B)via<br />
() = ³ −1 ()´ = ( ∈ ) for ∈ B<br />
This measure is the probability measure (p.m.)<br />
of , and the associated distribution function is<br />
defined by<br />
() = ((−∞]) = ( ≤ ) <br />
• If there is a function (·) and a measure space<br />
(R B)with<br />
() = ( ∈ ) =<br />
Z<br />
<br />
() () <br />
we say that is the density of (or of ) w.r.t.<br />
and that is ‘absolutely continuous’ w.r.t. .<br />
The most common cases are = Lebesgue measure<br />
(in which case we say that is a continuous<br />
r.v. and then 0 () exists and equals ()<br />
a.e.) and = counting measure, in which case<br />
() = ( = ) and is discrete.
209<br />
• The expected value of a r.v. is defined by the<br />
integral [] = R Ω () (), and can be<br />
evaluatedbytransformingtothep.m., i.e.<br />
[()] =<br />
Z<br />
R () (); (24.1)<br />
this in turn equals the R-S integral R ∞<br />
−∞ ()()<br />
whenever the latter exists. The proof of (24.1)<br />
consists of showing that both sides agree for simple<br />
functions (for instance when =1 both<br />
sides equal ( ∈ )) and extending to general<br />
Borel functions by monotonicity, etc.
210<br />
• Basic properties of expectations are inherited from<br />
those of the integral. Some particular ones are:<br />
1. []existsiff [||] exists, and then | []| ≤<br />
[||].<br />
2. Monotone convergence: If ≥ 0and ↑<br />
a.e., then [ ] → [] (which might<br />
= ∞).<br />
3. Dominated convergence: If → a.e. (or<br />
<br />
just → )and∀ | | ≤ with [ ] <br />
∞, then [ ] → [].<br />
4. Bounded convergence: this is “Dominated convergence”<br />
with = , aconstant.<br />
• Example: Suppose we estimate a bounded, continuous<br />
function () ofapopulationmeanby<br />
( ), where is an average based on a sample<br />
of size . We have that → by the<br />
<br />
WLLN, so ( ) → () bycontinuity;then<br />
[ ( )] → [ ()] = () by bounded convergence.
211<br />
• It was stated above that if is a function on<br />
(Ω F P),mappinginto(R B), and −1 () ∈<br />
F for every open set in R (recall this was our original<br />
definition of a r.v.) then −1 () ∈ F for<br />
every Borel measurable set (our current definition<br />
of a r.v.) To see this, firstnotethatif −1 () ∈<br />
F for every open set , then this holds as well for<br />
every interval = ( ] (=( ) ∩ ( ) for<br />
any ). It then holds as well for the set B 0<br />
of finite disjoint unions and complements of such<br />
intervals. The property can finally be extended<br />
to B = (B 0 ) through the Monotone Class Theorem.<br />
A class M of subsets of R is monotone if<br />
for sequences { } in M,<br />
1 ⊂ 2 ⊂ ⇒∪ ∈ M<br />
1 ⊃ 2 ⊃ ⇒∩ ∈ M<br />
The theorem states that if M is a monotone class,<br />
then B 0 ⊂ M implies (B 0 ) ⊂ M. So it suffices<br />
to verify that the class M of subsets for which<br />
−1 () ∈ F is monotone. This is straightforward.