25.03.2013 Views

MT580 Mathematics for Statistics: - Boston College

MT580 Mathematics for Statistics: - Boston College

MT580 Mathematics for Statistics: - Boston College

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>MT580</strong> <strong>Mathematics</strong> <strong>for</strong> <strong>Statistics</strong>:<br />

“All the Math You Never Had”<br />

Jenny A. Baglivo<br />

<strong>Mathematics</strong> Department<br />

<strong>Boston</strong> <strong>College</strong><br />

Chestnut Hill, MA 02467-3806<br />

(617) 552 3772<br />

jenny.baglivo@bc.edu<br />

Third Draft, Summer 2006<br />

<strong>MT580</strong> is an introduction to probability, discrete and continuous methods, and linear algebra<br />

<strong>for</strong> graduate students in applied statistics with little or no <strong>for</strong>mal training in these<br />

fields. Applications in statistics are emphasized throughout.<br />

The topics covered in <strong>MT580</strong> are drawn from several subject areas. Thus, the need <strong>for</strong> a<br />

specialized set of course notes. I would like to thank GSAS Dean Mick Smyer <strong>for</strong> supporting<br />

the development of these notes and <strong>Mathematics</strong> Professor Dan Chambers <strong>for</strong> many useful<br />

comments and suggestions on an earlier draft.<br />

c○ Jenny A. Baglivo 2006. All rights reserved.


1 Introductory Probability Concepts 1<br />

1.1 Set Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1<br />

1.2 Sample Spaces and Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2<br />

1.3 Probability Axioms and Properties Following From the Axioms . . . . . . . . . . . . 3<br />

1.3.1 Kolmogorov Axioms <strong>for</strong> Finite Sample Spaces . . . . . . . . . . . . . . . . . . 3<br />

1.3.2 Equally Likely Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />

1.3.3 Properties Following from the Axioms . . . . . . . . . . . . . . . . . . . . . . 4<br />

1.4 Counting Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br />

1.5 Permutations and Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />

1.5.1 Example: Simple Urn Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />

1.5.2 Binomial Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />

1.6 Partitioning Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />

1.6.1 Multinomial Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />

1.6.2 Example: Urn Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />

1.7 Applications in <strong>Statistics</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />

1.7.1 Maximum Likelihood Estimation in Simple Urn Models . . . . . . . . . . . . 11<br />

1.7.2 Permutation and Nonparametric Methods . . . . . . . . . . . . . . . . . . . . 12<br />

1.8 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />

1.8.1 Multiplication Rules <strong>for</strong> Probability . . . . . . . . . . . . . . . . . . . . . . . 17<br />

1.8.2 Law of Total Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />

1.8.3 Bayes Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18<br />

1.9 Independent Events and Mutually Independent Events . . . . . . . . . . . . . . . . . 21<br />

2 Discrete Random Variables 23<br />

2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />

2.1.1 PDF and CDF <strong>for</strong> Discrete Distributions . . . . . . . . . . . . . . . . . . . . 24<br />

2.2 Families of Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

2.2.1 Example: Discrete Uni<strong>for</strong>m Distribution . . . . . . . . . . . . . . . . . . . . . 25<br />

2.2.2 Example: Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . 25<br />

2.2.3 Distributions Related to Bernoulli Experiments . . . . . . . . . . . . . . . . . 26<br />

2.2.4 Simple Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />

i


2.3 Mathematical Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />

2.3.1 Definitions <strong>for</strong> Finite Discrete Random Variables . . . . . . . . . . . . . . . . 28<br />

2.3.2 Properties of Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />

2.3.3 Mean, Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . 30<br />

2.3.4 Chebyshev’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />

3 Joint Discrete Distributions 33<br />

3.1 Bivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />

3.1.1 Joint PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />

3.1.2 Joint CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />

3.1.3 Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />

3.1.4 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />

3.1.5 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />

3.1.6 Mathematical Expectation <strong>for</strong> Finite Bivariate Distributions . . . . . . . . . 36<br />

3.1.7 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36<br />

3.1.8 Conditional Expectation and Regression . . . . . . . . . . . . . . . . . . . . . 38<br />

3.1.9 Example: Bivariate Hypergeometric Distribution . . . . . . . . . . . . . . . . 40<br />

3.1.10 Example: Trinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 41<br />

3.1.11 Simple Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42<br />

3.2 Multivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42<br />

3.2.1 Joint PDF and Joint CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />

3.2.2 Mutually Independent Random Variables and Random Samples . . . . . . . . 44<br />

3.2.3 Mathematical Expectation <strong>for</strong> Finite Multivariate Distributions . . . . . . . . 44<br />

3.2.4 Example: Multivariate Hypergeometric Distribution . . . . . . . . . . . . . . 45<br />

3.2.5 Example: Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . 46<br />

3.2.6 Simple Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />

3.2.7 Sample Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />

4 Introductory Calculus Concepts 51<br />

4.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51<br />

4.2 Limits and Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54<br />

ii


4.2.1 Neighborhoods, Deleted Neighborhoods and Limits . . . . . . . . . . . . . . . 54<br />

4.2.2 Continuous Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />

4.3 Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />

4.3.1 Derivative and Tangent Line . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />

4.3.2 Approximations and Local Linearity . . . . . . . . . . . . . . . . . . . . . . . 60<br />

4.3.3 First and Second Derivative Functions . . . . . . . . . . . . . . . . . . . . . . 61<br />

4.3.4 Some Rules <strong>for</strong> Finding Derivative Functions . . . . . . . . . . . . . . . . . . 62<br />

4.3.5 Chain Rule <strong>for</strong> Composition of Functions . . . . . . . . . . . . . . . . . . . . 63<br />

4.3.6 Rules <strong>for</strong> Exponential and Logarithmic Functions . . . . . . . . . . . . . . . . 63<br />

4.3.7 Rules <strong>for</strong> Trigonometric and Inverse Trigonometric Functions . . . . . . . . . 64<br />

4.3.8 Indeterminate Forms and L’Hospital’s Rule . . . . . . . . . . . . . . . . . . . 65<br />

4.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67<br />

4.4.1 Local Extrema and Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . 67<br />

4.4.2 First Derivative Test <strong>for</strong> Local Extrema . . . . . . . . . . . . . . . . . . . . . 67<br />

4.4.3 Second Derivative Test <strong>for</strong> Local Extrema . . . . . . . . . . . . . . . . . . . . 68<br />

4.4.4 Global Extrema on Closed Intervals . . . . . . . . . . . . . . . . . . . . . . . 68<br />

4.4.5 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />

4.5 Applications in <strong>Statistics</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />

4.5.1 Maximum Likelihood Estimation in Binomial Models . . . . . . . . . . . . . . 71<br />

4.5.2 Maximum Likelihood Estimation in Multinomial Models . . . . . . . . . . . . 72<br />

4.5.3 Minimum Chi-Square Estimation in Multinomial Models . . . . . . . . . . . . 73<br />

4.6 Rolle’s Theorem and the Mean Value Theorem . . . . . . . . . . . . . . . . . . . . . 75<br />

4.6.1 Generalized Mean Value Theorem and Error Analysis . . . . . . . . . . . . . 76<br />

4.7 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77<br />

4.7.1 Riemann Sums and Definite Integrals . . . . . . . . . . . . . . . . . . . . . . 78<br />

4.7.2 Properties of Definite Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . 81<br />

4.7.3 Fundamental Theorem of Calculus . . . . . . . . . . . . . . . . . . . . . . . . 81<br />

4.7.4 Antiderivatives and Indefinite Integrals . . . . . . . . . . . . . . . . . . . . . 82<br />

4.7.5 Some Formulas <strong>for</strong> Indefinite Integrals . . . . . . . . . . . . . . . . . . . . . . 82<br />

4.7.6 Approximation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84<br />

iii


4.7.7 Improper Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86<br />

4.8 Applications in <strong>Statistics</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />

4.8.1 deMoivre-Laplace Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />

4.8.2 Computing Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />

5 Continuous Random Variables 91<br />

5.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91<br />

5.1.1 PDF and CDF <strong>for</strong> Continuous Distributions . . . . . . . . . . . . . . . . . . . 91<br />

5.1.2 Quantiles and Percentiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />

5.2 Families of Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93<br />

5.2.1 Example: Uni<strong>for</strong>m Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 93<br />

5.2.2 Example: Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . 94<br />

5.2.3 Euler Gamma Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95<br />

5.2.4 Example: Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 96<br />

5.2.5 Example: Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 97<br />

5.2.6 Trans<strong>for</strong>ming Continuous Random Variables . . . . . . . . . . . . . . . . . . 99<br />

5.3 Mathematical Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101<br />

5.3.1 Definitions <strong>for</strong> Continuous Random Variables . . . . . . . . . . . . . . . . . . 101<br />

5.3.2 Properties of Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102<br />

5.3.3 Mean, Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . 102<br />

5.3.4 Chebyshev’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102<br />

6 Infinite Sequences and Series 103<br />

6.1 Limits and Convergence of Infinite Sequences . . . . . . . . . . . . . . . . . . . . . . 103<br />

6.1.1 Convergence Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103<br />

6.1.2 Geometric Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104<br />

6.1.3 Bounded Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104<br />

6.1.4 Monotone Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105<br />

6.2 Partial Sums and Convergence of Infinite Series . . . . . . . . . . . . . . . . . . . . . 105<br />

6.2.1 Convergence Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106<br />

6.2.2 Geometric Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106<br />

iv


6.3 Convergence Tests <strong>for</strong> Positive Series . . . . . . . . . . . . . . . . . . . . . . . . . . . 107<br />

6.4 Alternating Series and Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 109<br />

6.5 Absolute and Conditional Convergence of Series . . . . . . . . . . . . . . . . . . . . . 110<br />

6.5.1 Product of Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111<br />

6.6 Taylor Polynomials and Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />

6.6.1 Higher Order Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />

6.6.2 Taylor Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113<br />

6.6.3 Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />

6.6.4 Radius of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />

6.7 Power Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />

6.7.1 Power Series and Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . 116<br />

6.7.2 Differentiation and Integration of Power Series . . . . . . . . . . . . . . . . . 117<br />

6.8 Discrete Random Variables, revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . 118<br />

6.8.1 Countably Infinite Sample Spaces . . . . . . . . . . . . . . . . . . . . . . . . . 118<br />

6.8.2 Infinite Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 118<br />

6.8.3 Example: Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 120<br />

6.8.4 Example: Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />

6.9 Continuous Random Variables, revisited . . . . . . . . . . . . . . . . . . . . . . . . . 122<br />

6.9.1 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123<br />

6.9.2 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123<br />

6.10 Summary: Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124<br />

7 Multivariable Calculus Concepts 125<br />

7.1 Vector And Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125<br />

7.1.1 Representations of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125<br />

7.1.2 Addition and Scalar Multiplication of Vectors . . . . . . . . . . . . . . . . . . 125<br />

7.1.3 Length and Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />

7.1.4 Dot Product of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128<br />

7.1.5 Matrix Notation and Matrix Transpose . . . . . . . . . . . . . . . . . . . . . 129<br />

7.1.6 Addition and Scalar Multiplication of Matrices . . . . . . . . . . . . . . . . . 129<br />

7.1.7 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130<br />

v


7.1.8 Determinants of Square Matrices . . . . . . . . . . . . . . . . . . . . . . . . . 130<br />

7.2 Limits and Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />

7.2.1 Neighborhoods, Deleted Neighborhoods and Limits . . . . . . . . . . . . . . . 132<br />

7.2.2 Continuous Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133<br />

7.3 Differentiation in Several Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134<br />

7.3.1 Partial Derivatives and the Gradient Vector . . . . . . . . . . . . . . . . . . . 134<br />

7.3.2 Second-Order Partial Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . 135<br />

7.3.3 Differentiability, Local Linearity and Tangent (Hyper)Planes . . . . . . . . . 135<br />

7.3.4 Total Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138<br />

7.3.5 Directional Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138<br />

7.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140<br />

7.4.1 Linear and Quadratic Approximations . . . . . . . . . . . . . . . . . . . . . . 140<br />

7.4.2 Second Derivative Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142<br />

7.4.3 Constrained Optimization and Lagrange Multipliers . . . . . . . . . . . . . . 144<br />

7.4.4 Method of Steepest Descent/Ascent . . . . . . . . . . . . . . . . . . . . . . . 148<br />

7.5 Applications in <strong>Statistics</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149<br />

7.5.1 Least Squares Estimation in Quadratic Models . . . . . . . . . . . . . . . . . 149<br />

7.5.2 Maximum Likelihood Estimation in Multinomial Models . . . . . . . . . . . . 150<br />

7.6 Multiple Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151<br />

7.6.1 Riemann Sums and Double Integrals over Rectangles . . . . . . . . . . . . . . 151<br />

7.6.2 Iterated Integrals over Rectangles . . . . . . . . . . . . . . . . . . . . . . . . . 153<br />

7.6.3 Properties of Double Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . 155<br />

7.6.4 Double and Iterated Integrals over General Domains . . . . . . . . . . . . . . 156<br />

7.6.5 Triple and Iterated Integrals over Boxes . . . . . . . . . . . . . . . . . . . . . 158<br />

7.6.6 Triple Integrals over General Domains . . . . . . . . . . . . . . . . . . . . . . 160<br />

7.6.7 Partial Anti-Differentiation and Partial Differentiation . . . . . . . . . . . . . 161<br />

7.6.8 Change of Variables and the Jacobian . . . . . . . . . . . . . . . . . . . . . . 162<br />

7.7 Applications in <strong>Statistics</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166<br />

7.7.1 Computing Probabilities <strong>for</strong> Continuous Random Pairs . . . . . . . . . . . . 166<br />

7.7.2 Contours <strong>for</strong> Independent Standard Normal Random Variables . . . . . . . . 167<br />

vi


8 Joint Continuous Distributions 169<br />

8.1 Bivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169<br />

8.1.1 Joint PDF and Joint CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169<br />

8.1.2 Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170<br />

8.1.3 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171<br />

8.1.4 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 173<br />

8.1.5 Mathematical Expectation <strong>for</strong> Bivariate Continuous Distributions . . . . . . 173<br />

8.1.6 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174<br />

8.1.7 Conditional Expectation and Regression . . . . . . . . . . . . . . . . . . . . . 175<br />

8.1.8 Example: Bivariate Uni<strong>for</strong>m Distribution . . . . . . . . . . . . . . . . . . . . 177<br />

8.1.9 Example: Bivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . 177<br />

8.1.10 Trans<strong>for</strong>ming Continuous Random Variables . . . . . . . . . . . . . . . . . . 179<br />

8.2 Multivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182<br />

8.2.1 Joint PDF and Joint CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182<br />

8.2.2 Marginal and Conditional Distributions . . . . . . . . . . . . . . . . . . . . . 183<br />

8.2.3 Mutually Independent Random Variables and Random Samples . . . . . . . . 184<br />

8.2.4 Mathematical Expectation For Multivariate Continuous Distributions . . . . 184<br />

8.2.5 Sample Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185<br />

9 Linear Algebra Concepts 189<br />

9.1 Solving Systems of Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189<br />

9.1.1 Elementary Row Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189<br />

9.1.2 Row Reduction and Echelon Forms . . . . . . . . . . . . . . . . . . . . . . . . 191<br />

9.1.3 Vector and Matrix Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 194<br />

9.2 Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197<br />

9.2.1 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198<br />

9.2.2 Inverses of Square Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200<br />

9.2.3 Matrix Factorizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203<br />

9.2.4 Partitioned Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205<br />

9.3 Determinants of Square Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207<br />

9.4 Vector Spaces and Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209<br />

vii


9.4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209<br />

9.4.2 Subspaces of R m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211<br />

9.4.3 Linear Independence, Basis and Dimension . . . . . . . . . . . . . . . . . . . 211<br />

9.4.4 Rank and Nullity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214<br />

9.5 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215<br />

9.5.1 Orthogonal Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215<br />

9.5.2 Orthogonal Sets, Bases and Projections . . . . . . . . . . . . . . . . . . . . . 216<br />

9.5.3 Gram-Schmidt Orthogonalization . . . . . . . . . . . . . . . . . . . . . . . . . 217<br />

9.5.4 QR-Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218<br />

9.5.5 Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219<br />

9.6 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221<br />

9.6.1 Characteristic Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221<br />

9.6.2 Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222<br />

9.6.3 Symmetric Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223<br />

9.6.4 Positive Definite Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224<br />

9.6.5 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 224<br />

9.7 Linear Trans<strong>for</strong>mations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228<br />

9.7.1 Range and Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229<br />

9.7.2 Linear Trans<strong>for</strong>mations and Matrices . . . . . . . . . . . . . . . . . . . . . . . 229<br />

9.7.3 Linearly Independent Columns . . . . . . . . . . . . . . . . . . . . . . . . . . 230<br />

9.7.4 Composition of Functions and Matrix Multiplication . . . . . . . . . . . . . . 230<br />

9.7.5 Diagonalization and Change of Bases . . . . . . . . . . . . . . . . . . . . . . . 231<br />

9.7.6 Random Vectors and Linear Trans<strong>for</strong>mations . . . . . . . . . . . . . . . . . . 232<br />

9.7.7 Example: Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . 233<br />

9.8 Applications in <strong>Statistics</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235<br />

9.8.1 Least Squares Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235<br />

9.8.2 Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 236<br />

10 Additional Reading 241<br />

viii


1 Introductory Probability Concepts<br />

Probability is the study of random phenomena. Probability theory can be applied, <strong>for</strong><br />

example, to study games of chance (e.g. roulette games, card games), occurrences of catastrophic<br />

events (e.g. tornados, earthquakes), survival of animal species, and changes in stock<br />

and commodity markets.<br />

This chapter introduces probability theory.<br />

1.1 Set Operations<br />

Sets and operations on sets are fundamental to any area of mathematics. A set is a welldefined<br />

collection of objects called the elements or members of the set. (Well-defined means<br />

that there is definite way of determining whether a given element belongs to the set.) The<br />

notation x ∈ A means “x is an element of A,” and the notation x /∈ A means “x is not an<br />

element of A.”<br />

The empty set is the set containing no elements, and the universal set is the set containing<br />

all elements under consideration in a given study. The notation ∅ is used to denote the<br />

empty set, and the notation Ω is used to denote the universal set.<br />

A is a subset of B (denoted by A ⊆ B) if each element of A is an element of B.<br />

Union and intersection. The union of the sets A and B (denoted by A ∪ B) is the set<br />

of elements that belong to either A or B, and the intersection of the sets A and B (denoted<br />

by A ∩ B) is the set of elements that belong to both A and B. That is,<br />

A ∪ B = {x | x ∈ A or x ∈ B} and A ∩ B = {x | x ∈ A and x ∈ B}.<br />

Note that A ∪ B = B ∪ A and A ∩ B = B ∩ A.<br />

The operations of union and intersection are associative, that is,<br />

(A ∪ B) ∪ C = A ∪ (B ∪ C) and (A ∩ B) ∩ C = A ∩ (B ∩ C) <strong>for</strong> sets A, B, and C.<br />

Thus, the union A ∪ B ∪ C and the intersection A ∩ B ∩ C are well-defined. More generally,<br />

the union of the k sets A1, A2, ..., Ak (denoted by ∪ k i=1 Ai) consists of all elements belonging<br />

to at least one Ai, and the intersection of the k sets A1, A2, ..., Ak (denoted by ∩ k i=1 Ai)<br />

consists of all elements belonging to all k sets.<br />

The operations of union and intersection satisfy the following distributive laws:<br />

A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C) and A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C)<br />

<strong>for</strong> sets A, B, and C.<br />

Complement. The complement of the set A in the universal set (denoted by A c = Ω−A)<br />

is the set of all elements in the universal set that do not belong to A. Note, in particular,<br />

that A ∩ A c = ∅, A ∪ A c = Ω and (A c ) c = A.<br />

1


Pairwise disjoint sets. The sets A1 and A2 are said to be disjoint if A1 ∩ A2 = ∅. The<br />

sets A1, A2, ..., Ak are said to be pairwise disjoint if Ai ∩ Aj = ∅ whenever i = j.<br />

If A and B are sets, then B can be written as the disjoint union of the part of B contained<br />

in A with the part of B contained in the complement of A. That is,<br />

B = (B ∩ A) ∪ (B ∩ A c ) where (B ∩ A) ∩ (B ∩ A c ) = ∅.<br />

Partitions. A partition of Ω is a collection A1, A2, ..., Ak of pairwise disjoint sets whose<br />

union is Ω. If A1, A2, ..., Ak partitions Ω, then (B ∩A1), (B ∩A2), ..., (B ∩Ak) partitions<br />

the set B. That is,<br />

B = ∪ k i=1 (B ∩ Ai) where (B ∩ Ai) ∩ (B ∩ Aj) = ∅ when i = j.<br />

De Morgan’s laws. De Morgan’s laws relate complements to unions and intersections:<br />

More generally,<br />

(A ∪ B) c = Ac ∩ Bc and (A ∩ B) c = Ac ∪ Bc <strong>for</strong> sets A and B.<br />

c = ∩k i=1Aci and<br />

c = ∪k i=1Aci .<br />

<br />

∪ k i=1 Ai<br />

1.2 Sample Spaces and Events<br />

<br />

∩ k i=1 Ai<br />

The term experiment (or random experiment) is used in probability theory to describe a<br />

procedure whose outcome is not known in advance with certainty. Further, experiments<br />

are assumed to be repeatable (at least in theory) and to have a well-defined set of possible<br />

outcomes.<br />

The sample space S is the set of all possible outcomes of an experiment. An event is a<br />

subset of the sample space. A simple event is an event with a single outcome. Events are<br />

usually denoted by capital letters (A, B, C, ...) and outcomes by lower case letters (x, y,<br />

z, ...). If x ∈ A is observed, then A is said to have occurred. The favorable outcomes of an<br />

experiment <strong>for</strong>m the event of interest.<br />

Each repetition of an experiment is called a trial. Repeated trials are repetitions of the<br />

experiment using the specified procedure, with the outcomes of the trials having no influence<br />

on one another.<br />

Example 1.1 (Coin Tossing) Suppose you toss a fair coin 5 times and record h (<strong>for</strong><br />

head) or t (<strong>for</strong> tail) each time. The sample space <strong>for</strong> this experiment is the collection of<br />

32 = 2 5 sequences of 5 h’s or t’s:<br />

S = {hhhhh,hhhht,hhhth,hhthh,hthhh, thhhh,hhhtt,hhtht,hthht,thhht,<br />

hhtth,hthth,thhth,htthh,ththh,tthhh, ttthh,tthth,thtth, httth,tthht,<br />

ththt,httht,thhtt,hthtt,hhttt,htttt,thttt,tthtt,tttht,tttth,ttttt}.<br />

If you are interested in getting exactly 5 heads, then the event of interest is the simple event<br />

A = {hhhhh}. If you are interested in getting exactly 3 heads, then the event of interest is<br />

A = {hhhtt,hhtht,hthht,thhht,hhtth,hthth,thhth, htthh, ththh,tthhh}.<br />

2


Example 1.2 (Birthdays) Consider the experiment “Ask ten people their birthdays and<br />

record the list of dates.” The sample space, S, is the collection of all sequences of 10<br />

birthdays. If you are interested in getting at least one match, then the event of interest, A,<br />

is the collection of all sequences of ten birthdays with at least one match.<br />

1.3 Probability Axioms and Properties Following From the Axioms<br />

The basic rules (or axioms) of probability were introduced by A. Kolmogorov in the 1930’s.<br />

Let A ⊆ S be an event, and P(A) be the probability that A will occur.<br />

1.3.1 Kolmogorov Axioms <strong>for</strong> Finite Sample Spaces<br />

A probability distribution, or simply a probability, on a finite sample space S is a specification<br />

of numbers P(A) satisfying the following axioms:<br />

1. P(S) = 1.<br />

2. If A is an event, then 0 ≤ P(A) ≤ 1.<br />

3. If A1 and A2 are disjoint, then P(A1 ∪ A2) = P(A1) + P(A2).<br />

3 ′ . More generally, if the events A1, A2, ..., Ak are pairwise disjoint, then<br />

P(A1 ∪ A2 ∪ · · · ∪ Ak) = P(A1) + P(A2) + · · · + P(Ak).<br />

Since S is the set of all possible outcomes, an outcome in S is certain to occur; the probability<br />

of an event that is certain to occur must be 1 (axiom 1). Probabilities must be between<br />

0 and 1 (axiom 2), and probabilities must be additive when events are pairwise disjoint<br />

(axiom 3).<br />

Note that the second part of axiom 3 can be proven from the first part using mathematical<br />

induction. The inductive step of the proof assumes that the equality holds <strong>for</strong> k − 1 events.<br />

Then P(A1 ∪ A2 ∪ · · · ∪ Ak)<br />

= P((A1 ∪ A2 ∪ · · · ∪ Ak−1) ∪ Ak) (associativity of union)<br />

= P(A1 ∪ A2 ∪ · · · ∪ Ak−1) + P(Ak) (axiom 3)<br />

= P(A1) + P(A2) + · · · + P(Ak−1) + P(Ak) (induction hypothesis)<br />

An additional extension of axiom 3 is needed <strong>for</strong> infinite sample samples. Namely,<br />

3 ′′ . If A1, A2, . . . are pairwise disjoint, then<br />

P (∪ ∞ i=1 Ai) = P(A1) + P(A2) + P(A3) + · · ·<br />

where the righthand side is understood to be the sum of a convergent infinite series.<br />

Methods <strong>for</strong> sequences and series are covered in Chapter 6 of these notes.<br />

3


Relative frequencies. The probability of an event can be written as the limit of relative<br />

frequencies. That is, if A ⊆ S is an event, then<br />

#(A)<br />

P(A) = lim<br />

n→∞ n<br />

where #(A) is the number of occurrences of event A in n repeated trials of the experiment.<br />

Expected number. If P(A) is the probability of event A, then nP(A) is the expected<br />

number of occurrences of event A in n repeated trials of the experiment.<br />

1.3.2 Equally Likely Outcomes<br />

If S is a finite set with N elements, A is a subset of S with n elements, and each outcome<br />

is equally likely, then<br />

P(A) =<br />

Number of elements in A<br />

Number of elements in S<br />

= |A|<br />

|S|<br />

= n<br />

N .<br />

Example 1.3 (Coin Tossing, continued) For example, if you toss a fair coin 5 times<br />

and record heads or tails each time, then the probability of getting exactly 3 heads is<br />

10/32=0.3125. In 2000 repetitions of the experiment, you expect to observe exactly 3 heads<br />

2000P(exactly 3 heads) = 2000(0.3125) = 625 times.<br />

1.3.3 Properties Following from the Axioms<br />

Properties following from the Kolmogorov axioms include<br />

1. Complement rule: Let A c = S − A be the complement of A in S. Then<br />

In particular, P(∅) = 0.<br />

P(A c ) = 1 − P(A).<br />

2. Subset rule: If A is a subset of B, then P(A) ≤ P(B).<br />

3. Inclusion-exclusion rule: If A and B are events, then<br />

P(A ∪ B) = P(A) + P(B) − P(A ∩ B).<br />

Example 1.4 (Demonstration of Inclusion-Exclusion) To demonstrate the inclusionexclusion<br />

rule using the axioms, first note that<br />

A ∪ B = A ∪ (B ∩ A c ) where A ∩ (B ∩ A c ) = ∅.<br />

4


Since the intersection is empty, we know that P(A ∪ B) = P(A) + P(B ∩ A c ) by axiom 3.<br />

Similarly, since<br />

B = (B ∩ A) ∪ (B ∩ A c ) where (B ∩ A) ∩ (B ∩ A c ) = ∅,<br />

P(B) = P(B ∩ A) + P(B ∩ A c ) or P(B ∩ A c ) = P(B) − P(B ∩ A). Thus,<br />

P(A ∪ B) = P(A) + P(B ∩ A c ) = P(A) + P(B) − P(B ∩ A) = P(A) + P(B) − P(A ∩ B),<br />

since P(A ∩ B) = P(B ∩ A).<br />

Inclusion-exclusion rules. The inclusion-exclusion rule can be generalized to more than<br />

two events. For three events, <strong>for</strong> example, the rule is<br />

P(A ∪ B ∪ C) = P(A) + P(B) + P(C) − P(A ∩ B) − P(A ∩ C) − P(B ∩ C) + P(A ∩ B ∩ C).<br />

1.4 Counting Methods<br />

Methods <strong>for</strong> counting the number of outcomes in a sample space or event are important in<br />

probability. The multiplication rule is the basic counting <strong>for</strong>mula.<br />

Theorem 1.5 (Multiplication Rule) If an operation consists of r steps of which the<br />

first can be done in n1 ways, <strong>for</strong> each of these the second can be done in n2 ways, <strong>for</strong> each of<br />

the first and second steps the third can be done in n3 ways, etc., then the entire operation<br />

can be done in n1 × n2 × · · · × nr ways.<br />

Sampling with/without replacement. Two special cases of the multiplication rule are<br />

1. Sampling with replacement. For a set of size n and a sample of size r, there are a total<br />

of n r = n × n × · · · × n ordered samples, if duplication is allowed.<br />

2. Sampling without replacement. For a set of size n and a sample of size r, there are a<br />

total of<br />

n!<br />

= n × (n − 1) × · · · × (n − r + 1)<br />

(n − r)!<br />

ordered samples, if duplication is not allowed.<br />

If n is a positive integer, the notation n! (“n factorial”) is used <strong>for</strong> the product<br />

n! = n × (n − 1) × · · · × 1.<br />

For convenience, 0! is defined to equal 1 (0! = 1).<br />

Example 1.6 (Birthdays, continued) For example, suppose there are r unrelated people<br />

in a room, none of whom was born on February 29th of a leap year. You would like to<br />

determine the probability that at least two people have the same birthday.<br />

5


1. You ask, and record, each person’s birthday. There are<br />

365 r = 365 × 365 × · · · × 365<br />

possible outcomes, where an outcome is a sequences of r responses.<br />

2. Consider the event “everyone has a different birthday.” The number of outcomes in<br />

this event is<br />

365!<br />

= 365 × 364 × · · · × (365 − r + 1).<br />

(365 − r)!<br />

3. Suppose that each sequence of birthdays is equally likely. The probability that at<br />

least two people have a common birthday is 1 minus the probability that everyone<br />

has a different birthday, or<br />

1 −<br />

365 × 364 × · · · × (365 − r + 1)<br />

365r .<br />

Let A be the event “at least two people have a common birthday”. The following table<br />

demonstrates how quickly P(A) increases as the number of people in the room increases:<br />

r 5 10 15 20 25 30 35 40 45 50<br />

P(A) 0.03 0.12 0.25 0.41 0.57 0.71 0.81 0.89 0.94 0.97<br />

1.5 Permutations and Combinations<br />

A permutation is an ordered subset of r distinct objects out of a set of n objects. A<br />

combination is an unordered subset of r distinct objects out of the n objects.<br />

By the multiplication rule (page 5), there are a total of<br />

nPr =<br />

permutations of r objects out of n objects.<br />

n!<br />

= n × (n − 1) × · · · × (n − r + 1)<br />

(n − r)!<br />

Since each unordered subset corresponds to r! ordered subsets (the r chosen elements are<br />

permuted in all possible ways), there are a total of<br />

nCr = nPr<br />

r! =<br />

combinations of r objects out of n objects.<br />

n! n × (n − 1) × · · · × (n − r + 1)<br />

=<br />

(n − r)! r! r × (r − 1) × · · · × 1<br />

For example, there are a total of 5040 ordered subsets of size 4 from a set of size 10, and a<br />

total of 5040/24=210 unordered subsets.<br />

The notation <br />

n<br />

= nCr =<br />

r<br />

n!<br />

(n − r)! r!<br />

is used to denote the total number of combinations.<br />

6<br />

(read “n choose r”)


Special cases are as follows:<br />

<br />

n<br />

=<br />

0<br />

<br />

n<br />

= 1 and<br />

n<br />

<br />

n n<br />

= = n.<br />

1 n − 1<br />

Further, since choosing r elements to <strong>for</strong>m a subset is equivalent to choosing the remaining<br />

n − r elements to <strong>for</strong>m the complementary subset,<br />

<br />

n n<br />

= <strong>for</strong> r = 0,1,... ,n.<br />

r n − r<br />

1.5.1 Example: Simple Urn Model<br />

Suppose there are M special objects in an urn containing a total of N objects. In a subset<br />

of size n chosen from the urn, exactly m are special.<br />

Unordered subsets. There are a total of<br />

<br />

M N − M<br />

×<br />

m n − m<br />

unordered subsets with exactly m special objects (and exactly n −m other objects). If each<br />

choice of subset is equally likely, then <strong>for</strong> each m<br />

M N−M m × n−m<br />

P(m special objects) = .<br />

Ordered subsets. There are a total of<br />

<br />

n<br />

m<br />

N<br />

n<br />

× MPm × N−MPn−m<br />

ordered subsets with exactly m special objects. (The positions of the special objects are<br />

selected first, followed by the special objects to fill these positions, followed by the nonspecial<br />

objects to fill the remaining positions.) If each choice of subset is equally likely, then <strong>for</strong><br />

each m<br />

P(m special objects) =<br />

25<br />

8<br />

n <br />

m × MPm × N−MPn−m<br />

.<br />

NPn<br />

Computing probabilities. Interestingly, P(m special objects) is the same in both cases.<br />

For example, let N = 25, M = 10, n = 8, and m = 3. Then, using the first <strong>for</strong>mula<br />

10 15 3 × 5 120 × 3003 728<br />

P(3 special objects) = = = ≈ 0.333.<br />

1081575 2185<br />

Using the second <strong>for</strong>mula, the probability is<br />

8 3 × 10P3 × 15P5<br />

P(3 special objects) =<br />

= 56 × 720 × 360360<br />

25P8<br />

7<br />

43609104000<br />

= 728<br />

≈ 0.333.<br />

2185


Example 1.7 (Quality Management) A batch of 20 microwaves contains 4 defective<br />

and 16 good machines. You decide to select 3 machines at random and test them. If each<br />

choice of subset is equally likely, then the probability that there are k defective machines<br />

in the chosen subset is<br />

<br />

P(k defectives) =<br />

4 16<br />

k 3−k<br />

20 3<br />

k = 0,1,2,3.<br />

The following table shows the distribution of probabilities:<br />

k 0 1 2 3<br />

560<br />

P(k defectives) 1140 = 0.4912 480<br />

1140 = 0.4211 96<br />

1140 = 0.0842 4<br />

1140 = 0.0035<br />

If you decide to accept the batch as long as none of the 3 chosen machines is defective, then<br />

there is a 49.12% chance that you will accept a batch containing 4 defective machines.<br />

1.5.2 Binomial Coefficients<br />

The quantities n r , r = 0,1,... ,n, are often referred to as the binomial coefficients because<br />

of the following theorem.<br />

Theorem 1.8 (Binomial Theorem) For all real numbers x and y and each positive<br />

integer n,<br />

(x + y) n n<br />

<br />

n<br />

= x<br />

r<br />

r y n−r .<br />

Idea of proof. The product on the left can be written as a sequence of n factors:<br />

r=0<br />

(x + y) n = (x + y) × (x + y) × · · · × (x + y).<br />

The product expands to 2n summands, where each summand is a sequence of n letters,<br />

where one letter is chosen from each factor. For each r, exactly n r sequences have r copies<br />

of x and n − r copies of y.<br />

Example 1.9 (n=12) If n = 12, then the numbers of subsets of size r are as follows:<br />

r 0 1 2 3 4 5 6 7 8 9 10 11 12<br />

<br />

1 12 66 220 495 792 924 792 495 220 66 12 1<br />

12<br />

r<br />

and the total number of subsets is 12<br />

r=0<br />

1.6 Partitioning Sets<br />

12 r = 4096 = 212 .<br />

The multiplication rule (page 5) can be used to find the number of partitions of a set of n<br />

elements into k distinguishable subsets of sizes r1, r2, ..., rk.<br />

8


Specifically, r1 of the n elements are chosen <strong>for</strong> the first subset, r2 of the remaining n − r1<br />

elements are chosen <strong>for</strong> the second subset, and so <strong>for</strong>th. The result is the product of the<br />

numbers of ways to per<strong>for</strong>m each step:<br />

<br />

n n − r1 n − r1 − · · · − rk−1<br />

× × · · · ×<br />

.<br />

r1<br />

The product simplifies to<br />

r1,r2,... ,rk”).<br />

r2<br />

n!<br />

r1! r2! ··· rk!<br />

rk<br />

and is denoted by<br />

<br />

n<br />

r1,r2,...,rk<br />

Example 1.10 (Partitioning Students) For example, there are a total of<br />

<br />

15<br />

=<br />

5,5,5<br />

15!<br />

= 756,756<br />

5! 5! 5!<br />

<br />

(read “n choose<br />

ways to partition the members of a class of 15 students into recitation sections of size 5<br />

each led by Joe, Sally, and Mary, respectively. (The recitation sections are distinguished by<br />

their group leaders.)<br />

Permutations of indistinguishable objects. The <strong>for</strong>mula above also represents the<br />

number of ways to permute n objects, where the first r1 are indistinguishable, the next r2<br />

are indistinguishable, ..., the last rk are indistinguishable. The computation is done as<br />

follows: r1 of the n positions are chosen <strong>for</strong> the first type of object, r2 of the remaining<br />

n − r1 positions are chosen <strong>for</strong> the second type of object, etc.<br />

In the example above, imagine that the 15 positions correspond to the 15 students in<br />

alphabetical order, and that there are 5 J’s (<strong>for</strong> Joe), 5 S’s (<strong>for</strong> Sally), and 5 M’s (<strong>for</strong> Mary)<br />

to permute. Each permutation of the 15 letters corresponds to an assignment of the 15<br />

students to the recitation sections led by Joe, Sally, and Mary.<br />

Footnote on partitioning students. Suppose instead that the 15 students are assigned<br />

to 3 teams of 5 students each and that each team will work on the same project. Then the<br />

total number of different assignments becomes 15 <br />

5,5,5 /3! = 126,126.<br />

1.6.1 Multinomial Coefficients<br />

The quantities n <br />

are often referred to as the multinomial coefficients because of<br />

r1,r2,...,rk<br />

the following theorem.<br />

Theorem 1.11 (Multinomial Theorem) For all real numbers x1,x2,... ,xk and each<br />

positive integer n,<br />

(x1 + x2 + · · · + xk) n =<br />

<br />

<br />

n<br />

x<br />

r1,r2,... ,rk<br />

r1<br />

1 xr2 2 · · · xrk<br />

k ,<br />

(r1,r2,...,rk)<br />

where the sum is over all k-tuples of nonnegative integers with <br />

i ri = n.<br />

9


Idea of proof: The product on the left can be written as a sequence of n factors<br />

(x1 + x2 + · · · + xk) n = (x1 + x2 + · · · + xk)×<br />

(x1 + x2 + · · · + xk) × · · · × (x1 + x2 + · · · + xk).<br />

The product expands to kn summands, where each summand is a sequence of n letters (one<br />

from each factor). For each r1,r2,...,rk, exactly n <br />

sequences have r1 copies of x1,<br />

r1,r2,...,rk<br />

r2 copies of x2, etc.<br />

1.6.2 Example: Urn Models<br />

The simple urn model from Section 1.5.1 can be generalized.<br />

Assume that an urn contains k types of objects, Mi is the number of objects of type i, and<br />

N = <br />

i Mi is the total number of objects in the urn. A subset of size n is chosen. If each<br />

choice of subset is equally likely, then the probability that there are exactly mi objects of<br />

type i <strong>for</strong> i = 1,2,... ,k is<br />

P(Exactly mi objects of type i <strong>for</strong> each i) =<br />

M1<br />

m1<br />

M2 m2<br />

Mk · · · mk<br />

N .<br />

n<br />

Example 1.12 (3 Types of Objects) Let M1 = 3, M2 = 4, M3 = 5 and n = 4. The<br />

following table gives the probability distribution <strong>for</strong> all possible values of m1 and m2 (with<br />

m3 = 4 − m1 − m2):<br />

m2 = 0 m2 = 1 m2 = 2 m2 = 3 m2 = 4<br />

m1 = 0 5/495 40/495 60/495 20/495 1/495 126/495<br />

m1 = 1 30/495 120/495 90/495 12/495 252/495<br />

m1 = 2 30/495 60/495 18/495 108/495<br />

m1 = 3 5/495 4/495 9/495<br />

70/495 224/495 168/495 32/495 1/495 495/495<br />

Since N = M1 + M2 + M3 = 12, the total number of subsets of size 4 is 12 4 = 495. To get<br />

the numerator <strong>for</strong> m1 = 2 and m2 = 1, <strong>for</strong> example, we compute 345 2 = 60.<br />

The rightmost column gives the proportion of time we expect to have 0, 1, 2, or 3 elements<br />

of type 1 in the subset of size 4. The bottom row gives the proportion of time we expect to<br />

have 0, 1, 2, 3, or 4 of type 2 in the subset of size 4.<br />

Urn models can be used to analyze surveys conducted in small populations. Additional<br />

examples will be discussed in the context of joint discrete distributions (Chapter 3).<br />

1.7 Applications in <strong>Statistics</strong><br />

This section introduces two important applications of the methods we have seen so far.<br />

10<br />

1<br />

1


0.08<br />

0.06<br />

0.04<br />

0.02<br />

Prob<br />

450 500 550 600 650 M<br />

0.16<br />

0.14<br />

0.12<br />

0.1<br />

0.08<br />

0.06<br />

0.04<br />

0.02<br />

Prob<br />

415 420 425 430 435<br />

Figure 1.1: Likelihood <strong>for</strong> survey example (left) and capture-recapture example (right).<br />

1.7.1 Maximum Likelihood Estimation in Simple Urn Models<br />

In statistics, in<strong>for</strong>mation from a sample is used when complete in<strong>for</strong>mation is impossible to<br />

obtain or would take too long to gather. For example, in<strong>for</strong>mation from a randomly chosen<br />

subset of elements from an urn can be used to<br />

1. Estimate the number of special elements in an urn, M, when N is known.<br />

2. Estimate the total number of elements in an urn, N, when M is known.<br />

The method uses proportions: if all four numbers (N, M, n, m) are positive, then the<br />

proportion of special elements in the subset (m/n) is set equal to the proportion of special<br />

elements in the urn (M/N), and the equation is solved <strong>for</strong> the unknown quantity.<br />

The method is intuitively appealing. Further, it can be shown that estimates obtained using<br />

proportions have maximum (or close to maximum) probabilities.<br />

Example 1.13 (Survey Analysis) Assume that in a well-conducted survey of 150 of the<br />

1000 students in the senior class at a local university, 82 students preferred Susan <strong>for</strong> class<br />

president. Since (82/150)1000 ≈ 547, we estimate that 547 seniors support Susan.<br />

The left part of Figure 1.1 is a plot of the function<br />

<br />

f(M) =<br />

M1000−M 82 68<br />

1000 150<br />

<strong>for</strong> M = 450,451,... ,625.<br />

If the subset of students chosen <strong>for</strong> the survey was one of 1000 150 equally likely choices, then<br />

the plot suggests that 547 is a maximum likelihood estimate of M.<br />

Example 1.14 (Capture-Recapture Model) Two independent techniques were used to<br />

estimate the total number of potentially fatal errors in a complex computer program. The<br />

first method identified M = 380 errors. The second method identified a total of n = 278<br />

errors, m = 250 of which were already identified by the first method.<br />

Since (278/250)380 ≈ 423, we estimate that there are a total of 423 potentially fatal errors<br />

in the program (408 identified by at least one method; 15 not identified by either method).<br />

11<br />

N


The right part of Figure 1.1 is a plot of<br />

f(N) =<br />

380N−380 250 28<br />

N<br />

278<br />

<strong>for</strong> N = 411,412,... ,437.<br />

If the subset of errors identified by the second method was one of N <br />

278 equally likely choices,<br />

then the plot suggests that 423 is a maximum likelihood estimate of N.<br />

Probability (or likelihood) ratio. An alternative to reporting a single value of M or<br />

N with maximum probability is to report a range of likely values. Often, the range is<br />

determined using probability ratios. That is, M or N is in the reported range if<br />

Probability <strong>for</strong> M or N<br />

Maximum Probability<br />

≥ p <strong>for</strong> a fixed proportion p.<br />

Using p = 0.25, the range of likely values in the survey example is {485,608} and the range<br />

of likely values in the capture-recapture example is {415,431}.<br />

Survey analysis when N is large. When N is large, we usually work with the unknown<br />

proportion of special objects p = M/N and the estimate p = m/n. Further, as an<br />

approximation, we assume that p takes values anywhere in the interval 0 ≤ p ≤ 1 and we<br />

use “sampling with replacement” in place of “sampling without replacement.”<br />

Survey analysis with three types of objects. The simple method of proportions can<br />

be generalized to urn models with three (or more) types of objects. For example, suppose<br />

we wish to estimate the numbers of students who will vote <strong>for</strong> each of three candidates<br />

(Susan plus two others) and that in a well-conducted survey of 150 of the 1000 students in<br />

the senior class, 72 prefer Susan (candidate 1), 56 prefer the second candidate and 22 prefer<br />

the third candidate. We estimate that<br />

<br />

72<br />

1000 ≈ 480,<br />

150<br />

<br />

56<br />

1000 ≈ 373 and<br />

150<br />

<br />

22<br />

1000 ≈ 147<br />

150<br />

will vote <strong>for</strong> candidates 1, 2, 3, respectively. Further, these estimates represent maximum<br />

(or near maximum) probabilities over all triples (M1,M2,M3).<br />

1.7.2 Permutation and Nonparametric Methods<br />

In many statistical applications, the null and alternative hypotheses of interest can be<br />

paraphrased in the following simple terms:<br />

Ho: Any patterns appearing in the data are due to chance alone.<br />

Ha: There is a tendency <strong>for</strong> a certain type of pattern to appear.<br />

Permutation methods allow researchers to determine whether to accept or reject a null hypothesis<br />

of randomness and, in some cases, to construct confidence intervals <strong>for</strong> unknown<br />

12


parameters. The methods are applicable in many settings since they require few mathematical<br />

assumptions.<br />

For example, consider the analysis of two samples. Let {x1,x2,...,xn} and {y1,y2,... ,ym}<br />

be the observed samples, where each is a list of numbers with repetitions.<br />

The data could have been generated under one of two sampling models.<br />

Population model. Data generated under a population model can be treated as the<br />

values of independent random samples from continuous distributions. (These concepts are<br />

defined <strong>for</strong>mally in Chapter 8 of these notes.)<br />

For example, consider testing the null hypothesis that the X and Y distributions are equal<br />

versus the alternative hypothesis that X is stochastically smaller than Y . (That is, that<br />

the values of X tend to be smaller than the values of Y .)<br />

If n = 3, m = 5, and the observed lists are<br />

then the event<br />

{x1,x2,x3} = {1.4,1.2,2.8} and {y1,y2,y3,y4,y5} = {1.7,2.9,3.7,2.3,1.3},<br />

X2 < Y5 < X1 < Y1 < Y4 < X3 < Y2 < Y3<br />

has occurred. Under the null hypothesis, each permutation of the n + m = 8 random<br />

variables is equally likely; thus, the observed event is one of (n + m)! = 40,320 equally<br />

likely choices. Patterns of interest under the alternative hypothesis are events where the<br />

Xi’s tend to be smaller than the Yj’s.<br />

Randomization model. Data generated under a randomization model cannot be treated<br />

as the values of independent random samples from continuous distributions, but can be<br />

thought of as one of N equally likely choices under the null hypothesis.<br />

In the example above, the null hypothesis of randomness is that the observed data is one<br />

of N = 40,320 equally likely choices, where a choice corresponds to a matching of the 8<br />

numbers to the labels x1, x2, x3, y1, y2, y3, y4, y5.<br />

Permutation tests. To conduct a permutation test using a test statistic T,<br />

1. The sampling distribution of T is obtained by computing the value of the statistic <strong>for</strong><br />

each reordering of the data, and<br />

2. The observed significance level (or p value) is computed by comparing the observed<br />

value of T to the sampling distribution from step 1.<br />

The sampling distribution from the first step is called the permutation distribution of the<br />

statistic T, and the p value is called a permutation p value.<br />

13


Table 1.1: Sampling distribution of the sum of observations in the first sample.<br />

sum x sample sum x sample sum x sample<br />

3.9 {1.2,1.3,1.4} 5.9 {1.4,1.7,2.8} 7.1 {1.4,2.8,2.9}<br />

4.2 {1.2,1.3,1.7} 5.9 {1.3,1.7,2.9} 7.2 {1.2,2.3,3.7}<br />

4.3 {1.2,1.4,1.7} 6.0 {1.4,1.7,2.9} 7.3 {1.3,2.3,3.7}<br />

4.4 {1.3,1.4,1.7} 6.2 {1.2,1.3,3.7} 7.4 {1.4,2.3,3.7}<br />

4.8 {1.2,1.3,2.3} 6.3 {1.2,1.4,3.7} 7.4 {1.7,2.8,2.9}<br />

4.9 {1.2,1.4,2.3} 6.3 {1.2,2.3,2.8} 7.7 {1.2,2.8,3.7}<br />

5.0 {1.3,1.4,2.3} 6.4 {1.3,2.3,2.8} 7.7 {1.7,2.3,3.7}<br />

5.2 {1.2,1.7,2.3} 6.4 {1.2,2.3,2.9} 7.8 {1.2,2.9,3.7}<br />

5.3 {1.2,1.3,2.8} 6.4 {1.3,1.4,3.7} 7.8 {1.3,2.8,3.7}<br />

5.3 {1.3,1.7,2.3} 6.5 {1.3,2.3,2.9} 7.9 {1.4,2.8,3.7}<br />

5.4 {1.2,1.4,2.8} 6.5 {1.4,2.3,2.8} 7.9 {1.3,2.9,3.7}<br />

5.4 {1.4,1.7,2.3} 6.6 {1.2,1.7,3.7} 8.0 {1.4,2.9,3.7}<br />

5.4 {1.2,1.3,2.9} 6.6 {1.4,2.3,2.9} 8.0 {2.3,2.8,2.9}<br />

5.5 {1.2,1.4,2.9} 6.7 {1.3,1.7,3.7} 8.2 {1.7,2.8,3.7}<br />

5.5 {1.3,1.4,2.8} 6.8 {1.4,1.7,3.7} 8.3 {1.7,2.9,3.7}<br />

5.6 {1.3,1.4,2.9} 6.8 {1.7,2.3,2.8} 8.8 {2.3,2.8,3.7}<br />

5.7 {1.2,1.7,2.8} 6.9 {1.2,2.8,2.9} 8.9 {2.3,2.9,3.7}<br />

5.8 {1.2,1.7,2.9} 6.9 {1.7,2.3,2.9} 9.4 {2.8,2.9,3.7}<br />

5.8 {1.3,1.7,2.8} 7.0 {1.3,2.8,2.9}<br />

Continuing with the example above, let S be the sum of numbers in the x sample. Since<br />

the value of S depends only on which observations are labelled x’s, and not on the relative<br />

ordering of all 8 observations, the sampling distribution of S under the null hypothesis is<br />

obtained by computing its value <strong>for</strong> each partition of the 8 numbers into subsets of sizes 3<br />

and 5, respectively.<br />

Table 1.1 shows the values of S <strong>for</strong> each of the 8 3 = 56 choices of observations <strong>for</strong> the first<br />

sample. The observed value of S if 5.4. Since small values of S support the alternative<br />

hypothesis that x values tend to be smaller than y values, the observed significance level is<br />

P(S ≤ 5.4) = 13/56.<br />

Example 1.15 (Olympic Sprinting Times) As a second permutation test example,<br />

consider a test of association between x-values and y-values using data on Olympic sprinting<br />

times (Hand et al., Chapman & Hall, 1994, p. 248), and the sample correlation statistic<br />

ni=1 (Xi − X)(Yi − Y )<br />

R = <br />

ni=1<br />

(Xi − X) 2 n i=1 (Yi − Y ) 2<br />

where X and Y are the sample means of the X and Y samples, respectively, and n is the<br />

total number of data pairs.<br />

The left part of Figure 1.2 shows the winning times in seconds (vertical axis) <strong>for</strong> the men’s<br />

100 meter finals versus the year since 1900 (horizontal axis) <strong>for</strong> the years 1912, 1924, · · ·,<br />

1972.<br />

14<br />

,


10.8<br />

10.5<br />

10.2<br />

y<br />

10 30 50 70 x<br />

400<br />

250<br />

100<br />

y<br />

125 200 275 x<br />

Figure 1.2: Winning time in seconds (y) versus year since 1900 (x) <strong>for</strong> Olympic sprinting<br />

times example (left plot). Triglycerides in mg/dL (y) versus cholesterol in mg/dL (y) <strong>for</strong><br />

heart disease example (right plot).<br />

The data are reproduced in the following table:<br />

x 12 24 36 48 60 72<br />

y 10.8 10.6 10.3 10.3 10.2 10.14<br />

For these data, the observed value of R is −0.941.<br />

To determine if the observed association is due to chance alone, a permutation analysis of<br />

R = 0 versus R = 0 was conducted at the 5% significance level.<br />

The sampling distribution of R under the null hypothesis is obtained by matching x-values<br />

to permuted y-values in all possible ways, and computing the value of R in each case. Since<br />

two y-values are equal, there are a total of 6!/2 = 360 permutations to consider. Since the<br />

permutation p value is<br />

p value =<br />

#(|R| ≥ | − 0.941|)<br />

360<br />

= 2<br />

≈ 0.006<br />

360<br />

(obtained using the computer), there is evidence that the observed negative association is<br />

not due to chance alone.<br />

Conditional test, nonparametric test. A permutation test is an example of a conditional<br />

test, since the sampling distribution of T is computed conditional on the observations.<br />

Note that the term nonparametric test is often used to describe permutation tests where<br />

observations have been replaced by ranks. The Wilcoxon rank sum and Mann-Whitney<br />

tests <strong>for</strong> the analysis of two samples, and the Wilcoxon signed rank test <strong>for</strong> the analysis of<br />

paired samples, are examples of nonparametric tests.<br />

Monte Carlo test. If the number of reorderings of the data is very large, then the<br />

computer can be used to approximate the sampling distribution of T by computing its<br />

15


value <strong>for</strong> a fixed number of random reorderings of the data. The approximate sampling<br />

distribution can then be used to estimate the p value.<br />

A test conducted in this way is an example of a Monte Carlo test. (In a Monte Carlo<br />

analysis, simulation is used to estimate a quantity of interest. Here the quantity of interest<br />

is a p value.)<br />

Example 1.16 (Heart Disease Study) (Hand et al., Chapman & Hall, 1994, p. 221)<br />

Cholesterol and triglycerides belong to the class of chemicals known as lipids (fats). As part<br />

of a study to determine the relationship between high levels of lipids and coronary artery<br />

disease, researchers measured plasma levels of cholesterol and triglycerides in milligrams<br />

per deciliter (mg/dL) in 371 men complaining of chest pain.<br />

The right part of Figure 1.2 compares the cholesterol (x) and triglycerides (y) levels <strong>for</strong> the<br />

51 men with no evidence of heart disease. The sample correlation is 0.325. (See the last<br />

example <strong>for</strong> the definition of the sample correlation statistic.)<br />

To determine if the observed association is due to chance alone, a permutation analysis can<br />

be conducted using the sample correlation statistic R as test statistic.<br />

Consider testing R = 0 versus R = 0 at the 5% significance level. Under the null hypothesis<br />

of randomness, each matching of observed x values to permuted y values is equally likely.<br />

Thus, the observed matching is one of about 52! ≈ 8 × 10 67 equally likely choices.<br />

In a Monte Carlo analysis using 5000 random matchings (including the observed matching),<br />

the estimated permutation p value was<br />

estimated p value =<br />

#(|R| ≥ |0.325|)<br />

5000<br />

= 98<br />

≈ 0.0196.<br />

5000<br />

Thus, there is evidence that the positive association between levels of cholesterol and triglycerides<br />

is not due to chance alone.<br />

1.8 Conditional Probability<br />

Assume that A and B are events, and that P(B) > 0. Then the conditional probability of<br />

A given B is defined as follows:<br />

P(A|B) =<br />

P(A ∩ B)<br />

.<br />

P(B)<br />

Event B is often referred to as the conditional sample space. The conditional probability<br />

P(A|B) is the relative “size” of A within B.<br />

Example 1.17 (Respiratory Problems) For example, suppose that 40% of the adults in<br />

a certain population smoke cigarettes and that 28% smoke cigarettes and have respiratory<br />

problems. Then<br />

P(Respiratory problems | Smoker) = 0.28<br />

= 0.70<br />

0.40<br />

is the probability that an adult has respiratory problems given that the adult is a smoker.<br />

(70% of smokers have respiratory problems.)<br />

16


Equally likely outcomes. Note that if the sample space S is finite and each outcome is<br />

equally likely, then the conditional probability of A given B simplifies to the following:<br />

P(A|B) =<br />

|A ∩ B|/|S|<br />

|B|/|S|<br />

= |A ∩ B|<br />

|B|<br />

1.8.1 Multiplication Rules <strong>for</strong> Probability<br />

= Number of elements in A ∩ B<br />

Number of elements in B .<br />

Assume that A and B are events with positive probability. Then the definition of conditional<br />

probability implies that the probability of the intersection, P(A ∩ B), can be written as a<br />

product of probabilities in two different ways:<br />

More generally,<br />

P(A ∩ B) = P(B) × P(A|B) = P(A) × P(B|A).<br />

Theorem 1.18 (Multiplication Rule <strong>for</strong> Probability) If A1,A2,... ,Ak are events<br />

and P(A1 ∩ A2 ∩ · · · ∩ Ak−1) > 0, then<br />

P(A1 ∩ A2 ∩ · · · ∩ Ak) =<br />

P(A1) × P(A2|A1) × P(A3|A1 ∩ A2) × · · · × P(Ak|A1 ∩ A2 ∩ · · · ∩ Ak−1).<br />

Example 1.19 (Sampling Without Replacement) For example, suppose that four<br />

slips of paper are sampled without replacement from a well-mixed urn containing twentyfive<br />

slips of paper: fifteen slips with the letter X written on each and ten slips of paper<br />

with the letter Y written on each. Then the probability of observing the sequence XY XX<br />

is P(XY XX) =<br />

P(X) × P(Y |X) × P(X|XY ) × P(X|XY X) = 15 10 14 13<br />

× × × ≈ 0.09.<br />

25 24 23 22<br />

(The probability of choosing an X slip is 15/25; with an X removed from the urn, the<br />

probability of drawing a Y slip is 10/24; with an X and Y removed from the urn, the<br />

probability of drawing an X slip is 14/23; with two X slips and one Y slip removed from<br />

the urn, the probability of drawing an X slip is 13/22.)<br />

1.8.2 Law of Total Probability<br />

The law of total probability can be used to write an unconditional probability as the<br />

weighted average of conditional probabilities. Specifically,<br />

Theorem 1.20 (Law of Total Probability) Let A1, A2, ..., Ak and B be events with<br />

nonzero probability. If A1, A2, ..., Ak are pairwise disjoint with union S, then<br />

P(B) = P(A1) × P(B|A1) + P(A2) × P(B|A2) + · · · + P(Ak) × P(B|Ak).<br />

17


To demonstrate the law of total probability, note that if A1, A2, ..., Ak are pairwise disjoint<br />

with union S, then the sets B ∩ A1, B ∩ A2, ..., B ∩ Ak are pairwise disjoint with union B.<br />

Thus, axiom 3 and the definition of conditional probability imply that<br />

k<br />

k<br />

P(B) = P(B ∩ Aj) = P(Aj)P(B|Aj).<br />

j=1<br />

j=1<br />

Example 1.21 (Respiratory Problems, continued) For example, suppose that 70% of<br />

smokers and 15% of nonsmokers in a certain population of adults have respiratory problems.<br />

If 40% of the population smoke cigarettes, then<br />

P(Respiratory problems) = 0.40(0.70) + 0.60(0.15) = 0.37<br />

is the probability of having respiratory problems.<br />

Law of average conditional probabilities. The law of total probability is often called<br />

the law of average conditional probabilities. Specifically, P(B) is the weighted average of<br />

the collection of conditional probabilities {P(B|Aj)}, using the collection of unconditional<br />

probabilities {P(Aj)} as weights.<br />

In the respiratory problems example above, 0.37 is the weighted average of 0.70 (the probability<br />

that a smoker has respiratory problems) and 0.15 (the probability that a nonsmoker<br />

has respiratory problems).<br />

1.8.3 Bayes Rule<br />

Bayes rule, proven by the Reverend T. Bayes in the 1760’s, can be used to update probabilities<br />

given that an event has occurred. Specifically,<br />

Theorem 1.22 (Bayes Rule) Let A1, A2, ..., Ak and B be events with nonzero probability.<br />

If A1, A2, ..., Ak are pairwise disjoint with union S, then<br />

P(Aj|B) =<br />

P(Aj) × P(B|Aj)<br />

, j = 1,2,... ,k.<br />

ki=1<br />

P(Ai) × P(B|Ai)<br />

Bayes rule is a restatement of the definition of conditional probability: the numerator in<br />

the <strong>for</strong>mula is P(Aj ∩ B), the denominator is P(B) (by the law of total probability), and<br />

the ratio is P(Aj|B).<br />

Example 1.23 (Defective Products) For example, suppose that 2% of the products<br />

assembled during the day shift and 6% of the products assembled during the night shift at<br />

a small company are defective and need reworking. If the day shift accounts <strong>for</strong> 55% of the<br />

products assembled by the company, then<br />

P(Day shift | Defective) =<br />

0.55(0.02)<br />

0.55(0.02) + 0.45(0.06)<br />

≈ 0.289<br />

is the probability that a product was assembled during the day shift given that the product<br />

is defective. (About 28.9% of defective products are assembled during the day shift.)<br />

18


Table 1.2: Distribution of unaided distance scores <strong>for</strong> right (x) and left (y) eyes.<br />

Prob<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

y = 1 y = 2 y = 3 y = 4 total<br />

x = 1 0.203 0.036 0.017 0.009 0.265<br />

x = 2 0.031 0.202 0.058 0.010 0.301<br />

x = 3 0.016 0.048 0.237 0.027 0.328<br />

x = 4 0.005 0.011 0.024 0.066 0.106<br />

total 0.255 0.297 0.336 0.112 1.000<br />

1 2 3 4 j<br />

Prob<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

1 2 3 4 j<br />

Figure 1.3: Distributions of left eye scores given right eye score is 1 (left) or 4 (right).<br />

Prior and posterior distributions. In applications of Bayes rule, the collection of<br />

probabilities {P(Aj)} are often referred to as the prior probabilities (the probabilities be<strong>for</strong>e<br />

observing an outcome in B), and the collection of probabilities {P(Aj|B)} are often referred<br />

to as the posterior probabilities (the probabilities after event B has occurred).<br />

Example 1.24 (Eyesight Study) (Stuart, Biometrika 42:412-416, 1955) Between 1943<br />

and 1946 the eyesight of more than seven thousand women aged 30-39 employed in the<br />

British weapons industry was tested. For each woman the unaided distance vision of the<br />

right eye (x) and of the left eye (y) was recorded using the four-point scale 1, 2, 3, 4, where<br />

1 is the highest grade and 4 is the lowest grade. Table 1.2 summarizes the results of the<br />

experiment.<br />

For example, <strong>for</strong> 20.3% of women who participated in the study both eyes were given the<br />

highest score (x = y = 1) and <strong>for</strong> 6.6% of women who participated in the study both eyes<br />

were given the lowest score (x = y = 4).<br />

Consider the experiment “Choose a woman from the set of women studied by Stuart and<br />

record the scores <strong>for</strong> both eyes,” assume each choice is equally likely, and let Aj be the<br />

event that the left eye score equals j.<br />

The left part of Figure 1.3 shows the posterior distribution {P(Aj|B)} (in black) superimposed<br />

on the prior distribution {P(Aj)} (in gray) when B is the event that the right eye<br />

19


score is 1. (If the right eye score is 1, then the left eye score is most likely 1.)<br />

The right part of Figure 1.3 shows the posterior distribution {P(Aj|B)} (in black) superimposed<br />

on the prior distribution {P(Aj)} (in gray) when B is the event that the right eye<br />

score is 4. (If the right eye score is 4, then the left eye score is most likely 4.)<br />

Example 1.25 (Religion and Education) As a last example, suppose that 45% of<br />

the population of a certain adult community are Catholic, 15% are Jewish, and 40% are<br />

Protestant. Further, suppose that<br />

1. Education levels among the Catholic population are as follows:<br />

Level: 8th grade Some HS Some <strong>College</strong> Postor<br />

less High School Graduate <strong>College</strong> Graduate Graduate<br />

Percent: 21% 28% 32% 8% 7% 4%<br />

2. Education levels among the Jewish population are as follows:<br />

Level: 8th grade Some HS Some <strong>College</strong> Postor<br />

less High School Graduate <strong>College</strong> Graduate Graduate<br />

Percent: 7% 23% 24% 15% 18% 13%<br />

3. Education levels among the Protestant population are as follows:<br />

Level: 8th grade Some HS Some <strong>College</strong> Postor<br />

less High School Graduate <strong>College</strong> Graduate Graduate<br />

Percent: 13% 22% 27% 11% 19% 8%<br />

The in<strong>for</strong>mation above can be used to construct a contingency table of proportions in each<br />

religion-education group in this community:<br />

Level: 8th grade Some HS Some <strong>College</strong> Postor<br />

less High School Graduate <strong>College</strong> Graduate Graduate<br />

Catholic: 0.0945 0.126 0.144 0.036 0.0315 0.018 0.45<br />

Jewish: 0.0105 0.0345 0.036 0.0225 0.027 0.0195 0.15<br />

Protestant: 0.052 0.088 0.108 0.044 0.076 0.032 0.40<br />

0.157 0.2485 0.288 0.1025 0.1345 0.0695 1<br />

Each number in the body of the table is obtained using the multiplication rule. For example,<br />

P(Catholic and HS Graduate) = 0.45 × 0.32 = 0.144.<br />

Each number in the bottom row is obtained using the law of total probability. For example,<br />

P(HS Graduate) = 0.45 × 0.32 + 0.15 × 0.24 + 0.40 × 0.27 = 0.288.<br />

The prior distribution of Catholic, Jewish, and Protestant is 0.45, 0.15, and 0.40, respectively.<br />

Given that a person is a high school graduate (and no more), <strong>for</strong> example, the<br />

posterior distribution of Catholic, Jewish, and Protestant is<br />

0.144 0.036<br />

= 0.5,<br />

0.288 0.288<br />

0.108<br />

= 0.125, and = 0.375, respectively.<br />

0.288<br />

This posterior distribution was computed using Bayes rule.<br />

20


1.9 Independent Events and Mutually Independent Events<br />

Events A and B are said to be independent if<br />

P(A ∩ B) = P(A) × P(B).<br />

Otherwise, A and B are said to be dependent.<br />

If A and B are independent and have positive probabilities, then the multiplication rule <strong>for</strong><br />

probability (page 17) implies that<br />

P(A|B) = P(A) and P(B|A) = P(B).<br />

(The relative size of A within B is the same as its relative size within S; the relative size of<br />

B within A is the same as its relative size within S.)<br />

If A and B are independent, 0 < P(A) < 1, and 0 < P(B) < 1, then A and B c are independent,<br />

A c and B are independent, and A c and B c are independent.<br />

More generally, events A1,A2,...,Ak are said to be mutually independent if<br />

• For each pair of distinct indices (i1,i2): P(Ai1 ∩ Ai2 ) = P(Ai1 ) × P(Ai2 ),<br />

• For each triple of distinct indices (i1,i2,i3):<br />

• and so <strong>for</strong>th.<br />

P(Ai1 ∩ Ai2 ∩ Ai3 ) = P(Ai1 ) × P(Ai2 ) × P(Ai3 ),<br />

Example 1.26 (Sampling With Replacement) For example, suppose that four slips<br />

of paper are sampled with replacement from a well-mixed urn containing twenty-five slips<br />

of paper: fifteen slips with the letter X written on each and ten slips of paper with the<br />

letter Y written on each. Then the probability of observing the sequence XY XX is<br />

P(XY XX) = P(X) × P(Y ) × P(X) × P(X) =<br />

3 <br />

15 10<br />

25 25 = 54<br />

625<br />

= 0.0864.<br />

Further, since 4 3 = 4 sequences have exactly 3 X’s, the probability of observing a sequence<br />

with exactly 3 X’s is 4(0.0864) = 0.3456.<br />

Example 1.27 (Three Coins) You toss a fair coin three times and record an h (<strong>for</strong><br />

heads) or a t (<strong>for</strong> tails) each time. The sample space is<br />

S = {hhh, hht, hth, thh, htt, tht, tth, ttt}.<br />

Let Hi be the event that the i th toss results in a head:<br />

H1 = {hhh, hht, hth, htt}, H2 = {hhh, hht, thh, tht}, H3 = {hhh, hth, thh, tth}.<br />

21


Since P(H1) = P(H2) = P(H3) = 1<br />

2 ,<br />

P(Hi ∩ Hj) = 1<br />

4 = P(Hi) × P(Hj) <strong>for</strong> (i,j) = (1,2),(1,3),(2, 3),<br />

and P(H1 ∩ H2 ∩ H3) = 1<br />

8 = P(H1)P(H2)P(H3), the events are mutually independent.<br />

Example 1.28 (Not Mutually Independent) Suppose that the sample space of an<br />

experiment consists of the 3! = 6 permutations of the letters a, b, and c, along with the<br />

triples of each letter:<br />

S = {aaa, bbb, ccc, abc, acb, bac, bca, cab, cba}.<br />

Assume each outcome is equally likely, and let Ai be the event that the i th letter is an a:<br />

A1 = {aaa, abc, acb}, A2 = {aaa, bac, cab}, A3 = {aaa, bca,cba}.<br />

Then P(A1) = P(A2) = P(A3) = 1<br />

3 and<br />

P(A1 ∩ A2) = P(A1 ∩ A3) = P(A2 ∩ A3) = P(A1 ∩ A2 ∩ A3) = 1<br />

9<br />

since each intersection set equals the simple event {aaa}.<br />

Since<br />

P(Ai ∩ Aj) = 1<br />

9 = P(Ai) × P(Aj) <strong>for</strong> (i,j) = (1,2),(1,3),(2,3),<br />

the events are independent in pairs. However, since P(A1 ∩ A2 ∩ A3) = P(A1)P(A2)P(A3),<br />

the three events are not mutually independent.<br />

Repeated trials and mutual independence. As stated at the beginning of the chapter,<br />

the term experiment is used in probability theory to describe a procedure whose outcome<br />

is not known in advance with certainty. Experiments are assumed to be repeatable, and to<br />

have a well-defined set of outcomes. Repeated trials are repetitions of an experiment using<br />

the specified procedure, with the outcomes of the trials having no influence on one another.<br />

The results of repeated trials of an experiment are mutually independent.<br />

22


2 Discrete Random Variables<br />

Researchers use random variables to describe the numerical results of experiments. For<br />

example, if a fair coin is tossed 5 times and the total number of heads is recorded, then a<br />

random variable whose values are 0, 1, 2, 3, 4, 5 is used to give a numerical description of<br />

the results.<br />

This chapter focuses on discrete random variables and their probability distributions.<br />

2.1 Definitions<br />

A random variable is a function from the sample space of an experiment to the real numbers.<br />

The range of a random variable is the set of values the random variable assumes. Random<br />

variables are usually denoted by capital letters (X, Y , Z, ...) and their values by lower case<br />

letters (x, y, z, ...).<br />

If the range of a random variable is a finite or countably infinite set, then the random<br />

variable is said to be discrete; if the range is an interval or a union of intervals, the random<br />

variable is said to be continuous; otherwise, the random variable is said to be mixed.<br />

If X is a discrete random variable, then P(X = x) is the probability that an outcome<br />

has value x. Similarly, P(X ≤ x) is the probability that an outcome has value x or less,<br />

P(a < X < b) is the probability that an outcome has value strictly between a and b, and<br />

so <strong>for</strong>th.<br />

Example 2.1 (Coin Tossing) Let H be the number of heads in 8 tosses of a fair coin<br />

and let X be the difference between the number of heads and the number of tails. That is,<br />

let X = H − (8 − H) = 2H − 8. Since<br />

P(H = h) =<br />

8 h<br />

256<br />

when h = 0,1,... ,8 and 0 otherwise,<br />

we know that X is a discrete random variable with range {−8, −6, −4, −2,0,2,4,6,8}.<br />

The following table displays P(X = x) <strong>for</strong> each x in the range of X, using a common<br />

denominator throughout:<br />

h 0 1 2 3 4 5 6 7 8<br />

x −8 −6 −4 −2 0 2 4 6 8<br />

P(X = x) 1<br />

256<br />

8<br />

256<br />

From this table, we can compute all probabilities of interest. For example,<br />

28<br />

256<br />

56<br />

256<br />

70<br />

256<br />

56<br />

256<br />

28<br />

256<br />

P(X ≥ 3) = P(X = 4,6,8) = 37<br />

≈ 0.1445.<br />

256<br />

(There are 256 sequences of heads and tails. Of these, 37 have either 6, 7, or 8 heads.)<br />

23<br />

8<br />

256<br />

1<br />

256


2.1.1 PDF and CDF <strong>for</strong> Discrete Distributions<br />

If X is a discrete random variable, then the frequency function (FF) or probability density<br />

function (PDF) of X is defined as follows<br />

PDFs satisfy the following properties:<br />

1. f(x) ≥ 0 <strong>for</strong> all real numbers x.<br />

f(x) = P(X = x) <strong>for</strong> all real numbers x.<br />

2. <br />

x∈R f(x) equals 1, where R is the range of X.<br />

Since f(x) is the probability of an event and events have nonnegative probabilities, the<br />

values of f(x) must be nonnegative (property 1). Since the events X = x <strong>for</strong> x ∈ R are<br />

mutually disjoint with union S (the sample space), the sum of the probabilities of these<br />

events must be 1 (property 2).<br />

The cumulative distribution function (CDF) of the discrete random variable X is defined<br />

as follows<br />

F(x) = P(X ≤ x) <strong>for</strong> all real numbers x.<br />

Note that F(x) is a sum of probabilities:<br />

F(x) = <br />

y∈Rx<br />

CDFs satisfy the following properties:<br />

1. limx→−∞ F(x) = 0 and limx→+∞ F(x) = 1.<br />

2. If x1 ≤ x2, then F(x1) ≤ F(x2).<br />

f(y), where Rx = {y ∈ R : y ≤ x}.<br />

3. F(x) is right continuous. That is, <strong>for</strong> each a, lim x→a + F(x) = F(a).<br />

F(x) represents cumulative probability, with limits 0 and 1 (property 1). Cumulative probability<br />

increases with increasing x (property 2), and has discrete jumps at values of x in<br />

the range of the random variable (property 3). (The concept of limit is central to calculus,<br />

and will be defined in Chapter 4 of these notes.)<br />

Plotting PDF and CDF functions. The PDF of a discrete random variable is represented<br />

graphically<br />

1. By using a scatter plot of pairs (x,f(x)) <strong>for</strong> x ∈ R, or<br />

2. By using a probability histogram, where area is used to represent probability.<br />

24


Prob.<br />

0.25<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

0<br />

-8 -6 -4 -2 0 2 4 6 8<br />

x<br />

Cum.Prob.<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

-8 -6 -4 -2 0 2 4 6 8<br />

Figure 2.1: Probability histogram (left) and cumulative distribution function (right) <strong>for</strong> the<br />

difference between the number of heads and the number of tails in eight tosses of a fair coin.<br />

The CDF of a discrete random variable is represented graphically as a step function, with<br />

steps of height f(x) at each x ∈ R.<br />

Example 2.2 (Coin Tossing, continued) For example, the left part of Figure 2.1 is the<br />

probability histogram <strong>for</strong> the difference between the number of heads and the number of<br />

tails in eight tosses of a fair coin. For each x in the range of the random variable, a rectangle<br />

with base equal to the interval [x − 0.50,x + 0.50] and with height equal to f(x) is drawn.<br />

The total area is 1.0. The right part is a representation of the cumulative distribution<br />

function. Note that F(x) is nondecreasing, F(x) = 0 when x < −8, and F(x) = 1 when<br />

x > 8. Steps occur at x = −8, −6,... ,6,8.<br />

2.2 Families of Distributions<br />

This section briefly describes four families of discrete probability distributions, and states<br />

properties of these distributions.<br />

2.2.1 Example: Discrete Uni<strong>for</strong>m Distribution<br />

Let n be a positive integer. The random variable X is said to be a discrete uni<strong>for</strong>m random<br />

variable, or to have a discrete uni<strong>for</strong>m distribution, with parameter n when its PDF is as<br />

follows:<br />

f(x) = 1<br />

when x = 1,2,... ,n and 0 otherwise.<br />

n<br />

For example, if you roll a fair six-sided die and let X equal the number of dots on the top<br />

face, then X has a discrete uni<strong>for</strong>m distribution with n = 6.<br />

2.2.2 Example: Hypergeometric Distribution<br />

Let N, M, and n be integers with 0 < M < N and 0 < n < N. The random variable X is<br />

said to be a hypergeometric random variable, or to have a hypergeometric distribution, with<br />

25<br />

x


Prob.<br />

0.3<br />

0.2<br />

0.1<br />

0 4 8 x<br />

Prob.<br />

0.3<br />

0.2<br />

0.1<br />

0 4 8 x<br />

Figure 2.2: Probability histograms <strong>for</strong> hypergeometric (left) and binomial (right) examples.<br />

parameters n, M, and N, when its PDF is as follows:<br />

<br />

f(x) =<br />

MN−M x n−x<br />

N n<br />

,<br />

<strong>for</strong> integers, x, between max(0,n+M −N) and min(n,M) (and zero otherwise). Note that<br />

the restrictions on x ensure that all combinations are defined and positive.<br />

Example 2.3 (n = 8, M = 14, N = 35) Assume that X has a hypergeometric distribution<br />

with parameters n = 8, M = 14 and N = 35. The following table gives the values<br />

of the PDF of X <strong>for</strong> all x in its range, using 4 decimal place accuracy.<br />

x 0 1 2 3 4 5 6 7 8<br />

f(x) 0.0086 0.0692 0.2098 0.3147 0.2545 0.1131 0.0268 0.0031 0.0001<br />

The left part of Figure 2.2 is a probability histogram <strong>for</strong> X.<br />

Hypergeometric distributions are used to model urn experiments (see Section 1.5.1). Suppose<br />

there are M special objects in an urn containing a total of N objects. Let X be the<br />

number of special objects in a subset of size n chosen from the urn. If each choice of subset<br />

is equally likely, then X has a hypergeometric distribution. In the example above, 40%<br />

(14/35) of the elements in the urn are special.<br />

2.2.3 Distributions Related to Bernoulli Experiments<br />

A Bernoulli experiment is an experiment with two possible outcomes. The outcome of<br />

chief interest is often called “success” and the other outcome “failure.” Let p equal the<br />

probability of success.<br />

Imagine repeating a Bernoulli experiment n times. The expected number of successes in n<br />

independent trials of a Bernoulli experiment with success probability p is np.<br />

For example, suppose that you roll a fair six-sided die and observe the number on the top<br />

face each time. Let success be a 1 or 4 on the top face, and failure be a 2, 3, 5, or 6 on the<br />

top face. Then p = 1/3 is the probability of success. In 600 trials of the experiment, you<br />

expect 200 successes.<br />

26


Example: Bernoulli distribution. Suppose that a Bernoulli experiment is run once.<br />

Let X equal 1 if a success occurs and 0 if a failure occurs. Then X is said to be a Bernoulli<br />

random variable, or to have a Bernoulli distribution, with parameter p. The PDF of X is<br />

as follows:<br />

f(1) = p, f(0) = 1 − p, and f(x) = 0 otherwise.<br />

Example: binomial distribution. Let X be the number of successes in n independent<br />

trials of a Bernoulli experiment with success probability p. Then X is said to be a binomial<br />

random variable, or to have a binomial distribution, with parameters n and p. The PDF of<br />

X is as follows:<br />

f(x) =<br />

<br />

n<br />

p<br />

x<br />

x (1 − p) n−x when x = 0,1,2,... ,n, and 0 otherwise.<br />

For each x, f(x) is the probability of the event “exactly x successes in n independent trials.”<br />

(There are a total of n sequences with exactly x successes and n − x failures; each such<br />

x<br />

sequence has probability px (1 − p) n−x .)<br />

Example 2.4 (n = 8, p = 0.4) Assume that X has a binomial distribution with parameters<br />

n = 8 and p = 0.4. The following table gives the values of the PDF of X <strong>for</strong> all x in<br />

its range:<br />

x 0 1 2 3 4 5 6 7 8<br />

f(x) 0.0168 0.0896 0.209 0.2787 0.2322 0.1239 0.0413 0.0079 0.0007<br />

The right part of Figure 2.2 is a probability histogram <strong>for</strong> X.<br />

Notes. The binomial theorem (page 8) can be used to demonstrate that<br />

f(0) + f(1) + · · · + f(n) = 1.<br />

Note also that a Bernoulli random variable is a binomial random variable with n = 1.<br />

2.2.4 Simple Random Samples<br />

Suppose that an urn contains N objects. A simple random sample of size n is a sequence<br />

of n objects chosen without replacement from the urn, where the choice of each sequence is<br />

equally likely.<br />

Let M be the number of special objects in the urn and X be the number of special objects<br />

in a simple random sample of size n. Then X has a hypergeometric distribution with<br />

parameters n, M, N. Further, if N is very large, then binomial probabilities can often be<br />

used to approximate hypergeometric probabilities.<br />

Theorem 2.5 (Binomial Approximation) If N is large, then the binomial distribution<br />

with parameters n and p = M/N can be used to approximate the hypergeometric<br />

distribution with parameters n, M, N. Specifically,<br />

P(x special objects) = <br />

<br />

n<br />

≈<br />

x<br />

MN−M x n−x<br />

N n<br />

27<br />

p x (1 − p) n−x <strong>for</strong> each x.


Note that if X is the number of special objects in a sequence of n objects chosen with<br />

replacement from the urn and if the choice of each sequence is equally likely, then X has<br />

a binomial distribution with parameters n and p = M/N. The theorem says that if N is<br />

large, then the model where sampling is done with replacement can be used to approximate<br />

the model where sampling is done without replacement.<br />

Adequacy of the approximation. One rule of thumb <strong>for</strong> judging the adequacy of the<br />

approximation is the following: the binomial approximation is adequate when<br />

20n < N and 0.05 < p < 0.95.<br />

Survey analysis. Simple random samples are used in surveys. If the survey population<br />

is small, then hypergeometric distributions are used to analyze the results. If the survey<br />

population is large, then binomial distributions are used to analyze the results, even though<br />

each person’s opinion is solicited at most once.<br />

Example 2.6 (Survey Analysis) For example, suppose that a surveyor is interested in<br />

determining the level of support <strong>for</strong> a proposal to change the local tax structure, and decides<br />

to choose a simple random sample of size 10 from the registered voter list. If there are a<br />

total of 120 registered voters, one-third of whom support the proposal, then the probability<br />

that exactly 3 of the 10 chosen voters support the proposal is<br />

P(X = 3) =<br />

40 80 3 7<br />

120 10<br />

≈ 0.27.<br />

If there are thousands of registered voters, the probability is<br />

1 3 7 10 2<br />

P(X = 3) ≈<br />

≈ 0.26.<br />

3 3 3<br />

Note that the exact number of registered voters is not needed in the approximation.<br />

2.3 Mathematical Expectation<br />

Mathematical expectation generalizes the idea of a weighted average, where probability<br />

distributions are used as the weights.<br />

2.3.1 Definitions <strong>for</strong> Finite Discrete Random Variables<br />

Let X be a discrete random variable with finite range R and PDF f(x). The mean of X<br />

(or the expected value of X or the expectation of X) is defined as follows:<br />

E(X) = <br />

x f(x).<br />

x∈R<br />

28


Similarly, if g(X) is a real-valued function of X, then the mean of g(X) (or the expected<br />

value of g(X) or the expectation of g(X)) is<br />

E(g(X)) = <br />

g(x) f(x).<br />

x∈R<br />

Example 2.7 (Nickels and Dimes) Assume you have 3 dimes and 5 nickels in your<br />

pocket. You choose a subset of 4 coins, let X equal the number of dimes in the subset and<br />

g(X) = 10X + 5(4 − X) = 20 + 5X<br />

be the total value (in cents) of the chosen coins. If each choice of subset is equally likely,<br />

then X has a hypergeometric distribution with parameters n = 4, M = 3, and N = 8, the<br />

expected number of dimes in the subset is<br />

E(X) =<br />

3<br />

x=0<br />

<br />

5<br />

x f(x) = 0 + 1<br />

70<br />

and the expected total value of the chosen coins is<br />

E(g(X)) =<br />

3<br />

x=0<br />

<br />

5<br />

(20 + 5x) f(x) = 20<br />

Note that E(g(X)) = g(E(X)).<br />

2.3.2 Properties of Expectation<br />

70<br />

<br />

+ 25<br />

<br />

30<br />

+ 2<br />

70<br />

<br />

30<br />

+ 30<br />

70<br />

Properties of sums imply the following properties of expectation:<br />

1. If a is a constant, then E(a) = a.<br />

2. If a and b are constants, then E(a + bX) = a + bE(X).<br />

<br />

30 5<br />

+ 3 = 1.5<br />

70 70<br />

<br />

30 5<br />

+ 35 = 27.5 cents.<br />

70 70<br />

3. If ci is a constant and gi(X) is a real-valued function <strong>for</strong> i = 1,2,... ,k, then<br />

E(c1g1(X) + c2g2(X) + · · · + ckgk(X)) =<br />

c1E(g1(X)) + c2E(g2(X)) + · · · + ckE(gk(X)).<br />

The first property says that the mean of a constant function is the constant itself. The<br />

second property says that if g(X) = a+bX, then E(g(X)) = g(E(X)). The third property<br />

generalizes the first two.<br />

Note that if g(X) = a + bX, then E(g(X)) and g(E(X)) may be different. For example,<br />

let g(X) = X2 in the nickels and dimes example above. Then<br />

E(g(X)) = 0 2<br />

<br />

5<br />

+ 1<br />

70<br />

2<br />

<br />

30<br />

+ 2<br />

70<br />

2<br />

<br />

30<br />

+ 3<br />

70<br />

2<br />

<br />

5<br />

=<br />

70<br />

39<br />

14 .<br />

Since (39/14) = (3/2) 2 , E(g(X)) = g(E(X)) in this case.<br />

29


Table 2.1: Means and variances <strong>for</strong> four families of discrete distributions.<br />

Distribution E(X) V ar(X)<br />

Discrete Uni<strong>for</strong>m n<br />

n+1<br />

2<br />

Hypergeometric n, M, N n M<br />

N<br />

n 2 −1<br />

12<br />

n M<br />

N<br />

Bernoulli p p p(1 − p)<br />

<br />

1 − M<br />

N<br />

Binomial n, p np np(1 − p)<br />

2.3.3 Mean, Variance and Standard Deviation<br />

<br />

N−n<br />

N−1<br />

Let X be a random variable and let µ = E(X) be its mean. The variance of X, V ar(X),<br />

is defined as follows:<br />

<br />

V ar(X) = E (X − µ) 2<br />

.<br />

The notation σ 2 = V ar(X) is used to denote the variance. The standard deviation of X,<br />

σ = SD(X), is the positive square root of the variance.<br />

The mean is a measure of the center (or location) of a distribution. The variance and<br />

standard deviation are measures of the scale (or spread) of a distribution. If X is the height<br />

of an individual in inches, say, then the values of E(X) and SD(X) are in inches, while the<br />

value of V ar(X) is in square-inches.<br />

Table 2.1 gives general <strong>for</strong>mulas <strong>for</strong> the mean and variance of the four families of discrete<br />

probability distributions discussed earlier.<br />

Example 2.8 (Drilling <strong>for</strong> Oil) Assume that the probability of finding oil in a given<br />

drill hole is 0.3 and that the results (finding oil or not) are independent from drill hole to<br />

drill hole.<br />

An oil company drills one hole at a time. If they find oil, then they stop; otherwise, they<br />

continue. However, the company only has enough money to drill five holes. Let X be the<br />

number of holes drilled.<br />

The range of X is {1,2,3,4,5} and its PDF is given in Table 2.2. (The values of f(x) are<br />

zero otherwise.) The mean number of holes drilled is<br />

5<br />

E(X) = x f(x) = 1(0.3) + 2(0.21) + 3(0.147) + 4(0.1029) + 5(0.2401) ≈ 2.7731.<br />

x=1<br />

Similarly,<br />

and SD(X) = V ar(X) ≈ 1.55622.<br />

5<br />

V ar(X) = (x − E(X))<br />

x=1<br />

2 f(x) ≈ 2.42182<br />

30


Table 2.2: Probability distribution <strong>for</strong> oil drilling example.<br />

f(x) = P(X = x) Comments:<br />

f(1) = 0.3 Success on first try<br />

f(2) = (0.7)0.3 = 0.21 Failure followed by success<br />

f(3) = (0.7) 2 0.3 = 0.147 Two failures followed by success<br />

f(4) = (0.7) 3 0.3 = 0.1029 Three failures followed by success<br />

f(5) = (0.7) 4 0.3 + (0.7) 5 = 0.2401 Four failures followed by success<br />

or failure on all 5 tries<br />

Properties of variance and standard deviation. Properties of sums can be used to<br />

prove the following properties of the variance and standard deviation:<br />

1. V ar(X) = E(X 2 ) − (E(X)) 2 .<br />

2. If Y = a + bX, then V ar(Y ) = b 2 V ar(X) and SD(Y ) = |b|SD(X).<br />

The first property provides a quick by-hand method <strong>for</strong> computing the variance. For example,<br />

the variance of X in the nickels and dimes example is 39/14 − (3/2) 2 = 15/28<br />

(confirming the result <strong>for</strong> hypergeometric random variables given in Table 2.1).<br />

2.3.4 Chebyshev’s Inequality<br />

The Chebyshev inequality illustrates the importance of the concepts of mean, variance, and<br />

standard deviation.<br />

Theorem 2.9 (Chebyshev’s Inequality) Let X be a random variable with finite mean<br />

µ and standard deviation σ, and let ǫ be a positive constant. Then<br />

In particular, if ǫ = kσ <strong>for</strong> some k, then<br />

P(µ − ǫ < X < µ + ǫ) > 1 − σ2<br />

.<br />

ǫ2 P(µ − kσ < X < µ + kσ) > 1 − 1<br />

k 2.<br />

For example, if ǫ = 2.5σ, then Chebyshev’s inequality states that<br />

• At least 84% of the distribution of X is in the interval |x − µ| < 2.5σ and<br />

• At most 16% of the distribution is in the complementary interval |x − µ| ≥ 2.5σ.<br />

31


Proof of Chebyshev’s inequality: Using complements, it is sufficient to demonstrate that<br />

P(|X − µ| ≥ ǫ) ≤ σ2<br />

.<br />

ǫ2 Since σ 2 = E (X − µ) 2 = <br />

x (x − µ)2 f(x), where f(x) is the PDF of X, and the sum of<br />

two nonnegative terms is greater than or equal to either term,<br />

σ 2 = <br />

|x−µ|≥ǫ<br />

(x − µ) 2 f(x) + <br />

|x−µ|


3 Joint Discrete Distributions<br />

A probability distribution describing the joint variability of two or more random variables<br />

is called a joint distribution.<br />

For example, if X is the height (in feet), Y is the weight (in pounds), and Z is the serum<br />

cholesterol level (in mg/dL) of a person chosen from a given population, then we may be<br />

interested in describing the joint distribution of the triple (X,Y,Z).<br />

3.1 Bivariate Distributions<br />

A bivariate distribution is the joint distribution of a pair of random variables.<br />

3.1.1 Joint PDF<br />

Assume that X and Y are discrete random variables. The joint frequency function (joint<br />

FF) or joint probability density function (joint PDF) of (X,Y ) is defined as follows:<br />

f(x,y) = P(X = x,Y = y) <strong>for</strong> all real pairs (x,y),<br />

where the comma is understood to mean the intersection of the events. The notation<br />

fXY (x,y) is sometimes used to emphasize the two random variables.<br />

3.1.2 Joint CDF<br />

Assume that X and Y are discrete random variables. The joint cumulative distribution<br />

function (joint CDF) of (X,Y ) is defined as follows:<br />

F(x,y) = P(X ≤ x,Y ≤ y) <strong>for</strong> all real pairs (x,y),<br />

where the comma is understood to mean the intersection of the events. The notation<br />

FXY (x,y) is sometimes used to emphasize the two random variables.<br />

3.1.3 Marginal Distributions<br />

Let X and Y be discrete random variables with joint PDF f(x,y). The marginal frequency<br />

function (marginal FF) or marginal probability density function (marginal PDF) of X is<br />

fX(x) = <br />

f(x,y) <strong>for</strong> all real numbers x,<br />

y<br />

where the sum is taken over all y in the range of Y . Similarly, the marginal frequency<br />

function or marginal PDF of Y is<br />

fY (y) = <br />

f(x,y) <strong>for</strong> all real numbers y,<br />

where the sum is taken over all x in the range of X.<br />

x<br />

33


3.1.4 Conditional Distributions<br />

Let X and Y be discrete random variables with joint PDF f(x,y).<br />

If fY (y) = 0, then the conditional frequency function (conditional FF) or conditional probability<br />

density function (conditional PDF) of X given Y = y is defined as follows:<br />

f X|Y =y(x|y) = f(x,y)<br />

fY (y)<br />

<strong>for</strong> all real numbers x.<br />

Similarly, if fX(x) = 0, then the conditional PDF of Y given X = x is<br />

f Y |X=x(y|x) = f(x,y)<br />

fX(x)<br />

<strong>for</strong> all real numbers y.<br />

Note that in the first case, the conditional sample space is the collection of outcomes with<br />

Y = y; in the second case, it is the collection of outcomes with X = x.<br />

Example 3.1 (Eyesight Study, continued) Consider again the eyesight study from<br />

page 19. Let X be the score <strong>for</strong> the right eye and Y be the score <strong>for</strong> the left eye. Then<br />

Table 1.2 displays the joint (X,Y ) distribution. The probability that both eyes have the<br />

same score, <strong>for</strong> example, is<br />

P(X = Y ) = f(1,1) + f(2,2) + f(3,3) + f(4,4) = 0.708.<br />

The marginal distribution of X is given in the right column of Table 1.2 and the marginal<br />

distribution of Y is given in the bottom row of the table.<br />

Figure 1.3 displays the marginal distribution of Y (in gray), the conditional distribution<br />

of Y given X = 1 (left plot, in black), and the conditional distribution of Y given X = 4<br />

(right plot, in black).<br />

Example 3.2 (Urn Experiment) Assume that an urn contains 4 red, 3 white and 1 blue<br />

chip. Consider the following 2 step experiment:<br />

Step 1. Thoroughly mix the contents of the urn. Choose a chip and record its color.<br />

Return the chip plus two more chips of the same color to the urn.<br />

Step 2. Thoroughly mix the contents of the urn. Choose a chip and record its color.<br />

Let X equal the number of red chips chosen and Y equal the number of white chips chosen.<br />

Then Table 3.1 displays the joint (X,Y ) distribution. The marginal distribution of X is<br />

given in the right column and the marginal distribution of Y is given in the bottom row.<br />

The conditional distribution of Y given X = 1, <strong>for</strong> example, is<br />

and f Y |X=1(y|1) = 0 otherwise.<br />

fY |X=1(0|1) = 0.1 1<br />

=<br />

0.4 4 , fY |X=1(1|1) = 0.3 3<br />

=<br />

0.4 4<br />

34


Table 3.1: Joint distribution <strong>for</strong> the urn experiment example.<br />

3.1.5 Independent Random Variables<br />

y = 0 y = 1 y = 2 total<br />

x = 0 0.0375 0.075 0.1875 0.3<br />

x = 1 0.1000 0.300 0.4<br />

x = 2 0.3000 0.3<br />

total 0.4375 0.375 0.1875 1.0<br />

The discrete random variables X and Y are said to be independent if<br />

f(x,y) = fX(x)fY (y) <strong>for</strong> all real pairs (x,y).<br />

Otherwise, X and Y are said to be dependent.<br />

The discrete random variables X and Y are independent if the probability of the intersection is<br />

equal to the product of the probabilities<br />

<strong>for</strong> all events of interest (<strong>for</strong> all x, y).<br />

P(X = x, Y = y) = P(X = x)P(Y = y)<br />

Example 3.3 (Independent Bernoulli Random Variables) Assume that (X,Y ) has<br />

the joint distribution given in the following table:<br />

y = 0 y = 1 total<br />

x = 0 0.28 0.42 0.70<br />

x = 1 0.12 0.18 0.30<br />

total 0.40 0.60 1.00<br />

X is a Bernoulli random variable with parameter 0.30 and Y is a Bernoulli random variable<br />

with parameter 0.60. Further, since each probability in the body of the 2-by-2 table is the<br />

product of the appropriate marginal probabilities, X and Y are independent.<br />

Example 3.4 (Eyesight Study, continued) The random variables X and Y in the eyesight<br />

study are dependent. To demonstrate this, we need to show that f(x,y) = fX(x)fY (y)<br />

<strong>for</strong> some pair (x,y). For example,<br />

f(1,1) = 0.203 = 0.265 × 0.255 = fX(1)fY (1).<br />

Example 3.5 (Urn Experiment, continued) Similarly, the random variables X and Y<br />

in the urn experiment are dependent. For example,<br />

f(1,1) = 0.3 = 0.4 × 0.375 = fX(1)fY (1).<br />

35


3.1.6 Mathematical Expectation <strong>for</strong> Finite Bivariate Distributions<br />

Let X and Y be discrete random variables with joint PDF f(x,y) = P(X = x,Y = y),<br />

and let g(X,Y ) be a real-valued function. The mean of g(X,Y ) (or the expected value of<br />

g(X,Y ) or the expectation of g(X,Y )) is<br />

E(g(X,Y )) = <br />

g(x,y) f(x,y)<br />

where the double sum includes all pairs with nonzero joint PDF.<br />

x<br />

Example 3.6 (Eyesight Study, continued) Continuing with the eyesight study, let<br />

g(X,Y ) = |X − Y | be the absolute value of the difference in scores <strong>for</strong> the two eyes. Then<br />

4 4<br />

E(g(X,Y )) = |x − y| f(x,y) = 0.374.<br />

x=1 y=1<br />

(The average absolute difference in scores <strong>for</strong> the two eyes is small.)<br />

Properties. Properties of sums and the fact that the joint PDF of independent random<br />

variables equals the product of the marginal PDFs can be used to show the following:<br />

y<br />

1. If a and b1,b2,... ,bn are constants and gi(X,Y ) are real-valued functions <strong>for</strong> each<br />

i = 1,2,... ,n, then<br />

<br />

n<br />

<br />

n<br />

E a + bigi(X,Y ) = a + biE(gi(X,Y )).<br />

i=1<br />

2. If X and Y are independent and g(X) and h(Y ) are real-valued functions, then<br />

i=1<br />

E(g(X)h(Y )) = E(g(X))E(h(Y )).<br />

The first property generalizes properties from Section 2.3.2. The second property is useful<br />

when studying associations between variables.<br />

3.1.7 Covariance and Correlation<br />

Let X and Y be random variables with finite means (µx, µy) and finite standard deviations<br />

(σx, σy). The covariance of X and Y , Cov(X,Y ), is defined as follows:<br />

Cov(X,Y ) = E((X − µx)(Y − µy)).<br />

The notation σxy = Cov(X,Y ) is used to denote the covariance. The correlation of X and<br />

Y , Corr(X,Y ), is defined as follows:<br />

Cov(X,Y )<br />

Corr(X,Y ) =<br />

σxσy<br />

= σxy<br />

.<br />

σxσy<br />

The notation ρ = Corr(X,Y ) is used to denote the correlation of X and Y ; ρ is called the<br />

correlation coefficient.<br />

36


Positive and negative association. Covariance and correlation are measures of the<br />

association between two random variables. Specifically,<br />

1. The random variables X and Y are said to be positively associated if as X increases,<br />

Y tends to increase. If X and Y are positively associated, then Cov(X,Y ) and<br />

Corr(X,Y ) will be positive.<br />

2. The random variables X and Y are said to be negatively associated if as X increases,<br />

Y tends to decrease. If X and Y are negatively associated, then Cov(X,Y ) and<br />

Corr(X,Y ) will be negative.<br />

For example, the height and weight of individuals in a given population are positively<br />

associated. Educational level and indices of poor health are often negatively associated.<br />

Properties of covariance. The following properties of covariance are important:<br />

1. Cov(X,X) = V ar(X).<br />

2. Cov(X,Y ) = Cov(Y,X).<br />

3. Cov(a + bX,c + dY ) = bdCov(X,Y ), where a, b, c, and d are constants.<br />

4. Cov(X,Y ) = E(XY ) − E(X)E(Y ).<br />

5. If X and Y are independent, then Cov(X,Y ) = 0.<br />

The first two properties follow immediately from the definition of covariance. The third<br />

property relates the covariance of linearly trans<strong>for</strong>med random variables to the covariance<br />

of the original random variables. In particular, if b = d = 1 (values are shifted only), then<br />

the covariance is unchanged; if a = c = 0 and b,d > 0 (values are re-scaled), then the<br />

covariance is multiplied by bd.<br />

The fourth property gives an alternative method <strong>for</strong> calculating the covariance; the method<br />

is particularly well-suited <strong>for</strong> by-hand computations. Since E(XY ) = E(X)E(Y ) <strong>for</strong> independent<br />

random variables, the fourth property can be used to prove that the covariance of<br />

independent random variables is zero (property 5).<br />

Properties of correlation. The following properties of correlation are important:<br />

1. −1 ≤ Corr(X,Y ) ≤ 1.<br />

2. If a, b, c, and d are constants, then<br />

Corr(a + bX,c + dY ) =<br />

⎧<br />

⎨<br />

Corr(X,Y ) when bd > 0<br />

⎩<br />

−Corr(X,Y ) when bd < 0<br />

In particular, Corr(X,a + bX) equals 1 if b > 0 and equals −1 if b < 0.<br />

3. If X and Y are independent, then Corr(X,Y ) = 0.<br />

37


Correlation is often called standardized covariance since, by the first property, its values<br />

always lie in the [−1,1] interval. If the correlation is close to −1, then there is a strong<br />

negative association between the variables; if the correlation is close to 1, then there is a<br />

strong positive association between the variables.<br />

The second property relates the correlation of linearly trans<strong>for</strong>med variables to the correlation<br />

of the original variables. Note, in particular, that the correlation is unchanged if the<br />

random variables are shifted (b = d = 1) or if the random variables are re-scaled (a = c = 0<br />

and b,d > 0). For example, the correlation between the height and weight of individuals in<br />

a population is the same no matter which measurement scale is used <strong>for</strong> height (e.g. inches,<br />

feet) and which measurement scale is used <strong>for</strong> weight (e.g. pounds, kilograms).<br />

If Corr(X,Y ) = 0, then X and Y are said to be uncorrelated; otherwise, they are said to<br />

be correlated. The third property says that independent random variables are uncorrelated.<br />

Note that uncorrelated random variables are not necessarily independent.<br />

Example 3.7 (Eyesight Study, continued) For (X,Y ) from the eyesight study,<br />

E(X) = 2.275, E(Y ) = 2.305, E(X 2 ) = 6.117, E(Y 2 ) = 6.259, E(XY ) = 5.905.<br />

Using properties of variance and covariance:<br />

Finally, ρ = Corr(X,Y ) ≈ 0.701.<br />

V ar(X) ≈ 0.941, V ar(Y ) ≈ 0.946, Cov(X,Y ) ≈ 0.661.<br />

Example 3.8 (Uncorrelated but Dependent) Assume that the random pair (X,Y )<br />

has the joint distribution given in the following table:<br />

y = −1 y = 0 y = 1 total<br />

x = −1 0.05 0.10 0.05 0.20<br />

x = 0 0.10 0.40 0.10 0.60<br />

x = 1 0.05 0.10 0.05 0.20<br />

total 0.20 0.60 0.20 1.00<br />

Since E(X) = E(Y ) = E(XY ) = 0, the covariance and correlation must equal 0.<br />

But, X and Y are dependent since, <strong>for</strong> example, f(0,0) = fX(0)fY (0).<br />

3.1.8 Conditional Expectation and Regression<br />

Let X and Y be discrete random variables with finite ranges and joint PDF f(x,y).<br />

If fX(x) = 0, then the conditional expectation (or the conditional mean) of Y given X = x,<br />

E(Y |X = x), is defined as follows:<br />

E(Y |X = x) = <br />

y fY |X=x(y|x),<br />

where the sum is over all y with nonzero conditional PDF (f Y |X=x(y|x) = 0).<br />

y<br />

38


x0<br />

x1<br />

x2<br />

x3<br />

x4<br />

y1 y2 y3 y4 y5<br />

0.02 0.02 0.02 0.01 0.01<br />

0.02 0.02 0.01 0.04 0.04<br />

0.02 0.01 0.04 0.04 0.10<br />

0.01 0.04 0.04 0.10 0.10<br />

0.01 0.04 0.10 0.10 0.04<br />

C.Exp.<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0 1 2 3 4 x<br />

Figure 3.1: Joint (X,Y ) distribution (left) and plot of (x,E(Y |X = x)) pairs (right) <strong>for</strong> the<br />

conditional expectation example.<br />

The conditional expectation of X given Y = y, E(X|Y = y), is defined similarly.<br />

Example 3.9 (Conditional Expectation) Let (X,Y ) be the random pair whose joint<br />

distribution is shown in the left part of Figure 3.1. Then<br />

<br />

.02 .02 .02 .01 .01<br />

E(Y |X = 0) = 1 + 2 + 3 + 4 + 5 = 2.625.<br />

.08 .08 .08 .08 .08<br />

Similarly,<br />

E(Y |X = 1) = 3.462, E(Y |X = 2) = 3.905, E(Y |X = 3) = 3.828 and E(Y |X = 4) = 3.414.<br />

The right part of Figure 3.1 is a plot of pairs (x,E(Y |X = x)) <strong>for</strong> x = 0,1,2,3,4.<br />

Regression equation. The <strong>for</strong>mula <strong>for</strong> the conditional expectation<br />

E(Y |X = x) as a function of x<br />

is called the regression equation of Y on X. Similarly, the <strong>for</strong>mula <strong>for</strong> the conditional<br />

expectation E(X|Y = y) as a function of y is called the regression equation of X on Y .<br />

Linear conditional means. Let X and Y be random variables with finite means (µx,<br />

µy), standard deviations (σx, σy), and correlation (ρ).<br />

If E(Y |X = x) is a linear function of x, then the <strong>for</strong>mula is of the <strong>for</strong>m:<br />

E(Y |X = x) = µy + ρσy<br />

(x − µx).<br />

Similarly, if E(X|Y = y) is a linear function of y, then the <strong>for</strong>mula is of the <strong>for</strong>m:<br />

σx<br />

E(X|Y = y) = µx + ρσx<br />

(y − µy).<br />

39<br />

σy


3.1.9 Example: Bivariate Hypergeometric Distribution<br />

Let n, M1, M2, and M3 be positive integers with n ≤ M1 + M2 + M3. The random<br />

pair (X,Y ) is said to have a bivariate hypergeometric distribution with parameters n and<br />

(M1,M2,M3) if its joint PDF has the following <strong>for</strong>m:<br />

f(x,y) =<br />

when x and y are nonnegative integers with<br />

M1 M2 M3<br />

x y n−x−y<br />

<br />

M1+M2+M3<br />

n<br />

x ≤ min(n,M1), y ≤ min(n,M2) and max(0,n − M3) ≤ x + y ≤ min(n,M3),<br />

and is equal to zero otherwise. Note that the restrictions on x and y ensure that all<br />

combinations are defined and positive.<br />

Example 3.10 (n = 4, M1 = 2, M2 = 3, M3 = 5) Assume that (X,Y ) has a bivariate<br />

hypergeometric distribution with parameters n = 4, M1 = 2, M2 = 3 and M3 = 5. The<br />

joint PDF is<br />

f(x,y) =<br />

23 5 <br />

x y 4−x−y<br />

10 4<br />

x = 0,1,2; y = 0,1,... ,min(3,4 − x)<br />

and 0 otherwise. The joint (X,Y ) distribution and the marginal X and Y distributions are<br />

displayed in the following table:<br />

y = 0 y = 1 y = 2 y = 3 total<br />

x = 0 0.0238 0.1429 0.1429 0.0238 0.3334<br />

x = 1 0.0952 0.2857 0.1429 0.0095 0.5333<br />

x = 2 0.0476 0.0714 0.0143 0.1333<br />

total 0.1666 0.5000 0.3001 0.0333 1.0000<br />

Urn experiments. Bivariate hypergeometric distributions are used to model urn experiments.<br />

Specifically, suppose that an urn contains N objects, M1 of type 1, M2 of type 2,<br />

and M3 of type 3 (N = M1 + M2 + M3). Let X equal the number of objects of type 1 and<br />

Y equal the number of objects of type 2 in a subset of size n chosen from the urn. If each<br />

choice of subset is equally likely, then (X,Y ) has a bivariate hypergeometric distribution<br />

with parameters n and (M1,M2,M3). In the example above, 20% (2/10) of the objects in<br />

the urn are of type 1, 30% (3/10) are of type 2 and 50% (5/10) are of type 3.<br />

Negative association. If (X,Y ) has a bivariate hypergeometric distribution, then X<br />

and Y are negatively associated with correlation<br />

ρ = Corr(X,Y ) = −<br />

<br />

40<br />

M1<br />

M2 + M3<br />

<br />

M2<br />

M1 + M3<br />

.


Marginal and conditional distributions. If (X,Y ) has a bivariate hypergeometric<br />

distribution, then<br />

1. X has a hypergeometric distribution with parameters n, M1, and N; and<br />

2. Y has a hypergeometric distribution with parameters n, M2, and N.<br />

In addition, each conditional distribution is hypergeometric. Finally, bivariate hypergeometric<br />

distributions have linear conditional means.<br />

3.1.10 Example: Trinomial Distribution<br />

Let n be a positive integer, and p1, p2, and p3 be positive proportions with sum 1. The<br />

random pair (X,Y ) is said to have a trinomial distribution with parameters n and (p1,p2,p3)<br />

when its joint PDF has the following <strong>for</strong>m:<br />

<br />

<br />

n<br />

f(x,y) =<br />

p<br />

x,y,n − x − y<br />

x 1py2 pn−x−y 3<br />

when x = 0,1,... ,n; y = 0,1,... ,n; x + y ≤ n, and is equal to zero otherwise.<br />

Example 3.11 (n = 4, p1 = 0.2, p2 = 0.3, p3 = 0.5) Assume that (X,Y ) has a trinomial<br />

distribution with parameters n = 4, p1 = 0.2, p2 = 0.3 and p3 = 0.5. The joint PDF<br />

is<br />

<br />

4<br />

f(x,y) =<br />

0.2<br />

x,y,4 − x − y<br />

x 0.3 y 0.5 4−x−y x,y = 0,1,2,3,4; x + y ≤ 4,<br />

and 0 otherwise. The joint (X,Y ) distribution and the marginal X and Y distributions are<br />

displayed in the following table:<br />

y = 0 y = 1 y = 2 y = 3 y = 4 total<br />

x = 0 0.0625 0.1500 0.1350 0.0540 0.0081 0.4096<br />

x = 1 0.1000 0.1800 0.1080 0.0216 0.4096<br />

x = 2 0.0600 0.0720 0.0216 0.1536<br />

x = 3 0.0160 0.0096 0.0256<br />

x = 4 0.0016 0.0016<br />

total 0.2401 0.4116 0.2646 0.0756 0.0081 1.0000<br />

Experiments with three outcomes. Trinomial distributions are used to model experiments<br />

with exactly three outcomes. Specifically, suppose that an experiment has three<br />

outcomes which occur with probabilities p1, p2, and p3, respectively. Let X be the number<br />

of occurrences of outcome 1 and Y be the number of occurrences of outcome 2 in n independent<br />

trials of the experiment. Then (X,Y ) has a trinomial distribution with parameters<br />

n and (p1,p2,p3). In the example above, the probabilities of the three outcomes are 0.2, 0.3<br />

and 0.5, respectively.<br />

41


Negative association. If (X,Y ) has a trinomial distribution, then X and Y are negatively<br />

associated with correlation<br />

<br />

p1<br />

ρ = Corr(X,Y ) = −<br />

1 − p1<br />

p2<br />

.<br />

1 − p2<br />

Marginal and conditional distributions. If (X,Y ) has a trinomial distribution, then<br />

1. X has a binomial distribution with parameters n and p1, and<br />

2. Y has a binomial distribution with parameters n and p2.<br />

In addition, each conditional distribution is binomial. Finally, trinomial distributions have<br />

linear conditional means.<br />

3.1.11 Simple Random Samples<br />

The results of Section 2.2.4 can be generalized. In particular, trinomial probabilities can<br />

often be used to approximate bivariate hypergeometric probabilities when N is large enough,<br />

and each family of distributions can be used in survey analysis.<br />

Example 3.12 (Survey Analysis) For example, suppose that a surveyor is interested<br />

in determining the level of support <strong>for</strong> a proposal to change the local tax structure, and<br />

decides to choose a simple random sample of size 10 from the registered voter list. If there<br />

are a total of 120 registered voters, where one-third support the proposal, one-half oppose<br />

the proposal, and one-sixth have no opinion, then the probability that exactly 3 support, 5<br />

oppose, and 2 have no opinion is<br />

P(X = 3,Y = 5) =<br />

40 60 20 3 5<br />

120 10<br />

2 ≈ 0.088.<br />

If there are thousands of registered voters, then the probability is<br />

1 3 5 2 10 1 1<br />

P(X = 3,Y = 5) ≈<br />

≈ 0.081.<br />

3,5,2 3 2 6<br />

As be<strong>for</strong>e, you do not need to know the exact number of registered voters when you use the<br />

trinomial approximation.<br />

3.2 Multivariate Distributions<br />

A multivariate distribution is the joint distribution of k random variables. Ideas studied in<br />

the bivariate case (k = 2) can be generalized to the case where k > 2.<br />

42


3.2.1 Joint PDF and Joint CDF<br />

Let X1, X2, ..., Xk be discrete random variables and let<br />

X = (X1,X2,...,Xk).<br />

The joint frequency function (joint FF) or joint probability density function (joint PDF) of<br />

X is defined as follows:<br />

f(x) = P(X1 = x1,X2 = x2,... ,Xk = xk)<br />

<strong>for</strong> all real k-tuples x = (x1,x2,... ,xk), where commas are understood to mean the intersection<br />

of events.<br />

The joint cumulative distribution function (joint CDF) of X is defined as follows:<br />

F(x) = P(X1 ≤ x1,X2 ≤ x2,...,Xk ≤ xk)<br />

<strong>for</strong> all real k-tuples x = (x1,x2,... ,xk), where commas are understood to mean the intersection<br />

of events.<br />

Example 3.13 (k = 3) Assume that (X,Y,Z) has joint PDF<br />

f(x,y,z) =<br />

5<br />

28(x + y + z)<br />

when x,y = 0,1, z = 1,2,3,4 and 0 otherwise.<br />

The following table shows the joint (X,Y,Z) distribution. The marginal (X,Y ) distribution<br />

is displayed in the right column and the marginal Z distribution is displayed in the bottom<br />

row. A common denominator is used throughout.<br />

z = 1 z = 2 z = 3 z = 4<br />

x = 0 y = 0 60/336 30/336 20/336 15/336 125/336<br />

y = 1 30/336 20/336 15/336 12/336 77/336<br />

x = 1 y = 0 30/336 20/336 15/336 12/336 77/336<br />

y = 1 20/336 15/336 12/336 10/336 57/336<br />

140/336 85/336 62/336 49/336 1<br />

X and Y have the same marginal distribution. Namely,<br />

(and 0 otherwise).<br />

fX(0) = fY (0) = 101<br />

168 and fX(1) = fY (1) = 67<br />

168<br />

The probability that X + Y + Z is at most 3, <strong>for</strong> example, is<br />

P(X + Y + Z ≤ 3) =<br />

1<br />

1<br />

x=0 y=0<br />

3−x−y<br />

<br />

43<br />

z=1<br />

f(x,y,z) = 115<br />

≈ 0.685.<br />

168


3.2.2 Mutually Independent Random Variables and Random Samples<br />

The discrete random variables X1, X2, ..., Xk are said to be mutually independent (or<br />

independent) if<br />

f(x) = f1(x1)f2(x2) · · · fk(xk) <strong>for</strong> all real k-tuples x = (x1,x2,...,xk),<br />

where fi(xi) = P(Xi = xi) <strong>for</strong> i = 1,2,... ,k. (The probability of the intersection is equal<br />

to the product of the probabilities <strong>for</strong> all events of interest.)<br />

Equivalently, X1, X2, ..., Xk are said to be mutually independent if<br />

F(x) = F1(x1)F2(x2) · · · Fk(xk) <strong>for</strong> all real k-tuples x = (x1,x2,... ,xk),<br />

where Fi(xi) = P(Xi ≤ xi) <strong>for</strong> i = 1,2,... ,k.<br />

X1, X2, ..., Xk are said to be dependent if they are not mutually independent.<br />

Example 3.14 (k = 3, continued) The random variables X, Y and Z in the previous<br />

example are dependent. To see this, we just need to show that f(x,y,z) = fX(x)fY (y)fZ(z)<br />

<strong>for</strong> some triple (x,y,z). For example, f(1,1,1) = fX(1)fY (1)fZ(1).<br />

Random sample. If the discrete random variables X1, X2, ..., Xk are mutually independent<br />

and have a common distribution (each marginal PDF is the same), then the random<br />

variables X1, X2, ..., Xk are said to be a random sample from that distribution.<br />

3.2.3 Mathematical Expectation <strong>for</strong> Finite Multivariate Distributions<br />

Let X1, X2, ..., Xk be discrete random variables with joint PDF f(x), and let g(X) be a<br />

real-valued k-variable function. The mean (or expected value or expectation) of g(X) is<br />

E(g(X)) = <br />

· · · <br />

g(x1,x2,... ,xk)f(x1,x2,... ,xk),<br />

xk<br />

xk−1<br />

where the multiple sum includes all k-tuples with nonzero joint PDF.<br />

x1<br />

Example 3.15 (Maximum Roll) You roll a fair die three times. Let Xi be the number<br />

on the top face after the i th roll and g(X1,X2,X3) be the maximum of the three numbers.<br />

X1, X2, X3 is a random sample from a uni<strong>for</strong>m distribution with parameter 6. The joint<br />

PDF of the random triple is<br />

f(x1,x2,x3) = 1<br />

216 when x1,x2,x3 = 1,2,3,4,5,6 and 0 otherwise<br />

and the expected value of the maximum of the three rolls reduces to<br />

E(g(X1,X2,X3)) =<br />

1<br />

1<br />

216<br />

<br />

+ 2<br />

7<br />

216<br />

<br />

<br />

19 37 61 91<br />

+ 3 + 4 + 5 + 6 =<br />

216 216 216 216<br />

44<br />

<br />

119<br />

≈ 4.96.<br />

24


Properties of expectation. Properties in the multivariate case directly generalize properties<br />

in the univariate and bivariate cases (Sections 2.3.2 and 3.1.6).<br />

3.2.4 Example: Multivariate Hypergeometric Distribution<br />

Suppose that an urn contains N objects,<br />

M1 of type 1, M2 of type 2, ..., and Mk of type k (M1 + M2 + · · · + Mk = N).<br />

Let Xi be the number of objects of type i in a subset of size n chosen from the urn. If each<br />

choice of subset is equally likely, then the random k-tuple (X1,X2,...,Xk) is said to have<br />

a multivariate hypergeometric distribution with parameters n and (M1,M2,... ,Mk).<br />

The joint PDF <strong>for</strong> the k-tuple is<br />

f(x1,x2,... ,xk) =<br />

M1<br />

x1<br />

M2<br />

x2<br />

Mk · · ·<br />

when each xi is a nonnegative integer satisfying xi ≤ min(n,Mi) and <br />

i xi = n (and zero<br />

otherwise).<br />

Note that<br />

1. If X is a hypergeometric random variable with parameters n, M, N, then<br />

(X,n − X)<br />

has a multivariate hypergeometric distribution with parameters n and (M,N − M).<br />

2. If (X,Y ) is bivariate hypergeometric with parameters n and (M1,M2,M3), then<br />

N<br />

n<br />

<br />

(X,Y,n − X − Y )<br />

is multivariate hypergeometric with parameters n and (M1,M2,M3).<br />

3. If (X1,X2,... ,Xk) has a multivariate hypergeometric distribution, then<br />

(a) For each i, Xi is a hypergeometric random variable with parameters n, Mi, N.<br />

(b) For each i = j, (Xi,Xj) is bivariate hypergeometric with parameters n and<br />

(Mi,Mj,N − Mi − Mj). In particular, Xi and Xj are negatively associated.<br />

Example 3.16 (Physical Activity) Suppose there are 600 women and 400 men living in<br />

an adult community. Among the women, 180 are physically active and 420 are not. Among<br />

the men, 120 are physically active and 280 are not.<br />

Let X1 be the number of women who are physically active, X2 be the number of women<br />

who are not physically active, X3 be the number of men who are physically active and X4<br />

be the number of men who are not physically active in a simple random sample of size 15<br />

chosen from those living in the community. Then, <strong>for</strong> example,<br />

P(X1 = 3,X2 = 7,X3 = 1,X4 = 4) =<br />

45<br />

xk<br />

180420120280 3 7 1<br />

1000 15<br />

4<br />

≈ 0.0182.


3.2.5 Example: Multinomial Distribution<br />

Suppose that an experiment has exactly k outcomes, and let<br />

pi be the probability of the i th outcome, i = 1,2,... ,k (p1 + p2 + · · · + pk = 1).<br />

Let Xi be the number of occurrences of the i th outcome in n independent trials of the experiment.<br />

Then the random k-tuple (X1,X2,...,Xk) is said to have a multinomial distribution<br />

with parameters n and (p1,p2,...,pk).<br />

The joint PDF <strong>for</strong> the k-tuple is<br />

f(x1,x2,... ,xk) =<br />

<br />

n<br />

x1,x2,... ,xk<br />

<br />

p x1<br />

1 px2 2 · · · pxk<br />

k<br />

when x1,x2,...,xk = 0,1,... ,n and <br />

i xi = n (and zero otherwise).<br />

Note that<br />

1. If X is a binomial random variable with parameters n and p, then<br />

(X,n − X)<br />

has a multinomial distribution with parameters n and (p,1 − p).<br />

2. If (X,Y ) has a trinomial distribution with parameters n and (p1,p2,p3), then<br />

(X,Y,n − X − Y )<br />

has a multinomial distribution with parameters n and (p1,p2,p3).<br />

3. If (X1,X2,... ,Xk) has a multinomial distribution, then<br />

(a) For each i, Xi is a binomial random variable with parameters n and pi.<br />

(b) For each i = j, (Xi,Xj) has a trinomial distribution with parameters n and<br />

(pi,pj,1 − pi − pj). In particular, Xi and Xj are negatively associated.<br />

Example 3.17 (Independence Model) Assume that A and B are independent events<br />

of interest to researchers. Let outcomes 1, 2, 3, 4 correspond to<br />

“both A and B occur,” “only A occurs,” “only B occurs,” “neither A nor B occur,”<br />

respectively. If P(A) = α and P(B) = β, then<br />

(p1,p2,p3,p4) = (αβ,α(1 − β),(1 − α)β,(1 − α)(1 − β))<br />

can be used as the list of probabilities in a multinomial model.<br />

If P(A) = 0.6, P(B) = 0.3, and Xi is the number of occurrences of the ith outcome in 15<br />

independent trials (i = 1,2,3,4), then, <strong>for</strong> example,<br />

<br />

15<br />

P(X1 = 3,X2 = 7,X3 = 1,X4 = 4) = (0.18)<br />

3,7,1,4<br />

3 (0.42) 7 (0.12) 1 (0.28) 4 ≈ 0.0179.<br />

46


3.2.6 Simple Random Samples<br />

The results of Sections 2.2.4 and 3.1.11 can be directly generalized to the multivariate<br />

case. In particular, multinomial probabilities can be used to approximate multivariate<br />

hypergeometric probabilities when N is large enough, and each family of distributions can<br />

be used in survey analysis.<br />

3.2.7 Sample Summaries<br />

If X1, X2, ..., Xn is a random sample from a distribution with mean µ and standard<br />

deviation σ, then the sample mean, X, is the random variable<br />

X = 1<br />

n (X1 + X2 + · · · + Xn) ,<br />

the sample variance, S 2 , is the random variable<br />

S 2 = 1<br />

n − 1<br />

n<br />

i=1<br />

2 Xi − X<br />

and the sample standard deviation, S, is the positive square root of the sample variance.<br />

The following theorem can be proven using properties of expectation:<br />

Theorem 3.18 (Sample Summaries) If X is the sample mean and S 2 is the sample<br />

variance of a random sample of size n from a distribution with mean µ and standard<br />

deviation σ, then<br />

1. E(X) = µ and V ar(X) = σ 2 /n.<br />

2. E(S 2 ) = σ 2 .<br />

Demonstration 1. To demonstrate that E(X) = E(X) = µ,<br />

<br />

1 E(X) = E n (X1<br />

<br />

+ X2 + · · · + Xn)<br />

= 1<br />

n (E(X1) + E(X2) + · · · + E(Xn))<br />

= 1<br />

n<br />

1 (nE(X)) = nnµ = µ<br />

Demonstration 2. To demonstrate V ar(X) = V ar(X)/n = σ 2 /n, first note that<br />

<br />

V ar(X) = E (X − µ) 2<br />

⎛<br />

n<br />

⎞ ⎛<br />

2 <br />

n<br />

⎞<br />

2<br />

1<br />

1<br />

= E ⎝ Xi − µ ⎠ = E ⎝ (Xi − µ) ⎠ .<br />

n<br />

n<br />

i=1<br />

i=1<br />

47


Now,<br />

V ar(X) = 1<br />

n2E <br />

( n i=1 (Xi − µ)) 2<br />

= 1<br />

n2 i=j E((Xi − µ) 2 ) + <br />

i=j E((Xi<br />

<br />

− µ)(Xj − µ))<br />

Since the Xi’s are independent, the last sum is zero. Finally,<br />

V ar(X) = 1<br />

n2 <br />

nσ 2 <br />

+ 0 = σ2<br />

n .<br />

Demonstration 3. To demonstrate that E(S 2 ) = V ar(X) = σ 2 , first note that<br />

Observe that<br />

S2 = 1 <br />

ni=1<br />

2<br />

n−1<br />

Xi − X<br />

= 1 <br />

ni=1<br />

n−1 X2 i − 2XXi + X 2<br />

= 1<br />

ni=1 n−1 X2 i<br />

= 1<br />

ni=1 n−1 X2 i<br />

− 2XnX + nX2<br />

− nX2<br />

E(X 2 ) = V ar(X) + (E(X)) 2 = σ 2 + µ 2<br />

Finally,<br />

.<br />

since n i=1 Xi = nX<br />

and E(X 2 ) = V ar(X) + (E(X)) 2 = σ2<br />

n + µ2 .<br />

E(S2 ) = 1<br />

ni=1 n−1 E(X2 i ) − nE(X2 <br />

)<br />

= 1<br />

<br />

n−1 n(σ2 + µ 2 ) − n(σ2 /n + µ 2 )<br />

= 1<br />

<br />

n−1 nσ2 + nµ 2 − σ2 − nµ 2 <br />

= 1 <br />

n−1 (n − 1)σ2 = σ2 .<br />

Sample correlation. A random sample of size n from the joint (X,Y ) distribution is a<br />

list of n mutually independent random pairs, each with the same distribution as (X,Y ).<br />

If (X1,Y1),(X2,Y2),... ,(Xn,Yn) is a random sample of size n from a bivariate distribution<br />

with correlation ρ = Corr(X,Y ), then the sample correlation, R, is the random variable<br />

ni=1 (Xi − X)(Yi − Y )<br />

R = <br />

ni=1<br />

(Xi − X) 2 n i=1 (Yi − Y ) 2<br />

where X and Y are the sample means of the X and Y samples, respectively.<br />

48<br />

.


Applications. In statistical applications, observed values of the sample mean, sample<br />

variance, and sample correlation are used to estimate unknown values of the parameters µ,<br />

σ 2 and ρ, respectively.<br />

Example 3.19 (n = 600) Assume that the following table summarizes the values of a<br />

random sample of size 600 from the joint (X,Y ) distribution:<br />

y = 1 y = 2 y = 3 y = 4 y = 5 y = 6 y = 7 y = 8 y = 9 y = 10<br />

x = 0 2 1 5 12 15 15 7 2 1 60<br />

x = 1 2 9 25 36 42 31 13 158<br />

x = 2 3 4 15 39 55 41 21 5 183<br />

x = 3 2 7 20 27 39 14 1 110<br />

x = 4 1 13 11 16 18 4 63<br />

x = 5 1 6 5 6 2 20<br />

x = 6 1 2 3 6<br />

8 36 64 118 162 116 68 25 2 1 600<br />

Two pairs of the <strong>for</strong>m (0,2) were observed, one pair of the <strong>for</strong>m (0,3) was observed, five<br />

pairs of the <strong>for</strong>m (0,4) were observed, and so <strong>for</strong>th.<br />

The sample mean of the x-coordinates is<br />

and the sample variance is<br />

s 2 x<br />

x = 1 600<br />

xi =<br />

600<br />

1<br />

(0(60) + 1(158) + · · · + 6(6)) ≈ 2.07<br />

600<br />

i=1<br />

1 600<br />

= (xi − x)<br />

599<br />

2 = 1<br />

599<br />

Similarly,<br />

i=1<br />

The observed sample correlation is<br />

r =<br />

<br />

(0 − x) 2 (60) + (1 − x) 2 (158) + · · · + (6 − x) 2 <br />

(6) ≈ 1.72464.<br />

y ≈ 4.92333 and s 2 y<br />

≈ 2.49161.<br />

−634.78<br />

(1033.06)(149247) ≈ −0.511219.<br />

The results suggest that X and Y are negatively associated.<br />

49


4 Introductory Calculus Concepts<br />

In introductory calculus, we study real-valued functions whose domain (set of input values)<br />

is a subset of the real numbers. We often use the shorthand<br />

f : D ⊆ R −→ R,<br />

where D is the domain and R is the set of real numbers, to denote such a function.<br />

This chapter covers limits, derivatives and integrals and applications in statistics.<br />

4.1 Functions<br />

This section reviews functions commonly used in calculus.<br />

Power functions. A power function is a function with rule<br />

f(x) = cx p where the coefficient c and power p are constants.<br />

(The values of f are proportional to a power of the input x.) The domain of f depends<br />

on the power p. For example, if p = 2, the domain of f is the set of all real numbers; if<br />

p = 1/2, the domain of f is the set of nonnegative real numbers.<br />

Polynomial and rational functions. A polynomial function is a function with rule<br />

f(x) = anx n + an−1x n−1 + · · · + a1x + a0<br />

where the coefficients ai are constants, n is a positive integer and x is a real number. If<br />

an = 0, then n is the degree of the polynomial.<br />

Special cases of polynomials include: If the degree of f is 0, then f is a constant function; if<br />

the degree is 1, then f is a linear function; if the degree is 2, then f is a quadratic function;<br />

if the degree is 3, then f is a cubic function.<br />

A rational function is the ratio of two polynomials:<br />

f(x) = p(x)<br />

, where p(x) and q(x) are polynomials.<br />

q(x)<br />

The domain of f is the set of real numbers satisfying q(x) = 0.<br />

Exponential functions. An exponential function is a function with rule<br />

f(x) = a x , where a > 0 is a positive constant and x is a real number.<br />

In this <strong>for</strong>mula, a is the base and x is the exponent. The exponential function is strictly<br />

increasing when a > 1 and strictly decreasing when 0 < a < 1.<br />

51


The exponential function with base e,<br />

f(x) = e x , where e ≈ 2.71828 is Euler’s constant,<br />

is used most often in calculus. (e is the natural base <strong>for</strong> calculus.)<br />

Exponential functions satisfy the following laws of exponents:<br />

(1) a x+y = a x a y , (2) a −x = 1<br />

a x , (3) (ax ) y = a xy<br />

where a > 0 is a positive number, and x and y are any real numbers.<br />

Logarithmic functions. A logarithmic function is a function with rule<br />

f(x) = log a(x) where a is a positive constant and x is a positive real number.<br />

As above, a is the base of the function. When a = e,<br />

f(x) = log e(x) = ln(x) is called the natural logarithm.<br />

Exponential and logarithmic functions with the same base are related as follows:<br />

y = log a(x) ⇐⇒ a y = x<br />

where a and x are positive numbers and y is any real number. Thus,<br />

log a(a x ) = x <strong>for</strong> all x ∈ R and a log a (x) = x <strong>for</strong> all x > 0.<br />

Logarithmic functions satisfy the following laws of logarithms:<br />

(1) loga(xy) = loga(x) + loga(y), <br />

(2) loga = loga(x) − loga(y) and<br />

x<br />

y<br />

(3) log a (x r ) = r log a(x),<br />

where a, x and y are positive numbers and r is any real number.<br />

Using the natural base. Since a = eln(a) <strong>for</strong> any a > 0, we can write<br />

a x <br />

= e ln(a) x<br />

= e ln(a)x , where x is a real number.<br />

Similarly, suppose that y = log a(x) (equivalently, x = a y ). Then,<br />

ln(x) = ln (a y ) = y ln(a) = log a(x)ln(a) =⇒ log a(x) = ln(x)<br />

ln(a) .<br />

Thus, exponential and logarithmic functions in base a can always be written in terms of<br />

the corresponding functions in the natural base.<br />

52


t<br />

Px,y<br />

t<br />

Px,y<br />

Figure 4.1: A positive angle t (left plot) and a negative angle t (right plot). In each case,<br />

the point P(x,y) lies on the unit circle centered at the origin, x = cos(t) and y = sin(t).<br />

Trigonometric functions. The trigonometric functions are defined using a unit circle<br />

centered at the origin and the radian measure of an angle, as illustrated in Figure 4.1.<br />

In the left part of the figure, t is the length of the arc of the unit circle measured counterclockwise<br />

from (1,0) to P(x,y); in the right part of the figure, t is the negative of the<br />

length of the arc measured clockwise from (1,0) to P(x,y).<br />

The sine and cosine functions have the following rules<br />

sin(t) = y and cos(t) = x<br />

where t is any real number, and x and y are described above. Since the circumference of a<br />

circle of radius one is 2π, we know that<br />

<strong>for</strong> every integer k and real number t.<br />

sin(t + k(2π)) = sin(t) and cos(t + k(2π)) = cos(t)<br />

The tangent, cotangent, secant and cosecant functions are defined in terms of sine and cosine.<br />

Specifically,<br />

tan(t) = sin(t) cos(t) 1<br />

, cot(t) = , sec(t) =<br />

cos(t) sin(t) cos(t)<br />

and csc(t) = 1<br />

sin(t) .<br />

In each case, the domain of the function is all t <strong>for</strong> which the denominator is not 0.<br />

Inverse trigonometric functions. Each trigonometric function has an inverse function:<br />

1. Inverse sine function: For −1 ≤ z ≤ 1,<br />

sin −1 (z) = t ⇐⇒ sin(t) = z and − π<br />

2<br />

2. Inverse cosine function: For −1 ≤ z ≤ 1,<br />

≤ t ≤ π<br />

2 .<br />

cos −1 (z) = t ⇐⇒ cos(t) = z and 0 ≤ t ≤ π.<br />

53


3. Inverse tangent function: For any real number z,<br />

tan −1 (z) = t ⇐⇒ tan(t) = z and − π<br />

2<br />

4. Inverse cotangent function: For any real number z,<br />

5. Inverse secant function: For |z| ≥ 1,<br />

< t < π<br />

2 .<br />

cot −1 (z) = t ⇐⇒ cot(t) = z and 0 < t < π.<br />

sec −1 (z) = t ⇐⇒ sec(t) = z and<br />

6. Inverse cosecant function: For |z| ≥ 1,<br />

csc −1 (z) = t ⇐⇒ csc(t) = z and<br />

<br />

0 ≤ t < π<br />

2<br />

<br />

0 < t ≤ π<br />

2<br />

<br />

3π<br />

or π ≤ t < .<br />

2<br />

<br />

3π<br />

or π < t ≤ .<br />

2<br />

Radian and degree measures. Angles can be measured in degrees or in radians. Radian<br />

measure is preferred in calculus. The following table gives some common conversions:<br />

Degrees 0 o 30 o 45 o 60 o 90 o 120 o 135 o 150 o 180 o 270 o 360 o<br />

Radians 0 π<br />

6<br />

π<br />

4<br />

π<br />

3<br />

π<br />

2<br />

2π<br />

3<br />

3π<br />

4<br />

5π 3π<br />

6 π 2<br />

Note that 2π radians (or 360 o ) corresponds to a complete revolution of the unit circle.<br />

Some values of sine and cosine. The following table gives some often used values of<br />

the sine and cosine functions:<br />

t 0 π<br />

6<br />

1 sin(t) 0 2<br />

cos(t) 1 √ 3<br />

2<br />

4.2 Limits and Continuity<br />

π<br />

4<br />

√<br />

2<br />

2<br />

√<br />

2<br />

2<br />

π π 2π 3π 5π<br />

3<br />

√<br />

3<br />

2<br />

2 3<br />

1<br />

4 6<br />

√ 1<br />

2<br />

3<br />

2<br />

0 −<br />

√<br />

2<br />

2<br />

1<br />

2 0<br />

1<br />

2 − √ 2<br />

2 − √ 3<br />

2 −1<br />

This section introduces the concepts of limit and continuity, which are central to calculus.<br />

4.2.1 Neighborhoods, Deleted Neighborhoods and Limits<br />

A neighborhood of the real number a is the set of real numbers x satisfying<br />

|x − a| < r <strong>for</strong> some positive number r.<br />

A deleted neighborhood of a is the neighborhood with a removed; that is, the set of real<br />

numbers x satisfying 0 < |x − a| < r <strong>for</strong> some positive r.<br />

54<br />

π<br />


y<br />

2<br />

1<br />

-0.5 0.5 1 1.5 2 x<br />

-1<br />

-2<br />

-0.5 0.5 1 1.5 2 x<br />

Figure 4.2: Plots of y = (x − 1)/(x 2 − 1) (left) and y = |x − 1|/(x 2 − 1) (right).<br />

Suppose that f(x) is defined <strong>for</strong> all x in a deleted neighborhood of a. We write<br />

lim f(x) = L<br />

x→a<br />

and say “the limit of f(x), as x approaches a, equals L” if <strong>for</strong> every positive number ǫ there<br />

exists a positive number δ such that<br />

if x satisfies 0 < |x − a| < δ, then f(x) satisfies |f(x) − L| < ǫ.<br />

(The deleted δ-neighborhood of a is mapped into the ǫ-neighborhood of L.)<br />

Example 4.1 (Figure 4.2) Let f(x) = (x−1)/(x 2 −1) and a = 1 (left part of Figure 4.2).<br />

Although f(1) is undefined, the limit as x approaches 1 exists. Specifically,<br />

lim f(x) = lim<br />

x→1 x→1<br />

(x − 1)<br />

(x2 = lim<br />

− 1) x→1<br />

y<br />

2<br />

1<br />

-1<br />

-2<br />

(x − 1)<br />

= lim<br />

(x − 1)(x + 1) x→1<br />

1 1<br />

=<br />

(x + 1) 2 .<br />

By contrast, if f(x) = |x − 1|/(x 2 − 1) and a = 1 (right part of Figure 4.2), then a limit<br />

does not exist, since function values approach either −1/2 or +1/2 depending on whether<br />

x < 1 or x > 1.<br />

Example 4.2 (Figure 4.3) Let f(x) = sin(π/x) and a = 0 (left part of Figure 4.3). f(x)<br />

is undefined at 0. Further, as x approaches 0, values of f(x) fluctuate between −1 and +1,<br />

and no limit exists.<br />

By contrast, if f(x) = xsin(π/x) and a = 0 (right part of Figure 4.3), then a limit does<br />

exist. Specifically, function values approach 0 as x approaches 0.<br />

One-sided limits. We write<br />

lim f(x) = L<br />

x→a +<br />

and say “the limit of f(x), as x approaches a from above, equals L” if <strong>for</strong> every positive<br />

number ǫ there exists a positive number δ such that<br />

if x satisfies 0 < x − a < δ, then f(x) satisfies |f(x) − L| < ǫ.<br />

55


0.5<br />

-1 -0.5 0.5 1 x<br />

-0.5<br />

y<br />

1<br />

-1<br />

0.5<br />

-1 -0.5 0.5 1 x<br />

-0.5<br />

Figure 4.3: Plots of y = sin(π/x) (left) and y = xsin(π/x) (right).<br />

Similarly, we write<br />

lim f(x) = L<br />

x→a<br />

and say “the limit of f(x), as x approaches a from below, equals L” if <strong>for</strong> every positive<br />

number ǫ there exists a positive number δ such that<br />

if x satisfies 0 < a − x < δ, then f(x) satisfies |f(x) − L| < ǫ.<br />

In the first case, we approach a from the right only; in the second case, from the left only.<br />

For example, if f(x) = |x − 1|/(x 2 − 1) and a = 1 (right part of Figure 4.2), then<br />

Infinite limits. We write<br />

lim f(x) = −1<br />

x→1− 2<br />

y<br />

1<br />

-1<br />

1<br />

and lim f(x) =<br />

x→1 + 2 .<br />

lim f(x) = L<br />

x→∞<br />

and say “the limit of f(x), as x approaches ∞, equals L” if <strong>for</strong> every positive number ǫ<br />

there exists a number M such that<br />

Similarly, we write<br />

if x satisfies x > M, then f(x) satisfies |f(x) − L| < ǫ.<br />

lim f(x) = L<br />

x→−∞<br />

and say “the limit of f(x), as x approaches −∞, equals L” if <strong>for</strong> every positive number ǫ<br />

there exists a number M such that<br />

if x satisfies x < M, then f(x) satisfies |f(x) − L| < ǫ.<br />

Example 4.3 (Exponential Function) Let f(x) = e x . Then<br />

lim f(x) = 0,<br />

x→−∞<br />

but no finite limit exists as x → ∞. We often write limx→∞ f(x) = ∞ when function values<br />

grow large without bound.<br />

56


Application in statistics. Let F(x) = P(X ≤ x) be the cumulative distribution function<br />

(CDF) of a discrete random variable (Section 2.1.1) and let a be an element of the range of<br />

the random variable. Then both one-sided limits exist,<br />

Further,<br />

Properties of limits. Suppose that<br />

and let c be a constant. Then<br />

1. Constant rule: limx→a c = c.<br />

2. Rule <strong>for</strong> x: limx→a x = a.<br />

lim F(x) = F(a) and lim F(x) = F(a).<br />

x→a + x→a− lim F(x) = 0 and lim F(x) = 1.<br />

x→−∞ x→∞<br />

lim f(x) = L, lim g(x) = M,<br />

x→a x→a<br />

3. Constant multiple rule: limx→a(cf(x)) = cL.<br />

4. Sum rule: limx→a(f(x) + g(x)) = L + M.<br />

5. Difference rule: limx→a(f(x) − g(x)) = L − M.<br />

6. Product rule: limx→a(f(x)g(x)) = LM.<br />

7. Quotient rule: limx→a(f(x)/g(x)) = L/M when M = 0.<br />

8. Power rule: limx→a(f(x)) n = L n when n is a positive integer.<br />

9. Power rule <strong>for</strong> x: limx→a x n = a n when n is a positive integer.<br />

10. Rule <strong>for</strong> roots: limx→a n f(x) = n√ L when n is an odd positive integer, or when n is<br />

an even positive integer and L is greater than 0.<br />

11. Rule <strong>for</strong> ordering: If f(x) ≤ g(x) <strong>for</strong> all x near a (except possibly at a), then L ≤ M.<br />

Squeezing principle. If ℓ(x) ≤ f(x) ≤ u(x) <strong>for</strong> all x near a (except possibly at a) and<br />

lim ℓ(x) = lim u(x) = L,<br />

x→a x→a<br />

then f(x) approaches L as well: limx→a f(x) = L.<br />

For example, let f(x) = xsin(π/x) and a = 0 (right part of Figure 4.3). Since<br />

−|x| ≤ xsin(π/x) ≤ |x|<br />

and the lower and upper functions each approach 0 as x approaches 0, we know that f(x)<br />

approaches 0 as well.<br />

57


Big O and little o notation. If<br />

|f(x)| ≤ Kg(x) <strong>for</strong> some positive K and all x near a (except possibly at a),<br />

we say that f(x) = O(g(x)) <strong>for</strong> x near a. If<br />

we say that f(x) = o(g(x)) <strong>for</strong> x near a.<br />

4.2.2 Continuous Functions<br />

f(x)<br />

lim = 0<br />

x→a g(x)<br />

The function f(x) is said to be continuous at a if<br />

lim f(x) = f(a).<br />

x→a<br />

The function is said to be discontinuous at a otherwise. Note that f(x) is discontinuous at<br />

a when f(a) is undefined, or when the limit does not exist, or when the limit and function<br />

values exist but are not equal.<br />

Polynomial functions, rational functions (ratios of polynomials), exponential functions, logarithmic<br />

functions, and trigonometric functions are continuous throughout their domains.<br />

Step functions are discontinuous at each “step”.<br />

Note that the properties of limits given above imply that sums, differences, products, quotients,<br />

powers, roots, and constant multiples of continuous functions are continuous whenever<br />

the appropriate operation is defined.<br />

Example 4.4 (Split Definition Functions) Let<br />

⎧<br />

0 x < 0<br />

⎪⎨<br />

−4x + 10 x < 2<br />

0.25 0 ≤ x < 1<br />

f(x) =<br />

and g(x) =<br />

−0.25x + 2.5 x ≥ 2<br />

⎪⎩<br />

0.75 1 ≤ x < 2<br />

1 x ≥ 2<br />

The function f(x) (left part of Figure 4.4) is continuous everywhere. In particular,<br />

lim f(x) = lim + 10) = 2 and lim f(x) = lim + 2.5) = 2.<br />

x→2− x→2−(−4x x→2 + x→2 +(−0.25x<br />

The step function g(x) (right part of Figure 4.4) is discontinuous at 0, 1 and 2. Note that<br />

g(x) is the cumulative distribution function of a binomial random variable with parameters<br />

n = 2 and p = 1/2.<br />

Compositions. If g(x) is continuous at a and f(x) is continuous at g(a), then the composite<br />

function f(g(x)) is continuous at a. In other words,<br />

<br />

lim f(g(x)) = f<br />

x→a<br />

lim<br />

x→a g(x)<br />

<br />

= f(g(a)).<br />

For example, limx→5 ln(6 − x) = ln(1) = 0, where ln() is the natural logarithm function.<br />

58


yfx<br />

10<br />

5<br />

4.3 Differentiation<br />

5 10 x<br />

ygx<br />

1<br />

0.5<br />

0 1 2 3<br />

Figure 4.4: Plots of split definition functions.<br />

Suppose that f(x) is defined <strong>for</strong> all x near a. The average rate of change of the function f<br />

over the interval [a,x] if x > a (or [x,a] if x < a) is the ratio<br />

f(x) − f(a)<br />

.<br />

x − a<br />

The average rate of change corresponds to the slope of the secant line through the points<br />

(a,f(a)) and (x,f(x)).<br />

Interest focuses on whether we can define the instantaneous rate of change at a or, equivalently,<br />

the slope of the curve y = f(x) at the point (a,f(a)).<br />

4.3.1 Derivative and Tangent Line<br />

The derivative of the function f at the number a, denoted by f ′ (a) (“f prime of a”), is<br />

f ′ f(x) − f(a)<br />

(a) = lim<br />

x→a x − a<br />

if the limit exists.<br />

The function f(x) is said to be differentiable at a if the limit above exists. Otherwise, f(x)<br />

is not differentiable at a.<br />

If f(x) represents your position at time x, then (f(x) − f(a))/(x − a) is your average velocity over<br />

the interval [a, x] (or [x, a]) and f ′ (a) is your instantaneous velocity at time a.<br />

Tangent line. If f(x) is differentiable at a, then the line with equation<br />

y = f(a) + f ′ (a)(x − a)<br />

is tangent to the curve y = f(x) at the point (a,f(a)), and f ′ (a) is said to be the slope of<br />

the curve at this point.<br />

59<br />

x


0.4<br />

0.3<br />

0.2<br />

0.1<br />

y<br />

0.5 1 1.5 2 x<br />

8<br />

6<br />

4<br />

2<br />

y<br />

1 2 3 4 x<br />

Figure 4.5: Plots of y = x/(1 + 2x) (left part) and y = 7 − 3|x − 2| (right part).<br />

Example 4.5 (Derivative Exists) Consider the function f(x) = x/(1+2x) and let a = 1<br />

(left part of Figure 4.5). The derivative is computed as follows:<br />

f ′ (1) = lim<br />

x→1<br />

x 1<br />

(1+2x) − 3<br />

x − 1<br />

= lim<br />

x→1<br />

The equation of the tangent line is y = (1/3) + (1/9)(x − 1).<br />

1 1<br />

=<br />

3 + 6x 9 .<br />

Example 4.6 (Derivative Does Not Exist) Consider the function f(x) = 7 − 3|x − 2|<br />

and let a = 2 (right part of Figure 4.5). Then<br />

1. If x > 2, then f(x) = 7 − 3(x − 2) and lim x→2 + f(x)−f(2)<br />

x−2<br />

2. If x < 2, then f(x) = 7 − 3(2 − x) and lim x→2 − f(x)−f(2)<br />

x−2<br />

= −3.<br />

= 3.<br />

Since the limits from above and below are not equal, the derivative does not exist.<br />

4.3.2 Approximations and Local Linearity<br />

If f(x) is differentiable at a, then f(x) is locally linear near a. That is, values on the tangent<br />

line can be used to approximate values on the curve itself.<br />

Example 4.7 (Derivative Exists, continued) For example, if f(x) = x/(1+2x), a = 1,<br />

and ℓ(x) = (1/3) + (1/9)(x − 1) (as calculated above), then the following table confirms<br />

that values of the linear function ℓ(x) are close to values of f(x) when x is near 1:<br />

x 0.75 0.80 0.85 0.9 0.95 1.00 1.05 1.10 1.15 1.20 1.25<br />

f(x) 0.300 0.308 0.315 0.321 0.328 0.333 0.339 0.344 0.348 0.353 0.357<br />

ℓ(x) 0.306 0.311 0.317 0.322 0.328 0.333 0.339 0.344 0.350 0.356 0.361<br />

60


4<br />

2<br />

-1 1 2 3 4<br />

-2<br />

-4<br />

y<br />

6<br />

x<br />

-1 1 2 3 4<br />

-2<br />

Figure 4.6: Plots of y = 2 + 9x − 6x 2 + x 3 with different points highlighted.<br />

4.3.3 First and Second Derivative Functions<br />

Since the derivative can take different values at different points, we can think of it as a<br />

function in its own right. For any function f, we define the derivative function, f ′ , by<br />

f ′ (x) = lim<br />

h→0<br />

4<br />

2<br />

-4<br />

y<br />

6<br />

f(x + h) − f(x)<br />

.<br />

h<br />

If the limit exists at a given x, we say that f is differentiable at x. If the derivative exists<br />

<strong>for</strong> all x in the domain of the function, we say that f is differentiable everywhere.<br />

Alternative notations. Starting with y = f(x) (where x represents the independent<br />

variable and y represents the dependent variable), there are several different ways to write<br />

the derivative function:<br />

f ′ (x) = dy d df<br />

= (f(x)) =<br />

dx dx dx = Df(x) = y′ .<br />

The “prime” notation (f ′ (x), y ′ ) is due to Newton. The d<br />

dx (“d-dx”) <strong>for</strong>m is due to Leibniz.<br />

Second derivative function. If f is a differentiable function, then its derivative f ′ is<br />

also a function, so f ′ may have a derivative of its own. The second derivative of f, denoted<br />

by f ′′ (“f double prime”), is defined as follows:<br />

f ′′ (x) = lim<br />

h→0<br />

Starting with y = f(x), alternative notations are<br />

f ′ (x + h) − f ′ (x)<br />

.<br />

h<br />

f ′′ (x) = d2y d2<br />

=<br />

dx2 dx2(f(x)) = d2f dx2 = D2f(x) = y ′′ .<br />

The second derivative is the rate of change of the rate of change. If f ′′ (a) > 0, then the<br />

curve y = f(x) is concave up at (a,f(a)). If f ′′ (a) < 0, then the curve is concave down at<br />

(a,f(a)).<br />

61<br />

x


Continuity and differentiability. Differentiable functions are continuous, but continuous<br />

functions may not be differentiable. For example, the absolute value function<br />

f(x) = 7 − 3|x − 2| (see Figure 4.5) is continuous everywhere and is differentiable <strong>for</strong><br />

all values of x other than 2.<br />

4.3.4 Some Rules <strong>for</strong> Finding Derivative Functions<br />

Assume that f and g are differentiable on a common domain D, and let a, b, c, and m be<br />

constants. Then<br />

d<br />

1. Constant functions: dx (c) = 0.<br />

d<br />

2. Linear functions: dx (mx + b) = m.<br />

3. Constant multiples: d<br />

dx (cf(x)) = cf ′ (x).<br />

4. Linear combinations: d<br />

dx (af(x) + bg(x)) = af ′ (x) + bg ′ (x).<br />

5. Products of functions: d<br />

dx (f(x)g(x)) = f ′ (x)g(x) + f(x)g ′ (x).<br />

<br />

d f(x)<br />

6. Quotients of functions: dx g(x) = g(x)f ′ (x)−f(x)g ′ (x)<br />

(g(x)) 2<br />

7. Powers of x: d<br />

dx (xn ) = nx n−1 <strong>for</strong> each constant n.<br />

when g(x) = 0.<br />

Polynomial functions. Recall that a polynomial function, f(x), has the <strong>for</strong>m<br />

f(x) = anx n + an−1x n−1 + · · · + a2x 2 + a1x + a0<br />

where a0,a1,... ,an are constants, and n is a nonnegative integer. The rules given above<br />

imply that<br />

f ′ (x) = nanx n−1 + (n − 1)an−1x n−2 + · · · + 2a2x + a1.<br />

Example 4.8 (Cubic Function) Let f(x) = 2 + 9x − 6x 2 + x 3 . Then<br />

f ′ (x) = 9 − 12x + 3x 2 and f ′′ (x) = −12 + 6x.<br />

Since f ′′ (x) < 0 when x < 2, f is concave down when x < 2 (left part of Figure 4.6). Since<br />

f ′′ (x) > 0 when x > 2, f is concave up when x > 2 (right part of Figure 4.6).<br />

Rational functions. Recall that a rational function, r(x), is the ratio of polynomials.<br />

The derivative of r(x) is computed using the quotient rule.<br />

Example 4.9 (Rational Function) For example, if<br />

then r ′ (x) =<br />

q(x)p ′ (x) − p(x)q ′ (x)<br />

(q(x)) 2<br />

r(x) = p(x) 3 + 4x − x2<br />

=<br />

q(x) 7 + 2x<br />

= (7 + 2x)(4 − 2x) − (3 + 4x − x2 )(2)<br />

(7 + 2x) 2<br />

62<br />

,<br />

= −2 −11 + 7x + x2 (7 + 2x) 2 .


4.3.5 Chain Rule <strong>for</strong> Composition of Functions<br />

If f and g are differentiable functions, then<br />

d<br />

dx (f(g(x))) = f ′ (g(x))g ′ (x).<br />

(The derivative of the outside function evaluated at the inside function times the derivative<br />

of the inside function evaluated at x.)<br />

Starting with y = f(u) and u = g(x), the chain rule can be written as follows:<br />

dy dy<br />

=<br />

dx du<br />

du<br />

dx .<br />

For example, since √ u = u 1/2 and d<br />

du (u1/2 ) = 1<br />

2 u−1/2 = 1<br />

2 √ u ,<br />

d <br />

x<br />

dx<br />

2 <br />

+ 1 =<br />

1<br />

2 √ x2 (2x) =<br />

+ 1<br />

x<br />

√ x 2 + 1 .<br />

Inverse functions. The functions f and g are said to be inverse functions if f(g(x)) = x.<br />

Using the chain rule:<br />

d<br />

dx (f(g(x))) = f ′ (g(x))g ′ (x) ⇒ 1 = f ′ (g(x))g ′ (x) ⇒ g ′ (x) =<br />

(The derivative of g at x is the reciprocal of the derivative of f at g(x).)<br />

1<br />

f ′ (g(x)) .<br />

For example, f(u) = u2 and g(x) = √ x are inverse functions when x is positive. Since<br />

f ′ (u) = 2u,<br />

g ′ (x) = 1<br />

2 √ x =<br />

1<br />

f ′ (g(x)) .<br />

4.3.6 Rules <strong>for</strong> Exponential and Logarithmic Functions<br />

The natural base <strong>for</strong> exponential and logarithmic functions in calculus is the base e, where<br />

<br />

e = lim 1 +<br />

n→∞<br />

1<br />

n = 2.71828... is the Euler constant.<br />

n<br />

The limit above is illustrated in the following table of approximate values:<br />

n 10 100 1000 10000 100000 1000000<br />

<br />

1 + 1<br />

n n 2.59374 2.70481 2.71692 2.71815 2.71827 2.71828<br />

The domain of the exponential function e x = exp(x) is all real numbers and the domain<br />

of the logarithmic function ln(x) = log e(x) is the positive real numbers. Plots of the<br />

exponential and logarithmic functions are given in Figure 4.7.<br />

63


y<br />

8<br />

6<br />

4<br />

2<br />

-1 1 2 x<br />

Figure 4.7: Plots of y = e x (left) and y = ln(x) (right).<br />

The functions e x and ln(x) are inverses. That is,<br />

Their derivatives are as follows:<br />

More generally,<br />

For example,<br />

y<br />

2<br />

1<br />

-1<br />

2 4 6 8 x<br />

e ln(x) = x when x > 0 and ln(e x ) = x <strong>for</strong> all real numbers.<br />

d<br />

dx (ex ) = e x and d 1<br />

(ln(x)) =<br />

dx x .<br />

d<br />

dx (eu u du<br />

) = e<br />

dx<br />

d <br />

e<br />

dx<br />

12−x3<br />

<br />

12−x3<br />

= e −3x 2<br />

d 1<br />

and (ln(u)) =<br />

dx u<br />

du<br />

dx .<br />

and d <br />

ln(x<br />

dx<br />

4 + 2x 2 <br />

+ 7) = 4x3 + 4x<br />

x4 + 2x2 + 7 .<br />

4.3.7 Rules <strong>for</strong> Trigonometric and Inverse Trigonometric Functions<br />

In calculus, we work with trigonometric functions using radian measure. That is, the input<br />

x is the radian measure of an angle (page 53). Figure 4.8 shows plots of the two most<br />

commonly used functions, sin(x) and cos(x).<br />

Trigonometric functions have specialized derivative <strong>for</strong>mulas:<br />

d<br />

d<br />

dx (sin(x)) = cos(x)<br />

d<br />

dx (tan(x)) = sec(x)2 d<br />

dx (cos(x)) = − sin(x)<br />

dx (cot(x)) = −csc(x)2<br />

d<br />

dx (sec(x)) = sec(x) tan(x) d<br />

dx (csc(x)) = − csc(x) cot(x)<br />

64


1<br />

0.5<br />

-9 -6 -3 3 6 9 x<br />

-0.5<br />

-1<br />

y<br />

1<br />

0.5<br />

-9 -6 -3 3 6 9 x<br />

-0.5<br />

Figure 4.8: Plots of y = sin(x) (left) and y = cos(x) (right).<br />

Similarly, inverse trigonometric functions have specialized derivative <strong>for</strong>mulas:<br />

For example,<br />

d<br />

dx (sin−1 (x)) = 1<br />

√ 1−x 2<br />

d<br />

dx (tan−1 (x)) = 1<br />

1+x2 d<br />

dx (sec−1 (x)) =<br />

1<br />

x 2 √ 1−x −2<br />

-1<br />

y<br />

d<br />

dx (cos−1 (x)) = −1<br />

√ 1−x 2<br />

d<br />

dx (cot−1 (x)) = −1<br />

1+x2 d<br />

dx (csc−1 −1 (x)) =<br />

x2 √ 1−x−2 d <br />

sin(3x<br />

dx<br />

2 <br />

+ 8) = cos(3x 2 + 8)(6x) and<br />

d<br />

dx (x cos(√x + x)) = cos( √ x + x) − x sin( √ x + x) 1<br />

2 √ + 1<br />

x<br />

= cos( √ x + x) − sin( √ x + x) √ x<br />

2<br />

4.3.8 Indeterminate Forms and L’Hospital’s Rule<br />

If f(x) → 0 (“f(x) approaches 0”) as x → a and g(x) → 0 as x → a, then<br />

f(x)<br />

lim<br />

x→a g(x)<br />

may or may not exist, and is called an indeterminate <strong>for</strong>m of type 0/0.<br />

+ x .<br />

For example,<br />

x − 1<br />

lim<br />

x→1 x2 + 11x − 12<br />

is an indeterminate <strong>for</strong>m of type 0/0 whose limit can be computed after cancelling the<br />

common factor (x − 1) from numerator and denominator:<br />

lim<br />

x→1<br />

x − 1<br />

x2 = lim<br />

+ 11x − 12 x→1<br />

x − 1<br />

= lim<br />

(x − 1)(x + 12) x→1<br />

65<br />

1 1<br />

=<br />

x + 12 13 .


Similarly, if f(x) → ±∞ and g(x) → ±∞ as x → a, then the limit of the ratio f(x)/g(x) is<br />

an indeterminate <strong>for</strong>m of type ∞/∞.<br />

Theorem 4.10 (L’Hospital’s Rule) Suppose that f and g are differentiable and g ′ (x) =<br />

0 near a (except possibly at a). Further, suppose that<br />

Then<br />

1. f(x) → 0 and g(x) → 0 as x → a, or<br />

2. f(x) → ±∞ and g(x) → ±∞ as x → a.<br />

if the limit on the righthand side exists.<br />

f(x) f<br />

lim = lim<br />

x→a g(x) x→a<br />

′ (x)<br />

g ′ (x)<br />

L’Hospital’s rule says that the limit of the ratio of the functions is the same as the limit of<br />

the ratio of the rates. For example,<br />

x<br />

lim<br />

x→1<br />

20 − 1<br />

x18 20x<br />

= lim<br />

− 1 x→1<br />

19 10x<br />

= lim<br />

18x17 x→1<br />

2<br />

9<br />

= 10<br />

9 .<br />

In some cases, L’Hospital’s rule needs to be used more than once. For example,<br />

1 − cos(x)<br />

lim<br />

x→0 x2 sin(x) cos(x) 1<br />

= lim = lim =<br />

x→0 2x x→0 2 2 .<br />

(The first and second limits are indeterminate <strong>for</strong>ms of type 0/0.)<br />

One-sided and infinite limits. Note that L’Hospital’s rule remains true <strong>for</strong> one-sided<br />

limits (x → a + or x → a − ) and <strong>for</strong> limits where x → ±∞.<br />

Indeterminate products. If f(x) → 0 and g(x) → ±∞ as x → a, then<br />

lim<br />

x→a f(x)g(x)<br />

is an indeterminate <strong>for</strong>m of type 0·∞. L’Hospital’s rule can sometimes be used to evaluate<br />

the limit. The trick is to rewrite the product as a quotient:<br />

For example,<br />

limx→a f(x)g(x) = limx→a<br />

lim<br />

x→∞ xe−x = lim<br />

x→∞<br />

x<br />

= lim<br />

ex x→∞<br />

= limx→a<br />

f(x)<br />

(1/g(x)) (type 0/0) or<br />

g(x)<br />

(1/f(x)) (type ∞/∞).<br />

1<br />

= 0 (using type ∞/∞).<br />

ex 66


Remarks. There are many other types of indeterminate <strong>for</strong>ms, including<br />

∞ − ∞, 0 0 , ∞ 0 , 1 ∞ .<br />

In each case, anything can happen (that is, a limit may or may not exist).<br />

Note that the limit used to define the Euler constant (page 63) is an example of an indeterminate<br />

<strong>for</strong>m of type 1 ∞ .<br />

4.4 Optimization<br />

The function f : D ⊆ R → R has a local minimum (or relative minimum) at c ∈ D if<br />

f(c) ≤ f(x) <strong>for</strong> each x in the intersection of a neighborhood of c with D.<br />

Similarly, f has a local maximum (or relative maximum) at c ∈ D if<br />

f(c) ≥ f(x) <strong>for</strong> each x in the intersection of a neighborhood of c with D.<br />

Local maxima and minima are called local extrema of the function.<br />

f has a global (or absolute) minimum at c if f(c) ≤ f(x) <strong>for</strong> all x ∈ D. Similarly, f has a<br />

global (or absolute) maximum at c if f(c) ≥ f(x) <strong>for</strong> all x ∈ D. Global maxima and minima<br />

are called global extrema of the function.<br />

4.4.1 Local Extrema and Derivatives<br />

The following theorem establishes a relationship between local extrema and derivatives.<br />

Theorem 4.11 (Fermat’s Theorem) If the continuous function f has a local extremum<br />

at c, and if f ′ (c) exists, then f ′ (c) = 0.<br />

Continuing with the cubic function example (Figure 4.6), f(x) = 2 + 9x − 6x 2 + x 3 has a<br />

local maximum at 1 (with maximum value 6) and a local minimum at 3 (with minimum<br />

value 2). Since the derivative function is f ′ (x) = 9 − 12x + 3x 2 , it is easy to check that<br />

f ′ (1) = f ′ (3) = 0.<br />

Critical numbers. A number c is a critical number of f if either f ′ (c) = 0 or f ′ (c) does<br />

not exist. If f has a local extremum at c, then Fermat’s theorem implies that c is a critical<br />

number of the function f.<br />

The cubic function f(x) = 2 + 9x − 6x 2 + x 3 has critical numbers 1 and 3. The absolute<br />

value function f(x) = 7 − 3|x − 2| (Figure 4.5) has critical number 2.<br />

4.4.2 First Derivative Test <strong>for</strong> Local Extrema<br />

Suppose that f is differentiable <strong>for</strong> all x near c. If c is a critical number of f and if f ′<br />

changes sign at c, then f has a local extremum at c. Specifically,<br />

67


1. If f ′ (x) is positive below c and negative above c, then f is increasing to the left of c<br />

and decreasing to the right of c. Thus, f has a local maximum at c.<br />

2. If f ′ (x) is negative below c and positive above c, then f is decreasing to the left of c<br />

and increasing to the right of c. Thus, f has a local minimum at c.<br />

Continuing with the cubic function f(x) = 2 + 9x − 6x 2 + x 3 , it is easy to check that<br />

f ′ (x) > 0 when x < 1, f ′ (x) < 0 when 1 < x < 3, and f ′ (x) > 0 when x > 3.<br />

These computations confirm that f has a local maximum at 1 and a local minimum at 3.<br />

4.4.3 Second Derivative Test <strong>for</strong> Local Extrema<br />

Suppose that f is twice differentiable <strong>for</strong> all x near c, and that c is a critical number <strong>for</strong> f.<br />

(In this case, f ′ (c) = 0.) Then<br />

1. If f ′′ (c) > 0, then f has a local minimum at c.<br />

2. If f ′′ (c) < 0, then f has a local maximum at c.<br />

Note that if f ′ (c) = 0 and f ′′ (c) = 0, then anything can happen. For example, if f is the<br />

quartic function f(x) = x 4 , then f ′ (0) = f ′′ (0) = 0 and f has a local (and global) minimum<br />

at 0. By contrast, if f is the cubic function f(x) = x 3 , then f ′ (0) = f ′′ (0) = 0, but f has<br />

neither a local minimum nor a local maximum at 0.<br />

4.4.4 Global Extrema on Closed Intervals<br />

Consider f : [a,b] ⊆ R → R (the domain of interest is the closed interval [a,b]). The<br />

following theorem says that continuous functions have global extrema on closed intervals.<br />

Theorem 4.12 (Extreme Value Theorem) If f is continuous on the closed interval<br />

[a,b], then there are numbers c,d ∈ [a,b] such that f has a global minimum at c and a<br />

global maximum at d.<br />

For example, if the domain of the cubic function f(x) = 2 + 9x − 6x 2 + x 3 (Figure 4.6) is<br />

restricted to the closed interval [−0.5,3.5], then f has a global minimum value of −4.125<br />

at −0.5, and a global maximum value of 6 at 1.<br />

Method <strong>for</strong> finding global extrema. To find the global extrema of the continuous<br />

function f on the closed interval [a,b]:<br />

1. Find the values of f at the critical numbers in the open interval (a,b).<br />

2. Find the values of f at the endpoints: f(a), f(b).<br />

68


70<br />

50<br />

30<br />

10<br />

y<br />

-2 -1 1 2 3<br />

x<br />

1 2 3<br />

Figure 4.9: Plots of y = 40 + 12x 2 + 4x 3 − 3x 4 . In the right plot, the domain is restricted<br />

to the interval [0,3].<br />

3. The largest function value from the first two steps is the global maximum, and the<br />

smallest function value is the global minimum.<br />

Example 4.13 (Quartic Function) Let f(x) = 40 + 12x 2 + 4x 3 − 3x 4 <strong>for</strong> all x.<br />

1. As illustrated in the left part of Figure 4.9, f(x) has a local maximum of 45 = f(−1)<br />

when x = −1, a local minimum of 40 = f(0) when x = 0, and both a local and global<br />

maximum of 72 = f(2) when x = 2.<br />

2. Since f ′ (x) = 24x + 12x 2 − 12x 3 = −12 (x − 2) x (x + 1), we know that f ′ (x) = 0<br />

when x = −1,0,2. Further,<br />

(a) f ′ (x) > 0 when x < −1 and when 0 < x < 2 and<br />

70<br />

50<br />

30<br />

10<br />

(b) f ′ (x) < 0 when −1 < x < 0 and when x > 2.<br />

The values of f(x) are increasing when x < −1, decreasing when −1 < x < 0,<br />

increasing when 0 < x < 2, and decreasing when x > 2.<br />

3. Since f ′′ (x) = 24 + 24x − 36x 2 , we know that f ′′ (x) = 0 when<br />

Further,<br />

x = 1<br />

3 (1 − √ 7) ≈ −0.549 and x = 1<br />

3 (1 + √ 7) ≈ 1.215.<br />

(a) f ′′ (x) < 0 when x < −0.549 and when x > 1.215 and<br />

(b) f ′′ (x) > 0 when −0.549 < x < 1.215.<br />

Concavity changes at both roots of the equation f ′′ (x) = 0. Specifically, f(x) is<br />

concave down when x < −0.549, concave up when −0.549 < x < 1.215, and concave<br />

down when x > 1.215.<br />

Example 4.14 (Restricted Domain) Let f(x) = 40 + 12x 2 + 4x 3 − 3x 4 <strong>for</strong> x ∈ [0,3].<br />

69<br />

y<br />

x


1. As illustrated in the right part of Figure 4.9, f(x) has a local minimum of 40 = f(0)<br />

at x = 0, a local and global maximum of 72 = f(2) at x = 2, and a local and global<br />

minimum of 13 = f(3) at x = 3.<br />

2. Since f ′ (x) = 24x + 12x 2 − 12x 3 = −12 (x − 2) x (x + 1), we know that f ′ (x) = 0<br />

when x = −1,0,2. For the two roots in the restricted domain, we consider<br />

f(0) = 40 and f(2) = 72 when determining global extrema.<br />

3. The function values at the endpoints are f(0) = 40 and f(3) = 13.<br />

4. The global maximum value is the largest of the numbers computed in items 2 and 3.<br />

The global minimum value is the smallest of the numbers computed in items 2 and 3.<br />

4.4.5 Newton’s Method<br />

Many problems in mathematics involve finding roots. For example, to find the critical<br />

numbers of the differentiable function f, we need to solve the derivative equation f ′ (x) = 0.<br />

Newton’s method is an iterative method <strong>for</strong> approximating the roots of an equation of the<br />

<strong>for</strong>m g(x) = 0 when g is a differentiable function. The method uses local linearity.<br />

Specifically, if r is a root of the equation, x0 (an initial estimate of r) is a number near r,<br />

and g ′ (x0) = 0, then the tangent line through (x0,g(x0))<br />

y = g(x0) + g ′ (x0)(x − x0)<br />

can be used to approximate the curve y = g(x) and the x-coordinate of the point where the<br />

tangent line crosses the x-axis (call it x1) is likely to be closer to r than x0. To find x1:<br />

0 = g(x0) + g ′ (x0)(x1 − x0) ⇒ x1 = x0 − g(x0)<br />

g ′ (x0) .<br />

If g ′ (x1) = 0, the method can be repeated to yield x2 = x1 − g(x1)<br />

g ′ (x1)<br />

general <strong>for</strong>mula is<br />

xk+1 = xk − g(xk)<br />

g ′ (xk) .<br />

, and so <strong>for</strong>th. The<br />

The iteration stops when the error term |g(xk)/g ′ (xk)| falls below a pre-specified bound.<br />

Example 4.15 (Root of Cubic Function) The cubic function g(x) = x 3 − x − 1 has a<br />

root between 1 and 2, as illustrated in the left part of Figure 4.10. Newton’s method can<br />

be used to find the root. The following table shows the results of the first five iterations<br />

starting with initial value x0 = 1.0.<br />

k 0 1 2 3 4 5<br />

xk 1.0 1.5 1.34783 1.3252 1.32472 1.32472<br />

Since x4 and x5 are identical to within five decimal places of accuracy, we estimate that the<br />

root occurs when x = 1.32472.<br />

70


5<br />

-2 -1 1 2<br />

-5<br />

y<br />

x<br />

20<br />

16<br />

12<br />

Figure 4.10: Newton’s method examples.<br />

Example 4.16 (Point of Intersection) The functions<br />

f1(x) = x + 4 and f2(x) = e x/3<br />

8<br />

y<br />

5 7.5 10 x<br />

have a point of intersection between 5 and 10, as illustrated in the right part of Figure 4.10.<br />

One way to find the point of intersection is to use Newton’s method to solve the equation<br />

g(x) = f1(x) − f2(x) = 0. The following table shows the results of the first six iterations<br />

starting with initial value x0 = 5.5.<br />

k 0 1 2 3 4 5 6<br />

xk 5.5 8.49133 7.53204 7.28038 7.26519 7.26514 7.26514<br />

Since x5 and x6 are identical to within five decimal places of accuracy, we estimate that the<br />

point of intersection occurs when x = 7.26514.<br />

Warning. Newton’s method is powerful and fast (in general, few iterations are required<br />

<strong>for</strong> convergence), but is sensitive to the initial value x0. If we let x0 = 0.57 in the cubic<br />

example above, <strong>for</strong> example, then 35 iterations are needed to approximate the root to within<br />

5 decimal places of accuracy.<br />

If g(x) has more than one root and the chosen x0 is far from r, then the method could<br />

converge to the wrong root. If g ′ (xk) ≈ 0 <strong>for</strong> some k, then the sequence of iterates may not<br />

converge at all.<br />

4.5 Applications in <strong>Statistics</strong><br />

Optimization is an important component of many statistical analyses. This section illustrates<br />

two applications.<br />

4.5.1 Maximum Likelihood Estimation in Binomial Models<br />

Consider the problem of estimating the binomial probability of success, p. If x successes are<br />

observed in n trials, then a natural estimate of p is the sample proportion, p = x/n. If the<br />

71


0.16<br />

0.12<br />

0.08<br />

0.04<br />

Lik<br />

0.2 0.4 0.6 0.8 1 p<br />

Lik<br />

0.0016<br />

0.0008<br />

0.25 0.5 0.75 1 t<br />

Figure 4.11: Examples of binomial (left) and multinomial (right) likelihoods.<br />

data summarize the results of n independent trials of a Bernoulli experiment with success<br />

probability p, then p = x/n is also the maximum likelihood (ML) estimate of p.<br />

The ML estimate of p is the value of p which maximizes the likelihood function, Lik(p), or<br />

(equivalently) the log-likelihood function, ℓ(p). Lik(p) is the binomial PDF with n and x<br />

fixed and p varying<br />

Lik(p) =<br />

<br />

n<br />

p<br />

x<br />

x (1 − p) n−x<br />

and ℓ(p) is the natural logarithm of the likelihood:<br />

ℓ(p) = ln(Lik(p)) = ln<br />

<br />

n<br />

+ xln(p) + (n − x)ln(1 − p).<br />

x<br />

If 0 < x < n, then the ML estimate is obtained by solving ℓ ′ (p) = 0 <strong>for</strong> p:<br />

ℓ ′ (p) = x n − x x n − x<br />

− = 0 ⇒ =<br />

p 1 − p p 1 − p<br />

⇒ p = x<br />

n .<br />

Further, since ℓ ′ (p) > 0 when 0 < p < x/n and ℓ ′ (p) < 0 when x/n < p < 1, we know that<br />

the likelihood is maximized when p = x<br />

n .<br />

For example, the left part of Figure 4.11 shows the binomial likelihood when 17 successes<br />

are observed in 25 trials. The function is maximized at the ML estimate, p = 0.68.<br />

4.5.2 Maximum Likelihood Estimation in Multinomial Models<br />

Maximum likelihood estimation can be generalized to multinomial models. Consider, <strong>for</strong><br />

example, a multinomial model with three categories and probabilities<br />

(p1,p2,p3) = (θ,(1 − θ)θ,(1 − θ) 2 ), where 0 ≤ θ ≤ 1 is unknown,<br />

and suppose that in 60 independent trials of the experiment, the first outcome was observed<br />

15 times, the second outcome 20 times and the third outcome 25 times. That is, suppose<br />

that n = 60, x1 = 15, x2 = 20 and x3 = 25.<br />

72


The likelihood function, Lik(θ), is the multinomial PDF written as a function of θ,<br />

<br />

60<br />

Lik(θ) = θ<br />

15,20,25<br />

15 <br />

20<br />

((1 − θ)θ) (1 − θ) 2 <br />

25 60<br />

= θ<br />

15,20,25<br />

35 (1 − θ) 70 ,<br />

as shown in the right part of Figure 4.11.<br />

The log-likelihood function is the natural logarithm of the likelihood,<br />

<br />

60<br />

ℓ(θ) = ln(Lik(θ)) = ln + 35ln(θ) + 70ln(1 − θ).<br />

15,20,25<br />

1 = . You can use either the first or second derivative test to<br />

ℓ ′ (θ) = 0 when θ = 35<br />

105 3<br />

demonstrate that the log-likelihood is maximized when θ = 1<br />

3 .<br />

<br />

1 2 4<br />

The estimated model has probabilities 3 , 9 , 9 .<br />

4.5.3 Minimum Chi-Square Estimation in Multinomial Models<br />

In 1900, K. Pearson developed a quantitative method to determine if observed data are<br />

consistent with a given multinomial model.<br />

If (X1,X2,... ,Xk) has a multinomial distribution with parameters n and (p1,p2,... ,pk),<br />

then Pearson’s statistic is the following random variable:<br />

X2 k (Xi − npi)<br />

=<br />

i=1<br />

2<br />

.<br />

npi<br />

For each i, the observed frequency, Xi, is compared to the expected frequency, E(Xi) = npi,<br />

under the multinomial model. If each observed frequency is close to expected, then the value<br />

of X 2 will be close to zero. If at least one observed frequency is far from expected, then<br />

the value of X 2 will be large and the appropriateness of the given multinomial model will<br />

be called into question.<br />

In many practical situations, certain parameters of the multinomial model need to be estimated<br />

from the sample data. For example, if<br />

(p1,p2,... ,pk) = (p1(θ),p2(θ),... ,pk(θ)) <strong>for</strong> some parameter θ,<br />

then the observed value of Pearson’s statistic<br />

f(θ) =<br />

k (xi − npi(θ)) 2<br />

i=1<br />

npi(θ)<br />

is a function of θ, and a natural estimate of θ is the value of θ which minimizes f. This<br />

estimate is known as the minimum chi-square estimate of θ <strong>for</strong> the given (x1,x2,... ,xk).<br />

Example 4.17 (n = 60, k = 3, continued) Consider again the model with probabilities<br />

(p1,p2,p3) = (θ,(1 − θ)θ,(1 − θ) 2 ), where 0 ≤ θ ≤ 1 is unknown,<br />

73


XSqr<br />

50<br />

25<br />

0.25 0.5 0.75 1 t<br />

XSqr<br />

12<br />

8<br />

4<br />

0.025 0.05 0.075 0.1 t<br />

Figure 4.12: Plots of y = f(θ) <strong>for</strong> the minimum chi-square estimation examples.<br />

and assume that in 60 independent trials we obtained x1 = 15, x2 = 20 and x3 = 25.<br />

The function we would like to minimize is<br />

f(θ) =<br />

(15 − 60θ)2<br />

60θ<br />

+ (20 − 60(1 − θ)θ)2<br />

60(1 − θ)θ<br />

+<br />

25 − 60(1 − θ) 2 2<br />

60(1 − θ) 2<br />

(left part of Figure 4.12) and the derivative of this function (after simplification) is<br />

f ′ (θ) = − 5 9θ3 − 9θ2 + 75θ − 25 <br />

12(θ − 1) 3θ2 .<br />

To find the minimum chi-square estimate we need to find the root of<br />

g(θ) = 9θ 3 − 9θ 2 + 75θ − 25 in the [0,1] interval.<br />

Using Newton’s method, 4 decimal places of accuracy, and θ0 = 0.5 we get:<br />

Let θ = 0.342592. Since<br />

the function is minimized at θ.<br />

θ0 = 0.5, θ1 = 0.343643, θ2 = 0.342592, θ3 = 0.342592.<br />

f ′ (θ) < 0 when 0 < θ < θ and f ′ (θ) > 0 when θ < θ < 1,<br />

The estimated model has (approximate) probabilities (0.3426,0.2252,0.4322).<br />

Example 4.18 (Genetics Study) (Larsen & Marx, Prentice-Hall, 1986, p.418) Researchers<br />

were interested in studying two genetic charateristics of corn:<br />

• Starchy (S) or sugary (s), where starchy is the dominant characteristic, and<br />

• Green base leaf (G) or white base leaf (g), where green base leaf is the dominant<br />

characteristic,<br />

74


Table 4.1: Summary of results of the genetics study.<br />

1 Starchy green (SG) 1997<br />

2 Starchy white (Sg) 906<br />

3 Sugary green (sG) 904<br />

4 Sugary white (sg) 32<br />

in the offspring of self-fertilized hybrid corn plants. The results <strong>for</strong> 3839 offspring are<br />

summarized in Table 4.1.<br />

A model of interest to geneticists hypothesizes that the probabilities of the four outcomes<br />

(SG, Sg, sG, sg) is<br />

(p1,p2,p3,p4) = (0.25(2 + θ),0.25(1 − θ),0.25(1 − θ),0.25θ),<br />

where θ is a parameter in the interval (0, 1). θ is a cross-over parameter, indicating a<br />

tendency <strong>for</strong> chromosome pairs Sg and sG in the hybrid parent to cross-over to SG and sg<br />

at the time of reproduction. (As θ increases the tendency to cross-over increases.)<br />

The right part of Figure 4.12 suggests that f(θ) is minimized in the interval 0.02 < θ < 0.05.<br />

Using Newton’s method to solve the derivative equation<br />

d<br />

(f(θ)) = 0 with initial value 0.03,<br />

dθ<br />

the minimum chi-square estimate of θ is θ = 0.0357852.<br />

4.6 Rolle’s Theorem and the Mean Value Theorem<br />

The Mean Value Theorem (MVT) connects the local picture about a curve (namely, the<br />

slope of the curve at a given point) to the global picture (namely, the average rate of change<br />

over an interval). Rolle’s Theorem is the first step.<br />

Theorem 4.19 (Rolle’s Theorem) Suppose f is continuous on [a,b], differentiable on<br />

(a,b) and that f(a) = f(b). Then there is a c ∈ (a,b) satisfying f ′ (c) = 0.<br />

Example 4.20 (Cubic Polynomial) For example, let f(x) = 48 − 44x + 12x 2 − x 3 on<br />

[a,b] = [2,4] (left part of Figure 4.13). Since f(2) = f(4), Rolle’s theorem applies. On this<br />

interval, f ′ (c) = 0 when c = 4 − 2/ √ 3 ≈ 2.8453. (c is a critical number corresponding to<br />

the global minimum on the interval.)<br />

Theorem 4.21 (Mean Value Theorem) Suppose f is continuous on the closed interval<br />

[a,b] and differentiable on the open interval (a,b). Then<br />

f(b) − f(a)<br />

b − a<br />

= f ′ (c) <strong>for</strong> some number c satisfying a < c < b.<br />

75


-1<br />

-2<br />

-3<br />

-4<br />

y<br />

1<br />

2 3 4<br />

x<br />

y<br />

20<br />

10<br />

-4 -2 2 4<br />

Figure 4.13: Plots of y = 48 − 44x + 12x 2 − x 3 (left) and y = 17 + 2x − 3x 2 + 0.1x 4 (right).<br />

The MVT can be proven using Rolle’s theorem. Let<br />

-10<br />

-20<br />

<br />

f(b) − f(a)<br />

s(x) = f(a) +<br />

(x − a) and h(x) = f(x) − s(x).<br />

b − a<br />

Since h(a) = f(a) − f(a) = 0 and h(b) = f(b) − f(b) = 0, the conditions of Rolle’s theorem<br />

are met. Thus, there is a c ∈ (a,b) satisfying<br />

0 = h ′ (c) = f ′ (c) −<br />

<br />

f(b) − f(a)<br />

b − a<br />

⇒ f ′ (c) =<br />

f(b) − f(a)<br />

.<br />

b − a<br />

Example 4.22 (Quartic Polynomial) For example, if f(x) = 17+2x−3x 2 +0.1x 4 and<br />

[a,b] = [−5,5] (right part of Figure 4.13), then the average rate of change over the interval is<br />

2. The instantaneous rate of change of the function is 2 when c = 0, c = ± √ 15 ≈ ±3.87298.<br />

4.6.1 Generalized Mean Value Theorem and Error Analysis<br />

The MVT can be extended to two functions.<br />

Theorem 4.23 (Generalized MVT) Suppose f and g are continuous on the closed<br />

interval [a,b] and differentiable on the open interval (a,b). Then<br />

(f(b) − f(a))g ′ (c) = (g(b) − g(a))f ′ (c) <strong>for</strong> some number c satisfying a < c < b.<br />

The generalized MVT can be proven using Rolle’s theorem applied to<br />

h(x) = (f(b) − f(a))g(x) − (g(b) − g(a))f(x).<br />

Since h(a) = f(b)g(a) − g(b)f(a) and h(b) = f(b)g(a) − g(b)f(a), the conditions of Rolle’s<br />

theorem are met. Thus, there is a c ∈ (a,b) satisfying<br />

0 = h ′ (c) = (f(b)−f(a))g ′ (c)−(g(b)−g(a))f ′ (c) ⇒ (f(b)−f(a))g ′ (c) = (g(b)−g(a))f ′ (c).<br />

76<br />

x


Error in linear approximations. If we apply the MVT to the function f on the interval<br />

[a,x], then we get<br />

f(x) = f(a) + f ′ (c)(x − a) <strong>for</strong> some c satisfying a < c < x.<br />

That is, the error in using f(a) instead of using f(x) <strong>for</strong> values of x near a is exactly<br />

f ′ (c)(x − a) <strong>for</strong> some number c between a and x.<br />

How about the error in using the tangent line instead of f(x)?<br />

Suppose f is twice differentiable. Let ℓ(x) = f(a) + f ′ (a)(x − a) be the tangent line<br />

to y = f(x) at (a,f(a)), e(x) = f(x) − ℓ(x) be the difference between f and ℓ, and<br />

g(x) = (x − a) 2 . (The function e(x) is often called the error function.)<br />

If we apply the generalized MVT twice to the functions e and g on the interval [a,x], then<br />

we obtain the following equation<br />

f(x) = f(a) + f ′ (a)(x − a) + f ′′ (c)<br />

(x − a)2<br />

2<br />

<strong>for</strong> some c ∈ (a,x).<br />

That is, the error in using values on the tangent line instead of using f(x) when x is near a<br />

is exactly f ′′ (c)<br />

2 (x − a) 2 <strong>for</strong> some number c between a and x. The <strong>for</strong>m of the result is the<br />

same if you apply the generalized MVT twice to e and g on the interval [x,a].<br />

Error in quadratic approximations. In Section 6.6 we will consider quadratic, cubic,<br />

quartic and higher order polynomial approximations to a function f, and error analyses. As<br />

a preview, suppose that f is differentiable to order 3 and consider the following quadratic<br />

approximation to f at a,<br />

q(x) = f(a) + f ′ (a)(x − a) + f ′′ (a)<br />

(x − a)<br />

2<br />

2 .<br />

(q(x) is a quardratic polynomial satisfying q(a) = f(a), q ′ (a) = f ′ (a) and q ′′ (a) = f ′′ (a).)<br />

Let e(x) = f(x) − q(x) and g(x) = (x − a) 3 . If we apply the generalized MVT three times<br />

to the functions e and g on the interval [a,x], then we obtain the following equation<br />

f(x) = f(a) + f ′ (a)(x − a) + f ′′ (a)<br />

2<br />

(x − a) 2 + f ′′′ (c)<br />

(x − a)<br />

6<br />

3<br />

<strong>for</strong> some c ∈ (a,x).<br />

That is, the error in using values on the quadratic curve instead of using f(x) when x is<br />

near a is exactly f ′′′ (c)<br />

6 (x − a) 2 <strong>for</strong> some number c between a and x. The <strong>for</strong>m of the result<br />

is the same if you apply the generalized MVT three times to e and g on the interval [x,a].<br />

4.7 Integration<br />

If f is the derivative of F, then we can interpret the function f as the instantaneous rate<br />

of change of the function F. Starting with a function f representing instantaneous rate of<br />

change, the techniques of integration allow us to recover F and to compute the total change<br />

in F over an interval.<br />

77


y<br />

2<br />

1<br />

1 2 3 4 x<br />

y<br />

15<br />

10<br />

5<br />

-5<br />

1 2 3<br />

Figure 4.14: Definite integral as area (left) and difference in areas (right).<br />

4.7.1 Riemann Sums and Definite Integrals<br />

Suppose f is continuous on the closed interval [a,b], and let n be a positive integer. Further,<br />

let △x = (b − a)/n and xi = a + i△x <strong>for</strong> i = 0,1,2,... ,n. Then the n + 1 numbers<br />

a = x0 < x1 < x2 < · · · < xn = b<br />

partition the interval [a,b] into n subintervals of equal length. Lastly, let x∗ i<br />

of the ith subinterval: x∗ i = (xi−1 + xi)/2.<br />

The definite integral of f from a to b is<br />

b<br />

a<br />

f(x) dx = lim<br />

n<br />

f(x<br />

n→∞<br />

i=1<br />

∗ i )△x.<br />

x<br />

be the midpoint<br />

In this <strong>for</strong>mula: a is the lower limit of integration, b is the upper limit of integration, f(x)<br />

is the integrand, and the sum on the right is called a Riemann sum.<br />

Continuity and Integrability. f is said to be integrable on the interval [a,b] if the limit<br />

above exists. Continuous functions are always integrable. Further, it is possible to show<br />

that the limit exists even if f has a finite number of “jump discontinuities” on [a,b].<br />

Area and difference in areas. If f is nonnegative on [a,b], then b<br />

a f(x) dx can be<br />

interpreted as the area under the curve y = f(x) and above the x-axis when a ≤ x ≤ b.<br />

Otherwise, the integral can be interpreted as the difference between two areas.<br />

Example 4.24 (Computing Area) Let f(x) = √ x on the interval [0,4]. To estimate<br />

the area under y = f(x) and above the x-axis we let n = 10 and ∆x = 0.4, as illustrated in<br />

the left part of Figure 4.14. The 10 midpoints are<br />

and the sum of the areas is 5.347.<br />

0.2, 0.6, 1.0, 1.4, 1.8, 2.2, 2.6, 3.0, 3.4, 3.8,<br />

78


3600<br />

1800<br />

y<br />

7 14 21 28 35<br />

x<br />

y<br />

64<br />

32<br />

1 2 3 4 x<br />

Figure 4.15: Definite integral as average (left) and arc length (right).<br />

The following list of approximate values suggests that the limit is 16/3.<br />

n 10 50 90 130 170 210 250 290 330 370 410<br />

Sum 5.347 5.335 5.334 5.334 5.334 5.333 5.333 5.333 5.333 5.333 5.333<br />

Example 4.25 (Difference in Areas) Let f(x) = x3 − 4x on the interval [0,3]. To<br />

estimate the definite integral we let n = 9 and ∆x = 1<br />

3 , as illustrated in the right part of<br />

Figure 4.14. The 9 midpoints are<br />

0.167, 0.5, 0.833, 1.167, 1.5, 1.833, 2.167, 2.5,2.833,<br />

and the sum is <br />

i f(x∗i )∆x = 2.125.<br />

The following list of approximate values suggests that the limit is 2.25.<br />

n 9 24 39 54 69 84 99 114 129 144 159<br />

Sum 2.125 2.232 2.243 2.247 2.248 2.249 2.249 2.249 2.249 2.25 2.25<br />

Geometrically, the limit is the difference between areas:<br />

2.25 =<br />

3<br />

0<br />

f(x) dx =<br />

2<br />

0<br />

3<br />

f(x) dx + f(x) dx = −4 + 6.25.<br />

2<br />

(The area between 0 and 2 is negated because it lies below the x-axis.)<br />

Note that the total area bounded by y = f(x), y = 0, x = 0 and x = 3 is 4 + 6.25 = 10.25.<br />

Average values. The quantity 1<br />

(b−a)<br />

of f(x) on the interval [a,b].<br />

To see this, let f = 1 ni=1 n f(x∗ i ) be the sample average. Then<br />

f =<br />

<br />

(b − a) 1<br />

(b − a) n<br />

n<br />

f(x<br />

i=1<br />

∗ <br />

i) =<br />

1<br />

(b − a)<br />

b<br />

a f(x) dx can be interpreted as the average value<br />

n<br />

i=1<br />

f(x ∗ i) △x ⇒ lim f =<br />

n→∞<br />

79<br />

<br />

1 b<br />

f(x) dx.<br />

(b − a) a


Example 4.26 (Average Value) Let f(x) = 225e 0.08x on the interval [0,35]. To estimate<br />

the average of f(x) we let n = 5 and ∆x = 35/5 = 7, as illustrated in the left part of<br />

Figure 4.15. Function values at the 5 midpoints are<br />

f(3.5) ≈ 297.704, f(10.5) ≈ 521.183, f(17.5) ≈ 912.42, f(24.5) ≈ 1597.35, f(31.5) ≈ 2796.43,<br />

and the average of these 5 numbers is 1225.02.<br />

The following list of approximate values suggests that the limit is 1241.08.<br />

n 5 30 55 80 105 130 155 180 205<br />

Avg 1225.02 1240.64 1240.95 1241.02 1241.05 1241.06 1241.07 1241.08 1241.08<br />

Arc lengths. Under mild conditions, the quantity b <br />

a 1 + (f ′ (x)) 2 dx can be interpreted<br />

as the length of the curve y = f(x) <strong>for</strong> x ∈ [a,b].<br />

To see this, let yi = f(xi), ∆yi = yi − yi−1 and<br />

n<br />

i=1<br />

<br />

(∆x) 2 + (∆yi) 2 =<br />

<br />

n<br />

1 +<br />

i=1<br />

2 ∆yi<br />

∆x<br />

∆x<br />

be the sum of lengths of line segments joining successive points:<br />

(x0,y0) → (x1,y1) → · · · → (xn,yn).<br />

If f ′ is continuous and not zero on [a,b] (except possibly at the endpoints), then<br />

∆yi<br />

∆x ≈ f ′ (x ∗ i ) <strong>for</strong> i = 1,2,... ,n,<br />

the sum of segment lengths can be approximated as follows,<br />

n<br />

i=1<br />

<br />

(∆x) 2 + (∆yi) 2 ≈<br />

n<br />

i=1<br />

<br />

1 + (f ′ (x ∗ i ))2 ∆x ,<br />

and the limit of the Riemann sum on the right is the arc length.<br />

Example 4.27 (Arc Length) Let f(x) = x 3 on the interval [0,4]. To estimate the length<br />

of the curve we let n = 4 and ∆x = 4/4 = 1, as illustrated in the right part of Figure 4.15.<br />

The sum of the lengths of the segments connecting the points (0,0), (1,1), (2,8), (3,27), (4,64)<br />

is 64.525.<br />

The following list of approximate values suggests that the limit is 64.672.<br />

n 4 12 20 28 36 44 52 60 68 76<br />

Sum 64.525 64.659 64.667 64.669 64.67 64.671 64.671 64.671 64.672 64.672<br />

80


4.7.2 Properties of Definite Integrals<br />

Suppose that f and g are integrable on [a,b], and let c be a constant. Then properties of<br />

limits imply the following properties of definite integrals:<br />

1. Constant functions: b<br />

a c dx = c(b − a).<br />

2. Sums and differences: b<br />

a (f(x) ± g(x)) dx = b<br />

a f(x) dx ± b<br />

a<br />

3. Constant multiples: b<br />

a cf(x) dx = c b<br />

a f(x) dx.<br />

4. Adjacent intervals: b<br />

a f(x) dx = c<br />

a f(x) dx + b<br />

c<br />

5. Null interval: a<br />

a f(x) dx = 0.<br />

6. Reverse limits of integration: b<br />

a f(x) dx = − a<br />

b f(x) dx.<br />

g(x) dx.<br />

f(x) dx when a < c < b.<br />

7. Ordering: If f(x) ≤ g(x) <strong>for</strong> all x on [a,b], then b<br />

a f(x) dx ≤ b<br />

a g(x) dx.<br />

<br />

<br />

8. Absolute value: If |f| is integrable, then <br />

b <br />

f(x) dx<br />

≤ b<br />

|f(x)| dx.<br />

4.7.3 Fundamental Theorem of Calculus<br />

The Fundamental Theorem of Calculus (FTC) relates integration and differentiation, and<br />

gives us a method <strong>for</strong> computing definite integrals when f(x) is a derivative function.<br />

Theorem 4.28 (Fundamental Theorem of Calculus) Let f be a continuous function<br />

on the closed interval [a,b]. Then<br />

1. If F(x) = x<br />

a f(t) dt, then F ′ (x) = f(x) <strong>for</strong> x ∈ [a,b].<br />

2. If f(x) = F ′ (x) <strong>for</strong> all x in [a,b], then b<br />

a f(x) dx = F(b) − F(a).<br />

The first part of the fundamental theorem says that f(x) is the rate of change of F(x) on<br />

the interval [a,b]. The second part of the theorem says that the definite integral of a rate<br />

of change is the total change over the interval.<br />

Derivatives of integrals with variable endpoints. A useful generalization of the<br />

Fundamental Theorem of Calculus is as follows: If ℓ(x) ≤ u(x) are differentiable functions<br />

(<strong>for</strong> the lower and upper endpoints), f is continuous, and<br />

G(x) =<br />

then G ′ (x) = f(u(x))u ′ (x) − f(ℓ(x))ℓ ′ (x).<br />

u(x)<br />

ℓ(x)<br />

81<br />

a<br />

f(t) dt,<br />

a


c f(x) dx = c f(x) dx<br />

x n dx = x n+1<br />

n+1<br />

e x dx = e x + C<br />

Table 4.2: Table of Indefinite Integrals.<br />

+ C (n = −1)<br />

sin(x) dx = − cos(x) + C<br />

sec 2 (x) dx = tan(x) + C<br />

sec(x)tan(x) dx = sec(x) + C<br />

1<br />

1+x 2 dx = tan −1 (x) + C<br />

(f(x) + g(x)) dx = f(x) dx + g(x) dx<br />

1<br />

x<br />

4.7.4 Antiderivatives and Indefinite Integrals<br />

dx = ln |x| + C (x = 0)<br />

ln(x) dx = xln(x) − x + C (x > 0)<br />

cos(x) dx = sin(x) + C<br />

csc 2 (x) dx = − cot(x) + C<br />

csc(x)cot(x) dx = − csc(x) + C<br />

1<br />

√ 1−x 2 dx = sin−1 (x) + C<br />

The function F(x) in the FTC is called an antiderivative of f(x). Since the derivative of a<br />

constant function is zero, antiderivative functions are not unique. The family of antiderivatives,<br />

known as the indefinite integral of f(x), is often written as follows:<br />

<br />

f(x) dx = F(x) + C where C is a constant (the constant of integration).<br />

If F(x) is an antiderivative of f(x), then the following notation is used when evaluating a<br />

definite integral using the FTC:<br />

b<br />

a<br />

f(x) dx = [F(x)] b<br />

a = F(b) − F(a).<br />

That is, we compute the difference between the function value at the upper endpoint and<br />

the function value at the lower endpoint.<br />

4.7.5 Some Formulas <strong>for</strong> Indefinite Integrals<br />

A useful first list of indefinite integrals is given in Table 4.2.<br />

Example 4.29 (Computing Area, continued) Since x 1/2 dx = 2<br />

3 x3/2 + C, the area<br />

under the curve y = √ x = x 1/2 and above the x-axis <strong>for</strong> x ∈ [0,4] is<br />

4<br />

0<br />

√ x dx =<br />

This answer confirms the earlier result.<br />

<br />

2<br />

3 x3/2<br />

4 =<br />

0<br />

16<br />

3 .<br />

82


Rules <strong>for</strong> polynomials. If p(x) = anxn + an−1xn−1 + · · · + a1x + a0 is a polynomial<br />

function, then in<strong>for</strong>mation in Table 4.2 and properties of integrals given earlier imply that<br />

<br />

p(x) dx = an<br />

n + 1 xn+1 + an−1<br />

n xn + · · · + a2<br />

3 x3 + a1<br />

2 x2 + a0x + C.<br />

In particular, a dx = ax + C and (ax + b) dx = a<br />

2 x2 + bx + C.<br />

Example 4.30 (Difference in Areas, continued) Since (x3 −4x) dx = 1<br />

4 x4 −2x2 +C,<br />

the definite integral of f(x) = x3 − 4x <strong>for</strong> x ∈ [0,3] is<br />

3<br />

0<br />

(x 3 − 4x) dx =<br />

This answer confirms the earlier result.<br />

<br />

1<br />

4 x4 − 2x 2<br />

3 0<br />

= 9<br />

4 .<br />

Substitution rule <strong>for</strong> composite functions. The chain rule <strong>for</strong> compositions says that<br />

d<br />

dx (f(g(x))) = f ′ (g(x))g ′ (x) where f and g are differentiable functions.<br />

Thus, <br />

is a general rule <strong>for</strong> finding antiderivatives.<br />

f ′ (g(x))g ′ (x) dx = f(g(x)) + C<br />

If we let u = g(x) and du = g ′ (x) dx, then we can rewrite the <strong>for</strong>mula above as<br />

<br />

f ′ (u) du = f(u) + C.<br />

(We substitute u <strong>for</strong> g(x) and du <strong>for</strong> g ′ (x) dx.)<br />

For example, consider finding <br />

x<br />

√ x 2 + 1 dx.<br />

To apply substitution, let u = x2 + 1. Then du = 2x dx (or x dx = 1<br />

2<br />

becomes <br />

1<br />

2 √ u du = u1/2 + C = <br />

x2 + 1 + C.<br />

du), and the integral<br />

Compositions when g(x) is linear. If g(x) = ax + b, where a and b are constants and<br />

a = 0, then a general rule <strong>for</strong> finding the indefinite integral is as follows:<br />

<br />

f ′ (ax + b) dx = 1<br />

f(ax + b) + C.<br />

a<br />

Example 4.31 (Average Value, continued) Since 225e0.08x dx = 2812.5e0.08x + C,<br />

the average value of f(x) = 225e0.08x <strong>for</strong> x ∈ [0,35] is<br />

1<br />

35<br />

35<br />

This answer confirms the earlier result.<br />

0<br />

225e 0.08x dx = 1 <br />

2812.5e<br />

35<br />

0.08x 35<br />

≈ 1241.08.<br />

0<br />

83


Integration by parts. The derivative rule <strong>for</strong> products says that<br />

d<br />

dx (f(x)g(x)) = f ′ (x)g(x) + f(x)g ′ (x) where f and g are differentiable functions.<br />

Thus, f ′ (x)g(x) + f(x)g ′ (x) dx = f(x)g(x) + C<br />

⇒<br />

is a general rule <strong>for</strong> finding antiderivatives.<br />

<br />

f(x)g ′ <br />

(x) dx = f(x)g(x) −<br />

f ′ (x)g(x) dx<br />

If we let u = f(x), du = f ′ (x) dx, v = g(x), and dv = g ′ (x) dx, then we can rewrite the<br />

<strong>for</strong>mula above as follows: <br />

<br />

u dv = uv −<br />

v du.<br />

In integration by parts, the problem of finding the indefinite integral u dv is replaced by<br />

the problem of finding v du. Care must be taken to choose u and v so that the second<br />

integral is easier to compute than the first.<br />

For example, to find xex dx, we let u = x and dv = ex dx. Then du = dx. Since<br />

<br />

dv = e x dx = e x + C<br />

and v can be any antiderivative, we choose v = ex . Now<br />

<br />

<br />

u dv = uv − v du = xe x <br />

−<br />

e x dx = xe x − e x + C.<br />

Similarly, to find xln(x) dx, we let u = ln(x) and dv = x dx. Then du = 1<br />

<br />

<br />

dv =<br />

x dx = x2<br />

2<br />

+ C<br />

x<br />

dx. Since<br />

and v can be any antiderivative, we choose v = x2<br />

2 . Now,<br />

<br />

<br />

u dv = uv − v du = 1<br />

2 x2 <br />

ln(x) −<br />

x2 <br />

1 1<br />

dx =<br />

2 x 2 x2 ln(x) − x2<br />

+ C.<br />

4<br />

4.7.6 Approximation Methods<br />

Finding antiderivatives is not as straight<strong>for</strong>ward as finding derivatives. In fact, in many<br />

situations exact antiderivatives cannot be found. (For example, f(x) = e x /x has no exact<br />

antiderivative.) When an exact antiderivative does not exist, an approximate method can<br />

be used to estimate a definite integral.<br />

Midpoint rule. A natural estimate to use is the value of the Riemann sum<br />

n<br />

i=1<br />

f(x ∗ i )△x when n is large.<br />

Since x ∗ i was chosen to be the midpoint of [xi−1,xi] <strong>for</strong> each i, this method of estimating<br />

the definite integral is called the midpoint rule.<br />

84


150<br />

100<br />

50<br />

y<br />

1 3 5 7 x<br />

150<br />

100<br />

50<br />

y<br />

1 3 5 7 x<br />

Figure 4.16: Plots of y = e x /x with trapezoidal (left) and Simpson (right) approximations.<br />

Trapezoidal and Simpson’s rules. In the trapezoidal rule, the area under the piecewise<br />

linear curve defined by the points<br />

(xi,f(xi)) <strong>for</strong> i = 0,1,2,... ,n<br />

is used to estimate the area under the nonnegative curve y = f(x). (Geometrically, a sum<br />

of areas of trapezoids replaces a sum of areas of rectangles.)<br />

In Simpson’s rule, the area under a piecewise quadratic curve is used to estimate the area un-<br />

der the nonnegative curve y = f(x), where the quadratic polynomial <strong>for</strong> the ith subinterval<br />

passes through the points (xi−1,f(xi−1)), (x∗ i ,f(x∗ i )), and (xi,f(xi)).<br />

Example 4.32 (Figure 4.16) Let f(x) = e x /x on the interval [1,7]. The left part of<br />

Figure 4.16 shows the setup <strong>for</strong> the trapezoidal rule, and the right part shows the setup<br />

<strong>for</strong> Simpson’s rule, when n = 3 subintervals are used. The approximation in the right plot<br />

is significantly better than the one in the left plot. (Note that the numerical value of the<br />

integral is 189.61.)<br />

Newton-Cotes and Gaussian methods. The midpoint rule gives exact answers if f is<br />

a constant function, the trapezoidal rule gives exact answers if f is a constant function or a<br />

linear function, and Simpson’s rule gives exact answers if f is a constant function, a linear<br />

function or a quadratic function. They are examples of Newton-Cotes approximations.<br />

An approximation of the Newton-Cotes type will give exact answers when f is a polynomial<br />

function of degree at most a fixed constant. Gaussian quadrature methods take Newton-<br />

Cotes methods a step further, by choosing partition points that are not necessarily equallyspaced;<br />

the spacing is chosen to minimize the approximation error.<br />

Monte Carlo methods. Monte Carlo methods are based on the idea of averaging (see<br />

page 79). Let x1, x2, ..., xn be n numbers chosen uni<strong>for</strong>mly from the interval [a,b]. Then<br />

a Monte Carlo estimate of b<br />

a f(x) dx is<br />

<br />

n<br />

<br />

1<br />

(b − a) f(xi) .<br />

n<br />

85<br />

i=1


y<br />

150<br />

100<br />

50<br />

0 c 1 2 x<br />

y<br />

150<br />

100<br />

50<br />

2.5<br />

Figure 4.17: Improper integral examples on finite intervals.<br />

Let f(x) = e x /x and [a,b] = [1,7] (see Figure 4.16). A Monte Carlo estimate of the integral,<br />

based on n = 50000 random numbers, was 190.061.<br />

Power series methods. Methods based on power series expansions can also be used to<br />

estimate an integral. See Section 6.7.2.<br />

4.7.7 Improper Integrals<br />

The definition of the definite integral can be extended to include situations where f is<br />

undefined at either the lower or upper endpoint of integration, and to include situations<br />

where a = −∞ or b = ∞.<br />

Specifically,<br />

1. If f is integrable on [c,b] <strong>for</strong> all a < c < b, then<br />

b<br />

a<br />

f(x) dx = lim<br />

c→a +<br />

b<br />

f(x) dx<br />

c<br />

when this limit exists.<br />

2. If f is integrable on [a,c] <strong>for</strong> all a < c < b, then<br />

b<br />

a<br />

f(x) dx = lim<br />

c→b− c<br />

f(x) dx<br />

a<br />

when this limit exists.<br />

3. If f is integrable on [a,b] <strong>for</strong> all b greater than a fixed a, then<br />

∞<br />

a<br />

b<br />

f(x) dx = lim<br />

b→∞<br />

f(x) dx<br />

a<br />

when this limit exists.<br />

4. If f is integrable on [a,b] <strong>for</strong> all a less than a fixed b, then<br />

b<br />

−∞<br />

b<br />

f(x) dx = lim<br />

a→−∞<br />

f(x) dx<br />

a<br />

when this limit exists.<br />

86<br />

c<br />

x


y<br />

0.4<br />

0 c2<br />

c<br />

x<br />

10<br />

5<br />

y<br />

01 c2<br />

c<br />

Figure 4.18: Improper integral examples on infinite intervals.<br />

When the limit exists, the improper integral is said to converge; otherwise, the improper<br />

integral is said to diverge.<br />

Example 4.33 (Improper Integrals on Finite Intervals) Let f(x) = 10/x2 on the<br />

interval (0,2], as illustrated in the left part of Figure 4.17. Since<br />

2 2 <br />

−10<br />

−10<br />

f(x) dx = = −5 − lim does not exist,<br />

x<br />

c→0+ c<br />

0<br />

the improper integral diverges.<br />

0<br />

c→0+<br />

Similarly, let f(x) = 10/ √ 5 − x on the interval [0,5), as illustrated in the right part of<br />

Figure 4.17. Since<br />

5 <br />

f(x) dx = −20 √ c→5− 5 − x = lim<br />

0 c→5− <br />

−20 √ <br />

5 − c + 20 √ 5 = 0 + 20 √ 5 = 20 √ 5,<br />

the improper integral converges to 20 √ 5 ≈ 44.7214.<br />

Example 4.34 (Improper Integrals on Infinite Intervals) Let f(x) = xe −x on the<br />

interval [0, ∞), as illustrated in the left part of Figure 4.18. Since<br />

∞<br />

0 f(x) dx = [−xe−x − e −x ] c→∞<br />

0<br />

= limc→∞ (−ce −c − e −c ) + 1<br />

= limc→∞ (−c/e c ) − limc→∞ (e −c ) + 1<br />

(using integration-by-parts)<br />

= limc→∞ (−1/e c ) − 0 + 1 (using L’Hospital’s Rule)<br />

= 0 − 0 + 1 = 1,<br />

the improper integral converges to 1.<br />

Similarly, let f(x) = 10/ √ x on the interval [1, ∞), as illustrated in the right part of Figure<br />

4.18. Since<br />

∞<br />

f(x) dx =<br />

1<br />

20 √ x c→∞ √ <br />

1 = lim 20 c − 20 does not exist,<br />

c→∞<br />

the improper integral diverges.<br />

87<br />

x


y<br />

-4 -3 -2 -1 0 1 2 3 4 x<br />

Figure 4.19: Standardized binomial distribution with standard normal approximation.<br />

4.8 Applications in <strong>Statistics</strong><br />

Integration methods are important in statistics, and, in particular, when working with<br />

density functions. Density functions are used to “smooth over” probability histograms of<br />

discrete random variables, and as functions supporting the theory of continuous random<br />

variables. Continuous random variables are studied in Chapter 5 of these notes.<br />

4.8.1 deMoivre-Laplace Limit Theorem<br />

The most famous example of “smoothing over” is given in the following theorem.<br />

Theorem 4.35 (deMoivre-Laplace Limit Theorem) Let Xn be a binomial random<br />

variable based on n trials of a Bernoulli experiment with success probability p, and let<br />

Zn = (Xn − np)<br />

√ npq<br />

be the standardized <strong>for</strong>m of Xn.<br />

(E(Zn) = 0 and SD(Zn) = 1.) Then, <strong>for</strong> each real number x<br />

lim<br />

n→∞ P(Zn ≤ x) =<br />

x<br />

−∞<br />

1<br />

√ 2π e −z2 /2 dz.<br />

The deMoivre-Laplace limit theorem implies that the area under the standard normal density<br />

function, φ(z) = 1<br />

√ 2π e −z2 /2 , can be used to estimate binomial probabilities when n is<br />

large enough. (See Section 5.2.5 <strong>for</strong> more details on the normal distribution.)<br />

Example 4.36 (Binomial Approximation) Let n = 40 and p = 1/2. In Figure 4.19, the<br />

function φ(z) is superimposed on the probability histogram <strong>for</strong> the standardized random<br />

variable Z40 = (X40 − 20)/ √ 10.<br />

The probability P(X40 ≤ 22) can be computed using the binomial PDF:<br />

<br />

22<br />

1 40 40<br />

P(X40 ≤ 22) =<br />

= 0.785205.<br />

x 2<br />

x=0<br />

88


Using the normal approximation (with continuity correction) we get<br />

P(X40 ≤ 22) ≈ Φ<br />

<br />

22.5 − 20<br />

√ = 0.785402<br />

10<br />

where Φ(x) is the area under φ(z) <strong>for</strong> z ≤ x. The values are close.<br />

Stirling’s <strong>for</strong>mula. An important component of the proof of the deMoivre-Laplace limit<br />

theorem is the use of Stirling’s <strong>for</strong>mula <strong>for</strong> factorials. Stirling proved that<br />

lim<br />

n→∞<br />

n!<br />

√ 2π n n+1/2 e −n<br />

Thus, <strong>for</strong> large values of n, the <strong>for</strong>mula √ 2π n n+1/2 e −n can be used to approximate n!.<br />

4.8.2 Computing Quantiles<br />

If the continuous random variable X has density function f(x), then the cumulative distribution<br />

function (CDF) of X can be obtained using integration:<br />

F(x) = P(X ≤ x) =<br />

x<br />

−∞<br />

= 1.<br />

f(t) dt.<br />

If 0 < p < 1, then the pth quantile of the X distribution (when it exists) is the number, xp,<br />

satisfying the equation<br />

F(xp) =<br />

xp<br />

−∞<br />

f(t) dt = p.<br />

Example 4.37 (Pareto Distribution) The Pareto distribution has density function<br />

f(x) = α<br />

x α+1<br />

where α is a positive constant.<br />

when x > 1 and 0 otherwise,<br />

Pareto distributions are used to model incomes in a population. The CDF of the Pareto<br />

distribution is found using integration. Specifically,<br />

F(x) =<br />

x<br />

−∞<br />

f(t)dt =<br />

x<br />

1<br />

α<br />

tα+1 dt = −t −αx 1 = 1 − x−α<br />

when x > 1 and 0 otherwise. For each p, the p th quantile is<br />

p = F(xp) =⇒ xp = (1 − p) −1/α .<br />

For example, the quartiles (25 th , 50 th , and 75 th percentiles) of the Pareto distribution with<br />

parameter α = 2.5 are 1.122, 1.320, and 1.741, respectively.<br />

Additional examples will be given in the next chapter.<br />

89


5 Continuous Random Variables<br />

Researchers use random variables to describe the numerical results of experiments. In the<br />

continuous setting, the possible numerical values <strong>for</strong>m an interval or a union of intervals.<br />

For example, a random variable whose values are the positive real numbers might be used<br />

to describe the lifetimes of individuals in a population.<br />

This chapter focuses on continuous random variables and their probability distributions.<br />

5.1 Definitions<br />

Recall that a random variable is a function from the sample space of an experiment to the<br />

real numbers and that the random variable X is said to be continuous if its range is an<br />

interval or a union of intervals.<br />

5.1.1 PDF and CDF <strong>for</strong> Continuous Distributions<br />

Let X be a continuous random variable. We assume there exists a nonnegative function<br />

f(x) defined over the entire real line such that<br />

<br />

P(X ∈ D) =<br />

D<br />

f(x) dx,<br />

where D ⊆ R is an interval or union of intervals. The function f(x) is called the probability<br />

density function (PDF) (or density function) of the continuous random variable X.<br />

PDFs satisfy the following properties:<br />

1. f(x) ≥ 0 whenever it exists.<br />

2. <br />

R f(x)dx equals 1, where R is the range of X.<br />

f represents the rate of change of probability; the rate must be nonnegative (property 1).<br />

The area under f represents probability; the total probability must be 1 (property 2).<br />

The cumulative distribution function (CDF) of X, denoted by F(x), is the function<br />

<br />

F(x) = P(X ≤ x) =<br />

CDFs satisfy the following properties:<br />

(−∞,x]<br />

1. limx→−∞ F(x) = 0 and limx→+∞ F(x) = 1.<br />

2. If x1 ≤ x2, then F(x1) ≤ F(x2).<br />

3. F(x) is continuous.<br />

f(x) dx <strong>for</strong> all real numbers x.<br />

91


Density<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

0<br />

0 10 20 30 40 50 60 x<br />

Cum.Prob.<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

0 10 20 30 40 50 60 x<br />

Figure 5.1: Probability density function (left plot) and cumulative distribution function<br />

(right plot) <strong>for</strong> a continuous random variable with range x ≥ 0.<br />

F represents cumulative probability, with limits 0 and 1 (property 1). Cumulative probability<br />

increases with increasing x (property 2). For continuous random variables, the<br />

cumulative distribution function is continuous (property 3).<br />

Note that if R is the range of the continuous random variable X and I is an interval, then<br />

the probability of the event “the value of X is in the interval I” is computed by finding the<br />

area under the density curve <strong>for</strong> x ∈ I ∩ R:<br />

<br />

P(X ∈ I) = f(x)dx.<br />

In particular, if a ∈ R, then P(X = a) = 0 since the area under the curve over an interval<br />

of length zero is zero.<br />

Example 5.1 (Distribution on Nonnegative Reals) Let X be a continuous random<br />

variable whose range is the nonnegative real numbers and whose PDF is<br />

f(x) =<br />

I∩R<br />

200<br />

when x ≥ 0 and 0 otherwise.<br />

(10 + x) 3<br />

The left part of Figure 5.1 is a plot of the density function of X and the right part is a plot<br />

of its cumulative distribution function:<br />

100<br />

F(x) = 1 − when x ≥ 0 and 0 otherwise.<br />

(10 + x) 2<br />

Further, <strong>for</strong> this random variable, the probability that X is greater than 8 is<br />

∞<br />

P(X > 8) = f(x)dx = 1 − F(8) = 100<br />

≈ 0.3086.<br />

182 5.1.2 Quantiles and Percentiles<br />

8<br />

Assume that 0 < p < 1. The p th quantile (or 100p th percentile) of the X distribution (when<br />

it exists) is the point, xp, satisfying the equation<br />

P(X ≤ xp) = p.<br />

92


To find xp, solve the equation F(x) = p <strong>for</strong> x.<br />

Important special cases of quantiles are as follows:<br />

1. The median of the X distribution: the 50 th percentile.<br />

2. The quartiles of the X distribution: the 25 th , 50 th , and 75 th percentiles.<br />

3. The deciles of the X distribution: the 10 th , 20 th , ..., 90 th percentiles.<br />

Interquartile range. The interquartile range (IQR) is the difference between the 75 th<br />

and 25 th percentiles: IQR = x0.75 − x0.25.<br />

Example 5.2 (Distribution on Nonnegative Reals, continued) Continuing with the<br />

example above, a general <strong>for</strong>mula <strong>for</strong> the p th quantile is<br />

p = 1 −<br />

In particular, the quartiles are<br />

100<br />

(10 + xp) 2 ⇒ xp = 10/ 1 − p − 10 .<br />

x0.25 = 20<br />

√ 3 − 10 ≈ 1.547, x0.50 = 10 √ 2 − 10 ≈ 4.142 and x0.75 = 10.<br />

The IQR of this distribution is IQR = 20 − 20/ √ 3 ≈ 8.453.<br />

Measures of center and spread. The median is a measure of the center (or location)<br />

of a distribution. The IQR is a measure of the scale (or spread) of a distribution.<br />

Additional measures of center and spread are the mean and standard deviation, respectively.<br />

Definitions <strong>for</strong> continuous random variables are given in Section 5.3.3.<br />

5.2 Families of Distributions<br />

This section briefly describes four families of continuous probability distributions, considers<br />

properties of these distributions, and considers continuous trans<strong>for</strong>mations.<br />

5.2.1 Example: Uni<strong>for</strong>m Distribution<br />

Let a and b be real numbers with a < b. The continuous random variable X is said to be a<br />

uni<strong>for</strong>m random variable, or to have a uni<strong>for</strong>m distribution, on the interval [a,b] when its<br />

PDF is as follows:<br />

f(x) = 1<br />

b − a<br />

when a ≤ x ≤ b and 0 otherwise.<br />

Note that the open interval (a,b), or one of the half-closed intervals [a,b) or (a,b], can be<br />

used instead of [a,b] as the range of a uni<strong>for</strong>m random variable.<br />

93


Density<br />

0.025<br />

0.02<br />

0.015<br />

0.01<br />

0.005<br />

0<br />

0 10 20 30 40 50 60 70 x<br />

Cum.Prob.<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

0 10 20 30 40 50 60 70 x<br />

Figure 5.2: PDF (left) and CDF (right) <strong>for</strong> the uni<strong>for</strong>m random variable on [10,50].<br />

Uni<strong>for</strong>m distributions have constant density over an interval. That density is the reciprocal<br />

of the length of the interval. If X is a uni<strong>for</strong>m random variable on the interval [a,b], and<br />

[c,d] ⊆ [a,b] is a subinterval, then<br />

P(c ≤ X ≤ d) =<br />

d<br />

c<br />

1 d − c Length of [c,d]<br />

dx = =<br />

b − a b − a Length of [a,b]<br />

Also, P(c < X < d) = P(c < X ≤ d) = P(c ≤ X < d) = (d − c)/(b − a).<br />

If 0 < p < 1, then the p th quantile is xp = a + p(b − a).<br />

Example 5.3 (a = 10, b = 50) Let X be a uni<strong>for</strong>m random variable on the interval<br />

[10,50]. X has median 30 and IQR 20. Further,<br />

P(15 ≤ X ≤ 42) = 27<br />

= 0.675.<br />

40<br />

The PDF and CDF of X are shown in Figure 5.2.<br />

Computer-generated random numbers. Note that computer commands that return<br />

“random” numbers in the interval [0,1] are simulating random numbers from the uni<strong>for</strong>m<br />

distribution on the interval [0,1].<br />

5.2.2 Example: Exponential Distribution<br />

Let λ be a positive real number. The continuous random variable X is said to be an<br />

exponential random variable, or to have an exponential distribution, with parameter λ when<br />

its PDF is as follows:<br />

f(x) = λe −λx when x ≥ 0 and 0 otherwise.<br />

Note that the interval x > 0 can be used instead of the interval x ≥ 0 as the range of an<br />

exponential random variable.<br />

94<br />

.


Density<br />

4<br />

3<br />

2<br />

1<br />

0<br />

0 0.5 1 1.5 x<br />

Cum.Prob.<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

0 0.5 1 1.5 x<br />

Figure 5.3: PDF (left) and CDF (right) <strong>for</strong> the exponential random variable with λ = 4.<br />

Exponential distributions are often used to represent the time that elapses be<strong>for</strong>e the occurrence<br />

of an event. For example, the time that a machine component will operate be<strong>for</strong>e<br />

breaking down.<br />

If X is an exponential random variable with parameter λ and [a,b] ⊆ [0, ∞), then<br />

P(a ≤ X ≤ b) =<br />

b<br />

a<br />

λe −λx dx =<br />

<br />

−e −λx b<br />

If 0 < p < 1, then the p th quantile is xp = − ln(1 − p)/λ.<br />

a = −e−λb + e −λa .<br />

Example 5.4 (λ = 4) Let X be an exponential random variable with parameter 4. Using<br />

properties of logarithms, X has median ln(2)/4 ≈ 0.173 and IQR ln(3)/4 ≈ 0.275. Further,<br />

P<br />

1<br />

4<br />

<br />

3<br />

≤ X ≤ = −e<br />

4<br />

−3 + e −1 ≈ 0.318.<br />

The PDF and CDF of X are shown in Figure 5.3.<br />

The exponential distribution is discussed further in Section 6.9.1.<br />

5.2.3 Euler Gamma Function<br />

Let r be a positive real number. The Euler gamma function is defined as follows:<br />

∞<br />

Γ(r) = x r−1 e −x dx.<br />

0<br />

If r is a positive integer, then Γ(r) = (r−1)!. Thus, the gamma function is said to interpolate<br />

the factorials.<br />

Remark. The property Γ(r) = (r−1)! <strong>for</strong> positive integers can be proven using induction.<br />

To start the induction, you need to demonstrate that Γ(1) = 1. To prove the induction<br />

step, you need to use integration-by-parts to demonstrate that Γ(r + 1) = r × Γ(r).<br />

95


Density<br />

0.08<br />

0.06<br />

0.04<br />

0.02<br />

0<br />

0 10 20 30 40<br />

x<br />

Cum.Prob.<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

0 10 20 30 40<br />

Figure 5.4: PDF (left) and CDF (right) <strong>for</strong> the gamma random variable with shape parameter<br />

α = 2 and scale parameter β = 5.<br />

5.2.4 Example: Gamma Distribution<br />

Let α and β be positive real numbers. The continuous random variable X is said to be a<br />

gamma random variable, or to have a gamma distribution, with parameters α and β when<br />

its PDF is as follows:<br />

f(x) =<br />

1<br />

β α Γ(α) xα−1 e −x/β when x > 0 and 0 otherwise.<br />

Note that if α ≥ 1, then the interval x ≥ 0 can be used instead of the interval x > 0 as the<br />

range of a gamma random variable.<br />

The parameter α is called a shape parameter; gamma distributions with the same value<br />

of α have the same shape. The parameter β is called a scale parameter; <strong>for</strong> fixed α, as β<br />

changes, the scale on the vertical and horizontal axes change, but the shape remains the<br />

same.<br />

Example 5.5 (α = 2, β = 5) Let X be a gamma random variable with shape parameter<br />

2 and scale parameter 5. Using Newton’s method, the median of X is approximately 8.39<br />

and the IQR is approximately 8.65. Further,<br />

P(8 ≤ X ≤ 20) =<br />

20<br />

8<br />

1<br />

25 x e−x/5 <br />

dx = −e −x/5 − 1<br />

5<br />

The PDF and CDF of X are shown in Figure 5.4.<br />

x e−x/5<br />

20<br />

8<br />

≈ 0.433.<br />

Remarks. The fact that f(x) is a valid PDF can be demonstrated using the method of<br />

substitution and the definition of the Euler gamma function. If the shape parameter α = 1,<br />

then the gamma distribution is the same as the exponential distribution with λ = 1/β.<br />

The gamma distribution is discussed further in Section 6.9.2.<br />

96<br />

x


5.2.5 Example: Normal Distribution<br />

Let µ be a real number and σ be a positive real number. The continuous random variable<br />

X is said to be a normal random variable, or to have a normal distribution, with parameters<br />

µ and σ when its PDF is as follows:<br />

f(x) =<br />

1<br />

√ 2π σ exp<br />

<br />

−(x − µ) 2<br />

2σ 2<br />

<strong>for</strong> all real numbers x<br />

where exp() is the exponential function. The normal distribution is also called the Gaussian<br />

distribution, in honor of the mathematician Carl Friedrich Gauss.<br />

The graph of the PDF of a normal random variable is the famous bell-shaped curve. The<br />

parameter µ is called the mean (or center) of the normal distribution, since the graph of<br />

the PDF is symmetric around x = µ. The median of the normal distribution is µ. The<br />

parameter σ is called the standard deviation (or spread) of the normal distribution. The<br />

interquartile range of the normal distribution is (approximately) 1.35σ.<br />

Normal distributions have many applications. For example, normal random variables can<br />

be used to model measurements of manufactured items made under strict controls; normal<br />

random variables can be used to model physical measurements (e.g. height, weight, blood<br />

values) in homogeneous populations.<br />

Note that the PDF of the normal distribution cannot be integrated in closed <strong>for</strong>m. Most<br />

computer programs provide functions to compute probabilities and quantiles of the normal<br />

distribution.<br />

Standard normal distribution. The continuous random variable Z is said to be a<br />

standard normal random variable, or to have a standard normal distribution, when Z is a<br />

normal random variable with µ = 0 and σ = 1. The PDF and CDF of Z have special<br />

symbols:<br />

φ(z) = 1<br />

√ e<br />

2π −z2 z<br />

/2<br />

and Φ(z) = φ(t)dt where z is a real number.<br />

−∞<br />

The notation zp is used <strong>for</strong> the p th quantile of the Z distribution.<br />

Working with tables. Table 5.1 gives values of Φ(z) <strong>for</strong> z ≥ 0, where<br />

z = Row Value + Column Value.<br />

For z < 0, use Φ(z) = 1 − Φ(−z). For example,<br />

1. Φ(1.25) = 0.894 (row 1.2, column 0.05).<br />

2. Φ(−0.70) = 1 − Φ(0.70) = 1 − 0.758 = 0.242 (row 0.7, column 0.00).<br />

3. z0.25 ≈ −0.67 and z0.75 ≈ 0.67 since Φ(0.67) = 0.749 (row 0.6, column 0.07).<br />

The graph shown in the table illustrates Φ(z) as an area.<br />

97


Table 5.1: Cumulative probabilities of the standard normal random variable<br />

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09<br />

0.0 0.500 0.504 0.508 0.512 0.516 0.520 0.524 0.528 0.532 0.536<br />

0.1 0.540 0.544 0.548 0.552 0.556 0.560 0.564 0.567 0.571 0.575<br />

0.2 0.579 0.583 0.587 0.591 0.595 0.599 0.603 0.606 0.610 0.614<br />

0.3 0.618 0.622 0.626 0.629 0.633 0.637 0.641 0.644 0.648 0.652<br />

0.4 0.655 0.659 0.663 0.666 0.670 0.674 0.677 0.681 0.684 0.688<br />

0.5 0.691 0.695 0.698 0.702 0.705 0.709 0.712 0.716 0.719 0.722<br />

0.6 0.726 0.729 0.732 0.736 0.739 0.742 0.745 0.749 0.752 0.755<br />

0.7 0.758 0.761 0.764 0.767 0.770 0.773 0.776 0.779 0.782 0.785<br />

0.8 0.788 0.791 0.794 0.797 0.800 0.802 0.805 0.808 0.811 0.813<br />

0.9 0.816 0.819 0.821 0.824 0.826 0.829 0.831 0.834 0.836 0.839<br />

1.0 0.841 0.844 0.846 0.848 0.851 0.853 0.855 0.858 0.860 0.862<br />

1.1 0.864 0.867 0.869 0.871 0.873 0.875 0.877 0.879 0.881 0.883<br />

1.2 0.885 0.887 0.889 0.891 0.893 0.894 0.896 0.898 0.900 0.901<br />

1.3 0.903 0.905 0.907 0.908 0.910 0.911 0.913 0.915 0.916 0.918<br />

1.4 0.919 0.921 0.922 0.924 0.925 0.926 0.928 0.929 0.931 0.932<br />

1.5 0.933 0.934 0.936 0.937 0.938 0.939 0.941 0.942 0.943 0.944<br />

1.6 0.945 0.946 0.947 0.948 0.949 0.951 0.952 0.953 0.954 0.954<br />

1.7 0.955 0.956 0.957 0.958 0.959 0.960 0.961 0.962 0.962 0.963<br />

1.8 0.964 0.965 0.966 0.966 0.967 0.968 0.969 0.969 0.970 0.971<br />

1.9 0.971 0.972 0.973 0.973 0.974 0.974 0.975 0.976 0.976 0.977<br />

2.0 0.977 0.978 0.978 0.979 0.979 0.980 0.980 0.981 0.981 0.982<br />

2.1 0.982 0.983 0.983 0.983 0.984 0.984 0.985 0.985 0.985 0.986<br />

2.2 0.986 0.986 0.987 0.987 0.987 0.988 0.988 0.988 0.989 0.989<br />

2.3 0.989 0.990 0.990 0.990 0.990 0.991 0.991 0.991 0.991 0.992<br />

2.4 0.992 0.992 0.992 0.992 0.993 0.993 0.993 0.993 0.993 0.994<br />

2.5 0.994 0.994 0.994 0.994 0.994 0.995 0.995 0.995 0.995 0.995<br />

2.6 0.995 0.995 0.996 0.996 0.996 0.996 0.996 0.996 0.996 0.996<br />

2.7 0.997 0.997 0.997 0.997 0.997 0.997 0.997 0.997 0.997 0.997<br />

2.8 0.997 0.998 0.998 0.998 0.998 0.998 0.998 0.998 0.998 0.998<br />

2.9 0.998 0.998 0.998 0.998 0.998 0.998 0.998 0.999 0.999 0.999<br />

3.0 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999<br />

3.1 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999<br />

3.2 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999<br />

98<br />

z


X Dens.<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

-1 0 1 2 3 4 x<br />

Y Dens.<br />

4<br />

3<br />

2<br />

1<br />

0<br />

-1 0 1 2 3 4 y<br />

Figure 5.5: Probability density functions of a continuous random variable with values on<br />

0 < x < 2 (left plot) and of its reciprocal with values on y > 0.5 (right plot).<br />

Nonstandard normal distributions. If X is a normal random variable with mean µ<br />

and standard deviation σ, then<br />

<br />

x − µ<br />

P(X ≤ x) = Φ<br />

σ<br />

where Φ() is the CDF of the standard normal random variable, and<br />

xp = µ + zpσ<br />

where zp is the p th quantile of the standard normal distribution.<br />

Example 5.6 (µ = 100, σ = 20) Let X be a normal random variable with mean 100<br />

and standard deviation 20. The median of X is 100 and the IQR is 20(z0.75 − z0.25) ≈ 26.8.<br />

Further,<br />

<br />

110 − 100 60 − 100<br />

P(60 ≤ X ≤ 110) = Φ<br />

− Φ<br />

≈ 0.668.<br />

20<br />

20<br />

5.2.6 Trans<strong>for</strong>ming Continuous Random Variables<br />

If X and Y = g(X) are continuous random variables, then the CDF of X can be used to<br />

determine the CDF and PDF of Y .<br />

Example 5.7 (Reciprocal Trans<strong>for</strong>mation) Let X be a continuous random variable<br />

whose range is the open interval (0,2) and whose PDF is<br />

fX(x) = x<br />

2<br />

when 0 < x < 2 and 0 otherwise,<br />

and let Y equal the reciprocal of X, Y = 1/X. Then <strong>for</strong> y > 1/2,<br />

P(Y ≤ y) = P<br />

1<br />

X<br />

<br />

≤ y = P X ≥ 1<br />

<br />

y<br />

99<br />

=<br />

2<br />

x=1/y<br />

x<br />

2<br />

dx = 1 − 1<br />

4y 2


and d<br />

dy P(Y ≤ y) = 1/(2y3 ). Thus, the PDF of Y is<br />

and the CDF of Y is<br />

fY (y) = 1<br />

when y > 1/2 and 0 otherwise<br />

2y3 FY (y) = 1 − 1<br />

when y > 1/2 and 0 otherwise.<br />

4y2 The left part of Figure 5.5 is a plot of the density function of X, and the right part is a<br />

plot of the density function of Y .<br />

Example 5.8 (Square Trans<strong>for</strong>mation) Let Z be the standard normal random variable<br />

and W be the square of Z, W = Z 2 . Then <strong>for</strong> w > 0,<br />

P(W ≤ w) = P(Z 2 ≤ w) = P(− √ w ≤ Z ≤ √ w) = Φ( √ w) − Φ(− √ w).<br />

Since the graph of the PDF of Z is symmetric around z = 0, Φ(−z) = 1 − Φ(z) <strong>for</strong> every z.<br />

Thus, the CDF of W is<br />

and the PDF of W is<br />

FW(w) = 2Φ( √ w) − 1 when w > 0 and 0 otherwise<br />

fW(w) = d<br />

dw FW(w) = 1<br />

√ 2πw e −w/2 when w > 0 and 0 otherwise.<br />

(Note that W = Z 2 is said to have a chi-square distribution with one degree of freedom.)<br />

Monotone trans<strong>for</strong>mations. If the differentiable trans<strong>for</strong>mation g is strictly monotone<br />

(either always increasing or always decreasing) on the range of X, Y = g(X) and y is in<br />

the range of Y , then the PDF of Y has the following <strong>for</strong>m:<br />

<br />

dy <br />

fY (y) = fX(x)/ <br />

dx<br />

where y = g(x) (and x = g−1 (y)).<br />

Example 5.9 (Reciprocal Trans<strong>for</strong>mation, continued) In the reciprocal trans<strong>for</strong>mation<br />

example above, y = g(x) = 1/x, x = g−1 (y) = 1/y (the inverse of the reciprocal<br />

function is the reciprocal function), and<br />

fY (y) = fX(g −1 <br />

−1<br />

(y))/<br />

g−1 (y) 2<br />

<br />

<br />

<br />

=<br />

<br />

1 <br />

<br />

/ −y<br />

2y<br />

2 1<br />

= when y > 0.5.<br />

2y3 Linear trans<strong>for</strong>mations. Linear trans<strong>for</strong>mations with nonzero slope are examples of<br />

differentiable monotone trans<strong>for</strong>mations. In particular, if Y = aX + b where a = 0 and y is<br />

in the range of Y , then<br />

fY (y) = 1<br />

|a| fX<br />

<br />

1<br />

(y − b) .<br />

a<br />

100


Example 5.10 (Normal Distribution) If X is a normal random variable with mean µ<br />

and standard deviation σ and Z is the standard normal random variable with PDF φ(z),<br />

then X = σZ + µ with PDF<br />

fX(x) = 1<br />

σ φ<br />

<br />

<br />

x − µ 1 −(x − µ) 2<br />

= √ exp<br />

σ 2π σ 2σ2 <br />

<strong>for</strong> all x.<br />

This <strong>for</strong>mula <strong>for</strong> the PDF agrees with the one given earlier.<br />

5.3 Mathematical Expectation<br />

Mathematical expectation generalizes the idea of a weighted average, where probability<br />

distributions are used as the weights. (See Section 2.3 also.)<br />

5.3.1 Definitions <strong>for</strong> Continuous Random Variables<br />

Let X be a continuous random variable with range R and PDF f(x). The mean of X (or<br />

expected value of X or expectation of X) is defined as follows:<br />

<br />

E(X) = x f(x) dx<br />

provided that <br />

x∈R |x| f(x) dx < ∞ (i.e. the integral converges absolutely).<br />

x∈R<br />

Similarly, if g(X) is a real-valued function of X, then the mean of g(X) (or expected value<br />

of g(X) or expectation of g(X)) is<br />

<br />

E(g(X)) = g(x) f(x) dx<br />

provided that <br />

x∈R |g(x)| f(x) dx < ∞.<br />

x∈R<br />

Note that the absolute convergence of an integral is not guaranteed. In cases where an<br />

integral with absolute values diverges, we say that the expectation is indeterminate.<br />

Example 5.11 (Triangular Distribution) Let X be a continuous random variable whose<br />

range is the interval [1,4] and whose PDF is as follows<br />

f(x) = 2<br />

(x − 1) when 1 ≤ x ≤ 4 and 0 otherwise,<br />

9<br />

and let g(X) = X 2 be the square of X. Then<br />

and<br />

E(X) =<br />

E(g(X)) =<br />

4<br />

Note that E(g(X)) = g(E(X)).<br />

4<br />

1<br />

1<br />

x f(x) dx =<br />

x 2 f(x) dx =<br />

101<br />

4<br />

1<br />

4<br />

1<br />

x 2<br />

(x − 1) dx = 3<br />

9<br />

2 2<br />

x (x − 1) dx = 19/2.<br />

9


Table 5.2: Means and variances <strong>for</strong> four families of continuous distributions.<br />

Distribution E(X) V ar(X)<br />

Uni<strong>for</strong>m a, b a+b<br />

2<br />

Exponential λ 1<br />

λ<br />

(b−a) 2<br />

12<br />

1<br />

λ 2<br />

Gamma α, β αβ αβ 2<br />

Normal µ, σ µ σ 2<br />

Example 5.12 (Indeterminate Mean) Let X be a continuous random variable whose<br />

range is x ≥ 1 and whose PDF is as follows:<br />

f(x) = 1<br />

when x ≥ 1 and 0 otherwise.<br />

x2 Since ∞<br />

1 x f(x) dx does not converge, the expectation of X is indeterminate.<br />

5.3.2 Properties of Expectation<br />

The properties of expectation stated in Section 2.3.2 <strong>for</strong> discrete random variables are also<br />

true <strong>for</strong> continuous random variables.<br />

5.3.3 Mean, Variance and Standard Deviation<br />

Let X be a random variable and let µ = E(X) be its mean. The variance of X, V ar(X),<br />

defined as follows:<br />

<br />

V ar(X) = E (X − µ) 2<br />

.<br />

The notation σ 2 = V ar(X) is used to denote the variance. The standard deviation of X,<br />

σ = SD(X), is the positive square root of the variance.<br />

These summary measures (mean, variance, standard deviation) have the same properties<br />

as those stated in the discrete case in Section 2.3.3.<br />

Table 5.2 gives general <strong>for</strong>mulas <strong>for</strong> the mean and variance of the four families of continuous<br />

probability distributions discussed earlier.<br />

5.3.4 Chebyshev’s Inequality<br />

Chebyshev’s inequality, stated in Section 2.3.4 <strong>for</strong> discrete random variables, remains true<br />

in the continuous case. The proof (with integration replacing summation) is also the same.<br />

102


6 Infinite Sequences and Series<br />

An infinite sequence (or sequence) is a real-valued function whose domain is the positive<br />

integers. We write a sequence as a list of numbers in a definite order<br />

a1, a2, a3, ..., an, ... ,<br />

where a1 is the first term, a2 is the second term, etc. The notations<br />

{a1, a2, a3, ...} or {an} or {an} ∞ n=1<br />

are used to denote sequences.<br />

In addition, sequences may start with n = 0 (or with some other integer).<br />

If {an} ∞ n=1<br />

is a sequence, then the <strong>for</strong>mal sum<br />

∞<br />

an = a1 + a2 + a3 + · · ·<br />

n=1<br />

is called an infinite series (or series). The numbers a1, a2, ... are the terms of the series.<br />

6.1 Limits and Convergence of Infinite Sequences<br />

Let {an} be an infinite sequence. We say that {an} has limit L, and write<br />

lim<br />

n→∞ an = L or an → L as n → ∞<br />

if <strong>for</strong> every positive real number ǫ there exists an integer M satisfying<br />

|an − L| < ǫ when n > M.<br />

If the sequence {an} has limit L, then the sequence converges (or is convergent). Otherwise,<br />

the sequence diverges (or is divergent).<br />

Example 6.1 (Convergent and Divergent Sequences) Since<br />

the sequence<br />

lim<br />

n→∞<br />

(n2 + 1)<br />

(5n2 (1 + 1/n<br />

= lim<br />

+ 7) n→∞<br />

2 )<br />

(5 + 7/n2 1<br />

=<br />

) 5 ,<br />

<br />

(n2 +1)<br />

(5n2 <br />

+7) converges to 1<br />

5 . Similarly, the sequence {(−0.5)n } converges to 0.<br />

By contrast, the sequence {ln(n)} diverges. (In fact, the values grow large without bound.)<br />

6.1.1 Convergence Properties<br />

Assume that {an} converges to L and {bn} converges to M. Then<br />

1. Constant multiples: {can} converges to cL, where c is a constant.<br />

103


2. Sums and differences: {an ± bn} converges to L ± M.<br />

3. Products: {anbn} converges to LM.<br />

4. Quotients: If bn = 0 <strong>for</strong> all n and M = 0, then {an/bn} converges to L/M.<br />

5. Convergence to zero: {an} converges to 0 if and only if {|an|} converges to 0.<br />

6. Squeezing principle: If an ≤ bn ≤ cn <strong>for</strong> all n ≥ n0, where n0 is a constant, and<br />

then {bn} converges to L as well.<br />

lim<br />

n→∞ an = lim<br />

n→∞ cn = L ,<br />

Example 6.2 (Squeezing Principle) Consider the sequence<br />

the sequence<br />

0 ≤ sin2 (n)<br />

n<br />

≤ 1<br />

n<br />

<br />

sin2 (n)<br />

n converges to 0 as well.<br />

6.1.2 Geometric Sequences<br />

1<br />

and lim = 0,<br />

n→∞ n<br />

<br />

sin2 (n)<br />

n . Since<br />

Consider the sequence {r n } where r is a fixed constant (called the ratio of the geometric<br />

sequence). Then<br />

{r n } converges to 0 when |r| < 1 and converges to 1 when r = 1.<br />

In all other cases, the sequence diverges.<br />

6.1.3 Bounded Sequences<br />

The sequence {an} is said to be bounded above if there is a number M satisfying<br />

an ≤ M <strong>for</strong> all n ≥ 1.<br />

Similarly, {an} is said to be bounded below if there is a number m satisfying<br />

an ≥ m <strong>for</strong> all n ≥ 1.<br />

The sequence {an} is said to be bounded if it is bounded above and below. Otherwise, {an}<br />

is said to be unbounded.<br />

Theorem 6.3 (Bounded Sequence Theorem) Let {an} be a sequence.<br />

1. If {an} converges, then it is bounded.<br />

2. If {an} is unbounded, then it diverges.<br />

Every convergent sequence is bounded, but not every bounded sequence is convergent. For<br />

example, the sequence {(−1) n } is bounded (since −1 ≤ an ≤ 1 <strong>for</strong> all n) but not convergent.<br />

104


6.1.4 Monotone Sequences<br />

The sequence {an} is said to be increasing if<br />

Similarly, {an} is said to be decreasing if<br />

a1 < a2 < · · · < an < · · · .<br />

a1 > a2 > · · · > an > · · · .<br />

{an} is said to be monotone if it is either increasing or decreasing.<br />

Theorem 6.4 (Monotone Convergence Theorem) If {an} is a bounded and monotone<br />

sequence, then {an} converges.<br />

Example 6.5 (Convergent Monotone Sequences) Since 0 ≤ 1 − 1<br />

n ≤ 1 <strong>for</strong> all n ≥ 1<br />

∞ and the terms are increasing, the sequence converges. In fact, the limit is 1.<br />

1 − 1<br />

n n=1<br />

<br />

1<br />

n ≤ sin(1) <strong>for</strong> all n ≥ 1 and the terms are decreasing, the sequence<br />

Similarly, since 0 ≤ sin<br />

∞ sin converges. In fact, the limit is 0.<br />

1<br />

n<br />

n=1<br />

Note. The proof of the Monotone Convergence Theorem uses the fact that sets of real<br />

numbers that are bounded above have a least upper bound (or supremum), and that sets of<br />

real numbers that are bounded below have a greatest lower bound (or infimum).<br />

6.2 Partial Sums and Convergence of Infinite Series<br />

Let ∞ n=1 an be an infinite series. The sequence with terms<br />

n<br />

sn = ai, n = 1,2,3,...<br />

i=1<br />

is known as the sequence of partial sums of the series.<br />

If the sequence of partial sums {sn} converges to L, then the series converges to L:<br />

∞<br />

n=1<br />

an = lim<br />

n→∞ sn = L<br />

and L is said to be the sum of the series. If the sequence of partial sums {sn} diverges,<br />

then the series is said to diverge.<br />

Example 6.6 (Convergent Series) The series ∞ n=1<br />

1<br />

n(n+1)<br />

converges to 1.<br />

1 1 1<br />

To see this, note that n(n+1) = n − n+1 <strong>for</strong> n ≥ 1. Thus,<br />

<br />

sn = 1 − 1<br />

<br />

1 1 1 1<br />

+ − + · · · + − = 1 −<br />

2 2 3 n n + 1<br />

1<br />

n + 1<br />

(all middle terms cancel) and sn → 1 as n → ∞.<br />

105


6.2.1 Convergence Properties<br />

Assume that ∞ n=1 an = L and ∞ n=1 bn = M. Then<br />

1. Constant multiples: ∞ n=1 can = c ∞ n=1 an = cL, where c is a constant.<br />

2. Sums and differences: ∞ n=1 (an ± bn) = ∞ n=1 an ± ∞ n=1 bn = L ± M.<br />

3. Convergence to zero: If the series ∞ n=1 an converges, then {an} converges to 0.<br />

Note that, if the sequence of terms of a series does not have limit zero, then the series must<br />

diverge. For example,<br />

∞<br />

n=1<br />

(n 2 + 1)<br />

(5n 2 + 7)<br />

diverges since lim<br />

n→∞<br />

(n2 + 1)<br />

(5n2 1<br />

= = 0.<br />

+ 7) 5<br />

(The sequence of partial sums is unbounded, since sn+1 ≈ sn + 1<br />

5<br />

6.2.2 Geometric Series<br />

when n is large.)<br />

Consider the series ∞ n=1 ar n−1 where a and r are fixed constants. Then the n th partial<br />

sum of the geometric series is<br />

sn = a + ar + ar 2 + · · · + ar n−1 = a(1 − rn )<br />

1 − r<br />

when r = 1<br />

(and sn = an when r = 1). Further, the series converges when |r| < 1 and its sum is<br />

If |r| ≥ 1, the series diverges.<br />

∞<br />

n=1<br />

ar n−1 = a<br />

1 − r<br />

Example 6.7 (a = −10/27, r = −2/3) The series<br />

since<br />

∞n=1<br />

∞<br />

n=1<br />

5(−2) n<br />

3 n+2<br />

(|r| < 1).<br />

5(−2) n<br />

3 n+2 converges to − 2<br />

9<br />

= −10 20 40 80<br />

27 + 81 − 243 + 729 ∓ · · ·<br />

= − 10<br />

<br />

27 1 − 2 4 8<br />

3 + 9 − 27 ± · · ·<br />

= −10/27<br />

1−(−2/3)<br />

106<br />

= −2<br />

9 .


6.3 Convergence Tests <strong>for</strong> Positive Series<br />

Let ∞ n=1 an and ∞ n=1 bn be series with positive terms. Then<br />

1. Comparison test: Assume that an ≤ bn <strong>for</strong> all n ≥ n0. Then<br />

(a) If ∞ n=1 bn converges, then ∞ n=1 an converges.<br />

(b) If ∞ n=1 an diverges, then ∞ n=1 bn diverges.<br />

2. Integral test: If f(x) is a decreasing function with f(n) = an, then<br />

∞<br />

an and<br />

n=1<br />

∞<br />

f(x) dx both converge or both diverge.<br />

1<br />

3. Ratio test: Assume that (an+1/an) → L as n → ∞.<br />

(a) If 0 ≤ L < 1, then ∞ n=1 an converges.<br />

(b) If L > 1, then ∞ n=1 an diverges.<br />

4. Root test: Assume that n√ an → L as n → ∞.<br />

(a) If 0 ≤ L < 1, then ∞ n=1 an converges.<br />

(b) If L > 1, then ∞ n=1 an diverges.<br />

5. Limit comparison test: If {an/bn} converges to L > 0, then<br />

∞<br />

an and<br />

n=1<br />

∞<br />

bn both converge or both diverge.<br />

n=1<br />

The comparison test follows from the Monotone Convergence Theorem since the sequences<br />

of partial sums are montone increasing: if the b-series converges, then its sum is an upper<br />

bound <strong>for</strong> the sequence of partial sums of the a-series; if the a-series diverges, then the<br />

sequence of partial sums of the b-series must be unbounded.<br />

The ratio and root tests apply to series that are “sufficiently like” geometric series. In each<br />

case, if L = 1, then no conclusion can be drawn. The limit comparison test applies to series<br />

that are approximately constant multiples of one another.<br />

Example 6.8 (p Series) The integral test can be used to demonstrate that<br />

∞<br />

n=1<br />

1<br />

converges if and only if p > 1.<br />

np To see this, let f(x) = 1<br />

x p = x −p . If p > 1, then<br />

<br />

∞ x−p+1 x→∞<br />

f(x) dx =<br />

1<br />

−p + 1<br />

1<br />

<br />

x−p+1 1 1 1<br />

= lim<br />

− = 0 − =<br />

x→∞ −p + 1 −p + 1 −p + 1 p − 1 .<br />

107


(The limit is 0 since x has a negative exponent.) Since the improper integral converges, so<br />

does the series.<br />

If p < 1, then the improper integral diverges since x has a positive exponent in the limit<br />

above. If p = 1, then the antiderivative of f(x) is ln(x) and again the improper integral<br />

diverges.<br />

Example 6.9 (Convergent Exponential Series) An important application of the ratio<br />

test is to the series<br />

∞ xn n!<br />

where x is a positive real number.<br />

Since the ratio<br />

an+1<br />

an<br />

n=0<br />

= xn+1 /(n + 1)!<br />

x n /n!<br />

=<br />

x<br />

→ 0 as n → ∞<br />

(n + 1)<br />

the series above converges. In addition, since the series converges, the sequence of terms<br />

must approach 0. That is, x n /n! → 0 as n → ∞.<br />

Example 6.10 (Comparison Test) The comparison test can be used to demonstrate<br />

that<br />

∞ 10sin2 (n)<br />

n!<br />

converges and that<br />

∞ 4<br />

√ diverges.<br />

5n − 1<br />

n=0<br />

In the first case, let an = 10 sin2 (n)<br />

n! and bn = 10<br />

n!<br />

. The ratio test can be used to demonstrate<br />

that the b-series converges (as in the last example). Thus, the a-series converges.<br />

In the second case, let an = 4<br />

√ 5n and bn = 4<br />

√ 5n−1 . The integral test can be used to demonstrate<br />

that the a-series diverges (as in the example above). Thus, the b-series diverges.<br />

Example 6.11 (Limit Comparison Test) The limit comparison test can be used to<br />

demonstrate that<br />

∞<br />

n=1<br />

4n + 3<br />

n 3 + 7n − 4<br />

converges and that<br />

n=1<br />

∞<br />

n=1<br />

In the first case, let an = 4n+3<br />

n 3 +7n−4 and bn = 1<br />

n 2. Since<br />

an<br />

lim<br />

n→∞ bn<br />

= lim<br />

n→∞<br />

4n3 + 3n2 n3 = lim<br />

+ 7n − 4 n→∞<br />

n + 5<br />

3n 2 + 2n + 2 diverges.<br />

4 + 3/n<br />

1 + 7/n2 = 4<br />

− 4/n3 and the b-series converges (by a previous example), the a-series converges.<br />

In the second case, let an = n+5<br />

an<br />

lim<br />

n→∞ bn<br />

3n2 +2n+2 and bn = 1<br />

n . Since<br />

= lim<br />

n→∞<br />

n2 + 5n<br />

3n2 = lim<br />

+ 2n + 2 n→∞<br />

1 + 5/n 1<br />

=<br />

3 + 2/n + 2/n2 3<br />

and the b-series diverges (by the same example), the a-series diverges.<br />

108


Example 6.12 (Root Test) The root test can be used to demonstrate that<br />

Specifically, since<br />

∞<br />

n n + 5<br />

2n<br />

n=1<br />

<br />

n√ n + 5 1 + 5/n<br />

an = =<br />

2n 2<br />

converges.<br />

→ 1<br />

2<br />

as n → ∞<br />

and the limit is less than 1, the series converges. (For large n, the terms of the series are<br />

like those of a convergent geometric series with r = 1/2.)<br />

Remarks. The convergence tests above can be used to determine if a positive series<br />

converges, but tell us nothing about the sum of the series. In most cases, finding an exact<br />

sum is challenging. In Section 6.6, we demonstrate that<br />

∞<br />

n=0<br />

x n<br />

n!<br />

= ex<br />

<strong>for</strong> all x.<br />

Harmonic series. The divergent series ∞ 1<br />

n=1 n is called the harmonic series. Although<br />

the terms of the harmonic series approach 0, the series diverges by the integral test.<br />

6.4 Alternating Series and Error Analysis<br />

An alternating series is a series whose terms alternate in sign. The usual notation is<br />

∞<br />

(−1)<br />

n=1<br />

n+1 an = a1 − a2 + a3 − a4 ± · · · ,<br />

where an > 0 <strong>for</strong> each n, <strong>for</strong> series starting with positive terms. For example,<br />

∞ (−1) n+1<br />

n=1<br />

2 n<br />

= 1 1 1 1<br />

− + − ± · · ·<br />

2 4 8 16<br />

is an alternating series starting with a positive term.<br />

A simple convergence test <strong>for</strong> alternating series is based on the following theorem.<br />

Theorem 6.13 (Alternating Series Theorem) Assume that an+1 ≤ an <strong>for</strong> each n and<br />

that an → 0 as n → ∞. Then<br />

1. ∞ n=1 (−1) n+1 an and ∞ n=1 (−1) n an are convergent series.<br />

2. If sn is the n th partial sum and L is the sum of one of these series, then<br />

|sn − L| ≤ an+1 <strong>for</strong> each n.<br />

109


The sequence of partial sums of an alternating series are alternately larger and smaller than<br />

the limit, L. When the first term is positive, the even partial sums <strong>for</strong>m an increasing<br />

sequence and the odd partial sums <strong>for</strong>m a decreasing sequence:<br />

s2 ≤ s4 ≤ s6 ≤ s8 ≤ s10 ≤ · · · ≤ L ≤ · · · ≤ s9 ≤ s7 ≤ s5 ≤ s3 ≤ s1.<br />

When the first term is negative, the even partial sums are decreasing and the odd partial<br />

sums are increasing. Each subsequence of partial sums must converge, since each is bounded<br />

and monotone. Further, since<br />

|sn − sn−1| = an → 0 as n → ∞,<br />

the limits of the subsequences must be equal.<br />

Example 6.14 (Alternating Harmonic Series) The series<br />

∞ (−1) n+1<br />

n=1<br />

n<br />

= 1 − 1 1 1<br />

+ − ± · · ·<br />

2 3 4<br />

converges since the sequence {1/n} is decreasing with limit 0.<br />

To estimate the sum of the alternating harmonic series to within 2 decimal places, we need<br />

to use 200 or more terms since<br />

|sn − L| ≤ an+1 = 1<br />

< 0.005 =⇒ n > 199.<br />

n + 1<br />

Our estimate is s200 ≈ 0.690653. (In fact, L = ln(2). See Section 6.6.)<br />

Example 6.15 (Alternating p Series) More generally, the series<br />

∞ (−1) n+1<br />

n=1<br />

n p<br />

= 1 − 1 1 1<br />

+ − ± · · ·<br />

2p 3p 4p converges <strong>for</strong> all p > 0 since {1/n p } is decreasing with limit 0.<br />

To estimate the sum when p = 3 to within 2 decimal places, <strong>for</strong> example, we need to use 5<br />

or more terms since<br />

|sn − L| ≤ an+1 =<br />

Our estimate is s5 ≈ 0.904412.<br />

1<br />

< 0.005 =⇒ n > 4.84804.<br />

(n + 1) 3<br />

6.5 Absolute and Conditional Convergence of Series<br />

The following theorem is used to study convergence properties of general series.<br />

Theorem 6.16 (Absolute Convergence Theorem) If the series ∞ n=1 |an| converges,<br />

then the series ∞ n=1 an converges as well.<br />

Thus, tests <strong>for</strong> positive series can be used to determine if the series with terms |an| converges.<br />

If the series converges, then we can conclude that the series whose terms are an converges<br />

110


as well. Note, however, that if the series with terms |an| diverges, then we cannot draw any<br />

conclusion about the series with terms an.<br />

Let ∞ n=1 an be a convergent series. If the series ∞ n=1 |an| converges, then ∞ n=1 an is<br />

said to be an absolutely convergent series. Otherwise, ∞ n=1 an is said to be a conditionally<br />

convergent series.<br />

Example 6.17 (Alternating p Series, continued) The series<br />

∞ (−1) n+1<br />

n=1<br />

n p<br />

= 1 − 1 1 1<br />

+ − ± · · ·<br />

2p 3p 4p is absolutely convergent when p > 1 (by the integral test) and conditionally convergent<br />

when 0 < p ≤ 1 (by the integral and alternating series tests).<br />

Example 6.18 (Exponential Series, continued) The series<br />

∞<br />

n=0<br />

x n<br />

n!<br />

= 1 + x + x2<br />

2<br />

+ x3<br />

6<br />

x4<br />

+ + · · ·<br />

24<br />

converges <strong>for</strong> all real numbers x. Specifically, if x = 0, the sum is 1. Otherwise, the series<br />

converges absolutely by the ratio test (see page 108).<br />

Example 6.19 (Power Series) The series<br />

∞<br />

n=1<br />

x n<br />

n<br />

= x + x2<br />

2<br />

+ x3<br />

3<br />

+ x4<br />

4<br />

+ · · ·<br />

converges absolutely when |x| < 1, converges conditionally when x = −1, and diverges<br />

otherwise. To see this,<br />

1. If x = 0, then the sum is 0. If 0 < |x| < 1, then<br />

|an+1|<br />

|an| =<br />

<br />

n<br />

|x| → |x| < 1 as n → ∞,<br />

n + 1<br />

implying that the series converges absolutely by the ratio test. Thus, the series converges<br />

absolutely when |x| < 1.<br />

2. When x = 1, the series reduces to the harmonic series, which diverges by the integral<br />

test. When x = −1, the series reduces to the (negative) alternating harmonic series,<br />

which converges by the alternating series test. Thus, the series converges conditionally<br />

when x = −1.<br />

3. When |x| > 1, the terms do not converge to 0, and the series diverges.<br />

6.5.1 Product of Series<br />

Let ∞ n=0 an and ∞ n=0 bn be series with index starting at 0.<br />

111


The series ∞ n=0 cn with terms<br />

cn =<br />

n<br />

aibn−i = a0bn + a1bn−1 + · · · + an−1b1 + anb0 <strong>for</strong> n = 0,1,2,... .<br />

i=0<br />

is the product (or the convolution) of ∞ n=0 an and ∞ n=0 bn.<br />

Theorem 6.20 (Convolution Theorem) If ∞ n=0 an = L, ∞ n=0 bn = M and both<br />

series are absolutely convergent, then<br />

∞n=0 cn is an absolutely convergent series with sum LM.<br />

Example 6.21 (Square of Geometric Series) The geometric series<br />

∞<br />

x<br />

n=0<br />

n = 1 + x + x 2 + x 3 + · · ·<br />

converges to 1/(1 − x) when |x| < 1 and diverges otherwise.<br />

The square of this series is<br />

<br />

∞<br />

x<br />

n=0<br />

n<br />

2 ∞<br />

= (n + 1)x<br />

n=0<br />

n = 1 + 2x + 3x 2 + 4x 3 + · · · .<br />

By the convolution theorem, the square series converges to 1/(1 − x) 2 when |x| < 1.<br />

Remarks. The last example demonstrates that we can “represent” the functions f(x) =<br />

1<br />

(1−x)<br />

and g(x) = 1<br />

(1−x) 2 as sums of convergent series <strong>for</strong> all x satisfying |x| < 1. Series<br />

representations of functions will be explored further in the next two sections.<br />

6.6 Taylor Polynomials and Taylor Series<br />

We have already seen how to approximate a function using a tangent line. This section<br />

considers using polynomials of higher orders to approximate functions with higher precision,<br />

and using the limiting <strong>for</strong>ms of these polynomials to give exact representations of the<br />

functions on intervals.<br />

6.6.1 Higher Order Derivatives<br />

The n th derivative of the function f is obtained from the function by differentating n times:<br />

f ′′′ is the derivative of f ′′ , f ′′′′ is the derivative of f ′′′ , and so <strong>for</strong>th.<br />

Starting with y = f(x), notations <strong>for</strong> the n th derivative are<br />

f (n) (x) = dny dn<br />

=<br />

dxn dxn (f(x)) = dnf dxn = Dnf(x) = y (n) .<br />

For convenience, we let f (0) (x) = f(x).<br />

112


ycosx<br />

2<br />

1<br />

-8 -6 -4 -2 2 4 6 8 x<br />

-1<br />

-2<br />

2<br />

8<br />

14<br />

y11x<br />

6<br />

4<br />

2<br />

-1 -0.5 0.5 1<br />

Figure 6.1: Plots of y = cos(x) (left) and y = 1/(1 −x) (right) with Taylor approximations.<br />

6.6.2 Taylor Polynomials<br />

The Taylor polynomial of order n centered at a is the polynomial<br />

pn(x) = f(a) + f ′ (a)(x − a) + f ′′ (a)<br />

2<br />

(x − a) 2 + · · · + f(n) (a)<br />

(x − a)<br />

n!<br />

n .<br />

Note that pn(a) = f(a) and that p (k)<br />

n (a) = f (k) (a) <strong>for</strong> k = 1,... ,n.<br />

Theorem 6.22 (Taylor’s Theorem, Part 1) If the n th derivative of f is continuous on<br />

the closed interval [a,b] and the (n + 1) st derivative exists on the open interval (a,b), then<br />

<strong>for</strong> each x in [a,b]<br />

f(x) = pn(x) + f(n+1) (c)<br />

(x − a)n+1<br />

(n + 1)!<br />

<strong>for</strong> some c satisfying a < c < x.<br />

In words, the error in using the Taylor polynomial pn(x) instead of the function f(x) <strong>for</strong><br />

values of x near a is exactly the remainder<br />

Rn(x) = f(n+1) (c)<br />

(n + 1)! (x − a)n+1 <strong>for</strong> some number c between a and x.<br />

Thus, Taylor’s theorem generalizes results on errors in linear approximations (Section 4.6).<br />

Taylor’s theorem can be proven using the Mean Value Theorem.<br />

Note that if |f (n+1) (x)| is bounded in a neighborhood of a, then |Rn(x)| is also bounded.<br />

Specifically,<br />

|f (n+1) (x)| ≤ M <strong>for</strong> |x − a| < r ⇒ |Rn(x)| ≤ M<br />

(n + 1)! |x − a|n+1 <strong>for</strong> |x − a| < r .<br />

Example 6.23 (Cosine Function) Consider f(x) = cos(x) and let a = 0. The left part of<br />

Figure 6.1 shows graphs of y = f(x), y = p2(x), y = p8(x), and y = p14(x). As n increases,<br />

the polynomial approximates the function well over wider intervals centered at 0.<br />

113<br />

5<br />

3<br />

1<br />

x


Further, since |f (n) (x)| ≤ 1 (each derivative is either ± sin(x) or ± cos(x)) and a = 0,<br />

|Rn(x)| ≤ |x|n+1<br />

(n + 1)!<br />

<strong>for</strong> all x.<br />

Example 6.24 (Geometric Function) Consider f(x) = 1/(1 − x) and let a = 0. The<br />

right part of Figure 6.1 shows graphs of y = f(x), y = p1(x), y = p3(x), and y = p5(x).<br />

As n increases, the polynomial approximates the function well over wider subintervals of<br />

(−1,1).<br />

Since f (n) (x) = n!/(1 − x) n+1 and f (n) (0) = n!, pn(x) = 1 + x + x 2 + · · · + x n .<br />

Further, since<br />

1 − x n+1<br />

1 − x = 1 + x + x2 + · · · + x n =⇒ 1<br />

1 − x = 1 + x + x2 + · · · + x n + xn+1<br />

1 − x<br />

(using properties of geometric series), the remainder is Rn(x) = x n+1 /(1 − x).<br />

6.6.3 Taylor Series<br />

The Taylor series representation of f(x) centered at a is<br />

f(x) =<br />

∞<br />

n=0<br />

f (n) (a)<br />

n!<br />

(x − a) n = f(a) + f ′ (a)(x − a) + f ′′ (a)<br />

(x − a)<br />

2<br />

2 + · · ·<br />

when the series converges. Note that the Taylor series centered at 0 is often called the<br />

Maclaurin series.<br />

Theorem 6.25 (Taylor’s Theorem, Part 2) If f(x) = pn(x) + Rn(x) and<br />

lim<br />

n→∞ Rn(x) = 0 <strong>for</strong> all x satisfying |x − a| < r,<br />

then f(x) is equal to the sum of its Taylor series on the interval |x − a| < r.<br />

Example 6.26 (Exponential Function) Consider f(x) = e x and let a = 0.<br />

Since f (n) (x) = e x and f (n) (0) = 1 <strong>for</strong> all n,<br />

e x = 1 + x + x2<br />

2<br />

<strong>for</strong> some c between 0 and x. Since<br />

+ · · · + xn<br />

n! + Rn(x) where Rn(x) =<br />

|Rn(x)| ≤ C |x|n+1<br />

(n + 1)! where C = max(ex ,1)<br />

e c<br />

(n + 1)! xn+1<br />

and |x| n /n! → 0 as n → ∞ <strong>for</strong> all x (page 108), Rn(x) → 0 as n → ∞ as well. Thus, e x is<br />

equal to the sum of its Taylor series <strong>for</strong> all real numbers.<br />

114


Table 6.1: Some Taylor series expansions centered at 0.<br />

1<br />

1−x = 1 + x + x2 + x 3 + · · · <strong>for</strong> −1 < x < 1<br />

1<br />

(1−x) 2 = 1 + 2x + 3x 2 + 4x 3 + · · · <strong>for</strong> −1 < x < 1<br />

ln(1 + x) = x − x2<br />

2<br />

e x = 1 + x + x2<br />

2<br />

cos(x) = 1 − x2<br />

2<br />

sin(x) = x − x3<br />

3!<br />

+ x3<br />

3!<br />

+ x4<br />

4!<br />

+ x5<br />

5!<br />

+ x3<br />

3<br />

− x4<br />

4<br />

+ · · · <strong>for</strong> −1 < x ≤ 1<br />

+ · · · <strong>for</strong> all x<br />

− x6<br />

6!<br />

− x7<br />

7!<br />

+ · · · <strong>for</strong> all x<br />

+ · · · <strong>for</strong> all x<br />

Example 6.27 (Exponential Function, continued) Using the result above, a series<br />

representation of e −1 is<br />

e −1 =<br />

∞ (−1) n<br />

n=0<br />

n!<br />

1 1 1<br />

= 1 − 1 + − + ∓ · · · .<br />

2 6 24<br />

To estimate e−1 to within 2 decimal places of accuracy, <strong>for</strong> example, we need to use 6 terms<br />

since<br />

1<br />

|Rn(x)| ≤ < 0.005 =⇒<br />

(n + 1)!<br />

n > 5.<br />

Our estimate is s6 = 53/144 ≈ 0.368056. (Note that this result could also be obtained using<br />

the methods of Section 6.4.)<br />

6.6.4 Radius of Convergence<br />

The largest value of r satisfying Taylor’s theorem is called the radius of convergence of the<br />

series. The example above demonstrates that the radius of convergence <strong>for</strong> the exponential<br />

series centered at 0 is ∞.<br />

Table 6.1 lists several commonly used series expansions. In the first three cases, the radius<br />

of convergence in 1. In the last three cases, the radius of convergence is ∞.<br />

6.7 Power Series<br />

A power series centered at a is a series of the <strong>for</strong>m<br />

∞<br />

cn(x − a)<br />

n=0<br />

n = c0 + c1(x − a) + c2(x − a) 2 + c3(x − a) 3 + · · · ,<br />

115


where {cn} is a sequence of constants (the coefficients of the series). The series converges<br />

to c0 when x = a. Interest focuses on finding the values of x <strong>for</strong> which the series converges<br />

and finding exact <strong>for</strong>ms <strong>for</strong> the sums.<br />

Theorem 6.28 (Convergence Theorem) Let ∞ n=0 cn(x−a) n be a power series centered<br />

at a. Then exactly one of the following holds:<br />

1. The series converges <strong>for</strong> x = a only.<br />

2. The series converges <strong>for</strong> all x.<br />

3. There is a number r > 0 such that the series converges when |x − a| < r and diverges<br />

when |x − a| > r.<br />

The number r given in the third part of the theorem is called the radius of convergence of<br />

the power series. If r is the radius of convergence, then the theorem tells us nothing about<br />

convergence at x = a ± r. Thus, the interval of convergence could be any one of<br />

[a − r,a + r], (a − r,a + r], [a − r,a + r), (a − r,a + r).<br />

Example 6.29 (Logarithm Function) Consider the Taylor series <strong>for</strong> f(x) = ln(x)<br />

centered at 1. Since<br />

ln(1) = 0, f (n) (x) = (−1)<br />

n−1(n − 1)!<br />

xn the <strong>for</strong>m of the Taylor series is<br />

∞<br />

(−1)<br />

n−1(x − 1)n (x − 1)2<br />

= (x − 1) −<br />

n<br />

2<br />

n=1<br />

and f (n) (1) = (−1) n−1 (n − 1)!,<br />

+ (x − 1)3<br />

3<br />

−<br />

(x − 1)4<br />

4<br />

± · · · .<br />

The series converges absolutely when |x − 1| < 1 by the ratio test. If |x − 1| > 1, then the<br />

terms do not converge to 0, and the series diverges. If x = 0, then the series diverges since<br />

it is the negative of the harmonic series. If x = 2, then the series converges since it is the<br />

alternating harmonic series.<br />

Thus, the radius of convergence is r = 1 and the interval of convergence is (0,2].<br />

6.7.1 Power Series and Taylor Series<br />

Taylor series are examples of power series. The following theorem states that convergent<br />

power series must be Taylor series within their radius of convergence.<br />

Theorem 6.30 (Representation Theorem) Assume that the radius of convergence of<br />

the power series ∞ n=0 cn(x − a) n is r > 0. Let<br />

∞<br />

f(x) = cn(x − a) n<br />

<strong>for</strong> a − r < x < a + r.<br />

n=0<br />

Then f has derivatives of all orders on (a − r,a + r) and<br />

f (n) (a) = cn n! <strong>for</strong> all n.<br />

Thus, the power series is the Taylor series of f centered at a.<br />

116


6.7.2 Differentiation and Integration of Power Series<br />

Power series can be differentiated and integrated term-by-term within their radius of convergence.<br />

Specifically,<br />

Theorem 6.31 (Term-by-Term Analysis) Assume that the radius of convergence of<br />

the power series ∞ n=0 cn(x − a) n is r > 0. Let<br />

∞<br />

f(x) = cn(x − a)<br />

n=0<br />

n<br />

<strong>for</strong> a − r < x < a + r.<br />

Then f is differentiable (and continuous) on the interval (a − r,a + r). Further,<br />

1. the derivative of f has the following <strong>for</strong>m:<br />

f ′ ∞<br />

(x) = ncn(x − a)<br />

n=1<br />

n−1 = c1 + 2c2(x − a) + 3c3(x − a) 2 + · · · .<br />

2. the family of antiderivatives of f has the following <strong>for</strong>m:<br />

<br />

f(x) dx = C +<br />

∞<br />

n=0<br />

The radius of convergence of each series is r.<br />

Example 6.32 (Geometric Function) Let<br />

f(x) =<br />

1<br />

(1 − x) =<br />

The derivative of f can be computed two ways:<br />

f ′ (x) = d<br />

<br />

1<br />

=<br />

dx (1 − x)<br />

f ′ (x) = d<br />

<br />

∞<br />

x<br />

dx<br />

n=0<br />

n<br />

<br />

∞<br />

=<br />

Thus,<br />

1<br />

=<br />

(1 − x) 2<br />

confirming results obtained on page 112.<br />

cn<br />

n + 1 (x − a)n+1 = C + c0(x − a) + c1<br />

2 (x − a)2 + · · · .<br />

∞<br />

x<br />

n=0<br />

n = 1 + x + x 2 + x 3 + · · · <strong>for</strong> −1 < x < 1.<br />

1<br />

and<br />

(1 − x) 2<br />

nx<br />

n=1<br />

n−1 = 1 + 2x + 3x 2 + 4x 3 + · · · .<br />

∞<br />

nx<br />

n=1<br />

n−1 = 1 + 2x + 3x 2 + 4x 3 + · · · <strong>for</strong> −1 < x < 1,<br />

Example 6.33 (Standard Normal Distribution) Let Z be the standard normal random<br />

variable. Using the Taylor series expansion of e x centered at 0, the PDF of Z can be written<br />

as follows:<br />

φ(z) = 1<br />

√ 2π e −z2 /2 = 1<br />

√2π<br />

∞ (−z2 /2) n<br />

n=0<br />

n!<br />

117<br />

= 1<br />

√ 2π<br />

∞<br />

n z2n<br />

(−1)<br />

n=0<br />

2 n n!<br />

<strong>for</strong> all z.


φ(z) does not have an exact antiderivative. However, by using term-by-term analysis, we<br />

can write the indefinite integral as a family of power series. Specifically,<br />

<br />

∞<br />

φ(z) dz = C + (−1)<br />

n=0<br />

n z2n+1 (2n + 1)2n n! .<br />

This power series representation can be used to compute probabilities <strong>for</strong> standard normal<br />

random variables to within any degree of accuracy.<br />

6.8 Discrete Random Variables, revisited<br />

There are many important applications of sequences and series in probability and statistics.<br />

This section focuses on discrete random variables whose ranges are countably infinite.<br />

6.8.1 Countably Infinite Sample Spaces<br />

A set is countably infinite when it corresponds to an infinite sequence. When a sample space<br />

S is countably infinite, we need to include the possibility of infinite unions, and of infinite<br />

sums of probabilities. In particular, the third Kolmogorov axiom is extended as follows:<br />

3 ′′ . If A1,A2,... are pairwise disjoint, then<br />

P (∪ ∞ i=1Ai) = P(A1) + P(A2) + P(A3) + · · ·<br />

where the righthand side is understood to be the sum of a convergent infinite series.<br />

6.8.2 Infinite Discrete Random Variables<br />

Recall that a discrete random variable is one whose range, R, is either finite or countably<br />

infinite. If the range of X is countably infinite, f(x) is the PDF of X, and g(X) is a<br />

real-valued function, then the expected value (or mean or expectation) of g(X) is<br />

E(g(X)) = <br />

g(x)f(x) when the series converges absolutely.<br />

x∈R<br />

That is, when <br />

x∈R |g(x)|f(x) converges. If the series with absolute values diverges, we<br />

say that the expectation is indeterminate.<br />

Example 6.34 (Coin Tossing) You toss a fair coin until you get a tail and record the<br />

sequence of h’s and t’s (<strong>for</strong> heads and tails, respectively). The sample space is<br />

S = {t, ht, hht, hhht, hhhht, hhhhht, ...}.<br />

Let X be the total number of tosses. Then X is a discrete random variable with PDF<br />

f(x) = P(X = x) = 1<br />

2 x <strong>for</strong> x = 1,2,3,... and 0 otherwise.<br />

118


U Dens.<br />

1<br />

0 1<br />

u<br />

X Prob.<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

1 2 3 4 5 6 7 8 9 10 x<br />

Figure 6.2: Probability density function of the uni<strong>for</strong>m random variable on the open interval<br />

(0,1) (left plot) and of the floor of its reciprocal (right plot). The range of the floor of<br />

the reciprocal is the positive integers. Vertical lines in the left plot are drawn at u =<br />

1/2,1/3,1/4,....<br />

The probability that you get a tail in 3 or fewer tosses, <strong>for</strong> example, is<br />

P(X ≤ 3) = P ({t, ht, hht}) = f(1) + f(2) + f(3) = 1 1 1 7<br />

+ + =<br />

2 4 8 8 .<br />

To find the expected value of X, we need to determine the sum of the following series:<br />

E(X) =<br />

∞<br />

x=1<br />

x 1<br />

= 1<br />

2x <br />

1<br />

+ 2<br />

2<br />

<br />

1<br />

+ 3<br />

4<br />

<br />

1 1<br />

+ 4 + · · · .<br />

8 16<br />

By using a simple trick, we can reduce the problem to finding the sum of a convergent<br />

geometric series. Specifically,<br />

E(X) − 1<br />

2E(X) =<br />

∞<br />

x<br />

x=1<br />

1<br />

∞<br />

∞<br />

<br />

1<br />

− x = x<br />

2x 2x+1 x=1 x=1<br />

1 1<br />

− (x − 1)<br />

2x 2x ∞ 1<br />

= = 1.<br />

2x x=1<br />

Since 1<br />

2E(X) = 1, we know that E(X) = 2. (The expected number of tosses is 2.) This<br />

trick is valid since the series is absolutely convergent.<br />

Example 6.35 (Indeterminate Mean) Let U be a continuous uni<strong>for</strong>m random variable<br />

on the open interval (0,1) and let X be the greatest integer less than or equal to the<br />

reciprocal of U (the “floor” of the reciprocal of U), X = ⌊1/U⌋. For x in the positive<br />

integers,<br />

f(x) = P(X = x) = P<br />

f(x) = 0 otherwise. The expectation of X is<br />

∞<br />

E(X) = x<br />

x=1<br />

1<br />

x + 1<br />

1<br />

x(x + 1) =<br />

119<br />

<br />

1<br />

< U ≤ =<br />

x<br />

∞<br />

x=1<br />

1<br />

(x + 1) .<br />

1<br />

x(x + 1) ;


Prob.<br />

0.25<br />

0.15<br />

0.05<br />

0 10 20 x<br />

Prob.<br />

0.25<br />

0.15<br />

0.05<br />

0 5 10 x<br />

Figure 6.3: Probability histograms <strong>for</strong> geometric (left) and Poisson (right) examples.<br />

The integral test can be used to demonstrate that the series on the right diverges. Thus,<br />

the expectation of X is indeterminate.<br />

The left part of Figure 6.2 is a plot of the PDF of U with the vertical lines<br />

u = 1/2, 1/3, 1/4, 1/5, ...<br />

superimposed. The right part is a probability histogram of the distribution of X = ⌊1/U⌋.<br />

Note that the area between u = 1/2 and u = 1 on the left is P(X = 1), the area between<br />

u = 1/3 and u = 1/2 is P(X = 2), etc.<br />

6.8.3 Example: Geometric Distribution<br />

Let X be the number of failures be<strong>for</strong>e the first success in a sequence of independent<br />

Bernoulli experiments with success probability p. Then X is said to be a geometric random<br />

variable, or to have a geometric distribution, with parameter p. The PDF of X is as follows:<br />

f(x) = (1 − p) x p when x = 0,1,2,..., and 0 otherwise.<br />

For each x, f(x) is the probability of the sequence of x failures (F) followed by a success<br />

(S). For example, f(5) = P({FFFFFS}) = (1 − p) 5 p.<br />

The probabilities f(0),f(1),... <strong>for</strong>m a geometric sequence whose sum is 1. Further,<br />

E(X) =<br />

1 − p<br />

p<br />

and V ar(X) = E<br />

<br />

(X − E(X)) 2<br />

= 1 − p<br />

p2 .<br />

Example 6.36 (p = 1/4) Let X be a geometric random variable with parameter 1/4.<br />

Then<br />

E(X) = 3 and V ar(X) = 12.<br />

The left part of Figure 6.3 is a probability histogram <strong>for</strong> X, <strong>for</strong> values of X at most 20.<br />

Note that<br />

∞<br />

x 21 3 1 3<br />

P(X > 20) =<br />

= ≈ 0.00237841.<br />

4 4 4<br />

x=21<br />

120


Alternative definition. An alternative definition of the geometric random variable is as<br />

follows: X is the trial number of the first success in a sequence of independent Bernoulli<br />

experiments with success probability p. In this case,<br />

f(x) = (1 − p) x−1 p when x = 1,2,3,..., and 0 otherwise.<br />

In particular, the range is now the positive integers.<br />

6.8.4 Example: Poisson Distribution<br />

Let λ be a positive real number. The random variable X is said to be a Poisson random<br />

variable, or to have a Poisson distribution, with parameter λ if its PDF is as follows:<br />

−λ λx<br />

f(x) = e<br />

x!<br />

when x = 0,1,2,..., and 0 otherwise.<br />

The series expansion <strong>for</strong> e x centered at 0 (see Table 6.1) can be used to demonstrate that<br />

the sequence f(0),f(1),... has sum 1. Further,<br />

E(X) = λ and V ar(X) = E (X − E(X)) 2 = λ.<br />

Example 6.37 (λ = 3) Let X be the Poisson random variable with mean 3 (and with<br />

variance 3). The right part of Figure 6.3 is a probability histogram <strong>for</strong> X, <strong>for</strong> values of X<br />

at most 10. Note that<br />

P(X > 10) =<br />

∞<br />

x=11<br />

e −33x<br />

x!<br />

≈ 0.000292337.<br />

Rare events. The idea <strong>for</strong> the Poisson distribution comes from a limit theorem proven<br />

by the mathematician S. Poisson in the 1830’s.<br />

Theorem 6.38 (Poisson Limit Theorem) Let λ be a positive real number, n a positive<br />

integer, and p = λ/n. Then<br />

<br />

n<br />

lim p<br />

n→∞ x<br />

x (1 − p) n−x = e −λλx<br />

when x = 0,1,2,....<br />

x!<br />

This theorem can be used to estimate binomial probabilities when the number of trials is<br />

large and the probability of success is small. That is, when success is a rare event.<br />

For example, if n = 10000, p = 2/5000, and x = 3, then the probability of 3 successes in<br />

10000 trials is<br />

10000<br />

3<br />

3 9997 2 4998<br />

≈ 0.195386<br />

5000 5000<br />

121


and Poisson’s approximation is<br />

e −443<br />

3!<br />

using λ = np = 4. The values are very close.<br />

≈ 0.195367,<br />

Poisson process. Events occurring in time are said to be generated by an (approximate)<br />

Poisson process with rate λ when the following conditions are satisfied:<br />

1. The number of events occurring in disjoint subintervals of time are independent<br />

of one another.<br />

2. The probability of one event occurring in a sufficiently small subinterval of<br />

time is proportional to the size of the subinterval. If h is the size, then the<br />

probability is λh.<br />

3. The probability of two or more events occurring in a sufficiently small<br />

subinterval of time is virtually zero.<br />

In this definition, λ represents the average number of events per unit time. If events follow<br />

an (approximate) Poisson process and X is the number of events observed in one unit of<br />

time, then X has a Poisson distribution with parameter λ.<br />

Typical applications of Poisson distributions include the numbers of cars passing an intersection<br />

in a fixed period of time during a workday, or the number of phone calls received in<br />

a fixed period of time during a workday.<br />

The idea of a Poisson process can be generalized to include events occurring over regions<br />

of space instead of intervals of time. (“Subregions” take the place of “subintervals” in the<br />

conditions above. In this case, λ represents the average number of events per unit area or<br />

per unit volume.)<br />

Remark. The definition of Poisson process allows you to think of the PDF of X as<br />

the limit of a sequence of binomial PDFs: The observation interval is subdivided into n<br />

nonoverlapping subintervals; the i th Bernoulli trial results in success if an event occurs in<br />

the i th subinterval, and failure otherwise. If n is large enough, then the probability that 2<br />

or more events occur in one subinterval can be assumed to be zero.<br />

6.9 Continuous Random Variables, revisited<br />

The exponential and gamma families of distributions are related to Poisson processes.<br />

122


6.9.1 Exponential Distribution<br />

Recall that the continuous random variable X has an exponential distribution with parameter<br />

λ when its PDF has the <strong>for</strong>m<br />

f(x) = λe −λx when x ≥ 0 and 0 otherwise,<br />

where λ is a positive constant. (See Section 5.2.2.)<br />

If events occurring over time follow an approximate Poisson process with rate λ, where λ<br />

is the average number of events per unit time, then the time between successive events has<br />

an exponential distribution with parameter λ. To see this,<br />

1. If you observe the process <strong>for</strong> t units of time and let Y equal the number<br />

of observed events, then Y has a Poisson distribution with parameter λt.<br />

The PDF of Y is as follows:<br />

P(Y = y) = e −λt(λt)y<br />

y!<br />

when y = 0,1,2,... and 0 otherwise.<br />

2. An event occurs, the clock is reset to time 0, and X is the time until the<br />

next event occurs. Then X is a continuous random variable whose range<br />

is x > 0. Further,<br />

P(X > t) = P(0 events in the interval [0,t]) = P(Y = 0) = e −λt<br />

and P(X ≤ t) = 1 − e −λt .<br />

3. Since f(t) = d<br />

d<br />

dtP(X ≤ t) = dt<br />

<br />

1 − e −λt<br />

= λe −λt when t > 0 (and 0<br />

otherwise) is the same as the PDF of an exponential random variable with<br />

parameter λ, X has an exponential distribution with parameter λ.<br />

6.9.2 Gamma Distribution<br />

Recall that the continuous random variable X has a gamma distribution with shape parameter<br />

α and scale parameter β when its PDF has the following <strong>for</strong>m<br />

f(x) =<br />

1<br />

β α Γ(α) xα−1 e −x/β when x > 0 and 0 otherwise,<br />

where α and β are positive constants. (See Section 5.2.4.)<br />

If events occurring over time follow an approximate Poisson process with rate λ, where λ<br />

is the average number of events per unit time and if r is a positive integer, then the time<br />

until the r th event occurs has a gamma distribution with α = r and β = 1/λ. To see this,<br />

1. If you observe the process <strong>for</strong> t units of time and let Y equal the number<br />

of observed events, then Y has a Poisson distribution with parameter λt.<br />

The PDF of Y is as follows:<br />

P(Y = y) = e −λt(λt)y<br />

y!<br />

when y = 0,1,2,... and 0 otherwise.<br />

123


2. Let X be the time you observe the r th event, starting from time 0. Then X<br />

is a continuous random variable whose range is x > 0. Further, P(X > t)<br />

is the same as the probability that there are fewer than r events in the<br />

interval [0,t]. Thus,<br />

and P(X ≤ t) = 1 − P(X > t).<br />

r−1<br />

<br />

P(X > t) = P(Y < r) =<br />

y=0<br />

e −λt(λt)y<br />

y!<br />

3. The PDF f(t) = d<br />

dtP(X ≤ t) is computed using the product rule:<br />

<br />

d<br />

dt 1 − e−λt r−1 y=0 λyty <br />

y!<br />

= λe−λt r−1 y=0 λyty <br />

y!<br />

= λe −λt r−1<br />

y=0 λy t y<br />

y!<br />

= λe−λt λr−1tr−1 <br />

(r−1)!<br />

= λr<br />

(r−1)! tr−1 e −λt .<br />

− e −λt r−1<br />

<br />

−<br />

y=1 λyyty−1 y!<br />

r−1<br />

y=1 λy−1ty−1 <br />

(y−1)!<br />

Since f(t) is the same as the PDF of a gamma random variable with parameters<br />

α = r and β = 1/λ, X has a gamma distribution with parameters<br />

α = r and β = 1/λ.<br />

6.10 Summary: Poisson Processes<br />

In summary, there are three distributions related to Poisson processes:<br />

1. If X is the number of events observed in one unit of time, then X is a Poisson random<br />

variable with parameter λ, where λ is the average number of events per unit time.<br />

The probability that exactly x events occur is<br />

P(X = x) = e −λλx<br />

x!<br />

when x = 0,1,2,..., and 0 otherwise.<br />

2. If X is the time between successive events, then X is an exponential random variable<br />

with parameter λ, where λ is the average number of events per unit time. The CDF<br />

of X is<br />

F(x) = 1 − e −λx when x > 0 and 0 otherwise.<br />

3. If X is the time to the rth event, then X is a gamma random variable with parameters<br />

α = r and β = 1/λ, where λ is the average number of events per unit time. The CDF<br />

of X is<br />

r−1 <br />

−λx (λx)y<br />

F(x) = 1 − e when x > 0 and 0 otherwise.<br />

y!<br />

y=0<br />

Remark. If α = r and r is not too large, then gamma probabilities can be computed by<br />

hand using the <strong>for</strong>mula <strong>for</strong> the CDF above. Otherwise, the computer can be used to find<br />

probabilities <strong>for</strong> gamma distributions.<br />

124


7 Multivariable Calculus Concepts<br />

Let R k be the set of ordered k-tuples of real numbers,<br />

R k = {(x1,x2,...,xk) : xi ∈ R} .<br />

Each element of R k is called a point or a vector. Initially, we use the notation<br />

x = (x1,x2,...,xk)<br />

to denote vectors in R k . We say that xi is the i th component of x.<br />

This chapter considers real-valued functions whose domain, D, is a subset of R k ,<br />

and their applications in statistics.<br />

7.1 Vector And Matrix Operations<br />

f : D ⊆ R k −→ R,<br />

This section introduces basic vector and matrix operations. Ideas introduced here will be<br />

studied in more depth in Chapter 9 of these notes.<br />

7.1.1 Representations of Vectors<br />

The origin (or zero vector) O in R k is the vector all of whose components are zero,<br />

O = (0,0,... ,0).<br />

Let v = (v1,v2,...,vk) be a vector in R k . If v is not the zero vector, then v can be<br />

represented as a directed line segment in two ways:<br />

1. As a position vector drawn from the origin to the point P(v1,v2,...,vk): v = OP .<br />

2. As a displacement vector drawn from point P(p1,p2,... ,pk) to point Q(q1,q2,... ,qk),<br />

where vi = qi − pi <strong>for</strong> each i: v = P Q.<br />

For example, the left part of Figure 7.1 shows the vector v = (−4,3) represented as the<br />

position vector OP and the displacement vectors AB and V W. The right part of the<br />

figure shows the position vector v = OP, where P is the point P(−4,3,2).<br />

7.1.2 Addition and Scalar Multiplication of Vectors<br />

If v and w are vectors in R k and c ∈ R is a scalar, then<br />

125


P<br />

3<br />

y<br />

B<br />

-4 4<br />

W<br />

-3 V<br />

A<br />

x<br />

x<br />

z P<br />

Figure 7.1: Representations of vectors in 2-space (left) and 3-space (right).<br />

1. the vector sum v + w is the vector obtained by componentwise addition,<br />

v + w = (v1 + w1,v2 + w2,...,vk + wk), and<br />

2. the scalar product cv is the vector obtained by multiplying each component of v by c,<br />

cv = (cv1,cv2,...,cvk).<br />

Example 7.1 (Vectors in R 3 ) If v = (8, −3,2) and w = (−9,2,3), then<br />

v + w = (−1, −1,5), 5v = (40, −15,10) and 5v − w = (49, −17,7).<br />

Properties of addition and scalar multiplication. The following properties hold <strong>for</strong><br />

vectors u, v, w in R k and c,d ∈ R:<br />

1. Commutative: v + w = w + v.<br />

2. Associative: (u + v) + w = u + (v + w), c(dv) = (cd)v.<br />

3. Distributive: (c + d)v = cv + dv, c(v + w) = cv + cw.<br />

4. Identity: 1v = v, 0v = O, v + O = v, and<br />

v + (−1)w = v − w (that is, we have the operation of subtraction).<br />

Parallelogram or triangle rules. Figure 7.2 is graphical way to represent addition (left<br />

part) and subtraction (right part), in 2-space and 3-space.<br />

In the left plot, the vector sum v + w is obtained by starting at the lower left and going<br />

across the bottom and then up the right side, or by starting at the lower left and going up<br />

the left side and across the top (since v + w = w + v). Geometrically, the vector sum is<br />

the diagonal of the parallelogram.<br />

In the right plot, observe that w + u = v implies that u = v − w.<br />

126<br />

y


w vw<br />

v<br />

w uvw<br />

Figure 7.2: Graphical representation of vector addition (left) and subtraction(right).<br />

1.5v<br />

Figure 7.3: Graphical representation of scalar multiplication of vectors.<br />

Parallel vectors. Vectors v and w are said to be parallel when w = mv <strong>for</strong> some constant<br />

m. If m > 0, then the vectors are pointing in the same direction; if m < 0, then the vectors<br />

are pointing in the opposite direction.<br />

Figure 7.3 is a graphical way to represent scalar multiplication in 2-space and 3-space.<br />

Vector 2v is a vector pointing in the same direction as v but of twice the length; vector<br />

−1.5v is a vector of length 1.5 times the length of v, but pointing in the opposite direction.<br />

7.1.3 Length and Distance<br />

If v is a vector in R k , then the length of v is<br />

v =<br />

v<br />

2v<br />

<br />

v 2 1 + v2 2 + · · · v2 k .<br />

The zero vector has length 0; all other vectors have positive length.<br />

A unit vector is a vector of length 1. If v is not the zero vector, then the unit vector in the<br />

direction of v is the vector u = 1<br />

v v.<br />

The distance between vectors v and w is the length of the difference vector v − w:<br />

v − w =<br />

<br />

(v1 − w1) 2 + (v2 − w2) 2 + · · · + (vk − wk) 2 .<br />

Example 7.2 (Vectors in R 3 , continued) If v = (8, −3,2) and w = (−9,2,3), then the<br />

length of v is v = √ 77 ≈ 8.77, the length of w is w = √ 94 ≈ 9.70, and the distance<br />

between v and w is v − w = 3 √ 35 ≈ 17.75.<br />

127<br />

v


8<br />

The unit vector in the direction of v is u = √77 , −3<br />

<br />

√ 2 , √77 .<br />

77<br />

Theorem 7.3 (Triangle Inequality) If v,w ∈ R k , then v + w ≤ v + w.<br />

Note that if k = 2 or k = 3, then the triangle with corners O, v and v + w has sides of<br />

lengths v, w, and v + w.<br />

7.1.4 Dot Product of Vectors<br />

If v,w ∈ R k , then the dot product (or inner product or scalar product) is the number<br />

Note that v·v = v 2 .<br />

v·w = v1w1 + v2w2 + · · · + vkwk.<br />

Let k = 2 or 3, v = OP and w = OQ be representations of v and w as position vectors,<br />

and θ be the smaller of the two angles defined by P OQ (the connected segments from P<br />

to the origin to Q). Then, analytic geometry can be used to demonstrate that<br />

v·w = v w cos(θ).<br />

If θ is unknown, then it can be found by solving the equation above.<br />

Example 7.4 (Vectors in R 3 , continued) If v = (8, −3,2) and w = (−9,2,3), then<br />

v·w = −72 and θ = cos −1<br />

<br />

v·w<br />

≈ 2.58 radians (≈ 148 degrees).<br />

v w<br />

Angle between vectors. Analytic geometry can be used to prove the following theorem.<br />

Theorem 7.5 (Cauchy-Schwarz Inequality) If v,w ∈ R k , then |v·w| ≤ v w.<br />

The Cauchy-Schwarz inequality allows us to <strong>for</strong>mally define the angle between two nonzero<br />

vectors. Specifically, if v,w ∈ Rk are nonzero vectors, then θ is defined to be the angle<br />

satisfying the following equation:<br />

θ = cos −1<br />

<br />

v·w<br />

.<br />

v w<br />

This definition generalizes ideas about angles from 2-space and 3-space.<br />

Orthogonal vectors. The nonzero vectors v,w ∈ R k are said to be orthogonal if their<br />

dot product is 0, v·w = 0.<br />

Note that the angle between orthogonal vectors is π<br />

2 (or 90 degrees), since cos π <br />

2 = 0 and<br />

is in the range of the inverse cosine function (see page 53).<br />

π<br />

2<br />

128


7.1.5 Matrix Notation and Matrix Transpose<br />

An m-by-n matrix A is a rectangular array with m rows and n columns:<br />

⎡<br />

⎤<br />

a11 a12 · · · a1n<br />

⎢ a21 a22 · · · a2n<br />

⎥<br />

A = ⎢<br />

⎥ = [aij].<br />

⎣ . . · · · . ⎦<br />

am1 am2 · · · amn<br />

The (i,j)-entry of A (denoted by aij) is the element in the row i and column j position.<br />

Entries are shown explicitly in the first term above. The second term is used as shorthand<br />

when the size of the matrix (the numbers m and n) is understood.<br />

A 1-by-n matrix is known as a row vector and an m-by-1 matrix is known as a column<br />

vector. The m-by-n matrix A can be thought of as a single column of m row vectors or as<br />

a single row of n column vectors.<br />

If A is an m-by-n matrix, then the transpose of A, denoted by A T , is the n-by-m matrix<br />

whose (i,j)-entry is aji. For example, if<br />

A =<br />

<br />

1 2 3<br />

, then A<br />

4 5 6<br />

T ⎡ ⎤<br />

1 4<br />

= ⎣ 2 5 ⎦.<br />

3 6<br />

7.1.6 Addition and Scalar Multiplication of Matrices<br />

Let A = [aij] and B = [bij] be m-by-n matrices of numbers and let c ∈ R be a scalar. Then<br />

1. the matrix sum, A + B, is the m-by-n matrix obtained by componentwise addition,<br />

A + B = [aij + bij], and<br />

2. the scalar product, cA, is the m-by-n matrix obtained by multiplying each component<br />

of A by c,<br />

cA = [caij].<br />

<br />

Example 7.6 (2-by-3 Matrices) If A =<br />

A + B =<br />

<br />

7 −3 8<br />

2 5 1<br />

<br />

, 5A =<br />

<br />

10 15 30<br />

−10 0 5<br />

2 3 6<br />

−2 0 1<br />

<br />

<br />

<br />

5 −6 2<br />

and B =<br />

4 5 0<br />

and 5A − 2B =<br />

<br />

<br />

, then<br />

0 27 26<br />

−18 −10 5<br />

Properties of addition and scalar multiplication. Since operations are per<strong>for</strong>med<br />

componentwise, properties of addition and scalar multiplication of matrices are similar to<br />

those <strong>for</strong> vectors.<br />

129<br />

<br />

.


7.1.7 Matrix Multiplication<br />

Let A = [aij] be an m-by-n matrix and let B = [bij] be an n-by-p matrix.<br />

The matrix product, C = AB, is the m-by-p matrix whose (i,j)-entry is the dot product of<br />

the i th row of A with the j th column of B:<br />

cij =<br />

n<br />

aikbkj <strong>for</strong> i = 1,2,... ,m; j = 1,2,... ,p.<br />

k=1<br />

<br />

Example 7.7 (2-by-3 and 3-by-2 Matrices) Let A =<br />

Then<br />

AB =<br />

<br />

28 −45<br />

2 −4<br />

<br />

and BA =<br />

For example, the (1,1)-entry of the product AB is<br />

⎡<br />

⎢<br />

⎣<br />

2 3 6<br />

−1 0 1<br />

7 6 9<br />

−1 0 1<br />

15 12 17<br />

(2,3,6)·(2,0,4) = (2)(2) + (3)(0) + (6)(4) = 28.<br />

Similarly, the (1,1)-entry of the product BA is (2)(2) + (−3)(−1) = 7.<br />

Properties. Matrix products are associative. Thus, the product<br />

ABC = A(BC) = (AB)C<br />

⎡<br />

<br />

and B = ⎣<br />

⎤<br />

⎥<br />

⎦ .<br />

2 −3<br />

0 1<br />

4 −7<br />

is well-defined. Matrix products are not necessarily commutative, as illustrated in the<br />

example above. In fact, if the number of rows of A is not equal to the number of columns<br />

of B, then the product BA is not defined.<br />

Note also that the transpose of the product of two matrices is equal to the product of the<br />

transposes in the opposite order: (AB) T = B T A T .<br />

7.1.8 Determinants of Square Matrices<br />

A square matrix is a matrix with m = n. If A is a square matrix, then the determinant<br />

of A is a number which summarizes certain properties of the matrix. Notations <strong>for</strong> the<br />

determinant of A are det(A) or |A|.<br />

The determinant is often defined recursively, as illustrated in the following special cases:<br />

1. If A =<br />

a1 a2<br />

b1 b2<br />

<br />

is a 2-by-2 matrix, then<br />

<br />

<br />

det(A) = <br />

<br />

<br />

a1 a2 <br />

<br />

b1 b2<br />

130<br />

= a1b2 − b1a2 .<br />

⎤<br />

⎦.


2. If A =<br />

⎡<br />

⎣ a1 a2 a3<br />

b1 b2 b3<br />

c1 c2 c3<br />

<br />

<br />

<br />

det(A) = <br />

<br />

<br />

⎤<br />

⎦ is a 3-by-3 matrix, then<br />

<br />

a1 a2 a3 <br />

<br />

b1 b2 b3<br />

<br />

<br />

<br />

c1 c2 c3<br />

<br />

<br />

= a1<br />

<br />

<br />

<br />

b2 b3 <br />

<br />

c2 c3<br />

<br />

<br />

<br />

− a2<br />

<br />

<br />

<br />

b1 b3 <br />

<br />

c1 c3<br />

<br />

<br />

<br />

+ a3<br />

<br />

<br />

<br />

b1 b2 <br />

<br />

c1 c2<br />

(The <strong>for</strong>mula uses an alternating sum of 3 determinants of 2-by-2 matrices.)<br />

⎡<br />

⎢<br />

3. If A = ⎣<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

a1 a2 a3 a4<br />

b1 b2 b3 b4<br />

c1 c2 c3 c4<br />

d1 d2 d3 d4<br />

a1 a2 a3 a4<br />

b1 b2 b3 b4<br />

c1 c2 c3 c4<br />

d1 d2 d3 d4<br />

<br />

<br />

<br />

<br />

= a1 <br />

<br />

<br />

⎤<br />

⎥<br />

⎦ is a 4-by-4 matrix, then det(A) =<br />

b2 b3 b4<br />

c2 c3 c4<br />

d2 d3 d4<br />

<br />

<br />

<br />

<br />

<br />

−a2<br />

<br />

<br />

<br />

<br />

<br />

<br />

b1 b3 b4<br />

c1 c3 c4<br />

d1 d3 d4<br />

<br />

<br />

<br />

<br />

<br />

+a3<br />

<br />

<br />

<br />

<br />

<br />

<br />

b1 b2 b4<br />

c1 c2 c4<br />

d1 d2 d4<br />

<br />

<br />

<br />

<br />

<br />

−a4<br />

<br />

<br />

<br />

<br />

<br />

<br />

(The <strong>for</strong>mula uses an alternating sum of 4 determinants of 3-by-3 matrices.)<br />

.<br />

b1 b2 b3<br />

c1 c2 c3<br />

d1 d2 d3<br />

Note that the notation <strong>for</strong> the entries in the special cases was chosen to emphasize the<br />

recursive nature of the definition.<br />

In general, if Aij is the (n − 1)-by-(n − 1) submatrix obtained by eliminating the i th row<br />

and the j th column of A, then<br />

n<br />

det(A) = (−1)<br />

j=1<br />

1+j a1jdet(A1j).<br />

The <strong>for</strong>mula uses an alternating sum of n determinants of (n − 1)-by-(n − 1) matrices. We<br />

say that the determinant is obtained by “expanding in the first row.” (Additional methods<br />

<strong>for</strong> computing det(A) are discussed in Section 9.3.)<br />

Example 7.8 (3-by-3 Matrix) If A =<br />

⎡<br />

⎢<br />

⎣<br />

4 1 2<br />

−1 2 1<br />

1 3 −3<br />

⎤<br />

⎥<br />

⎦, then det(A) =<br />

<br />

<br />

4 1 2<br />

<br />

<br />

<br />

<br />

−1 2 1 2 1 <br />

−1 1 <br />

−1 2 <br />

<br />

= 4 − 1 + 2 <br />

3 −3 1 −3 1 3 <br />

1 3 −3 <br />

7.2 Limits and Continuity<br />

This section generalizes ideas from Section 4.2.<br />

= 4(−6 − 3) − 1(3 − 1) + 2(−3 − 2)<br />

= 4(−9) − 1(2) + 2(−5) = −48.<br />

131<br />

<br />

<br />

<br />

<br />

<br />

.


7.2.1 Neighborhoods, Deleted Neighborhoods and Limits<br />

A neighborhood of the vector a ∈ R k is the set of vectors x ∈ R k satisfying<br />

x − a < r <strong>for</strong> some positive number r.<br />

A deleted neighborhood of a is the neighborhood with a removed; that is, the set of vectors<br />

x satisfying 0 < x − a < r <strong>for</strong> some positive r.<br />

The vector a ∈ D is said to be a limit point of the domain D if every deleted neighborhood<br />

of a contains points of D.<br />

Consider f : D ⊆ R k −→ R, and suppose that a is a limit point of D. We say that the<br />

limit as x approaches a of f(x) is L,<br />

lim f(x) = L,<br />

x→a<br />

if <strong>for</strong> every positive number ǫ there exists a positive number δ such that<br />

if x ∈ D satisfies 0 < x − a < δ, then f(x) satisfies |f(x) − L| < ǫ.<br />

(Vectors in the intersection of the domain of the function with the deleted δ-neighborhood<br />

of a are mapped into the ǫ-neighborhood of L.)<br />

Example 7.9 (Limit Exists) Consider the 2-variable function<br />

Let x = (x,y) and a = (0,0). Then<br />

f(x,y) = x3 − y 3<br />

x − y<br />

x<br />

lim f(x) = lim<br />

x→a (x,y)→(0,0)<br />

3 − y3 x − y<br />

when x = y.<br />

= lim<br />

(x,y)→(0,0) x2 + xy + y 2 = 0.<br />

(As (x,y) approaches the origin from any direction not crossing the line y = x, the function<br />

values f(x,y) approach zero.)<br />

Example 7.10 (Limit Does Not Exist) Consider the 2-variable function<br />

f(x,y) = x2 − y2 x2 when (x,y) = (0,0).<br />

+ y2 Let x = (x,y) and a = (0,0), and consider approaching (0,0) along the line y = mx. Then<br />

Since<br />

f(x,y) = f(x,mx) = x2 − m2x2 x2 + m2 1 − m2<br />

=<br />

x2 1 + m2 on this line (when x = 0).<br />

lim f(x) = lim<br />

x→a along y = mx (x,y)→(0,0) along y = mx<br />

and m can be any number, the two-dimensional limit does not exist.<br />

132<br />

1 − m2 1 − m2<br />

=<br />

1 + m2 1 + m2


Properties of limits. Properties of limits <strong>for</strong> 1-variable functions (see page 57) can be<br />

generalized to limits <strong>for</strong> k-variable functions.<br />

Example 7.11 (Squeezing Principle) Consider the 2-variable function<br />

f(x,y) = 3x2y x2 when (x,y) = (0,0).<br />

+ y2 Let x = (x,y) and a = (0,0). The squeezing principle can be used to demonstrate that<br />

To see this, note that<br />

lim f(x,y) = 0.<br />

(x,y)→(0,0)<br />

0 ≤ x2<br />

x2 ≤ 1 when (x,y) = (0,0)<br />

+ y2 implies that values of f have the following simple bounds:<br />

−3|y| ≤ f(x,y) ≤ 3|y| when (x,y) = (0,0).<br />

Since ±3|y| → 0 as y → 0, f(x,y) must approach 0 as well.<br />

7.2.2 Continuous Functions<br />

The function f(x) is said to be continuous at a if<br />

lim f(x) = f(a).<br />

x→a<br />

The function is said to be discontinuous at a otherwise. Note that f(x) is discontinuous at<br />

a when f(a) is undefined, or when the limit does not exist, or when the limit and function<br />

values exist but are not equal.<br />

Linear functions. A linear function in k variables is a function of the <strong>for</strong>m<br />

f(x) = c0 + c1x1 + c2x2 + · · · + ckxk<br />

where each coefficient ci ∈ R. Linear functions are continuous <strong>for</strong> all x ∈ R k .<br />

Polynomial functions. A polynomial function in k variables is a function of the <strong>for</strong>m<br />

f(x) =<br />

d<br />

j1,j2,...,jk=0<br />

cj1,j2,...,jk xj1<br />

1 xj2<br />

2 · · · x jk<br />

k<br />

where d is a nonnegative integer,<br />

and the c’s are real numbers. For example, f(x,y) = 3x 2 − 10xy 8 + y 2 + 12 is a polynomial<br />

function in two variables. Polynomial functions are continuous <strong>for</strong> all x ∈ R k .<br />

Rational functions. A rational function in k variables is a ratio of polynomials, f(x) =<br />

p(x)/q(x). Rational functions are continuous whenever q(x) = 0.<br />

133


More generally. More generally, properties of limits imply that sums, differences, products,<br />

quotients, and constant multiples of continuous functions are continuous whenever the<br />

appropriate operation is defined.<br />

Composition of continuous functions. If the k-variable function g(x) is continuous<br />

at a and the 1-variable function f(x) is continuous at g(a), then the composite function<br />

f(g(x)) is continuous at a. In other words,<br />

<br />

lim f(g(x)) = f<br />

x→a<br />

lim<br />

x→a g(x)<br />

<br />

= f(g(a)).<br />

For example, let g(x,y) = x 2 − 4y, f(x) = √ x and a = (5,1). Then the composite function<br />

f(g(x,y)) is continuous at (5,1) since<br />

<br />

lim f(g(x,y)) = lim x<br />

(x,y)→(5,1) (x,y)→(5,1)<br />

2 − 4y = √ 25 − 4 = √ 21 = f(g(5,1)).<br />

7.3 Differentiation in Several Variables<br />

This section generalizes differentiability from Section 4.3.<br />

7.3.1 Partial Derivatives and the Gradient Vector<br />

The partial derivative of f(x) with respect to xi is<br />

f(x1,... ,xi−1,xi + h,xi+1,... ,xk) − f(x)<br />

fxi (x) = lim<br />

h→0<br />

h<br />

(The ith partial derivative is the ordinary derivative of the function where all variables<br />

except the ith variable are held fixed.) Alternative notations are Dxi<br />

.<br />

f(x) and ∂f<br />

∂xi (x).<br />

Example 7.12 (k = 3) Let x = (x,y,z) and f(x,y,z) = −2xy 2 + z 3 + 3x 2 yz 4 . Then<br />

fx(x,y,z) = −2y 2 + 6xy z 4 , fy(x,y,z) = −4xy + 3x 2 z 4 , fz(x,y,z) = 3z 2 + 12x 2 y z 3 .<br />

The gradient vector, ∇f(x) (“grad f of x”), is the vector of partial derivatives:<br />

∇f(x) = (fx1 (x),fx2 (x),... ,fxk (x)) .<br />

For example, <strong>for</strong> the 3-variable function above,<br />

∇f(x,y,z) = (−2y 2 + 6xy z 4 , − 4xy + 3x 2 z 4 , 3z 2 + 12x 2 y z 3 ).<br />

134


Matrix notation. We sometimes write the list of partial derivatives as a 1-by-k matrix,<br />

Df(x) = [fx1 (x) fx2 (x) ... fxk (x)] .<br />

Note that matrix notation is especially useful when considering vector-valued functions. For<br />

example, if the domain of f is a subset of R n and the range of f is a subset of R m , then<br />

Df(x) is an m-by-n matrix of partial derivatives.<br />

7.3.2 Second-Order Partial Derivatives<br />

The second-order partial derivative of f(x) with respect to xi and xj is the partial derivative<br />

of fxi (x) with respect to xj:<br />

<br />

∂ ∂f<br />

fxi,xj (x) = (x) =<br />

∂xj ∂xi<br />

∂2f (x) = Dxi,xj<br />

∂xj ∂xi<br />

f(x).<br />

When i = j, the second-order partial derivative can be interpreted as the concavity of a<br />

partial function. When i = j, the second-order partial derivative is often called a mixed<br />

partial derivative. Additional notations when i = j are:<br />

fxi,xi (x) = ∂2f ∂x2 (x) = D<br />

i<br />

2 xif(x). Theorem 7.13 (Clairaut’s Theorem) Suppose that the k-variable function f(x) has<br />

continuous first- and second-order partial derivatives in a neighborhood of a. Then <strong>for</strong> each<br />

i and j, the mixed partial derivatives at a are equal:<br />

fxi,xj (a) = fxj,xi (a) <strong>for</strong> all i,j = 1,2,... ,k.<br />

Example 7.14 (k = 3, continued) The second-order partial derivatives (with arguments<br />

omitted) of the 3-variable function above are as follows:<br />

fx,x = 6y z 4 , fy,y = −4x, fz,z = 6z + 36x 2 y z 2 ,<br />

fx,y = fy,x = −4y + 6xz 4 , fx,z = fz,x = 24xy z 3 , fy,z = fz,y = 12x 2 z 3 .<br />

Remark. Higher-order partial derivatives can also be defined.<br />

7.3.3 Differentiability, Local Linearity and Tangent (Hyper)Planes<br />

Suppose that f(x) is defined <strong>for</strong> all x in a neighborhood of a and that<br />

fxi (a) exists <strong>for</strong> i = 1,2,... ,k.<br />

135


z<br />

-2<br />

-1<br />

0<br />

x<br />

1<br />

2 -2<br />

-1<br />

0<br />

2<br />

1<br />

y<br />

Figure 7.4: Plots of z = 4 − x 2 − y 2 (left part) and z = x 3 − 3xy + y 3 (right part). In each<br />

case, partial functions are superimposed on the surface. (See Figure 7.5 also.)<br />

Let<br />

h(x) = f(a) + ∇f(a)·(x − a)<br />

z<br />

-1<br />

= f(a) + fx1 (a)(x1 − a1) + · · · + fxk (a)(xk − ak).<br />

We say that f is differentiable at a if the following limit holds:<br />

f(x) − h(x)<br />

lim = 0.<br />

x→a x − a<br />

That is, if the error in using h as an approximation to f approaches 0 at a rate faster than<br />

the rate at which the distance between x and a approaches 0.<br />

Local linearity. The definition of differentiability says that f is locally linear; the error<br />

in using the linear function h as an approximation to f can be made as small as we like by<br />

taking x to be close enough to a.<br />

Tangent (hyper)planes. If f is differentiable at a, then y = h(x) is called the tangent<br />

plane at (a,f(a)) when k = 2 and is called the tangent hyperplane at (a,f(a)) when k > 2.<br />

Example 7.15 (Quadratic Function) Consider the 2-variable function<br />

f(x,y) = 4 − x 2 − y 2 and let a = (0.5, −1).<br />

The left part of Figure 7.4 shows a graph of the surface z = f(x,y), with graphs of the<br />

partial functions z = f(0.5,y) and z = f(x, −1) superimposed.<br />

1. The partial derivatives of f are fx(x,y) = −2x and fy(x,y) = −2y.<br />

2. The slope of z = f(x, −1) when x = 0.5 is fx(0.5, −1) = −1.<br />

3. The slope of z = f(0.5,y) when y = −1 is fy(0.5, −1) = 2.<br />

136<br />

0<br />

x<br />

1<br />

2<br />

-1<br />

0<br />

2<br />

1<br />

y


4. Since f(0.5, −1) = 2.75, the linear approximation at (0.5, −1) is<br />

h(x,y) = 2.75 − (x − 0.5) + 2(y + 1).<br />

In addition, f is differentiable at this point (and everywhere).<br />

Example 7.16 (Cubic Function) Consider the 2-variable function<br />

f(x,y) = x 3 − 3xy + y 3 and let a = (1.5,1).<br />

The right part of Figure 7.4 shows a graph of the surface z = f(x,y), with graphs of the<br />

partial functions z = f(1.5,y) and z = f(x,1) superimposed.<br />

1. The partial derivatives of f are fx(x,y) = 3x 2 − 3y and fy(x,y) = 3y 2 − 3x.<br />

2. The slope of z = f(x,1) when x = 1.5 is fx(1.5,1) = 3.75.<br />

3. The slope of z = f(1.5,y) when y = 1 is fy(1.5,1) = −1.5.<br />

4. Since f(1.5,1) = −0.125, the linear approximation at (1.5,1) is<br />

h(x,y) = −0.125 + 3.75(x − 1.5) − 1.5(y − 1).<br />

In addition, f is differentiable at this point (and everywhere).<br />

Example 7.17 (Derivative Does Not Exist) Consider the 2-variable function<br />

<br />

<br />

f(x,y) = |x| − |y| − |x| − |y| and let a = (0,0).<br />

Since f(x,0) = 0 and f(0,y) = 0, fx(x,0) = 0 and fy(0,y) = 0 <strong>for</strong> each x and y, and the<br />

linear approximation h(x,y) = 0.<br />

But f is not differentiable at the origin since the limit given in the definition of differentiability<br />

does not exist. If you evaluate the limit by approaching along y = 0 and along y = x,<br />

<strong>for</strong> example, you get two different answers:<br />

f(x,0) − 0<br />

f(x,x) − 0<br />

lim = 0 and lim<br />

(x,0)→(0,0) (x,0) (x,x)→(0,0) (x,x) = −√2. Theorem 7.18 (Differentiability and Continuity) Suppose that the k-variable function<br />

f(x) is defined <strong>for</strong> all x in a neighborhood of a. Then<br />

1. If f is differentiable at a, then f is continuous at a.<br />

2. If each partial derivative function, fxi (x) <strong>for</strong> i = 1,2,... ,k, is continuous in a neighborhood<br />

of a, then f is differentiable at a.<br />

Note that in the quadratic and cubic function examples, the partial derivative functions (fx<br />

and fy) are defined and continuous <strong>for</strong> all pairs. In the absolute value function example,<br />

partial derivative functions are not defined in a neighborhood of the origin.<br />

137


7.3.4 Total Derivative<br />

If each coordinate xi is a function of t,<br />

xi = xi(t) <strong>for</strong> i = 1,2,... ,k<br />

(or simply x = x(t)), then f is also a function of t. The derivative of f with respect to t<br />

(when it exists) is called the total derivative of f.<br />

Theorem 7.19 (Total Derivatives) If f is differentiable in a neighborhood of x0 = x(t0)<br />

and each coordinate function is differentiable at t0, then the total derivative exists and its<br />

value can be computed as follows:<br />

df<br />

dt (t0) = fx1 (x0)x ′ 1 (t0) + fx2 (x0)x ′ 2 (t0) + · · · + fxk (x0)x ′ k (t0).<br />

Note that if we let x ′ (t) = (x ′ 1 (t),x′ 2 (t),... ,x′ k (t)) be the vector of derivatives of the coordinate<br />

functions, then the total derivative can be written as the dot product of the gradient<br />

vector of f with the derivative vector of x:<br />

df<br />

dt (t0) = ∇f(x0)·x ′ (t0).<br />

Thus, the <strong>for</strong>mula <strong>for</strong> the total derivative is a generalization of the chain rule <strong>for</strong> compositions<br />

of 1-variable functions.<br />

Example 7.20 (k = 3, continued) Let f(x,y,z) = −2xy 2 + z 3 + 3x 2 y z 4 ,<br />

and t0 = 1. Since<br />

<br />

1 1<br />

x(t) = ,<br />

t t2,−2t <br />

x = x(t) = 1 1<br />

, y = y(t) =<br />

t t2, z = z(t) = −2t<br />

⇒ x ′ <br />

(t) = − 1 2<br />

t2,− t3,−2 <br />

⇒ x ′ (1) = (−1, −2, −2),<br />

and ∇f(x(1)) = ∇f(1,1, −2) = (94,44, −84) (using the <strong>for</strong>mula <strong>for</strong> ∇f derived earlier),<br />

the total derivative of f when t = 1 is<br />

7.3.5 Directional Derivatives<br />

∇f(x(1))·x ′ (1) = (94,44, −84)·(−1, −2, −2) = −14.<br />

If u is a unit vector, and f is a k-variable function defined in a neighorhood of a, then the<br />

directional derivative of f at a is<br />

f(a + hu) − f(a)<br />

Duf(a) = lim<br />

h→0 h<br />

138<br />

provided the limit exists.


In general, the function Duf(x) represents the rate of change of the function f with respect<br />

to a unit change in the direction given by u.<br />

Theorem 7.21 (Directional Derivatives) If the k-variable function f is differentiable<br />

in a neighborhood of a and u is a unit vector, then<br />

Duf(a) = ∇f(a)·u .<br />

That is, the rate of change is the dot product of the gradient vector of f at a with u.<br />

Bounds on directional derivatives. Let θ be the angle between vectors ∇f(a) and u<br />

(see page 128). Then, since u = 1,<br />

Duf(a) = ∇f(a)·u = ∇f(a)cos(θ) .<br />

Further, since the cosine function takes values in the interval [−1,1], directional derivatives<br />

take values between −∇f(a) and ∇f(a). If ∇f(a) = O, then<br />

1. The minimum value is assumed when u = −1<br />

∇f(a) ∇f(a), and<br />

2. The maximum value is assumed when u = 1<br />

∇f(a) ∇f(a).<br />

In the first case, the direction of u is called the direction of steepest descent; in the second<br />

case, it is called the direction of steepest ascent.<br />

Example 7.22 (Quadratic Function, continued) Consider again the function<br />

f(x,y) = 4 − x 2 − y 2 and let a = (0.5, −1).<br />

The left part of Figure 7.5 shows contours corresponding to z = 0,1,2,3,4. The unit vector<br />

in the direction of steepest ascent at (0.5, −1) is highlighted.<br />

Since ∇f(a) = (−1,2), the direction of steepest ascent is u = 1 √ 5 (−1,2), and the slope of<br />

the partial function in this direction is √ 5 ≈ 2.24.<br />

Note that the contour corresponding to z = k is the set of all pairs (x, y) satisfying f(x, y) = k.<br />

Example 7.23 (Cubic Function, continued) Consider again the function<br />

f(x,y) = x 3 − 3xy + y 3 and let a = (1.5,1).<br />

The right part of Figure 7.5 shows contours corresponding to z = −1,0,1,2,3,4. The unit<br />

vector in the direction of steepest ascent at (1.5,1) is highlighted.<br />

Since ∇f(a) = (3.75, −1.5), the direction of steepest ascent is u = 1<br />

√ 16.3125 (3.75, −1.5), and<br />

the slope of the partial function in this direction is √ 16.3125 ≈ 4.04.<br />

139


y<br />

2<br />

1<br />

0<br />

-1<br />

-2 -1 0 1 2 x<br />

-2<br />

-1<br />

-1 0 1 2<br />

Figure 7.5: Contour plots of z = 4 − x 2 − y 2 (left part) and z = x 3 − 3xy + y 3 (right part),<br />

with directions of steepest ascent highlighted. (See Figure 7.4 also.)<br />

7.4 Optimization<br />

Definitions of local and global extrema <strong>for</strong> functions of k-variables directly generalize the<br />

definitions given in Section 4.4 <strong>for</strong> 1-variable functions. In addition, Fermat’s theorem can<br />

be generalized as follows:<br />

Theorem 7.24 (Fermat’s Theorem) If the k-variable function f has a local extremum<br />

at c and if f is differentiable in a neighborhood of c, then ∇f(c) = O.<br />

Example 7.25 (Quadratic Function, continued) Consider again<br />

f(x,y) = 4 − x 2 − y 2 (left parts of Figures 7.4 and 7.5).<br />

f has a local (and global) maximum at the origin and ∇f(0,0) = (0,0).<br />

Example 7.26 (Cubic Function, continued) Consider again<br />

f(x,y) = x 3 − 3xy + y 3 (right parts of Figures 7.4 and 7.5).<br />

f has a local minimum at (1,1) and ∇f(1,1) = (0,0).<br />

Note that ∇f(0,0) = (0,0), but (0,0) is neither a local max nor a local min.<br />

Critical points. A point c is a critical point (or stationary point) of the k-variable function<br />

f if either ∇f(c) = O or if one of the partial derivatives does not exist at c.<br />

For example, the quadratic function f(x,y) = 4 − x 2 − y 2 has a single critical point at the<br />

origin; the cubic function f(x,y) = x 3 − 3xy + y 3 has critical points (1,1) and (0,0).<br />

7.4.1 Linear and Quadratic Approximations<br />

The definition of Taylor polynomials given in Section 6.6 can be generalized to k-variable<br />

functions. This section considers first-order and second-order approximations only.<br />

140<br />

2<br />

1<br />

0<br />

y<br />

x


Linear approximation. Suppose that the k-variable function f is differentiable in a<br />

neighborhood of a. The first-order Taylor polynomial centered at a is defined as follows:<br />

p1(x) = f(a) + ∇f(a)·(x − a).<br />

Note that, since f is differentiable, the error in using p1 as an approximation to f can be<br />

made as small as we like by taking x to be close enough to a.<br />

If we write the list of partial derivatives as the 1-by-k matrix Df(a), the difference vector<br />

(x − a) as a k-by-1 matrix, and consider the matrix product<br />

Df(a)(x − a) = [fx1 (a) fx2 (a) · · · fxk (a)]<br />

⎡ ⎤<br />

x1 − a1<br />

⎢ x2 − a2 ⎥<br />

⎣ ... ⎦<br />

xk − ak<br />

= fx1 (a)(x1 − a1) + fx2 (a)(x2 − a2) + · · · + fxk (a)(xk − ak)<br />

(a 1-by-1 matrix is equivalent to a scalar), then the first-order Taylor polynomial can be<br />

written as follows:<br />

p1(x) = f(a) + Df(a)(x − a).<br />

Quadratic approximation. The Hessian of a k-variable function f is the k-by-k matrix<br />

whose (i,j)-entry is fxi,xj :<br />

⎡<br />

fx1,x1 (x)<br />

⎢ fx2,x1 ⎢ (x)<br />

Hf(x) = ⎢<br />

⎣ .<br />

fx1,x2 (x)<br />

fx2,x2 (x)<br />

.<br />

· · ·<br />

· · ·<br />

· · ·<br />

⎤<br />

fx1,xk (x)<br />

fx2,xk (x) ⎥<br />

. ⎦<br />

fxk,x1 (x) fxk,x2 (x) · · · fxk,xk (x)<br />

Suppose that the k-variable function f has continuous first- and second-order partial derivatives<br />

in a neighborhood of a. Then the second-order Taylor polynomial centered at a is<br />

defined as follows:<br />

p2(x) = f(a) + Df(a)(x − a) + 1<br />

2 (x − a)T Hf(a)(x − a)<br />

where Df(a) is the list of partial derivatives written as a 1-by-k matrix, (x − a) is the<br />

difference vector written as a k-by-1 matrix, and Hf(a) is the k-by-k Hessian matrix at a.<br />

Example 7.27 (Ratio Function) Let f(x,y) = y<br />

x<br />

1. The first- and second- partial derivatives of f are:<br />

and a = (1,1).<br />

fx(x,y) = − y<br />

x 2, fy(x,y) = 1<br />

x ,<br />

fxx(x,y) = 2y<br />

x 3, fxy(x,y) = fyx(x,y) = − 1<br />

x 2 and fyy(x,y) = 0.<br />

141


2. Since f(a) = 1, fx(a) = −1 and fy(a) = 1, the first-order Taylor polynomial centered<br />

at a is<br />

<br />

x − 1<br />

p1(x,y) = 1 + [ −1 1 ] = 1 − (x − 1) + (y − 1).<br />

y − 1<br />

3. Further, since fxx(a) = 2, fxy(a) = fyx(a) = −1 and fyy(a) = 0, the second-order<br />

Taylor polynomial centered at a is<br />

p2(x,y) = 1 + [ −1 1 ]<br />

<br />

x − 1<br />

+<br />

y − 1<br />

1<br />

2 [ x − 1 y − 1 ]<br />

2 −1<br />

−1 0<br />

= 1 − (x − 1) + (y − 1) + (x − 1) 2 − (x − 1)(y − 1) .<br />

<br />

x − 1<br />

y − 1<br />

Theorem 7.28 (Second-Order Taylor Polynomials) Suppose that the k-variable<br />

function f has continuous first- and second-order partial derivatives in a neighborhood of<br />

a, and let p2(x) be the second-order Taylor polynomial centered at a. Then<br />

lim<br />

x→a<br />

f(x) − p2(x)<br />

= 0.<br />

x − a2 That is, the error function e(x) = f(x) − p2(x) approaches zero at a rate faster than the<br />

rate at which the square of the distance between x and a approaches zero.<br />

7.4.2 Second Derivative Test<br />

Suppose that the k-variable function f has continuous first- and second-order partial derivatives<br />

in a neighborhood of the critical point a. Since Df(a) is the zero vector,<br />

p2(x) = f(a) + 1<br />

2 (x − a)T Hf(a)(x − a).<br />

If the term (x −a) T Hf(a)(x −a) is positive <strong>for</strong> all x sufficiently close (but not equal) to a,<br />

then f has a local minimum at a. Similarly, if the term (x − a) T Hf(a)(x − a) is negative<br />

<strong>for</strong> all x sufficiently close (but not equal) to a, then f has a local maximum at a.<br />

The second derivative test gives criteria <strong>for</strong> determining when the quadratic term in the<br />

second-order approximation is always positive or always negative <strong>for</strong> x near a. Matrix<br />

notation and operations are useful in developing the test.<br />

Symmetric matrices and quadratic <strong>for</strong>ms. The n-by-n matrix A is symmetric if<br />

aij = aji <strong>for</strong> all i and j. Symmetric matrices have the property that A T = A.<br />

If A is a symmetric n-by-n matrix and h is an n-by-1 matrix (a column vector), then<br />

is called a quadratic <strong>for</strong>m in h.<br />

Q(h) = h T n n<br />

Ah = aijhihj<br />

i=1 j=1<br />

142


Positive and negative definite quadratic <strong>for</strong>ms. The quadratic <strong>for</strong>m Q (respectively,<br />

the associated symmetric matrix A) is said to be positive definite if Q(h) > 0 when h = O<br />

and is said to be negative definite if Q(h) < 0 when h = O.<br />

If the k-variable function f has continuous first- and second-partial derivatives in a neighborhood<br />

of a, then by Clairaut’s theorem (page 135) the Hessian matrix Hf(a) is a symmetric<br />

k-by-k matrix. Our concern is whether Hf(a) is positive definite or negative definite.<br />

Principal minors and determinants. If A is an k-by-k matrix, then the j-by-j principal<br />

minor of A is the matrix<br />

⎡<br />

⎤<br />

⎢<br />

Aj = ⎢<br />

⎣<br />

a11 a12 · · · a1j<br />

a21 a22 · · · a2j ⎥<br />

⎦<br />

.<br />

. · · ·<br />

aj1 aj2 · · · ajj<br />

.<br />

<strong>for</strong> j = 1,2,... ,k.<br />

Let dj = det(Aj) be the determinant of the j-by-j principal minor (j > 1) and let d1 = a11.<br />

⎡<br />

4 1<br />

For example, if A = ⎣ −1 2<br />

⎤<br />

2<br />

1 ⎦, then d1 = 4, d2 = 10 and d3 = −48.<br />

1 3 −3<br />

Second derivative test <strong>for</strong> local extrema. Suppose that the k-variable function f<br />

has continuous first- and second-order partial derivatives in a neighborhood of the critical<br />

point a, let Hf(a) be the Hessian matrix at a, and let d1, d2, · · ·, dk be the sequence of<br />

determinants of the principal minors of Hf(a).<br />

Assume that dk = det(Hf(a)) = 0. Then<br />

1. If di > 0 <strong>for</strong> i = 1,2,... ,k, then f has a local minimum at a.<br />

2. If d1 < 0, d2 > 0, d3 < 0, ..., then f has a local maximum at a.<br />

3. If the patterns given in the first two steps do not hold, then f has a saddle point at<br />

the point a. That is, there is neither a maximum nor a minimum at a.<br />

If dk = det(Hf(a)) = 0, then anything can happen. (The critical point may correspond to<br />

a local maximum, a local minimum, or a saddle point.)<br />

Example 7.29 (Quadratic Function, continued) Consider again<br />

Since<br />

f(x,y) = 4 − x 2 − y 2 (left parts of Figures 7.4 and 7.5).<br />

the only critical point is the origin. Since<br />

<br />

−2 0<br />

Hf(0,0) =<br />

0 −2<br />

∇f(x,y) = (−2x, −2y) = (0,0) ⇒ x = y = 0<br />

⇒ d1 = −2 and d2 = 4,<br />

the second derivative test implies that f has a local maximum at the origin.<br />

143


Example 7.30 (Cubic Function, continued) Consider again<br />

Since<br />

f(x,y) = x 3 − 3xy + y 3 (right parts of Figures 7.4 and 7.5).<br />

∇f(x,y) = (3x 2 − 3y, −3x + 3y 2 ) = (0,0) ⇒ (x,y) = (1,1) or (0,0),<br />

there are two critical points to consider.<br />

1. Let (x,y) = (1,1). Since<br />

Hf(1,1) =<br />

<br />

6 −3<br />

−3 6<br />

<br />

⇒ d1 = 6 and d2 = 27,<br />

the second derivative test implies that f has a local minimum at (1,1).<br />

2. Let (x,y) = (0,0). Since<br />

Hf(0,0) =<br />

<br />

0 −3<br />

−3 0<br />

<br />

⇒ d1 = 0 and d2 = −9,<br />

the second derivative test implies that f has a saddle point at the origin.<br />

Note that if we let y = −x, then the 1-variable function z = f(x, −x) = 3x 2 has a local<br />

minimum at the origin; if we let y = x, then the 1-variable function z = f(x,x) =<br />

2x 3 − 3x 2 has a local maximum at the origin.<br />

7.4.3 Constrained Optimization and Lagrange Multipliers<br />

In many applications of calculus the goal is to optimize (either maximize or minimize) a<br />

real-valued function of k-variables, f(x), subject to one or more constraints:<br />

g1(x) = c1, g2(x) = c2, ... , gp(x) = cp, where the ci are constants, and p < k.<br />

Solve and substitute method. It may be possible to use the constraints to reduce the<br />

number of variables in the optimization problem, as illustrated in the following example.<br />

Example 7.31 (Constrained Minimization) Let x represent length, y represent width,<br />

and z represent height of an aquarium. Assume that the cost of the slate needed <strong>for</strong><br />

the bottom of the aquarium is 4 times the cost of the glass needed <strong>for</strong> the sides. Then<br />

4xy + 2xz + 2yz is proportional to the total cost <strong>for</strong> these materials.<br />

The problem of interest is to find positive numbers x, y, z to<br />

Minimize 4xy + 2xz + 2yz when the volume V = xyz = 16 cubic units.<br />

The volume constraint can be solved <strong>for</strong> z, and the result substituted into the 3-variable<br />

function we would like to minimize, yielding the following 2-variable problem:<br />

Minimize f(x,y) = 4xy + 32<br />

y<br />

+ 32<br />

x<br />

over the domain where x > 0 and y > 0.<br />

144


f<br />

1<br />

2<br />

x<br />

3<br />

4<br />

1<br />

5<br />

2<br />

5<br />

4<br />

3<br />

y<br />

y<br />

5<br />

4<br />

3<br />

2<br />

1<br />

1 2 3 4 5 x<br />

Figure 7.6: Surface (left) and contour (right) plots of f(x,y) = 4xy + 32<br />

y<br />

correspond to f(x,y) = 49,51,53,55,57.<br />

(See Figure 7.6.) Working with this function,<br />

<br />

∇f(x,y) = 4y − 32 32<br />

x2,4x −<br />

y2 <br />

and Hf(x,y) =<br />

<br />

64/x3 4<br />

4 64/y3 <br />

.<br />

32 + x . Contours<br />

<br />

8 4<br />

Since ∇f(x,y) = (0,0) when x = y = 2, Hf(2,2) = , d1 = 8 and d2 = 48, we know<br />

4 8<br />

that f has a local (and global) minimum at (2,2). The minimum value is 48.<br />

The dimensions of the aquarium should be x = 2, y = 2, z = 4.<br />

Method of Lagrange Multipliers. A general method, which does not require the “solve<br />

and substitute” approach outlined in the example above, depends on the following theorem.<br />

Theorem 7.32 (Lagrange Multipliers) Assume that the k-variable functions f and<br />

g1, g2, ..., gp have continuous first partial derivatives on the domain<br />

D = {x : g1(x) = c1, g2(x) = c2, ..., gp(x) = cp},<br />

f has an extremum on this domain at a, and that<br />

{∇g1(a), ∇g2(a), ..., ∇gp(a)} is a linearly independent set (see below).<br />

Then there exist constants λ1, ..., λp (called Lagrange multipliers) satisfying<br />

∇f(a) = λ1∇g1(a) + λ2∇g2(a) + · · · + λp∇gp(a).<br />

Note that when p = 1, the linear independence requirement in Lagrange’s theorem reduces<br />

to ∇g1(a) = O. When p > 1, the requirement means that<br />

The linear combination p<br />

i=1 ai∇gi(a) = O only when each coefficient ai = 0.<br />

145


The theorem can be used to find critical points. When p = 1, we drop the subscript on the<br />

single constraint and solve<br />

∇f(x) = λ∇g(x) and g(x) = c <strong>for</strong> x1,x2,...,xk and λ.<br />

Example 7.33 (Constrained Minimization, continued) Consider again the aquarium<br />

problem and let f(x,y,z) = 4xy + 2xz + 2yz and g(x,y,z) = xyz. Then<br />

∇f(x,y,z) = (4y + 2z,4x + 2z,2x + 2y) and ∇g(x,y,z) = (yz,xz,xy).<br />

Critical points need to satisfy the following system of 4 equations in 4 unknowns:<br />

4y + 2z = λyz, 4x + 2z = λxz, 2x + 2y = λxy and xyz = 16.<br />

Since x, y and z must be positive, we can solve the first 3 equations <strong>for</strong> λ:<br />

λ = 4<br />

z<br />

2 4<br />

+ , λ =<br />

y z<br />

2 2 2<br />

+ , λ = +<br />

x y x .<br />

From equations 1 and 2 we get y = x, and from equations 1 and 3 we get z = 2x. Now,<br />

16 = xyz = x(x)(2x) ⇒ x = 2, y = 2, z = 4.<br />

Finally, λ = 2 as well. (Note that λ can be interpreted as the rate of change of cost with<br />

respect to a unit change in the volume constraint.)<br />

Lagrangian function. The method outlined in Lagrange’s theorem can be coded using<br />

the following (p + k)-variable Lagrangian function:<br />

p<br />

L(λ1,... ,λp,x1,...,xk) = f(x1,... ,xk) − λi(g(x1,... ,xk) − ci).<br />

i=1<br />

For simplicity, we use the notation L(λ,x).<br />

Critical points of L satisfy the result of Lagrange’s theorem. A special <strong>for</strong>m of the second<br />

derivative test allows us to determine in some cases whether critical points are local extrema.<br />

Second derivative test <strong>for</strong> constrained local extrema. Let a be a constrained critical<br />

point of f subject to the constraints gi(x) = ci (<strong>for</strong> all i), and assume that all first- and<br />

second-partial derivatives are continuous on the constrained domain.<br />

Let d1, d2, ..., dp+k be the determinants of the principal minors of HL(λ,a) and consider<br />

the subsequence of (k − p) values:<br />

(−1) p d2p+1, (−1) p d2p+2, ..., (−1) p dp+k.<br />

Assume that dp+k = det(HL(λ,a)) = 0. Then<br />

1. If all terms in the subsequence are positive, then f has a constrained local<br />

minimum value at a.<br />

146


2. If the first term in the subsequence is negative, and the signs alternate,<br />

then f has a constrained local maximum value at a.<br />

3. If the patterns given in the first two steps do not hold, then f has a constrained<br />

saddle point at a.<br />

Note that if dp+k = det(HL(λ,a)) = 0, then anything can happen.<br />

Example 7.34 (Constrained Minimization, continued) Returning to the problem<br />

above,<br />

L(λ,x,y,z) = 4xy + 2xz + 2yz − λ(xyz − 16),<br />

∇L(λ,x,y,z) = (16 − xy z,4y + 2z − y z λ,4x + 2z − xz λ,2x + 2y − xy λ), and<br />

⎡<br />

0 − (y z) − (xz)<br />

⎤<br />

− (xy)<br />

⎢<br />

HL(λ,x,y,z) = ⎢ − (y z)<br />

⎣ − (xz)<br />

0<br />

4 − z λ<br />

4 − z λ<br />

0<br />

2 − y λ ⎥<br />

2 − xλ ⎦<br />

− (xy) 2 − y λ 2 − xλ 0<br />

.<br />

If λ = 2, x = 2, y = 2, z = 4, then d1 = 0, d2 = −64, d3 = −512, d4 = −768.<br />

We need to consider the subsequence −d3 = 512 and −d4 = 768. Since both terms are<br />

positive, f has a constrained local minimum when x = 2, y = 2, z = 4.<br />

Example 7.35 (k = 3, p = 2) Consider the following minimization problem:<br />

and let<br />

Minimize x 2 + y 2 + z 2 subject to x + y + z = 12 and x + 2y + 3z = 18<br />

L(λ,µ,x,y,z) = x 2 + y 2 + z 2 − λ(x + y + z − 12) − µ(x + 2y + 3z − 18).<br />

Then ∇L(λ,µ,x,y,z) =<br />

(12 − x − y − z, 18 − x − 2y − 3z, 2x − λ − µ, 2y − λ − 2µ, 2z − λ − 3µ),<br />

⎡<br />

⎤<br />

0 0 −1 −1 −1<br />

⎢ 0 0 −1 −2 −3 ⎥<br />

⎢<br />

⎥<br />

HL(λ,µ,x,y,z) = ⎢ −1 −1 2 0 0 ⎥ ,<br />

⎢<br />

⎥<br />

⎣ −1 −2 0 2 0 ⎦<br />

−1 −3 0 0 2<br />

and d1 = d2 = d3 = 0, d4 = 1 and d5 = 12 (<strong>for</strong> all x, y, z).<br />

Since<br />

∇L(λ,µ,x,y,z) = O ⇒ λ = 20, µ = −6, x = 7, y = 4, z = 1,<br />

and (−1) 2 d5 = 12 > 0, there is a constrained minimum at (x,y,z) = (7,4,1).<br />

Footnote: The minimization problem above corresponds to finding the point on the intersection of<br />

the planes x + y + z = 12 and x + 2y + 3z = 18 which is closest to the origin.<br />

147


7.4.4 Method of Steepest Descent/Ascent<br />

The first step in solving an optimization problem is to find the critical points of f by solving<br />

the derivative equation ∇f(x) = O. Since this equation may be difficult to solve, a variety<br />

of approximate methods have been developed. This section focuses on methods based on<br />

properties of directional derivatives and gradients. (See Section 7.3.5.)<br />

Method of steepest descent. Suppose that f has continuous partial derivatives in a<br />

neighborhood of a and has a local minimum at a. If x0 (an initial point) is close enough<br />

to the point a and ∇f(x0) = O, then<br />

The minimum value of Duf(x0) occurs when u = −∇f(x0)/∇f(x0).<br />

That is, the directional derivative is minimized in the direction of the negative of the gradient<br />

at the point x0.<br />

The idea of the method of steepest descent is to move in the direction of steepest descent<br />

as far as possible with the hope of getting closer to a. Specifically, let g(t) be the following<br />

1-variable function:<br />

<br />

g(t) = f x0 − t ∇f(x0)<br />

<br />

.<br />

∇f(x0)<br />

Note that g(0) = f(x0). In addition, <strong>for</strong> values of t near zero, g(t) is decreasing. Let t0 > 0<br />

be the critical number of g closest to 0 and let x1 = x0 − t0(∇f(x0)/∇f(x0)).<br />

In general, the process is repeated, defining<br />

until ti is close enough to zero.<br />

xi+1 = xi − ti(∇f(xi)/∇f(xi)),<br />

Example 7.36 (Constrained Minimization, continued) The method is illustrated<br />

using a familiar situation:<br />

Minimize f(x,y) = 4xy + 32<br />

y<br />

(See Figure 7.6.)<br />

+ 32<br />

x<br />

over the domain where x > 0 and y > 0.<br />

The following table shows the results of the first eight steps of the algorithm, using (3.2,4.1)<br />

as the initial point, and using four decimal places of accuracy:<br />

i 0 1 2 3 4 5 6 7<br />

ti 2.0973 0.9084 0.1588 0.0458 0.0038 0.0007 0.0000 0.0000<br />

xi 3.2 1.5789 2.1553 2.0325 2.0034 2.0005 2.0000 2.0000<br />

yi 4.1 2.7694 2.0672 1.9664 2.0019 1.9995 2.0000 2.0000<br />

Both t6 and t7 are virtually zero (and x6 and x7 are virtually equal), indicating that the<br />

algorithm has identified (2,2) as a local minimum point.<br />

Note that Newton’s method (Section 4.4.5) was used to find each ti.<br />

148


15<br />

10<br />

5<br />

0<br />

-5<br />

-10<br />

y<br />

0 2 4 6 8 x<br />

Figure 7.7: Weight change in grams (vertical axis) versus dosage level in 100 mg/kg/day<br />

(horizontal axis) <strong>for</strong> ten animals. The gray curve is y = −0.49x 2 + 1.98x + 6.70.<br />

Method of steepest ascent. To approximate a local maximum, the function<br />

is used to find ti, and the <strong>for</strong>mula<br />

is used to find the next approximate point.<br />

7.5 Applications in <strong>Statistics</strong><br />

g(t) = f (xi + t (∇f(xi)/∇f(xi)))<br />

xi+1 = xi + ti(∇f(xi)/∇f(xi))<br />

Optimization is an essential part of many problems in statistics. In this section, optimization<br />

techniques are used to solve problems in least squares and maximum likelihood estimation.<br />

7.5.1 Least Squares Estimation in Quadratic Models<br />

Given n data pairs (x1,y1), (x2,y2), ..., (xn,yn), whose values lie close to a quadratic curve<br />

of the <strong>for</strong>m<br />

y = ax 2 + bx + c ,<br />

the method of least squares, developed by Legendre in the early 1800’s, allows us to estimate<br />

the unknown parameters: Find the values of a, b, and c which minimize the sum of squared<br />

deviations of observed y-coordinates from the values expected if the data points fit the curve<br />

exactly. That is, minimize the following 3-variable function:<br />

n<br />

f(a,b,c) = (yi − (ax<br />

i=1<br />

2 i + bxi + c)) 2 .<br />

Example 7.37 (Toxicology Study) (Fine & Bosch, JASA 94:375-382, 2000) As part of a<br />

study to determine the adverse effects of a proposed drug <strong>for</strong> the treatment of tuberculosis,<br />

female rats were given the drug <strong>for</strong> a period of 14 days at each of five dosage levels (in 100<br />

milligrams per kilogram per day).<br />

149


The vertical axis in Figure 7.7 is the weight change in grams (defined as the weight at the<br />

end of the period minus the weight at the beginning of the period) <strong>for</strong> ten animals in the<br />

study, and the horizontal axis shows the dose in 100 mg/kg/day.<br />

For these data, f(a,b,c) =<br />

562.63 + 703.45a + 7612.13a 2 − 15.1b + 2223.5ab + 172.5b 2 − 88.6c + 345ac + 62bc + 10c 2 ,<br />

∇f(a,b,c) =<br />

(703.45+15224.3a+2223.5b+345c, −15.1+2223.5a+345b+62c, −88.6+345a+62b+20c)<br />

and ∇f(a,b,c) = O when a = −0.49, b = 1.98, and c = 6.70.<br />

Since<br />

⎡<br />

15224.3 2223.5<br />

⎤<br />

345<br />

Hf = ⎣ 2223.5 345 62 ⎦ , d1 = 15224.3, d2 ≈ 3.1 × 10<br />

345 62 20<br />

5 , d3 ≈ 1.7 × 10 6<br />

and f is a quadratic function, f is minimized when a = −0.49, b = 1.98, and c = 6.70. The<br />

least squares quadratic fit <strong>for</strong>mula is shown in Figure 7.7.<br />

Linear algebra and least squares. A linear algebra approach to least squares analysis<br />

is discussed in Sections 9.5.5 and 9.8.1 of these notes.<br />

7.5.2 Maximum Likelihood Estimation in Multinomial Models<br />

This section generalizes ideas from Section 4.5.1.<br />

Consider the problem of estimating the proportions p1,... ,pk in a multinomial experiment.<br />

Let p = (p1,p2,... ,pk), and let xi be the observed number of occurrences of outcome i in<br />

n trials (<strong>for</strong> i = 1,2,... ,k). We will demonstrate that the sample proportions, pi = xi/n,<br />

are maximum likelihood estimates of the proportions in the multinomial experiment using<br />

constrained optimatization techniques.<br />

The likelihood function, Lik(p), is the joint PDF with n and the xi fixed and p varying:<br />

<br />

n<br />

Lik(p) =<br />

p<br />

x1,x2,... ,xk<br />

x1<br />

1 px2 2 · · · pxk<br />

k .<br />

The log-likelihood function, ℓ(p), is the natural logarithm of the likelihood:<br />

<br />

n<br />

ℓ(p) = ln(Lik(p)) = ln<br />

x1,x2,...,xk<br />

+ x1 ln(p1) + x2 ln(p2) + · · · + xk ln(pk).<br />

We wish to maximize ℓ(p) subject to the constraint that the sum of the proportions is 1.<br />

The Lagrangian function is<br />

L(λ,p) = ℓ(p) − λ(p1 + p2 + · · · + pk − 1),<br />

150


and the gradient of the Lagrangian function is<br />

∇L(λ,p) = (1 − p1 − p2 − · · · − pk,x1/p1 − λ,x2/p2 − λ,... ,xk/pk − λ).<br />

If all quantities are positive, then ∇L(λ,p) = O implies<br />

1 = <br />

i pi and pi = xi/λ <strong>for</strong> each i.<br />

These equations imply that (n,x1/n,x2/n,...,xk/n) is a critical point.<br />

Example 7.38 (k = 3) To check that the result is a maximum, we consider the special<br />

case when<br />

k = 3, n = 40, x1 = 18, x2 = 12 and x3 = 10.<br />

Then<br />

⎡<br />

0 −1 −1<br />

⎤<br />

−1<br />

⎢<br />

HL = ⎢ −1 −800/9<br />

⎣ −1 0<br />

0<br />

−400/3<br />

0<br />

0<br />

⎥<br />

⎦<br />

−1 0 0 −160<br />

, d1 = 0, d2 = −1, d3 = 2000<br />

9 , d4 = − 1280000<br />

.<br />

27<br />

Since −d3 < 0 and −d4 > 0, the second derivative test <strong>for</strong> constrained local extrema implies<br />

that there is a constrained local maximum when p1 = 0.45, p2 = 0.30, and p3 = 0.25 (and<br />

λ = 40). There is both a local and global maximum at (p1,p2,p3) = (0.45,0.35,0.25).<br />

7.6 Multiple Integration<br />

This section generalizes ideas from Section 4.7.<br />

7.6.1 Riemann Sums and Double Integrals over Rectangles<br />

Suppose that the 2-variable function f(x,y) is continuous on the rectangle<br />

R = [a,b] × [c,d] = {(x,y) : a ≤ x ≤ b, c ≤ y ≤ d} ⊂ R 2<br />

and let m and n be positive integers. Further, let △x = (b−a) (d−c)<br />

m , △y = n ,<br />

1. xi = a + i△x <strong>for</strong> i = 0,1,... ,m, x ∗ i = (xi−1 + xi)/2 <strong>for</strong> i = 1,2,... ,m,<br />

2. yj = c + j△y <strong>for</strong> j = 0,1,... ,n, and y ∗ j = (yj−1 + yj)/2 <strong>for</strong> j = 1,2,... ,n.<br />

Then the (m + 1) numbers<br />

a = x0 < x1 < x2 < · · · < xm = b<br />

partition [a,b] into m subintervals of equal length with midpoints x∗ i , the (n + 1) numbers<br />

c = y0 < y1 < y2 < · · · < yn = d<br />

151


z<br />

1<br />

x<br />

3<br />

1<br />

3<br />

y<br />

5<br />

Figure 7.8: Plot of z = 3x 2 + y + 2 on [0,4] × [0,6] (left) and approximating boxes (right).<br />

partition [c,d] into n subintervals of equal length with midpoints y ∗ j<br />

gles<br />

Ri,j = [xi−1,xi] × [yj−1,yj]<br />

z<br />

1<br />

x<br />

3<br />

1<br />

3<br />

5<br />

y<br />

and the mn subrectan-<br />

<strong>for</strong>m an m-by-n partition of the rectangle R. Each subrectangle has area △A = (△x)(△y).<br />

The double integral of f over R is<br />

<br />

f(x,y) dA = lim<br />

R<br />

m,n→∞<br />

m n<br />

f(x<br />

i=1 j=1<br />

∗ i ,y∗ j )△A.<br />

In this <strong>for</strong>mula, f(x,y) is the integrand, and the sum on the right is called a Riemann sum.<br />

Continuity and Integrability. f(x,y) is said to be integrable on the rectangle R if<br />

the limit above exists. Continuous functions are always integrable. Further, it is possible<br />

to show that the limit exists when the values of f are bounded on R and the set of<br />

discontinuities has area zero.<br />

Volumes. If f is nonnegative on R, then the double integral over R can be interpreted<br />

as the volume under the surface z = f(x,y) and above the xy-plane <strong>for</strong> (x,y) ∈ R.<br />

Example 7.39 (Computing Volume) Let f(x,y) = 3x 2 + y + 2 on R = [0,4] × [0,6].<br />

To estimate the volume under the surface z = f(x,y) and above the xy-plane over R (left<br />

part of Figure 7.8), we let m = 4, n = 6 and ∆x = ∆y = 1. The 24 midpoints are<br />

(x ∗ ,y ∗ ), where x ∗ = 0.5,1.5,2.5,3.5 and y ∗ = 0.5,1.5,... ,5.5,<br />

and the value of the Riemann sum is 498.0. This value is the sum of the volumes of the 24<br />

rectangular boxes shown in the right part of Figure 7.8.<br />

The following list of approximate values suggests that the limit is 504.<br />

m 4 8 16 32 64 128 256<br />

n 6 12 24 48 96 192 384<br />

Sum 498.0 502.5 503.625 503.906 503.977 503.994 503.999<br />

152


z<br />

0<br />

1<br />

2<br />

x 3<br />

4 0<br />

2<br />

4<br />

y<br />

6<br />

z<br />

0<br />

1<br />

2<br />

x 3<br />

4 0<br />

Figure 7.9: Area slices of z = 3x 2 + y + 2 on [0,4] × [0,6] <strong>for</strong> fixed x = 0,0.5,... ,4 (left)<br />

and <strong>for</strong> fixed y = 0,0.5,... ,6 (right).<br />

Average values. The quantity<br />

<br />

1<br />

f(x,y) dA<br />

(b − a)(d − c) R<br />

can be interpreted as the average value of f(x,y) on R.<br />

To see this, let f = 1 <br />

mn i,j f(x∗i ,y∗ j ) be the sample average, and let A = (b − a)(d − c) be<br />

the area of the rectangle. Then<br />

f = A<br />

⎛<br />

⎝<br />

A<br />

1 <br />

f(x<br />

mn<br />

∗ i,y ∗ ⎞<br />

⎠<br />

j) = 1 <br />

A<br />

i,j<br />

i,j<br />

2<br />

4<br />

y<br />

f(x ∗ i,y ∗ <br />

1<br />

j) △A ⇒ lim f = f(x,y) dA.<br />

m,n→∞ A R<br />

For example, the list of approximations in the previous example suggest that the average<br />

value of f(x,y) = 3x 2 + y + 2 on [0,4] × [0,6] is 504/24 = 21.<br />

7.6.2 Iterated Integrals over Rectangles<br />

Suppose that f is a continuous, 2-variable function on the rectangle R = [a,b] × [c,d].<br />

We use the notation d<br />

f(x,y) dy<br />

c<br />

to mean that x is held fixed and f is integrated as y varies from c to d. Similarly, we use<br />

the notation b<br />

f(x,y) dx<br />

to mean that y is held fixed and f is integrated as x varies from a to b.<br />

If f is nonnegative on R, then the results can be interpreted as areas.<br />

a<br />

Example 7.40 (Area Slices) Consider again f(x,y) = 3x 2 + y + 2 on [0,4] × [0,6].<br />

153<br />

6


The left part of Figure 7.9 shows area slices related to the partial functions of f when<br />

x = 0,0.5,1,... ,4. The area of each slice as a function of x is<br />

Ax(x) =<br />

6<br />

0<br />

(3x 2 + y + 2) dy =<br />

<br />

3x 2 y + 1<br />

2 y2 6 + 2y = 18x<br />

0<br />

2 + 30,<br />

and the areas of the displayed slices are 30, 34.5, 48, ..., 318.<br />

The right part of Figure 7.9 shows area slices related to the partial functions of f when<br />

y = 0,0.5,1,... ,6. The area of each slice as a function of y is<br />

Ay(y) =<br />

4<br />

0<br />

(3x 2 + y + 2) dx =<br />

and the areas of the displayed slices are 72, 74, 76, ..., 96.<br />

<br />

x 3 4 + yx + 2x = 4y + 72,<br />

0<br />

Cavalieri’s principle. Continuing with the area application, Cavalieri’s principle (also<br />

called the method of slicing) says that the volume under the surface can be obtained by<br />

integrating the area function:<br />

Volume = d<br />

c Ay(y) dy = d<br />

c<br />

= b<br />

a Ax(x) dx = b<br />

a<br />

b<br />

a f(x,y) dx dy<br />

d<br />

c f(x,y) dy dx<br />

The right most integrals on each line above are examples of iterated integrals. In each case,<br />

the inside integral is evaluated first, followed by the outside integral.<br />

Example 7.41 (Cavalieri’s Principle) Continuing with the example above, since<br />

4<br />

0<br />

(18x 2 + 30) dx =<br />

<br />

6x 3 4 + 30x = 504 and<br />

0<br />

6<br />

0<br />

(4y + 72) dy =<br />

<br />

2y 2 6 + 72y = 504<br />

0<br />

we know that the volume of the solid shown in the left part of Figure 7.8 is 504 cubic units.<br />

Double versus iterated integrals. Fubini’s theorem says that we can compute double<br />

integrals by computing iterated integrals, where the iteration can be done in either order.<br />

Theorem 7.42 (Fubini’s Theorem) Suppose that f is a continuous, 2-variable function<br />

on the rectangle R = [a,b] × [c,d]. Then<br />

<br />

R<br />

f(x,y) dA =<br />

d b<br />

c<br />

a<br />

f(x,y) dx dy =<br />

b d<br />

a<br />

c<br />

f(x,y) dy dx .<br />

Note that Fubini’s theorem remains true if f is a bounded function on R satisfying (1) the<br />

set of discontinuities of f has area 0, and (2) each slice (<strong>for</strong> fixed x or <strong>for</strong> fixed y) intersects<br />

the set of discontinuities in at most a finite number of points.<br />

154


z<br />

0<br />

1<br />

x<br />

2 0<br />

1<br />

y<br />

2<br />

y<br />

2<br />

1<br />

1 2 x<br />

Figure 7.10: Plot of z = 10/x 2 −90/(x + 2y) 2 on [1,2]×[0,2] (left plot) and function domain<br />

with y = x superimposed (right plot).<br />

Example 7.43 (Difference in Volumes) Let<br />

as illustrated in Figure 7.10.<br />

By Fubini’s theorem,<br />

f(x,y) = 10 90<br />

− on R = [1,2] × [0,2],<br />

x2 (x + 2y) 2<br />

2<br />

0<br />

<br />

R f(x,y) dA = 2<br />

1 f(x,y) dx dy<br />

= <br />

2 20<br />

1 x2 − 45<br />

<br />

45<br />

x + 4+x dx<br />

<br />

= −20 x − 45ln(x) + 45ln(4 + x)<br />

2<br />

1<br />

≈ −12.9871.<br />

Note that, given (x,y) ∈ R, f(x,y) ≥ 0 when y ≥ x and f(x,y) ≤ 0 when y ≤ x. Thus, this<br />

answer can be interpreted as the difference between (1) the volume of the solid bounded<br />

by z = f(x,y) and the xy-plane <strong>for</strong> (x,y) ∈ R with y ≥ x and (2) the volume of the solid<br />

bounded by z = f(x,y) and the xy-plane <strong>for</strong> (x,y) ∈ R with y ≤ x.<br />

7.6.3 Properties of Double Integrals<br />

Properties of double integrals generalize those <strong>for</strong> single integrals from Section 4.7.2.<br />

Suppose that f(x,y) and g(x,y) are integrable functions on R, and let c be a constant.<br />

Then properties of limits imply:<br />

1. Constant functions:<br />

2. Sums and differences:<br />

3. Constant multiples:<br />

<br />

R<br />

<br />

R<br />

<br />

R<br />

c dA = c × (Area of R).<br />

4. Null domain: If the area of R is 0, then<br />

<br />

(f(x,y) ± g(x,y)) dA = R<br />

<br />

cf(x,y) dA = c R f(x,y) dA.<br />

<br />

R f(x,y) dA = 0.<br />

155<br />

<br />

f(x,y) dA ± R g(x,y) dA.


y<br />

2<br />

1<br />

0.5 1 x<br />

1<br />

0.5<br />

y<br />

0.5 1 x<br />

Figure 7.11: Domains of type 1 (left) and of both types (right).<br />

<br />

<br />

5. Ordering: If f(x,y) ≤ g(x,y) <strong>for</strong> all (x,y) ∈ R, then R f(x,y) dA ≤ R g(x,y) dA.<br />

<br />

<br />

6. Absolute value: If |f| is integrable, then | R f(x,y) dA| ≤ R |f(x,y)| dA.<br />

7.6.4 Double and Iterated Integrals over General Domains<br />

Double and iterated integrals can be evaluated over more general domains. Specifically,<br />

1. Type 1: Let D = {(x,y) : a ≤ x ≤ b, ℓ(x) ≤ y ≤ u(x)}, where ℓ(x) and u(x) are<br />

continuous on [a,b], and let f be a continuous, 2-variable function on D. Then<br />

<br />

D<br />

f(x,y) dA =<br />

b u(x)<br />

a<br />

ℓ(x)<br />

f(x,y) dy dx .<br />

2. Type 2: Let D = {(x,y) : c ≤ y ≤ d, ℓ(y) ≤ x ≤ u(y)}, where ℓ(y) and u(y) are<br />

continuous on [c,d], and let f be a continuous, 2-variable function on D. Then<br />

<br />

D<br />

f(x,y) dA =<br />

d u(y)<br />

c<br />

ℓ(y)<br />

The left part of Figure 7.11 illustrates a domain of type 1:<br />

f(x,y) dx dy .<br />

Dleft = {(x,y) : 0 ≤ x ≤ 1, 1 − x ≤ y ≤ 1 + x} .<br />

The right part of the figure illustrates a domain of both types:<br />

Dright = {(x,y) : 0 ≤ x ≤ 1, x 2 ≤ y ≤ x} = {(x,y) : 0 ≤ y ≤ 1, y ≤ x ≤ √ y} .<br />

Example 7.44 (Figure 7.11, Left Part) To illustrate computations, we compute the<br />

double integral of f(x,y) = 70xy 2 over the domain shown in the left part of Figure 7.11:<br />

<br />

Dleft<br />

70xy 2 dA =<br />

1 1+x<br />

0<br />

1−x<br />

70xy 2 dy dx =<br />

156<br />

1<br />

(140x<br />

0<br />

2 + 140x 4 /3) dx = 56.


Note that Dleft can be written as the union of two type 2 domains whose intersection is a<br />

set with area 0, and that the double integral can be computed as the sum of the values of<br />

two iterated integrals. Specifically,<br />

Dleft = {(x,y) : 0 ≤ y ≤ 1, 1 − y ≤ x ≤ 1} ∪ {(x,y) : 1 ≤ y ≤ 2, y − 1 ≤ x ≤ 1}<br />

(union of triangular regions with intersection the segment from (0,1) to (1,1)), and<br />

<br />

Dleft<br />

70xy 2 dA =<br />

1 1<br />

0<br />

1−y<br />

70xy 2 dx dy +<br />

2 1<br />

1<br />

y−1<br />

70xy 2 dx dy = 21<br />

2<br />

Example 7.45 (Difference in Volumes, continued) Consider again<br />

f(x,y) = 10 90<br />

− on R = [1,2] × [0,2] (Figure 7.10).<br />

x2 (x + 2y) 2<br />

+ 91<br />

2<br />

= 56.<br />

The rectangle R can be written as the union of two type 1 domains whose intersection is<br />

the segment from (1,1) to (2,2):<br />

Since<br />

<br />

and<br />

<br />

D2<br />

D1<br />

D1 = R ∩ {(x,y) : y ≥ x} and D2 = R ∩ {(x,y) : y ≤ x}.<br />

f(x,y) dA =<br />

f(x,y) dA =<br />

the double integral<br />

2 2<br />

1<br />

2 x<br />

1<br />

0<br />

x<br />

f(x,y) dy dx =<br />

f(x,y) dy dx =<br />

Further, the total volume enclosed by<br />

2<br />

1<br />

2 <br />

1<br />

<br />

20 25 45<br />

− +<br />

x2 2 4 + x<br />

− 20<br />

x<br />

<br />

dx ≈ 0.8758<br />

dx = [−20ln(x)] 2<br />

1 ≈ −13.8629,<br />

<br />

R f(x,y) dA ≈ −12.9871, confirming the result obtained earlier.<br />

z = f(x,y), the xy-plane, x = 1, x = 2, y = 0 and y = 2<br />

is approximately 0.8758 + 13.8629 = 14.7387.<br />

Average values over general domains. If f is integrable over the general domain D<br />

with positive area A(D), then the average value of f over D is<br />

Average of f over D = 1<br />

<br />

f(x,y) dA .<br />

A(D) D<br />

Example 7.46 (Figure 7.11, Right Part) To illustrate computations, we find the<br />

average value of f(x,y) = 70xy 2 over the domain shown in the right part of Figure 7.11,<br />

and view the domain as a domain of type 1.<br />

157


The double integral of f over Dright is<br />

<br />

Dright<br />

70xy 2 dA =<br />

1 x<br />

0<br />

x 2<br />

70xy 2 dy dx =<br />

1<br />

(70x<br />

0<br />

4 /3 − 70x 7 /3) dx = 7/4.<br />

The area of Dright is the double integral of the function 1 (or, equivalently, the single integral<br />

over [0,1] of the difference between the value of y on the upper curve and the value of y on<br />

the lower curve):<br />

<br />

Dright<br />

1 dA =<br />

1 x<br />

0<br />

x 2<br />

1 dy dx =<br />

Finally, the average value is (7/4)/(1/6) = 21/2.<br />

7.6.5 Triple and Iterated Integrals over Boxes<br />

1<br />

0<br />

(x − x 2 ) dx = 1/6.<br />

Suppose that the 3-variable function f(x,y,z) is continuous on the box<br />

B = [a,b] × [c,d] × [p,q] = {(x,y,z) : a ≤ x ≤ b, c ≤ y ≤ d, p ≤ z ≤ q} ⊂ R 3<br />

and let ℓ, m and n be positive integers. Further, let ∆x = (b−a)<br />

ℓ , ∆y = (d−c) (q−p)<br />

m and ∆z = n ,<br />

1. xi = a + i∆x <strong>for</strong> i = 0,1,... ,ℓ, x ∗ i = (xi−1 + xi)/2 <strong>for</strong> i = 1,2,... ,ℓ,<br />

2. yj = c + j∆y <strong>for</strong> j = 0,1,... ,m, y ∗ j = (yj−1 + yj)/2 <strong>for</strong> j = 1,2,... ,m,<br />

3. zk = p + k∆z <strong>for</strong> k = 0,1,... ,n, and z ∗ k = (zk−1 + zk)/2 <strong>for</strong> k = 1,2,... ,n.<br />

Then the (ℓ + 1) numbers<br />

a = x0 < x1 < · · · < xℓ = b<br />

partition [a,b] into ℓ subintervals of equal length with midpoints x∗ i , the (m + 1) numbers<br />

c = y0 < y1 < · · · < ym = d<br />

partition [c,d] into m subintervals of equal length with midpoints y∗ j , the (n + 1) numbers<br />

p = z0 < z1 < · · · < zn = q<br />

partition [p,q] into n subintervals of equal length with midpoints z∗ k , and the ℓmn subboxes<br />

Bi,j,k = [xi−1,xi] × [yj−1,yj] × [zk−1,zk] (i = 1,... ,ℓ; j = 1,... ,m; k = 1,... ,n)<br />

<strong>for</strong>m an ℓ-by-m-by-n partition of B. Each subbox has volume △V = (△x)(△y)(△z).<br />

The triple integral of f over B is<br />

<br />

B<br />

f(x,y,z) dV = lim<br />

ℓ m n<br />

f(x<br />

ℓ,m,n→∞<br />

i=1 j=1 k=1<br />

∗ i,y ∗ j,z ∗ k)△V.<br />

In this <strong>for</strong>mula, f(x,y,z) is the integrand and the sum on the right is a Riemann sum.<br />

158


Continuity and integrability. f(x,y,z) is said to be integrable on the box B if the limit<br />

above exists. Continuous functions are always integrable. Further, it is possible to show<br />

that the limit above exists if the values of f are bounded on B and the set of discontinuities<br />

of f on B has volume zero.<br />

Iterated integrals and Fubini’s theorem. A generalization of Fubini’s theorem (p. 154)<br />

states that if f is continuous on B, then the triple integral can be computed using iterated<br />

integrals, where the iteration can be done in any order. In particular,<br />

<br />

b d q<br />

f(x,y,z) dV = f(x,y,z) dz dy dx .<br />

(There are a total of six orders.)<br />

B<br />

a<br />

More generally, if f is a bounded function on B satisfying (1) the set of discontinuities of<br />

f has volume 0, and (2) each line parallel to one of the coordinate axes intersects the set<br />

of discontinuities of f in at most a finite number of points, then the triple integral can be<br />

computed using iterated integrals in any order.<br />

Average values. The quantity<br />

1<br />

(b − a)(d − c)(q − p)<br />

c<br />

p<br />

<br />

can be interpreted as the average value of f on B.<br />

To see this, let<br />

f = 1<br />

ℓmn<br />

B<br />

<br />

f(x ∗ i,y ∗ j,z ∗ k)<br />

i,j,k<br />

f(x,y,z) dV<br />

be the sample average, and let V = (b − a)(d − c)(q − p) be the volume of the box. Then<br />

f = V 1<br />

ℓmn i,j,k<br />

V<br />

f(x∗i , y∗ j , z∗ k )<br />

<br />

= 1 <br />

i,j,k<br />

V<br />

f(x∗i , y∗ j , z∗ k ) △V<br />

1 <br />

⇒ lim f = B f(x, y, z) dV .<br />

ℓ,m,n→∞ V<br />

Example 7.47 (Average Value) To illustrate computations, we find the average of<br />

f(x,y,z) = 12x 2 (y − z) on B = [−1,1] × [−1,2] × [0,3].<br />

Using the generalization of Fubini’s theorem,<br />

<br />

B f(x,y,z) dV = 1 2 3<br />

−1 −1 0 12x2 (y − z) dz dy dx<br />

= 1 2 <br />

−1 −1 −54x2 + 36x2y dy dx<br />

= 1 <br />

−1 −108x2 dx = −72.<br />

Since the volume of B is 18, the average of f on B is −72/18 = −4.<br />

159


6<br />

z<br />

4<br />

2<br />

0<br />

0<br />

1<br />

x<br />

0<br />

2<br />

1<br />

0.5<br />

y<br />

25<br />

z<br />

0<br />

0<br />

5<br />

y<br />

10 0<br />

Figure 7.12: Domains of integration <strong>for</strong> triple integral examples.<br />

7.6.6 Triple Integrals over General Domains<br />

Triple integrals can also be computed over more general domains. There are six basic types,<br />

which depend on the order of integration.<br />

One of the six types can be defined as follows:<br />

where<br />

W = {(x,y,z) : a ≤ x ≤ b, ℓ1(x) ≤ y ≤ u1(x), ℓ2(x,y) ≤ z ≤ u2(x,y)},<br />

1. ℓ1(x) and u1(x) are continuous on [a,b] and<br />

2. ℓ2(x,y) and u2(x,y) are continuous on the 2-dimensional domain<br />

If f(x,y,z) is continuous on W, then<br />

<br />

W<br />

{(x,y) : a ≤ x ≤ b, ℓ1(x) ≤ y ≤ u1(x)}.<br />

f(x,y,z) dV =<br />

b u1(x) u2(x,y)<br />

a<br />

ℓ1(x)<br />

ℓ2(x,y)<br />

f(x,y,z) dz dy dx .<br />

Example 7.48 (Figure 7.12, Left Part) Let f(x,y,z) = xy on<br />

(left part of Figure 7.12).<br />

W = {(x,y,z) : 0 ≤ x ≤ 2, 0 ≤ y ≤ 1, 0 ≤ z ≤ 4 + x + y}<br />

Then, the triple integral of f over W is<br />

<br />

W f dV = 2 1 4+x+y<br />

0 0 0 xy dz dy dx =<br />

2 1<br />

(4xy + x<br />

0 0<br />

2 y + xy 2 2<br />

) dy dx = (7x/3 + x<br />

0<br />

2 /2) dx = 6.<br />

160<br />

10x<br />

20


Average values over general domains. If f is integrable over the general domain W<br />

with positive volume V (W), then the average value of f over W is<br />

Average value of f over W = 1<br />

V (W)<br />

<br />

W<br />

f(x,y,z) dV.<br />

Example 7.49 (Figure 7.12, Right Part) Let f(x,y,z) = 3x on<br />

(right part of Figure 7.12).<br />

W = {(x,y,z) : 0 ≤ y ≤ 10, y ≤ x ≤ 20 − y, 0 ≤ z ≤ 25 − y}<br />

The triple integral of f over W is<br />

<br />

W f dV = 10 20−y 25−y<br />

0 y 0 3x dz dx dy =<br />

10 20−y<br />

0<br />

y<br />

(75x − 3xy) dx dy =<br />

10<br />

0<br />

(15000 − 2100y + 60y 2 ) dy = 65000.<br />

The volume of W is the triple integral of the function 1 (equivalently, the volume is the<br />

double integral of 25 − y over {(x,y) : 0 ≤ y ≤ 10, y ≤ x ≤ 20 − y}):<br />

V (W) = 10 20−y 25−y<br />

0 y 0 1 dz dx dy =<br />

10 20−y<br />

0<br />

y<br />

(25 − y) dx dy =<br />

10<br />

Finally, the average is (65000)/(6500/3) = 30.<br />

0<br />

(500 − 70y + 2y 2 ) dy = 6500<br />

3 .<br />

Extensions to k variables. Definitions and techniques <strong>for</strong> double and triple integrals<br />

can be extended to define integration methods <strong>for</strong> k-variable functions. The extensions are<br />

straight<strong>for</strong>ward, but the computations can be quite tedious.<br />

7.6.7 Partial Anti-Differentiation and Partial Differentiation<br />

Computing iterated integrals involves computing partial anti-differentives. This section<br />

explores the processes of partial differentiation and partial anti-differentation further in the<br />

2-variable setting.<br />

Theorem 7.50 (Fixed Endpoints) If the 2-variable function f has continuous first<br />

partial derivatives on the rectangle [a,b] × [c,d], then<br />

<br />

b<br />

b<br />

d<br />

f(x, y) dx = fy(x, y) dx <strong>for</strong> y ∈ [c, d]<br />

dy<br />

and<br />

a<br />

<br />

d<br />

d<br />

f(x, y) dy =<br />

dx c<br />

a<br />

d<br />

c<br />

161<br />

fx(x, y) dy <strong>for</strong> x ∈ [a, b].


To illustrate the first equality, let f(x,y) = 4x 3 y 2 and [a,b] = [1,2]. Then<br />

2<br />

d<br />

4x<br />

dy 1<br />

3 y 2 <br />

dx = d 2<br />

15y<br />

dy<br />

= 30y and<br />

2<br />

1<br />

∂ 3 2<br />

4x y<br />

∂y<br />

dy =<br />

2<br />

8x 3 y dx = 30y.<br />

The theorem can be extended to integrals with variable endpoints, using the rule <strong>for</strong> the<br />

total derivative from Section 7.3.4.<br />

7.6.8 Change of Variables and the Jacobian<br />

The change of variables technique <strong>for</strong> multiple integrals generalizes the method of substitution<br />

<strong>for</strong> single integrals (see page 83). In this section, we introduce the technique in the<br />

2-variable setting.<br />

Consider a vector-valued function T : D ∗ ⊆ R 2 → R 2 , with <strong>for</strong>mula<br />

T(u,v) = (x(u,v),y(u,v)) <strong>for</strong> all (u,v) ∈ D ∗ ,<br />

and let D = T(D ∗ ) be the image of D ∗ . We say that the trans<strong>for</strong>mation T maps the<br />

domain D ∗ in the uv-plane to the domain D in the xy-plane.<br />

The functions x(u,v) and y(u,v) are called the coordinate functions of T.<br />

The Jacobian. Assume that each coordinate function has continuous first partial deriva-<br />

, is defined as follows:<br />

tives. The Jacobian of the trans<strong>for</strong>mation T, denoted by ∂(x,y)<br />

∂(u,v)<br />

∂(x,y)<br />

∂(u,v) =<br />

<br />

xu xv <br />

<br />

yu yv<br />

= xuyv − xvyu.<br />

(The Jacobian is the determinant of the matrix of partial derivatives).<br />

Example 7.51 (Linear Trans<strong>for</strong>mation) Let<br />

where a, b, c and d are constants.<br />

(x,y) = T(u,v) = (au + bv,cu + dv),<br />

The <strong>for</strong>mula <strong>for</strong> T can be written using matrix notation. Specifically,<br />

where<br />

A =<br />

<br />

a b<br />

, x =<br />

c d<br />

x = T(u) = Au,<br />

<br />

x<br />

y<br />

and u =<br />

<br />

u<br />

.<br />

v<br />

A is the matrix of partial derivatives of T; the Jacobian of T is det(A) = (ad − bc).<br />

162<br />

1


v<br />

1<br />

1<br />

u<br />

y<br />

1<br />

-1<br />

2 4 x<br />

Figure 7.13: Unit square in uv-plane (left) and parallelogram in xy-plane (right).<br />

Example 7.52 (Nonlinear Trans<strong>for</strong>mation) Let<br />

where u and v are positive.<br />

(x,y) = T(u,v) =<br />

The (simplified) matrix of partial derivatives is<br />

⎡<br />

⎢<br />

⎣ xu xv<br />

yu yv<br />

and the (simplified) Jacobian is 1/(2v).<br />

⎤<br />

⎥<br />

⎦ =<br />

<br />

u<br />

v , √ <br />

uv ,<br />

⎡<br />

⎢<br />

1<br />

⎢ 2<br />

⎢<br />

⎣<br />

√ uv − √ u<br />

2v √ v<br />

√<br />

v<br />

2 √ √<br />

u<br />

u 2 √ v<br />

Stretching/shrinking factor. The absolute value of the Jacobian is the “stretching”<br />

(or “shrinking”) factor which tells us how areas in xy-space are related to areas in uv-space.<br />

For linear trans<strong>for</strong>mations, the relationship between areas is easy to state.<br />

Theorem 7.53 (Linear Trans<strong>for</strong>mations) Let x = T(u) = Au, where<br />

Then<br />

A =<br />

<br />

a b<br />

c d<br />

1. T is a one-to-one and onto trans<strong>for</strong>mation.<br />

⎤<br />

⎥<br />

⎦<br />

and det(A) = (ad − bc) = 0.<br />

2. T maps triangles to triangles and parallelograms to parallelograms in such a way that<br />

vertices are mapped to vertices.<br />

3. If D ∗ is a parallelogram in the uv-plane and D = T(D ∗ ) in the xy-plane, then<br />

<br />

<br />

(Area of D) = <br />

∂(x,y) <br />

<br />

∂(u,v)<br />

(Area of D∗ ) = |det(A)| (Area of D∗ ).<br />

163


Example 7.54 (Area of Parallelogram) The parallelogram D in the xy-plane whose<br />

corners are (0,0), (1, −1), (4,0) and (3,1) is the image of D ∗ = [0,1] ×[0,1] in the uv-plane<br />

under the linear trans<strong>for</strong>mation<br />

as illustrated in Figure 7.13.<br />

<br />

<br />

The Jacobian of T is 3 1 <br />

<br />

1 −1 = −4.<br />

(x,y) = T(u,v) = (3u + v,u − v) <strong>for</strong> (u,v) ∈ D ∗ ,<br />

Since the area of D ∗ is 1, the area of D is<br />

(Area of D) = | − 4|(Area of D ∗ ) = 4.<br />

Change of variables. In 2-variable substitution, we replace<br />

<br />

<br />

dA = dx dy with dA = <br />

∂(x,y) <br />

<br />

∂(u,v)<br />

du dv.<br />

The complete <strong>for</strong>mula is given in the following theorem.<br />

Theorem 7.55 (Change of Variables) Let D and D ∗ be basic regions (of types 1 or<br />

2) in the xy- and uv-planes, respectively. Suppose that<br />

T : D ∗ ⊆ R 2 → R 2<br />

maps D∗ onto D in a one-to-one fashion, and that each coordinate function of T has<br />

continuous first partial derivatives. Further, suppose that f is a 2-variable, continuous<br />

function on D. Then<br />

<br />

<br />

<br />

<br />

f(x,y) dx dy = f(x(u,v),y(u,v)) <br />

∂(x,y) <br />

<br />

∂(u,v)<br />

du dv .<br />

D<br />

D ∗<br />

The change of variables technique is used to simplify the boundaries of integration, to<br />

simplify the integrand, or both.<br />

Example 7.56 (Simplifying Integrand) Consider evaluating the double integral<br />

<br />

x − y<br />

cos dA<br />

x + y<br />

D<br />

where D is the triangular region in the xy-plane bounded by x = 0, y = 0, x + y = 1.<br />

The integrand can be simplified by using<br />

Since<br />

u = x − y, v = x + y ⇐⇒ x =<br />

u + v −u + v<br />

, y =<br />

2 2<br />

(1) x = 0 ⇒ u = −v, (2) y = 0 ⇒ u = v and (3) x + y = 1 ⇒ v = 1,<br />

164<br />

.


the domain D ∗ = {(u,v) : 0 ≤ v ≤ 1, −v ≤ u ≤ v}. The Jacobian is 1/2.<br />

The integral becomes<br />

<br />

D<br />

cos<br />

<br />

x − y<br />

x + y<br />

dA =<br />

1 v<br />

0<br />

−v<br />

cos<br />

<br />

u 1<br />

v 2<br />

du dv =<br />

(Note that, by substitution, cos(u/v) du = v sin(u/v) + C.)<br />

1<br />

0<br />

v sin(1) dv = 1<br />

2 sin(1).<br />

Example 7.57 (Simplifying Boundary) Consider evaluating the double integral<br />

<br />

xy dA<br />

D<br />

where D is the region bounded by xy = 1, xy = 4, xy 2 = 1, xy 2 = 4.<br />

The boundary can be simplified by using<br />

u = xy, v = xy 2 ⇐⇒ x = u2 v<br />

, y =<br />

v u .<br />

The region D ∗ = [1,4] × [1,4]. After simplification, the Jacobian is 1/v.<br />

The integral becomes<br />

<br />

D<br />

xy dA =<br />

4 4<br />

1<br />

1<br />

u 1<br />

v<br />

du dv =<br />

4<br />

1<br />

15<br />

2v<br />

15ln(4)<br />

dv = .<br />

2<br />

Remarks. As stated above, the absolute value of the Jacobian of a linear trans<strong>for</strong>mation<br />

is related to a change in area. If T is a nonlinear trans<strong>for</strong>mation with continuous first<br />

partial derivatives, then it is approximately linear in small enough regions. Thus, <strong>for</strong> small<br />

regions, the absolute value of the Jacobian is an approximate change in area.<br />

Linear trans<strong>for</strong>mations will be studied further in Chapter 9 of these notes.<br />

Polar coordinate trans<strong>for</strong>mations. Consider (x,y) in polar coordinates<br />

x = r cos(θ), y = r sin(θ)<br />

where r = x 2 + y 2 is the distance between (x,y) and (0,0), and θ is the angle measured<br />

counterclockwise from the positive x-axis to the ray from (0,0) to (x,y).<br />

Trans<strong>for</strong>mations from domains in the rθ-plane to domains in the xy-plane have Jacobian<br />

∂(x,y)<br />

∂(r,θ) =<br />

<br />

xr xθ <br />

<br />

=<br />

<br />

<br />

<br />

cos(θ) −r sin(θ) <br />

<br />

sin(θ) r cos(θ) = r cos2 (θ) + r sin 2 (θ) = r ,<br />

yr yθ<br />

since sin 2 (θ) + cos 2 (θ) = 1 <strong>for</strong> all θ.<br />

Polar coordinate trans<strong>for</strong>mations are particularly useful when the domain of integration is<br />

a disk (the boundary plus the interior of a circle), or a wedge of a disk.<br />

165


Example 7.58 (Quarter Disk) Consider evaluating<br />

<br />

(x<br />

D<br />

2 + y 2 ) dA<br />

where D is the subset of the disk x 2 +y 2 ≤ 1 in the first quadrant. Then D can be described<br />

in polar coordinates as D∗ = {(r,θ) : 0 ≤ r ≤ 1, 0 ≤ θ ≤ π/2}, and<br />

<br />

(x 2 + y 2 1 π/2<br />

) dA = r 2 1 π<br />

r dθ dr =<br />

2 r3 dr = π<br />

8 .<br />

D<br />

0<br />

0<br />

Application of polar trans<strong>for</strong>mation. An important application of the polar coordinate<br />

trans<strong>for</strong>mation in probability and statistics is the evaluation of the double integral<br />

<br />

e<br />

D<br />

−(x2 +y2 )<br />

dA, where D = R2 .<br />

D can be described in polar coordinates as D ∗ = {(r,θ) : 0 ≤ r < ∞, 0 ≤ θ ≤ 2π} and<br />

<br />

D e−(x2 +y2 <br />

) ∞ 2π<br />

dA = 0 0 e−r2<br />

= π ∞<br />

= π<br />

= π<br />

7.7 Applications in <strong>Statistics</strong><br />

r dθ dr<br />

(2r) dr<br />

0 e−r2<br />

<br />

−e−r2 r→∞<br />

0<br />

<br />

limr→∞(−e −r2<br />

) + 1<br />

= π(0 + 1) = π.<br />

<br />

0<br />

(by substitution)<br />

Multiple integration methods are important in statistics, and, in particular, when working<br />

with joint density functions. Joint density functions are used to “smooth over” joint<br />

probability histograms, and as functions supporting the theory of multivariate continuous<br />

random variables (Chapter 8). This section previews some ideas.<br />

7.7.1 Computing Probabilities <strong>for</strong> Continuous Random Pairs<br />

If the continuous random pair (X,Y ) has joint density function f(x,y), then the probability<br />

that (X,Y ) takes values in the domain D is the double integral of f over D:<br />

<br />

P((X,Y ) ∈ D) = f(x,y) dA.<br />

Example 7.59 (Joint Distribution on Rectangle) Let<br />

f(x,y) = 1 <br />

x<br />

8<br />

2 + y 2<br />

<strong>for</strong> (x,y) ∈ [−1,2] × [−1,1]<br />

166<br />

D


z<br />

2<br />

-2<br />

-1<br />

0<br />

x<br />

1<br />

2<br />

3<br />

0<br />

-1<br />

-2<br />

1<br />

y<br />

y<br />

2<br />

1<br />

-2 -1 1 2 3 x<br />

-2 -1 1 2 3 x<br />

Figure 7.14: Volume (left) over D (right) is P(X ≥ Y ) <strong>for</strong> a random pair on [−1,2]×[−1,1].<br />

(and 0 otherwise) be the joint density function of a continuous random pair (X,Y ).<br />

To find the probability that X is greater than or equal to Y , <strong>for</strong> example, we need to<br />

compute the double integral of f over the domain D, where D is the intersection of the<br />

rectangle [−1,2] × [−1,1] with the half-plane x ≥ y:<br />

P(X ≥ Y ) =<br />

1<br />

−1<br />

2<br />

y<br />

1 <br />

x<br />

8<br />

2 + y 2<br />

dx dy =<br />

1<br />

−1<br />

-1<br />

-2<br />

<br />

1 y2 y3<br />

+ −<br />

3 4 6<br />

dy = 5<br />

6 .<br />

The domain D is pictured in the right part of Figure 7.14; the probability P(X ≥ Y ) is the<br />

volume of the solid shown in the left part of the figure.<br />

7.7.2 Contours <strong>for</strong> Independent Standard Normal Random Variables<br />

Assume that the continuous random pair (X,Y ) has joint density function<br />

f(x,y) = 1<br />

2π exp<br />

<br />

−(x 2 + y 2 <br />

)/2<br />

<strong>for</strong> (x,y) ∈ R 2 ,<br />

where exp() is the exponential function. The left part of Figure 7.15 is a graph of the<br />

surface z = f(x,y) when (x,y) ∈ [−3,3] × [−3,3].<br />

Contours <strong>for</strong> the joint density are circles in the xy-plane centered at the origin. The right<br />

part of Figure 7.15 shows contours enclosing 5%, 20%, ..., 95% of the joint probability.<br />

Given a > 0, let<br />

Da = {(x,y) : x 2 + y 2 ≤ a 2 }<br />

be the disk of radius a centered at the origin. The polar trans<strong>for</strong>mation can be used to<br />

compute the probability that values of (X,Y ) lie in Da. Specifically,<br />

P(X 2 + Y 2 ≤ a 2 <br />

) =<br />

Da<br />

f(x,y) dA =<br />

2π a<br />

(See the end of the last section <strong>for</strong> a similar computation.)<br />

167<br />

0<br />

0<br />

1<br />

2π e−r2 /2 r dr dθ = 1 − e −a 2 /2 .


z<br />

-3<br />

0<br />

x<br />

3 -3<br />

0<br />

y<br />

3<br />

-3 -1.5 1.5 3 x<br />

-1.5<br />

Figure 7.15: Joint density (left) and contours enclosing 5%, 20%, ..., 95% of probability<br />

when X and Y are independent standard normal random variables.<br />

To find the value of a so that Da encloses a given proportion p of probability, we solve<br />

p = P(X 2 + Y 2 ≤ a 2 ) = 1 − e −a2 /2 ⇒ a =<br />

3<br />

1.5<br />

-3<br />

y<br />

<br />

−2ln(1 − p).<br />

Further, <strong>for</strong> (x,y) satisfying x 2 + y 2 = a 2 = −2ln(1 − p), f(x,y) = (1 − p)/(2π).<br />

Remarks. The joint density given in this section corresponds to the density of independent<br />

standard normal random variables. See Section 8.1.9 <strong>for</strong> details about bivariate normal<br />

distributions.<br />

Probability contours <strong>for</strong> bivariate continuous distributions like the bivariate normal distribution<br />

take the place of quantiles <strong>for</strong> univariate continuous distributions.<br />

168


8 Joint Continuous Distributions<br />

Recall, from Chapter 3, that a probability distribution describing the joint variability of<br />

two or more random variables is called a joint distribution, and that a bivariate distribution<br />

is the joint distribution of a pair of random variables.<br />

8.1 Bivariate Distributions<br />

This section considers joint distributions of pairs of continuous random variables.<br />

8.1.1 Joint PDF and Joint CDF<br />

In the bivariate continuous setting, we assume that X and Y are continuous random variables<br />

and that there exists a nonnegative function f(x,y) defined over the entire xy-plane<br />

such that<br />

<br />

P((X,Y ) ∈ D) = f dA <strong>for</strong> every D ⊆ R2 .<br />

D<br />

The function f(x,y) is called the joint probability density function (joint PDF) (or joint<br />

density function) of the random pair (X,Y ). The notation fXY (x,y) is sometimes used to<br />

emphasize the two random variables.<br />

Joint PDFs satisfy the following two properties:<br />

1. f(x,y) ≥ 0 <strong>for</strong> all pairs (x,y).<br />

<br />

2.<br />

R2 f(x,y) dA = 1.<br />

The joint cumulative distribution function (joint CDF) of the random pair (X,Y ), denoted<br />

by F(x,y), is the function<br />

<br />

F(x,y) = P(X ≤ x,Y ≤ y) = f dA, <strong>for</strong> all real pairs (x,y),<br />

where Dxy = {(u,v) : u ≤ x, v ≤ y}.<br />

Joint CDFs satisfy the following properties:<br />

1. lim (x,y)→(−∞,−∞) F(x,y) = 0 and lim (x,y)→(∞,∞) F(x,y) = 1.<br />

Dxy<br />

2. If x1 ≤ x2 and y1 ≤ y2, then F(x1,y1) ≤ F(x2,y2).<br />

3. F(x,y) is continuous.<br />

Example 8.1 (Joint Distribution in 1st Quadrant) Assume that (X,Y ) has the<br />

following joint density function:<br />

f(x,y) = 2e −x−2y when x,y ≥ 0 and 0 otherwise.<br />

169


Given (x,y) in the first quadrant (both coordinates nonnegative),<br />

F(x,y) =<br />

y x<br />

F(x,y) is equal to 0 otherwise.<br />

0<br />

0<br />

2e −u−2v du dv =<br />

y<br />

2e<br />

0<br />

−2v (1 − e −x ) dv = (1 − e −x )(1 − e −2y ).<br />

Further, to compute the probability that X is greater than or equal to Y , <strong>for</strong> example, we<br />

compute the double integral of the joint PDF over the intersection of the half-plane x ≥ y<br />

with the first quadrant:<br />

∞ ∞<br />

P(X ≥ Y ) = 2e −x−2y ∞<br />

dx dy =<br />

8.1.2 Marginal Distributions<br />

0<br />

y<br />

0<br />

2e −3y dy = 2<br />

3 .<br />

The marginal probability density function (marginal PDF) (or marginal density function)<br />

of X is<br />

fX(x) = <br />

y f(x,y)dy <strong>for</strong> x in the range of X<br />

and 0 otherwise, where the integral is taken over all y in the range of Y .<br />

Similarly, the marginal PDF (or marginal density function) of Y is<br />

fY (y) = <br />

x f(x,y)dx <strong>for</strong> y in the range of Y<br />

and 0 otherwise, where the integral is taken over all x in the range of X.<br />

Example 8.2 (Joint Distribution in 1st Quadrant, continued) For the joint distribution<br />

described above:<br />

1. When x ≥ 0, ∞<br />

0 f(x,y) dy = e−x . Thus,<br />

fX(x) = e −x when x ≥ 0 and 0 otherwise.<br />

(X is an exponential random variable with parameter 1.)<br />

2. When y ≥ 0, ∞<br />

0 f(x,y) dx = 2e−2y . Thus,<br />

fY (y) = 2e −2y when y ≥ 0 and 0 otherwise.<br />

(Y is an exponential random variable with parameter 2.)<br />

Example 8.3 (Joint Distribution on Eighth Plane) Assume that (X,Y ) has the<br />

following joint density function:<br />

f(x,y) = 1<br />

4 e−y/2 when 0 ≤ x ≤ y and 0 otherwise.<br />

170


z<br />

0<br />

5<br />

x<br />

0<br />

10<br />

5<br />

10<br />

y<br />

y<br />

10<br />

5<br />

5 10 x<br />

5 10 x<br />

Figure 8.1: Plot of z = (1/4)e −y/2 on 0 ≤ x ≤ y (left) and region of nonzero density (right).<br />

The left plot in Figure 8.1 shows part of the surface z = f(x,y) (with vertical planes added)<br />

and the right plot shows part of the region of nonzero density.<br />

The volume of the solid shown in the left part of Figure 8.1 is equal to the probability that<br />

Y is at most 11:<br />

11 y 1<br />

P(Y ≤ 11) =<br />

4 e−y/2 11 1<br />

dx dy =<br />

0 4 y e−y/2 dy = 1 − 13<br />

2 e−11/2 ≈ 0.9734.<br />

Further, <strong>for</strong> this random pair,<br />

0<br />

0<br />

1. When x ≥ 0, ∞<br />

1<br />

x f(x,y) dy = 2 e−x/2 . Thus,<br />

fX(x) = 1<br />

2 e−x/2 when x ≥ 0 and 0 otherwise.<br />

(X is an exponential random variable with parameter 1/2.)<br />

2. When y ≥ 0, y<br />

1<br />

0 f(x,y) dx = 4 y e−y/2 . Thus,<br />

fY (y) = 1<br />

4 y e−y/2 when y ≥ 0 and 0 otherwise.<br />

(Y is a gamma random variable with shape parameter 2 and scale parameter 2.)<br />

8.1.3 Conditional Distributions<br />

Let X and Y be continuous random variables with joint PDF f(x,y), and marginal PDFs<br />

fX(x) and fY (y), respectively.<br />

If fY (y) = 0, then the conditional probability density function (conditional PDF) (or conditional<br />

density function) of X given Y = y is defined as follows:<br />

f X|Y =y(x|y) = f(x,y)<br />

fY (y)<br />

171<br />

<strong>for</strong> all real numbers x.


Similarly, if fX(x) = 0, then the conditional PDF (or conditional density function) of Y<br />

given X = x is<br />

fY |X=x(y|x) = f(x,y)<br />

<strong>for</strong> all real numbers y.<br />

fX(x)<br />

Example 8.4 (Joint Distribution in 1st Quadrant, continued) Consider again the<br />

random pair with joint density function<br />

f(x,y) = 2e −x−2y when x,y ≥ 0 and 0 otherwise.<br />

1. Given x ≥ 0, the conditional distribution of Y given X = x has PDF<br />

f Y |X=x(y|x) = 2e −2y when y ≥ 0 and 0 otherwise.<br />

(The conditional distribution is the same as the marginal distribution.)<br />

2. Given y ≥ 0, the conditional distribution of X given Y = y has PDF<br />

f X|Y =y(x|y) = e −x when x ≥ 0 and 0 otherwise.<br />

(The conditional distribution is the same as the marginal distribution.)<br />

Example 8.5 (Joint Distribution on Eighth Plane, continued) Consider again the<br />

random pair with joint density function<br />

f(x,y) = 1<br />

4 e−y/2 when 0 ≤ x ≤ y and 0 otherwise. (Figure 8.1)<br />

1. Given x ≥ 0, the conditional distribution of Y given X = x has PDF<br />

f Y |X=x(y|x) = 1<br />

2 e−(y−x)/2 when y ≥ x and 0 otherwise.<br />

(The conditional distribution is a shifted exponential distribution.)<br />

2. Given y > 0, the conditional distribution of X given Y = y has PDF<br />

f X|Y =y(x|y) = 1<br />

y<br />

when 0 ≤ x ≤ y and 0 otherwise.<br />

(The conditional distribution is uni<strong>for</strong>m on the interval [0,y].)<br />

Remarks. The function fX|Y =y(x|y) is a valid PDF since fX|Y =y(x|y) ≥ 0 and<br />

<br />

fX|Y =y(x|y) dx =<br />

x<br />

1<br />

<br />

f(x,y) dx =<br />

fY (y) x<br />

1<br />

fY (y) fY (y) = 1.<br />

Similarly, the function f Y |X=x(y|x) is a valid PDF. Thus, conditional distributions are<br />

distributions in their own right.<br />

Although, <strong>for</strong> example, fY (y) does not represent probability, we can think of the conditional<br />

density function f X|Y =y(x|y) as a limit of probabilities. Specifically,<br />

∂<br />

fX|Y =y(x|y) = lim P(X ≤ x|y − ǫ ≤ Y ≤ y + ǫ).<br />

ǫ→0 ∂x<br />

Thus, the conditional PDF of X given Y = y can be thought of as the conditional PDF of<br />

X given that Y is very close to y.<br />

172


8.1.4 Independent Random Variables<br />

The continuous random variables X and Y are said to be independent if their joint density<br />

function equals the product of their marginal density functions <strong>for</strong> all real pairs,<br />

f(x,y) = fX(x)fY (y) <strong>for</strong> all (x,y).<br />

Equivalently, X and Y are independent if their joint CDF equals the product of their<br />

marginal CDFs <strong>for</strong> all real pairs,<br />

F(x,y) = FX(x)FY (y) <strong>for</strong> all (x,y).<br />

Random variables that are not independent are said to be dependent.<br />

Example 8.6 (Joint Distribution in 1st Quadrant, continued) The random variables<br />

with joint density function<br />

are independent.<br />

f(x,y) = 2e −x−2y = (e −x )(2e −2y ) when x,y ≥ 0 and 0 otherwise<br />

Example 8.7 (Joint Distribution on Eighth Plane, continued) The random variables<br />

with joint density function<br />

f(x,y) = 1<br />

4 e−y/2 when 0 ≤ x ≤ y and 0 otherwise (Figure 8.1)<br />

are dependent. To demonstrate the dependence of X and Y , we just need to show that<br />

f(x,y) = fX(x)fY (y) <strong>for</strong> some pair (x,y). For example, f(2,1) = fX(2)fY (1).<br />

8.1.5 Mathematical Expectation <strong>for</strong> Bivariate Continuous Distributions<br />

Let X and Y be continuous random variables with joint density function f(x,y) and let<br />

g(X,Y ) be a real-valued function.<br />

The mean of g(X,Y ) (or the expected value of g(X,Y ) or the expectation of g(X,Y )) is<br />

<br />

E(g(X,Y )) = g(x,y) f(x,y) dy dx<br />

x<br />

y<br />

provided the integral converges absolutely. The double integral is assumed to include all<br />

pairs with nonzero joint density.<br />

Example 8.8 (Joint Distribution on Triangle) Assume that (X,Y ) has the following<br />

joint density function<br />

f(x,y) = 2<br />

when 0 ≤ x ≤ 3, 0 ≤ y ≤ 5x/3 and 0 otherwise.<br />

15<br />

(The random pair has constant density on the triangle with corners (0,0), (3,0), (3,5).)<br />

Let g(X,Y ) = XY be the product of the variables. Then<br />

E(g(X,Y )) =<br />

3 5x/3<br />

0<br />

0<br />

xy 2<br />

15<br />

173<br />

dy dx =<br />

3<br />

0<br />

5<br />

27 x3 dx = 15<br />

4 .


Properties of expectation. The properties of expectation stated in Section 3.1.6 in the<br />

bivariate discrete case are also true in the bivariate continuous case.<br />

8.1.6 Covariance and Correlation<br />

Let X and Y be random variables with finite means (µx, µy) and finite standard deviations<br />

(σx, σy). The covariance of X and Y , Cov(X,Y ), is defined as follows:<br />

Cov(X,Y ) = E((X − µx)(Y − µy)).<br />

The notation σxy = Cov(X,Y ) is used to denote the covariance. The correlation of X and<br />

Y , Corr(X,Y ), is defined as follows:<br />

Cov(X,Y )<br />

Corr(X,Y ) =<br />

σxσy<br />

= σxy<br />

.<br />

σxσy<br />

The notation ρ = Corr(X,Y ) is used to denote the correlation of X and Y ; ρ is called the<br />

correlation coefficient.<br />

The properties of covariance and correlation stated in Section 3.1.7 in the bivariate discrete<br />

case are also true in the bivariate continuous case.<br />

Example 8.9 (Joint Distribution on Triangle, continued) Continuing with the joint<br />

distribution above,<br />

E(X) = 2, E(Y ) = 5<br />

3 , E(X2 ) = 9<br />

Using properties of variance and covariance:<br />

2 , E(Y 2 ) = 25<br />

6<br />

, E(XY ) = 15<br />

4 .<br />

V ar(X) = E(X 2 ) − (E(X)) 2 = 1<br />

2 , V ar(Y ) = E(Y 2 ) − (E(Y )) 2 = 25<br />

18 ,<br />

Cov(X,Y ) = E(XY ) − E(X)E(Y ) = 5/12 and ρ = Corr(X,Y ) = 1/2.<br />

Example 8.10 (Joint Distribution on Diamond) Assume that the continuous pair<br />

(X,Y ) has joint density function<br />

f(x,y) = 1<br />

2<br />

when |x| + |y| ≤ 1 and 0 otherwise.<br />

(Joint density is constant on the “diamond” with corners (−1,0), (0,1), (1,0), (0, −1).)<br />

For this joint distribution,<br />

1. E(X) = E(Y ) = E(XY ) = 0.<br />

2. Cov(X,Y ) = 0 and Corr(X,Y ) = 0.<br />

3. The marginal PDF of X is fX(x) = 1 − |x| when −1 ≤ x ≤ 1 and 0 otherwise.<br />

4. The marginal PDF of Y is fY (y) = 1 − |y| when −1 ≤ y ≤ 1 and 0 otherwise.<br />

Since Corr(X,Y ) = 0, X and Y are uncorrelated. But, X and Y are dependent since, <strong>for</strong><br />

example, f(0,0) = fX(0)fY (0).<br />

174


z<br />

2<br />

1.5<br />

1 1<br />

y<br />

1.5<br />

2<br />

0.5<br />

x 2.5 0<br />

3<br />

2<br />

1.5<br />

1<br />

0.5<br />

y<br />

1.5 2 2.5 3 x<br />

Figure 8.2: Graph of z = f(x,y) <strong>for</strong> a continuous bivariate distribution (left plot) and<br />

contour plot with z = 1/2,1/4,1/6,... ,1/20 and conditional expectation of Y given X = x<br />

superimposed (right plot).<br />

8.1.7 Conditional Expectation and Regression<br />

Let X and Y be continuous random variables with joint density function f(x,y).<br />

If fX(x) = 0, then the conditional expectation (or the conditional mean) of Y given X = x,<br />

E(Y |X = x), is defined as follows:<br />

<br />

E(Y |X = x) = y fY |X=x(y|x) dy,<br />

y<br />

where the integral is over all y with nonzero conditional density (f Y |X=x(y|x) = 0), provided<br />

the integral converges absolutely.<br />

The conditional expectation of X given Y = y, E(X|Y = y), is defined similarly.<br />

Example 8.11 (Conditional Expectation) Assume that the continuous pair (X,Y )<br />

has joint density function<br />

f(x,y) = e−yx<br />

2 √ x<br />

1. The marginal PDF of X is<br />

2. The conditional PDF of Y given X = x is<br />

when x ≥ 1,y ≥ 0 and 0 otherwise.<br />

fX(x) = 1<br />

when x ≥ 1 and 0 otherwise.<br />

2x3/2 f Y |X=x(y|x) = xe −xy when y ≥ 0 and 0 otherwise.<br />

(The conditional distribution is exponential with parameter x.)<br />

3. Since the expected value of an exponential random variable is the reciprocal of its<br />

parameter (see Table 5.2), the conditional mean <strong>for</strong>mula is<br />

E(Y |X = x) = 1<br />

x<br />

175<br />

<strong>for</strong> x ≥ 1.


The left part of Figure 8.2 is a graph of z = f(x,y) (with vertical planes added) and the<br />

right part is a contour plot with z = 1/2,1/4,1/6,... ,1/20 (in gray). The <strong>for</strong>mula <strong>for</strong> the<br />

conditional mean, y = 1/x, is superimposed on the contour plot (in black).<br />

Regression equation. Note that the <strong>for</strong>mula <strong>for</strong> the conditional expectation<br />

E(Y |X = x) as a function of x<br />

is called the regression equation of Y on X. Similarly, the <strong>for</strong>mula <strong>for</strong> the conditional<br />

expectation E(X|Y = y) as a function of y is called the regression equation of X on Y .<br />

Linear conditional means. Let X and Y be random variables with finite means (µx,<br />

µy), standard deviations (σx, σy), and correlation (ρ).<br />

If E(Y |X = x) is a linear function of x, then the <strong>for</strong>mula is of the <strong>for</strong>m:<br />

E(Y |X = x) = µy + ρσy<br />

(x − µx).<br />

Similarly, if E(X|Y = y) is a linear function of y, then the <strong>for</strong>mula is of the <strong>for</strong>m:<br />

σx<br />

E(X|Y = y) = µx + ρσx<br />

(y − µy).<br />

Example 8.12 (Two Step Experiment) Consider the following 2 step experiment:<br />

Step 1: Choose X uni<strong>for</strong>mly from the interval [10,20].<br />

Step 2: Given that X = x, choose Y uni<strong>for</strong>mly from the interval [0,x].<br />

Then the continuous pair (X,Y ) has joint density function<br />

f(x,y) = fX(x)f Y |X=x(y|x) =<br />

σy<br />

<br />

1 1<br />

=<br />

10 x<br />

1<br />

10x<br />

when 10 ≤ x ≤ 20, 0 ≤ y ≤ x and 0 otherwise. (The region of nonzero density is the<br />

trapezoidal region with corners (10,0), (10,10), (20,20), (20,0).)<br />

Since the conditional distribution of Y given X = x is uni<strong>for</strong>m on the interval [0,x], and the<br />

expected value of a uni<strong>for</strong>m random variable is the midpoint of its interval (see Table 5.2),<br />

the conditional mean <strong>for</strong>mula is<br />

E(Y |X = x) = x<br />

2<br />

(a linear function of x). Further, since<br />

<strong>for</strong> 10 ≤ x ≤ 20<br />

µx = 15, µy = 15<br />

2 , σx = 5<br />

√ , σy =<br />

3 5√31 6<br />

and ρ =<br />

3<br />

31<br />

<strong>for</strong> this random pair, µy + ρ(σy/σx)(x − µx) = x/2, verifying the <strong>for</strong>mula given above.<br />

176


8.1.8 Example: Bivariate Uni<strong>for</strong>m Distribution<br />

Let R be a region of the plane with finite positive area. The continuous random pair (X,Y )<br />

is said to have a bivariate uni<strong>for</strong>m distribution on the region R if its joint PDF is as follows:<br />

f(x,y) =<br />

1<br />

Area of R<br />

when (x,y) ∈ R and 0 otherwise.<br />

Bivariate uni<strong>for</strong>m distributions have constant density over the region of nonzero density.<br />

That constant density is the reciprocal of the area. If A is a subregion of R (A ⊆ R), then<br />

P((X,Y ) ∈ A) =<br />

Area of A<br />

Area of R .<br />

That is, the probability of the event “the point (x,y) is in A” is the ratio of the area of the<br />

subregion to the area of the full region.<br />

The joint distributions on the triangle and on the diamond considered earlier are examples<br />

of bivariate uni<strong>for</strong>m distributions on regions with areas 15/2 and 2, respectively.<br />

Expectations. If (X,Y ) has a bivariate uni<strong>for</strong>m distribution on R and g(X,Y ) is a realvalued<br />

function, then E(g(X,Y )) is the same as the average value of the 2-variable function<br />

g over the region R, as discussed on page 157.<br />

Marginal distributions. If (X,Y ) has a bivariate uni<strong>for</strong>m distribution, then X and Y<br />

may not be uni<strong>for</strong>m random variables.<br />

Consider again the joint distribution on the diamond. X is not a uni<strong>for</strong>m random variable<br />

since its PDF is not constant on the interval [−1,1]. Similarly, Y is not a uni<strong>for</strong>m random<br />

variable since its PDF is not constant on [−1,1].<br />

8.1.9 Example: Bivariate Normal Distribution<br />

Let µx and µy be real numbers, σx and σy be positive real numbers, and ρ be a number<br />

in the interval −1 < ρ < 1. The random pair (X,Y ) is said to have a bivariate normal<br />

distribution with parameters µx, µy, σx, σy and ρ when its joint PDF is as follows<br />

where<br />

1<br />

f(x,y) = <br />

2πσxσy 1 − ρ2 exp<br />

<br />

− term<br />

2(1 − ρ2 <br />

)<br />

term =<br />

x − µx<br />

σy<br />

2<br />

and exp() is the exponential function.<br />

For this random pair,<br />

<strong>for</strong> all real pairs (x,y)<br />

<br />

x − µx<br />

− 2ρ<br />

<br />

y − µy y − µy<br />

+<br />

σx<br />

177<br />

σy<br />

σy<br />

2


z<br />

-3<br />

0<br />

x<br />

3 -3<br />

0<br />

y<br />

3<br />

1.5<br />

-1.5 1.5<br />

-1.5<br />

Figure 8.3: Joint density (left) and contours enclosing 5%, 35%, 65% and 95% of the<br />

probability <strong>for</strong> the standard bivariate normal distribution with correlation ρ = −0.80.<br />

1. µx and σx are the mean and standard deviation of X,<br />

2. µy and σy are the mean and standard deviation of Y and<br />

3. ρ (the correlation coefficient) is the correlation between X and Y .<br />

Further, if ρ = 0, then X and Y are independent.<br />

Standard bivariate normal distribution. The random pair (X,Y ) is said to have a<br />

standard bivariate normal distribution with parameter ρ when (X,Y ) has a bivariate normal<br />

distribution with µx = µy = 0 and σx = σy = 1.<br />

The joint PDF of (X,Y ) is as follows:<br />

1<br />

f(x,y) =<br />

2π <br />

−(x2 − 2ρxy + y2 )<br />

exp<br />

1 − ρ2 2(1 − ρ2 <br />

)<br />

y<br />

x<br />

<strong>for</strong> all real pairs (x,y).<br />

Example 8.13 (ρ = −0.80) The left part of Figure 8.3 shows the joint density function<br />

of the bivariate standard normal distribution with correlation ρ = −0.80.<br />

The right part of the figure shows contours enclosing 5%, 35%, 65% and 95% of the probability<br />

of this joint distribution. Each contour is an ellipse whose major axis lies along the<br />

line y = −x and whose minor axis lies along the line y = x.<br />

Marginal and conditional distributions. If (X,Y ) has a bivariate normal distribution,<br />

then each marginal and conditional distribution is normal. Specifically,<br />

1. X is a normal random variable with mean µx and standard deviation σx,<br />

2. Y is a normal random variable with mean µy and standard deviation σy,<br />

178


v<br />

1<br />

2 u<br />

2 u<br />

X Dens.<br />

3<br />

2<br />

1<br />

0<br />

0 1 2 3 x<br />

Figure 8.4: The region of nonzero density <strong>for</strong> a bivariate uni<strong>for</strong>m distribution on the open<br />

rectangle (0,2) ×(0,1) (left plot) and the probability density function of the product of the<br />

coordinates with values in (0,2) (right plot). Contours <strong>for</strong> products equal to 0.2, 0.6, 1.0,<br />

and 1.4 are shown in the left plot.<br />

3. the conditional distribution of Y given X = x is normal with mean<br />

<br />

and standard deviation σy 1 − ρ2 , and<br />

E(Y |X = x) = µy + ρσy<br />

(x − µx)<br />

4. the conditional distribution of X given Y = y is normal with mean<br />

<br />

and standard deviation σx 1 − ρ2 .<br />

σx<br />

E(X|Y = y) = µx + ρσx<br />

(y − µy)<br />

Remarks. The computations needed to find probability contours <strong>for</strong> bivariate normal<br />

distributions generalize ideas from Section 7.7.2 and are discussed further in the next section.<br />

8.1.10 Trans<strong>for</strong>ming Continuous Random Variables<br />

Let (U,V ) be a continuous random pair with joint density function f(u,v). This section<br />

considers scalar-valued and vector-valued continuous trans<strong>for</strong>mations of (U,V ).<br />

Scalar-valued functions. We first consider scalar-valued functions.<br />

Example 8.14 (Product Trans<strong>for</strong>mation) Let U be the length, V be the width, and<br />

X = UV be the area of a random rectangle. Specifically, assume that U is a uni<strong>for</strong>m<br />

random variable on the interval (0,2), V is a uni<strong>for</strong>m random variable on the interval (0,1),<br />

and that U and V are independent.<br />

Since the joint PDF of (U,V ) is<br />

f(u,v) = fU(u)fV (v) = 1<br />

2<br />

σy<br />

when (u,v) ∈ (0,2) × (0,1) and 0 otherwise,<br />

179


(U,V ) has a bivariate uni<strong>for</strong>m distribution on the open rectangle.<br />

For 0 < x < 2,<br />

P(X ≤ x) = P(UV ≤ x) = x<br />

<br />

1 2 x<br />

+<br />

2 2 x u<br />

and d<br />

1<br />

1<br />

dxP(X ≤ x) = 2 (ln(2) − ln(x)) = 2 ln(2/x). Thus,<br />

1<br />

du = (x + xln(2) − xln(x))<br />

2<br />

fX(x) = 1<br />

ln(2/x) when 0 < x < 2 and 0 otherwise.<br />

2<br />

The left part of Figure 8.4 shows the region of nonzero joint density, with contours corresponding<br />

to x = 0.2,0.6,1.0,1.4 superimposed. The right part is a plot of the density<br />

function of X = UV .<br />

Example 8.15 (Sum Trans<strong>for</strong>mation) Let X = U +V , where U and V are independent<br />

exponential random variables with parameter 1. The range of X is the nonnegative real<br />

numbers.<br />

Since U and V are independent, the joint density function of (U,V ) is<br />

Given x ≥ 0,<br />

P(X ≤ x) =<br />

f(u,v) = fU(u)fV (v) = e −u−v when u,v ≥ 0 and 0 otherwise.<br />

x x−u<br />

0<br />

0<br />

e −u−v dv du =<br />

x −x −u<br />

−e + e du = 1 − e −x − xe −x ,<br />

and d<br />

dx P(X ≤ x) = xe−x (using the product rule <strong>for</strong> derivatives). Thus,<br />

fX(x) = xe −x when x ≥ 0 and 0 otherwise.<br />

(X is a gamma random variable with shape parameter 2 and scale parameter 1.)<br />

Working with sums. If X = U + V and f(u,v) has continuous partial derivatives, then<br />

the PDF of X can be computed as follows:<br />

<br />

fX(x) = f(u,x − u) du where R is the range of U.<br />

u∈R<br />

This result can be proven using the relationship between partial differentiation and partial<br />

anti-differentation given in Section 7.6.7.<br />

For the example above, since the joint density function is nonzero when both arguments<br />

are nonnegative, the integral reduces to<br />

fX(x) =<br />

0<br />

x<br />

e<br />

0<br />

−x du = xe −x when x ≥ 0 and 0 otherwise,<br />

confirming the result obtained using double integration followed by differentiation.<br />

180


Vector-valued functions. We next consider vector-valued functions.<br />

Let (x,y) = T(u,v) be a vector-valued function of two variables. Assume that the coordinate<br />

functions of the trans<strong>for</strong>mation, x = x(u,v) and y = y(u,v), have continuous partial<br />

derivatives and that the Jacobian of the trans<strong>for</strong>mation<br />

∂(x,y)<br />

∂(u,v) =<br />

<br />

<br />

<br />

xu<br />

<br />

xv <br />

<br />

yu yv<br />

= xuyv − xvyu = 0 <strong>for</strong> all (u,v).<br />

(See Section 7.6.8 <strong>for</strong> a discussion of the Jacobian.)<br />

If (X,Y ) = T(U,V ) and f(u,v) has continuous partial derivatives, then, <strong>for</strong> all (x,y) with<br />

nonzero joint density,<br />

<br />

∂(x,y) <br />

fXY (x,y) = f(u,v)/ <br />

∂(u,v)) , where (x,y) = T(u,v).<br />

(This result generalizes the result <strong>for</strong> monotone trans<strong>for</strong>mations from page 100.)<br />

Example 8.16 (Standard Bivariate Normal Distribution) Let U and V be independent<br />

standard normal random variables and let<br />

<br />

(X,Y ) = T(U,V ) = U,ρU + 1 − ρ2 <br />

V ,<br />

where −1 < ρ < 1. Then (X,Y ) has a standard bivariate normal distribution with correlation<br />

ρ. To see this,<br />

1. The Jacobian of the trans<strong>for</strong>mation is<br />

∂(x,y)<br />

∂(u,v) =<br />

<br />

xu <br />

<br />

<br />

xv <br />

<br />

=<br />

<br />

<br />

1<br />

ρ<br />

0<br />

1 − ρ2 <br />

<br />

<br />

=<br />

yu yv<br />

<br />

1 − ρ 2 .<br />

2. Since x = u, y = ρu + 1 − ρ2v implies u = x, v = 1 √ (−ρx + y),<br />

1−ρ2 u 2 + v 2 = x 2 1<br />

+<br />

(1 − ρ2 ) (−ρx + y)2 = x2 − 2ρxy + y2 (1 − ρ2 )<br />

3. Since f(u,v) = exp(−(u2 + v2 )/2)/(2π), the <strong>for</strong>mula above implies that<br />

<br />

1<br />

fXY (x,y) =<br />

2π <br />

−(x2 − 2ρxy + y2 )<br />

exp<br />

1 − ρ2 2(1 − ρ2 )<br />

.<br />

<strong>for</strong> all (x,y).<br />

The result is the same as the joint density function of the standard bivariate normal<br />

distribution with correlation ρ given earlier.<br />

Example 8.17 (Bivariate Normal Distribution) More generally, if U and V are<br />

independent standard normal random variables and we let<br />

<br />

<br />

(X,Y ) = T(U,V ) = µx + σxU, µy + σy ρU + 1 − ρ2 <br />

V ,<br />

then (X,Y ) has a bivariate normal distribution with parameters µx, µy, σx, σy and ρ.<br />

The steps in the derivation follow those in the previous example.<br />

181


Remarks. The linear trans<strong>for</strong>mations given in the last two examples can be used to find<br />

probability contours <strong>for</strong> bivariate normal distributions. That is, the trans<strong>for</strong>mations can be<br />

used to find contours enclosing probability p, <strong>for</strong> 0 < p < 1.<br />

We consider the first trans<strong>for</strong>mation.<br />

1. Let U and V be independent standard normal random variables, Ca be the circle of<br />

radius a centered at the origin and<br />

Da = {(u,v) : u 2 + v 2 ≤ a 2 }<br />

be the disk of radius a centered at the origin. If a = −2ln(1 − p), then<br />

P ((U,V ) ∈ Da) = p<br />

(see Section 7.7.2) and Ca is the contour enclosing probability p.<br />

Further, if (u,v) ∈ Ca, then f(u,v) = (1 − p)/(2π).<br />

2. Let (X,Y ) = T(U,V ) =<br />

The image of Da under T,<br />

<br />

U,ρU + 1 − ρ 2 V<br />

<br />

, where −1 < ρ < 1.<br />

T(Da) = {(x,y) : x 2 − 2ρxy + y 2 ≤ a 2 (1 − ρ 2 )},<br />

is an elliptical region centered at the origin, whose boundary, T(Ca), is the contour<br />

enclosing probability p.<br />

Further, if (x,y) ∈ T(Ca), then fXY (x,y) = (1 − p)/(2π 1 − ρ 2 ).<br />

Note that the right part of Figure 8.3 shows contours corresponding to p = 0.05, 0.35, 0.65<br />

and 0.95 when ρ = −0.80.<br />

Bivariate normal distributions (and multivariate normal distributions) will be studied further<br />

in Chapter 9 of these notes.<br />

8.2 Multivariate Distributions<br />

Recall (from Section 3.2) that a multivariate distribution is the joint distribution of k<br />

random variables. Ideas studied in the bivariate case (k = 2) can be generalized to the case<br />

where k > 2.<br />

8.2.1 Joint PDF and Joint CDF<br />

Let X1, X2, ..., Xk be continuous random variables and let<br />

X = (X1,X2,...,Xk).<br />

We assume there exists a nonnegative k-variable function f with domain Rk such that<br />

<br />

P(X ∈ W) = · · · f(x1,x2,... ,xk) dx1 · · · dxk<br />

W<br />

182


<strong>for</strong> every W ⊆ R k . The function f(x1,x2,... ,xk) is called the joint probability density<br />

function (joint PDF) (or joint density function) of the random k-tuple X.<br />

The joint cumulative distribution function (joint CDF) of X is defined as follows:<br />

F(x) = P(X1 ≤ x1,X2 ≤ x2,... ,Xk ≤ xk) <strong>for</strong> all x = (x1,x2,... ,xk) ∈ R k ,<br />

where commas are understood to mean the intersection of events.<br />

Example 8.18 (Joint Distribution on 1st Octant) Assume that X = (X,Y,Z) has<br />

joint density function<br />

Given x,y,z ≥ 0,<br />

f(x,y,z) =<br />

F(x,y,z) = x y z<br />

0 0 0 f(u,v,w) dw dv du =<br />

6<br />

when x,y,z ≥ 0 and 0 otherwise.<br />

(1 + x + y + z) 4<br />

yz(2 + y + z)<br />

(1 + y)(1 + z)(1 + y + z) −<br />

yz(2 + 2x + y + z)<br />

(1 + x)(1 + x + y)(1 + x + z)(1 + x + y + z) .<br />

The joint CDF is equal to 0 otherwise.<br />

Let c be a positive constant. To find the probability that the sum of the three variables is<br />

at most c, <strong>for</strong> example, we compute the triple integral of f over the region<br />

The probability is<br />

W = {(x,y,z) : 0 ≤ x ≤ c, 0 ≤ y ≤ c − x, 0 ≤ z ≤ c − x − y} ⊂ R 3 .<br />

<br />

P(X + Y + Z ≤ c) =<br />

8.2.2 Marginal and Conditional Distributions<br />

W<br />

f(x,y,z) dV =<br />

3 c<br />

.<br />

c + 1<br />

The concepts of marginal and conditional distributions can also be generalized.<br />

Example 8.19 (Joint Distribution on 1st Octant, continued) To illustrate computations,<br />

we continue with the joint distribution defined above.<br />

1. The marginal distribution of the random pair (X,Y ) is<br />

∞<br />

2<br />

fXY (x,y) = f(x,y,z) dz = when x,y ≥ 0 and 0 otherwise.<br />

(1 + x + y) 3<br />

0<br />

2. The marginal distribution of Z is<br />

∞ ∞<br />

fZ(z) = f(x,y,z) dx dy =<br />

0<br />

0<br />

183<br />

1<br />

when z ≥ 0 and 0 otherwise.<br />

(1 + z) 2


3. Let x and y be fixed nonnegative constants. The conditional distribution of Z given<br />

(X,Y ) = (x,y) is<br />

when z ≥ 0 and 0 otherwise.<br />

fZ|(X,Y )=(x,y)(z|(x,y)) = f(x,y,z) 3(1 + x + y)3<br />

=<br />

fXY (x,y) (1 + x + y + z) 4<br />

8.2.3 Mutually Independent Random Variables and Random Samples<br />

The continuous random variables X1, X2, ..., Xk are said to be mutually independent (or<br />

independent) if<br />

f(x) = f1(x1)f2(x2) · · · fk(xk) <strong>for</strong> all x = (x1,x2,...,xk) ∈ R k ,<br />

where fi(xi) is the PDF of Xi <strong>for</strong> each i.<br />

Equivalently, X1, X2, ..., Xk are said to be mutually independent if<br />

F(x) = F1(x1)F2(x2) · · · Fk(xk) <strong>for</strong> all x = (x1,x2,... ,xk) ∈ R k ,<br />

where Fi(xi) is the CDF of Xi <strong>for</strong> each i.<br />

X1, X2, ..., Xk are said to be dependent if they are not independent.<br />

Example 8.20 (Joint Distribution on 1st Octant, continued) The random variables<br />

X, Y and Z in the example above are dependent. To see this, first note that each random<br />

variable has the same marginal density function:<br />

fX(u) = fY (u) = fZ(u) =<br />

1<br />

when u ≥ 0 and 0 otherwise.<br />

(1 + u) 2<br />

To demonstrate dependence, we just need to show that f(x,y,z) = fX(x)fY (y)fZ(z) <strong>for</strong><br />

some triple (x,y,z). For example, f(1,1,1) = fX(1)fY (1)fZ(1).<br />

Random sample. If the continuous random variables X1, X2, ..., Xk are mutually<br />

independent and have a common distribution (each marginal distribution is the same),<br />

then X1, X2, ..., Xk are said to be a random sample from that distribution.<br />

8.2.4 Mathematical Expectation For Multivariate Continuous Distributions<br />

Let X1, X2, ..., Xk be continuous random variables with joint PDF f(x) and let g(X) be<br />

a real-valued function. The mean or expected value or expectation of g(X) is<br />

<br />

E(g(X)) = · · · g(x1,x2,... ,xk) f(x1,x2,...,xk)dx1dx2 · · · dxk<br />

xk<br />

xk−1<br />

x1<br />

provided that the integral converges absolutely. The multiple integral is assumed to include<br />

all k-tuples with nonzero joint PDF.<br />

184


Example 8.21 (Product Function) You pick three numbers independently and uni<strong>for</strong>mly<br />

from the interval [0,5]. Let X1, X2, X3 be the three numbers and let g(X1,X2,X3)<br />

be the function with rule g(X1,X2,X3) = X1X 2 2 X3.<br />

By independence,<br />

f(x1,x2,x3) = 1<br />

125<br />

The expected product is E(g(X1,X2,X3)) =<br />

5 5 5<br />

5 5<br />

0<br />

0<br />

0<br />

<br />

x1x 2 2x3 1<br />

125 dx3 dx2 dx1 =<br />

on the [0,5] × [0,5] × [0,5] box and 0 otherwise.<br />

0<br />

0<br />

1<br />

10 x1x 2 2 dx2<br />

5<br />

25<br />

dx1 =<br />

0 6 x1 dx1 = 625<br />

≈ 50.08.<br />

12<br />

Properties of expectation. Properties of expectation in the multivariate case directly<br />

generalize properties in the univariate and bivariate cases. In particular,<br />

1. If a and b1,b2,... ,bn are constants and gi(X) are real-valued k-variable functions <strong>for</strong><br />

each i = 1,2,... ,n, then<br />

<br />

n<br />

<br />

n<br />

E a + bigi(X) = a + biE(gi(X)).<br />

i=1<br />

2. If X1, X2, ..., Xk are mutually independent and gi(Xi) are real-valued 1-variable<br />

functions <strong>for</strong> i = 1,2,... ,k, then<br />

i=1<br />

k<br />

E(g1(X1)g2(X2) · · · gk(Xk)) = E(gi(Xi)).<br />

i=1<br />

Example 8.22 (Product Function, continued) Continuing with the example above,<br />

and using properties of expectation, the expected product is<br />

E(X1X 2 2X3) = E(X1)E(X 2 2 )E(X3)<br />

<br />

5 25 5<br />

=<br />

=<br />

2 3 2<br />

625<br />

12 ,<br />

confirming the result obtained above.<br />

8.2.5 Sample Summaries<br />

If X1, X2, ..., Xn is a random sample from a distribution with mean µ and standard<br />

deviation σ, then the sample mean, X, is the random variable<br />

X = 1<br />

n (X1 + X2 + · · · + Xn) ,<br />

the sample variance, S 2 , is the random variable<br />

S 2 = 1<br />

n − 1<br />

n<br />

i=1<br />

185<br />

2 Xi − X


and the sample standard deviation, S, is the positive square root of the sample variance.<br />

The following theorem can be proven using properties of expectation (see Section 3.2.7):<br />

Theorem 8.23 (Sample Summaries) If X is the sample mean and S 2 is the sample<br />

variance of a random sample of size n from a distribution with mean µ and standard<br />

deviation σ, then<br />

1. E(X) = µ and V ar(X) = σ 2 /n.<br />

2. E(S 2 ) = σ 2 .<br />

Sample correlation. A random sample of size n from the joint (X,Y ) distribution is a<br />

list of n mutually independent random pairs, each with the same distribution as (X,Y ).<br />

If (X1,Y1),(X2,Y2),... ,(Xn,Yn) is a random sample of size n from a bivariate distribution<br />

with correlation ρ = Corr(X,Y ), then the sample correlation, R, is the random variable<br />

ni=1(Xi − X)(Yi − Y )<br />

R = <br />

ni=1<br />

(Xi − X) 2 n i=1 (Yi − Y ) 2<br />

where X and Y are the sample means of the X and Y samples, respectively.<br />

Applications. In statistical applications, observed values of the sample mean, sample<br />

variance, and sample correlation are used to estimate unknown values of the parameters µ,<br />

σ 2 and ρ, respectively.<br />

Example 8.24 (Brain-Body Weights)(Allison & Cicchetti, Science, 194:732-374, 1976;<br />

lib.stat.cmu.edu/DASL.) As part of a study on sleep in mammals, researchers collected<br />

in<strong>for</strong>mation on the average brain weight and average body weight <strong>for</strong> 43 different species.<br />

(Body-weight, brain-weight) combinations ranged from (0.05kg,0.14g) <strong>for</strong> the short-tail<br />

shrew to (2547.0kg,4603.0g) <strong>for</strong> the Asian elephant. The data <strong>for</strong> “man” are<br />

(62kg,1320g) = (136.4lb,2.9lb).<br />

Let X be the common logarithm of body weight in kilograms, and Y be the common<br />

logarithm of brain weight in grams. Sample summaries are as follows:<br />

1. Mean log-body weight is x = 0.311738, with a SD of sx = 1.33113.<br />

2. Mean log-brain weight is y = 1.17421, with a SD of sy = 1.09415.<br />

3. Correlation between log-body weight and log-brain weight is r = 0.951693.<br />

186


y<br />

4<br />

3<br />

2<br />

-3 -2 -1 0 1 2 3 4 x<br />

1<br />

0<br />

-1<br />

-2<br />

Figure 8.5: Log-brain weight (vertical axis) versus log-body weight (horizontal axis) <strong>for</strong> the<br />

brain-body study. Probability contours of the estimated bivariate normal model and the<br />

estimated conditional expectation are superimposed.<br />

The researchers determined that a bivariate normal model was reasonable <strong>for</strong> these data.<br />

Figure 8.5 compares the common logarithms of average brain weight in grams (vertical<br />

axis) and average body weight in kilograms (horizontal axis) <strong>for</strong> the 43 species. Probability<br />

contours <strong>for</strong> the model with parameters equal to the sample summaries are shown on the<br />

plot; contours enclose 5%, 35%, 65% and 95% of probability. The pair <strong>for</strong> “man” is the<br />

only point outside the outermost contour.<br />

The estimated conditional expectation line,<br />

y = 1.17421 + 0.782263 (x − 0.311738),<br />

is displayed in the plot. Note, in particular, that the slope of the line is r(sy/sx) = 0.782263.<br />

187


188


9 Linear Algebra Concepts<br />

The central equations of interest in linear algebra are of the <strong>for</strong>m Ax = b, where A is an<br />

m-by-n coefficient matrix, x is an n-by-1 vector of unknowns, and b is an m-by-1 vector<br />

of values. Solutions come at three levels of sophistication: (1) direct solution methods, (2)<br />

matrix algebra methods, and (3) vector space methods.<br />

This chapter introduces linear algebra, and applications in statistics.<br />

9.1 Solving Systems of Equations<br />

A linear equation in the variables x1,x2,... ,xn is an equation of the <strong>for</strong>m<br />

a1x1 + a2x2 + · · · + anxn = b<br />

where a1,a2,... ,an and b are constants. A system of linear equations in the variables<br />

x1,x2,... ,xn is a collection of one or more linear equations in these variables. For example,<br />

x1 −2x2 + x3 = 0<br />

2x2 −8x3 = 8<br />

−4x1 +5x2 +9x3 = −9<br />

is a system of 3 linear equations in the three unknowns x1,x2,x3 (a “3-by-3 system”).<br />

A solution of a linear system is a list (s1,s2,... ,sn) of numbers that makes each equation<br />

in the system true when si is substituted <strong>for</strong> xi <strong>for</strong> i = 1,2,... ,n. The list (29, 16, 3) is a<br />

solution to the system above. To check this, note that<br />

(29) −2(16) + (3) = 0<br />

2(16) −8(3) = 8<br />

−4(29) +5(16) +9(3) = −9<br />

The solution set of a linear system is the collection of all possible solutions of the system.<br />

For the example above, (29, 16, 3) is the unique solution.<br />

Two linear systems are equivalent if each has the same solution set.<br />

9.1.1 Elementary Row Operations<br />

The strategy <strong>for</strong> solving a linear system is to replace the system with an equivalent one that<br />

is easier to solve. The equivalent system is obtained using elementary row operations:<br />

1. (Replacement) Add a multiple of one row to another row,<br />

2. (Interchange) Interchange two rows,<br />

3. (Scaling) Multiply all entries in a row by a nonzero constant,<br />

where each “row” corresponds to an equation in the system.<br />

189


Example 9.1 (Using Row Operations) We work out the strategy using the 3-by-3<br />

system given above, where a 3-by-4 augmented matrix of the coefficients and right hand<br />

side values follows our progress toward a solution.<br />

(1)<br />

(2)<br />

(3)<br />

(4)<br />

(5)<br />

(6)<br />

R1 :<br />

R2 :<br />

R3 :<br />

R1 :<br />

R2 :<br />

R3 + 4R1 :<br />

R1 :<br />

1<br />

2 R2 :<br />

R3 :<br />

R1 :<br />

R2 :<br />

R3 + 3R2 :<br />

R1 − R3 :<br />

R2 + 4R3 :<br />

R3 :<br />

R1 + 2R2 :<br />

R2 :<br />

R3 :<br />

x1 −2x2 + x3 = 0<br />

2x2 −8x3 = 8<br />

−4x1 +5x2 +9x3 = −9<br />

x1 −2x2 + x3 = 0<br />

2x2 −8x3 = 8<br />

−3x2 +13x3 = −9<br />

x1 −2x2 + x3 = 0<br />

x2 −4x3 = 4<br />

−3x2 +13x3 = −9<br />

x1 −2x2 + x3 = 0<br />

x2 −4x3 = 4<br />

x3 = 3<br />

x1 −2x2 = −3<br />

x2 = 16<br />

x3 = 3<br />

x1 = 29<br />

x2 = 16<br />

x3 = 3<br />

Thus, the solution is x1 = 29, x2 = 16, and x3 = 3.<br />

⎡<br />

⎣<br />

⎡<br />

⎣<br />

⎡<br />

⎣<br />

1 −2 1 0<br />

0 2 −8 8<br />

−4 5 9 −9<br />

⎡<br />

⎣<br />

⎡<br />

⎣<br />

1 −2 1 0<br />

0 2 −8 8<br />

0 −3 13 −9<br />

1 −2 1 0<br />

0 1 −4 4<br />

0 −3 13 −9<br />

⎡<br />

⎣<br />

1 −2 1 0<br />

0 1 −4 4<br />

0 0 1 3<br />

1 −2 0 −3<br />

0 1 0 16<br />

0 0 1 3<br />

1 0 0 29<br />

0 1 0 16<br />

0 0 1 3<br />

Forward and backward phases. In the <strong>for</strong>ward phase of the row reduction process of<br />

our example (corresponding to systems (1) through (4)), R1 is used to eliminate x1 from<br />

R2 and R3; then R2 is used to eliminate x2 from R3. In the backward phase of the process<br />

(corresponding to systems (5) and (6)), R3 is used to eliminate x3 from R1 and R2; then<br />

R2 is used to eliminate x2 from R1.<br />

Existence and uniqueness. Two fundamental questions about linear systems are:<br />

1. Does at least one solution exist?<br />

2. If a solution exists, is the solution unique?<br />

If a linear system has one or more solutions, then it is said to be consistent; otherwise, it is<br />

said to be inconsistent. A linear system can be inconsistent, or have a unique solution, or<br />

have infinitely many solutions.<br />

Note that in the example above, we knew that the system was consistent once we reached<br />

equivalent system (4); the remaining steps allowed us to find the unique solution.<br />

190<br />

⎤<br />

⎦<br />

⎤<br />

⎦<br />

⎤<br />

⎦<br />

⎤<br />

⎦<br />

⎤<br />

⎦<br />

⎤<br />


Example 9.2 (Infinitely Many Solutions) Consider the following 3-by-4 system:<br />

x1 + x2 − x3 = 9<br />

x1 +2x2 −3x3 −4x4 = 9<br />

−3x1 −2xx +2x3 −3x4 = −23<br />

and the following sequence of equivalent augmented matrices:<br />

⎡<br />

⎣<br />

1<br />

1<br />

1<br />

2<br />

−1 0<br />

−3 −4<br />

⎤ ⎡<br />

9 1<br />

9 ⎦ ∼ ⎣ 0<br />

⎤ ⎡<br />

1 −1 0 9 1<br />

1 −2 −4 0 ⎦ ∼ ⎣ 0<br />

1 −1<br />

1 −2<br />

0 9<br />

−4 0<br />

−3 −2 2 −3 −23 0 1 −1 −3 4 0 0 1 1 4<br />

⎡<br />

∼ ⎣<br />

1 1 0 1 13<br />

0 1 0 −2 8<br />

0 0 1 1 4<br />

⎤<br />

⎡<br />

⎦ ∼ ⎣<br />

1 0 0 3 5<br />

0 1 0 −2 8<br />

0 0 1 1 4<br />

The last matrix corresponds to x1 + 3x4 = 5, x2 − 2x4 = 8, and x3 + x4 = 4. Thus,<br />

is a solution <strong>for</strong> any value of x4.<br />

(s1,s2,s3,s4) = (5 − 3x4,8 + 2x4,4 − x4,x4)<br />

Example 9.3 (Inconsistent System) Consider the following 3-by-3 system:<br />

3x2 −6x3 = 8<br />

x1 −2x2 +3x3 = −1<br />

5x1 −7x2 +9x3 = 0<br />

and the following sequence of equivalent augmented matrices:<br />

⎡<br />

⎤ ⎡<br />

0 3 −6 8 1 −2<br />

⎣ 1 −2 3 −1 ⎦ ∼ ⎣ 0 3<br />

3<br />

−6<br />

⎤ ⎡<br />

−1 1<br />

8 ⎦ ∼ ⎣ 0<br />

−2<br />

3<br />

3<br />

−6<br />

−1<br />

8<br />

5 −7 9 0 5 −7 9 0 0 3 −6 5<br />

⎤<br />

⎡<br />

⎦ ∼ ⎣<br />

⎤<br />

⎦<br />

⎤<br />

⎦<br />

1 −2 3 −1<br />

0 3 −6 8<br />

0 0 0 −3<br />

The last matrix corresponds to x1 − 2x2 + 3x3 = −1, 3x2 − 6x3 = 8, and 0 = −3.<br />

Since 0 = −3, the system has no solution.<br />

9.1.2 Row Reduction and Echelon Forms<br />

A matrix is in (row) echelon <strong>for</strong>m when<br />

1. All nonzero rows are above any row of zeros.<br />

2. Each leading entry (that is, leftmost nonzero entry) in a row is in a column to the<br />

right of the leading entries of the rows above it.<br />

3. All entries in a column below a leading entry are zero.<br />

191<br />

⎤<br />


Table 9.1: 6-by-11 matrices in echelon (left) and reduced echelon (right) <strong>for</strong>ms.<br />

⎡<br />

0 • ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗<br />

⎤<br />

∗<br />

⎡<br />

0 1 ∗ 0 0 ∗ ∗ 0 0 ∗<br />

⎤<br />

∗<br />

⎢0<br />

0<br />

⎢<br />

⎢0<br />

0<br />

⎢<br />

⎢0<br />

0<br />

⎣<br />

0 0<br />

0<br />

0<br />

0<br />

0<br />

• ∗<br />

0 •<br />

0 0<br />

0 0<br />

∗ ∗<br />

∗ ∗<br />

0 0<br />

0 0<br />

∗ ∗<br />

∗ ∗<br />

• ∗<br />

0 •<br />

∗<br />

∗<br />

∗<br />

∗<br />

∗ ⎥<br />

∗ ⎥<br />

∗ ⎥<br />

⎦<br />

∗<br />

⎢0<br />

⎢<br />

⎢0<br />

⎢<br />

⎢0<br />

⎣<br />

0<br />

0 0<br />

0 0<br />

0 0<br />

0 0<br />

1 0<br />

0 1<br />

0 0<br />

0 0<br />

∗ ∗<br />

∗ ∗<br />

0 0<br />

0 0<br />

0 0<br />

0 0<br />

1 0<br />

0 1<br />

∗<br />

∗<br />

∗<br />

∗<br />

∗ ⎥<br />

∗ ⎥<br />

∗⎥<br />

⎦<br />

∗<br />

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0<br />

An echelon <strong>for</strong>m matrix is in reduced echelon <strong>for</strong>m when (in addition)<br />

4. The leading entry in each nonzero row is 1.<br />

5. Each leading 1 is the only nonzero entry in its column.<br />

The left part of Table 9.1 represents a 6-by-11 echelon <strong>for</strong>m matrix, where • represents<br />

the leading entry, and ∗ represents any number (either zero or nonzero). The right part of<br />

Table 9.1 represents a 6-by-11 reduced echelon <strong>for</strong>m matrix.<br />

Starting with the augmented matrix of a system of linear equations, the <strong>for</strong>ward phase of<br />

the row reduction process will produce a matrix in echelon <strong>for</strong>m. Continuing the process<br />

through the backward phase, we get a reduced echelon <strong>for</strong>m matrix. An important theorem<br />

is the following:<br />

Theorem 9.4 (Uniqueness Theorem) Each matrix is row-equivalent to one and only<br />

one reduced echelon <strong>for</strong>m matrix.<br />

Further, we can “read” the solutions to the original system from the reduced echelon <strong>for</strong>m<br />

of the augmented matrix.<br />

Pivot positions. A pivot position is a position of a leading entry in an echelon <strong>for</strong>m<br />

matrix. A column that contains a pivot position is called a pivot column. For example, in<br />

the left part of Table 9.1, each • is located at a pivot position, and columns 2, 4, 5, 8, and<br />

9 are pivot columns.<br />

A pivot is a nonzero number that either is used in a pivot position to create zeros or is<br />

changed into a leading 1, which in turn is used to create zeros. In general there is no more<br />

than one pivot in any row and no more than one pivot in any column.<br />

Example 9.5 (Echelon Forms) Consider the following 3-by-6 matrix:<br />

⎡<br />

0 3 −6<br />

⎣ 3 −7 8<br />

⎤<br />

6 4 −5<br />

−5 8 9 ⎦<br />

3 −9 12 −9 6 15<br />

We first interchange R1 and R3:<br />

⎡<br />

⎣<br />

3 −9 12 −9 6 15<br />

3 −7 8 −5 8 9<br />

0 3 −6 6 4 −5<br />

192<br />

⎤<br />


Column 1 is a pivot column and we use the 3 in the upper left corner to convert the 3 in<br />

row 2 to 0 (R2 − R1 replaces R2). In addition, R1 is replaced by 1<br />

3R1 <strong>for</strong> simplicity:<br />

⎡<br />

⎤<br />

1 −3 4 −3 2 5<br />

⎣ 0 2 −4 4 2 −6 ⎦<br />

0 3 −6 6 4 −5<br />

Column 2 is a pivot column. We could use the current row 2, or interchange the second and<br />

third rows and use the new row two <strong>for</strong> pivot. For simplicity, we use the 2 in the second<br />

row to convert the 3 in the third row to a zero (R3 − 3<br />

2 R2 replaces R3). In addition, R2 is<br />

replaced by 1<br />

2 R2 <strong>for</strong> simplicity:<br />

⎡<br />

⎣<br />

1 −3 4 −3 2 5<br />

0 1 −2 2 1 −3<br />

0 0 0 0 1 4<br />

This matrix is in echelon <strong>for</strong>m; columns 1, 2, and 5 are pivot columns.<br />

Working with rows 3 and 2 (in that order) to “clear” columns 5 and 2, we get the following<br />

reduced echelon <strong>for</strong>m matrix:<br />

⎡<br />

⎤ ⎡<br />

1 −3 4 −3 2 5 1 −3 4 −3 0<br />

⎣ 0 1 −2 2 1 −3 ⎦ ∼ ⎣ 0 1 −2 2 0<br />

⎤ ⎡<br />

−3 1<br />

−7 ⎦ ∼ ⎣ 0<br />

0<br />

1<br />

−2 3 0<br />

−2 2 0<br />

⎤<br />

−24<br />

−7 ⎦<br />

0 0 0 0 1 4 0 0 0 0 1 4 0 0 0 0 1 4<br />

Note that if the original matrix was the augmented matrix of a 3-by-5 system of linear<br />

equations in x1, x2, x3, x4, x5, then the last matrix corresponds to the equivalent linear<br />

system x1 − 2x3 + 3x4 = −24, x2 − 2x3 + 2x4 = −7, x5 = 4. Thus,<br />

(s1,s2,s3,s4,s5) = (−24 + 2x3 − 3x4, −7 + 2x3 − 2x4,x3,x4,4)<br />

is a solution <strong>for</strong> any values of x3, x4.<br />

Solving linear systems. When solving linear systems,<br />

1. A basic variable is any variable that corresponds to a pivot column in the augmented<br />

matrix of the system, and a free variable is any nonbasic variable.<br />

2. From the equivalent system produced using the reduced echelon <strong>for</strong>m, we solve each<br />

equation <strong>for</strong> the basic variable in terms of the free variables (if any) in the equation.<br />

3. If there is at least one free variable, then the original linear system has an infinite<br />

number of solutions.<br />

4. If an echelon <strong>for</strong>m of the augmented matrix has a row of the <strong>for</strong>m<br />

[ 0 0 · · · 0 b ] where b = 0,<br />

then the system is inconsistent. (Once we know the system is inconsistent, there is<br />

no need to continue to find the reduced echelon <strong>for</strong>m.)<br />

193<br />

⎤<br />


Example 9.6 (Echelon Forms, continued) Continuing with the last example, the basic<br />

variables are x1, x2, and x5; the free variables are x3 and x4. The free variables are used to<br />

characterize the infinite number of solutions to the 3-by-5 system.<br />

9.1.3 Vector and Matrix Equations<br />

In linear algebra, we let Rm be the set of m-by-1 matrices of real numbers:<br />

Rm ⎧⎡<br />

⎤<br />

x1<br />

⎪⎨ ⎢ x2 ⎥<br />

= ⎢<br />

⎣ . ⎥<br />

⎪⎩ .<br />

⎦ : xi<br />

⎫<br />

⎪⎬<br />

∈ R .<br />

⎪⎭<br />

xm<br />

Each element of Rm is called a point or a vector.<br />

⎡ ⎤<br />

If x =<br />

⎢<br />

⎣<br />

x1<br />

x2<br />

.<br />

.<br />

xm<br />

⎥<br />

⎦ is a vector in Rm , we say that xi is the i th component of x.<br />

The zero vector, O, is the vector all of whose components equal zero.<br />

Addition and scalar multiplication. If x,y ∈ R m and c ∈ R, then<br />

1. the vector sum x + y is the vector obtained by componentwise addition. That is, the<br />

vector whose i th component is xi + yi <strong>for</strong> each i. And,<br />

2. the scalar product cx is the vector obtained by multiplying each component by c.<br />

That is, the vector whose i th component is cxi <strong>for</strong> each i.<br />

Linear combinations and spans. If v1,v2,... ,vn ∈ R m and c1,c2,...,cn ∈ R, then<br />

w = c1v1 + c2v2 + · · · + cnvn<br />

is a linear combination of the vectors v1, v2, ..., vn. The span of the set {v1,v2,...,vn}<br />

is the collection of all linear combinations of the vectors v1, v2, ..., vn:<br />

Span{v1,v2,...,vn} = {c1v1 + c2v2 + · · · + cnvn : ci ∈ R} ⊆ R m .<br />

The span of a set of vectors always contains the zero vector O (let each ci = 0).<br />

Vector equations. Consider an m-by-n system of equations in the variables x1, x2, ...,<br />

xn, where each equation is of the <strong>for</strong>m<br />

ai1x1 + ai2x2 + · · · + ainxn = bi <strong>for</strong> i = 1,2,... ,m.<br />

Let aj be the vector of coefficients of xj, and b be the vector of right hand side values.<br />

194


The m-by-n system can be rewritten as a vector equation:<br />

x1a1 + x2a2 + · · · + xnan = b.<br />

The system is consistent if and only if b ∈ Span{a1,a2,... ,an}.<br />

Matrix equations. The coefficient matrix A of the m-by-n system above is the m-by-n<br />

matrix whose columns are the ai’s: A = [a1 a2 · · · an ].<br />

If x ∈ Rn is the vector of unknowns, then the matrix product of A with x is<br />

⎡ ⎤<br />

⎢<br />

Ax = [ a1 a2 · · · an ] ⎢<br />

⎣<br />

and the matrix equation <strong>for</strong> the system is Ax = b.<br />

x1<br />

x2<br />

.<br />

.<br />

xn<br />

⎥<br />

⎦ = x1a1 + x2a2 + · · · + xnan,<br />

Note that the matrix product of a linear combination of vectors is equal to that linear<br />

combination of matrix products:<br />

A(c1v1 + c2v2 + · · · + cnvn) = c1Av1 + c2Av2 + · · · + cnAvn<br />

<strong>for</strong> all ci ∈ R and vi ∈ R m . (Multiplication by A distributes over addition, and we can<br />

factor out constants.)<br />

Consistency <strong>for</strong> all b. The following theorem gives criteria <strong>for</strong> determining when a<br />

system of equations is consistent <strong>for</strong> all b ∈ R m :<br />

Theorem 9.7 (Consistency Theorem) Let A be an m-by-n coefficient matrix, and x<br />

be an n-by-1 vector of unknowns. The following statements are equivalent:<br />

1. The equation Ax = b has a solution <strong>for</strong> all b ∈ R m .<br />

2. Each b ∈ R m is a linear combination of the coefficient vectors aj.<br />

3. Span{a1,a2,...,an} = R m .<br />

4. A has a pivot position in each row.<br />

Note that the number of pivot positions is at most the minimum of m and n. Thus, by the<br />

consistency theorem, if m > n (there are more equations than unknowns), a system with<br />

coefficient matrix A cannot be consistent <strong>for</strong> all b ∈ R m .<br />

Homogeneous systems and solution sets. A homogeneous system is a linear system<br />

with matrix equation Ax = O.<br />

Homogeneous systems are consistent since x = O is a solution to the system (called the<br />

trivial solution). In addition, each linear combination of solutions is a solution. Thus, the<br />

solution set is a span of some set of vectors.<br />

195


Example 9.8 (Trivial Solution Only) Consider the matrix equation<br />

⎡ ⎤ ⎡<br />

2 −2 <br />

x1<br />

⎣ 1 2 ⎦ = ⎣<br />

x2<br />

1 8<br />

0<br />

⎤<br />

0 ⎦.<br />

0<br />

Since ⎡<br />

⎣<br />

2 −2 0<br />

1 2 0<br />

1 8 0<br />

⎤<br />

⎡<br />

⎦ ∼ · · · ∼ ⎣<br />

1 0 0<br />

0 1 0<br />

0 0 0<br />

⎤<br />

⎦ ⇒ x1 = 0<br />

x2 = 0<br />

the system has the trivial solution only. The solution set is Span{O} = {O} ⊆ R 2 .<br />

Example 9.9 (One Generating Vector) Consider the matrix equation<br />

<br />

2 4 −6<br />

4 8 −10<br />

⎡<br />

⎣ x1<br />

⎤<br />

<br />

x2 ⎦ 0<br />

= .<br />

0<br />

Since<br />

2 4 −6 0<br />

4 8 −10 0<br />

<br />

∼ · · · ∼<br />

x3<br />

1 2 0 0<br />

0 0 1 0<br />

solutions can be written parametrically as follows:<br />

⎡<br />

s = ⎣ −2x2<br />

⎤ ⎡<br />

x2 ⎦ = x2⎣<br />

0<br />

−2<br />

1<br />

0<br />

where x2 is free. The solution set is Span{v} ⊆ R 3 .<br />

⎤<br />

<br />

⇒<br />

⎦ = x2v<br />

x1 = −2x2<br />

x2 is free<br />

x3 = 0<br />

Example 9.10 (Two Generating Vectors) Consider the matrix equation<br />

⎡ ⎤<br />

⎡<br />

⎤ x1<br />

2 4 −6 8 ⎢ x2 ⎥<br />

⎣ 4 8 4 0 ⎦⎢<br />

⎥<br />

⎣ x3 ⎦<br />

3 6 −1 4<br />

=<br />

⎡<br />

⎣ 0<br />

⎤<br />

0 ⎦.<br />

0<br />

Since<br />

⎡<br />

⎣<br />

2 4 −6 8 0<br />

4 8 4 0 0<br />

3 6 −1 4 0<br />

x4<br />

⎤<br />

⎡<br />

⎦ ∼ · · · ∼ ⎣<br />

x4<br />

1 2 0 1 0<br />

0 0 1 −1 0<br />

0 0 0 0 0<br />

solutions can be written parametrically as:<br />

⎡ ⎤ ⎡ ⎤ ⎡<br />

−2x2 − x4<br />

−2<br />

⎢ x2 ⎥ ⎢<br />

s = ⎢ ⎥<br />

⎣ x4 ⎦ = x2<br />

⎢ 1 ⎥ ⎢<br />

⎥<br />

⎣ 0 ⎦ + x4<br />

⎢<br />

⎣<br />

0<br />

⎤<br />

⎦ ⇒<br />

where x2 and x4 are free. The solution set is Span{v1,v2} ⊆ R 4 .<br />

196<br />

−1<br />

0<br />

1<br />

1<br />

⎤<br />

x1 = −2x2 − x4<br />

x2 is free<br />

x3 = x4<br />

x4 is free<br />

⎥<br />

⎦ = x2v1 + x4v2


Nonhomogeneous systems and solution sets. A nonhomogeneous system is a linear<br />

system with matrix equation Ax = b where b = O.<br />

Theorem 9.11 (Solution Sets) Suppose that Ax = b is consistent and let p be a<br />

solution. Then the solution set of the system Ax = b is the set of all vectors of the <strong>for</strong>m<br />

s = p + vh<br />

where vh is a solution of the homogeneous system Ax = O.<br />

Example 9.12 (Two Generating Vectors, continued) Consider the matrix equation<br />

⎡ ⎤<br />

⎡<br />

⎤ x1<br />

2 4 −6 8 ⎢ x2 ⎥<br />

⎣ 4 8 4 0 ⎦⎢<br />

⎥<br />

⎣ x3 ⎦<br />

3 6 −1 4<br />

=<br />

⎡<br />

⎣ 18<br />

⎤<br />

4 ⎦.<br />

11<br />

Since<br />

⎡<br />

⎣<br />

2 4 −6 8 18<br />

4 8 4 0 4<br />

3 6 −1 4 11<br />

x4<br />

⎤<br />

⎡<br />

⎦ ∼ · · · ∼ ⎣<br />

x4<br />

1 2 0 1 3<br />

0 0 1 −1 −2<br />

0 0 0 0 0<br />

solutions can be written parametrically as:<br />

⎡ ⎤<br />

3 − 2x2 − x4<br />

⎢ x2 ⎥<br />

s = ⎢ ⎥<br />

⎣ −2 + x4 ⎦ =<br />

⎡ ⎤ ⎡ ⎤ ⎡<br />

3<br />

−2<br />

⎢ 0 ⎥ ⎢<br />

⎥<br />

⎣ −2 ⎦ + x2<br />

⎢ 1 ⎥ ⎢<br />

⎥<br />

⎣ 0 ⎦ + x4<br />

⎢<br />

⎣<br />

0 0<br />

−1<br />

0<br />

1<br />

1<br />

⎤<br />

⎦ ⇒<br />

where x2 and x4 are free. (The solution set is a shift of a span.)<br />

9.2 Matrix Operations<br />

An m-by-n matrix A is a rectangular array with m rows and n columns:<br />

⎡<br />

⎤<br />

A =<br />

⎢<br />

⎣<br />

a11 a12 · · · a1n<br />

a21 a22 · · · a2n<br />

.<br />

. · · ·<br />

.<br />

am1 am2 · · · amn<br />

⎤<br />

x1 = 3 − 2x2 − x4<br />

x2 is free<br />

x3 = −2 + x4<br />

x4 is free<br />

⎥<br />

⎦ = p + (x2v1 + x4v2) = p + vh<br />

⎥<br />

⎦ = [ a1 a2 · · · an ].<br />

The (i,j)-entry of A (denoted by aij) is the element in the row i and column j position. In<br />

the first term above, the entries of A are shown explicitly. In the second term, A is written<br />

as n columns where each column is a vector in R m . (See Section 7.1 also.)<br />

Square matrices. A square matrix is a matrix with m = n. An n-by-n square matrix is<br />

often called a square matrix of order n. If A is a square matrix of order n, then the diagonal<br />

elements of A are the elements: a11, a22, ..., ann.<br />

Symmetric matrices. Let A be a square matrix. Then A is said to be a symmetric<br />

matrix if aij = aji <strong>for</strong> all i and j.<br />

197


Triangular matrices. Let A be a square matrix. Then A is said to be an upper triangular<br />

matrix if aij = 0 when i > j; A is said to a lower triangular matrix if aij = 0 when i < j;<br />

and A is said to be a triangular matrix if it is either upper or lower triangular.<br />

Diagonal matrices, identity matrices. The square matrix A is said to be a diagonal<br />

matrix when aij = 0 when i = j.<br />

Given d1, d2, ..., dn, Diag(d1,d2,... ,dn) is used to denote the diagonal matrix of order n<br />

whose diagonal elements are the di’s:<br />

⎡<br />

d1 0 0 · · · 0<br />

0 d2 0 · · · 0<br />

0 0 d3 · · · 0<br />

⎢<br />

Diag(d1,d2,...,dn) = ⎢<br />

⎣<br />

. . . · · ·<br />

⎥<br />

.<br />

⎦<br />

0 0 0 · · · dn<br />

.<br />

The identity matrix of order n, In, is the diagonal matrix with di = 1 <strong>for</strong> each i. We<br />

sometimes omit the subscript when the order of the matrix is clear from the problem.<br />

Note that diagonal matrices are symmetric, upper triangular, and lower triangular.<br />

Vectors and matrices with all ones. For convenience, we let 1m be the vector in R m<br />

all of whose components equal one, and we let Jm×n be the m-by-n matrix all of whose<br />

entries equal one. We sometimes omit the subscripts when the orders of the vectors and<br />

matrices are clear from the problem.<br />

9.2.1 Basic Operations<br />

In this section, we consider the operations of addition and scalar multiplication, linear<br />

combinations of matrices, matrix transpose and matrix product.<br />

Addition and scalar multiplication. Let A and B be m-by-n matrices, and let c ∈ R<br />

be a scalar. Then matrix sum and scalar product are defined as follows:<br />

1. The matrix sum of A and B is the m-by-n matrix<br />

A + B = [a1 + b1 a2 + b2 · · · an + bn ].<br />

That is, A + B is the matrix with (i,j)-entry (A + B) ij = aij + bij.<br />

2. The scalar product of A by c is the m-by-n matrix<br />

cA = [ ca1 ca2 · · · can ].<br />

That is, cA is the matrix with (i,j)-entry (cA) ij = caij.<br />

198<br />


Linear combinations of matrices. Let A and B be m-by-n matrices, and c and d be<br />

scalars. Then the m-by-n matrix<br />

cA + dB = [ca1 + db1 ca2 + db2 · · · can + dbn ]<br />

is said to be a linear combination of A and B.<br />

The linear combination of A and B has (i,j)-entry as follows:<br />

⎡<br />

For example, if A = ⎣<br />

1 −2<br />

3 −4<br />

5 −6<br />

⎤<br />

(cA + dB) ij = caij + dbij.<br />

⎡<br />

⎦ and B = ⎣<br />

0 7<br />

4 2<br />

−2 −6<br />

⎤<br />

⎡<br />

⎦, then 2A − B = ⎣<br />

2 −11<br />

2 −10<br />

12 −6<br />

Matrix transpose. The transpose of the m-by-n matrix A, denoted by AT <br />

, is the n-by-m<br />

matrix whose (i,j)-entry is aji: AT = aji.<br />

ij<br />

For convenience, we let αi ∈ R n (<strong>for</strong> i = 1,2,... ,m) be the columns of A T and write<br />

⎡<br />

A = ⎣<br />

α1 T<br />

.<br />

αm T<br />

when we want to view A as a “stack” of m row vectors.<br />

Matrix product. Let A be an m-by-n matrix and let B = [ b1 b2 · · · bp ] be an<br />

n-by-p matrix. Then the matrix product, AB, is the m-by-p matrix whose j th column is the<br />

product of A with the j th column of B:<br />

AB = [Ab1 Ab2 · · · Abp ] .<br />

Equivalently, AB is the matrix whose (i,j)-entry is<br />

(See Section 7.1.7 also.)<br />

(AB) ij = ai1b1j + ai2b2j + · · · + ainbnj.<br />

Dot and matrix products. Another convenient way to write the matrix product is in<br />

terms of dot products. Recall that if v and w are vectors in R n , then their dot product is<br />

the scalar<br />

v·w = v T w = v1w1 + v2w2 + · · · + vnwn.<br />

Then<br />

⎡<br />

AB = ⎣<br />

α1 T<br />

.<br />

.<br />

αm T<br />

⎡<br />

⎤<br />

⎦[b1 b2 · · ·<br />

⎢<br />

bp ] = ⎢<br />

⎣<br />

199<br />

⎤<br />

⎦<br />

α1 T b1 α1 T b2 · · · α1 T bp<br />

α2 T b1 α2 T b2 · · · α2 T bp<br />

.<br />

. · · ·<br />

αm T b1 αm T b2 · · · αm T bp<br />

.<br />

⎤<br />

⎥<br />

⎦ .<br />

⎤<br />

⎦.


Properties of the basic operations. Some properties of the basic operations are given<br />

below. In each case, the sizes of the matrices are assumed to be compatible.<br />

1. Commutative: A + B = B + A.<br />

2. Associative: A + (B + C) = (A + B) + C and A(BC) = (AB)C.<br />

3. Distributive: A(B + C) = (AB) + (AC), (B + C)A = (BA) + (CA).<br />

4. Scalar Multiples: Given c ∈ R:<br />

c(AB) = (cA)B = A(cB), c(A + B) = (cA) + (cB) and (cA) T = cA T .<br />

5. Identity: ImA = A = AIn when A is an m-by-n matrix.<br />

6. Transpose of Transpose:<br />

<br />

A T T<br />

= A.<br />

7. Transpose of Sum: (A + B) T = A T + B T .<br />

8. Transpose of Product: (AB) T = B T A T .<br />

Note, in particular, that the matrix sum is commutative but the matrix product is not. In<br />

fact, it is possible that AB is defined but that BA is undefined.<br />

9.2.2 Inverses of Square Matrices<br />

The n-by-n matrix A is said to be invertible if there exists a square matrix C satisfying<br />

AC = CA = In where In is the n-by-n identity matrix.<br />

Otherwise, A is said to be not invertible. Square matrices that are not invertible are also<br />

called singular matrices; invertible square matrices are also called nonsingular matrices.<br />

If A is invertible, then the matrix C is unique. To see this, suppose that the matrix D also<br />

satisfies the equality AD = DA = In. Then, using properties of matrix products,<br />

D = DIn = D(AC) = (DA)C = InC = C.<br />

If A is invertible, then A −1 is used to denote the matrix inverse of A.<br />

Special case: 2-by-2 matrices. Let A =<br />

A −1 = 1<br />

ad−bc<br />

<br />

a b<br />

be a 2-by-2 matrix. Then<br />

c d<br />

<br />

d −b<br />

when ad − bc = 0.<br />

−c a<br />

If ad − bc = 0, then A does not have an inverse. (Note: ad − bc is the determinant of A.)<br />

200


Special case: diagonal matrices. Let A = Diag(a1,a2,...,an) be a diagonal matrix.<br />

If ai = 0 <strong>for</strong> each i, then A is invertible with inverse<br />

If ai = 0 <strong>for</strong> some i, then A is not invertible.<br />

A −1 = Diag(1/a1,1/a2,...,1/an).<br />

Demonstration: Note first that if C = Diag(c1,c2,... ,cn), then<br />

AC = CA = Diag(a1c1,a2c2,...,ancn).<br />

To obtain AC = CA = In, we need aici = 1 <strong>for</strong> each i. If ai = 0 <strong>for</strong> some i, then we cannot<br />

find solutions; otherwise, we let ci = 1/ai <strong>for</strong> each i.<br />

Properties of inverses. Let A and B be invertible square matrices of order n and let c<br />

be a nonzero constant. Then<br />

1. Inverse of Identity: I −1<br />

n = In <strong>for</strong> each n.<br />

2. Inverse of Inverse: A −1 −1 = A.<br />

3. Inverse of Product: (AB) −1 = B −1 A −1 .<br />

4. Inverse of Scalar Multiple: (cA) −1 = (1/c)A−1 .<br />

<br />

5. Inverse of Transpose: AT −1 = A−1T .<br />

Algorithm <strong>for</strong> finding an inverse. Let A be an n-by-n matrix. Form the n-by-2n<br />

augmented matrix whose last n columns are the columns of the identity matrix: [A In ].<br />

If A is row equivalent to the identity matrix, then the n-by-2n augmented matrix will be<br />

trans<strong>for</strong>med as follows:<br />

[A In ] ∼ · · · ∼ [In A −1 ]<br />

That is, the last n columns will be the inverse of A. Otherwise, A does not have an inverse.<br />

Partial justification: For convenience, let C = A −1 and In = [e1 e2 · · · en ]. Then<br />

AC = In can be written as follows:<br />

Equivalently,<br />

AC = [ Ac1 Ac2 · · · Acn ] = [e1 e2 · · · en ].<br />

Ac1 = e1, Ac2 = e2, ..., Acn = en.<br />

By using the n-by-2n augmented matrix, we are solving all n systems at once. The solutions<br />

to the n systems <strong>for</strong>m the n columns of the inverse matrix.<br />

<br />

−7 8<br />

Example 9.13 (2-by-2 Invertible Matrix) Let A = . Using the special<br />

2 −2<br />

<strong>for</strong>mula <strong>for</strong> 2-by-2 matrices,<br />

A −1 = 1<br />

<br />

−2 −8 1 4<br />

= .<br />

−2 −2 −7 1 7/2<br />

201


Using the algorithm <strong>for</strong> finding inverses,<br />

<br />

−7 8 1<br />

[ A I2 ] =<br />

2 −2 0<br />

<br />

0 −7<br />

∼<br />

1 0<br />

<br />

8 1 0 −7 8 1<br />

∼<br />

2/7 2/7 1 0 1 1<br />

0<br />

7/2<br />

<br />

−7 0<br />

∼<br />

0 1<br />

<br />

−7 −28 1 0<br />

∼<br />

1 7/2 0 1<br />

<br />

1 4<br />

= [ I2<br />

1 7/2<br />

A−1 ] .<br />

The last two columns of the augmented matrix match the inverse computed above.<br />

⎡<br />

0<br />

Example 9.14 (3-by-3 Invertible Matrix) Let A = ⎣ 1<br />

⎤<br />

1 2<br />

0 3 ⎦. Since<br />

⎡<br />

0<br />

[ A I3 ] = ⎣ 1<br />

1<br />

0<br />

2 1<br />

3 0<br />

⎤ ⎡<br />

0 0 1 0<br />

1 0 ⎦ ∼ · · · ∼ ⎣ 0 1<br />

4 −3 8<br />

0 −4.5<br />

0 −2.0<br />

⎤<br />

7.0 −1.5<br />

4.0 −1.0 ⎦,<br />

4 −3 8 0 0 1 0 0 1 1.5 −2.0 −0.5<br />

A is invertible with inverse A−1 ⎡<br />

4.5<br />

= ⎣ −2.0<br />

⎤<br />

7.0 −1.5<br />

4.0 −1.0 ⎦.<br />

1.5 −2.0 −0.5<br />

⎡<br />

0<br />

Example 9.15 (3-by-3 Singular Matrix) Let A = ⎣ 1<br />

1<br />

0<br />

⎤<br />

2<br />

3 ⎦. Since<br />

4 −3 6<br />

⎡<br />

0<br />

[ A I3 ] = ⎣ 1<br />

1 2<br />

0 3<br />

1 0<br />

0 1<br />

⎤ ⎡<br />

0 1<br />

0 ⎦ ∼ · · · ∼ ⎣ 0<br />

0 3<br />

1 2<br />

0<br />

1<br />

1<br />

0<br />

⎤<br />

0<br />

0 ⎦,<br />

4 −3 6 0 0 1 0 0 0 3 −4 1<br />

A is not invertible. (Note that the third row has 3 leading zeros.)<br />

Elementary matrices and matrix inverses. An elementary matrix, E, is one obtained<br />

by per<strong>for</strong>ming a single elementary row operation on an identity matrix. Since each elementary<br />

row operation is reversible, each elementary matrix is invertible.<br />

For example, let n = 4.<br />

1. (Replacement) Starting with I4, we replace R3 with R3+aR1. The elementary matrix<br />

is E shown on the left below. The operation is reversed by replacing R3 with R3−aR1.<br />

The inverse matrix is shown on the right below.<br />

E =<br />

⎡<br />

⎢<br />

⎣<br />

1 0 0 0<br />

0 1 0 0<br />

a 0 1 0<br />

0 0 0 1<br />

⎤<br />

⎥<br />

⎦ and E−1 =<br />

⎡<br />

⎢<br />

⎣<br />

1 0 0 0<br />

0 1 0 0<br />

−a 0 1 0<br />

0 0 0 1<br />

2. (Scaling) Starting with I4, we scale R2 using the nonzero constant c. The elementary<br />

matrix is E shown on the left below. The operation is reversed by scaling R2 by the<br />

constant 1/c. The inverse matrix is shown on the right below.<br />

E =<br />

⎡<br />

⎢<br />

⎣<br />

1 0 0 0<br />

0 c 0 0<br />

0 0 1 0<br />

0 0 0 1<br />

⎤<br />

⎥<br />

⎦ and E−1 =<br />

202<br />

⎡<br />

⎢<br />

⎣<br />

1 0 0 0<br />

0 1/c 0 0<br />

0 0 1 0<br />

0 0 0 1<br />

⎤<br />

⎥<br />

⎦ .<br />

⎤<br />

⎥<br />

⎦ .


3. (Interchange) Starting with I4, we interchange R2 and R4. The elementary matrix<br />

is E shown below. The operation is reversed by interchanging R2 and R4. Thus,<br />

E −1 = E.<br />

Note also that E 2 = I4.<br />

E =<br />

⎡<br />

⎢<br />

⎣<br />

1 0 0 0<br />

0 0 0 1<br />

0 0 1 0<br />

0 1 0 0<br />

⎤<br />

⎥<br />

⎦ = E−1 .<br />

Let A be an n-by-n matrix. Elementary row operations on A correspond to left multiplication<br />

by the corresponding elementary matrix. If a sequence of p elementary row operations<br />

(with elementary matrices E1, E2, ..., Ep) trans<strong>for</strong>m A to the identity matrix, then<br />

(Ep(Ep−1 · · · (E2(E1A)))) = (EpEp−1 · · · E2E1)A = In,<br />

the inverse of A is the following product of elementary matrices:<br />

A −1 = EpEp−1 · · · E2E1<br />

and A is the product of the inverses in the opposite order:<br />

A = E −1<br />

1 E−1<br />

2 · · · E −1<br />

p−1 E−1<br />

p .<br />

These computations can be used to <strong>for</strong>mally justify the algorithm <strong>for</strong> finding inverses.<br />

Example 9.16 (2 × 2 Matrix, continued) Continuing with the 2-by-2 example, four<br />

elementary row operations were used. Written as a matrix product:<br />

E4E3E2E1A =<br />

−1/7 0<br />

0 1<br />

1 −1<br />

0 1<br />

Thus, A−1 <br />

= E4E3E2E1 =<br />

1 0<br />

0 7/2<br />

1 4<br />

−1 7/2<br />

<br />

.<br />

<br />

1 0<br />

2/7 1<br />

−7 8<br />

2 −2<br />

<br />

=<br />

1 0<br />

0 1<br />

<br />

= I2.<br />

Relationship to solving systems of equations. The following theorem provides an<br />

important relationship between matrix inverses and systems of linear equations.<br />

Theorem 9.17 (Uniqueness Theorem) Let A be an invertible n-by-n matrix. Then<br />

the matrix equation Ax = b has the unique solution x = A −1 b <strong>for</strong> any b ∈ R n .<br />

9.2.3 Matrix Factorizations<br />

A factorization of a matrix A is an equation expressing A as the product of two or more<br />

matrices. This section introduces the LU-factorization, which supports an important practical<br />

method <strong>for</strong> solving large systems of linear equations. Additional factorizations will be<br />

considered later in these notes.<br />

203


LU-Factorization. Let A be an m-by-n matrix. An LU-factorization of A is a matrix<br />

equation A = LU, where<br />

1. L is an m-by-m unit lower triangular matrix, and<br />

2. U is an m-by-n (upper) echelon <strong>for</strong>m matrix.<br />

Note that L is said to be a unit lower triangular matrix if L is a lower triangular matrix<br />

whose diagonal elements ℓii = 1 <strong>for</strong> all i.<br />

The m-by-n matrix A has an LU-factorization when (and only when) A is row equivalent<br />

to an echelon <strong>for</strong>m matrix using a sequence of row replacements only:<br />

EpEp−1 · · · E2E1A = U =⇒ A = LU where L = (EpEp−1 · · · E2E1) −1 .<br />

In this equation, the Ei’s are the elementary matrices representing the row replacements.<br />

Example 9.18 (3-by-3 Matrix) Consider the following matrix and equivalent <strong>for</strong>ms:<br />

⎡<br />

3 1<br />

A = ⎣ −6 0<br />

⎤ ⎡<br />

0 3 1<br />

−4 ⎦ ∼ ⎣ 0 2<br />

⎤ ⎡<br />

0 3 1<br />

−4 ⎦ ∼ ⎣ 0 2<br />

⎤<br />

0<br />

−4 ⎦.<br />

0 6 −10 0 6 −10 0 0 2<br />

The last matrix is the echelon matrix U. Since two row replacements were used,<br />

L = E −1<br />

1 E−1 2 =<br />

⎡<br />

1 0<br />

⎣ −2 1<br />

0 0<br />

⎤⎡<br />

0 1 0<br />

0 ⎦⎣<br />

0 1<br />

1 0 3<br />

⎤ ⎡ ⎤<br />

0 1 0 0<br />

0 ⎦ = ⎣ −2 1 0 ⎦.<br />

1 0 3 1<br />

Example 9.19 (3-by-4 Matrix) Consider the following matrix and equivalent <strong>for</strong>ms:<br />

⎡<br />

2 4 −1<br />

⎤<br />

5<br />

⎡<br />

2 4 −1<br />

⎤<br />

5<br />

⎡<br />

2 4<br />

⎤<br />

−1 5<br />

A = ⎣ −4 −5 3 −8 ⎦ ∼ ⎣ 0 3 1 2 ⎦ ∼ ⎣ 0 3 1 2 ⎦.<br />

2 −5 −4 1 0 −9 −3 −4 0 0 0 2<br />

The last matrix is the echelon matrix U. Since three row replacements were used,<br />

L = E −1<br />

1 E−1 2 E−1 3 =<br />

⎡ ⎤⎡<br />

1 0 0 1<br />

⎣ −2 1 0 ⎦⎣<br />

0<br />

0 0 1 1<br />

⎤ ⎡<br />

0 0 1<br />

1 0 ⎦ ⎣ 0<br />

0 1 0<br />

⎤ ⎡<br />

0 0 1 0 0<br />

1 0 ⎦ = ⎣ −2 1 0<br />

−3 1 1 −3 1<br />

Application. If A = LU and Ax = b is consistent, then the system can be solved<br />

efficiently. Consider the following setup:<br />

b = Ax = (LU)x = L(Ux) = Ly where y = Ux.<br />

The efficient solution is done in two steps:<br />

1. Solve Ly = b <strong>for</strong> y.<br />

2. Solve Ux = y <strong>for</strong> x.<br />

The first step is essentially a <strong>for</strong>ward substitution method, and the second step is essentially<br />

a backward substitution method. The process is useful when the same coefficient matrix is<br />

used with several different right hand sides.<br />

204<br />

⎤<br />

⎦.


Permutation matrices and factorizations. A permutation matrix P is a product of<br />

elementary matrices corresponding to row interchanges.<br />

In the LU-factorization, L encodes in<strong>for</strong>mation about row replacements, and the pivot<br />

positions of U encode in<strong>for</strong>mation about scalings. If row interchanges are needed to solve a<br />

system, then a permutation matrix can be used to encode these operations. Specifically, let<br />

P be the product of interchange matrices needed to bring the rows of A into proper order<br />

<strong>for</strong> an LU-factorization. Then write PA = LU. (If row interchanges are needed, then A<br />

does not have an LU-factorization. The matrix PA does have an LU-factorization.)<br />

9.2.4 Partitioned Matrices<br />

A partitioned matrix (or blocked matrix) is one whose entries are themselves matrices. An<br />

m-by-n matrix can be partitioned in many ways, including the following 2-by-2 <strong>for</strong>m:<br />

<br />

A11 A12<br />

A = where Aij is a pi-by-qj matrix,<br />

A21 A22<br />

p1 + p2 = m and q1 + q2 = n. (This can be generalized to 3-by-3 <strong>for</strong>ms, etc.)<br />

Each submatrix in a partitioned matrix is called a block. Partitioning is especially useful<br />

when you can take advantage of block patterns. For example, if certain Aij’s are zero<br />

matrices, identity matrices, diagonal matrices, or matrices all of whose entries are equal.<br />

If all submatrices are con<strong>for</strong>mable, then partitioned matrices can be added or multiplied by<br />

adding or multiplying the appropriate blocks. In particular,<br />

<br />

A11 A12 B11 B12 C11 C12<br />

AB =<br />

=<br />

A21 A22<br />

where Cij = Ai1B1j + Ai2B2j <strong>for</strong> each i and j.<br />

B21 B22<br />

C21 C22<br />

This section considers inverses of partitioned square matrices. Additional methods <strong>for</strong><br />

partitioned matrices will be considered in later sections of these notes.<br />

Block diagonal matrices. A 2-by-2 block diagonal matrix is a square matrix A of order<br />

n written as follows<br />

<br />

A11 O<br />

A = where A11 is p-by-p, A22 is q-by-q,<br />

O A22<br />

each O is a zero matrix, and p + q = n. (This can be generalized to 3-by-3, etc.)<br />

A is invertible if and only if both A11 and A22 are invertible. Then, A −1 =<br />

A −1<br />

11<br />

O<br />

O A −1<br />

22<br />

Example 9.20 (4-by-4 Matrix) Let A be the 4-by-4 matrix shown on the left below.<br />

The inverse of A can be computed using the <strong>for</strong>mula <strong>for</strong> 2-by-2 block diagonal matrices:<br />

⎡<br />

⎢<br />

A = ⎢<br />

⎣<br />

8<br />

−3<br />

0<br />

−5<br />

2<br />

0<br />

0<br />

0<br />

9<br />

0<br />

0<br />

4<br />

⎤<br />

⎥<br />

⎦<br />

0 0 4 2<br />

=⇒ A−1 ⎡<br />

2<br />

⎢<br />

= ⎢ 3<br />

⎣ 0<br />

5<br />

8<br />

0<br />

0<br />

0<br />

1<br />

⎤<br />

0<br />

0 ⎥<br />

−2 ⎦<br />

0 0 −2 9/2<br />

205<br />

<br />

.


Block upper triangular matrices. A 2-by-2 block upper triangular matrix is a square<br />

matrix A of order n written as follows<br />

<br />

A11 A12<br />

A = where A11 is p-by-p, A12 is p-by-q, A22 is q-by-q,<br />

O A22<br />

O is a zero matrix, and p + q = n. (This can be generalized to 3-by-3, etc.)<br />

A is invertible if and only if both A11 and A22 are invertible. Then, A −1 =<br />

where C = −A −1 −1<br />

11 A12A22 .<br />

In particular, if A11 and A22 are identity matrices, then A −1 =<br />

Ip −A12<br />

O Iq<br />

<br />

.<br />

A −1<br />

11<br />

C<br />

O A −1<br />

22<br />

Block lower triangular matrices. A 2-by-2 block lower triangular matrix is a square<br />

matrix A of order n written as follows<br />

<br />

A11 O<br />

A = where A11 is p-by-p, A21 is q-by-p, A22 is q-by-q,<br />

A21 A22<br />

O is a zero matrix, and p + q = n. (This can be generalized to 3-by-3, etc.)<br />

Since AT <br />

T A11 A<br />

=<br />

T <br />

21 is block upper triangular, the method above easily generalizes to<br />

handle this case.<br />

O A T 22<br />

Block elimination and Schur complements. Let M be a square matrix of order n<br />

written as follows:<br />

<br />

A B<br />

M = where the diagonal submatrices are square,<br />

C D<br />

and assume that A is invertible. Then block elimination subtracts CA−1 times the first row<br />

from the second row. This leaves the Schur complement of A in the lower right corner of<br />

the resulting block upper triangular matrix, as illustrated below:<br />

<br />

I O<br />

−CA −1 I<br />

<br />

A B<br />

C D<br />

=<br />

<br />

A B<br />

O D − CA−1 <br />

.<br />

B<br />

If the Schur complement (D − CA −1 B) is invertible, then the inverse of M can be found in<br />

block <strong>for</strong>m using methods <strong>for</strong> block triangular matrices.<br />

Example 9.21 (5-by-5 Matrix) Consider the 5-by-5 matrix M (written as a 2-by-2 block<br />

matrix) and its Schur complement:<br />

M =<br />

<br />

A B<br />

=<br />

C D<br />

⎡<br />

⎢<br />

⎣<br />

3 1 0 −42 0<br />

4 1 −2 0 −1<br />

2 0 −5 1 −7<br />

1 0 1 2 1<br />

1 1 1 3 4<br />

⎤<br />

⎥<br />

⎦<br />

206<br />

and (D − CA −1 B) =<br />

<br />

−289 −13<br />

378 17<br />

<br />

.<br />

<br />

,


Since A, D and (D − CA −1 B) are invertible, we can compute M −1 using block methods.<br />

Using properties of inverses,<br />

M −1 =<br />

⎢<br />

= ⎢<br />

⎣<br />

<br />

A B<br />

O D − CA−1 −1 <br />

B<br />

⎡<br />

⎡<br />

⎢<br />

= ⎢<br />

⎣<br />

I O<br />

−CA −1 I<br />

−5 5 −2 −134 −103<br />

16 −15 6 1116 855<br />

−2 2 −1 479 366<br />

0 0 0 17 13<br />

0 0 0 −378 −289<br />

<br />

⎤⎡<br />

⎥⎢<br />

⎥⎢<br />

⎥⎢<br />

⎥⎢<br />

⎦⎣<br />

−16 119 −95 −134 −103<br />

133 −987 789 1116 855<br />

57 −423 338 479 366<br />

2 −15 12 17 13<br />

−45 334 −267 −378 −289<br />

9.3 Determinants of Square Matrices<br />

1 0 0 0 0<br />

0 1 0 0 0<br />

0 0 1 0 0<br />

7 −7 3 1 0<br />

−9 8 −3 0 1<br />

Determinants of square matrices were introduced in Section 7.1.8. This section gives additional<br />

in<strong>for</strong>mation about determinants.<br />

Recursive or cofactor definition. Let A be a square matrix of order n, and let Aij<br />

be the submatrix obtained by eliminating the i th row and the j th column of A. Then the<br />

determinant of A can be defined recursively as follows:<br />

⎤<br />

⎥<br />

⎦ .<br />

n<br />

det(A) = (−1)<br />

j=1<br />

1+j a1jdet(A1j).<br />

(The determinant has been computed by expanding across the first row.)<br />

The value of the determinant remains the same if we expand across the i th row:<br />

or if we expand down the j th column:<br />

<strong>for</strong> any i or j. The number<br />

is often called the ij-cofactor of A.<br />

n<br />

det(A) = (−1)<br />

j=1<br />

i+j aijdet(Aij)<br />

n<br />

det(A) = (−1)<br />

i=1<br />

i+j aijdet(Aij)<br />

cij = (−1) i+j det(Aij)<br />

207<br />

⎤<br />

⎥<br />


Properties of determinants. Let A and B be square matrices of order n. Then<br />

<br />

1. Transpose: det AT = det(A).<br />

2. Triangular: If A is a triangular matrix, then det(A) = a11 · · · ann.<br />

3. Product: det(AB) = det(A) det(B).<br />

4. Idempotent: If A 2 = A (A is an idempotent matrix), then det(A) = 0 or 1.<br />

5. Inverse: A is an invertible matrix if and only if det(A) = 0. Further, if the matrix A<br />

is invertible, then det A −1 = 1/det(A).<br />

6. Elementary Matrices: Let E be an elementary matrix. Then<br />

⎧<br />

⎪⎨<br />

1 when E corresponds to row replacement<br />

det(E) = −1 when E corresponds to row interchange<br />

⎪⎩ c when E corresponds to scaling a row by c<br />

7. Zeroes: If A has a row or column of zeroes, then det(A) = 0.<br />

8. Identical: If two rows or two columns of A are identical, then det(A) = 0.<br />

⎡<br />

0<br />

Example 9.22 (3-by-3 Invertible Matrix, continued) If A = ⎣ 1<br />

⎤<br />

1 2<br />

0 3 ⎦, then<br />

4 −3 8<br />

<br />

<br />

0 1 2 <br />

<br />

det(A) = <br />

1 0 3 <br />

= 0 − 1<br />

<br />

4 −3 8 4<br />

<br />

3<br />

<br />

<br />

8<br />

+ 2 1<br />

0 <br />

<br />

4 −3<br />

= 0 − (−4) + 2(−3) = −2 = 0.<br />

⎡<br />

0<br />

Example 9.23 (3-by-3 Singular Matrix, continued) If A = ⎣ 1<br />

1<br />

0<br />

⎤<br />

2<br />

3 ⎦,then<br />

4 −3 6<br />

<br />

<br />

0 1<br />

det(A) = <br />

1 0<br />

4 −3<br />

<br />

2 <br />

<br />

3 <br />

= 0 − 1<br />

<br />

6 4<br />

<br />

3<br />

<br />

<br />

6<br />

+ 2 1<br />

4 <br />

0 <br />

<br />

−3<br />

= 0 − (−6) + 2(−3) = 0.<br />

Relationship to pivots. Consider the square matrix A of order n. Let K be a product<br />

of elementary matrices needed to bring A into echelon <strong>for</strong>m, and U be the resulting echelon<br />

<strong>for</strong>m matrix: KA = U.<br />

Since det(K) = 0 by Properties 3 and 5, and det(U) = <br />

i uii be Property 2,<br />

det(A) =<br />

1<br />

det(K)<br />

<br />

uii (using Property 3).<br />

Thus, det(A) = 0 when there are fewer than n pivot positions.<br />

i<br />

208


Reducing to echelon <strong>for</strong>m. The determinant of A can be derived by reducing A to an<br />

echelon <strong>for</strong>m matrix and observing the following:<br />

1. If a multiple of one row is added to another, then the determinant is unchanged.<br />

2. If two rows are interchanged, then the determinant changes sign.<br />

3. If a row is multiplied by c, then the determinant is multiplied by c.<br />

For example,<br />

<br />

<br />

3 6 6 <br />

<br />

<br />

1 2 1 <br />

= 3 <br />

<br />

1 4 4 <br />

1 2 2<br />

1 2 1<br />

1 4 4<br />

<br />

<br />

<br />

<br />

= 3 <br />

<br />

<br />

1 2 2<br />

0 0 −1<br />

0 2 2<br />

<br />

<br />

<br />

<br />

= −3 <br />

<br />

<br />

1 2 2<br />

0 2 2<br />

0 0 −1<br />

<br />

<br />

<br />

<br />

= −3(−2) = 6.<br />

<br />

Explicit <strong>for</strong>mula <strong>for</strong> the inverse using cofactors. If A is invertible, then the inverse<br />

of A can be written explicitly in terms of deteminants as follows:<br />

A −1 =<br />

⎡<br />

1 ⎢<br />

det(A) ⎣<br />

c11 c12 · · · c1n<br />

c21 c22 · · · c2n<br />

.<br />

. · · ·<br />

cn1 cn2 · · · cnn<br />

.<br />

⎤T<br />

⎥<br />

⎦<br />

where cij = (−1) i+j det(Aij).<br />

(A −1 equals the reciprocal of det(A) times the transpose of the cofactor matrix.)<br />

This explicit <strong>for</strong>mula is often useful when proving theorems about inverses, but it is not a<br />

useful computational <strong>for</strong>mula.<br />

9.4 Vector Spaces and Subspaces<br />

This section introduces vector space theory, emphasizing topics of interest in statistics.<br />

9.4.1 Definitions<br />

A vector space V over the reals is a set of objects (called vectors), and operations of addition<br />

and scalar multiplication satisfying the following axioms:<br />

1. Vector sum: Vector sum is commutative and associative.<br />

2. Zero vector: There is a zero vector O satisfying<br />

3. Additive inverse: For each x ∈ V ,<br />

O + x = x = x + O <strong>for</strong> each x ∈ V .<br />

y + x = O = x + y <strong>for</strong> some y ∈ V .<br />

209


4. Scalar multiples: Given x,y ∈ V and c,d ∈ R,<br />

c(x + y) = (cx) + (cy), (c + d)x = (cx) + (dx), (cd)x = c(dx), 1x = x.<br />

Examples of vector spaces include<br />

1. R m with the usual vector addition and scalar multiplication.<br />

2. The set of r-by-c matrices, Mr×c, with the usual operations of matrix addition and<br />

scalar multiplication.<br />

3. The set of polynomial functions of degree at most k, Pk,<br />

Pk = {p(t) = a0 + a1t + a2t 2 + · · · + akt k : ai ∈ R}<br />

where polynomials are added by adding their values <strong>for</strong> each t, and the scalar multiple<br />

by c is obtained by multiplying each value of the polynomial by the scalar c.<br />

4. The set of all polynomial functions, P∞,<br />

P∞ = P1 ∪ P2 ∪ P3 ∪ · · ·<br />

with addition and scalar multiplication as defined above.<br />

5. The set of continuous real-valued functions with domain [0,1], C[0, 1],<br />

C[0, 1] = {f : [0,1] → R : f is continuous}<br />

where functions are added by adding their values <strong>for</strong> each x ∈ [0,1], and the scalar<br />

multiple by c is obtained by multiplying each value of the function by the scalar c.<br />

Subspaces. A subset H ⊆ V is a subspace of the vector space V if<br />

1. Contains zero vector: O ∈ H.<br />

2. Closed under addition: If x,y ∈ H, then x + y ∈ H.<br />

3. Closed under scalar multiplication: If x ∈ H, then cx ∈ H <strong>for</strong> each c ∈ R.<br />

Note that the subset {O} is a subspace, called the trivial subspace.<br />

Example 9.24 (Lines in R2 ) The set of points in the plane satisfying y = mx + b <strong>for</strong><br />

some m and b:<br />

<br />

x<br />

Lm,b =<br />

y<br />

<br />

: y = mx + b ⊆ R2 is a subspace of R 2 when b = 0 (the line goes through the origin) and is not a subspace<br />

when b = 0 (the line does not go through the origin).<br />

210


Example 9.25 (Planes in R3 ) The set of points in three-space satisfying z = ax+by +c<br />

<strong>for</strong> some a, b, c:<br />

⎧⎡<br />

⎤<br />

⎫<br />

⎨ x<br />

⎬<br />

Pa,b,c = ⎣ y ⎦ : z = ax + by + c ⊆ R3<br />

⎩<br />

⎭<br />

z<br />

is a subspace of R 3 when c = 0 (the plane goes through the origin) and is not a subspace<br />

when c = 0 (the plane does not go through the origin).<br />

Of interest. Subspaces are vector spaces in their own right. Of interest in statistics are<br />

the vector spaces R m , and the vector subspaces H ⊆ R m .<br />

9.4.2 Subspaces of R m<br />

Recall (from Section 9.1.3) that the span of the set {v1,v2,... ,vn} ⊆ R m , is the collection<br />

of all linear combinations of the vectors v1, v2, ..., vn:<br />

Span{v1,v2,...,vn} = {c1v1 + c2v2 + · · · + cnvn : ci ∈ R} ⊆ R m .<br />

Spans are subspaces. Of particular interest are spans related to coefficient matrices.<br />

Column space of A. Let A = [a1 a2 · · · an] be an m-by-n coefficient matrix. The<br />

column space of A is the span of the columns of A:<br />

Col(A) = Span{a1,a2,... ,an} ⊆ R m .<br />

The column space of A is the collection of all b so that Ax = b is consistent.<br />

Null space of A. Let A be an m-by-n coefficient matrix. The null space of A is the set<br />

of solutions of the homogeneous system Ax = O:<br />

Null(A) = {x : Ax = O} ⊆ R n .<br />

Since Null(A) can be written as a span of a set of vectors, it is a subspace of R n .<br />

9.4.3 Linear Independence, Basis and Dimension<br />

The set {v1,v2,... ,vn} ⊆ R m is said to be linearly independent when<br />

c1v1 + c2v2 + · · · + cnvn = O<br />

only when each ci = 0. Otherwise, the set is said to be linearly dependent.<br />

The following uniqueness theorem gives an important property of linearly independent sets.<br />

211


Theorem 9.26 (Uniqueness Theorem) If {v1,v2,...,vn} ⊆ R m is a linearly independent<br />

set and w ∈ Span{v1,v2,... ,vn}, then w has a unique representation as a linear<br />

combination of the vi’s. That is, we can write<br />

w = c1v1 + c2v2 + · · · + cnvn<br />

<strong>for</strong> a unique ordered list of scalars c1, c2, ..., cn (called the coordinates of w).<br />

Note that the coordinates can be found by solving the system<br />

[ v1 v2 · · · vn ]<br />

⎡<br />

⎣<br />

c1<br />

.<br />

.<br />

cn<br />

⎤<br />

⎡<br />

⎦ = ⎣<br />

That is, by solving Ac = w, where A is the m-by-n matrix whose columns are the vi’s.<br />

Additional facts. Some additional facts about linear independence/dependence are:<br />

1. If vi = O <strong>for</strong> some i, then the set {v1,v2,...,vn} is linearly dependent.<br />

2. If n > m, then the set {v1,v2,... ,vn} is linearly dependent.<br />

3. Assume vi = O <strong>for</strong> each i. Then, the set is linearly dependent if and only if one vector<br />

in the set can be written as a linear combination of the others.<br />

4. Let be the m-by-n matrix whose columns are the vi’s: A = [ v1 v2 · · · vn ]. Then,<br />

the set is linearly independent if and only if A has a pivot in every column.<br />

5. Let be the m-by-n matrix whose columns are the vi’s: A = [ v1 v2 · · · vn ]. Then,<br />

the set is linearly independent if and only if the homogeneous system Ax = O has<br />

the trivial solution only.<br />

Bases. A basis <strong>for</strong> the vector space V ⊆ R m is a linearly independent set which spans V .<br />

That is, a basis is a set of linearly independent vectors {v1,v2,... ,vn} satisfying<br />

w1<br />

.<br />

.<br />

wm<br />

V = Span{v1,v2,... ,vn}.<br />

Example 9.27 (Bases <strong>for</strong> R3 ) Bases are not unique. For example, the sets<br />

⎧⎡<br />

⎨<br />

⎣<br />

⎩<br />

1<br />

⎤ ⎡<br />

0 ⎦, ⎣<br />

0<br />

0<br />

⎤ ⎡<br />

1⎦,<br />

⎣<br />

0<br />

0<br />

⎤⎫<br />

⎬<br />

0⎦<br />

⎭<br />

1<br />

and<br />

⎧⎡<br />

⎨<br />

⎣<br />

⎩<br />

−3<br />

⎤ ⎡<br />

0 ⎦ , ⎣<br />

0<br />

2<br />

⎤ ⎡<br />

1 ⎦, ⎣<br />

0<br />

6<br />

⎤⎫<br />

⎬<br />

−2 ⎦<br />

⎭<br />

1<br />

are each bases of V = R 3 .<br />

Further, if the components of x ∈ R 3 are x1, x2, x3, then the coordinates of x<br />

1. With respect to the first basis are x1, x2, x3.<br />

2. With respect to the second basis are 1<br />

3 (−x1 + 2x2 + 10x3), x2 + 2x3, x3.<br />

212<br />

⎤<br />

⎦.


Standard basis <strong>for</strong> R m . The standard basis <strong>for</strong> R m is the set {e1,e2,... ,em}, where<br />

the ei’s are the columns of the identity matrix Im. That is, where ei is the vector whose<br />

j th component equals 0 when j = i and equals 1 when j = i.<br />

Note that the first basis given in the previous example is the standard basis <strong>for</strong> R 3 .<br />

Basis <strong>for</strong> the span of a set of vectors. Suppose that<br />

V = Span{v1,v2,... ,vn} ⊆ R m ,<br />

A basis <strong>for</strong> V can be found using row reductions. Specifically,<br />

Theorem 9.28 (Spanning Set Theorem) If V = Span{v1,v2,...,vn} ⊆ R m is not<br />

the trivial subspace, and A is the m-by-n matrix whose columns are the vi’s:<br />

A = [v1 v2 · · · vn ],<br />

then the pivot columns of A <strong>for</strong>m a basis <strong>for</strong> the vector space V .<br />

Basis <strong>for</strong> the column space of A. In particular, the pivot columns of the coefficient<br />

matrix A <strong>for</strong>m a basis <strong>for</strong> the column space of A, Col(A).<br />

Example 9.29 (3-by-5 Matrix) Consider the following 3-by-5 coefficient matrix A and<br />

its equivalent reduced echelon <strong>for</strong>m matrix:<br />

A = [a1 a2 a3 a4<br />

⎡<br />

0 3<br />

a5 ] = ⎣ 3 −7<br />

−6<br />

8<br />

⎤<br />

6 4<br />

−5 8 ⎦ ∼<br />

⎡<br />

1 0<br />

⎣ 0 1<br />

⎤<br />

−2 3 0<br />

−2 2 0 ⎦<br />

3 −9 12 −9 6 0 0 0 0 1<br />

Since the pivots are in columns 1, 2, 5, a basis <strong>for</strong> Col(A) is the set {a1,a2,a5}.<br />

Basis <strong>for</strong> the null space of A. The method <strong>for</strong> writing the solution set of a homogeneous<br />

system Ax = O given in Section 9.1.3 automatically produces a set of linearly independent<br />

vectors which span Null(A). Thus, no further work is needed to find a basis <strong>for</strong> Null(A).<br />

Example 9.30 (3-by-5 Matrix, continued) Continuing with the example above, since<br />

Ax = O ⇒<br />

x1 = 2x3 − 3x4<br />

x2 = 2x3 − 2x4<br />

x3 is free<br />

x4 is free<br />

x5 = 0<br />

⎡<br />

⎢<br />

⇒ s = x3<br />

⎢<br />

⎣<br />

2<br />

2<br />

1<br />

0<br />

0<br />

⎤<br />

⎡<br />

⎥ ⎢<br />

⎥ ⎢<br />

⎥ + x4<br />

⎢<br />

⎦ ⎣<br />

where x3 and x4 are free, a basis <strong>for</strong> Null(A) is the set {v1,v2}.<br />

213<br />

−3<br />

−2<br />

0<br />

1<br />

0<br />

⎤<br />

⎥<br />

⎦ = x3v1 + x4v2,


Dimension of a vector space. Although there are many choices <strong>for</strong> the basis vectors of<br />

a vector space V , the number of basis vectors does not change.<br />

Theorem 9.31 (Basis Theorem) If V ⊆ R m has a basis with p vectors, then<br />

1. Every basis of V must contain exactly p vectors.<br />

2. Any subset of V with p linearly independent vectors is a basis of V .<br />

3. Any subset of V with p vectors which span V is a basis of V .<br />

If V is a nontrivial subspace of R m , then the dimension of V , denoted by dim(V ), is the<br />

number of vectors in any basis of V . By the basis theorem, this number is well-defined. For<br />

convenience, we say that the dimension of the trivial subspace, {O}, is zero.<br />

Note that <strong>for</strong> every subspace V ⊆ R m , 0 ≤ dim(V ) ≤ m.<br />

9.4.4 Rank and Nullity<br />

Let A be an m-by-n matrix. The rank of A, denoted by rank(A), is the number of pivot<br />

columns of A. Equivalently, the rank of A is the dimension of the column space of A,<br />

rank(A) = dim(Col(A)).<br />

The nullity of A, denoted by nullity(A), is the dimension of its null space,<br />

nullity(A) = dim(Null(A)).<br />

If A is the coefficient matrix of an m-by-n system of linear equations and r = rank(A),<br />

then the system has r basic variables and n − r free variables. Since the number of free<br />

variables corresponds to the dimension of the null space of A, we have<br />

rank(A) + nullity(A) = n.<br />

Example 9.32 (3-by-5 Matrix, continued) For example, the 3-by-5 coefficient matrix<br />

A considered earlier has rank 3 and nullity 2. The sum of these dimensions equals the<br />

number of columns of the matrix.<br />

Subspaces related to A and A T . Let A be an m-by-n coefficient matrix. There are<br />

four important subspaces related A:<br />

1. Column space of A: The column space of A, Col(A) ⊆ R m .<br />

2. Null space of A: The null space of A, Null(A) ⊆ R n .<br />

3. Row space of A: The column space of A T , Col(A T ) ⊆ R n .<br />

4. Left null space of A: The null space of A T , Null(A T ) ⊆ R m .<br />

214


The dimensions of these spaces are related as follows:<br />

Theorem 9.33 (Fundamental Theorem of Linear Algebra, Part I) If A is an<br />

m-by-n coefficient matrix of rank r, then<br />

1. rank(A) = rank(A T ) = r,<br />

2. nullity(A) = n − r and<br />

3. nullity(A T ) = m − r.<br />

Matrices of rank one. If the m-by-n matrix A has rank 1, then the columns of A are<br />

all multiples of a single vector, say ai = civ <strong>for</strong> some constants ci and vector v ∈ R m . Let<br />

c ∈ R n be the vector whose components are the ci’s. Then<br />

9.5 Orthogonality<br />

A = [ c1v c2v · · · cnv ] = vc T .<br />

Let v,w ∈ R m . Recall (from Section 7.1.4) that<br />

1. The dot product of v with w is the number<br />

2. If θ is the angle between v and w, then<br />

v·w = v T w = v1w1 + v2w2 + · · · + vnwn.<br />

v·w = v w cos(θ).<br />

3. The vectors v and w are said to be orthogonal if v·w = 0.<br />

When m = 2 or m = 3, orthogonality corresponds to being at right angles in the usual<br />

geometric sense. When m > 3, the definition of orthogonality in terms of the dot product<br />

is a useful generalization.<br />

9.5.1 Orthogonal Complements<br />

If V ⊆ R m is a subspace, then the orthogonal complement of V is the set of all vectors<br />

x ∈ R m which are orthogonal to every vector in V :<br />

(V ⊥ is read “V -perp.”)<br />

V ⊥ = {x : v·x = 0 <strong>for</strong> every v ∈ V } ⊆ R m .<br />

215


Properties. Orthogonal complements satisfy the following properties:<br />

1. Subspace: V ⊥ is a subspace of R m .<br />

2. Trivial intersection: Since x·x = 0 only when x = O, V ∩ V ⊥ = {O}.<br />

<br />

3. Complement of complement: V ⊥ ⊥<br />

= V .<br />

4. Bases: If you pool bases <strong>for</strong> V and V ⊥ , you get a basis <strong>for</strong> R m .<br />

5. Dimensions: dim(V ) + dim(V ⊥ ) = m.<br />

Example 9.34 (Complements in R3 ) Let V be the following subspace:<br />

⎧⎡<br />

⎨<br />

V = Span{v1,v2} = Span ⎣<br />

⎩<br />

1<br />

⎤ ⎡<br />

−1 ⎦, ⎣<br />

0<br />

2<br />

⎤⎫<br />

⎬<br />

0 ⎦<br />

⎭<br />

−2<br />

⊆ R3 .<br />

V is a two-dimensional subspace of R 3 since the set {v1,v2} is linearly independent. To<br />

find V ⊥ it is sufficient to find those vectors x satisfying v1·x = 0 and v2·x = 0.<br />

Equivalently, we need to solve the following matrix equation<br />

<br />

v1 T <br />

x = O.<br />

Since<br />

1 −1 0 0<br />

2 0 −2 0<br />

<br />

∼<br />

v2 T<br />

1 0 −1 0<br />

0 1 −1 0<br />

<br />

=⇒<br />

x1 = x3<br />

x2 = x3<br />

x3 is free<br />

V ⊥ ⎡<br />

is the span of the single vector v3 = ⎣ 1<br />

⎤<br />

1⎦.<br />

Further, {v1,v2,v3} is a basis <strong>for</strong> R<br />

1<br />

3 .<br />

The example above suggests the second part of the Fundamental Theorem of Linear Algebra:<br />

Theorem 9.35 (Fundamental Theorem of Linear Algebra, Part II) Let A be an<br />

m-by-n coefficient matrix. Then<br />

1. Null(A T ) and Col(A) are orthogonal complements in R m .<br />

2. Null(A) and Col(A T ) are orthogonal complements in R n .<br />

9.5.2 Orthogonal Sets, Bases and Projections<br />

The nonzero vectors v1, v2, ..., vn are said to be mutually orthogonal if vi·vj = 0 when<br />

i = j. A set of mutually orthogonal vectors is known as an orthogonal set.<br />

The following theorem summarizes several important properties of orthogonal sets.<br />

Theorem 9.36 (Orthogonal Sets) Let {v1,v2,...,vn} be an orthogonal set of vectors<br />

in R m , and let V be the subspace spanned by these vectors. Then<br />

216


1. {v1,v2,...,vn} is a basis <strong>for</strong> V .<br />

2. If y ∈ V , then<br />

3. If y ∈ V , then<br />

y = c1v1 + c2v2 + · · · + cnvn where ci = y·vi<br />

vi·vi<br />

y = c1v1 + c2v2 + · · · + cnvn where ci = y·vi<br />

vi·vi<br />

<strong>for</strong> each i.<br />

<strong>for</strong> each i<br />

is the point closest to y in V . Further, the difference between the vector y and the<br />

vector y is a vector in the orthogonal complement of V : (y − y) ∈ V ⊥ .<br />

The first part of the theorem tells us that orthogonal sets are linearly independent. The<br />

second part gives a convenient way of finding the coordinates of a vector with respect to a<br />

basis of mutually orthogonal vectors (called an orthogonal basis). The third part generalizes<br />

the idea from analytic geometry of “dropping a perpendicular.”<br />

Projections and orthogonal decompositions. If y ∈ V , then y in the theorem above<br />

is called the projection of y on V , and is denoted by<br />

y = proj V (y).<br />

Writing y as the sum of a vector in V and a vector in V ⊥ :<br />

y = y + (y − y)<br />

is called the orthogonal decomposition of y with respect to V . For each y ∈ R m , the<br />

orthogonal decomposition of y with respect to V is unique.<br />

9.5.3 Gram-Schmidt Orthogonalization<br />

The Gram-Schmidt orthogonalization process allows us to convert any basis of a subspace<br />

of R m to an orthogonal basis <strong>for</strong> the space.<br />

Theorem 9.37 (Gram-Schmidt Process) Let {v1,v2,... ,vn} be a basis <strong>for</strong> the vector<br />

space V ⊆ R m , and let {u1,u2,... ,un} be defined as follows:<br />

u1 = v1<br />

u2 = v2 − proj V1 (v2) where V1 = Span{u1} = Span{v1}.<br />

u3 = v3 − proj V2 (v3) where V2 = Span{u1,u2} = Span{v1,v2}.<br />

.<br />

un = vn − proj Vn−1 (vn) where Vn−1 = Span{u1,... ,un−1} = Span{v1,...,vn−1}.<br />

Then {u1,u2,... ,un} is an orthogonal basis of V .<br />

217


Example 9.38 (Subspace of R4 ⎧⎡<br />

⎤<br />

⎪⎨<br />

1<br />

⎢<br />

) If V = Span ⎢ −1 ⎥<br />

⎣<br />

⎪⎩<br />

0 ⎦<br />

1<br />

,<br />

⎡ ⎤<br />

2<br />

⎢ 0 ⎥<br />

⎣ −2 ⎦<br />

1<br />

,<br />

⎡ ⎤⎫<br />

4 ⎪⎬ ⎢ 0 ⎥<br />

⎣ 5 ⎦ , then<br />

⎪⎭<br />

2<br />

⎡ ⎤<br />

1<br />

⎢<br />

u1 = v1 = ⎢ −1 ⎥<br />

⎣ 0 ⎦<br />

1<br />

, u2<br />

⎡ ⎤<br />

2<br />

v2·u1 ⎢<br />

= v2 − u1 = ⎢ 0 ⎥<br />

u1·u1<br />

⎣ −2 ⎦<br />

1<br />

−<br />

⎡ ⎤<br />

1<br />

3 ⎢ −1 ⎥<br />

3 ⎣ 0 ⎦<br />

1<br />

=<br />

⎡ ⎤<br />

1<br />

⎢ 1 ⎥<br />

⎣ −2 ⎦<br />

0<br />

,<br />

and<br />

⎡ ⎤<br />

4<br />

v3·u1 v3·u2 ⎢<br />

u3 = v3 − u1 − u2 = ⎢ 0 ⎥<br />

u1·u1 u2·u2<br />

⎣ 5 ⎦<br />

2<br />

−<br />

⎡ ⎤<br />

1<br />

6 ⎢ −1 ⎥<br />

3 ⎣ 0 ⎦<br />

1<br />

−<br />

⎡ ⎤<br />

1<br />

−6 ⎢ 1 ⎥<br />

6 ⎣ −2 ⎦<br />

0<br />

=<br />

⎡ ⎤<br />

3<br />

⎢ 3 ⎥<br />

⎣ 3 ⎦<br />

0<br />

.<br />

{u1,u2,u3} is an orthogonal basis of V ⊆ R 4 .<br />

9.5.4 QR-Factorization<br />

Orthogonal vectors make many calculations easier. For example, if {u1,u2,...,un} is an<br />

orthogonal set in R m , and A is the matrix whose columns are the ui’s, then A T A is a<br />

diagonal matrix. This idea motivates a technique known as QR-factorization.<br />

Orthonormal sets. The orthogonal set {q1,q2,... ,qn} is said to be an orthonormal set<br />

of vectors if each qi is a unit vector. (Each vector has been normalized to have length 1.)<br />

Theorem 9.39 (Orthonormal Sets) If {q1,q2,... ,qn} is an orthonormal set of vectors<br />

in R m , Q is the matrix whose columns are the qi’s, and x,y ∈ R m , then<br />

1. Q T Q = In (and QQ T = Im).<br />

2. Qx = x.<br />

3. (Qx)·(Qy) = x·y.<br />

4. (Qx)·(Qy) = 0 if and only if x·y = 0.<br />

Orthogonal matrices. The square matrix Q is said to be an orthogonal matrix if its<br />

columns <strong>for</strong>m an orthonormal set. If Q is an orthogonal matrix of order n, then the first<br />

part of the theorem above implies<br />

Q T Q = QQ T = In =⇒ Q −1 = Q T .<br />

That is, the inverse of Q is equal to its transpose.<br />

218


QR-factorization. Let A = [v1 v2 · · · vn] be an m-by-n matrix whose columns are<br />

linearly independent. Use the Gram-Schmidt process to convert the vi’s to ui’s, let<br />

qi = 1<br />

ui ui <strong>for</strong> i = 1,2,... ,n,<br />

and let Q be the m-by-n matrix whose columns are the qi’s.<br />

In the Gram-Schmidt process,<br />

vi ∈ Span{v1,... ,vi} = Span{u1,... ,ui} <strong>for</strong> i = 1,2,... ,n.<br />

By the uniqueness theorem <strong>for</strong> coordinates, Q T A = R must be an upper triangular matrix<br />

of order n. Further, it can be shown that the diagonal elements of R are all positive. Since<br />

the matrix product QQ T is the identity matrix of order m,<br />

Q T A = R =⇒ Q(Q T A) = QR =⇒ A = QR.<br />

Example 9.40 (Subspace of R 4 , continued) For example, the QR-factorization of the<br />

4-by-3 matrix whose columns are the vi’s from the Gram-Schmidt example above is<br />

⎡<br />

⎢<br />

⎣<br />

1 2 4<br />

−1 0 0<br />

0 −2 5<br />

1 1 2<br />

9.5.5 Linear Least Squares<br />

⎤<br />

⎥<br />

⎦ =<br />

⎡ 1√<br />

3<br />

⎢ √−1 3 ⎢<br />

⎣ 0<br />

√1 6<br />

√1 6 2 − 3<br />

√1 ⎤<br />

3 ⎡ √<br />

√1 ⎥<br />

3 ⎥ 3<br />

⎥⎣<br />

√1 ⎥ 0<br />

3 ⎦<br />

√<br />

3<br />

√<br />

2 3<br />

0 0<br />

√ 6 − √ 0 0<br />

6<br />

3 √ ⎤<br />

⎦.<br />

3<br />

1√ 3<br />

One of the most useful applications of the methods from this section is to the linear least<br />

squares problem. The introduction here focuses on the general problem of finding approximate<br />

solutions to inconsistent systems.<br />

Let A be an m-by-n coefficient matrix, and assume that Ax = b is inconsistent (b is not in<br />

the column space of A). To find approximate solutions to the original system, we propose<br />

to do the following:<br />

1. Find the projection of b on Col(A), b, and<br />

2. Report solutions to the consistent system Ax = b.<br />

There are two key observations:<br />

Observation 1: Since b is as close to b as possible, each approximate solution x satisfies<br />

b − b = b−Ax is as small as possible.<br />

219


The difference vector is<br />

b−Ax =<br />

⎡<br />

⎢<br />

⎣<br />

b1 − (a1,1x1 + a1,2x2 + · · · + a1,nxn)<br />

b2 − (a2,1x1 + a2,2x2 + · · · + a2,nxn)<br />

.<br />

.<br />

bm − (am,1x1 + am,2x2 + · · · + am,nxn)<br />

and the square of the length of the difference vector is<br />

m<br />

(bi − (ai,1x1 + ai,2x2 + · · · + ai,nxn))<br />

i=1<br />

2 .<br />

Each approximate solution x will minimize the above sum of squared differences. For this<br />

reason the approximate solutions are called least squares solutions.<br />

Observation 2: Since the difference vector<br />

(b − b) = (b−Ax) ∈ Col(A) ⊥ = Null(A T )<br />

(by the Fundamental Theorem of Linear Algebra), we know that<br />

O = A T (b−Ax) = A T b − A T Ax =⇒ A T Ax = A T b.<br />

Thus, least squares solutions can be found by solving the consistent system on the right;<br />

there is no need to explicitly compute the projection of b on the column space of A.<br />

The matrix equation A T Ax = A T b is often called the normal equation of the system.<br />

Example 9.41 (3-by-2 Inconsistent System) Consider the inconsistent system<br />

⎡ ⎤ ⎡<br />

1 0 <br />

⎣1 1 ⎦ x<br />

= ⎣<br />

y<br />

1 2<br />

6<br />

⎤<br />

0⎦.<br />

0<br />

Then<br />

A T Ax = A T b ⇒<br />

3 3<br />

3 5<br />

<br />

x<br />

=<br />

y<br />

<br />

6<br />

0<br />

⇒<br />

⎤<br />

⎥<br />

⎦<br />

x = 5<br />

y = −3<br />

Thus, x = 5 and y = −3 is the unique least squares solution to the system.<br />

Example 9.42 (4-by-3 Inconsistent System) Consider the inconsistent system<br />

⎡ ⎤<br />

1 3 5 ⎡<br />

⎢1<br />

1 0 ⎥<br />

⎣ ⎦⎣<br />

1 1 2<br />

1 3 3<br />

x<br />

⎡ ⎤<br />

⎤ 3<br />

⎢<br />

y ⎦ = ⎢ 5 ⎥<br />

⎣ 7 ⎦<br />

z<br />

−3<br />

.<br />

Then<br />

A T Ax = A T b ⇒<br />

⎡<br />

4<br />

⎣ 8<br />

⎤⎡<br />

8 10<br />

20 26⎦<br />

⎣<br />

10 26 38<br />

x<br />

⎤ ⎡<br />

y ⎦ = ⎣<br />

z<br />

12<br />

⎤<br />

12⎦<br />

⇒<br />

20<br />

x = 10<br />

y = −6<br />

z = 2<br />

Thus, x = 10, y = −6 and z = 2 is the unique least squares solution to the system.<br />

220


Remark. If the columns of the coefficient matrix A are linearly independent, then the<br />

symmetric matrix ATA is invertible and<br />

<br />

x = A T −1 A A T b<br />

is the unique least squares solution. (The proof uses the QR-factorization of A.)<br />

9.6 Eigenvalues and Eigenvectors<br />

Let A be a square matrix of order n. The constant λ is said to be an eigenvalue of A if<br />

Ax = λx <strong>for</strong> some nonzero vector x ∈ R n .<br />

Each nonzero x satisfying this equation is said to be an eigenvector with eigenvalue λ.<br />

If x is an eigenvector of A with eigenvalue λ, then so is each scalar multiple of x, since<br />

A(cx) = c(Ax) = c(λx) = λ(cx) <strong>for</strong> each nonzero constant c.<br />

If we think of the collection of scalar multiples of x as defining a line in n-space, then A<br />

takes each point on the line to another point on the line.<br />

9.6.1 Characteristic Equation<br />

Since x = Inx, where In is the identity matrix,<br />

Ax = λx ⇐⇒ Ax − λx = O ⇐⇒ (A − λIn)x = O.<br />

That is, λ is an eigenvalue of A if and only if the homogeneous system of equations on the<br />

right has nontrivial solutions. The system has nontrivial solutions if and only if (A − λIn)<br />

is a singular matrix. The matrix is singular if and only if its determinant is zero.<br />

Thus, the eigenvalues of A can be found by solving<br />

det(A − λIn) = 0 <strong>for</strong> λ.<br />

This equation is known as the characteristic equation <strong>for</strong> the matrix A. Further, if λ is an<br />

eigenvalue of A, then its eigenvectors are the nonzero vectors in the null space of (A −λIn).<br />

<br />

−8 −10<br />

Example 9.43 (2-by-2 Matrix) Let A = . Then<br />

5 7<br />

<br />

<br />

<br />

det(A − λI2) = −8 − λ −10 <br />

<br />

5 7 − λ = (−8 − λ)(7 − λ) + 50 = λ2 + λ − 6 = (λ + 3)(λ − 2),<br />

and the determinant equals 0 when λ = 2 or λ = −3.<br />

Using the techniques of Section 9.1.3,<br />

<br />

1<br />

Null(A − 2I2) = Span<br />

−1<br />

<br />

1<br />

Finally, note that the set {v1,v2} =<br />

−1<br />

<br />

−2<br />

and Null(A + 3I2) = Span<br />

1<br />

<br />

−2<br />

,<br />

1<br />

221<br />

<br />

is a basis <strong>for</strong> R2 .<br />

<br />

.


Properties. The following are important properties of eigenvalues and eigenvectors:<br />

1. 0 is an eigenvalue of A if and only if A is singular (does not have an inverse).<br />

2. If A is a triangular matrix, then the eigenvalues of A are its diagonal elements.<br />

3. If λ is an eigenvalue of A with eigenvector x and A k = AA · · · A is the k th power of<br />

the matrix A, then λ k is an eigenvalue of A k with eigenvector x.<br />

4. Suppose that A has k distinct eigenvalues λ1, λ2, ..., λk. If you pool the bases of<br />

the null spaces Null(A − λiIn), <strong>for</strong> i = 1,2,... ,k, then the resulting set is linearly<br />

independent. If the resulting set has n elements, the set is a basis <strong>for</strong> R n .<br />

Remarks. If A is a square matrix of order n, then the characteristic equation is a polynomial<br />

equation of degree n. Polynomials of degree n cannot always be factored into n terms<br />

of the <strong>for</strong>m “(λ−c)” where the c’s are real numbers, but can be factored into n linear terms<br />

if we allow some c’s to be complex numbers.<br />

9.6.2 Diagonalization<br />

Suppose that A has n linearly independent eigenvectors,<br />

vi <strong>for</strong> i = 1,2,... ,n,<br />

with corresponding eigenvalues λi <strong>for</strong> i = 1,2,... ,n.<br />

Let P be the matrix whose columns are the vi’s, and let Λ be the diagonal matrix whose<br />

diagonal elements are the λi’s,<br />

Then, since<br />

P = [v1 v2 · · · vn], Λ = Diag(λ1,λ2,...,λn).<br />

AP = [Av1 Av2 · · · Avn ] = [λ1v1 λ2v2 · · · λ2vn ] = PΛ<br />

and P is invertible, P −1 AP = Λ and A = PΛP −1 .<br />

The matrix equation A = PΛP −1 , where Λ is a diagonal matrix, is called a diagonalization<br />

of the matrix A. Diagonalizations do not always exist.<br />

Theorem 9.44 (Diagonalization Theorem) The square matrix A of order n is diagonalizable<br />

if and only if A has n linearly independent eigenvectors.<br />

⎡ ⎤<br />

3 0 0<br />

Example 9.45 (Diagonalizable 3-by-3 Matrix) Let A = ⎣ 0 3 0 ⎦.<br />

−1 1 2<br />

Since A is a triangular matrix, its eigenvalues are 3, 3, 2. (Since 3 corresponds to two roots<br />

of the characteristic equation, it appears twice in this list.)<br />

222


Using the techniques of Section 9.1.3,<br />

⎧⎡<br />

⎨<br />

Null(A − 3I3) = Span ⎣<br />

⎩<br />

1<br />

⎤ ⎡<br />

1 ⎦, ⎣<br />

0<br />

−1<br />

⎤⎫<br />

⎬<br />

0 ⎦<br />

⎭<br />

1<br />

and Null(A − 2I3)<br />

⎧⎡<br />

⎨<br />

= Span ⎣<br />

⎩<br />

0<br />

0<br />

1<br />

⎤⎫<br />

⎬<br />

⎦<br />

⎭ .<br />

Since there are three linearly independent eigenvectors, A is diagonalizable. Further, we<br />

can factor A as follows:<br />

A = PΛP −1 ⎡<br />

1<br />

= ⎣ 1<br />

⎤⎡<br />

−1 0 3 0<br />

0 0 ⎦⎣<br />

0 3<br />

⎤⎡<br />

0 0<br />

0 ⎦⎣<br />

−1<br />

⎤<br />

1 0<br />

1 0 ⎦.<br />

0 1 1 0 0 2 1 −1 1<br />

Example 9.46 (Nondiagonalizable 3-by-3 Matrix) Let A = ⎣<br />

As above, since A is a triangular matrix, its eigenvalues are 3, 3, 2.<br />

⎡<br />

3 0 0<br />

2 3 0<br />

2 1 2<br />

Using the techniques of Section 9.1.3,<br />

⎧⎡<br />

⎨<br />

Null(A − 3I3) = Span ⎣<br />

⎩<br />

0<br />

⎤⎫<br />

⎬<br />

1 ⎦<br />

⎭<br />

1<br />

and Null(A − 2I3)<br />

⎧⎡<br />

⎨<br />

= Span ⎣<br />

⎩<br />

0<br />

⎤⎫<br />

⎬<br />

0 ⎦<br />

⎭<br />

1<br />

.<br />

Since there are only two linearly independent eigenvectors, A is not diagonalizable.<br />

9.6.3 Symmetric Matrices<br />

Recall that the square matrix A is symmetric if aij = aji <strong>for</strong> all i and j. Written succinctly,<br />

the square matrix A is symmetric if A = A T .<br />

Symmetric matrices are always diagonalizable. Further, the eigenvectors can be chosen to<br />

be orthonormal. These facts are summarized in the following theorem.<br />

Theorem 9.47 (Spectral Theorem) Let A be a symmetric matrix of order n. Then<br />

there exists a diagonal matrix Λ and an orthogonal matrix Q satisfying<br />

(since the inverse of Q is its transpose).<br />

A = QΛQ −1 = QΛQ T<br />

The spectral theorem says that if A is symmetric, then (1) the diagonalization process<br />

outlined above always results in finding n linearly independent eigenvectors, and (2) eigenvectors<br />

corresponding to different eigenvalues are always orthogonal.<br />

To construct an orthonormal eigenvector basis, you just need to convert the basis of each<br />

null space, Null(A − λiIn) <strong>for</strong> each i, to an orthonormal basis using the normalized Gram-<br />

Schmidt process.<br />

223<br />

⎤<br />

⎦.


Spectral decomposition. Let Q = [q1 q2 · · · qn ] and let λi be the eigenvalue<br />

associated with qi <strong>for</strong> each i. Then<br />

A = QΛQ T = λ1q1q1 T + λ2q2q2 T + · · · + λnqnqn T<br />

is called the spectral decomposition of A. The spectral decomposition breaks up A into pieces<br />

determined by its “spectrum” of eigenvalues. Each term in the spectral decomposition of<br />

A is a rank 1 matrix.<br />

9.6.4 Positive Definite Matrices<br />

Let A be a symmetric matrix of order n and let Quad(x) be the following real-valued<br />

n-variable function:<br />

Quad(x) = x T n n<br />

Ax = aijxixj .<br />

i=1 j=1<br />

(Quad(x) is a quadratic <strong>for</strong>m in x.) Then, A is said to be positive definite if<br />

Quad(x) > 0 when x = O.<br />

The following theorem relates the method <strong>for</strong> determining when a matrix is positive definite<br />

given in Section 7.4.2 (on the second derivative test) to the eigenvalues of A.<br />

Theorem 9.48 (Positive Definite Matrices) Let A be a symmetric matrix. If A has<br />

one of the following three properties, it has them all:<br />

1. All the eigenvalues of A are positive.<br />

2. The determinants of all the principal minors are positive.<br />

3. Quad(x) = x T Ax > 0 except when x = O.<br />

To demonstrate that the first statement implies the third, <strong>for</strong> example, write A = QΛQ T ,<br />

as given in the spectral theorem above. Then<br />

x T Ax = x T (QΛQ T )x = y T n<br />

Λy = λiy<br />

i=1<br />

2 i<br />

where y = Q T x.<br />

Since Q is an invertible matrix (with inverse Q T ), y = Q T x = O when x = O. Further,<br />

since each λi is positive, the sum on the right is positive when y = O.<br />

9.6.5 Singular Value Decomposition<br />

The singular value decomposition is a useful factorization method <strong>for</strong> rectangular matrices,<br />

generalizing the idea of diagonalization.<br />

224


Singular values. Let A be an m-by-n matrix. By the spectral theorem, the symmetric<br />

matrix A T A can be orthogonally diagonalized. Let<br />

A T A = V ΛV T where V = [v1 v2 · · · vn ], Λ = Diag(λ1,λ2,... ,λn),<br />

the columns of V <strong>for</strong>m an orthogonal set, and the λi’s (and vi’s) have been ordered so that<br />

λ1 ≥ λ2 ≥ · · · ≥ λn.<br />

The λi’s are nonnegative since they correspond to squared lengths:<br />

Avi 2 = (Avi) T (Avi) = vi T A T Avi = vi T λivi = λi ≥ 0 <strong>for</strong> each i.<br />

The singular values of A are the square roots of the eigenvalues of A T A,<br />

σi = √ λi <strong>for</strong> i = 1,2,... ,n<br />

written in descending order. (The singular values are the lengths of the vectors Avi.)<br />

Theorem 9.49 (Singular Values) Given the setup above, if the rank of A is r, then<br />

1. σi > 0 when i ≤ r and σi = 0 when i > r.<br />

2. The set {Av1,Av2,... ,Avr} is an orthogonal basis of Col(A).<br />

Decomposition. To complete the singular value decomposition, let<br />

ui = 1<br />

Avi Avi = 1<br />

σi<br />

Avi <strong>for</strong> i = 1,2,... ,r.<br />

Then {u1,u2,...,ur} is an orthonormal basis of Col(A). This basis can be completed to<br />

an orthonormal basis of R m , say {u1,u2,...,um}.<br />

Let U be the orthogonal matrix of order m whose columns are the ui’s,<br />

U = [u1 u2 · · · um],<br />

let D be the diagonal matrix of order r whose diagonal elements are the nonzero σi’s,<br />

D = Diag(σ1,σ2,... ,σr),<br />

and let Σ be the m-by-n partitioned matrix<br />

<br />

D<br />

Σ =<br />

O<br />

<br />

O<br />

O<br />

where each O is a zero matrix.<br />

Then, since<br />

UΣ = [σ1u1 · · · σrur 0 · · · 0] = [Av1 · · · Avr 0 · · · 0] = AV<br />

and V is invertible (with inverse V T ), UΣV T = AV V T = A.<br />

225


The matrix equation<br />

A = UΣV T ,<br />

where U and V are orthogonal matrices and Σ has the <strong>for</strong>m shown above, is called a singular<br />

value decomposition of A. Singular value decompositions always exist.<br />

<br />

2 2<br />

Example 9.50 (2-by-2 Matrix of Rank 1) Let A = .<br />

1 1<br />

Then AT <br />

5 5<br />

A = = V ΛV<br />

5 5<br />

T where<br />

V = [ v1 v2 ] =<br />

√ √<br />

1/ 2 −1/ 2<br />

1/ √ 2 1/ √ <br />

λ1 0<br />

and Λ = =<br />

2<br />

0 λ2<br />

<br />

10 0<br />

.<br />

0 0<br />

The column space of A is spanned by v1, the first singular value is σ1 = √ 10, and<br />

u1 = 1<br />

√<br />

2/ 5<br />

Av1 =<br />

1/ √ <br />

.<br />

5<br />

σ1<br />

√<br />

−1/ 5<br />

A unit vector orthogonal to u1 is u2 =<br />

2/ √ <br />

.<br />

5<br />

Thus, a singular value decomposition of A is<br />

A = UΣV T =<br />

2/ √ 5 −1/ √ 5<br />

1/ √ 5 2/ √ 5<br />

√ 10 0<br />

0 0<br />

Example 9.51 (3-by-2 Matrix of Rank 2) Let A = ⎣<br />

Then A T A =<br />

117 −81<br />

−81 333<br />

V = [v1 v2 ] =<br />

<br />

= V ΛV T where<br />

−1/ √ 10 3/ √ 10<br />

3/ √ 10 1/ √ 10<br />

√ √<br />

1/ 2 1/ 2<br />

−1/ √ 2 1/ √ <br />

.<br />

2<br />

⎡<br />

−1 13<br />

4 8<br />

10 −10<br />

<br />

λ1 0<br />

and Λ = =<br />

0 λ2<br />

The column space of A is spanned by {v1,v2}, σ1 = 6 √ 10, σ2 = 3 √ 10,<br />

u1 = 1<br />

⎡<br />

Av1 = ⎣<br />

σ1<br />

2/3<br />

⎤<br />

1/3 ⎦ and u2 =<br />

−2/3<br />

1<br />

⎡<br />

Av2 = ⎣<br />

σ2<br />

1/3<br />

⎤<br />

2/3 ⎦.<br />

2/3<br />

A unit vector orthogonal to these vectors is u3 =<br />

⎡<br />

⎣ 2/3<br />

−2/3<br />

1/3<br />

⎤<br />

⎦.<br />

⎤<br />

⎦.<br />

360 0<br />

0 90<br />

Thus, a singular value decomposition of A is<br />

A = UΣV T ⎡<br />

⎤⎡<br />

2/3 1/3 2/3<br />

= ⎣ 1/3 2/3 −2/3 ⎦⎣<br />

−2/3 2/3 1/3<br />

6√10 0<br />

0 3 √ ⎤<br />

√ √<br />

10⎦<br />

−1/ 10 3/ 10<br />

3/<br />

0 0<br />

√ 10 1/ √ 10<br />

226<br />

<br />

.<br />

<br />

.


Bases <strong>for</strong> the four important subspaces. Let A be an m-by-n matrix of rank r, and<br />

let A = UΣV T be a singular value decomposition. Let U = [Ur Um−r ] be a partition of<br />

U into matrices with r and (m − r) columns. Similarly, let V = [Vr Vn−r ] be a partition<br />

of V into matrices with r and (n − r) columns. Then<br />

1. The columns of Ur <strong>for</strong>m an orthonormal basis <strong>for</strong> Col(A) ⊆ R m .<br />

2. The columns of Um−r <strong>for</strong>m an orthonormal basis <strong>for</strong> Null(A T ) ⊆ R m .<br />

3. The columns of Vr <strong>for</strong>m an orthonormal basis <strong>for</strong> Col(A T ) ⊆ R n .<br />

4. The columns of Vn−r <strong>for</strong>m an orthonormal basis <strong>for</strong> Null(A) ⊆ R n .<br />

For this reason, the singular value decomposition is often thought of as the third part of<br />

the Fundamental Theorem of Linear Algebra.<br />

Note that we can now write:<br />

A = UΣV T <br />

=<br />

Ur Um−r<br />

⎡<br />

⎢<br />

⎣<br />

⎤ ⎡<br />

D O ⎥ ⎢ V T<br />

r<br />

⎦ ⎣<br />

O O V T ⎤<br />

n−r<br />

⎥<br />

⎦ = UrDV T<br />

r ,<br />

where D = Diag(σ1,σ2,... ,σr) is the diagonal matrix of order r whose diagonal elements<br />

are the nonzero singular values, written in descending order.<br />

Moore-Penrose pseudoinverse. Working from the decomposition above, the matrix<br />

A + = VrD −1 U T r<br />

is called the Moore-Penrose pseudoinverse of the matrix A. A + is an n-by-m matrix.<br />

Using properties of orthogonal matrices: A + AA + = A and AA + A = A + .<br />

The Moore-Penrose pseudoinverse can be applied to least squares problems (Section 9.5.5).<br />

If Ax = b is inconsistent and A T A does not have an inverse, then there are infinitely many<br />

least squares solutions. By using A + as if it were a true inverse:<br />

Ax = b =⇒ x + = A + b<br />

we obtain the least squares solution of minimum length.<br />

Example 9.52 (Solution of Minimum Length) Consider the inconsistent system<br />

Then<br />

A T Ax = A T b ⇒<br />

<br />

⎡<br />

⎣<br />

1 −2<br />

2 −4<br />

−1 2<br />

6 −12<br />

−12 24<br />

⎤<br />

⎦<br />

⎡<br />

<br />

x<br />

= ⎣<br />

y<br />

3<br />

−4<br />

15<br />

<br />

x −20<br />

=<br />

y 40<br />

227<br />

⎤<br />

⎦.<br />

<br />

⇒<br />

x = 2y − 10/3<br />

.<br />

y is free


Thus, there are an infinite number of least squares solutions of the <strong>for</strong>m<br />

(2y − 10/3,y) where y is free.<br />

⎡<br />

−1/<br />

Since A = ⎣<br />

√ 6<br />

−2/ √ 6<br />

1/ √ ⎤<br />

⎦<br />

6<br />

√ 30 [ −1/ √ 5 2/ √ 5 ], the pseudoinverse is<br />

A + √<br />

−1/ 5<br />

=<br />

2/ √ <br />

1/ √ √ √ √<br />

30 [ −1/ 6 −2/ 6 1/ 6 ]<br />

5<br />

and x + = A + <br />

−2/3<br />

b = .<br />

4/3<br />

Note that the square of the length of each solution is f(y) = (2y − 10/3) 2 + y 2 , and that<br />

the function f(y) is minimized when y = 4/3.<br />

Numerical linear algebra. In most practical applications of linear algebra, solutions<br />

are subject to roundoff errors and care must be taken to minimize these errors. Often, a<br />

singular value decomposition of the coefficient matrix is used in the computations, and the<br />

range of nonzero singular values is used to analyze errors.<br />

If the coefficient matrix is invertible, then the ratio of the largest to the smallest singular<br />

value c = σ1/σn is called the condition number of the matrix. The larger the condition<br />

number, the more error prone the numerical algorithm. A rule of thumb, which has been<br />

experimentally verified in a large number of trial cases, is that the computer can lose about<br />

log 10(c) decimal places of accuracy while solving linear equations numerically.<br />

In linear regression analysis, <strong>for</strong> example, if two columns of the design matrix X are nearly<br />

linearly related (nearly colinear), then the condition number of X T X will be quite high,<br />

affecting estimates of coefficients and variances.<br />

9.7 Linear Trans<strong>for</strong>mations<br />

Let V and W be vector spaces and let T : V → W be a function assigning to each v ∈ V<br />

a vector T(v) ∈ W. Then T is said to be a linear trans<strong>for</strong>mation if<br />

1. T(v1 + v2) = T(v1) + T(v2) <strong>for</strong> all v1,v2 ∈ V and<br />

2. T(cv) = cT(v) <strong>for</strong> all v ∈ V and c ∈ R.<br />

(That is, if T “respects” vector addition and scalar multiplication.)<br />

Example 9.53 (Trans<strong>for</strong>mations on R2 ) For example, let V = W = R2 ,<br />

<br />

x 2x + 3y x x − y<br />

T1 = and T2 = .<br />

y −y<br />

y xy<br />

Then T1 satisfies the criteria <strong>for</strong> linear trans<strong>for</strong>mations, but T2 does not. In particular,<br />

<br />

3 0 0 1<br />

T2 = = = 3T2 .<br />

3 9 3 1<br />

228


x1,y1<br />

2<br />

-2 2<br />

-2<br />

x2,y2<br />

2<br />

-2 2<br />

-2<br />

x2,y2<br />

2<br />

-2 2<br />

-2<br />

x2,y2<br />

2<br />

-2 2<br />

Figure 9.1: The effects of three different linear trans<strong>for</strong>mations on the unit square (left).<br />

Linear combinations. If T is a linear trans<strong>for</strong>mation, then T(O) = O and<br />

T(c1v1 + c2v2 + · · · + cnvn) = c1T(v1) + c2T(v2) + · · · + cnT(vn)<br />

<strong>for</strong> each linear combination of vectors in V .<br />

9.7.1 Range and Kernel<br />

If T is a linear trans<strong>for</strong>mation, then the range of T,<br />

Range(T) = {w : w = T(v) <strong>for</strong> some v ∈ V } ⊆ W,<br />

is a subspace of W. Similarly, the kernel of T,<br />

is a subspace of V .<br />

Kernel(T) = {v : T(v) = O} ⊆ V,<br />

Isomorphism. T is said to be one-to-one if its kernel is trivial, Kernel(T) = {O}. T is<br />

said to be onto if its range is the entire space, Range(T) = W.<br />

If T is both one-to-one and onto, then T is an isomorphism between the vector spaces.<br />

9.7.2 Linear Trans<strong>for</strong>mations and Matrices<br />

If V ⊆ R n and W ⊆ R m , then the linear trans<strong>for</strong>mation T corresponds to matrix multiplication.<br />

That is,<br />

T(v) = Av <strong>for</strong> some m-by-n matrix A.<br />

Further, Range(T) = Col(A) and Kernel(T) = Null(A).<br />

Example 9.54 (Planar Trans<strong>for</strong>mations) When V = W = R2 , linear trans<strong>for</strong>mations<br />

are often viewed geometrically. For example, Figure 9.1 shows the image of the unit square<br />

(the set [0,1] × [0,1]) under linear trans<strong>for</strong>mations whose matrices are<br />

<br />

1 1.5<br />

0 1<br />

<br />

,<br />

<br />

−1.5 0<br />

0 2<br />

<br />

, and<br />

229<br />

<br />

−2 1<br />

−1 2.5<br />

<br />

, respectively.<br />

-2


y1<br />

x1<br />

Figure 9.2: A trans<strong>for</strong>mation of R 2 (left) whose range (right) is isomorphic to R 2 .<br />

A linear trans<strong>for</strong>mation whose matrix is invertible will map parallelograms to parallelograms.<br />

Since the second matrix is a diagonal matrix, the image of the unit square is a<br />

rectangle whose sides have lengths equal to the absolute values of the diagonal elements.<br />

In each case, the area of the image of the unit square is the absolute value of the determinant<br />

of the matrix. The resulting areas in this case are 1, 3, and 4, respectively.<br />

9.7.3 Linearly Independent Columns<br />

Let T : R n → R m be a linear trans<strong>for</strong>mation with matrix A. If the columns of A are<br />

linearly independent, then<br />

y2<br />

Range(T) = Col(A) ⊆ R m is isomorphic to R n .<br />

Example 9.55 (Dimension 2) Let T : R2 → R3 be defined as follows:<br />

⎡ ⎤ ⎡<br />

1 0 <br />

x1<br />

x1<br />

T(x) = A = ⎣ 0 −1 ⎦ = ⎣<br />

y1<br />

y1<br />

1 1<br />

x1<br />

⎤<br />

−y1 ⎦.<br />

x1 + y1<br />

Since the columns of A are linearly independent, the range of T is the 2-dimensional subspace<br />

of R 3 shown in the right part of Figure 9.2. (Plane with equation z2 = x2 − y2.)<br />

The figure also shows how a square grid on the x1y1-plane is trans<strong>for</strong>med by T to a parallelogram<br />

grid on the surface in x2y2z2-space with equation z2 = x2 − y2.<br />

9.7.4 Composition of Functions and Matrix Multiplication<br />

Let U ⊆ R p , V ⊆ R n , and W ⊆ R m be subspaces, let<br />

TB : U → V and TA : V → W<br />

be linear trans<strong>for</strong>mations with matrices B and A, respectively, and let<br />

T : U → W be the composition T(u) = TA(TB(u)).<br />

230<br />

x2<br />

z2


x1,y1<br />

8<br />

4<br />

-8 -4 4 8<br />

-4<br />

-8<br />

x2,y2<br />

8<br />

4<br />

-8 -4 4 8<br />

-4<br />

-8<br />

x3,y3<br />

8<br />

4<br />

-8 -4 4 8<br />

-4<br />

-8<br />

x4,y4<br />

8<br />

4<br />

-8 -4 4 8<br />

Figure 9.3: A linear trans<strong>for</strong>mation of the [−2,2]×[−2,2] square viewed through the singular<br />

value decomposition of the matrix of the trans<strong>for</strong>mation.<br />

Then the matrix of the linear trans<strong>for</strong>mation T is the matrix product AB.<br />

<br />

2 −1<br />

Example 9.56 (SVD) The planar trans<strong>for</strong>mation whose matrix is A =<br />

2 2<br />

maps the x1y1-plane to the x4y4-plane, shown in the first and last parts of Figure 9.3.<br />

Consider the following singular value decomposition of A<br />

A = UΣV T =<br />

1/ √ 5 −2/ √ 5<br />

2/ √ 5 1/ √ 5<br />

and its effect on the [−2,2] × [−2,2] square.<br />

3 0<br />

0 2<br />

<br />

2/ √ 5 1/ √ 5<br />

−1/ √ 5 2/ √ 5<br />

1. The x1y1-plane is mapped to the x2y2-plane by the trans<strong>for</strong>mation whose matrix is<br />

V T . Since V is orthogonal, the image of the square is a tilted square of the same size.<br />

2. The x2y2-plane is mapped to the x3y3-plane by the trans<strong>for</strong>mation whose matrix is<br />

Σ. Since Σ is diagonal, the tilted square is stretched in the horizontal and vertical<br />

directions by 3 and 2, respectively, yielding a parallelogram.<br />

3. The x3y3-plane is mapped to the x4y4-plane by the trans<strong>for</strong>mation whose matrix is U.<br />

Since U is orthogonal, the image of the parallelogram is a congruent parallelogram.<br />

The figure also shows that the unit circle is trans<strong>for</strong>med to an ellipse by the action of the<br />

diagonal matrix in the singular value decomposition.<br />

9.7.5 Diagonalization and Change of Bases<br />

Suppose that A is a diagonalizable matrix of order n (Section 9.6.2), with diagonalization<br />

A = PΛP −1 , where P = [v1 v2 · · · vn ] and Λ = Diag(λ1,λ2,... ,λn).<br />

The columns of P <strong>for</strong>m an eigenvector basis of R n . Given x ∈ R n , the coordinates of x<br />

with respect to this basis, say c, are found by solving the matrix equation:<br />

Pc = x =⇒ c = P −1 x.<br />

231<br />

<br />

,<br />

-4<br />

-8


Thus, the linear trans<strong>for</strong>mation whose matrix is P −1 can be thought of as a trans<strong>for</strong>mation<br />

from the standard basis of R n to the eigenvector basis {v1,v2,...,vn}. Similarly, the<br />

trans<strong>for</strong>mation whose matrix is P can be thought of as a change of basis from the eigenvector<br />

basis back to the standard basis.<br />

Now, consider the trans<strong>for</strong>mation T whose matrix is A. Since<br />

<br />

T(x) = Ax = PΛP −1<br />

<br />

x = P Λ P −1 <br />

x = P (Λc) ,<br />

the action of T is (1) to change from the standard basis to the eigenvector basis, (2) to<br />

operate in the eigenvector-coordinate system as a diagonal matrix, and (3) to interpret the<br />

results back in standard-basis <strong>for</strong>m.<br />

Powers of A. The k th power of A can be written as follows:<br />

A k =<br />

<br />

PΛP −1 k <br />

= PΛP<br />

−1<br />

PΛP −1<br />

<br />

· · · PΛP −1<br />

= PΛ k P −1 .<br />

Thus, the action of the linear trans<strong>for</strong>mation whose matrix is A k is (1) to change to the<br />

eigenvector basis, (2) to operate in the eigenvector-coordinate system by applying Λ k times,<br />

and (3) to interpret the results back in standard-basis <strong>for</strong>m.<br />

This geometric interpretation is especially useful in practice when considering, <strong>for</strong> example,<br />

transition matrices (from probability theory) or Leslie matrices (from population dynamics).<br />

9.7.6 Random Vectors and Linear Trans<strong>for</strong>mations<br />

Linear trans<strong>for</strong>mations of vectors of random variables are commonly used. Let<br />

⎡ ⎤<br />

⎡<br />

X1<br />

E(X1)<br />

⎤<br />

⎢ X2 ⎥<br />

⎢ E(X2) ⎥<br />

X = ⎢ ⎥<br />

⎣<br />

.<br />

⎦ and µ = E(X) = ⎢ ⎥<br />

⎣ . ⎦<br />

.<br />

E(Xn)<br />

Xn<br />

be n-by-1 vectors of random variables and their means, respectively, and let<br />

V ar(X) = E((X − µ)(X − µ) T )<br />

be the n-by-n matrix of covariances of the random variables. That is, V ar(X) is the matrix<br />

whose ij-entry is equal to the covariance between Xi and Xj when i = j, and is equal to<br />

the variance of Xi when i = j.<br />

Linear trans<strong>for</strong>mations. If A is an m-by-n matrix, then the linearly trans<strong>for</strong>med random<br />

vector Y = AX has an m-variate distribution with mean vector<br />

and with covariance matrix<br />

E(Y ) = E(AX) = AE(X) = Aµ<br />

V ar(Y ) = E((AX − Aµ)(AX − Aµ) T ) = E(A(X − µ)(X − µ) T A T ) = AV ar(X)A T .<br />

232


z<br />

-3<br />

0<br />

x<br />

3 -3<br />

0<br />

y<br />

3<br />

1.5<br />

-1.5 1.5<br />

-1.5<br />

Figure 9.4: Joint density (left) and contours enclosing 20%, 40% and 60% of the probability<br />

<strong>for</strong> the standard bivariate normal distribution with correlation ρ = 0.50.<br />

Example 9.57 (2-by-2 Matrix) Let<br />

If<br />

then<br />

X =<br />

X1<br />

X2<br />

E(Y ) = Aµ =<br />

<br />

, A =<br />

µ = E(X) =<br />

<br />

3<br />

0<br />

1 1<br />

2 −4<br />

<br />

2<br />

1<br />

<br />

<br />

X1 + X2<br />

and Y = AX =<br />

2X1 − 4X2<br />

and V ar(X) =<br />

y<br />

<br />

1 1/2<br />

,<br />

1/2 1<br />

and V ar(Y ) = AV ar(X)A T =<br />

Note that Corr(X1,X2) = 1<br />

2 and Corr(Y1,Y2) = Cov(Y1,Y2) √ = −1<br />

V ar(Y1)V ar(Y2) 2 .<br />

9.7.7 Example: Multivariate Normal Distribution<br />

⎡<br />

Let X = ⎣<br />

X1<br />

.<br />

Xk<br />

⎤<br />

⎦ be a random k-tuple.<br />

<br />

3 −3<br />

−3 12<br />

X is said to have a multivariate normal distribution if its joint PDF has the following <strong>for</strong>m<br />

f(x) =<br />

1<br />

(2π) k/2 |V | exp<br />

<br />

− 1<br />

2 (x − µ)TV −1 <br />

(x − µ)<br />

<br />

.<br />

<br />

.<br />

x<br />

<strong>for</strong> all x ∈ R k .<br />

In this <strong>for</strong>mula, µ = E(X) is the k-by-1 vector of means and V = V ar(X) is the k-by-k<br />

covariance matrix. (See Section 8.1.9 also.)<br />

Note that to find the probability that a normal random k-tuple takes values in a domain<br />

D ⊆ R k , you need to compute a k-variate integral.<br />

233


x1,y1<br />

1.5<br />

-1.5 1.5<br />

-1.5<br />

Example 9.58 (k = 2) Let X =<br />

x2,y2<br />

1.5<br />

-1.5 1.5<br />

-1.5<br />

Figure 9.5: Trans<strong>for</strong>mations of probability contours.<br />

<br />

X<br />

, µ = O and V =<br />

Y<br />

<br />

1 0.5<br />

.<br />

0.5 1<br />

x3,y3<br />

1.5<br />

-1.5 1.5<br />

-1.5<br />

Then the random pair has a standard bivariate normal distribution with parameter ρ = 0.50.<br />

The left part of Figure 9.4 is a graph of the surface z = f(x,y). The right part of the figure<br />

shows contours enclosing 20%, 40% and 60% of the probability.<br />

Each contour in the right plot is an ellipse whose major and minor axes are along the lines<br />

y = x and y = −x (gray lines), respectively. Let D1, D2, and D3 be the regions bounded<br />

by the three ellipses in increasing size.<br />

Orthogonal diagonalization. Since V is symmetric, the spectral theorem (Section 9.6.3)<br />

implies that it can be diagonalized as follows:<br />

V = QΛQ T ,<br />

where Λ is a diagonal matrix of eigenvalues and Q is orthogonal.<br />

As discussed earlier, the linear trans<strong>for</strong>mation whose matrix is QT can be thought of as a<br />

trans<strong>for</strong>mation from the standard basis of Rk to the basis whose elements are the columns<br />

of Q. Further, if Y = QTX, then<br />

E(Y ) = Q T µ and V ar(Y ) = Q T <br />

T T<br />

V ar(X)Q = Q QΛQ Q = Λ.<br />

Thus, the Yi’s are uncorrelated, with variances equal to the eigenvalues of V . (The Yi’s are,<br />

in fact, independent normal random variables.)<br />

Example 9.59 (k = 2, continued) Continuing with the example above,<br />

V = QΛQ T =<br />

1/ √ 2 −1/ √ 2<br />

1/ √ 2 1/ √ 2<br />

1.5 0<br />

0 0.5<br />

<br />

1/ √ 2 1/ √ 2<br />

−1/ √ 2 1/ √ 2<br />

The action of the matrix product Λ −1 Q T is illustrated in Figure 9.5.<br />

Specifically, if we let x2 = Q T x1, then points on the ellipses shown in the x1y1-plane (left<br />

plot) are rotated by 45 degrees in the clockwise direction to points on the ellipses shown in<br />

the x2y2-plane (center plot).<br />

234<br />

<br />

.


Further, if we let x3 = Λ −1 x2, then points on the rotated ellipses in the x2y2-plane (center<br />

plot) are mapped to circles in the x3y3-plane (right plot).<br />

These trans<strong>for</strong>mations (plus a polar trans<strong>for</strong>mation) can be used to demonstrate that<br />

P((X,Y ) ∈ D1) = 0.20, P((X,Y ) ∈ D2) = 0.40, P((X,Y ) ∈ D3) = 0.60.<br />

9.8 Applications in <strong>Statistics</strong><br />

Linear algebra is one of the most useful branches of mathematics. In statistics, linear algebra<br />

methods are used to simplify computations related to multivariate probability distributions<br />

and to simplify computations related to summarizing random samples. This section focuses<br />

on two applications.<br />

9.8.1 Least Squares Estimation<br />

Given n cases, (x1,i,x2,i,... ,xp−1,i,yi) <strong>for</strong> i = 1,2,... ,n, whose values lie close to a linear<br />

equation of the <strong>for</strong>m<br />

y = β0 + β1x1 + β2x2 + · · · + βp−1xp−1,<br />

the method of least squares can be used to estimate the unknown parameters.<br />

The n-by-p design matrix, X, the p-by-1 vector of unknowns, β, and the n-by-1 vector of<br />

right hand side values, y, are defined as follows:<br />

⎡<br />

⎢<br />

X = ⎢<br />

⎣ .<br />

1 x1,1 x2,1 · · · xp−1,1<br />

1 x1,2 x2,2 · · · xp−1,2<br />

.<br />

. · · ·<br />

.<br />

1 x1,n x2,n · · · xp−1,n<br />

⎤<br />

⎡<br />

⎥ ⎢<br />

⎥<br />

⎦ , β = ⎢<br />

⎣<br />

β0<br />

β1<br />

.<br />

.<br />

βp−1<br />

⎤<br />

⎥<br />

⎦<br />

⎡<br />

⎢<br />

and y = ⎢<br />

⎣<br />

We wish to find least squares solutions to the inconsistent system Xβ = y.<br />

Example 9.60 (Sleep Study, Part 1)(Allison & Cicchetti, Science, 194:732-374, 1976;<br />

lib.stat.cmu.edu/DASL.) As part of a study on sleep in mammals, researchers collected<br />

in<strong>for</strong>mation on the average hours of nondreaming and dreaming sleep <strong>for</strong> 43 different species.<br />

The left part of Figure 9.6 compares the common logarithm of nondreaming sleep (vertical<br />

axis) to the common logarithm of dreaming sleep (horizontal axis). A simple linear prediction<br />

<strong>for</strong>mula will be used to write the common logarithm of hours of nondreaming sleep as<br />

a function of the common logarithm of hours of dreaming sleep.<br />

For these data, Xβ = y ⇒ X T Xβ = X T y ⇒<br />

<br />

<br />

43 8.75915 β0<br />

=<br />

8.75915 5.33664 β1<br />

<br />

38.3639<br />

9.33209<br />

y1<br />

y2<br />

.<br />

.<br />

yn<br />

⇒ β0 = 0.805178<br />

β1 = 0.427124 .<br />

Thus, the simple linear prediction equation is y = 0.427124x + 0.805178.<br />

235<br />

⎤<br />

⎥<br />

⎦ .


1.2<br />

0.8<br />

0.4<br />

NDS<br />

0 0.4 0.8 DS<br />

1.2<br />

0.8<br />

0.4<br />

NDS<br />

-2 0 2<br />

Figure 9.6: Comparison of nondreaming sleep in log-hours with dreaming sleep in log-hours<br />

(left plot) and with body weight in log-kilograms (right plot) <strong>for</strong> data from the sleep study.<br />

Least squares linear fits are superimposed.<br />

The simple linear prediction equation is shown in the left part of Figure 9.6.<br />

Example 9.61 (Sleep Study, Part 2) The researchers also collected in<strong>for</strong>mation on the<br />

average body weight in kilograms <strong>for</strong> the 43 species, and on other variables. They developed<br />

a danger index using a five-point scale, where a higher value corresponds to a higher risk<br />

of exposure to the elements while sleeping and/or a higher risk of predation while sleeping.<br />

Table 9.2 gives the values of the danger index <strong>for</strong> each species in the study.<br />

The right part of Figure 9.6 compares the common logarithm of nondreaming sleep (vertical<br />

axis) to the common logarithm of body weight (horizontal axis). A linear prediction <strong>for</strong>mula<br />

will be used to write the common logarithm of hours of nondreaming sleep as a function<br />

of the common logarithm of kilograms of body weight (variable 1) and the danger index<br />

(variable 2).<br />

For these data, Xβ = y ⇒ X T Xβ = X T y ⇒<br />

⎡<br />

43 13.4047 115<br />

⎤ ⎡<br />

⎣ 13.4047 78.5992 63.7371 ⎦ ⎣<br />

115 63.7371 389<br />

⎤<br />

β0<br />

β1<br />

β2<br />

⎡ ⎤<br />

38.3639<br />

⎦ = ⎣ 3.22446 ⎦ ⇒<br />

95.4993<br />

β0 = 1.06671<br />

β1 = −0.097165<br />

β2 = −0.0539304<br />

Thus, the linear prediction equation is y = −0.097165x − 0.0539304d + 1.06671.<br />

Nondreaming sleep is negatively associated with both body weight and danger index. The<br />

right part of Figure 9.6 shows the simple linear prediction <strong>for</strong>mulas when d = 1,2,3,4,5.<br />

9.8.2 Principal Components Analysis<br />

Let X = [ x1 x2 · · · xn ] be a p-by-n data matrix, whose columns are a sample of size<br />

n from a p-variate distribution, let x be the p-by-1 sample mean,<br />

x = 1<br />

n<br />

n<br />

xi,<br />

i=1<br />

236<br />

.<br />

BW


Table 9.2: Danger indices <strong>for</strong> 43 species in the sleep study.<br />

Danger = 1 Danger = 2 Danger = 3 Danger = 4 Danger = 5<br />

Big brown bat European hedgehog African giant rat Asian elephant Cow<br />

Cat Galago ground squirrel Baboon Goat<br />

Chimpanzee Golden hamster Mountain beaver Brazilian tapir Horse<br />

E.Amer.mole Owl monkey Mouse Chincilla Rabbit<br />

Gray seal Phanlanger Musk shrew guinea pig Sheep<br />

Little brown bat Rhesus monkey Rat Short tail shrew<br />

Man Rock hyrax (h.b.) Rockhyrax (p.h.) Patas monkey<br />

Mole rat Tenrec Tree hyrax Pig<br />

N.Amer.opossum Tree shrew Vervet<br />

9 banded armadillo<br />

Red fox<br />

Water opossum<br />

and let X be the p-by-n matrix whose j th column is the deviation of the j th observation<br />

from the sample mean: xj = xj − x.<br />

Let S be the sample correlation matrix. S can be computed using the matrix equation<br />

S =<br />

where X is the centered data matrix above.<br />

1<br />

(n − 1) X X T ,<br />

Since S is symmetric, we can use the spectral theorem (Section 9.6.3) to write<br />

S = QΛQ T<br />

where Λ is a diagonal matrix of eigenvalues ordered so that λ1 ≥ λ2 ≥ · · · ≥ λp, and where<br />

Q is an orthogonal matrix.<br />

Consider the linear trans<strong>for</strong>mation<br />

Y = Q T ⎡<br />

q1·x1<br />

⎢<br />

X ⎢ q2·x1<br />

= ⎢<br />

⎣ .<br />

q1·x2<br />

q2·x2<br />

.<br />

· · ·<br />

· · ·<br />

· · ·<br />

⎤<br />

q1· xn<br />

q2· xn<br />

⎥<br />

. ⎦ ,<br />

qp·x1 qp·x2 · · · qp· xn<br />

and let SY be the sample correlation matrix of Y . Then<br />

SY = Q T <br />

T T<br />

SQ = Q QΛQ Q = Λ,<br />

indicating that the p trans<strong>for</strong>med centered samples are uncorrelated. Further, the first<br />

trans<strong>for</strong>med centered sample (first row of Y ) has the largest sample variance, the second<br />

(second row of Y ) has the next largest sample variance, and so <strong>for</strong>th.<br />

Since the i th column of Q was used to trans<strong>for</strong>m the i th centered sample, the vector qi is<br />

called the i th principal component of the centered data matrix X.<br />

237


115<br />

95<br />

75<br />

a<br />

75 95 115 c<br />

h<br />

115<br />

95<br />

75<br />

75<br />

95<br />

a<br />

115<br />

115<br />

75<br />

95 c<br />

Figure 9.7: Abdomen and chest measurements (left), and abdomen, chest and hip measurements<br />

(right) <strong>for</strong> 100 men. Principal component axes are highlighted in each plot.<br />

Example 9.62 (Body Fat Study, Part 1) (Johnson, JSE:4(1), 1996) As part of a study<br />

to determine if simple body measurements could be used to predict body fat, abdomen<br />

and chest measurements in centimeters were made on 100 men. The left part of Figure 9.7<br />

compares chest (variable 1) and abdomen (variable 2) measurements <strong>for</strong> these men.<br />

For these data,<br />

and S = QΛQ T where<br />

<br />

λ1 0<br />

Λ = ≈<br />

0 λ2<br />

x =<br />

<br />

101.068<br />

, S =<br />

92.183<br />

<br />

163.709 0<br />

<br />

0 7.218<br />

<br />

68.3283 76.3466<br />

,<br />

76.3466 102.599<br />

and Q = [q1 q2 ] ≈<br />

<br />

0.625 −0.781<br />

0.781 0.625<br />

The principal component axes shown in the left part of Figure 9.7 are lines described<br />

parametrically as ℓi(t) = x + tqi, <strong>for</strong> i = 1,2. The lines are centered at x.<br />

Since λ1 is much greater than λ2, most of the variation in the data is along the first principal<br />

component axis. In fact, since<br />

λ1<br />

λ1 + λ2<br />

≈ 0.958 and<br />

λ2<br />

λ1 + λ2<br />

≈ 0.042<br />

about 95.8% of the total variation in the data is due to the first component.<br />

Further, if c is the chest measurement and a is the abdomen measurement of a man in the<br />

study, then the <strong>for</strong>mula<br />

y = 0.625(c − 101.068) + 0.781(a − 92.183)<br />

can be used as a body “size index” in later analyses.<br />

238<br />

<br />

.


Example 9.63 (Body Fat Study, Part 2) The researchers also took hip measurements<br />

in centimeters. The right part of Figure 9.7 compares chest (variable 1), abdomen (variable<br />

2), and hip (variable 3) measurements <strong>for</strong> the 100 men.<br />

For these data,<br />

and S = QΛQ T where<br />

and<br />

⎡ ⎤<br />

101.068<br />

⎡<br />

68.3283 76.3466<br />

⎤<br />

43.8397<br />

x = ⎣ 92.183 ⎦, S = ⎣ 76.3466 102.599 56.79 ⎦ ,<br />

99.594 43.8397 56.79 42.1284<br />

⎡<br />

λ1<br />

Λ = ⎣ 0<br />

0<br />

λ2<br />

⎤ ⎡<br />

0 196.946<br />

0 ⎦ ≈ ⎣ 0<br />

0<br />

9.474<br />

0<br />

0<br />

⎤<br />

⎦<br />

0 0 λ3 0 0 6.635<br />

Q = [q1 q2 q3 ] ≈<br />

⎡<br />

⎢<br />

⎣<br />

0.565 0.588 0.579<br />

0.710 0.011 −0.704<br />

0.420 −0.809 0.412<br />

The principal component axes shown in the right part of Figure 9.7 are lines described<br />

parametrically as ℓi(t) = x + tqi, <strong>for</strong> i = 1,2,3. The lines are centered at x.<br />

Since λ1 is much greater than λ2 and λ3, most of the variation in the data is along the first<br />

principal component axis. In fact, since<br />

λ1<br />

λ1 + λ2 + λ3<br />

≈ 0.924,<br />

λ2<br />

λ1 + λ2 + λ3<br />

≈ 0.044,<br />

λ3<br />

⎤<br />

⎥<br />

⎦ .<br />

λ1 + λ2 + λ3<br />

≈ 0.031,<br />

about 92.4% of the total variation in the data is due to the first component.<br />

Further, if c is the chest measurement, a is the abdomen measurement, and h is the hip<br />

measurement of a man in the study, then the <strong>for</strong>mula<br />

y = 0.565(c − 101.068) + 0.710(a − 92.183) + 0.420(h − 99.594)<br />

can be used as a body “size index” in later analyses.<br />

239


240


10 Additional Reading<br />

(1-2) An Introduction to Mathematical <strong>Statistics</strong>, by Larsen and Marx (Prentice Hall) and<br />

Mathematical <strong>Statistics</strong> and Data Analysis, by Rice (Duxbury Press), are introductions to<br />

probability theory and mathematical statistics with emphasis on applications. The book<br />

by Larsen and Marx is currently used as the text book <strong>for</strong> the MT426-427 mathematical<br />

statistics sequence.<br />

(3-4) Calculus by Hughes-Hallett, Gleason, et al (Wiley), and Multivariable Calculus by<br />

McCallum, Hughes-Hallett, Gleason, et al (Wiley), are gentle introductions to single variable<br />

and multivariable calculus methods. The first book is currently used as the text book <strong>for</strong><br />

the MT100-101 calculus sequence.<br />

(5) Vector Calculus, by Colley (Prentice Hall), is a more sophisticated introduction to<br />

multivariable calculus. Colley emphasizes the use of linear algebra methods. This book is<br />

currently used as the text book <strong>for</strong> the MT202 multivariable calculus course.<br />

(6-7) Linear Algebra and its Applications, by Lay (Addison-Wesley), and Introduction to<br />

Linear Algebra, by Strang (Wellesley-Cambridge) are good introductions to linear algebra<br />

methods. Each book emphasizes both theory and practice. The book by Lay is currently<br />

used as the text book <strong>for</strong> the MT210 linear algebra course.<br />

(8) Advanced Calculus with Applications in <strong>Statistics</strong>, by Khuri (Wiley), is a more advanced<br />

text designed specifically to tie topics in advanced calculus to applications in statistics.<br />

(9-10) Introduction to Matrices with Applications in <strong>Statistics</strong> by Graybill (Wadsworth),<br />

and Matrix Algebra useful <strong>for</strong> <strong>Statistics</strong> by Searle (Wiley), are well-respected texts designed<br />

specifically to tie linear algebra methods to applications in statistics. Note, however, that<br />

each book is quite old, and each uses out-of-date notation.<br />

(11) Statistical Models, by Freedman (Cambridge University Press), is a recent text <strong>for</strong> a<br />

second course in statistics. From the Preface: “The contents of the book can be fairly<br />

described as what you have to know in order to start reading empirical papers that use<br />

statistical models.”<br />

241

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!