MT580 Mathematics for Statistics: - Boston College
MT580 Mathematics for Statistics: - Boston College
MT580 Mathematics for Statistics: - Boston College
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>MT580</strong> <strong>Mathematics</strong> <strong>for</strong> <strong>Statistics</strong>:<br />
“All the Math You Never Had”<br />
Jenny A. Baglivo<br />
<strong>Mathematics</strong> Department<br />
<strong>Boston</strong> <strong>College</strong><br />
Chestnut Hill, MA 02467-3806<br />
(617) 552 3772<br />
jenny.baglivo@bc.edu<br />
Third Draft, Summer 2006<br />
<strong>MT580</strong> is an introduction to probability, discrete and continuous methods, and linear algebra<br />
<strong>for</strong> graduate students in applied statistics with little or no <strong>for</strong>mal training in these<br />
fields. Applications in statistics are emphasized throughout.<br />
The topics covered in <strong>MT580</strong> are drawn from several subject areas. Thus, the need <strong>for</strong> a<br />
specialized set of course notes. I would like to thank GSAS Dean Mick Smyer <strong>for</strong> supporting<br />
the development of these notes and <strong>Mathematics</strong> Professor Dan Chambers <strong>for</strong> many useful<br />
comments and suggestions on an earlier draft.<br />
c○ Jenny A. Baglivo 2006. All rights reserved.
1 Introductory Probability Concepts 1<br />
1.1 Set Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1<br />
1.2 Sample Spaces and Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2<br />
1.3 Probability Axioms and Properties Following From the Axioms . . . . . . . . . . . . 3<br />
1.3.1 Kolmogorov Axioms <strong>for</strong> Finite Sample Spaces . . . . . . . . . . . . . . . . . . 3<br />
1.3.2 Equally Likely Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />
1.3.3 Properties Following from the Axioms . . . . . . . . . . . . . . . . . . . . . . 4<br />
1.4 Counting Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br />
1.5 Permutations and Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />
1.5.1 Example: Simple Urn Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />
1.5.2 Binomial Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />
1.6 Partitioning Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />
1.6.1 Multinomial Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />
1.6.2 Example: Urn Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />
1.7 Applications in <strong>Statistics</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />
1.7.1 Maximum Likelihood Estimation in Simple Urn Models . . . . . . . . . . . . 11<br />
1.7.2 Permutation and Nonparametric Methods . . . . . . . . . . . . . . . . . . . . 12<br />
1.8 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />
1.8.1 Multiplication Rules <strong>for</strong> Probability . . . . . . . . . . . . . . . . . . . . . . . 17<br />
1.8.2 Law of Total Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />
1.8.3 Bayes Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18<br />
1.9 Independent Events and Mutually Independent Events . . . . . . . . . . . . . . . . . 21<br />
2 Discrete Random Variables 23<br />
2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />
2.1.1 PDF and CDF <strong>for</strong> Discrete Distributions . . . . . . . . . . . . . . . . . . . . 24<br />
2.2 Families of Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />
2.2.1 Example: Discrete Uni<strong>for</strong>m Distribution . . . . . . . . . . . . . . . . . . . . . 25<br />
2.2.2 Example: Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . 25<br />
2.2.3 Distributions Related to Bernoulli Experiments . . . . . . . . . . . . . . . . . 26<br />
2.2.4 Simple Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />
i
2.3 Mathematical Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />
2.3.1 Definitions <strong>for</strong> Finite Discrete Random Variables . . . . . . . . . . . . . . . . 28<br />
2.3.2 Properties of Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />
2.3.3 Mean, Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . 30<br />
2.3.4 Chebyshev’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />
3 Joint Discrete Distributions 33<br />
3.1 Bivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />
3.1.1 Joint PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />
3.1.2 Joint CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />
3.1.3 Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />
3.1.4 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />
3.1.5 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />
3.1.6 Mathematical Expectation <strong>for</strong> Finite Bivariate Distributions . . . . . . . . . 36<br />
3.1.7 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36<br />
3.1.8 Conditional Expectation and Regression . . . . . . . . . . . . . . . . . . . . . 38<br />
3.1.9 Example: Bivariate Hypergeometric Distribution . . . . . . . . . . . . . . . . 40<br />
3.1.10 Example: Trinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 41<br />
3.1.11 Simple Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42<br />
3.2 Multivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42<br />
3.2.1 Joint PDF and Joint CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />
3.2.2 Mutually Independent Random Variables and Random Samples . . . . . . . . 44<br />
3.2.3 Mathematical Expectation <strong>for</strong> Finite Multivariate Distributions . . . . . . . . 44<br />
3.2.4 Example: Multivariate Hypergeometric Distribution . . . . . . . . . . . . . . 45<br />
3.2.5 Example: Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . 46<br />
3.2.6 Simple Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />
3.2.7 Sample Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />
4 Introductory Calculus Concepts 51<br />
4.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51<br />
4.2 Limits and Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54<br />
ii
4.2.1 Neighborhoods, Deleted Neighborhoods and Limits . . . . . . . . . . . . . . . 54<br />
4.2.2 Continuous Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />
4.3 Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />
4.3.1 Derivative and Tangent Line . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />
4.3.2 Approximations and Local Linearity . . . . . . . . . . . . . . . . . . . . . . . 60<br />
4.3.3 First and Second Derivative Functions . . . . . . . . . . . . . . . . . . . . . . 61<br />
4.3.4 Some Rules <strong>for</strong> Finding Derivative Functions . . . . . . . . . . . . . . . . . . 62<br />
4.3.5 Chain Rule <strong>for</strong> Composition of Functions . . . . . . . . . . . . . . . . . . . . 63<br />
4.3.6 Rules <strong>for</strong> Exponential and Logarithmic Functions . . . . . . . . . . . . . . . . 63<br />
4.3.7 Rules <strong>for</strong> Trigonometric and Inverse Trigonometric Functions . . . . . . . . . 64<br />
4.3.8 Indeterminate Forms and L’Hospital’s Rule . . . . . . . . . . . . . . . . . . . 65<br />
4.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67<br />
4.4.1 Local Extrema and Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . 67<br />
4.4.2 First Derivative Test <strong>for</strong> Local Extrema . . . . . . . . . . . . . . . . . . . . . 67<br />
4.4.3 Second Derivative Test <strong>for</strong> Local Extrema . . . . . . . . . . . . . . . . . . . . 68<br />
4.4.4 Global Extrema on Closed Intervals . . . . . . . . . . . . . . . . . . . . . . . 68<br />
4.4.5 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />
4.5 Applications in <strong>Statistics</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />
4.5.1 Maximum Likelihood Estimation in Binomial Models . . . . . . . . . . . . . . 71<br />
4.5.2 Maximum Likelihood Estimation in Multinomial Models . . . . . . . . . . . . 72<br />
4.5.3 Minimum Chi-Square Estimation in Multinomial Models . . . . . . . . . . . . 73<br />
4.6 Rolle’s Theorem and the Mean Value Theorem . . . . . . . . . . . . . . . . . . . . . 75<br />
4.6.1 Generalized Mean Value Theorem and Error Analysis . . . . . . . . . . . . . 76<br />
4.7 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77<br />
4.7.1 Riemann Sums and Definite Integrals . . . . . . . . . . . . . . . . . . . . . . 78<br />
4.7.2 Properties of Definite Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . 81<br />
4.7.3 Fundamental Theorem of Calculus . . . . . . . . . . . . . . . . . . . . . . . . 81<br />
4.7.4 Antiderivatives and Indefinite Integrals . . . . . . . . . . . . . . . . . . . . . 82<br />
4.7.5 Some Formulas <strong>for</strong> Indefinite Integrals . . . . . . . . . . . . . . . . . . . . . . 82<br />
4.7.6 Approximation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84<br />
iii
4.7.7 Improper Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86<br />
4.8 Applications in <strong>Statistics</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />
4.8.1 deMoivre-Laplace Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />
4.8.2 Computing Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />
5 Continuous Random Variables 91<br />
5.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91<br />
5.1.1 PDF and CDF <strong>for</strong> Continuous Distributions . . . . . . . . . . . . . . . . . . . 91<br />
5.1.2 Quantiles and Percentiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />
5.2 Families of Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93<br />
5.2.1 Example: Uni<strong>for</strong>m Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 93<br />
5.2.2 Example: Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . 94<br />
5.2.3 Euler Gamma Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95<br />
5.2.4 Example: Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 96<br />
5.2.5 Example: Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 97<br />
5.2.6 Trans<strong>for</strong>ming Continuous Random Variables . . . . . . . . . . . . . . . . . . 99<br />
5.3 Mathematical Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101<br />
5.3.1 Definitions <strong>for</strong> Continuous Random Variables . . . . . . . . . . . . . . . . . . 101<br />
5.3.2 Properties of Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102<br />
5.3.3 Mean, Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . 102<br />
5.3.4 Chebyshev’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102<br />
6 Infinite Sequences and Series 103<br />
6.1 Limits and Convergence of Infinite Sequences . . . . . . . . . . . . . . . . . . . . . . 103<br />
6.1.1 Convergence Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103<br />
6.1.2 Geometric Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104<br />
6.1.3 Bounded Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104<br />
6.1.4 Monotone Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105<br />
6.2 Partial Sums and Convergence of Infinite Series . . . . . . . . . . . . . . . . . . . . . 105<br />
6.2.1 Convergence Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106<br />
6.2.2 Geometric Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106<br />
iv
6.3 Convergence Tests <strong>for</strong> Positive Series . . . . . . . . . . . . . . . . . . . . . . . . . . . 107<br />
6.4 Alternating Series and Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 109<br />
6.5 Absolute and Conditional Convergence of Series . . . . . . . . . . . . . . . . . . . . . 110<br />
6.5.1 Product of Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111<br />
6.6 Taylor Polynomials and Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />
6.6.1 Higher Order Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />
6.6.2 Taylor Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113<br />
6.6.3 Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />
6.6.4 Radius of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />
6.7 Power Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />
6.7.1 Power Series and Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . 116<br />
6.7.2 Differentiation and Integration of Power Series . . . . . . . . . . . . . . . . . 117<br />
6.8 Discrete Random Variables, revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . 118<br />
6.8.1 Countably Infinite Sample Spaces . . . . . . . . . . . . . . . . . . . . . . . . . 118<br />
6.8.2 Infinite Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 118<br />
6.8.3 Example: Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 120<br />
6.8.4 Example: Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />
6.9 Continuous Random Variables, revisited . . . . . . . . . . . . . . . . . . . . . . . . . 122<br />
6.9.1 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123<br />
6.9.2 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123<br />
6.10 Summary: Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124<br />
7 Multivariable Calculus Concepts 125<br />
7.1 Vector And Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125<br />
7.1.1 Representations of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125<br />
7.1.2 Addition and Scalar Multiplication of Vectors . . . . . . . . . . . . . . . . . . 125<br />
7.1.3 Length and Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />
7.1.4 Dot Product of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128<br />
7.1.5 Matrix Notation and Matrix Transpose . . . . . . . . . . . . . . . . . . . . . 129<br />
7.1.6 Addition and Scalar Multiplication of Matrices . . . . . . . . . . . . . . . . . 129<br />
7.1.7 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130<br />
v
7.1.8 Determinants of Square Matrices . . . . . . . . . . . . . . . . . . . . . . . . . 130<br />
7.2 Limits and Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />
7.2.1 Neighborhoods, Deleted Neighborhoods and Limits . . . . . . . . . . . . . . . 132<br />
7.2.2 Continuous Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133<br />
7.3 Differentiation in Several Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134<br />
7.3.1 Partial Derivatives and the Gradient Vector . . . . . . . . . . . . . . . . . . . 134<br />
7.3.2 Second-Order Partial Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . 135<br />
7.3.3 Differentiability, Local Linearity and Tangent (Hyper)Planes . . . . . . . . . 135<br />
7.3.4 Total Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138<br />
7.3.5 Directional Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138<br />
7.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140<br />
7.4.1 Linear and Quadratic Approximations . . . . . . . . . . . . . . . . . . . . . . 140<br />
7.4.2 Second Derivative Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142<br />
7.4.3 Constrained Optimization and Lagrange Multipliers . . . . . . . . . . . . . . 144<br />
7.4.4 Method of Steepest Descent/Ascent . . . . . . . . . . . . . . . . . . . . . . . 148<br />
7.5 Applications in <strong>Statistics</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149<br />
7.5.1 Least Squares Estimation in Quadratic Models . . . . . . . . . . . . . . . . . 149<br />
7.5.2 Maximum Likelihood Estimation in Multinomial Models . . . . . . . . . . . . 150<br />
7.6 Multiple Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151<br />
7.6.1 Riemann Sums and Double Integrals over Rectangles . . . . . . . . . . . . . . 151<br />
7.6.2 Iterated Integrals over Rectangles . . . . . . . . . . . . . . . . . . . . . . . . . 153<br />
7.6.3 Properties of Double Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . 155<br />
7.6.4 Double and Iterated Integrals over General Domains . . . . . . . . . . . . . . 156<br />
7.6.5 Triple and Iterated Integrals over Boxes . . . . . . . . . . . . . . . . . . . . . 158<br />
7.6.6 Triple Integrals over General Domains . . . . . . . . . . . . . . . . . . . . . . 160<br />
7.6.7 Partial Anti-Differentiation and Partial Differentiation . . . . . . . . . . . . . 161<br />
7.6.8 Change of Variables and the Jacobian . . . . . . . . . . . . . . . . . . . . . . 162<br />
7.7 Applications in <strong>Statistics</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166<br />
7.7.1 Computing Probabilities <strong>for</strong> Continuous Random Pairs . . . . . . . . . . . . 166<br />
7.7.2 Contours <strong>for</strong> Independent Standard Normal Random Variables . . . . . . . . 167<br />
vi
8 Joint Continuous Distributions 169<br />
8.1 Bivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169<br />
8.1.1 Joint PDF and Joint CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169<br />
8.1.2 Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170<br />
8.1.3 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171<br />
8.1.4 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 173<br />
8.1.5 Mathematical Expectation <strong>for</strong> Bivariate Continuous Distributions . . . . . . 173<br />
8.1.6 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174<br />
8.1.7 Conditional Expectation and Regression . . . . . . . . . . . . . . . . . . . . . 175<br />
8.1.8 Example: Bivariate Uni<strong>for</strong>m Distribution . . . . . . . . . . . . . . . . . . . . 177<br />
8.1.9 Example: Bivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . 177<br />
8.1.10 Trans<strong>for</strong>ming Continuous Random Variables . . . . . . . . . . . . . . . . . . 179<br />
8.2 Multivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182<br />
8.2.1 Joint PDF and Joint CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182<br />
8.2.2 Marginal and Conditional Distributions . . . . . . . . . . . . . . . . . . . . . 183<br />
8.2.3 Mutually Independent Random Variables and Random Samples . . . . . . . . 184<br />
8.2.4 Mathematical Expectation For Multivariate Continuous Distributions . . . . 184<br />
8.2.5 Sample Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185<br />
9 Linear Algebra Concepts 189<br />
9.1 Solving Systems of Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189<br />
9.1.1 Elementary Row Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189<br />
9.1.2 Row Reduction and Echelon Forms . . . . . . . . . . . . . . . . . . . . . . . . 191<br />
9.1.3 Vector and Matrix Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 194<br />
9.2 Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197<br />
9.2.1 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198<br />
9.2.2 Inverses of Square Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200<br />
9.2.3 Matrix Factorizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203<br />
9.2.4 Partitioned Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205<br />
9.3 Determinants of Square Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207<br />
9.4 Vector Spaces and Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209<br />
vii
9.4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209<br />
9.4.2 Subspaces of R m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211<br />
9.4.3 Linear Independence, Basis and Dimension . . . . . . . . . . . . . . . . . . . 211<br />
9.4.4 Rank and Nullity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214<br />
9.5 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215<br />
9.5.1 Orthogonal Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215<br />
9.5.2 Orthogonal Sets, Bases and Projections . . . . . . . . . . . . . . . . . . . . . 216<br />
9.5.3 Gram-Schmidt Orthogonalization . . . . . . . . . . . . . . . . . . . . . . . . . 217<br />
9.5.4 QR-Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218<br />
9.5.5 Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219<br />
9.6 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221<br />
9.6.1 Characteristic Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221<br />
9.6.2 Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222<br />
9.6.3 Symmetric Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223<br />
9.6.4 Positive Definite Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224<br />
9.6.5 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 224<br />
9.7 Linear Trans<strong>for</strong>mations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228<br />
9.7.1 Range and Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229<br />
9.7.2 Linear Trans<strong>for</strong>mations and Matrices . . . . . . . . . . . . . . . . . . . . . . . 229<br />
9.7.3 Linearly Independent Columns . . . . . . . . . . . . . . . . . . . . . . . . . . 230<br />
9.7.4 Composition of Functions and Matrix Multiplication . . . . . . . . . . . . . . 230<br />
9.7.5 Diagonalization and Change of Bases . . . . . . . . . . . . . . . . . . . . . . . 231<br />
9.7.6 Random Vectors and Linear Trans<strong>for</strong>mations . . . . . . . . . . . . . . . . . . 232<br />
9.7.7 Example: Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . 233<br />
9.8 Applications in <strong>Statistics</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235<br />
9.8.1 Least Squares Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235<br />
9.8.2 Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 236<br />
10 Additional Reading 241<br />
viii
1 Introductory Probability Concepts<br />
Probability is the study of random phenomena. Probability theory can be applied, <strong>for</strong><br />
example, to study games of chance (e.g. roulette games, card games), occurrences of catastrophic<br />
events (e.g. tornados, earthquakes), survival of animal species, and changes in stock<br />
and commodity markets.<br />
This chapter introduces probability theory.<br />
1.1 Set Operations<br />
Sets and operations on sets are fundamental to any area of mathematics. A set is a welldefined<br />
collection of objects called the elements or members of the set. (Well-defined means<br />
that there is definite way of determining whether a given element belongs to the set.) The<br />
notation x ∈ A means “x is an element of A,” and the notation x /∈ A means “x is not an<br />
element of A.”<br />
The empty set is the set containing no elements, and the universal set is the set containing<br />
all elements under consideration in a given study. The notation ∅ is used to denote the<br />
empty set, and the notation Ω is used to denote the universal set.<br />
A is a subset of B (denoted by A ⊆ B) if each element of A is an element of B.<br />
Union and intersection. The union of the sets A and B (denoted by A ∪ B) is the set<br />
of elements that belong to either A or B, and the intersection of the sets A and B (denoted<br />
by A ∩ B) is the set of elements that belong to both A and B. That is,<br />
A ∪ B = {x | x ∈ A or x ∈ B} and A ∩ B = {x | x ∈ A and x ∈ B}.<br />
Note that A ∪ B = B ∪ A and A ∩ B = B ∩ A.<br />
The operations of union and intersection are associative, that is,<br />
(A ∪ B) ∪ C = A ∪ (B ∪ C) and (A ∩ B) ∩ C = A ∩ (B ∩ C) <strong>for</strong> sets A, B, and C.<br />
Thus, the union A ∪ B ∪ C and the intersection A ∩ B ∩ C are well-defined. More generally,<br />
the union of the k sets A1, A2, ..., Ak (denoted by ∪ k i=1 Ai) consists of all elements belonging<br />
to at least one Ai, and the intersection of the k sets A1, A2, ..., Ak (denoted by ∩ k i=1 Ai)<br />
consists of all elements belonging to all k sets.<br />
The operations of union and intersection satisfy the following distributive laws:<br />
A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C) and A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C)<br />
<strong>for</strong> sets A, B, and C.<br />
Complement. The complement of the set A in the universal set (denoted by A c = Ω−A)<br />
is the set of all elements in the universal set that do not belong to A. Note, in particular,<br />
that A ∩ A c = ∅, A ∪ A c = Ω and (A c ) c = A.<br />
1
Pairwise disjoint sets. The sets A1 and A2 are said to be disjoint if A1 ∩ A2 = ∅. The<br />
sets A1, A2, ..., Ak are said to be pairwise disjoint if Ai ∩ Aj = ∅ whenever i = j.<br />
If A and B are sets, then B can be written as the disjoint union of the part of B contained<br />
in A with the part of B contained in the complement of A. That is,<br />
B = (B ∩ A) ∪ (B ∩ A c ) where (B ∩ A) ∩ (B ∩ A c ) = ∅.<br />
Partitions. A partition of Ω is a collection A1, A2, ..., Ak of pairwise disjoint sets whose<br />
union is Ω. If A1, A2, ..., Ak partitions Ω, then (B ∩A1), (B ∩A2), ..., (B ∩Ak) partitions<br />
the set B. That is,<br />
B = ∪ k i=1 (B ∩ Ai) where (B ∩ Ai) ∩ (B ∩ Aj) = ∅ when i = j.<br />
De Morgan’s laws. De Morgan’s laws relate complements to unions and intersections:<br />
More generally,<br />
(A ∪ B) c = Ac ∩ Bc and (A ∩ B) c = Ac ∪ Bc <strong>for</strong> sets A and B.<br />
c = ∩k i=1Aci and<br />
c = ∪k i=1Aci .<br />
<br />
∪ k i=1 Ai<br />
1.2 Sample Spaces and Events<br />
<br />
∩ k i=1 Ai<br />
The term experiment (or random experiment) is used in probability theory to describe a<br />
procedure whose outcome is not known in advance with certainty. Further, experiments<br />
are assumed to be repeatable (at least in theory) and to have a well-defined set of possible<br />
outcomes.<br />
The sample space S is the set of all possible outcomes of an experiment. An event is a<br />
subset of the sample space. A simple event is an event with a single outcome. Events are<br />
usually denoted by capital letters (A, B, C, ...) and outcomes by lower case letters (x, y,<br />
z, ...). If x ∈ A is observed, then A is said to have occurred. The favorable outcomes of an<br />
experiment <strong>for</strong>m the event of interest.<br />
Each repetition of an experiment is called a trial. Repeated trials are repetitions of the<br />
experiment using the specified procedure, with the outcomes of the trials having no influence<br />
on one another.<br />
Example 1.1 (Coin Tossing) Suppose you toss a fair coin 5 times and record h (<strong>for</strong><br />
head) or t (<strong>for</strong> tail) each time. The sample space <strong>for</strong> this experiment is the collection of<br />
32 = 2 5 sequences of 5 h’s or t’s:<br />
S = {hhhhh,hhhht,hhhth,hhthh,hthhh, thhhh,hhhtt,hhtht,hthht,thhht,<br />
hhtth,hthth,thhth,htthh,ththh,tthhh, ttthh,tthth,thtth, httth,tthht,<br />
ththt,httht,thhtt,hthtt,hhttt,htttt,thttt,tthtt,tttht,tttth,ttttt}.<br />
If you are interested in getting exactly 5 heads, then the event of interest is the simple event<br />
A = {hhhhh}. If you are interested in getting exactly 3 heads, then the event of interest is<br />
A = {hhhtt,hhtht,hthht,thhht,hhtth,hthth,thhth, htthh, ththh,tthhh}.<br />
2
Example 1.2 (Birthdays) Consider the experiment “Ask ten people their birthdays and<br />
record the list of dates.” The sample space, S, is the collection of all sequences of 10<br />
birthdays. If you are interested in getting at least one match, then the event of interest, A,<br />
is the collection of all sequences of ten birthdays with at least one match.<br />
1.3 Probability Axioms and Properties Following From the Axioms<br />
The basic rules (or axioms) of probability were introduced by A. Kolmogorov in the 1930’s.<br />
Let A ⊆ S be an event, and P(A) be the probability that A will occur.<br />
1.3.1 Kolmogorov Axioms <strong>for</strong> Finite Sample Spaces<br />
A probability distribution, or simply a probability, on a finite sample space S is a specification<br />
of numbers P(A) satisfying the following axioms:<br />
1. P(S) = 1.<br />
2. If A is an event, then 0 ≤ P(A) ≤ 1.<br />
3. If A1 and A2 are disjoint, then P(A1 ∪ A2) = P(A1) + P(A2).<br />
3 ′ . More generally, if the events A1, A2, ..., Ak are pairwise disjoint, then<br />
P(A1 ∪ A2 ∪ · · · ∪ Ak) = P(A1) + P(A2) + · · · + P(Ak).<br />
Since S is the set of all possible outcomes, an outcome in S is certain to occur; the probability<br />
of an event that is certain to occur must be 1 (axiom 1). Probabilities must be between<br />
0 and 1 (axiom 2), and probabilities must be additive when events are pairwise disjoint<br />
(axiom 3).<br />
Note that the second part of axiom 3 can be proven from the first part using mathematical<br />
induction. The inductive step of the proof assumes that the equality holds <strong>for</strong> k − 1 events.<br />
Then P(A1 ∪ A2 ∪ · · · ∪ Ak)<br />
= P((A1 ∪ A2 ∪ · · · ∪ Ak−1) ∪ Ak) (associativity of union)<br />
= P(A1 ∪ A2 ∪ · · · ∪ Ak−1) + P(Ak) (axiom 3)<br />
= P(A1) + P(A2) + · · · + P(Ak−1) + P(Ak) (induction hypothesis)<br />
An additional extension of axiom 3 is needed <strong>for</strong> infinite sample samples. Namely,<br />
3 ′′ . If A1, A2, . . . are pairwise disjoint, then<br />
P (∪ ∞ i=1 Ai) = P(A1) + P(A2) + P(A3) + · · ·<br />
where the righthand side is understood to be the sum of a convergent infinite series.<br />
Methods <strong>for</strong> sequences and series are covered in Chapter 6 of these notes.<br />
3
Relative frequencies. The probability of an event can be written as the limit of relative<br />
frequencies. That is, if A ⊆ S is an event, then<br />
#(A)<br />
P(A) = lim<br />
n→∞ n<br />
where #(A) is the number of occurrences of event A in n repeated trials of the experiment.<br />
Expected number. If P(A) is the probability of event A, then nP(A) is the expected<br />
number of occurrences of event A in n repeated trials of the experiment.<br />
1.3.2 Equally Likely Outcomes<br />
If S is a finite set with N elements, A is a subset of S with n elements, and each outcome<br />
is equally likely, then<br />
P(A) =<br />
Number of elements in A<br />
Number of elements in S<br />
= |A|<br />
|S|<br />
= n<br />
N .<br />
Example 1.3 (Coin Tossing, continued) For example, if you toss a fair coin 5 times<br />
and record heads or tails each time, then the probability of getting exactly 3 heads is<br />
10/32=0.3125. In 2000 repetitions of the experiment, you expect to observe exactly 3 heads<br />
2000P(exactly 3 heads) = 2000(0.3125) = 625 times.<br />
1.3.3 Properties Following from the Axioms<br />
Properties following from the Kolmogorov axioms include<br />
1. Complement rule: Let A c = S − A be the complement of A in S. Then<br />
In particular, P(∅) = 0.<br />
P(A c ) = 1 − P(A).<br />
2. Subset rule: If A is a subset of B, then P(A) ≤ P(B).<br />
3. Inclusion-exclusion rule: If A and B are events, then<br />
P(A ∪ B) = P(A) + P(B) − P(A ∩ B).<br />
Example 1.4 (Demonstration of Inclusion-Exclusion) To demonstrate the inclusionexclusion<br />
rule using the axioms, first note that<br />
A ∪ B = A ∪ (B ∩ A c ) where A ∩ (B ∩ A c ) = ∅.<br />
4
Since the intersection is empty, we know that P(A ∪ B) = P(A) + P(B ∩ A c ) by axiom 3.<br />
Similarly, since<br />
B = (B ∩ A) ∪ (B ∩ A c ) where (B ∩ A) ∩ (B ∩ A c ) = ∅,<br />
P(B) = P(B ∩ A) + P(B ∩ A c ) or P(B ∩ A c ) = P(B) − P(B ∩ A). Thus,<br />
P(A ∪ B) = P(A) + P(B ∩ A c ) = P(A) + P(B) − P(B ∩ A) = P(A) + P(B) − P(A ∩ B),<br />
since P(A ∩ B) = P(B ∩ A).<br />
Inclusion-exclusion rules. The inclusion-exclusion rule can be generalized to more than<br />
two events. For three events, <strong>for</strong> example, the rule is<br />
P(A ∪ B ∪ C) = P(A) + P(B) + P(C) − P(A ∩ B) − P(A ∩ C) − P(B ∩ C) + P(A ∩ B ∩ C).<br />
1.4 Counting Methods<br />
Methods <strong>for</strong> counting the number of outcomes in a sample space or event are important in<br />
probability. The multiplication rule is the basic counting <strong>for</strong>mula.<br />
Theorem 1.5 (Multiplication Rule) If an operation consists of r steps of which the<br />
first can be done in n1 ways, <strong>for</strong> each of these the second can be done in n2 ways, <strong>for</strong> each of<br />
the first and second steps the third can be done in n3 ways, etc., then the entire operation<br />
can be done in n1 × n2 × · · · × nr ways.<br />
Sampling with/without replacement. Two special cases of the multiplication rule are<br />
1. Sampling with replacement. For a set of size n and a sample of size r, there are a total<br />
of n r = n × n × · · · × n ordered samples, if duplication is allowed.<br />
2. Sampling without replacement. For a set of size n and a sample of size r, there are a<br />
total of<br />
n!<br />
= n × (n − 1) × · · · × (n − r + 1)<br />
(n − r)!<br />
ordered samples, if duplication is not allowed.<br />
If n is a positive integer, the notation n! (“n factorial”) is used <strong>for</strong> the product<br />
n! = n × (n − 1) × · · · × 1.<br />
For convenience, 0! is defined to equal 1 (0! = 1).<br />
Example 1.6 (Birthdays, continued) For example, suppose there are r unrelated people<br />
in a room, none of whom was born on February 29th of a leap year. You would like to<br />
determine the probability that at least two people have the same birthday.<br />
5
1. You ask, and record, each person’s birthday. There are<br />
365 r = 365 × 365 × · · · × 365<br />
possible outcomes, where an outcome is a sequences of r responses.<br />
2. Consider the event “everyone has a different birthday.” The number of outcomes in<br />
this event is<br />
365!<br />
= 365 × 364 × · · · × (365 − r + 1).<br />
(365 − r)!<br />
3. Suppose that each sequence of birthdays is equally likely. The probability that at<br />
least two people have a common birthday is 1 minus the probability that everyone<br />
has a different birthday, or<br />
1 −<br />
365 × 364 × · · · × (365 − r + 1)<br />
365r .<br />
Let A be the event “at least two people have a common birthday”. The following table<br />
demonstrates how quickly P(A) increases as the number of people in the room increases:<br />
r 5 10 15 20 25 30 35 40 45 50<br />
P(A) 0.03 0.12 0.25 0.41 0.57 0.71 0.81 0.89 0.94 0.97<br />
1.5 Permutations and Combinations<br />
A permutation is an ordered subset of r distinct objects out of a set of n objects. A<br />
combination is an unordered subset of r distinct objects out of the n objects.<br />
By the multiplication rule (page 5), there are a total of<br />
nPr =<br />
permutations of r objects out of n objects.<br />
n!<br />
= n × (n − 1) × · · · × (n − r + 1)<br />
(n − r)!<br />
Since each unordered subset corresponds to r! ordered subsets (the r chosen elements are<br />
permuted in all possible ways), there are a total of<br />
nCr = nPr<br />
r! =<br />
combinations of r objects out of n objects.<br />
n! n × (n − 1) × · · · × (n − r + 1)<br />
=<br />
(n − r)! r! r × (r − 1) × · · · × 1<br />
For example, there are a total of 5040 ordered subsets of size 4 from a set of size 10, and a<br />
total of 5040/24=210 unordered subsets.<br />
The notation <br />
n<br />
= nCr =<br />
r<br />
n!<br />
(n − r)! r!<br />
is used to denote the total number of combinations.<br />
6<br />
(read “n choose r”)
Special cases are as follows:<br />
<br />
n<br />
=<br />
0<br />
<br />
n<br />
= 1 and<br />
n<br />
<br />
n n<br />
= = n.<br />
1 n − 1<br />
Further, since choosing r elements to <strong>for</strong>m a subset is equivalent to choosing the remaining<br />
n − r elements to <strong>for</strong>m the complementary subset,<br />
<br />
n n<br />
= <strong>for</strong> r = 0,1,... ,n.<br />
r n − r<br />
1.5.1 Example: Simple Urn Model<br />
Suppose there are M special objects in an urn containing a total of N objects. In a subset<br />
of size n chosen from the urn, exactly m are special.<br />
Unordered subsets. There are a total of<br />
<br />
M N − M<br />
×<br />
m n − m<br />
unordered subsets with exactly m special objects (and exactly n −m other objects). If each<br />
choice of subset is equally likely, then <strong>for</strong> each m<br />
M N−M m × n−m<br />
P(m special objects) = .<br />
Ordered subsets. There are a total of<br />
<br />
n<br />
m<br />
N<br />
n<br />
× MPm × N−MPn−m<br />
ordered subsets with exactly m special objects. (The positions of the special objects are<br />
selected first, followed by the special objects to fill these positions, followed by the nonspecial<br />
objects to fill the remaining positions.) If each choice of subset is equally likely, then <strong>for</strong><br />
each m<br />
P(m special objects) =<br />
25<br />
8<br />
n <br />
m × MPm × N−MPn−m<br />
.<br />
NPn<br />
Computing probabilities. Interestingly, P(m special objects) is the same in both cases.<br />
For example, let N = 25, M = 10, n = 8, and m = 3. Then, using the first <strong>for</strong>mula<br />
10 15 3 × 5 120 × 3003 728<br />
P(3 special objects) = = = ≈ 0.333.<br />
1081575 2185<br />
Using the second <strong>for</strong>mula, the probability is<br />
8 3 × 10P3 × 15P5<br />
P(3 special objects) =<br />
= 56 × 720 × 360360<br />
25P8<br />
7<br />
43609104000<br />
= 728<br />
≈ 0.333.<br />
2185
Example 1.7 (Quality Management) A batch of 20 microwaves contains 4 defective<br />
and 16 good machines. You decide to select 3 machines at random and test them. If each<br />
choice of subset is equally likely, then the probability that there are k defective machines<br />
in the chosen subset is<br />
<br />
P(k defectives) =<br />
4 16<br />
k 3−k<br />
20 3<br />
k = 0,1,2,3.<br />
The following table shows the distribution of probabilities:<br />
k 0 1 2 3<br />
560<br />
P(k defectives) 1140 = 0.4912 480<br />
1140 = 0.4211 96<br />
1140 = 0.0842 4<br />
1140 = 0.0035<br />
If you decide to accept the batch as long as none of the 3 chosen machines is defective, then<br />
there is a 49.12% chance that you will accept a batch containing 4 defective machines.<br />
1.5.2 Binomial Coefficients<br />
The quantities n r , r = 0,1,... ,n, are often referred to as the binomial coefficients because<br />
of the following theorem.<br />
Theorem 1.8 (Binomial Theorem) For all real numbers x and y and each positive<br />
integer n,<br />
(x + y) n n<br />
<br />
n<br />
= x<br />
r<br />
r y n−r .<br />
Idea of proof. The product on the left can be written as a sequence of n factors:<br />
r=0<br />
(x + y) n = (x + y) × (x + y) × · · · × (x + y).<br />
The product expands to 2n summands, where each summand is a sequence of n letters,<br />
where one letter is chosen from each factor. For each r, exactly n r sequences have r copies<br />
of x and n − r copies of y.<br />
Example 1.9 (n=12) If n = 12, then the numbers of subsets of size r are as follows:<br />
r 0 1 2 3 4 5 6 7 8 9 10 11 12<br />
<br />
1 12 66 220 495 792 924 792 495 220 66 12 1<br />
12<br />
r<br />
and the total number of subsets is 12<br />
r=0<br />
1.6 Partitioning Sets<br />
12 r = 4096 = 212 .<br />
The multiplication rule (page 5) can be used to find the number of partitions of a set of n<br />
elements into k distinguishable subsets of sizes r1, r2, ..., rk.<br />
8
Specifically, r1 of the n elements are chosen <strong>for</strong> the first subset, r2 of the remaining n − r1<br />
elements are chosen <strong>for</strong> the second subset, and so <strong>for</strong>th. The result is the product of the<br />
numbers of ways to per<strong>for</strong>m each step:<br />
<br />
n n − r1 n − r1 − · · · − rk−1<br />
× × · · · ×<br />
.<br />
r1<br />
The product simplifies to<br />
r1,r2,... ,rk”).<br />
r2<br />
n!<br />
r1! r2! ··· rk!<br />
rk<br />
and is denoted by<br />
<br />
n<br />
r1,r2,...,rk<br />
Example 1.10 (Partitioning Students) For example, there are a total of<br />
<br />
15<br />
=<br />
5,5,5<br />
15!<br />
= 756,756<br />
5! 5! 5!<br />
<br />
(read “n choose<br />
ways to partition the members of a class of 15 students into recitation sections of size 5<br />
each led by Joe, Sally, and Mary, respectively. (The recitation sections are distinguished by<br />
their group leaders.)<br />
Permutations of indistinguishable objects. The <strong>for</strong>mula above also represents the<br />
number of ways to permute n objects, where the first r1 are indistinguishable, the next r2<br />
are indistinguishable, ..., the last rk are indistinguishable. The computation is done as<br />
follows: r1 of the n positions are chosen <strong>for</strong> the first type of object, r2 of the remaining<br />
n − r1 positions are chosen <strong>for</strong> the second type of object, etc.<br />
In the example above, imagine that the 15 positions correspond to the 15 students in<br />
alphabetical order, and that there are 5 J’s (<strong>for</strong> Joe), 5 S’s (<strong>for</strong> Sally), and 5 M’s (<strong>for</strong> Mary)<br />
to permute. Each permutation of the 15 letters corresponds to an assignment of the 15<br />
students to the recitation sections led by Joe, Sally, and Mary.<br />
Footnote on partitioning students. Suppose instead that the 15 students are assigned<br />
to 3 teams of 5 students each and that each team will work on the same project. Then the<br />
total number of different assignments becomes 15 <br />
5,5,5 /3! = 126,126.<br />
1.6.1 Multinomial Coefficients<br />
The quantities n <br />
are often referred to as the multinomial coefficients because of<br />
r1,r2,...,rk<br />
the following theorem.<br />
Theorem 1.11 (Multinomial Theorem) For all real numbers x1,x2,... ,xk and each<br />
positive integer n,<br />
(x1 + x2 + · · · + xk) n =<br />
<br />
<br />
n<br />
x<br />
r1,r2,... ,rk<br />
r1<br />
1 xr2 2 · · · xrk<br />
k ,<br />
(r1,r2,...,rk)<br />
where the sum is over all k-tuples of nonnegative integers with <br />
i ri = n.<br />
9
Idea of proof: The product on the left can be written as a sequence of n factors<br />
(x1 + x2 + · · · + xk) n = (x1 + x2 + · · · + xk)×<br />
(x1 + x2 + · · · + xk) × · · · × (x1 + x2 + · · · + xk).<br />
The product expands to kn summands, where each summand is a sequence of n letters (one<br />
from each factor). For each r1,r2,...,rk, exactly n <br />
sequences have r1 copies of x1,<br />
r1,r2,...,rk<br />
r2 copies of x2, etc.<br />
1.6.2 Example: Urn Models<br />
The simple urn model from Section 1.5.1 can be generalized.<br />
Assume that an urn contains k types of objects, Mi is the number of objects of type i, and<br />
N = <br />
i Mi is the total number of objects in the urn. A subset of size n is chosen. If each<br />
choice of subset is equally likely, then the probability that there are exactly mi objects of<br />
type i <strong>for</strong> i = 1,2,... ,k is<br />
P(Exactly mi objects of type i <strong>for</strong> each i) =<br />
M1<br />
m1<br />
M2 m2<br />
Mk · · · mk<br />
N .<br />
n<br />
Example 1.12 (3 Types of Objects) Let M1 = 3, M2 = 4, M3 = 5 and n = 4. The<br />
following table gives the probability distribution <strong>for</strong> all possible values of m1 and m2 (with<br />
m3 = 4 − m1 − m2):<br />
m2 = 0 m2 = 1 m2 = 2 m2 = 3 m2 = 4<br />
m1 = 0 5/495 40/495 60/495 20/495 1/495 126/495<br />
m1 = 1 30/495 120/495 90/495 12/495 252/495<br />
m1 = 2 30/495 60/495 18/495 108/495<br />
m1 = 3 5/495 4/495 9/495<br />
70/495 224/495 168/495 32/495 1/495 495/495<br />
Since N = M1 + M2 + M3 = 12, the total number of subsets of size 4 is 12 4 = 495. To get<br />
the numerator <strong>for</strong> m1 = 2 and m2 = 1, <strong>for</strong> example, we compute 345 2 = 60.<br />
The rightmost column gives the proportion of time we expect to have 0, 1, 2, or 3 elements<br />
of type 1 in the subset of size 4. The bottom row gives the proportion of time we expect to<br />
have 0, 1, 2, 3, or 4 of type 2 in the subset of size 4.<br />
Urn models can be used to analyze surveys conducted in small populations. Additional<br />
examples will be discussed in the context of joint discrete distributions (Chapter 3).<br />
1.7 Applications in <strong>Statistics</strong><br />
This section introduces two important applications of the methods we have seen so far.<br />
10<br />
1<br />
1
0.08<br />
0.06<br />
0.04<br />
0.02<br />
Prob<br />
450 500 550 600 650 M<br />
0.16<br />
0.14<br />
0.12<br />
0.1<br />
0.08<br />
0.06<br />
0.04<br />
0.02<br />
Prob<br />
415 420 425 430 435<br />
Figure 1.1: Likelihood <strong>for</strong> survey example (left) and capture-recapture example (right).<br />
1.7.1 Maximum Likelihood Estimation in Simple Urn Models<br />
In statistics, in<strong>for</strong>mation from a sample is used when complete in<strong>for</strong>mation is impossible to<br />
obtain or would take too long to gather. For example, in<strong>for</strong>mation from a randomly chosen<br />
subset of elements from an urn can be used to<br />
1. Estimate the number of special elements in an urn, M, when N is known.<br />
2. Estimate the total number of elements in an urn, N, when M is known.<br />
The method uses proportions: if all four numbers (N, M, n, m) are positive, then the<br />
proportion of special elements in the subset (m/n) is set equal to the proportion of special<br />
elements in the urn (M/N), and the equation is solved <strong>for</strong> the unknown quantity.<br />
The method is intuitively appealing. Further, it can be shown that estimates obtained using<br />
proportions have maximum (or close to maximum) probabilities.<br />
Example 1.13 (Survey Analysis) Assume that in a well-conducted survey of 150 of the<br />
1000 students in the senior class at a local university, 82 students preferred Susan <strong>for</strong> class<br />
president. Since (82/150)1000 ≈ 547, we estimate that 547 seniors support Susan.<br />
The left part of Figure 1.1 is a plot of the function<br />
<br />
f(M) =<br />
M1000−M 82 68<br />
1000 150<br />
<strong>for</strong> M = 450,451,... ,625.<br />
If the subset of students chosen <strong>for</strong> the survey was one of 1000 150 equally likely choices, then<br />
the plot suggests that 547 is a maximum likelihood estimate of M.<br />
Example 1.14 (Capture-Recapture Model) Two independent techniques were used to<br />
estimate the total number of potentially fatal errors in a complex computer program. The<br />
first method identified M = 380 errors. The second method identified a total of n = 278<br />
errors, m = 250 of which were already identified by the first method.<br />
Since (278/250)380 ≈ 423, we estimate that there are a total of 423 potentially fatal errors<br />
in the program (408 identified by at least one method; 15 not identified by either method).<br />
11<br />
N
The right part of Figure 1.1 is a plot of<br />
f(N) =<br />
380N−380 250 28<br />
N<br />
278<br />
<strong>for</strong> N = 411,412,... ,437.<br />
If the subset of errors identified by the second method was one of N <br />
278 equally likely choices,<br />
then the plot suggests that 423 is a maximum likelihood estimate of N.<br />
Probability (or likelihood) ratio. An alternative to reporting a single value of M or<br />
N with maximum probability is to report a range of likely values. Often, the range is<br />
determined using probability ratios. That is, M or N is in the reported range if<br />
Probability <strong>for</strong> M or N<br />
Maximum Probability<br />
≥ p <strong>for</strong> a fixed proportion p.<br />
Using p = 0.25, the range of likely values in the survey example is {485,608} and the range<br />
of likely values in the capture-recapture example is {415,431}.<br />
Survey analysis when N is large. When N is large, we usually work with the unknown<br />
proportion of special objects p = M/N and the estimate p = m/n. Further, as an<br />
approximation, we assume that p takes values anywhere in the interval 0 ≤ p ≤ 1 and we<br />
use “sampling with replacement” in place of “sampling without replacement.”<br />
Survey analysis with three types of objects. The simple method of proportions can<br />
be generalized to urn models with three (or more) types of objects. For example, suppose<br />
we wish to estimate the numbers of students who will vote <strong>for</strong> each of three candidates<br />
(Susan plus two others) and that in a well-conducted survey of 150 of the 1000 students in<br />
the senior class, 72 prefer Susan (candidate 1), 56 prefer the second candidate and 22 prefer<br />
the third candidate. We estimate that<br />
<br />
72<br />
1000 ≈ 480,<br />
150<br />
<br />
56<br />
1000 ≈ 373 and<br />
150<br />
<br />
22<br />
1000 ≈ 147<br />
150<br />
will vote <strong>for</strong> candidates 1, 2, 3, respectively. Further, these estimates represent maximum<br />
(or near maximum) probabilities over all triples (M1,M2,M3).<br />
1.7.2 Permutation and Nonparametric Methods<br />
In many statistical applications, the null and alternative hypotheses of interest can be<br />
paraphrased in the following simple terms:<br />
Ho: Any patterns appearing in the data are due to chance alone.<br />
Ha: There is a tendency <strong>for</strong> a certain type of pattern to appear.<br />
Permutation methods allow researchers to determine whether to accept or reject a null hypothesis<br />
of randomness and, in some cases, to construct confidence intervals <strong>for</strong> unknown<br />
12
parameters. The methods are applicable in many settings since they require few mathematical<br />
assumptions.<br />
For example, consider the analysis of two samples. Let {x1,x2,...,xn} and {y1,y2,... ,ym}<br />
be the observed samples, where each is a list of numbers with repetitions.<br />
The data could have been generated under one of two sampling models.<br />
Population model. Data generated under a population model can be treated as the<br />
values of independent random samples from continuous distributions. (These concepts are<br />
defined <strong>for</strong>mally in Chapter 8 of these notes.)<br />
For example, consider testing the null hypothesis that the X and Y distributions are equal<br />
versus the alternative hypothesis that X is stochastically smaller than Y . (That is, that<br />
the values of X tend to be smaller than the values of Y .)<br />
If n = 3, m = 5, and the observed lists are<br />
then the event<br />
{x1,x2,x3} = {1.4,1.2,2.8} and {y1,y2,y3,y4,y5} = {1.7,2.9,3.7,2.3,1.3},<br />
X2 < Y5 < X1 < Y1 < Y4 < X3 < Y2 < Y3<br />
has occurred. Under the null hypothesis, each permutation of the n + m = 8 random<br />
variables is equally likely; thus, the observed event is one of (n + m)! = 40,320 equally<br />
likely choices. Patterns of interest under the alternative hypothesis are events where the<br />
Xi’s tend to be smaller than the Yj’s.<br />
Randomization model. Data generated under a randomization model cannot be treated<br />
as the values of independent random samples from continuous distributions, but can be<br />
thought of as one of N equally likely choices under the null hypothesis.<br />
In the example above, the null hypothesis of randomness is that the observed data is one<br />
of N = 40,320 equally likely choices, where a choice corresponds to a matching of the 8<br />
numbers to the labels x1, x2, x3, y1, y2, y3, y4, y5.<br />
Permutation tests. To conduct a permutation test using a test statistic T,<br />
1. The sampling distribution of T is obtained by computing the value of the statistic <strong>for</strong><br />
each reordering of the data, and<br />
2. The observed significance level (or p value) is computed by comparing the observed<br />
value of T to the sampling distribution from step 1.<br />
The sampling distribution from the first step is called the permutation distribution of the<br />
statistic T, and the p value is called a permutation p value.<br />
13
Table 1.1: Sampling distribution of the sum of observations in the first sample.<br />
sum x sample sum x sample sum x sample<br />
3.9 {1.2,1.3,1.4} 5.9 {1.4,1.7,2.8} 7.1 {1.4,2.8,2.9}<br />
4.2 {1.2,1.3,1.7} 5.9 {1.3,1.7,2.9} 7.2 {1.2,2.3,3.7}<br />
4.3 {1.2,1.4,1.7} 6.0 {1.4,1.7,2.9} 7.3 {1.3,2.3,3.7}<br />
4.4 {1.3,1.4,1.7} 6.2 {1.2,1.3,3.7} 7.4 {1.4,2.3,3.7}<br />
4.8 {1.2,1.3,2.3} 6.3 {1.2,1.4,3.7} 7.4 {1.7,2.8,2.9}<br />
4.9 {1.2,1.4,2.3} 6.3 {1.2,2.3,2.8} 7.7 {1.2,2.8,3.7}<br />
5.0 {1.3,1.4,2.3} 6.4 {1.3,2.3,2.8} 7.7 {1.7,2.3,3.7}<br />
5.2 {1.2,1.7,2.3} 6.4 {1.2,2.3,2.9} 7.8 {1.2,2.9,3.7}<br />
5.3 {1.2,1.3,2.8} 6.4 {1.3,1.4,3.7} 7.8 {1.3,2.8,3.7}<br />
5.3 {1.3,1.7,2.3} 6.5 {1.3,2.3,2.9} 7.9 {1.4,2.8,3.7}<br />
5.4 {1.2,1.4,2.8} 6.5 {1.4,2.3,2.8} 7.9 {1.3,2.9,3.7}<br />
5.4 {1.4,1.7,2.3} 6.6 {1.2,1.7,3.7} 8.0 {1.4,2.9,3.7}<br />
5.4 {1.2,1.3,2.9} 6.6 {1.4,2.3,2.9} 8.0 {2.3,2.8,2.9}<br />
5.5 {1.2,1.4,2.9} 6.7 {1.3,1.7,3.7} 8.2 {1.7,2.8,3.7}<br />
5.5 {1.3,1.4,2.8} 6.8 {1.4,1.7,3.7} 8.3 {1.7,2.9,3.7}<br />
5.6 {1.3,1.4,2.9} 6.8 {1.7,2.3,2.8} 8.8 {2.3,2.8,3.7}<br />
5.7 {1.2,1.7,2.8} 6.9 {1.2,2.8,2.9} 8.9 {2.3,2.9,3.7}<br />
5.8 {1.2,1.7,2.9} 6.9 {1.7,2.3,2.9} 9.4 {2.8,2.9,3.7}<br />
5.8 {1.3,1.7,2.8} 7.0 {1.3,2.8,2.9}<br />
Continuing with the example above, let S be the sum of numbers in the x sample. Since<br />
the value of S depends only on which observations are labelled x’s, and not on the relative<br />
ordering of all 8 observations, the sampling distribution of S under the null hypothesis is<br />
obtained by computing its value <strong>for</strong> each partition of the 8 numbers into subsets of sizes 3<br />
and 5, respectively.<br />
Table 1.1 shows the values of S <strong>for</strong> each of the 8 3 = 56 choices of observations <strong>for</strong> the first<br />
sample. The observed value of S if 5.4. Since small values of S support the alternative<br />
hypothesis that x values tend to be smaller than y values, the observed significance level is<br />
P(S ≤ 5.4) = 13/56.<br />
Example 1.15 (Olympic Sprinting Times) As a second permutation test example,<br />
consider a test of association between x-values and y-values using data on Olympic sprinting<br />
times (Hand et al., Chapman & Hall, 1994, p. 248), and the sample correlation statistic<br />
ni=1 (Xi − X)(Yi − Y )<br />
R = <br />
ni=1<br />
(Xi − X) 2 n i=1 (Yi − Y ) 2<br />
where X and Y are the sample means of the X and Y samples, respectively, and n is the<br />
total number of data pairs.<br />
The left part of Figure 1.2 shows the winning times in seconds (vertical axis) <strong>for</strong> the men’s<br />
100 meter finals versus the year since 1900 (horizontal axis) <strong>for</strong> the years 1912, 1924, · · ·,<br />
1972.<br />
14<br />
,
10.8<br />
10.5<br />
10.2<br />
y<br />
10 30 50 70 x<br />
400<br />
250<br />
100<br />
y<br />
125 200 275 x<br />
Figure 1.2: Winning time in seconds (y) versus year since 1900 (x) <strong>for</strong> Olympic sprinting<br />
times example (left plot). Triglycerides in mg/dL (y) versus cholesterol in mg/dL (y) <strong>for</strong><br />
heart disease example (right plot).<br />
The data are reproduced in the following table:<br />
x 12 24 36 48 60 72<br />
y 10.8 10.6 10.3 10.3 10.2 10.14<br />
For these data, the observed value of R is −0.941.<br />
To determine if the observed association is due to chance alone, a permutation analysis of<br />
R = 0 versus R = 0 was conducted at the 5% significance level.<br />
The sampling distribution of R under the null hypothesis is obtained by matching x-values<br />
to permuted y-values in all possible ways, and computing the value of R in each case. Since<br />
two y-values are equal, there are a total of 6!/2 = 360 permutations to consider. Since the<br />
permutation p value is<br />
p value =<br />
#(|R| ≥ | − 0.941|)<br />
360<br />
= 2<br />
≈ 0.006<br />
360<br />
(obtained using the computer), there is evidence that the observed negative association is<br />
not due to chance alone.<br />
Conditional test, nonparametric test. A permutation test is an example of a conditional<br />
test, since the sampling distribution of T is computed conditional on the observations.<br />
Note that the term nonparametric test is often used to describe permutation tests where<br />
observations have been replaced by ranks. The Wilcoxon rank sum and Mann-Whitney<br />
tests <strong>for</strong> the analysis of two samples, and the Wilcoxon signed rank test <strong>for</strong> the analysis of<br />
paired samples, are examples of nonparametric tests.<br />
Monte Carlo test. If the number of reorderings of the data is very large, then the<br />
computer can be used to approximate the sampling distribution of T by computing its<br />
15
value <strong>for</strong> a fixed number of random reorderings of the data. The approximate sampling<br />
distribution can then be used to estimate the p value.<br />
A test conducted in this way is an example of a Monte Carlo test. (In a Monte Carlo<br />
analysis, simulation is used to estimate a quantity of interest. Here the quantity of interest<br />
is a p value.)<br />
Example 1.16 (Heart Disease Study) (Hand et al., Chapman & Hall, 1994, p. 221)<br />
Cholesterol and triglycerides belong to the class of chemicals known as lipids (fats). As part<br />
of a study to determine the relationship between high levels of lipids and coronary artery<br />
disease, researchers measured plasma levels of cholesterol and triglycerides in milligrams<br />
per deciliter (mg/dL) in 371 men complaining of chest pain.<br />
The right part of Figure 1.2 compares the cholesterol (x) and triglycerides (y) levels <strong>for</strong> the<br />
51 men with no evidence of heart disease. The sample correlation is 0.325. (See the last<br />
example <strong>for</strong> the definition of the sample correlation statistic.)<br />
To determine if the observed association is due to chance alone, a permutation analysis can<br />
be conducted using the sample correlation statistic R as test statistic.<br />
Consider testing R = 0 versus R = 0 at the 5% significance level. Under the null hypothesis<br />
of randomness, each matching of observed x values to permuted y values is equally likely.<br />
Thus, the observed matching is one of about 52! ≈ 8 × 10 67 equally likely choices.<br />
In a Monte Carlo analysis using 5000 random matchings (including the observed matching),<br />
the estimated permutation p value was<br />
estimated p value =<br />
#(|R| ≥ |0.325|)<br />
5000<br />
= 98<br />
≈ 0.0196.<br />
5000<br />
Thus, there is evidence that the positive association between levels of cholesterol and triglycerides<br />
is not due to chance alone.<br />
1.8 Conditional Probability<br />
Assume that A and B are events, and that P(B) > 0. Then the conditional probability of<br />
A given B is defined as follows:<br />
P(A|B) =<br />
P(A ∩ B)<br />
.<br />
P(B)<br />
Event B is often referred to as the conditional sample space. The conditional probability<br />
P(A|B) is the relative “size” of A within B.<br />
Example 1.17 (Respiratory Problems) For example, suppose that 40% of the adults in<br />
a certain population smoke cigarettes and that 28% smoke cigarettes and have respiratory<br />
problems. Then<br />
P(Respiratory problems | Smoker) = 0.28<br />
= 0.70<br />
0.40<br />
is the probability that an adult has respiratory problems given that the adult is a smoker.<br />
(70% of smokers have respiratory problems.)<br />
16
Equally likely outcomes. Note that if the sample space S is finite and each outcome is<br />
equally likely, then the conditional probability of A given B simplifies to the following:<br />
P(A|B) =<br />
|A ∩ B|/|S|<br />
|B|/|S|<br />
= |A ∩ B|<br />
|B|<br />
1.8.1 Multiplication Rules <strong>for</strong> Probability<br />
= Number of elements in A ∩ B<br />
Number of elements in B .<br />
Assume that A and B are events with positive probability. Then the definition of conditional<br />
probability implies that the probability of the intersection, P(A ∩ B), can be written as a<br />
product of probabilities in two different ways:<br />
More generally,<br />
P(A ∩ B) = P(B) × P(A|B) = P(A) × P(B|A).<br />
Theorem 1.18 (Multiplication Rule <strong>for</strong> Probability) If A1,A2,... ,Ak are events<br />
and P(A1 ∩ A2 ∩ · · · ∩ Ak−1) > 0, then<br />
P(A1 ∩ A2 ∩ · · · ∩ Ak) =<br />
P(A1) × P(A2|A1) × P(A3|A1 ∩ A2) × · · · × P(Ak|A1 ∩ A2 ∩ · · · ∩ Ak−1).<br />
Example 1.19 (Sampling Without Replacement) For example, suppose that four<br />
slips of paper are sampled without replacement from a well-mixed urn containing twentyfive<br />
slips of paper: fifteen slips with the letter X written on each and ten slips of paper<br />
with the letter Y written on each. Then the probability of observing the sequence XY XX<br />
is P(XY XX) =<br />
P(X) × P(Y |X) × P(X|XY ) × P(X|XY X) = 15 10 14 13<br />
× × × ≈ 0.09.<br />
25 24 23 22<br />
(The probability of choosing an X slip is 15/25; with an X removed from the urn, the<br />
probability of drawing a Y slip is 10/24; with an X and Y removed from the urn, the<br />
probability of drawing an X slip is 14/23; with two X slips and one Y slip removed from<br />
the urn, the probability of drawing an X slip is 13/22.)<br />
1.8.2 Law of Total Probability<br />
The law of total probability can be used to write an unconditional probability as the<br />
weighted average of conditional probabilities. Specifically,<br />
Theorem 1.20 (Law of Total Probability) Let A1, A2, ..., Ak and B be events with<br />
nonzero probability. If A1, A2, ..., Ak are pairwise disjoint with union S, then<br />
P(B) = P(A1) × P(B|A1) + P(A2) × P(B|A2) + · · · + P(Ak) × P(B|Ak).<br />
17
To demonstrate the law of total probability, note that if A1, A2, ..., Ak are pairwise disjoint<br />
with union S, then the sets B ∩ A1, B ∩ A2, ..., B ∩ Ak are pairwise disjoint with union B.<br />
Thus, axiom 3 and the definition of conditional probability imply that<br />
k<br />
k<br />
P(B) = P(B ∩ Aj) = P(Aj)P(B|Aj).<br />
j=1<br />
j=1<br />
Example 1.21 (Respiratory Problems, continued) For example, suppose that 70% of<br />
smokers and 15% of nonsmokers in a certain population of adults have respiratory problems.<br />
If 40% of the population smoke cigarettes, then<br />
P(Respiratory problems) = 0.40(0.70) + 0.60(0.15) = 0.37<br />
is the probability of having respiratory problems.<br />
Law of average conditional probabilities. The law of total probability is often called<br />
the law of average conditional probabilities. Specifically, P(B) is the weighted average of<br />
the collection of conditional probabilities {P(B|Aj)}, using the collection of unconditional<br />
probabilities {P(Aj)} as weights.<br />
In the respiratory problems example above, 0.37 is the weighted average of 0.70 (the probability<br />
that a smoker has respiratory problems) and 0.15 (the probability that a nonsmoker<br />
has respiratory problems).<br />
1.8.3 Bayes Rule<br />
Bayes rule, proven by the Reverend T. Bayes in the 1760’s, can be used to update probabilities<br />
given that an event has occurred. Specifically,<br />
Theorem 1.22 (Bayes Rule) Let A1, A2, ..., Ak and B be events with nonzero probability.<br />
If A1, A2, ..., Ak are pairwise disjoint with union S, then<br />
P(Aj|B) =<br />
P(Aj) × P(B|Aj)<br />
, j = 1,2,... ,k.<br />
ki=1<br />
P(Ai) × P(B|Ai)<br />
Bayes rule is a restatement of the definition of conditional probability: the numerator in<br />
the <strong>for</strong>mula is P(Aj ∩ B), the denominator is P(B) (by the law of total probability), and<br />
the ratio is P(Aj|B).<br />
Example 1.23 (Defective Products) For example, suppose that 2% of the products<br />
assembled during the day shift and 6% of the products assembled during the night shift at<br />
a small company are defective and need reworking. If the day shift accounts <strong>for</strong> 55% of the<br />
products assembled by the company, then<br />
P(Day shift | Defective) =<br />
0.55(0.02)<br />
0.55(0.02) + 0.45(0.06)<br />
≈ 0.289<br />
is the probability that a product was assembled during the day shift given that the product<br />
is defective. (About 28.9% of defective products are assembled during the day shift.)<br />
18
Table 1.2: Distribution of unaided distance scores <strong>for</strong> right (x) and left (y) eyes.<br />
Prob<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
y = 1 y = 2 y = 3 y = 4 total<br />
x = 1 0.203 0.036 0.017 0.009 0.265<br />
x = 2 0.031 0.202 0.058 0.010 0.301<br />
x = 3 0.016 0.048 0.237 0.027 0.328<br />
x = 4 0.005 0.011 0.024 0.066 0.106<br />
total 0.255 0.297 0.336 0.112 1.000<br />
1 2 3 4 j<br />
Prob<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
1 2 3 4 j<br />
Figure 1.3: Distributions of left eye scores given right eye score is 1 (left) or 4 (right).<br />
Prior and posterior distributions. In applications of Bayes rule, the collection of<br />
probabilities {P(Aj)} are often referred to as the prior probabilities (the probabilities be<strong>for</strong>e<br />
observing an outcome in B), and the collection of probabilities {P(Aj|B)} are often referred<br />
to as the posterior probabilities (the probabilities after event B has occurred).<br />
Example 1.24 (Eyesight Study) (Stuart, Biometrika 42:412-416, 1955) Between 1943<br />
and 1946 the eyesight of more than seven thousand women aged 30-39 employed in the<br />
British weapons industry was tested. For each woman the unaided distance vision of the<br />
right eye (x) and of the left eye (y) was recorded using the four-point scale 1, 2, 3, 4, where<br />
1 is the highest grade and 4 is the lowest grade. Table 1.2 summarizes the results of the<br />
experiment.<br />
For example, <strong>for</strong> 20.3% of women who participated in the study both eyes were given the<br />
highest score (x = y = 1) and <strong>for</strong> 6.6% of women who participated in the study both eyes<br />
were given the lowest score (x = y = 4).<br />
Consider the experiment “Choose a woman from the set of women studied by Stuart and<br />
record the scores <strong>for</strong> both eyes,” assume each choice is equally likely, and let Aj be the<br />
event that the left eye score equals j.<br />
The left part of Figure 1.3 shows the posterior distribution {P(Aj|B)} (in black) superimposed<br />
on the prior distribution {P(Aj)} (in gray) when B is the event that the right eye<br />
19
score is 1. (If the right eye score is 1, then the left eye score is most likely 1.)<br />
The right part of Figure 1.3 shows the posterior distribution {P(Aj|B)} (in black) superimposed<br />
on the prior distribution {P(Aj)} (in gray) when B is the event that the right eye<br />
score is 4. (If the right eye score is 4, then the left eye score is most likely 4.)<br />
Example 1.25 (Religion and Education) As a last example, suppose that 45% of<br />
the population of a certain adult community are Catholic, 15% are Jewish, and 40% are<br />
Protestant. Further, suppose that<br />
1. Education levels among the Catholic population are as follows:<br />
Level: 8th grade Some HS Some <strong>College</strong> Postor<br />
less High School Graduate <strong>College</strong> Graduate Graduate<br />
Percent: 21% 28% 32% 8% 7% 4%<br />
2. Education levels among the Jewish population are as follows:<br />
Level: 8th grade Some HS Some <strong>College</strong> Postor<br />
less High School Graduate <strong>College</strong> Graduate Graduate<br />
Percent: 7% 23% 24% 15% 18% 13%<br />
3. Education levels among the Protestant population are as follows:<br />
Level: 8th grade Some HS Some <strong>College</strong> Postor<br />
less High School Graduate <strong>College</strong> Graduate Graduate<br />
Percent: 13% 22% 27% 11% 19% 8%<br />
The in<strong>for</strong>mation above can be used to construct a contingency table of proportions in each<br />
religion-education group in this community:<br />
Level: 8th grade Some HS Some <strong>College</strong> Postor<br />
less High School Graduate <strong>College</strong> Graduate Graduate<br />
Catholic: 0.0945 0.126 0.144 0.036 0.0315 0.018 0.45<br />
Jewish: 0.0105 0.0345 0.036 0.0225 0.027 0.0195 0.15<br />
Protestant: 0.052 0.088 0.108 0.044 0.076 0.032 0.40<br />
0.157 0.2485 0.288 0.1025 0.1345 0.0695 1<br />
Each number in the body of the table is obtained using the multiplication rule. For example,<br />
P(Catholic and HS Graduate) = 0.45 × 0.32 = 0.144.<br />
Each number in the bottom row is obtained using the law of total probability. For example,<br />
P(HS Graduate) = 0.45 × 0.32 + 0.15 × 0.24 + 0.40 × 0.27 = 0.288.<br />
The prior distribution of Catholic, Jewish, and Protestant is 0.45, 0.15, and 0.40, respectively.<br />
Given that a person is a high school graduate (and no more), <strong>for</strong> example, the<br />
posterior distribution of Catholic, Jewish, and Protestant is<br />
0.144 0.036<br />
= 0.5,<br />
0.288 0.288<br />
0.108<br />
= 0.125, and = 0.375, respectively.<br />
0.288<br />
This posterior distribution was computed using Bayes rule.<br />
20
1.9 Independent Events and Mutually Independent Events<br />
Events A and B are said to be independent if<br />
P(A ∩ B) = P(A) × P(B).<br />
Otherwise, A and B are said to be dependent.<br />
If A and B are independent and have positive probabilities, then the multiplication rule <strong>for</strong><br />
probability (page 17) implies that<br />
P(A|B) = P(A) and P(B|A) = P(B).<br />
(The relative size of A within B is the same as its relative size within S; the relative size of<br />
B within A is the same as its relative size within S.)<br />
If A and B are independent, 0 < P(A) < 1, and 0 < P(B) < 1, then A and B c are independent,<br />
A c and B are independent, and A c and B c are independent.<br />
More generally, events A1,A2,...,Ak are said to be mutually independent if<br />
• For each pair of distinct indices (i1,i2): P(Ai1 ∩ Ai2 ) = P(Ai1 ) × P(Ai2 ),<br />
• For each triple of distinct indices (i1,i2,i3):<br />
• and so <strong>for</strong>th.<br />
P(Ai1 ∩ Ai2 ∩ Ai3 ) = P(Ai1 ) × P(Ai2 ) × P(Ai3 ),<br />
Example 1.26 (Sampling With Replacement) For example, suppose that four slips<br />
of paper are sampled with replacement from a well-mixed urn containing twenty-five slips<br />
of paper: fifteen slips with the letter X written on each and ten slips of paper with the<br />
letter Y written on each. Then the probability of observing the sequence XY XX is<br />
P(XY XX) = P(X) × P(Y ) × P(X) × P(X) =<br />
3 <br />
15 10<br />
25 25 = 54<br />
625<br />
= 0.0864.<br />
Further, since 4 3 = 4 sequences have exactly 3 X’s, the probability of observing a sequence<br />
with exactly 3 X’s is 4(0.0864) = 0.3456.<br />
Example 1.27 (Three Coins) You toss a fair coin three times and record an h (<strong>for</strong><br />
heads) or a t (<strong>for</strong> tails) each time. The sample space is<br />
S = {hhh, hht, hth, thh, htt, tht, tth, ttt}.<br />
Let Hi be the event that the i th toss results in a head:<br />
H1 = {hhh, hht, hth, htt}, H2 = {hhh, hht, thh, tht}, H3 = {hhh, hth, thh, tth}.<br />
21
Since P(H1) = P(H2) = P(H3) = 1<br />
2 ,<br />
P(Hi ∩ Hj) = 1<br />
4 = P(Hi) × P(Hj) <strong>for</strong> (i,j) = (1,2),(1,3),(2, 3),<br />
and P(H1 ∩ H2 ∩ H3) = 1<br />
8 = P(H1)P(H2)P(H3), the events are mutually independent.<br />
Example 1.28 (Not Mutually Independent) Suppose that the sample space of an<br />
experiment consists of the 3! = 6 permutations of the letters a, b, and c, along with the<br />
triples of each letter:<br />
S = {aaa, bbb, ccc, abc, acb, bac, bca, cab, cba}.<br />
Assume each outcome is equally likely, and let Ai be the event that the i th letter is an a:<br />
A1 = {aaa, abc, acb}, A2 = {aaa, bac, cab}, A3 = {aaa, bca,cba}.<br />
Then P(A1) = P(A2) = P(A3) = 1<br />
3 and<br />
P(A1 ∩ A2) = P(A1 ∩ A3) = P(A2 ∩ A3) = P(A1 ∩ A2 ∩ A3) = 1<br />
9<br />
since each intersection set equals the simple event {aaa}.<br />
Since<br />
P(Ai ∩ Aj) = 1<br />
9 = P(Ai) × P(Aj) <strong>for</strong> (i,j) = (1,2),(1,3),(2,3),<br />
the events are independent in pairs. However, since P(A1 ∩ A2 ∩ A3) = P(A1)P(A2)P(A3),<br />
the three events are not mutually independent.<br />
Repeated trials and mutual independence. As stated at the beginning of the chapter,<br />
the term experiment is used in probability theory to describe a procedure whose outcome<br />
is not known in advance with certainty. Experiments are assumed to be repeatable, and to<br />
have a well-defined set of outcomes. Repeated trials are repetitions of an experiment using<br />
the specified procedure, with the outcomes of the trials having no influence on one another.<br />
The results of repeated trials of an experiment are mutually independent.<br />
22
2 Discrete Random Variables<br />
Researchers use random variables to describe the numerical results of experiments. For<br />
example, if a fair coin is tossed 5 times and the total number of heads is recorded, then a<br />
random variable whose values are 0, 1, 2, 3, 4, 5 is used to give a numerical description of<br />
the results.<br />
This chapter focuses on discrete random variables and their probability distributions.<br />
2.1 Definitions<br />
A random variable is a function from the sample space of an experiment to the real numbers.<br />
The range of a random variable is the set of values the random variable assumes. Random<br />
variables are usually denoted by capital letters (X, Y , Z, ...) and their values by lower case<br />
letters (x, y, z, ...).<br />
If the range of a random variable is a finite or countably infinite set, then the random<br />
variable is said to be discrete; if the range is an interval or a union of intervals, the random<br />
variable is said to be continuous; otherwise, the random variable is said to be mixed.<br />
If X is a discrete random variable, then P(X = x) is the probability that an outcome<br />
has value x. Similarly, P(X ≤ x) is the probability that an outcome has value x or less,<br />
P(a < X < b) is the probability that an outcome has value strictly between a and b, and<br />
so <strong>for</strong>th.<br />
Example 2.1 (Coin Tossing) Let H be the number of heads in 8 tosses of a fair coin<br />
and let X be the difference between the number of heads and the number of tails. That is,<br />
let X = H − (8 − H) = 2H − 8. Since<br />
P(H = h) =<br />
8 h<br />
256<br />
when h = 0,1,... ,8 and 0 otherwise,<br />
we know that X is a discrete random variable with range {−8, −6, −4, −2,0,2,4,6,8}.<br />
The following table displays P(X = x) <strong>for</strong> each x in the range of X, using a common<br />
denominator throughout:<br />
h 0 1 2 3 4 5 6 7 8<br />
x −8 −6 −4 −2 0 2 4 6 8<br />
P(X = x) 1<br />
256<br />
8<br />
256<br />
From this table, we can compute all probabilities of interest. For example,<br />
28<br />
256<br />
56<br />
256<br />
70<br />
256<br />
56<br />
256<br />
28<br />
256<br />
P(X ≥ 3) = P(X = 4,6,8) = 37<br />
≈ 0.1445.<br />
256<br />
(There are 256 sequences of heads and tails. Of these, 37 have either 6, 7, or 8 heads.)<br />
23<br />
8<br />
256<br />
1<br />
256
2.1.1 PDF and CDF <strong>for</strong> Discrete Distributions<br />
If X is a discrete random variable, then the frequency function (FF) or probability density<br />
function (PDF) of X is defined as follows<br />
PDFs satisfy the following properties:<br />
1. f(x) ≥ 0 <strong>for</strong> all real numbers x.<br />
f(x) = P(X = x) <strong>for</strong> all real numbers x.<br />
2. <br />
x∈R f(x) equals 1, where R is the range of X.<br />
Since f(x) is the probability of an event and events have nonnegative probabilities, the<br />
values of f(x) must be nonnegative (property 1). Since the events X = x <strong>for</strong> x ∈ R are<br />
mutually disjoint with union S (the sample space), the sum of the probabilities of these<br />
events must be 1 (property 2).<br />
The cumulative distribution function (CDF) of the discrete random variable X is defined<br />
as follows<br />
F(x) = P(X ≤ x) <strong>for</strong> all real numbers x.<br />
Note that F(x) is a sum of probabilities:<br />
F(x) = <br />
y∈Rx<br />
CDFs satisfy the following properties:<br />
1. limx→−∞ F(x) = 0 and limx→+∞ F(x) = 1.<br />
2. If x1 ≤ x2, then F(x1) ≤ F(x2).<br />
f(y), where Rx = {y ∈ R : y ≤ x}.<br />
3. F(x) is right continuous. That is, <strong>for</strong> each a, lim x→a + F(x) = F(a).<br />
F(x) represents cumulative probability, with limits 0 and 1 (property 1). Cumulative probability<br />
increases with increasing x (property 2), and has discrete jumps at values of x in<br />
the range of the random variable (property 3). (The concept of limit is central to calculus,<br />
and will be defined in Chapter 4 of these notes.)<br />
Plotting PDF and CDF functions. The PDF of a discrete random variable is represented<br />
graphically<br />
1. By using a scatter plot of pairs (x,f(x)) <strong>for</strong> x ∈ R, or<br />
2. By using a probability histogram, where area is used to represent probability.<br />
24
Prob.<br />
0.25<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
0<br />
-8 -6 -4 -2 0 2 4 6 8<br />
x<br />
Cum.Prob.<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
-8 -6 -4 -2 0 2 4 6 8<br />
Figure 2.1: Probability histogram (left) and cumulative distribution function (right) <strong>for</strong> the<br />
difference between the number of heads and the number of tails in eight tosses of a fair coin.<br />
The CDF of a discrete random variable is represented graphically as a step function, with<br />
steps of height f(x) at each x ∈ R.<br />
Example 2.2 (Coin Tossing, continued) For example, the left part of Figure 2.1 is the<br />
probability histogram <strong>for</strong> the difference between the number of heads and the number of<br />
tails in eight tosses of a fair coin. For each x in the range of the random variable, a rectangle<br />
with base equal to the interval [x − 0.50,x + 0.50] and with height equal to f(x) is drawn.<br />
The total area is 1.0. The right part is a representation of the cumulative distribution<br />
function. Note that F(x) is nondecreasing, F(x) = 0 when x < −8, and F(x) = 1 when<br />
x > 8. Steps occur at x = −8, −6,... ,6,8.<br />
2.2 Families of Distributions<br />
This section briefly describes four families of discrete probability distributions, and states<br />
properties of these distributions.<br />
2.2.1 Example: Discrete Uni<strong>for</strong>m Distribution<br />
Let n be a positive integer. The random variable X is said to be a discrete uni<strong>for</strong>m random<br />
variable, or to have a discrete uni<strong>for</strong>m distribution, with parameter n when its PDF is as<br />
follows:<br />
f(x) = 1<br />
when x = 1,2,... ,n and 0 otherwise.<br />
n<br />
For example, if you roll a fair six-sided die and let X equal the number of dots on the top<br />
face, then X has a discrete uni<strong>for</strong>m distribution with n = 6.<br />
2.2.2 Example: Hypergeometric Distribution<br />
Let N, M, and n be integers with 0 < M < N and 0 < n < N. The random variable X is<br />
said to be a hypergeometric random variable, or to have a hypergeometric distribution, with<br />
25<br />
x
Prob.<br />
0.3<br />
0.2<br />
0.1<br />
0 4 8 x<br />
Prob.<br />
0.3<br />
0.2<br />
0.1<br />
0 4 8 x<br />
Figure 2.2: Probability histograms <strong>for</strong> hypergeometric (left) and binomial (right) examples.<br />
parameters n, M, and N, when its PDF is as follows:<br />
<br />
f(x) =<br />
MN−M x n−x<br />
N n<br />
,<br />
<strong>for</strong> integers, x, between max(0,n+M −N) and min(n,M) (and zero otherwise). Note that<br />
the restrictions on x ensure that all combinations are defined and positive.<br />
Example 2.3 (n = 8, M = 14, N = 35) Assume that X has a hypergeometric distribution<br />
with parameters n = 8, M = 14 and N = 35. The following table gives the values<br />
of the PDF of X <strong>for</strong> all x in its range, using 4 decimal place accuracy.<br />
x 0 1 2 3 4 5 6 7 8<br />
f(x) 0.0086 0.0692 0.2098 0.3147 0.2545 0.1131 0.0268 0.0031 0.0001<br />
The left part of Figure 2.2 is a probability histogram <strong>for</strong> X.<br />
Hypergeometric distributions are used to model urn experiments (see Section 1.5.1). Suppose<br />
there are M special objects in an urn containing a total of N objects. Let X be the<br />
number of special objects in a subset of size n chosen from the urn. If each choice of subset<br />
is equally likely, then X has a hypergeometric distribution. In the example above, 40%<br />
(14/35) of the elements in the urn are special.<br />
2.2.3 Distributions Related to Bernoulli Experiments<br />
A Bernoulli experiment is an experiment with two possible outcomes. The outcome of<br />
chief interest is often called “success” and the other outcome “failure.” Let p equal the<br />
probability of success.<br />
Imagine repeating a Bernoulli experiment n times. The expected number of successes in n<br />
independent trials of a Bernoulli experiment with success probability p is np.<br />
For example, suppose that you roll a fair six-sided die and observe the number on the top<br />
face each time. Let success be a 1 or 4 on the top face, and failure be a 2, 3, 5, or 6 on the<br />
top face. Then p = 1/3 is the probability of success. In 600 trials of the experiment, you<br />
expect 200 successes.<br />
26
Example: Bernoulli distribution. Suppose that a Bernoulli experiment is run once.<br />
Let X equal 1 if a success occurs and 0 if a failure occurs. Then X is said to be a Bernoulli<br />
random variable, or to have a Bernoulli distribution, with parameter p. The PDF of X is<br />
as follows:<br />
f(1) = p, f(0) = 1 − p, and f(x) = 0 otherwise.<br />
Example: binomial distribution. Let X be the number of successes in n independent<br />
trials of a Bernoulli experiment with success probability p. Then X is said to be a binomial<br />
random variable, or to have a binomial distribution, with parameters n and p. The PDF of<br />
X is as follows:<br />
f(x) =<br />
<br />
n<br />
p<br />
x<br />
x (1 − p) n−x when x = 0,1,2,... ,n, and 0 otherwise.<br />
For each x, f(x) is the probability of the event “exactly x successes in n independent trials.”<br />
(There are a total of n sequences with exactly x successes and n − x failures; each such<br />
x<br />
sequence has probability px (1 − p) n−x .)<br />
Example 2.4 (n = 8, p = 0.4) Assume that X has a binomial distribution with parameters<br />
n = 8 and p = 0.4. The following table gives the values of the PDF of X <strong>for</strong> all x in<br />
its range:<br />
x 0 1 2 3 4 5 6 7 8<br />
f(x) 0.0168 0.0896 0.209 0.2787 0.2322 0.1239 0.0413 0.0079 0.0007<br />
The right part of Figure 2.2 is a probability histogram <strong>for</strong> X.<br />
Notes. The binomial theorem (page 8) can be used to demonstrate that<br />
f(0) + f(1) + · · · + f(n) = 1.<br />
Note also that a Bernoulli random variable is a binomial random variable with n = 1.<br />
2.2.4 Simple Random Samples<br />
Suppose that an urn contains N objects. A simple random sample of size n is a sequence<br />
of n objects chosen without replacement from the urn, where the choice of each sequence is<br />
equally likely.<br />
Let M be the number of special objects in the urn and X be the number of special objects<br />
in a simple random sample of size n. Then X has a hypergeometric distribution with<br />
parameters n, M, N. Further, if N is very large, then binomial probabilities can often be<br />
used to approximate hypergeometric probabilities.<br />
Theorem 2.5 (Binomial Approximation) If N is large, then the binomial distribution<br />
with parameters n and p = M/N can be used to approximate the hypergeometric<br />
distribution with parameters n, M, N. Specifically,<br />
P(x special objects) = <br />
<br />
n<br />
≈<br />
x<br />
MN−M x n−x<br />
N n<br />
27<br />
p x (1 − p) n−x <strong>for</strong> each x.
Note that if X is the number of special objects in a sequence of n objects chosen with<br />
replacement from the urn and if the choice of each sequence is equally likely, then X has<br />
a binomial distribution with parameters n and p = M/N. The theorem says that if N is<br />
large, then the model where sampling is done with replacement can be used to approximate<br />
the model where sampling is done without replacement.<br />
Adequacy of the approximation. One rule of thumb <strong>for</strong> judging the adequacy of the<br />
approximation is the following: the binomial approximation is adequate when<br />
20n < N and 0.05 < p < 0.95.<br />
Survey analysis. Simple random samples are used in surveys. If the survey population<br />
is small, then hypergeometric distributions are used to analyze the results. If the survey<br />
population is large, then binomial distributions are used to analyze the results, even though<br />
each person’s opinion is solicited at most once.<br />
Example 2.6 (Survey Analysis) For example, suppose that a surveyor is interested in<br />
determining the level of support <strong>for</strong> a proposal to change the local tax structure, and decides<br />
to choose a simple random sample of size 10 from the registered voter list. If there are a<br />
total of 120 registered voters, one-third of whom support the proposal, then the probability<br />
that exactly 3 of the 10 chosen voters support the proposal is<br />
P(X = 3) =<br />
40 80 3 7<br />
120 10<br />
≈ 0.27.<br />
If there are thousands of registered voters, the probability is<br />
1 3 7 10 2<br />
P(X = 3) ≈<br />
≈ 0.26.<br />
3 3 3<br />
Note that the exact number of registered voters is not needed in the approximation.<br />
2.3 Mathematical Expectation<br />
Mathematical expectation generalizes the idea of a weighted average, where probability<br />
distributions are used as the weights.<br />
2.3.1 Definitions <strong>for</strong> Finite Discrete Random Variables<br />
Let X be a discrete random variable with finite range R and PDF f(x). The mean of X<br />
(or the expected value of X or the expectation of X) is defined as follows:<br />
E(X) = <br />
x f(x).<br />
x∈R<br />
28
Similarly, if g(X) is a real-valued function of X, then the mean of g(X) (or the expected<br />
value of g(X) or the expectation of g(X)) is<br />
E(g(X)) = <br />
g(x) f(x).<br />
x∈R<br />
Example 2.7 (Nickels and Dimes) Assume you have 3 dimes and 5 nickels in your<br />
pocket. You choose a subset of 4 coins, let X equal the number of dimes in the subset and<br />
g(X) = 10X + 5(4 − X) = 20 + 5X<br />
be the total value (in cents) of the chosen coins. If each choice of subset is equally likely,<br />
then X has a hypergeometric distribution with parameters n = 4, M = 3, and N = 8, the<br />
expected number of dimes in the subset is<br />
E(X) =<br />
3<br />
x=0<br />
<br />
5<br />
x f(x) = 0 + 1<br />
70<br />
and the expected total value of the chosen coins is<br />
E(g(X)) =<br />
3<br />
x=0<br />
<br />
5<br />
(20 + 5x) f(x) = 20<br />
Note that E(g(X)) = g(E(X)).<br />
2.3.2 Properties of Expectation<br />
70<br />
<br />
+ 25<br />
<br />
30<br />
+ 2<br />
70<br />
<br />
30<br />
+ 30<br />
70<br />
Properties of sums imply the following properties of expectation:<br />
1. If a is a constant, then E(a) = a.<br />
2. If a and b are constants, then E(a + bX) = a + bE(X).<br />
<br />
30 5<br />
+ 3 = 1.5<br />
70 70<br />
<br />
30 5<br />
+ 35 = 27.5 cents.<br />
70 70<br />
3. If ci is a constant and gi(X) is a real-valued function <strong>for</strong> i = 1,2,... ,k, then<br />
E(c1g1(X) + c2g2(X) + · · · + ckgk(X)) =<br />
c1E(g1(X)) + c2E(g2(X)) + · · · + ckE(gk(X)).<br />
The first property says that the mean of a constant function is the constant itself. The<br />
second property says that if g(X) = a+bX, then E(g(X)) = g(E(X)). The third property<br />
generalizes the first two.<br />
Note that if g(X) = a + bX, then E(g(X)) and g(E(X)) may be different. For example,<br />
let g(X) = X2 in the nickels and dimes example above. Then<br />
E(g(X)) = 0 2<br />
<br />
5<br />
+ 1<br />
70<br />
2<br />
<br />
30<br />
+ 2<br />
70<br />
2<br />
<br />
30<br />
+ 3<br />
70<br />
2<br />
<br />
5<br />
=<br />
70<br />
39<br />
14 .<br />
Since (39/14) = (3/2) 2 , E(g(X)) = g(E(X)) in this case.<br />
29
Table 2.1: Means and variances <strong>for</strong> four families of discrete distributions.<br />
Distribution E(X) V ar(X)<br />
Discrete Uni<strong>for</strong>m n<br />
n+1<br />
2<br />
Hypergeometric n, M, N n M<br />
N<br />
n 2 −1<br />
12<br />
n M<br />
N<br />
Bernoulli p p p(1 − p)<br />
<br />
1 − M<br />
N<br />
Binomial n, p np np(1 − p)<br />
2.3.3 Mean, Variance and Standard Deviation<br />
<br />
N−n<br />
N−1<br />
Let X be a random variable and let µ = E(X) be its mean. The variance of X, V ar(X),<br />
is defined as follows:<br />
<br />
V ar(X) = E (X − µ) 2<br />
.<br />
The notation σ 2 = V ar(X) is used to denote the variance. The standard deviation of X,<br />
σ = SD(X), is the positive square root of the variance.<br />
The mean is a measure of the center (or location) of a distribution. The variance and<br />
standard deviation are measures of the scale (or spread) of a distribution. If X is the height<br />
of an individual in inches, say, then the values of E(X) and SD(X) are in inches, while the<br />
value of V ar(X) is in square-inches.<br />
Table 2.1 gives general <strong>for</strong>mulas <strong>for</strong> the mean and variance of the four families of discrete<br />
probability distributions discussed earlier.<br />
Example 2.8 (Drilling <strong>for</strong> Oil) Assume that the probability of finding oil in a given<br />
drill hole is 0.3 and that the results (finding oil or not) are independent from drill hole to<br />
drill hole.<br />
An oil company drills one hole at a time. If they find oil, then they stop; otherwise, they<br />
continue. However, the company only has enough money to drill five holes. Let X be the<br />
number of holes drilled.<br />
The range of X is {1,2,3,4,5} and its PDF is given in Table 2.2. (The values of f(x) are<br />
zero otherwise.) The mean number of holes drilled is<br />
5<br />
E(X) = x f(x) = 1(0.3) + 2(0.21) + 3(0.147) + 4(0.1029) + 5(0.2401) ≈ 2.7731.<br />
x=1<br />
Similarly,<br />
and SD(X) = V ar(X) ≈ 1.55622.<br />
5<br />
V ar(X) = (x − E(X))<br />
x=1<br />
2 f(x) ≈ 2.42182<br />
30
Table 2.2: Probability distribution <strong>for</strong> oil drilling example.<br />
f(x) = P(X = x) Comments:<br />
f(1) = 0.3 Success on first try<br />
f(2) = (0.7)0.3 = 0.21 Failure followed by success<br />
f(3) = (0.7) 2 0.3 = 0.147 Two failures followed by success<br />
f(4) = (0.7) 3 0.3 = 0.1029 Three failures followed by success<br />
f(5) = (0.7) 4 0.3 + (0.7) 5 = 0.2401 Four failures followed by success<br />
or failure on all 5 tries<br />
Properties of variance and standard deviation. Properties of sums can be used to<br />
prove the following properties of the variance and standard deviation:<br />
1. V ar(X) = E(X 2 ) − (E(X)) 2 .<br />
2. If Y = a + bX, then V ar(Y ) = b 2 V ar(X) and SD(Y ) = |b|SD(X).<br />
The first property provides a quick by-hand method <strong>for</strong> computing the variance. For example,<br />
the variance of X in the nickels and dimes example is 39/14 − (3/2) 2 = 15/28<br />
(confirming the result <strong>for</strong> hypergeometric random variables given in Table 2.1).<br />
2.3.4 Chebyshev’s Inequality<br />
The Chebyshev inequality illustrates the importance of the concepts of mean, variance, and<br />
standard deviation.<br />
Theorem 2.9 (Chebyshev’s Inequality) Let X be a random variable with finite mean<br />
µ and standard deviation σ, and let ǫ be a positive constant. Then<br />
In particular, if ǫ = kσ <strong>for</strong> some k, then<br />
P(µ − ǫ < X < µ + ǫ) > 1 − σ2<br />
.<br />
ǫ2 P(µ − kσ < X < µ + kσ) > 1 − 1<br />
k 2.<br />
For example, if ǫ = 2.5σ, then Chebyshev’s inequality states that<br />
• At least 84% of the distribution of X is in the interval |x − µ| < 2.5σ and<br />
• At most 16% of the distribution is in the complementary interval |x − µ| ≥ 2.5σ.<br />
31
Proof of Chebyshev’s inequality: Using complements, it is sufficient to demonstrate that<br />
P(|X − µ| ≥ ǫ) ≤ σ2<br />
.<br />
ǫ2 Since σ 2 = E (X − µ) 2 = <br />
x (x − µ)2 f(x), where f(x) is the PDF of X, and the sum of<br />
two nonnegative terms is greater than or equal to either term,<br />
σ 2 = <br />
|x−µ|≥ǫ<br />
(x − µ) 2 f(x) + <br />
|x−µ|
3 Joint Discrete Distributions<br />
A probability distribution describing the joint variability of two or more random variables<br />
is called a joint distribution.<br />
For example, if X is the height (in feet), Y is the weight (in pounds), and Z is the serum<br />
cholesterol level (in mg/dL) of a person chosen from a given population, then we may be<br />
interested in describing the joint distribution of the triple (X,Y,Z).<br />
3.1 Bivariate Distributions<br />
A bivariate distribution is the joint distribution of a pair of random variables.<br />
3.1.1 Joint PDF<br />
Assume that X and Y are discrete random variables. The joint frequency function (joint<br />
FF) or joint probability density function (joint PDF) of (X,Y ) is defined as follows:<br />
f(x,y) = P(X = x,Y = y) <strong>for</strong> all real pairs (x,y),<br />
where the comma is understood to mean the intersection of the events. The notation<br />
fXY (x,y) is sometimes used to emphasize the two random variables.<br />
3.1.2 Joint CDF<br />
Assume that X and Y are discrete random variables. The joint cumulative distribution<br />
function (joint CDF) of (X,Y ) is defined as follows:<br />
F(x,y) = P(X ≤ x,Y ≤ y) <strong>for</strong> all real pairs (x,y),<br />
where the comma is understood to mean the intersection of the events. The notation<br />
FXY (x,y) is sometimes used to emphasize the two random variables.<br />
3.1.3 Marginal Distributions<br />
Let X and Y be discrete random variables with joint PDF f(x,y). The marginal frequency<br />
function (marginal FF) or marginal probability density function (marginal PDF) of X is<br />
fX(x) = <br />
f(x,y) <strong>for</strong> all real numbers x,<br />
y<br />
where the sum is taken over all y in the range of Y . Similarly, the marginal frequency<br />
function or marginal PDF of Y is<br />
fY (y) = <br />
f(x,y) <strong>for</strong> all real numbers y,<br />
where the sum is taken over all x in the range of X.<br />
x<br />
33
3.1.4 Conditional Distributions<br />
Let X and Y be discrete random variables with joint PDF f(x,y).<br />
If fY (y) = 0, then the conditional frequency function (conditional FF) or conditional probability<br />
density function (conditional PDF) of X given Y = y is defined as follows:<br />
f X|Y =y(x|y) = f(x,y)<br />
fY (y)<br />
<strong>for</strong> all real numbers x.<br />
Similarly, if fX(x) = 0, then the conditional PDF of Y given X = x is<br />
f Y |X=x(y|x) = f(x,y)<br />
fX(x)<br />
<strong>for</strong> all real numbers y.<br />
Note that in the first case, the conditional sample space is the collection of outcomes with<br />
Y = y; in the second case, it is the collection of outcomes with X = x.<br />
Example 3.1 (Eyesight Study, continued) Consider again the eyesight study from<br />
page 19. Let X be the score <strong>for</strong> the right eye and Y be the score <strong>for</strong> the left eye. Then<br />
Table 1.2 displays the joint (X,Y ) distribution. The probability that both eyes have the<br />
same score, <strong>for</strong> example, is<br />
P(X = Y ) = f(1,1) + f(2,2) + f(3,3) + f(4,4) = 0.708.<br />
The marginal distribution of X is given in the right column of Table 1.2 and the marginal<br />
distribution of Y is given in the bottom row of the table.<br />
Figure 1.3 displays the marginal distribution of Y (in gray), the conditional distribution<br />
of Y given X = 1 (left plot, in black), and the conditional distribution of Y given X = 4<br />
(right plot, in black).<br />
Example 3.2 (Urn Experiment) Assume that an urn contains 4 red, 3 white and 1 blue<br />
chip. Consider the following 2 step experiment:<br />
Step 1. Thoroughly mix the contents of the urn. Choose a chip and record its color.<br />
Return the chip plus two more chips of the same color to the urn.<br />
Step 2. Thoroughly mix the contents of the urn. Choose a chip and record its color.<br />
Let X equal the number of red chips chosen and Y equal the number of white chips chosen.<br />
Then Table 3.1 displays the joint (X,Y ) distribution. The marginal distribution of X is<br />
given in the right column and the marginal distribution of Y is given in the bottom row.<br />
The conditional distribution of Y given X = 1, <strong>for</strong> example, is<br />
and f Y |X=1(y|1) = 0 otherwise.<br />
fY |X=1(0|1) = 0.1 1<br />
=<br />
0.4 4 , fY |X=1(1|1) = 0.3 3<br />
=<br />
0.4 4<br />
34
Table 3.1: Joint distribution <strong>for</strong> the urn experiment example.<br />
3.1.5 Independent Random Variables<br />
y = 0 y = 1 y = 2 total<br />
x = 0 0.0375 0.075 0.1875 0.3<br />
x = 1 0.1000 0.300 0.4<br />
x = 2 0.3000 0.3<br />
total 0.4375 0.375 0.1875 1.0<br />
The discrete random variables X and Y are said to be independent if<br />
f(x,y) = fX(x)fY (y) <strong>for</strong> all real pairs (x,y).<br />
Otherwise, X and Y are said to be dependent.<br />
The discrete random variables X and Y are independent if the probability of the intersection is<br />
equal to the product of the probabilities<br />
<strong>for</strong> all events of interest (<strong>for</strong> all x, y).<br />
P(X = x, Y = y) = P(X = x)P(Y = y)<br />
Example 3.3 (Independent Bernoulli Random Variables) Assume that (X,Y ) has<br />
the joint distribution given in the following table:<br />
y = 0 y = 1 total<br />
x = 0 0.28 0.42 0.70<br />
x = 1 0.12 0.18 0.30<br />
total 0.40 0.60 1.00<br />
X is a Bernoulli random variable with parameter 0.30 and Y is a Bernoulli random variable<br />
with parameter 0.60. Further, since each probability in the body of the 2-by-2 table is the<br />
product of the appropriate marginal probabilities, X and Y are independent.<br />
Example 3.4 (Eyesight Study, continued) The random variables X and Y in the eyesight<br />
study are dependent. To demonstrate this, we need to show that f(x,y) = fX(x)fY (y)<br />
<strong>for</strong> some pair (x,y). For example,<br />
f(1,1) = 0.203 = 0.265 × 0.255 = fX(1)fY (1).<br />
Example 3.5 (Urn Experiment, continued) Similarly, the random variables X and Y<br />
in the urn experiment are dependent. For example,<br />
f(1,1) = 0.3 = 0.4 × 0.375 = fX(1)fY (1).<br />
35
3.1.6 Mathematical Expectation <strong>for</strong> Finite Bivariate Distributions<br />
Let X and Y be discrete random variables with joint PDF f(x,y) = P(X = x,Y = y),<br />
and let g(X,Y ) be a real-valued function. The mean of g(X,Y ) (or the expected value of<br />
g(X,Y ) or the expectation of g(X,Y )) is<br />
E(g(X,Y )) = <br />
g(x,y) f(x,y)<br />
where the double sum includes all pairs with nonzero joint PDF.<br />
x<br />
Example 3.6 (Eyesight Study, continued) Continuing with the eyesight study, let<br />
g(X,Y ) = |X − Y | be the absolute value of the difference in scores <strong>for</strong> the two eyes. Then<br />
4 4<br />
E(g(X,Y )) = |x − y| f(x,y) = 0.374.<br />
x=1 y=1<br />
(The average absolute difference in scores <strong>for</strong> the two eyes is small.)<br />
Properties. Properties of sums and the fact that the joint PDF of independent random<br />
variables equals the product of the marginal PDFs can be used to show the following:<br />
y<br />
1. If a and b1,b2,... ,bn are constants and gi(X,Y ) are real-valued functions <strong>for</strong> each<br />
i = 1,2,... ,n, then<br />
<br />
n<br />
<br />
n<br />
E a + bigi(X,Y ) = a + biE(gi(X,Y )).<br />
i=1<br />
2. If X and Y are independent and g(X) and h(Y ) are real-valued functions, then<br />
i=1<br />
E(g(X)h(Y )) = E(g(X))E(h(Y )).<br />
The first property generalizes properties from Section 2.3.2. The second property is useful<br />
when studying associations between variables.<br />
3.1.7 Covariance and Correlation<br />
Let X and Y be random variables with finite means (µx, µy) and finite standard deviations<br />
(σx, σy). The covariance of X and Y , Cov(X,Y ), is defined as follows:<br />
Cov(X,Y ) = E((X − µx)(Y − µy)).<br />
The notation σxy = Cov(X,Y ) is used to denote the covariance. The correlation of X and<br />
Y , Corr(X,Y ), is defined as follows:<br />
Cov(X,Y )<br />
Corr(X,Y ) =<br />
σxσy<br />
= σxy<br />
.<br />
σxσy<br />
The notation ρ = Corr(X,Y ) is used to denote the correlation of X and Y ; ρ is called the<br />
correlation coefficient.<br />
36
Positive and negative association. Covariance and correlation are measures of the<br />
association between two random variables. Specifically,<br />
1. The random variables X and Y are said to be positively associated if as X increases,<br />
Y tends to increase. If X and Y are positively associated, then Cov(X,Y ) and<br />
Corr(X,Y ) will be positive.<br />
2. The random variables X and Y are said to be negatively associated if as X increases,<br />
Y tends to decrease. If X and Y are negatively associated, then Cov(X,Y ) and<br />
Corr(X,Y ) will be negative.<br />
For example, the height and weight of individuals in a given population are positively<br />
associated. Educational level and indices of poor health are often negatively associated.<br />
Properties of covariance. The following properties of covariance are important:<br />
1. Cov(X,X) = V ar(X).<br />
2. Cov(X,Y ) = Cov(Y,X).<br />
3. Cov(a + bX,c + dY ) = bdCov(X,Y ), where a, b, c, and d are constants.<br />
4. Cov(X,Y ) = E(XY ) − E(X)E(Y ).<br />
5. If X and Y are independent, then Cov(X,Y ) = 0.<br />
The first two properties follow immediately from the definition of covariance. The third<br />
property relates the covariance of linearly trans<strong>for</strong>med random variables to the covariance<br />
of the original random variables. In particular, if b = d = 1 (values are shifted only), then<br />
the covariance is unchanged; if a = c = 0 and b,d > 0 (values are re-scaled), then the<br />
covariance is multiplied by bd.<br />
The fourth property gives an alternative method <strong>for</strong> calculating the covariance; the method<br />
is particularly well-suited <strong>for</strong> by-hand computations. Since E(XY ) = E(X)E(Y ) <strong>for</strong> independent<br />
random variables, the fourth property can be used to prove that the covariance of<br />
independent random variables is zero (property 5).<br />
Properties of correlation. The following properties of correlation are important:<br />
1. −1 ≤ Corr(X,Y ) ≤ 1.<br />
2. If a, b, c, and d are constants, then<br />
Corr(a + bX,c + dY ) =<br />
⎧<br />
⎨<br />
Corr(X,Y ) when bd > 0<br />
⎩<br />
−Corr(X,Y ) when bd < 0<br />
In particular, Corr(X,a + bX) equals 1 if b > 0 and equals −1 if b < 0.<br />
3. If X and Y are independent, then Corr(X,Y ) = 0.<br />
37
Correlation is often called standardized covariance since, by the first property, its values<br />
always lie in the [−1,1] interval. If the correlation is close to −1, then there is a strong<br />
negative association between the variables; if the correlation is close to 1, then there is a<br />
strong positive association between the variables.<br />
The second property relates the correlation of linearly trans<strong>for</strong>med variables to the correlation<br />
of the original variables. Note, in particular, that the correlation is unchanged if the<br />
random variables are shifted (b = d = 1) or if the random variables are re-scaled (a = c = 0<br />
and b,d > 0). For example, the correlation between the height and weight of individuals in<br />
a population is the same no matter which measurement scale is used <strong>for</strong> height (e.g. inches,<br />
feet) and which measurement scale is used <strong>for</strong> weight (e.g. pounds, kilograms).<br />
If Corr(X,Y ) = 0, then X and Y are said to be uncorrelated; otherwise, they are said to<br />
be correlated. The third property says that independent random variables are uncorrelated.<br />
Note that uncorrelated random variables are not necessarily independent.<br />
Example 3.7 (Eyesight Study, continued) For (X,Y ) from the eyesight study,<br />
E(X) = 2.275, E(Y ) = 2.305, E(X 2 ) = 6.117, E(Y 2 ) = 6.259, E(XY ) = 5.905.<br />
Using properties of variance and covariance:<br />
Finally, ρ = Corr(X,Y ) ≈ 0.701.<br />
V ar(X) ≈ 0.941, V ar(Y ) ≈ 0.946, Cov(X,Y ) ≈ 0.661.<br />
Example 3.8 (Uncorrelated but Dependent) Assume that the random pair (X,Y )<br />
has the joint distribution given in the following table:<br />
y = −1 y = 0 y = 1 total<br />
x = −1 0.05 0.10 0.05 0.20<br />
x = 0 0.10 0.40 0.10 0.60<br />
x = 1 0.05 0.10 0.05 0.20<br />
total 0.20 0.60 0.20 1.00<br />
Since E(X) = E(Y ) = E(XY ) = 0, the covariance and correlation must equal 0.<br />
But, X and Y are dependent since, <strong>for</strong> example, f(0,0) = fX(0)fY (0).<br />
3.1.8 Conditional Expectation and Regression<br />
Let X and Y be discrete random variables with finite ranges and joint PDF f(x,y).<br />
If fX(x) = 0, then the conditional expectation (or the conditional mean) of Y given X = x,<br />
E(Y |X = x), is defined as follows:<br />
E(Y |X = x) = <br />
y fY |X=x(y|x),<br />
where the sum is over all y with nonzero conditional PDF (f Y |X=x(y|x) = 0).<br />
y<br />
38
x0<br />
x1<br />
x2<br />
x3<br />
x4<br />
y1 y2 y3 y4 y5<br />
0.02 0.02 0.02 0.01 0.01<br />
0.02 0.02 0.01 0.04 0.04<br />
0.02 0.01 0.04 0.04 0.10<br />
0.01 0.04 0.04 0.10 0.10<br />
0.01 0.04 0.10 0.10 0.04<br />
C.Exp.<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0 1 2 3 4 x<br />
Figure 3.1: Joint (X,Y ) distribution (left) and plot of (x,E(Y |X = x)) pairs (right) <strong>for</strong> the<br />
conditional expectation example.<br />
The conditional expectation of X given Y = y, E(X|Y = y), is defined similarly.<br />
Example 3.9 (Conditional Expectation) Let (X,Y ) be the random pair whose joint<br />
distribution is shown in the left part of Figure 3.1. Then<br />
<br />
.02 .02 .02 .01 .01<br />
E(Y |X = 0) = 1 + 2 + 3 + 4 + 5 = 2.625.<br />
.08 .08 .08 .08 .08<br />
Similarly,<br />
E(Y |X = 1) = 3.462, E(Y |X = 2) = 3.905, E(Y |X = 3) = 3.828 and E(Y |X = 4) = 3.414.<br />
The right part of Figure 3.1 is a plot of pairs (x,E(Y |X = x)) <strong>for</strong> x = 0,1,2,3,4.<br />
Regression equation. The <strong>for</strong>mula <strong>for</strong> the conditional expectation<br />
E(Y |X = x) as a function of x<br />
is called the regression equation of Y on X. Similarly, the <strong>for</strong>mula <strong>for</strong> the conditional<br />
expectation E(X|Y = y) as a function of y is called the regression equation of X on Y .<br />
Linear conditional means. Let X and Y be random variables with finite means (µx,<br />
µy), standard deviations (σx, σy), and correlation (ρ).<br />
If E(Y |X = x) is a linear function of x, then the <strong>for</strong>mula is of the <strong>for</strong>m:<br />
E(Y |X = x) = µy + ρσy<br />
(x − µx).<br />
Similarly, if E(X|Y = y) is a linear function of y, then the <strong>for</strong>mula is of the <strong>for</strong>m:<br />
σx<br />
E(X|Y = y) = µx + ρσx<br />
(y − µy).<br />
39<br />
σy
3.1.9 Example: Bivariate Hypergeometric Distribution<br />
Let n, M1, M2, and M3 be positive integers with n ≤ M1 + M2 + M3. The random<br />
pair (X,Y ) is said to have a bivariate hypergeometric distribution with parameters n and<br />
(M1,M2,M3) if its joint PDF has the following <strong>for</strong>m:<br />
f(x,y) =<br />
when x and y are nonnegative integers with<br />
M1 M2 M3<br />
x y n−x−y<br />
<br />
M1+M2+M3<br />
n<br />
x ≤ min(n,M1), y ≤ min(n,M2) and max(0,n − M3) ≤ x + y ≤ min(n,M3),<br />
and is equal to zero otherwise. Note that the restrictions on x and y ensure that all<br />
combinations are defined and positive.<br />
Example 3.10 (n = 4, M1 = 2, M2 = 3, M3 = 5) Assume that (X,Y ) has a bivariate<br />
hypergeometric distribution with parameters n = 4, M1 = 2, M2 = 3 and M3 = 5. The<br />
joint PDF is<br />
f(x,y) =<br />
23 5 <br />
x y 4−x−y<br />
10 4<br />
x = 0,1,2; y = 0,1,... ,min(3,4 − x)<br />
and 0 otherwise. The joint (X,Y ) distribution and the marginal X and Y distributions are<br />
displayed in the following table:<br />
y = 0 y = 1 y = 2 y = 3 total<br />
x = 0 0.0238 0.1429 0.1429 0.0238 0.3334<br />
x = 1 0.0952 0.2857 0.1429 0.0095 0.5333<br />
x = 2 0.0476 0.0714 0.0143 0.1333<br />
total 0.1666 0.5000 0.3001 0.0333 1.0000<br />
Urn experiments. Bivariate hypergeometric distributions are used to model urn experiments.<br />
Specifically, suppose that an urn contains N objects, M1 of type 1, M2 of type 2,<br />
and M3 of type 3 (N = M1 + M2 + M3). Let X equal the number of objects of type 1 and<br />
Y equal the number of objects of type 2 in a subset of size n chosen from the urn. If each<br />
choice of subset is equally likely, then (X,Y ) has a bivariate hypergeometric distribution<br />
with parameters n and (M1,M2,M3). In the example above, 20% (2/10) of the objects in<br />
the urn are of type 1, 30% (3/10) are of type 2 and 50% (5/10) are of type 3.<br />
Negative association. If (X,Y ) has a bivariate hypergeometric distribution, then X<br />
and Y are negatively associated with correlation<br />
ρ = Corr(X,Y ) = −<br />
<br />
40<br />
M1<br />
M2 + M3<br />
<br />
M2<br />
M1 + M3<br />
.
Marginal and conditional distributions. If (X,Y ) has a bivariate hypergeometric<br />
distribution, then<br />
1. X has a hypergeometric distribution with parameters n, M1, and N; and<br />
2. Y has a hypergeometric distribution with parameters n, M2, and N.<br />
In addition, each conditional distribution is hypergeometric. Finally, bivariate hypergeometric<br />
distributions have linear conditional means.<br />
3.1.10 Example: Trinomial Distribution<br />
Let n be a positive integer, and p1, p2, and p3 be positive proportions with sum 1. The<br />
random pair (X,Y ) is said to have a trinomial distribution with parameters n and (p1,p2,p3)<br />
when its joint PDF has the following <strong>for</strong>m:<br />
<br />
<br />
n<br />
f(x,y) =<br />
p<br />
x,y,n − x − y<br />
x 1py2 pn−x−y 3<br />
when x = 0,1,... ,n; y = 0,1,... ,n; x + y ≤ n, and is equal to zero otherwise.<br />
Example 3.11 (n = 4, p1 = 0.2, p2 = 0.3, p3 = 0.5) Assume that (X,Y ) has a trinomial<br />
distribution with parameters n = 4, p1 = 0.2, p2 = 0.3 and p3 = 0.5. The joint PDF<br />
is<br />
<br />
4<br />
f(x,y) =<br />
0.2<br />
x,y,4 − x − y<br />
x 0.3 y 0.5 4−x−y x,y = 0,1,2,3,4; x + y ≤ 4,<br />
and 0 otherwise. The joint (X,Y ) distribution and the marginal X and Y distributions are<br />
displayed in the following table:<br />
y = 0 y = 1 y = 2 y = 3 y = 4 total<br />
x = 0 0.0625 0.1500 0.1350 0.0540 0.0081 0.4096<br />
x = 1 0.1000 0.1800 0.1080 0.0216 0.4096<br />
x = 2 0.0600 0.0720 0.0216 0.1536<br />
x = 3 0.0160 0.0096 0.0256<br />
x = 4 0.0016 0.0016<br />
total 0.2401 0.4116 0.2646 0.0756 0.0081 1.0000<br />
Experiments with three outcomes. Trinomial distributions are used to model experiments<br />
with exactly three outcomes. Specifically, suppose that an experiment has three<br />
outcomes which occur with probabilities p1, p2, and p3, respectively. Let X be the number<br />
of occurrences of outcome 1 and Y be the number of occurrences of outcome 2 in n independent<br />
trials of the experiment. Then (X,Y ) has a trinomial distribution with parameters<br />
n and (p1,p2,p3). In the example above, the probabilities of the three outcomes are 0.2, 0.3<br />
and 0.5, respectively.<br />
41
Negative association. If (X,Y ) has a trinomial distribution, then X and Y are negatively<br />
associated with correlation<br />
<br />
p1<br />
ρ = Corr(X,Y ) = −<br />
1 − p1<br />
p2<br />
.<br />
1 − p2<br />
Marginal and conditional distributions. If (X,Y ) has a trinomial distribution, then<br />
1. X has a binomial distribution with parameters n and p1, and<br />
2. Y has a binomial distribution with parameters n and p2.<br />
In addition, each conditional distribution is binomial. Finally, trinomial distributions have<br />
linear conditional means.<br />
3.1.11 Simple Random Samples<br />
The results of Section 2.2.4 can be generalized. In particular, trinomial probabilities can<br />
often be used to approximate bivariate hypergeometric probabilities when N is large enough,<br />
and each family of distributions can be used in survey analysis.<br />
Example 3.12 (Survey Analysis) For example, suppose that a surveyor is interested<br />
in determining the level of support <strong>for</strong> a proposal to change the local tax structure, and<br />
decides to choose a simple random sample of size 10 from the registered voter list. If there<br />
are a total of 120 registered voters, where one-third support the proposal, one-half oppose<br />
the proposal, and one-sixth have no opinion, then the probability that exactly 3 support, 5<br />
oppose, and 2 have no opinion is<br />
P(X = 3,Y = 5) =<br />
40 60 20 3 5<br />
120 10<br />
2 ≈ 0.088.<br />
If there are thousands of registered voters, then the probability is<br />
1 3 5 2 10 1 1<br />
P(X = 3,Y = 5) ≈<br />
≈ 0.081.<br />
3,5,2 3 2 6<br />
As be<strong>for</strong>e, you do not need to know the exact number of registered voters when you use the<br />
trinomial approximation.<br />
3.2 Multivariate Distributions<br />
A multivariate distribution is the joint distribution of k random variables. Ideas studied in<br />
the bivariate case (k = 2) can be generalized to the case where k > 2.<br />
42
3.2.1 Joint PDF and Joint CDF<br />
Let X1, X2, ..., Xk be discrete random variables and let<br />
X = (X1,X2,...,Xk).<br />
The joint frequency function (joint FF) or joint probability density function (joint PDF) of<br />
X is defined as follows:<br />
f(x) = P(X1 = x1,X2 = x2,... ,Xk = xk)<br />
<strong>for</strong> all real k-tuples x = (x1,x2,... ,xk), where commas are understood to mean the intersection<br />
of events.<br />
The joint cumulative distribution function (joint CDF) of X is defined as follows:<br />
F(x) = P(X1 ≤ x1,X2 ≤ x2,...,Xk ≤ xk)<br />
<strong>for</strong> all real k-tuples x = (x1,x2,... ,xk), where commas are understood to mean the intersection<br />
of events.<br />
Example 3.13 (k = 3) Assume that (X,Y,Z) has joint PDF<br />
f(x,y,z) =<br />
5<br />
28(x + y + z)<br />
when x,y = 0,1, z = 1,2,3,4 and 0 otherwise.<br />
The following table shows the joint (X,Y,Z) distribution. The marginal (X,Y ) distribution<br />
is displayed in the right column and the marginal Z distribution is displayed in the bottom<br />
row. A common denominator is used throughout.<br />
z = 1 z = 2 z = 3 z = 4<br />
x = 0 y = 0 60/336 30/336 20/336 15/336 125/336<br />
y = 1 30/336 20/336 15/336 12/336 77/336<br />
x = 1 y = 0 30/336 20/336 15/336 12/336 77/336<br />
y = 1 20/336 15/336 12/336 10/336 57/336<br />
140/336 85/336 62/336 49/336 1<br />
X and Y have the same marginal distribution. Namely,<br />
(and 0 otherwise).<br />
fX(0) = fY (0) = 101<br />
168 and fX(1) = fY (1) = 67<br />
168<br />
The probability that X + Y + Z is at most 3, <strong>for</strong> example, is<br />
P(X + Y + Z ≤ 3) =<br />
1<br />
1<br />
x=0 y=0<br />
3−x−y<br />
<br />
43<br />
z=1<br />
f(x,y,z) = 115<br />
≈ 0.685.<br />
168
3.2.2 Mutually Independent Random Variables and Random Samples<br />
The discrete random variables X1, X2, ..., Xk are said to be mutually independent (or<br />
independent) if<br />
f(x) = f1(x1)f2(x2) · · · fk(xk) <strong>for</strong> all real k-tuples x = (x1,x2,...,xk),<br />
where fi(xi) = P(Xi = xi) <strong>for</strong> i = 1,2,... ,k. (The probability of the intersection is equal<br />
to the product of the probabilities <strong>for</strong> all events of interest.)<br />
Equivalently, X1, X2, ..., Xk are said to be mutually independent if<br />
F(x) = F1(x1)F2(x2) · · · Fk(xk) <strong>for</strong> all real k-tuples x = (x1,x2,... ,xk),<br />
where Fi(xi) = P(Xi ≤ xi) <strong>for</strong> i = 1,2,... ,k.<br />
X1, X2, ..., Xk are said to be dependent if they are not mutually independent.<br />
Example 3.14 (k = 3, continued) The random variables X, Y and Z in the previous<br />
example are dependent. To see this, we just need to show that f(x,y,z) = fX(x)fY (y)fZ(z)<br />
<strong>for</strong> some triple (x,y,z). For example, f(1,1,1) = fX(1)fY (1)fZ(1).<br />
Random sample. If the discrete random variables X1, X2, ..., Xk are mutually independent<br />
and have a common distribution (each marginal PDF is the same), then the random<br />
variables X1, X2, ..., Xk are said to be a random sample from that distribution.<br />
3.2.3 Mathematical Expectation <strong>for</strong> Finite Multivariate Distributions<br />
Let X1, X2, ..., Xk be discrete random variables with joint PDF f(x), and let g(X) be a<br />
real-valued k-variable function. The mean (or expected value or expectation) of g(X) is<br />
E(g(X)) = <br />
· · · <br />
g(x1,x2,... ,xk)f(x1,x2,... ,xk),<br />
xk<br />
xk−1<br />
where the multiple sum includes all k-tuples with nonzero joint PDF.<br />
x1<br />
Example 3.15 (Maximum Roll) You roll a fair die three times. Let Xi be the number<br />
on the top face after the i th roll and g(X1,X2,X3) be the maximum of the three numbers.<br />
X1, X2, X3 is a random sample from a uni<strong>for</strong>m distribution with parameter 6. The joint<br />
PDF of the random triple is<br />
f(x1,x2,x3) = 1<br />
216 when x1,x2,x3 = 1,2,3,4,5,6 and 0 otherwise<br />
and the expected value of the maximum of the three rolls reduces to<br />
E(g(X1,X2,X3)) =<br />
1<br />
1<br />
216<br />
<br />
+ 2<br />
7<br />
216<br />
<br />
<br />
19 37 61 91<br />
+ 3 + 4 + 5 + 6 =<br />
216 216 216 216<br />
44<br />
<br />
119<br />
≈ 4.96.<br />
24
Properties of expectation. Properties in the multivariate case directly generalize properties<br />
in the univariate and bivariate cases (Sections 2.3.2 and 3.1.6).<br />
3.2.4 Example: Multivariate Hypergeometric Distribution<br />
Suppose that an urn contains N objects,<br />
M1 of type 1, M2 of type 2, ..., and Mk of type k (M1 + M2 + · · · + Mk = N).<br />
Let Xi be the number of objects of type i in a subset of size n chosen from the urn. If each<br />
choice of subset is equally likely, then the random k-tuple (X1,X2,...,Xk) is said to have<br />
a multivariate hypergeometric distribution with parameters n and (M1,M2,... ,Mk).<br />
The joint PDF <strong>for</strong> the k-tuple is<br />
f(x1,x2,... ,xk) =<br />
M1<br />
x1<br />
M2<br />
x2<br />
Mk · · ·<br />
when each xi is a nonnegative integer satisfying xi ≤ min(n,Mi) and <br />
i xi = n (and zero<br />
otherwise).<br />
Note that<br />
1. If X is a hypergeometric random variable with parameters n, M, N, then<br />
(X,n − X)<br />
has a multivariate hypergeometric distribution with parameters n and (M,N − M).<br />
2. If (X,Y ) is bivariate hypergeometric with parameters n and (M1,M2,M3), then<br />
N<br />
n<br />
<br />
(X,Y,n − X − Y )<br />
is multivariate hypergeometric with parameters n and (M1,M2,M3).<br />
3. If (X1,X2,... ,Xk) has a multivariate hypergeometric distribution, then<br />
(a) For each i, Xi is a hypergeometric random variable with parameters n, Mi, N.<br />
(b) For each i = j, (Xi,Xj) is bivariate hypergeometric with parameters n and<br />
(Mi,Mj,N − Mi − Mj). In particular, Xi and Xj are negatively associated.<br />
Example 3.16 (Physical Activity) Suppose there are 600 women and 400 men living in<br />
an adult community. Among the women, 180 are physically active and 420 are not. Among<br />
the men, 120 are physically active and 280 are not.<br />
Let X1 be the number of women who are physically active, X2 be the number of women<br />
who are not physically active, X3 be the number of men who are physically active and X4<br />
be the number of men who are not physically active in a simple random sample of size 15<br />
chosen from those living in the community. Then, <strong>for</strong> example,<br />
P(X1 = 3,X2 = 7,X3 = 1,X4 = 4) =<br />
45<br />
xk<br />
180420120280 3 7 1<br />
1000 15<br />
4<br />
≈ 0.0182.
3.2.5 Example: Multinomial Distribution<br />
Suppose that an experiment has exactly k outcomes, and let<br />
pi be the probability of the i th outcome, i = 1,2,... ,k (p1 + p2 + · · · + pk = 1).<br />
Let Xi be the number of occurrences of the i th outcome in n independent trials of the experiment.<br />
Then the random k-tuple (X1,X2,...,Xk) is said to have a multinomial distribution<br />
with parameters n and (p1,p2,...,pk).<br />
The joint PDF <strong>for</strong> the k-tuple is<br />
f(x1,x2,... ,xk) =<br />
<br />
n<br />
x1,x2,... ,xk<br />
<br />
p x1<br />
1 px2 2 · · · pxk<br />
k<br />
when x1,x2,...,xk = 0,1,... ,n and <br />
i xi = n (and zero otherwise).<br />
Note that<br />
1. If X is a binomial random variable with parameters n and p, then<br />
(X,n − X)<br />
has a multinomial distribution with parameters n and (p,1 − p).<br />
2. If (X,Y ) has a trinomial distribution with parameters n and (p1,p2,p3), then<br />
(X,Y,n − X − Y )<br />
has a multinomial distribution with parameters n and (p1,p2,p3).<br />
3. If (X1,X2,... ,Xk) has a multinomial distribution, then<br />
(a) For each i, Xi is a binomial random variable with parameters n and pi.<br />
(b) For each i = j, (Xi,Xj) has a trinomial distribution with parameters n and<br />
(pi,pj,1 − pi − pj). In particular, Xi and Xj are negatively associated.<br />
Example 3.17 (Independence Model) Assume that A and B are independent events<br />
of interest to researchers. Let outcomes 1, 2, 3, 4 correspond to<br />
“both A and B occur,” “only A occurs,” “only B occurs,” “neither A nor B occur,”<br />
respectively. If P(A) = α and P(B) = β, then<br />
(p1,p2,p3,p4) = (αβ,α(1 − β),(1 − α)β,(1 − α)(1 − β))<br />
can be used as the list of probabilities in a multinomial model.<br />
If P(A) = 0.6, P(B) = 0.3, and Xi is the number of occurrences of the ith outcome in 15<br />
independent trials (i = 1,2,3,4), then, <strong>for</strong> example,<br />
<br />
15<br />
P(X1 = 3,X2 = 7,X3 = 1,X4 = 4) = (0.18)<br />
3,7,1,4<br />
3 (0.42) 7 (0.12) 1 (0.28) 4 ≈ 0.0179.<br />
46
3.2.6 Simple Random Samples<br />
The results of Sections 2.2.4 and 3.1.11 can be directly generalized to the multivariate<br />
case. In particular, multinomial probabilities can be used to approximate multivariate<br />
hypergeometric probabilities when N is large enough, and each family of distributions can<br />
be used in survey analysis.<br />
3.2.7 Sample Summaries<br />
If X1, X2, ..., Xn is a random sample from a distribution with mean µ and standard<br />
deviation σ, then the sample mean, X, is the random variable<br />
X = 1<br />
n (X1 + X2 + · · · + Xn) ,<br />
the sample variance, S 2 , is the random variable<br />
S 2 = 1<br />
n − 1<br />
n<br />
i=1<br />
2 Xi − X<br />
and the sample standard deviation, S, is the positive square root of the sample variance.<br />
The following theorem can be proven using properties of expectation:<br />
Theorem 3.18 (Sample Summaries) If X is the sample mean and S 2 is the sample<br />
variance of a random sample of size n from a distribution with mean µ and standard<br />
deviation σ, then<br />
1. E(X) = µ and V ar(X) = σ 2 /n.<br />
2. E(S 2 ) = σ 2 .<br />
Demonstration 1. To demonstrate that E(X) = E(X) = µ,<br />
<br />
1 E(X) = E n (X1<br />
<br />
+ X2 + · · · + Xn)<br />
= 1<br />
n (E(X1) + E(X2) + · · · + E(Xn))<br />
= 1<br />
n<br />
1 (nE(X)) = nnµ = µ<br />
Demonstration 2. To demonstrate V ar(X) = V ar(X)/n = σ 2 /n, first note that<br />
<br />
V ar(X) = E (X − µ) 2<br />
⎛<br />
n<br />
⎞ ⎛<br />
2 <br />
n<br />
⎞<br />
2<br />
1<br />
1<br />
= E ⎝ Xi − µ ⎠ = E ⎝ (Xi − µ) ⎠ .<br />
n<br />
n<br />
i=1<br />
i=1<br />
47
Now,<br />
V ar(X) = 1<br />
n2E <br />
( n i=1 (Xi − µ)) 2<br />
= 1<br />
n2 i=j E((Xi − µ) 2 ) + <br />
i=j E((Xi<br />
<br />
− µ)(Xj − µ))<br />
Since the Xi’s are independent, the last sum is zero. Finally,<br />
V ar(X) = 1<br />
n2 <br />
nσ 2 <br />
+ 0 = σ2<br />
n .<br />
Demonstration 3. To demonstrate that E(S 2 ) = V ar(X) = σ 2 , first note that<br />
Observe that<br />
S2 = 1 <br />
ni=1<br />
2<br />
n−1<br />
Xi − X<br />
= 1 <br />
ni=1<br />
n−1 X2 i − 2XXi + X 2<br />
= 1<br />
ni=1 n−1 X2 i<br />
= 1<br />
ni=1 n−1 X2 i<br />
− 2XnX + nX2<br />
− nX2<br />
E(X 2 ) = V ar(X) + (E(X)) 2 = σ 2 + µ 2<br />
Finally,<br />
.<br />
since n i=1 Xi = nX<br />
and E(X 2 ) = V ar(X) + (E(X)) 2 = σ2<br />
n + µ2 .<br />
E(S2 ) = 1<br />
ni=1 n−1 E(X2 i ) − nE(X2 <br />
)<br />
= 1<br />
<br />
n−1 n(σ2 + µ 2 ) − n(σ2 /n + µ 2 )<br />
= 1<br />
<br />
n−1 nσ2 + nµ 2 − σ2 − nµ 2 <br />
= 1 <br />
n−1 (n − 1)σ2 = σ2 .<br />
Sample correlation. A random sample of size n from the joint (X,Y ) distribution is a<br />
list of n mutually independent random pairs, each with the same distribution as (X,Y ).<br />
If (X1,Y1),(X2,Y2),... ,(Xn,Yn) is a random sample of size n from a bivariate distribution<br />
with correlation ρ = Corr(X,Y ), then the sample correlation, R, is the random variable<br />
ni=1 (Xi − X)(Yi − Y )<br />
R = <br />
ni=1<br />
(Xi − X) 2 n i=1 (Yi − Y ) 2<br />
where X and Y are the sample means of the X and Y samples, respectively.<br />
48<br />
.
Applications. In statistical applications, observed values of the sample mean, sample<br />
variance, and sample correlation are used to estimate unknown values of the parameters µ,<br />
σ 2 and ρ, respectively.<br />
Example 3.19 (n = 600) Assume that the following table summarizes the values of a<br />
random sample of size 600 from the joint (X,Y ) distribution:<br />
y = 1 y = 2 y = 3 y = 4 y = 5 y = 6 y = 7 y = 8 y = 9 y = 10<br />
x = 0 2 1 5 12 15 15 7 2 1 60<br />
x = 1 2 9 25 36 42 31 13 158<br />
x = 2 3 4 15 39 55 41 21 5 183<br />
x = 3 2 7 20 27 39 14 1 110<br />
x = 4 1 13 11 16 18 4 63<br />
x = 5 1 6 5 6 2 20<br />
x = 6 1 2 3 6<br />
8 36 64 118 162 116 68 25 2 1 600<br />
Two pairs of the <strong>for</strong>m (0,2) were observed, one pair of the <strong>for</strong>m (0,3) was observed, five<br />
pairs of the <strong>for</strong>m (0,4) were observed, and so <strong>for</strong>th.<br />
The sample mean of the x-coordinates is<br />
and the sample variance is<br />
s 2 x<br />
x = 1 600<br />
xi =<br />
600<br />
1<br />
(0(60) + 1(158) + · · · + 6(6)) ≈ 2.07<br />
600<br />
i=1<br />
1 600<br />
= (xi − x)<br />
599<br />
2 = 1<br />
599<br />
Similarly,<br />
i=1<br />
The observed sample correlation is<br />
r =<br />
<br />
(0 − x) 2 (60) + (1 − x) 2 (158) + · · · + (6 − x) 2 <br />
(6) ≈ 1.72464.<br />
y ≈ 4.92333 and s 2 y<br />
≈ 2.49161.<br />
−634.78<br />
(1033.06)(149247) ≈ −0.511219.<br />
The results suggest that X and Y are negatively associated.<br />
49
4 Introductory Calculus Concepts<br />
In introductory calculus, we study real-valued functions whose domain (set of input values)<br />
is a subset of the real numbers. We often use the shorthand<br />
f : D ⊆ R −→ R,<br />
where D is the domain and R is the set of real numbers, to denote such a function.<br />
This chapter covers limits, derivatives and integrals and applications in statistics.<br />
4.1 Functions<br />
This section reviews functions commonly used in calculus.<br />
Power functions. A power function is a function with rule<br />
f(x) = cx p where the coefficient c and power p are constants.<br />
(The values of f are proportional to a power of the input x.) The domain of f depends<br />
on the power p. For example, if p = 2, the domain of f is the set of all real numbers; if<br />
p = 1/2, the domain of f is the set of nonnegative real numbers.<br />
Polynomial and rational functions. A polynomial function is a function with rule<br />
f(x) = anx n + an−1x n−1 + · · · + a1x + a0<br />
where the coefficients ai are constants, n is a positive integer and x is a real number. If<br />
an = 0, then n is the degree of the polynomial.<br />
Special cases of polynomials include: If the degree of f is 0, then f is a constant function; if<br />
the degree is 1, then f is a linear function; if the degree is 2, then f is a quadratic function;<br />
if the degree is 3, then f is a cubic function.<br />
A rational function is the ratio of two polynomials:<br />
f(x) = p(x)<br />
, where p(x) and q(x) are polynomials.<br />
q(x)<br />
The domain of f is the set of real numbers satisfying q(x) = 0.<br />
Exponential functions. An exponential function is a function with rule<br />
f(x) = a x , where a > 0 is a positive constant and x is a real number.<br />
In this <strong>for</strong>mula, a is the base and x is the exponent. The exponential function is strictly<br />
increasing when a > 1 and strictly decreasing when 0 < a < 1.<br />
51
The exponential function with base e,<br />
f(x) = e x , where e ≈ 2.71828 is Euler’s constant,<br />
is used most often in calculus. (e is the natural base <strong>for</strong> calculus.)<br />
Exponential functions satisfy the following laws of exponents:<br />
(1) a x+y = a x a y , (2) a −x = 1<br />
a x , (3) (ax ) y = a xy<br />
where a > 0 is a positive number, and x and y are any real numbers.<br />
Logarithmic functions. A logarithmic function is a function with rule<br />
f(x) = log a(x) where a is a positive constant and x is a positive real number.<br />
As above, a is the base of the function. When a = e,<br />
f(x) = log e(x) = ln(x) is called the natural logarithm.<br />
Exponential and logarithmic functions with the same base are related as follows:<br />
y = log a(x) ⇐⇒ a y = x<br />
where a and x are positive numbers and y is any real number. Thus,<br />
log a(a x ) = x <strong>for</strong> all x ∈ R and a log a (x) = x <strong>for</strong> all x > 0.<br />
Logarithmic functions satisfy the following laws of logarithms:<br />
(1) loga(xy) = loga(x) + loga(y), <br />
(2) loga = loga(x) − loga(y) and<br />
x<br />
y<br />
(3) log a (x r ) = r log a(x),<br />
where a, x and y are positive numbers and r is any real number.<br />
Using the natural base. Since a = eln(a) <strong>for</strong> any a > 0, we can write<br />
a x <br />
= e ln(a) x<br />
= e ln(a)x , where x is a real number.<br />
Similarly, suppose that y = log a(x) (equivalently, x = a y ). Then,<br />
ln(x) = ln (a y ) = y ln(a) = log a(x)ln(a) =⇒ log a(x) = ln(x)<br />
ln(a) .<br />
Thus, exponential and logarithmic functions in base a can always be written in terms of<br />
the corresponding functions in the natural base.<br />
52
t<br />
Px,y<br />
t<br />
Px,y<br />
Figure 4.1: A positive angle t (left plot) and a negative angle t (right plot). In each case,<br />
the point P(x,y) lies on the unit circle centered at the origin, x = cos(t) and y = sin(t).<br />
Trigonometric functions. The trigonometric functions are defined using a unit circle<br />
centered at the origin and the radian measure of an angle, as illustrated in Figure 4.1.<br />
In the left part of the figure, t is the length of the arc of the unit circle measured counterclockwise<br />
from (1,0) to P(x,y); in the right part of the figure, t is the negative of the<br />
length of the arc measured clockwise from (1,0) to P(x,y).<br />
The sine and cosine functions have the following rules<br />
sin(t) = y and cos(t) = x<br />
where t is any real number, and x and y are described above. Since the circumference of a<br />
circle of radius one is 2π, we know that<br />
<strong>for</strong> every integer k and real number t.<br />
sin(t + k(2π)) = sin(t) and cos(t + k(2π)) = cos(t)<br />
The tangent, cotangent, secant and cosecant functions are defined in terms of sine and cosine.<br />
Specifically,<br />
tan(t) = sin(t) cos(t) 1<br />
, cot(t) = , sec(t) =<br />
cos(t) sin(t) cos(t)<br />
and csc(t) = 1<br />
sin(t) .<br />
In each case, the domain of the function is all t <strong>for</strong> which the denominator is not 0.<br />
Inverse trigonometric functions. Each trigonometric function has an inverse function:<br />
1. Inverse sine function: For −1 ≤ z ≤ 1,<br />
sin −1 (z) = t ⇐⇒ sin(t) = z and − π<br />
2<br />
2. Inverse cosine function: For −1 ≤ z ≤ 1,<br />
≤ t ≤ π<br />
2 .<br />
cos −1 (z) = t ⇐⇒ cos(t) = z and 0 ≤ t ≤ π.<br />
53
3. Inverse tangent function: For any real number z,<br />
tan −1 (z) = t ⇐⇒ tan(t) = z and − π<br />
2<br />
4. Inverse cotangent function: For any real number z,<br />
5. Inverse secant function: For |z| ≥ 1,<br />
< t < π<br />
2 .<br />
cot −1 (z) = t ⇐⇒ cot(t) = z and 0 < t < π.<br />
sec −1 (z) = t ⇐⇒ sec(t) = z and<br />
6. Inverse cosecant function: For |z| ≥ 1,<br />
csc −1 (z) = t ⇐⇒ csc(t) = z and<br />
<br />
0 ≤ t < π<br />
2<br />
<br />
0 < t ≤ π<br />
2<br />
<br />
3π<br />
or π ≤ t < .<br />
2<br />
<br />
3π<br />
or π < t ≤ .<br />
2<br />
Radian and degree measures. Angles can be measured in degrees or in radians. Radian<br />
measure is preferred in calculus. The following table gives some common conversions:<br />
Degrees 0 o 30 o 45 o 60 o 90 o 120 o 135 o 150 o 180 o 270 o 360 o<br />
Radians 0 π<br />
6<br />
π<br />
4<br />
π<br />
3<br />
π<br />
2<br />
2π<br />
3<br />
3π<br />
4<br />
5π 3π<br />
6 π 2<br />
Note that 2π radians (or 360 o ) corresponds to a complete revolution of the unit circle.<br />
Some values of sine and cosine. The following table gives some often used values of<br />
the sine and cosine functions:<br />
t 0 π<br />
6<br />
1 sin(t) 0 2<br />
cos(t) 1 √ 3<br />
2<br />
4.2 Limits and Continuity<br />
π<br />
4<br />
√<br />
2<br />
2<br />
√<br />
2<br />
2<br />
π π 2π 3π 5π<br />
3<br />
√<br />
3<br />
2<br />
2 3<br />
1<br />
4 6<br />
√ 1<br />
2<br />
3<br />
2<br />
0 −<br />
√<br />
2<br />
2<br />
1<br />
2 0<br />
1<br />
2 − √ 2<br />
2 − √ 3<br />
2 −1<br />
This section introduces the concepts of limit and continuity, which are central to calculus.<br />
4.2.1 Neighborhoods, Deleted Neighborhoods and Limits<br />
A neighborhood of the real number a is the set of real numbers x satisfying<br />
|x − a| < r <strong>for</strong> some positive number r.<br />
A deleted neighborhood of a is the neighborhood with a removed; that is, the set of real<br />
numbers x satisfying 0 < |x − a| < r <strong>for</strong> some positive r.<br />
54<br />
π<br />
2π
y<br />
2<br />
1<br />
-0.5 0.5 1 1.5 2 x<br />
-1<br />
-2<br />
-0.5 0.5 1 1.5 2 x<br />
Figure 4.2: Plots of y = (x − 1)/(x 2 − 1) (left) and y = |x − 1|/(x 2 − 1) (right).<br />
Suppose that f(x) is defined <strong>for</strong> all x in a deleted neighborhood of a. We write<br />
lim f(x) = L<br />
x→a<br />
and say “the limit of f(x), as x approaches a, equals L” if <strong>for</strong> every positive number ǫ there<br />
exists a positive number δ such that<br />
if x satisfies 0 < |x − a| < δ, then f(x) satisfies |f(x) − L| < ǫ.<br />
(The deleted δ-neighborhood of a is mapped into the ǫ-neighborhood of L.)<br />
Example 4.1 (Figure 4.2) Let f(x) = (x−1)/(x 2 −1) and a = 1 (left part of Figure 4.2).<br />
Although f(1) is undefined, the limit as x approaches 1 exists. Specifically,<br />
lim f(x) = lim<br />
x→1 x→1<br />
(x − 1)<br />
(x2 = lim<br />
− 1) x→1<br />
y<br />
2<br />
1<br />
-1<br />
-2<br />
(x − 1)<br />
= lim<br />
(x − 1)(x + 1) x→1<br />
1 1<br />
=<br />
(x + 1) 2 .<br />
By contrast, if f(x) = |x − 1|/(x 2 − 1) and a = 1 (right part of Figure 4.2), then a limit<br />
does not exist, since function values approach either −1/2 or +1/2 depending on whether<br />
x < 1 or x > 1.<br />
Example 4.2 (Figure 4.3) Let f(x) = sin(π/x) and a = 0 (left part of Figure 4.3). f(x)<br />
is undefined at 0. Further, as x approaches 0, values of f(x) fluctuate between −1 and +1,<br />
and no limit exists.<br />
By contrast, if f(x) = xsin(π/x) and a = 0 (right part of Figure 4.3), then a limit does<br />
exist. Specifically, function values approach 0 as x approaches 0.<br />
One-sided limits. We write<br />
lim f(x) = L<br />
x→a +<br />
and say “the limit of f(x), as x approaches a from above, equals L” if <strong>for</strong> every positive<br />
number ǫ there exists a positive number δ such that<br />
if x satisfies 0 < x − a < δ, then f(x) satisfies |f(x) − L| < ǫ.<br />
55
0.5<br />
-1 -0.5 0.5 1 x<br />
-0.5<br />
y<br />
1<br />
-1<br />
0.5<br />
-1 -0.5 0.5 1 x<br />
-0.5<br />
Figure 4.3: Plots of y = sin(π/x) (left) and y = xsin(π/x) (right).<br />
Similarly, we write<br />
lim f(x) = L<br />
x→a<br />
and say “the limit of f(x), as x approaches a from below, equals L” if <strong>for</strong> every positive<br />
number ǫ there exists a positive number δ such that<br />
if x satisfies 0 < a − x < δ, then f(x) satisfies |f(x) − L| < ǫ.<br />
In the first case, we approach a from the right only; in the second case, from the left only.<br />
For example, if f(x) = |x − 1|/(x 2 − 1) and a = 1 (right part of Figure 4.2), then<br />
Infinite limits. We write<br />
lim f(x) = −1<br />
x→1− 2<br />
y<br />
1<br />
-1<br />
1<br />
and lim f(x) =<br />
x→1 + 2 .<br />
lim f(x) = L<br />
x→∞<br />
and say “the limit of f(x), as x approaches ∞, equals L” if <strong>for</strong> every positive number ǫ<br />
there exists a number M such that<br />
Similarly, we write<br />
if x satisfies x > M, then f(x) satisfies |f(x) − L| < ǫ.<br />
lim f(x) = L<br />
x→−∞<br />
and say “the limit of f(x), as x approaches −∞, equals L” if <strong>for</strong> every positive number ǫ<br />
there exists a number M such that<br />
if x satisfies x < M, then f(x) satisfies |f(x) − L| < ǫ.<br />
Example 4.3 (Exponential Function) Let f(x) = e x . Then<br />
lim f(x) = 0,<br />
x→−∞<br />
but no finite limit exists as x → ∞. We often write limx→∞ f(x) = ∞ when function values<br />
grow large without bound.<br />
56
Application in statistics. Let F(x) = P(X ≤ x) be the cumulative distribution function<br />
(CDF) of a discrete random variable (Section 2.1.1) and let a be an element of the range of<br />
the random variable. Then both one-sided limits exist,<br />
Further,<br />
Properties of limits. Suppose that<br />
and let c be a constant. Then<br />
1. Constant rule: limx→a c = c.<br />
2. Rule <strong>for</strong> x: limx→a x = a.<br />
lim F(x) = F(a) and lim F(x) = F(a).<br />
x→a + x→a− lim F(x) = 0 and lim F(x) = 1.<br />
x→−∞ x→∞<br />
lim f(x) = L, lim g(x) = M,<br />
x→a x→a<br />
3. Constant multiple rule: limx→a(cf(x)) = cL.<br />
4. Sum rule: limx→a(f(x) + g(x)) = L + M.<br />
5. Difference rule: limx→a(f(x) − g(x)) = L − M.<br />
6. Product rule: limx→a(f(x)g(x)) = LM.<br />
7. Quotient rule: limx→a(f(x)/g(x)) = L/M when M = 0.<br />
8. Power rule: limx→a(f(x)) n = L n when n is a positive integer.<br />
9. Power rule <strong>for</strong> x: limx→a x n = a n when n is a positive integer.<br />
10. Rule <strong>for</strong> roots: limx→a n f(x) = n√ L when n is an odd positive integer, or when n is<br />
an even positive integer and L is greater than 0.<br />
11. Rule <strong>for</strong> ordering: If f(x) ≤ g(x) <strong>for</strong> all x near a (except possibly at a), then L ≤ M.<br />
Squeezing principle. If ℓ(x) ≤ f(x) ≤ u(x) <strong>for</strong> all x near a (except possibly at a) and<br />
lim ℓ(x) = lim u(x) = L,<br />
x→a x→a<br />
then f(x) approaches L as well: limx→a f(x) = L.<br />
For example, let f(x) = xsin(π/x) and a = 0 (right part of Figure 4.3). Since<br />
−|x| ≤ xsin(π/x) ≤ |x|<br />
and the lower and upper functions each approach 0 as x approaches 0, we know that f(x)<br />
approaches 0 as well.<br />
57
Big O and little o notation. If<br />
|f(x)| ≤ Kg(x) <strong>for</strong> some positive K and all x near a (except possibly at a),<br />
we say that f(x) = O(g(x)) <strong>for</strong> x near a. If<br />
we say that f(x) = o(g(x)) <strong>for</strong> x near a.<br />
4.2.2 Continuous Functions<br />
f(x)<br />
lim = 0<br />
x→a g(x)<br />
The function f(x) is said to be continuous at a if<br />
lim f(x) = f(a).<br />
x→a<br />
The function is said to be discontinuous at a otherwise. Note that f(x) is discontinuous at<br />
a when f(a) is undefined, or when the limit does not exist, or when the limit and function<br />
values exist but are not equal.<br />
Polynomial functions, rational functions (ratios of polynomials), exponential functions, logarithmic<br />
functions, and trigonometric functions are continuous throughout their domains.<br />
Step functions are discontinuous at each “step”.<br />
Note that the properties of limits given above imply that sums, differences, products, quotients,<br />
powers, roots, and constant multiples of continuous functions are continuous whenever<br />
the appropriate operation is defined.<br />
Example 4.4 (Split Definition Functions) Let<br />
⎧<br />
0 x < 0<br />
⎪⎨<br />
−4x + 10 x < 2<br />
0.25 0 ≤ x < 1<br />
f(x) =<br />
and g(x) =<br />
−0.25x + 2.5 x ≥ 2<br />
⎪⎩<br />
0.75 1 ≤ x < 2<br />
1 x ≥ 2<br />
The function f(x) (left part of Figure 4.4) is continuous everywhere. In particular,<br />
lim f(x) = lim + 10) = 2 and lim f(x) = lim + 2.5) = 2.<br />
x→2− x→2−(−4x x→2 + x→2 +(−0.25x<br />
The step function g(x) (right part of Figure 4.4) is discontinuous at 0, 1 and 2. Note that<br />
g(x) is the cumulative distribution function of a binomial random variable with parameters<br />
n = 2 and p = 1/2.<br />
Compositions. If g(x) is continuous at a and f(x) is continuous at g(a), then the composite<br />
function f(g(x)) is continuous at a. In other words,<br />
<br />
lim f(g(x)) = f<br />
x→a<br />
lim<br />
x→a g(x)<br />
<br />
= f(g(a)).<br />
For example, limx→5 ln(6 − x) = ln(1) = 0, where ln() is the natural logarithm function.<br />
58
yfx<br />
10<br />
5<br />
4.3 Differentiation<br />
5 10 x<br />
ygx<br />
1<br />
0.5<br />
0 1 2 3<br />
Figure 4.4: Plots of split definition functions.<br />
Suppose that f(x) is defined <strong>for</strong> all x near a. The average rate of change of the function f<br />
over the interval [a,x] if x > a (or [x,a] if x < a) is the ratio<br />
f(x) − f(a)<br />
.<br />
x − a<br />
The average rate of change corresponds to the slope of the secant line through the points<br />
(a,f(a)) and (x,f(x)).<br />
Interest focuses on whether we can define the instantaneous rate of change at a or, equivalently,<br />
the slope of the curve y = f(x) at the point (a,f(a)).<br />
4.3.1 Derivative and Tangent Line<br />
The derivative of the function f at the number a, denoted by f ′ (a) (“f prime of a”), is<br />
f ′ f(x) − f(a)<br />
(a) = lim<br />
x→a x − a<br />
if the limit exists.<br />
The function f(x) is said to be differentiable at a if the limit above exists. Otherwise, f(x)<br />
is not differentiable at a.<br />
If f(x) represents your position at time x, then (f(x) − f(a))/(x − a) is your average velocity over<br />
the interval [a, x] (or [x, a]) and f ′ (a) is your instantaneous velocity at time a.<br />
Tangent line. If f(x) is differentiable at a, then the line with equation<br />
y = f(a) + f ′ (a)(x − a)<br />
is tangent to the curve y = f(x) at the point (a,f(a)), and f ′ (a) is said to be the slope of<br />
the curve at this point.<br />
59<br />
x
0.4<br />
0.3<br />
0.2<br />
0.1<br />
y<br />
0.5 1 1.5 2 x<br />
8<br />
6<br />
4<br />
2<br />
y<br />
1 2 3 4 x<br />
Figure 4.5: Plots of y = x/(1 + 2x) (left part) and y = 7 − 3|x − 2| (right part).<br />
Example 4.5 (Derivative Exists) Consider the function f(x) = x/(1+2x) and let a = 1<br />
(left part of Figure 4.5). The derivative is computed as follows:<br />
f ′ (1) = lim<br />
x→1<br />
x 1<br />
(1+2x) − 3<br />
x − 1<br />
= lim<br />
x→1<br />
The equation of the tangent line is y = (1/3) + (1/9)(x − 1).<br />
1 1<br />
=<br />
3 + 6x 9 .<br />
Example 4.6 (Derivative Does Not Exist) Consider the function f(x) = 7 − 3|x − 2|<br />
and let a = 2 (right part of Figure 4.5). Then<br />
1. If x > 2, then f(x) = 7 − 3(x − 2) and lim x→2 + f(x)−f(2)<br />
x−2<br />
2. If x < 2, then f(x) = 7 − 3(2 − x) and lim x→2 − f(x)−f(2)<br />
x−2<br />
= −3.<br />
= 3.<br />
Since the limits from above and below are not equal, the derivative does not exist.<br />
4.3.2 Approximations and Local Linearity<br />
If f(x) is differentiable at a, then f(x) is locally linear near a. That is, values on the tangent<br />
line can be used to approximate values on the curve itself.<br />
Example 4.7 (Derivative Exists, continued) For example, if f(x) = x/(1+2x), a = 1,<br />
and ℓ(x) = (1/3) + (1/9)(x − 1) (as calculated above), then the following table confirms<br />
that values of the linear function ℓ(x) are close to values of f(x) when x is near 1:<br />
x 0.75 0.80 0.85 0.9 0.95 1.00 1.05 1.10 1.15 1.20 1.25<br />
f(x) 0.300 0.308 0.315 0.321 0.328 0.333 0.339 0.344 0.348 0.353 0.357<br />
ℓ(x) 0.306 0.311 0.317 0.322 0.328 0.333 0.339 0.344 0.350 0.356 0.361<br />
60
4<br />
2<br />
-1 1 2 3 4<br />
-2<br />
-4<br />
y<br />
6<br />
x<br />
-1 1 2 3 4<br />
-2<br />
Figure 4.6: Plots of y = 2 + 9x − 6x 2 + x 3 with different points highlighted.<br />
4.3.3 First and Second Derivative Functions<br />
Since the derivative can take different values at different points, we can think of it as a<br />
function in its own right. For any function f, we define the derivative function, f ′ , by<br />
f ′ (x) = lim<br />
h→0<br />
4<br />
2<br />
-4<br />
y<br />
6<br />
f(x + h) − f(x)<br />
.<br />
h<br />
If the limit exists at a given x, we say that f is differentiable at x. If the derivative exists<br />
<strong>for</strong> all x in the domain of the function, we say that f is differentiable everywhere.<br />
Alternative notations. Starting with y = f(x) (where x represents the independent<br />
variable and y represents the dependent variable), there are several different ways to write<br />
the derivative function:<br />
f ′ (x) = dy d df<br />
= (f(x)) =<br />
dx dx dx = Df(x) = y′ .<br />
The “prime” notation (f ′ (x), y ′ ) is due to Newton. The d<br />
dx (“d-dx”) <strong>for</strong>m is due to Leibniz.<br />
Second derivative function. If f is a differentiable function, then its derivative f ′ is<br />
also a function, so f ′ may have a derivative of its own. The second derivative of f, denoted<br />
by f ′′ (“f double prime”), is defined as follows:<br />
f ′′ (x) = lim<br />
h→0<br />
Starting with y = f(x), alternative notations are<br />
f ′ (x + h) − f ′ (x)<br />
.<br />
h<br />
f ′′ (x) = d2y d2<br />
=<br />
dx2 dx2(f(x)) = d2f dx2 = D2f(x) = y ′′ .<br />
The second derivative is the rate of change of the rate of change. If f ′′ (a) > 0, then the<br />
curve y = f(x) is concave up at (a,f(a)). If f ′′ (a) < 0, then the curve is concave down at<br />
(a,f(a)).<br />
61<br />
x
Continuity and differentiability. Differentiable functions are continuous, but continuous<br />
functions may not be differentiable. For example, the absolute value function<br />
f(x) = 7 − 3|x − 2| (see Figure 4.5) is continuous everywhere and is differentiable <strong>for</strong><br />
all values of x other than 2.<br />
4.3.4 Some Rules <strong>for</strong> Finding Derivative Functions<br />
Assume that f and g are differentiable on a common domain D, and let a, b, c, and m be<br />
constants. Then<br />
d<br />
1. Constant functions: dx (c) = 0.<br />
d<br />
2. Linear functions: dx (mx + b) = m.<br />
3. Constant multiples: d<br />
dx (cf(x)) = cf ′ (x).<br />
4. Linear combinations: d<br />
dx (af(x) + bg(x)) = af ′ (x) + bg ′ (x).<br />
5. Products of functions: d<br />
dx (f(x)g(x)) = f ′ (x)g(x) + f(x)g ′ (x).<br />
<br />
d f(x)<br />
6. Quotients of functions: dx g(x) = g(x)f ′ (x)−f(x)g ′ (x)<br />
(g(x)) 2<br />
7. Powers of x: d<br />
dx (xn ) = nx n−1 <strong>for</strong> each constant n.<br />
when g(x) = 0.<br />
Polynomial functions. Recall that a polynomial function, f(x), has the <strong>for</strong>m<br />
f(x) = anx n + an−1x n−1 + · · · + a2x 2 + a1x + a0<br />
where a0,a1,... ,an are constants, and n is a nonnegative integer. The rules given above<br />
imply that<br />
f ′ (x) = nanx n−1 + (n − 1)an−1x n−2 + · · · + 2a2x + a1.<br />
Example 4.8 (Cubic Function) Let f(x) = 2 + 9x − 6x 2 + x 3 . Then<br />
f ′ (x) = 9 − 12x + 3x 2 and f ′′ (x) = −12 + 6x.<br />
Since f ′′ (x) < 0 when x < 2, f is concave down when x < 2 (left part of Figure 4.6). Since<br />
f ′′ (x) > 0 when x > 2, f is concave up when x > 2 (right part of Figure 4.6).<br />
Rational functions. Recall that a rational function, r(x), is the ratio of polynomials.<br />
The derivative of r(x) is computed using the quotient rule.<br />
Example 4.9 (Rational Function) For example, if<br />
then r ′ (x) =<br />
q(x)p ′ (x) − p(x)q ′ (x)<br />
(q(x)) 2<br />
r(x) = p(x) 3 + 4x − x2<br />
=<br />
q(x) 7 + 2x<br />
= (7 + 2x)(4 − 2x) − (3 + 4x − x2 )(2)<br />
(7 + 2x) 2<br />
62<br />
,<br />
= −2 −11 + 7x + x2 (7 + 2x) 2 .
4.3.5 Chain Rule <strong>for</strong> Composition of Functions<br />
If f and g are differentiable functions, then<br />
d<br />
dx (f(g(x))) = f ′ (g(x))g ′ (x).<br />
(The derivative of the outside function evaluated at the inside function times the derivative<br />
of the inside function evaluated at x.)<br />
Starting with y = f(u) and u = g(x), the chain rule can be written as follows:<br />
dy dy<br />
=<br />
dx du<br />
du<br />
dx .<br />
For example, since √ u = u 1/2 and d<br />
du (u1/2 ) = 1<br />
2 u−1/2 = 1<br />
2 √ u ,<br />
d <br />
x<br />
dx<br />
2 <br />
+ 1 =<br />
1<br />
2 √ x2 (2x) =<br />
+ 1<br />
x<br />
√ x 2 + 1 .<br />
Inverse functions. The functions f and g are said to be inverse functions if f(g(x)) = x.<br />
Using the chain rule:<br />
d<br />
dx (f(g(x))) = f ′ (g(x))g ′ (x) ⇒ 1 = f ′ (g(x))g ′ (x) ⇒ g ′ (x) =<br />
(The derivative of g at x is the reciprocal of the derivative of f at g(x).)<br />
1<br />
f ′ (g(x)) .<br />
For example, f(u) = u2 and g(x) = √ x are inverse functions when x is positive. Since<br />
f ′ (u) = 2u,<br />
g ′ (x) = 1<br />
2 √ x =<br />
1<br />
f ′ (g(x)) .<br />
4.3.6 Rules <strong>for</strong> Exponential and Logarithmic Functions<br />
The natural base <strong>for</strong> exponential and logarithmic functions in calculus is the base e, where<br />
<br />
e = lim 1 +<br />
n→∞<br />
1<br />
n = 2.71828... is the Euler constant.<br />
n<br />
The limit above is illustrated in the following table of approximate values:<br />
n 10 100 1000 10000 100000 1000000<br />
<br />
1 + 1<br />
n n 2.59374 2.70481 2.71692 2.71815 2.71827 2.71828<br />
The domain of the exponential function e x = exp(x) is all real numbers and the domain<br />
of the logarithmic function ln(x) = log e(x) is the positive real numbers. Plots of the<br />
exponential and logarithmic functions are given in Figure 4.7.<br />
63
y<br />
8<br />
6<br />
4<br />
2<br />
-1 1 2 x<br />
Figure 4.7: Plots of y = e x (left) and y = ln(x) (right).<br />
The functions e x and ln(x) are inverses. That is,<br />
Their derivatives are as follows:<br />
More generally,<br />
For example,<br />
y<br />
2<br />
1<br />
-1<br />
2 4 6 8 x<br />
e ln(x) = x when x > 0 and ln(e x ) = x <strong>for</strong> all real numbers.<br />
d<br />
dx (ex ) = e x and d 1<br />
(ln(x)) =<br />
dx x .<br />
d<br />
dx (eu u du<br />
) = e<br />
dx<br />
d <br />
e<br />
dx<br />
12−x3<br />
<br />
12−x3<br />
= e −3x 2<br />
d 1<br />
and (ln(u)) =<br />
dx u<br />
du<br />
dx .<br />
and d <br />
ln(x<br />
dx<br />
4 + 2x 2 <br />
+ 7) = 4x3 + 4x<br />
x4 + 2x2 + 7 .<br />
4.3.7 Rules <strong>for</strong> Trigonometric and Inverse Trigonometric Functions<br />
In calculus, we work with trigonometric functions using radian measure. That is, the input<br />
x is the radian measure of an angle (page 53). Figure 4.8 shows plots of the two most<br />
commonly used functions, sin(x) and cos(x).<br />
Trigonometric functions have specialized derivative <strong>for</strong>mulas:<br />
d<br />
d<br />
dx (sin(x)) = cos(x)<br />
d<br />
dx (tan(x)) = sec(x)2 d<br />
dx (cos(x)) = − sin(x)<br />
dx (cot(x)) = −csc(x)2<br />
d<br />
dx (sec(x)) = sec(x) tan(x) d<br />
dx (csc(x)) = − csc(x) cot(x)<br />
64
1<br />
0.5<br />
-9 -6 -3 3 6 9 x<br />
-0.5<br />
-1<br />
y<br />
1<br />
0.5<br />
-9 -6 -3 3 6 9 x<br />
-0.5<br />
Figure 4.8: Plots of y = sin(x) (left) and y = cos(x) (right).<br />
Similarly, inverse trigonometric functions have specialized derivative <strong>for</strong>mulas:<br />
For example,<br />
d<br />
dx (sin−1 (x)) = 1<br />
√ 1−x 2<br />
d<br />
dx (tan−1 (x)) = 1<br />
1+x2 d<br />
dx (sec−1 (x)) =<br />
1<br />
x 2 √ 1−x −2<br />
-1<br />
y<br />
d<br />
dx (cos−1 (x)) = −1<br />
√ 1−x 2<br />
d<br />
dx (cot−1 (x)) = −1<br />
1+x2 d<br />
dx (csc−1 −1 (x)) =<br />
x2 √ 1−x−2 d <br />
sin(3x<br />
dx<br />
2 <br />
+ 8) = cos(3x 2 + 8)(6x) and<br />
d<br />
dx (x cos(√x + x)) = cos( √ x + x) − x sin( √ x + x) 1<br />
2 √ + 1<br />
x<br />
= cos( √ x + x) − sin( √ x + x) √ x<br />
2<br />
4.3.8 Indeterminate Forms and L’Hospital’s Rule<br />
If f(x) → 0 (“f(x) approaches 0”) as x → a and g(x) → 0 as x → a, then<br />
f(x)<br />
lim<br />
x→a g(x)<br />
may or may not exist, and is called an indeterminate <strong>for</strong>m of type 0/0.<br />
+ x .<br />
For example,<br />
x − 1<br />
lim<br />
x→1 x2 + 11x − 12<br />
is an indeterminate <strong>for</strong>m of type 0/0 whose limit can be computed after cancelling the<br />
common factor (x − 1) from numerator and denominator:<br />
lim<br />
x→1<br />
x − 1<br />
x2 = lim<br />
+ 11x − 12 x→1<br />
x − 1<br />
= lim<br />
(x − 1)(x + 12) x→1<br />
65<br />
1 1<br />
=<br />
x + 12 13 .
Similarly, if f(x) → ±∞ and g(x) → ±∞ as x → a, then the limit of the ratio f(x)/g(x) is<br />
an indeterminate <strong>for</strong>m of type ∞/∞.<br />
Theorem 4.10 (L’Hospital’s Rule) Suppose that f and g are differentiable and g ′ (x) =<br />
0 near a (except possibly at a). Further, suppose that<br />
Then<br />
1. f(x) → 0 and g(x) → 0 as x → a, or<br />
2. f(x) → ±∞ and g(x) → ±∞ as x → a.<br />
if the limit on the righthand side exists.<br />
f(x) f<br />
lim = lim<br />
x→a g(x) x→a<br />
′ (x)<br />
g ′ (x)<br />
L’Hospital’s rule says that the limit of the ratio of the functions is the same as the limit of<br />
the ratio of the rates. For example,<br />
x<br />
lim<br />
x→1<br />
20 − 1<br />
x18 20x<br />
= lim<br />
− 1 x→1<br />
19 10x<br />
= lim<br />
18x17 x→1<br />
2<br />
9<br />
= 10<br />
9 .<br />
In some cases, L’Hospital’s rule needs to be used more than once. For example,<br />
1 − cos(x)<br />
lim<br />
x→0 x2 sin(x) cos(x) 1<br />
= lim = lim =<br />
x→0 2x x→0 2 2 .<br />
(The first and second limits are indeterminate <strong>for</strong>ms of type 0/0.)<br />
One-sided and infinite limits. Note that L’Hospital’s rule remains true <strong>for</strong> one-sided<br />
limits (x → a + or x → a − ) and <strong>for</strong> limits where x → ±∞.<br />
Indeterminate products. If f(x) → 0 and g(x) → ±∞ as x → a, then<br />
lim<br />
x→a f(x)g(x)<br />
is an indeterminate <strong>for</strong>m of type 0·∞. L’Hospital’s rule can sometimes be used to evaluate<br />
the limit. The trick is to rewrite the product as a quotient:<br />
For example,<br />
limx→a f(x)g(x) = limx→a<br />
lim<br />
x→∞ xe−x = lim<br />
x→∞<br />
x<br />
= lim<br />
ex x→∞<br />
= limx→a<br />
f(x)<br />
(1/g(x)) (type 0/0) or<br />
g(x)<br />
(1/f(x)) (type ∞/∞).<br />
1<br />
= 0 (using type ∞/∞).<br />
ex 66
Remarks. There are many other types of indeterminate <strong>for</strong>ms, including<br />
∞ − ∞, 0 0 , ∞ 0 , 1 ∞ .<br />
In each case, anything can happen (that is, a limit may or may not exist).<br />
Note that the limit used to define the Euler constant (page 63) is an example of an indeterminate<br />
<strong>for</strong>m of type 1 ∞ .<br />
4.4 Optimization<br />
The function f : D ⊆ R → R has a local minimum (or relative minimum) at c ∈ D if<br />
f(c) ≤ f(x) <strong>for</strong> each x in the intersection of a neighborhood of c with D.<br />
Similarly, f has a local maximum (or relative maximum) at c ∈ D if<br />
f(c) ≥ f(x) <strong>for</strong> each x in the intersection of a neighborhood of c with D.<br />
Local maxima and minima are called local extrema of the function.<br />
f has a global (or absolute) minimum at c if f(c) ≤ f(x) <strong>for</strong> all x ∈ D. Similarly, f has a<br />
global (or absolute) maximum at c if f(c) ≥ f(x) <strong>for</strong> all x ∈ D. Global maxima and minima<br />
are called global extrema of the function.<br />
4.4.1 Local Extrema and Derivatives<br />
The following theorem establishes a relationship between local extrema and derivatives.<br />
Theorem 4.11 (Fermat’s Theorem) If the continuous function f has a local extremum<br />
at c, and if f ′ (c) exists, then f ′ (c) = 0.<br />
Continuing with the cubic function example (Figure 4.6), f(x) = 2 + 9x − 6x 2 + x 3 has a<br />
local maximum at 1 (with maximum value 6) and a local minimum at 3 (with minimum<br />
value 2). Since the derivative function is f ′ (x) = 9 − 12x + 3x 2 , it is easy to check that<br />
f ′ (1) = f ′ (3) = 0.<br />
Critical numbers. A number c is a critical number of f if either f ′ (c) = 0 or f ′ (c) does<br />
not exist. If f has a local extremum at c, then Fermat’s theorem implies that c is a critical<br />
number of the function f.<br />
The cubic function f(x) = 2 + 9x − 6x 2 + x 3 has critical numbers 1 and 3. The absolute<br />
value function f(x) = 7 − 3|x − 2| (Figure 4.5) has critical number 2.<br />
4.4.2 First Derivative Test <strong>for</strong> Local Extrema<br />
Suppose that f is differentiable <strong>for</strong> all x near c. If c is a critical number of f and if f ′<br />
changes sign at c, then f has a local extremum at c. Specifically,<br />
67
1. If f ′ (x) is positive below c and negative above c, then f is increasing to the left of c<br />
and decreasing to the right of c. Thus, f has a local maximum at c.<br />
2. If f ′ (x) is negative below c and positive above c, then f is decreasing to the left of c<br />
and increasing to the right of c. Thus, f has a local minimum at c.<br />
Continuing with the cubic function f(x) = 2 + 9x − 6x 2 + x 3 , it is easy to check that<br />
f ′ (x) > 0 when x < 1, f ′ (x) < 0 when 1 < x < 3, and f ′ (x) > 0 when x > 3.<br />
These computations confirm that f has a local maximum at 1 and a local minimum at 3.<br />
4.4.3 Second Derivative Test <strong>for</strong> Local Extrema<br />
Suppose that f is twice differentiable <strong>for</strong> all x near c, and that c is a critical number <strong>for</strong> f.<br />
(In this case, f ′ (c) = 0.) Then<br />
1. If f ′′ (c) > 0, then f has a local minimum at c.<br />
2. If f ′′ (c) < 0, then f has a local maximum at c.<br />
Note that if f ′ (c) = 0 and f ′′ (c) = 0, then anything can happen. For example, if f is the<br />
quartic function f(x) = x 4 , then f ′ (0) = f ′′ (0) = 0 and f has a local (and global) minimum<br />
at 0. By contrast, if f is the cubic function f(x) = x 3 , then f ′ (0) = f ′′ (0) = 0, but f has<br />
neither a local minimum nor a local maximum at 0.<br />
4.4.4 Global Extrema on Closed Intervals<br />
Consider f : [a,b] ⊆ R → R (the domain of interest is the closed interval [a,b]). The<br />
following theorem says that continuous functions have global extrema on closed intervals.<br />
Theorem 4.12 (Extreme Value Theorem) If f is continuous on the closed interval<br />
[a,b], then there are numbers c,d ∈ [a,b] such that f has a global minimum at c and a<br />
global maximum at d.<br />
For example, if the domain of the cubic function f(x) = 2 + 9x − 6x 2 + x 3 (Figure 4.6) is<br />
restricted to the closed interval [−0.5,3.5], then f has a global minimum value of −4.125<br />
at −0.5, and a global maximum value of 6 at 1.<br />
Method <strong>for</strong> finding global extrema. To find the global extrema of the continuous<br />
function f on the closed interval [a,b]:<br />
1. Find the values of f at the critical numbers in the open interval (a,b).<br />
2. Find the values of f at the endpoints: f(a), f(b).<br />
68
70<br />
50<br />
30<br />
10<br />
y<br />
-2 -1 1 2 3<br />
x<br />
1 2 3<br />
Figure 4.9: Plots of y = 40 + 12x 2 + 4x 3 − 3x 4 . In the right plot, the domain is restricted<br />
to the interval [0,3].<br />
3. The largest function value from the first two steps is the global maximum, and the<br />
smallest function value is the global minimum.<br />
Example 4.13 (Quartic Function) Let f(x) = 40 + 12x 2 + 4x 3 − 3x 4 <strong>for</strong> all x.<br />
1. As illustrated in the left part of Figure 4.9, f(x) has a local maximum of 45 = f(−1)<br />
when x = −1, a local minimum of 40 = f(0) when x = 0, and both a local and global<br />
maximum of 72 = f(2) when x = 2.<br />
2. Since f ′ (x) = 24x + 12x 2 − 12x 3 = −12 (x − 2) x (x + 1), we know that f ′ (x) = 0<br />
when x = −1,0,2. Further,<br />
(a) f ′ (x) > 0 when x < −1 and when 0 < x < 2 and<br />
70<br />
50<br />
30<br />
10<br />
(b) f ′ (x) < 0 when −1 < x < 0 and when x > 2.<br />
The values of f(x) are increasing when x < −1, decreasing when −1 < x < 0,<br />
increasing when 0 < x < 2, and decreasing when x > 2.<br />
3. Since f ′′ (x) = 24 + 24x − 36x 2 , we know that f ′′ (x) = 0 when<br />
Further,<br />
x = 1<br />
3 (1 − √ 7) ≈ −0.549 and x = 1<br />
3 (1 + √ 7) ≈ 1.215.<br />
(a) f ′′ (x) < 0 when x < −0.549 and when x > 1.215 and<br />
(b) f ′′ (x) > 0 when −0.549 < x < 1.215.<br />
Concavity changes at both roots of the equation f ′′ (x) = 0. Specifically, f(x) is<br />
concave down when x < −0.549, concave up when −0.549 < x < 1.215, and concave<br />
down when x > 1.215.<br />
Example 4.14 (Restricted Domain) Let f(x) = 40 + 12x 2 + 4x 3 − 3x 4 <strong>for</strong> x ∈ [0,3].<br />
69<br />
y<br />
x
1. As illustrated in the right part of Figure 4.9, f(x) has a local minimum of 40 = f(0)<br />
at x = 0, a local and global maximum of 72 = f(2) at x = 2, and a local and global<br />
minimum of 13 = f(3) at x = 3.<br />
2. Since f ′ (x) = 24x + 12x 2 − 12x 3 = −12 (x − 2) x (x + 1), we know that f ′ (x) = 0<br />
when x = −1,0,2. For the two roots in the restricted domain, we consider<br />
f(0) = 40 and f(2) = 72 when determining global extrema.<br />
3. The function values at the endpoints are f(0) = 40 and f(3) = 13.<br />
4. The global maximum value is the largest of the numbers computed in items 2 and 3.<br />
The global minimum value is the smallest of the numbers computed in items 2 and 3.<br />
4.4.5 Newton’s Method<br />
Many problems in mathematics involve finding roots. For example, to find the critical<br />
numbers of the differentiable function f, we need to solve the derivative equation f ′ (x) = 0.<br />
Newton’s method is an iterative method <strong>for</strong> approximating the roots of an equation of the<br />
<strong>for</strong>m g(x) = 0 when g is a differentiable function. The method uses local linearity.<br />
Specifically, if r is a root of the equation, x0 (an initial estimate of r) is a number near r,<br />
and g ′ (x0) = 0, then the tangent line through (x0,g(x0))<br />
y = g(x0) + g ′ (x0)(x − x0)<br />
can be used to approximate the curve y = g(x) and the x-coordinate of the point where the<br />
tangent line crosses the x-axis (call it x1) is likely to be closer to r than x0. To find x1:<br />
0 = g(x0) + g ′ (x0)(x1 − x0) ⇒ x1 = x0 − g(x0)<br />
g ′ (x0) .<br />
If g ′ (x1) = 0, the method can be repeated to yield x2 = x1 − g(x1)<br />
g ′ (x1)<br />
general <strong>for</strong>mula is<br />
xk+1 = xk − g(xk)<br />
g ′ (xk) .<br />
, and so <strong>for</strong>th. The<br />
The iteration stops when the error term |g(xk)/g ′ (xk)| falls below a pre-specified bound.<br />
Example 4.15 (Root of Cubic Function) The cubic function g(x) = x 3 − x − 1 has a<br />
root between 1 and 2, as illustrated in the left part of Figure 4.10. Newton’s method can<br />
be used to find the root. The following table shows the results of the first five iterations<br />
starting with initial value x0 = 1.0.<br />
k 0 1 2 3 4 5<br />
xk 1.0 1.5 1.34783 1.3252 1.32472 1.32472<br />
Since x4 and x5 are identical to within five decimal places of accuracy, we estimate that the<br />
root occurs when x = 1.32472.<br />
70
5<br />
-2 -1 1 2<br />
-5<br />
y<br />
x<br />
20<br />
16<br />
12<br />
Figure 4.10: Newton’s method examples.<br />
Example 4.16 (Point of Intersection) The functions<br />
f1(x) = x + 4 and f2(x) = e x/3<br />
8<br />
y<br />
5 7.5 10 x<br />
have a point of intersection between 5 and 10, as illustrated in the right part of Figure 4.10.<br />
One way to find the point of intersection is to use Newton’s method to solve the equation<br />
g(x) = f1(x) − f2(x) = 0. The following table shows the results of the first six iterations<br />
starting with initial value x0 = 5.5.<br />
k 0 1 2 3 4 5 6<br />
xk 5.5 8.49133 7.53204 7.28038 7.26519 7.26514 7.26514<br />
Since x5 and x6 are identical to within five decimal places of accuracy, we estimate that the<br />
point of intersection occurs when x = 7.26514.<br />
Warning. Newton’s method is powerful and fast (in general, few iterations are required<br />
<strong>for</strong> convergence), but is sensitive to the initial value x0. If we let x0 = 0.57 in the cubic<br />
example above, <strong>for</strong> example, then 35 iterations are needed to approximate the root to within<br />
5 decimal places of accuracy.<br />
If g(x) has more than one root and the chosen x0 is far from r, then the method could<br />
converge to the wrong root. If g ′ (xk) ≈ 0 <strong>for</strong> some k, then the sequence of iterates may not<br />
converge at all.<br />
4.5 Applications in <strong>Statistics</strong><br />
Optimization is an important component of many statistical analyses. This section illustrates<br />
two applications.<br />
4.5.1 Maximum Likelihood Estimation in Binomial Models<br />
Consider the problem of estimating the binomial probability of success, p. If x successes are<br />
observed in n trials, then a natural estimate of p is the sample proportion, p = x/n. If the<br />
71
0.16<br />
0.12<br />
0.08<br />
0.04<br />
Lik<br />
0.2 0.4 0.6 0.8 1 p<br />
Lik<br />
0.0016<br />
0.0008<br />
0.25 0.5 0.75 1 t<br />
Figure 4.11: Examples of binomial (left) and multinomial (right) likelihoods.<br />
data summarize the results of n independent trials of a Bernoulli experiment with success<br />
probability p, then p = x/n is also the maximum likelihood (ML) estimate of p.<br />
The ML estimate of p is the value of p which maximizes the likelihood function, Lik(p), or<br />
(equivalently) the log-likelihood function, ℓ(p). Lik(p) is the binomial PDF with n and x<br />
fixed and p varying<br />
Lik(p) =<br />
<br />
n<br />
p<br />
x<br />
x (1 − p) n−x<br />
and ℓ(p) is the natural logarithm of the likelihood:<br />
ℓ(p) = ln(Lik(p)) = ln<br />
<br />
n<br />
+ xln(p) + (n − x)ln(1 − p).<br />
x<br />
If 0 < x < n, then the ML estimate is obtained by solving ℓ ′ (p) = 0 <strong>for</strong> p:<br />
ℓ ′ (p) = x n − x x n − x<br />
− = 0 ⇒ =<br />
p 1 − p p 1 − p<br />
⇒ p = x<br />
n .<br />
Further, since ℓ ′ (p) > 0 when 0 < p < x/n and ℓ ′ (p) < 0 when x/n < p < 1, we know that<br />
the likelihood is maximized when p = x<br />
n .<br />
For example, the left part of Figure 4.11 shows the binomial likelihood when 17 successes<br />
are observed in 25 trials. The function is maximized at the ML estimate, p = 0.68.<br />
4.5.2 Maximum Likelihood Estimation in Multinomial Models<br />
Maximum likelihood estimation can be generalized to multinomial models. Consider, <strong>for</strong><br />
example, a multinomial model with three categories and probabilities<br />
(p1,p2,p3) = (θ,(1 − θ)θ,(1 − θ) 2 ), where 0 ≤ θ ≤ 1 is unknown,<br />
and suppose that in 60 independent trials of the experiment, the first outcome was observed<br />
15 times, the second outcome 20 times and the third outcome 25 times. That is, suppose<br />
that n = 60, x1 = 15, x2 = 20 and x3 = 25.<br />
72
The likelihood function, Lik(θ), is the multinomial PDF written as a function of θ,<br />
<br />
60<br />
Lik(θ) = θ<br />
15,20,25<br />
15 <br />
20<br />
((1 − θ)θ) (1 − θ) 2 <br />
25 60<br />
= θ<br />
15,20,25<br />
35 (1 − θ) 70 ,<br />
as shown in the right part of Figure 4.11.<br />
The log-likelihood function is the natural logarithm of the likelihood,<br />
<br />
60<br />
ℓ(θ) = ln(Lik(θ)) = ln + 35ln(θ) + 70ln(1 − θ).<br />
15,20,25<br />
1 = . You can use either the first or second derivative test to<br />
ℓ ′ (θ) = 0 when θ = 35<br />
105 3<br />
demonstrate that the log-likelihood is maximized when θ = 1<br />
3 .<br />
<br />
1 2 4<br />
The estimated model has probabilities 3 , 9 , 9 .<br />
4.5.3 Minimum Chi-Square Estimation in Multinomial Models<br />
In 1900, K. Pearson developed a quantitative method to determine if observed data are<br />
consistent with a given multinomial model.<br />
If (X1,X2,... ,Xk) has a multinomial distribution with parameters n and (p1,p2,... ,pk),<br />
then Pearson’s statistic is the following random variable:<br />
X2 k (Xi − npi)<br />
=<br />
i=1<br />
2<br />
.<br />
npi<br />
For each i, the observed frequency, Xi, is compared to the expected frequency, E(Xi) = npi,<br />
under the multinomial model. If each observed frequency is close to expected, then the value<br />
of X 2 will be close to zero. If at least one observed frequency is far from expected, then<br />
the value of X 2 will be large and the appropriateness of the given multinomial model will<br />
be called into question.<br />
In many practical situations, certain parameters of the multinomial model need to be estimated<br />
from the sample data. For example, if<br />
(p1,p2,... ,pk) = (p1(θ),p2(θ),... ,pk(θ)) <strong>for</strong> some parameter θ,<br />
then the observed value of Pearson’s statistic<br />
f(θ) =<br />
k (xi − npi(θ)) 2<br />
i=1<br />
npi(θ)<br />
is a function of θ, and a natural estimate of θ is the value of θ which minimizes f. This<br />
estimate is known as the minimum chi-square estimate of θ <strong>for</strong> the given (x1,x2,... ,xk).<br />
Example 4.17 (n = 60, k = 3, continued) Consider again the model with probabilities<br />
(p1,p2,p3) = (θ,(1 − θ)θ,(1 − θ) 2 ), where 0 ≤ θ ≤ 1 is unknown,<br />
73
XSqr<br />
50<br />
25<br />
0.25 0.5 0.75 1 t<br />
XSqr<br />
12<br />
8<br />
4<br />
0.025 0.05 0.075 0.1 t<br />
Figure 4.12: Plots of y = f(θ) <strong>for</strong> the minimum chi-square estimation examples.<br />
and assume that in 60 independent trials we obtained x1 = 15, x2 = 20 and x3 = 25.<br />
The function we would like to minimize is<br />
f(θ) =<br />
(15 − 60θ)2<br />
60θ<br />
+ (20 − 60(1 − θ)θ)2<br />
60(1 − θ)θ<br />
+<br />
25 − 60(1 − θ) 2 2<br />
60(1 − θ) 2<br />
(left part of Figure 4.12) and the derivative of this function (after simplification) is<br />
f ′ (θ) = − 5 9θ3 − 9θ2 + 75θ − 25 <br />
12(θ − 1) 3θ2 .<br />
To find the minimum chi-square estimate we need to find the root of<br />
g(θ) = 9θ 3 − 9θ 2 + 75θ − 25 in the [0,1] interval.<br />
Using Newton’s method, 4 decimal places of accuracy, and θ0 = 0.5 we get:<br />
Let θ = 0.342592. Since<br />
the function is minimized at θ.<br />
θ0 = 0.5, θ1 = 0.343643, θ2 = 0.342592, θ3 = 0.342592.<br />
f ′ (θ) < 0 when 0 < θ < θ and f ′ (θ) > 0 when θ < θ < 1,<br />
The estimated model has (approximate) probabilities (0.3426,0.2252,0.4322).<br />
Example 4.18 (Genetics Study) (Larsen & Marx, Prentice-Hall, 1986, p.418) Researchers<br />
were interested in studying two genetic charateristics of corn:<br />
• Starchy (S) or sugary (s), where starchy is the dominant characteristic, and<br />
• Green base leaf (G) or white base leaf (g), where green base leaf is the dominant<br />
characteristic,<br />
74
Table 4.1: Summary of results of the genetics study.<br />
1 Starchy green (SG) 1997<br />
2 Starchy white (Sg) 906<br />
3 Sugary green (sG) 904<br />
4 Sugary white (sg) 32<br />
in the offspring of self-fertilized hybrid corn plants. The results <strong>for</strong> 3839 offspring are<br />
summarized in Table 4.1.<br />
A model of interest to geneticists hypothesizes that the probabilities of the four outcomes<br />
(SG, Sg, sG, sg) is<br />
(p1,p2,p3,p4) = (0.25(2 + θ),0.25(1 − θ),0.25(1 − θ),0.25θ),<br />
where θ is a parameter in the interval (0, 1). θ is a cross-over parameter, indicating a<br />
tendency <strong>for</strong> chromosome pairs Sg and sG in the hybrid parent to cross-over to SG and sg<br />
at the time of reproduction. (As θ increases the tendency to cross-over increases.)<br />
The right part of Figure 4.12 suggests that f(θ) is minimized in the interval 0.02 < θ < 0.05.<br />
Using Newton’s method to solve the derivative equation<br />
d<br />
(f(θ)) = 0 with initial value 0.03,<br />
dθ<br />
the minimum chi-square estimate of θ is θ = 0.0357852.<br />
4.6 Rolle’s Theorem and the Mean Value Theorem<br />
The Mean Value Theorem (MVT) connects the local picture about a curve (namely, the<br />
slope of the curve at a given point) to the global picture (namely, the average rate of change<br />
over an interval). Rolle’s Theorem is the first step.<br />
Theorem 4.19 (Rolle’s Theorem) Suppose f is continuous on [a,b], differentiable on<br />
(a,b) and that f(a) = f(b). Then there is a c ∈ (a,b) satisfying f ′ (c) = 0.<br />
Example 4.20 (Cubic Polynomial) For example, let f(x) = 48 − 44x + 12x 2 − x 3 on<br />
[a,b] = [2,4] (left part of Figure 4.13). Since f(2) = f(4), Rolle’s theorem applies. On this<br />
interval, f ′ (c) = 0 when c = 4 − 2/ √ 3 ≈ 2.8453. (c is a critical number corresponding to<br />
the global minimum on the interval.)<br />
Theorem 4.21 (Mean Value Theorem) Suppose f is continuous on the closed interval<br />
[a,b] and differentiable on the open interval (a,b). Then<br />
f(b) − f(a)<br />
b − a<br />
= f ′ (c) <strong>for</strong> some number c satisfying a < c < b.<br />
75
-1<br />
-2<br />
-3<br />
-4<br />
y<br />
1<br />
2 3 4<br />
x<br />
y<br />
20<br />
10<br />
-4 -2 2 4<br />
Figure 4.13: Plots of y = 48 − 44x + 12x 2 − x 3 (left) and y = 17 + 2x − 3x 2 + 0.1x 4 (right).<br />
The MVT can be proven using Rolle’s theorem. Let<br />
-10<br />
-20<br />
<br />
f(b) − f(a)<br />
s(x) = f(a) +<br />
(x − a) and h(x) = f(x) − s(x).<br />
b − a<br />
Since h(a) = f(a) − f(a) = 0 and h(b) = f(b) − f(b) = 0, the conditions of Rolle’s theorem<br />
are met. Thus, there is a c ∈ (a,b) satisfying<br />
0 = h ′ (c) = f ′ (c) −<br />
<br />
f(b) − f(a)<br />
b − a<br />
⇒ f ′ (c) =<br />
f(b) − f(a)<br />
.<br />
b − a<br />
Example 4.22 (Quartic Polynomial) For example, if f(x) = 17+2x−3x 2 +0.1x 4 and<br />
[a,b] = [−5,5] (right part of Figure 4.13), then the average rate of change over the interval is<br />
2. The instantaneous rate of change of the function is 2 when c = 0, c = ± √ 15 ≈ ±3.87298.<br />
4.6.1 Generalized Mean Value Theorem and Error Analysis<br />
The MVT can be extended to two functions.<br />
Theorem 4.23 (Generalized MVT) Suppose f and g are continuous on the closed<br />
interval [a,b] and differentiable on the open interval (a,b). Then<br />
(f(b) − f(a))g ′ (c) = (g(b) − g(a))f ′ (c) <strong>for</strong> some number c satisfying a < c < b.<br />
The generalized MVT can be proven using Rolle’s theorem applied to<br />
h(x) = (f(b) − f(a))g(x) − (g(b) − g(a))f(x).<br />
Since h(a) = f(b)g(a) − g(b)f(a) and h(b) = f(b)g(a) − g(b)f(a), the conditions of Rolle’s<br />
theorem are met. Thus, there is a c ∈ (a,b) satisfying<br />
0 = h ′ (c) = (f(b)−f(a))g ′ (c)−(g(b)−g(a))f ′ (c) ⇒ (f(b)−f(a))g ′ (c) = (g(b)−g(a))f ′ (c).<br />
76<br />
x
Error in linear approximations. If we apply the MVT to the function f on the interval<br />
[a,x], then we get<br />
f(x) = f(a) + f ′ (c)(x − a) <strong>for</strong> some c satisfying a < c < x.<br />
That is, the error in using f(a) instead of using f(x) <strong>for</strong> values of x near a is exactly<br />
f ′ (c)(x − a) <strong>for</strong> some number c between a and x.<br />
How about the error in using the tangent line instead of f(x)?<br />
Suppose f is twice differentiable. Let ℓ(x) = f(a) + f ′ (a)(x − a) be the tangent line<br />
to y = f(x) at (a,f(a)), e(x) = f(x) − ℓ(x) be the difference between f and ℓ, and<br />
g(x) = (x − a) 2 . (The function e(x) is often called the error function.)<br />
If we apply the generalized MVT twice to the functions e and g on the interval [a,x], then<br />
we obtain the following equation<br />
f(x) = f(a) + f ′ (a)(x − a) + f ′′ (c)<br />
(x − a)2<br />
2<br />
<strong>for</strong> some c ∈ (a,x).<br />
That is, the error in using values on the tangent line instead of using f(x) when x is near a<br />
is exactly f ′′ (c)<br />
2 (x − a) 2 <strong>for</strong> some number c between a and x. The <strong>for</strong>m of the result is the<br />
same if you apply the generalized MVT twice to e and g on the interval [x,a].<br />
Error in quadratic approximations. In Section 6.6 we will consider quadratic, cubic,<br />
quartic and higher order polynomial approximations to a function f, and error analyses. As<br />
a preview, suppose that f is differentiable to order 3 and consider the following quadratic<br />
approximation to f at a,<br />
q(x) = f(a) + f ′ (a)(x − a) + f ′′ (a)<br />
(x − a)<br />
2<br />
2 .<br />
(q(x) is a quardratic polynomial satisfying q(a) = f(a), q ′ (a) = f ′ (a) and q ′′ (a) = f ′′ (a).)<br />
Let e(x) = f(x) − q(x) and g(x) = (x − a) 3 . If we apply the generalized MVT three times<br />
to the functions e and g on the interval [a,x], then we obtain the following equation<br />
f(x) = f(a) + f ′ (a)(x − a) + f ′′ (a)<br />
2<br />
(x − a) 2 + f ′′′ (c)<br />
(x − a)<br />
6<br />
3<br />
<strong>for</strong> some c ∈ (a,x).<br />
That is, the error in using values on the quadratic curve instead of using f(x) when x is<br />
near a is exactly f ′′′ (c)<br />
6 (x − a) 2 <strong>for</strong> some number c between a and x. The <strong>for</strong>m of the result<br />
is the same if you apply the generalized MVT three times to e and g on the interval [x,a].<br />
4.7 Integration<br />
If f is the derivative of F, then we can interpret the function f as the instantaneous rate<br />
of change of the function F. Starting with a function f representing instantaneous rate of<br />
change, the techniques of integration allow us to recover F and to compute the total change<br />
in F over an interval.<br />
77
y<br />
2<br />
1<br />
1 2 3 4 x<br />
y<br />
15<br />
10<br />
5<br />
-5<br />
1 2 3<br />
Figure 4.14: Definite integral as area (left) and difference in areas (right).<br />
4.7.1 Riemann Sums and Definite Integrals<br />
Suppose f is continuous on the closed interval [a,b], and let n be a positive integer. Further,<br />
let △x = (b − a)/n and xi = a + i△x <strong>for</strong> i = 0,1,2,... ,n. Then the n + 1 numbers<br />
a = x0 < x1 < x2 < · · · < xn = b<br />
partition the interval [a,b] into n subintervals of equal length. Lastly, let x∗ i<br />
of the ith subinterval: x∗ i = (xi−1 + xi)/2.<br />
The definite integral of f from a to b is<br />
b<br />
a<br />
f(x) dx = lim<br />
n<br />
f(x<br />
n→∞<br />
i=1<br />
∗ i )△x.<br />
x<br />
be the midpoint<br />
In this <strong>for</strong>mula: a is the lower limit of integration, b is the upper limit of integration, f(x)<br />
is the integrand, and the sum on the right is called a Riemann sum.<br />
Continuity and Integrability. f is said to be integrable on the interval [a,b] if the limit<br />
above exists. Continuous functions are always integrable. Further, it is possible to show<br />
that the limit exists even if f has a finite number of “jump discontinuities” on [a,b].<br />
Area and difference in areas. If f is nonnegative on [a,b], then b<br />
a f(x) dx can be<br />
interpreted as the area under the curve y = f(x) and above the x-axis when a ≤ x ≤ b.<br />
Otherwise, the integral can be interpreted as the difference between two areas.<br />
Example 4.24 (Computing Area) Let f(x) = √ x on the interval [0,4]. To estimate<br />
the area under y = f(x) and above the x-axis we let n = 10 and ∆x = 0.4, as illustrated in<br />
the left part of Figure 4.14. The 10 midpoints are<br />
and the sum of the areas is 5.347.<br />
0.2, 0.6, 1.0, 1.4, 1.8, 2.2, 2.6, 3.0, 3.4, 3.8,<br />
78
3600<br />
1800<br />
y<br />
7 14 21 28 35<br />
x<br />
y<br />
64<br />
32<br />
1 2 3 4 x<br />
Figure 4.15: Definite integral as average (left) and arc length (right).<br />
The following list of approximate values suggests that the limit is 16/3.<br />
n 10 50 90 130 170 210 250 290 330 370 410<br />
Sum 5.347 5.335 5.334 5.334 5.334 5.333 5.333 5.333 5.333 5.333 5.333<br />
Example 4.25 (Difference in Areas) Let f(x) = x3 − 4x on the interval [0,3]. To<br />
estimate the definite integral we let n = 9 and ∆x = 1<br />
3 , as illustrated in the right part of<br />
Figure 4.14. The 9 midpoints are<br />
0.167, 0.5, 0.833, 1.167, 1.5, 1.833, 2.167, 2.5,2.833,<br />
and the sum is <br />
i f(x∗i )∆x = 2.125.<br />
The following list of approximate values suggests that the limit is 2.25.<br />
n 9 24 39 54 69 84 99 114 129 144 159<br />
Sum 2.125 2.232 2.243 2.247 2.248 2.249 2.249 2.249 2.249 2.25 2.25<br />
Geometrically, the limit is the difference between areas:<br />
2.25 =<br />
3<br />
0<br />
f(x) dx =<br />
2<br />
0<br />
3<br />
f(x) dx + f(x) dx = −4 + 6.25.<br />
2<br />
(The area between 0 and 2 is negated because it lies below the x-axis.)<br />
Note that the total area bounded by y = f(x), y = 0, x = 0 and x = 3 is 4 + 6.25 = 10.25.<br />
Average values. The quantity 1<br />
(b−a)<br />
of f(x) on the interval [a,b].<br />
To see this, let f = 1 ni=1 n f(x∗ i ) be the sample average. Then<br />
f =<br />
<br />
(b − a) 1<br />
(b − a) n<br />
n<br />
f(x<br />
i=1<br />
∗ <br />
i) =<br />
1<br />
(b − a)<br />
b<br />
a f(x) dx can be interpreted as the average value<br />
n<br />
i=1<br />
f(x ∗ i) △x ⇒ lim f =<br />
n→∞<br />
79<br />
<br />
1 b<br />
f(x) dx.<br />
(b − a) a
Example 4.26 (Average Value) Let f(x) = 225e 0.08x on the interval [0,35]. To estimate<br />
the average of f(x) we let n = 5 and ∆x = 35/5 = 7, as illustrated in the left part of<br />
Figure 4.15. Function values at the 5 midpoints are<br />
f(3.5) ≈ 297.704, f(10.5) ≈ 521.183, f(17.5) ≈ 912.42, f(24.5) ≈ 1597.35, f(31.5) ≈ 2796.43,<br />
and the average of these 5 numbers is 1225.02.<br />
The following list of approximate values suggests that the limit is 1241.08.<br />
n 5 30 55 80 105 130 155 180 205<br />
Avg 1225.02 1240.64 1240.95 1241.02 1241.05 1241.06 1241.07 1241.08 1241.08<br />
Arc lengths. Under mild conditions, the quantity b <br />
a 1 + (f ′ (x)) 2 dx can be interpreted<br />
as the length of the curve y = f(x) <strong>for</strong> x ∈ [a,b].<br />
To see this, let yi = f(xi), ∆yi = yi − yi−1 and<br />
n<br />
i=1<br />
<br />
(∆x) 2 + (∆yi) 2 =<br />
<br />
n<br />
1 +<br />
i=1<br />
2 ∆yi<br />
∆x<br />
∆x<br />
be the sum of lengths of line segments joining successive points:<br />
(x0,y0) → (x1,y1) → · · · → (xn,yn).<br />
If f ′ is continuous and not zero on [a,b] (except possibly at the endpoints), then<br />
∆yi<br />
∆x ≈ f ′ (x ∗ i ) <strong>for</strong> i = 1,2,... ,n,<br />
the sum of segment lengths can be approximated as follows,<br />
n<br />
i=1<br />
<br />
(∆x) 2 + (∆yi) 2 ≈<br />
n<br />
i=1<br />
<br />
1 + (f ′ (x ∗ i ))2 ∆x ,<br />
and the limit of the Riemann sum on the right is the arc length.<br />
Example 4.27 (Arc Length) Let f(x) = x 3 on the interval [0,4]. To estimate the length<br />
of the curve we let n = 4 and ∆x = 4/4 = 1, as illustrated in the right part of Figure 4.15.<br />
The sum of the lengths of the segments connecting the points (0,0), (1,1), (2,8), (3,27), (4,64)<br />
is 64.525.<br />
The following list of approximate values suggests that the limit is 64.672.<br />
n 4 12 20 28 36 44 52 60 68 76<br />
Sum 64.525 64.659 64.667 64.669 64.67 64.671 64.671 64.671 64.672 64.672<br />
80
4.7.2 Properties of Definite Integrals<br />
Suppose that f and g are integrable on [a,b], and let c be a constant. Then properties of<br />
limits imply the following properties of definite integrals:<br />
1. Constant functions: b<br />
a c dx = c(b − a).<br />
2. Sums and differences: b<br />
a (f(x) ± g(x)) dx = b<br />
a f(x) dx ± b<br />
a<br />
3. Constant multiples: b<br />
a cf(x) dx = c b<br />
a f(x) dx.<br />
4. Adjacent intervals: b<br />
a f(x) dx = c<br />
a f(x) dx + b<br />
c<br />
5. Null interval: a<br />
a f(x) dx = 0.<br />
6. Reverse limits of integration: b<br />
a f(x) dx = − a<br />
b f(x) dx.<br />
g(x) dx.<br />
f(x) dx when a < c < b.<br />
7. Ordering: If f(x) ≤ g(x) <strong>for</strong> all x on [a,b], then b<br />
a f(x) dx ≤ b<br />
a g(x) dx.<br />
<br />
<br />
8. Absolute value: If |f| is integrable, then <br />
b <br />
f(x) dx<br />
≤ b<br />
|f(x)| dx.<br />
4.7.3 Fundamental Theorem of Calculus<br />
The Fundamental Theorem of Calculus (FTC) relates integration and differentiation, and<br />
gives us a method <strong>for</strong> computing definite integrals when f(x) is a derivative function.<br />
Theorem 4.28 (Fundamental Theorem of Calculus) Let f be a continuous function<br />
on the closed interval [a,b]. Then<br />
1. If F(x) = x<br />
a f(t) dt, then F ′ (x) = f(x) <strong>for</strong> x ∈ [a,b].<br />
2. If f(x) = F ′ (x) <strong>for</strong> all x in [a,b], then b<br />
a f(x) dx = F(b) − F(a).<br />
The first part of the fundamental theorem says that f(x) is the rate of change of F(x) on<br />
the interval [a,b]. The second part of the theorem says that the definite integral of a rate<br />
of change is the total change over the interval.<br />
Derivatives of integrals with variable endpoints. A useful generalization of the<br />
Fundamental Theorem of Calculus is as follows: If ℓ(x) ≤ u(x) are differentiable functions<br />
(<strong>for</strong> the lower and upper endpoints), f is continuous, and<br />
G(x) =<br />
then G ′ (x) = f(u(x))u ′ (x) − f(ℓ(x))ℓ ′ (x).<br />
u(x)<br />
ℓ(x)<br />
81<br />
a<br />
f(t) dt,<br />
a
c f(x) dx = c f(x) dx<br />
x n dx = x n+1<br />
n+1<br />
e x dx = e x + C<br />
Table 4.2: Table of Indefinite Integrals.<br />
+ C (n = −1)<br />
sin(x) dx = − cos(x) + C<br />
sec 2 (x) dx = tan(x) + C<br />
sec(x)tan(x) dx = sec(x) + C<br />
1<br />
1+x 2 dx = tan −1 (x) + C<br />
(f(x) + g(x)) dx = f(x) dx + g(x) dx<br />
1<br />
x<br />
4.7.4 Antiderivatives and Indefinite Integrals<br />
dx = ln |x| + C (x = 0)<br />
ln(x) dx = xln(x) − x + C (x > 0)<br />
cos(x) dx = sin(x) + C<br />
csc 2 (x) dx = − cot(x) + C<br />
csc(x)cot(x) dx = − csc(x) + C<br />
1<br />
√ 1−x 2 dx = sin−1 (x) + C<br />
The function F(x) in the FTC is called an antiderivative of f(x). Since the derivative of a<br />
constant function is zero, antiderivative functions are not unique. The family of antiderivatives,<br />
known as the indefinite integral of f(x), is often written as follows:<br />
<br />
f(x) dx = F(x) + C where C is a constant (the constant of integration).<br />
If F(x) is an antiderivative of f(x), then the following notation is used when evaluating a<br />
definite integral using the FTC:<br />
b<br />
a<br />
f(x) dx = [F(x)] b<br />
a = F(b) − F(a).<br />
That is, we compute the difference between the function value at the upper endpoint and<br />
the function value at the lower endpoint.<br />
4.7.5 Some Formulas <strong>for</strong> Indefinite Integrals<br />
A useful first list of indefinite integrals is given in Table 4.2.<br />
Example 4.29 (Computing Area, continued) Since x 1/2 dx = 2<br />
3 x3/2 + C, the area<br />
under the curve y = √ x = x 1/2 and above the x-axis <strong>for</strong> x ∈ [0,4] is<br />
4<br />
0<br />
√ x dx =<br />
This answer confirms the earlier result.<br />
<br />
2<br />
3 x3/2<br />
4 =<br />
0<br />
16<br />
3 .<br />
82
Rules <strong>for</strong> polynomials. If p(x) = anxn + an−1xn−1 + · · · + a1x + a0 is a polynomial<br />
function, then in<strong>for</strong>mation in Table 4.2 and properties of integrals given earlier imply that<br />
<br />
p(x) dx = an<br />
n + 1 xn+1 + an−1<br />
n xn + · · · + a2<br />
3 x3 + a1<br />
2 x2 + a0x + C.<br />
In particular, a dx = ax + C and (ax + b) dx = a<br />
2 x2 + bx + C.<br />
Example 4.30 (Difference in Areas, continued) Since (x3 −4x) dx = 1<br />
4 x4 −2x2 +C,<br />
the definite integral of f(x) = x3 − 4x <strong>for</strong> x ∈ [0,3] is<br />
3<br />
0<br />
(x 3 − 4x) dx =<br />
This answer confirms the earlier result.<br />
<br />
1<br />
4 x4 − 2x 2<br />
3 0<br />
= 9<br />
4 .<br />
Substitution rule <strong>for</strong> composite functions. The chain rule <strong>for</strong> compositions says that<br />
d<br />
dx (f(g(x))) = f ′ (g(x))g ′ (x) where f and g are differentiable functions.<br />
Thus, <br />
is a general rule <strong>for</strong> finding antiderivatives.<br />
f ′ (g(x))g ′ (x) dx = f(g(x)) + C<br />
If we let u = g(x) and du = g ′ (x) dx, then we can rewrite the <strong>for</strong>mula above as<br />
<br />
f ′ (u) du = f(u) + C.<br />
(We substitute u <strong>for</strong> g(x) and du <strong>for</strong> g ′ (x) dx.)<br />
For example, consider finding <br />
x<br />
√ x 2 + 1 dx.<br />
To apply substitution, let u = x2 + 1. Then du = 2x dx (or x dx = 1<br />
2<br />
becomes <br />
1<br />
2 √ u du = u1/2 + C = <br />
x2 + 1 + C.<br />
du), and the integral<br />
Compositions when g(x) is linear. If g(x) = ax + b, where a and b are constants and<br />
a = 0, then a general rule <strong>for</strong> finding the indefinite integral is as follows:<br />
<br />
f ′ (ax + b) dx = 1<br />
f(ax + b) + C.<br />
a<br />
Example 4.31 (Average Value, continued) Since 225e0.08x dx = 2812.5e0.08x + C,<br />
the average value of f(x) = 225e0.08x <strong>for</strong> x ∈ [0,35] is<br />
1<br />
35<br />
35<br />
This answer confirms the earlier result.<br />
0<br />
225e 0.08x dx = 1 <br />
2812.5e<br />
35<br />
0.08x 35<br />
≈ 1241.08.<br />
0<br />
83
Integration by parts. The derivative rule <strong>for</strong> products says that<br />
d<br />
dx (f(x)g(x)) = f ′ (x)g(x) + f(x)g ′ (x) where f and g are differentiable functions.<br />
Thus, f ′ (x)g(x) + f(x)g ′ (x) dx = f(x)g(x) + C<br />
⇒<br />
is a general rule <strong>for</strong> finding antiderivatives.<br />
<br />
f(x)g ′ <br />
(x) dx = f(x)g(x) −<br />
f ′ (x)g(x) dx<br />
If we let u = f(x), du = f ′ (x) dx, v = g(x), and dv = g ′ (x) dx, then we can rewrite the<br />
<strong>for</strong>mula above as follows: <br />
<br />
u dv = uv −<br />
v du.<br />
In integration by parts, the problem of finding the indefinite integral u dv is replaced by<br />
the problem of finding v du. Care must be taken to choose u and v so that the second<br />
integral is easier to compute than the first.<br />
For example, to find xex dx, we let u = x and dv = ex dx. Then du = dx. Since<br />
<br />
dv = e x dx = e x + C<br />
and v can be any antiderivative, we choose v = ex . Now<br />
<br />
<br />
u dv = uv − v du = xe x <br />
−<br />
e x dx = xe x − e x + C.<br />
Similarly, to find xln(x) dx, we let u = ln(x) and dv = x dx. Then du = 1<br />
<br />
<br />
dv =<br />
x dx = x2<br />
2<br />
+ C<br />
x<br />
dx. Since<br />
and v can be any antiderivative, we choose v = x2<br />
2 . Now,<br />
<br />
<br />
u dv = uv − v du = 1<br />
2 x2 <br />
ln(x) −<br />
x2 <br />
1 1<br />
dx =<br />
2 x 2 x2 ln(x) − x2<br />
+ C.<br />
4<br />
4.7.6 Approximation Methods<br />
Finding antiderivatives is not as straight<strong>for</strong>ward as finding derivatives. In fact, in many<br />
situations exact antiderivatives cannot be found. (For example, f(x) = e x /x has no exact<br />
antiderivative.) When an exact antiderivative does not exist, an approximate method can<br />
be used to estimate a definite integral.<br />
Midpoint rule. A natural estimate to use is the value of the Riemann sum<br />
n<br />
i=1<br />
f(x ∗ i )△x when n is large.<br />
Since x ∗ i was chosen to be the midpoint of [xi−1,xi] <strong>for</strong> each i, this method of estimating<br />
the definite integral is called the midpoint rule.<br />
84
150<br />
100<br />
50<br />
y<br />
1 3 5 7 x<br />
150<br />
100<br />
50<br />
y<br />
1 3 5 7 x<br />
Figure 4.16: Plots of y = e x /x with trapezoidal (left) and Simpson (right) approximations.<br />
Trapezoidal and Simpson’s rules. In the trapezoidal rule, the area under the piecewise<br />
linear curve defined by the points<br />
(xi,f(xi)) <strong>for</strong> i = 0,1,2,... ,n<br />
is used to estimate the area under the nonnegative curve y = f(x). (Geometrically, a sum<br />
of areas of trapezoids replaces a sum of areas of rectangles.)<br />
In Simpson’s rule, the area under a piecewise quadratic curve is used to estimate the area un-<br />
der the nonnegative curve y = f(x), where the quadratic polynomial <strong>for</strong> the ith subinterval<br />
passes through the points (xi−1,f(xi−1)), (x∗ i ,f(x∗ i )), and (xi,f(xi)).<br />
Example 4.32 (Figure 4.16) Let f(x) = e x /x on the interval [1,7]. The left part of<br />
Figure 4.16 shows the setup <strong>for</strong> the trapezoidal rule, and the right part shows the setup<br />
<strong>for</strong> Simpson’s rule, when n = 3 subintervals are used. The approximation in the right plot<br />
is significantly better than the one in the left plot. (Note that the numerical value of the<br />
integral is 189.61.)<br />
Newton-Cotes and Gaussian methods. The midpoint rule gives exact answers if f is<br />
a constant function, the trapezoidal rule gives exact answers if f is a constant function or a<br />
linear function, and Simpson’s rule gives exact answers if f is a constant function, a linear<br />
function or a quadratic function. They are examples of Newton-Cotes approximations.<br />
An approximation of the Newton-Cotes type will give exact answers when f is a polynomial<br />
function of degree at most a fixed constant. Gaussian quadrature methods take Newton-<br />
Cotes methods a step further, by choosing partition points that are not necessarily equallyspaced;<br />
the spacing is chosen to minimize the approximation error.<br />
Monte Carlo methods. Monte Carlo methods are based on the idea of averaging (see<br />
page 79). Let x1, x2, ..., xn be n numbers chosen uni<strong>for</strong>mly from the interval [a,b]. Then<br />
a Monte Carlo estimate of b<br />
a f(x) dx is<br />
<br />
n<br />
<br />
1<br />
(b − a) f(xi) .<br />
n<br />
85<br />
i=1
y<br />
150<br />
100<br />
50<br />
0 c 1 2 x<br />
y<br />
150<br />
100<br />
50<br />
2.5<br />
Figure 4.17: Improper integral examples on finite intervals.<br />
Let f(x) = e x /x and [a,b] = [1,7] (see Figure 4.16). A Monte Carlo estimate of the integral,<br />
based on n = 50000 random numbers, was 190.061.<br />
Power series methods. Methods based on power series expansions can also be used to<br />
estimate an integral. See Section 6.7.2.<br />
4.7.7 Improper Integrals<br />
The definition of the definite integral can be extended to include situations where f is<br />
undefined at either the lower or upper endpoint of integration, and to include situations<br />
where a = −∞ or b = ∞.<br />
Specifically,<br />
1. If f is integrable on [c,b] <strong>for</strong> all a < c < b, then<br />
b<br />
a<br />
f(x) dx = lim<br />
c→a +<br />
b<br />
f(x) dx<br />
c<br />
when this limit exists.<br />
2. If f is integrable on [a,c] <strong>for</strong> all a < c < b, then<br />
b<br />
a<br />
f(x) dx = lim<br />
c→b− c<br />
f(x) dx<br />
a<br />
when this limit exists.<br />
3. If f is integrable on [a,b] <strong>for</strong> all b greater than a fixed a, then<br />
∞<br />
a<br />
b<br />
f(x) dx = lim<br />
b→∞<br />
f(x) dx<br />
a<br />
when this limit exists.<br />
4. If f is integrable on [a,b] <strong>for</strong> all a less than a fixed b, then<br />
b<br />
−∞<br />
b<br />
f(x) dx = lim<br />
a→−∞<br />
f(x) dx<br />
a<br />
when this limit exists.<br />
86<br />
c<br />
x
y<br />
0.4<br />
0 c2<br />
c<br />
x<br />
10<br />
5<br />
y<br />
01 c2<br />
c<br />
Figure 4.18: Improper integral examples on infinite intervals.<br />
When the limit exists, the improper integral is said to converge; otherwise, the improper<br />
integral is said to diverge.<br />
Example 4.33 (Improper Integrals on Finite Intervals) Let f(x) = 10/x2 on the<br />
interval (0,2], as illustrated in the left part of Figure 4.17. Since<br />
2 2 <br />
−10<br />
−10<br />
f(x) dx = = −5 − lim does not exist,<br />
x<br />
c→0+ c<br />
0<br />
the improper integral diverges.<br />
0<br />
c→0+<br />
Similarly, let f(x) = 10/ √ 5 − x on the interval [0,5), as illustrated in the right part of<br />
Figure 4.17. Since<br />
5 <br />
f(x) dx = −20 √ c→5− 5 − x = lim<br />
0 c→5− <br />
−20 √ <br />
5 − c + 20 √ 5 = 0 + 20 √ 5 = 20 √ 5,<br />
the improper integral converges to 20 √ 5 ≈ 44.7214.<br />
Example 4.34 (Improper Integrals on Infinite Intervals) Let f(x) = xe −x on the<br />
interval [0, ∞), as illustrated in the left part of Figure 4.18. Since<br />
∞<br />
0 f(x) dx = [−xe−x − e −x ] c→∞<br />
0<br />
= limc→∞ (−ce −c − e −c ) + 1<br />
= limc→∞ (−c/e c ) − limc→∞ (e −c ) + 1<br />
(using integration-by-parts)<br />
= limc→∞ (−1/e c ) − 0 + 1 (using L’Hospital’s Rule)<br />
= 0 − 0 + 1 = 1,<br />
the improper integral converges to 1.<br />
Similarly, let f(x) = 10/ √ x on the interval [1, ∞), as illustrated in the right part of Figure<br />
4.18. Since<br />
∞<br />
f(x) dx =<br />
1<br />
20 √ x c→∞ √ <br />
1 = lim 20 c − 20 does not exist,<br />
c→∞<br />
the improper integral diverges.<br />
87<br />
x
y<br />
-4 -3 -2 -1 0 1 2 3 4 x<br />
Figure 4.19: Standardized binomial distribution with standard normal approximation.<br />
4.8 Applications in <strong>Statistics</strong><br />
Integration methods are important in statistics, and, in particular, when working with<br />
density functions. Density functions are used to “smooth over” probability histograms of<br />
discrete random variables, and as functions supporting the theory of continuous random<br />
variables. Continuous random variables are studied in Chapter 5 of these notes.<br />
4.8.1 deMoivre-Laplace Limit Theorem<br />
The most famous example of “smoothing over” is given in the following theorem.<br />
Theorem 4.35 (deMoivre-Laplace Limit Theorem) Let Xn be a binomial random<br />
variable based on n trials of a Bernoulli experiment with success probability p, and let<br />
Zn = (Xn − np)<br />
√ npq<br />
be the standardized <strong>for</strong>m of Xn.<br />
(E(Zn) = 0 and SD(Zn) = 1.) Then, <strong>for</strong> each real number x<br />
lim<br />
n→∞ P(Zn ≤ x) =<br />
x<br />
−∞<br />
1<br />
√ 2π e −z2 /2 dz.<br />
The deMoivre-Laplace limit theorem implies that the area under the standard normal density<br />
function, φ(z) = 1<br />
√ 2π e −z2 /2 , can be used to estimate binomial probabilities when n is<br />
large enough. (See Section 5.2.5 <strong>for</strong> more details on the normal distribution.)<br />
Example 4.36 (Binomial Approximation) Let n = 40 and p = 1/2. In Figure 4.19, the<br />
function φ(z) is superimposed on the probability histogram <strong>for</strong> the standardized random<br />
variable Z40 = (X40 − 20)/ √ 10.<br />
The probability P(X40 ≤ 22) can be computed using the binomial PDF:<br />
<br />
22<br />
1 40 40<br />
P(X40 ≤ 22) =<br />
= 0.785205.<br />
x 2<br />
x=0<br />
88
Using the normal approximation (with continuity correction) we get<br />
P(X40 ≤ 22) ≈ Φ<br />
<br />
22.5 − 20<br />
√ = 0.785402<br />
10<br />
where Φ(x) is the area under φ(z) <strong>for</strong> z ≤ x. The values are close.<br />
Stirling’s <strong>for</strong>mula. An important component of the proof of the deMoivre-Laplace limit<br />
theorem is the use of Stirling’s <strong>for</strong>mula <strong>for</strong> factorials. Stirling proved that<br />
lim<br />
n→∞<br />
n!<br />
√ 2π n n+1/2 e −n<br />
Thus, <strong>for</strong> large values of n, the <strong>for</strong>mula √ 2π n n+1/2 e −n can be used to approximate n!.<br />
4.8.2 Computing Quantiles<br />
If the continuous random variable X has density function f(x), then the cumulative distribution<br />
function (CDF) of X can be obtained using integration:<br />
F(x) = P(X ≤ x) =<br />
x<br />
−∞<br />
= 1.<br />
f(t) dt.<br />
If 0 < p < 1, then the pth quantile of the X distribution (when it exists) is the number, xp,<br />
satisfying the equation<br />
F(xp) =<br />
xp<br />
−∞<br />
f(t) dt = p.<br />
Example 4.37 (Pareto Distribution) The Pareto distribution has density function<br />
f(x) = α<br />
x α+1<br />
where α is a positive constant.<br />
when x > 1 and 0 otherwise,<br />
Pareto distributions are used to model incomes in a population. The CDF of the Pareto<br />
distribution is found using integration. Specifically,<br />
F(x) =<br />
x<br />
−∞<br />
f(t)dt =<br />
x<br />
1<br />
α<br />
tα+1 dt = −t −αx 1 = 1 − x−α<br />
when x > 1 and 0 otherwise. For each p, the p th quantile is<br />
p = F(xp) =⇒ xp = (1 − p) −1/α .<br />
For example, the quartiles (25 th , 50 th , and 75 th percentiles) of the Pareto distribution with<br />
parameter α = 2.5 are 1.122, 1.320, and 1.741, respectively.<br />
Additional examples will be given in the next chapter.<br />
89
5 Continuous Random Variables<br />
Researchers use random variables to describe the numerical results of experiments. In the<br />
continuous setting, the possible numerical values <strong>for</strong>m an interval or a union of intervals.<br />
For example, a random variable whose values are the positive real numbers might be used<br />
to describe the lifetimes of individuals in a population.<br />
This chapter focuses on continuous random variables and their probability distributions.<br />
5.1 Definitions<br />
Recall that a random variable is a function from the sample space of an experiment to the<br />
real numbers and that the random variable X is said to be continuous if its range is an<br />
interval or a union of intervals.<br />
5.1.1 PDF and CDF <strong>for</strong> Continuous Distributions<br />
Let X be a continuous random variable. We assume there exists a nonnegative function<br />
f(x) defined over the entire real line such that<br />
<br />
P(X ∈ D) =<br />
D<br />
f(x) dx,<br />
where D ⊆ R is an interval or union of intervals. The function f(x) is called the probability<br />
density function (PDF) (or density function) of the continuous random variable X.<br />
PDFs satisfy the following properties:<br />
1. f(x) ≥ 0 whenever it exists.<br />
2. <br />
R f(x)dx equals 1, where R is the range of X.<br />
f represents the rate of change of probability; the rate must be nonnegative (property 1).<br />
The area under f represents probability; the total probability must be 1 (property 2).<br />
The cumulative distribution function (CDF) of X, denoted by F(x), is the function<br />
<br />
F(x) = P(X ≤ x) =<br />
CDFs satisfy the following properties:<br />
(−∞,x]<br />
1. limx→−∞ F(x) = 0 and limx→+∞ F(x) = 1.<br />
2. If x1 ≤ x2, then F(x1) ≤ F(x2).<br />
3. F(x) is continuous.<br />
f(x) dx <strong>for</strong> all real numbers x.<br />
91
Density<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
0<br />
0 10 20 30 40 50 60 x<br />
Cum.Prob.<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
0 10 20 30 40 50 60 x<br />
Figure 5.1: Probability density function (left plot) and cumulative distribution function<br />
(right plot) <strong>for</strong> a continuous random variable with range x ≥ 0.<br />
F represents cumulative probability, with limits 0 and 1 (property 1). Cumulative probability<br />
increases with increasing x (property 2). For continuous random variables, the<br />
cumulative distribution function is continuous (property 3).<br />
Note that if R is the range of the continuous random variable X and I is an interval, then<br />
the probability of the event “the value of X is in the interval I” is computed by finding the<br />
area under the density curve <strong>for</strong> x ∈ I ∩ R:<br />
<br />
P(X ∈ I) = f(x)dx.<br />
In particular, if a ∈ R, then P(X = a) = 0 since the area under the curve over an interval<br />
of length zero is zero.<br />
Example 5.1 (Distribution on Nonnegative Reals) Let X be a continuous random<br />
variable whose range is the nonnegative real numbers and whose PDF is<br />
f(x) =<br />
I∩R<br />
200<br />
when x ≥ 0 and 0 otherwise.<br />
(10 + x) 3<br />
The left part of Figure 5.1 is a plot of the density function of X and the right part is a plot<br />
of its cumulative distribution function:<br />
100<br />
F(x) = 1 − when x ≥ 0 and 0 otherwise.<br />
(10 + x) 2<br />
Further, <strong>for</strong> this random variable, the probability that X is greater than 8 is<br />
∞<br />
P(X > 8) = f(x)dx = 1 − F(8) = 100<br />
≈ 0.3086.<br />
182 5.1.2 Quantiles and Percentiles<br />
8<br />
Assume that 0 < p < 1. The p th quantile (or 100p th percentile) of the X distribution (when<br />
it exists) is the point, xp, satisfying the equation<br />
P(X ≤ xp) = p.<br />
92
To find xp, solve the equation F(x) = p <strong>for</strong> x.<br />
Important special cases of quantiles are as follows:<br />
1. The median of the X distribution: the 50 th percentile.<br />
2. The quartiles of the X distribution: the 25 th , 50 th , and 75 th percentiles.<br />
3. The deciles of the X distribution: the 10 th , 20 th , ..., 90 th percentiles.<br />
Interquartile range. The interquartile range (IQR) is the difference between the 75 th<br />
and 25 th percentiles: IQR = x0.75 − x0.25.<br />
Example 5.2 (Distribution on Nonnegative Reals, continued) Continuing with the<br />
example above, a general <strong>for</strong>mula <strong>for</strong> the p th quantile is<br />
p = 1 −<br />
In particular, the quartiles are<br />
100<br />
(10 + xp) 2 ⇒ xp = 10/ 1 − p − 10 .<br />
x0.25 = 20<br />
√ 3 − 10 ≈ 1.547, x0.50 = 10 √ 2 − 10 ≈ 4.142 and x0.75 = 10.<br />
The IQR of this distribution is IQR = 20 − 20/ √ 3 ≈ 8.453.<br />
Measures of center and spread. The median is a measure of the center (or location)<br />
of a distribution. The IQR is a measure of the scale (or spread) of a distribution.<br />
Additional measures of center and spread are the mean and standard deviation, respectively.<br />
Definitions <strong>for</strong> continuous random variables are given in Section 5.3.3.<br />
5.2 Families of Distributions<br />
This section briefly describes four families of continuous probability distributions, considers<br />
properties of these distributions, and considers continuous trans<strong>for</strong>mations.<br />
5.2.1 Example: Uni<strong>for</strong>m Distribution<br />
Let a and b be real numbers with a < b. The continuous random variable X is said to be a<br />
uni<strong>for</strong>m random variable, or to have a uni<strong>for</strong>m distribution, on the interval [a,b] when its<br />
PDF is as follows:<br />
f(x) = 1<br />
b − a<br />
when a ≤ x ≤ b and 0 otherwise.<br />
Note that the open interval (a,b), or one of the half-closed intervals [a,b) or (a,b], can be<br />
used instead of [a,b] as the range of a uni<strong>for</strong>m random variable.<br />
93
Density<br />
0.025<br />
0.02<br />
0.015<br />
0.01<br />
0.005<br />
0<br />
0 10 20 30 40 50 60 70 x<br />
Cum.Prob.<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
0 10 20 30 40 50 60 70 x<br />
Figure 5.2: PDF (left) and CDF (right) <strong>for</strong> the uni<strong>for</strong>m random variable on [10,50].<br />
Uni<strong>for</strong>m distributions have constant density over an interval. That density is the reciprocal<br />
of the length of the interval. If X is a uni<strong>for</strong>m random variable on the interval [a,b], and<br />
[c,d] ⊆ [a,b] is a subinterval, then<br />
P(c ≤ X ≤ d) =<br />
d<br />
c<br />
1 d − c Length of [c,d]<br />
dx = =<br />
b − a b − a Length of [a,b]<br />
Also, P(c < X < d) = P(c < X ≤ d) = P(c ≤ X < d) = (d − c)/(b − a).<br />
If 0 < p < 1, then the p th quantile is xp = a + p(b − a).<br />
Example 5.3 (a = 10, b = 50) Let X be a uni<strong>for</strong>m random variable on the interval<br />
[10,50]. X has median 30 and IQR 20. Further,<br />
P(15 ≤ X ≤ 42) = 27<br />
= 0.675.<br />
40<br />
The PDF and CDF of X are shown in Figure 5.2.<br />
Computer-generated random numbers. Note that computer commands that return<br />
“random” numbers in the interval [0,1] are simulating random numbers from the uni<strong>for</strong>m<br />
distribution on the interval [0,1].<br />
5.2.2 Example: Exponential Distribution<br />
Let λ be a positive real number. The continuous random variable X is said to be an<br />
exponential random variable, or to have an exponential distribution, with parameter λ when<br />
its PDF is as follows:<br />
f(x) = λe −λx when x ≥ 0 and 0 otherwise.<br />
Note that the interval x > 0 can be used instead of the interval x ≥ 0 as the range of an<br />
exponential random variable.<br />
94<br />
.
Density<br />
4<br />
3<br />
2<br />
1<br />
0<br />
0 0.5 1 1.5 x<br />
Cum.Prob.<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
0 0.5 1 1.5 x<br />
Figure 5.3: PDF (left) and CDF (right) <strong>for</strong> the exponential random variable with λ = 4.<br />
Exponential distributions are often used to represent the time that elapses be<strong>for</strong>e the occurrence<br />
of an event. For example, the time that a machine component will operate be<strong>for</strong>e<br />
breaking down.<br />
If X is an exponential random variable with parameter λ and [a,b] ⊆ [0, ∞), then<br />
P(a ≤ X ≤ b) =<br />
b<br />
a<br />
λe −λx dx =<br />
<br />
−e −λx b<br />
If 0 < p < 1, then the p th quantile is xp = − ln(1 − p)/λ.<br />
a = −e−λb + e −λa .<br />
Example 5.4 (λ = 4) Let X be an exponential random variable with parameter 4. Using<br />
properties of logarithms, X has median ln(2)/4 ≈ 0.173 and IQR ln(3)/4 ≈ 0.275. Further,<br />
P<br />
1<br />
4<br />
<br />
3<br />
≤ X ≤ = −e<br />
4<br />
−3 + e −1 ≈ 0.318.<br />
The PDF and CDF of X are shown in Figure 5.3.<br />
The exponential distribution is discussed further in Section 6.9.1.<br />
5.2.3 Euler Gamma Function<br />
Let r be a positive real number. The Euler gamma function is defined as follows:<br />
∞<br />
Γ(r) = x r−1 e −x dx.<br />
0<br />
If r is a positive integer, then Γ(r) = (r−1)!. Thus, the gamma function is said to interpolate<br />
the factorials.<br />
Remark. The property Γ(r) = (r−1)! <strong>for</strong> positive integers can be proven using induction.<br />
To start the induction, you need to demonstrate that Γ(1) = 1. To prove the induction<br />
step, you need to use integration-by-parts to demonstrate that Γ(r + 1) = r × Γ(r).<br />
95
Density<br />
0.08<br />
0.06<br />
0.04<br />
0.02<br />
0<br />
0 10 20 30 40<br />
x<br />
Cum.Prob.<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
0 10 20 30 40<br />
Figure 5.4: PDF (left) and CDF (right) <strong>for</strong> the gamma random variable with shape parameter<br />
α = 2 and scale parameter β = 5.<br />
5.2.4 Example: Gamma Distribution<br />
Let α and β be positive real numbers. The continuous random variable X is said to be a<br />
gamma random variable, or to have a gamma distribution, with parameters α and β when<br />
its PDF is as follows:<br />
f(x) =<br />
1<br />
β α Γ(α) xα−1 e −x/β when x > 0 and 0 otherwise.<br />
Note that if α ≥ 1, then the interval x ≥ 0 can be used instead of the interval x > 0 as the<br />
range of a gamma random variable.<br />
The parameter α is called a shape parameter; gamma distributions with the same value<br />
of α have the same shape. The parameter β is called a scale parameter; <strong>for</strong> fixed α, as β<br />
changes, the scale on the vertical and horizontal axes change, but the shape remains the<br />
same.<br />
Example 5.5 (α = 2, β = 5) Let X be a gamma random variable with shape parameter<br />
2 and scale parameter 5. Using Newton’s method, the median of X is approximately 8.39<br />
and the IQR is approximately 8.65. Further,<br />
P(8 ≤ X ≤ 20) =<br />
20<br />
8<br />
1<br />
25 x e−x/5 <br />
dx = −e −x/5 − 1<br />
5<br />
The PDF and CDF of X are shown in Figure 5.4.<br />
x e−x/5<br />
20<br />
8<br />
≈ 0.433.<br />
Remarks. The fact that f(x) is a valid PDF can be demonstrated using the method of<br />
substitution and the definition of the Euler gamma function. If the shape parameter α = 1,<br />
then the gamma distribution is the same as the exponential distribution with λ = 1/β.<br />
The gamma distribution is discussed further in Section 6.9.2.<br />
96<br />
x
5.2.5 Example: Normal Distribution<br />
Let µ be a real number and σ be a positive real number. The continuous random variable<br />
X is said to be a normal random variable, or to have a normal distribution, with parameters<br />
µ and σ when its PDF is as follows:<br />
f(x) =<br />
1<br />
√ 2π σ exp<br />
<br />
−(x − µ) 2<br />
2σ 2<br />
<strong>for</strong> all real numbers x<br />
where exp() is the exponential function. The normal distribution is also called the Gaussian<br />
distribution, in honor of the mathematician Carl Friedrich Gauss.<br />
The graph of the PDF of a normal random variable is the famous bell-shaped curve. The<br />
parameter µ is called the mean (or center) of the normal distribution, since the graph of<br />
the PDF is symmetric around x = µ. The median of the normal distribution is µ. The<br />
parameter σ is called the standard deviation (or spread) of the normal distribution. The<br />
interquartile range of the normal distribution is (approximately) 1.35σ.<br />
Normal distributions have many applications. For example, normal random variables can<br />
be used to model measurements of manufactured items made under strict controls; normal<br />
random variables can be used to model physical measurements (e.g. height, weight, blood<br />
values) in homogeneous populations.<br />
Note that the PDF of the normal distribution cannot be integrated in closed <strong>for</strong>m. Most<br />
computer programs provide functions to compute probabilities and quantiles of the normal<br />
distribution.<br />
Standard normal distribution. The continuous random variable Z is said to be a<br />
standard normal random variable, or to have a standard normal distribution, when Z is a<br />
normal random variable with µ = 0 and σ = 1. The PDF and CDF of Z have special<br />
symbols:<br />
φ(z) = 1<br />
√ e<br />
2π −z2 z<br />
/2<br />
and Φ(z) = φ(t)dt where z is a real number.<br />
−∞<br />
The notation zp is used <strong>for</strong> the p th quantile of the Z distribution.<br />
Working with tables. Table 5.1 gives values of Φ(z) <strong>for</strong> z ≥ 0, where<br />
z = Row Value + Column Value.<br />
For z < 0, use Φ(z) = 1 − Φ(−z). For example,<br />
1. Φ(1.25) = 0.894 (row 1.2, column 0.05).<br />
2. Φ(−0.70) = 1 − Φ(0.70) = 1 − 0.758 = 0.242 (row 0.7, column 0.00).<br />
3. z0.25 ≈ −0.67 and z0.75 ≈ 0.67 since Φ(0.67) = 0.749 (row 0.6, column 0.07).<br />
The graph shown in the table illustrates Φ(z) as an area.<br />
97
Table 5.1: Cumulative probabilities of the standard normal random variable<br />
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09<br />
0.0 0.500 0.504 0.508 0.512 0.516 0.520 0.524 0.528 0.532 0.536<br />
0.1 0.540 0.544 0.548 0.552 0.556 0.560 0.564 0.567 0.571 0.575<br />
0.2 0.579 0.583 0.587 0.591 0.595 0.599 0.603 0.606 0.610 0.614<br />
0.3 0.618 0.622 0.626 0.629 0.633 0.637 0.641 0.644 0.648 0.652<br />
0.4 0.655 0.659 0.663 0.666 0.670 0.674 0.677 0.681 0.684 0.688<br />
0.5 0.691 0.695 0.698 0.702 0.705 0.709 0.712 0.716 0.719 0.722<br />
0.6 0.726 0.729 0.732 0.736 0.739 0.742 0.745 0.749 0.752 0.755<br />
0.7 0.758 0.761 0.764 0.767 0.770 0.773 0.776 0.779 0.782 0.785<br />
0.8 0.788 0.791 0.794 0.797 0.800 0.802 0.805 0.808 0.811 0.813<br />
0.9 0.816 0.819 0.821 0.824 0.826 0.829 0.831 0.834 0.836 0.839<br />
1.0 0.841 0.844 0.846 0.848 0.851 0.853 0.855 0.858 0.860 0.862<br />
1.1 0.864 0.867 0.869 0.871 0.873 0.875 0.877 0.879 0.881 0.883<br />
1.2 0.885 0.887 0.889 0.891 0.893 0.894 0.896 0.898 0.900 0.901<br />
1.3 0.903 0.905 0.907 0.908 0.910 0.911 0.913 0.915 0.916 0.918<br />
1.4 0.919 0.921 0.922 0.924 0.925 0.926 0.928 0.929 0.931 0.932<br />
1.5 0.933 0.934 0.936 0.937 0.938 0.939 0.941 0.942 0.943 0.944<br />
1.6 0.945 0.946 0.947 0.948 0.949 0.951 0.952 0.953 0.954 0.954<br />
1.7 0.955 0.956 0.957 0.958 0.959 0.960 0.961 0.962 0.962 0.963<br />
1.8 0.964 0.965 0.966 0.966 0.967 0.968 0.969 0.969 0.970 0.971<br />
1.9 0.971 0.972 0.973 0.973 0.974 0.974 0.975 0.976 0.976 0.977<br />
2.0 0.977 0.978 0.978 0.979 0.979 0.980 0.980 0.981 0.981 0.982<br />
2.1 0.982 0.983 0.983 0.983 0.984 0.984 0.985 0.985 0.985 0.986<br />
2.2 0.986 0.986 0.987 0.987 0.987 0.988 0.988 0.988 0.989 0.989<br />
2.3 0.989 0.990 0.990 0.990 0.990 0.991 0.991 0.991 0.991 0.992<br />
2.4 0.992 0.992 0.992 0.992 0.993 0.993 0.993 0.993 0.993 0.994<br />
2.5 0.994 0.994 0.994 0.994 0.994 0.995 0.995 0.995 0.995 0.995<br />
2.6 0.995 0.995 0.996 0.996 0.996 0.996 0.996 0.996 0.996 0.996<br />
2.7 0.997 0.997 0.997 0.997 0.997 0.997 0.997 0.997 0.997 0.997<br />
2.8 0.997 0.998 0.998 0.998 0.998 0.998 0.998 0.998 0.998 0.998<br />
2.9 0.998 0.998 0.998 0.998 0.998 0.998 0.998 0.999 0.999 0.999<br />
3.0 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999<br />
3.1 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999<br />
3.2 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999<br />
98<br />
z
X Dens.<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
-1 0 1 2 3 4 x<br />
Y Dens.<br />
4<br />
3<br />
2<br />
1<br />
0<br />
-1 0 1 2 3 4 y<br />
Figure 5.5: Probability density functions of a continuous random variable with values on<br />
0 < x < 2 (left plot) and of its reciprocal with values on y > 0.5 (right plot).<br />
Nonstandard normal distributions. If X is a normal random variable with mean µ<br />
and standard deviation σ, then<br />
<br />
x − µ<br />
P(X ≤ x) = Φ<br />
σ<br />
where Φ() is the CDF of the standard normal random variable, and<br />
xp = µ + zpσ<br />
where zp is the p th quantile of the standard normal distribution.<br />
Example 5.6 (µ = 100, σ = 20) Let X be a normal random variable with mean 100<br />
and standard deviation 20. The median of X is 100 and the IQR is 20(z0.75 − z0.25) ≈ 26.8.<br />
Further,<br />
<br />
110 − 100 60 − 100<br />
P(60 ≤ X ≤ 110) = Φ<br />
− Φ<br />
≈ 0.668.<br />
20<br />
20<br />
5.2.6 Trans<strong>for</strong>ming Continuous Random Variables<br />
If X and Y = g(X) are continuous random variables, then the CDF of X can be used to<br />
determine the CDF and PDF of Y .<br />
Example 5.7 (Reciprocal Trans<strong>for</strong>mation) Let X be a continuous random variable<br />
whose range is the open interval (0,2) and whose PDF is<br />
fX(x) = x<br />
2<br />
when 0 < x < 2 and 0 otherwise,<br />
and let Y equal the reciprocal of X, Y = 1/X. Then <strong>for</strong> y > 1/2,<br />
P(Y ≤ y) = P<br />
1<br />
X<br />
<br />
≤ y = P X ≥ 1<br />
<br />
y<br />
99<br />
=<br />
2<br />
x=1/y<br />
x<br />
2<br />
dx = 1 − 1<br />
4y 2
and d<br />
dy P(Y ≤ y) = 1/(2y3 ). Thus, the PDF of Y is<br />
and the CDF of Y is<br />
fY (y) = 1<br />
when y > 1/2 and 0 otherwise<br />
2y3 FY (y) = 1 − 1<br />
when y > 1/2 and 0 otherwise.<br />
4y2 The left part of Figure 5.5 is a plot of the density function of X, and the right part is a<br />
plot of the density function of Y .<br />
Example 5.8 (Square Trans<strong>for</strong>mation) Let Z be the standard normal random variable<br />
and W be the square of Z, W = Z 2 . Then <strong>for</strong> w > 0,<br />
P(W ≤ w) = P(Z 2 ≤ w) = P(− √ w ≤ Z ≤ √ w) = Φ( √ w) − Φ(− √ w).<br />
Since the graph of the PDF of Z is symmetric around z = 0, Φ(−z) = 1 − Φ(z) <strong>for</strong> every z.<br />
Thus, the CDF of W is<br />
and the PDF of W is<br />
FW(w) = 2Φ( √ w) − 1 when w > 0 and 0 otherwise<br />
fW(w) = d<br />
dw FW(w) = 1<br />
√ 2πw e −w/2 when w > 0 and 0 otherwise.<br />
(Note that W = Z 2 is said to have a chi-square distribution with one degree of freedom.)<br />
Monotone trans<strong>for</strong>mations. If the differentiable trans<strong>for</strong>mation g is strictly monotone<br />
(either always increasing or always decreasing) on the range of X, Y = g(X) and y is in<br />
the range of Y , then the PDF of Y has the following <strong>for</strong>m:<br />
<br />
dy <br />
fY (y) = fX(x)/ <br />
dx<br />
where y = g(x) (and x = g−1 (y)).<br />
Example 5.9 (Reciprocal Trans<strong>for</strong>mation, continued) In the reciprocal trans<strong>for</strong>mation<br />
example above, y = g(x) = 1/x, x = g−1 (y) = 1/y (the inverse of the reciprocal<br />
function is the reciprocal function), and<br />
fY (y) = fX(g −1 <br />
−1<br />
(y))/<br />
g−1 (y) 2<br />
<br />
<br />
<br />
=<br />
<br />
1 <br />
<br />
/ −y<br />
2y<br />
2 1<br />
= when y > 0.5.<br />
2y3 Linear trans<strong>for</strong>mations. Linear trans<strong>for</strong>mations with nonzero slope are examples of<br />
differentiable monotone trans<strong>for</strong>mations. In particular, if Y = aX + b where a = 0 and y is<br />
in the range of Y , then<br />
fY (y) = 1<br />
|a| fX<br />
<br />
1<br />
(y − b) .<br />
a<br />
100
Example 5.10 (Normal Distribution) If X is a normal random variable with mean µ<br />
and standard deviation σ and Z is the standard normal random variable with PDF φ(z),<br />
then X = σZ + µ with PDF<br />
fX(x) = 1<br />
σ φ<br />
<br />
<br />
x − µ 1 −(x − µ) 2<br />
= √ exp<br />
σ 2π σ 2σ2 <br />
<strong>for</strong> all x.<br />
This <strong>for</strong>mula <strong>for</strong> the PDF agrees with the one given earlier.<br />
5.3 Mathematical Expectation<br />
Mathematical expectation generalizes the idea of a weighted average, where probability<br />
distributions are used as the weights. (See Section 2.3 also.)<br />
5.3.1 Definitions <strong>for</strong> Continuous Random Variables<br />
Let X be a continuous random variable with range R and PDF f(x). The mean of X (or<br />
expected value of X or expectation of X) is defined as follows:<br />
<br />
E(X) = x f(x) dx<br />
provided that <br />
x∈R |x| f(x) dx < ∞ (i.e. the integral converges absolutely).<br />
x∈R<br />
Similarly, if g(X) is a real-valued function of X, then the mean of g(X) (or expected value<br />
of g(X) or expectation of g(X)) is<br />
<br />
E(g(X)) = g(x) f(x) dx<br />
provided that <br />
x∈R |g(x)| f(x) dx < ∞.<br />
x∈R<br />
Note that the absolute convergence of an integral is not guaranteed. In cases where an<br />
integral with absolute values diverges, we say that the expectation is indeterminate.<br />
Example 5.11 (Triangular Distribution) Let X be a continuous random variable whose<br />
range is the interval [1,4] and whose PDF is as follows<br />
f(x) = 2<br />
(x − 1) when 1 ≤ x ≤ 4 and 0 otherwise,<br />
9<br />
and let g(X) = X 2 be the square of X. Then<br />
and<br />
E(X) =<br />
E(g(X)) =<br />
4<br />
Note that E(g(X)) = g(E(X)).<br />
4<br />
1<br />
1<br />
x f(x) dx =<br />
x 2 f(x) dx =<br />
101<br />
4<br />
1<br />
4<br />
1<br />
x 2<br />
(x − 1) dx = 3<br />
9<br />
2 2<br />
x (x − 1) dx = 19/2.<br />
9
Table 5.2: Means and variances <strong>for</strong> four families of continuous distributions.<br />
Distribution E(X) V ar(X)<br />
Uni<strong>for</strong>m a, b a+b<br />
2<br />
Exponential λ 1<br />
λ<br />
(b−a) 2<br />
12<br />
1<br />
λ 2<br />
Gamma α, β αβ αβ 2<br />
Normal µ, σ µ σ 2<br />
Example 5.12 (Indeterminate Mean) Let X be a continuous random variable whose<br />
range is x ≥ 1 and whose PDF is as follows:<br />
f(x) = 1<br />
when x ≥ 1 and 0 otherwise.<br />
x2 Since ∞<br />
1 x f(x) dx does not converge, the expectation of X is indeterminate.<br />
5.3.2 Properties of Expectation<br />
The properties of expectation stated in Section 2.3.2 <strong>for</strong> discrete random variables are also<br />
true <strong>for</strong> continuous random variables.<br />
5.3.3 Mean, Variance and Standard Deviation<br />
Let X be a random variable and let µ = E(X) be its mean. The variance of X, V ar(X),<br />
defined as follows:<br />
<br />
V ar(X) = E (X − µ) 2<br />
.<br />
The notation σ 2 = V ar(X) is used to denote the variance. The standard deviation of X,<br />
σ = SD(X), is the positive square root of the variance.<br />
These summary measures (mean, variance, standard deviation) have the same properties<br />
as those stated in the discrete case in Section 2.3.3.<br />
Table 5.2 gives general <strong>for</strong>mulas <strong>for</strong> the mean and variance of the four families of continuous<br />
probability distributions discussed earlier.<br />
5.3.4 Chebyshev’s Inequality<br />
Chebyshev’s inequality, stated in Section 2.3.4 <strong>for</strong> discrete random variables, remains true<br />
in the continuous case. The proof (with integration replacing summation) is also the same.<br />
102
6 Infinite Sequences and Series<br />
An infinite sequence (or sequence) is a real-valued function whose domain is the positive<br />
integers. We write a sequence as a list of numbers in a definite order<br />
a1, a2, a3, ..., an, ... ,<br />
where a1 is the first term, a2 is the second term, etc. The notations<br />
{a1, a2, a3, ...} or {an} or {an} ∞ n=1<br />
are used to denote sequences.<br />
In addition, sequences may start with n = 0 (or with some other integer).<br />
If {an} ∞ n=1<br />
is a sequence, then the <strong>for</strong>mal sum<br />
∞<br />
an = a1 + a2 + a3 + · · ·<br />
n=1<br />
is called an infinite series (or series). The numbers a1, a2, ... are the terms of the series.<br />
6.1 Limits and Convergence of Infinite Sequences<br />
Let {an} be an infinite sequence. We say that {an} has limit L, and write<br />
lim<br />
n→∞ an = L or an → L as n → ∞<br />
if <strong>for</strong> every positive real number ǫ there exists an integer M satisfying<br />
|an − L| < ǫ when n > M.<br />
If the sequence {an} has limit L, then the sequence converges (or is convergent). Otherwise,<br />
the sequence diverges (or is divergent).<br />
Example 6.1 (Convergent and Divergent Sequences) Since<br />
the sequence<br />
lim<br />
n→∞<br />
(n2 + 1)<br />
(5n2 (1 + 1/n<br />
= lim<br />
+ 7) n→∞<br />
2 )<br />
(5 + 7/n2 1<br />
=<br />
) 5 ,<br />
<br />
(n2 +1)<br />
(5n2 <br />
+7) converges to 1<br />
5 . Similarly, the sequence {(−0.5)n } converges to 0.<br />
By contrast, the sequence {ln(n)} diverges. (In fact, the values grow large without bound.)<br />
6.1.1 Convergence Properties<br />
Assume that {an} converges to L and {bn} converges to M. Then<br />
1. Constant multiples: {can} converges to cL, where c is a constant.<br />
103
2. Sums and differences: {an ± bn} converges to L ± M.<br />
3. Products: {anbn} converges to LM.<br />
4. Quotients: If bn = 0 <strong>for</strong> all n and M = 0, then {an/bn} converges to L/M.<br />
5. Convergence to zero: {an} converges to 0 if and only if {|an|} converges to 0.<br />
6. Squeezing principle: If an ≤ bn ≤ cn <strong>for</strong> all n ≥ n0, where n0 is a constant, and<br />
then {bn} converges to L as well.<br />
lim<br />
n→∞ an = lim<br />
n→∞ cn = L ,<br />
Example 6.2 (Squeezing Principle) Consider the sequence<br />
the sequence<br />
0 ≤ sin2 (n)<br />
n<br />
≤ 1<br />
n<br />
<br />
sin2 (n)<br />
n converges to 0 as well.<br />
6.1.2 Geometric Sequences<br />
1<br />
and lim = 0,<br />
n→∞ n<br />
<br />
sin2 (n)<br />
n . Since<br />
Consider the sequence {r n } where r is a fixed constant (called the ratio of the geometric<br />
sequence). Then<br />
{r n } converges to 0 when |r| < 1 and converges to 1 when r = 1.<br />
In all other cases, the sequence diverges.<br />
6.1.3 Bounded Sequences<br />
The sequence {an} is said to be bounded above if there is a number M satisfying<br />
an ≤ M <strong>for</strong> all n ≥ 1.<br />
Similarly, {an} is said to be bounded below if there is a number m satisfying<br />
an ≥ m <strong>for</strong> all n ≥ 1.<br />
The sequence {an} is said to be bounded if it is bounded above and below. Otherwise, {an}<br />
is said to be unbounded.<br />
Theorem 6.3 (Bounded Sequence Theorem) Let {an} be a sequence.<br />
1. If {an} converges, then it is bounded.<br />
2. If {an} is unbounded, then it diverges.<br />
Every convergent sequence is bounded, but not every bounded sequence is convergent. For<br />
example, the sequence {(−1) n } is bounded (since −1 ≤ an ≤ 1 <strong>for</strong> all n) but not convergent.<br />
104
6.1.4 Monotone Sequences<br />
The sequence {an} is said to be increasing if<br />
Similarly, {an} is said to be decreasing if<br />
a1 < a2 < · · · < an < · · · .<br />
a1 > a2 > · · · > an > · · · .<br />
{an} is said to be monotone if it is either increasing or decreasing.<br />
Theorem 6.4 (Monotone Convergence Theorem) If {an} is a bounded and monotone<br />
sequence, then {an} converges.<br />
Example 6.5 (Convergent Monotone Sequences) Since 0 ≤ 1 − 1<br />
n ≤ 1 <strong>for</strong> all n ≥ 1<br />
∞ and the terms are increasing, the sequence converges. In fact, the limit is 1.<br />
1 − 1<br />
n n=1<br />
<br />
1<br />
n ≤ sin(1) <strong>for</strong> all n ≥ 1 and the terms are decreasing, the sequence<br />
Similarly, since 0 ≤ sin<br />
∞ sin converges. In fact, the limit is 0.<br />
1<br />
n<br />
n=1<br />
Note. The proof of the Monotone Convergence Theorem uses the fact that sets of real<br />
numbers that are bounded above have a least upper bound (or supremum), and that sets of<br />
real numbers that are bounded below have a greatest lower bound (or infimum).<br />
6.2 Partial Sums and Convergence of Infinite Series<br />
Let ∞ n=1 an be an infinite series. The sequence with terms<br />
n<br />
sn = ai, n = 1,2,3,...<br />
i=1<br />
is known as the sequence of partial sums of the series.<br />
If the sequence of partial sums {sn} converges to L, then the series converges to L:<br />
∞<br />
n=1<br />
an = lim<br />
n→∞ sn = L<br />
and L is said to be the sum of the series. If the sequence of partial sums {sn} diverges,<br />
then the series is said to diverge.<br />
Example 6.6 (Convergent Series) The series ∞ n=1<br />
1<br />
n(n+1)<br />
converges to 1.<br />
1 1 1<br />
To see this, note that n(n+1) = n − n+1 <strong>for</strong> n ≥ 1. Thus,<br />
<br />
sn = 1 − 1<br />
<br />
1 1 1 1<br />
+ − + · · · + − = 1 −<br />
2 2 3 n n + 1<br />
1<br />
n + 1<br />
(all middle terms cancel) and sn → 1 as n → ∞.<br />
105
6.2.1 Convergence Properties<br />
Assume that ∞ n=1 an = L and ∞ n=1 bn = M. Then<br />
1. Constant multiples: ∞ n=1 can = c ∞ n=1 an = cL, where c is a constant.<br />
2. Sums and differences: ∞ n=1 (an ± bn) = ∞ n=1 an ± ∞ n=1 bn = L ± M.<br />
3. Convergence to zero: If the series ∞ n=1 an converges, then {an} converges to 0.<br />
Note that, if the sequence of terms of a series does not have limit zero, then the series must<br />
diverge. For example,<br />
∞<br />
n=1<br />
(n 2 + 1)<br />
(5n 2 + 7)<br />
diverges since lim<br />
n→∞<br />
(n2 + 1)<br />
(5n2 1<br />
= = 0.<br />
+ 7) 5<br />
(The sequence of partial sums is unbounded, since sn+1 ≈ sn + 1<br />
5<br />
6.2.2 Geometric Series<br />
when n is large.)<br />
Consider the series ∞ n=1 ar n−1 where a and r are fixed constants. Then the n th partial<br />
sum of the geometric series is<br />
sn = a + ar + ar 2 + · · · + ar n−1 = a(1 − rn )<br />
1 − r<br />
when r = 1<br />
(and sn = an when r = 1). Further, the series converges when |r| < 1 and its sum is<br />
If |r| ≥ 1, the series diverges.<br />
∞<br />
n=1<br />
ar n−1 = a<br />
1 − r<br />
Example 6.7 (a = −10/27, r = −2/3) The series<br />
since<br />
∞n=1<br />
∞<br />
n=1<br />
5(−2) n<br />
3 n+2<br />
(|r| < 1).<br />
5(−2) n<br />
3 n+2 converges to − 2<br />
9<br />
= −10 20 40 80<br />
27 + 81 − 243 + 729 ∓ · · ·<br />
= − 10<br />
<br />
27 1 − 2 4 8<br />
3 + 9 − 27 ± · · ·<br />
= −10/27<br />
1−(−2/3)<br />
106<br />
= −2<br />
9 .
6.3 Convergence Tests <strong>for</strong> Positive Series<br />
Let ∞ n=1 an and ∞ n=1 bn be series with positive terms. Then<br />
1. Comparison test: Assume that an ≤ bn <strong>for</strong> all n ≥ n0. Then<br />
(a) If ∞ n=1 bn converges, then ∞ n=1 an converges.<br />
(b) If ∞ n=1 an diverges, then ∞ n=1 bn diverges.<br />
2. Integral test: If f(x) is a decreasing function with f(n) = an, then<br />
∞<br />
an and<br />
n=1<br />
∞<br />
f(x) dx both converge or both diverge.<br />
1<br />
3. Ratio test: Assume that (an+1/an) → L as n → ∞.<br />
(a) If 0 ≤ L < 1, then ∞ n=1 an converges.<br />
(b) If L > 1, then ∞ n=1 an diverges.<br />
4. Root test: Assume that n√ an → L as n → ∞.<br />
(a) If 0 ≤ L < 1, then ∞ n=1 an converges.<br />
(b) If L > 1, then ∞ n=1 an diverges.<br />
5. Limit comparison test: If {an/bn} converges to L > 0, then<br />
∞<br />
an and<br />
n=1<br />
∞<br />
bn both converge or both diverge.<br />
n=1<br />
The comparison test follows from the Monotone Convergence Theorem since the sequences<br />
of partial sums are montone increasing: if the b-series converges, then its sum is an upper<br />
bound <strong>for</strong> the sequence of partial sums of the a-series; if the a-series diverges, then the<br />
sequence of partial sums of the b-series must be unbounded.<br />
The ratio and root tests apply to series that are “sufficiently like” geometric series. In each<br />
case, if L = 1, then no conclusion can be drawn. The limit comparison test applies to series<br />
that are approximately constant multiples of one another.<br />
Example 6.8 (p Series) The integral test can be used to demonstrate that<br />
∞<br />
n=1<br />
1<br />
converges if and only if p > 1.<br />
np To see this, let f(x) = 1<br />
x p = x −p . If p > 1, then<br />
<br />
∞ x−p+1 x→∞<br />
f(x) dx =<br />
1<br />
−p + 1<br />
1<br />
<br />
x−p+1 1 1 1<br />
= lim<br />
− = 0 − =<br />
x→∞ −p + 1 −p + 1 −p + 1 p − 1 .<br />
107
(The limit is 0 since x has a negative exponent.) Since the improper integral converges, so<br />
does the series.<br />
If p < 1, then the improper integral diverges since x has a positive exponent in the limit<br />
above. If p = 1, then the antiderivative of f(x) is ln(x) and again the improper integral<br />
diverges.<br />
Example 6.9 (Convergent Exponential Series) An important application of the ratio<br />
test is to the series<br />
∞ xn n!<br />
where x is a positive real number.<br />
Since the ratio<br />
an+1<br />
an<br />
n=0<br />
= xn+1 /(n + 1)!<br />
x n /n!<br />
=<br />
x<br />
→ 0 as n → ∞<br />
(n + 1)<br />
the series above converges. In addition, since the series converges, the sequence of terms<br />
must approach 0. That is, x n /n! → 0 as n → ∞.<br />
Example 6.10 (Comparison Test) The comparison test can be used to demonstrate<br />
that<br />
∞ 10sin2 (n)<br />
n!<br />
converges and that<br />
∞ 4<br />
√ diverges.<br />
5n − 1<br />
n=0<br />
In the first case, let an = 10 sin2 (n)<br />
n! and bn = 10<br />
n!<br />
. The ratio test can be used to demonstrate<br />
that the b-series converges (as in the last example). Thus, the a-series converges.<br />
In the second case, let an = 4<br />
√ 5n and bn = 4<br />
√ 5n−1 . The integral test can be used to demonstrate<br />
that the a-series diverges (as in the example above). Thus, the b-series diverges.<br />
Example 6.11 (Limit Comparison Test) The limit comparison test can be used to<br />
demonstrate that<br />
∞<br />
n=1<br />
4n + 3<br />
n 3 + 7n − 4<br />
converges and that<br />
n=1<br />
∞<br />
n=1<br />
In the first case, let an = 4n+3<br />
n 3 +7n−4 and bn = 1<br />
n 2. Since<br />
an<br />
lim<br />
n→∞ bn<br />
= lim<br />
n→∞<br />
4n3 + 3n2 n3 = lim<br />
+ 7n − 4 n→∞<br />
n + 5<br />
3n 2 + 2n + 2 diverges.<br />
4 + 3/n<br />
1 + 7/n2 = 4<br />
− 4/n3 and the b-series converges (by a previous example), the a-series converges.<br />
In the second case, let an = n+5<br />
an<br />
lim<br />
n→∞ bn<br />
3n2 +2n+2 and bn = 1<br />
n . Since<br />
= lim<br />
n→∞<br />
n2 + 5n<br />
3n2 = lim<br />
+ 2n + 2 n→∞<br />
1 + 5/n 1<br />
=<br />
3 + 2/n + 2/n2 3<br />
and the b-series diverges (by the same example), the a-series diverges.<br />
108
Example 6.12 (Root Test) The root test can be used to demonstrate that<br />
Specifically, since<br />
∞<br />
n n + 5<br />
2n<br />
n=1<br />
<br />
n√ n + 5 1 + 5/n<br />
an = =<br />
2n 2<br />
converges.<br />
→ 1<br />
2<br />
as n → ∞<br />
and the limit is less than 1, the series converges. (For large n, the terms of the series are<br />
like those of a convergent geometric series with r = 1/2.)<br />
Remarks. The convergence tests above can be used to determine if a positive series<br />
converges, but tell us nothing about the sum of the series. In most cases, finding an exact<br />
sum is challenging. In Section 6.6, we demonstrate that<br />
∞<br />
n=0<br />
x n<br />
n!<br />
= ex<br />
<strong>for</strong> all x.<br />
Harmonic series. The divergent series ∞ 1<br />
n=1 n is called the harmonic series. Although<br />
the terms of the harmonic series approach 0, the series diverges by the integral test.<br />
6.4 Alternating Series and Error Analysis<br />
An alternating series is a series whose terms alternate in sign. The usual notation is<br />
∞<br />
(−1)<br />
n=1<br />
n+1 an = a1 − a2 + a3 − a4 ± · · · ,<br />
where an > 0 <strong>for</strong> each n, <strong>for</strong> series starting with positive terms. For example,<br />
∞ (−1) n+1<br />
n=1<br />
2 n<br />
= 1 1 1 1<br />
− + − ± · · ·<br />
2 4 8 16<br />
is an alternating series starting with a positive term.<br />
A simple convergence test <strong>for</strong> alternating series is based on the following theorem.<br />
Theorem 6.13 (Alternating Series Theorem) Assume that an+1 ≤ an <strong>for</strong> each n and<br />
that an → 0 as n → ∞. Then<br />
1. ∞ n=1 (−1) n+1 an and ∞ n=1 (−1) n an are convergent series.<br />
2. If sn is the n th partial sum and L is the sum of one of these series, then<br />
|sn − L| ≤ an+1 <strong>for</strong> each n.<br />
109
The sequence of partial sums of an alternating series are alternately larger and smaller than<br />
the limit, L. When the first term is positive, the even partial sums <strong>for</strong>m an increasing<br />
sequence and the odd partial sums <strong>for</strong>m a decreasing sequence:<br />
s2 ≤ s4 ≤ s6 ≤ s8 ≤ s10 ≤ · · · ≤ L ≤ · · · ≤ s9 ≤ s7 ≤ s5 ≤ s3 ≤ s1.<br />
When the first term is negative, the even partial sums are decreasing and the odd partial<br />
sums are increasing. Each subsequence of partial sums must converge, since each is bounded<br />
and monotone. Further, since<br />
|sn − sn−1| = an → 0 as n → ∞,<br />
the limits of the subsequences must be equal.<br />
Example 6.14 (Alternating Harmonic Series) The series<br />
∞ (−1) n+1<br />
n=1<br />
n<br />
= 1 − 1 1 1<br />
+ − ± · · ·<br />
2 3 4<br />
converges since the sequence {1/n} is decreasing with limit 0.<br />
To estimate the sum of the alternating harmonic series to within 2 decimal places, we need<br />
to use 200 or more terms since<br />
|sn − L| ≤ an+1 = 1<br />
< 0.005 =⇒ n > 199.<br />
n + 1<br />
Our estimate is s200 ≈ 0.690653. (In fact, L = ln(2). See Section 6.6.)<br />
Example 6.15 (Alternating p Series) More generally, the series<br />
∞ (−1) n+1<br />
n=1<br />
n p<br />
= 1 − 1 1 1<br />
+ − ± · · ·<br />
2p 3p 4p converges <strong>for</strong> all p > 0 since {1/n p } is decreasing with limit 0.<br />
To estimate the sum when p = 3 to within 2 decimal places, <strong>for</strong> example, we need to use 5<br />
or more terms since<br />
|sn − L| ≤ an+1 =<br />
Our estimate is s5 ≈ 0.904412.<br />
1<br />
< 0.005 =⇒ n > 4.84804.<br />
(n + 1) 3<br />
6.5 Absolute and Conditional Convergence of Series<br />
The following theorem is used to study convergence properties of general series.<br />
Theorem 6.16 (Absolute Convergence Theorem) If the series ∞ n=1 |an| converges,<br />
then the series ∞ n=1 an converges as well.<br />
Thus, tests <strong>for</strong> positive series can be used to determine if the series with terms |an| converges.<br />
If the series converges, then we can conclude that the series whose terms are an converges<br />
110
as well. Note, however, that if the series with terms |an| diverges, then we cannot draw any<br />
conclusion about the series with terms an.<br />
Let ∞ n=1 an be a convergent series. If the series ∞ n=1 |an| converges, then ∞ n=1 an is<br />
said to be an absolutely convergent series. Otherwise, ∞ n=1 an is said to be a conditionally<br />
convergent series.<br />
Example 6.17 (Alternating p Series, continued) The series<br />
∞ (−1) n+1<br />
n=1<br />
n p<br />
= 1 − 1 1 1<br />
+ − ± · · ·<br />
2p 3p 4p is absolutely convergent when p > 1 (by the integral test) and conditionally convergent<br />
when 0 < p ≤ 1 (by the integral and alternating series tests).<br />
Example 6.18 (Exponential Series, continued) The series<br />
∞<br />
n=0<br />
x n<br />
n!<br />
= 1 + x + x2<br />
2<br />
+ x3<br />
6<br />
x4<br />
+ + · · ·<br />
24<br />
converges <strong>for</strong> all real numbers x. Specifically, if x = 0, the sum is 1. Otherwise, the series<br />
converges absolutely by the ratio test (see page 108).<br />
Example 6.19 (Power Series) The series<br />
∞<br />
n=1<br />
x n<br />
n<br />
= x + x2<br />
2<br />
+ x3<br />
3<br />
+ x4<br />
4<br />
+ · · ·<br />
converges absolutely when |x| < 1, converges conditionally when x = −1, and diverges<br />
otherwise. To see this,<br />
1. If x = 0, then the sum is 0. If 0 < |x| < 1, then<br />
|an+1|<br />
|an| =<br />
<br />
n<br />
|x| → |x| < 1 as n → ∞,<br />
n + 1<br />
implying that the series converges absolutely by the ratio test. Thus, the series converges<br />
absolutely when |x| < 1.<br />
2. When x = 1, the series reduces to the harmonic series, which diverges by the integral<br />
test. When x = −1, the series reduces to the (negative) alternating harmonic series,<br />
which converges by the alternating series test. Thus, the series converges conditionally<br />
when x = −1.<br />
3. When |x| > 1, the terms do not converge to 0, and the series diverges.<br />
6.5.1 Product of Series<br />
Let ∞ n=0 an and ∞ n=0 bn be series with index starting at 0.<br />
111
The series ∞ n=0 cn with terms<br />
cn =<br />
n<br />
aibn−i = a0bn + a1bn−1 + · · · + an−1b1 + anb0 <strong>for</strong> n = 0,1,2,... .<br />
i=0<br />
is the product (or the convolution) of ∞ n=0 an and ∞ n=0 bn.<br />
Theorem 6.20 (Convolution Theorem) If ∞ n=0 an = L, ∞ n=0 bn = M and both<br />
series are absolutely convergent, then<br />
∞n=0 cn is an absolutely convergent series with sum LM.<br />
Example 6.21 (Square of Geometric Series) The geometric series<br />
∞<br />
x<br />
n=0<br />
n = 1 + x + x 2 + x 3 + · · ·<br />
converges to 1/(1 − x) when |x| < 1 and diverges otherwise.<br />
The square of this series is<br />
<br />
∞<br />
x<br />
n=0<br />
n<br />
2 ∞<br />
= (n + 1)x<br />
n=0<br />
n = 1 + 2x + 3x 2 + 4x 3 + · · · .<br />
By the convolution theorem, the square series converges to 1/(1 − x) 2 when |x| < 1.<br />
Remarks. The last example demonstrates that we can “represent” the functions f(x) =<br />
1<br />
(1−x)<br />
and g(x) = 1<br />
(1−x) 2 as sums of convergent series <strong>for</strong> all x satisfying |x| < 1. Series<br />
representations of functions will be explored further in the next two sections.<br />
6.6 Taylor Polynomials and Taylor Series<br />
We have already seen how to approximate a function using a tangent line. This section<br />
considers using polynomials of higher orders to approximate functions with higher precision,<br />
and using the limiting <strong>for</strong>ms of these polynomials to give exact representations of the<br />
functions on intervals.<br />
6.6.1 Higher Order Derivatives<br />
The n th derivative of the function f is obtained from the function by differentating n times:<br />
f ′′′ is the derivative of f ′′ , f ′′′′ is the derivative of f ′′′ , and so <strong>for</strong>th.<br />
Starting with y = f(x), notations <strong>for</strong> the n th derivative are<br />
f (n) (x) = dny dn<br />
=<br />
dxn dxn (f(x)) = dnf dxn = Dnf(x) = y (n) .<br />
For convenience, we let f (0) (x) = f(x).<br />
112
ycosx<br />
2<br />
1<br />
-8 -6 -4 -2 2 4 6 8 x<br />
-1<br />
-2<br />
2<br />
8<br />
14<br />
y11x<br />
6<br />
4<br />
2<br />
-1 -0.5 0.5 1<br />
Figure 6.1: Plots of y = cos(x) (left) and y = 1/(1 −x) (right) with Taylor approximations.<br />
6.6.2 Taylor Polynomials<br />
The Taylor polynomial of order n centered at a is the polynomial<br />
pn(x) = f(a) + f ′ (a)(x − a) + f ′′ (a)<br />
2<br />
(x − a) 2 + · · · + f(n) (a)<br />
(x − a)<br />
n!<br />
n .<br />
Note that pn(a) = f(a) and that p (k)<br />
n (a) = f (k) (a) <strong>for</strong> k = 1,... ,n.<br />
Theorem 6.22 (Taylor’s Theorem, Part 1) If the n th derivative of f is continuous on<br />
the closed interval [a,b] and the (n + 1) st derivative exists on the open interval (a,b), then<br />
<strong>for</strong> each x in [a,b]<br />
f(x) = pn(x) + f(n+1) (c)<br />
(x − a)n+1<br />
(n + 1)!<br />
<strong>for</strong> some c satisfying a < c < x.<br />
In words, the error in using the Taylor polynomial pn(x) instead of the function f(x) <strong>for</strong><br />
values of x near a is exactly the remainder<br />
Rn(x) = f(n+1) (c)<br />
(n + 1)! (x − a)n+1 <strong>for</strong> some number c between a and x.<br />
Thus, Taylor’s theorem generalizes results on errors in linear approximations (Section 4.6).<br />
Taylor’s theorem can be proven using the Mean Value Theorem.<br />
Note that if |f (n+1) (x)| is bounded in a neighborhood of a, then |Rn(x)| is also bounded.<br />
Specifically,<br />
|f (n+1) (x)| ≤ M <strong>for</strong> |x − a| < r ⇒ |Rn(x)| ≤ M<br />
(n + 1)! |x − a|n+1 <strong>for</strong> |x − a| < r .<br />
Example 6.23 (Cosine Function) Consider f(x) = cos(x) and let a = 0. The left part of<br />
Figure 6.1 shows graphs of y = f(x), y = p2(x), y = p8(x), and y = p14(x). As n increases,<br />
the polynomial approximates the function well over wider intervals centered at 0.<br />
113<br />
5<br />
3<br />
1<br />
x
Further, since |f (n) (x)| ≤ 1 (each derivative is either ± sin(x) or ± cos(x)) and a = 0,<br />
|Rn(x)| ≤ |x|n+1<br />
(n + 1)!<br />
<strong>for</strong> all x.<br />
Example 6.24 (Geometric Function) Consider f(x) = 1/(1 − x) and let a = 0. The<br />
right part of Figure 6.1 shows graphs of y = f(x), y = p1(x), y = p3(x), and y = p5(x).<br />
As n increases, the polynomial approximates the function well over wider subintervals of<br />
(−1,1).<br />
Since f (n) (x) = n!/(1 − x) n+1 and f (n) (0) = n!, pn(x) = 1 + x + x 2 + · · · + x n .<br />
Further, since<br />
1 − x n+1<br />
1 − x = 1 + x + x2 + · · · + x n =⇒ 1<br />
1 − x = 1 + x + x2 + · · · + x n + xn+1<br />
1 − x<br />
(using properties of geometric series), the remainder is Rn(x) = x n+1 /(1 − x).<br />
6.6.3 Taylor Series<br />
The Taylor series representation of f(x) centered at a is<br />
f(x) =<br />
∞<br />
n=0<br />
f (n) (a)<br />
n!<br />
(x − a) n = f(a) + f ′ (a)(x − a) + f ′′ (a)<br />
(x − a)<br />
2<br />
2 + · · ·<br />
when the series converges. Note that the Taylor series centered at 0 is often called the<br />
Maclaurin series.<br />
Theorem 6.25 (Taylor’s Theorem, Part 2) If f(x) = pn(x) + Rn(x) and<br />
lim<br />
n→∞ Rn(x) = 0 <strong>for</strong> all x satisfying |x − a| < r,<br />
then f(x) is equal to the sum of its Taylor series on the interval |x − a| < r.<br />
Example 6.26 (Exponential Function) Consider f(x) = e x and let a = 0.<br />
Since f (n) (x) = e x and f (n) (0) = 1 <strong>for</strong> all n,<br />
e x = 1 + x + x2<br />
2<br />
<strong>for</strong> some c between 0 and x. Since<br />
+ · · · + xn<br />
n! + Rn(x) where Rn(x) =<br />
|Rn(x)| ≤ C |x|n+1<br />
(n + 1)! where C = max(ex ,1)<br />
e c<br />
(n + 1)! xn+1<br />
and |x| n /n! → 0 as n → ∞ <strong>for</strong> all x (page 108), Rn(x) → 0 as n → ∞ as well. Thus, e x is<br />
equal to the sum of its Taylor series <strong>for</strong> all real numbers.<br />
114
Table 6.1: Some Taylor series expansions centered at 0.<br />
1<br />
1−x = 1 + x + x2 + x 3 + · · · <strong>for</strong> −1 < x < 1<br />
1<br />
(1−x) 2 = 1 + 2x + 3x 2 + 4x 3 + · · · <strong>for</strong> −1 < x < 1<br />
ln(1 + x) = x − x2<br />
2<br />
e x = 1 + x + x2<br />
2<br />
cos(x) = 1 − x2<br />
2<br />
sin(x) = x − x3<br />
3!<br />
+ x3<br />
3!<br />
+ x4<br />
4!<br />
+ x5<br />
5!<br />
+ x3<br />
3<br />
− x4<br />
4<br />
+ · · · <strong>for</strong> −1 < x ≤ 1<br />
+ · · · <strong>for</strong> all x<br />
− x6<br />
6!<br />
− x7<br />
7!<br />
+ · · · <strong>for</strong> all x<br />
+ · · · <strong>for</strong> all x<br />
Example 6.27 (Exponential Function, continued) Using the result above, a series<br />
representation of e −1 is<br />
e −1 =<br />
∞ (−1) n<br />
n=0<br />
n!<br />
1 1 1<br />
= 1 − 1 + − + ∓ · · · .<br />
2 6 24<br />
To estimate e−1 to within 2 decimal places of accuracy, <strong>for</strong> example, we need to use 6 terms<br />
since<br />
1<br />
|Rn(x)| ≤ < 0.005 =⇒<br />
(n + 1)!<br />
n > 5.<br />
Our estimate is s6 = 53/144 ≈ 0.368056. (Note that this result could also be obtained using<br />
the methods of Section 6.4.)<br />
6.6.4 Radius of Convergence<br />
The largest value of r satisfying Taylor’s theorem is called the radius of convergence of the<br />
series. The example above demonstrates that the radius of convergence <strong>for</strong> the exponential<br />
series centered at 0 is ∞.<br />
Table 6.1 lists several commonly used series expansions. In the first three cases, the radius<br />
of convergence in 1. In the last three cases, the radius of convergence is ∞.<br />
6.7 Power Series<br />
A power series centered at a is a series of the <strong>for</strong>m<br />
∞<br />
cn(x − a)<br />
n=0<br />
n = c0 + c1(x − a) + c2(x − a) 2 + c3(x − a) 3 + · · · ,<br />
115
where {cn} is a sequence of constants (the coefficients of the series). The series converges<br />
to c0 when x = a. Interest focuses on finding the values of x <strong>for</strong> which the series converges<br />
and finding exact <strong>for</strong>ms <strong>for</strong> the sums.<br />
Theorem 6.28 (Convergence Theorem) Let ∞ n=0 cn(x−a) n be a power series centered<br />
at a. Then exactly one of the following holds:<br />
1. The series converges <strong>for</strong> x = a only.<br />
2. The series converges <strong>for</strong> all x.<br />
3. There is a number r > 0 such that the series converges when |x − a| < r and diverges<br />
when |x − a| > r.<br />
The number r given in the third part of the theorem is called the radius of convergence of<br />
the power series. If r is the radius of convergence, then the theorem tells us nothing about<br />
convergence at x = a ± r. Thus, the interval of convergence could be any one of<br />
[a − r,a + r], (a − r,a + r], [a − r,a + r), (a − r,a + r).<br />
Example 6.29 (Logarithm Function) Consider the Taylor series <strong>for</strong> f(x) = ln(x)<br />
centered at 1. Since<br />
ln(1) = 0, f (n) (x) = (−1)<br />
n−1(n − 1)!<br />
xn the <strong>for</strong>m of the Taylor series is<br />
∞<br />
(−1)<br />
n−1(x − 1)n (x − 1)2<br />
= (x − 1) −<br />
n<br />
2<br />
n=1<br />
and f (n) (1) = (−1) n−1 (n − 1)!,<br />
+ (x − 1)3<br />
3<br />
−<br />
(x − 1)4<br />
4<br />
± · · · .<br />
The series converges absolutely when |x − 1| < 1 by the ratio test. If |x − 1| > 1, then the<br />
terms do not converge to 0, and the series diverges. If x = 0, then the series diverges since<br />
it is the negative of the harmonic series. If x = 2, then the series converges since it is the<br />
alternating harmonic series.<br />
Thus, the radius of convergence is r = 1 and the interval of convergence is (0,2].<br />
6.7.1 Power Series and Taylor Series<br />
Taylor series are examples of power series. The following theorem states that convergent<br />
power series must be Taylor series within their radius of convergence.<br />
Theorem 6.30 (Representation Theorem) Assume that the radius of convergence of<br />
the power series ∞ n=0 cn(x − a) n is r > 0. Let<br />
∞<br />
f(x) = cn(x − a) n<br />
<strong>for</strong> a − r < x < a + r.<br />
n=0<br />
Then f has derivatives of all orders on (a − r,a + r) and<br />
f (n) (a) = cn n! <strong>for</strong> all n.<br />
Thus, the power series is the Taylor series of f centered at a.<br />
116
6.7.2 Differentiation and Integration of Power Series<br />
Power series can be differentiated and integrated term-by-term within their radius of convergence.<br />
Specifically,<br />
Theorem 6.31 (Term-by-Term Analysis) Assume that the radius of convergence of<br />
the power series ∞ n=0 cn(x − a) n is r > 0. Let<br />
∞<br />
f(x) = cn(x − a)<br />
n=0<br />
n<br />
<strong>for</strong> a − r < x < a + r.<br />
Then f is differentiable (and continuous) on the interval (a − r,a + r). Further,<br />
1. the derivative of f has the following <strong>for</strong>m:<br />
f ′ ∞<br />
(x) = ncn(x − a)<br />
n=1<br />
n−1 = c1 + 2c2(x − a) + 3c3(x − a) 2 + · · · .<br />
2. the family of antiderivatives of f has the following <strong>for</strong>m:<br />
<br />
f(x) dx = C +<br />
∞<br />
n=0<br />
The radius of convergence of each series is r.<br />
Example 6.32 (Geometric Function) Let<br />
f(x) =<br />
1<br />
(1 − x) =<br />
The derivative of f can be computed two ways:<br />
f ′ (x) = d<br />
<br />
1<br />
=<br />
dx (1 − x)<br />
f ′ (x) = d<br />
<br />
∞<br />
x<br />
dx<br />
n=0<br />
n<br />
<br />
∞<br />
=<br />
Thus,<br />
1<br />
=<br />
(1 − x) 2<br />
confirming results obtained on page 112.<br />
cn<br />
n + 1 (x − a)n+1 = C + c0(x − a) + c1<br />
2 (x − a)2 + · · · .<br />
∞<br />
x<br />
n=0<br />
n = 1 + x + x 2 + x 3 + · · · <strong>for</strong> −1 < x < 1.<br />
1<br />
and<br />
(1 − x) 2<br />
nx<br />
n=1<br />
n−1 = 1 + 2x + 3x 2 + 4x 3 + · · · .<br />
∞<br />
nx<br />
n=1<br />
n−1 = 1 + 2x + 3x 2 + 4x 3 + · · · <strong>for</strong> −1 < x < 1,<br />
Example 6.33 (Standard Normal Distribution) Let Z be the standard normal random<br />
variable. Using the Taylor series expansion of e x centered at 0, the PDF of Z can be written<br />
as follows:<br />
φ(z) = 1<br />
√ 2π e −z2 /2 = 1<br />
√2π<br />
∞ (−z2 /2) n<br />
n=0<br />
n!<br />
117<br />
= 1<br />
√ 2π<br />
∞<br />
n z2n<br />
(−1)<br />
n=0<br />
2 n n!<br />
<strong>for</strong> all z.
φ(z) does not have an exact antiderivative. However, by using term-by-term analysis, we<br />
can write the indefinite integral as a family of power series. Specifically,<br />
<br />
∞<br />
φ(z) dz = C + (−1)<br />
n=0<br />
n z2n+1 (2n + 1)2n n! .<br />
This power series representation can be used to compute probabilities <strong>for</strong> standard normal<br />
random variables to within any degree of accuracy.<br />
6.8 Discrete Random Variables, revisited<br />
There are many important applications of sequences and series in probability and statistics.<br />
This section focuses on discrete random variables whose ranges are countably infinite.<br />
6.8.1 Countably Infinite Sample Spaces<br />
A set is countably infinite when it corresponds to an infinite sequence. When a sample space<br />
S is countably infinite, we need to include the possibility of infinite unions, and of infinite<br />
sums of probabilities. In particular, the third Kolmogorov axiom is extended as follows:<br />
3 ′′ . If A1,A2,... are pairwise disjoint, then<br />
P (∪ ∞ i=1Ai) = P(A1) + P(A2) + P(A3) + · · ·<br />
where the righthand side is understood to be the sum of a convergent infinite series.<br />
6.8.2 Infinite Discrete Random Variables<br />
Recall that a discrete random variable is one whose range, R, is either finite or countably<br />
infinite. If the range of X is countably infinite, f(x) is the PDF of X, and g(X) is a<br />
real-valued function, then the expected value (or mean or expectation) of g(X) is<br />
E(g(X)) = <br />
g(x)f(x) when the series converges absolutely.<br />
x∈R<br />
That is, when <br />
x∈R |g(x)|f(x) converges. If the series with absolute values diverges, we<br />
say that the expectation is indeterminate.<br />
Example 6.34 (Coin Tossing) You toss a fair coin until you get a tail and record the<br />
sequence of h’s and t’s (<strong>for</strong> heads and tails, respectively). The sample space is<br />
S = {t, ht, hht, hhht, hhhht, hhhhht, ...}.<br />
Let X be the total number of tosses. Then X is a discrete random variable with PDF<br />
f(x) = P(X = x) = 1<br />
2 x <strong>for</strong> x = 1,2,3,... and 0 otherwise.<br />
118
U Dens.<br />
1<br />
0 1<br />
u<br />
X Prob.<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
1 2 3 4 5 6 7 8 9 10 x<br />
Figure 6.2: Probability density function of the uni<strong>for</strong>m random variable on the open interval<br />
(0,1) (left plot) and of the floor of its reciprocal (right plot). The range of the floor of<br />
the reciprocal is the positive integers. Vertical lines in the left plot are drawn at u =<br />
1/2,1/3,1/4,....<br />
The probability that you get a tail in 3 or fewer tosses, <strong>for</strong> example, is<br />
P(X ≤ 3) = P ({t, ht, hht}) = f(1) + f(2) + f(3) = 1 1 1 7<br />
+ + =<br />
2 4 8 8 .<br />
To find the expected value of X, we need to determine the sum of the following series:<br />
E(X) =<br />
∞<br />
x=1<br />
x 1<br />
= 1<br />
2x <br />
1<br />
+ 2<br />
2<br />
<br />
1<br />
+ 3<br />
4<br />
<br />
1 1<br />
+ 4 + · · · .<br />
8 16<br />
By using a simple trick, we can reduce the problem to finding the sum of a convergent<br />
geometric series. Specifically,<br />
E(X) − 1<br />
2E(X) =<br />
∞<br />
x<br />
x=1<br />
1<br />
∞<br />
∞<br />
<br />
1<br />
− x = x<br />
2x 2x+1 x=1 x=1<br />
1 1<br />
− (x − 1)<br />
2x 2x ∞ 1<br />
= = 1.<br />
2x x=1<br />
Since 1<br />
2E(X) = 1, we know that E(X) = 2. (The expected number of tosses is 2.) This<br />
trick is valid since the series is absolutely convergent.<br />
Example 6.35 (Indeterminate Mean) Let U be a continuous uni<strong>for</strong>m random variable<br />
on the open interval (0,1) and let X be the greatest integer less than or equal to the<br />
reciprocal of U (the “floor” of the reciprocal of U), X = ⌊1/U⌋. For x in the positive<br />
integers,<br />
f(x) = P(X = x) = P<br />
f(x) = 0 otherwise. The expectation of X is<br />
∞<br />
E(X) = x<br />
x=1<br />
1<br />
x + 1<br />
1<br />
x(x + 1) =<br />
119<br />
<br />
1<br />
< U ≤ =<br />
x<br />
∞<br />
x=1<br />
1<br />
(x + 1) .<br />
1<br />
x(x + 1) ;
Prob.<br />
0.25<br />
0.15<br />
0.05<br />
0 10 20 x<br />
Prob.<br />
0.25<br />
0.15<br />
0.05<br />
0 5 10 x<br />
Figure 6.3: Probability histograms <strong>for</strong> geometric (left) and Poisson (right) examples.<br />
The integral test can be used to demonstrate that the series on the right diverges. Thus,<br />
the expectation of X is indeterminate.<br />
The left part of Figure 6.2 is a plot of the PDF of U with the vertical lines<br />
u = 1/2, 1/3, 1/4, 1/5, ...<br />
superimposed. The right part is a probability histogram of the distribution of X = ⌊1/U⌋.<br />
Note that the area between u = 1/2 and u = 1 on the left is P(X = 1), the area between<br />
u = 1/3 and u = 1/2 is P(X = 2), etc.<br />
6.8.3 Example: Geometric Distribution<br />
Let X be the number of failures be<strong>for</strong>e the first success in a sequence of independent<br />
Bernoulli experiments with success probability p. Then X is said to be a geometric random<br />
variable, or to have a geometric distribution, with parameter p. The PDF of X is as follows:<br />
f(x) = (1 − p) x p when x = 0,1,2,..., and 0 otherwise.<br />
For each x, f(x) is the probability of the sequence of x failures (F) followed by a success<br />
(S). For example, f(5) = P({FFFFFS}) = (1 − p) 5 p.<br />
The probabilities f(0),f(1),... <strong>for</strong>m a geometric sequence whose sum is 1. Further,<br />
E(X) =<br />
1 − p<br />
p<br />
and V ar(X) = E<br />
<br />
(X − E(X)) 2<br />
= 1 − p<br />
p2 .<br />
Example 6.36 (p = 1/4) Let X be a geometric random variable with parameter 1/4.<br />
Then<br />
E(X) = 3 and V ar(X) = 12.<br />
The left part of Figure 6.3 is a probability histogram <strong>for</strong> X, <strong>for</strong> values of X at most 20.<br />
Note that<br />
∞<br />
x 21 3 1 3<br />
P(X > 20) =<br />
= ≈ 0.00237841.<br />
4 4 4<br />
x=21<br />
120
Alternative definition. An alternative definition of the geometric random variable is as<br />
follows: X is the trial number of the first success in a sequence of independent Bernoulli<br />
experiments with success probability p. In this case,<br />
f(x) = (1 − p) x−1 p when x = 1,2,3,..., and 0 otherwise.<br />
In particular, the range is now the positive integers.<br />
6.8.4 Example: Poisson Distribution<br />
Let λ be a positive real number. The random variable X is said to be a Poisson random<br />
variable, or to have a Poisson distribution, with parameter λ if its PDF is as follows:<br />
−λ λx<br />
f(x) = e<br />
x!<br />
when x = 0,1,2,..., and 0 otherwise.<br />
The series expansion <strong>for</strong> e x centered at 0 (see Table 6.1) can be used to demonstrate that<br />
the sequence f(0),f(1),... has sum 1. Further,<br />
E(X) = λ and V ar(X) = E (X − E(X)) 2 = λ.<br />
Example 6.37 (λ = 3) Let X be the Poisson random variable with mean 3 (and with<br />
variance 3). The right part of Figure 6.3 is a probability histogram <strong>for</strong> X, <strong>for</strong> values of X<br />
at most 10. Note that<br />
P(X > 10) =<br />
∞<br />
x=11<br />
e −33x<br />
x!<br />
≈ 0.000292337.<br />
Rare events. The idea <strong>for</strong> the Poisson distribution comes from a limit theorem proven<br />
by the mathematician S. Poisson in the 1830’s.<br />
Theorem 6.38 (Poisson Limit Theorem) Let λ be a positive real number, n a positive<br />
integer, and p = λ/n. Then<br />
<br />
n<br />
lim p<br />
n→∞ x<br />
x (1 − p) n−x = e −λλx<br />
when x = 0,1,2,....<br />
x!<br />
This theorem can be used to estimate binomial probabilities when the number of trials is<br />
large and the probability of success is small. That is, when success is a rare event.<br />
For example, if n = 10000, p = 2/5000, and x = 3, then the probability of 3 successes in<br />
10000 trials is<br />
10000<br />
3<br />
3 9997 2 4998<br />
≈ 0.195386<br />
5000 5000<br />
121
and Poisson’s approximation is<br />
e −443<br />
3!<br />
using λ = np = 4. The values are very close.<br />
≈ 0.195367,<br />
Poisson process. Events occurring in time are said to be generated by an (approximate)<br />
Poisson process with rate λ when the following conditions are satisfied:<br />
1. The number of events occurring in disjoint subintervals of time are independent<br />
of one another.<br />
2. The probability of one event occurring in a sufficiently small subinterval of<br />
time is proportional to the size of the subinterval. If h is the size, then the<br />
probability is λh.<br />
3. The probability of two or more events occurring in a sufficiently small<br />
subinterval of time is virtually zero.<br />
In this definition, λ represents the average number of events per unit time. If events follow<br />
an (approximate) Poisson process and X is the number of events observed in one unit of<br />
time, then X has a Poisson distribution with parameter λ.<br />
Typical applications of Poisson distributions include the numbers of cars passing an intersection<br />
in a fixed period of time during a workday, or the number of phone calls received in<br />
a fixed period of time during a workday.<br />
The idea of a Poisson process can be generalized to include events occurring over regions<br />
of space instead of intervals of time. (“Subregions” take the place of “subintervals” in the<br />
conditions above. In this case, λ represents the average number of events per unit area or<br />
per unit volume.)<br />
Remark. The definition of Poisson process allows you to think of the PDF of X as<br />
the limit of a sequence of binomial PDFs: The observation interval is subdivided into n<br />
nonoverlapping subintervals; the i th Bernoulli trial results in success if an event occurs in<br />
the i th subinterval, and failure otherwise. If n is large enough, then the probability that 2<br />
or more events occur in one subinterval can be assumed to be zero.<br />
6.9 Continuous Random Variables, revisited<br />
The exponential and gamma families of distributions are related to Poisson processes.<br />
122
6.9.1 Exponential Distribution<br />
Recall that the continuous random variable X has an exponential distribution with parameter<br />
λ when its PDF has the <strong>for</strong>m<br />
f(x) = λe −λx when x ≥ 0 and 0 otherwise,<br />
where λ is a positive constant. (See Section 5.2.2.)<br />
If events occurring over time follow an approximate Poisson process with rate λ, where λ<br />
is the average number of events per unit time, then the time between successive events has<br />
an exponential distribution with parameter λ. To see this,<br />
1. If you observe the process <strong>for</strong> t units of time and let Y equal the number<br />
of observed events, then Y has a Poisson distribution with parameter λt.<br />
The PDF of Y is as follows:<br />
P(Y = y) = e −λt(λt)y<br />
y!<br />
when y = 0,1,2,... and 0 otherwise.<br />
2. An event occurs, the clock is reset to time 0, and X is the time until the<br />
next event occurs. Then X is a continuous random variable whose range<br />
is x > 0. Further,<br />
P(X > t) = P(0 events in the interval [0,t]) = P(Y = 0) = e −λt<br />
and P(X ≤ t) = 1 − e −λt .<br />
3. Since f(t) = d<br />
d<br />
dtP(X ≤ t) = dt<br />
<br />
1 − e −λt<br />
= λe −λt when t > 0 (and 0<br />
otherwise) is the same as the PDF of an exponential random variable with<br />
parameter λ, X has an exponential distribution with parameter λ.<br />
6.9.2 Gamma Distribution<br />
Recall that the continuous random variable X has a gamma distribution with shape parameter<br />
α and scale parameter β when its PDF has the following <strong>for</strong>m<br />
f(x) =<br />
1<br />
β α Γ(α) xα−1 e −x/β when x > 0 and 0 otherwise,<br />
where α and β are positive constants. (See Section 5.2.4.)<br />
If events occurring over time follow an approximate Poisson process with rate λ, where λ<br />
is the average number of events per unit time and if r is a positive integer, then the time<br />
until the r th event occurs has a gamma distribution with α = r and β = 1/λ. To see this,<br />
1. If you observe the process <strong>for</strong> t units of time and let Y equal the number<br />
of observed events, then Y has a Poisson distribution with parameter λt.<br />
The PDF of Y is as follows:<br />
P(Y = y) = e −λt(λt)y<br />
y!<br />
when y = 0,1,2,... and 0 otherwise.<br />
123
2. Let X be the time you observe the r th event, starting from time 0. Then X<br />
is a continuous random variable whose range is x > 0. Further, P(X > t)<br />
is the same as the probability that there are fewer than r events in the<br />
interval [0,t]. Thus,<br />
and P(X ≤ t) = 1 − P(X > t).<br />
r−1<br />
<br />
P(X > t) = P(Y < r) =<br />
y=0<br />
e −λt(λt)y<br />
y!<br />
3. The PDF f(t) = d<br />
dtP(X ≤ t) is computed using the product rule:<br />
<br />
d<br />
dt 1 − e−λt r−1 y=0 λyty <br />
y!<br />
= λe−λt r−1 y=0 λyty <br />
y!<br />
= λe −λt r−1<br />
y=0 λy t y<br />
y!<br />
= λe−λt λr−1tr−1 <br />
(r−1)!<br />
= λr<br />
(r−1)! tr−1 e −λt .<br />
− e −λt r−1<br />
<br />
−<br />
y=1 λyyty−1 y!<br />
r−1<br />
y=1 λy−1ty−1 <br />
(y−1)!<br />
Since f(t) is the same as the PDF of a gamma random variable with parameters<br />
α = r and β = 1/λ, X has a gamma distribution with parameters<br />
α = r and β = 1/λ.<br />
6.10 Summary: Poisson Processes<br />
In summary, there are three distributions related to Poisson processes:<br />
1. If X is the number of events observed in one unit of time, then X is a Poisson random<br />
variable with parameter λ, where λ is the average number of events per unit time.<br />
The probability that exactly x events occur is<br />
P(X = x) = e −λλx<br />
x!<br />
when x = 0,1,2,..., and 0 otherwise.<br />
2. If X is the time between successive events, then X is an exponential random variable<br />
with parameter λ, where λ is the average number of events per unit time. The CDF<br />
of X is<br />
F(x) = 1 − e −λx when x > 0 and 0 otherwise.<br />
3. If X is the time to the rth event, then X is a gamma random variable with parameters<br />
α = r and β = 1/λ, where λ is the average number of events per unit time. The CDF<br />
of X is<br />
r−1 <br />
−λx (λx)y<br />
F(x) = 1 − e when x > 0 and 0 otherwise.<br />
y!<br />
y=0<br />
Remark. If α = r and r is not too large, then gamma probabilities can be computed by<br />
hand using the <strong>for</strong>mula <strong>for</strong> the CDF above. Otherwise, the computer can be used to find<br />
probabilities <strong>for</strong> gamma distributions.<br />
124
7 Multivariable Calculus Concepts<br />
Let R k be the set of ordered k-tuples of real numbers,<br />
R k = {(x1,x2,...,xk) : xi ∈ R} .<br />
Each element of R k is called a point or a vector. Initially, we use the notation<br />
x = (x1,x2,...,xk)<br />
to denote vectors in R k . We say that xi is the i th component of x.<br />
This chapter considers real-valued functions whose domain, D, is a subset of R k ,<br />
and their applications in statistics.<br />
7.1 Vector And Matrix Operations<br />
f : D ⊆ R k −→ R,<br />
This section introduces basic vector and matrix operations. Ideas introduced here will be<br />
studied in more depth in Chapter 9 of these notes.<br />
7.1.1 Representations of Vectors<br />
The origin (or zero vector) O in R k is the vector all of whose components are zero,<br />
O = (0,0,... ,0).<br />
Let v = (v1,v2,...,vk) be a vector in R k . If v is not the zero vector, then v can be<br />
represented as a directed line segment in two ways:<br />
1. As a position vector drawn from the origin to the point P(v1,v2,...,vk): v = OP .<br />
2. As a displacement vector drawn from point P(p1,p2,... ,pk) to point Q(q1,q2,... ,qk),<br />
where vi = qi − pi <strong>for</strong> each i: v = P Q.<br />
For example, the left part of Figure 7.1 shows the vector v = (−4,3) represented as the<br />
position vector OP and the displacement vectors AB and V W. The right part of the<br />
figure shows the position vector v = OP, where P is the point P(−4,3,2).<br />
7.1.2 Addition and Scalar Multiplication of Vectors<br />
If v and w are vectors in R k and c ∈ R is a scalar, then<br />
125
P<br />
3<br />
y<br />
B<br />
-4 4<br />
W<br />
-3 V<br />
A<br />
x<br />
x<br />
z P<br />
Figure 7.1: Representations of vectors in 2-space (left) and 3-space (right).<br />
1. the vector sum v + w is the vector obtained by componentwise addition,<br />
v + w = (v1 + w1,v2 + w2,...,vk + wk), and<br />
2. the scalar product cv is the vector obtained by multiplying each component of v by c,<br />
cv = (cv1,cv2,...,cvk).<br />
Example 7.1 (Vectors in R 3 ) If v = (8, −3,2) and w = (−9,2,3), then<br />
v + w = (−1, −1,5), 5v = (40, −15,10) and 5v − w = (49, −17,7).<br />
Properties of addition and scalar multiplication. The following properties hold <strong>for</strong><br />
vectors u, v, w in R k and c,d ∈ R:<br />
1. Commutative: v + w = w + v.<br />
2. Associative: (u + v) + w = u + (v + w), c(dv) = (cd)v.<br />
3. Distributive: (c + d)v = cv + dv, c(v + w) = cv + cw.<br />
4. Identity: 1v = v, 0v = O, v + O = v, and<br />
v + (−1)w = v − w (that is, we have the operation of subtraction).<br />
Parallelogram or triangle rules. Figure 7.2 is graphical way to represent addition (left<br />
part) and subtraction (right part), in 2-space and 3-space.<br />
In the left plot, the vector sum v + w is obtained by starting at the lower left and going<br />
across the bottom and then up the right side, or by starting at the lower left and going up<br />
the left side and across the top (since v + w = w + v). Geometrically, the vector sum is<br />
the diagonal of the parallelogram.<br />
In the right plot, observe that w + u = v implies that u = v − w.<br />
126<br />
y
w vw<br />
v<br />
w uvw<br />
Figure 7.2: Graphical representation of vector addition (left) and subtraction(right).<br />
1.5v<br />
Figure 7.3: Graphical representation of scalar multiplication of vectors.<br />
Parallel vectors. Vectors v and w are said to be parallel when w = mv <strong>for</strong> some constant<br />
m. If m > 0, then the vectors are pointing in the same direction; if m < 0, then the vectors<br />
are pointing in the opposite direction.<br />
Figure 7.3 is a graphical way to represent scalar multiplication in 2-space and 3-space.<br />
Vector 2v is a vector pointing in the same direction as v but of twice the length; vector<br />
−1.5v is a vector of length 1.5 times the length of v, but pointing in the opposite direction.<br />
7.1.3 Length and Distance<br />
If v is a vector in R k , then the length of v is<br />
v =<br />
v<br />
2v<br />
<br />
v 2 1 + v2 2 + · · · v2 k .<br />
The zero vector has length 0; all other vectors have positive length.<br />
A unit vector is a vector of length 1. If v is not the zero vector, then the unit vector in the<br />
direction of v is the vector u = 1<br />
v v.<br />
The distance between vectors v and w is the length of the difference vector v − w:<br />
v − w =<br />
<br />
(v1 − w1) 2 + (v2 − w2) 2 + · · · + (vk − wk) 2 .<br />
Example 7.2 (Vectors in R 3 , continued) If v = (8, −3,2) and w = (−9,2,3), then the<br />
length of v is v = √ 77 ≈ 8.77, the length of w is w = √ 94 ≈ 9.70, and the distance<br />
between v and w is v − w = 3 √ 35 ≈ 17.75.<br />
127<br />
v
8<br />
The unit vector in the direction of v is u = √77 , −3<br />
<br />
√ 2 , √77 .<br />
77<br />
Theorem 7.3 (Triangle Inequality) If v,w ∈ R k , then v + w ≤ v + w.<br />
Note that if k = 2 or k = 3, then the triangle with corners O, v and v + w has sides of<br />
lengths v, w, and v + w.<br />
7.1.4 Dot Product of Vectors<br />
If v,w ∈ R k , then the dot product (or inner product or scalar product) is the number<br />
Note that v·v = v 2 .<br />
v·w = v1w1 + v2w2 + · · · + vkwk.<br />
Let k = 2 or 3, v = OP and w = OQ be representations of v and w as position vectors,<br />
and θ be the smaller of the two angles defined by P OQ (the connected segments from P<br />
to the origin to Q). Then, analytic geometry can be used to demonstrate that<br />
v·w = v w cos(θ).<br />
If θ is unknown, then it can be found by solving the equation above.<br />
Example 7.4 (Vectors in R 3 , continued) If v = (8, −3,2) and w = (−9,2,3), then<br />
v·w = −72 and θ = cos −1<br />
<br />
v·w<br />
≈ 2.58 radians (≈ 148 degrees).<br />
v w<br />
Angle between vectors. Analytic geometry can be used to prove the following theorem.<br />
Theorem 7.5 (Cauchy-Schwarz Inequality) If v,w ∈ R k , then |v·w| ≤ v w.<br />
The Cauchy-Schwarz inequality allows us to <strong>for</strong>mally define the angle between two nonzero<br />
vectors. Specifically, if v,w ∈ Rk are nonzero vectors, then θ is defined to be the angle<br />
satisfying the following equation:<br />
θ = cos −1<br />
<br />
v·w<br />
.<br />
v w<br />
This definition generalizes ideas about angles from 2-space and 3-space.<br />
Orthogonal vectors. The nonzero vectors v,w ∈ R k are said to be orthogonal if their<br />
dot product is 0, v·w = 0.<br />
Note that the angle between orthogonal vectors is π<br />
2 (or 90 degrees), since cos π <br />
2 = 0 and<br />
is in the range of the inverse cosine function (see page 53).<br />
π<br />
2<br />
128
7.1.5 Matrix Notation and Matrix Transpose<br />
An m-by-n matrix A is a rectangular array with m rows and n columns:<br />
⎡<br />
⎤<br />
a11 a12 · · · a1n<br />
⎢ a21 a22 · · · a2n<br />
⎥<br />
A = ⎢<br />
⎥ = [aij].<br />
⎣ . . · · · . ⎦<br />
am1 am2 · · · amn<br />
The (i,j)-entry of A (denoted by aij) is the element in the row i and column j position.<br />
Entries are shown explicitly in the first term above. The second term is used as shorthand<br />
when the size of the matrix (the numbers m and n) is understood.<br />
A 1-by-n matrix is known as a row vector and an m-by-1 matrix is known as a column<br />
vector. The m-by-n matrix A can be thought of as a single column of m row vectors or as<br />
a single row of n column vectors.<br />
If A is an m-by-n matrix, then the transpose of A, denoted by A T , is the n-by-m matrix<br />
whose (i,j)-entry is aji. For example, if<br />
A =<br />
<br />
1 2 3<br />
, then A<br />
4 5 6<br />
T ⎡ ⎤<br />
1 4<br />
= ⎣ 2 5 ⎦.<br />
3 6<br />
7.1.6 Addition and Scalar Multiplication of Matrices<br />
Let A = [aij] and B = [bij] be m-by-n matrices of numbers and let c ∈ R be a scalar. Then<br />
1. the matrix sum, A + B, is the m-by-n matrix obtained by componentwise addition,<br />
A + B = [aij + bij], and<br />
2. the scalar product, cA, is the m-by-n matrix obtained by multiplying each component<br />
of A by c,<br />
cA = [caij].<br />
<br />
Example 7.6 (2-by-3 Matrices) If A =<br />
A + B =<br />
<br />
7 −3 8<br />
2 5 1<br />
<br />
, 5A =<br />
<br />
10 15 30<br />
−10 0 5<br />
2 3 6<br />
−2 0 1<br />
<br />
<br />
<br />
5 −6 2<br />
and B =<br />
4 5 0<br />
and 5A − 2B =<br />
<br />
<br />
, then<br />
0 27 26<br />
−18 −10 5<br />
Properties of addition and scalar multiplication. Since operations are per<strong>for</strong>med<br />
componentwise, properties of addition and scalar multiplication of matrices are similar to<br />
those <strong>for</strong> vectors.<br />
129<br />
<br />
.
7.1.7 Matrix Multiplication<br />
Let A = [aij] be an m-by-n matrix and let B = [bij] be an n-by-p matrix.<br />
The matrix product, C = AB, is the m-by-p matrix whose (i,j)-entry is the dot product of<br />
the i th row of A with the j th column of B:<br />
cij =<br />
n<br />
aikbkj <strong>for</strong> i = 1,2,... ,m; j = 1,2,... ,p.<br />
k=1<br />
<br />
Example 7.7 (2-by-3 and 3-by-2 Matrices) Let A =<br />
Then<br />
AB =<br />
<br />
28 −45<br />
2 −4<br />
<br />
and BA =<br />
For example, the (1,1)-entry of the product AB is<br />
⎡<br />
⎢<br />
⎣<br />
2 3 6<br />
−1 0 1<br />
7 6 9<br />
−1 0 1<br />
15 12 17<br />
(2,3,6)·(2,0,4) = (2)(2) + (3)(0) + (6)(4) = 28.<br />
Similarly, the (1,1)-entry of the product BA is (2)(2) + (−3)(−1) = 7.<br />
Properties. Matrix products are associative. Thus, the product<br />
ABC = A(BC) = (AB)C<br />
⎡<br />
<br />
and B = ⎣<br />
⎤<br />
⎥<br />
⎦ .<br />
2 −3<br />
0 1<br />
4 −7<br />
is well-defined. Matrix products are not necessarily commutative, as illustrated in the<br />
example above. In fact, if the number of rows of A is not equal to the number of columns<br />
of B, then the product BA is not defined.<br />
Note also that the transpose of the product of two matrices is equal to the product of the<br />
transposes in the opposite order: (AB) T = B T A T .<br />
7.1.8 Determinants of Square Matrices<br />
A square matrix is a matrix with m = n. If A is a square matrix, then the determinant<br />
of A is a number which summarizes certain properties of the matrix. Notations <strong>for</strong> the<br />
determinant of A are det(A) or |A|.<br />
The determinant is often defined recursively, as illustrated in the following special cases:<br />
1. If A =<br />
a1 a2<br />
b1 b2<br />
<br />
is a 2-by-2 matrix, then<br />
<br />
<br />
det(A) = <br />
<br />
<br />
a1 a2 <br />
<br />
b1 b2<br />
130<br />
= a1b2 − b1a2 .<br />
⎤<br />
⎦.
2. If A =<br />
⎡<br />
⎣ a1 a2 a3<br />
b1 b2 b3<br />
c1 c2 c3<br />
<br />
<br />
<br />
det(A) = <br />
<br />
<br />
⎤<br />
⎦ is a 3-by-3 matrix, then<br />
<br />
a1 a2 a3 <br />
<br />
b1 b2 b3<br />
<br />
<br />
<br />
c1 c2 c3<br />
<br />
<br />
= a1<br />
<br />
<br />
<br />
b2 b3 <br />
<br />
c2 c3<br />
<br />
<br />
<br />
− a2<br />
<br />
<br />
<br />
b1 b3 <br />
<br />
c1 c3<br />
<br />
<br />
<br />
+ a3<br />
<br />
<br />
<br />
b1 b2 <br />
<br />
c1 c2<br />
(The <strong>for</strong>mula uses an alternating sum of 3 determinants of 2-by-2 matrices.)<br />
⎡<br />
⎢<br />
3. If A = ⎣<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
a1 a2 a3 a4<br />
b1 b2 b3 b4<br />
c1 c2 c3 c4<br />
d1 d2 d3 d4<br />
a1 a2 a3 a4<br />
b1 b2 b3 b4<br />
c1 c2 c3 c4<br />
d1 d2 d3 d4<br />
<br />
<br />
<br />
<br />
= a1 <br />
<br />
<br />
⎤<br />
⎥<br />
⎦ is a 4-by-4 matrix, then det(A) =<br />
b2 b3 b4<br />
c2 c3 c4<br />
d2 d3 d4<br />
<br />
<br />
<br />
<br />
<br />
−a2<br />
<br />
<br />
<br />
<br />
<br />
<br />
b1 b3 b4<br />
c1 c3 c4<br />
d1 d3 d4<br />
<br />
<br />
<br />
<br />
<br />
+a3<br />
<br />
<br />
<br />
<br />
<br />
<br />
b1 b2 b4<br />
c1 c2 c4<br />
d1 d2 d4<br />
<br />
<br />
<br />
<br />
<br />
−a4<br />
<br />
<br />
<br />
<br />
<br />
<br />
(The <strong>for</strong>mula uses an alternating sum of 4 determinants of 3-by-3 matrices.)<br />
.<br />
b1 b2 b3<br />
c1 c2 c3<br />
d1 d2 d3<br />
Note that the notation <strong>for</strong> the entries in the special cases was chosen to emphasize the<br />
recursive nature of the definition.<br />
In general, if Aij is the (n − 1)-by-(n − 1) submatrix obtained by eliminating the i th row<br />
and the j th column of A, then<br />
n<br />
det(A) = (−1)<br />
j=1<br />
1+j a1jdet(A1j).<br />
The <strong>for</strong>mula uses an alternating sum of n determinants of (n − 1)-by-(n − 1) matrices. We<br />
say that the determinant is obtained by “expanding in the first row.” (Additional methods<br />
<strong>for</strong> computing det(A) are discussed in Section 9.3.)<br />
Example 7.8 (3-by-3 Matrix) If A =<br />
⎡<br />
⎢<br />
⎣<br />
4 1 2<br />
−1 2 1<br />
1 3 −3<br />
⎤<br />
⎥<br />
⎦, then det(A) =<br />
<br />
<br />
4 1 2<br />
<br />
<br />
<br />
<br />
−1 2 1 2 1 <br />
−1 1 <br />
−1 2 <br />
<br />
= 4 − 1 + 2 <br />
3 −3 1 −3 1 3 <br />
1 3 −3 <br />
7.2 Limits and Continuity<br />
This section generalizes ideas from Section 4.2.<br />
= 4(−6 − 3) − 1(3 − 1) + 2(−3 − 2)<br />
= 4(−9) − 1(2) + 2(−5) = −48.<br />
131<br />
<br />
<br />
<br />
<br />
<br />
.
7.2.1 Neighborhoods, Deleted Neighborhoods and Limits<br />
A neighborhood of the vector a ∈ R k is the set of vectors x ∈ R k satisfying<br />
x − a < r <strong>for</strong> some positive number r.<br />
A deleted neighborhood of a is the neighborhood with a removed; that is, the set of vectors<br />
x satisfying 0 < x − a < r <strong>for</strong> some positive r.<br />
The vector a ∈ D is said to be a limit point of the domain D if every deleted neighborhood<br />
of a contains points of D.<br />
Consider f : D ⊆ R k −→ R, and suppose that a is a limit point of D. We say that the<br />
limit as x approaches a of f(x) is L,<br />
lim f(x) = L,<br />
x→a<br />
if <strong>for</strong> every positive number ǫ there exists a positive number δ such that<br />
if x ∈ D satisfies 0 < x − a < δ, then f(x) satisfies |f(x) − L| < ǫ.<br />
(Vectors in the intersection of the domain of the function with the deleted δ-neighborhood<br />
of a are mapped into the ǫ-neighborhood of L.)<br />
Example 7.9 (Limit Exists) Consider the 2-variable function<br />
Let x = (x,y) and a = (0,0). Then<br />
f(x,y) = x3 − y 3<br />
x − y<br />
x<br />
lim f(x) = lim<br />
x→a (x,y)→(0,0)<br />
3 − y3 x − y<br />
when x = y.<br />
= lim<br />
(x,y)→(0,0) x2 + xy + y 2 = 0.<br />
(As (x,y) approaches the origin from any direction not crossing the line y = x, the function<br />
values f(x,y) approach zero.)<br />
Example 7.10 (Limit Does Not Exist) Consider the 2-variable function<br />
f(x,y) = x2 − y2 x2 when (x,y) = (0,0).<br />
+ y2 Let x = (x,y) and a = (0,0), and consider approaching (0,0) along the line y = mx. Then<br />
Since<br />
f(x,y) = f(x,mx) = x2 − m2x2 x2 + m2 1 − m2<br />
=<br />
x2 1 + m2 on this line (when x = 0).<br />
lim f(x) = lim<br />
x→a along y = mx (x,y)→(0,0) along y = mx<br />
and m can be any number, the two-dimensional limit does not exist.<br />
132<br />
1 − m2 1 − m2<br />
=<br />
1 + m2 1 + m2
Properties of limits. Properties of limits <strong>for</strong> 1-variable functions (see page 57) can be<br />
generalized to limits <strong>for</strong> k-variable functions.<br />
Example 7.11 (Squeezing Principle) Consider the 2-variable function<br />
f(x,y) = 3x2y x2 when (x,y) = (0,0).<br />
+ y2 Let x = (x,y) and a = (0,0). The squeezing principle can be used to demonstrate that<br />
To see this, note that<br />
lim f(x,y) = 0.<br />
(x,y)→(0,0)<br />
0 ≤ x2<br />
x2 ≤ 1 when (x,y) = (0,0)<br />
+ y2 implies that values of f have the following simple bounds:<br />
−3|y| ≤ f(x,y) ≤ 3|y| when (x,y) = (0,0).<br />
Since ±3|y| → 0 as y → 0, f(x,y) must approach 0 as well.<br />
7.2.2 Continuous Functions<br />
The function f(x) is said to be continuous at a if<br />
lim f(x) = f(a).<br />
x→a<br />
The function is said to be discontinuous at a otherwise. Note that f(x) is discontinuous at<br />
a when f(a) is undefined, or when the limit does not exist, or when the limit and function<br />
values exist but are not equal.<br />
Linear functions. A linear function in k variables is a function of the <strong>for</strong>m<br />
f(x) = c0 + c1x1 + c2x2 + · · · + ckxk<br />
where each coefficient ci ∈ R. Linear functions are continuous <strong>for</strong> all x ∈ R k .<br />
Polynomial functions. A polynomial function in k variables is a function of the <strong>for</strong>m<br />
f(x) =<br />
d<br />
j1,j2,...,jk=0<br />
cj1,j2,...,jk xj1<br />
1 xj2<br />
2 · · · x jk<br />
k<br />
where d is a nonnegative integer,<br />
and the c’s are real numbers. For example, f(x,y) = 3x 2 − 10xy 8 + y 2 + 12 is a polynomial<br />
function in two variables. Polynomial functions are continuous <strong>for</strong> all x ∈ R k .<br />
Rational functions. A rational function in k variables is a ratio of polynomials, f(x) =<br />
p(x)/q(x). Rational functions are continuous whenever q(x) = 0.<br />
133
More generally. More generally, properties of limits imply that sums, differences, products,<br />
quotients, and constant multiples of continuous functions are continuous whenever the<br />
appropriate operation is defined.<br />
Composition of continuous functions. If the k-variable function g(x) is continuous<br />
at a and the 1-variable function f(x) is continuous at g(a), then the composite function<br />
f(g(x)) is continuous at a. In other words,<br />
<br />
lim f(g(x)) = f<br />
x→a<br />
lim<br />
x→a g(x)<br />
<br />
= f(g(a)).<br />
For example, let g(x,y) = x 2 − 4y, f(x) = √ x and a = (5,1). Then the composite function<br />
f(g(x,y)) is continuous at (5,1) since<br />
<br />
lim f(g(x,y)) = lim x<br />
(x,y)→(5,1) (x,y)→(5,1)<br />
2 − 4y = √ 25 − 4 = √ 21 = f(g(5,1)).<br />
7.3 Differentiation in Several Variables<br />
This section generalizes differentiability from Section 4.3.<br />
7.3.1 Partial Derivatives and the Gradient Vector<br />
The partial derivative of f(x) with respect to xi is<br />
f(x1,... ,xi−1,xi + h,xi+1,... ,xk) − f(x)<br />
fxi (x) = lim<br />
h→0<br />
h<br />
(The ith partial derivative is the ordinary derivative of the function where all variables<br />
except the ith variable are held fixed.) Alternative notations are Dxi<br />
.<br />
f(x) and ∂f<br />
∂xi (x).<br />
Example 7.12 (k = 3) Let x = (x,y,z) and f(x,y,z) = −2xy 2 + z 3 + 3x 2 yz 4 . Then<br />
fx(x,y,z) = −2y 2 + 6xy z 4 , fy(x,y,z) = −4xy + 3x 2 z 4 , fz(x,y,z) = 3z 2 + 12x 2 y z 3 .<br />
The gradient vector, ∇f(x) (“grad f of x”), is the vector of partial derivatives:<br />
∇f(x) = (fx1 (x),fx2 (x),... ,fxk (x)) .<br />
For example, <strong>for</strong> the 3-variable function above,<br />
∇f(x,y,z) = (−2y 2 + 6xy z 4 , − 4xy + 3x 2 z 4 , 3z 2 + 12x 2 y z 3 ).<br />
134
Matrix notation. We sometimes write the list of partial derivatives as a 1-by-k matrix,<br />
Df(x) = [fx1 (x) fx2 (x) ... fxk (x)] .<br />
Note that matrix notation is especially useful when considering vector-valued functions. For<br />
example, if the domain of f is a subset of R n and the range of f is a subset of R m , then<br />
Df(x) is an m-by-n matrix of partial derivatives.<br />
7.3.2 Second-Order Partial Derivatives<br />
The second-order partial derivative of f(x) with respect to xi and xj is the partial derivative<br />
of fxi (x) with respect to xj:<br />
<br />
∂ ∂f<br />
fxi,xj (x) = (x) =<br />
∂xj ∂xi<br />
∂2f (x) = Dxi,xj<br />
∂xj ∂xi<br />
f(x).<br />
When i = j, the second-order partial derivative can be interpreted as the concavity of a<br />
partial function. When i = j, the second-order partial derivative is often called a mixed<br />
partial derivative. Additional notations when i = j are:<br />
fxi,xi (x) = ∂2f ∂x2 (x) = D<br />
i<br />
2 xif(x). Theorem 7.13 (Clairaut’s Theorem) Suppose that the k-variable function f(x) has<br />
continuous first- and second-order partial derivatives in a neighborhood of a. Then <strong>for</strong> each<br />
i and j, the mixed partial derivatives at a are equal:<br />
fxi,xj (a) = fxj,xi (a) <strong>for</strong> all i,j = 1,2,... ,k.<br />
Example 7.14 (k = 3, continued) The second-order partial derivatives (with arguments<br />
omitted) of the 3-variable function above are as follows:<br />
fx,x = 6y z 4 , fy,y = −4x, fz,z = 6z + 36x 2 y z 2 ,<br />
fx,y = fy,x = −4y + 6xz 4 , fx,z = fz,x = 24xy z 3 , fy,z = fz,y = 12x 2 z 3 .<br />
Remark. Higher-order partial derivatives can also be defined.<br />
7.3.3 Differentiability, Local Linearity and Tangent (Hyper)Planes<br />
Suppose that f(x) is defined <strong>for</strong> all x in a neighborhood of a and that<br />
fxi (a) exists <strong>for</strong> i = 1,2,... ,k.<br />
135
z<br />
-2<br />
-1<br />
0<br />
x<br />
1<br />
2 -2<br />
-1<br />
0<br />
2<br />
1<br />
y<br />
Figure 7.4: Plots of z = 4 − x 2 − y 2 (left part) and z = x 3 − 3xy + y 3 (right part). In each<br />
case, partial functions are superimposed on the surface. (See Figure 7.5 also.)<br />
Let<br />
h(x) = f(a) + ∇f(a)·(x − a)<br />
z<br />
-1<br />
= f(a) + fx1 (a)(x1 − a1) + · · · + fxk (a)(xk − ak).<br />
We say that f is differentiable at a if the following limit holds:<br />
f(x) − h(x)<br />
lim = 0.<br />
x→a x − a<br />
That is, if the error in using h as an approximation to f approaches 0 at a rate faster than<br />
the rate at which the distance between x and a approaches 0.<br />
Local linearity. The definition of differentiability says that f is locally linear; the error<br />
in using the linear function h as an approximation to f can be made as small as we like by<br />
taking x to be close enough to a.<br />
Tangent (hyper)planes. If f is differentiable at a, then y = h(x) is called the tangent<br />
plane at (a,f(a)) when k = 2 and is called the tangent hyperplane at (a,f(a)) when k > 2.<br />
Example 7.15 (Quadratic Function) Consider the 2-variable function<br />
f(x,y) = 4 − x 2 − y 2 and let a = (0.5, −1).<br />
The left part of Figure 7.4 shows a graph of the surface z = f(x,y), with graphs of the<br />
partial functions z = f(0.5,y) and z = f(x, −1) superimposed.<br />
1. The partial derivatives of f are fx(x,y) = −2x and fy(x,y) = −2y.<br />
2. The slope of z = f(x, −1) when x = 0.5 is fx(0.5, −1) = −1.<br />
3. The slope of z = f(0.5,y) when y = −1 is fy(0.5, −1) = 2.<br />
136<br />
0<br />
x<br />
1<br />
2<br />
-1<br />
0<br />
2<br />
1<br />
y
4. Since f(0.5, −1) = 2.75, the linear approximation at (0.5, −1) is<br />
h(x,y) = 2.75 − (x − 0.5) + 2(y + 1).<br />
In addition, f is differentiable at this point (and everywhere).<br />
Example 7.16 (Cubic Function) Consider the 2-variable function<br />
f(x,y) = x 3 − 3xy + y 3 and let a = (1.5,1).<br />
The right part of Figure 7.4 shows a graph of the surface z = f(x,y), with graphs of the<br />
partial functions z = f(1.5,y) and z = f(x,1) superimposed.<br />
1. The partial derivatives of f are fx(x,y) = 3x 2 − 3y and fy(x,y) = 3y 2 − 3x.<br />
2. The slope of z = f(x,1) when x = 1.5 is fx(1.5,1) = 3.75.<br />
3. The slope of z = f(1.5,y) when y = 1 is fy(1.5,1) = −1.5.<br />
4. Since f(1.5,1) = −0.125, the linear approximation at (1.5,1) is<br />
h(x,y) = −0.125 + 3.75(x − 1.5) − 1.5(y − 1).<br />
In addition, f is differentiable at this point (and everywhere).<br />
Example 7.17 (Derivative Does Not Exist) Consider the 2-variable function<br />
<br />
<br />
f(x,y) = |x| − |y| − |x| − |y| and let a = (0,0).<br />
Since f(x,0) = 0 and f(0,y) = 0, fx(x,0) = 0 and fy(0,y) = 0 <strong>for</strong> each x and y, and the<br />
linear approximation h(x,y) = 0.<br />
But f is not differentiable at the origin since the limit given in the definition of differentiability<br />
does not exist. If you evaluate the limit by approaching along y = 0 and along y = x,<br />
<strong>for</strong> example, you get two different answers:<br />
f(x,0) − 0<br />
f(x,x) − 0<br />
lim = 0 and lim<br />
(x,0)→(0,0) (x,0) (x,x)→(0,0) (x,x) = −√2. Theorem 7.18 (Differentiability and Continuity) Suppose that the k-variable function<br />
f(x) is defined <strong>for</strong> all x in a neighborhood of a. Then<br />
1. If f is differentiable at a, then f is continuous at a.<br />
2. If each partial derivative function, fxi (x) <strong>for</strong> i = 1,2,... ,k, is continuous in a neighborhood<br />
of a, then f is differentiable at a.<br />
Note that in the quadratic and cubic function examples, the partial derivative functions (fx<br />
and fy) are defined and continuous <strong>for</strong> all pairs. In the absolute value function example,<br />
partial derivative functions are not defined in a neighborhood of the origin.<br />
137
7.3.4 Total Derivative<br />
If each coordinate xi is a function of t,<br />
xi = xi(t) <strong>for</strong> i = 1,2,... ,k<br />
(or simply x = x(t)), then f is also a function of t. The derivative of f with respect to t<br />
(when it exists) is called the total derivative of f.<br />
Theorem 7.19 (Total Derivatives) If f is differentiable in a neighborhood of x0 = x(t0)<br />
and each coordinate function is differentiable at t0, then the total derivative exists and its<br />
value can be computed as follows:<br />
df<br />
dt (t0) = fx1 (x0)x ′ 1 (t0) + fx2 (x0)x ′ 2 (t0) + · · · + fxk (x0)x ′ k (t0).<br />
Note that if we let x ′ (t) = (x ′ 1 (t),x′ 2 (t),... ,x′ k (t)) be the vector of derivatives of the coordinate<br />
functions, then the total derivative can be written as the dot product of the gradient<br />
vector of f with the derivative vector of x:<br />
df<br />
dt (t0) = ∇f(x0)·x ′ (t0).<br />
Thus, the <strong>for</strong>mula <strong>for</strong> the total derivative is a generalization of the chain rule <strong>for</strong> compositions<br />
of 1-variable functions.<br />
Example 7.20 (k = 3, continued) Let f(x,y,z) = −2xy 2 + z 3 + 3x 2 y z 4 ,<br />
and t0 = 1. Since<br />
<br />
1 1<br />
x(t) = ,<br />
t t2,−2t <br />
x = x(t) = 1 1<br />
, y = y(t) =<br />
t t2, z = z(t) = −2t<br />
⇒ x ′ <br />
(t) = − 1 2<br />
t2,− t3,−2 <br />
⇒ x ′ (1) = (−1, −2, −2),<br />
and ∇f(x(1)) = ∇f(1,1, −2) = (94,44, −84) (using the <strong>for</strong>mula <strong>for</strong> ∇f derived earlier),<br />
the total derivative of f when t = 1 is<br />
7.3.5 Directional Derivatives<br />
∇f(x(1))·x ′ (1) = (94,44, −84)·(−1, −2, −2) = −14.<br />
If u is a unit vector, and f is a k-variable function defined in a neighorhood of a, then the<br />
directional derivative of f at a is<br />
f(a + hu) − f(a)<br />
Duf(a) = lim<br />
h→0 h<br />
138<br />
provided the limit exists.
In general, the function Duf(x) represents the rate of change of the function f with respect<br />
to a unit change in the direction given by u.<br />
Theorem 7.21 (Directional Derivatives) If the k-variable function f is differentiable<br />
in a neighborhood of a and u is a unit vector, then<br />
Duf(a) = ∇f(a)·u .<br />
That is, the rate of change is the dot product of the gradient vector of f at a with u.<br />
Bounds on directional derivatives. Let θ be the angle between vectors ∇f(a) and u<br />
(see page 128). Then, since u = 1,<br />
Duf(a) = ∇f(a)·u = ∇f(a)cos(θ) .<br />
Further, since the cosine function takes values in the interval [−1,1], directional derivatives<br />
take values between −∇f(a) and ∇f(a). If ∇f(a) = O, then<br />
1. The minimum value is assumed when u = −1<br />
∇f(a) ∇f(a), and<br />
2. The maximum value is assumed when u = 1<br />
∇f(a) ∇f(a).<br />
In the first case, the direction of u is called the direction of steepest descent; in the second<br />
case, it is called the direction of steepest ascent.<br />
Example 7.22 (Quadratic Function, continued) Consider again the function<br />
f(x,y) = 4 − x 2 − y 2 and let a = (0.5, −1).<br />
The left part of Figure 7.5 shows contours corresponding to z = 0,1,2,3,4. The unit vector<br />
in the direction of steepest ascent at (0.5, −1) is highlighted.<br />
Since ∇f(a) = (−1,2), the direction of steepest ascent is u = 1 √ 5 (−1,2), and the slope of<br />
the partial function in this direction is √ 5 ≈ 2.24.<br />
Note that the contour corresponding to z = k is the set of all pairs (x, y) satisfying f(x, y) = k.<br />
Example 7.23 (Cubic Function, continued) Consider again the function<br />
f(x,y) = x 3 − 3xy + y 3 and let a = (1.5,1).<br />
The right part of Figure 7.5 shows contours corresponding to z = −1,0,1,2,3,4. The unit<br />
vector in the direction of steepest ascent at (1.5,1) is highlighted.<br />
Since ∇f(a) = (3.75, −1.5), the direction of steepest ascent is u = 1<br />
√ 16.3125 (3.75, −1.5), and<br />
the slope of the partial function in this direction is √ 16.3125 ≈ 4.04.<br />
139
y<br />
2<br />
1<br />
0<br />
-1<br />
-2 -1 0 1 2 x<br />
-2<br />
-1<br />
-1 0 1 2<br />
Figure 7.5: Contour plots of z = 4 − x 2 − y 2 (left part) and z = x 3 − 3xy + y 3 (right part),<br />
with directions of steepest ascent highlighted. (See Figure 7.4 also.)<br />
7.4 Optimization<br />
Definitions of local and global extrema <strong>for</strong> functions of k-variables directly generalize the<br />
definitions given in Section 4.4 <strong>for</strong> 1-variable functions. In addition, Fermat’s theorem can<br />
be generalized as follows:<br />
Theorem 7.24 (Fermat’s Theorem) If the k-variable function f has a local extremum<br />
at c and if f is differentiable in a neighborhood of c, then ∇f(c) = O.<br />
Example 7.25 (Quadratic Function, continued) Consider again<br />
f(x,y) = 4 − x 2 − y 2 (left parts of Figures 7.4 and 7.5).<br />
f has a local (and global) maximum at the origin and ∇f(0,0) = (0,0).<br />
Example 7.26 (Cubic Function, continued) Consider again<br />
f(x,y) = x 3 − 3xy + y 3 (right parts of Figures 7.4 and 7.5).<br />
f has a local minimum at (1,1) and ∇f(1,1) = (0,0).<br />
Note that ∇f(0,0) = (0,0), but (0,0) is neither a local max nor a local min.<br />
Critical points. A point c is a critical point (or stationary point) of the k-variable function<br />
f if either ∇f(c) = O or if one of the partial derivatives does not exist at c.<br />
For example, the quadratic function f(x,y) = 4 − x 2 − y 2 has a single critical point at the<br />
origin; the cubic function f(x,y) = x 3 − 3xy + y 3 has critical points (1,1) and (0,0).<br />
7.4.1 Linear and Quadratic Approximations<br />
The definition of Taylor polynomials given in Section 6.6 can be generalized to k-variable<br />
functions. This section considers first-order and second-order approximations only.<br />
140<br />
2<br />
1<br />
0<br />
y<br />
x
Linear approximation. Suppose that the k-variable function f is differentiable in a<br />
neighborhood of a. The first-order Taylor polynomial centered at a is defined as follows:<br />
p1(x) = f(a) + ∇f(a)·(x − a).<br />
Note that, since f is differentiable, the error in using p1 as an approximation to f can be<br />
made as small as we like by taking x to be close enough to a.<br />
If we write the list of partial derivatives as the 1-by-k matrix Df(a), the difference vector<br />
(x − a) as a k-by-1 matrix, and consider the matrix product<br />
Df(a)(x − a) = [fx1 (a) fx2 (a) · · · fxk (a)]<br />
⎡ ⎤<br />
x1 − a1<br />
⎢ x2 − a2 ⎥<br />
⎣ ... ⎦<br />
xk − ak<br />
= fx1 (a)(x1 − a1) + fx2 (a)(x2 − a2) + · · · + fxk (a)(xk − ak)<br />
(a 1-by-1 matrix is equivalent to a scalar), then the first-order Taylor polynomial can be<br />
written as follows:<br />
p1(x) = f(a) + Df(a)(x − a).<br />
Quadratic approximation. The Hessian of a k-variable function f is the k-by-k matrix<br />
whose (i,j)-entry is fxi,xj :<br />
⎡<br />
fx1,x1 (x)<br />
⎢ fx2,x1 ⎢ (x)<br />
Hf(x) = ⎢<br />
⎣ .<br />
fx1,x2 (x)<br />
fx2,x2 (x)<br />
.<br />
· · ·<br />
· · ·<br />
· · ·<br />
⎤<br />
fx1,xk (x)<br />
fx2,xk (x) ⎥<br />
. ⎦<br />
fxk,x1 (x) fxk,x2 (x) · · · fxk,xk (x)<br />
Suppose that the k-variable function f has continuous first- and second-order partial derivatives<br />
in a neighborhood of a. Then the second-order Taylor polynomial centered at a is<br />
defined as follows:<br />
p2(x) = f(a) + Df(a)(x − a) + 1<br />
2 (x − a)T Hf(a)(x − a)<br />
where Df(a) is the list of partial derivatives written as a 1-by-k matrix, (x − a) is the<br />
difference vector written as a k-by-1 matrix, and Hf(a) is the k-by-k Hessian matrix at a.<br />
Example 7.27 (Ratio Function) Let f(x,y) = y<br />
x<br />
1. The first- and second- partial derivatives of f are:<br />
and a = (1,1).<br />
fx(x,y) = − y<br />
x 2, fy(x,y) = 1<br />
x ,<br />
fxx(x,y) = 2y<br />
x 3, fxy(x,y) = fyx(x,y) = − 1<br />
x 2 and fyy(x,y) = 0.<br />
141
2. Since f(a) = 1, fx(a) = −1 and fy(a) = 1, the first-order Taylor polynomial centered<br />
at a is<br />
<br />
x − 1<br />
p1(x,y) = 1 + [ −1 1 ] = 1 − (x − 1) + (y − 1).<br />
y − 1<br />
3. Further, since fxx(a) = 2, fxy(a) = fyx(a) = −1 and fyy(a) = 0, the second-order<br />
Taylor polynomial centered at a is<br />
p2(x,y) = 1 + [ −1 1 ]<br />
<br />
x − 1<br />
+<br />
y − 1<br />
1<br />
2 [ x − 1 y − 1 ]<br />
2 −1<br />
−1 0<br />
= 1 − (x − 1) + (y − 1) + (x − 1) 2 − (x − 1)(y − 1) .<br />
<br />
x − 1<br />
y − 1<br />
Theorem 7.28 (Second-Order Taylor Polynomials) Suppose that the k-variable<br />
function f has continuous first- and second-order partial derivatives in a neighborhood of<br />
a, and let p2(x) be the second-order Taylor polynomial centered at a. Then<br />
lim<br />
x→a<br />
f(x) − p2(x)<br />
= 0.<br />
x − a2 That is, the error function e(x) = f(x) − p2(x) approaches zero at a rate faster than the<br />
rate at which the square of the distance between x and a approaches zero.<br />
7.4.2 Second Derivative Test<br />
Suppose that the k-variable function f has continuous first- and second-order partial derivatives<br />
in a neighborhood of the critical point a. Since Df(a) is the zero vector,<br />
p2(x) = f(a) + 1<br />
2 (x − a)T Hf(a)(x − a).<br />
If the term (x −a) T Hf(a)(x −a) is positive <strong>for</strong> all x sufficiently close (but not equal) to a,<br />
then f has a local minimum at a. Similarly, if the term (x − a) T Hf(a)(x − a) is negative<br />
<strong>for</strong> all x sufficiently close (but not equal) to a, then f has a local maximum at a.<br />
The second derivative test gives criteria <strong>for</strong> determining when the quadratic term in the<br />
second-order approximation is always positive or always negative <strong>for</strong> x near a. Matrix<br />
notation and operations are useful in developing the test.<br />
Symmetric matrices and quadratic <strong>for</strong>ms. The n-by-n matrix A is symmetric if<br />
aij = aji <strong>for</strong> all i and j. Symmetric matrices have the property that A T = A.<br />
If A is a symmetric n-by-n matrix and h is an n-by-1 matrix (a column vector), then<br />
is called a quadratic <strong>for</strong>m in h.<br />
Q(h) = h T n n<br />
Ah = aijhihj<br />
i=1 j=1<br />
142
Positive and negative definite quadratic <strong>for</strong>ms. The quadratic <strong>for</strong>m Q (respectively,<br />
the associated symmetric matrix A) is said to be positive definite if Q(h) > 0 when h = O<br />
and is said to be negative definite if Q(h) < 0 when h = O.<br />
If the k-variable function f has continuous first- and second-partial derivatives in a neighborhood<br />
of a, then by Clairaut’s theorem (page 135) the Hessian matrix Hf(a) is a symmetric<br />
k-by-k matrix. Our concern is whether Hf(a) is positive definite or negative definite.<br />
Principal minors and determinants. If A is an k-by-k matrix, then the j-by-j principal<br />
minor of A is the matrix<br />
⎡<br />
⎤<br />
⎢<br />
Aj = ⎢<br />
⎣<br />
a11 a12 · · · a1j<br />
a21 a22 · · · a2j ⎥<br />
⎦<br />
.<br />
. · · ·<br />
aj1 aj2 · · · ajj<br />
.<br />
<strong>for</strong> j = 1,2,... ,k.<br />
Let dj = det(Aj) be the determinant of the j-by-j principal minor (j > 1) and let d1 = a11.<br />
⎡<br />
4 1<br />
For example, if A = ⎣ −1 2<br />
⎤<br />
2<br />
1 ⎦, then d1 = 4, d2 = 10 and d3 = −48.<br />
1 3 −3<br />
Second derivative test <strong>for</strong> local extrema. Suppose that the k-variable function f<br />
has continuous first- and second-order partial derivatives in a neighborhood of the critical<br />
point a, let Hf(a) be the Hessian matrix at a, and let d1, d2, · · ·, dk be the sequence of<br />
determinants of the principal minors of Hf(a).<br />
Assume that dk = det(Hf(a)) = 0. Then<br />
1. If di > 0 <strong>for</strong> i = 1,2,... ,k, then f has a local minimum at a.<br />
2. If d1 < 0, d2 > 0, d3 < 0, ..., then f has a local maximum at a.<br />
3. If the patterns given in the first two steps do not hold, then f has a saddle point at<br />
the point a. That is, there is neither a maximum nor a minimum at a.<br />
If dk = det(Hf(a)) = 0, then anything can happen. (The critical point may correspond to<br />
a local maximum, a local minimum, or a saddle point.)<br />
Example 7.29 (Quadratic Function, continued) Consider again<br />
Since<br />
f(x,y) = 4 − x 2 − y 2 (left parts of Figures 7.4 and 7.5).<br />
the only critical point is the origin. Since<br />
<br />
−2 0<br />
Hf(0,0) =<br />
0 −2<br />
∇f(x,y) = (−2x, −2y) = (0,0) ⇒ x = y = 0<br />
⇒ d1 = −2 and d2 = 4,<br />
the second derivative test implies that f has a local maximum at the origin.<br />
143
Example 7.30 (Cubic Function, continued) Consider again<br />
Since<br />
f(x,y) = x 3 − 3xy + y 3 (right parts of Figures 7.4 and 7.5).<br />
∇f(x,y) = (3x 2 − 3y, −3x + 3y 2 ) = (0,0) ⇒ (x,y) = (1,1) or (0,0),<br />
there are two critical points to consider.<br />
1. Let (x,y) = (1,1). Since<br />
Hf(1,1) =<br />
<br />
6 −3<br />
−3 6<br />
<br />
⇒ d1 = 6 and d2 = 27,<br />
the second derivative test implies that f has a local minimum at (1,1).<br />
2. Let (x,y) = (0,0). Since<br />
Hf(0,0) =<br />
<br />
0 −3<br />
−3 0<br />
<br />
⇒ d1 = 0 and d2 = −9,<br />
the second derivative test implies that f has a saddle point at the origin.<br />
Note that if we let y = −x, then the 1-variable function z = f(x, −x) = 3x 2 has a local<br />
minimum at the origin; if we let y = x, then the 1-variable function z = f(x,x) =<br />
2x 3 − 3x 2 has a local maximum at the origin.<br />
7.4.3 Constrained Optimization and Lagrange Multipliers<br />
In many applications of calculus the goal is to optimize (either maximize or minimize) a<br />
real-valued function of k-variables, f(x), subject to one or more constraints:<br />
g1(x) = c1, g2(x) = c2, ... , gp(x) = cp, where the ci are constants, and p < k.<br />
Solve and substitute method. It may be possible to use the constraints to reduce the<br />
number of variables in the optimization problem, as illustrated in the following example.<br />
Example 7.31 (Constrained Minimization) Let x represent length, y represent width,<br />
and z represent height of an aquarium. Assume that the cost of the slate needed <strong>for</strong><br />
the bottom of the aquarium is 4 times the cost of the glass needed <strong>for</strong> the sides. Then<br />
4xy + 2xz + 2yz is proportional to the total cost <strong>for</strong> these materials.<br />
The problem of interest is to find positive numbers x, y, z to<br />
Minimize 4xy + 2xz + 2yz when the volume V = xyz = 16 cubic units.<br />
The volume constraint can be solved <strong>for</strong> z, and the result substituted into the 3-variable<br />
function we would like to minimize, yielding the following 2-variable problem:<br />
Minimize f(x,y) = 4xy + 32<br />
y<br />
+ 32<br />
x<br />
over the domain where x > 0 and y > 0.<br />
144
f<br />
1<br />
2<br />
x<br />
3<br />
4<br />
1<br />
5<br />
2<br />
5<br />
4<br />
3<br />
y<br />
y<br />
5<br />
4<br />
3<br />
2<br />
1<br />
1 2 3 4 5 x<br />
Figure 7.6: Surface (left) and contour (right) plots of f(x,y) = 4xy + 32<br />
y<br />
correspond to f(x,y) = 49,51,53,55,57.<br />
(See Figure 7.6.) Working with this function,<br />
<br />
∇f(x,y) = 4y − 32 32<br />
x2,4x −<br />
y2 <br />
and Hf(x,y) =<br />
<br />
64/x3 4<br />
4 64/y3 <br />
.<br />
32 + x . Contours<br />
<br />
8 4<br />
Since ∇f(x,y) = (0,0) when x = y = 2, Hf(2,2) = , d1 = 8 and d2 = 48, we know<br />
4 8<br />
that f has a local (and global) minimum at (2,2). The minimum value is 48.<br />
The dimensions of the aquarium should be x = 2, y = 2, z = 4.<br />
Method of Lagrange Multipliers. A general method, which does not require the “solve<br />
and substitute” approach outlined in the example above, depends on the following theorem.<br />
Theorem 7.32 (Lagrange Multipliers) Assume that the k-variable functions f and<br />
g1, g2, ..., gp have continuous first partial derivatives on the domain<br />
D = {x : g1(x) = c1, g2(x) = c2, ..., gp(x) = cp},<br />
f has an extremum on this domain at a, and that<br />
{∇g1(a), ∇g2(a), ..., ∇gp(a)} is a linearly independent set (see below).<br />
Then there exist constants λ1, ..., λp (called Lagrange multipliers) satisfying<br />
∇f(a) = λ1∇g1(a) + λ2∇g2(a) + · · · + λp∇gp(a).<br />
Note that when p = 1, the linear independence requirement in Lagrange’s theorem reduces<br />
to ∇g1(a) = O. When p > 1, the requirement means that<br />
The linear combination p<br />
i=1 ai∇gi(a) = O only when each coefficient ai = 0.<br />
145
The theorem can be used to find critical points. When p = 1, we drop the subscript on the<br />
single constraint and solve<br />
∇f(x) = λ∇g(x) and g(x) = c <strong>for</strong> x1,x2,...,xk and λ.<br />
Example 7.33 (Constrained Minimization, continued) Consider again the aquarium<br />
problem and let f(x,y,z) = 4xy + 2xz + 2yz and g(x,y,z) = xyz. Then<br />
∇f(x,y,z) = (4y + 2z,4x + 2z,2x + 2y) and ∇g(x,y,z) = (yz,xz,xy).<br />
Critical points need to satisfy the following system of 4 equations in 4 unknowns:<br />
4y + 2z = λyz, 4x + 2z = λxz, 2x + 2y = λxy and xyz = 16.<br />
Since x, y and z must be positive, we can solve the first 3 equations <strong>for</strong> λ:<br />
λ = 4<br />
z<br />
2 4<br />
+ , λ =<br />
y z<br />
2 2 2<br />
+ , λ = +<br />
x y x .<br />
From equations 1 and 2 we get y = x, and from equations 1 and 3 we get z = 2x. Now,<br />
16 = xyz = x(x)(2x) ⇒ x = 2, y = 2, z = 4.<br />
Finally, λ = 2 as well. (Note that λ can be interpreted as the rate of change of cost with<br />
respect to a unit change in the volume constraint.)<br />
Lagrangian function. The method outlined in Lagrange’s theorem can be coded using<br />
the following (p + k)-variable Lagrangian function:<br />
p<br />
L(λ1,... ,λp,x1,...,xk) = f(x1,... ,xk) − λi(g(x1,... ,xk) − ci).<br />
i=1<br />
For simplicity, we use the notation L(λ,x).<br />
Critical points of L satisfy the result of Lagrange’s theorem. A special <strong>for</strong>m of the second<br />
derivative test allows us to determine in some cases whether critical points are local extrema.<br />
Second derivative test <strong>for</strong> constrained local extrema. Let a be a constrained critical<br />
point of f subject to the constraints gi(x) = ci (<strong>for</strong> all i), and assume that all first- and<br />
second-partial derivatives are continuous on the constrained domain.<br />
Let d1, d2, ..., dp+k be the determinants of the principal minors of HL(λ,a) and consider<br />
the subsequence of (k − p) values:<br />
(−1) p d2p+1, (−1) p d2p+2, ..., (−1) p dp+k.<br />
Assume that dp+k = det(HL(λ,a)) = 0. Then<br />
1. If all terms in the subsequence are positive, then f has a constrained local<br />
minimum value at a.<br />
146
2. If the first term in the subsequence is negative, and the signs alternate,<br />
then f has a constrained local maximum value at a.<br />
3. If the patterns given in the first two steps do not hold, then f has a constrained<br />
saddle point at a.<br />
Note that if dp+k = det(HL(λ,a)) = 0, then anything can happen.<br />
Example 7.34 (Constrained Minimization, continued) Returning to the problem<br />
above,<br />
L(λ,x,y,z) = 4xy + 2xz + 2yz − λ(xyz − 16),<br />
∇L(λ,x,y,z) = (16 − xy z,4y + 2z − y z λ,4x + 2z − xz λ,2x + 2y − xy λ), and<br />
⎡<br />
0 − (y z) − (xz)<br />
⎤<br />
− (xy)<br />
⎢<br />
HL(λ,x,y,z) = ⎢ − (y z)<br />
⎣ − (xz)<br />
0<br />
4 − z λ<br />
4 − z λ<br />
0<br />
2 − y λ ⎥<br />
2 − xλ ⎦<br />
− (xy) 2 − y λ 2 − xλ 0<br />
.<br />
If λ = 2, x = 2, y = 2, z = 4, then d1 = 0, d2 = −64, d3 = −512, d4 = −768.<br />
We need to consider the subsequence −d3 = 512 and −d4 = 768. Since both terms are<br />
positive, f has a constrained local minimum when x = 2, y = 2, z = 4.<br />
Example 7.35 (k = 3, p = 2) Consider the following minimization problem:<br />
and let<br />
Minimize x 2 + y 2 + z 2 subject to x + y + z = 12 and x + 2y + 3z = 18<br />
L(λ,µ,x,y,z) = x 2 + y 2 + z 2 − λ(x + y + z − 12) − µ(x + 2y + 3z − 18).<br />
Then ∇L(λ,µ,x,y,z) =<br />
(12 − x − y − z, 18 − x − 2y − 3z, 2x − λ − µ, 2y − λ − 2µ, 2z − λ − 3µ),<br />
⎡<br />
⎤<br />
0 0 −1 −1 −1<br />
⎢ 0 0 −1 −2 −3 ⎥<br />
⎢<br />
⎥<br />
HL(λ,µ,x,y,z) = ⎢ −1 −1 2 0 0 ⎥ ,<br />
⎢<br />
⎥<br />
⎣ −1 −2 0 2 0 ⎦<br />
−1 −3 0 0 2<br />
and d1 = d2 = d3 = 0, d4 = 1 and d5 = 12 (<strong>for</strong> all x, y, z).<br />
Since<br />
∇L(λ,µ,x,y,z) = O ⇒ λ = 20, µ = −6, x = 7, y = 4, z = 1,<br />
and (−1) 2 d5 = 12 > 0, there is a constrained minimum at (x,y,z) = (7,4,1).<br />
Footnote: The minimization problem above corresponds to finding the point on the intersection of<br />
the planes x + y + z = 12 and x + 2y + 3z = 18 which is closest to the origin.<br />
147
7.4.4 Method of Steepest Descent/Ascent<br />
The first step in solving an optimization problem is to find the critical points of f by solving<br />
the derivative equation ∇f(x) = O. Since this equation may be difficult to solve, a variety<br />
of approximate methods have been developed. This section focuses on methods based on<br />
properties of directional derivatives and gradients. (See Section 7.3.5.)<br />
Method of steepest descent. Suppose that f has continuous partial derivatives in a<br />
neighborhood of a and has a local minimum at a. If x0 (an initial point) is close enough<br />
to the point a and ∇f(x0) = O, then<br />
The minimum value of Duf(x0) occurs when u = −∇f(x0)/∇f(x0).<br />
That is, the directional derivative is minimized in the direction of the negative of the gradient<br />
at the point x0.<br />
The idea of the method of steepest descent is to move in the direction of steepest descent<br />
as far as possible with the hope of getting closer to a. Specifically, let g(t) be the following<br />
1-variable function:<br />
<br />
g(t) = f x0 − t ∇f(x0)<br />
<br />
.<br />
∇f(x0)<br />
Note that g(0) = f(x0). In addition, <strong>for</strong> values of t near zero, g(t) is decreasing. Let t0 > 0<br />
be the critical number of g closest to 0 and let x1 = x0 − t0(∇f(x0)/∇f(x0)).<br />
In general, the process is repeated, defining<br />
until ti is close enough to zero.<br />
xi+1 = xi − ti(∇f(xi)/∇f(xi)),<br />
Example 7.36 (Constrained Minimization, continued) The method is illustrated<br />
using a familiar situation:<br />
Minimize f(x,y) = 4xy + 32<br />
y<br />
(See Figure 7.6.)<br />
+ 32<br />
x<br />
over the domain where x > 0 and y > 0.<br />
The following table shows the results of the first eight steps of the algorithm, using (3.2,4.1)<br />
as the initial point, and using four decimal places of accuracy:<br />
i 0 1 2 3 4 5 6 7<br />
ti 2.0973 0.9084 0.1588 0.0458 0.0038 0.0007 0.0000 0.0000<br />
xi 3.2 1.5789 2.1553 2.0325 2.0034 2.0005 2.0000 2.0000<br />
yi 4.1 2.7694 2.0672 1.9664 2.0019 1.9995 2.0000 2.0000<br />
Both t6 and t7 are virtually zero (and x6 and x7 are virtually equal), indicating that the<br />
algorithm has identified (2,2) as a local minimum point.<br />
Note that Newton’s method (Section 4.4.5) was used to find each ti.<br />
148
15<br />
10<br />
5<br />
0<br />
-5<br />
-10<br />
y<br />
0 2 4 6 8 x<br />
Figure 7.7: Weight change in grams (vertical axis) versus dosage level in 100 mg/kg/day<br />
(horizontal axis) <strong>for</strong> ten animals. The gray curve is y = −0.49x 2 + 1.98x + 6.70.<br />
Method of steepest ascent. To approximate a local maximum, the function<br />
is used to find ti, and the <strong>for</strong>mula<br />
is used to find the next approximate point.<br />
7.5 Applications in <strong>Statistics</strong><br />
g(t) = f (xi + t (∇f(xi)/∇f(xi)))<br />
xi+1 = xi + ti(∇f(xi)/∇f(xi))<br />
Optimization is an essential part of many problems in statistics. In this section, optimization<br />
techniques are used to solve problems in least squares and maximum likelihood estimation.<br />
7.5.1 Least Squares Estimation in Quadratic Models<br />
Given n data pairs (x1,y1), (x2,y2), ..., (xn,yn), whose values lie close to a quadratic curve<br />
of the <strong>for</strong>m<br />
y = ax 2 + bx + c ,<br />
the method of least squares, developed by Legendre in the early 1800’s, allows us to estimate<br />
the unknown parameters: Find the values of a, b, and c which minimize the sum of squared<br />
deviations of observed y-coordinates from the values expected if the data points fit the curve<br />
exactly. That is, minimize the following 3-variable function:<br />
n<br />
f(a,b,c) = (yi − (ax<br />
i=1<br />
2 i + bxi + c)) 2 .<br />
Example 7.37 (Toxicology Study) (Fine & Bosch, JASA 94:375-382, 2000) As part of a<br />
study to determine the adverse effects of a proposed drug <strong>for</strong> the treatment of tuberculosis,<br />
female rats were given the drug <strong>for</strong> a period of 14 days at each of five dosage levels (in 100<br />
milligrams per kilogram per day).<br />
149
The vertical axis in Figure 7.7 is the weight change in grams (defined as the weight at the<br />
end of the period minus the weight at the beginning of the period) <strong>for</strong> ten animals in the<br />
study, and the horizontal axis shows the dose in 100 mg/kg/day.<br />
For these data, f(a,b,c) =<br />
562.63 + 703.45a + 7612.13a 2 − 15.1b + 2223.5ab + 172.5b 2 − 88.6c + 345ac + 62bc + 10c 2 ,<br />
∇f(a,b,c) =<br />
(703.45+15224.3a+2223.5b+345c, −15.1+2223.5a+345b+62c, −88.6+345a+62b+20c)<br />
and ∇f(a,b,c) = O when a = −0.49, b = 1.98, and c = 6.70.<br />
Since<br />
⎡<br />
15224.3 2223.5<br />
⎤<br />
345<br />
Hf = ⎣ 2223.5 345 62 ⎦ , d1 = 15224.3, d2 ≈ 3.1 × 10<br />
345 62 20<br />
5 , d3 ≈ 1.7 × 10 6<br />
and f is a quadratic function, f is minimized when a = −0.49, b = 1.98, and c = 6.70. The<br />
least squares quadratic fit <strong>for</strong>mula is shown in Figure 7.7.<br />
Linear algebra and least squares. A linear algebra approach to least squares analysis<br />
is discussed in Sections 9.5.5 and 9.8.1 of these notes.<br />
7.5.2 Maximum Likelihood Estimation in Multinomial Models<br />
This section generalizes ideas from Section 4.5.1.<br />
Consider the problem of estimating the proportions p1,... ,pk in a multinomial experiment.<br />
Let p = (p1,p2,... ,pk), and let xi be the observed number of occurrences of outcome i in<br />
n trials (<strong>for</strong> i = 1,2,... ,k). We will demonstrate that the sample proportions, pi = xi/n,<br />
are maximum likelihood estimates of the proportions in the multinomial experiment using<br />
constrained optimatization techniques.<br />
The likelihood function, Lik(p), is the joint PDF with n and the xi fixed and p varying:<br />
<br />
n<br />
Lik(p) =<br />
p<br />
x1,x2,... ,xk<br />
x1<br />
1 px2 2 · · · pxk<br />
k .<br />
The log-likelihood function, ℓ(p), is the natural logarithm of the likelihood:<br />
<br />
n<br />
ℓ(p) = ln(Lik(p)) = ln<br />
x1,x2,...,xk<br />
+ x1 ln(p1) + x2 ln(p2) + · · · + xk ln(pk).<br />
We wish to maximize ℓ(p) subject to the constraint that the sum of the proportions is 1.<br />
The Lagrangian function is<br />
L(λ,p) = ℓ(p) − λ(p1 + p2 + · · · + pk − 1),<br />
150
and the gradient of the Lagrangian function is<br />
∇L(λ,p) = (1 − p1 − p2 − · · · − pk,x1/p1 − λ,x2/p2 − λ,... ,xk/pk − λ).<br />
If all quantities are positive, then ∇L(λ,p) = O implies<br />
1 = <br />
i pi and pi = xi/λ <strong>for</strong> each i.<br />
These equations imply that (n,x1/n,x2/n,...,xk/n) is a critical point.<br />
Example 7.38 (k = 3) To check that the result is a maximum, we consider the special<br />
case when<br />
k = 3, n = 40, x1 = 18, x2 = 12 and x3 = 10.<br />
Then<br />
⎡<br />
0 −1 −1<br />
⎤<br />
−1<br />
⎢<br />
HL = ⎢ −1 −800/9<br />
⎣ −1 0<br />
0<br />
−400/3<br />
0<br />
0<br />
⎥<br />
⎦<br />
−1 0 0 −160<br />
, d1 = 0, d2 = −1, d3 = 2000<br />
9 , d4 = − 1280000<br />
.<br />
27<br />
Since −d3 < 0 and −d4 > 0, the second derivative test <strong>for</strong> constrained local extrema implies<br />
that there is a constrained local maximum when p1 = 0.45, p2 = 0.30, and p3 = 0.25 (and<br />
λ = 40). There is both a local and global maximum at (p1,p2,p3) = (0.45,0.35,0.25).<br />
7.6 Multiple Integration<br />
This section generalizes ideas from Section 4.7.<br />
7.6.1 Riemann Sums and Double Integrals over Rectangles<br />
Suppose that the 2-variable function f(x,y) is continuous on the rectangle<br />
R = [a,b] × [c,d] = {(x,y) : a ≤ x ≤ b, c ≤ y ≤ d} ⊂ R 2<br />
and let m and n be positive integers. Further, let △x = (b−a) (d−c)<br />
m , △y = n ,<br />
1. xi = a + i△x <strong>for</strong> i = 0,1,... ,m, x ∗ i = (xi−1 + xi)/2 <strong>for</strong> i = 1,2,... ,m,<br />
2. yj = c + j△y <strong>for</strong> j = 0,1,... ,n, and y ∗ j = (yj−1 + yj)/2 <strong>for</strong> j = 1,2,... ,n.<br />
Then the (m + 1) numbers<br />
a = x0 < x1 < x2 < · · · < xm = b<br />
partition [a,b] into m subintervals of equal length with midpoints x∗ i , the (n + 1) numbers<br />
c = y0 < y1 < y2 < · · · < yn = d<br />
151
z<br />
1<br />
x<br />
3<br />
1<br />
3<br />
y<br />
5<br />
Figure 7.8: Plot of z = 3x 2 + y + 2 on [0,4] × [0,6] (left) and approximating boxes (right).<br />
partition [c,d] into n subintervals of equal length with midpoints y ∗ j<br />
gles<br />
Ri,j = [xi−1,xi] × [yj−1,yj]<br />
z<br />
1<br />
x<br />
3<br />
1<br />
3<br />
5<br />
y<br />
and the mn subrectan-<br />
<strong>for</strong>m an m-by-n partition of the rectangle R. Each subrectangle has area △A = (△x)(△y).<br />
The double integral of f over R is<br />
<br />
f(x,y) dA = lim<br />
R<br />
m,n→∞<br />
m n<br />
f(x<br />
i=1 j=1<br />
∗ i ,y∗ j )△A.<br />
In this <strong>for</strong>mula, f(x,y) is the integrand, and the sum on the right is called a Riemann sum.<br />
Continuity and Integrability. f(x,y) is said to be integrable on the rectangle R if<br />
the limit above exists. Continuous functions are always integrable. Further, it is possible<br />
to show that the limit exists when the values of f are bounded on R and the set of<br />
discontinuities has area zero.<br />
Volumes. If f is nonnegative on R, then the double integral over R can be interpreted<br />
as the volume under the surface z = f(x,y) and above the xy-plane <strong>for</strong> (x,y) ∈ R.<br />
Example 7.39 (Computing Volume) Let f(x,y) = 3x 2 + y + 2 on R = [0,4] × [0,6].<br />
To estimate the volume under the surface z = f(x,y) and above the xy-plane over R (left<br />
part of Figure 7.8), we let m = 4, n = 6 and ∆x = ∆y = 1. The 24 midpoints are<br />
(x ∗ ,y ∗ ), where x ∗ = 0.5,1.5,2.5,3.5 and y ∗ = 0.5,1.5,... ,5.5,<br />
and the value of the Riemann sum is 498.0. This value is the sum of the volumes of the 24<br />
rectangular boxes shown in the right part of Figure 7.8.<br />
The following list of approximate values suggests that the limit is 504.<br />
m 4 8 16 32 64 128 256<br />
n 6 12 24 48 96 192 384<br />
Sum 498.0 502.5 503.625 503.906 503.977 503.994 503.999<br />
152
z<br />
0<br />
1<br />
2<br />
x 3<br />
4 0<br />
2<br />
4<br />
y<br />
6<br />
z<br />
0<br />
1<br />
2<br />
x 3<br />
4 0<br />
Figure 7.9: Area slices of z = 3x 2 + y + 2 on [0,4] × [0,6] <strong>for</strong> fixed x = 0,0.5,... ,4 (left)<br />
and <strong>for</strong> fixed y = 0,0.5,... ,6 (right).<br />
Average values. The quantity<br />
<br />
1<br />
f(x,y) dA<br />
(b − a)(d − c) R<br />
can be interpreted as the average value of f(x,y) on R.<br />
To see this, let f = 1 <br />
mn i,j f(x∗i ,y∗ j ) be the sample average, and let A = (b − a)(d − c) be<br />
the area of the rectangle. Then<br />
f = A<br />
⎛<br />
⎝<br />
A<br />
1 <br />
f(x<br />
mn<br />
∗ i,y ∗ ⎞<br />
⎠<br />
j) = 1 <br />
A<br />
i,j<br />
i,j<br />
2<br />
4<br />
y<br />
f(x ∗ i,y ∗ <br />
1<br />
j) △A ⇒ lim f = f(x,y) dA.<br />
m,n→∞ A R<br />
For example, the list of approximations in the previous example suggest that the average<br />
value of f(x,y) = 3x 2 + y + 2 on [0,4] × [0,6] is 504/24 = 21.<br />
7.6.2 Iterated Integrals over Rectangles<br />
Suppose that f is a continuous, 2-variable function on the rectangle R = [a,b] × [c,d].<br />
We use the notation d<br />
f(x,y) dy<br />
c<br />
to mean that x is held fixed and f is integrated as y varies from c to d. Similarly, we use<br />
the notation b<br />
f(x,y) dx<br />
to mean that y is held fixed and f is integrated as x varies from a to b.<br />
If f is nonnegative on R, then the results can be interpreted as areas.<br />
a<br />
Example 7.40 (Area Slices) Consider again f(x,y) = 3x 2 + y + 2 on [0,4] × [0,6].<br />
153<br />
6
The left part of Figure 7.9 shows area slices related to the partial functions of f when<br />
x = 0,0.5,1,... ,4. The area of each slice as a function of x is<br />
Ax(x) =<br />
6<br />
0<br />
(3x 2 + y + 2) dy =<br />
<br />
3x 2 y + 1<br />
2 y2 6 + 2y = 18x<br />
0<br />
2 + 30,<br />
and the areas of the displayed slices are 30, 34.5, 48, ..., 318.<br />
The right part of Figure 7.9 shows area slices related to the partial functions of f when<br />
y = 0,0.5,1,... ,6. The area of each slice as a function of y is<br />
Ay(y) =<br />
4<br />
0<br />
(3x 2 + y + 2) dx =<br />
and the areas of the displayed slices are 72, 74, 76, ..., 96.<br />
<br />
x 3 4 + yx + 2x = 4y + 72,<br />
0<br />
Cavalieri’s principle. Continuing with the area application, Cavalieri’s principle (also<br />
called the method of slicing) says that the volume under the surface can be obtained by<br />
integrating the area function:<br />
Volume = d<br />
c Ay(y) dy = d<br />
c<br />
= b<br />
a Ax(x) dx = b<br />
a<br />
b<br />
a f(x,y) dx dy<br />
d<br />
c f(x,y) dy dx<br />
The right most integrals on each line above are examples of iterated integrals. In each case,<br />
the inside integral is evaluated first, followed by the outside integral.<br />
Example 7.41 (Cavalieri’s Principle) Continuing with the example above, since<br />
4<br />
0<br />
(18x 2 + 30) dx =<br />
<br />
6x 3 4 + 30x = 504 and<br />
0<br />
6<br />
0<br />
(4y + 72) dy =<br />
<br />
2y 2 6 + 72y = 504<br />
0<br />
we know that the volume of the solid shown in the left part of Figure 7.8 is 504 cubic units.<br />
Double versus iterated integrals. Fubini’s theorem says that we can compute double<br />
integrals by computing iterated integrals, where the iteration can be done in either order.<br />
Theorem 7.42 (Fubini’s Theorem) Suppose that f is a continuous, 2-variable function<br />
on the rectangle R = [a,b] × [c,d]. Then<br />
<br />
R<br />
f(x,y) dA =<br />
d b<br />
c<br />
a<br />
f(x,y) dx dy =<br />
b d<br />
a<br />
c<br />
f(x,y) dy dx .<br />
Note that Fubini’s theorem remains true if f is a bounded function on R satisfying (1) the<br />
set of discontinuities of f has area 0, and (2) each slice (<strong>for</strong> fixed x or <strong>for</strong> fixed y) intersects<br />
the set of discontinuities in at most a finite number of points.<br />
154
z<br />
0<br />
1<br />
x<br />
2 0<br />
1<br />
y<br />
2<br />
y<br />
2<br />
1<br />
1 2 x<br />
Figure 7.10: Plot of z = 10/x 2 −90/(x + 2y) 2 on [1,2]×[0,2] (left plot) and function domain<br />
with y = x superimposed (right plot).<br />
Example 7.43 (Difference in Volumes) Let<br />
as illustrated in Figure 7.10.<br />
By Fubini’s theorem,<br />
f(x,y) = 10 90<br />
− on R = [1,2] × [0,2],<br />
x2 (x + 2y) 2<br />
2<br />
0<br />
<br />
R f(x,y) dA = 2<br />
1 f(x,y) dx dy<br />
= <br />
2 20<br />
1 x2 − 45<br />
<br />
45<br />
x + 4+x dx<br />
<br />
= −20 x − 45ln(x) + 45ln(4 + x)<br />
2<br />
1<br />
≈ −12.9871.<br />
Note that, given (x,y) ∈ R, f(x,y) ≥ 0 when y ≥ x and f(x,y) ≤ 0 when y ≤ x. Thus, this<br />
answer can be interpreted as the difference between (1) the volume of the solid bounded<br />
by z = f(x,y) and the xy-plane <strong>for</strong> (x,y) ∈ R with y ≥ x and (2) the volume of the solid<br />
bounded by z = f(x,y) and the xy-plane <strong>for</strong> (x,y) ∈ R with y ≤ x.<br />
7.6.3 Properties of Double Integrals<br />
Properties of double integrals generalize those <strong>for</strong> single integrals from Section 4.7.2.<br />
Suppose that f(x,y) and g(x,y) are integrable functions on R, and let c be a constant.<br />
Then properties of limits imply:<br />
1. Constant functions:<br />
2. Sums and differences:<br />
3. Constant multiples:<br />
<br />
R<br />
<br />
R<br />
<br />
R<br />
c dA = c × (Area of R).<br />
4. Null domain: If the area of R is 0, then<br />
<br />
(f(x,y) ± g(x,y)) dA = R<br />
<br />
cf(x,y) dA = c R f(x,y) dA.<br />
<br />
R f(x,y) dA = 0.<br />
155<br />
<br />
f(x,y) dA ± R g(x,y) dA.
y<br />
2<br />
1<br />
0.5 1 x<br />
1<br />
0.5<br />
y<br />
0.5 1 x<br />
Figure 7.11: Domains of type 1 (left) and of both types (right).<br />
<br />
<br />
5. Ordering: If f(x,y) ≤ g(x,y) <strong>for</strong> all (x,y) ∈ R, then R f(x,y) dA ≤ R g(x,y) dA.<br />
<br />
<br />
6. Absolute value: If |f| is integrable, then | R f(x,y) dA| ≤ R |f(x,y)| dA.<br />
7.6.4 Double and Iterated Integrals over General Domains<br />
Double and iterated integrals can be evaluated over more general domains. Specifically,<br />
1. Type 1: Let D = {(x,y) : a ≤ x ≤ b, ℓ(x) ≤ y ≤ u(x)}, where ℓ(x) and u(x) are<br />
continuous on [a,b], and let f be a continuous, 2-variable function on D. Then<br />
<br />
D<br />
f(x,y) dA =<br />
b u(x)<br />
a<br />
ℓ(x)<br />
f(x,y) dy dx .<br />
2. Type 2: Let D = {(x,y) : c ≤ y ≤ d, ℓ(y) ≤ x ≤ u(y)}, where ℓ(y) and u(y) are<br />
continuous on [c,d], and let f be a continuous, 2-variable function on D. Then<br />
<br />
D<br />
f(x,y) dA =<br />
d u(y)<br />
c<br />
ℓ(y)<br />
The left part of Figure 7.11 illustrates a domain of type 1:<br />
f(x,y) dx dy .<br />
Dleft = {(x,y) : 0 ≤ x ≤ 1, 1 − x ≤ y ≤ 1 + x} .<br />
The right part of the figure illustrates a domain of both types:<br />
Dright = {(x,y) : 0 ≤ x ≤ 1, x 2 ≤ y ≤ x} = {(x,y) : 0 ≤ y ≤ 1, y ≤ x ≤ √ y} .<br />
Example 7.44 (Figure 7.11, Left Part) To illustrate computations, we compute the<br />
double integral of f(x,y) = 70xy 2 over the domain shown in the left part of Figure 7.11:<br />
<br />
Dleft<br />
70xy 2 dA =<br />
1 1+x<br />
0<br />
1−x<br />
70xy 2 dy dx =<br />
156<br />
1<br />
(140x<br />
0<br />
2 + 140x 4 /3) dx = 56.
Note that Dleft can be written as the union of two type 2 domains whose intersection is a<br />
set with area 0, and that the double integral can be computed as the sum of the values of<br />
two iterated integrals. Specifically,<br />
Dleft = {(x,y) : 0 ≤ y ≤ 1, 1 − y ≤ x ≤ 1} ∪ {(x,y) : 1 ≤ y ≤ 2, y − 1 ≤ x ≤ 1}<br />
(union of triangular regions with intersection the segment from (0,1) to (1,1)), and<br />
<br />
Dleft<br />
70xy 2 dA =<br />
1 1<br />
0<br />
1−y<br />
70xy 2 dx dy +<br />
2 1<br />
1<br />
y−1<br />
70xy 2 dx dy = 21<br />
2<br />
Example 7.45 (Difference in Volumes, continued) Consider again<br />
f(x,y) = 10 90<br />
− on R = [1,2] × [0,2] (Figure 7.10).<br />
x2 (x + 2y) 2<br />
+ 91<br />
2<br />
= 56.<br />
The rectangle R can be written as the union of two type 1 domains whose intersection is<br />
the segment from (1,1) to (2,2):<br />
Since<br />
<br />
and<br />
<br />
D2<br />
D1<br />
D1 = R ∩ {(x,y) : y ≥ x} and D2 = R ∩ {(x,y) : y ≤ x}.<br />
f(x,y) dA =<br />
f(x,y) dA =<br />
the double integral<br />
2 2<br />
1<br />
2 x<br />
1<br />
0<br />
x<br />
f(x,y) dy dx =<br />
f(x,y) dy dx =<br />
Further, the total volume enclosed by<br />
2<br />
1<br />
2 <br />
1<br />
<br />
20 25 45<br />
− +<br />
x2 2 4 + x<br />
− 20<br />
x<br />
<br />
dx ≈ 0.8758<br />
dx = [−20ln(x)] 2<br />
1 ≈ −13.8629,<br />
<br />
R f(x,y) dA ≈ −12.9871, confirming the result obtained earlier.<br />
z = f(x,y), the xy-plane, x = 1, x = 2, y = 0 and y = 2<br />
is approximately 0.8758 + 13.8629 = 14.7387.<br />
Average values over general domains. If f is integrable over the general domain D<br />
with positive area A(D), then the average value of f over D is<br />
Average of f over D = 1<br />
<br />
f(x,y) dA .<br />
A(D) D<br />
Example 7.46 (Figure 7.11, Right Part) To illustrate computations, we find the<br />
average value of f(x,y) = 70xy 2 over the domain shown in the right part of Figure 7.11,<br />
and view the domain as a domain of type 1.<br />
157
The double integral of f over Dright is<br />
<br />
Dright<br />
70xy 2 dA =<br />
1 x<br />
0<br />
x 2<br />
70xy 2 dy dx =<br />
1<br />
(70x<br />
0<br />
4 /3 − 70x 7 /3) dx = 7/4.<br />
The area of Dright is the double integral of the function 1 (or, equivalently, the single integral<br />
over [0,1] of the difference between the value of y on the upper curve and the value of y on<br />
the lower curve):<br />
<br />
Dright<br />
1 dA =<br />
1 x<br />
0<br />
x 2<br />
1 dy dx =<br />
Finally, the average value is (7/4)/(1/6) = 21/2.<br />
7.6.5 Triple and Iterated Integrals over Boxes<br />
1<br />
0<br />
(x − x 2 ) dx = 1/6.<br />
Suppose that the 3-variable function f(x,y,z) is continuous on the box<br />
B = [a,b] × [c,d] × [p,q] = {(x,y,z) : a ≤ x ≤ b, c ≤ y ≤ d, p ≤ z ≤ q} ⊂ R 3<br />
and let ℓ, m and n be positive integers. Further, let ∆x = (b−a)<br />
ℓ , ∆y = (d−c) (q−p)<br />
m and ∆z = n ,<br />
1. xi = a + i∆x <strong>for</strong> i = 0,1,... ,ℓ, x ∗ i = (xi−1 + xi)/2 <strong>for</strong> i = 1,2,... ,ℓ,<br />
2. yj = c + j∆y <strong>for</strong> j = 0,1,... ,m, y ∗ j = (yj−1 + yj)/2 <strong>for</strong> j = 1,2,... ,m,<br />
3. zk = p + k∆z <strong>for</strong> k = 0,1,... ,n, and z ∗ k = (zk−1 + zk)/2 <strong>for</strong> k = 1,2,... ,n.<br />
Then the (ℓ + 1) numbers<br />
a = x0 < x1 < · · · < xℓ = b<br />
partition [a,b] into ℓ subintervals of equal length with midpoints x∗ i , the (m + 1) numbers<br />
c = y0 < y1 < · · · < ym = d<br />
partition [c,d] into m subintervals of equal length with midpoints y∗ j , the (n + 1) numbers<br />
p = z0 < z1 < · · · < zn = q<br />
partition [p,q] into n subintervals of equal length with midpoints z∗ k , and the ℓmn subboxes<br />
Bi,j,k = [xi−1,xi] × [yj−1,yj] × [zk−1,zk] (i = 1,... ,ℓ; j = 1,... ,m; k = 1,... ,n)<br />
<strong>for</strong>m an ℓ-by-m-by-n partition of B. Each subbox has volume △V = (△x)(△y)(△z).<br />
The triple integral of f over B is<br />
<br />
B<br />
f(x,y,z) dV = lim<br />
ℓ m n<br />
f(x<br />
ℓ,m,n→∞<br />
i=1 j=1 k=1<br />
∗ i,y ∗ j,z ∗ k)△V.<br />
In this <strong>for</strong>mula, f(x,y,z) is the integrand and the sum on the right is a Riemann sum.<br />
158
Continuity and integrability. f(x,y,z) is said to be integrable on the box B if the limit<br />
above exists. Continuous functions are always integrable. Further, it is possible to show<br />
that the limit above exists if the values of f are bounded on B and the set of discontinuities<br />
of f on B has volume zero.<br />
Iterated integrals and Fubini’s theorem. A generalization of Fubini’s theorem (p. 154)<br />
states that if f is continuous on B, then the triple integral can be computed using iterated<br />
integrals, where the iteration can be done in any order. In particular,<br />
<br />
b d q<br />
f(x,y,z) dV = f(x,y,z) dz dy dx .<br />
(There are a total of six orders.)<br />
B<br />
a<br />
More generally, if f is a bounded function on B satisfying (1) the set of discontinuities of<br />
f has volume 0, and (2) each line parallel to one of the coordinate axes intersects the set<br />
of discontinuities of f in at most a finite number of points, then the triple integral can be<br />
computed using iterated integrals in any order.<br />
Average values. The quantity<br />
1<br />
(b − a)(d − c)(q − p)<br />
c<br />
p<br />
<br />
can be interpreted as the average value of f on B.<br />
To see this, let<br />
f = 1<br />
ℓmn<br />
B<br />
<br />
f(x ∗ i,y ∗ j,z ∗ k)<br />
i,j,k<br />
f(x,y,z) dV<br />
be the sample average, and let V = (b − a)(d − c)(q − p) be the volume of the box. Then<br />
f = V 1<br />
ℓmn i,j,k<br />
V<br />
f(x∗i , y∗ j , z∗ k )<br />
<br />
= 1 <br />
i,j,k<br />
V<br />
f(x∗i , y∗ j , z∗ k ) △V<br />
1 <br />
⇒ lim f = B f(x, y, z) dV .<br />
ℓ,m,n→∞ V<br />
Example 7.47 (Average Value) To illustrate computations, we find the average of<br />
f(x,y,z) = 12x 2 (y − z) on B = [−1,1] × [−1,2] × [0,3].<br />
Using the generalization of Fubini’s theorem,<br />
<br />
B f(x,y,z) dV = 1 2 3<br />
−1 −1 0 12x2 (y − z) dz dy dx<br />
= 1 2 <br />
−1 −1 −54x2 + 36x2y dy dx<br />
= 1 <br />
−1 −108x2 dx = −72.<br />
Since the volume of B is 18, the average of f on B is −72/18 = −4.<br />
159
6<br />
z<br />
4<br />
2<br />
0<br />
0<br />
1<br />
x<br />
0<br />
2<br />
1<br />
0.5<br />
y<br />
25<br />
z<br />
0<br />
0<br />
5<br />
y<br />
10 0<br />
Figure 7.12: Domains of integration <strong>for</strong> triple integral examples.<br />
7.6.6 Triple Integrals over General Domains<br />
Triple integrals can also be computed over more general domains. There are six basic types,<br />
which depend on the order of integration.<br />
One of the six types can be defined as follows:<br />
where<br />
W = {(x,y,z) : a ≤ x ≤ b, ℓ1(x) ≤ y ≤ u1(x), ℓ2(x,y) ≤ z ≤ u2(x,y)},<br />
1. ℓ1(x) and u1(x) are continuous on [a,b] and<br />
2. ℓ2(x,y) and u2(x,y) are continuous on the 2-dimensional domain<br />
If f(x,y,z) is continuous on W, then<br />
<br />
W<br />
{(x,y) : a ≤ x ≤ b, ℓ1(x) ≤ y ≤ u1(x)}.<br />
f(x,y,z) dV =<br />
b u1(x) u2(x,y)<br />
a<br />
ℓ1(x)<br />
ℓ2(x,y)<br />
f(x,y,z) dz dy dx .<br />
Example 7.48 (Figure 7.12, Left Part) Let f(x,y,z) = xy on<br />
(left part of Figure 7.12).<br />
W = {(x,y,z) : 0 ≤ x ≤ 2, 0 ≤ y ≤ 1, 0 ≤ z ≤ 4 + x + y}<br />
Then, the triple integral of f over W is<br />
<br />
W f dV = 2 1 4+x+y<br />
0 0 0 xy dz dy dx =<br />
2 1<br />
(4xy + x<br />
0 0<br />
2 y + xy 2 2<br />
) dy dx = (7x/3 + x<br />
0<br />
2 /2) dx = 6.<br />
160<br />
10x<br />
20
Average values over general domains. If f is integrable over the general domain W<br />
with positive volume V (W), then the average value of f over W is<br />
Average value of f over W = 1<br />
V (W)<br />
<br />
W<br />
f(x,y,z) dV.<br />
Example 7.49 (Figure 7.12, Right Part) Let f(x,y,z) = 3x on<br />
(right part of Figure 7.12).<br />
W = {(x,y,z) : 0 ≤ y ≤ 10, y ≤ x ≤ 20 − y, 0 ≤ z ≤ 25 − y}<br />
The triple integral of f over W is<br />
<br />
W f dV = 10 20−y 25−y<br />
0 y 0 3x dz dx dy =<br />
10 20−y<br />
0<br />
y<br />
(75x − 3xy) dx dy =<br />
10<br />
0<br />
(15000 − 2100y + 60y 2 ) dy = 65000.<br />
The volume of W is the triple integral of the function 1 (equivalently, the volume is the<br />
double integral of 25 − y over {(x,y) : 0 ≤ y ≤ 10, y ≤ x ≤ 20 − y}):<br />
V (W) = 10 20−y 25−y<br />
0 y 0 1 dz dx dy =<br />
10 20−y<br />
0<br />
y<br />
(25 − y) dx dy =<br />
10<br />
Finally, the average is (65000)/(6500/3) = 30.<br />
0<br />
(500 − 70y + 2y 2 ) dy = 6500<br />
3 .<br />
Extensions to k variables. Definitions and techniques <strong>for</strong> double and triple integrals<br />
can be extended to define integration methods <strong>for</strong> k-variable functions. The extensions are<br />
straight<strong>for</strong>ward, but the computations can be quite tedious.<br />
7.6.7 Partial Anti-Differentiation and Partial Differentiation<br />
Computing iterated integrals involves computing partial anti-differentives. This section<br />
explores the processes of partial differentiation and partial anti-differentation further in the<br />
2-variable setting.<br />
Theorem 7.50 (Fixed Endpoints) If the 2-variable function f has continuous first<br />
partial derivatives on the rectangle [a,b] × [c,d], then<br />
<br />
b<br />
b<br />
d<br />
f(x, y) dx = fy(x, y) dx <strong>for</strong> y ∈ [c, d]<br />
dy<br />
and<br />
a<br />
<br />
d<br />
d<br />
f(x, y) dy =<br />
dx c<br />
a<br />
d<br />
c<br />
161<br />
fx(x, y) dy <strong>for</strong> x ∈ [a, b].
To illustrate the first equality, let f(x,y) = 4x 3 y 2 and [a,b] = [1,2]. Then<br />
2<br />
d<br />
4x<br />
dy 1<br />
3 y 2 <br />
dx = d 2<br />
15y<br />
dy<br />
= 30y and<br />
2<br />
1<br />
∂ 3 2<br />
4x y<br />
∂y<br />
dy =<br />
2<br />
8x 3 y dx = 30y.<br />
The theorem can be extended to integrals with variable endpoints, using the rule <strong>for</strong> the<br />
total derivative from Section 7.3.4.<br />
7.6.8 Change of Variables and the Jacobian<br />
The change of variables technique <strong>for</strong> multiple integrals generalizes the method of substitution<br />
<strong>for</strong> single integrals (see page 83). In this section, we introduce the technique in the<br />
2-variable setting.<br />
Consider a vector-valued function T : D ∗ ⊆ R 2 → R 2 , with <strong>for</strong>mula<br />
T(u,v) = (x(u,v),y(u,v)) <strong>for</strong> all (u,v) ∈ D ∗ ,<br />
and let D = T(D ∗ ) be the image of D ∗ . We say that the trans<strong>for</strong>mation T maps the<br />
domain D ∗ in the uv-plane to the domain D in the xy-plane.<br />
The functions x(u,v) and y(u,v) are called the coordinate functions of T.<br />
The Jacobian. Assume that each coordinate function has continuous first partial deriva-<br />
, is defined as follows:<br />
tives. The Jacobian of the trans<strong>for</strong>mation T, denoted by ∂(x,y)<br />
∂(u,v)<br />
∂(x,y)<br />
∂(u,v) =<br />
<br />
xu xv <br />
<br />
yu yv<br />
= xuyv − xvyu.<br />
(The Jacobian is the determinant of the matrix of partial derivatives).<br />
Example 7.51 (Linear Trans<strong>for</strong>mation) Let<br />
where a, b, c and d are constants.<br />
(x,y) = T(u,v) = (au + bv,cu + dv),<br />
The <strong>for</strong>mula <strong>for</strong> T can be written using matrix notation. Specifically,<br />
where<br />
A =<br />
<br />
a b<br />
, x =<br />
c d<br />
x = T(u) = Au,<br />
<br />
x<br />
y<br />
and u =<br />
<br />
u<br />
.<br />
v<br />
A is the matrix of partial derivatives of T; the Jacobian of T is det(A) = (ad − bc).<br />
162<br />
1
v<br />
1<br />
1<br />
u<br />
y<br />
1<br />
-1<br />
2 4 x<br />
Figure 7.13: Unit square in uv-plane (left) and parallelogram in xy-plane (right).<br />
Example 7.52 (Nonlinear Trans<strong>for</strong>mation) Let<br />
where u and v are positive.<br />
(x,y) = T(u,v) =<br />
The (simplified) matrix of partial derivatives is<br />
⎡<br />
⎢<br />
⎣ xu xv<br />
yu yv<br />
and the (simplified) Jacobian is 1/(2v).<br />
⎤<br />
⎥<br />
⎦ =<br />
<br />
u<br />
v , √ <br />
uv ,<br />
⎡<br />
⎢<br />
1<br />
⎢ 2<br />
⎢<br />
⎣<br />
√ uv − √ u<br />
2v √ v<br />
√<br />
v<br />
2 √ √<br />
u<br />
u 2 √ v<br />
Stretching/shrinking factor. The absolute value of the Jacobian is the “stretching”<br />
(or “shrinking”) factor which tells us how areas in xy-space are related to areas in uv-space.<br />
For linear trans<strong>for</strong>mations, the relationship between areas is easy to state.<br />
Theorem 7.53 (Linear Trans<strong>for</strong>mations) Let x = T(u) = Au, where<br />
Then<br />
A =<br />
<br />
a b<br />
c d<br />
1. T is a one-to-one and onto trans<strong>for</strong>mation.<br />
⎤<br />
⎥<br />
⎦<br />
and det(A) = (ad − bc) = 0.<br />
2. T maps triangles to triangles and parallelograms to parallelograms in such a way that<br />
vertices are mapped to vertices.<br />
3. If D ∗ is a parallelogram in the uv-plane and D = T(D ∗ ) in the xy-plane, then<br />
<br />
<br />
(Area of D) = <br />
∂(x,y) <br />
<br />
∂(u,v)<br />
(Area of D∗ ) = |det(A)| (Area of D∗ ).<br />
163
Example 7.54 (Area of Parallelogram) The parallelogram D in the xy-plane whose<br />
corners are (0,0), (1, −1), (4,0) and (3,1) is the image of D ∗ = [0,1] ×[0,1] in the uv-plane<br />
under the linear trans<strong>for</strong>mation<br />
as illustrated in Figure 7.13.<br />
<br />
<br />
The Jacobian of T is 3 1 <br />
<br />
1 −1 = −4.<br />
(x,y) = T(u,v) = (3u + v,u − v) <strong>for</strong> (u,v) ∈ D ∗ ,<br />
Since the area of D ∗ is 1, the area of D is<br />
(Area of D) = | − 4|(Area of D ∗ ) = 4.<br />
Change of variables. In 2-variable substitution, we replace<br />
<br />
<br />
dA = dx dy with dA = <br />
∂(x,y) <br />
<br />
∂(u,v)<br />
du dv.<br />
The complete <strong>for</strong>mula is given in the following theorem.<br />
Theorem 7.55 (Change of Variables) Let D and D ∗ be basic regions (of types 1 or<br />
2) in the xy- and uv-planes, respectively. Suppose that<br />
T : D ∗ ⊆ R 2 → R 2<br />
maps D∗ onto D in a one-to-one fashion, and that each coordinate function of T has<br />
continuous first partial derivatives. Further, suppose that f is a 2-variable, continuous<br />
function on D. Then<br />
<br />
<br />
<br />
<br />
f(x,y) dx dy = f(x(u,v),y(u,v)) <br />
∂(x,y) <br />
<br />
∂(u,v)<br />
du dv .<br />
D<br />
D ∗<br />
The change of variables technique is used to simplify the boundaries of integration, to<br />
simplify the integrand, or both.<br />
Example 7.56 (Simplifying Integrand) Consider evaluating the double integral<br />
<br />
x − y<br />
cos dA<br />
x + y<br />
D<br />
where D is the triangular region in the xy-plane bounded by x = 0, y = 0, x + y = 1.<br />
The integrand can be simplified by using<br />
Since<br />
u = x − y, v = x + y ⇐⇒ x =<br />
u + v −u + v<br />
, y =<br />
2 2<br />
(1) x = 0 ⇒ u = −v, (2) y = 0 ⇒ u = v and (3) x + y = 1 ⇒ v = 1,<br />
164<br />
.
the domain D ∗ = {(u,v) : 0 ≤ v ≤ 1, −v ≤ u ≤ v}. The Jacobian is 1/2.<br />
The integral becomes<br />
<br />
D<br />
cos<br />
<br />
x − y<br />
x + y<br />
dA =<br />
1 v<br />
0<br />
−v<br />
cos<br />
<br />
u 1<br />
v 2<br />
du dv =<br />
(Note that, by substitution, cos(u/v) du = v sin(u/v) + C.)<br />
1<br />
0<br />
v sin(1) dv = 1<br />
2 sin(1).<br />
Example 7.57 (Simplifying Boundary) Consider evaluating the double integral<br />
<br />
xy dA<br />
D<br />
where D is the region bounded by xy = 1, xy = 4, xy 2 = 1, xy 2 = 4.<br />
The boundary can be simplified by using<br />
u = xy, v = xy 2 ⇐⇒ x = u2 v<br />
, y =<br />
v u .<br />
The region D ∗ = [1,4] × [1,4]. After simplification, the Jacobian is 1/v.<br />
The integral becomes<br />
<br />
D<br />
xy dA =<br />
4 4<br />
1<br />
1<br />
u 1<br />
v<br />
du dv =<br />
4<br />
1<br />
15<br />
2v<br />
15ln(4)<br />
dv = .<br />
2<br />
Remarks. As stated above, the absolute value of the Jacobian of a linear trans<strong>for</strong>mation<br />
is related to a change in area. If T is a nonlinear trans<strong>for</strong>mation with continuous first<br />
partial derivatives, then it is approximately linear in small enough regions. Thus, <strong>for</strong> small<br />
regions, the absolute value of the Jacobian is an approximate change in area.<br />
Linear trans<strong>for</strong>mations will be studied further in Chapter 9 of these notes.<br />
Polar coordinate trans<strong>for</strong>mations. Consider (x,y) in polar coordinates<br />
x = r cos(θ), y = r sin(θ)<br />
where r = x 2 + y 2 is the distance between (x,y) and (0,0), and θ is the angle measured<br />
counterclockwise from the positive x-axis to the ray from (0,0) to (x,y).<br />
Trans<strong>for</strong>mations from domains in the rθ-plane to domains in the xy-plane have Jacobian<br />
∂(x,y)<br />
∂(r,θ) =<br />
<br />
xr xθ <br />
<br />
=<br />
<br />
<br />
<br />
cos(θ) −r sin(θ) <br />
<br />
sin(θ) r cos(θ) = r cos2 (θ) + r sin 2 (θ) = r ,<br />
yr yθ<br />
since sin 2 (θ) + cos 2 (θ) = 1 <strong>for</strong> all θ.<br />
Polar coordinate trans<strong>for</strong>mations are particularly useful when the domain of integration is<br />
a disk (the boundary plus the interior of a circle), or a wedge of a disk.<br />
165
Example 7.58 (Quarter Disk) Consider evaluating<br />
<br />
(x<br />
D<br />
2 + y 2 ) dA<br />
where D is the subset of the disk x 2 +y 2 ≤ 1 in the first quadrant. Then D can be described<br />
in polar coordinates as D∗ = {(r,θ) : 0 ≤ r ≤ 1, 0 ≤ θ ≤ π/2}, and<br />
<br />
(x 2 + y 2 1 π/2<br />
) dA = r 2 1 π<br />
r dθ dr =<br />
2 r3 dr = π<br />
8 .<br />
D<br />
0<br />
0<br />
Application of polar trans<strong>for</strong>mation. An important application of the polar coordinate<br />
trans<strong>for</strong>mation in probability and statistics is the evaluation of the double integral<br />
<br />
e<br />
D<br />
−(x2 +y2 )<br />
dA, where D = R2 .<br />
D can be described in polar coordinates as D ∗ = {(r,θ) : 0 ≤ r < ∞, 0 ≤ θ ≤ 2π} and<br />
<br />
D e−(x2 +y2 <br />
) ∞ 2π<br />
dA = 0 0 e−r2<br />
= π ∞<br />
= π<br />
= π<br />
7.7 Applications in <strong>Statistics</strong><br />
r dθ dr<br />
(2r) dr<br />
0 e−r2<br />
<br />
−e−r2 r→∞<br />
0<br />
<br />
limr→∞(−e −r2<br />
) + 1<br />
= π(0 + 1) = π.<br />
<br />
0<br />
(by substitution)<br />
Multiple integration methods are important in statistics, and, in particular, when working<br />
with joint density functions. Joint density functions are used to “smooth over” joint<br />
probability histograms, and as functions supporting the theory of multivariate continuous<br />
random variables (Chapter 8). This section previews some ideas.<br />
7.7.1 Computing Probabilities <strong>for</strong> Continuous Random Pairs<br />
If the continuous random pair (X,Y ) has joint density function f(x,y), then the probability<br />
that (X,Y ) takes values in the domain D is the double integral of f over D:<br />
<br />
P((X,Y ) ∈ D) = f(x,y) dA.<br />
Example 7.59 (Joint Distribution on Rectangle) Let<br />
f(x,y) = 1 <br />
x<br />
8<br />
2 + y 2<br />
<strong>for</strong> (x,y) ∈ [−1,2] × [−1,1]<br />
166<br />
D
z<br />
2<br />
-2<br />
-1<br />
0<br />
x<br />
1<br />
2<br />
3<br />
0<br />
-1<br />
-2<br />
1<br />
y<br />
y<br />
2<br />
1<br />
-2 -1 1 2 3 x<br />
-2 -1 1 2 3 x<br />
Figure 7.14: Volume (left) over D (right) is P(X ≥ Y ) <strong>for</strong> a random pair on [−1,2]×[−1,1].<br />
(and 0 otherwise) be the joint density function of a continuous random pair (X,Y ).<br />
To find the probability that X is greater than or equal to Y , <strong>for</strong> example, we need to<br />
compute the double integral of f over the domain D, where D is the intersection of the<br />
rectangle [−1,2] × [−1,1] with the half-plane x ≥ y:<br />
P(X ≥ Y ) =<br />
1<br />
−1<br />
2<br />
y<br />
1 <br />
x<br />
8<br />
2 + y 2<br />
dx dy =<br />
1<br />
−1<br />
-1<br />
-2<br />
<br />
1 y2 y3<br />
+ −<br />
3 4 6<br />
dy = 5<br />
6 .<br />
The domain D is pictured in the right part of Figure 7.14; the probability P(X ≥ Y ) is the<br />
volume of the solid shown in the left part of the figure.<br />
7.7.2 Contours <strong>for</strong> Independent Standard Normal Random Variables<br />
Assume that the continuous random pair (X,Y ) has joint density function<br />
f(x,y) = 1<br />
2π exp<br />
<br />
−(x 2 + y 2 <br />
)/2<br />
<strong>for</strong> (x,y) ∈ R 2 ,<br />
where exp() is the exponential function. The left part of Figure 7.15 is a graph of the<br />
surface z = f(x,y) when (x,y) ∈ [−3,3] × [−3,3].<br />
Contours <strong>for</strong> the joint density are circles in the xy-plane centered at the origin. The right<br />
part of Figure 7.15 shows contours enclosing 5%, 20%, ..., 95% of the joint probability.<br />
Given a > 0, let<br />
Da = {(x,y) : x 2 + y 2 ≤ a 2 }<br />
be the disk of radius a centered at the origin. The polar trans<strong>for</strong>mation can be used to<br />
compute the probability that values of (X,Y ) lie in Da. Specifically,<br />
P(X 2 + Y 2 ≤ a 2 <br />
) =<br />
Da<br />
f(x,y) dA =<br />
2π a<br />
(See the end of the last section <strong>for</strong> a similar computation.)<br />
167<br />
0<br />
0<br />
1<br />
2π e−r2 /2 r dr dθ = 1 − e −a 2 /2 .
z<br />
-3<br />
0<br />
x<br />
3 -3<br />
0<br />
y<br />
3<br />
-3 -1.5 1.5 3 x<br />
-1.5<br />
Figure 7.15: Joint density (left) and contours enclosing 5%, 20%, ..., 95% of probability<br />
when X and Y are independent standard normal random variables.<br />
To find the value of a so that Da encloses a given proportion p of probability, we solve<br />
p = P(X 2 + Y 2 ≤ a 2 ) = 1 − e −a2 /2 ⇒ a =<br />
3<br />
1.5<br />
-3<br />
y<br />
<br />
−2ln(1 − p).<br />
Further, <strong>for</strong> (x,y) satisfying x 2 + y 2 = a 2 = −2ln(1 − p), f(x,y) = (1 − p)/(2π).<br />
Remarks. The joint density given in this section corresponds to the density of independent<br />
standard normal random variables. See Section 8.1.9 <strong>for</strong> details about bivariate normal<br />
distributions.<br />
Probability contours <strong>for</strong> bivariate continuous distributions like the bivariate normal distribution<br />
take the place of quantiles <strong>for</strong> univariate continuous distributions.<br />
168
8 Joint Continuous Distributions<br />
Recall, from Chapter 3, that a probability distribution describing the joint variability of<br />
two or more random variables is called a joint distribution, and that a bivariate distribution<br />
is the joint distribution of a pair of random variables.<br />
8.1 Bivariate Distributions<br />
This section considers joint distributions of pairs of continuous random variables.<br />
8.1.1 Joint PDF and Joint CDF<br />
In the bivariate continuous setting, we assume that X and Y are continuous random variables<br />
and that there exists a nonnegative function f(x,y) defined over the entire xy-plane<br />
such that<br />
<br />
P((X,Y ) ∈ D) = f dA <strong>for</strong> every D ⊆ R2 .<br />
D<br />
The function f(x,y) is called the joint probability density function (joint PDF) (or joint<br />
density function) of the random pair (X,Y ). The notation fXY (x,y) is sometimes used to<br />
emphasize the two random variables.<br />
Joint PDFs satisfy the following two properties:<br />
1. f(x,y) ≥ 0 <strong>for</strong> all pairs (x,y).<br />
<br />
2.<br />
R2 f(x,y) dA = 1.<br />
The joint cumulative distribution function (joint CDF) of the random pair (X,Y ), denoted<br />
by F(x,y), is the function<br />
<br />
F(x,y) = P(X ≤ x,Y ≤ y) = f dA, <strong>for</strong> all real pairs (x,y),<br />
where Dxy = {(u,v) : u ≤ x, v ≤ y}.<br />
Joint CDFs satisfy the following properties:<br />
1. lim (x,y)→(−∞,−∞) F(x,y) = 0 and lim (x,y)→(∞,∞) F(x,y) = 1.<br />
Dxy<br />
2. If x1 ≤ x2 and y1 ≤ y2, then F(x1,y1) ≤ F(x2,y2).<br />
3. F(x,y) is continuous.<br />
Example 8.1 (Joint Distribution in 1st Quadrant) Assume that (X,Y ) has the<br />
following joint density function:<br />
f(x,y) = 2e −x−2y when x,y ≥ 0 and 0 otherwise.<br />
169
Given (x,y) in the first quadrant (both coordinates nonnegative),<br />
F(x,y) =<br />
y x<br />
F(x,y) is equal to 0 otherwise.<br />
0<br />
0<br />
2e −u−2v du dv =<br />
y<br />
2e<br />
0<br />
−2v (1 − e −x ) dv = (1 − e −x )(1 − e −2y ).<br />
Further, to compute the probability that X is greater than or equal to Y , <strong>for</strong> example, we<br />
compute the double integral of the joint PDF over the intersection of the half-plane x ≥ y<br />
with the first quadrant:<br />
∞ ∞<br />
P(X ≥ Y ) = 2e −x−2y ∞<br />
dx dy =<br />
8.1.2 Marginal Distributions<br />
0<br />
y<br />
0<br />
2e −3y dy = 2<br />
3 .<br />
The marginal probability density function (marginal PDF) (or marginal density function)<br />
of X is<br />
fX(x) = <br />
y f(x,y)dy <strong>for</strong> x in the range of X<br />
and 0 otherwise, where the integral is taken over all y in the range of Y .<br />
Similarly, the marginal PDF (or marginal density function) of Y is<br />
fY (y) = <br />
x f(x,y)dx <strong>for</strong> y in the range of Y<br />
and 0 otherwise, where the integral is taken over all x in the range of X.<br />
Example 8.2 (Joint Distribution in 1st Quadrant, continued) For the joint distribution<br />
described above:<br />
1. When x ≥ 0, ∞<br />
0 f(x,y) dy = e−x . Thus,<br />
fX(x) = e −x when x ≥ 0 and 0 otherwise.<br />
(X is an exponential random variable with parameter 1.)<br />
2. When y ≥ 0, ∞<br />
0 f(x,y) dx = 2e−2y . Thus,<br />
fY (y) = 2e −2y when y ≥ 0 and 0 otherwise.<br />
(Y is an exponential random variable with parameter 2.)<br />
Example 8.3 (Joint Distribution on Eighth Plane) Assume that (X,Y ) has the<br />
following joint density function:<br />
f(x,y) = 1<br />
4 e−y/2 when 0 ≤ x ≤ y and 0 otherwise.<br />
170
z<br />
0<br />
5<br />
x<br />
0<br />
10<br />
5<br />
10<br />
y<br />
y<br />
10<br />
5<br />
5 10 x<br />
5 10 x<br />
Figure 8.1: Plot of z = (1/4)e −y/2 on 0 ≤ x ≤ y (left) and region of nonzero density (right).<br />
The left plot in Figure 8.1 shows part of the surface z = f(x,y) (with vertical planes added)<br />
and the right plot shows part of the region of nonzero density.<br />
The volume of the solid shown in the left part of Figure 8.1 is equal to the probability that<br />
Y is at most 11:<br />
11 y 1<br />
P(Y ≤ 11) =<br />
4 e−y/2 11 1<br />
dx dy =<br />
0 4 y e−y/2 dy = 1 − 13<br />
2 e−11/2 ≈ 0.9734.<br />
Further, <strong>for</strong> this random pair,<br />
0<br />
0<br />
1. When x ≥ 0, ∞<br />
1<br />
x f(x,y) dy = 2 e−x/2 . Thus,<br />
fX(x) = 1<br />
2 e−x/2 when x ≥ 0 and 0 otherwise.<br />
(X is an exponential random variable with parameter 1/2.)<br />
2. When y ≥ 0, y<br />
1<br />
0 f(x,y) dx = 4 y e−y/2 . Thus,<br />
fY (y) = 1<br />
4 y e−y/2 when y ≥ 0 and 0 otherwise.<br />
(Y is a gamma random variable with shape parameter 2 and scale parameter 2.)<br />
8.1.3 Conditional Distributions<br />
Let X and Y be continuous random variables with joint PDF f(x,y), and marginal PDFs<br />
fX(x) and fY (y), respectively.<br />
If fY (y) = 0, then the conditional probability density function (conditional PDF) (or conditional<br />
density function) of X given Y = y is defined as follows:<br />
f X|Y =y(x|y) = f(x,y)<br />
fY (y)<br />
171<br />
<strong>for</strong> all real numbers x.
Similarly, if fX(x) = 0, then the conditional PDF (or conditional density function) of Y<br />
given X = x is<br />
fY |X=x(y|x) = f(x,y)<br />
<strong>for</strong> all real numbers y.<br />
fX(x)<br />
Example 8.4 (Joint Distribution in 1st Quadrant, continued) Consider again the<br />
random pair with joint density function<br />
f(x,y) = 2e −x−2y when x,y ≥ 0 and 0 otherwise.<br />
1. Given x ≥ 0, the conditional distribution of Y given X = x has PDF<br />
f Y |X=x(y|x) = 2e −2y when y ≥ 0 and 0 otherwise.<br />
(The conditional distribution is the same as the marginal distribution.)<br />
2. Given y ≥ 0, the conditional distribution of X given Y = y has PDF<br />
f X|Y =y(x|y) = e −x when x ≥ 0 and 0 otherwise.<br />
(The conditional distribution is the same as the marginal distribution.)<br />
Example 8.5 (Joint Distribution on Eighth Plane, continued) Consider again the<br />
random pair with joint density function<br />
f(x,y) = 1<br />
4 e−y/2 when 0 ≤ x ≤ y and 0 otherwise. (Figure 8.1)<br />
1. Given x ≥ 0, the conditional distribution of Y given X = x has PDF<br />
f Y |X=x(y|x) = 1<br />
2 e−(y−x)/2 when y ≥ x and 0 otherwise.<br />
(The conditional distribution is a shifted exponential distribution.)<br />
2. Given y > 0, the conditional distribution of X given Y = y has PDF<br />
f X|Y =y(x|y) = 1<br />
y<br />
when 0 ≤ x ≤ y and 0 otherwise.<br />
(The conditional distribution is uni<strong>for</strong>m on the interval [0,y].)<br />
Remarks. The function fX|Y =y(x|y) is a valid PDF since fX|Y =y(x|y) ≥ 0 and<br />
<br />
fX|Y =y(x|y) dx =<br />
x<br />
1<br />
<br />
f(x,y) dx =<br />
fY (y) x<br />
1<br />
fY (y) fY (y) = 1.<br />
Similarly, the function f Y |X=x(y|x) is a valid PDF. Thus, conditional distributions are<br />
distributions in their own right.<br />
Although, <strong>for</strong> example, fY (y) does not represent probability, we can think of the conditional<br />
density function f X|Y =y(x|y) as a limit of probabilities. Specifically,<br />
∂<br />
fX|Y =y(x|y) = lim P(X ≤ x|y − ǫ ≤ Y ≤ y + ǫ).<br />
ǫ→0 ∂x<br />
Thus, the conditional PDF of X given Y = y can be thought of as the conditional PDF of<br />
X given that Y is very close to y.<br />
172
8.1.4 Independent Random Variables<br />
The continuous random variables X and Y are said to be independent if their joint density<br />
function equals the product of their marginal density functions <strong>for</strong> all real pairs,<br />
f(x,y) = fX(x)fY (y) <strong>for</strong> all (x,y).<br />
Equivalently, X and Y are independent if their joint CDF equals the product of their<br />
marginal CDFs <strong>for</strong> all real pairs,<br />
F(x,y) = FX(x)FY (y) <strong>for</strong> all (x,y).<br />
Random variables that are not independent are said to be dependent.<br />
Example 8.6 (Joint Distribution in 1st Quadrant, continued) The random variables<br />
with joint density function<br />
are independent.<br />
f(x,y) = 2e −x−2y = (e −x )(2e −2y ) when x,y ≥ 0 and 0 otherwise<br />
Example 8.7 (Joint Distribution on Eighth Plane, continued) The random variables<br />
with joint density function<br />
f(x,y) = 1<br />
4 e−y/2 when 0 ≤ x ≤ y and 0 otherwise (Figure 8.1)<br />
are dependent. To demonstrate the dependence of X and Y , we just need to show that<br />
f(x,y) = fX(x)fY (y) <strong>for</strong> some pair (x,y). For example, f(2,1) = fX(2)fY (1).<br />
8.1.5 Mathematical Expectation <strong>for</strong> Bivariate Continuous Distributions<br />
Let X and Y be continuous random variables with joint density function f(x,y) and let<br />
g(X,Y ) be a real-valued function.<br />
The mean of g(X,Y ) (or the expected value of g(X,Y ) or the expectation of g(X,Y )) is<br />
<br />
E(g(X,Y )) = g(x,y) f(x,y) dy dx<br />
x<br />
y<br />
provided the integral converges absolutely. The double integral is assumed to include all<br />
pairs with nonzero joint density.<br />
Example 8.8 (Joint Distribution on Triangle) Assume that (X,Y ) has the following<br />
joint density function<br />
f(x,y) = 2<br />
when 0 ≤ x ≤ 3, 0 ≤ y ≤ 5x/3 and 0 otherwise.<br />
15<br />
(The random pair has constant density on the triangle with corners (0,0), (3,0), (3,5).)<br />
Let g(X,Y ) = XY be the product of the variables. Then<br />
E(g(X,Y )) =<br />
3 5x/3<br />
0<br />
0<br />
xy 2<br />
15<br />
173<br />
dy dx =<br />
3<br />
0<br />
5<br />
27 x3 dx = 15<br />
4 .
Properties of expectation. The properties of expectation stated in Section 3.1.6 in the<br />
bivariate discrete case are also true in the bivariate continuous case.<br />
8.1.6 Covariance and Correlation<br />
Let X and Y be random variables with finite means (µx, µy) and finite standard deviations<br />
(σx, σy). The covariance of X and Y , Cov(X,Y ), is defined as follows:<br />
Cov(X,Y ) = E((X − µx)(Y − µy)).<br />
The notation σxy = Cov(X,Y ) is used to denote the covariance. The correlation of X and<br />
Y , Corr(X,Y ), is defined as follows:<br />
Cov(X,Y )<br />
Corr(X,Y ) =<br />
σxσy<br />
= σxy<br />
.<br />
σxσy<br />
The notation ρ = Corr(X,Y ) is used to denote the correlation of X and Y ; ρ is called the<br />
correlation coefficient.<br />
The properties of covariance and correlation stated in Section 3.1.7 in the bivariate discrete<br />
case are also true in the bivariate continuous case.<br />
Example 8.9 (Joint Distribution on Triangle, continued) Continuing with the joint<br />
distribution above,<br />
E(X) = 2, E(Y ) = 5<br />
3 , E(X2 ) = 9<br />
Using properties of variance and covariance:<br />
2 , E(Y 2 ) = 25<br />
6<br />
, E(XY ) = 15<br />
4 .<br />
V ar(X) = E(X 2 ) − (E(X)) 2 = 1<br />
2 , V ar(Y ) = E(Y 2 ) − (E(Y )) 2 = 25<br />
18 ,<br />
Cov(X,Y ) = E(XY ) − E(X)E(Y ) = 5/12 and ρ = Corr(X,Y ) = 1/2.<br />
Example 8.10 (Joint Distribution on Diamond) Assume that the continuous pair<br />
(X,Y ) has joint density function<br />
f(x,y) = 1<br />
2<br />
when |x| + |y| ≤ 1 and 0 otherwise.<br />
(Joint density is constant on the “diamond” with corners (−1,0), (0,1), (1,0), (0, −1).)<br />
For this joint distribution,<br />
1. E(X) = E(Y ) = E(XY ) = 0.<br />
2. Cov(X,Y ) = 0 and Corr(X,Y ) = 0.<br />
3. The marginal PDF of X is fX(x) = 1 − |x| when −1 ≤ x ≤ 1 and 0 otherwise.<br />
4. The marginal PDF of Y is fY (y) = 1 − |y| when −1 ≤ y ≤ 1 and 0 otherwise.<br />
Since Corr(X,Y ) = 0, X and Y are uncorrelated. But, X and Y are dependent since, <strong>for</strong><br />
example, f(0,0) = fX(0)fY (0).<br />
174
z<br />
2<br />
1.5<br />
1 1<br />
y<br />
1.5<br />
2<br />
0.5<br />
x 2.5 0<br />
3<br />
2<br />
1.5<br />
1<br />
0.5<br />
y<br />
1.5 2 2.5 3 x<br />
Figure 8.2: Graph of z = f(x,y) <strong>for</strong> a continuous bivariate distribution (left plot) and<br />
contour plot with z = 1/2,1/4,1/6,... ,1/20 and conditional expectation of Y given X = x<br />
superimposed (right plot).<br />
8.1.7 Conditional Expectation and Regression<br />
Let X and Y be continuous random variables with joint density function f(x,y).<br />
If fX(x) = 0, then the conditional expectation (or the conditional mean) of Y given X = x,<br />
E(Y |X = x), is defined as follows:<br />
<br />
E(Y |X = x) = y fY |X=x(y|x) dy,<br />
y<br />
where the integral is over all y with nonzero conditional density (f Y |X=x(y|x) = 0), provided<br />
the integral converges absolutely.<br />
The conditional expectation of X given Y = y, E(X|Y = y), is defined similarly.<br />
Example 8.11 (Conditional Expectation) Assume that the continuous pair (X,Y )<br />
has joint density function<br />
f(x,y) = e−yx<br />
2 √ x<br />
1. The marginal PDF of X is<br />
2. The conditional PDF of Y given X = x is<br />
when x ≥ 1,y ≥ 0 and 0 otherwise.<br />
fX(x) = 1<br />
when x ≥ 1 and 0 otherwise.<br />
2x3/2 f Y |X=x(y|x) = xe −xy when y ≥ 0 and 0 otherwise.<br />
(The conditional distribution is exponential with parameter x.)<br />
3. Since the expected value of an exponential random variable is the reciprocal of its<br />
parameter (see Table 5.2), the conditional mean <strong>for</strong>mula is<br />
E(Y |X = x) = 1<br />
x<br />
175<br />
<strong>for</strong> x ≥ 1.
The left part of Figure 8.2 is a graph of z = f(x,y) (with vertical planes added) and the<br />
right part is a contour plot with z = 1/2,1/4,1/6,... ,1/20 (in gray). The <strong>for</strong>mula <strong>for</strong> the<br />
conditional mean, y = 1/x, is superimposed on the contour plot (in black).<br />
Regression equation. Note that the <strong>for</strong>mula <strong>for</strong> the conditional expectation<br />
E(Y |X = x) as a function of x<br />
is called the regression equation of Y on X. Similarly, the <strong>for</strong>mula <strong>for</strong> the conditional<br />
expectation E(X|Y = y) as a function of y is called the regression equation of X on Y .<br />
Linear conditional means. Let X and Y be random variables with finite means (µx,<br />
µy), standard deviations (σx, σy), and correlation (ρ).<br />
If E(Y |X = x) is a linear function of x, then the <strong>for</strong>mula is of the <strong>for</strong>m:<br />
E(Y |X = x) = µy + ρσy<br />
(x − µx).<br />
Similarly, if E(X|Y = y) is a linear function of y, then the <strong>for</strong>mula is of the <strong>for</strong>m:<br />
σx<br />
E(X|Y = y) = µx + ρσx<br />
(y − µy).<br />
Example 8.12 (Two Step Experiment) Consider the following 2 step experiment:<br />
Step 1: Choose X uni<strong>for</strong>mly from the interval [10,20].<br />
Step 2: Given that X = x, choose Y uni<strong>for</strong>mly from the interval [0,x].<br />
Then the continuous pair (X,Y ) has joint density function<br />
f(x,y) = fX(x)f Y |X=x(y|x) =<br />
σy<br />
<br />
1 1<br />
=<br />
10 x<br />
1<br />
10x<br />
when 10 ≤ x ≤ 20, 0 ≤ y ≤ x and 0 otherwise. (The region of nonzero density is the<br />
trapezoidal region with corners (10,0), (10,10), (20,20), (20,0).)<br />
Since the conditional distribution of Y given X = x is uni<strong>for</strong>m on the interval [0,x], and the<br />
expected value of a uni<strong>for</strong>m random variable is the midpoint of its interval (see Table 5.2),<br />
the conditional mean <strong>for</strong>mula is<br />
E(Y |X = x) = x<br />
2<br />
(a linear function of x). Further, since<br />
<strong>for</strong> 10 ≤ x ≤ 20<br />
µx = 15, µy = 15<br />
2 , σx = 5<br />
√ , σy =<br />
3 5√31 6<br />
and ρ =<br />
3<br />
31<br />
<strong>for</strong> this random pair, µy + ρ(σy/σx)(x − µx) = x/2, verifying the <strong>for</strong>mula given above.<br />
176
8.1.8 Example: Bivariate Uni<strong>for</strong>m Distribution<br />
Let R be a region of the plane with finite positive area. The continuous random pair (X,Y )<br />
is said to have a bivariate uni<strong>for</strong>m distribution on the region R if its joint PDF is as follows:<br />
f(x,y) =<br />
1<br />
Area of R<br />
when (x,y) ∈ R and 0 otherwise.<br />
Bivariate uni<strong>for</strong>m distributions have constant density over the region of nonzero density.<br />
That constant density is the reciprocal of the area. If A is a subregion of R (A ⊆ R), then<br />
P((X,Y ) ∈ A) =<br />
Area of A<br />
Area of R .<br />
That is, the probability of the event “the point (x,y) is in A” is the ratio of the area of the<br />
subregion to the area of the full region.<br />
The joint distributions on the triangle and on the diamond considered earlier are examples<br />
of bivariate uni<strong>for</strong>m distributions on regions with areas 15/2 and 2, respectively.<br />
Expectations. If (X,Y ) has a bivariate uni<strong>for</strong>m distribution on R and g(X,Y ) is a realvalued<br />
function, then E(g(X,Y )) is the same as the average value of the 2-variable function<br />
g over the region R, as discussed on page 157.<br />
Marginal distributions. If (X,Y ) has a bivariate uni<strong>for</strong>m distribution, then X and Y<br />
may not be uni<strong>for</strong>m random variables.<br />
Consider again the joint distribution on the diamond. X is not a uni<strong>for</strong>m random variable<br />
since its PDF is not constant on the interval [−1,1]. Similarly, Y is not a uni<strong>for</strong>m random<br />
variable since its PDF is not constant on [−1,1].<br />
8.1.9 Example: Bivariate Normal Distribution<br />
Let µx and µy be real numbers, σx and σy be positive real numbers, and ρ be a number<br />
in the interval −1 < ρ < 1. The random pair (X,Y ) is said to have a bivariate normal<br />
distribution with parameters µx, µy, σx, σy and ρ when its joint PDF is as follows<br />
where<br />
1<br />
f(x,y) = <br />
2πσxσy 1 − ρ2 exp<br />
<br />
− term<br />
2(1 − ρ2 <br />
)<br />
term =<br />
x − µx<br />
σy<br />
2<br />
and exp() is the exponential function.<br />
For this random pair,<br />
<strong>for</strong> all real pairs (x,y)<br />
<br />
x − µx<br />
− 2ρ<br />
<br />
y − µy y − µy<br />
+<br />
σx<br />
177<br />
σy<br />
σy<br />
2
z<br />
-3<br />
0<br />
x<br />
3 -3<br />
0<br />
y<br />
3<br />
1.5<br />
-1.5 1.5<br />
-1.5<br />
Figure 8.3: Joint density (left) and contours enclosing 5%, 35%, 65% and 95% of the<br />
probability <strong>for</strong> the standard bivariate normal distribution with correlation ρ = −0.80.<br />
1. µx and σx are the mean and standard deviation of X,<br />
2. µy and σy are the mean and standard deviation of Y and<br />
3. ρ (the correlation coefficient) is the correlation between X and Y .<br />
Further, if ρ = 0, then X and Y are independent.<br />
Standard bivariate normal distribution. The random pair (X,Y ) is said to have a<br />
standard bivariate normal distribution with parameter ρ when (X,Y ) has a bivariate normal<br />
distribution with µx = µy = 0 and σx = σy = 1.<br />
The joint PDF of (X,Y ) is as follows:<br />
1<br />
f(x,y) =<br />
2π <br />
−(x2 − 2ρxy + y2 )<br />
exp<br />
1 − ρ2 2(1 − ρ2 <br />
)<br />
y<br />
x<br />
<strong>for</strong> all real pairs (x,y).<br />
Example 8.13 (ρ = −0.80) The left part of Figure 8.3 shows the joint density function<br />
of the bivariate standard normal distribution with correlation ρ = −0.80.<br />
The right part of the figure shows contours enclosing 5%, 35%, 65% and 95% of the probability<br />
of this joint distribution. Each contour is an ellipse whose major axis lies along the<br />
line y = −x and whose minor axis lies along the line y = x.<br />
Marginal and conditional distributions. If (X,Y ) has a bivariate normal distribution,<br />
then each marginal and conditional distribution is normal. Specifically,<br />
1. X is a normal random variable with mean µx and standard deviation σx,<br />
2. Y is a normal random variable with mean µy and standard deviation σy,<br />
178
v<br />
1<br />
2 u<br />
2 u<br />
X Dens.<br />
3<br />
2<br />
1<br />
0<br />
0 1 2 3 x<br />
Figure 8.4: The region of nonzero density <strong>for</strong> a bivariate uni<strong>for</strong>m distribution on the open<br />
rectangle (0,2) ×(0,1) (left plot) and the probability density function of the product of the<br />
coordinates with values in (0,2) (right plot). Contours <strong>for</strong> products equal to 0.2, 0.6, 1.0,<br />
and 1.4 are shown in the left plot.<br />
3. the conditional distribution of Y given X = x is normal with mean<br />
<br />
and standard deviation σy 1 − ρ2 , and<br />
E(Y |X = x) = µy + ρσy<br />
(x − µx)<br />
4. the conditional distribution of X given Y = y is normal with mean<br />
<br />
and standard deviation σx 1 − ρ2 .<br />
σx<br />
E(X|Y = y) = µx + ρσx<br />
(y − µy)<br />
Remarks. The computations needed to find probability contours <strong>for</strong> bivariate normal<br />
distributions generalize ideas from Section 7.7.2 and are discussed further in the next section.<br />
8.1.10 Trans<strong>for</strong>ming Continuous Random Variables<br />
Let (U,V ) be a continuous random pair with joint density function f(u,v). This section<br />
considers scalar-valued and vector-valued continuous trans<strong>for</strong>mations of (U,V ).<br />
Scalar-valued functions. We first consider scalar-valued functions.<br />
Example 8.14 (Product Trans<strong>for</strong>mation) Let U be the length, V be the width, and<br />
X = UV be the area of a random rectangle. Specifically, assume that U is a uni<strong>for</strong>m<br />
random variable on the interval (0,2), V is a uni<strong>for</strong>m random variable on the interval (0,1),<br />
and that U and V are independent.<br />
Since the joint PDF of (U,V ) is<br />
f(u,v) = fU(u)fV (v) = 1<br />
2<br />
σy<br />
when (u,v) ∈ (0,2) × (0,1) and 0 otherwise,<br />
179
(U,V ) has a bivariate uni<strong>for</strong>m distribution on the open rectangle.<br />
For 0 < x < 2,<br />
P(X ≤ x) = P(UV ≤ x) = x<br />
<br />
1 2 x<br />
+<br />
2 2 x u<br />
and d<br />
1<br />
1<br />
dxP(X ≤ x) = 2 (ln(2) − ln(x)) = 2 ln(2/x). Thus,<br />
1<br />
du = (x + xln(2) − xln(x))<br />
2<br />
fX(x) = 1<br />
ln(2/x) when 0 < x < 2 and 0 otherwise.<br />
2<br />
The left part of Figure 8.4 shows the region of nonzero joint density, with contours corresponding<br />
to x = 0.2,0.6,1.0,1.4 superimposed. The right part is a plot of the density<br />
function of X = UV .<br />
Example 8.15 (Sum Trans<strong>for</strong>mation) Let X = U +V , where U and V are independent<br />
exponential random variables with parameter 1. The range of X is the nonnegative real<br />
numbers.<br />
Since U and V are independent, the joint density function of (U,V ) is<br />
Given x ≥ 0,<br />
P(X ≤ x) =<br />
f(u,v) = fU(u)fV (v) = e −u−v when u,v ≥ 0 and 0 otherwise.<br />
x x−u<br />
0<br />
0<br />
e −u−v dv du =<br />
x −x −u<br />
−e + e du = 1 − e −x − xe −x ,<br />
and d<br />
dx P(X ≤ x) = xe−x (using the product rule <strong>for</strong> derivatives). Thus,<br />
fX(x) = xe −x when x ≥ 0 and 0 otherwise.<br />
(X is a gamma random variable with shape parameter 2 and scale parameter 1.)<br />
Working with sums. If X = U + V and f(u,v) has continuous partial derivatives, then<br />
the PDF of X can be computed as follows:<br />
<br />
fX(x) = f(u,x − u) du where R is the range of U.<br />
u∈R<br />
This result can be proven using the relationship between partial differentiation and partial<br />
anti-differentation given in Section 7.6.7.<br />
For the example above, since the joint density function is nonzero when both arguments<br />
are nonnegative, the integral reduces to<br />
fX(x) =<br />
0<br />
x<br />
e<br />
0<br />
−x du = xe −x when x ≥ 0 and 0 otherwise,<br />
confirming the result obtained using double integration followed by differentiation.<br />
180
Vector-valued functions. We next consider vector-valued functions.<br />
Let (x,y) = T(u,v) be a vector-valued function of two variables. Assume that the coordinate<br />
functions of the trans<strong>for</strong>mation, x = x(u,v) and y = y(u,v), have continuous partial<br />
derivatives and that the Jacobian of the trans<strong>for</strong>mation<br />
∂(x,y)<br />
∂(u,v) =<br />
<br />
<br />
<br />
xu<br />
<br />
xv <br />
<br />
yu yv<br />
= xuyv − xvyu = 0 <strong>for</strong> all (u,v).<br />
(See Section 7.6.8 <strong>for</strong> a discussion of the Jacobian.)<br />
If (X,Y ) = T(U,V ) and f(u,v) has continuous partial derivatives, then, <strong>for</strong> all (x,y) with<br />
nonzero joint density,<br />
<br />
∂(x,y) <br />
fXY (x,y) = f(u,v)/ <br />
∂(u,v)) , where (x,y) = T(u,v).<br />
(This result generalizes the result <strong>for</strong> monotone trans<strong>for</strong>mations from page 100.)<br />
Example 8.16 (Standard Bivariate Normal Distribution) Let U and V be independent<br />
standard normal random variables and let<br />
<br />
(X,Y ) = T(U,V ) = U,ρU + 1 − ρ2 <br />
V ,<br />
where −1 < ρ < 1. Then (X,Y ) has a standard bivariate normal distribution with correlation<br />
ρ. To see this,<br />
1. The Jacobian of the trans<strong>for</strong>mation is<br />
∂(x,y)<br />
∂(u,v) =<br />
<br />
xu <br />
<br />
<br />
xv <br />
<br />
=<br />
<br />
<br />
1<br />
ρ<br />
0<br />
1 − ρ2 <br />
<br />
<br />
=<br />
yu yv<br />
<br />
1 − ρ 2 .<br />
2. Since x = u, y = ρu + 1 − ρ2v implies u = x, v = 1 √ (−ρx + y),<br />
1−ρ2 u 2 + v 2 = x 2 1<br />
+<br />
(1 − ρ2 ) (−ρx + y)2 = x2 − 2ρxy + y2 (1 − ρ2 )<br />
3. Since f(u,v) = exp(−(u2 + v2 )/2)/(2π), the <strong>for</strong>mula above implies that<br />
<br />
1<br />
fXY (x,y) =<br />
2π <br />
−(x2 − 2ρxy + y2 )<br />
exp<br />
1 − ρ2 2(1 − ρ2 )<br />
.<br />
<strong>for</strong> all (x,y).<br />
The result is the same as the joint density function of the standard bivariate normal<br />
distribution with correlation ρ given earlier.<br />
Example 8.17 (Bivariate Normal Distribution) More generally, if U and V are<br />
independent standard normal random variables and we let<br />
<br />
<br />
(X,Y ) = T(U,V ) = µx + σxU, µy + σy ρU + 1 − ρ2 <br />
V ,<br />
then (X,Y ) has a bivariate normal distribution with parameters µx, µy, σx, σy and ρ.<br />
The steps in the derivation follow those in the previous example.<br />
181
Remarks. The linear trans<strong>for</strong>mations given in the last two examples can be used to find<br />
probability contours <strong>for</strong> bivariate normal distributions. That is, the trans<strong>for</strong>mations can be<br />
used to find contours enclosing probability p, <strong>for</strong> 0 < p < 1.<br />
We consider the first trans<strong>for</strong>mation.<br />
1. Let U and V be independent standard normal random variables, Ca be the circle of<br />
radius a centered at the origin and<br />
Da = {(u,v) : u 2 + v 2 ≤ a 2 }<br />
be the disk of radius a centered at the origin. If a = −2ln(1 − p), then<br />
P ((U,V ) ∈ Da) = p<br />
(see Section 7.7.2) and Ca is the contour enclosing probability p.<br />
Further, if (u,v) ∈ Ca, then f(u,v) = (1 − p)/(2π).<br />
2. Let (X,Y ) = T(U,V ) =<br />
The image of Da under T,<br />
<br />
U,ρU + 1 − ρ 2 V<br />
<br />
, where −1 < ρ < 1.<br />
T(Da) = {(x,y) : x 2 − 2ρxy + y 2 ≤ a 2 (1 − ρ 2 )},<br />
is an elliptical region centered at the origin, whose boundary, T(Ca), is the contour<br />
enclosing probability p.<br />
Further, if (x,y) ∈ T(Ca), then fXY (x,y) = (1 − p)/(2π 1 − ρ 2 ).<br />
Note that the right part of Figure 8.3 shows contours corresponding to p = 0.05, 0.35, 0.65<br />
and 0.95 when ρ = −0.80.<br />
Bivariate normal distributions (and multivariate normal distributions) will be studied further<br />
in Chapter 9 of these notes.<br />
8.2 Multivariate Distributions<br />
Recall (from Section 3.2) that a multivariate distribution is the joint distribution of k<br />
random variables. Ideas studied in the bivariate case (k = 2) can be generalized to the case<br />
where k > 2.<br />
8.2.1 Joint PDF and Joint CDF<br />
Let X1, X2, ..., Xk be continuous random variables and let<br />
X = (X1,X2,...,Xk).<br />
We assume there exists a nonnegative k-variable function f with domain Rk such that<br />
<br />
P(X ∈ W) = · · · f(x1,x2,... ,xk) dx1 · · · dxk<br />
W<br />
182
<strong>for</strong> every W ⊆ R k . The function f(x1,x2,... ,xk) is called the joint probability density<br />
function (joint PDF) (or joint density function) of the random k-tuple X.<br />
The joint cumulative distribution function (joint CDF) of X is defined as follows:<br />
F(x) = P(X1 ≤ x1,X2 ≤ x2,... ,Xk ≤ xk) <strong>for</strong> all x = (x1,x2,... ,xk) ∈ R k ,<br />
where commas are understood to mean the intersection of events.<br />
Example 8.18 (Joint Distribution on 1st Octant) Assume that X = (X,Y,Z) has<br />
joint density function<br />
Given x,y,z ≥ 0,<br />
f(x,y,z) =<br />
F(x,y,z) = x y z<br />
0 0 0 f(u,v,w) dw dv du =<br />
6<br />
when x,y,z ≥ 0 and 0 otherwise.<br />
(1 + x + y + z) 4<br />
yz(2 + y + z)<br />
(1 + y)(1 + z)(1 + y + z) −<br />
yz(2 + 2x + y + z)<br />
(1 + x)(1 + x + y)(1 + x + z)(1 + x + y + z) .<br />
The joint CDF is equal to 0 otherwise.<br />
Let c be a positive constant. To find the probability that the sum of the three variables is<br />
at most c, <strong>for</strong> example, we compute the triple integral of f over the region<br />
The probability is<br />
W = {(x,y,z) : 0 ≤ x ≤ c, 0 ≤ y ≤ c − x, 0 ≤ z ≤ c − x − y} ⊂ R 3 .<br />
<br />
P(X + Y + Z ≤ c) =<br />
8.2.2 Marginal and Conditional Distributions<br />
W<br />
f(x,y,z) dV =<br />
3 c<br />
.<br />
c + 1<br />
The concepts of marginal and conditional distributions can also be generalized.<br />
Example 8.19 (Joint Distribution on 1st Octant, continued) To illustrate computations,<br />
we continue with the joint distribution defined above.<br />
1. The marginal distribution of the random pair (X,Y ) is<br />
∞<br />
2<br />
fXY (x,y) = f(x,y,z) dz = when x,y ≥ 0 and 0 otherwise.<br />
(1 + x + y) 3<br />
0<br />
2. The marginal distribution of Z is<br />
∞ ∞<br />
fZ(z) = f(x,y,z) dx dy =<br />
0<br />
0<br />
183<br />
1<br />
when z ≥ 0 and 0 otherwise.<br />
(1 + z) 2
3. Let x and y be fixed nonnegative constants. The conditional distribution of Z given<br />
(X,Y ) = (x,y) is<br />
when z ≥ 0 and 0 otherwise.<br />
fZ|(X,Y )=(x,y)(z|(x,y)) = f(x,y,z) 3(1 + x + y)3<br />
=<br />
fXY (x,y) (1 + x + y + z) 4<br />
8.2.3 Mutually Independent Random Variables and Random Samples<br />
The continuous random variables X1, X2, ..., Xk are said to be mutually independent (or<br />
independent) if<br />
f(x) = f1(x1)f2(x2) · · · fk(xk) <strong>for</strong> all x = (x1,x2,...,xk) ∈ R k ,<br />
where fi(xi) is the PDF of Xi <strong>for</strong> each i.<br />
Equivalently, X1, X2, ..., Xk are said to be mutually independent if<br />
F(x) = F1(x1)F2(x2) · · · Fk(xk) <strong>for</strong> all x = (x1,x2,... ,xk) ∈ R k ,<br />
where Fi(xi) is the CDF of Xi <strong>for</strong> each i.<br />
X1, X2, ..., Xk are said to be dependent if they are not independent.<br />
Example 8.20 (Joint Distribution on 1st Octant, continued) The random variables<br />
X, Y and Z in the example above are dependent. To see this, first note that each random<br />
variable has the same marginal density function:<br />
fX(u) = fY (u) = fZ(u) =<br />
1<br />
when u ≥ 0 and 0 otherwise.<br />
(1 + u) 2<br />
To demonstrate dependence, we just need to show that f(x,y,z) = fX(x)fY (y)fZ(z) <strong>for</strong><br />
some triple (x,y,z). For example, f(1,1,1) = fX(1)fY (1)fZ(1).<br />
Random sample. If the continuous random variables X1, X2, ..., Xk are mutually<br />
independent and have a common distribution (each marginal distribution is the same),<br />
then X1, X2, ..., Xk are said to be a random sample from that distribution.<br />
8.2.4 Mathematical Expectation For Multivariate Continuous Distributions<br />
Let X1, X2, ..., Xk be continuous random variables with joint PDF f(x) and let g(X) be<br />
a real-valued function. The mean or expected value or expectation of g(X) is<br />
<br />
E(g(X)) = · · · g(x1,x2,... ,xk) f(x1,x2,...,xk)dx1dx2 · · · dxk<br />
xk<br />
xk−1<br />
x1<br />
provided that the integral converges absolutely. The multiple integral is assumed to include<br />
all k-tuples with nonzero joint PDF.<br />
184
Example 8.21 (Product Function) You pick three numbers independently and uni<strong>for</strong>mly<br />
from the interval [0,5]. Let X1, X2, X3 be the three numbers and let g(X1,X2,X3)<br />
be the function with rule g(X1,X2,X3) = X1X 2 2 X3.<br />
By independence,<br />
f(x1,x2,x3) = 1<br />
125<br />
The expected product is E(g(X1,X2,X3)) =<br />
5 5 5<br />
5 5<br />
0<br />
0<br />
0<br />
<br />
x1x 2 2x3 1<br />
125 dx3 dx2 dx1 =<br />
on the [0,5] × [0,5] × [0,5] box and 0 otherwise.<br />
0<br />
0<br />
1<br />
10 x1x 2 2 dx2<br />
5<br />
25<br />
dx1 =<br />
0 6 x1 dx1 = 625<br />
≈ 50.08.<br />
12<br />
Properties of expectation. Properties of expectation in the multivariate case directly<br />
generalize properties in the univariate and bivariate cases. In particular,<br />
1. If a and b1,b2,... ,bn are constants and gi(X) are real-valued k-variable functions <strong>for</strong><br />
each i = 1,2,... ,n, then<br />
<br />
n<br />
<br />
n<br />
E a + bigi(X) = a + biE(gi(X)).<br />
i=1<br />
2. If X1, X2, ..., Xk are mutually independent and gi(Xi) are real-valued 1-variable<br />
functions <strong>for</strong> i = 1,2,... ,k, then<br />
i=1<br />
k<br />
E(g1(X1)g2(X2) · · · gk(Xk)) = E(gi(Xi)).<br />
i=1<br />
Example 8.22 (Product Function, continued) Continuing with the example above,<br />
and using properties of expectation, the expected product is<br />
E(X1X 2 2X3) = E(X1)E(X 2 2 )E(X3)<br />
<br />
5 25 5<br />
=<br />
=<br />
2 3 2<br />
625<br />
12 ,<br />
confirming the result obtained above.<br />
8.2.5 Sample Summaries<br />
If X1, X2, ..., Xn is a random sample from a distribution with mean µ and standard<br />
deviation σ, then the sample mean, X, is the random variable<br />
X = 1<br />
n (X1 + X2 + · · · + Xn) ,<br />
the sample variance, S 2 , is the random variable<br />
S 2 = 1<br />
n − 1<br />
n<br />
i=1<br />
185<br />
2 Xi − X
and the sample standard deviation, S, is the positive square root of the sample variance.<br />
The following theorem can be proven using properties of expectation (see Section 3.2.7):<br />
Theorem 8.23 (Sample Summaries) If X is the sample mean and S 2 is the sample<br />
variance of a random sample of size n from a distribution with mean µ and standard<br />
deviation σ, then<br />
1. E(X) = µ and V ar(X) = σ 2 /n.<br />
2. E(S 2 ) = σ 2 .<br />
Sample correlation. A random sample of size n from the joint (X,Y ) distribution is a<br />
list of n mutually independent random pairs, each with the same distribution as (X,Y ).<br />
If (X1,Y1),(X2,Y2),... ,(Xn,Yn) is a random sample of size n from a bivariate distribution<br />
with correlation ρ = Corr(X,Y ), then the sample correlation, R, is the random variable<br />
ni=1(Xi − X)(Yi − Y )<br />
R = <br />
ni=1<br />
(Xi − X) 2 n i=1 (Yi − Y ) 2<br />
where X and Y are the sample means of the X and Y samples, respectively.<br />
Applications. In statistical applications, observed values of the sample mean, sample<br />
variance, and sample correlation are used to estimate unknown values of the parameters µ,<br />
σ 2 and ρ, respectively.<br />
Example 8.24 (Brain-Body Weights)(Allison & Cicchetti, Science, 194:732-374, 1976;<br />
lib.stat.cmu.edu/DASL.) As part of a study on sleep in mammals, researchers collected<br />
in<strong>for</strong>mation on the average brain weight and average body weight <strong>for</strong> 43 different species.<br />
(Body-weight, brain-weight) combinations ranged from (0.05kg,0.14g) <strong>for</strong> the short-tail<br />
shrew to (2547.0kg,4603.0g) <strong>for</strong> the Asian elephant. The data <strong>for</strong> “man” are<br />
(62kg,1320g) = (136.4lb,2.9lb).<br />
Let X be the common logarithm of body weight in kilograms, and Y be the common<br />
logarithm of brain weight in grams. Sample summaries are as follows:<br />
1. Mean log-body weight is x = 0.311738, with a SD of sx = 1.33113.<br />
2. Mean log-brain weight is y = 1.17421, with a SD of sy = 1.09415.<br />
3. Correlation between log-body weight and log-brain weight is r = 0.951693.<br />
186
y<br />
4<br />
3<br />
2<br />
-3 -2 -1 0 1 2 3 4 x<br />
1<br />
0<br />
-1<br />
-2<br />
Figure 8.5: Log-brain weight (vertical axis) versus log-body weight (horizontal axis) <strong>for</strong> the<br />
brain-body study. Probability contours of the estimated bivariate normal model and the<br />
estimated conditional expectation are superimposed.<br />
The researchers determined that a bivariate normal model was reasonable <strong>for</strong> these data.<br />
Figure 8.5 compares the common logarithms of average brain weight in grams (vertical<br />
axis) and average body weight in kilograms (horizontal axis) <strong>for</strong> the 43 species. Probability<br />
contours <strong>for</strong> the model with parameters equal to the sample summaries are shown on the<br />
plot; contours enclose 5%, 35%, 65% and 95% of probability. The pair <strong>for</strong> “man” is the<br />
only point outside the outermost contour.<br />
The estimated conditional expectation line,<br />
y = 1.17421 + 0.782263 (x − 0.311738),<br />
is displayed in the plot. Note, in particular, that the slope of the line is r(sy/sx) = 0.782263.<br />
187
188
9 Linear Algebra Concepts<br />
The central equations of interest in linear algebra are of the <strong>for</strong>m Ax = b, where A is an<br />
m-by-n coefficient matrix, x is an n-by-1 vector of unknowns, and b is an m-by-1 vector<br />
of values. Solutions come at three levels of sophistication: (1) direct solution methods, (2)<br />
matrix algebra methods, and (3) vector space methods.<br />
This chapter introduces linear algebra, and applications in statistics.<br />
9.1 Solving Systems of Equations<br />
A linear equation in the variables x1,x2,... ,xn is an equation of the <strong>for</strong>m<br />
a1x1 + a2x2 + · · · + anxn = b<br />
where a1,a2,... ,an and b are constants. A system of linear equations in the variables<br />
x1,x2,... ,xn is a collection of one or more linear equations in these variables. For example,<br />
x1 −2x2 + x3 = 0<br />
2x2 −8x3 = 8<br />
−4x1 +5x2 +9x3 = −9<br />
is a system of 3 linear equations in the three unknowns x1,x2,x3 (a “3-by-3 system”).<br />
A solution of a linear system is a list (s1,s2,... ,sn) of numbers that makes each equation<br />
in the system true when si is substituted <strong>for</strong> xi <strong>for</strong> i = 1,2,... ,n. The list (29, 16, 3) is a<br />
solution to the system above. To check this, note that<br />
(29) −2(16) + (3) = 0<br />
2(16) −8(3) = 8<br />
−4(29) +5(16) +9(3) = −9<br />
The solution set of a linear system is the collection of all possible solutions of the system.<br />
For the example above, (29, 16, 3) is the unique solution.<br />
Two linear systems are equivalent if each has the same solution set.<br />
9.1.1 Elementary Row Operations<br />
The strategy <strong>for</strong> solving a linear system is to replace the system with an equivalent one that<br />
is easier to solve. The equivalent system is obtained using elementary row operations:<br />
1. (Replacement) Add a multiple of one row to another row,<br />
2. (Interchange) Interchange two rows,<br />
3. (Scaling) Multiply all entries in a row by a nonzero constant,<br />
where each “row” corresponds to an equation in the system.<br />
189
Example 9.1 (Using Row Operations) We work out the strategy using the 3-by-3<br />
system given above, where a 3-by-4 augmented matrix of the coefficients and right hand<br />
side values follows our progress toward a solution.<br />
(1)<br />
(2)<br />
(3)<br />
(4)<br />
(5)<br />
(6)<br />
R1 :<br />
R2 :<br />
R3 :<br />
R1 :<br />
R2 :<br />
R3 + 4R1 :<br />
R1 :<br />
1<br />
2 R2 :<br />
R3 :<br />
R1 :<br />
R2 :<br />
R3 + 3R2 :<br />
R1 − R3 :<br />
R2 + 4R3 :<br />
R3 :<br />
R1 + 2R2 :<br />
R2 :<br />
R3 :<br />
x1 −2x2 + x3 = 0<br />
2x2 −8x3 = 8<br />
−4x1 +5x2 +9x3 = −9<br />
x1 −2x2 + x3 = 0<br />
2x2 −8x3 = 8<br />
−3x2 +13x3 = −9<br />
x1 −2x2 + x3 = 0<br />
x2 −4x3 = 4<br />
−3x2 +13x3 = −9<br />
x1 −2x2 + x3 = 0<br />
x2 −4x3 = 4<br />
x3 = 3<br />
x1 −2x2 = −3<br />
x2 = 16<br />
x3 = 3<br />
x1 = 29<br />
x2 = 16<br />
x3 = 3<br />
Thus, the solution is x1 = 29, x2 = 16, and x3 = 3.<br />
⎡<br />
⎣<br />
⎡<br />
⎣<br />
⎡<br />
⎣<br />
1 −2 1 0<br />
0 2 −8 8<br />
−4 5 9 −9<br />
⎡<br />
⎣<br />
⎡<br />
⎣<br />
1 −2 1 0<br />
0 2 −8 8<br />
0 −3 13 −9<br />
1 −2 1 0<br />
0 1 −4 4<br />
0 −3 13 −9<br />
⎡<br />
⎣<br />
1 −2 1 0<br />
0 1 −4 4<br />
0 0 1 3<br />
1 −2 0 −3<br />
0 1 0 16<br />
0 0 1 3<br />
1 0 0 29<br />
0 1 0 16<br />
0 0 1 3<br />
Forward and backward phases. In the <strong>for</strong>ward phase of the row reduction process of<br />
our example (corresponding to systems (1) through (4)), R1 is used to eliminate x1 from<br />
R2 and R3; then R2 is used to eliminate x2 from R3. In the backward phase of the process<br />
(corresponding to systems (5) and (6)), R3 is used to eliminate x3 from R1 and R2; then<br />
R2 is used to eliminate x2 from R1.<br />
Existence and uniqueness. Two fundamental questions about linear systems are:<br />
1. Does at least one solution exist?<br />
2. If a solution exists, is the solution unique?<br />
If a linear system has one or more solutions, then it is said to be consistent; otherwise, it is<br />
said to be inconsistent. A linear system can be inconsistent, or have a unique solution, or<br />
have infinitely many solutions.<br />
Note that in the example above, we knew that the system was consistent once we reached<br />
equivalent system (4); the remaining steps allowed us to find the unique solution.<br />
190<br />
⎤<br />
⎦<br />
⎤<br />
⎦<br />
⎤<br />
⎦<br />
⎤<br />
⎦<br />
⎤<br />
⎦<br />
⎤<br />
⎦
Example 9.2 (Infinitely Many Solutions) Consider the following 3-by-4 system:<br />
x1 + x2 − x3 = 9<br />
x1 +2x2 −3x3 −4x4 = 9<br />
−3x1 −2xx +2x3 −3x4 = −23<br />
and the following sequence of equivalent augmented matrices:<br />
⎡<br />
⎣<br />
1<br />
1<br />
1<br />
2<br />
−1 0<br />
−3 −4<br />
⎤ ⎡<br />
9 1<br />
9 ⎦ ∼ ⎣ 0<br />
⎤ ⎡<br />
1 −1 0 9 1<br />
1 −2 −4 0 ⎦ ∼ ⎣ 0<br />
1 −1<br />
1 −2<br />
0 9<br />
−4 0<br />
−3 −2 2 −3 −23 0 1 −1 −3 4 0 0 1 1 4<br />
⎡<br />
∼ ⎣<br />
1 1 0 1 13<br />
0 1 0 −2 8<br />
0 0 1 1 4<br />
⎤<br />
⎡<br />
⎦ ∼ ⎣<br />
1 0 0 3 5<br />
0 1 0 −2 8<br />
0 0 1 1 4<br />
The last matrix corresponds to x1 + 3x4 = 5, x2 − 2x4 = 8, and x3 + x4 = 4. Thus,<br />
is a solution <strong>for</strong> any value of x4.<br />
(s1,s2,s3,s4) = (5 − 3x4,8 + 2x4,4 − x4,x4)<br />
Example 9.3 (Inconsistent System) Consider the following 3-by-3 system:<br />
3x2 −6x3 = 8<br />
x1 −2x2 +3x3 = −1<br />
5x1 −7x2 +9x3 = 0<br />
and the following sequence of equivalent augmented matrices:<br />
⎡<br />
⎤ ⎡<br />
0 3 −6 8 1 −2<br />
⎣ 1 −2 3 −1 ⎦ ∼ ⎣ 0 3<br />
3<br />
−6<br />
⎤ ⎡<br />
−1 1<br />
8 ⎦ ∼ ⎣ 0<br />
−2<br />
3<br />
3<br />
−6<br />
−1<br />
8<br />
5 −7 9 0 5 −7 9 0 0 3 −6 5<br />
⎤<br />
⎡<br />
⎦ ∼ ⎣<br />
⎤<br />
⎦<br />
⎤<br />
⎦<br />
1 −2 3 −1<br />
0 3 −6 8<br />
0 0 0 −3<br />
The last matrix corresponds to x1 − 2x2 + 3x3 = −1, 3x2 − 6x3 = 8, and 0 = −3.<br />
Since 0 = −3, the system has no solution.<br />
9.1.2 Row Reduction and Echelon Forms<br />
A matrix is in (row) echelon <strong>for</strong>m when<br />
1. All nonzero rows are above any row of zeros.<br />
2. Each leading entry (that is, leftmost nonzero entry) in a row is in a column to the<br />
right of the leading entries of the rows above it.<br />
3. All entries in a column below a leading entry are zero.<br />
191<br />
⎤<br />
⎦
Table 9.1: 6-by-11 matrices in echelon (left) and reduced echelon (right) <strong>for</strong>ms.<br />
⎡<br />
0 • ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗<br />
⎤<br />
∗<br />
⎡<br />
0 1 ∗ 0 0 ∗ ∗ 0 0 ∗<br />
⎤<br />
∗<br />
⎢0<br />
0<br />
⎢<br />
⎢0<br />
0<br />
⎢<br />
⎢0<br />
0<br />
⎣<br />
0 0<br />
0<br />
0<br />
0<br />
0<br />
• ∗<br />
0 •<br />
0 0<br />
0 0<br />
∗ ∗<br />
∗ ∗<br />
0 0<br />
0 0<br />
∗ ∗<br />
∗ ∗<br />
• ∗<br />
0 •<br />
∗<br />
∗<br />
∗<br />
∗<br />
∗ ⎥<br />
∗ ⎥<br />
∗ ⎥<br />
⎦<br />
∗<br />
⎢0<br />
⎢<br />
⎢0<br />
⎢<br />
⎢0<br />
⎣<br />
0<br />
0 0<br />
0 0<br />
0 0<br />
0 0<br />
1 0<br />
0 1<br />
0 0<br />
0 0<br />
∗ ∗<br />
∗ ∗<br />
0 0<br />
0 0<br />
0 0<br />
0 0<br />
1 0<br />
0 1<br />
∗<br />
∗<br />
∗<br />
∗<br />
∗ ⎥<br />
∗ ⎥<br />
∗⎥<br />
⎦<br />
∗<br />
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0<br />
An echelon <strong>for</strong>m matrix is in reduced echelon <strong>for</strong>m when (in addition)<br />
4. The leading entry in each nonzero row is 1.<br />
5. Each leading 1 is the only nonzero entry in its column.<br />
The left part of Table 9.1 represents a 6-by-11 echelon <strong>for</strong>m matrix, where • represents<br />
the leading entry, and ∗ represents any number (either zero or nonzero). The right part of<br />
Table 9.1 represents a 6-by-11 reduced echelon <strong>for</strong>m matrix.<br />
Starting with the augmented matrix of a system of linear equations, the <strong>for</strong>ward phase of<br />
the row reduction process will produce a matrix in echelon <strong>for</strong>m. Continuing the process<br />
through the backward phase, we get a reduced echelon <strong>for</strong>m matrix. An important theorem<br />
is the following:<br />
Theorem 9.4 (Uniqueness Theorem) Each matrix is row-equivalent to one and only<br />
one reduced echelon <strong>for</strong>m matrix.<br />
Further, we can “read” the solutions to the original system from the reduced echelon <strong>for</strong>m<br />
of the augmented matrix.<br />
Pivot positions. A pivot position is a position of a leading entry in an echelon <strong>for</strong>m<br />
matrix. A column that contains a pivot position is called a pivot column. For example, in<br />
the left part of Table 9.1, each • is located at a pivot position, and columns 2, 4, 5, 8, and<br />
9 are pivot columns.<br />
A pivot is a nonzero number that either is used in a pivot position to create zeros or is<br />
changed into a leading 1, which in turn is used to create zeros. In general there is no more<br />
than one pivot in any row and no more than one pivot in any column.<br />
Example 9.5 (Echelon Forms) Consider the following 3-by-6 matrix:<br />
⎡<br />
0 3 −6<br />
⎣ 3 −7 8<br />
⎤<br />
6 4 −5<br />
−5 8 9 ⎦<br />
3 −9 12 −9 6 15<br />
We first interchange R1 and R3:<br />
⎡<br />
⎣<br />
3 −9 12 −9 6 15<br />
3 −7 8 −5 8 9<br />
0 3 −6 6 4 −5<br />
192<br />
⎤<br />
⎦
Column 1 is a pivot column and we use the 3 in the upper left corner to convert the 3 in<br />
row 2 to 0 (R2 − R1 replaces R2). In addition, R1 is replaced by 1<br />
3R1 <strong>for</strong> simplicity:<br />
⎡<br />
⎤<br />
1 −3 4 −3 2 5<br />
⎣ 0 2 −4 4 2 −6 ⎦<br />
0 3 −6 6 4 −5<br />
Column 2 is a pivot column. We could use the current row 2, or interchange the second and<br />
third rows and use the new row two <strong>for</strong> pivot. For simplicity, we use the 2 in the second<br />
row to convert the 3 in the third row to a zero (R3 − 3<br />
2 R2 replaces R3). In addition, R2 is<br />
replaced by 1<br />
2 R2 <strong>for</strong> simplicity:<br />
⎡<br />
⎣<br />
1 −3 4 −3 2 5<br />
0 1 −2 2 1 −3<br />
0 0 0 0 1 4<br />
This matrix is in echelon <strong>for</strong>m; columns 1, 2, and 5 are pivot columns.<br />
Working with rows 3 and 2 (in that order) to “clear” columns 5 and 2, we get the following<br />
reduced echelon <strong>for</strong>m matrix:<br />
⎡<br />
⎤ ⎡<br />
1 −3 4 −3 2 5 1 −3 4 −3 0<br />
⎣ 0 1 −2 2 1 −3 ⎦ ∼ ⎣ 0 1 −2 2 0<br />
⎤ ⎡<br />
−3 1<br />
−7 ⎦ ∼ ⎣ 0<br />
0<br />
1<br />
−2 3 0<br />
−2 2 0<br />
⎤<br />
−24<br />
−7 ⎦<br />
0 0 0 0 1 4 0 0 0 0 1 4 0 0 0 0 1 4<br />
Note that if the original matrix was the augmented matrix of a 3-by-5 system of linear<br />
equations in x1, x2, x3, x4, x5, then the last matrix corresponds to the equivalent linear<br />
system x1 − 2x3 + 3x4 = −24, x2 − 2x3 + 2x4 = −7, x5 = 4. Thus,<br />
(s1,s2,s3,s4,s5) = (−24 + 2x3 − 3x4, −7 + 2x3 − 2x4,x3,x4,4)<br />
is a solution <strong>for</strong> any values of x3, x4.<br />
Solving linear systems. When solving linear systems,<br />
1. A basic variable is any variable that corresponds to a pivot column in the augmented<br />
matrix of the system, and a free variable is any nonbasic variable.<br />
2. From the equivalent system produced using the reduced echelon <strong>for</strong>m, we solve each<br />
equation <strong>for</strong> the basic variable in terms of the free variables (if any) in the equation.<br />
3. If there is at least one free variable, then the original linear system has an infinite<br />
number of solutions.<br />
4. If an echelon <strong>for</strong>m of the augmented matrix has a row of the <strong>for</strong>m<br />
[ 0 0 · · · 0 b ] where b = 0,<br />
then the system is inconsistent. (Once we know the system is inconsistent, there is<br />
no need to continue to find the reduced echelon <strong>for</strong>m.)<br />
193<br />
⎤<br />
⎦
Example 9.6 (Echelon Forms, continued) Continuing with the last example, the basic<br />
variables are x1, x2, and x5; the free variables are x3 and x4. The free variables are used to<br />
characterize the infinite number of solutions to the 3-by-5 system.<br />
9.1.3 Vector and Matrix Equations<br />
In linear algebra, we let Rm be the set of m-by-1 matrices of real numbers:<br />
Rm ⎧⎡<br />
⎤<br />
x1<br />
⎪⎨ ⎢ x2 ⎥<br />
= ⎢<br />
⎣ . ⎥<br />
⎪⎩ .<br />
⎦ : xi<br />
⎫<br />
⎪⎬<br />
∈ R .<br />
⎪⎭<br />
xm<br />
Each element of Rm is called a point or a vector.<br />
⎡ ⎤<br />
If x =<br />
⎢<br />
⎣<br />
x1<br />
x2<br />
.<br />
.<br />
xm<br />
⎥<br />
⎦ is a vector in Rm , we say that xi is the i th component of x.<br />
The zero vector, O, is the vector all of whose components equal zero.<br />
Addition and scalar multiplication. If x,y ∈ R m and c ∈ R, then<br />
1. the vector sum x + y is the vector obtained by componentwise addition. That is, the<br />
vector whose i th component is xi + yi <strong>for</strong> each i. And,<br />
2. the scalar product cx is the vector obtained by multiplying each component by c.<br />
That is, the vector whose i th component is cxi <strong>for</strong> each i.<br />
Linear combinations and spans. If v1,v2,... ,vn ∈ R m and c1,c2,...,cn ∈ R, then<br />
w = c1v1 + c2v2 + · · · + cnvn<br />
is a linear combination of the vectors v1, v2, ..., vn. The span of the set {v1,v2,...,vn}<br />
is the collection of all linear combinations of the vectors v1, v2, ..., vn:<br />
Span{v1,v2,...,vn} = {c1v1 + c2v2 + · · · + cnvn : ci ∈ R} ⊆ R m .<br />
The span of a set of vectors always contains the zero vector O (let each ci = 0).<br />
Vector equations. Consider an m-by-n system of equations in the variables x1, x2, ...,<br />
xn, where each equation is of the <strong>for</strong>m<br />
ai1x1 + ai2x2 + · · · + ainxn = bi <strong>for</strong> i = 1,2,... ,m.<br />
Let aj be the vector of coefficients of xj, and b be the vector of right hand side values.<br />
194
The m-by-n system can be rewritten as a vector equation:<br />
x1a1 + x2a2 + · · · + xnan = b.<br />
The system is consistent if and only if b ∈ Span{a1,a2,... ,an}.<br />
Matrix equations. The coefficient matrix A of the m-by-n system above is the m-by-n<br />
matrix whose columns are the ai’s: A = [a1 a2 · · · an ].<br />
If x ∈ Rn is the vector of unknowns, then the matrix product of A with x is<br />
⎡ ⎤<br />
⎢<br />
Ax = [ a1 a2 · · · an ] ⎢<br />
⎣<br />
and the matrix equation <strong>for</strong> the system is Ax = b.<br />
x1<br />
x2<br />
.<br />
.<br />
xn<br />
⎥<br />
⎦ = x1a1 + x2a2 + · · · + xnan,<br />
Note that the matrix product of a linear combination of vectors is equal to that linear<br />
combination of matrix products:<br />
A(c1v1 + c2v2 + · · · + cnvn) = c1Av1 + c2Av2 + · · · + cnAvn<br />
<strong>for</strong> all ci ∈ R and vi ∈ R m . (Multiplication by A distributes over addition, and we can<br />
factor out constants.)<br />
Consistency <strong>for</strong> all b. The following theorem gives criteria <strong>for</strong> determining when a<br />
system of equations is consistent <strong>for</strong> all b ∈ R m :<br />
Theorem 9.7 (Consistency Theorem) Let A be an m-by-n coefficient matrix, and x<br />
be an n-by-1 vector of unknowns. The following statements are equivalent:<br />
1. The equation Ax = b has a solution <strong>for</strong> all b ∈ R m .<br />
2. Each b ∈ R m is a linear combination of the coefficient vectors aj.<br />
3. Span{a1,a2,...,an} = R m .<br />
4. A has a pivot position in each row.<br />
Note that the number of pivot positions is at most the minimum of m and n. Thus, by the<br />
consistency theorem, if m > n (there are more equations than unknowns), a system with<br />
coefficient matrix A cannot be consistent <strong>for</strong> all b ∈ R m .<br />
Homogeneous systems and solution sets. A homogeneous system is a linear system<br />
with matrix equation Ax = O.<br />
Homogeneous systems are consistent since x = O is a solution to the system (called the<br />
trivial solution). In addition, each linear combination of solutions is a solution. Thus, the<br />
solution set is a span of some set of vectors.<br />
195
Example 9.8 (Trivial Solution Only) Consider the matrix equation<br />
⎡ ⎤ ⎡<br />
2 −2 <br />
x1<br />
⎣ 1 2 ⎦ = ⎣<br />
x2<br />
1 8<br />
0<br />
⎤<br />
0 ⎦.<br />
0<br />
Since ⎡<br />
⎣<br />
2 −2 0<br />
1 2 0<br />
1 8 0<br />
⎤<br />
⎡<br />
⎦ ∼ · · · ∼ ⎣<br />
1 0 0<br />
0 1 0<br />
0 0 0<br />
⎤<br />
⎦ ⇒ x1 = 0<br />
x2 = 0<br />
the system has the trivial solution only. The solution set is Span{O} = {O} ⊆ R 2 .<br />
Example 9.9 (One Generating Vector) Consider the matrix equation<br />
<br />
2 4 −6<br />
4 8 −10<br />
⎡<br />
⎣ x1<br />
⎤<br />
<br />
x2 ⎦ 0<br />
= .<br />
0<br />
Since<br />
2 4 −6 0<br />
4 8 −10 0<br />
<br />
∼ · · · ∼<br />
x3<br />
1 2 0 0<br />
0 0 1 0<br />
solutions can be written parametrically as follows:<br />
⎡<br />
s = ⎣ −2x2<br />
⎤ ⎡<br />
x2 ⎦ = x2⎣<br />
0<br />
−2<br />
1<br />
0<br />
where x2 is free. The solution set is Span{v} ⊆ R 3 .<br />
⎤<br />
<br />
⇒<br />
⎦ = x2v<br />
x1 = −2x2<br />
x2 is free<br />
x3 = 0<br />
Example 9.10 (Two Generating Vectors) Consider the matrix equation<br />
⎡ ⎤<br />
⎡<br />
⎤ x1<br />
2 4 −6 8 ⎢ x2 ⎥<br />
⎣ 4 8 4 0 ⎦⎢<br />
⎥<br />
⎣ x3 ⎦<br />
3 6 −1 4<br />
=<br />
⎡<br />
⎣ 0<br />
⎤<br />
0 ⎦.<br />
0<br />
Since<br />
⎡<br />
⎣<br />
2 4 −6 8 0<br />
4 8 4 0 0<br />
3 6 −1 4 0<br />
x4<br />
⎤<br />
⎡<br />
⎦ ∼ · · · ∼ ⎣<br />
x4<br />
1 2 0 1 0<br />
0 0 1 −1 0<br />
0 0 0 0 0<br />
solutions can be written parametrically as:<br />
⎡ ⎤ ⎡ ⎤ ⎡<br />
−2x2 − x4<br />
−2<br />
⎢ x2 ⎥ ⎢<br />
s = ⎢ ⎥<br />
⎣ x4 ⎦ = x2<br />
⎢ 1 ⎥ ⎢<br />
⎥<br />
⎣ 0 ⎦ + x4<br />
⎢<br />
⎣<br />
0<br />
⎤<br />
⎦ ⇒<br />
where x2 and x4 are free. The solution set is Span{v1,v2} ⊆ R 4 .<br />
196<br />
−1<br />
0<br />
1<br />
1<br />
⎤<br />
x1 = −2x2 − x4<br />
x2 is free<br />
x3 = x4<br />
x4 is free<br />
⎥<br />
⎦ = x2v1 + x4v2
Nonhomogeneous systems and solution sets. A nonhomogeneous system is a linear<br />
system with matrix equation Ax = b where b = O.<br />
Theorem 9.11 (Solution Sets) Suppose that Ax = b is consistent and let p be a<br />
solution. Then the solution set of the system Ax = b is the set of all vectors of the <strong>for</strong>m<br />
s = p + vh<br />
where vh is a solution of the homogeneous system Ax = O.<br />
Example 9.12 (Two Generating Vectors, continued) Consider the matrix equation<br />
⎡ ⎤<br />
⎡<br />
⎤ x1<br />
2 4 −6 8 ⎢ x2 ⎥<br />
⎣ 4 8 4 0 ⎦⎢<br />
⎥<br />
⎣ x3 ⎦<br />
3 6 −1 4<br />
=<br />
⎡<br />
⎣ 18<br />
⎤<br />
4 ⎦.<br />
11<br />
Since<br />
⎡<br />
⎣<br />
2 4 −6 8 18<br />
4 8 4 0 4<br />
3 6 −1 4 11<br />
x4<br />
⎤<br />
⎡<br />
⎦ ∼ · · · ∼ ⎣<br />
x4<br />
1 2 0 1 3<br />
0 0 1 −1 −2<br />
0 0 0 0 0<br />
solutions can be written parametrically as:<br />
⎡ ⎤<br />
3 − 2x2 − x4<br />
⎢ x2 ⎥<br />
s = ⎢ ⎥<br />
⎣ −2 + x4 ⎦ =<br />
⎡ ⎤ ⎡ ⎤ ⎡<br />
3<br />
−2<br />
⎢ 0 ⎥ ⎢<br />
⎥<br />
⎣ −2 ⎦ + x2<br />
⎢ 1 ⎥ ⎢<br />
⎥<br />
⎣ 0 ⎦ + x4<br />
⎢<br />
⎣<br />
0 0<br />
−1<br />
0<br />
1<br />
1<br />
⎤<br />
⎦ ⇒<br />
where x2 and x4 are free. (The solution set is a shift of a span.)<br />
9.2 Matrix Operations<br />
An m-by-n matrix A is a rectangular array with m rows and n columns:<br />
⎡<br />
⎤<br />
A =<br />
⎢<br />
⎣<br />
a11 a12 · · · a1n<br />
a21 a22 · · · a2n<br />
.<br />
. · · ·<br />
.<br />
am1 am2 · · · amn<br />
⎤<br />
x1 = 3 − 2x2 − x4<br />
x2 is free<br />
x3 = −2 + x4<br />
x4 is free<br />
⎥<br />
⎦ = p + (x2v1 + x4v2) = p + vh<br />
⎥<br />
⎦ = [ a1 a2 · · · an ].<br />
The (i,j)-entry of A (denoted by aij) is the element in the row i and column j position. In<br />
the first term above, the entries of A are shown explicitly. In the second term, A is written<br />
as n columns where each column is a vector in R m . (See Section 7.1 also.)<br />
Square matrices. A square matrix is a matrix with m = n. An n-by-n square matrix is<br />
often called a square matrix of order n. If A is a square matrix of order n, then the diagonal<br />
elements of A are the elements: a11, a22, ..., ann.<br />
Symmetric matrices. Let A be a square matrix. Then A is said to be a symmetric<br />
matrix if aij = aji <strong>for</strong> all i and j.<br />
197
Triangular matrices. Let A be a square matrix. Then A is said to be an upper triangular<br />
matrix if aij = 0 when i > j; A is said to a lower triangular matrix if aij = 0 when i < j;<br />
and A is said to be a triangular matrix if it is either upper or lower triangular.<br />
Diagonal matrices, identity matrices. The square matrix A is said to be a diagonal<br />
matrix when aij = 0 when i = j.<br />
Given d1, d2, ..., dn, Diag(d1,d2,... ,dn) is used to denote the diagonal matrix of order n<br />
whose diagonal elements are the di’s:<br />
⎡<br />
d1 0 0 · · · 0<br />
0 d2 0 · · · 0<br />
0 0 d3 · · · 0<br />
⎢<br />
Diag(d1,d2,...,dn) = ⎢<br />
⎣<br />
. . . · · ·<br />
⎥<br />
.<br />
⎦<br />
0 0 0 · · · dn<br />
.<br />
The identity matrix of order n, In, is the diagonal matrix with di = 1 <strong>for</strong> each i. We<br />
sometimes omit the subscript when the order of the matrix is clear from the problem.<br />
Note that diagonal matrices are symmetric, upper triangular, and lower triangular.<br />
Vectors and matrices with all ones. For convenience, we let 1m be the vector in R m<br />
all of whose components equal one, and we let Jm×n be the m-by-n matrix all of whose<br />
entries equal one. We sometimes omit the subscripts when the orders of the vectors and<br />
matrices are clear from the problem.<br />
9.2.1 Basic Operations<br />
In this section, we consider the operations of addition and scalar multiplication, linear<br />
combinations of matrices, matrix transpose and matrix product.<br />
Addition and scalar multiplication. Let A and B be m-by-n matrices, and let c ∈ R<br />
be a scalar. Then matrix sum and scalar product are defined as follows:<br />
1. The matrix sum of A and B is the m-by-n matrix<br />
A + B = [a1 + b1 a2 + b2 · · · an + bn ].<br />
That is, A + B is the matrix with (i,j)-entry (A + B) ij = aij + bij.<br />
2. The scalar product of A by c is the m-by-n matrix<br />
cA = [ ca1 ca2 · · · can ].<br />
That is, cA is the matrix with (i,j)-entry (cA) ij = caij.<br />
198<br />
⎤
Linear combinations of matrices. Let A and B be m-by-n matrices, and c and d be<br />
scalars. Then the m-by-n matrix<br />
cA + dB = [ca1 + db1 ca2 + db2 · · · can + dbn ]<br />
is said to be a linear combination of A and B.<br />
The linear combination of A and B has (i,j)-entry as follows:<br />
⎡<br />
For example, if A = ⎣<br />
1 −2<br />
3 −4<br />
5 −6<br />
⎤<br />
(cA + dB) ij = caij + dbij.<br />
⎡<br />
⎦ and B = ⎣<br />
0 7<br />
4 2<br />
−2 −6<br />
⎤<br />
⎡<br />
⎦, then 2A − B = ⎣<br />
2 −11<br />
2 −10<br />
12 −6<br />
Matrix transpose. The transpose of the m-by-n matrix A, denoted by AT <br />
, is the n-by-m<br />
matrix whose (i,j)-entry is aji: AT = aji.<br />
ij<br />
For convenience, we let αi ∈ R n (<strong>for</strong> i = 1,2,... ,m) be the columns of A T and write<br />
⎡<br />
A = ⎣<br />
α1 T<br />
.<br />
αm T<br />
when we want to view A as a “stack” of m row vectors.<br />
Matrix product. Let A be an m-by-n matrix and let B = [ b1 b2 · · · bp ] be an<br />
n-by-p matrix. Then the matrix product, AB, is the m-by-p matrix whose j th column is the<br />
product of A with the j th column of B:<br />
AB = [Ab1 Ab2 · · · Abp ] .<br />
Equivalently, AB is the matrix whose (i,j)-entry is<br />
(See Section 7.1.7 also.)<br />
(AB) ij = ai1b1j + ai2b2j + · · · + ainbnj.<br />
Dot and matrix products. Another convenient way to write the matrix product is in<br />
terms of dot products. Recall that if v and w are vectors in R n , then their dot product is<br />
the scalar<br />
v·w = v T w = v1w1 + v2w2 + · · · + vnwn.<br />
Then<br />
⎡<br />
AB = ⎣<br />
α1 T<br />
.<br />
.<br />
αm T<br />
⎡<br />
⎤<br />
⎦[b1 b2 · · ·<br />
⎢<br />
bp ] = ⎢<br />
⎣<br />
199<br />
⎤<br />
⎦<br />
α1 T b1 α1 T b2 · · · α1 T bp<br />
α2 T b1 α2 T b2 · · · α2 T bp<br />
.<br />
. · · ·<br />
αm T b1 αm T b2 · · · αm T bp<br />
.<br />
⎤<br />
⎥<br />
⎦ .<br />
⎤<br />
⎦.
Properties of the basic operations. Some properties of the basic operations are given<br />
below. In each case, the sizes of the matrices are assumed to be compatible.<br />
1. Commutative: A + B = B + A.<br />
2. Associative: A + (B + C) = (A + B) + C and A(BC) = (AB)C.<br />
3. Distributive: A(B + C) = (AB) + (AC), (B + C)A = (BA) + (CA).<br />
4. Scalar Multiples: Given c ∈ R:<br />
c(AB) = (cA)B = A(cB), c(A + B) = (cA) + (cB) and (cA) T = cA T .<br />
5. Identity: ImA = A = AIn when A is an m-by-n matrix.<br />
6. Transpose of Transpose:<br />
<br />
A T T<br />
= A.<br />
7. Transpose of Sum: (A + B) T = A T + B T .<br />
8. Transpose of Product: (AB) T = B T A T .<br />
Note, in particular, that the matrix sum is commutative but the matrix product is not. In<br />
fact, it is possible that AB is defined but that BA is undefined.<br />
9.2.2 Inverses of Square Matrices<br />
The n-by-n matrix A is said to be invertible if there exists a square matrix C satisfying<br />
AC = CA = In where In is the n-by-n identity matrix.<br />
Otherwise, A is said to be not invertible. Square matrices that are not invertible are also<br />
called singular matrices; invertible square matrices are also called nonsingular matrices.<br />
If A is invertible, then the matrix C is unique. To see this, suppose that the matrix D also<br />
satisfies the equality AD = DA = In. Then, using properties of matrix products,<br />
D = DIn = D(AC) = (DA)C = InC = C.<br />
If A is invertible, then A −1 is used to denote the matrix inverse of A.<br />
Special case: 2-by-2 matrices. Let A =<br />
A −1 = 1<br />
ad−bc<br />
<br />
a b<br />
be a 2-by-2 matrix. Then<br />
c d<br />
<br />
d −b<br />
when ad − bc = 0.<br />
−c a<br />
If ad − bc = 0, then A does not have an inverse. (Note: ad − bc is the determinant of A.)<br />
200
Special case: diagonal matrices. Let A = Diag(a1,a2,...,an) be a diagonal matrix.<br />
If ai = 0 <strong>for</strong> each i, then A is invertible with inverse<br />
If ai = 0 <strong>for</strong> some i, then A is not invertible.<br />
A −1 = Diag(1/a1,1/a2,...,1/an).<br />
Demonstration: Note first that if C = Diag(c1,c2,... ,cn), then<br />
AC = CA = Diag(a1c1,a2c2,...,ancn).<br />
To obtain AC = CA = In, we need aici = 1 <strong>for</strong> each i. If ai = 0 <strong>for</strong> some i, then we cannot<br />
find solutions; otherwise, we let ci = 1/ai <strong>for</strong> each i.<br />
Properties of inverses. Let A and B be invertible square matrices of order n and let c<br />
be a nonzero constant. Then<br />
1. Inverse of Identity: I −1<br />
n = In <strong>for</strong> each n.<br />
2. Inverse of Inverse: A −1 −1 = A.<br />
3. Inverse of Product: (AB) −1 = B −1 A −1 .<br />
4. Inverse of Scalar Multiple: (cA) −1 = (1/c)A−1 .<br />
<br />
5. Inverse of Transpose: AT −1 = A−1T .<br />
Algorithm <strong>for</strong> finding an inverse. Let A be an n-by-n matrix. Form the n-by-2n<br />
augmented matrix whose last n columns are the columns of the identity matrix: [A In ].<br />
If A is row equivalent to the identity matrix, then the n-by-2n augmented matrix will be<br />
trans<strong>for</strong>med as follows:<br />
[A In ] ∼ · · · ∼ [In A −1 ]<br />
That is, the last n columns will be the inverse of A. Otherwise, A does not have an inverse.<br />
Partial justification: For convenience, let C = A −1 and In = [e1 e2 · · · en ]. Then<br />
AC = In can be written as follows:<br />
Equivalently,<br />
AC = [ Ac1 Ac2 · · · Acn ] = [e1 e2 · · · en ].<br />
Ac1 = e1, Ac2 = e2, ..., Acn = en.<br />
By using the n-by-2n augmented matrix, we are solving all n systems at once. The solutions<br />
to the n systems <strong>for</strong>m the n columns of the inverse matrix.<br />
<br />
−7 8<br />
Example 9.13 (2-by-2 Invertible Matrix) Let A = . Using the special<br />
2 −2<br />
<strong>for</strong>mula <strong>for</strong> 2-by-2 matrices,<br />
A −1 = 1<br />
<br />
−2 −8 1 4<br />
= .<br />
−2 −2 −7 1 7/2<br />
201
Using the algorithm <strong>for</strong> finding inverses,<br />
<br />
−7 8 1<br />
[ A I2 ] =<br />
2 −2 0<br />
<br />
0 −7<br />
∼<br />
1 0<br />
<br />
8 1 0 −7 8 1<br />
∼<br />
2/7 2/7 1 0 1 1<br />
0<br />
7/2<br />
<br />
−7 0<br />
∼<br />
0 1<br />
<br />
−7 −28 1 0<br />
∼<br />
1 7/2 0 1<br />
<br />
1 4<br />
= [ I2<br />
1 7/2<br />
A−1 ] .<br />
The last two columns of the augmented matrix match the inverse computed above.<br />
⎡<br />
0<br />
Example 9.14 (3-by-3 Invertible Matrix) Let A = ⎣ 1<br />
⎤<br />
1 2<br />
0 3 ⎦. Since<br />
⎡<br />
0<br />
[ A I3 ] = ⎣ 1<br />
1<br />
0<br />
2 1<br />
3 0<br />
⎤ ⎡<br />
0 0 1 0<br />
1 0 ⎦ ∼ · · · ∼ ⎣ 0 1<br />
4 −3 8<br />
0 −4.5<br />
0 −2.0<br />
⎤<br />
7.0 −1.5<br />
4.0 −1.0 ⎦,<br />
4 −3 8 0 0 1 0 0 1 1.5 −2.0 −0.5<br />
A is invertible with inverse A−1 ⎡<br />
4.5<br />
= ⎣ −2.0<br />
⎤<br />
7.0 −1.5<br />
4.0 −1.0 ⎦.<br />
1.5 −2.0 −0.5<br />
⎡<br />
0<br />
Example 9.15 (3-by-3 Singular Matrix) Let A = ⎣ 1<br />
1<br />
0<br />
⎤<br />
2<br />
3 ⎦. Since<br />
4 −3 6<br />
⎡<br />
0<br />
[ A I3 ] = ⎣ 1<br />
1 2<br />
0 3<br />
1 0<br />
0 1<br />
⎤ ⎡<br />
0 1<br />
0 ⎦ ∼ · · · ∼ ⎣ 0<br />
0 3<br />
1 2<br />
0<br />
1<br />
1<br />
0<br />
⎤<br />
0<br />
0 ⎦,<br />
4 −3 6 0 0 1 0 0 0 3 −4 1<br />
A is not invertible. (Note that the third row has 3 leading zeros.)<br />
Elementary matrices and matrix inverses. An elementary matrix, E, is one obtained<br />
by per<strong>for</strong>ming a single elementary row operation on an identity matrix. Since each elementary<br />
row operation is reversible, each elementary matrix is invertible.<br />
For example, let n = 4.<br />
1. (Replacement) Starting with I4, we replace R3 with R3+aR1. The elementary matrix<br />
is E shown on the left below. The operation is reversed by replacing R3 with R3−aR1.<br />
The inverse matrix is shown on the right below.<br />
E =<br />
⎡<br />
⎢<br />
⎣<br />
1 0 0 0<br />
0 1 0 0<br />
a 0 1 0<br />
0 0 0 1<br />
⎤<br />
⎥<br />
⎦ and E−1 =<br />
⎡<br />
⎢<br />
⎣<br />
1 0 0 0<br />
0 1 0 0<br />
−a 0 1 0<br />
0 0 0 1<br />
2. (Scaling) Starting with I4, we scale R2 using the nonzero constant c. The elementary<br />
matrix is E shown on the left below. The operation is reversed by scaling R2 by the<br />
constant 1/c. The inverse matrix is shown on the right below.<br />
E =<br />
⎡<br />
⎢<br />
⎣<br />
1 0 0 0<br />
0 c 0 0<br />
0 0 1 0<br />
0 0 0 1<br />
⎤<br />
⎥<br />
⎦ and E−1 =<br />
202<br />
⎡<br />
⎢<br />
⎣<br />
1 0 0 0<br />
0 1/c 0 0<br />
0 0 1 0<br />
0 0 0 1<br />
⎤<br />
⎥<br />
⎦ .<br />
⎤<br />
⎥<br />
⎦ .
3. (Interchange) Starting with I4, we interchange R2 and R4. The elementary matrix<br />
is E shown below. The operation is reversed by interchanging R2 and R4. Thus,<br />
E −1 = E.<br />
Note also that E 2 = I4.<br />
E =<br />
⎡<br />
⎢<br />
⎣<br />
1 0 0 0<br />
0 0 0 1<br />
0 0 1 0<br />
0 1 0 0<br />
⎤<br />
⎥<br />
⎦ = E−1 .<br />
Let A be an n-by-n matrix. Elementary row operations on A correspond to left multiplication<br />
by the corresponding elementary matrix. If a sequence of p elementary row operations<br />
(with elementary matrices E1, E2, ..., Ep) trans<strong>for</strong>m A to the identity matrix, then<br />
(Ep(Ep−1 · · · (E2(E1A)))) = (EpEp−1 · · · E2E1)A = In,<br />
the inverse of A is the following product of elementary matrices:<br />
A −1 = EpEp−1 · · · E2E1<br />
and A is the product of the inverses in the opposite order:<br />
A = E −1<br />
1 E−1<br />
2 · · · E −1<br />
p−1 E−1<br />
p .<br />
These computations can be used to <strong>for</strong>mally justify the algorithm <strong>for</strong> finding inverses.<br />
Example 9.16 (2 × 2 Matrix, continued) Continuing with the 2-by-2 example, four<br />
elementary row operations were used. Written as a matrix product:<br />
E4E3E2E1A =<br />
−1/7 0<br />
0 1<br />
1 −1<br />
0 1<br />
Thus, A−1 <br />
= E4E3E2E1 =<br />
1 0<br />
0 7/2<br />
1 4<br />
−1 7/2<br />
<br />
.<br />
<br />
1 0<br />
2/7 1<br />
−7 8<br />
2 −2<br />
<br />
=<br />
1 0<br />
0 1<br />
<br />
= I2.<br />
Relationship to solving systems of equations. The following theorem provides an<br />
important relationship between matrix inverses and systems of linear equations.<br />
Theorem 9.17 (Uniqueness Theorem) Let A be an invertible n-by-n matrix. Then<br />
the matrix equation Ax = b has the unique solution x = A −1 b <strong>for</strong> any b ∈ R n .<br />
9.2.3 Matrix Factorizations<br />
A factorization of a matrix A is an equation expressing A as the product of two or more<br />
matrices. This section introduces the LU-factorization, which supports an important practical<br />
method <strong>for</strong> solving large systems of linear equations. Additional factorizations will be<br />
considered later in these notes.<br />
203
LU-Factorization. Let A be an m-by-n matrix. An LU-factorization of A is a matrix<br />
equation A = LU, where<br />
1. L is an m-by-m unit lower triangular matrix, and<br />
2. U is an m-by-n (upper) echelon <strong>for</strong>m matrix.<br />
Note that L is said to be a unit lower triangular matrix if L is a lower triangular matrix<br />
whose diagonal elements ℓii = 1 <strong>for</strong> all i.<br />
The m-by-n matrix A has an LU-factorization when (and only when) A is row equivalent<br />
to an echelon <strong>for</strong>m matrix using a sequence of row replacements only:<br />
EpEp−1 · · · E2E1A = U =⇒ A = LU where L = (EpEp−1 · · · E2E1) −1 .<br />
In this equation, the Ei’s are the elementary matrices representing the row replacements.<br />
Example 9.18 (3-by-3 Matrix) Consider the following matrix and equivalent <strong>for</strong>ms:<br />
⎡<br />
3 1<br />
A = ⎣ −6 0<br />
⎤ ⎡<br />
0 3 1<br />
−4 ⎦ ∼ ⎣ 0 2<br />
⎤ ⎡<br />
0 3 1<br />
−4 ⎦ ∼ ⎣ 0 2<br />
⎤<br />
0<br />
−4 ⎦.<br />
0 6 −10 0 6 −10 0 0 2<br />
The last matrix is the echelon matrix U. Since two row replacements were used,<br />
L = E −1<br />
1 E−1 2 =<br />
⎡<br />
1 0<br />
⎣ −2 1<br />
0 0<br />
⎤⎡<br />
0 1 0<br />
0 ⎦⎣<br />
0 1<br />
1 0 3<br />
⎤ ⎡ ⎤<br />
0 1 0 0<br />
0 ⎦ = ⎣ −2 1 0 ⎦.<br />
1 0 3 1<br />
Example 9.19 (3-by-4 Matrix) Consider the following matrix and equivalent <strong>for</strong>ms:<br />
⎡<br />
2 4 −1<br />
⎤<br />
5<br />
⎡<br />
2 4 −1<br />
⎤<br />
5<br />
⎡<br />
2 4<br />
⎤<br />
−1 5<br />
A = ⎣ −4 −5 3 −8 ⎦ ∼ ⎣ 0 3 1 2 ⎦ ∼ ⎣ 0 3 1 2 ⎦.<br />
2 −5 −4 1 0 −9 −3 −4 0 0 0 2<br />
The last matrix is the echelon matrix U. Since three row replacements were used,<br />
L = E −1<br />
1 E−1 2 E−1 3 =<br />
⎡ ⎤⎡<br />
1 0 0 1<br />
⎣ −2 1 0 ⎦⎣<br />
0<br />
0 0 1 1<br />
⎤ ⎡<br />
0 0 1<br />
1 0 ⎦ ⎣ 0<br />
0 1 0<br />
⎤ ⎡<br />
0 0 1 0 0<br />
1 0 ⎦ = ⎣ −2 1 0<br />
−3 1 1 −3 1<br />
Application. If A = LU and Ax = b is consistent, then the system can be solved<br />
efficiently. Consider the following setup:<br />
b = Ax = (LU)x = L(Ux) = Ly where y = Ux.<br />
The efficient solution is done in two steps:<br />
1. Solve Ly = b <strong>for</strong> y.<br />
2. Solve Ux = y <strong>for</strong> x.<br />
The first step is essentially a <strong>for</strong>ward substitution method, and the second step is essentially<br />
a backward substitution method. The process is useful when the same coefficient matrix is<br />
used with several different right hand sides.<br />
204<br />
⎤<br />
⎦.
Permutation matrices and factorizations. A permutation matrix P is a product of<br />
elementary matrices corresponding to row interchanges.<br />
In the LU-factorization, L encodes in<strong>for</strong>mation about row replacements, and the pivot<br />
positions of U encode in<strong>for</strong>mation about scalings. If row interchanges are needed to solve a<br />
system, then a permutation matrix can be used to encode these operations. Specifically, let<br />
P be the product of interchange matrices needed to bring the rows of A into proper order<br />
<strong>for</strong> an LU-factorization. Then write PA = LU. (If row interchanges are needed, then A<br />
does not have an LU-factorization. The matrix PA does have an LU-factorization.)<br />
9.2.4 Partitioned Matrices<br />
A partitioned matrix (or blocked matrix) is one whose entries are themselves matrices. An<br />
m-by-n matrix can be partitioned in many ways, including the following 2-by-2 <strong>for</strong>m:<br />
<br />
A11 A12<br />
A = where Aij is a pi-by-qj matrix,<br />
A21 A22<br />
p1 + p2 = m and q1 + q2 = n. (This can be generalized to 3-by-3 <strong>for</strong>ms, etc.)<br />
Each submatrix in a partitioned matrix is called a block. Partitioning is especially useful<br />
when you can take advantage of block patterns. For example, if certain Aij’s are zero<br />
matrices, identity matrices, diagonal matrices, or matrices all of whose entries are equal.<br />
If all submatrices are con<strong>for</strong>mable, then partitioned matrices can be added or multiplied by<br />
adding or multiplying the appropriate blocks. In particular,<br />
<br />
A11 A12 B11 B12 C11 C12<br />
AB =<br />
=<br />
A21 A22<br />
where Cij = Ai1B1j + Ai2B2j <strong>for</strong> each i and j.<br />
B21 B22<br />
C21 C22<br />
This section considers inverses of partitioned square matrices. Additional methods <strong>for</strong><br />
partitioned matrices will be considered in later sections of these notes.<br />
Block diagonal matrices. A 2-by-2 block diagonal matrix is a square matrix A of order<br />
n written as follows<br />
<br />
A11 O<br />
A = where A11 is p-by-p, A22 is q-by-q,<br />
O A22<br />
each O is a zero matrix, and p + q = n. (This can be generalized to 3-by-3, etc.)<br />
A is invertible if and only if both A11 and A22 are invertible. Then, A −1 =<br />
A −1<br />
11<br />
O<br />
O A −1<br />
22<br />
Example 9.20 (4-by-4 Matrix) Let A be the 4-by-4 matrix shown on the left below.<br />
The inverse of A can be computed using the <strong>for</strong>mula <strong>for</strong> 2-by-2 block diagonal matrices:<br />
⎡<br />
⎢<br />
A = ⎢<br />
⎣<br />
8<br />
−3<br />
0<br />
−5<br />
2<br />
0<br />
0<br />
0<br />
9<br />
0<br />
0<br />
4<br />
⎤<br />
⎥<br />
⎦<br />
0 0 4 2<br />
=⇒ A−1 ⎡<br />
2<br />
⎢<br />
= ⎢ 3<br />
⎣ 0<br />
5<br />
8<br />
0<br />
0<br />
0<br />
1<br />
⎤<br />
0<br />
0 ⎥<br />
−2 ⎦<br />
0 0 −2 9/2<br />
205<br />
<br />
.
Block upper triangular matrices. A 2-by-2 block upper triangular matrix is a square<br />
matrix A of order n written as follows<br />
<br />
A11 A12<br />
A = where A11 is p-by-p, A12 is p-by-q, A22 is q-by-q,<br />
O A22<br />
O is a zero matrix, and p + q = n. (This can be generalized to 3-by-3, etc.)<br />
A is invertible if and only if both A11 and A22 are invertible. Then, A −1 =<br />
where C = −A −1 −1<br />
11 A12A22 .<br />
In particular, if A11 and A22 are identity matrices, then A −1 =<br />
Ip −A12<br />
O Iq<br />
<br />
.<br />
A −1<br />
11<br />
C<br />
O A −1<br />
22<br />
Block lower triangular matrices. A 2-by-2 block lower triangular matrix is a square<br />
matrix A of order n written as follows<br />
<br />
A11 O<br />
A = where A11 is p-by-p, A21 is q-by-p, A22 is q-by-q,<br />
A21 A22<br />
O is a zero matrix, and p + q = n. (This can be generalized to 3-by-3, etc.)<br />
Since AT <br />
T A11 A<br />
=<br />
T <br />
21 is block upper triangular, the method above easily generalizes to<br />
handle this case.<br />
O A T 22<br />
Block elimination and Schur complements. Let M be a square matrix of order n<br />
written as follows:<br />
<br />
A B<br />
M = where the diagonal submatrices are square,<br />
C D<br />
and assume that A is invertible. Then block elimination subtracts CA−1 times the first row<br />
from the second row. This leaves the Schur complement of A in the lower right corner of<br />
the resulting block upper triangular matrix, as illustrated below:<br />
<br />
I O<br />
−CA −1 I<br />
<br />
A B<br />
C D<br />
=<br />
<br />
A B<br />
O D − CA−1 <br />
.<br />
B<br />
If the Schur complement (D − CA −1 B) is invertible, then the inverse of M can be found in<br />
block <strong>for</strong>m using methods <strong>for</strong> block triangular matrices.<br />
Example 9.21 (5-by-5 Matrix) Consider the 5-by-5 matrix M (written as a 2-by-2 block<br />
matrix) and its Schur complement:<br />
M =<br />
<br />
A B<br />
=<br />
C D<br />
⎡<br />
⎢<br />
⎣<br />
3 1 0 −42 0<br />
4 1 −2 0 −1<br />
2 0 −5 1 −7<br />
1 0 1 2 1<br />
1 1 1 3 4<br />
⎤<br />
⎥<br />
⎦<br />
206<br />
and (D − CA −1 B) =<br />
<br />
−289 −13<br />
378 17<br />
<br />
.<br />
<br />
,
Since A, D and (D − CA −1 B) are invertible, we can compute M −1 using block methods.<br />
Using properties of inverses,<br />
M −1 =<br />
⎢<br />
= ⎢<br />
⎣<br />
<br />
A B<br />
O D − CA−1 −1 <br />
B<br />
⎡<br />
⎡<br />
⎢<br />
= ⎢<br />
⎣<br />
I O<br />
−CA −1 I<br />
−5 5 −2 −134 −103<br />
16 −15 6 1116 855<br />
−2 2 −1 479 366<br />
0 0 0 17 13<br />
0 0 0 −378 −289<br />
<br />
⎤⎡<br />
⎥⎢<br />
⎥⎢<br />
⎥⎢<br />
⎥⎢<br />
⎦⎣<br />
−16 119 −95 −134 −103<br />
133 −987 789 1116 855<br />
57 −423 338 479 366<br />
2 −15 12 17 13<br />
−45 334 −267 −378 −289<br />
9.3 Determinants of Square Matrices<br />
1 0 0 0 0<br />
0 1 0 0 0<br />
0 0 1 0 0<br />
7 −7 3 1 0<br />
−9 8 −3 0 1<br />
Determinants of square matrices were introduced in Section 7.1.8. This section gives additional<br />
in<strong>for</strong>mation about determinants.<br />
Recursive or cofactor definition. Let A be a square matrix of order n, and let Aij<br />
be the submatrix obtained by eliminating the i th row and the j th column of A. Then the<br />
determinant of A can be defined recursively as follows:<br />
⎤<br />
⎥<br />
⎦ .<br />
n<br />
det(A) = (−1)<br />
j=1<br />
1+j a1jdet(A1j).<br />
(The determinant has been computed by expanding across the first row.)<br />
The value of the determinant remains the same if we expand across the i th row:<br />
or if we expand down the j th column:<br />
<strong>for</strong> any i or j. The number<br />
is often called the ij-cofactor of A.<br />
n<br />
det(A) = (−1)<br />
j=1<br />
i+j aijdet(Aij)<br />
n<br />
det(A) = (−1)<br />
i=1<br />
i+j aijdet(Aij)<br />
cij = (−1) i+j det(Aij)<br />
207<br />
⎤<br />
⎥<br />
⎦
Properties of determinants. Let A and B be square matrices of order n. Then<br />
<br />
1. Transpose: det AT = det(A).<br />
2. Triangular: If A is a triangular matrix, then det(A) = a11 · · · ann.<br />
3. Product: det(AB) = det(A) det(B).<br />
4. Idempotent: If A 2 = A (A is an idempotent matrix), then det(A) = 0 or 1.<br />
5. Inverse: A is an invertible matrix if and only if det(A) = 0. Further, if the matrix A<br />
is invertible, then det A −1 = 1/det(A).<br />
6. Elementary Matrices: Let E be an elementary matrix. Then<br />
⎧<br />
⎪⎨<br />
1 when E corresponds to row replacement<br />
det(E) = −1 when E corresponds to row interchange<br />
⎪⎩ c when E corresponds to scaling a row by c<br />
7. Zeroes: If A has a row or column of zeroes, then det(A) = 0.<br />
8. Identical: If two rows or two columns of A are identical, then det(A) = 0.<br />
⎡<br />
0<br />
Example 9.22 (3-by-3 Invertible Matrix, continued) If A = ⎣ 1<br />
⎤<br />
1 2<br />
0 3 ⎦, then<br />
4 −3 8<br />
<br />
<br />
0 1 2 <br />
<br />
det(A) = <br />
1 0 3 <br />
= 0 − 1<br />
<br />
4 −3 8 4<br />
<br />
3<br />
<br />
<br />
8<br />
+ 2 1<br />
0 <br />
<br />
4 −3<br />
= 0 − (−4) + 2(−3) = −2 = 0.<br />
⎡<br />
0<br />
Example 9.23 (3-by-3 Singular Matrix, continued) If A = ⎣ 1<br />
1<br />
0<br />
⎤<br />
2<br />
3 ⎦,then<br />
4 −3 6<br />
<br />
<br />
0 1<br />
det(A) = <br />
1 0<br />
4 −3<br />
<br />
2 <br />
<br />
3 <br />
= 0 − 1<br />
<br />
6 4<br />
<br />
3<br />
<br />
<br />
6<br />
+ 2 1<br />
4 <br />
0 <br />
<br />
−3<br />
= 0 − (−6) + 2(−3) = 0.<br />
Relationship to pivots. Consider the square matrix A of order n. Let K be a product<br />
of elementary matrices needed to bring A into echelon <strong>for</strong>m, and U be the resulting echelon<br />
<strong>for</strong>m matrix: KA = U.<br />
Since det(K) = 0 by Properties 3 and 5, and det(U) = <br />
i uii be Property 2,<br />
det(A) =<br />
1<br />
det(K)<br />
<br />
uii (using Property 3).<br />
Thus, det(A) = 0 when there are fewer than n pivot positions.<br />
i<br />
208
Reducing to echelon <strong>for</strong>m. The determinant of A can be derived by reducing A to an<br />
echelon <strong>for</strong>m matrix and observing the following:<br />
1. If a multiple of one row is added to another, then the determinant is unchanged.<br />
2. If two rows are interchanged, then the determinant changes sign.<br />
3. If a row is multiplied by c, then the determinant is multiplied by c.<br />
For example,<br />
<br />
<br />
3 6 6 <br />
<br />
<br />
1 2 1 <br />
= 3 <br />
<br />
1 4 4 <br />
1 2 2<br />
1 2 1<br />
1 4 4<br />
<br />
<br />
<br />
<br />
= 3 <br />
<br />
<br />
1 2 2<br />
0 0 −1<br />
0 2 2<br />
<br />
<br />
<br />
<br />
= −3 <br />
<br />
<br />
1 2 2<br />
0 2 2<br />
0 0 −1<br />
<br />
<br />
<br />
<br />
= −3(−2) = 6.<br />
<br />
Explicit <strong>for</strong>mula <strong>for</strong> the inverse using cofactors. If A is invertible, then the inverse<br />
of A can be written explicitly in terms of deteminants as follows:<br />
A −1 =<br />
⎡<br />
1 ⎢<br />
det(A) ⎣<br />
c11 c12 · · · c1n<br />
c21 c22 · · · c2n<br />
.<br />
. · · ·<br />
cn1 cn2 · · · cnn<br />
.<br />
⎤T<br />
⎥<br />
⎦<br />
where cij = (−1) i+j det(Aij).<br />
(A −1 equals the reciprocal of det(A) times the transpose of the cofactor matrix.)<br />
This explicit <strong>for</strong>mula is often useful when proving theorems about inverses, but it is not a<br />
useful computational <strong>for</strong>mula.<br />
9.4 Vector Spaces and Subspaces<br />
This section introduces vector space theory, emphasizing topics of interest in statistics.<br />
9.4.1 Definitions<br />
A vector space V over the reals is a set of objects (called vectors), and operations of addition<br />
and scalar multiplication satisfying the following axioms:<br />
1. Vector sum: Vector sum is commutative and associative.<br />
2. Zero vector: There is a zero vector O satisfying<br />
3. Additive inverse: For each x ∈ V ,<br />
O + x = x = x + O <strong>for</strong> each x ∈ V .<br />
y + x = O = x + y <strong>for</strong> some y ∈ V .<br />
209
4. Scalar multiples: Given x,y ∈ V and c,d ∈ R,<br />
c(x + y) = (cx) + (cy), (c + d)x = (cx) + (dx), (cd)x = c(dx), 1x = x.<br />
Examples of vector spaces include<br />
1. R m with the usual vector addition and scalar multiplication.<br />
2. The set of r-by-c matrices, Mr×c, with the usual operations of matrix addition and<br />
scalar multiplication.<br />
3. The set of polynomial functions of degree at most k, Pk,<br />
Pk = {p(t) = a0 + a1t + a2t 2 + · · · + akt k : ai ∈ R}<br />
where polynomials are added by adding their values <strong>for</strong> each t, and the scalar multiple<br />
by c is obtained by multiplying each value of the polynomial by the scalar c.<br />
4. The set of all polynomial functions, P∞,<br />
P∞ = P1 ∪ P2 ∪ P3 ∪ · · ·<br />
with addition and scalar multiplication as defined above.<br />
5. The set of continuous real-valued functions with domain [0,1], C[0, 1],<br />
C[0, 1] = {f : [0,1] → R : f is continuous}<br />
where functions are added by adding their values <strong>for</strong> each x ∈ [0,1], and the scalar<br />
multiple by c is obtained by multiplying each value of the function by the scalar c.<br />
Subspaces. A subset H ⊆ V is a subspace of the vector space V if<br />
1. Contains zero vector: O ∈ H.<br />
2. Closed under addition: If x,y ∈ H, then x + y ∈ H.<br />
3. Closed under scalar multiplication: If x ∈ H, then cx ∈ H <strong>for</strong> each c ∈ R.<br />
Note that the subset {O} is a subspace, called the trivial subspace.<br />
Example 9.24 (Lines in R2 ) The set of points in the plane satisfying y = mx + b <strong>for</strong><br />
some m and b:<br />
<br />
x<br />
Lm,b =<br />
y<br />
<br />
: y = mx + b ⊆ R2 is a subspace of R 2 when b = 0 (the line goes through the origin) and is not a subspace<br />
when b = 0 (the line does not go through the origin).<br />
210
Example 9.25 (Planes in R3 ) The set of points in three-space satisfying z = ax+by +c<br />
<strong>for</strong> some a, b, c:<br />
⎧⎡<br />
⎤<br />
⎫<br />
⎨ x<br />
⎬<br />
Pa,b,c = ⎣ y ⎦ : z = ax + by + c ⊆ R3<br />
⎩<br />
⎭<br />
z<br />
is a subspace of R 3 when c = 0 (the plane goes through the origin) and is not a subspace<br />
when c = 0 (the plane does not go through the origin).<br />
Of interest. Subspaces are vector spaces in their own right. Of interest in statistics are<br />
the vector spaces R m , and the vector subspaces H ⊆ R m .<br />
9.4.2 Subspaces of R m<br />
Recall (from Section 9.1.3) that the span of the set {v1,v2,... ,vn} ⊆ R m , is the collection<br />
of all linear combinations of the vectors v1, v2, ..., vn:<br />
Span{v1,v2,...,vn} = {c1v1 + c2v2 + · · · + cnvn : ci ∈ R} ⊆ R m .<br />
Spans are subspaces. Of particular interest are spans related to coefficient matrices.<br />
Column space of A. Let A = [a1 a2 · · · an] be an m-by-n coefficient matrix. The<br />
column space of A is the span of the columns of A:<br />
Col(A) = Span{a1,a2,... ,an} ⊆ R m .<br />
The column space of A is the collection of all b so that Ax = b is consistent.<br />
Null space of A. Let A be an m-by-n coefficient matrix. The null space of A is the set<br />
of solutions of the homogeneous system Ax = O:<br />
Null(A) = {x : Ax = O} ⊆ R n .<br />
Since Null(A) can be written as a span of a set of vectors, it is a subspace of R n .<br />
9.4.3 Linear Independence, Basis and Dimension<br />
The set {v1,v2,... ,vn} ⊆ R m is said to be linearly independent when<br />
c1v1 + c2v2 + · · · + cnvn = O<br />
only when each ci = 0. Otherwise, the set is said to be linearly dependent.<br />
The following uniqueness theorem gives an important property of linearly independent sets.<br />
211
Theorem 9.26 (Uniqueness Theorem) If {v1,v2,...,vn} ⊆ R m is a linearly independent<br />
set and w ∈ Span{v1,v2,... ,vn}, then w has a unique representation as a linear<br />
combination of the vi’s. That is, we can write<br />
w = c1v1 + c2v2 + · · · + cnvn<br />
<strong>for</strong> a unique ordered list of scalars c1, c2, ..., cn (called the coordinates of w).<br />
Note that the coordinates can be found by solving the system<br />
[ v1 v2 · · · vn ]<br />
⎡<br />
⎣<br />
c1<br />
.<br />
.<br />
cn<br />
⎤<br />
⎡<br />
⎦ = ⎣<br />
That is, by solving Ac = w, where A is the m-by-n matrix whose columns are the vi’s.<br />
Additional facts. Some additional facts about linear independence/dependence are:<br />
1. If vi = O <strong>for</strong> some i, then the set {v1,v2,...,vn} is linearly dependent.<br />
2. If n > m, then the set {v1,v2,... ,vn} is linearly dependent.<br />
3. Assume vi = O <strong>for</strong> each i. Then, the set is linearly dependent if and only if one vector<br />
in the set can be written as a linear combination of the others.<br />
4. Let be the m-by-n matrix whose columns are the vi’s: A = [ v1 v2 · · · vn ]. Then,<br />
the set is linearly independent if and only if A has a pivot in every column.<br />
5. Let be the m-by-n matrix whose columns are the vi’s: A = [ v1 v2 · · · vn ]. Then,<br />
the set is linearly independent if and only if the homogeneous system Ax = O has<br />
the trivial solution only.<br />
Bases. A basis <strong>for</strong> the vector space V ⊆ R m is a linearly independent set which spans V .<br />
That is, a basis is a set of linearly independent vectors {v1,v2,... ,vn} satisfying<br />
w1<br />
.<br />
.<br />
wm<br />
V = Span{v1,v2,... ,vn}.<br />
Example 9.27 (Bases <strong>for</strong> R3 ) Bases are not unique. For example, the sets<br />
⎧⎡<br />
⎨<br />
⎣<br />
⎩<br />
1<br />
⎤ ⎡<br />
0 ⎦, ⎣<br />
0<br />
0<br />
⎤ ⎡<br />
1⎦,<br />
⎣<br />
0<br />
0<br />
⎤⎫<br />
⎬<br />
0⎦<br />
⎭<br />
1<br />
and<br />
⎧⎡<br />
⎨<br />
⎣<br />
⎩<br />
−3<br />
⎤ ⎡<br />
0 ⎦ , ⎣<br />
0<br />
2<br />
⎤ ⎡<br />
1 ⎦, ⎣<br />
0<br />
6<br />
⎤⎫<br />
⎬<br />
−2 ⎦<br />
⎭<br />
1<br />
are each bases of V = R 3 .<br />
Further, if the components of x ∈ R 3 are x1, x2, x3, then the coordinates of x<br />
1. With respect to the first basis are x1, x2, x3.<br />
2. With respect to the second basis are 1<br />
3 (−x1 + 2x2 + 10x3), x2 + 2x3, x3.<br />
212<br />
⎤<br />
⎦.
Standard basis <strong>for</strong> R m . The standard basis <strong>for</strong> R m is the set {e1,e2,... ,em}, where<br />
the ei’s are the columns of the identity matrix Im. That is, where ei is the vector whose<br />
j th component equals 0 when j = i and equals 1 when j = i.<br />
Note that the first basis given in the previous example is the standard basis <strong>for</strong> R 3 .<br />
Basis <strong>for</strong> the span of a set of vectors. Suppose that<br />
V = Span{v1,v2,... ,vn} ⊆ R m ,<br />
A basis <strong>for</strong> V can be found using row reductions. Specifically,<br />
Theorem 9.28 (Spanning Set Theorem) If V = Span{v1,v2,...,vn} ⊆ R m is not<br />
the trivial subspace, and A is the m-by-n matrix whose columns are the vi’s:<br />
A = [v1 v2 · · · vn ],<br />
then the pivot columns of A <strong>for</strong>m a basis <strong>for</strong> the vector space V .<br />
Basis <strong>for</strong> the column space of A. In particular, the pivot columns of the coefficient<br />
matrix A <strong>for</strong>m a basis <strong>for</strong> the column space of A, Col(A).<br />
Example 9.29 (3-by-5 Matrix) Consider the following 3-by-5 coefficient matrix A and<br />
its equivalent reduced echelon <strong>for</strong>m matrix:<br />
A = [a1 a2 a3 a4<br />
⎡<br />
0 3<br />
a5 ] = ⎣ 3 −7<br />
−6<br />
8<br />
⎤<br />
6 4<br />
−5 8 ⎦ ∼<br />
⎡<br />
1 0<br />
⎣ 0 1<br />
⎤<br />
−2 3 0<br />
−2 2 0 ⎦<br />
3 −9 12 −9 6 0 0 0 0 1<br />
Since the pivots are in columns 1, 2, 5, a basis <strong>for</strong> Col(A) is the set {a1,a2,a5}.<br />
Basis <strong>for</strong> the null space of A. The method <strong>for</strong> writing the solution set of a homogeneous<br />
system Ax = O given in Section 9.1.3 automatically produces a set of linearly independent<br />
vectors which span Null(A). Thus, no further work is needed to find a basis <strong>for</strong> Null(A).<br />
Example 9.30 (3-by-5 Matrix, continued) Continuing with the example above, since<br />
Ax = O ⇒<br />
x1 = 2x3 − 3x4<br />
x2 = 2x3 − 2x4<br />
x3 is free<br />
x4 is free<br />
x5 = 0<br />
⎡<br />
⎢<br />
⇒ s = x3<br />
⎢<br />
⎣<br />
2<br />
2<br />
1<br />
0<br />
0<br />
⎤<br />
⎡<br />
⎥ ⎢<br />
⎥ ⎢<br />
⎥ + x4<br />
⎢<br />
⎦ ⎣<br />
where x3 and x4 are free, a basis <strong>for</strong> Null(A) is the set {v1,v2}.<br />
213<br />
−3<br />
−2<br />
0<br />
1<br />
0<br />
⎤<br />
⎥<br />
⎦ = x3v1 + x4v2,
Dimension of a vector space. Although there are many choices <strong>for</strong> the basis vectors of<br />
a vector space V , the number of basis vectors does not change.<br />
Theorem 9.31 (Basis Theorem) If V ⊆ R m has a basis with p vectors, then<br />
1. Every basis of V must contain exactly p vectors.<br />
2. Any subset of V with p linearly independent vectors is a basis of V .<br />
3. Any subset of V with p vectors which span V is a basis of V .<br />
If V is a nontrivial subspace of R m , then the dimension of V , denoted by dim(V ), is the<br />
number of vectors in any basis of V . By the basis theorem, this number is well-defined. For<br />
convenience, we say that the dimension of the trivial subspace, {O}, is zero.<br />
Note that <strong>for</strong> every subspace V ⊆ R m , 0 ≤ dim(V ) ≤ m.<br />
9.4.4 Rank and Nullity<br />
Let A be an m-by-n matrix. The rank of A, denoted by rank(A), is the number of pivot<br />
columns of A. Equivalently, the rank of A is the dimension of the column space of A,<br />
rank(A) = dim(Col(A)).<br />
The nullity of A, denoted by nullity(A), is the dimension of its null space,<br />
nullity(A) = dim(Null(A)).<br />
If A is the coefficient matrix of an m-by-n system of linear equations and r = rank(A),<br />
then the system has r basic variables and n − r free variables. Since the number of free<br />
variables corresponds to the dimension of the null space of A, we have<br />
rank(A) + nullity(A) = n.<br />
Example 9.32 (3-by-5 Matrix, continued) For example, the 3-by-5 coefficient matrix<br />
A considered earlier has rank 3 and nullity 2. The sum of these dimensions equals the<br />
number of columns of the matrix.<br />
Subspaces related to A and A T . Let A be an m-by-n coefficient matrix. There are<br />
four important subspaces related A:<br />
1. Column space of A: The column space of A, Col(A) ⊆ R m .<br />
2. Null space of A: The null space of A, Null(A) ⊆ R n .<br />
3. Row space of A: The column space of A T , Col(A T ) ⊆ R n .<br />
4. Left null space of A: The null space of A T , Null(A T ) ⊆ R m .<br />
214
The dimensions of these spaces are related as follows:<br />
Theorem 9.33 (Fundamental Theorem of Linear Algebra, Part I) If A is an<br />
m-by-n coefficient matrix of rank r, then<br />
1. rank(A) = rank(A T ) = r,<br />
2. nullity(A) = n − r and<br />
3. nullity(A T ) = m − r.<br />
Matrices of rank one. If the m-by-n matrix A has rank 1, then the columns of A are<br />
all multiples of a single vector, say ai = civ <strong>for</strong> some constants ci and vector v ∈ R m . Let<br />
c ∈ R n be the vector whose components are the ci’s. Then<br />
9.5 Orthogonality<br />
A = [ c1v c2v · · · cnv ] = vc T .<br />
Let v,w ∈ R m . Recall (from Section 7.1.4) that<br />
1. The dot product of v with w is the number<br />
2. If θ is the angle between v and w, then<br />
v·w = v T w = v1w1 + v2w2 + · · · + vnwn.<br />
v·w = v w cos(θ).<br />
3. The vectors v and w are said to be orthogonal if v·w = 0.<br />
When m = 2 or m = 3, orthogonality corresponds to being at right angles in the usual<br />
geometric sense. When m > 3, the definition of orthogonality in terms of the dot product<br />
is a useful generalization.<br />
9.5.1 Orthogonal Complements<br />
If V ⊆ R m is a subspace, then the orthogonal complement of V is the set of all vectors<br />
x ∈ R m which are orthogonal to every vector in V :<br />
(V ⊥ is read “V -perp.”)<br />
V ⊥ = {x : v·x = 0 <strong>for</strong> every v ∈ V } ⊆ R m .<br />
215
Properties. Orthogonal complements satisfy the following properties:<br />
1. Subspace: V ⊥ is a subspace of R m .<br />
2. Trivial intersection: Since x·x = 0 only when x = O, V ∩ V ⊥ = {O}.<br />
<br />
3. Complement of complement: V ⊥ ⊥<br />
= V .<br />
4. Bases: If you pool bases <strong>for</strong> V and V ⊥ , you get a basis <strong>for</strong> R m .<br />
5. Dimensions: dim(V ) + dim(V ⊥ ) = m.<br />
Example 9.34 (Complements in R3 ) Let V be the following subspace:<br />
⎧⎡<br />
⎨<br />
V = Span{v1,v2} = Span ⎣<br />
⎩<br />
1<br />
⎤ ⎡<br />
−1 ⎦, ⎣<br />
0<br />
2<br />
⎤⎫<br />
⎬<br />
0 ⎦<br />
⎭<br />
−2<br />
⊆ R3 .<br />
V is a two-dimensional subspace of R 3 since the set {v1,v2} is linearly independent. To<br />
find V ⊥ it is sufficient to find those vectors x satisfying v1·x = 0 and v2·x = 0.<br />
Equivalently, we need to solve the following matrix equation<br />
<br />
v1 T <br />
x = O.<br />
Since<br />
1 −1 0 0<br />
2 0 −2 0<br />
<br />
∼<br />
v2 T<br />
1 0 −1 0<br />
0 1 −1 0<br />
<br />
=⇒<br />
x1 = x3<br />
x2 = x3<br />
x3 is free<br />
V ⊥ ⎡<br />
is the span of the single vector v3 = ⎣ 1<br />
⎤<br />
1⎦.<br />
Further, {v1,v2,v3} is a basis <strong>for</strong> R<br />
1<br />
3 .<br />
The example above suggests the second part of the Fundamental Theorem of Linear Algebra:<br />
Theorem 9.35 (Fundamental Theorem of Linear Algebra, Part II) Let A be an<br />
m-by-n coefficient matrix. Then<br />
1. Null(A T ) and Col(A) are orthogonal complements in R m .<br />
2. Null(A) and Col(A T ) are orthogonal complements in R n .<br />
9.5.2 Orthogonal Sets, Bases and Projections<br />
The nonzero vectors v1, v2, ..., vn are said to be mutually orthogonal if vi·vj = 0 when<br />
i = j. A set of mutually orthogonal vectors is known as an orthogonal set.<br />
The following theorem summarizes several important properties of orthogonal sets.<br />
Theorem 9.36 (Orthogonal Sets) Let {v1,v2,...,vn} be an orthogonal set of vectors<br />
in R m , and let V be the subspace spanned by these vectors. Then<br />
216
1. {v1,v2,...,vn} is a basis <strong>for</strong> V .<br />
2. If y ∈ V , then<br />
3. If y ∈ V , then<br />
y = c1v1 + c2v2 + · · · + cnvn where ci = y·vi<br />
vi·vi<br />
y = c1v1 + c2v2 + · · · + cnvn where ci = y·vi<br />
vi·vi<br />
<strong>for</strong> each i.<br />
<strong>for</strong> each i<br />
is the point closest to y in V . Further, the difference between the vector y and the<br />
vector y is a vector in the orthogonal complement of V : (y − y) ∈ V ⊥ .<br />
The first part of the theorem tells us that orthogonal sets are linearly independent. The<br />
second part gives a convenient way of finding the coordinates of a vector with respect to a<br />
basis of mutually orthogonal vectors (called an orthogonal basis). The third part generalizes<br />
the idea from analytic geometry of “dropping a perpendicular.”<br />
Projections and orthogonal decompositions. If y ∈ V , then y in the theorem above<br />
is called the projection of y on V , and is denoted by<br />
y = proj V (y).<br />
Writing y as the sum of a vector in V and a vector in V ⊥ :<br />
y = y + (y − y)<br />
is called the orthogonal decomposition of y with respect to V . For each y ∈ R m , the<br />
orthogonal decomposition of y with respect to V is unique.<br />
9.5.3 Gram-Schmidt Orthogonalization<br />
The Gram-Schmidt orthogonalization process allows us to convert any basis of a subspace<br />
of R m to an orthogonal basis <strong>for</strong> the space.<br />
Theorem 9.37 (Gram-Schmidt Process) Let {v1,v2,... ,vn} be a basis <strong>for</strong> the vector<br />
space V ⊆ R m , and let {u1,u2,... ,un} be defined as follows:<br />
u1 = v1<br />
u2 = v2 − proj V1 (v2) where V1 = Span{u1} = Span{v1}.<br />
u3 = v3 − proj V2 (v3) where V2 = Span{u1,u2} = Span{v1,v2}.<br />
.<br />
un = vn − proj Vn−1 (vn) where Vn−1 = Span{u1,... ,un−1} = Span{v1,...,vn−1}.<br />
Then {u1,u2,... ,un} is an orthogonal basis of V .<br />
217
Example 9.38 (Subspace of R4 ⎧⎡<br />
⎤<br />
⎪⎨<br />
1<br />
⎢<br />
) If V = Span ⎢ −1 ⎥<br />
⎣<br />
⎪⎩<br />
0 ⎦<br />
1<br />
,<br />
⎡ ⎤<br />
2<br />
⎢ 0 ⎥<br />
⎣ −2 ⎦<br />
1<br />
,<br />
⎡ ⎤⎫<br />
4 ⎪⎬ ⎢ 0 ⎥<br />
⎣ 5 ⎦ , then<br />
⎪⎭<br />
2<br />
⎡ ⎤<br />
1<br />
⎢<br />
u1 = v1 = ⎢ −1 ⎥<br />
⎣ 0 ⎦<br />
1<br />
, u2<br />
⎡ ⎤<br />
2<br />
v2·u1 ⎢<br />
= v2 − u1 = ⎢ 0 ⎥<br />
u1·u1<br />
⎣ −2 ⎦<br />
1<br />
−<br />
⎡ ⎤<br />
1<br />
3 ⎢ −1 ⎥<br />
3 ⎣ 0 ⎦<br />
1<br />
=<br />
⎡ ⎤<br />
1<br />
⎢ 1 ⎥<br />
⎣ −2 ⎦<br />
0<br />
,<br />
and<br />
⎡ ⎤<br />
4<br />
v3·u1 v3·u2 ⎢<br />
u3 = v3 − u1 − u2 = ⎢ 0 ⎥<br />
u1·u1 u2·u2<br />
⎣ 5 ⎦<br />
2<br />
−<br />
⎡ ⎤<br />
1<br />
6 ⎢ −1 ⎥<br />
3 ⎣ 0 ⎦<br />
1<br />
−<br />
⎡ ⎤<br />
1<br />
−6 ⎢ 1 ⎥<br />
6 ⎣ −2 ⎦<br />
0<br />
=<br />
⎡ ⎤<br />
3<br />
⎢ 3 ⎥<br />
⎣ 3 ⎦<br />
0<br />
.<br />
{u1,u2,u3} is an orthogonal basis of V ⊆ R 4 .<br />
9.5.4 QR-Factorization<br />
Orthogonal vectors make many calculations easier. For example, if {u1,u2,...,un} is an<br />
orthogonal set in R m , and A is the matrix whose columns are the ui’s, then A T A is a<br />
diagonal matrix. This idea motivates a technique known as QR-factorization.<br />
Orthonormal sets. The orthogonal set {q1,q2,... ,qn} is said to be an orthonormal set<br />
of vectors if each qi is a unit vector. (Each vector has been normalized to have length 1.)<br />
Theorem 9.39 (Orthonormal Sets) If {q1,q2,... ,qn} is an orthonormal set of vectors<br />
in R m , Q is the matrix whose columns are the qi’s, and x,y ∈ R m , then<br />
1. Q T Q = In (and QQ T = Im).<br />
2. Qx = x.<br />
3. (Qx)·(Qy) = x·y.<br />
4. (Qx)·(Qy) = 0 if and only if x·y = 0.<br />
Orthogonal matrices. The square matrix Q is said to be an orthogonal matrix if its<br />
columns <strong>for</strong>m an orthonormal set. If Q is an orthogonal matrix of order n, then the first<br />
part of the theorem above implies<br />
Q T Q = QQ T = In =⇒ Q −1 = Q T .<br />
That is, the inverse of Q is equal to its transpose.<br />
218
QR-factorization. Let A = [v1 v2 · · · vn] be an m-by-n matrix whose columns are<br />
linearly independent. Use the Gram-Schmidt process to convert the vi’s to ui’s, let<br />
qi = 1<br />
ui ui <strong>for</strong> i = 1,2,... ,n,<br />
and let Q be the m-by-n matrix whose columns are the qi’s.<br />
In the Gram-Schmidt process,<br />
vi ∈ Span{v1,... ,vi} = Span{u1,... ,ui} <strong>for</strong> i = 1,2,... ,n.<br />
By the uniqueness theorem <strong>for</strong> coordinates, Q T A = R must be an upper triangular matrix<br />
of order n. Further, it can be shown that the diagonal elements of R are all positive. Since<br />
the matrix product QQ T is the identity matrix of order m,<br />
Q T A = R =⇒ Q(Q T A) = QR =⇒ A = QR.<br />
Example 9.40 (Subspace of R 4 , continued) For example, the QR-factorization of the<br />
4-by-3 matrix whose columns are the vi’s from the Gram-Schmidt example above is<br />
⎡<br />
⎢<br />
⎣<br />
1 2 4<br />
−1 0 0<br />
0 −2 5<br />
1 1 2<br />
9.5.5 Linear Least Squares<br />
⎤<br />
⎥<br />
⎦ =<br />
⎡ 1√<br />
3<br />
⎢ √−1 3 ⎢<br />
⎣ 0<br />
√1 6<br />
√1 6 2 − 3<br />
√1 ⎤<br />
3 ⎡ √<br />
√1 ⎥<br />
3 ⎥ 3<br />
⎥⎣<br />
√1 ⎥ 0<br />
3 ⎦<br />
√<br />
3<br />
√<br />
2 3<br />
0 0<br />
√ 6 − √ 0 0<br />
6<br />
3 √ ⎤<br />
⎦.<br />
3<br />
1√ 3<br />
One of the most useful applications of the methods from this section is to the linear least<br />
squares problem. The introduction here focuses on the general problem of finding approximate<br />
solutions to inconsistent systems.<br />
Let A be an m-by-n coefficient matrix, and assume that Ax = b is inconsistent (b is not in<br />
the column space of A). To find approximate solutions to the original system, we propose<br />
to do the following:<br />
1. Find the projection of b on Col(A), b, and<br />
2. Report solutions to the consistent system Ax = b.<br />
There are two key observations:<br />
Observation 1: Since b is as close to b as possible, each approximate solution x satisfies<br />
b − b = b−Ax is as small as possible.<br />
219
The difference vector is<br />
b−Ax =<br />
⎡<br />
⎢<br />
⎣<br />
b1 − (a1,1x1 + a1,2x2 + · · · + a1,nxn)<br />
b2 − (a2,1x1 + a2,2x2 + · · · + a2,nxn)<br />
.<br />
.<br />
bm − (am,1x1 + am,2x2 + · · · + am,nxn)<br />
and the square of the length of the difference vector is<br />
m<br />
(bi − (ai,1x1 + ai,2x2 + · · · + ai,nxn))<br />
i=1<br />
2 .<br />
Each approximate solution x will minimize the above sum of squared differences. For this<br />
reason the approximate solutions are called least squares solutions.<br />
Observation 2: Since the difference vector<br />
(b − b) = (b−Ax) ∈ Col(A) ⊥ = Null(A T )<br />
(by the Fundamental Theorem of Linear Algebra), we know that<br />
O = A T (b−Ax) = A T b − A T Ax =⇒ A T Ax = A T b.<br />
Thus, least squares solutions can be found by solving the consistent system on the right;<br />
there is no need to explicitly compute the projection of b on the column space of A.<br />
The matrix equation A T Ax = A T b is often called the normal equation of the system.<br />
Example 9.41 (3-by-2 Inconsistent System) Consider the inconsistent system<br />
⎡ ⎤ ⎡<br />
1 0 <br />
⎣1 1 ⎦ x<br />
= ⎣<br />
y<br />
1 2<br />
6<br />
⎤<br />
0⎦.<br />
0<br />
Then<br />
A T Ax = A T b ⇒<br />
3 3<br />
3 5<br />
<br />
x<br />
=<br />
y<br />
<br />
6<br />
0<br />
⇒<br />
⎤<br />
⎥<br />
⎦<br />
x = 5<br />
y = −3<br />
Thus, x = 5 and y = −3 is the unique least squares solution to the system.<br />
Example 9.42 (4-by-3 Inconsistent System) Consider the inconsistent system<br />
⎡ ⎤<br />
1 3 5 ⎡<br />
⎢1<br />
1 0 ⎥<br />
⎣ ⎦⎣<br />
1 1 2<br />
1 3 3<br />
x<br />
⎡ ⎤<br />
⎤ 3<br />
⎢<br />
y ⎦ = ⎢ 5 ⎥<br />
⎣ 7 ⎦<br />
z<br />
−3<br />
.<br />
Then<br />
A T Ax = A T b ⇒<br />
⎡<br />
4<br />
⎣ 8<br />
⎤⎡<br />
8 10<br />
20 26⎦<br />
⎣<br />
10 26 38<br />
x<br />
⎤ ⎡<br />
y ⎦ = ⎣<br />
z<br />
12<br />
⎤<br />
12⎦<br />
⇒<br />
20<br />
x = 10<br />
y = −6<br />
z = 2<br />
Thus, x = 10, y = −6 and z = 2 is the unique least squares solution to the system.<br />
220
Remark. If the columns of the coefficient matrix A are linearly independent, then the<br />
symmetric matrix ATA is invertible and<br />
<br />
x = A T −1 A A T b<br />
is the unique least squares solution. (The proof uses the QR-factorization of A.)<br />
9.6 Eigenvalues and Eigenvectors<br />
Let A be a square matrix of order n. The constant λ is said to be an eigenvalue of A if<br />
Ax = λx <strong>for</strong> some nonzero vector x ∈ R n .<br />
Each nonzero x satisfying this equation is said to be an eigenvector with eigenvalue λ.<br />
If x is an eigenvector of A with eigenvalue λ, then so is each scalar multiple of x, since<br />
A(cx) = c(Ax) = c(λx) = λ(cx) <strong>for</strong> each nonzero constant c.<br />
If we think of the collection of scalar multiples of x as defining a line in n-space, then A<br />
takes each point on the line to another point on the line.<br />
9.6.1 Characteristic Equation<br />
Since x = Inx, where In is the identity matrix,<br />
Ax = λx ⇐⇒ Ax − λx = O ⇐⇒ (A − λIn)x = O.<br />
That is, λ is an eigenvalue of A if and only if the homogeneous system of equations on the<br />
right has nontrivial solutions. The system has nontrivial solutions if and only if (A − λIn)<br />
is a singular matrix. The matrix is singular if and only if its determinant is zero.<br />
Thus, the eigenvalues of A can be found by solving<br />
det(A − λIn) = 0 <strong>for</strong> λ.<br />
This equation is known as the characteristic equation <strong>for</strong> the matrix A. Further, if λ is an<br />
eigenvalue of A, then its eigenvectors are the nonzero vectors in the null space of (A −λIn).<br />
<br />
−8 −10<br />
Example 9.43 (2-by-2 Matrix) Let A = . Then<br />
5 7<br />
<br />
<br />
<br />
det(A − λI2) = −8 − λ −10 <br />
<br />
5 7 − λ = (−8 − λ)(7 − λ) + 50 = λ2 + λ − 6 = (λ + 3)(λ − 2),<br />
and the determinant equals 0 when λ = 2 or λ = −3.<br />
Using the techniques of Section 9.1.3,<br />
<br />
1<br />
Null(A − 2I2) = Span<br />
−1<br />
<br />
1<br />
Finally, note that the set {v1,v2} =<br />
−1<br />
<br />
−2<br />
and Null(A + 3I2) = Span<br />
1<br />
<br />
−2<br />
,<br />
1<br />
221<br />
<br />
is a basis <strong>for</strong> R2 .<br />
<br />
.
Properties. The following are important properties of eigenvalues and eigenvectors:<br />
1. 0 is an eigenvalue of A if and only if A is singular (does not have an inverse).<br />
2. If A is a triangular matrix, then the eigenvalues of A are its diagonal elements.<br />
3. If λ is an eigenvalue of A with eigenvector x and A k = AA · · · A is the k th power of<br />
the matrix A, then λ k is an eigenvalue of A k with eigenvector x.<br />
4. Suppose that A has k distinct eigenvalues λ1, λ2, ..., λk. If you pool the bases of<br />
the null spaces Null(A − λiIn), <strong>for</strong> i = 1,2,... ,k, then the resulting set is linearly<br />
independent. If the resulting set has n elements, the set is a basis <strong>for</strong> R n .<br />
Remarks. If A is a square matrix of order n, then the characteristic equation is a polynomial<br />
equation of degree n. Polynomials of degree n cannot always be factored into n terms<br />
of the <strong>for</strong>m “(λ−c)” where the c’s are real numbers, but can be factored into n linear terms<br />
if we allow some c’s to be complex numbers.<br />
9.6.2 Diagonalization<br />
Suppose that A has n linearly independent eigenvectors,<br />
vi <strong>for</strong> i = 1,2,... ,n,<br />
with corresponding eigenvalues λi <strong>for</strong> i = 1,2,... ,n.<br />
Let P be the matrix whose columns are the vi’s, and let Λ be the diagonal matrix whose<br />
diagonal elements are the λi’s,<br />
Then, since<br />
P = [v1 v2 · · · vn], Λ = Diag(λ1,λ2,...,λn).<br />
AP = [Av1 Av2 · · · Avn ] = [λ1v1 λ2v2 · · · λ2vn ] = PΛ<br />
and P is invertible, P −1 AP = Λ and A = PΛP −1 .<br />
The matrix equation A = PΛP −1 , where Λ is a diagonal matrix, is called a diagonalization<br />
of the matrix A. Diagonalizations do not always exist.<br />
Theorem 9.44 (Diagonalization Theorem) The square matrix A of order n is diagonalizable<br />
if and only if A has n linearly independent eigenvectors.<br />
⎡ ⎤<br />
3 0 0<br />
Example 9.45 (Diagonalizable 3-by-3 Matrix) Let A = ⎣ 0 3 0 ⎦.<br />
−1 1 2<br />
Since A is a triangular matrix, its eigenvalues are 3, 3, 2. (Since 3 corresponds to two roots<br />
of the characteristic equation, it appears twice in this list.)<br />
222
Using the techniques of Section 9.1.3,<br />
⎧⎡<br />
⎨<br />
Null(A − 3I3) = Span ⎣<br />
⎩<br />
1<br />
⎤ ⎡<br />
1 ⎦, ⎣<br />
0<br />
−1<br />
⎤⎫<br />
⎬<br />
0 ⎦<br />
⎭<br />
1<br />
and Null(A − 2I3)<br />
⎧⎡<br />
⎨<br />
= Span ⎣<br />
⎩<br />
0<br />
0<br />
1<br />
⎤⎫<br />
⎬<br />
⎦<br />
⎭ .<br />
Since there are three linearly independent eigenvectors, A is diagonalizable. Further, we<br />
can factor A as follows:<br />
A = PΛP −1 ⎡<br />
1<br />
= ⎣ 1<br />
⎤⎡<br />
−1 0 3 0<br />
0 0 ⎦⎣<br />
0 3<br />
⎤⎡<br />
0 0<br />
0 ⎦⎣<br />
−1<br />
⎤<br />
1 0<br />
1 0 ⎦.<br />
0 1 1 0 0 2 1 −1 1<br />
Example 9.46 (Nondiagonalizable 3-by-3 Matrix) Let A = ⎣<br />
As above, since A is a triangular matrix, its eigenvalues are 3, 3, 2.<br />
⎡<br />
3 0 0<br />
2 3 0<br />
2 1 2<br />
Using the techniques of Section 9.1.3,<br />
⎧⎡<br />
⎨<br />
Null(A − 3I3) = Span ⎣<br />
⎩<br />
0<br />
⎤⎫<br />
⎬<br />
1 ⎦<br />
⎭<br />
1<br />
and Null(A − 2I3)<br />
⎧⎡<br />
⎨<br />
= Span ⎣<br />
⎩<br />
0<br />
⎤⎫<br />
⎬<br />
0 ⎦<br />
⎭<br />
1<br />
.<br />
Since there are only two linearly independent eigenvectors, A is not diagonalizable.<br />
9.6.3 Symmetric Matrices<br />
Recall that the square matrix A is symmetric if aij = aji <strong>for</strong> all i and j. Written succinctly,<br />
the square matrix A is symmetric if A = A T .<br />
Symmetric matrices are always diagonalizable. Further, the eigenvectors can be chosen to<br />
be orthonormal. These facts are summarized in the following theorem.<br />
Theorem 9.47 (Spectral Theorem) Let A be a symmetric matrix of order n. Then<br />
there exists a diagonal matrix Λ and an orthogonal matrix Q satisfying<br />
(since the inverse of Q is its transpose).<br />
A = QΛQ −1 = QΛQ T<br />
The spectral theorem says that if A is symmetric, then (1) the diagonalization process<br />
outlined above always results in finding n linearly independent eigenvectors, and (2) eigenvectors<br />
corresponding to different eigenvalues are always orthogonal.<br />
To construct an orthonormal eigenvector basis, you just need to convert the basis of each<br />
null space, Null(A − λiIn) <strong>for</strong> each i, to an orthonormal basis using the normalized Gram-<br />
Schmidt process.<br />
223<br />
⎤<br />
⎦.
Spectral decomposition. Let Q = [q1 q2 · · · qn ] and let λi be the eigenvalue<br />
associated with qi <strong>for</strong> each i. Then<br />
A = QΛQ T = λ1q1q1 T + λ2q2q2 T + · · · + λnqnqn T<br />
is called the spectral decomposition of A. The spectral decomposition breaks up A into pieces<br />
determined by its “spectrum” of eigenvalues. Each term in the spectral decomposition of<br />
A is a rank 1 matrix.<br />
9.6.4 Positive Definite Matrices<br />
Let A be a symmetric matrix of order n and let Quad(x) be the following real-valued<br />
n-variable function:<br />
Quad(x) = x T n n<br />
Ax = aijxixj .<br />
i=1 j=1<br />
(Quad(x) is a quadratic <strong>for</strong>m in x.) Then, A is said to be positive definite if<br />
Quad(x) > 0 when x = O.<br />
The following theorem relates the method <strong>for</strong> determining when a matrix is positive definite<br />
given in Section 7.4.2 (on the second derivative test) to the eigenvalues of A.<br />
Theorem 9.48 (Positive Definite Matrices) Let A be a symmetric matrix. If A has<br />
one of the following three properties, it has them all:<br />
1. All the eigenvalues of A are positive.<br />
2. The determinants of all the principal minors are positive.<br />
3. Quad(x) = x T Ax > 0 except when x = O.<br />
To demonstrate that the first statement implies the third, <strong>for</strong> example, write A = QΛQ T ,<br />
as given in the spectral theorem above. Then<br />
x T Ax = x T (QΛQ T )x = y T n<br />
Λy = λiy<br />
i=1<br />
2 i<br />
where y = Q T x.<br />
Since Q is an invertible matrix (with inverse Q T ), y = Q T x = O when x = O. Further,<br />
since each λi is positive, the sum on the right is positive when y = O.<br />
9.6.5 Singular Value Decomposition<br />
The singular value decomposition is a useful factorization method <strong>for</strong> rectangular matrices,<br />
generalizing the idea of diagonalization.<br />
224
Singular values. Let A be an m-by-n matrix. By the spectral theorem, the symmetric<br />
matrix A T A can be orthogonally diagonalized. Let<br />
A T A = V ΛV T where V = [v1 v2 · · · vn ], Λ = Diag(λ1,λ2,... ,λn),<br />
the columns of V <strong>for</strong>m an orthogonal set, and the λi’s (and vi’s) have been ordered so that<br />
λ1 ≥ λ2 ≥ · · · ≥ λn.<br />
The λi’s are nonnegative since they correspond to squared lengths:<br />
Avi 2 = (Avi) T (Avi) = vi T A T Avi = vi T λivi = λi ≥ 0 <strong>for</strong> each i.<br />
The singular values of A are the square roots of the eigenvalues of A T A,<br />
σi = √ λi <strong>for</strong> i = 1,2,... ,n<br />
written in descending order. (The singular values are the lengths of the vectors Avi.)<br />
Theorem 9.49 (Singular Values) Given the setup above, if the rank of A is r, then<br />
1. σi > 0 when i ≤ r and σi = 0 when i > r.<br />
2. The set {Av1,Av2,... ,Avr} is an orthogonal basis of Col(A).<br />
Decomposition. To complete the singular value decomposition, let<br />
ui = 1<br />
Avi Avi = 1<br />
σi<br />
Avi <strong>for</strong> i = 1,2,... ,r.<br />
Then {u1,u2,...,ur} is an orthonormal basis of Col(A). This basis can be completed to<br />
an orthonormal basis of R m , say {u1,u2,...,um}.<br />
Let U be the orthogonal matrix of order m whose columns are the ui’s,<br />
U = [u1 u2 · · · um],<br />
let D be the diagonal matrix of order r whose diagonal elements are the nonzero σi’s,<br />
D = Diag(σ1,σ2,... ,σr),<br />
and let Σ be the m-by-n partitioned matrix<br />
<br />
D<br />
Σ =<br />
O<br />
<br />
O<br />
O<br />
where each O is a zero matrix.<br />
Then, since<br />
UΣ = [σ1u1 · · · σrur 0 · · · 0] = [Av1 · · · Avr 0 · · · 0] = AV<br />
and V is invertible (with inverse V T ), UΣV T = AV V T = A.<br />
225
The matrix equation<br />
A = UΣV T ,<br />
where U and V are orthogonal matrices and Σ has the <strong>for</strong>m shown above, is called a singular<br />
value decomposition of A. Singular value decompositions always exist.<br />
<br />
2 2<br />
Example 9.50 (2-by-2 Matrix of Rank 1) Let A = .<br />
1 1<br />
Then AT <br />
5 5<br />
A = = V ΛV<br />
5 5<br />
T where<br />
V = [ v1 v2 ] =<br />
√ √<br />
1/ 2 −1/ 2<br />
1/ √ 2 1/ √ <br />
λ1 0<br />
and Λ = =<br />
2<br />
0 λ2<br />
<br />
10 0<br />
.<br />
0 0<br />
The column space of A is spanned by v1, the first singular value is σ1 = √ 10, and<br />
u1 = 1<br />
√<br />
2/ 5<br />
Av1 =<br />
1/ √ <br />
.<br />
5<br />
σ1<br />
√<br />
−1/ 5<br />
A unit vector orthogonal to u1 is u2 =<br />
2/ √ <br />
.<br />
5<br />
Thus, a singular value decomposition of A is<br />
A = UΣV T =<br />
2/ √ 5 −1/ √ 5<br />
1/ √ 5 2/ √ 5<br />
√ 10 0<br />
0 0<br />
Example 9.51 (3-by-2 Matrix of Rank 2) Let A = ⎣<br />
Then A T A =<br />
117 −81<br />
−81 333<br />
V = [v1 v2 ] =<br />
<br />
= V ΛV T where<br />
−1/ √ 10 3/ √ 10<br />
3/ √ 10 1/ √ 10<br />
√ √<br />
1/ 2 1/ 2<br />
−1/ √ 2 1/ √ <br />
.<br />
2<br />
⎡<br />
−1 13<br />
4 8<br />
10 −10<br />
<br />
λ1 0<br />
and Λ = =<br />
0 λ2<br />
The column space of A is spanned by {v1,v2}, σ1 = 6 √ 10, σ2 = 3 √ 10,<br />
u1 = 1<br />
⎡<br />
Av1 = ⎣<br />
σ1<br />
2/3<br />
⎤<br />
1/3 ⎦ and u2 =<br />
−2/3<br />
1<br />
⎡<br />
Av2 = ⎣<br />
σ2<br />
1/3<br />
⎤<br />
2/3 ⎦.<br />
2/3<br />
A unit vector orthogonal to these vectors is u3 =<br />
⎡<br />
⎣ 2/3<br />
−2/3<br />
1/3<br />
⎤<br />
⎦.<br />
⎤<br />
⎦.<br />
360 0<br />
0 90<br />
Thus, a singular value decomposition of A is<br />
A = UΣV T ⎡<br />
⎤⎡<br />
2/3 1/3 2/3<br />
= ⎣ 1/3 2/3 −2/3 ⎦⎣<br />
−2/3 2/3 1/3<br />
6√10 0<br />
0 3 √ ⎤<br />
√ √<br />
10⎦<br />
−1/ 10 3/ 10<br />
3/<br />
0 0<br />
√ 10 1/ √ 10<br />
226<br />
<br />
.<br />
<br />
.
Bases <strong>for</strong> the four important subspaces. Let A be an m-by-n matrix of rank r, and<br />
let A = UΣV T be a singular value decomposition. Let U = [Ur Um−r ] be a partition of<br />
U into matrices with r and (m − r) columns. Similarly, let V = [Vr Vn−r ] be a partition<br />
of V into matrices with r and (n − r) columns. Then<br />
1. The columns of Ur <strong>for</strong>m an orthonormal basis <strong>for</strong> Col(A) ⊆ R m .<br />
2. The columns of Um−r <strong>for</strong>m an orthonormal basis <strong>for</strong> Null(A T ) ⊆ R m .<br />
3. The columns of Vr <strong>for</strong>m an orthonormal basis <strong>for</strong> Col(A T ) ⊆ R n .<br />
4. The columns of Vn−r <strong>for</strong>m an orthonormal basis <strong>for</strong> Null(A) ⊆ R n .<br />
For this reason, the singular value decomposition is often thought of as the third part of<br />
the Fundamental Theorem of Linear Algebra.<br />
Note that we can now write:<br />
A = UΣV T <br />
=<br />
Ur Um−r<br />
⎡<br />
⎢<br />
⎣<br />
⎤ ⎡<br />
D O ⎥ ⎢ V T<br />
r<br />
⎦ ⎣<br />
O O V T ⎤<br />
n−r<br />
⎥<br />
⎦ = UrDV T<br />
r ,<br />
where D = Diag(σ1,σ2,... ,σr) is the diagonal matrix of order r whose diagonal elements<br />
are the nonzero singular values, written in descending order.<br />
Moore-Penrose pseudoinverse. Working from the decomposition above, the matrix<br />
A + = VrD −1 U T r<br />
is called the Moore-Penrose pseudoinverse of the matrix A. A + is an n-by-m matrix.<br />
Using properties of orthogonal matrices: A + AA + = A and AA + A = A + .<br />
The Moore-Penrose pseudoinverse can be applied to least squares problems (Section 9.5.5).<br />
If Ax = b is inconsistent and A T A does not have an inverse, then there are infinitely many<br />
least squares solutions. By using A + as if it were a true inverse:<br />
Ax = b =⇒ x + = A + b<br />
we obtain the least squares solution of minimum length.<br />
Example 9.52 (Solution of Minimum Length) Consider the inconsistent system<br />
Then<br />
A T Ax = A T b ⇒<br />
<br />
⎡<br />
⎣<br />
1 −2<br />
2 −4<br />
−1 2<br />
6 −12<br />
−12 24<br />
⎤<br />
⎦<br />
⎡<br />
<br />
x<br />
= ⎣<br />
y<br />
3<br />
−4<br />
15<br />
<br />
x −20<br />
=<br />
y 40<br />
227<br />
⎤<br />
⎦.<br />
<br />
⇒<br />
x = 2y − 10/3<br />
.<br />
y is free
Thus, there are an infinite number of least squares solutions of the <strong>for</strong>m<br />
(2y − 10/3,y) where y is free.<br />
⎡<br />
−1/<br />
Since A = ⎣<br />
√ 6<br />
−2/ √ 6<br />
1/ √ ⎤<br />
⎦<br />
6<br />
√ 30 [ −1/ √ 5 2/ √ 5 ], the pseudoinverse is<br />
A + √<br />
−1/ 5<br />
=<br />
2/ √ <br />
1/ √ √ √ √<br />
30 [ −1/ 6 −2/ 6 1/ 6 ]<br />
5<br />
and x + = A + <br />
−2/3<br />
b = .<br />
4/3<br />
Note that the square of the length of each solution is f(y) = (2y − 10/3) 2 + y 2 , and that<br />
the function f(y) is minimized when y = 4/3.<br />
Numerical linear algebra. In most practical applications of linear algebra, solutions<br />
are subject to roundoff errors and care must be taken to minimize these errors. Often, a<br />
singular value decomposition of the coefficient matrix is used in the computations, and the<br />
range of nonzero singular values is used to analyze errors.<br />
If the coefficient matrix is invertible, then the ratio of the largest to the smallest singular<br />
value c = σ1/σn is called the condition number of the matrix. The larger the condition<br />
number, the more error prone the numerical algorithm. A rule of thumb, which has been<br />
experimentally verified in a large number of trial cases, is that the computer can lose about<br />
log 10(c) decimal places of accuracy while solving linear equations numerically.<br />
In linear regression analysis, <strong>for</strong> example, if two columns of the design matrix X are nearly<br />
linearly related (nearly colinear), then the condition number of X T X will be quite high,<br />
affecting estimates of coefficients and variances.<br />
9.7 Linear Trans<strong>for</strong>mations<br />
Let V and W be vector spaces and let T : V → W be a function assigning to each v ∈ V<br />
a vector T(v) ∈ W. Then T is said to be a linear trans<strong>for</strong>mation if<br />
1. T(v1 + v2) = T(v1) + T(v2) <strong>for</strong> all v1,v2 ∈ V and<br />
2. T(cv) = cT(v) <strong>for</strong> all v ∈ V and c ∈ R.<br />
(That is, if T “respects” vector addition and scalar multiplication.)<br />
Example 9.53 (Trans<strong>for</strong>mations on R2 ) For example, let V = W = R2 ,<br />
<br />
x 2x + 3y x x − y<br />
T1 = and T2 = .<br />
y −y<br />
y xy<br />
Then T1 satisfies the criteria <strong>for</strong> linear trans<strong>for</strong>mations, but T2 does not. In particular,<br />
<br />
3 0 0 1<br />
T2 = = = 3T2 .<br />
3 9 3 1<br />
228
x1,y1<br />
2<br />
-2 2<br />
-2<br />
x2,y2<br />
2<br />
-2 2<br />
-2<br />
x2,y2<br />
2<br />
-2 2<br />
-2<br />
x2,y2<br />
2<br />
-2 2<br />
Figure 9.1: The effects of three different linear trans<strong>for</strong>mations on the unit square (left).<br />
Linear combinations. If T is a linear trans<strong>for</strong>mation, then T(O) = O and<br />
T(c1v1 + c2v2 + · · · + cnvn) = c1T(v1) + c2T(v2) + · · · + cnT(vn)<br />
<strong>for</strong> each linear combination of vectors in V .<br />
9.7.1 Range and Kernel<br />
If T is a linear trans<strong>for</strong>mation, then the range of T,<br />
Range(T) = {w : w = T(v) <strong>for</strong> some v ∈ V } ⊆ W,<br />
is a subspace of W. Similarly, the kernel of T,<br />
is a subspace of V .<br />
Kernel(T) = {v : T(v) = O} ⊆ V,<br />
Isomorphism. T is said to be one-to-one if its kernel is trivial, Kernel(T) = {O}. T is<br />
said to be onto if its range is the entire space, Range(T) = W.<br />
If T is both one-to-one and onto, then T is an isomorphism between the vector spaces.<br />
9.7.2 Linear Trans<strong>for</strong>mations and Matrices<br />
If V ⊆ R n and W ⊆ R m , then the linear trans<strong>for</strong>mation T corresponds to matrix multiplication.<br />
That is,<br />
T(v) = Av <strong>for</strong> some m-by-n matrix A.<br />
Further, Range(T) = Col(A) and Kernel(T) = Null(A).<br />
Example 9.54 (Planar Trans<strong>for</strong>mations) When V = W = R2 , linear trans<strong>for</strong>mations<br />
are often viewed geometrically. For example, Figure 9.1 shows the image of the unit square<br />
(the set [0,1] × [0,1]) under linear trans<strong>for</strong>mations whose matrices are<br />
<br />
1 1.5<br />
0 1<br />
<br />
,<br />
<br />
−1.5 0<br />
0 2<br />
<br />
, and<br />
229<br />
<br />
−2 1<br />
−1 2.5<br />
<br />
, respectively.<br />
-2
y1<br />
x1<br />
Figure 9.2: A trans<strong>for</strong>mation of R 2 (left) whose range (right) is isomorphic to R 2 .<br />
A linear trans<strong>for</strong>mation whose matrix is invertible will map parallelograms to parallelograms.<br />
Since the second matrix is a diagonal matrix, the image of the unit square is a<br />
rectangle whose sides have lengths equal to the absolute values of the diagonal elements.<br />
In each case, the area of the image of the unit square is the absolute value of the determinant<br />
of the matrix. The resulting areas in this case are 1, 3, and 4, respectively.<br />
9.7.3 Linearly Independent Columns<br />
Let T : R n → R m be a linear trans<strong>for</strong>mation with matrix A. If the columns of A are<br />
linearly independent, then<br />
y2<br />
Range(T) = Col(A) ⊆ R m is isomorphic to R n .<br />
Example 9.55 (Dimension 2) Let T : R2 → R3 be defined as follows:<br />
⎡ ⎤ ⎡<br />
1 0 <br />
x1<br />
x1<br />
T(x) = A = ⎣ 0 −1 ⎦ = ⎣<br />
y1<br />
y1<br />
1 1<br />
x1<br />
⎤<br />
−y1 ⎦.<br />
x1 + y1<br />
Since the columns of A are linearly independent, the range of T is the 2-dimensional subspace<br />
of R 3 shown in the right part of Figure 9.2. (Plane with equation z2 = x2 − y2.)<br />
The figure also shows how a square grid on the x1y1-plane is trans<strong>for</strong>med by T to a parallelogram<br />
grid on the surface in x2y2z2-space with equation z2 = x2 − y2.<br />
9.7.4 Composition of Functions and Matrix Multiplication<br />
Let U ⊆ R p , V ⊆ R n , and W ⊆ R m be subspaces, let<br />
TB : U → V and TA : V → W<br />
be linear trans<strong>for</strong>mations with matrices B and A, respectively, and let<br />
T : U → W be the composition T(u) = TA(TB(u)).<br />
230<br />
x2<br />
z2
x1,y1<br />
8<br />
4<br />
-8 -4 4 8<br />
-4<br />
-8<br />
x2,y2<br />
8<br />
4<br />
-8 -4 4 8<br />
-4<br />
-8<br />
x3,y3<br />
8<br />
4<br />
-8 -4 4 8<br />
-4<br />
-8<br />
x4,y4<br />
8<br />
4<br />
-8 -4 4 8<br />
Figure 9.3: A linear trans<strong>for</strong>mation of the [−2,2]×[−2,2] square viewed through the singular<br />
value decomposition of the matrix of the trans<strong>for</strong>mation.<br />
Then the matrix of the linear trans<strong>for</strong>mation T is the matrix product AB.<br />
<br />
2 −1<br />
Example 9.56 (SVD) The planar trans<strong>for</strong>mation whose matrix is A =<br />
2 2<br />
maps the x1y1-plane to the x4y4-plane, shown in the first and last parts of Figure 9.3.<br />
Consider the following singular value decomposition of A<br />
A = UΣV T =<br />
1/ √ 5 −2/ √ 5<br />
2/ √ 5 1/ √ 5<br />
and its effect on the [−2,2] × [−2,2] square.<br />
3 0<br />
0 2<br />
<br />
2/ √ 5 1/ √ 5<br />
−1/ √ 5 2/ √ 5<br />
1. The x1y1-plane is mapped to the x2y2-plane by the trans<strong>for</strong>mation whose matrix is<br />
V T . Since V is orthogonal, the image of the square is a tilted square of the same size.<br />
2. The x2y2-plane is mapped to the x3y3-plane by the trans<strong>for</strong>mation whose matrix is<br />
Σ. Since Σ is diagonal, the tilted square is stretched in the horizontal and vertical<br />
directions by 3 and 2, respectively, yielding a parallelogram.<br />
3. The x3y3-plane is mapped to the x4y4-plane by the trans<strong>for</strong>mation whose matrix is U.<br />
Since U is orthogonal, the image of the parallelogram is a congruent parallelogram.<br />
The figure also shows that the unit circle is trans<strong>for</strong>med to an ellipse by the action of the<br />
diagonal matrix in the singular value decomposition.<br />
9.7.5 Diagonalization and Change of Bases<br />
Suppose that A is a diagonalizable matrix of order n (Section 9.6.2), with diagonalization<br />
A = PΛP −1 , where P = [v1 v2 · · · vn ] and Λ = Diag(λ1,λ2,... ,λn).<br />
The columns of P <strong>for</strong>m an eigenvector basis of R n . Given x ∈ R n , the coordinates of x<br />
with respect to this basis, say c, are found by solving the matrix equation:<br />
Pc = x =⇒ c = P −1 x.<br />
231<br />
<br />
,<br />
-4<br />
-8
Thus, the linear trans<strong>for</strong>mation whose matrix is P −1 can be thought of as a trans<strong>for</strong>mation<br />
from the standard basis of R n to the eigenvector basis {v1,v2,...,vn}. Similarly, the<br />
trans<strong>for</strong>mation whose matrix is P can be thought of as a change of basis from the eigenvector<br />
basis back to the standard basis.<br />
Now, consider the trans<strong>for</strong>mation T whose matrix is A. Since<br />
<br />
T(x) = Ax = PΛP −1<br />
<br />
x = P Λ P −1 <br />
x = P (Λc) ,<br />
the action of T is (1) to change from the standard basis to the eigenvector basis, (2) to<br />
operate in the eigenvector-coordinate system as a diagonal matrix, and (3) to interpret the<br />
results back in standard-basis <strong>for</strong>m.<br />
Powers of A. The k th power of A can be written as follows:<br />
A k =<br />
<br />
PΛP −1 k <br />
= PΛP<br />
−1<br />
PΛP −1<br />
<br />
· · · PΛP −1<br />
= PΛ k P −1 .<br />
Thus, the action of the linear trans<strong>for</strong>mation whose matrix is A k is (1) to change to the<br />
eigenvector basis, (2) to operate in the eigenvector-coordinate system by applying Λ k times,<br />
and (3) to interpret the results back in standard-basis <strong>for</strong>m.<br />
This geometric interpretation is especially useful in practice when considering, <strong>for</strong> example,<br />
transition matrices (from probability theory) or Leslie matrices (from population dynamics).<br />
9.7.6 Random Vectors and Linear Trans<strong>for</strong>mations<br />
Linear trans<strong>for</strong>mations of vectors of random variables are commonly used. Let<br />
⎡ ⎤<br />
⎡<br />
X1<br />
E(X1)<br />
⎤<br />
⎢ X2 ⎥<br />
⎢ E(X2) ⎥<br />
X = ⎢ ⎥<br />
⎣<br />
.<br />
⎦ and µ = E(X) = ⎢ ⎥<br />
⎣ . ⎦<br />
.<br />
E(Xn)<br />
Xn<br />
be n-by-1 vectors of random variables and their means, respectively, and let<br />
V ar(X) = E((X − µ)(X − µ) T )<br />
be the n-by-n matrix of covariances of the random variables. That is, V ar(X) is the matrix<br />
whose ij-entry is equal to the covariance between Xi and Xj when i = j, and is equal to<br />
the variance of Xi when i = j.<br />
Linear trans<strong>for</strong>mations. If A is an m-by-n matrix, then the linearly trans<strong>for</strong>med random<br />
vector Y = AX has an m-variate distribution with mean vector<br />
and with covariance matrix<br />
E(Y ) = E(AX) = AE(X) = Aµ<br />
V ar(Y ) = E((AX − Aµ)(AX − Aµ) T ) = E(A(X − µ)(X − µ) T A T ) = AV ar(X)A T .<br />
232
z<br />
-3<br />
0<br />
x<br />
3 -3<br />
0<br />
y<br />
3<br />
1.5<br />
-1.5 1.5<br />
-1.5<br />
Figure 9.4: Joint density (left) and contours enclosing 20%, 40% and 60% of the probability<br />
<strong>for</strong> the standard bivariate normal distribution with correlation ρ = 0.50.<br />
Example 9.57 (2-by-2 Matrix) Let<br />
If<br />
then<br />
X =<br />
X1<br />
X2<br />
E(Y ) = Aµ =<br />
<br />
, A =<br />
µ = E(X) =<br />
<br />
3<br />
0<br />
1 1<br />
2 −4<br />
<br />
2<br />
1<br />
<br />
<br />
X1 + X2<br />
and Y = AX =<br />
2X1 − 4X2<br />
and V ar(X) =<br />
y<br />
<br />
1 1/2<br />
,<br />
1/2 1<br />
and V ar(Y ) = AV ar(X)A T =<br />
Note that Corr(X1,X2) = 1<br />
2 and Corr(Y1,Y2) = Cov(Y1,Y2) √ = −1<br />
V ar(Y1)V ar(Y2) 2 .<br />
9.7.7 Example: Multivariate Normal Distribution<br />
⎡<br />
Let X = ⎣<br />
X1<br />
.<br />
Xk<br />
⎤<br />
⎦ be a random k-tuple.<br />
<br />
3 −3<br />
−3 12<br />
X is said to have a multivariate normal distribution if its joint PDF has the following <strong>for</strong>m<br />
f(x) =<br />
1<br />
(2π) k/2 |V | exp<br />
<br />
− 1<br />
2 (x − µ)TV −1 <br />
(x − µ)<br />
<br />
.<br />
<br />
.<br />
x<br />
<strong>for</strong> all x ∈ R k .<br />
In this <strong>for</strong>mula, µ = E(X) is the k-by-1 vector of means and V = V ar(X) is the k-by-k<br />
covariance matrix. (See Section 8.1.9 also.)<br />
Note that to find the probability that a normal random k-tuple takes values in a domain<br />
D ⊆ R k , you need to compute a k-variate integral.<br />
233
x1,y1<br />
1.5<br />
-1.5 1.5<br />
-1.5<br />
Example 9.58 (k = 2) Let X =<br />
x2,y2<br />
1.5<br />
-1.5 1.5<br />
-1.5<br />
Figure 9.5: Trans<strong>for</strong>mations of probability contours.<br />
<br />
X<br />
, µ = O and V =<br />
Y<br />
<br />
1 0.5<br />
.<br />
0.5 1<br />
x3,y3<br />
1.5<br />
-1.5 1.5<br />
-1.5<br />
Then the random pair has a standard bivariate normal distribution with parameter ρ = 0.50.<br />
The left part of Figure 9.4 is a graph of the surface z = f(x,y). The right part of the figure<br />
shows contours enclosing 20%, 40% and 60% of the probability.<br />
Each contour in the right plot is an ellipse whose major and minor axes are along the lines<br />
y = x and y = −x (gray lines), respectively. Let D1, D2, and D3 be the regions bounded<br />
by the three ellipses in increasing size.<br />
Orthogonal diagonalization. Since V is symmetric, the spectral theorem (Section 9.6.3)<br />
implies that it can be diagonalized as follows:<br />
V = QΛQ T ,<br />
where Λ is a diagonal matrix of eigenvalues and Q is orthogonal.<br />
As discussed earlier, the linear trans<strong>for</strong>mation whose matrix is QT can be thought of as a<br />
trans<strong>for</strong>mation from the standard basis of Rk to the basis whose elements are the columns<br />
of Q. Further, if Y = QTX, then<br />
E(Y ) = Q T µ and V ar(Y ) = Q T <br />
T T<br />
V ar(X)Q = Q QΛQ Q = Λ.<br />
Thus, the Yi’s are uncorrelated, with variances equal to the eigenvalues of V . (The Yi’s are,<br />
in fact, independent normal random variables.)<br />
Example 9.59 (k = 2, continued) Continuing with the example above,<br />
V = QΛQ T =<br />
1/ √ 2 −1/ √ 2<br />
1/ √ 2 1/ √ 2<br />
1.5 0<br />
0 0.5<br />
<br />
1/ √ 2 1/ √ 2<br />
−1/ √ 2 1/ √ 2<br />
The action of the matrix product Λ −1 Q T is illustrated in Figure 9.5.<br />
Specifically, if we let x2 = Q T x1, then points on the ellipses shown in the x1y1-plane (left<br />
plot) are rotated by 45 degrees in the clockwise direction to points on the ellipses shown in<br />
the x2y2-plane (center plot).<br />
234<br />
<br />
.
Further, if we let x3 = Λ −1 x2, then points on the rotated ellipses in the x2y2-plane (center<br />
plot) are mapped to circles in the x3y3-plane (right plot).<br />
These trans<strong>for</strong>mations (plus a polar trans<strong>for</strong>mation) can be used to demonstrate that<br />
P((X,Y ) ∈ D1) = 0.20, P((X,Y ) ∈ D2) = 0.40, P((X,Y ) ∈ D3) = 0.60.<br />
9.8 Applications in <strong>Statistics</strong><br />
Linear algebra is one of the most useful branches of mathematics. In statistics, linear algebra<br />
methods are used to simplify computations related to multivariate probability distributions<br />
and to simplify computations related to summarizing random samples. This section focuses<br />
on two applications.<br />
9.8.1 Least Squares Estimation<br />
Given n cases, (x1,i,x2,i,... ,xp−1,i,yi) <strong>for</strong> i = 1,2,... ,n, whose values lie close to a linear<br />
equation of the <strong>for</strong>m<br />
y = β0 + β1x1 + β2x2 + · · · + βp−1xp−1,<br />
the method of least squares can be used to estimate the unknown parameters.<br />
The n-by-p design matrix, X, the p-by-1 vector of unknowns, β, and the n-by-1 vector of<br />
right hand side values, y, are defined as follows:<br />
⎡<br />
⎢<br />
X = ⎢<br />
⎣ .<br />
1 x1,1 x2,1 · · · xp−1,1<br />
1 x1,2 x2,2 · · · xp−1,2<br />
.<br />
. · · ·<br />
.<br />
1 x1,n x2,n · · · xp−1,n<br />
⎤<br />
⎡<br />
⎥ ⎢<br />
⎥<br />
⎦ , β = ⎢<br />
⎣<br />
β0<br />
β1<br />
.<br />
.<br />
βp−1<br />
⎤<br />
⎥<br />
⎦<br />
⎡<br />
⎢<br />
and y = ⎢<br />
⎣<br />
We wish to find least squares solutions to the inconsistent system Xβ = y.<br />
Example 9.60 (Sleep Study, Part 1)(Allison & Cicchetti, Science, 194:732-374, 1976;<br />
lib.stat.cmu.edu/DASL.) As part of a study on sleep in mammals, researchers collected<br />
in<strong>for</strong>mation on the average hours of nondreaming and dreaming sleep <strong>for</strong> 43 different species.<br />
The left part of Figure 9.6 compares the common logarithm of nondreaming sleep (vertical<br />
axis) to the common logarithm of dreaming sleep (horizontal axis). A simple linear prediction<br />
<strong>for</strong>mula will be used to write the common logarithm of hours of nondreaming sleep as<br />
a function of the common logarithm of hours of dreaming sleep.<br />
For these data, Xβ = y ⇒ X T Xβ = X T y ⇒<br />
<br />
<br />
43 8.75915 β0<br />
=<br />
8.75915 5.33664 β1<br />
<br />
38.3639<br />
9.33209<br />
y1<br />
y2<br />
.<br />
.<br />
yn<br />
⇒ β0 = 0.805178<br />
β1 = 0.427124 .<br />
Thus, the simple linear prediction equation is y = 0.427124x + 0.805178.<br />
235<br />
⎤<br />
⎥<br />
⎦ .
1.2<br />
0.8<br />
0.4<br />
NDS<br />
0 0.4 0.8 DS<br />
1.2<br />
0.8<br />
0.4<br />
NDS<br />
-2 0 2<br />
Figure 9.6: Comparison of nondreaming sleep in log-hours with dreaming sleep in log-hours<br />
(left plot) and with body weight in log-kilograms (right plot) <strong>for</strong> data from the sleep study.<br />
Least squares linear fits are superimposed.<br />
The simple linear prediction equation is shown in the left part of Figure 9.6.<br />
Example 9.61 (Sleep Study, Part 2) The researchers also collected in<strong>for</strong>mation on the<br />
average body weight in kilograms <strong>for</strong> the 43 species, and on other variables. They developed<br />
a danger index using a five-point scale, where a higher value corresponds to a higher risk<br />
of exposure to the elements while sleeping and/or a higher risk of predation while sleeping.<br />
Table 9.2 gives the values of the danger index <strong>for</strong> each species in the study.<br />
The right part of Figure 9.6 compares the common logarithm of nondreaming sleep (vertical<br />
axis) to the common logarithm of body weight (horizontal axis). A linear prediction <strong>for</strong>mula<br />
will be used to write the common logarithm of hours of nondreaming sleep as a function<br />
of the common logarithm of kilograms of body weight (variable 1) and the danger index<br />
(variable 2).<br />
For these data, Xβ = y ⇒ X T Xβ = X T y ⇒<br />
⎡<br />
43 13.4047 115<br />
⎤ ⎡<br />
⎣ 13.4047 78.5992 63.7371 ⎦ ⎣<br />
115 63.7371 389<br />
⎤<br />
β0<br />
β1<br />
β2<br />
⎡ ⎤<br />
38.3639<br />
⎦ = ⎣ 3.22446 ⎦ ⇒<br />
95.4993<br />
β0 = 1.06671<br />
β1 = −0.097165<br />
β2 = −0.0539304<br />
Thus, the linear prediction equation is y = −0.097165x − 0.0539304d + 1.06671.<br />
Nondreaming sleep is negatively associated with both body weight and danger index. The<br />
right part of Figure 9.6 shows the simple linear prediction <strong>for</strong>mulas when d = 1,2,3,4,5.<br />
9.8.2 Principal Components Analysis<br />
Let X = [ x1 x2 · · · xn ] be a p-by-n data matrix, whose columns are a sample of size<br />
n from a p-variate distribution, let x be the p-by-1 sample mean,<br />
x = 1<br />
n<br />
n<br />
xi,<br />
i=1<br />
236<br />
.<br />
BW
Table 9.2: Danger indices <strong>for</strong> 43 species in the sleep study.<br />
Danger = 1 Danger = 2 Danger = 3 Danger = 4 Danger = 5<br />
Big brown bat European hedgehog African giant rat Asian elephant Cow<br />
Cat Galago ground squirrel Baboon Goat<br />
Chimpanzee Golden hamster Mountain beaver Brazilian tapir Horse<br />
E.Amer.mole Owl monkey Mouse Chincilla Rabbit<br />
Gray seal Phanlanger Musk shrew guinea pig Sheep<br />
Little brown bat Rhesus monkey Rat Short tail shrew<br />
Man Rock hyrax (h.b.) Rockhyrax (p.h.) Patas monkey<br />
Mole rat Tenrec Tree hyrax Pig<br />
N.Amer.opossum Tree shrew Vervet<br />
9 banded armadillo<br />
Red fox<br />
Water opossum<br />
and let X be the p-by-n matrix whose j th column is the deviation of the j th observation<br />
from the sample mean: xj = xj − x.<br />
Let S be the sample correlation matrix. S can be computed using the matrix equation<br />
S =<br />
where X is the centered data matrix above.<br />
1<br />
(n − 1) X X T ,<br />
Since S is symmetric, we can use the spectral theorem (Section 9.6.3) to write<br />
S = QΛQ T<br />
where Λ is a diagonal matrix of eigenvalues ordered so that λ1 ≥ λ2 ≥ · · · ≥ λp, and where<br />
Q is an orthogonal matrix.<br />
Consider the linear trans<strong>for</strong>mation<br />
Y = Q T ⎡<br />
q1·x1<br />
⎢<br />
X ⎢ q2·x1<br />
= ⎢<br />
⎣ .<br />
q1·x2<br />
q2·x2<br />
.<br />
· · ·<br />
· · ·<br />
· · ·<br />
⎤<br />
q1· xn<br />
q2· xn<br />
⎥<br />
. ⎦ ,<br />
qp·x1 qp·x2 · · · qp· xn<br />
and let SY be the sample correlation matrix of Y . Then<br />
SY = Q T <br />
T T<br />
SQ = Q QΛQ Q = Λ,<br />
indicating that the p trans<strong>for</strong>med centered samples are uncorrelated. Further, the first<br />
trans<strong>for</strong>med centered sample (first row of Y ) has the largest sample variance, the second<br />
(second row of Y ) has the next largest sample variance, and so <strong>for</strong>th.<br />
Since the i th column of Q was used to trans<strong>for</strong>m the i th centered sample, the vector qi is<br />
called the i th principal component of the centered data matrix X.<br />
237
115<br />
95<br />
75<br />
a<br />
75 95 115 c<br />
h<br />
115<br />
95<br />
75<br />
75<br />
95<br />
a<br />
115<br />
115<br />
75<br />
95 c<br />
Figure 9.7: Abdomen and chest measurements (left), and abdomen, chest and hip measurements<br />
(right) <strong>for</strong> 100 men. Principal component axes are highlighted in each plot.<br />
Example 9.62 (Body Fat Study, Part 1) (Johnson, JSE:4(1), 1996) As part of a study<br />
to determine if simple body measurements could be used to predict body fat, abdomen<br />
and chest measurements in centimeters were made on 100 men. The left part of Figure 9.7<br />
compares chest (variable 1) and abdomen (variable 2) measurements <strong>for</strong> these men.<br />
For these data,<br />
and S = QΛQ T where<br />
<br />
λ1 0<br />
Λ = ≈<br />
0 λ2<br />
x =<br />
<br />
101.068<br />
, S =<br />
92.183<br />
<br />
163.709 0<br />
<br />
0 7.218<br />
<br />
68.3283 76.3466<br />
,<br />
76.3466 102.599<br />
and Q = [q1 q2 ] ≈<br />
<br />
0.625 −0.781<br />
0.781 0.625<br />
The principal component axes shown in the left part of Figure 9.7 are lines described<br />
parametrically as ℓi(t) = x + tqi, <strong>for</strong> i = 1,2. The lines are centered at x.<br />
Since λ1 is much greater than λ2, most of the variation in the data is along the first principal<br />
component axis. In fact, since<br />
λ1<br />
λ1 + λ2<br />
≈ 0.958 and<br />
λ2<br />
λ1 + λ2<br />
≈ 0.042<br />
about 95.8% of the total variation in the data is due to the first component.<br />
Further, if c is the chest measurement and a is the abdomen measurement of a man in the<br />
study, then the <strong>for</strong>mula<br />
y = 0.625(c − 101.068) + 0.781(a − 92.183)<br />
can be used as a body “size index” in later analyses.<br />
238<br />
<br />
.
Example 9.63 (Body Fat Study, Part 2) The researchers also took hip measurements<br />
in centimeters. The right part of Figure 9.7 compares chest (variable 1), abdomen (variable<br />
2), and hip (variable 3) measurements <strong>for</strong> the 100 men.<br />
For these data,<br />
and S = QΛQ T where<br />
and<br />
⎡ ⎤<br />
101.068<br />
⎡<br />
68.3283 76.3466<br />
⎤<br />
43.8397<br />
x = ⎣ 92.183 ⎦, S = ⎣ 76.3466 102.599 56.79 ⎦ ,<br />
99.594 43.8397 56.79 42.1284<br />
⎡<br />
λ1<br />
Λ = ⎣ 0<br />
0<br />
λ2<br />
⎤ ⎡<br />
0 196.946<br />
0 ⎦ ≈ ⎣ 0<br />
0<br />
9.474<br />
0<br />
0<br />
⎤<br />
⎦<br />
0 0 λ3 0 0 6.635<br />
Q = [q1 q2 q3 ] ≈<br />
⎡<br />
⎢<br />
⎣<br />
0.565 0.588 0.579<br />
0.710 0.011 −0.704<br />
0.420 −0.809 0.412<br />
The principal component axes shown in the right part of Figure 9.7 are lines described<br />
parametrically as ℓi(t) = x + tqi, <strong>for</strong> i = 1,2,3. The lines are centered at x.<br />
Since λ1 is much greater than λ2 and λ3, most of the variation in the data is along the first<br />
principal component axis. In fact, since<br />
λ1<br />
λ1 + λ2 + λ3<br />
≈ 0.924,<br />
λ2<br />
λ1 + λ2 + λ3<br />
≈ 0.044,<br />
λ3<br />
⎤<br />
⎥<br />
⎦ .<br />
λ1 + λ2 + λ3<br />
≈ 0.031,<br />
about 92.4% of the total variation in the data is due to the first component.<br />
Further, if c is the chest measurement, a is the abdomen measurement, and h is the hip<br />
measurement of a man in the study, then the <strong>for</strong>mula<br />
y = 0.565(c − 101.068) + 0.710(a − 92.183) + 0.420(h − 99.594)<br />
can be used as a body “size index” in later analyses.<br />
239
240
10 Additional Reading<br />
(1-2) An Introduction to Mathematical <strong>Statistics</strong>, by Larsen and Marx (Prentice Hall) and<br />
Mathematical <strong>Statistics</strong> and Data Analysis, by Rice (Duxbury Press), are introductions to<br />
probability theory and mathematical statistics with emphasis on applications. The book<br />
by Larsen and Marx is currently used as the text book <strong>for</strong> the MT426-427 mathematical<br />
statistics sequence.<br />
(3-4) Calculus by Hughes-Hallett, Gleason, et al (Wiley), and Multivariable Calculus by<br />
McCallum, Hughes-Hallett, Gleason, et al (Wiley), are gentle introductions to single variable<br />
and multivariable calculus methods. The first book is currently used as the text book <strong>for</strong><br />
the MT100-101 calculus sequence.<br />
(5) Vector Calculus, by Colley (Prentice Hall), is a more sophisticated introduction to<br />
multivariable calculus. Colley emphasizes the use of linear algebra methods. This book is<br />
currently used as the text book <strong>for</strong> the MT202 multivariable calculus course.<br />
(6-7) Linear Algebra and its Applications, by Lay (Addison-Wesley), and Introduction to<br />
Linear Algebra, by Strang (Wellesley-Cambridge) are good introductions to linear algebra<br />
methods. Each book emphasizes both theory and practice. The book by Lay is currently<br />
used as the text book <strong>for</strong> the MT210 linear algebra course.<br />
(8) Advanced Calculus with Applications in <strong>Statistics</strong>, by Khuri (Wiley), is a more advanced<br />
text designed specifically to tie topics in advanced calculus to applications in statistics.<br />
(9-10) Introduction to Matrices with Applications in <strong>Statistics</strong> by Graybill (Wadsworth),<br />
and Matrix Algebra useful <strong>for</strong> <strong>Statistics</strong> by Searle (Wiley), are well-respected texts designed<br />
specifically to tie linear algebra methods to applications in statistics. Note, however, that<br />
each book is quite old, and each uses out-of-date notation.<br />
(11) Statistical Models, by Freedman (Cambridge University Press), is a recent text <strong>for</strong> a<br />
second course in statistics. From the Preface: “The contents of the book can be fairly<br />
described as what you have to know in order to start reading empirical papers that use<br />
statistical models.”<br />
241