MT580 Mathematics for Statistics: - Boston College

MT580 Mathematics for Statistics: 

“All the Math You Never Had” 

Jenny A. Baglivo 

Mathematics Department 

Boston College 

Chestnut Hill, MA 02467-3806 

(617) 552 3772 

jenny.baglivo@bc.edu 

Third Draft, Summer 2006 

MT580 is an introduction to probability, discrete and continuous methods, and linear algebra 

for graduate students in applied statistics with little or no formal training in these 

fields. Applications in statistics are emphasized throughout. 

The topics covered in MT580 are drawn from several subject areas. Thus, the need for a 

specialized set of course notes. I would like to thank GSAS Dean Mick Smyer for supporting 

the development of these notes and Mathematics Professor Dan Chambers for many useful 

comments and suggestions on an earlier draft. 

c○ Jenny A. Baglivo 2006. All rights reserved.

1 Introductory Probability Concepts 1 

1.1 Set Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 

1.2 Sample Spaces and Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 

1.3 Probability Axioms and Properties Following From the Axioms . . . . . . . . . . . . 3 

1.3.1 Kolmogorov Axioms for Finite Sample Spaces . . . . . . . . . . . . . . . . . . 3 

1.3.2 Equally Likely Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 

1.3.3 Properties Following from the Axioms . . . . . . . . . . . . . . . . . . . . . . 4 

1.4 Counting Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 

1.5 Permutations and Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 

1.5.1 Example: Simple Urn Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 

1.5.2 Binomial Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 

1.6 Partitioning Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 

1.6.1 Multinomial Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 

1.6.2 Example: Urn Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 

1.7 Applications in Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 

1.7.1 Maximum Likelihood Estimation in Simple Urn Models . . . . . . . . . . . . 11 

1.7.2 Permutation and Nonparametric Methods . . . . . . . . . . . . . . . . . . . . 12 

1.8 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 

1.8.1 Multiplication Rules for Probability . . . . . . . . . . . . . . . . . . . . . . . 17 

1.8.2 Law of Total Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 

1.8.3 Bayes Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 

1.9 Independent Events and Mutually Independent Events . . . . . . . . . . . . . . . . . 21 

2 Discrete Random Variables 23 

2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 

2.1.1 PDF and CDF for Discrete Distributions . . . . . . . . . . . . . . . . . . . . 24 

2.2 Families of Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 

2.2.1 Example: Discrete Uniform Distribution . . . . . . . . . . . . . . . . . . . . . 25 

2.2.2 Example: Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . 25 

2.2.3 Distributions Related to Bernoulli Experiments . . . . . . . . . . . . . . . . . 26 

2.2.4 Simple Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 

i

2.3 Mathematical Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 

2.3.1 Definitions for Finite Discrete Random Variables . . . . . . . . . . . . . . . . 28 

2.3.2 Properties of Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 

2.3.3 Mean, Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . 30 

2.3.4 Chebyshev’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 

3 Joint Discrete Distributions 33 

3.1 Bivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 

3.1.1 Joint PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 

3.1.2 Joint CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 

3.1.3 Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 

3.1.4 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 

3.1.5 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 35 

3.1.6 Mathematical Expectation for Finite Bivariate Distributions . . . . . . . . . 36 

3.1.7 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 

3.1.8 Conditional Expectation and Regression . . . . . . . . . . . . . . . . . . . . . 38 

3.1.9 Example: Bivariate Hypergeometric Distribution . . . . . . . . . . . . . . . . 40 

3.1.10 Example: Trinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 41 


3.2 Multivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 

3.2.1 Joint PDF and Joint CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 

3.2.2 Mutually Independent Random Variables and Random Samples . . . . . . . . 44 

3.2.3 Mathematical Expectation for Finite Multivariate Distributions . . . . . . . . 44 

3.2.4 Example: Multivariate Hypergeometric Distribution . . . . . . . . . . . . . . 45 

3.2.5 Example: Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . 46 


3.2.7 Sample Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 

4 Introductory Calculus Concepts 51 

4.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 

4.2 Limits and Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 

ii

4.2.1 Neighborhoods, Deleted Neighborhoods and Limits . . . . . . . . . . . . . . . 54 

4.2.2 Continuous Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 

4.3 Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 

4.3.1 Derivative and Tangent Line . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 

4.3.2 Approximations and Local Linearity . . . . . . . . . . . . . . . . . . . . . . . 60 

4.3.3 First and Second Derivative Functions . . . . . . . . . . . . . . . . . . . . . . 61 

4.3.4 Some Rules for Finding Derivative Functions . . . . . . . . . . . . . . . . . . 62 

4.3.5 Chain Rule for Composition of Functions . . . . . . . . . . . . . . . . . . . . 63 

4.3.6 Rules for Exponential and Logarithmic Functions . . . . . . . . . . . . . . . . 63 

4.3.7 Rules for Trigonometric and Inverse Trigonometric Functions . . . . . . . . . 64 

4.3.8 Indeterminate Forms and L’Hospital’s Rule . . . . . . . . . . . . . . . . . . . 65 

4.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 

4.4.1 Local Extrema and Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . 67 

4.4.2 First Derivative Test for Local Extrema . . . . . . . . . . . . . . . . . . . . . 67 

4.4.3 Second Derivative Test for Local Extrema . . . . . . . . . . . . . . . . . . . . 68 

4.4.4 Global Extrema on Closed Intervals . . . . . . . . . . . . . . . . . . . . . . . 68 

4.4.5 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 


4.5.1 Maximum Likelihood Estimation in Binomial Models . . . . . . . . . . . . . . 71 

4.5.2 Maximum Likelihood Estimation in Multinomial Models . . . . . . . . . . . . 72 

4.5.3 Minimum Chi-Square Estimation in Multinomial Models . . . . . . . . . . . . 73 

4.6 Rolle’s Theorem and the Mean Value Theorem . . . . . . . . . . . . . . . . . . . . . 75 

4.6.1 Generalized Mean Value Theorem and Error Analysis . . . . . . . . . . . . . 76 

4.7 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 

4.7.1 Riemann Sums and Definite Integrals . . . . . . . . . . . . . . . . . . . . . . 78 

4.7.2 Properties of Definite Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . 81 

4.7.3 Fundamental Theorem of Calculus . . . . . . . . . . . . . . . . . . . . . . . . 81 

4.7.4 Antiderivatives and Indefinite Integrals . . . . . . . . . . . . . . . . . . . . . 82 

4.7.5 Some Formulas for Indefinite Integrals . . . . . . . . . . . . . . . . . . . . . . 82 

4.7.6 Approximation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 

iii

4.7.7 Improper Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 


4.8.1 deMoivre-Laplace Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 88 

4.8.2 Computing Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 

5 Continuous Random Variables 91 

5.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 

5.1.1 PDF and CDF for Continuous Distributions . . . . . . . . . . . . . . . . . . . 91 

5.1.2 Quantiles and Percentiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 

5.2 Families of Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 

5.2.1 Example: Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 93 

5.2.2 Example: Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . 94 

5.2.3 Euler Gamma Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 

5.2.4 Example: Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 96 

5.2.5 Example: Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 97 

5.2.6 Transforming Continuous Random Variables . . . . . . . . . . . . . . . . . . 99 

5.3 Mathematical Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 

5.3.1 Definitions for Continuous Random Variables . . . . . . . . . . . . . . . . . . 101 

5.3.2 Properties of Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 

5.3.3 Mean, Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . 102 

5.3.4 Chebyshev’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 

6 Infinite Sequences and Series 103 

6.1 Limits and Convergence of Infinite Sequences . . . . . . . . . . . . . . . . . . . . . . 103 

6.1.1 Convergence Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 

6.1.2 Geometric Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 

6.1.3 Bounded Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 

6.1.4 Monotone Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 

6.2 Partial Sums and Convergence of Infinite Series . . . . . . . . . . . . . . . . . . . . . 105 

6.2.1 Convergence Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 

6.2.2 Geometric Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 

iv

6.3 Convergence Tests for Positive Series . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 

6.4 Alternating Series and Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 109 

6.5 Absolute and Conditional Convergence of Series . . . . . . . . . . . . . . . . . . . . . 110 

6.5.1 Product of Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 

6.6 Taylor Polynomials and Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 

6.6.1 Higher Order Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 

6.6.2 Taylor Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 

6.6.3 Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 

6.6.4 Radius of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 

6.7 Power Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 

6.7.1 Power Series and Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . 116 

6.7.2 Differentiation and Integration of Power Series . . . . . . . . . . . . . . . . . 117 

6.8 Discrete Random Variables, revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 

6.8.1 Countably Infinite Sample Spaces . . . . . . . . . . . . . . . . . . . . . . . . . 118 

6.8.2 Infinite Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 118 

6.8.3 Example: Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 120 

6.8.4 Example: Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 121 

6.9 Continuous Random Variables, revisited . . . . . . . . . . . . . . . . . . . . . . . . . 122 

6.9.1 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 

6.9.2 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 

6.10 Summary: Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 

7 Multivariable Calculus Concepts 125 

7.1 Vector And Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 

7.1.1 Representations of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 

7.1.2 Addition and Scalar Multiplication of Vectors . . . . . . . . . . . . . . . . . . 125 

7.1.3 Length and Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 

7.1.4 Dot Product of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 

7.1.5 Matrix Notation and Matrix Transpose . . . . . . . . . . . . . . . . . . . . . 129 

7.1.6 Addition and Scalar Multiplication of Matrices . . . . . . . . . . . . . . . . . 129 

7.1.7 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 

v

7.1.8 Determinants of Square Matrices . . . . . . . . . . . . . . . . . . . . . . . . . 130 

7.2 Limits and Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 

7.2.1 Neighborhoods, Deleted Neighborhoods and Limits . . . . . . . . . . . . . . . 132 

7.2.2 Continuous Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 

7.3 Differentiation in Several Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 

7.3.1 Partial Derivatives and the Gradient Vector . . . . . . . . . . . . . . . . . . . 134 

7.3.2 Second-Order Partial Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . 135 

7.3.3 Differentiability, Local Linearity and Tangent (Hyper)Planes . . . . . . . . . 135 

7.3.4 Total Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 

7.3.5 Directional Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 

7.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 

7.4.1 Linear and Quadratic Approximations . . . . . . . . . . . . . . . . . . . . . . 140 

7.4.2 Second Derivative Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 

7.4.3 Constrained Optimization and Lagrange Multipliers . . . . . . . . . . . . . . 144 

7.4.4 Method of Steepest Descent/Ascent . . . . . . . . . . . . . . . . . . . . . . . 148 


7.5.1 Least Squares Estimation in Quadratic Models . . . . . . . . . . . . . . . . . 149 

7.5.2 Maximum Likelihood Estimation in Multinomial Models . . . . . . . . . . . . 150 

7.6 Multiple Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 

7.6.1 Riemann Sums and Double Integrals over Rectangles . . . . . . . . . . . . . . 151 

7.6.2 Iterated Integrals over Rectangles . . . . . . . . . . . . . . . . . . . . . . . . . 153 

7.6.3 Properties of Double Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 

7.6.4 Double and Iterated Integrals over General Domains . . . . . . . . . . . . . . 156 

7.6.5 Triple and Iterated Integrals over Boxes . . . . . . . . . . . . . . . . . . . . . 158 

7.6.6 Triple Integrals over General Domains . . . . . . . . . . . . . . . . . . . . . . 160 

7.6.7 Partial Anti-Differentiation and Partial Differentiation . . . . . . . . . . . . . 161 

7.6.8 Change of Variables and the Jacobian . . . . . . . . . . . . . . . . . . . . . . 162 


7.7.1 Computing Probabilities for Continuous Random Pairs . . . . . . . . . . . . 166 

7.7.2 Contours for Independent Standard Normal Random Variables . . . . . . . . 167 

vi

8 Joint Continuous Distributions 169 

8.1 Bivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 


8.1.2 Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 

8.1.3 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 

8.1.4 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 173 

8.1.5 Mathematical Expectation for Bivariate Continuous Distributions . . . . . . 173 

8.1.6 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 

8.1.7 Conditional Expectation and Regression . . . . . . . . . . . . . . . . . . . . . 175 

8.1.8 Example: Bivariate Uniform Distribution . . . . . . . . . . . . . . . . . . . . 177 

8.1.9 Example: Bivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . 177 

8.1.10 Transforming Continuous Random Variables . . . . . . . . . . . . . . . . . . 179 

8.2 Multivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 


8.2.2 Marginal and Conditional Distributions . . . . . . . . . . . . . . . . . . . . . 183 

8.2.3 Mutually Independent Random Variables and Random Samples . . . . . . . . 184 

8.2.4 Mathematical Expectation For Multivariate Continuous Distributions . . . . 184 

8.2.5 Sample Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 

9 Linear Algebra Concepts 189 

9.1 Solving Systems of Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 

9.1.1 Elementary Row Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 

9.1.2 Row Reduction and Echelon Forms . . . . . . . . . . . . . . . . . . . . . . . . 191 

9.1.3 Vector and Matrix Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 

9.2 Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 

9.2.1 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 

9.2.2 Inverses of Square Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 

9.2.3 Matrix Factorizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 

9.2.4 Partitioned Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 

9.3 Determinants of Square Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 

9.4 Vector Spaces and Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 

vii

9.4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 

9.4.2 Subspaces of R m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 

9.4.3 Linear Independence, Basis and Dimension . . . . . . . . . . . . . . . . . . . 211 

9.4.4 Rank and Nullity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 

9.5 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 

9.5.1 Orthogonal Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 

9.5.2 Orthogonal Sets, Bases and Projections . . . . . . . . . . . . . . . . . . . . . 216 

9.5.3 Gram-Schmidt Orthogonalization . . . . . . . . . . . . . . . . . . . . . . . . . 217 

9.5.4 QR-Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 

9.5.5 Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 

9.6 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 

9.6.1 Characteristic Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 

9.6.2 Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 

9.6.3 Symmetric Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 

9.6.4 Positive Definite Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 

9.6.5 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 

9.7 Linear Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 

9.7.1 Range and Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 

9.7.2 Linear Transformations and Matrices . . . . . . . . . . . . . . . . . . . . . . . 229 

9.7.3 Linearly Independent Columns . . . . . . . . . . . . . . . . . . . . . . . . . . 230 

9.7.4 Composition of Functions and Matrix Multiplication . . . . . . . . . . . . . . 230 

9.7.5 Diagonalization and Change of Bases . . . . . . . . . . . . . . . . . . . . . . . 231 

9.7.6 Random Vectors and Linear Transformations . . . . . . . . . . . . . . . . . . 232 

9.7.7 Example: Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . 233 


9.8.1 Least Squares Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 

9.8.2 Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 236 

10 Additional Reading 241 

viii

1 Introductory Probability Concepts 

Probability is the study of random phenomena. Probability theory can be applied, for 

example, to study games of chance (e.g. roulette games, card games), occurrences of catastrophic 

events (e.g. tornados, earthquakes), survival of animal species, and changes in stock 

and commodity markets. 

This chapter introduces probability theory. 

1.1 Set Operations 

Sets and operations on sets are fundamental to any area of mathematics. A set is a welldefined 

collection of objects called the elements or members of the set. (Well-defined means 

that there is definite way of determining whether a given element belongs to the set.) The 

notation x ∈ A means “x is an element of A,” and the notation x /∈ A means “x is not an 

element of A.” 

The empty set is the set containing no elements, and the universal set is the set containing 

all elements under consideration in a given study. The notation ∅ is used to denote the 

empty set, and the notation Ω is used to denote the universal set. 

A is a subset of B (denoted by A ⊆ B) if each element of A is an element of B. 

Union and intersection. The union of the sets A and B (denoted by A ∪ B) is the set 

of elements that belong to either A or B, and the intersection of the sets A and B (denoted 

by A ∩ B) is the set of elements that belong to both A and B. That is, 

A ∪ B = {x | x ∈ A or x ∈ B} and A ∩ B = {x | x ∈ A and x ∈ B}. 

Note that A ∪ B = B ∪ A and A ∩ B = B ∩ A. 

The operations of union and intersection are associative, that is, 

(A ∪ B) ∪ C = A ∪ (B ∪ C) and (A ∩ B) ∩ C = A ∩ (B ∩ C) for sets A, B, and C. 

Thus, the union A ∪ B ∪ C and the intersection A ∩ B ∩ C are well-defined. More generally, 

the union of the k sets A1, A2, ..., Ak (denoted by ∪ k i=1 Ai) consists of all elements belonging 

to at least one Ai, and the intersection of the k sets A1, A2, ..., Ak (denoted by ∩ k i=1 Ai) 

consists of all elements belonging to all k sets. 

The operations of union and intersection satisfy the following distributive laws: 

A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C) and A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C) 

for sets A, B, and C. 

Complement. The complement of the set A in the universal set (denoted by A c = Ω−A) 

is the set of all elements in the universal set that do not belong to A. Note, in particular, 

that A ∩ A c = ∅, A ∪ A c = Ω and (A c ) c = A. 

1

Pairwise disjoint sets. The sets A1 and A2 are said to be disjoint if A1 ∩ A2 = ∅. The 

sets A1, A2, ..., Ak are said to be pairwise disjoint if Ai ∩ Aj = ∅ whenever i = j. 

If A and B are sets, then B can be written as the disjoint union of the part of B contained 

in A with the part of B contained in the complement of A. That is, 

B = (B ∩ A) ∪ (B ∩ A c ) where (B ∩ A) ∩ (B ∩ A c ) = ∅. 

Partitions. A partition of Ω is a collection A1, A2, ..., Ak of pairwise disjoint sets whose 

union is Ω. If A1, A2, ..., Ak partitions Ω, then (B ∩A1), (B ∩A2), ..., (B ∩Ak) partitions 

the set B. That is, 

B = ∪ k i=1 (B ∩ Ai) where (B ∩ Ai) ∩ (B ∩ Aj) = ∅ when i = j. 

De Morgan’s laws. De Morgan’s laws relate complements to unions and intersections: 

More generally, 

(A ∪ B) c = Ac ∩ Bc and (A ∩ B) c = Ac ∪ Bc for sets A and B. 

c = ∩k i=1Aci and 

c = ∪k i=1Aci . 

 

∪ k i=1 Ai 

1.2 Sample Spaces and Events 

 

∩ k i=1 Ai 

The term experiment (or random experiment) is used in probability theory to describe a 

procedure whose outcome is not known in advance with certainty. Further, experiments 

are assumed to be repeatable (at least in theory) and to have a well-defined set of possible 

outcomes. 

The sample space S is the set of all possible outcomes of an experiment. An event is a 

subset of the sample space. A simple event is an event with a single outcome. Events are 

usually denoted by capital letters (A, B, C, ...) and outcomes by lower case letters (x, y, 

z, ...). If x ∈ A is observed, then A is said to have occurred. The favorable outcomes of an 

experiment form the event of interest. 

Each repetition of an experiment is called a trial. Repeated trials are repetitions of the 

experiment using the specified procedure, with the outcomes of the trials having no influence 

on one another. 

Example 1.1 (Coin Tossing) Suppose you toss a fair coin 5 times and record h (for 

head) or t (for tail) each time. The sample space for this experiment is the collection of 

32 = 2 5 sequences of 5 h’s or t’s: 

S = {hhhhh,hhhht,hhhth,hhthh,hthhh, thhhh,hhhtt,hhtht,hthht,thhht, 

hhtth,hthth,thhth,htthh,ththh,tthhh, ttthh,tthth,thtth, httth,tthht, 

ththt,httht,thhtt,hthtt,hhttt,htttt,thttt,tthtt,tttht,tttth,ttttt}. 

If you are interested in getting exactly 5 heads, then the event of interest is the simple event 

A = {hhhhh}. If you are interested in getting exactly 3 heads, then the event of interest is 

A = {hhhtt,hhtht,hthht,thhht,hhtth,hthth,thhth, htthh, ththh,tthhh}. 

2

Example 1.2 (Birthdays) Consider the experiment “Ask ten people their birthdays and 

record the list of dates.” The sample space, S, is the collection of all sequences of 10 

birthdays. If you are interested in getting at least one match, then the event of interest, A, 

is the collection of all sequences of ten birthdays with at least one match. 

1.3 Probability Axioms and Properties Following From the Axioms 

The basic rules (or axioms) of probability were introduced by A. Kolmogorov in the 1930’s. 

Let A ⊆ S be an event, and P(A) be the probability that A will occur. 

1.3.1 Kolmogorov Axioms for Finite Sample Spaces 

A probability distribution, or simply a probability, on a finite sample space S is a specification 

of numbers P(A) satisfying the following axioms: 

1. P(S) = 1. 

2. If A is an event, then 0 ≤ P(A) ≤ 1. 

3. If A1 and A2 are disjoint, then P(A1 ∪ A2) = P(A1) + P(A2). 

3 ′ . More generally, if the events A1, A2, ..., Ak are pairwise disjoint, then 

P(A1 ∪ A2 ∪ · · · ∪ Ak) = P(A1) + P(A2) + · · · + P(Ak). 

Since S is the set of all possible outcomes, an outcome in S is certain to occur; the probability 

of an event that is certain to occur must be 1 (axiom 1). Probabilities must be between 

0 and 1 (axiom 2), and probabilities must be additive when events are pairwise disjoint 

(axiom 3). 

Note that the second part of axiom 3 can be proven from the first part using mathematical 

induction. The inductive step of the proof assumes that the equality holds for k − 1 events. 

Then P(A1 ∪ A2 ∪ · · · ∪ Ak) 

= P((A1 ∪ A2 ∪ · · · ∪ Ak−1) ∪ Ak) (associativity of union) 

= P(A1 ∪ A2 ∪ · · · ∪ Ak−1) + P(Ak) (axiom 3) 

= P(A1) + P(A2) + · · · + P(Ak−1) + P(Ak) (induction hypothesis) 

An additional extension of axiom 3 is needed for infinite sample samples. Namely, 

3 ′′ . If A1, A2, . . . are pairwise disjoint, then 

P (∪ ∞ i=1 Ai) = P(A1) + P(A2) + P(A3) + · · · 

where the righthand side is understood to be the sum of a convergent infinite series. 

Methods for sequences and series are covered in Chapter 6 of these notes. 

3

Relative frequencies. The probability of an event can be written as the limit of relative 

frequencies. That is, if A ⊆ S is an event, then 

#(A) 

P(A) = lim 

n→∞ n 

where #(A) is the number of occurrences of event A in n repeated trials of the experiment. 

Expected number. If P(A) is the probability of event A, then nP(A) is the expected 

number of occurrences of event A in n repeated trials of the experiment. 

1.3.2 Equally Likely Outcomes 

If S is a finite set with N elements, A is a subset of S with n elements, and each outcome 

is equally likely, then 

P(A) = 

Number of elements in A 

Number of elements in S 

= |A| 

|S| 

= n 

N . 

Example 1.3 (Coin Tossing, continued) For example, if you toss a fair coin 5 times 

and record heads or tails each time, then the probability of getting exactly 3 heads is 

10/32=0.3125. In 2000 repetitions of the experiment, you expect to observe exactly 3 heads 

2000P(exactly 3 heads) = 2000(0.3125) = 625 times. 

1.3.3 Properties Following from the Axioms 

Properties following from the Kolmogorov axioms include 

1. Complement rule: Let A c = S − A be the complement of A in S. Then 

In particular, P(∅) = 0. 

P(A c ) = 1 − P(A). 

2. Subset rule: If A is a subset of B, then P(A) ≤ P(B). 

3. Inclusion-exclusion rule: If A and B are events, then 

P(A ∪ B) = P(A) + P(B) − P(A ∩ B). 

Example 1.4 (Demonstration of Inclusion-Exclusion) To demonstrate the inclusionexclusion 

rule using the axioms, first note that 

A ∪ B = A ∪ (B ∩ A c ) where A ∩ (B ∩ A c ) = ∅. 

4

Since the intersection is empty, we know that P(A ∪ B) = P(A) + P(B ∩ A c ) by axiom 3. 

Similarly, since 

B = (B ∩ A) ∪ (B ∩ A c ) where (B ∩ A) ∩ (B ∩ A c ) = ∅, 

P(B) = P(B ∩ A) + P(B ∩ A c ) or P(B ∩ A c ) = P(B) − P(B ∩ A). Thus, 

P(A ∪ B) = P(A) + P(B ∩ A c ) = P(A) + P(B) − P(B ∩ A) = P(A) + P(B) − P(A ∩ B), 

since P(A ∩ B) = P(B ∩ A). 

Inclusion-exclusion rules. The inclusion-exclusion rule can be generalized to more than 

two events. For three events, for example, the rule is 

P(A ∪ B ∪ C) = P(A) + P(B) + P(C) − P(A ∩ B) − P(A ∩ C) − P(B ∩ C) + P(A ∩ B ∩ C). 

1.4 Counting Methods 

Methods for counting the number of outcomes in a sample space or event are important in 

probability. The multiplication rule is the basic counting formula. 

Theorem 1.5 (Multiplication Rule) If an operation consists of r steps of which the 

first can be done in n1 ways, for each of these the second can be done in n2 ways, for each of 

the first and second steps the third can be done in n3 ways, etc., then the entire operation 

can be done in n1 × n2 × · · · × nr ways. 

Sampling with/without replacement. Two special cases of the multiplication rule are 

1. Sampling with replacement. For a set of size n and a sample of size r, there are a total 

of n r = n × n × · · · × n ordered samples, if duplication is allowed. 

2. Sampling without replacement. For a set of size n and a sample of size r, there are a 

total of 

n! 

= n × (n − 1) × · · · × (n − r + 1) 

(n − r)! 

ordered samples, if duplication is not allowed. 

If n is a positive integer, the notation n! (“n factorial”) is used for the product 

n! = n × (n − 1) × · · · × 1. 

For convenience, 0! is defined to equal 1 (0! = 1). 

Example 1.6 (Birthdays, continued) For example, suppose there are r unrelated people 

in a room, none of whom was born on February 29th of a leap year. You would like to 

determine the probability that at least two people have the same birthday. 

5

1. You ask, and record, each person’s birthday. There are 

365 r = 365 × 365 × · · · × 365 

possible outcomes, where an outcome is a sequences of r responses. 

2. Consider the event “everyone has a different birthday.” The number of outcomes in 

this event is 

365! 

= 365 × 364 × · · · × (365 − r + 1). 

(365 − r)! 

3. Suppose that each sequence of birthdays is equally likely. The probability that at 

least two people have a common birthday is 1 minus the probability that everyone 

has a different birthday, or 

1 − 

365 × 364 × · · · × (365 − r + 1) 

365r . 

Let A be the event “at least two people have a common birthday”. The following table 

demonstrates how quickly P(A) increases as the number of people in the room increases: 

r 5 10 15 20 25 30 35 40 45 50 

P(A) 0.03 0.12 0.25 0.41 0.57 0.71 0.81 0.89 0.94 0.97 

1.5 Permutations and Combinations 

A permutation is an ordered subset of r distinct objects out of a set of n objects. A 

combination is an unordered subset of r distinct objects out of the n objects. 

By the multiplication rule (page 5), there are a total of 

nPr = 

permutations of r objects out of n objects. 

n! 

= n × (n − 1) × · · · × (n − r + 1) 

(n − r)! 

Since each unordered subset corresponds to r! ordered subsets (the r chosen elements are 

permuted in all possible ways), there are a total of 

nCr = nPr 

r! = 

combinations of r objects out of n objects. 

n! n × (n − 1) × · · · × (n − r + 1) 

= 

(n − r)! r! r × (r − 1) × · · · × 1 

For example, there are a total of 5040 ordered subsets of size 4 from a set of size 10, and a 

total of 5040/24=210 unordered subsets. 

The notation 

n 

= nCr = 

r 

n! 

(n − r)! r! 

is used to denote the total number of combinations. 

6 

(read “n choose r”)

Special cases are as follows: 

 

n 

= 

0 

 

n 

= 1 and 

n 

 

n n 

= = n. 

1 n − 1 

Further, since choosing r elements to form a subset is equivalent to choosing the remaining 

n − r elements to form the complementary subset, 

 

n n 

= for r = 0,1,... ,n. 

r n − r 

1.5.1 Example: Simple Urn Model 

Suppose there are M special objects in an urn containing a total of N objects. In a subset 

of size n chosen from the urn, exactly m are special. 

Unordered subsets. There are a total of 

 

M N − M 

× 

m n − m 

unordered subsets with exactly m special objects (and exactly n −m other objects). If each 

choice of subset is equally likely, then for each m 

M N−M m × n−m 

P(m special objects) = . 

Ordered subsets. There are a total of 

 

n 

m 

N 

n 

× MPm × N−MPn−m 

ordered subsets with exactly m special objects. (The positions of the special objects are 

selected first, followed by the special objects to fill these positions, followed by the nonspecial 

objects to fill the remaining positions.) If each choice of subset is equally likely, then for 

each m 

P(m special objects) = 

25 

8 

n 

m × MPm × N−MPn−m 

. 

NPn 

Computing probabilities. Interestingly, P(m special objects) is the same in both cases. 

For example, let N = 25, M = 10, n = 8, and m = 3. Then, using the first formula 

10 15 3 × 5 120 × 3003 728 

P(3 special objects) = = = ≈ 0.333. 

1081575 2185 

Using the second formula, the probability is 

8 3 × 10P3 × 15P5 

P(3 special objects) = 

= 56 × 720 × 360360 

25P8 

7 

43609104000 

= 728 

≈ 0.333. 

2185

Example 1.7 (Quality Management) A batch of 20 microwaves contains 4 defective 

and 16 good machines. You decide to select 3 machines at random and test them. If each 

choice of subset is equally likely, then the probability that there are k defective machines 

in the chosen subset is 

 

P(k defectives) = 

4 16 

k 3−k 

20 3 

k = 0,1,2,3. 

The following table shows the distribution of probabilities: 

k 0 1 2 3 

560 

P(k defectives) 1140 = 0.4912 480 

1140 = 0.4211 96 

1140 = 0.0842 4 

1140 = 0.0035 

If you decide to accept the batch as long as none of the 3 chosen machines is defective, then 

there is a 49.12% chance that you will accept a batch containing 4 defective machines. 

1.5.2 Binomial Coefficients 

The quantities n r , r = 0,1,... ,n, are often referred to as the binomial coefficients because 

of the following theorem. 

Theorem 1.8 (Binomial Theorem) For all real numbers x and y and each positive 

integer n, 

(x + y) n n 

 

n 

= x 

r 

r y n−r . 

Idea of proof. The product on the left can be written as a sequence of n factors: 

r=0 

(x + y) n = (x + y) × (x + y) × · · · × (x + y). 

The product expands to 2n summands, where each summand is a sequence of n letters, 

where one letter is chosen from each factor. For each r, exactly n r sequences have r copies 

of x and n − r copies of y. 

Example 1.9 (n=12) If n = 12, then the numbers of subsets of size r are as follows: 

r 0 1 2 3 4 5 6 7 8 9 10 11 12 

 

1 12 66 220 495 792 924 792 495 220 66 12 1 

12 

r 

and the total number of subsets is 12 

r=0 

1.6 Partitioning Sets 

12 r = 4096 = 212 . 

The multiplication rule (page 5) can be used to find the number of partitions of a set of n 

elements into k distinguishable subsets of sizes r1, r2, ..., rk. 

8

Specifically, r1 of the n elements are chosen for the first subset, r2 of the remaining n − r1 

elements are chosen for the second subset, and so forth. The result is the product of the 

numbers of ways to perform each step: 

 

n n − r1 n − r1 − · · · − rk−1 

× × · · · × 

. 

r1 

The product simplifies to 

r1,r2,... ,rk”). 

r2 

n! 

r1! r2! ··· rk! 

rk 

and is denoted by 

 

n 

r1,r2,...,rk 

Example 1.10 (Partitioning Students) For example, there are a total of 

 

15 

= 

5,5,5 

15! 

= 756,756 

5! 5! 5! 

 

(read “n choose 

ways to partition the members of a class of 15 students into recitation sections of size 5 

each led by Joe, Sally, and Mary, respectively. (The recitation sections are distinguished by 

their group leaders.) 

Permutations of indistinguishable objects. The formula above also represents the 

number of ways to permute n objects, where the first r1 are indistinguishable, the next r2 

are indistinguishable, ..., the last rk are indistinguishable. The computation is done as 

follows: r1 of the n positions are chosen for the first type of object, r2 of the remaining 

n − r1 positions are chosen for the second type of object, etc. 

In the example above, imagine that the 15 positions correspond to the 15 students in 

alphabetical order, and that there are 5 J’s (for Joe), 5 S’s (for Sally), and 5 M’s (for Mary) 

to permute. Each permutation of the 15 letters corresponds to an assignment of the 15 

students to the recitation sections led by Joe, Sally, and Mary. 

Footnote on partitioning students. Suppose instead that the 15 students are assigned 

to 3 teams of 5 students each and that each team will work on the same project. Then the 

total number of different assignments becomes 15 

5,5,5 /3! = 126,126. 

1.6.1 Multinomial Coefficients 

The quantities n 

are often referred to as the multinomial coefficients because of 

r1,r2,...,rk 

the following theorem. 

Theorem 1.11 (Multinomial Theorem) For all real numbers x1,x2,... ,xk and each 

positive integer n, 

(x1 + x2 + · · · + xk) n = 

 

 

n 

x 

r1,r2,... ,rk 

r1 

1 xr2 2 · · · xrk 

k , 

(r1,r2,...,rk) 

where the sum is over all k-tuples of nonnegative integers with 

i ri = n. 

9

Idea of proof: The product on the left can be written as a sequence of n factors 

(x1 + x2 + · · · + xk) n = (x1 + x2 + · · · + xk)× 

(x1 + x2 + · · · + xk) × · · · × (x1 + x2 + · · · + xk). 

The product expands to kn summands, where each summand is a sequence of n letters (one 

from each factor). For each r1,r2,...,rk, exactly n 

sequences have r1 copies of x1, 

r1,r2,...,rk 

r2 copies of x2, etc. 

1.6.2 Example: Urn Models 

The simple urn model from Section 1.5.1 can be generalized. 

Assume that an urn contains k types of objects, Mi is the number of objects of type i, and 

N = 

i Mi is the total number of objects in the urn. A subset of size n is chosen. If each 

choice of subset is equally likely, then the probability that there are exactly mi objects of 

type i for i = 1,2,... ,k is 

P(Exactly mi objects of type i for each i) = 

M1 

m1 

M2 m2 

Mk · · · mk 

N . 

n 

Example 1.12 (3 Types of Objects) Let M1 = 3, M2 = 4, M3 = 5 and n = 4. The 

following table gives the probability distribution for all possible values of m1 and m2 (with 

m3 = 4 − m1 − m2): 

m2 = 0 m2 = 1 m2 = 2 m2 = 3 m2 = 4 

m1 = 0 5/495 40/495 60/495 20/495 1/495 126/495 

m1 = 1 30/495 120/495 90/495 12/495 252/495 

m1 = 2 30/495 60/495 18/495 108/495 

m1 = 3 5/495 4/495 9/495 

70/495 224/495 168/495 32/495 1/495 495/495 

Since N = M1 + M2 + M3 = 12, the total number of subsets of size 4 is 12 4 = 495. To get 

the numerator for m1 = 2 and m2 = 1, for example, we compute 345 2 = 60. 

The rightmost column gives the proportion of time we expect to have 0, 1, 2, or 3 elements 

of type 1 in the subset of size 4. The bottom row gives the proportion of time we expect to 

have 0, 1, 2, 3, or 4 of type 2 in the subset of size 4. 

Urn models can be used to analyze surveys conducted in small populations. Additional 

examples will be discussed in the context of joint discrete distributions (Chapter 3). 

1.7 Applications in Statistics 

This section introduces two important applications of the methods we have seen so far. 

10 

1 

1

0.08 

0.06 

0.04 

0.02 

Prob 

450 500 550 600 650 M 

0.16 

0.14 

0.12 

0.1 

0.08 

0.06 

0.04 

0.02 

Prob 

415 420 425 430 435 

Figure 1.1: Likelihood for survey example (left) and capture-recapture example (right). 

1.7.1 Maximum Likelihood Estimation in Simple Urn Models 

In statistics, information from a sample is used when complete information is impossible to 

obtain or would take too long to gather. For example, information from a randomly chosen 

subset of elements from an urn can be used to 

1. Estimate the number of special elements in an urn, M, when N is known. 

2. Estimate the total number of elements in an urn, N, when M is known. 

The method uses proportions: if all four numbers (N, M, n, m) are positive, then the 

proportion of special elements in the subset (m/n) is set equal to the proportion of special 

elements in the urn (M/N), and the equation is solved for the unknown quantity. 

The method is intuitively appealing. Further, it can be shown that estimates obtained using 

proportions have maximum (or close to maximum) probabilities. 

Example 1.13 (Survey Analysis) Assume that in a well-conducted survey of 150 of the 

1000 students in the senior class at a local university, 82 students preferred Susan for class 

president. Since (82/150)1000 ≈ 547, we estimate that 547 seniors support Susan. 

The left part of Figure 1.1 is a plot of the function 

 

f(M) = 

M1000−M 82 68 

1000 150 

for M = 450,451,... ,625. 

If the subset of students chosen for the survey was one of 1000 150 equally likely choices, then 

the plot suggests that 547 is a maximum likelihood estimate of M. 

Example 1.14 (Capture-Recapture Model) Two independent techniques were used to 

estimate the total number of potentially fatal errors in a complex computer program. The 

first method identified M = 380 errors. The second method identified a total of n = 278 

errors, m = 250 of which were already identified by the first method. 

Since (278/250)380 ≈ 423, we estimate that there are a total of 423 potentially fatal errors 

in the program (408 identified by at least one method; 15 not identified by either method). 

11 

N

The right part of Figure 1.1 is a plot of 

f(N) = 

380N−380 250 28 

N 

278 

for N = 411,412,... ,437. 

If the subset of errors identified by the second method was one of N 

278 equally likely choices, 

then the plot suggests that 423 is a maximum likelihood estimate of N. 

Probability (or likelihood) ratio. An alternative to reporting a single value of M or 

N with maximum probability is to report a range of likely values. Often, the range is 

determined using probability ratios. That is, M or N is in the reported range if 

Probability for M or N 

Maximum Probability 

≥ p for a fixed proportion p. 

Using p = 0.25, the range of likely values in the survey example is {485,608} and the range 

of likely values in the capture-recapture example is {415,431}. 

Survey analysis when N is large. When N is large, we usually work with the unknown 

proportion of special objects p = M/N and the estimate p = m/n. Further, as an 

approximation, we assume that p takes values anywhere in the interval 0 ≤ p ≤ 1 and we 

use “sampling with replacement” in place of “sampling without replacement.” 

Survey analysis with three types of objects. The simple method of proportions can 

be generalized to urn models with three (or more) types of objects. For example, suppose 

we wish to estimate the numbers of students who will vote for each of three candidates 

(Susan plus two others) and that in a well-conducted survey of 150 of the 1000 students in 

the senior class, 72 prefer Susan (candidate 1), 56 prefer the second candidate and 22 prefer 

the third candidate. We estimate that 

 

72 

1000 ≈ 480, 

150 

 

56 

1000 ≈ 373 and 

150 

 

22 

1000 ≈ 147 

150 

will vote for candidates 1, 2, 3, respectively. Further, these estimates represent maximum 

(or near maximum) probabilities over all triples (M1,M2,M3). 

1.7.2 Permutation and Nonparametric Methods 

In many statistical applications, the null and alternative hypotheses of interest can be 

paraphrased in the following simple terms: 

Ho: Any patterns appearing in the data are due to chance alone. 

Ha: There is a tendency for a certain type of pattern to appear. 

Permutation methods allow researchers to determine whether to accept or reject a null hypothesis 

of randomness and, in some cases, to construct confidence intervals for unknown 

12

parameters. The methods are applicable in many settings since they require few mathematical 

assumptions. 

For example, consider the analysis of two samples. Let {x1,x2,...,xn} and {y1,y2,... ,ym} 

be the observed samples, where each is a list of numbers with repetitions. 

The data could have been generated under one of two sampling models. 

Population model. Data generated under a population model can be treated as the 

values of independent random samples from continuous distributions. (These concepts are 

defined formally in Chapter 8 of these notes.) 

For example, consider testing the null hypothesis that the X and Y distributions are equal 

versus the alternative hypothesis that X is stochastically smaller than Y . (That is, that 

the values of X tend to be smaller than the values of Y .) 

If n = 3, m = 5, and the observed lists are 

then the event 

{x1,x2,x3} = {1.4,1.2,2.8} and {y1,y2,y3,y4,y5} = {1.7,2.9,3.7,2.3,1.3}, 

X2 < Y5 < X1 < Y1 < Y4 < X3 < Y2 < Y3 

has occurred. Under the null hypothesis, each permutation of the n + m = 8 random 

variables is equally likely; thus, the observed event is one of (n + m)! = 40,320 equally 

likely choices. Patterns of interest under the alternative hypothesis are events where the 

Xi’s tend to be smaller than the Yj’s. 

Randomization model. Data generated under a randomization model cannot be treated 

as the values of independent random samples from continuous distributions, but can be 

thought of as one of N equally likely choices under the null hypothesis. 

In the example above, the null hypothesis of randomness is that the observed data is one 

of N = 40,320 equally likely choices, where a choice corresponds to a matching of the 8 

numbers to the labels x1, x2, x3, y1, y2, y3, y4, y5. 

Permutation tests. To conduct a permutation test using a test statistic T, 

1. The sampling distribution of T is obtained by computing the value of the statistic for 

each reordering of the data, and 

2. The observed significance level (or p value) is computed by comparing the observed 

value of T to the sampling distribution from step 1. 

The sampling distribution from the first step is called the permutation distribution of the 

statistic T, and the p value is called a permutation p value. 

13

Table 1.1: Sampling distribution of the sum of observations in the first sample. 

sum x sample sum x sample sum x sample 

3.9 {1.2,1.3,1.4} 5.9 {1.4,1.7,2.8} 7.1 {1.4,2.8,2.9} 

4.2 {1.2,1.3,1.7} 5.9 {1.3,1.7,2.9} 7.2 {1.2,2.3,3.7} 

4.3 {1.2,1.4,1.7} 6.0 {1.4,1.7,2.9} 7.3 {1.3,2.3,3.7} 

4.4 {1.3,1.4,1.7} 6.2 {1.2,1.3,3.7} 7.4 {1.4,2.3,3.7} 

4.8 {1.2,1.3,2.3} 6.3 {1.2,1.4,3.7} 7.4 {1.7,2.8,2.9} 

4.9 {1.2,1.4,2.3} 6.3 {1.2,2.3,2.8} 7.7 {1.2,2.8,3.7} 

5.0 {1.3,1.4,2.3} 6.4 {1.3,2.3,2.8} 7.7 {1.7,2.3,3.7} 

5.2 {1.2,1.7,2.3} 6.4 {1.2,2.3,2.9} 7.8 {1.2,2.9,3.7} 

5.3 {1.2,1.3,2.8} 6.4 {1.3,1.4,3.7} 7.8 {1.3,2.8,3.7} 

5.3 {1.3,1.7,2.3} 6.5 {1.3,2.3,2.9} 7.9 {1.4,2.8,3.7} 

5.4 {1.2,1.4,2.8} 6.5 {1.4,2.3,2.8} 7.9 {1.3,2.9,3.7} 

5.4 {1.4,1.7,2.3} 6.6 {1.2,1.7,3.7} 8.0 {1.4,2.9,3.7} 

5.4 {1.2,1.3,2.9} 6.6 {1.4,2.3,2.9} 8.0 {2.3,2.8,2.9} 

5.5 {1.2,1.4,2.9} 6.7 {1.3,1.7,3.7} 8.2 {1.7,2.8,3.7} 

5.5 {1.3,1.4,2.8} 6.8 {1.4,1.7,3.7} 8.3 {1.7,2.9,3.7} 

5.6 {1.3,1.4,2.9} 6.8 {1.7,2.3,2.8} 8.8 {2.3,2.8,3.7} 

5.7 {1.2,1.7,2.8} 6.9 {1.2,2.8,2.9} 8.9 {2.3,2.9,3.7} 

5.8 {1.2,1.7,2.9} 6.9 {1.7,2.3,2.9} 9.4 {2.8,2.9,3.7} 

5.8 {1.3,1.7,2.8} 7.0 {1.3,2.8,2.9} 

Continuing with the example above, let S be the sum of numbers in the x sample. Since 

the value of S depends only on which observations are labelled x’s, and not on the relative 

ordering of all 8 observations, the sampling distribution of S under the null hypothesis is 

obtained by computing its value for each partition of the 8 numbers into subsets of sizes 3 

and 5, respectively. 

Table 1.1 shows the values of S for each of the 8 3 = 56 choices of observations for the first 

sample. The observed value of S if 5.4. Since small values of S support the alternative 

hypothesis that x values tend to be smaller than y values, the observed significance level is 

P(S ≤ 5.4) = 13/56. 

Example 1.15 (Olympic Sprinting Times) As a second permutation test example, 

consider a test of association between x-values and y-values using data on Olympic sprinting 

times (Hand et al., Chapman & Hall, 1994, p. 248), and the sample correlation statistic 

ni=1 (Xi − X)(Yi − Y ) 

R = 

ni=1 

(Xi − X) 2 n i=1 (Yi − Y ) 2 

where X and Y are the sample means of the X and Y samples, respectively, and n is the 

total number of data pairs. 

The left part of Figure 1.2 shows the winning times in seconds (vertical axis) for the men’s 

100 meter finals versus the year since 1900 (horizontal axis) for the years 1912, 1924, · · ·, 

1972. 

14 

,

10.8 

10.5 

10.2 

y 

10 30 50 70 x 

400 

250 

100 

y 

125 200 275 x 

Figure 1.2: Winning time in seconds (y) versus year since 1900 (x) for Olympic sprinting 

times example (left plot). Triglycerides in mg/dL (y) versus cholesterol in mg/dL (y) for 

heart disease example (right plot). 

The data are reproduced in the following table: 

x 12 24 36 48 60 72 

y 10.8 10.6 10.3 10.3 10.2 10.14 

For these data, the observed value of R is −0.941. 

To determine if the observed association is due to chance alone, a permutation analysis of 

R = 0 versus R = 0 was conducted at the 5% significance level. 

The sampling distribution of R under the null hypothesis is obtained by matching x-values 

to permuted y-values in all possible ways, and computing the value of R in each case. Since 

two y-values are equal, there are a total of 6!/2 = 360 permutations to consider. Since the 

permutation p value is 

p value = 

#(|R| ≥ | − 0.941|) 

360 

= 2 

≈ 0.006 

360 

(obtained using the computer), there is evidence that the observed negative association is 

not due to chance alone. 

Conditional test, nonparametric test. A permutation test is an example of a conditional 

test, since the sampling distribution of T is computed conditional on the observations. 

Note that the term nonparametric test is often used to describe permutation tests where 

observations have been replaced by ranks. The Wilcoxon rank sum and Mann-Whitney 

tests for the analysis of two samples, and the Wilcoxon signed rank test for the analysis of 

paired samples, are examples of nonparametric tests. 

Monte Carlo test. If the number of reorderings of the data is very large, then the 

computer can be used to approximate the sampling distribution of T by computing its 

15

value for a fixed number of random reorderings of the data. The approximate sampling 

distribution can then be used to estimate the p value. 

A test conducted in this way is an example of a Monte Carlo test. (In a Monte Carlo 

analysis, simulation is used to estimate a quantity of interest. Here the quantity of interest 

is a p value.) 

Example 1.16 (Heart Disease Study) (Hand et al., Chapman & Hall, 1994, p. 221) 

Cholesterol and triglycerides belong to the class of chemicals known as lipids (fats). As part 

of a study to determine the relationship between high levels of lipids and coronary artery 

disease, researchers measured plasma levels of cholesterol and triglycerides in milligrams 

per deciliter (mg/dL) in 371 men complaining of chest pain. 

The right part of Figure 1.2 compares the cholesterol (x) and triglycerides (y) levels for the 

51 men with no evidence of heart disease. The sample correlation is 0.325. (See the last 

example for the definition of the sample correlation statistic.) 

To determine if the observed association is due to chance alone, a permutation analysis can 

be conducted using the sample correlation statistic R as test statistic. 

Consider testing R = 0 versus R = 0 at the 5% significance level. Under the null hypothesis 

of randomness, each matching of observed x values to permuted y values is equally likely. 

Thus, the observed matching is one of about 52! ≈ 8 × 10 67 equally likely choices. 

In a Monte Carlo analysis using 5000 random matchings (including the observed matching), 

the estimated permutation p value was 

estimated p value = 

#(|R| ≥ |0.325|) 

5000 

= 98 

≈ 0.0196. 

5000 

Thus, there is evidence that the positive association between levels of cholesterol and triglycerides 

is not due to chance alone. 

1.8 Conditional Probability 

Assume that A and B are events, and that P(B) > 0. Then the conditional probability of 

A given B is defined as follows: 

P(A|B) = 

P(A ∩ B) 

. 

P(B) 

Event B is often referred to as the conditional sample space. The conditional probability 

P(A|B) is the relative “size” of A within B. 

Example 1.17 (Respiratory Problems) For example, suppose that 40% of the adults in 

a certain population smoke cigarettes and that 28% smoke cigarettes and have respiratory 

problems. Then 

P(Respiratory problems | Smoker) = 0.28 

= 0.70 

0.40 

is the probability that an adult has respiratory problems given that the adult is a smoker. 

(70% of smokers have respiratory problems.) 

16

Equally likely outcomes. Note that if the sample space S is finite and each outcome is 

equally likely, then the conditional probability of A given B simplifies to the following: 

P(A|B) = 

|A ∩ B|/|S| 

|B|/|S| 

= |A ∩ B| 

|B| 

1.8.1 Multiplication Rules for Probability 

= Number of elements in A ∩ B 

Number of elements in B . 

Assume that A and B are events with positive probability. Then the definition of conditional 

probability implies that the probability of the intersection, P(A ∩ B), can be written as a 

product of probabilities in two different ways: 


P(A ∩ B) = P(B) × P(A|B) = P(A) × P(B|A). 

Theorem 1.18 (Multiplication Rule for Probability) If A1,A2,... ,Ak are events 

and P(A1 ∩ A2 ∩ · · · ∩ Ak−1) > 0, then 

P(A1 ∩ A2 ∩ · · · ∩ Ak) = 

P(A1) × P(A2|A1) × P(A3|A1 ∩ A2) × · · · × P(Ak|A1 ∩ A2 ∩ · · · ∩ Ak−1). 

Example 1.19 (Sampling Without Replacement) For example, suppose that four 

slips of paper are sampled without replacement from a well-mixed urn containing twentyfive 

slips of paper: fifteen slips with the letter X written on each and ten slips of paper 

with the letter Y written on each. Then the probability of observing the sequence XY XX 

is P(XY XX) = 

P(X) × P(Y |X) × P(X|XY ) × P(X|XY X) = 15 10 14 13 

× × × ≈ 0.09. 

25 24 23 22 

(The probability of choosing an X slip is 15/25; with an X removed from the urn, the 

probability of drawing a Y slip is 10/24; with an X and Y removed from the urn, the 

probability of drawing an X slip is 14/23; with two X slips and one Y slip removed from 

the urn, the probability of drawing an X slip is 13/22.) 

1.8.2 Law of Total Probability 

The law of total probability can be used to write an unconditional probability as the 

weighted average of conditional probabilities. Specifically, 

Theorem 1.20 (Law of Total Probability) Let A1, A2, ..., Ak and B be events with 

nonzero probability. If A1, A2, ..., Ak are pairwise disjoint with union S, then 

P(B) = P(A1) × P(B|A1) + P(A2) × P(B|A2) + · · · + P(Ak) × P(B|Ak). 

17

To demonstrate the law of total probability, note that if A1, A2, ..., Ak are pairwise disjoint 

with union S, then the sets B ∩ A1, B ∩ A2, ..., B ∩ Ak are pairwise disjoint with union B. 

Thus, axiom 3 and the definition of conditional probability imply that 

k 

k 

P(B) = P(B ∩ Aj) = P(Aj)P(B|Aj). 

j=1 

j=1 

Example 1.21 (Respiratory Problems, continued) For example, suppose that 70% of 

smokers and 15% of nonsmokers in a certain population of adults have respiratory problems. 

If 40% of the population smoke cigarettes, then 

P(Respiratory problems) = 0.40(0.70) + 0.60(0.15) = 0.37 

is the probability of having respiratory problems. 

Law of average conditional probabilities. The law of total probability is often called 

the law of average conditional probabilities. Specifically, P(B) is the weighted average of 

the collection of conditional probabilities {P(B|Aj)}, using the collection of unconditional 

probabilities {P(Aj)} as weights. 

In the respiratory problems example above, 0.37 is the weighted average of 0.70 (the probability 

that a smoker has respiratory problems) and 0.15 (the probability that a nonsmoker 

has respiratory problems). 

1.8.3 Bayes Rule 

Bayes rule, proven by the Reverend T. Bayes in the 1760’s, can be used to update probabilities 

given that an event has occurred. Specifically, 

Theorem 1.22 (Bayes Rule) Let A1, A2, ..., Ak and B be events with nonzero probability. 

If A1, A2, ..., Ak are pairwise disjoint with union S, then 

P(Aj|B) = 

P(Aj) × P(B|Aj) 

, j = 1,2,... ,k. 

ki=1 

P(Ai) × P(B|Ai) 

Bayes rule is a restatement of the definition of conditional probability: the numerator in 

the formula is P(Aj ∩ B), the denominator is P(B) (by the law of total probability), and 

the ratio is P(Aj|B). 

Example 1.23 (Defective Products) For example, suppose that 2% of the products 

assembled during the day shift and 6% of the products assembled during the night shift at 

a small company are defective and need reworking. If the day shift accounts for 55% of the 

products assembled by the company, then 

P(Day shift | Defective) = 

0.55(0.02) 

0.55(0.02) + 0.45(0.06) 

≈ 0.289 

is the probability that a product was assembled during the day shift given that the product 

is defective. (About 28.9% of defective products are assembled during the day shift.) 

18

Table 1.2: Distribution of unaided distance scores for right (x) and left (y) eyes. 

Prob 

0.8 

0.6 

0.4 

0.2 

y = 1 y = 2 y = 3 y = 4 total 

x = 1 0.203 0.036 0.017 0.009 0.265 

x = 2 0.031 0.202 0.058 0.010 0.301 

x = 3 0.016 0.048 0.237 0.027 0.328 

x = 4 0.005 0.011 0.024 0.066 0.106 

total 0.255 0.297 0.336 0.112 1.000 

1 2 3 4 j 

Prob 

0.8 

0.6 

0.4 

0.2 

1 2 3 4 j 

Figure 1.3: Distributions of left eye scores given right eye score is 1 (left) or 4 (right). 

Prior and posterior distributions. In applications of Bayes rule, the collection of 

probabilities {P(Aj)} are often referred to as the prior probabilities (the probabilities before 

observing an outcome in B), and the collection of probabilities {P(Aj|B)} are often referred 

to as the posterior probabilities (the probabilities after event B has occurred). 

Example 1.24 (Eyesight Study) (Stuart, Biometrika 42:412-416, 1955) Between 1943 

and 1946 the eyesight of more than seven thousand women aged 30-39 employed in the 

British weapons industry was tested. For each woman the unaided distance vision of the 

right eye (x) and of the left eye (y) was recorded using the four-point scale 1, 2, 3, 4, where 

1 is the highest grade and 4 is the lowest grade. Table 1.2 summarizes the results of the 

experiment. 

For example, for 20.3% of women who participated in the study both eyes were given the 

highest score (x = y = 1) and for 6.6% of women who participated in the study both eyes 

were given the lowest score (x = y = 4). 

Consider the experiment “Choose a woman from the set of women studied by Stuart and 

record the scores for both eyes,” assume each choice is equally likely, and let Aj be the 

event that the left eye score equals j. 

The left part of Figure 1.3 shows the posterior distribution {P(Aj|B)} (in black) superimposed 

on the prior distribution {P(Aj)} (in gray) when B is the event that the right eye 

19

score is 1. (If the right eye score is 1, then the left eye score is most likely 1.) 

The right part of Figure 1.3 shows the posterior distribution {P(Aj|B)} (in black) superimposed 

on the prior distribution {P(Aj)} (in gray) when B is the event that the right eye 

score is 4. (If the right eye score is 4, then the left eye score is most likely 4.) 

Example 1.25 (Religion and Education) As a last example, suppose that 45% of 

the population of a certain adult community are Catholic, 15% are Jewish, and 40% are 

Protestant. Further, suppose that 

1. Education levels among the Catholic population are as follows: 

Level: 8th grade Some HS Some College Postor 

less High School Graduate College Graduate Graduate 

Percent: 21% 28% 32% 8% 7% 4% 

2. Education levels among the Jewish population are as follows: 



Percent: 7% 23% 24% 15% 18% 13% 

3. Education levels among the Protestant population are as follows: 



Percent: 13% 22% 27% 11% 19% 8% 

The information above can be used to construct a contingency table of proportions in each 

religion-education group in this community: 



Catholic: 0.0945 0.126 0.144 0.036 0.0315 0.018 0.45 

Jewish: 0.0105 0.0345 0.036 0.0225 0.027 0.0195 0.15 

Protestant: 0.052 0.088 0.108 0.044 0.076 0.032 0.40 

0.157 0.2485 0.288 0.1025 0.1345 0.0695 1 

Each number in the body of the table is obtained using the multiplication rule. For example, 

P(Catholic and HS Graduate) = 0.45 × 0.32 = 0.144. 

Each number in the bottom row is obtained using the law of total probability. For example, 

P(HS Graduate) = 0.45 × 0.32 + 0.15 × 0.24 + 0.40 × 0.27 = 0.288. 

The prior distribution of Catholic, Jewish, and Protestant is 0.45, 0.15, and 0.40, respectively. 

Given that a person is a high school graduate (and no more), for example, the 

posterior distribution of Catholic, Jewish, and Protestant is 

0.144 0.036 

= 0.5, 

0.288 0.288 

0.108 

= 0.125, and = 0.375, respectively. 

0.288 

This posterior distribution was computed using Bayes rule. 

20

1.9 Independent Events and Mutually Independent Events 

Events A and B are said to be independent if 

P(A ∩ B) = P(A) × P(B). 

Otherwise, A and B are said to be dependent. 

If A and B are independent and have positive probabilities, then the multiplication rule for 

probability (page 17) implies that 

P(A|B) = P(A) and P(B|A) = P(B). 

(The relative size of A within B is the same as its relative size within S; the relative size of 

B within A is the same as its relative size within S.) 

If A and B are independent, 0 < P(A) < 1, and 0 < P(B) < 1, then A and B c are independent, 

A c and B are independent, and A c and B c are independent. 

More generally, events A1,A2,...,Ak are said to be mutually independent if 

• For each pair of distinct indices (i1,i2): P(Ai1 ∩ Ai2 ) = P(Ai1 ) × P(Ai2 ), 

• For each triple of distinct indices (i1,i2,i3): 

• and so forth. 

P(Ai1 ∩ Ai2 ∩ Ai3 ) = P(Ai1 ) × P(Ai2 ) × P(Ai3 ), 

Example 1.26 (Sampling With Replacement) For example, suppose that four slips 

of paper are sampled with replacement from a well-mixed urn containing twenty-five slips 

of paper: fifteen slips with the letter X written on each and ten slips of paper with the 

letter Y written on each. Then the probability of observing the sequence XY XX is 

P(XY XX) = P(X) × P(Y ) × P(X) × P(X) = 

3 

15 10 

25 25 = 54 

625 

= 0.0864. 

Further, since 4 3 = 4 sequences have exactly 3 X’s, the probability of observing a sequence 

with exactly 3 X’s is 4(0.0864) = 0.3456. 

Example 1.27 (Three Coins) You toss a fair coin three times and record an h (for 

heads) or a t (for tails) each time. The sample space is 

S = {hhh, hht, hth, thh, htt, tht, tth, ttt}. 

Let Hi be the event that the i th toss results in a head: 

H1 = {hhh, hht, hth, htt}, H2 = {hhh, hht, thh, tht}, H3 = {hhh, hth, thh, tth}. 

21

Since P(H1) = P(H2) = P(H3) = 1 

2 , 

P(Hi ∩ Hj) = 1 

4 = P(Hi) × P(Hj) for (i,j) = (1,2),(1,3),(2, 3), 

and P(H1 ∩ H2 ∩ H3) = 1 

8 = P(H1)P(H2)P(H3), the events are mutually independent. 

Example 1.28 (Not Mutually Independent) Suppose that the sample space of an 

experiment consists of the 3! = 6 permutations of the letters a, b, and c, along with the 

triples of each letter: 

S = {aaa, bbb, ccc, abc, acb, bac, bca, cab, cba}. 

Assume each outcome is equally likely, and let Ai be the event that the i th letter is an a: 

A1 = {aaa, abc, acb}, A2 = {aaa, bac, cab}, A3 = {aaa, bca,cba}. 

Then P(A1) = P(A2) = P(A3) = 1 

3 and 

P(A1 ∩ A2) = P(A1 ∩ A3) = P(A2 ∩ A3) = P(A1 ∩ A2 ∩ A3) = 1 

9 

since each intersection set equals the simple event {aaa}. 

Since 

P(Ai ∩ Aj) = 1 

9 = P(Ai) × P(Aj) for (i,j) = (1,2),(1,3),(2,3), 

the events are independent in pairs. However, since P(A1 ∩ A2 ∩ A3) = P(A1)P(A2)P(A3), 

the three events are not mutually independent. 

Repeated trials and mutual independence. As stated at the beginning of the chapter, 

the term experiment is used in probability theory to describe a procedure whose outcome 

is not known in advance with certainty. Experiments are assumed to be repeatable, and to 

have a well-defined set of outcomes. Repeated trials are repetitions of an experiment using 

the specified procedure, with the outcomes of the trials having no influence on one another. 

The results of repeated trials of an experiment are mutually independent. 

22

2 Discrete Random Variables 

Researchers use random variables to describe the numerical results of experiments. For 

example, if a fair coin is tossed 5 times and the total number of heads is recorded, then a 

random variable whose values are 0, 1, 2, 3, 4, 5 is used to give a numerical description of 

the results. 

This chapter focuses on discrete random variables and their probability distributions. 

2.1 Definitions 

A random variable is a function from the sample space of an experiment to the real numbers. 

The range of a random variable is the set of values the random variable assumes. Random 

variables are usually denoted by capital letters (X, Y , Z, ...) and their values by lower case 

letters (x, y, z, ...). 

If the range of a random variable is a finite or countably infinite set, then the random 

variable is said to be discrete; if the range is an interval or a union of intervals, the random 

variable is said to be continuous; otherwise, the random variable is said to be mixed. 

If X is a discrete random variable, then P(X = x) is the probability that an outcome 

has value x. Similarly, P(X ≤ x) is the probability that an outcome has value x or less, 

P(a < X < b) is the probability that an outcome has value strictly between a and b, and 

so forth. 

Example 2.1 (Coin Tossing) Let H be the number of heads in 8 tosses of a fair coin 

and let X be the difference between the number of heads and the number of tails. That is, 

let X = H − (8 − H) = 2H − 8. Since 

P(H = h) = 

8 h 

256 

when h = 0,1,... ,8 and 0 otherwise, 

we know that X is a discrete random variable with range {−8, −6, −4, −2,0,2,4,6,8}. 

The following table displays P(X = x) for each x in the range of X, using a common 

denominator throughout: 

h 0 1 2 3 4 5 6 7 8 

x −8 −6 −4 −2 0 2 4 6 8 

P(X = x) 1 

256 

8 

256 

From this table, we can compute all probabilities of interest. For example, 

28 

256 

56 

256 

70 

256 

56 

256 

28 

256 

P(X ≥ 3) = P(X = 4,6,8) = 37 

≈ 0.1445. 

256 

(There are 256 sequences of heads and tails. Of these, 37 have either 6, 7, or 8 heads.) 

23 

8 

256 

1 

256

2.1.1 PDF and CDF for Discrete Distributions 

If X is a discrete random variable, then the frequency function (FF) or probability density 

function (PDF) of X is defined as follows 

PDFs satisfy the following properties: 

1. f(x) ≥ 0 for all real numbers x. 

f(x) = P(X = x) for all real numbers x. 

2. 

x∈R f(x) equals 1, where R is the range of X. 

Since f(x) is the probability of an event and events have nonnegative probabilities, the 

values of f(x) must be nonnegative (property 1). Since the events X = x for x ∈ R are 

mutually disjoint with union S (the sample space), the sum of the probabilities of these 

events must be 1 (property 2). 

The cumulative distribution function (CDF) of the discrete random variable X is defined 

as follows 

F(x) = P(X ≤ x) for all real numbers x. 

Note that F(x) is a sum of probabilities: 

F(x) = 

y∈Rx 

CDFs satisfy the following properties: 

1. limx→−∞ F(x) = 0 and limx→+∞ F(x) = 1. 

2. If x1 ≤ x2, then F(x1) ≤ F(x2). 

f(y), where Rx = {y ∈ R : y ≤ x}. 

3. F(x) is right continuous. That is, for each a, lim x→a + F(x) = F(a). 

F(x) represents cumulative probability, with limits 0 and 1 (property 1). Cumulative probability 

increases with increasing x (property 2), and has discrete jumps at values of x in 

the range of the random variable (property 3). (The concept of limit is central to calculus, 

and will be defined in Chapter 4 of these notes.) 

Plotting PDF and CDF functions. The PDF of a discrete random variable is represented 

graphically 

1. By using a scatter plot of pairs (x,f(x)) for x ∈ R, or 

2. By using a probability histogram, where area is used to represent probability. 

24

Prob. 

0.25 

0.2 

0.15 

0.1 

0.05 

0 

-8 -6 -4 -2 0 2 4 6 8 

x 

Cum.Prob. 

1 

0.8 

0.6 

0.4 

0.2 

0 

-8 -6 -4 -2 0 2 4 6 8 

Figure 2.1: Probability histogram (left) and cumulative distribution function (right) for the 

difference between the number of heads and the number of tails in eight tosses of a fair coin. 

The CDF of a discrete random variable is represented graphically as a step function, with 

steps of height f(x) at each x ∈ R. 

Example 2.2 (Coin Tossing, continued) For example, the left part of Figure 2.1 is the 

probability histogram for the difference between the number of heads and the number of 

tails in eight tosses of a fair coin. For each x in the range of the random variable, a rectangle 

with base equal to the interval [x − 0.50,x + 0.50] and with height equal to f(x) is drawn. 

The total area is 1.0. The right part is a representation of the cumulative distribution 

function. Note that F(x) is nondecreasing, F(x) = 0 when x < −8, and F(x) = 1 when 

x > 8. Steps occur at x = −8, −6,... ,6,8. 

2.2 Families of Distributions 

This section briefly describes four families of discrete probability distributions, and states 

properties of these distributions. 

2.2.1 Example: Discrete Uniform Distribution 

Let n be a positive integer. The random variable X is said to be a discrete uniform random 

variable, or to have a discrete uniform distribution, with parameter n when its PDF is as 

follows: 

f(x) = 1 

when x = 1,2,... ,n and 0 otherwise. 

n 

For example, if you roll a fair six-sided die and let X equal the number of dots on the top 

face, then X has a discrete uniform distribution with n = 6. 

2.2.2 Example: Hypergeometric Distribution 

Let N, M, and n be integers with 0 < M < N and 0 < n < N. The random variable X is 

said to be a hypergeometric random variable, or to have a hypergeometric distribution, with 

25 

x

Prob. 

0.3 

0.2 

0.1 

0 4 8 x 

Prob. 

0.3 

0.2 

0.1 

0 4 8 x 

Figure 2.2: Probability histograms for hypergeometric (left) and binomial (right) examples. 

parameters n, M, and N, when its PDF is as follows: 

 

f(x) = 

MN−M x n−x 

N n 

, 

for integers, x, between max(0,n+M −N) and min(n,M) (and zero otherwise). Note that 

the restrictions on x ensure that all combinations are defined and positive. 

Example 2.3 (n = 8, M = 14, N = 35) Assume that X has a hypergeometric distribution 

with parameters n = 8, M = 14 and N = 35. The following table gives the values 

of the PDF of X for all x in its range, using 4 decimal place accuracy. 

x 0 1 2 3 4 5 6 7 8 

f(x) 0.0086 0.0692 0.2098 0.3147 0.2545 0.1131 0.0268 0.0031 0.0001 

The left part of Figure 2.2 is a probability histogram for X. 

Hypergeometric distributions are used to model urn experiments (see Section 1.5.1). Suppose 

there are M special objects in an urn containing a total of N objects. Let X be the 

number of special objects in a subset of size n chosen from the urn. If each choice of subset 

is equally likely, then X has a hypergeometric distribution. In the example above, 40% 

(14/35) of the elements in the urn are special. 

2.2.3 Distributions Related to Bernoulli Experiments 

A Bernoulli experiment is an experiment with two possible outcomes. The outcome of 

chief interest is often called “success” and the other outcome “failure.” Let p equal the 

probability of success. 

Imagine repeating a Bernoulli experiment n times. The expected number of successes in n 

independent trials of a Bernoulli experiment with success probability p is np. 

For example, suppose that you roll a fair six-sided die and observe the number on the top 

face each time. Let success be a 1 or 4 on the top face, and failure be a 2, 3, 5, or 6 on the 

top face. Then p = 1/3 is the probability of success. In 600 trials of the experiment, you 

expect 200 successes. 

26

Example: Bernoulli distribution. Suppose that a Bernoulli experiment is run once. 

Let X equal 1 if a success occurs and 0 if a failure occurs. Then X is said to be a Bernoulli 

random variable, or to have a Bernoulli distribution, with parameter p. The PDF of X is 

as follows: 

f(1) = p, f(0) = 1 − p, and f(x) = 0 otherwise. 

Example: binomial distribution. Let X be the number of successes in n independent 

trials of a Bernoulli experiment with success probability p. Then X is said to be a binomial 

random variable, or to have a binomial distribution, with parameters n and p. The PDF of 

X is as follows: 

f(x) = 

 

n 

p 

x 

x (1 − p) n−x when x = 0,1,2,... ,n, and 0 otherwise. 

For each x, f(x) is the probability of the event “exactly x successes in n independent trials.” 

(There are a total of n sequences with exactly x successes and n − x failures; each such 

x 

sequence has probability px (1 − p) n−x .) 

Example 2.4 (n = 8, p = 0.4) Assume that X has a binomial distribution with parameters 

n = 8 and p = 0.4. The following table gives the values of the PDF of X for all x in 

its range: 

x 0 1 2 3 4 5 6 7 8 

f(x) 0.0168 0.0896 0.209 0.2787 0.2322 0.1239 0.0413 0.0079 0.0007 

The right part of Figure 2.2 is a probability histogram for X. 

Notes. The binomial theorem (page 8) can be used to demonstrate that 

f(0) + f(1) + · · · + f(n) = 1. 

Note also that a Bernoulli random variable is a binomial random variable with n = 1. 

2.2.4 Simple Random Samples 

Suppose that an urn contains N objects. A simple random sample of size n is a sequence 

of n objects chosen without replacement from the urn, where the choice of each sequence is 

equally likely. 

Let M be the number of special objects in the urn and X be the number of special objects 

in a simple random sample of size n. Then X has a hypergeometric distribution with 

parameters n, M, N. Further, if N is very large, then binomial probabilities can often be 

used to approximate hypergeometric probabilities. 

Theorem 2.5 (Binomial Approximation) If N is large, then the binomial distribution 

with parameters n and p = M/N can be used to approximate the hypergeometric 

distribution with parameters n, M, N. Specifically, 

P(x special objects) = 

 

n 

≈ 

x 

MN−M x n−x 

N n 

27 

p x (1 − p) n−x for each x.

Note that if X is the number of special objects in a sequence of n objects chosen with 

replacement from the urn and if the choice of each sequence is equally likely, then X has 

a binomial distribution with parameters n and p = M/N. The theorem says that if N is 

large, then the model where sampling is done with replacement can be used to approximate 

the model where sampling is done without replacement. 

Adequacy of the approximation. One rule of thumb for judging the adequacy of the 

approximation is the following: the binomial approximation is adequate when 

20n < N and 0.05 

Survey analysis. Simple random samples are used in surveys. If the survey population 

is small, then hypergeometric distributions are used to analyze the results. If the survey 

population is large, then binomial distributions are used to analyze the results, even though 

each person’s opinion is solicited at most once. 

Example 2.6 (Survey Analysis) For example, suppose that a surveyor is interested in 

determining the level of support for a proposal to change the local tax structure, and decides 

to choose a simple random sample of size 10 from the registered voter list. If there are a 

total of 120 registered voters, one-third of whom support the proposal, then the probability 

that exactly 3 of the 10 chosen voters support the proposal is 

P(X = 3) = 

40 80 3 7 

120 10 

≈ 0.27. 

If there are thousands of registered voters, the probability is 

1 3 7 10 2 

P(X = 3) ≈ 

≈ 0.26. 

3 3 3 

Note that the exact number of registered voters is not needed in the approximation. 

2.3 Mathematical Expectation 

Mathematical expectation generalizes the idea of a weighted average, where probability 

distributions are used as the weights. 

2.3.1 Definitions for Finite Discrete Random Variables 

Let X be a discrete random variable with finite range R and PDF f(x). The mean of X 

(or the expected value of X or the expectation of X) is defined as follows: 

E(X) = 

x f(x). 

x∈R 

28

Similarly, if g(X) is a real-valued function of X, then the mean of g(X) (or the expected 

value of g(X) or the expectation of g(X)) is 

E(g(X)) = 

g(x) f(x). 

x∈R 

Example 2.7 (Nickels and Dimes) Assume you have 3 dimes and 5 nickels in your 

pocket. You choose a subset of 4 coins, let X equal the number of dimes in the subset and 

g(X) = 10X + 5(4 − X) = 20 + 5X 

be the total value (in cents) of the chosen coins. If each choice of subset is equally likely, 

then X has a hypergeometric distribution with parameters n = 4, M = 3, and N = 8, the 

expected number of dimes in the subset is 

E(X) = 

3 

x=0 

 

5 

x f(x) = 0 + 1 

70 

and the expected total value of the chosen coins is 

E(g(X)) = 

3 

x=0 

 

5 

(20 + 5x) f(x) = 20 

Note that E(g(X)) = g(E(X)). 

2.3.2 Properties of Expectation 

70 

 

+ 25 

 

30 

+ 2 

70 

 

30 

+ 30 

70 

Properties of sums imply the following properties of expectation: 

1. If a is a constant, then E(a) = a. 

2. If a and b are constants, then E(a + bX) = a + bE(X). 

 

30 5 

+ 3 = 1.5 

70 70 

 

30 5 

+ 35 = 27.5 cents. 

70 70 

3. If ci is a constant and gi(X) is a real-valued function for i = 1,2,... ,k, then 

E(c1g1(X) + c2g2(X) + · · · + ckgk(X)) = 

c1E(g1(X)) + c2E(g2(X)) + · · · + ckE(gk(X)). 

The first property says that the mean of a constant function is the constant itself. The 

second property says that if g(X) = a+bX, then E(g(X)) = g(E(X)). The third property 

generalizes the first two. 

Note that if g(X) = a + bX, then E(g(X)) and g(E(X)) may be different. For example, 

let g(X) = X2 in the nickels and dimes example above. Then 

E(g(X)) = 0 2 

 

5 

+ 1 

70 

2 

 

30 

+ 2 

70 

2 

 

30 

+ 3 

70 

2 

 

5 

= 

70 

39 

14 . 

Since (39/14) = (3/2) 2 , E(g(X)) = g(E(X)) in this case. 

29

Table 2.1: Means and variances for four families of discrete distributions. 

Distribution E(X) V ar(X) 

Discrete Uniform n 

n+1 

2 

Hypergeometric n, M, N n M 

N 

n 2 −1 

12 

n M 

N 

Bernoulli p p p(1 − p) 

 

1 − M 

N 

Binomial n, p np np(1 − p) 

2.3.3 Mean, Variance and Standard Deviation 

 

N−n 

N−1 

Let X be a random variable and let µ = E(X) be its mean. The variance of X, V ar(X), 

is defined as follows: 

 

V ar(X) = E (X − µ) 2 

. 

The notation σ 2 = V ar(X) is used to denote the variance. The standard deviation of X, 

σ = SD(X), is the positive square root of the variance. 

The mean is a measure of the center (or location) of a distribution. The variance and 

standard deviation are measures of the scale (or spread) of a distribution. If X is the height 

of an individual in inches, say, then the values of E(X) and SD(X) are in inches, while the 

value of V ar(X) is in square-inches. 

Table 2.1 gives general formulas for the mean and variance of the four families of discrete 

probability distributions discussed earlier. 

Example 2.8 (Drilling for Oil) Assume that the probability of finding oil in a given 

drill hole is 0.3 and that the results (finding oil or not) are independent from drill hole to 

drill hole. 

An oil company drills one hole at a time. If they find oil, then they stop; otherwise, they 

continue. However, the company only has enough money to drill five holes. Let X be the 

number of holes drilled. 

The range of X is {1,2,3,4,5} and its PDF is given in Table 2.2. (The values of f(x) are 

zero otherwise.) The mean number of holes drilled is 

5 

E(X) = x f(x) = 1(0.3) + 2(0.21) + 3(0.147) + 4(0.1029) + 5(0.2401) ≈ 2.7731. 

x=1 

Similarly, 

and SD(X) = V ar(X) ≈ 1.55622. 

5 

V ar(X) = (x − E(X)) 

x=1 

2 f(x) ≈ 2.42182 

30

Table 2.2: Probability distribution for oil drilling example. 

f(x) = P(X = x) Comments: 

f(1) = 0.3 Success on first try 

f(2) = (0.7)0.3 = 0.21 Failure followed by success 

f(3) = (0.7) 2 0.3 = 0.147 Two failures followed by success 

f(4) = (0.7) 3 0.3 = 0.1029 Three failures followed by success 

f(5) = (0.7) 4 0.3 + (0.7) 5 = 0.2401 Four failures followed by success 

or failure on all 5 tries 

Properties of variance and standard deviation. Properties of sums can be used to 

prove the following properties of the variance and standard deviation: 

1. V ar(X) = E(X 2 ) − (E(X)) 2 . 

2. If Y = a + bX, then V ar(Y ) = b 2 V ar(X) and SD(Y ) = |b|SD(X). 

The first property provides a quick by-hand method for computing the variance. For example, 

the variance of X in the nickels and dimes example is 39/14 − (3/2) 2 = 15/28 

(confirming the result for hypergeometric random variables given in Table 2.1). 

2.3.4 Chebyshev’s Inequality 

The Chebyshev inequality illustrates the importance of the concepts of mean, variance, and 

standard deviation. 

Theorem 2.9 (Chebyshev’s Inequality) Let X be a random variable with finite mean 

µ and standard deviation σ, and let ǫ be a positive constant. Then 

In particular, if ǫ = kσ for some k, then 

P(µ − ǫ < X < µ + ǫ) > 1 − σ2 

. 

ǫ2 P(µ − kσ < X < µ + kσ) > 1 − 1 

k 2. 

For example, if ǫ = 2.5σ, then Chebyshev’s inequality states that 

• At least 84% of the distribution of X is in the interval |x − µ| < 2.5σ and 

• At most 16% of the distribution is in the complementary interval |x − µ| ≥ 2.5σ. 

31

Proof of Chebyshev’s inequality: Using complements, it is sufficient to demonstrate that 

P(|X − µ| ≥ ǫ) ≤ σ2 

. 

ǫ2 Since σ 2 = E (X − µ) 2 = 

x (x − µ)2 f(x), where f(x) is the PDF of X, and the sum of 

two nonnegative terms is greater than or equal to either term, 

σ 2 = 

|x−µ|≥ǫ 

(x − µ) 2 f(x) + 

|x−µ|

3 Joint Discrete Distributions 

A probability distribution describing the joint variability of two or more random variables 

is called a joint distribution. 

For example, if X is the height (in feet), Y is the weight (in pounds), and Z is the serum 

cholesterol level (in mg/dL) of a person chosen from a given population, then we may be 

interested in describing the joint distribution of the triple (X,Y,Z). 

3.1 Bivariate Distributions 

A bivariate distribution is the joint distribution of a pair of random variables. 

3.1.1 Joint PDF 

Assume that X and Y are discrete random variables. The joint frequency function (joint 

FF) or joint probability density function (joint PDF) of (X,Y ) is defined as follows: 

f(x,y) = P(X = x,Y = y) for all real pairs (x,y), 

where the comma is understood to mean the intersection of the events. The notation 

fXY (x,y) is sometimes used to emphasize the two random variables. 

3.1.2 Joint CDF 

Assume that X and Y are discrete random variables. The joint cumulative distribution 

function (joint CDF) of (X,Y ) is defined as follows: 

F(x,y) = P(X ≤ x,Y ≤ y) for all real pairs (x,y), 

where the comma is understood to mean the intersection of the events. The notation 

FXY (x,y) is sometimes used to emphasize the two random variables. 

3.1.3 Marginal Distributions 

Let X and Y be discrete random variables with joint PDF f(x,y). The marginal frequency 

function (marginal FF) or marginal probability density function (marginal PDF) of X is 

fX(x) = 

f(x,y) for all real numbers x, 

y 

where the sum is taken over all y in the range of Y . Similarly, the marginal frequency 

function or marginal PDF of Y is 

fY (y) = 

f(x,y) for all real numbers y, 

where the sum is taken over all x in the range of X. 

x 

33

3.1.4 Conditional Distributions 

Let X and Y be discrete random variables with joint PDF f(x,y). 

If fY (y) = 0, then the conditional frequency function (conditional FF) or conditional probability 

density function (conditional PDF) of X given Y = y is defined as follows: 

f X|Y =y(x|y) = f(x,y) 

fY (y) 

for all real numbers x. 

Similarly, if fX(x) = 0, then the conditional PDF of Y given X = x is 

f Y |X=x(y|x) = f(x,y) 

fX(x) 

for all real numbers y. 

Note that in the first case, the conditional sample space is the collection of outcomes with 

Y = y; in the second case, it is the collection of outcomes with X = x. 

Example 3.1 (Eyesight Study, continued) Consider again the eyesight study from 

page 19. Let X be the score for the right eye and Y be the score for the left eye. Then 

Table 1.2 displays the joint (X,Y ) distribution. The probability that both eyes have the 

same score, for example, is 

P(X = Y ) = f(1,1) + f(2,2) + f(3,3) + f(4,4) = 0.708. 

The marginal distribution of X is given in the right column of Table 1.2 and the marginal 

distribution of Y is given in the bottom row of the table. 

Figure 1.3 displays the marginal distribution of Y (in gray), the conditional distribution 

of Y given X = 1 (left plot, in black), and the conditional distribution of Y given X = 4 

(right plot, in black). 

Example 3.2 (Urn Experiment) Assume that an urn contains 4 red, 3 white and 1 blue 

chip. Consider the following 2 step experiment: 

Step 1. Thoroughly mix the contents of the urn. Choose a chip and record its color. 

Return the chip plus two more chips of the same color to the urn. 

Step 2. Thoroughly mix the contents of the urn. Choose a chip and record its color. 

Let X equal the number of red chips chosen and Y equal the number of white chips chosen. 

Then Table 3.1 displays the joint (X,Y ) distribution. The marginal distribution of X is 

given in the right column and the marginal distribution of Y is given in the bottom row. 

The conditional distribution of Y given X = 1, for example, is 

and f Y |X=1(y|1) = 0 otherwise. 

fY |X=1(0|1) = 0.1 1 

= 

0.4 4 , fY |X=1(1|1) = 0.3 3 

= 

0.4 4 

34

Table 3.1: Joint distribution for the urn experiment example. 

3.1.5 Independent Random Variables 

y = 0 y = 1 y = 2 total 

x = 0 0.0375 0.075 0.1875 0.3 

x = 1 0.1000 0.300 0.4 

x = 2 0.3000 0.3 

total 0.4375 0.375 0.1875 1.0 

The discrete random variables X and Y are said to be independent if 

f(x,y) = fX(x)fY (y) for all real pairs (x,y). 

Otherwise, X and Y are said to be dependent. 

The discrete random variables X and Y are independent if the probability of the intersection is 

equal to the product of the probabilities 

for all events of interest (for all x, y). 

P(X = x, Y = y) = P(X = x)P(Y = y) 

Example 3.3 (Independent Bernoulli Random Variables) Assume that (X,Y ) has 

the joint distribution given in the following table: 

y = 0 y = 1 total 

x = 0 0.28 0.42 0.70 

x = 1 0.12 0.18 0.30 

total 0.40 0.60 1.00 

X is a Bernoulli random variable with parameter 0.30 and Y is a Bernoulli random variable 

with parameter 0.60. Further, since each probability in the body of the 2-by-2 table is the 

product of the appropriate marginal probabilities, X and Y are independent. 

Example 3.4 (Eyesight Study, continued) The random variables X and Y in the eyesight 

study are dependent. To demonstrate this, we need to show that f(x,y) = fX(x)fY (y) 

for some pair (x,y). For example, 

f(1,1) = 0.203 = 0.265 × 0.255 = fX(1)fY (1). 

Example 3.5 (Urn Experiment, continued) Similarly, the random variables X and Y 

in the urn experiment are dependent. For example, 

f(1,1) = 0.3 = 0.4 × 0.375 = fX(1)fY (1). 

35

3.1.6 Mathematical Expectation for Finite Bivariate Distributions 

Let X and Y be discrete random variables with joint PDF f(x,y) = P(X = x,Y = y), 

and let g(X,Y ) be a real-valued function. The mean of g(X,Y ) (or the expected value of 

g(X,Y ) or the expectation of g(X,Y )) is 

E(g(X,Y )) = 

g(x,y) f(x,y) 

where the double sum includes all pairs with nonzero joint PDF. 

x 

Example 3.6 (Eyesight Study, continued) Continuing with the eyesight study, let 

g(X,Y ) = |X − Y | be the absolute value of the difference in scores for the two eyes. Then 

4 4 

E(g(X,Y )) = |x − y| f(x,y) = 0.374. 

x=1 y=1 

(The average absolute difference in scores for the two eyes is small.) 

Properties. Properties of sums and the fact that the joint PDF of independent random 

variables equals the product of the marginal PDFs can be used to show the following: 

y 

1. If a and b1,b2,... ,bn are constants and gi(X,Y ) are real-valued functions for each 

i = 1,2,... ,n, then 

 

n 

 

n 

E a + bigi(X,Y ) = a + biE(gi(X,Y )). 

i=1 

2. If X and Y are independent and g(X) and h(Y ) are real-valued functions, then 

i=1 

E(g(X)h(Y )) = E(g(X))E(h(Y )). 

The first property generalizes properties from Section 2.3.2. The second property is useful 

when studying associations between variables. 

3.1.7 Covariance and Correlation 

Let X and Y be random variables with finite means (µx, µy) and finite standard deviations 

(σx, σy). The covariance of X and Y , Cov(X,Y ), is defined as follows: 

Cov(X,Y ) = E((X − µx)(Y − µy)). 

The notation σxy = Cov(X,Y ) is used to denote the covariance. The correlation of X and 

Y , Corr(X,Y ), is defined as follows: 

Cov(X,Y ) 

Corr(X,Y ) = 

σxσy 

= σxy 

. 

σxσy 

The notation ρ = Corr(X,Y ) is used to denote the correlation of X and Y ; ρ is called the 

correlation coefficient. 

36

Positive and negative association. Covariance and correlation are measures of the 

association between two random variables. Specifically, 

1. The random variables X and Y are said to be positively associated if as X increases, 

Y tends to increase. If X and Y are positively associated, then Cov(X,Y ) and 

Corr(X,Y ) will be positive. 

2. The random variables X and Y are said to be negatively associated if as X increases, 

Y tends to decrease. If X and Y are negatively associated, then Cov(X,Y ) and 

Corr(X,Y ) will be negative. 

For example, the height and weight of individuals in a given population are positively 

associated. Educational level and indices of poor health are often negatively associated. 

Properties of covariance. The following properties of covariance are important: 

1. Cov(X,X) = V ar(X). 

2. Cov(X,Y ) = Cov(Y,X). 

3. Cov(a + bX,c + dY ) = bdCov(X,Y ), where a, b, c, and d are constants. 

4. Cov(X,Y ) = E(XY ) − E(X)E(Y ). 

5. If X and Y are independent, then Cov(X,Y ) = 0. 

The first two properties follow immediately from the definition of covariance. The third 

property relates the covariance of linearly transformed random variables to the covariance 

of the original random variables. In particular, if b = d = 1 (values are shifted only), then 

the covariance is unchanged; if a = c = 0 and b,d > 0 (values are re-scaled), then the 

covariance is multiplied by bd. 

The fourth property gives an alternative method for calculating the covariance; the method 

is particularly well-suited for by-hand computations. Since E(XY ) = E(X)E(Y ) for independent 

random variables, the fourth property can be used to prove that the covariance of 

independent random variables is zero (property 5). 

Properties of correlation. The following properties of correlation are important: 

1. −1 ≤ Corr(X,Y ) ≤ 1. 

2. If a, b, c, and d are constants, then 

Corr(a + bX,c + dY ) = 

⎧ 

⎨ 

Corr(X,Y ) when bd > 0 

⎩ 

−Corr(X,Y ) when bd < 0 

In particular, Corr(X,a + bX) equals 1 if b > 0 and equals −1 if b < 0. 

3. If X and Y are independent, then Corr(X,Y ) = 0. 

37

Correlation is often called standardized covariance since, by the first property, its values 

always lie in the [−1,1] interval. If the correlation is close to −1, then there is a strong 

negative association between the variables; if the correlation is close to 1, then there is a 

strong positive association between the variables. 

The second property relates the correlation of linearly transformed variables to the correlation 

of the original variables. Note, in particular, that the correlation is unchanged if the 

random variables are shifted (b = d = 1) or if the random variables are re-scaled (a = c = 0 

and b,d > 0). For example, the correlation between the height and weight of individuals in 

a population is the same no matter which measurement scale is used for height (e.g. inches, 

feet) and which measurement scale is used for weight (e.g. pounds, kilograms). 

If Corr(X,Y ) = 0, then X and Y are said to be uncorrelated; otherwise, they are said to 

be correlated. The third property says that independent random variables are uncorrelated. 

Note that uncorrelated random variables are not necessarily independent. 

Example 3.7 (Eyesight Study, continued) For (X,Y ) from the eyesight study, 

E(X) = 2.275, E(Y ) = 2.305, E(X 2 ) = 6.117, E(Y 2 ) = 6.259, E(XY ) = 5.905. 

Using properties of variance and covariance: 

Finally, ρ = Corr(X,Y ) ≈ 0.701. 

V ar(X) ≈ 0.941, V ar(Y ) ≈ 0.946, Cov(X,Y ) ≈ 0.661. 

Example 3.8 (Uncorrelated but Dependent) Assume that the random pair (X,Y ) 

has the joint distribution given in the following table: 

y = −1 y = 0 y = 1 total 

x = −1 0.05 0.10 0.05 0.20 

x = 0 0.10 0.40 0.10 0.60 

x = 1 0.05 0.10 0.05 0.20 

total 0.20 0.60 0.20 1.00 

Since E(X) = E(Y ) = E(XY ) = 0, the covariance and correlation must equal 0. 

But, X and Y are dependent since, for example, f(0,0) = fX(0)fY (0). 

3.1.8 Conditional Expectation and Regression 

Let X and Y be discrete random variables with finite ranges and joint PDF f(x,y). 

If fX(x) = 0, then the conditional expectation (or the conditional mean) of Y given X = x, 

E(Y |X = x), is defined as follows: 

E(Y |X = x) = 

y fY |X=x(y|x), 

where the sum is over all y with nonzero conditional PDF (f Y |X=x(y|x) = 0). 

y 

38

x0 

x1 

x2 

x3 

x4 

y1 y2 y3 y4 y5 

0.02 0.02 0.02 0.01 0.01 

0.02 0.02 0.01 0.04 0.04 

0.02 0.01 0.04 0.04 0.10 

0.01 0.04 0.04 0.10 0.10 

0.01 0.04 0.10 0.10 0.04 

C.Exp. 

5 

4 

3 

2 

1 

0 1 2 3 4 x 

Figure 3.1: Joint (X,Y ) distribution (left) and plot of (x,E(Y |X = x)) pairs (right) for the 

conditional expectation example. 

The conditional expectation of X given Y = y, E(X|Y = y), is defined similarly. 

Example 3.9 (Conditional Expectation) Let (X,Y ) be the random pair whose joint 

distribution is shown in the left part of Figure 3.1. Then 

 

.02 .02 .02 .01 .01 

E(Y |X = 0) = 1 + 2 + 3 + 4 + 5 = 2.625. 

.08 .08 .08 .08 .08 

Similarly, 

E(Y |X = 1) = 3.462, E(Y |X = 2) = 3.905, E(Y |X = 3) = 3.828 and E(Y |X = 4) = 3.414. 

The right part of Figure 3.1 is a plot of pairs (x,E(Y |X = x)) for x = 0,1,2,3,4. 

Regression equation. The formula for the conditional expectation 

E(Y |X = x) as a function of x 

is called the regression equation of Y on X. Similarly, the formula for the conditional 

expectation E(X|Y = y) as a function of y is called the regression equation of X on Y . 

Linear conditional means. Let X and Y be random variables with finite means (µx, 

µy), standard deviations (σx, σy), and correlation (ρ). 

If E(Y |X = x) is a linear function of x, then the formula is of the form: 

E(Y |X = x) = µy + ρσy 

(x − µx). 

Similarly, if E(X|Y = y) is a linear function of y, then the formula is of the form: 

σx 

E(X|Y = y) = µx + ρσx 

(y − µy). 

39 

σy

3.1.9 Example: Bivariate Hypergeometric Distribution 

Let n, M1, M2, and M3 be positive integers with n ≤ M1 + M2 + M3. The random 

pair (X,Y ) is said to have a bivariate hypergeometric distribution with parameters n and 

(M1,M2,M3) if its joint PDF has the following form: 

f(x,y) = 

when x and y are nonnegative integers with 

M1 M2 M3 

x y n−x−y 

 

M1+M2+M3 

n 

x ≤ min(n,M1), y ≤ min(n,M2) and max(0,n − M3) ≤ x + y ≤ min(n,M3), 

and is equal to zero otherwise. Note that the restrictions on x and y ensure that all 

combinations are defined and positive. 

Example 3.10 (n = 4, M1 = 2, M2 = 3, M3 = 5) Assume that (X,Y ) has a bivariate 

hypergeometric distribution with parameters n = 4, M1 = 2, M2 = 3 and M3 = 5. The 

joint PDF is 

f(x,y) = 

23 5 

x y 4−x−y 

10 4 

x = 0,1,2; y = 0,1,... ,min(3,4 − x) 

and 0 otherwise. The joint (X,Y ) distribution and the marginal X and Y distributions are 

displayed in the following table: 

y = 0 y = 1 y = 2 y = 3 total 

x = 0 0.0238 0.1429 0.1429 0.0238 0.3334 

x = 1 0.0952 0.2857 0.1429 0.0095 0.5333 

x = 2 0.0476 0.0714 0.0143 0.1333 

total 0.1666 0.5000 0.3001 0.0333 1.0000 

Urn experiments. Bivariate hypergeometric distributions are used to model urn experiments. 

Specifically, suppose that an urn contains N objects, M1 of type 1, M2 of type 2, 

and M3 of type 3 (N = M1 + M2 + M3). Let X equal the number of objects of type 1 and 

Y equal the number of objects of type 2 in a subset of size n chosen from the urn. If each 

choice of subset is equally likely, then (X,Y ) has a bivariate hypergeometric distribution 

with parameters n and (M1,M2,M3). In the example above, 20% (2/10) of the objects in 

the urn are of type 1, 30% (3/10) are of type 2 and 50% (5/10) are of type 3. 

Negative association. If (X,Y ) has a bivariate hypergeometric distribution, then X 

and Y are negatively associated with correlation 

ρ = Corr(X,Y ) = − 

 

40 

M1 

M2 + M3 

 

M2 

M1 + M3 

.

Marginal and conditional distributions. If (X,Y ) has a bivariate hypergeometric 

distribution, then 

1. X has a hypergeometric distribution with parameters n, M1, and N; and 

2. Y has a hypergeometric distribution with parameters n, M2, and N. 

In addition, each conditional distribution is hypergeometric. Finally, bivariate hypergeometric 

distributions have linear conditional means. 

3.1.10 Example: Trinomial Distribution 

Let n be a positive integer, and p1, p2, and p3 be positive proportions with sum 1. The 

random pair (X,Y ) is said to have a trinomial distribution with parameters n and (p1,p2,p3) 

when its joint PDF has the following form: 

 

 

n 

f(x,y) = 

p 

x,y,n − x − y 

x 1py2 pn−x−y 3 

when x = 0,1,... ,n; y = 0,1,... ,n; x + y ≤ n, and is equal to zero otherwise. 

Example 3.11 (n = 4, p1 = 0.2, p2 = 0.3, p3 = 0.5) Assume that (X,Y ) has a trinomial 

distribution with parameters n = 4, p1 = 0.2, p2 = 0.3 and p3 = 0.5. The joint PDF 

is 

 

4 

f(x,y) = 

0.2 

x,y,4 − x − y 

x 0.3 y 0.5 4−x−y x,y = 0,1,2,3,4; x + y ≤ 4, 

and 0 otherwise. The joint (X,Y ) distribution and the marginal X and Y distributions are 

displayed in the following table: 

y = 0 y = 1 y = 2 y = 3 y = 4 total 

x = 0 0.0625 0.1500 0.1350 0.0540 0.0081 0.4096 

x = 1 0.1000 0.1800 0.1080 0.0216 0.4096 

x = 2 0.0600 0.0720 0.0216 0.1536 

x = 3 0.0160 0.0096 0.0256 

x = 4 0.0016 0.0016 

total 0.2401 0.4116 0.2646 0.0756 0.0081 1.0000 

Experiments with three outcomes. Trinomial distributions are used to model experiments 

with exactly three outcomes. Specifically, suppose that an experiment has three 

outcomes which occur with probabilities p1, p2, and p3, respectively. Let X be the number 

of occurrences of outcome 1 and Y be the number of occurrences of outcome 2 in n independent 

trials of the experiment. Then (X,Y ) has a trinomial distribution with parameters 

n and (p1,p2,p3). In the example above, the probabilities of the three outcomes are 0.2, 0.3 

and 0.5, respectively. 

41

Negative association. If (X,Y ) has a trinomial distribution, then X and Y are negatively 

associated with correlation 

 

p1 

ρ = Corr(X,Y ) = − 

1 − p1 

p2 

. 

1 − p2 

Marginal and conditional distributions. If (X,Y ) has a trinomial distribution, then 

1. X has a binomial distribution with parameters n and p1, and 

2. Y has a binomial distribution with parameters n and p2. 

In addition, each conditional distribution is binomial. Finally, trinomial distributions have 

linear conditional means. 


The results of Section 2.2.4 can be generalized. In particular, trinomial probabilities can 

often be used to approximate bivariate hypergeometric probabilities when N is large enough, 

and each family of distributions can be used in survey analysis. 

Example 3.12 (Survey Analysis) For example, suppose that a surveyor is interested 

in determining the level of support for a proposal to change the local tax structure, and 

decides to choose a simple random sample of size 10 from the registered voter list. If there 

are a total of 120 registered voters, where one-third support the proposal, one-half oppose 

the proposal, and one-sixth have no opinion, then the probability that exactly 3 support, 5 

oppose, and 2 have no opinion is 

P(X = 3,Y = 5) = 

40 60 20 3 5 

120 10 

2 ≈ 0.088. 

If there are thousands of registered voters, then the probability is 

1 3 5 2 10 1 1 

P(X = 3,Y = 5) ≈ 

≈ 0.081. 

3,5,2 3 2 6 

As before, you do not need to know the exact number of registered voters when you use the 

trinomial approximation. 

3.2 Multivariate Distributions 

A multivariate distribution is the joint distribution of k random variables. Ideas studied in 

the bivariate case (k = 2) can be generalized to the case where k > 2. 

42

3.2.1 Joint PDF and Joint CDF 

Let X1, X2, ..., Xk be discrete random variables and let 

X = (X1,X2,...,Xk). 

The joint frequency function (joint FF) or joint probability density function (joint PDF) of 

X is defined as follows: 

f(x) = P(X1 = x1,X2 = x2,... ,Xk = xk) 

for all real k-tuples x = (x1,x2,... ,xk), where commas are understood to mean the intersection 

of events. 

The joint cumulative distribution function (joint CDF) of X is defined as follows: 

F(x) = P(X1 ≤ x1,X2 ≤ x2,...,Xk ≤ xk) 

for all real k-tuples x = (x1,x2,... ,xk), where commas are understood to mean the intersection 

of events. 

Example 3.13 (k = 3) Assume that (X,Y,Z) has joint PDF 

f(x,y,z) = 

5 

28(x + y + z) 

when x,y = 0,1, z = 1,2,3,4 and 0 otherwise. 

The following table shows the joint (X,Y,Z) distribution. The marginal (X,Y ) distribution 

is displayed in the right column and the marginal Z distribution is displayed in the bottom 

row. A common denominator is used throughout. 

z = 1 z = 2 z = 3 z = 4 

x = 0 y = 0 60/336 30/336 20/336 15/336 125/336 

y = 1 30/336 20/336 15/336 12/336 77/336 

x = 1 y = 0 30/336 20/336 15/336 12/336 77/336 

y = 1 20/336 15/336 12/336 10/336 57/336 

140/336 85/336 62/336 49/336 1 

X and Y have the same marginal distribution. Namely, 

(and 0 otherwise). 

fX(0) = fY (0) = 101 

168 and fX(1) = fY (1) = 67 

168 

The probability that X + Y + Z is at most 3, for example, is 

P(X + Y + Z ≤ 3) = 

1 

1 

x=0 y=0 

3−x−y 

 

43 

z=1 

f(x,y,z) = 115 

≈ 0.685. 

168

3.2.2 Mutually Independent Random Variables and Random Samples 

The discrete random variables X1, X2, ..., Xk are said to be mutually independent (or 

independent) if 

f(x) = f1(x1)f2(x2) · · · fk(xk) for all real k-tuples x = (x1,x2,...,xk), 

where fi(xi) = P(Xi = xi) for i = 1,2,... ,k. (The probability of the intersection is equal 

to the product of the probabilities for all events of interest.) 

Equivalently, X1, X2, ..., Xk are said to be mutually independent if 

F(x) = F1(x1)F2(x2) · · · Fk(xk) for all real k-tuples x = (x1,x2,... ,xk), 

where Fi(xi) = P(Xi ≤ xi) for i = 1,2,... ,k. 

X1, X2, ..., Xk are said to be dependent if they are not mutually independent. 

Example 3.14 (k = 3, continued) The random variables X, Y and Z in the previous 

example are dependent. To see this, we just need to show that f(x,y,z) = fX(x)fY (y)fZ(z) 

for some triple (x,y,z). For example, f(1,1,1) = fX(1)fY (1)fZ(1). 

Random sample. If the discrete random variables X1, X2, ..., Xk are mutually independent 

and have a common distribution (each marginal PDF is the same), then the random 

variables X1, X2, ..., Xk are said to be a random sample from that distribution. 

3.2.3 Mathematical Expectation for Finite Multivariate Distributions 

Let X1, X2, ..., Xk be discrete random variables with joint PDF f(x), and let g(X) be a 

real-valued k-variable function. The mean (or expected value or expectation) of g(X) is 

E(g(X)) = 

· · · 

g(x1,x2,... ,xk)f(x1,x2,... ,xk), 

xk 

xk−1 

where the multiple sum includes all k-tuples with nonzero joint PDF. 

x1 

Example 3.15 (Maximum Roll) You roll a fair die three times. Let Xi be the number 

on the top face after the i th roll and g(X1,X2,X3) be the maximum of the three numbers. 

X1, X2, X3 is a random sample from a uniform distribution with parameter 6. The joint 

PDF of the random triple is 

f(x1,x2,x3) = 1 

216 when x1,x2,x3 = 1,2,3,4,5,6 and 0 otherwise 

and the expected value of the maximum of the three rolls reduces to 

E(g(X1,X2,X3)) = 

1 

1 

216 

 

+ 2 

7 

216 

 

 

19 37 61 91 

+ 3 + 4 + 5 + 6 = 

216 216 216 216 

44 

 

119 

≈ 4.96. 

24

Properties of expectation. Properties in the multivariate case directly generalize properties 

in the univariate and bivariate cases (Sections 2.3.2 and 3.1.6). 

3.2.4 Example: Multivariate Hypergeometric Distribution 

Suppose that an urn contains N objects, 

M1 of type 1, M2 of type 2, ..., and Mk of type k (M1 + M2 + · · · + Mk = N). 

Let Xi be the number of objects of type i in a subset of size n chosen from the urn. If each 

choice of subset is equally likely, then the random k-tuple (X1,X2,...,Xk) is said to have 

a multivariate hypergeometric distribution with parameters n and (M1,M2,... ,Mk). 

The joint PDF for the k-tuple is 

f(x1,x2,... ,xk) = 

M1 

x1 

M2 

x2 

Mk · · · 

when each xi is a nonnegative integer satisfying xi ≤ min(n,Mi) and 

i xi = n (and zero 

otherwise). 

Note that 

1. If X is a hypergeometric random variable with parameters n, M, N, then 

(X,n − X) 

has a multivariate hypergeometric distribution with parameters n and (M,N − M). 

2. If (X,Y ) is bivariate hypergeometric with parameters n and (M1,M2,M3), then 

N 

n 

 

(X,Y,n − X − Y ) 

is multivariate hypergeometric with parameters n and (M1,M2,M3). 

3. If (X1,X2,... ,Xk) has a multivariate hypergeometric distribution, then 

(a) For each i, Xi is a hypergeometric random variable with parameters n, Mi, N. 

(b) For each i = j, (Xi,Xj) is bivariate hypergeometric with parameters n and 

(Mi,Mj,N − Mi − Mj). In particular, Xi and Xj are negatively associated. 

Example 3.16 (Physical Activity) Suppose there are 600 women and 400 men living in 

an adult community. Among the women, 180 are physically active and 420 are not. Among 

the men, 120 are physically active and 280 are not. 

Let X1 be the number of women who are physically active, X2 be the number of women 

who are not physically active, X3 be the number of men who are physically active and X4 

be the number of men who are not physically active in a simple random sample of size 15 

chosen from those living in the community. Then, for example, 

P(X1 = 3,X2 = 7,X3 = 1,X4 = 4) = 

45 

xk 

180420120280 3 7 1 

1000 15 

4 

≈ 0.0182.

3.2.5 Example: Multinomial Distribution 

Suppose that an experiment has exactly k outcomes, and let 

pi be the probability of the i th outcome, i = 1,2,... ,k (p1 + p2 + · · · + pk = 1). 

Let Xi be the number of occurrences of the i th outcome in n independent trials of the experiment. 

Then the random k-tuple (X1,X2,...,Xk) is said to have a multinomial distribution 

with parameters n and (p1,p2,...,pk). 

The joint PDF for the k-tuple is 

f(x1,x2,... ,xk) = 

 

n 

x1,x2,... ,xk 

 

p x1 

1 px2 2 · · · pxk 

k 

when x1,x2,...,xk = 0,1,... ,n and 

i xi = n (and zero otherwise). 

Note that 

1. If X is a binomial random variable with parameters n and p, then 

(X,n − X) 

has a multinomial distribution with parameters n and (p,1 − p). 

2. If (X,Y ) has a trinomial distribution with parameters n and (p1,p2,p3), then 

(X,Y,n − X − Y ) 

has a multinomial distribution with parameters n and (p1,p2,p3). 

3. If (X1,X2,... ,Xk) has a multinomial distribution, then 

(a) For each i, Xi is a binomial random variable with parameters n and pi. 

(b) For each i = j, (Xi,Xj) has a trinomial distribution with parameters n and 

(pi,pj,1 − pi − pj). In particular, Xi and Xj are negatively associated. 

Example 3.17 (Independence Model) Assume that A and B are independent events 

of interest to researchers. Let outcomes 1, 2, 3, 4 correspond to 

“both A and B occur,” “only A occurs,” “only B occurs,” “neither A nor B occur,” 

respectively. If P(A) = α and P(B) = β, then 

(p1,p2,p3,p4) = (αβ,α(1 − β),(1 − α)β,(1 − α)(1 − β)) 

can be used as the list of probabilities in a multinomial model. 

If P(A) = 0.6, P(B) = 0.3, and Xi is the number of occurrences of the ith outcome in 15 

independent trials (i = 1,2,3,4), then, for example, 

 

15 

P(X1 = 3,X2 = 7,X3 = 1,X4 = 4) = (0.18) 

3,7,1,4 

3 (0.42) 7 (0.12) 1 (0.28) 4 ≈ 0.0179. 

46


The results of Sections 2.2.4 and 3.1.11 can be directly generalized to the multivariate 

case. In particular, multinomial probabilities can be used to approximate multivariate 

hypergeometric probabilities when N is large enough, and each family of distributions can 

be used in survey analysis. 

3.2.7 Sample Summaries 

If X1, X2, ..., Xn is a random sample from a distribution with mean µ and standard 

deviation σ, then the sample mean, X, is the random variable 

X = 1 

n (X1 + X2 + · · · + Xn) , 

the sample variance, S 2 , is the random variable 

S 2 = 1 

n − 1 

n 

i=1 

2 Xi − X 

and the sample standard deviation, S, is the positive square root of the sample variance. 

The following theorem can be proven using properties of expectation: 

Theorem 3.18 (Sample Summaries) If X is the sample mean and S 2 is the sample 

variance of a random sample of size n from a distribution with mean µ and standard 

deviation σ, then 

1. E(X) = µ and V ar(X) = σ 2 /n. 

2. E(S 2 ) = σ 2 . 

Demonstration 1. To demonstrate that E(X) = E(X) = µ, 

 

1 E(X) = E n (X1 

 

+ X2 + · · · + Xn) 

= 1 

n (E(X1) + E(X2) + · · · + E(Xn)) 

= 1 

n 

1 (nE(X)) = nnµ = µ 

Demonstration 2. To demonstrate V ar(X) = V ar(X)/n = σ 2 /n, first note that 

 

V ar(X) = E (X − µ) 2 

⎛ 

n 

⎞ ⎛ 

2 

n 

⎞ 

2 

1 

1 

= E ⎝ Xi − µ ⎠ = E ⎝ (Xi − µ) ⎠ . 

n 

n 

i=1 

i=1 

47

Now, 

V ar(X) = 1 

n2E 

( n i=1 (Xi − µ)) 2 

= 1 

n2 i=j E((Xi − µ) 2 ) + 

i=j E((Xi 

 

− µ)(Xj − µ)) 

Since the Xi’s are independent, the last sum is zero. Finally, 

V ar(X) = 1 

n2 

nσ 2 

+ 0 = σ2 

n . 

Demonstration 3. To demonstrate that E(S 2 ) = V ar(X) = σ 2 , first note that 

Observe that 

S2 = 1 

ni=1 

2 

n−1 

Xi − X 

= 1 

ni=1 

n−1 X2 i − 2XXi + X 2 

= 1 

ni=1 n−1 X2 i 

= 1 

ni=1 n−1 X2 i 

− 2XnX + nX2 

− nX2 

E(X 2 ) = V ar(X) + (E(X)) 2 = σ 2 + µ 2 

Finally, 

. 

since n i=1 Xi = nX 

and E(X 2 ) = V ar(X) + (E(X)) 2 = σ2 

n + µ2 . 

E(S2 ) = 1 

ni=1 n−1 E(X2 i ) − nE(X2 

) 

= 1 

 

n−1 n(σ2 + µ 2 ) − n(σ2 /n + µ 2 ) 

= 1 

 

n−1 nσ2 + nµ 2 − σ2 − nµ 2 

= 1 

n−1 (n − 1)σ2 = σ2 . 

Sample correlation. A random sample of size n from the joint (X,Y ) distribution is a 

list of n mutually independent random pairs, each with the same distribution as (X,Y ). 

If (X1,Y1),(X2,Y2),... ,(Xn,Yn) is a random sample of size n from a bivariate distribution 

with correlation ρ = Corr(X,Y ), then the sample correlation, R, is the random variable 

ni=1 (Xi − X)(Yi − Y ) 

R = 

ni=1 

(Xi − X) 2 n i=1 (Yi − Y ) 2 

where X and Y are the sample means of the X and Y samples, respectively. 

48 

.

Applications. In statistical applications, observed values of the sample mean, sample 

variance, and sample correlation are used to estimate unknown values of the parameters µ, 

σ 2 and ρ, respectively. 

Example 3.19 (n = 600) Assume that the following table summarizes the values of a 

random sample of size 600 from the joint (X,Y ) distribution: 

y = 1 y = 2 y = 3 y = 4 y = 5 y = 6 y = 7 y = 8 y = 9 y = 10 

x = 0 2 1 5 12 15 15 7 2 1 60 

x = 1 2 9 25 36 42 31 13 158 

x = 2 3 4 15 39 55 41 21 5 183 

x = 3 2 7 20 27 39 14 1 110 

x = 4 1 13 11 16 18 4 63 

x = 5 1 6 5 6 2 20 

x = 6 1 2 3 6 

8 36 64 118 162 116 68 25 2 1 600 

Two pairs of the form (0,2) were observed, one pair of the form (0,3) was observed, five 

pairs of the form (0,4) were observed, and so forth. 

The sample mean of the x-coordinates is 

and the sample variance is 

s 2 x 

x = 1 600 

xi = 

600 

1 

(0(60) + 1(158) + · · · + 6(6)) ≈ 2.07 

600 

i=1 

1 600 

= (xi − x) 

599 

2 = 1 

599 

Similarly, 

i=1 

The observed sample correlation is 

r = 

 

(0 − x) 2 (60) + (1 − x) 2 (158) + · · · + (6 − x) 2 

(6) ≈ 1.72464. 

y ≈ 4.92333 and s 2 y 

≈ 2.49161. 

−634.78 

(1033.06)(149247) ≈ −0.511219. 

The results suggest that X and Y are negatively associated. 

49

4 Introductory Calculus Concepts 

In introductory calculus, we study real-valued functions whose domain (set of input values) 

is a subset of the real numbers. We often use the shorthand 

f : D ⊆ R −→ R, 

where D is the domain and R is the set of real numbers, to denote such a function. 

This chapter covers limits, derivatives and integrals and applications in statistics. 

4.1 Functions 

This section reviews functions commonly used in calculus. 

Power functions. A power function is a function with rule 

f(x) = cx p where the coefficient c and power p are constants. 

(The values of f are proportional to a power of the input x.) The domain of f depends 

on the power p. For example, if p = 2, the domain of f is the set of all real numbers; if 

p = 1/2, the domain of f is the set of nonnegative real numbers. 

Polynomial and rational functions. A polynomial function is a function with rule 

f(x) = anx n + an−1x n−1 + · · · + a1x + a0 

where the coefficients ai are constants, n is a positive integer and x is a real number. If 

an = 0, then n is the degree of the polynomial. 

Special cases of polynomials include: If the degree of f is 0, then f is a constant function; if 

the degree is 1, then f is a linear function; if the degree is 2, then f is a quadratic function; 

if the degree is 3, then f is a cubic function. 

A rational function is the ratio of two polynomials: 

f(x) = p(x) 

, where p(x) and q(x) are polynomials. 

q(x) 

The domain of f is the set of real numbers satisfying q(x) = 0. 

Exponential functions. An exponential function is a function with rule 

f(x) = a x , where a > 0 is a positive constant and x is a real number. 

In this formula, a is the base and x is the exponent. The exponential function is strictly 

increasing when a > 1 and strictly decreasing when 0 < a < 1. 

51

The exponential function with base e, 

f(x) = e x , where e ≈ 2.71828 is Euler’s constant, 

is used most often in calculus. (e is the natural base for calculus.) 

Exponential functions satisfy the following laws of exponents: 

(1) a x+y = a x a y , (2) a −x = 1 

a x , (3) (ax ) y = a xy 

where a > 0 is a positive number, and x and y are any real numbers. 

Logarithmic functions. A logarithmic function is a function with rule 

f(x) = log a(x) where a is a positive constant and x is a positive real number. 

As above, a is the base of the function. When a = e, 

f(x) = log e(x) = ln(x) is called the natural logarithm. 

Exponential and logarithmic functions with the same base are related as follows: 

y = log a(x) ⇐⇒ a y = x 

where a and x are positive numbers and y is any real number. Thus, 

log a(a x ) = x for all x ∈ R and a log a (x) = x for all x > 0. 

Logarithmic functions satisfy the following laws of logarithms: 

(1) loga(xy) = loga(x) + loga(y), 

(2) loga = loga(x) − loga(y) and 

x 

y 

(3) log a (x r ) = r log a(x), 

where a, x and y are positive numbers and r is any real number. 

Using the natural base. Since a = eln(a) for any a > 0, we can write 

a x 

= e ln(a) x 

= e ln(a)x , where x is a real number. 

Similarly, suppose that y = log a(x) (equivalently, x = a y ). Then, 

ln(x) = ln (a y ) = y ln(a) = log a(x)ln(a) =⇒ log a(x) = ln(x) 

ln(a) . 

Thus, exponential and logarithmic functions in base a can always be written in terms of 

the corresponding functions in the natural base. 

52

t 

Px,y 

t 

Px,y 

Figure 4.1: A positive angle t (left plot) and a negative angle t (right plot). In each case, 

the point P(x,y) lies on the unit circle centered at the origin, x = cos(t) and y = sin(t). 

Trigonometric functions. The trigonometric functions are defined using a unit circle 

centered at the origin and the radian measure of an angle, as illustrated in Figure 4.1. 

In the left part of the figure, t is the length of the arc of the unit circle measured counterclockwise 

from (1,0) to P(x,y); in the right part of the figure, t is the negative of the 

length of the arc measured clockwise from (1,0) to P(x,y). 

The sine and cosine functions have the following rules 

sin(t) = y and cos(t) = x 

where t is any real number, and x and y are described above. Since the circumference of a 

circle of radius one is 2π, we know that 

for every integer k and real number t. 

sin(t + k(2π)) = sin(t) and cos(t + k(2π)) = cos(t) 

The tangent, cotangent, secant and cosecant functions are defined in terms of sine and cosine. 

Specifically, 

tan(t) = sin(t) cos(t) 1 

, cot(t) = , sec(t) = 

cos(t) sin(t) cos(t) 

and csc(t) = 1 

sin(t) . 

In each case, the domain of the function is all t for which the denominator is not 0. 

Inverse trigonometric functions. Each trigonometric function has an inverse function: 

1. Inverse sine function: For −1 ≤ z ≤ 1, 

sin −1 (z) = t ⇐⇒ sin(t) = z and − π 

2 

2. Inverse cosine function: For −1 ≤ z ≤ 1, 

≤ t ≤ π 

2 . 

cos −1 (z) = t ⇐⇒ cos(t) = z and 0 ≤ t ≤ π. 

53

3. Inverse tangent function: For any real number z, 

tan −1 (z) = t ⇐⇒ tan(t) = z and − π 

2 

4. Inverse cotangent function: For any real number z, 

5. Inverse secant function: For |z| ≥ 1, 

< t < π 

2 . 

cot −1 (z) = t ⇐⇒ cot(t) = z and 0 < t < π. 

sec −1 (z) = t ⇐⇒ sec(t) = z and 

6. Inverse cosecant function: For |z| ≥ 1, 

csc −1 (z) = t ⇐⇒ csc(t) = z and 

 

0 ≤ t < π 

2 

 

0 < t ≤ π 

2 

 

3π 

or π ≤ t < . 

2 

 

3π 

or π < t ≤ . 

2 

Radian and degree measures. Angles can be measured in degrees or in radians. Radian 

measure is preferred in calculus. The following table gives some common conversions: 

Degrees 0 o 30 o 45 o 60 o 90 o 120 o 135 o 150 o 180 o 270 o 360 o 

Radians 0 π 

6 

π 

4 

π 

3 

π 

2 

2π 

3 

3π 

4 

5π 3π 

6 π 2 

Note that 2π radians (or 360 o ) corresponds to a complete revolution of the unit circle. 

Some values of sine and cosine. The following table gives some often used values of 

the sine and cosine functions: 

t 0 π 

6 

1 sin(t) 0 2 

cos(t) 1 √ 3 

2 

4.2 Limits and Continuity 

π 

4 

√ 

2 

2 

√ 

2 

2 

π π 2π 3π 5π 

3 

√ 

3 

2 

2 3 

1 

4 6 

√ 1 

2 

3 

2 

0 − 

√ 

2 

2 

1 

2 0 

1 

2 − √ 2 

2 − √ 3 

2 −1 

This section introduces the concepts of limit and continuity, which are central to calculus. 

4.2.1 Neighborhoods, Deleted Neighborhoods and Limits 

A neighborhood of the real number a is the set of real numbers x satisfying 

|x − a| < r for some positive number r. 

A deleted neighborhood of a is the neighborhood with a removed; that is, the set of real 

numbers x satisfying 0 < |x − a| < r for some positive r. 

54 

π 

2π

y 

2 

1 

-0.5 0.5 1 1.5 2 x 

-1 

-2 

-0.5 0.5 1 1.5 2 x 

Figure 4.2: Plots of y = (x − 1)/(x 2 − 1) (left) and y = |x − 1|/(x 2 − 1) (right). 

Suppose that f(x) is defined for all x in a deleted neighborhood of a. We write 

lim f(x) = L 

x→a 

and say “the limit of f(x), as x approaches a, equals L” if for every positive number ǫ there 

exists a positive number δ such that 

if x satisfies 0 < |x − a| < δ, then f(x) satisfies |f(x) − L| < ǫ. 

(The deleted δ-neighborhood of a is mapped into the ǫ-neighborhood of L.) 

Example 4.1 (Figure 4.2) Let f(x) = (x−1)/(x 2 −1) and a = 1 (left part of Figure 4.2). 

Although f(1) is undefined, the limit as x approaches 1 exists. Specifically, 

lim f(x) = lim 

x→1 x→1 

(x − 1) 

(x2 = lim 

− 1) x→1 

y 

2 

1 

-1 

-2 

(x − 1) 

= lim 

(x − 1)(x + 1) x→1 

1 1 

= 

(x + 1) 2 . 

By contrast, if f(x) = |x − 1|/(x 2 − 1) and a = 1 (right part of Figure 4.2), then a limit 

does not exist, since function values approach either −1/2 or +1/2 depending on whether 

x < 1 or x > 1. 

Example 4.2 (Figure 4.3) Let f(x) = sin(π/x) and a = 0 (left part of Figure 4.3). f(x) 

is undefined at 0. Further, as x approaches 0, values of f(x) fluctuate between −1 and +1, 

and no limit exists. 

By contrast, if f(x) = xsin(π/x) and a = 0 (right part of Figure 4.3), then a limit does 

exist. Specifically, function values approach 0 as x approaches 0. 

One-sided limits. We write 

lim f(x) = L 

x→a + 

and say “the limit of f(x), as x approaches a from above, equals L” if for every positive 

number ǫ there exists a positive number δ such that 

if x satisfies 0 < x − a < δ, then f(x) satisfies |f(x) − L| < ǫ. 

55

0.5 

-1 -0.5 0.5 1 x 

-0.5 

y 

1 

-1 

0.5 

-1 -0.5 0.5 1 x 

-0.5 

Figure 4.3: Plots of y = sin(π/x) (left) and y = xsin(π/x) (right). 

Similarly, we write 

lim f(x) = L 

x→a 

and say “the limit of f(x), as x approaches a from below, equals L” if for every positive 

number ǫ there exists a positive number δ such that 

if x satisfies 0 < a − x < δ, then f(x) satisfies |f(x) − L| < ǫ. 

In the first case, we approach a from the right only; in the second case, from the left only. 

For example, if f(x) = |x − 1|/(x 2 − 1) and a = 1 (right part of Figure 4.2), then 

Infinite limits. We write 

lim f(x) = −1 

x→1− 2 

y 

1 

-1 

1 

and lim f(x) = 

x→1 + 2 . 

lim f(x) = L 

x→∞ 

and say “the limit of f(x), as x approaches ∞, equals L” if for every positive number ǫ 

there exists a number M such that 

Similarly, we write 

if x satisfies x > M, then f(x) satisfies |f(x) − L| < ǫ. 

lim f(x) = L 

x→−∞ 

and say “the limit of f(x), as x approaches −∞, equals L” if for every positive number ǫ 

there exists a number M such that 

if x satisfies x < M, then f(x) satisfies |f(x) − L| < ǫ. 

Example 4.3 (Exponential Function) Let f(x) = e x . Then 

lim f(x) = 0, 

x→−∞ 

but no finite limit exists as x → ∞. We often write limx→∞ f(x) = ∞ when function values 

grow large without bound. 

56

Application in statistics. Let F(x) = P(X ≤ x) be the cumulative distribution function 

(CDF) of a discrete random variable (Section 2.1.1) and let a be an element of the range of 

the random variable. Then both one-sided limits exist, 

Further, 

Properties of limits. Suppose that 

and let c be a constant. Then 

1. Constant rule: limx→a c = c. 

2. Rule for x: limx→a x = a. 

lim F(x) = F(a) and lim F(x) = F(a). 

x→a + x→a− lim F(x) = 0 and lim F(x) = 1. 

x→−∞ x→∞ 

lim f(x) = L, lim g(x) = M, 

x→a x→a 

3. Constant multiple rule: limx→a(cf(x)) = cL. 

4. Sum rule: limx→a(f(x) + g(x)) = L + M. 

5. Difference rule: limx→a(f(x) − g(x)) = L − M. 

6. Product rule: limx→a(f(x)g(x)) = LM. 

7. Quotient rule: limx→a(f(x)/g(x)) = L/M when M = 0. 

8. Power rule: limx→a(f(x)) n = L n when n is a positive integer. 

9. Power rule for x: limx→a x n = a n when n is a positive integer. 

10. Rule for roots: limx→a n f(x) = n√ L when n is an odd positive integer, or when n is 

an even positive integer and L is greater than 0. 

11. Rule for ordering: If f(x) ≤ g(x) for all x near a (except possibly at a), then L ≤ M. 

Squeezing principle. If ℓ(x) ≤ f(x) ≤ u(x) for all x near a (except possibly at a) and 

lim ℓ(x) = lim u(x) = L, 

x→a x→a 

then f(x) approaches L as well: limx→a f(x) = L. 

For example, let f(x) = xsin(π/x) and a = 0 (right part of Figure 4.3). Since 

−|x| ≤ xsin(π/x) ≤ |x| 

and the lower and upper functions each approach 0 as x approaches 0, we know that f(x) 

approaches 0 as well. 

57

Big O and little o notation. If 

|f(x)| ≤ Kg(x) for some positive K and all x near a (except possibly at a), 

we say that f(x) = O(g(x)) for x near a. If 

we say that f(x) = o(g(x)) for x near a. 

4.2.2 Continuous Functions 

f(x) 

lim = 0 

x→a g(x) 

The function f(x) is said to be continuous at a if 

lim f(x) = f(a). 

x→a 

The function is said to be discontinuous at a otherwise. Note that f(x) is discontinuous at 

a when f(a) is undefined, or when the limit does not exist, or when the limit and function 

values exist but are not equal. 

Polynomial functions, rational functions (ratios of polynomials), exponential functions, logarithmic 

functions, and trigonometric functions are continuous throughout their domains. 

Step functions are discontinuous at each “step”. 

Note that the properties of limits given above imply that sums, differences, products, quotients, 

powers, roots, and constant multiples of continuous functions are continuous whenever 

the appropriate operation is defined. 

Example 4.4 (Split Definition Functions) Let 

⎧ 

0 x < 0 

⎪⎨ 

−4x + 10 x < 2 

0.25 0 ≤ x < 1 

f(x) = 

and g(x) = 

−0.25x + 2.5 x ≥ 2 

⎪⎩ 

0.75 1 ≤ x < 2 

1 x ≥ 2 

The function f(x) (left part of Figure 4.4) is continuous everywhere. In particular, 

lim f(x) = lim + 10) = 2 and lim f(x) = lim + 2.5) = 2. 

x→2− x→2−(−4x x→2 + x→2 +(−0.25x 

The step function g(x) (right part of Figure 4.4) is discontinuous at 0, 1 and 2. Note that 

g(x) is the cumulative distribution function of a binomial random variable with parameters 

n = 2 and p = 1/2. 

Compositions. If g(x) is continuous at a and f(x) is continuous at g(a), then the composite 

function f(g(x)) is continuous at a. In other words, 

 

lim f(g(x)) = f 

x→a 

lim 

x→a g(x) 

 

= f(g(a)). 

For example, limx→5 ln(6 − x) = ln(1) = 0, where ln() is the natural logarithm function. 

58

yfx 

10 

5 

4.3 Differentiation 

5 10 x 

ygx 

1 

0.5 

0 1 2 3 

Figure 4.4: Plots of split definition functions. 

Suppose that f(x) is defined for all x near a. The average rate of change of the function f 

over the interval [a,x] if x > a (or [x,a] if x < a) is the ratio 

f(x) − f(a) 

. 

x − a 

The average rate of change corresponds to the slope of the secant line through the points 

(a,f(a)) and (x,f(x)). 

Interest focuses on whether we can define the instantaneous rate of change at a or, equivalently, 

the slope of the curve y = f(x) at the point (a,f(a)). 

4.3.1 Derivative and Tangent Line 

The derivative of the function f at the number a, denoted by f ′ (a) (“f prime of a”), is 

f ′ f(x) − f(a) 

(a) = lim 

x→a x − a 

if the limit exists. 

The function f(x) is said to be differentiable at a if the limit above exists. Otherwise, f(x) 

is not differentiable at a. 

If f(x) represents your position at time x, then (f(x) − f(a))/(x − a) is your average velocity over 

the interval [a, x] (or [x, a]) and f ′ (a) is your instantaneous velocity at time a. 

Tangent line. If f(x) is differentiable at a, then the line with equation 

y = f(a) + f ′ (a)(x − a) 

is tangent to the curve y = f(x) at the point (a,f(a)), and f ′ (a) is said to be the slope of 

the curve at this point. 

59 

x

0.4 

0.3 

0.2 

0.1 

y 

0.5 1 1.5 2 x 

8 

6 

4 

2 

y 

1 2 3 4 x 

Figure 4.5: Plots of y = x/(1 + 2x) (left part) and y = 7 − 3|x − 2| (right part). 

Example 4.5 (Derivative Exists) Consider the function f(x) = x/(1+2x) and let a = 1 

(left part of Figure 4.5). The derivative is computed as follows: 

f ′ (1) = lim 

x→1 

x 1 

(1+2x) − 3 

x − 1 

= lim 

x→1 

The equation of the tangent line is y = (1/3) + (1/9)(x − 1). 

1 1 

= 

3 + 6x 9 . 

Example 4.6 (Derivative Does Not Exist) Consider the function f(x) = 7 − 3|x − 2| 

and let a = 2 (right part of Figure 4.5). Then 

1. If x > 2, then f(x) = 7 − 3(x − 2) and lim x→2 + f(x)−f(2) 

x−2 

2. If x < 2, then f(x) = 7 − 3(2 − x) and lim x→2 − f(x)−f(2) 

x−2 

= −3. 

= 3. 

Since the limits from above and below are not equal, the derivative does not exist. 

4.3.2 Approximations and Local Linearity 

If f(x) is differentiable at a, then f(x) is locally linear near a. That is, values on the tangent 

line can be used to approximate values on the curve itself. 

Example 4.7 (Derivative Exists, continued) For example, if f(x) = x/(1+2x), a = 1, 

and ℓ(x) = (1/3) + (1/9)(x − 1) (as calculated above), then the following table confirms 

that values of the linear function ℓ(x) are close to values of f(x) when x is near 1: 

x 0.75 0.80 0.85 0.9 0.95 1.00 1.05 1.10 1.15 1.20 1.25 

f(x) 0.300 0.308 0.315 0.321 0.328 0.333 0.339 0.344 0.348 0.353 0.357 

ℓ(x) 0.306 0.311 0.317 0.322 0.328 0.333 0.339 0.344 0.350 0.356 0.361 

60

4 

2 

-1 1 2 3 4 

-2 

-4 

y 

6 

x 

-1 1 2 3 4 

-2 

Figure 4.6: Plots of y = 2 + 9x − 6x 2 + x 3 with different points highlighted. 

4.3.3 First and Second Derivative Functions 

Since the derivative can take different values at different points, we can think of it as a 

function in its own right. For any function f, we define the derivative function, f ′ , by 

f ′ (x) = lim 

h→0 

4 

2 

-4 

y 

6 

f(x + h) − f(x) 

. 

h 

If the limit exists at a given x, we say that f is differentiable at x. If the derivative exists 

for all x in the domain of the function, we say that f is differentiable everywhere. 

Alternative notations. Starting with y = f(x) (where x represents the independent 

variable and y represents the dependent variable), there are several different ways to write 

the derivative function: 

f ′ (x) = dy d df 

= (f(x)) = 

dx dx dx = Df(x) = y′ . 

The “prime” notation (f ′ (x), y ′ ) is due to Newton. The d 

dx (“d-dx”) form is due to Leibniz. 

Second derivative function. If f is a differentiable function, then its derivative f ′ is 

also a function, so f ′ may have a derivative of its own. The second derivative of f, denoted 

by f ′′ (“f double prime”), is defined as follows: 

f ′′ (x) = lim 

h→0 

Starting with y = f(x), alternative notations are 

f ′ (x + h) − f ′ (x) 

. 

h 

f ′′ (x) = d2y d2 

= 

dx2 dx2(f(x)) = d2f dx2 = D2f(x) = y ′′ . 

The second derivative is the rate of change of the rate of change. If f ′′ (a) > 0, then the 

curve y = f(x) is concave up at (a,f(a)). If f ′′ (a) < 0, then the curve is concave down at 

(a,f(a)). 

61 

x

Continuity and differentiability. Differentiable functions are continuous, but continuous 

functions may not be differentiable. For example, the absolute value function 

f(x) = 7 − 3|x − 2| (see Figure 4.5) is continuous everywhere and is differentiable for 

all values of x other than 2. 

4.3.4 Some Rules for Finding Derivative Functions 

Assume that f and g are differentiable on a common domain D, and let a, b, c, and m be 

constants. Then 

d 

1. Constant functions: dx (c) = 0. 

d 

2. Linear functions: dx (mx + b) = m. 

3. Constant multiples: d 

dx (cf(x)) = cf ′ (x). 

4. Linear combinations: d 

dx (af(x) + bg(x)) = af ′ (x) + bg ′ (x). 

5. Products of functions: d 

dx (f(x)g(x)) = f ′ (x)g(x) + f(x)g ′ (x). 

 

d f(x) 

6. Quotients of functions: dx g(x) = g(x)f ′ (x)−f(x)g ′ (x) 

(g(x)) 2 

7. Powers of x: d 

dx (xn ) = nx n−1 for each constant n. 

when g(x) = 0. 

Polynomial functions. Recall that a polynomial function, f(x), has the form 

f(x) = anx n + an−1x n−1 + · · · + a2x 2 + a1x + a0 

where a0,a1,... ,an are constants, and n is a nonnegative integer. The rules given above 

imply that 

f ′ (x) = nanx n−1 + (n − 1)an−1x n−2 + · · · + 2a2x + a1. 

Example 4.8 (Cubic Function) Let f(x) = 2 + 9x − 6x 2 + x 3 . Then 

f ′ (x) = 9 − 12x + 3x 2 and f ′′ (x) = −12 + 6x. 

Since f ′′ (x) < 0 when x < 2, f is concave down when x < 2 (left part of Figure 4.6). Since 

f ′′ (x) > 0 when x > 2, f is concave up when x > 2 (right part of Figure 4.6). 

Rational functions. Recall that a rational function, r(x), is the ratio of polynomials. 

The derivative of r(x) is computed using the quotient rule. 

Example 4.9 (Rational Function) For example, if 

then r ′ (x) = 

q(x)p ′ (x) − p(x)q ′ (x) 

(q(x)) 2 

r(x) = p(x) 3 + 4x − x2 

= 

q(x) 7 + 2x 

= (7 + 2x)(4 − 2x) − (3 + 4x − x2 )(2) 

(7 + 2x) 2 

62 

, 

= −2 −11 + 7x + x2 (7 + 2x) 2 .

4.3.5 Chain Rule for Composition of Functions 

If f and g are differentiable functions, then 

d 

dx (f(g(x))) = f ′ (g(x))g ′ (x). 

(The derivative of the outside function evaluated at the inside function times the derivative 

of the inside function evaluated at x.) 

Starting with y = f(u) and u = g(x), the chain rule can be written as follows: 

dy dy 

= 

dx du 

du 

dx . 

For example, since √ u = u 1/2 and d 

du (u1/2 ) = 1 

2 u−1/2 = 1 

2 √ u , 

d 

x 

dx 

2 

+ 1 = 

1 

2 √ x2 (2x) = 

+ 1 

x 

√ x 2 + 1 . 

Inverse functions. The functions f and g are said to be inverse functions if f(g(x)) = x. 

Using the chain rule: 

d 

dx (f(g(x))) = f ′ (g(x))g ′ (x) ⇒ 1 = f ′ (g(x))g ′ (x) ⇒ g ′ (x) = 

(The derivative of g at x is the reciprocal of the derivative of f at g(x).) 

1 

f ′ (g(x)) . 

For example, f(u) = u2 and g(x) = √ x are inverse functions when x is positive. Since 

f ′ (u) = 2u, 

g ′ (x) = 1 

2 √ x = 

1 

f ′ (g(x)) . 

4.3.6 Rules for Exponential and Logarithmic Functions 

The natural base for exponential and logarithmic functions in calculus is the base e, where 

 

e = lim 1 + 

n→∞ 

1 

n = 2.71828... is the Euler constant. 

n 

The limit above is illustrated in the following table of approximate values: 

n 10 100 1000 10000 100000 1000000 

 

1 + 1 

n n 2.59374 2.70481 2.71692 2.71815 2.71827 2.71828 

The domain of the exponential function e x = exp(x) is all real numbers and the domain 

of the logarithmic function ln(x) = log e(x) is the positive real numbers. Plots of the 

exponential and logarithmic functions are given in Figure 4.7. 

63

y 

8 

6 

4 

2 

-1 1 2 x 

Figure 4.7: Plots of y = e x (left) and y = ln(x) (right). 

The functions e x and ln(x) are inverses. That is, 

Their derivatives are as follows: 


For example, 

y 

2 

1 

-1 

2 4 6 8 x 

e ln(x) = x when x > 0 and ln(e x ) = x for all real numbers. 

d 

dx (ex ) = e x and d 1 

(ln(x)) = 

dx x . 

d 

dx (eu u du 

) = e 

dx 

d 

e 

dx 

12−x3 

 

12−x3 

= e −3x 2 

d 1 

and (ln(u)) = 

dx u 

du 

dx . 

and d 

ln(x 

dx 

4 + 2x 2 

+ 7) = 4x3 + 4x 

x4 + 2x2 + 7 . 

4.3.7 Rules for Trigonometric and Inverse Trigonometric Functions 

In calculus, we work with trigonometric functions using radian measure. That is, the input 

x is the radian measure of an angle (page 53). Figure 4.8 shows plots of the two most 

commonly used functions, sin(x) and cos(x). 

Trigonometric functions have specialized derivative formulas: 

d 

d 

dx (sin(x)) = cos(x) 

d 

dx (tan(x)) = sec(x)2 d 

dx (cos(x)) = − sin(x) 

dx (cot(x)) = −csc(x)2 

d 

dx (sec(x)) = sec(x) tan(x) d 

dx (csc(x)) = − csc(x) cot(x) 

64

1 

0.5 

-9 -6 -3 3 6 9 x 

-0.5 

-1 

y 

1 

0.5 

-9 -6 -3 3 6 9 x 

-0.5 

Figure 4.8: Plots of y = sin(x) (left) and y = cos(x) (right). 

Similarly, inverse trigonometric functions have specialized derivative formulas: 

For example, 

d 

dx (sin−1 (x)) = 1 

√ 1−x 2 

d 

dx (tan−1 (x)) = 1 

1+x2 d 

dx (sec−1 (x)) = 

1 

x 2 √ 1−x −2 

-1 

y 

d 

dx (cos−1 (x)) = −1 

√ 1−x 2 

d 

dx (cot−1 (x)) = −1 

1+x2 d 

dx (csc−1 −1 (x)) = 

x2 √ 1−x−2 d 

sin(3x 

dx 

2 

+ 8) = cos(3x 2 + 8)(6x) and 

d 

dx (x cos(√x + x)) = cos( √ x + x) − x sin( √ x + x) 1 

2 √ + 1 

x 

= cos( √ x + x) − sin( √ x + x) √ x 

2 

4.3.8 Indeterminate Forms and L’Hospital’s Rule 

If f(x) → 0 (“f(x) approaches 0”) as x → a and g(x) → 0 as x → a, then 

f(x) 

lim 

x→a g(x) 

may or may not exist, and is called an indeterminate form of type 0/0. 

+ x . 

For example, 

x − 1 

lim 

x→1 x2 + 11x − 12 

is an indeterminate form of type 0/0 whose limit can be computed after cancelling the 

common factor (x − 1) from numerator and denominator: 

lim 

x→1 

x − 1 

x2 = lim 

+ 11x − 12 x→1 

x − 1 

= lim 

(x − 1)(x + 12) x→1 

65 

1 1 

= 

x + 12 13 .

Similarly, if f(x) → ±∞ and g(x) → ±∞ as x → a, then the limit of the ratio f(x)/g(x) is 

an indeterminate form of type ∞/∞. 

Theorem 4.10 (L’Hospital’s Rule) Suppose that f and g are differentiable and g ′ (x) = 

0 near a (except possibly at a). Further, suppose that 

Then 

1. f(x) → 0 and g(x) → 0 as x → a, or 

2. f(x) → ±∞ and g(x) → ±∞ as x → a. 

if the limit on the righthand side exists. 

f(x) f 

lim = lim 

x→a g(x) x→a 

′ (x) 

g ′ (x) 

L’Hospital’s rule says that the limit of the ratio of the functions is the same as the limit of 

the ratio of the rates. For example, 

x 

lim 

x→1 

20 − 1 

x18 20x 

= lim 

− 1 x→1 

19 10x 

= lim 

18x17 x→1 

2 

9 

= 10 

9 . 

In some cases, L’Hospital’s rule needs to be used more than once. For example, 

1 − cos(x) 

lim 

x→0 x2 sin(x) cos(x) 1 

= lim = lim = 

x→0 2x x→0 2 2 . 

(The first and second limits are indeterminate forms of type 0/0.) 

One-sided and infinite limits. Note that L’Hospital’s rule remains true for one-sided 

limits (x → a + or x → a − ) and for limits where x → ±∞. 

Indeterminate products. If f(x) → 0 and g(x) → ±∞ as x → a, then 

lim 

x→a f(x)g(x) 

is an indeterminate form of type 0·∞. L’Hospital’s rule can sometimes be used to evaluate 

the limit. The trick is to rewrite the product as a quotient: 

For example, 

limx→a f(x)g(x) = limx→a 

lim 

x→∞ xe−x = lim 

x→∞ 

x 

= lim 

ex x→∞ 

= limx→a 

f(x) 

(1/g(x)) (type 0/0) or 

g(x) 

(1/f(x)) (type ∞/∞). 

1 

= 0 (using type ∞/∞). 

ex 66

Remarks. There are many other types of indeterminate forms, including 

∞ − ∞, 0 0 , ∞ 0 , 1 ∞ . 

In each case, anything can happen (that is, a limit may or may not exist). 

Note that the limit used to define the Euler constant (page 63) is an example of an indeterminate 

form of type 1 ∞ . 

4.4 Optimization 

The function f : D ⊆ R → R has a local minimum (or relative minimum) at c ∈ D if 

f(c) ≤ f(x) for each x in the intersection of a neighborhood of c with D. 

Similarly, f has a local maximum (or relative maximum) at c ∈ D if 

f(c) ≥ f(x) for each x in the intersection of a neighborhood of c with D. 

Local maxima and minima are called local extrema of the function. 

f has a global (or absolute) minimum at c if f(c) ≤ f(x) for all x ∈ D. Similarly, f has a 

global (or absolute) maximum at c if f(c) ≥ f(x) for all x ∈ D. Global maxima and minima 

are called global extrema of the function. 

4.4.1 Local Extrema and Derivatives 

The following theorem establishes a relationship between local extrema and derivatives. 

Theorem 4.11 (Fermat’s Theorem) If the continuous function f has a local extremum 

at c, and if f ′ (c) exists, then f ′ (c) = 0. 

Continuing with the cubic function example (Figure 4.6), f(x) = 2 + 9x − 6x 2 + x 3 has a 

local maximum at 1 (with maximum value 6) and a local minimum at 3 (with minimum 

value 2). Since the derivative function is f ′ (x) = 9 − 12x + 3x 2 , it is easy to check that 

f ′ (1) = f ′ (3) = 0. 

Critical numbers. A number c is a critical number of f if either f ′ (c) = 0 or f ′ (c) does 

not exist. If f has a local extremum at c, then Fermat’s theorem implies that c is a critical 

number of the function f. 

The cubic function f(x) = 2 + 9x − 6x 2 + x 3 has critical numbers 1 and 3. The absolute 

value function f(x) = 7 − 3|x − 2| (Figure 4.5) has critical number 2. 

4.4.2 First Derivative Test for Local Extrema 

Suppose that f is differentiable for all x near c. If c is a critical number of f and if f ′ 

changes sign at c, then f has a local extremum at c. Specifically, 

67

1. If f ′ (x) is positive below c and negative above c, then f is increasing to the left of c 

and decreasing to the right of c. Thus, f has a local maximum at c. 

2. If f ′ (x) is negative below c and positive above c, then f is decreasing to the left of c 

and increasing to the right of c. Thus, f has a local minimum at c. 

Continuing with the cubic function f(x) = 2 + 9x − 6x 2 + x 3 , it is easy to check that 

f ′ (x) > 0 when x < 1, f ′ (x) < 0 when 1 < x < 3, and f ′ (x) > 0 when x > 3. 

These computations confirm that f has a local maximum at 1 and a local minimum at 3. 

4.4.3 Second Derivative Test for Local Extrema 

Suppose that f is twice differentiable for all x near c, and that c is a critical number for f. 

(In this case, f ′ (c) = 0.) Then 

1. If f ′′ (c) > 0, then f has a local minimum at c. 

2. If f ′′ (c) < 0, then f has a local maximum at c. 

Note that if f ′ (c) = 0 and f ′′ (c) = 0, then anything can happen. For example, if f is the 

quartic function f(x) = x 4 , then f ′ (0) = f ′′ (0) = 0 and f has a local (and global) minimum 

at 0. By contrast, if f is the cubic function f(x) = x 3 , then f ′ (0) = f ′′ (0) = 0, but f has 

neither a local minimum nor a local maximum at 0. 

4.4.4 Global Extrema on Closed Intervals 

Consider f : [a,b] ⊆ R → R (the domain of interest is the closed interval [a,b]). The 

following theorem says that continuous functions have global extrema on closed intervals. 

Theorem 4.12 (Extreme Value Theorem) If f is continuous on the closed interval 

[a,b], then there are numbers c,d ∈ [a,b] such that f has a global minimum at c and a 

global maximum at d. 

For example, if the domain of the cubic function f(x) = 2 + 9x − 6x 2 + x 3 (Figure 4.6) is 

restricted to the closed interval [−0.5,3.5], then f has a global minimum value of −4.125 

at −0.5, and a global maximum value of 6 at 1. 

Method for finding global extrema. To find the global extrema of the continuous 

function f on the closed interval [a,b]: 

1. Find the values of f at the critical numbers in the open interval (a,b). 

2. Find the values of f at the endpoints: f(a), f(b). 

68

70 

50 

30 

10 

y 

-2 -1 1 2 3 

x 

1 2 3 

Figure 4.9: Plots of y = 40 + 12x 2 + 4x 3 − 3x 4 . In the right plot, the domain is restricted 

to the interval [0,3]. 

3. The largest function value from the first two steps is the global maximum, and the 

smallest function value is the global minimum. 

Example 4.13 (Quartic Function) Let f(x) = 40 + 12x 2 + 4x 3 − 3x 4 for all x. 

1. As illustrated in the left part of Figure 4.9, f(x) has a local maximum of 45 = f(−1) 

when x = −1, a local minimum of 40 = f(0) when x = 0, and both a local and global 

maximum of 72 = f(2) when x = 2. 

2. Since f ′ (x) = 24x + 12x 2 − 12x 3 = −12 (x − 2) x (x + 1), we know that f ′ (x) = 0 

when x = −1,0,2. Further, 

(a) f ′ (x) > 0 when x < −1 and when 0 < x < 2 and 

70 

50 

30 

10 

(b) f ′ (x) < 0 when −1 < x < 0 and when x > 2. 

The values of f(x) are increasing when x < −1, decreasing when −1 < x < 0, 

increasing when 0 < x < 2, and decreasing when x > 2. 

3. Since f ′′ (x) = 24 + 24x − 36x 2 , we know that f ′′ (x) = 0 when 

Further, 

x = 1 

3 (1 − √ 7) ≈ −0.549 and x = 1 

3 (1 + √ 7) ≈ 1.215. 

(a) f ′′ (x) < 0 when x < −0.549 and when x > 1.215 and 

(b) f ′′ (x) > 0 when −0.549 < x < 1.215. 

Concavity changes at both roots of the equation f ′′ (x) = 0. Specifically, f(x) is 

concave down when x < −0.549, concave up when −0.549 < x < 1.215, and concave 

down when x > 1.215. 

Example 4.14 (Restricted Domain) Let f(x) = 40 + 12x 2 + 4x 3 − 3x 4 for x ∈ [0,3]. 

69 

y 

x

1. As illustrated in the right part of Figure 4.9, f(x) has a local minimum of 40 = f(0) 

at x = 0, a local and global maximum of 72 = f(2) at x = 2, and a local and global 

minimum of 13 = f(3) at x = 3. 

2. Since f ′ (x) = 24x + 12x 2 − 12x 3 = −12 (x − 2) x (x + 1), we know that f ′ (x) = 0 

when x = −1,0,2. For the two roots in the restricted domain, we consider 

f(0) = 40 and f(2) = 72 when determining global extrema. 

3. The function values at the endpoints are f(0) = 40 and f(3) = 13. 

4. The global maximum value is the largest of the numbers computed in items 2 and 3. 

The global minimum value is the smallest of the numbers computed in items 2 and 3. 

4.4.5 Newton’s Method 

Many problems in mathematics involve finding roots. For example, to find the critical 

numbers of the differentiable function f, we need to solve the derivative equation f ′ (x) = 0. 

Newton’s method is an iterative method for approximating the roots of an equation of the 

form g(x) = 0 when g is a differentiable function. The method uses local linearity. 

Specifically, if r is a root of the equation, x0 (an initial estimate of r) is a number near r, 

and g ′ (x0) = 0, then the tangent line through (x0,g(x0)) 

y = g(x0) + g ′ (x0)(x − x0) 

can be used to approximate the curve y = g(x) and the x-coordinate of the point where the 

tangent line crosses the x-axis (call it x1) is likely to be closer to r than x0. To find x1: 

0 = g(x0) + g ′ (x0)(x1 − x0) ⇒ x1 = x0 − g(x0) 

g ′ (x0) . 

If g ′ (x1) = 0, the method can be repeated to yield x2 = x1 − g(x1) 

g ′ (x1) 

general formula is 

xk+1 = xk − g(xk) 

g ′ (xk) . 

, and so forth. The 

The iteration stops when the error term |g(xk)/g ′ (xk)| falls below a pre-specified bound. 

Example 4.15 (Root of Cubic Function) The cubic function g(x) = x 3 − x − 1 has a 

root between 1 and 2, as illustrated in the left part of Figure 4.10. Newton’s method can 

be used to find the root. The following table shows the results of the first five iterations 

starting with initial value x0 = 1.0. 

k 0 1 2 3 4 5 

xk 1.0 1.5 1.34783 1.3252 1.32472 1.32472 

Since x4 and x5 are identical to within five decimal places of accuracy, we estimate that the 

root occurs when x = 1.32472. 

70

5 

-2 -1 1 2 

-5 

y 

x 

20 

16 

12 

Figure 4.10: Newton’s method examples. 

Example 4.16 (Point of Intersection) The functions 

f1(x) = x + 4 and f2(x) = e x/3 

8 

y 

5 7.5 10 x 

have a point of intersection between 5 and 10, as illustrated in the right part of Figure 4.10. 

One way to find the point of intersection is to use Newton’s method to solve the equation 

g(x) = f1(x) − f2(x) = 0. The following table shows the results of the first six iterations 

starting with initial value x0 = 5.5. 

k 0 1 2 3 4 5 6 

xk 5.5 8.49133 7.53204 7.28038 7.26519 7.26514 7.26514 

Since x5 and x6 are identical to within five decimal places of accuracy, we estimate that the 

point of intersection occurs when x = 7.26514. 

Warning. Newton’s method is powerful and fast (in general, few iterations are required 

for convergence), but is sensitive to the initial value x0. If we let x0 = 0.57 in the cubic 

example above, for example, then 35 iterations are needed to approximate the root to within 

5 decimal places of accuracy. 

If g(x) has more than one root and the chosen x0 is far from r, then the method could 

converge to the wrong root. If g ′ (xk) ≈ 0 for some k, then the sequence of iterates may not 

converge at all. 


Optimization is an important component of many statistical analyses. This section illustrates 

two applications. 

4.5.1 Maximum Likelihood Estimation in Binomial Models 

Consider the problem of estimating the binomial probability of success, p. If x successes are 

observed in n trials, then a natural estimate of p is the sample proportion, p = x/n. If the 

71

0.16 

0.12 

0.08 

0.04 

Lik 

0.2 0.4 0.6 0.8 1 p 

Lik 

0.0016 

0.0008 

0.25 0.5 0.75 1 t 

Figure 4.11: Examples of binomial (left) and multinomial (right) likelihoods. 

data summarize the results of n independent trials of a Bernoulli experiment with success 

probability p, then p = x/n is also the maximum likelihood (ML) estimate of p. 

The ML estimate of p is the value of p which maximizes the likelihood function, Lik(p), or 

(equivalently) the log-likelihood function, ℓ(p). Lik(p) is the binomial PDF with n and x 

fixed and p varying 

Lik(p) = 

 

n 

p 

x 

x (1 − p) n−x 

and ℓ(p) is the natural logarithm of the likelihood: 

ℓ(p) = ln(Lik(p)) = ln 

 

n 

+ xln(p) + (n − x)ln(1 − p). 

x 

If 0 < x < n, then the ML estimate is obtained by solving ℓ ′ (p) = 0 for p: 

ℓ ′ (p) = x n − x x n − x 

− = 0 ⇒ = 

p 1 − p p 1 − p 

⇒ p = x 

n . 

Further, since ℓ ′ (p) > 0 when 0 

the likelihood is maximized when p = x 

n . 

For example, the left part of Figure 4.11 shows the binomial likelihood when 17 successes 

are observed in 25 trials. The function is maximized at the ML estimate, p = 0.68. 

4.5.2 Maximum Likelihood Estimation in Multinomial Models 

Maximum likelihood estimation can be generalized to multinomial models. Consider, for 

example, a multinomial model with three categories and probabilities 

(p1,p2,p3) = (θ,(1 − θ)θ,(1 − θ) 2 ), where 0 ≤ θ ≤ 1 is unknown, 

and suppose that in 60 independent trials of the experiment, the first outcome was observed 

15 times, the second outcome 20 times and the third outcome 25 times. That is, suppose 

that n = 60, x1 = 15, x2 = 20 and x3 = 25. 

72

The likelihood function, Lik(θ), is the multinomial PDF written as a function of θ, 

 

60 

Lik(θ) = θ 

15,20,25 

15 

20 

((1 − θ)θ) (1 − θ) 2 

25 60 

= θ 

15,20,25 

35 (1 − θ) 70 , 

as shown in the right part of Figure 4.11. 

The log-likelihood function is the natural logarithm of the likelihood, 

 

60 

ℓ(θ) = ln(Lik(θ)) = ln + 35ln(θ) + 70ln(1 − θ). 

15,20,25 

1 = . You can use either the first or second derivative test to 

ℓ ′ (θ) = 0 when θ = 35 

105 3 

demonstrate that the log-likelihood is maximized when θ = 1 

3 . 

 

1 2 4 

The estimated model has probabilities 3 , 9 , 9 . 

4.5.3 Minimum Chi-Square Estimation in Multinomial Models 

In 1900, K. Pearson developed a quantitative method to determine if observed data are 

consistent with a given multinomial model. 

If (X1,X2,... ,Xk) has a multinomial distribution with parameters n and (p1,p2,... ,pk), 

then Pearson’s statistic is the following random variable: 

X2 k (Xi − npi) 

= 

i=1 

2 

. 

npi 

For each i, the observed frequency, Xi, is compared to the expected frequency, E(Xi) = npi, 

under the multinomial model. If each observed frequency is close to expected, then the value 

of X 2 will be close to zero. If at least one observed frequency is far from expected, then 

the value of X 2 will be large and the appropriateness of the given multinomial model will 

be called into question. 

In many practical situations, certain parameters of the multinomial model need to be estimated 

from the sample data. For example, if 

(p1,p2,... ,pk) = (p1(θ),p2(θ),... ,pk(θ)) for some parameter θ, 

then the observed value of Pearson’s statistic 

f(θ) = 

k (xi − npi(θ)) 2 

i=1 

npi(θ) 

is a function of θ, and a natural estimate of θ is the value of θ which minimizes f. This 

estimate is known as the minimum chi-square estimate of θ for the given (x1,x2,... ,xk). 

Example 4.17 (n = 60, k = 3, continued) Consider again the model with probabilities 

(p1,p2,p3) = (θ,(1 − θ)θ,(1 − θ) 2 ), where 0 ≤ θ ≤ 1 is unknown, 

73

XSqr 

50 

25 

0.25 0.5 0.75 1 t 

XSqr 

12 

8 

4 

0.025 0.05 0.075 0.1 t 

Figure 4.12: Plots of y = f(θ) for the minimum chi-square estimation examples. 

and assume that in 60 independent trials we obtained x1 = 15, x2 = 20 and x3 = 25. 

The function we would like to minimize is 

f(θ) = 

(15 − 60θ)2 

60θ 

+ (20 − 60(1 − θ)θ)2 

60(1 − θ)θ 

+ 

25 − 60(1 − θ) 2 2 

60(1 − θ) 2 

(left part of Figure 4.12) and the derivative of this function (after simplification) is 

f ′ (θ) = − 5 9θ3 − 9θ2 + 75θ − 25 

12(θ − 1) 3θ2 . 

To find the minimum chi-square estimate we need to find the root of 

g(θ) = 9θ 3 − 9θ 2 + 75θ − 25 in the [0,1] interval. 

Using Newton’s method, 4 decimal places of accuracy, and θ0 = 0.5 we get: 

Let θ = 0.342592. Since 

the function is minimized at θ. 

θ0 = 0.5, θ1 = 0.343643, θ2 = 0.342592, θ3 = 0.342592. 

f ′ (θ) < 0 when 0 < θ < θ and f ′ (θ) > 0 when θ < θ < 1, 

The estimated model has (approximate) probabilities (0.3426,0.2252,0.4322). 

Example 4.18 (Genetics Study) (Larsen & Marx, Prentice-Hall, 1986, p.418) Researchers 

were interested in studying two genetic charateristics of corn: 

• Starchy (S) or sugary (s), where starchy is the dominant characteristic, and 

• Green base leaf (G) or white base leaf (g), where green base leaf is the dominant 

characteristic, 

74

Table 4.1: Summary of results of the genetics study. 

1 Starchy green (SG) 1997 

2 Starchy white (Sg) 906 

3 Sugary green (sG) 904 

4 Sugary white (sg) 32 

in the offspring of self-fertilized hybrid corn plants. The results for 3839 offspring are 

summarized in Table 4.1. 

A model of interest to geneticists hypothesizes that the probabilities of the four outcomes 

(SG, Sg, sG, sg) is 

(p1,p2,p3,p4) = (0.25(2 + θ),0.25(1 − θ),0.25(1 − θ),0.25θ), 

where θ is a parameter in the interval (0, 1). θ is a cross-over parameter, indicating a 

tendency for chromosome pairs Sg and sG in the hybrid parent to cross-over to SG and sg 

at the time of reproduction. (As θ increases the tendency to cross-over increases.) 

The right part of Figure 4.12 suggests that f(θ) is minimized in the interval 0.02 < θ < 0.05. 

Using Newton’s method to solve the derivative equation 

d 

(f(θ)) = 0 with initial value 0.03, 

dθ 

the minimum chi-square estimate of θ is θ = 0.0357852. 

4.6 Rolle’s Theorem and the Mean Value Theorem 

The Mean Value Theorem (MVT) connects the local picture about a curve (namely, the 

slope of the curve at a given point) to the global picture (namely, the average rate of change 

over an interval). Rolle’s Theorem is the first step. 

Theorem 4.19 (Rolle’s Theorem) Suppose f is continuous on [a,b], differentiable on 

(a,b) and that f(a) = f(b). Then there is a c ∈ (a,b) satisfying f ′ (c) = 0. 

Example 4.20 (Cubic Polynomial) For example, let f(x) = 48 − 44x + 12x 2 − x 3 on 

[a,b] = [2,4] (left part of Figure 4.13). Since f(2) = f(4), Rolle’s theorem applies. On this 

interval, f ′ (c) = 0 when c = 4 − 2/ √ 3 ≈ 2.8453. (c is a critical number corresponding to 

the global minimum on the interval.) 

Theorem 4.21 (Mean Value Theorem) Suppose f is continuous on the closed interval 

[a,b] and differentiable on the open interval (a,b). Then 

f(b) − f(a) 

b − a 

= f ′ (c) for some number c satisfying a < c < b. 

75

-1 

-2 

-3 

-4 

y 

1 

2 3 4 

x 

y 

20 

10 

-4 -2 2 4 

Figure 4.13: Plots of y = 48 − 44x + 12x 2 − x 3 (left) and y = 17 + 2x − 3x 2 + 0.1x 4 (right). 

The MVT can be proven using Rolle’s theorem. Let 

-10 

-20 

 

f(b) − f(a) 

s(x) = f(a) + 

(x − a) and h(x) = f(x) − s(x). 

b − a 

Since h(a) = f(a) − f(a) = 0 and h(b) = f(b) − f(b) = 0, the conditions of Rolle’s theorem 

are met. Thus, there is a c ∈ (a,b) satisfying 

0 = h ′ (c) = f ′ (c) − 

 

f(b) − f(a) 

b − a 

⇒ f ′ (c) = 

f(b) − f(a) 

. 

b − a 

Example 4.22 (Quartic Polynomial) For example, if f(x) = 17+2x−3x 2 +0.1x 4 and 

[a,b] = [−5,5] (right part of Figure 4.13), then the average rate of change over the interval is 

2. The instantaneous rate of change of the function is 2 when c = 0, c = ± √ 15 ≈ ±3.87298. 

4.6.1 Generalized Mean Value Theorem and Error Analysis 

The MVT can be extended to two functions. 

Theorem 4.23 (Generalized MVT) Suppose f and g are continuous on the closed 

interval [a,b] and differentiable on the open interval (a,b). Then 

(f(b) − f(a))g ′ (c) = (g(b) − g(a))f ′ (c) for some number c satisfying a < c < b. 

The generalized MVT can be proven using Rolle’s theorem applied to 

h(x) = (f(b) − f(a))g(x) − (g(b) − g(a))f(x). 

Since h(a) = f(b)g(a) − g(b)f(a) and h(b) = f(b)g(a) − g(b)f(a), the conditions of Rolle’s 

theorem are met. Thus, there is a c ∈ (a,b) satisfying 

0 = h ′ (c) = (f(b)−f(a))g ′ (c)−(g(b)−g(a))f ′ (c) ⇒ (f(b)−f(a))g ′ (c) = (g(b)−g(a))f ′ (c). 

76 

x

Error in linear approximations. If we apply the MVT to the function f on the interval 

[a,x], then we get 

f(x) = f(a) + f ′ (c)(x − a) for some c satisfying a < c < x. 

That is, the error in using f(a) instead of using f(x) for values of x near a is exactly 

f ′ (c)(x − a) for some number c between a and x. 

How about the error in using the tangent line instead of f(x)? 

Suppose f is twice differentiable. Let ℓ(x) = f(a) + f ′ (a)(x − a) be the tangent line 

to y = f(x) at (a,f(a)), e(x) = f(x) − ℓ(x) be the difference between f and ℓ, and 

g(x) = (x − a) 2 . (The function e(x) is often called the error function.) 

If we apply the generalized MVT twice to the functions e and g on the interval [a,x], then 

we obtain the following equation 

f(x) = f(a) + f ′ (a)(x − a) + f ′′ (c) 

(x − a)2 

2 

for some c ∈ (a,x). 

That is, the error in using values on the tangent line instead of using f(x) when x is near a 

is exactly f ′′ (c) 

2 (x − a) 2 for some number c between a and x. The form of the result is the 

same if you apply the generalized MVT twice to e and g on the interval [x,a]. 

Error in quadratic approximations. In Section 6.6 we will consider quadratic, cubic, 

quartic and higher order polynomial approximations to a function f, and error analyses. As 

a preview, suppose that f is differentiable to order 3 and consider the following quadratic 

approximation to f at a, 

q(x) = f(a) + f ′ (a)(x − a) + f ′′ (a) 

(x − a) 

2 

2 . 

(q(x) is a quardratic polynomial satisfying q(a) = f(a), q ′ (a) = f ′ (a) and q ′′ (a) = f ′′ (a).) 

Let e(x) = f(x) − q(x) and g(x) = (x − a) 3 . If we apply the generalized MVT three times 

to the functions e and g on the interval [a,x], then we obtain the following equation 

f(x) = f(a) + f ′ (a)(x − a) + f ′′ (a) 

2 

(x − a) 2 + f ′′′ (c) 

(x − a) 

6 

3 

for some c ∈ (a,x). 

That is, the error in using values on the quadratic curve instead of using f(x) when x is 

near a is exactly f ′′′ (c) 

6 (x − a) 2 for some number c between a and x. The form of the result 

is the same if you apply the generalized MVT three times to e and g on the interval [x,a]. 

4.7 Integration 

If f is the derivative of F, then we can interpret the function f as the instantaneous rate 

of change of the function F. Starting with a function f representing instantaneous rate of 

change, the techniques of integration allow us to recover F and to compute the total change 

in F over an interval. 

77

y 

2 

1 

1 2 3 4 x 

y 

15 

10 

5 

-5 

1 2 3 

Figure 4.14: Definite integral as area (left) and difference in areas (right). 

4.7.1 Riemann Sums and Definite Integrals 

Suppose f is continuous on the closed interval [a,b], and let n be a positive integer. Further, 

let △x = (b − a)/n and xi = a + i△x for i = 0,1,2,... ,n. Then the n + 1 numbers 

a = x0 < x1 < x2 < · · · < xn = b 

partition the interval [a,b] into n subintervals of equal length. Lastly, let x∗ i 

of the ith subinterval: x∗ i = (xi−1 + xi)/2. 

The definite integral of f from a to b is 

b 

a 

f(x) dx = lim 

n 

f(x 

n→∞ 

i=1 

∗ i )△x. 

x 

be the midpoint 

In this formula: a is the lower limit of integration, b is the upper limit of integration, f(x) 

is the integrand, and the sum on the right is called a Riemann sum. 

Continuity and Integrability. f is said to be integrable on the interval [a,b] if the limit 

above exists. Continuous functions are always integrable. Further, it is possible to show 

that the limit exists even if f has a finite number of “jump discontinuities” on [a,b]. 

Area and difference in areas. If f is nonnegative on [a,b], then b 

a f(x) dx can be 

interpreted as the area under the curve y = f(x) and above the x-axis when a ≤ x ≤ b. 

Otherwise, the integral can be interpreted as the difference between two areas. 

Example 4.24 (Computing Area) Let f(x) = √ x on the interval [0,4]. To estimate 

the area under y = f(x) and above the x-axis we let n = 10 and ∆x = 0.4, as illustrated in 

the left part of Figure 4.14. The 10 midpoints are 

and the sum of the areas is 5.347. 

0.2, 0.6, 1.0, 1.4, 1.8, 2.2, 2.6, 3.0, 3.4, 3.8, 

78

3600 

1800 

y 

7 14 21 28 35 

x 

y 

64 

32 

1 2 3 4 x 

Figure 4.15: Definite integral as average (left) and arc length (right). 

The following list of approximate values suggests that the limit is 16/3. 

n 10 50 90 130 170 210 250 290 330 370 410 

Sum 5.347 5.335 5.334 5.334 5.334 5.333 5.333 5.333 5.333 5.333 5.333 

Example 4.25 (Difference in Areas) Let f(x) = x3 − 4x on the interval [0,3]. To 

estimate the definite integral we let n = 9 and ∆x = 1 

3 , as illustrated in the right part of 

Figure 4.14. The 9 midpoints are 

0.167, 0.5, 0.833, 1.167, 1.5, 1.833, 2.167, 2.5,2.833, 

and the sum is 

i f(x∗i )∆x = 2.125. 

The following list of approximate values suggests that the limit is 2.25. 

n 9 24 39 54 69 84 99 114 129 144 159 

Sum 2.125 2.232 2.243 2.247 2.248 2.249 2.249 2.249 2.249 2.25 2.25 

Geometrically, the limit is the difference between areas: 

2.25 = 

3 

0 

f(x) dx = 

2 

0 

3 

f(x) dx + f(x) dx = −4 + 6.25. 

2 

(The area between 0 and 2 is negated because it lies below the x-axis.) 

Note that the total area bounded by y = f(x), y = 0, x = 0 and x = 3 is 4 + 6.25 = 10.25. 

Average values. The quantity 1 

(b−a) 

of f(x) on the interval [a,b]. 

To see this, let f = 1 ni=1 n f(x∗ i ) be the sample average. Then 

f = 

 

(b − a) 1 

(b − a) n 

n 

f(x 

i=1 

∗ 

i) = 

1 

(b − a) 

b 

a f(x) dx can be interpreted as the average value 

n 

i=1 

f(x ∗ i) △x ⇒ lim f = 

n→∞ 

79 

 

1 b 

f(x) dx. 

(b − a) a

Example 4.26 (Average Value) Let f(x) = 225e 0.08x on the interval [0,35]. To estimate 

the average of f(x) we let n = 5 and ∆x = 35/5 = 7, as illustrated in the left part of 

Figure 4.15. Function values at the 5 midpoints are 

f(3.5) ≈ 297.704, f(10.5) ≈ 521.183, f(17.5) ≈ 912.42, f(24.5) ≈ 1597.35, f(31.5) ≈ 2796.43, 

and the average of these 5 numbers is 1225.02. 


n 5 30 55 80 105 130 155 180 205 

Avg 1225.02 1240.64 1240.95 1241.02 1241.05 1241.06 1241.07 1241.08 1241.08 

Arc lengths. Under mild conditions, the quantity b 

a 1 + (f ′ (x)) 2 dx can be interpreted 

as the length of the curve y = f(x) for x ∈ [a,b]. 

To see this, let yi = f(xi), ∆yi = yi − yi−1 and 

n 

i=1 

 

(∆x) 2 + (∆yi) 2 = 

 

n 

1 + 

i=1 

2 ∆yi 

∆x 

∆x 

be the sum of lengths of line segments joining successive points: 

(x0,y0) → (x1,y1) → · · · → (xn,yn). 

If f ′ is continuous and not zero on [a,b] (except possibly at the endpoints), then 

∆yi 

∆x ≈ f ′ (x ∗ i ) for i = 1,2,... ,n, 

the sum of segment lengths can be approximated as follows, 

n 

i=1 

 

(∆x) 2 + (∆yi) 2 ≈ 

n 

i=1 

 

1 + (f ′ (x ∗ i ))2 ∆x , 

and the limit of the Riemann sum on the right is the arc length. 

Example 4.27 (Arc Length) Let f(x) = x 3 on the interval [0,4]. To estimate the length 

of the curve we let n = 4 and ∆x = 4/4 = 1, as illustrated in the right part of Figure 4.15. 

The sum of the lengths of the segments connecting the points (0,0), (1,1), (2,8), (3,27), (4,64) 

is 64.525. 


n 4 12 20 28 36 44 52 60 68 76 

Sum 64.525 64.659 64.667 64.669 64.67 64.671 64.671 64.671 64.672 64.672 

80

4.7.2 Properties of Definite Integrals 

Suppose that f and g are integrable on [a,b], and let c be a constant. Then properties of 

limits imply the following properties of definite integrals: 

1. Constant functions: b 

a c dx = c(b − a). 

2. Sums and differences: b 

a (f(x) ± g(x)) dx = b 

a f(x) dx ± b 

a 

3. Constant multiples: b 

a cf(x) dx = c b 

a f(x) dx. 

4. Adjacent intervals: b 

a f(x) dx = c 

a f(x) dx + b 

c 

5. Null interval: a 

a f(x) dx = 0. 

6. Reverse limits of integration: b 

a f(x) dx = − a 

b f(x) dx. 

g(x) dx. 

f(x) dx when a < c < b. 

7. Ordering: If f(x) ≤ g(x) for all x on [a,b], then b 

a f(x) dx ≤ b 

a g(x) dx. 

 

 

8. Absolute value: If |f| is integrable, then 

b 

f(x) dx 

≤ b 

|f(x)| dx. 

4.7.3 Fundamental Theorem of Calculus 

The Fundamental Theorem of Calculus (FTC) relates integration and differentiation, and 

gives us a method for computing definite integrals when f(x) is a derivative function. 

Theorem 4.28 (Fundamental Theorem of Calculus) Let f be a continuous function 

on the closed interval [a,b]. Then 

1. If F(x) = x 

a f(t) dt, then F ′ (x) = f(x) for x ∈ [a,b]. 

2. If f(x) = F ′ (x) for all x in [a,b], then b 

a f(x) dx = F(b) − F(a). 

The first part of the fundamental theorem says that f(x) is the rate of change of F(x) on 

the interval [a,b]. The second part of the theorem says that the definite integral of a rate 

of change is the total change over the interval. 

Derivatives of integrals with variable endpoints. A useful generalization of the 

Fundamental Theorem of Calculus is as follows: If ℓ(x) ≤ u(x) are differentiable functions 

(for the lower and upper endpoints), f is continuous, and 

G(x) = 

then G ′ (x) = f(u(x))u ′ (x) − f(ℓ(x))ℓ ′ (x). 

u(x) 

ℓ(x) 

81 

a 

f(t) dt, 

a

c f(x) dx = c f(x) dx 

x n dx = x n+1 

n+1 

e x dx = e x + C 

Table 4.2: Table of Indefinite Integrals. 

+ C (n = −1) 

sin(x) dx = − cos(x) + C 

sec 2 (x) dx = tan(x) + C 

sec(x)tan(x) dx = sec(x) + C 

1 

1+x 2 dx = tan −1 (x) + C 

(f(x) + g(x)) dx = f(x) dx + g(x) dx 

1 

x 

4.7.4 Antiderivatives and Indefinite Integrals 

dx = ln |x| + C (x = 0) 

ln(x) dx = xln(x) − x + C (x > 0) 

cos(x) dx = sin(x) + C 

csc 2 (x) dx = − cot(x) + C 

csc(x)cot(x) dx = − csc(x) + C 

1 

√ 1−x 2 dx = sin−1 (x) + C 

The function F(x) in the FTC is called an antiderivative of f(x). Since the derivative of a 

constant function is zero, antiderivative functions are not unique. The family of antiderivatives, 

known as the indefinite integral of f(x), is often written as follows: 

 

f(x) dx = F(x) + C where C is a constant (the constant of integration). 

If F(x) is an antiderivative of f(x), then the following notation is used when evaluating a 

definite integral using the FTC: 

b 

a 

f(x) dx = [F(x)] b 

a = F(b) − F(a). 

That is, we compute the difference between the function value at the upper endpoint and 

the function value at the lower endpoint. 

4.7.5 Some Formulas for Indefinite Integrals 

A useful first list of indefinite integrals is given in Table 4.2. 

Example 4.29 (Computing Area, continued) Since x 1/2 dx = 2 

3 x3/2 + C, the area 

under the curve y = √ x = x 1/2 and above the x-axis for x ∈ [0,4] is 

4 

0 

√ x dx = 

This answer confirms the earlier result. 

 

2 

3 x3/2 

4 = 

0 

16 

3 . 

82

Rules for polynomials. If p(x) = anxn + an−1xn−1 + · · · + a1x + a0 is a polynomial 

function, then information in Table 4.2 and properties of integrals given earlier imply that 

 

p(x) dx = an 

n + 1 xn+1 + an−1 

n xn + · · · + a2 

3 x3 + a1 

2 x2 + a0x + C. 

In particular, a dx = ax + C and (ax + b) dx = a 

2 x2 + bx + C. 

Example 4.30 (Difference in Areas, continued) Since (x3 −4x) dx = 1 

4 x4 −2x2 +C, 

the definite integral of f(x) = x3 − 4x for x ∈ [0,3] is 

3 

0 

(x 3 − 4x) dx = 


 

1 

4 x4 − 2x 2 

3 0 

= 9 

4 . 

Substitution rule for composite functions. The chain rule for compositions says that 

d 

dx (f(g(x))) = f ′ (g(x))g ′ (x) where f and g are differentiable functions. 

Thus, 

is a general rule for finding antiderivatives. 

f ′ (g(x))g ′ (x) dx = f(g(x)) + C 

If we let u = g(x) and du = g ′ (x) dx, then we can rewrite the formula above as 

 

f ′ (u) du = f(u) + C. 

(We substitute u for g(x) and du for g ′ (x) dx.) 

For example, consider finding 

x 

√ x 2 + 1 dx. 

To apply substitution, let u = x2 + 1. Then du = 2x dx (or x dx = 1 

2 

becomes 

1 

2 √ u du = u1/2 + C = 

x2 + 1 + C. 

du), and the integral 

Compositions when g(x) is linear. If g(x) = ax + b, where a and b are constants and 

a = 0, then a general rule for finding the indefinite integral is as follows: 

 

f ′ (ax + b) dx = 1 

f(ax + b) + C. 

a 

Example 4.31 (Average Value, continued) Since 225e0.08x dx = 2812.5e0.08x + C, 

the average value of f(x) = 225e0.08x for x ∈ [0,35] is 

1 

35 

35 


0 

225e 0.08x dx = 1 

2812.5e 

35 

0.08x 35 

≈ 1241.08. 

0 

83

Integration by parts. The derivative rule for products says that 

d 

dx (f(x)g(x)) = f ′ (x)g(x) + f(x)g ′ (x) where f and g are differentiable functions. 

Thus, f ′ (x)g(x) + f(x)g ′ (x) dx = f(x)g(x) + C 

⇒ 

is a general rule for finding antiderivatives. 

 

f(x)g ′ 

(x) dx = f(x)g(x) − 

f ′ (x)g(x) dx 

If we let u = f(x), du = f ′ (x) dx, v = g(x), and dv = g ′ (x) dx, then we can rewrite the 

formula above as follows: 

 

u dv = uv − 

v du. 

In integration by parts, the problem of finding the indefinite integral u dv is replaced by 

the problem of finding v du. Care must be taken to choose u and v so that the second 

integral is easier to compute than the first. 

For example, to find xex dx, we let u = x and dv = ex dx. Then du = dx. Since 

 

dv = e x dx = e x + C 

and v can be any antiderivative, we choose v = ex . Now 

 

 

u dv = uv − v du = xe x 

− 

e x dx = xe x − e x + C. 

Similarly, to find xln(x) dx, we let u = ln(x) and dv = x dx. Then du = 1 

 

 

dv = 

x dx = x2 

2 

+ C 

x 

dx. Since 

and v can be any antiderivative, we choose v = x2 

2 . Now, 

 

 

u dv = uv − v du = 1 

2 x2 

ln(x) − 

x2 

1 1 

dx = 

2 x 2 x2 ln(x) − x2 

+ C. 

4 

4.7.6 Approximation Methods 

Finding antiderivatives is not as straightforward as finding derivatives. In fact, in many 

situations exact antiderivatives cannot be found. (For example, f(x) = e x /x has no exact 

antiderivative.) When an exact antiderivative does not exist, an approximate method can 

be used to estimate a definite integral. 

Midpoint rule. A natural estimate to use is the value of the Riemann sum 

n 

i=1 

f(x ∗ i )△x when n is large. 

Since x ∗ i was chosen to be the midpoint of [xi−1,xi] for each i, this method of estimating 

the definite integral is called the midpoint rule. 

84

150 

100 

50 

y 

1 3 5 7 x 

150 

100 

50 

y 

1 3 5 7 x 

Figure 4.16: Plots of y = e x /x with trapezoidal (left) and Simpson (right) approximations. 

Trapezoidal and Simpson’s rules. In the trapezoidal rule, the area under the piecewise 

linear curve defined by the points 

(xi,f(xi)) for i = 0,1,2,... ,n 

is used to estimate the area under the nonnegative curve y = f(x). (Geometrically, a sum 

of areas of trapezoids replaces a sum of areas of rectangles.) 

In Simpson’s rule, the area under a piecewise quadratic curve is used to estimate the area un- 

der the nonnegative curve y = f(x), where the quadratic polynomial for the ith subinterval 

passes through the points (xi−1,f(xi−1)), (x∗ i ,f(x∗ i )), and (xi,f(xi)). 

Example 4.32 (Figure 4.16) Let f(x) = e x /x on the interval [1,7]. The left part of 

Figure 4.16 shows the setup for the trapezoidal rule, and the right part shows the setup 

for Simpson’s rule, when n = 3 subintervals are used. The approximation in the right plot 

is significantly better than the one in the left plot. (Note that the numerical value of the 

integral is 189.61.) 

Newton-Cotes and Gaussian methods. The midpoint rule gives exact answers if f is 

a constant function, the trapezoidal rule gives exact answers if f is a constant function or a 

linear function, and Simpson’s rule gives exact answers if f is a constant function, a linear 

function or a quadratic function. They are examples of Newton-Cotes approximations. 

An approximation of the Newton-Cotes type will give exact answers when f is a polynomial 

function of degree at most a fixed constant. Gaussian quadrature methods take Newton- 

Cotes methods a step further, by choosing partition points that are not necessarily equallyspaced; 

the spacing is chosen to minimize the approximation error. 

Monte Carlo methods. Monte Carlo methods are based on the idea of averaging (see 

page 79). Let x1, x2, ..., xn be n numbers chosen uniformly from the interval [a,b]. Then 

a Monte Carlo estimate of b 

a f(x) dx is 

 

n 

 

1 

(b − a) f(xi) . 

n 

85 

i=1

y 

150 

100 

50 

0 c 1 2 x 

y 

150 

100 

50 

2.5 

Figure 4.17: Improper integral examples on finite intervals. 

Let f(x) = e x /x and [a,b] = [1,7] (see Figure 4.16). A Monte Carlo estimate of the integral, 

based on n = 50000 random numbers, was 190.061. 

Power series methods. Methods based on power series expansions can also be used to 

estimate an integral. See Section 6.7.2. 

4.7.7 Improper Integrals 

The definition of the definite integral can be extended to include situations where f is 

undefined at either the lower or upper endpoint of integration, and to include situations 

where a = −∞ or b = ∞. 

Specifically, 

1. If f is integrable on [c,b] for all a < c < b, then 

b 

a 

f(x) dx = lim 

c→a + 

b 

f(x) dx 

c 

when this limit exists. 

2. If f is integrable on [a,c] for all a < c < b, then 

b 

a 

f(x) dx = lim 

c→b− c 

f(x) dx 

a 


3. If f is integrable on [a,b] for all b greater than a fixed a, then 

∞ 

a 

b 

f(x) dx = lim 

b→∞ 

f(x) dx 

a 


4. If f is integrable on [a,b] for all a less than a fixed b, then 

b 

−∞ 

b 

f(x) dx = lim 

a→−∞ 

f(x) dx 

a 


86 

c 

x

y 

0.4 

0 c2 

c 

x 

10 

5 

y 

01 c2 

c 

Figure 4.18: Improper integral examples on infinite intervals. 

When the limit exists, the improper integral is said to converge; otherwise, the improper 

integral is said to diverge. 

Example 4.33 (Improper Integrals on Finite Intervals) Let f(x) = 10/x2 on the 

interval (0,2], as illustrated in the left part of Figure 4.17. Since 

2 2 

−10 

−10 

f(x) dx = = −5 − lim does not exist, 

x 

c→0+ c 

0 

the improper integral diverges. 

0 

c→0+ 

Similarly, let f(x) = 10/ √ 5 − x on the interval [0,5), as illustrated in the right part of 

Figure 4.17. Since 

5 

f(x) dx = −20 √ c→5− 5 − x = lim 

0 c→5− 

−20 √ 

5 − c + 20 √ 5 = 0 + 20 √ 5 = 20 √ 5, 

the improper integral converges to 20 √ 5 ≈ 44.7214. 

Example 4.34 (Improper Integrals on Infinite Intervals) Let f(x) = xe −x on the 

interval [0, ∞), as illustrated in the left part of Figure 4.18. Since 

∞ 

0 f(x) dx = [−xe−x − e −x ] c→∞ 

0 

= limc→∞ (−ce −c − e −c ) + 1 

= limc→∞ (−c/e c ) − limc→∞ (e −c ) + 1 

(using integration-by-parts) 

= limc→∞ (−1/e c ) − 0 + 1 (using L’Hospital’s Rule) 

= 0 − 0 + 1 = 1, 

the improper integral converges to 1. 

Similarly, let f(x) = 10/ √ x on the interval [1, ∞), as illustrated in the right part of Figure 

4.18. Since 

∞ 

f(x) dx = 

1 

20 √ x c→∞ √ 

1 = lim 20 c − 20 does not exist, 

c→∞ 

the improper integral diverges. 

87 

x

y 

-4 -3 -2 -1 0 1 2 3 4 x 

Figure 4.19: Standardized binomial distribution with standard normal approximation. 


Integration methods are important in statistics, and, in particular, when working with 

density functions. Density functions are used to “smooth over” probability histograms of 

discrete random variables, and as functions supporting the theory of continuous random 

variables. Continuous random variables are studied in Chapter 5 of these notes. 

4.8.1 deMoivre-Laplace Limit Theorem 

The most famous example of “smoothing over” is given in the following theorem. 

Theorem 4.35 (deMoivre-Laplace Limit Theorem) Let Xn be a binomial random 

variable based on n trials of a Bernoulli experiment with success probability p, and let 

Zn = (Xn − np) 

√ npq 

be the standardized form of Xn. 

(E(Zn) = 0 and SD(Zn) = 1.) Then, for each real number x 

lim 

n→∞ P(Zn ≤ x) = 

x 

−∞ 

1 

√ 2π e −z2 /2 dz. 

The deMoivre-Laplace limit theorem implies that the area under the standard normal density 

function, φ(z) = 1 

√ 2π e −z2 /2 , can be used to estimate binomial probabilities when n is 

large enough. (See Section 5.2.5 for more details on the normal distribution.) 

Example 4.36 (Binomial Approximation) Let n = 40 and p = 1/2. In Figure 4.19, the 

function φ(z) is superimposed on the probability histogram for the standardized random 

variable Z40 = (X40 − 20)/ √ 10. 

The probability P(X40 ≤ 22) can be computed using the binomial PDF: 

 

22 

1 40 40 

P(X40 ≤ 22) = 

= 0.785205. 

x 2 

x=0 

88

Using the normal approximation (with continuity correction) we get 

P(X40 ≤ 22) ≈ Φ 

 

22.5 − 20 

√ = 0.785402 

10 

where Φ(x) is the area under φ(z) for z ≤ x. The values are close. 

Stirling’s formula. An important component of the proof of the deMoivre-Laplace limit 

theorem is the use of Stirling’s formula for factorials. Stirling proved that 

lim 

n→∞ 

n! 

√ 2π n n+1/2 e −n 

Thus, for large values of n, the formula √ 2π n n+1/2 e −n can be used to approximate n!. 

4.8.2 Computing Quantiles 

If the continuous random variable X has density function f(x), then the cumulative distribution 

function (CDF) of X can be obtained using integration: 

F(x) = P(X ≤ x) = 

x 

−∞ 

= 1. 

f(t) dt. 

If 0 

satisfying the equation 

F(xp) = 

xp 

−∞ 

f(t) dt = p. 

Example 4.37 (Pareto Distribution) The Pareto distribution has density function 

f(x) = α 

x α+1 

where α is a positive constant. 

when x > 1 and 0 otherwise, 

Pareto distributions are used to model incomes in a population. The CDF of the Pareto 

distribution is found using integration. Specifically, 

F(x) = 

x 

−∞ 

f(t)dt = 

x 

1 

α 

tα+1 dt = −t −αx 1 = 1 − x−α 

when x > 1 and 0 otherwise. For each p, the p th quantile is 

p = F(xp) =⇒ xp = (1 − p) −1/α . 

For example, the quartiles (25 th , 50 th , and 75 th percentiles) of the Pareto distribution with 

parameter α = 2.5 are 1.122, 1.320, and 1.741, respectively. 

Additional examples will be given in the next chapter. 

89

5 Continuous Random Variables 

Researchers use random variables to describe the numerical results of experiments. In the 

continuous setting, the possible numerical values form an interval or a union of intervals. 

For example, a random variable whose values are the positive real numbers might be used 

to describe the lifetimes of individuals in a population. 

This chapter focuses on continuous random variables and their probability distributions. 

5.1 Definitions 

Recall that a random variable is a function from the sample space of an experiment to the 

real numbers and that the random variable X is said to be continuous if its range is an 

interval or a union of intervals. 

5.1.1 PDF and CDF for Continuous Distributions 

Let X be a continuous random variable. We assume there exists a nonnegative function 

f(x) defined over the entire real line such that 

 

P(X ∈ D) = 

D 

f(x) dx, 

where D ⊆ R is an interval or union of intervals. The function f(x) is called the probability 

density function (PDF) (or density function) of the continuous random variable X. 

PDFs satisfy the following properties: 

1. f(x) ≥ 0 whenever it exists. 

2. 

R f(x)dx equals 1, where R is the range of X. 

f represents the rate of change of probability; the rate must be nonnegative (property 1). 

The area under f represents probability; the total probability must be 1 (property 2). 

The cumulative distribution function (CDF) of X, denoted by F(x), is the function 

 

F(x) = P(X ≤ x) = 

CDFs satisfy the following properties: 

(−∞,x] 

1. limx→−∞ F(x) = 0 and limx→+∞ F(x) = 1. 

2. If x1 ≤ x2, then F(x1) ≤ F(x2). 

3. F(x) is continuous. 

f(x) dx for all real numbers x. 

91

Density 

0.2 

0.15 

0.1 

0.05 

0 

0 10 20 30 40 50 60 x 

Cum.Prob. 

1 

0.8 

0.6 

0.4 

0.2 

0 

0 10 20 30 40 50 60 x 

Figure 5.1: Probability density function (left plot) and cumulative distribution function 

(right plot) for a continuous random variable with range x ≥ 0. 

F represents cumulative probability, with limits 0 and 1 (property 1). Cumulative probability 

increases with increasing x (property 2). For continuous random variables, the 

cumulative distribution function is continuous (property 3). 

Note that if R is the range of the continuous random variable X and I is an interval, then 

the probability of the event “the value of X is in the interval I” is computed by finding the 

area under the density curve for x ∈ I ∩ R: 

 

P(X ∈ I) = f(x)dx. 

In particular, if a ∈ R, then P(X = a) = 0 since the area under the curve over an interval 

of length zero is zero. 

Example 5.1 (Distribution on Nonnegative Reals) Let X be a continuous random 

variable whose range is the nonnegative real numbers and whose PDF is 

f(x) = 

I∩R 

200 

when x ≥ 0 and 0 otherwise. 

(10 + x) 3 

The left part of Figure 5.1 is a plot of the density function of X and the right part is a plot 

of its cumulative distribution function: 

100 

F(x) = 1 − when x ≥ 0 and 0 otherwise. 

(10 + x) 2 

Further, for this random variable, the probability that X is greater than 8 is 

∞ 

P(X > 8) = f(x)dx = 1 − F(8) = 100 

≈ 0.3086. 

182 5.1.2 Quantiles and Percentiles 

8 

Assume that 0 

it exists) is the point, xp, satisfying the equation 

P(X ≤ xp) = p. 

92

To find xp, solve the equation F(x) = p for x. 

Important special cases of quantiles are as follows: 

1. The median of the X distribution: the 50 th percentile. 

2. The quartiles of the X distribution: the 25 th , 50 th , and 75 th percentiles. 

3. The deciles of the X distribution: the 10 th , 20 th , ..., 90 th percentiles. 

Interquartile range. The interquartile range (IQR) is the difference between the 75 th 

and 25 th percentiles: IQR = x0.75 − x0.25. 

Example 5.2 (Distribution on Nonnegative Reals, continued) Continuing with the 

example above, a general formula for the p th quantile is 

p = 1 − 

In particular, the quartiles are 

100 

(10 + xp) 2 ⇒ xp = 10/ 1 − p − 10 . 

x0.25 = 20 

√ 3 − 10 ≈ 1.547, x0.50 = 10 √ 2 − 10 ≈ 4.142 and x0.75 = 10. 

The IQR of this distribution is IQR = 20 − 20/ √ 3 ≈ 8.453. 

Measures of center and spread. The median is a measure of the center (or location) 

of a distribution. The IQR is a measure of the scale (or spread) of a distribution. 

Additional measures of center and spread are the mean and standard deviation, respectively. 

Definitions for continuous random variables are given in Section 5.3.3. 

5.2 Families of Distributions 

This section briefly describes four families of continuous probability distributions, considers 

properties of these distributions, and considers continuous transformations. 

5.2.1 Example: Uniform Distribution 

Let a and b be real numbers with a < b. The continuous random variable X is said to be a 

uniform random variable, or to have a uniform distribution, on the interval [a,b] when its 

PDF is as follows: 

f(x) = 1 

b − a 

when a ≤ x ≤ b and 0 otherwise. 

Note that the open interval (a,b), or one of the half-closed intervals [a,b) or (a,b], can be 

used instead of [a,b] as the range of a uniform random variable. 

93

Density 

0.025 

0.02 

0.015 

0.01 

0.005 

0 

0 10 20 30 40 50 60 70 x 

Cum.Prob. 

1 

0.8 

0.6 

0.4 

0.2 

0 

0 10 20 30 40 50 60 70 x 

Figure 5.2: PDF (left) and CDF (right) for the uniform random variable on [10,50]. 

Uniform distributions have constant density over an interval. That density is the reciprocal 

of the length of the interval. If X is a uniform random variable on the interval [a,b], and 

[c,d] ⊆ [a,b] is a subinterval, then 

P(c ≤ X ≤ d) = 

d 

c 

1 d − c Length of [c,d] 

dx = = 

b − a b − a Length of [a,b] 

Also, P(c < X < d) = P(c < X ≤ d) = P(c ≤ X < d) = (d − c)/(b − a). 

If 0 

Example 5.3 (a = 10, b = 50) Let X be a uniform random variable on the interval 

[10,50]. X has median 30 and IQR 20. Further, 

P(15 ≤ X ≤ 42) = 27 

= 0.675. 

40 

The PDF and CDF of X are shown in Figure 5.2. 

Computer-generated random numbers. Note that computer commands that return 

“random” numbers in the interval [0,1] are simulating random numbers from the uniform 

distribution on the interval [0,1]. 

5.2.2 Example: Exponential Distribution 

Let λ be a positive real number. The continuous random variable X is said to be an 

exponential random variable, or to have an exponential distribution, with parameter λ when 

its PDF is as follows: 

f(x) = λe −λx when x ≥ 0 and 0 otherwise. 

Note that the interval x > 0 can be used instead of the interval x ≥ 0 as the range of an 

exponential random variable. 

94 

.

Density 

4 

3 

2 

1 

0 

0 0.5 1 1.5 x 

Cum.Prob. 

1 

0.8 

0.6 

0.4 

0.2 

0 

0 0.5 1 1.5 x 

Figure 5.3: PDF (left) and CDF (right) for the exponential random variable with λ = 4. 

Exponential distributions are often used to represent the time that elapses before the occurrence 

of an event. For example, the time that a machine component will operate before 

breaking down. 

If X is an exponential random variable with parameter λ and [a,b] ⊆ [0, ∞), then 

P(a ≤ X ≤ b) = 

b 

a 

λe −λx dx = 

 

−e −λx b 

If 0 

a = −e−λb + e −λa . 

Example 5.4 (λ = 4) Let X be an exponential random variable with parameter 4. Using 

properties of logarithms, X has median ln(2)/4 ≈ 0.173 and IQR ln(3)/4 ≈ 0.275. Further, 

P 

1 

4 

 

3 

≤ X ≤ = −e 

4 

−3 + e −1 ≈ 0.318. 


The exponential distribution is discussed further in Section 6.9.1. 

5.2.3 Euler Gamma Function 

Let r be a positive real number. The Euler gamma function is defined as follows: 

∞ 

Γ(r) = x r−1 e −x dx. 

0 

If r is a positive integer, then Γ(r) = (r−1)!. Thus, the gamma function is said to interpolate 

the factorials. 

Remark. The property Γ(r) = (r−1)! for positive integers can be proven using induction. 

To start the induction, you need to demonstrate that Γ(1) = 1. To prove the induction 

step, you need to use integration-by-parts to demonstrate that Γ(r + 1) = r × Γ(r). 

95

Density 

0.08 

0.06 

0.04 

0.02 

0 

0 10 20 30 40 

x 

Cum.Prob. 

1 

0.8 

0.6 

0.4 

0.2 

0 

0 10 20 30 40 

Figure 5.4: PDF (left) and CDF (right) for the gamma random variable with shape parameter 

α = 2 and scale parameter β = 5. 

5.2.4 Example: Gamma Distribution 

Let α and β be positive real numbers. The continuous random variable X is said to be a 

gamma random variable, or to have a gamma distribution, with parameters α and β when 

its PDF is as follows: 

f(x) = 

1 

β α Γ(α) xα−1 e −x/β when x > 0 and 0 otherwise. 

Note that if α ≥ 1, then the interval x ≥ 0 can be used instead of the interval x > 0 as the 

range of a gamma random variable. 

The parameter α is called a shape parameter; gamma distributions with the same value 

of α have the same shape. The parameter β is called a scale parameter; for fixed α, as β 

changes, the scale on the vertical and horizontal axes change, but the shape remains the 

same. 

Example 5.5 (α = 2, β = 5) Let X be a gamma random variable with shape parameter 

2 and scale parameter 5. Using Newton’s method, the median of X is approximately 8.39 

and the IQR is approximately 8.65. Further, 

P(8 ≤ X ≤ 20) = 

20 

8 

1 

25 x e−x/5 

dx = −e −x/5 − 1 

5 


x e−x/5 

20 

8 

≈ 0.433. 

Remarks. The fact that f(x) is a valid PDF can be demonstrated using the method of 

substitution and the definition of the Euler gamma function. If the shape parameter α = 1, 

then the gamma distribution is the same as the exponential distribution with λ = 1/β. 

The gamma distribution is discussed further in Section 6.9.2. 

96 

x

5.2.5 Example: Normal Distribution 

Let µ be a real number and σ be a positive real number. The continuous random variable 

X is said to be a normal random variable, or to have a normal distribution, with parameters 

µ and σ when its PDF is as follows: 

f(x) = 

1 

√ 2π σ exp 

 

−(x − µ) 2 

2σ 2 

for all real numbers x 

where exp() is the exponential function. The normal distribution is also called the Gaussian 

distribution, in honor of the mathematician Carl Friedrich Gauss. 

The graph of the PDF of a normal random variable is the famous bell-shaped curve. The 

parameter µ is called the mean (or center) of the normal distribution, since the graph of 

the PDF is symmetric around x = µ. The median of the normal distribution is µ. The 

parameter σ is called the standard deviation (or spread) of the normal distribution. The 

interquartile range of the normal distribution is (approximately) 1.35σ. 

Normal distributions have many applications. For example, normal random variables can 

be used to model measurements of manufactured items made under strict controls; normal 

random variables can be used to model physical measurements (e.g. height, weight, blood 

values) in homogeneous populations. 

Note that the PDF of the normal distribution cannot be integrated in closed form. Most 

computer programs provide functions to compute probabilities and quantiles of the normal 

distribution. 

Standard normal distribution. The continuous random variable Z is said to be a 

standard normal random variable, or to have a standard normal distribution, when Z is a 

normal random variable with µ = 0 and σ = 1. The PDF and CDF of Z have special 

symbols: 

φ(z) = 1 

√ e 

2π −z2 z 

/2 

and Φ(z) = φ(t)dt where z is a real number. 

−∞ 

The notation zp is used for the p th quantile of the Z distribution. 

Working with tables. Table 5.1 gives values of Φ(z) for z ≥ 0, where 

z = Row Value + Column Value. 

For z < 0, use Φ(z) = 1 − Φ(−z). For example, 

1. Φ(1.25) = 0.894 (row 1.2, column 0.05). 

2. Φ(−0.70) = 1 − Φ(0.70) = 1 − 0.758 = 0.242 (row 0.7, column 0.00). 

3. z0.25 ≈ −0.67 and z0.75 ≈ 0.67 since Φ(0.67) = 0.749 (row 0.6, column 0.07). 

The graph shown in the table illustrates Φ(z) as an area. 

97

Table 5.1: Cumulative probabilities of the standard normal random variable 

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 

0.0 0.500 0.504 0.508 0.512 0.516 0.520 0.524 0.528 0.532 0.536 

0.1 0.540 0.544 0.548 0.552 0.556 0.560 0.564 0.567 0.571 0.575 

0.2 0.579 0.583 0.587 0.591 0.595 0.599 0.603 0.606 0.610 0.614 

0.3 0.618 0.622 0.626 0.629 0.633 0.637 0.641 0.644 0.648 0.652 

0.4 0.655 0.659 0.663 0.666 0.670 0.674 0.677 0.681 0.684 0.688 

0.5 0.691 0.695 0.698 0.702 0.705 0.709 0.712 0.716 0.719 0.722 

0.6 0.726 0.729 0.732 0.736 0.739 0.742 0.745 0.749 0.752 0.755 

0.7 0.758 0.761 0.764 0.767 0.770 0.773 0.776 0.779 0.782 0.785 

0.8 0.788 0.791 0.794 0.797 0.800 0.802 0.805 0.808 0.811 0.813 

0.9 0.816 0.819 0.821 0.824 0.826 0.829 0.831 0.834 0.836 0.839 

1.0 0.841 0.844 0.846 0.848 0.851 0.853 0.855 0.858 0.860 0.862 

1.1 0.864 0.867 0.869 0.871 0.873 0.875 0.877 0.879 0.881 0.883 

1.2 0.885 0.887 0.889 0.891 0.893 0.894 0.896 0.898 0.900 0.901 

1.3 0.903 0.905 0.907 0.908 0.910 0.911 0.913 0.915 0.916 0.918 

1.4 0.919 0.921 0.922 0.924 0.925 0.926 0.928 0.929 0.931 0.932 

1.5 0.933 0.934 0.936 0.937 0.938 0.939 0.941 0.942 0.943 0.944 

1.6 0.945 0.946 0.947 0.948 0.949 0.951 0.952 0.953 0.954 0.954 

1.7 0.955 0.956 0.957 0.958 0.959 0.960 0.961 0.962 0.962 0.963 

1.8 0.964 0.965 0.966 0.966 0.967 0.968 0.969 0.969 0.970 0.971 

1.9 0.971 0.972 0.973 0.973 0.974 0.974 0.975 0.976 0.976 0.977 

2.0 0.977 0.978 0.978 0.979 0.979 0.980 0.980 0.981 0.981 0.982 

2.1 0.982 0.983 0.983 0.983 0.984 0.984 0.985 0.985 0.985 0.986 

2.2 0.986 0.986 0.987 0.987 0.987 0.988 0.988 0.988 0.989 0.989 

2.3 0.989 0.990 0.990 0.990 0.990 0.991 0.991 0.991 0.991 0.992 

2.4 0.992 0.992 0.992 0.992 0.993 0.993 0.993 0.993 0.993 0.994 

2.5 0.994 0.994 0.994 0.994 0.994 0.995 0.995 0.995 0.995 0.995 

2.6 0.995 0.995 0.996 0.996 0.996 0.996 0.996 0.996 0.996 0.996 

2.7 0.997 0.997 0.997 0.997 0.997 0.997 0.997 0.997 0.997 0.997 

2.8 0.997 0.998 0.998 0.998 0.998 0.998 0.998 0.998 0.998 0.998 

2.9 0.998 0.998 0.998 0.998 0.998 0.998 0.998 0.999 0.999 0.999 

3.0 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 

3.1 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 

3.2 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 

98 

z

X Dens. 

1 

0.8 

0.6 

0.4 

0.2 

0 

-1 0 1 2 3 4 x 

Y Dens. 

4 

3 

2 

1 

0 

-1 0 1 2 3 4 y 

Figure 5.5: Probability density functions of a continuous random variable with values on 

0 < x < 2 (left plot) and of its reciprocal with values on y > 0.5 (right plot). 

Nonstandard normal distributions. If X is a normal random variable with mean µ 

and standard deviation σ, then 

 

x − µ 

P(X ≤ x) = Φ 

σ 

where Φ() is the CDF of the standard normal random variable, and 

xp = µ + zpσ 

where zp is the p th quantile of the standard normal distribution. 

Example 5.6 (µ = 100, σ = 20) Let X be a normal random variable with mean 100 

and standard deviation 20. The median of X is 100 and the IQR is 20(z0.75 − z0.25) ≈ 26.8. 

Further, 

 

110 − 100 60 − 100 

P(60 ≤ X ≤ 110) = Φ 

− Φ 

≈ 0.668. 

20 

20 

5.2.6 Transforming Continuous Random Variables 

If X and Y = g(X) are continuous random variables, then the CDF of X can be used to 

determine the CDF and PDF of Y . 

Example 5.7 (Reciprocal Transformation) Let X be a continuous random variable 

whose range is the open interval (0,2) and whose PDF is 

fX(x) = x 

2 

when 0 < x < 2 and 0 otherwise, 

and let Y equal the reciprocal of X, Y = 1/X. Then for y > 1/2, 

P(Y ≤ y) = P 

1 

X 

 

≤ y = P X ≥ 1 

 

y 

99 

= 

2 

x=1/y 

x 

2 

dx = 1 − 1 

4y 2

and d 

dy P(Y ≤ y) = 1/(2y3 ). Thus, the PDF of Y is 

and the CDF of Y is 

fY (y) = 1 

when y > 1/2 and 0 otherwise 

2y3 FY (y) = 1 − 1 

when y > 1/2 and 0 otherwise. 

4y2 The left part of Figure 5.5 is a plot of the density function of X, and the right part is a 

plot of the density function of Y . 

Example 5.8 (Square Transformation) Let Z be the standard normal random variable 

and W be the square of Z, W = Z 2 . Then for w > 0, 

P(W ≤ w) = P(Z 2 ≤ w) = P(− √ w ≤ Z ≤ √ w) = Φ( √ w) − Φ(− √ w). 

Since the graph of the PDF of Z is symmetric around z = 0, Φ(−z) = 1 − Φ(z) for every z. 

Thus, the CDF of W is 

and the PDF of W is 

FW(w) = 2Φ( √ w) − 1 when w > 0 and 0 otherwise 

fW(w) = d 

dw FW(w) = 1 

√ 2πw e −w/2 when w > 0 and 0 otherwise. 

(Note that W = Z 2 is said to have a chi-square distribution with one degree of freedom.) 

Monotone transformations. If the differentiable transformation g is strictly monotone 

(either always increasing or always decreasing) on the range of X, Y = g(X) and y is in 

the range of Y , then the PDF of Y has the following form: 

 

dy 

fY (y) = fX(x)/ 

dx 

where y = g(x) (and x = g−1 (y)). 

Example 5.9 (Reciprocal Transformation, continued) In the reciprocal transformation 

example above, y = g(x) = 1/x, x = g−1 (y) = 1/y (the inverse of the reciprocal 

function is the reciprocal function), and 

fY (y) = fX(g −1 

−1 

(y))/ 

g−1 (y) 2 

 

 

 

= 

 

1 

 

/ −y 

2y 

2 1 

= when y > 0.5. 

2y3 Linear transformations. Linear transformations with nonzero slope are examples of 

differentiable monotone transformations. In particular, if Y = aX + b where a = 0 and y is 

in the range of Y , then 

fY (y) = 1 

|a| fX 

 

1 

(y − b) . 

a 

100

Example 5.10 (Normal Distribution) If X is a normal random variable with mean µ 

and standard deviation σ and Z is the standard normal random variable with PDF φ(z), 

then X = σZ + µ with PDF 

fX(x) = 1 

σ φ 

 

 

x − µ 1 −(x − µ) 2 

= √ exp 

σ 2π σ 2σ2 

for all x. 

This formula for the PDF agrees with the one given earlier. 

5.3 Mathematical Expectation 

Mathematical expectation generalizes the idea of a weighted average, where probability 

distributions are used as the weights. (See Section 2.3 also.) 

5.3.1 Definitions for Continuous Random Variables 

Let X be a continuous random variable with range R and PDF f(x). The mean of X (or 

expected value of X or expectation of X) is defined as follows: 

 

E(X) = x f(x) dx 

provided that 

x∈R |x| f(x) dx < ∞ (i.e. the integral converges absolutely). 

x∈R 

Similarly, if g(X) is a real-valued function of X, then the mean of g(X) (or expected value 

of g(X) or expectation of g(X)) is 

 

E(g(X)) = g(x) f(x) dx 

provided that 

x∈R |g(x)| f(x) dx < ∞. 

x∈R 

Note that the absolute convergence of an integral is not guaranteed. In cases where an 

integral with absolute values diverges, we say that the expectation is indeterminate. 

Example 5.11 (Triangular Distribution) Let X be a continuous random variable whose 

range is the interval [1,4] and whose PDF is as follows 

f(x) = 2 

(x − 1) when 1 ≤ x ≤ 4 and 0 otherwise, 

9 

and let g(X) = X 2 be the square of X. Then 

and 

E(X) = 

E(g(X)) = 

4 

Note that E(g(X)) = g(E(X)). 

4 

1 

1 

x f(x) dx = 

x 2 f(x) dx = 

101 

4 

1 

4 

1 

x 2 

(x − 1) dx = 3 

9 

2 2 

x (x − 1) dx = 19/2. 

9

Table 5.2: Means and variances for four families of continuous distributions. 

Distribution E(X) V ar(X) 

Uniform a, b a+b 

2 

Exponential λ 1 

λ 

(b−a) 2 

12 

1 

λ 2 

Gamma α, β αβ αβ 2 

Normal µ, σ µ σ 2 

Example 5.12 (Indeterminate Mean) Let X be a continuous random variable whose 

range is x ≥ 1 and whose PDF is as follows: 

f(x) = 1 


x2 Since ∞ 

1 x f(x) dx does not converge, the expectation of X is indeterminate. 

5.3.2 Properties of Expectation 

The properties of expectation stated in Section 2.3.2 for discrete random variables are also 

true for continuous random variables. 

5.3.3 Mean, Variance and Standard Deviation 

Let X be a random variable and let µ = E(X) be its mean. The variance of X, V ar(X), 

defined as follows: 

 

V ar(X) = E (X − µ) 2 

. 

The notation σ 2 = V ar(X) is used to denote the variance. The standard deviation of X, 

σ = SD(X), is the positive square root of the variance. 

These summary measures (mean, variance, standard deviation) have the same properties 

as those stated in the discrete case in Section 2.3.3. 

Table 5.2 gives general formulas for the mean and variance of the four families of continuous 

probability distributions discussed earlier. 

5.3.4 Chebyshev’s Inequality 

Chebyshev’s inequality, stated in Section 2.3.4 for discrete random variables, remains true 

in the continuous case. The proof (with integration replacing summation) is also the same. 

102

6 Infinite Sequences and Series 

An infinite sequence (or sequence) is a real-valued function whose domain is the positive 

integers. We write a sequence as a list of numbers in a definite order 

a1, a2, a3, ..., an, ... , 

where a1 is the first term, a2 is the second term, etc. The notations 

{a1, a2, a3, ...} or {an} or {an} ∞ n=1 

are used to denote sequences. 

In addition, sequences may start with n = 0 (or with some other integer). 

If {an} ∞ n=1 

is a sequence, then the formal sum 

∞ 

an = a1 + a2 + a3 + · · · 

n=1 

is called an infinite series (or series). The numbers a1, a2, ... are the terms of the series. 

6.1 Limits and Convergence of Infinite Sequences 

Let {an} be an infinite sequence. We say that {an} has limit L, and write 

lim 

n→∞ an = L or an → L as n → ∞ 

if for every positive real number ǫ there exists an integer M satisfying 

|an − L| < ǫ when n > M. 

If the sequence {an} has limit L, then the sequence converges (or is convergent). Otherwise, 

the sequence diverges (or is divergent). 

Example 6.1 (Convergent and Divergent Sequences) Since 

the sequence 

lim 

n→∞ 

(n2 + 1) 

(5n2 (1 + 1/n 

= lim 

+ 7) n→∞ 

2 ) 

(5 + 7/n2 1 

= 

) 5 , 

 

(n2 +1) 

(5n2 

+7) converges to 1 

5 . Similarly, the sequence {(−0.5)n } converges to 0. 

By contrast, the sequence {ln(n)} diverges. (In fact, the values grow large without bound.) 

6.1.1 Convergence Properties 

Assume that {an} converges to L and {bn} converges to M. Then 

1. Constant multiples: {can} converges to cL, where c is a constant. 

103

2. Sums and differences: {an ± bn} converges to L ± M. 

3. Products: {anbn} converges to LM. 

4. Quotients: If bn = 0 for all n and M = 0, then {an/bn} converges to L/M. 

5. Convergence to zero: {an} converges to 0 if and only if {|an|} converges to 0. 

6. Squeezing principle: If an ≤ bn ≤ cn for all n ≥ n0, where n0 is a constant, and 

then {bn} converges to L as well. 

lim 

n→∞ an = lim 

n→∞ cn = L , 

Example 6.2 (Squeezing Principle) Consider the sequence 

the sequence 

0 ≤ sin2 (n) 

n 

≤ 1 

n 

 

sin2 (n) 

n converges to 0 as well. 

6.1.2 Geometric Sequences 

1 

and lim = 0, 

n→∞ n 

 

sin2 (n) 

n . Since 

Consider the sequence {r n } where r is a fixed constant (called the ratio of the geometric 

sequence). Then 

{r n } converges to 0 when |r| < 1 and converges to 1 when r = 1. 

In all other cases, the sequence diverges. 

6.1.3 Bounded Sequences 

The sequence {an} is said to be bounded above if there is a number M satisfying 

an ≤ M for all n ≥ 1. 

Similarly, {an} is said to be bounded below if there is a number m satisfying 

an ≥ m for all n ≥ 1. 

The sequence {an} is said to be bounded if it is bounded above and below. Otherwise, {an} 

is said to be unbounded. 

Theorem 6.3 (Bounded Sequence Theorem) Let {an} be a sequence. 

1. If {an} converges, then it is bounded. 

2. If {an} is unbounded, then it diverges. 

Every convergent sequence is bounded, but not every bounded sequence is convergent. For 

example, the sequence {(−1) n } is bounded (since −1 ≤ an ≤ 1 for all n) but not convergent. 

104

6.1.4 Monotone Sequences 

The sequence {an} is said to be increasing if 

Similarly, {an} is said to be decreasing if 

a1 < a2 < · · · < an < · · · . 

a1 > a2 > · · · > an > · · · . 

{an} is said to be monotone if it is either increasing or decreasing. 

Theorem 6.4 (Monotone Convergence Theorem) If {an} is a bounded and monotone 

sequence, then {an} converges. 

Example 6.5 (Convergent Monotone Sequences) Since 0 ≤ 1 − 1 

n ≤ 1 for all n ≥ 1 

∞ and the terms are increasing, the sequence converges. In fact, the limit is 1. 

1 − 1 

n n=1 

 

1 

n ≤ sin(1) for all n ≥ 1 and the terms are decreasing, the sequence 

Similarly, since 0 ≤ sin 

∞ sin converges. In fact, the limit is 0. 

1 

n 

n=1 

Note. The proof of the Monotone Convergence Theorem uses the fact that sets of real 

numbers that are bounded above have a least upper bound (or supremum), and that sets of 

real numbers that are bounded below have a greatest lower bound (or infimum). 

6.2 Partial Sums and Convergence of Infinite Series 

Let ∞ n=1 an be an infinite series. The sequence with terms 

n 

sn = ai, n = 1,2,3,... 

i=1 

is known as the sequence of partial sums of the series. 

If the sequence of partial sums {sn} converges to L, then the series converges to L: 

∞ 

n=1 

an = lim 

n→∞ sn = L 

and L is said to be the sum of the series. If the sequence of partial sums {sn} diverges, 

then the series is said to diverge. 

Example 6.6 (Convergent Series) The series ∞ n=1 

1 

n(n+1) 

converges to 1. 

1 1 1 

To see this, note that n(n+1) = n − n+1 for n ≥ 1. Thus, 

 

sn = 1 − 1 

 

1 1 1 1 

+ − + · · · + − = 1 − 

2 2 3 n n + 1 

1 

n + 1 

(all middle terms cancel) and sn → 1 as n → ∞. 

105

6.2.1 Convergence Properties 

Assume that ∞ n=1 an = L and ∞ n=1 bn = M. Then 

1. Constant multiples: ∞ n=1 can = c ∞ n=1 an = cL, where c is a constant. 

2. Sums and differences: ∞ n=1 (an ± bn) = ∞ n=1 an ± ∞ n=1 bn = L ± M. 

3. Convergence to zero: If the series ∞ n=1 an converges, then {an} converges to 0. 

Note that, if the sequence of terms of a series does not have limit zero, then the series must 

diverge. For example, 

∞ 

n=1 

(n 2 + 1) 

(5n 2 + 7) 

diverges since lim 

n→∞ 

(n2 + 1) 

(5n2 1 

= = 0. 

+ 7) 5 

(The sequence of partial sums is unbounded, since sn+1 ≈ sn + 1 

5 

6.2.2 Geometric Series 

when n is large.) 

Consider the series ∞ n=1 ar n−1 where a and r are fixed constants. Then the n th partial 

sum of the geometric series is 

sn = a + ar + ar 2 + · · · + ar n−1 = a(1 − rn ) 

1 − r 

when r = 1 

(and sn = an when r = 1). Further, the series converges when |r| < 1 and its sum is 

If |r| ≥ 1, the series diverges. 

∞ 

n=1 

ar n−1 = a 

1 − r 

Example 6.7 (a = −10/27, r = −2/3) The series 

since 

∞n=1 

∞ 

n=1 

5(−2) n 

3 n+2 

(|r| < 1). 

5(−2) n 

3 n+2 converges to − 2 

9 

= −10 20 40 80 

27 + 81 − 243 + 729 ∓ · · · 

= − 10 

 

27 1 − 2 4 8 

3 + 9 − 27 ± · · · 

= −10/27 

1−(−2/3) 

106 

= −2 

9 .

6.3 Convergence Tests for Positive Series 

Let ∞ n=1 an and ∞ n=1 bn be series with positive terms. Then 

1. Comparison test: Assume that an ≤ bn for all n ≥ n0. Then 

(a) If ∞ n=1 bn converges, then ∞ n=1 an converges. 

(b) If ∞ n=1 an diverges, then ∞ n=1 bn diverges. 

2. Integral test: If f(x) is a decreasing function with f(n) = an, then 

∞ 

an and 

n=1 

∞ 

f(x) dx both converge or both diverge. 

1 

3. Ratio test: Assume that (an+1/an) → L as n → ∞. 

(a) If 0 ≤ L < 1, then ∞ n=1 an converges. 

(b) If L > 1, then ∞ n=1 an diverges. 

4. Root test: Assume that n√ an → L as n → ∞. 

(a) If 0 ≤ L < 1, then ∞ n=1 an converges. 

(b) If L > 1, then ∞ n=1 an diverges. 

5. Limit comparison test: If {an/bn} converges to L > 0, then 

∞ 

an and 

n=1 

∞ 

bn both converge or both diverge. 

n=1 

The comparison test follows from the Monotone Convergence Theorem since the sequences 

of partial sums are montone increasing: if the b-series converges, then its sum is an upper 

bound for the sequence of partial sums of the a-series; if the a-series diverges, then the 

sequence of partial sums of the b-series must be unbounded. 

The ratio and root tests apply to series that are “sufficiently like” geometric series. In each 

case, if L = 1, then no conclusion can be drawn. The limit comparison test applies to series 

that are approximately constant multiples of one another. 

Example 6.8 (p Series) The integral test can be used to demonstrate that 

∞ 

n=1 

1 

converges if and only if p > 1. 

np To see this, let f(x) = 1 

x p = x −p . If p > 1, then 

 

∞ x−p+1 x→∞ 

f(x) dx = 

1 

−p + 1 

1 

 

x−p+1 1 1 1 

= lim 

− = 0 − = 

x→∞ −p + 1 −p + 1 −p + 1 p − 1 . 

107

(The limit is 0 since x has a negative exponent.) Since the improper integral converges, so 

does the series. 

If p < 1, then the improper integral diverges since x has a positive exponent in the limit 

above. If p = 1, then the antiderivative of f(x) is ln(x) and again the improper integral 

diverges. 

Example 6.9 (Convergent Exponential Series) An important application of the ratio 

test is to the series 

∞ xn n! 

where x is a positive real number. 

Since the ratio 

an+1 

an 

n=0 

= xn+1 /(n + 1)! 

x n /n! 

= 

x 

→ 0 as n → ∞ 

(n + 1) 

the series above converges. In addition, since the series converges, the sequence of terms 

must approach 0. That is, x n /n! → 0 as n → ∞. 

Example 6.10 (Comparison Test) The comparison test can be used to demonstrate 

that 

∞ 10sin2 (n) 

n! 

converges and that 

∞ 4 

√ diverges. 

5n − 1 

n=0 

In the first case, let an = 10 sin2 (n) 

n! and bn = 10 

n! 

. The ratio test can be used to demonstrate 

that the b-series converges (as in the last example). Thus, the a-series converges. 

In the second case, let an = 4 

√ 5n and bn = 4 

√ 5n−1 . The integral test can be used to demonstrate 

that the a-series diverges (as in the example above). Thus, the b-series diverges. 

Example 6.11 (Limit Comparison Test) The limit comparison test can be used to 

demonstrate that 

∞ 

n=1 

4n + 3 

n 3 + 7n − 4 

converges and that 

n=1 

∞ 

n=1 

In the first case, let an = 4n+3 

n 3 +7n−4 and bn = 1 

n 2. Since 

an 

lim 

n→∞ bn 

= lim 

n→∞ 

4n3 + 3n2 n3 = lim 

+ 7n − 4 n→∞ 

n + 5 

3n 2 + 2n + 2 diverges. 

4 + 3/n 

1 + 7/n2 = 4 

− 4/n3 and the b-series converges (by a previous example), the a-series converges. 

In the second case, let an = n+5 

an 

lim 

n→∞ bn 

3n2 +2n+2 and bn = 1 

n . Since 

= lim 

n→∞ 

n2 + 5n 

3n2 = lim 

+ 2n + 2 n→∞ 

1 + 5/n 1 

= 

3 + 2/n + 2/n2 3 

and the b-series diverges (by the same example), the a-series diverges. 

108

Example 6.12 (Root Test) The root test can be used to demonstrate that 

Specifically, since 

∞ 

n n + 5 

2n 

n=1 

 

n√ n + 5 1 + 5/n 

an = = 

2n 2 

converges. 

→ 1 

2 

as n → ∞ 

and the limit is less than 1, the series converges. (For large n, the terms of the series are 

like those of a convergent geometric series with r = 1/2.) 

Remarks. The convergence tests above can be used to determine if a positive series 

converges, but tell us nothing about the sum of the series. In most cases, finding an exact 

sum is challenging. In Section 6.6, we demonstrate that 

∞ 

n=0 

x n 

n! 

= ex 


Harmonic series. The divergent series ∞ 1 

n=1 n is called the harmonic series. Although 

the terms of the harmonic series approach 0, the series diverges by the integral test. 

6.4 Alternating Series and Error Analysis 

An alternating series is a series whose terms alternate in sign. The usual notation is 

∞ 

(−1) 

n=1 

n+1 an = a1 − a2 + a3 − a4 ± · · · , 

where an > 0 for each n, for series starting with positive terms. For example, 

∞ (−1) n+1 

n=1 

2 n 

= 1 1 1 1 

− + − ± · · · 

2 4 8 16 

is an alternating series starting with a positive term. 

A simple convergence test for alternating series is based on the following theorem. 

Theorem 6.13 (Alternating Series Theorem) Assume that an+1 ≤ an for each n and 

that an → 0 as n → ∞. Then 

1. ∞ n=1 (−1) n+1 an and ∞ n=1 (−1) n an are convergent series. 

2. If sn is the n th partial sum and L is the sum of one of these series, then 

|sn − L| ≤ an+1 for each n. 

109

The sequence of partial sums of an alternating series are alternately larger and smaller than 

the limit, L. When the first term is positive, the even partial sums form an increasing 

sequence and the odd partial sums form a decreasing sequence: 

s2 ≤ s4 ≤ s6 ≤ s8 ≤ s10 ≤ · · · ≤ L ≤ · · · ≤ s9 ≤ s7 ≤ s5 ≤ s3 ≤ s1. 

When the first term is negative, the even partial sums are decreasing and the odd partial 

sums are increasing. Each subsequence of partial sums must converge, since each is bounded 

and monotone. Further, since 

|sn − sn−1| = an → 0 as n → ∞, 

the limits of the subsequences must be equal. 

Example 6.14 (Alternating Harmonic Series) The series 

∞ (−1) n+1 

n=1 

n 

= 1 − 1 1 1 

+ − ± · · · 

2 3 4 

converges since the sequence {1/n} is decreasing with limit 0. 

To estimate the sum of the alternating harmonic series to within 2 decimal places, we need 

to use 200 or more terms since 

|sn − L| ≤ an+1 = 1 

< 0.005 =⇒ n > 199. 

n + 1 

Our estimate is s200 ≈ 0.690653. (In fact, L = ln(2). See Section 6.6.) 

Example 6.15 (Alternating p Series) More generally, the series 

∞ (−1) n+1 

n=1 

n p 

= 1 − 1 1 1 

+ − ± · · · 

2p 3p 4p converges for all p > 0 since {1/n p } is decreasing with limit 0. 

To estimate the sum when p = 3 to within 2 decimal places, for example, we need to use 5 

or more terms since 

|sn − L| ≤ an+1 = 

Our estimate is s5 ≈ 0.904412. 

1 

< 0.005 =⇒ n > 4.84804. 

(n + 1) 3 

6.5 Absolute and Conditional Convergence of Series 

The following theorem is used to study convergence properties of general series. 

Theorem 6.16 (Absolute Convergence Theorem) If the series ∞ n=1 |an| converges, 

then the series ∞ n=1 an converges as well. 

Thus, tests for positive series can be used to determine if the series with terms |an| converges. 

If the series converges, then we can conclude that the series whose terms are an converges 

110

as well. Note, however, that if the series with terms |an| diverges, then we cannot draw any 

conclusion about the series with terms an. 

Let ∞ n=1 an be a convergent series. If the series ∞ n=1 |an| converges, then ∞ n=1 an is 

said to be an absolutely convergent series. Otherwise, ∞ n=1 an is said to be a conditionally 

convergent series. 

Example 6.17 (Alternating p Series, continued) The series 

∞ (−1) n+1 

n=1 

n p 

= 1 − 1 1 1 

+ − ± · · · 

2p 3p 4p is absolutely convergent when p > 1 (by the integral test) and conditionally convergent 

when 0 

Example 6.18 (Exponential Series, continued) The series 

∞ 

n=0 

x n 

n! 

= 1 + x + x2 

2 

+ x3 

6 

x4 

+ + · · · 

24 

converges for all real numbers x. Specifically, if x = 0, the sum is 1. Otherwise, the series 

converges absolutely by the ratio test (see page 108). 

Example 6.19 (Power Series) The series 

∞ 

n=1 

x n 

n 

= x + x2 

2 

+ x3 

3 

+ x4 

4 

+ · · · 

converges absolutely when |x| < 1, converges conditionally when x = −1, and diverges 

otherwise. To see this, 

1. If x = 0, then the sum is 0. If 0 < |x| < 1, then 

|an+1| 

|an| = 

 

n 

|x| → |x| < 1 as n → ∞, 

n + 1 

implying that the series converges absolutely by the ratio test. Thus, the series converges 

absolutely when |x| < 1. 

2. When x = 1, the series reduces to the harmonic series, which diverges by the integral 

test. When x = −1, the series reduces to the (negative) alternating harmonic series, 

which converges by the alternating series test. Thus, the series converges conditionally 

when x = −1. 

3. When |x| > 1, the terms do not converge to 0, and the series diverges. 

6.5.1 Product of Series 

Let ∞ n=0 an and ∞ n=0 bn be series with index starting at 0. 

111

The series ∞ n=0 cn with terms 

cn = 

n 

aibn−i = a0bn + a1bn−1 + · · · + an−1b1 + anb0 for n = 0,1,2,... . 

i=0 

is the product (or the convolution) of ∞ n=0 an and ∞ n=0 bn. 

Theorem 6.20 (Convolution Theorem) If ∞ n=0 an = L, ∞ n=0 bn = M and both 

series are absolutely convergent, then 

∞n=0 cn is an absolutely convergent series with sum LM. 

Example 6.21 (Square of Geometric Series) The geometric series 

∞ 

x 

n=0 

n = 1 + x + x 2 + x 3 + · · · 

converges to 1/(1 − x) when |x| < 1 and diverges otherwise. 

The square of this series is 

 

∞ 

x 

n=0 

n 

2 ∞ 

= (n + 1)x 

n=0 

n = 1 + 2x + 3x 2 + 4x 3 + · · · . 

By the convolution theorem, the square series converges to 1/(1 − x) 2 when |x| < 1. 

Remarks. The last example demonstrates that we can “represent” the functions f(x) = 

1 

(1−x) 

and g(x) = 1 

(1−x) 2 as sums of convergent series for all x satisfying |x| < 1. Series 

representations of functions will be explored further in the next two sections. 

6.6 Taylor Polynomials and Taylor Series 

We have already seen how to approximate a function using a tangent line. This section 

considers using polynomials of higher orders to approximate functions with higher precision, 

and using the limiting forms of these polynomials to give exact representations of the 

functions on intervals. 

6.6.1 Higher Order Derivatives 

The n th derivative of the function f is obtained from the function by differentating n times: 

f ′′′ is the derivative of f ′′ , f ′′′′ is the derivative of f ′′′ , and so forth. 

Starting with y = f(x), notations for the n th derivative are 

f (n) (x) = dny dn 

= 

dxn dxn (f(x)) = dnf dxn = Dnf(x) = y (n) . 

For convenience, we let f (0) (x) = f(x). 

112

ycosx 

2 

1 

-8 -6 -4 -2 2 4 6 8 x 

-1 

-2 

2 

8 

14 

y11x 

6 

4 

2 

-1 -0.5 0.5 1 

Figure 6.1: Plots of y = cos(x) (left) and y = 1/(1 −x) (right) with Taylor approximations. 

6.6.2 Taylor Polynomials 

The Taylor polynomial of order n centered at a is the polynomial 

pn(x) = f(a) + f ′ (a)(x − a) + f ′′ (a) 

2 

(x − a) 2 + · · · + f(n) (a) 

(x − a) 

n! 

n . 

Note that pn(a) = f(a) and that p (k) 

n (a) = f (k) (a) for k = 1,... ,n. 

Theorem 6.22 (Taylor’s Theorem, Part 1) If the n th derivative of f is continuous on 

the closed interval [a,b] and the (n + 1) st derivative exists on the open interval (a,b), then 

for each x in [a,b] 

f(x) = pn(x) + f(n+1) (c) 

(x − a)n+1 

(n + 1)! 

for some c satisfying a < c < x. 

In words, the error in using the Taylor polynomial pn(x) instead of the function f(x) for 

values of x near a is exactly the remainder 

Rn(x) = f(n+1) (c) 

(n + 1)! (x − a)n+1 for some number c between a and x. 

Thus, Taylor’s theorem generalizes results on errors in linear approximations (Section 4.6). 

Taylor’s theorem can be proven using the Mean Value Theorem. 

Note that if |f (n+1) (x)| is bounded in a neighborhood of a, then |Rn(x)| is also bounded. 

Specifically, 

|f (n+1) (x)| ≤ M for |x − a| < r ⇒ |Rn(x)| ≤ M 

(n + 1)! |x − a|n+1 for |x − a| < r . 

Example 6.23 (Cosine Function) Consider f(x) = cos(x) and let a = 0. The left part of 

Figure 6.1 shows graphs of y = f(x), y = p2(x), y = p8(x), and y = p14(x). As n increases, 

the polynomial approximates the function well over wider intervals centered at 0. 

113 

5 

3 

1 

x

Further, since |f (n) (x)| ≤ 1 (each derivative is either ± sin(x) or ± cos(x)) and a = 0, 

|Rn(x)| ≤ |x|n+1 

(n + 1)! 


Example 6.24 (Geometric Function) Consider f(x) = 1/(1 − x) and let a = 0. The 

right part of Figure 6.1 shows graphs of y = f(x), y = p1(x), y = p3(x), and y = p5(x). 

As n increases, the polynomial approximates the function well over wider subintervals of 

(−1,1). 

Since f (n) (x) = n!/(1 − x) n+1 and f (n) (0) = n!, pn(x) = 1 + x + x 2 + · · · + x n . 

Further, since 

1 − x n+1 

1 − x = 1 + x + x2 + · · · + x n =⇒ 1 

1 − x = 1 + x + x2 + · · · + x n + xn+1 

1 − x 

(using properties of geometric series), the remainder is Rn(x) = x n+1 /(1 − x). 

6.6.3 Taylor Series 

The Taylor series representation of f(x) centered at a is 

f(x) = 

∞ 

n=0 

f (n) (a) 

n! 

(x − a) n = f(a) + f ′ (a)(x − a) + f ′′ (a) 

(x − a) 

2 

2 + · · · 

when the series converges. Note that the Taylor series centered at 0 is often called the 

Maclaurin series. 

Theorem 6.25 (Taylor’s Theorem, Part 2) If f(x) = pn(x) + Rn(x) and 

lim 

n→∞ Rn(x) = 0 for all x satisfying |x − a| < r, 

then f(x) is equal to the sum of its Taylor series on the interval |x − a| < r. 

Example 6.26 (Exponential Function) Consider f(x) = e x and let a = 0. 

Since f (n) (x) = e x and f (n) (0) = 1 for all n, 

e x = 1 + x + x2 

2 

for some c between 0 and x. Since 

+ · · · + xn 

n! + Rn(x) where Rn(x) = 

|Rn(x)| ≤ C |x|n+1 

(n + 1)! where C = max(ex ,1) 

e c 

(n + 1)! xn+1 

and |x| n /n! → 0 as n → ∞ for all x (page 108), Rn(x) → 0 as n → ∞ as well. Thus, e x is 

equal to the sum of its Taylor series for all real numbers. 

114

Table 6.1: Some Taylor series expansions centered at 0. 

1 

1−x = 1 + x + x2 + x 3 + · · · for −1 < x < 1 

1 

(1−x) 2 = 1 + 2x + 3x 2 + 4x 3 + · · · for −1 < x < 1 

ln(1 + x) = x − x2 

2 

e x = 1 + x + x2 

2 

cos(x) = 1 − x2 

2 

sin(x) = x − x3 

3! 

+ x3 

3! 

+ x4 

4! 

+ x5 

5! 

+ x3 

3 

− x4 

4 

+ · · · for −1 < x ≤ 1 

+ · · · for all x 

− x6 

6! 

− x7 

7! 



Example 6.27 (Exponential Function, continued) Using the result above, a series 

representation of e −1 is 

e −1 = 

∞ (−1) n 

n=0 

n! 

1 1 1 

= 1 − 1 + − + ∓ · · · . 

2 6 24 

To estimate e−1 to within 2 decimal places of accuracy, for example, we need to use 6 terms 

since 

1 

|Rn(x)| ≤ < 0.005 =⇒ 

(n + 1)! 

n > 5. 

Our estimate is s6 = 53/144 ≈ 0.368056. (Note that this result could also be obtained using 

the methods of Section 6.4.) 

6.6.4 Radius of Convergence 

The largest value of r satisfying Taylor’s theorem is called the radius of convergence of the 

series. The example above demonstrates that the radius of convergence for the exponential 

series centered at 0 is ∞. 

Table 6.1 lists several commonly used series expansions. In the first three cases, the radius 

of convergence in 1. In the last three cases, the radius of convergence is ∞. 

6.7 Power Series 

A power series centered at a is a series of the form 

∞ 

cn(x − a) 

n=0 

n = c0 + c1(x − a) + c2(x − a) 2 + c3(x − a) 3 + · · · , 

115

where {cn} is a sequence of constants (the coefficients of the series). The series converges 

to c0 when x = a. Interest focuses on finding the values of x for which the series converges 

and finding exact forms for the sums. 

Theorem 6.28 (Convergence Theorem) Let ∞ n=0 cn(x−a) n be a power series centered 

at a. Then exactly one of the following holds: 

1. The series converges for x = a only. 

2. The series converges for all x. 

3. There is a number r > 0 such that the series converges when |x − a| < r and diverges 

when |x − a| > r. 

The number r given in the third part of the theorem is called the radius of convergence of 

the power series. If r is the radius of convergence, then the theorem tells us nothing about 

convergence at x = a ± r. Thus, the interval of convergence could be any one of 

[a − r,a + r], (a − r,a + r], [a − r,a + r), (a − r,a + r). 

Example 6.29 (Logarithm Function) Consider the Taylor series for f(x) = ln(x) 

centered at 1. Since 

ln(1) = 0, f (n) (x) = (−1) 

n−1(n − 1)! 

xn the form of the Taylor series is 

∞ 

(−1) 

n−1(x − 1)n (x − 1)2 

= (x − 1) − 

n 

2 

n=1 

and f (n) (1) = (−1) n−1 (n − 1)!, 

+ (x − 1)3 

3 

− 

(x − 1)4 

4 

± · · · . 

The series converges absolutely when |x − 1| < 1 by the ratio test. If |x − 1| > 1, then the 

terms do not converge to 0, and the series diverges. If x = 0, then the series diverges since 

it is the negative of the harmonic series. If x = 2, then the series converges since it is the 

alternating harmonic series. 

Thus, the radius of convergence is r = 1 and the interval of convergence is (0,2]. 

6.7.1 Power Series and Taylor Series 

Taylor series are examples of power series. The following theorem states that convergent 

power series must be Taylor series within their radius of convergence. 

Theorem 6.30 (Representation Theorem) Assume that the radius of convergence of 

the power series ∞ n=0 cn(x − a) n is r > 0. Let 

∞ 

f(x) = cn(x − a) n 

for a − r < x < a + r. 

n=0 

Then f has derivatives of all orders on (a − r,a + r) and 

f (n) (a) = cn n! for all n. 

Thus, the power series is the Taylor series of f centered at a. 

116

6.7.2 Differentiation and Integration of Power Series 

Power series can be differentiated and integrated term-by-term within their radius of convergence. 

Specifically, 

Theorem 6.31 (Term-by-Term Analysis) Assume that the radius of convergence of 

the power series ∞ n=0 cn(x − a) n is r > 0. Let 

∞ 

f(x) = cn(x − a) 

n=0 

n 

for a − r < x < a + r. 

Then f is differentiable (and continuous) on the interval (a − r,a + r). Further, 

1. the derivative of f has the following form: 

f ′ ∞ 

(x) = ncn(x − a) 

n=1 

n−1 = c1 + 2c2(x − a) + 3c3(x − a) 2 + · · · . 

2. the family of antiderivatives of f has the following form: 

 

f(x) dx = C + 

∞ 

n=0 

The radius of convergence of each series is r. 

Example 6.32 (Geometric Function) Let 

f(x) = 

1 

(1 − x) = 

The derivative of f can be computed two ways: 

f ′ (x) = d 

 

1 

= 

dx (1 − x) 

f ′ (x) = d 

 

∞ 

x 

dx 

n=0 

n 

 

∞ 

= 

Thus, 

1 

= 

(1 − x) 2 

confirming results obtained on page 112. 

cn 

n + 1 (x − a)n+1 = C + c0(x − a) + c1 

2 (x − a)2 + · · · . 

∞ 

x 

n=0 

n = 1 + x + x 2 + x 3 + · · · for −1 < x < 1. 

1 

and 

(1 − x) 2 

nx 

n=1 

n−1 = 1 + 2x + 3x 2 + 4x 3 + · · · . 

∞ 

nx 

n=1 

n−1 = 1 + 2x + 3x 2 + 4x 3 + · · · for −1 < x < 1, 

Example 6.33 (Standard Normal Distribution) Let Z be the standard normal random 

variable. Using the Taylor series expansion of e x centered at 0, the PDF of Z can be written 

as follows: 

φ(z) = 1 

√ 2π e −z2 /2 = 1 

√2π 

∞ (−z2 /2) n 

n=0 

n! 

117 

= 1 

√ 2π 

∞ 

n z2n 

(−1) 

n=0 

2 n n! 

for all z.

φ(z) does not have an exact antiderivative. However, by using term-by-term analysis, we 

can write the indefinite integral as a family of power series. Specifically, 

 

∞ 

φ(z) dz = C + (−1) 

n=0 

n z2n+1 (2n + 1)2n n! . 

This power series representation can be used to compute probabilities for standard normal 

random variables to within any degree of accuracy. 

6.8 Discrete Random Variables, revisited 

There are many important applications of sequences and series in probability and statistics. 

This section focuses on discrete random variables whose ranges are countably infinite. 

6.8.1 Countably Infinite Sample Spaces 

A set is countably infinite when it corresponds to an infinite sequence. When a sample space 

S is countably infinite, we need to include the possibility of infinite unions, and of infinite 

sums of probabilities. In particular, the third Kolmogorov axiom is extended as follows: 

3 ′′ . If A1,A2,... are pairwise disjoint, then 

P (∪ ∞ i=1Ai) = P(A1) + P(A2) + P(A3) + · · · 

where the righthand side is understood to be the sum of a convergent infinite series. 

6.8.2 Infinite Discrete Random Variables 

Recall that a discrete random variable is one whose range, R, is either finite or countably 

infinite. If the range of X is countably infinite, f(x) is the PDF of X, and g(X) is a 

real-valued function, then the expected value (or mean or expectation) of g(X) is 

E(g(X)) = 

g(x)f(x) when the series converges absolutely. 

x∈R 

That is, when 

x∈R |g(x)|f(x) converges. If the series with absolute values diverges, we 

say that the expectation is indeterminate. 

Example 6.34 (Coin Tossing) You toss a fair coin until you get a tail and record the 

sequence of h’s and t’s (for heads and tails, respectively). The sample space is 

S = {t, ht, hht, hhht, hhhht, hhhhht, ...}. 

Let X be the total number of tosses. Then X is a discrete random variable with PDF 

f(x) = P(X = x) = 1 

2 x for x = 1,2,3,... and 0 otherwise. 

118

U Dens. 

1 

0 1 

u 

X Prob. 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

1 2 3 4 5 6 7 8 9 10 x 

Figure 6.2: Probability density function of the uniform random variable on the open interval 

(0,1) (left plot) and of the floor of its reciprocal (right plot). The range of the floor of 

the reciprocal is the positive integers. Vertical lines in the left plot are drawn at u = 

1/2,1/3,1/4,.... 

The probability that you get a tail in 3 or fewer tosses, for example, is 

P(X ≤ 3) = P ({t, ht, hht}) = f(1) + f(2) + f(3) = 1 1 1 7 

+ + = 

2 4 8 8 . 

To find the expected value of X, we need to determine the sum of the following series: 

E(X) = 

∞ 

x=1 

x 1 

= 1 

2x 

1 

+ 2 

2 

 

1 

+ 3 

4 

 

1 1 

+ 4 + · · · . 

8 16 

By using a simple trick, we can reduce the problem to finding the sum of a convergent 

geometric series. Specifically, 

E(X) − 1 

2E(X) = 

∞ 

x 

x=1 

1 

∞ 

∞ 

 

1 

− x = x 

2x 2x+1 x=1 x=1 

1 1 

− (x − 1) 

2x 2x ∞ 1 

= = 1. 

2x x=1 

Since 1 

2E(X) = 1, we know that E(X) = 2. (The expected number of tosses is 2.) This 

trick is valid since the series is absolutely convergent. 

Example 6.35 (Indeterminate Mean) Let U be a continuous uniform random variable 

on the open interval (0,1) and let X be the greatest integer less than or equal to the 

reciprocal of U (the “floor” of the reciprocal of U), X = ⌊1/U⌋. For x in the positive 

integers, 

f(x) = P(X = x) = P 

f(x) = 0 otherwise. The expectation of X is 

∞ 

E(X) = x 

x=1 

1 

x + 1 

1 

x(x + 1) = 

119 

 

1 



x 

∞ 

x=1 

1 

(x + 1) . 

1 

x(x + 1) ;

Prob. 

0.25 

0.15 

0.05 

0 10 20 x 

Prob. 

0.25 

0.15 

0.05 

0 5 10 x 

Figure 6.3: Probability histograms for geometric (left) and Poisson (right) examples. 

The integral test can be used to demonstrate that the series on the right diverges. Thus, 

the expectation of X is indeterminate. 

The left part of Figure 6.2 is a plot of the PDF of U with the vertical lines 

u = 1/2, 1/3, 1/4, 1/5, ... 

superimposed. The right part is a probability histogram of the distribution of X = ⌊1/U⌋. 

Note that the area between u = 1/2 and u = 1 on the left is P(X = 1), the area between 

u = 1/3 and u = 1/2 is P(X = 2), etc. 

6.8.3 Example: Geometric Distribution 

Let X be the number of failures before the first success in a sequence of independent 

Bernoulli experiments with success probability p. Then X is said to be a geometric random 

variable, or to have a geometric distribution, with parameter p. The PDF of X is as follows: 

f(x) = (1 − p) x p when x = 0,1,2,..., and 0 otherwise. 

For each x, f(x) is the probability of the sequence of x failures (F) followed by a success 

(S). For example, f(5) = P({FFFFFS}) = (1 − p) 5 p. 

The probabilities f(0),f(1),... form a geometric sequence whose sum is 1. Further, 

E(X) = 

1 − p 

p 

and V ar(X) = E 

 

(X − E(X)) 2 

= 1 − p 

p2 . 

Example 6.36 (p = 1/4) Let X be a geometric random variable with parameter 1/4. 

Then 

E(X) = 3 and V ar(X) = 12. 

The left part of Figure 6.3 is a probability histogram for X, for values of X at most 20. 

Note that 

∞ 

x 21 3 1 3 

P(X > 20) = 

= ≈ 0.00237841. 

4 4 4 

x=21 

120

Alternative definition. An alternative definition of the geometric random variable is as 

follows: X is the trial number of the first success in a sequence of independent Bernoulli 

experiments with success probability p. In this case, 

f(x) = (1 − p) x−1 p when x = 1,2,3,..., and 0 otherwise. 

In particular, the range is now the positive integers. 

6.8.4 Example: Poisson Distribution 

Let λ be a positive real number. The random variable X is said to be a Poisson random 

variable, or to have a Poisson distribution, with parameter λ if its PDF is as follows: 

−λ λx 

f(x) = e 

x! 

when x = 0,1,2,..., and 0 otherwise. 

The series expansion for e x centered at 0 (see Table 6.1) can be used to demonstrate that 

the sequence f(0),f(1),... has sum 1. Further, 

E(X) = λ and V ar(X) = E (X − E(X)) 2 = λ. 

Example 6.37 (λ = 3) Let X be the Poisson random variable with mean 3 (and with 

variance 3). The right part of Figure 6.3 is a probability histogram for X, for values of X 

at most 10. Note that 

P(X > 10) = 

∞ 

x=11 

e −33x 

x! 

≈ 0.000292337. 

Rare events. The idea for the Poisson distribution comes from a limit theorem proven 

by the mathematician S. Poisson in the 1830’s. 

Theorem 6.38 (Poisson Limit Theorem) Let λ be a positive real number, n a positive 

integer, and p = λ/n. Then 

 

n 

lim p 

n→∞ x 

x (1 − p) n−x = e −λλx 

when x = 0,1,2,.... 

x! 

This theorem can be used to estimate binomial probabilities when the number of trials is 

large and the probability of success is small. That is, when success is a rare event. 

For example, if n = 10000, p = 2/5000, and x = 3, then the probability of 3 successes in 

10000 trials is 

10000 

3 

3 9997 2 4998 

≈ 0.195386 

5000 5000 

121

and Poisson’s approximation is 

e −443 

3! 

using λ = np = 4. The values are very close. 

≈ 0.195367, 

Poisson process. Events occurring in time are said to be generated by an (approximate) 

Poisson process with rate λ when the following conditions are satisfied: 

1. The number of events occurring in disjoint subintervals of time are independent 

of one another. 

2. The probability of one event occurring in a sufficiently small subinterval of 

time is proportional to the size of the subinterval. If h is the size, then the 

probability is λh. 

3. The probability of two or more events occurring in a sufficiently small 

subinterval of time is virtually zero. 

In this definition, λ represents the average number of events per unit time. If events follow 

an (approximate) Poisson process and X is the number of events observed in one unit of 

time, then X has a Poisson distribution with parameter λ. 

Typical applications of Poisson distributions include the numbers of cars passing an intersection 

in a fixed period of time during a workday, or the number of phone calls received in 

a fixed period of time during a workday. 

The idea of a Poisson process can be generalized to include events occurring over regions 

of space instead of intervals of time. (“Subregions” take the place of “subintervals” in the 

conditions above. In this case, λ represents the average number of events per unit area or 

per unit volume.) 

Remark. The definition of Poisson process allows you to think of the PDF of X as 

the limit of a sequence of binomial PDFs: The observation interval is subdivided into n 

nonoverlapping subintervals; the i th Bernoulli trial results in success if an event occurs in 

the i th subinterval, and failure otherwise. If n is large enough, then the probability that 2 

or more events occur in one subinterval can be assumed to be zero. 

6.9 Continuous Random Variables, revisited 

The exponential and gamma families of distributions are related to Poisson processes. 

122

6.9.1 Exponential Distribution 

Recall that the continuous random variable X has an exponential distribution with parameter 

λ when its PDF has the form 

f(x) = λe −λx when x ≥ 0 and 0 otherwise, 

where λ is a positive constant. (See Section 5.2.2.) 

If events occurring over time follow an approximate Poisson process with rate λ, where λ 

is the average number of events per unit time, then the time between successive events has 

an exponential distribution with parameter λ. To see this, 

1. If you observe the process for t units of time and let Y equal the number 

of observed events, then Y has a Poisson distribution with parameter λt. 

The PDF of Y is as follows: 

P(Y = y) = e −λt(λt)y 

y! 

when y = 0,1,2,... and 0 otherwise. 

2. An event occurs, the clock is reset to time 0, and X is the time until the 

next event occurs. Then X is a continuous random variable whose range 

is x > 0. Further, 

P(X > t) = P(0 events in the interval [0,t]) = P(Y = 0) = e −λt 

and P(X ≤ t) = 1 − e −λt . 

3. Since f(t) = d 

d 

dtP(X ≤ t) = dt 

 

1 − e −λt 

= λe −λt when t > 0 (and 0 

otherwise) is the same as the PDF of an exponential random variable with 

parameter λ, X has an exponential distribution with parameter λ. 

6.9.2 Gamma Distribution 

Recall that the continuous random variable X has a gamma distribution with shape parameter 

α and scale parameter β when its PDF has the following form 

f(x) = 

1 

β α Γ(α) xα−1 e −x/β when x > 0 and 0 otherwise, 

where α and β are positive constants. (See Section 5.2.4.) 

If events occurring over time follow an approximate Poisson process with rate λ, where λ 

is the average number of events per unit time and if r is a positive integer, then the time 

until the r th event occurs has a gamma distribution with α = r and β = 1/λ. To see this, 

1. If you observe the process for t units of time and let Y equal the number 

of observed events, then Y has a Poisson distribution with parameter λt. 

The PDF of Y is as follows: 

P(Y = y) = e −λt(λt)y 

y! 

when y = 0,1,2,... and 0 otherwise. 

123

2. Let X be the time you observe the r th event, starting from time 0. Then X 

is a continuous random variable whose range is x > 0. Further, P(X > t) 

is the same as the probability that there are fewer than r events in the 

interval [0,t]. Thus, 

and P(X ≤ t) = 1 − P(X > t). 

r−1 

 

P(X > t) = P(Y < r) = 

y=0 

e −λt(λt)y 

y! 

3. The PDF f(t) = d 

dtP(X ≤ t) is computed using the product rule: 

 

d 

dt 1 − e−λt r−1 y=0 λyty 

y! 

= λe−λt r−1 y=0 λyty 

y! 

= λe −λt r−1 

y=0 λy t y 

y! 

= λe−λt λr−1tr−1 

(r−1)! 

= λr 

(r−1)! tr−1 e −λt . 

− e −λt r−1 

 

− 

y=1 λyyty−1 y! 

r−1 

y=1 λy−1ty−1 

(y−1)! 

Since f(t) is the same as the PDF of a gamma random variable with parameters 

α = r and β = 1/λ, X has a gamma distribution with parameters 

α = r and β = 1/λ. 

6.10 Summary: Poisson Processes 

In summary, there are three distributions related to Poisson processes: 

1. If X is the number of events observed in one unit of time, then X is a Poisson random 

variable with parameter λ, where λ is the average number of events per unit time. 

The probability that exactly x events occur is 

P(X = x) = e −λλx 

x! 

when x = 0,1,2,..., and 0 otherwise. 

2. If X is the time between successive events, then X is an exponential random variable 

with parameter λ, where λ is the average number of events per unit time. The CDF 

of X is 

F(x) = 1 − e −λx when x > 0 and 0 otherwise. 

3. If X is the time to the rth event, then X is a gamma random variable with parameters 

α = r and β = 1/λ, where λ is the average number of events per unit time. The CDF 

of X is 

r−1 

−λx (λx)y 

F(x) = 1 − e when x > 0 and 0 otherwise. 

y! 

y=0 

Remark. If α = r and r is not too large, then gamma probabilities can be computed by 

hand using the formula for the CDF above. Otherwise, the computer can be used to find 

probabilities for gamma distributions. 

124

7 Multivariable Calculus Concepts 

Let R k be the set of ordered k-tuples of real numbers, 

R k = {(x1,x2,...,xk) : xi ∈ R} . 

Each element of R k is called a point or a vector. Initially, we use the notation 

x = (x1,x2,...,xk) 

to denote vectors in R k . We say that xi is the i th component of x. 

This chapter considers real-valued functions whose domain, D, is a subset of R k , 

and their applications in statistics. 

7.1 Vector And Matrix Operations 

f : D ⊆ R k −→ R, 

This section introduces basic vector and matrix operations. Ideas introduced here will be 

studied in more depth in Chapter 9 of these notes. 

7.1.1 Representations of Vectors 

The origin (or zero vector) O in R k is the vector all of whose components are zero, 

O = (0,0,... ,0). 

Let v = (v1,v2,...,vk) be a vector in R k . If v is not the zero vector, then v can be 

represented as a directed line segment in two ways: 

1. As a position vector drawn from the origin to the point P(v1,v2,...,vk): v = OP . 

2. As a displacement vector drawn from point P(p1,p2,... ,pk) to point Q(q1,q2,... ,qk), 

where vi = qi − pi for each i: v = P Q. 

For example, the left part of Figure 7.1 shows the vector v = (−4,3) represented as the 

position vector OP and the displacement vectors AB and V W. The right part of the 

figure shows the position vector v = OP, where P is the point P(−4,3,2). 

7.1.2 Addition and Scalar Multiplication of Vectors 

If v and w are vectors in R k and c ∈ R is a scalar, then 

125

P 

3 

y 

B 

-4 4 

W 

-3 V 

A 

x 

x 

z P 

Figure 7.1: Representations of vectors in 2-space (left) and 3-space (right). 

1. the vector sum v + w is the vector obtained by componentwise addition, 

v + w = (v1 + w1,v2 + w2,...,vk + wk), and 

2. the scalar product cv is the vector obtained by multiplying each component of v by c, 

cv = (cv1,cv2,...,cvk). 

Example 7.1 (Vectors in R 3 ) If v = (8, −3,2) and w = (−9,2,3), then 

v + w = (−1, −1,5), 5v = (40, −15,10) and 5v − w = (49, −17,7). 

Properties of addition and scalar multiplication. The following properties hold for 

vectors u, v, w in R k and c,d ∈ R: 

1. Commutative: v + w = w + v. 

2. Associative: (u + v) + w = u + (v + w), c(dv) = (cd)v. 

3. Distributive: (c + d)v = cv + dv, c(v + w) = cv + cw. 

4. Identity: 1v = v, 0v = O, v + O = v, and 

v + (−1)w = v − w (that is, we have the operation of subtraction). 

Parallelogram or triangle rules. Figure 7.2 is graphical way to represent addition (left 

part) and subtraction (right part), in 2-space and 3-space. 

In the left plot, the vector sum v + w is obtained by starting at the lower left and going 

across the bottom and then up the right side, or by starting at the lower left and going up 

the left side and across the top (since v + w = w + v). Geometrically, the vector sum is 

the diagonal of the parallelogram. 

In the right plot, observe that w + u = v implies that u = v − w. 

126 

y

w vw 

v 

w uvw 

Figure 7.2: Graphical representation of vector addition (left) and subtraction(right). 

1.5v 

Figure 7.3: Graphical representation of scalar multiplication of vectors. 

Parallel vectors. Vectors v and w are said to be parallel when w = mv for some constant 

m. If m > 0, then the vectors are pointing in the same direction; if m < 0, then the vectors 

are pointing in the opposite direction. 

Figure 7.3 is a graphical way to represent scalar multiplication in 2-space and 3-space. 

Vector 2v is a vector pointing in the same direction as v but of twice the length; vector 

−1.5v is a vector of length 1.5 times the length of v, but pointing in the opposite direction. 

7.1.3 Length and Distance 

If v is a vector in R k , then the length of v is 

v = 

v 

2v 

 

v 2 1 + v2 2 + · · · v2 k . 

The zero vector has length 0; all other vectors have positive length. 

A unit vector is a vector of length 1. If v is not the zero vector, then the unit vector in the 

direction of v is the vector u = 1 

v v. 

The distance between vectors v and w is the length of the difference vector v − w: 

v − w = 

 

(v1 − w1) 2 + (v2 − w2) 2 + · · · + (vk − wk) 2 . 

Example 7.2 (Vectors in R 3 , continued) If v = (8, −3,2) and w = (−9,2,3), then the 

length of v is v = √ 77 ≈ 8.77, the length of w is w = √ 94 ≈ 9.70, and the distance 

between v and w is v − w = 3 √ 35 ≈ 17.75. 

127 

v

8 

The unit vector in the direction of v is u = √77 , −3 

 

√ 2 , √77 . 

77 

Theorem 7.3 (Triangle Inequality) If v,w ∈ R k , then v + w ≤ v + w. 

Note that if k = 2 or k = 3, then the triangle with corners O, v and v + w has sides of 

lengths v, w, and v + w. 

7.1.4 Dot Product of Vectors 

If v,w ∈ R k , then the dot product (or inner product or scalar product) is the number 

Note that v·v = v 2 . 

v·w = v1w1 + v2w2 + · · · + vkwk. 

Let k = 2 or 3, v = OP and w = OQ be representations of v and w as position vectors, 

and θ be the smaller of the two angles defined by P OQ (the connected segments from P 

to the origin to Q). Then, analytic geometry can be used to demonstrate that 

v·w = v w cos(θ). 

If θ is unknown, then it can be found by solving the equation above. 

Example 7.4 (Vectors in R 3 , continued) If v = (8, −3,2) and w = (−9,2,3), then 

v·w = −72 and θ = cos −1 

 

v·w 

≈ 2.58 radians (≈ 148 degrees). 

v w 

Angle between vectors. Analytic geometry can be used to prove the following theorem. 

Theorem 7.5 (Cauchy-Schwarz Inequality) If v,w ∈ R k , then |v·w| ≤ v w. 

The Cauchy-Schwarz inequality allows us to formally define the angle between two nonzero 

vectors. Specifically, if v,w ∈ Rk are nonzero vectors, then θ is defined to be the angle 

satisfying the following equation: 

θ = cos −1 

 

v·w 

. 

v w 

This definition generalizes ideas about angles from 2-space and 3-space. 

Orthogonal vectors. The nonzero vectors v,w ∈ R k are said to be orthogonal if their 

dot product is 0, v·w = 0. 

Note that the angle between orthogonal vectors is π 

2 (or 90 degrees), since cos π 

2 = 0 and 

is in the range of the inverse cosine function (see page 53). 

π 

2 

128

7.1.5 Matrix Notation and Matrix Transpose 

An m-by-n matrix A is a rectangular array with m rows and n columns: 

⎡ 

⎤ 

a11 a12 · · · a1n 

⎢ a21 a22 · · · a2n 

⎥ 

A = ⎢ 

⎥ = [aij]. 

⎣ . . · · · . ⎦ 

am1 am2 · · · amn 

The (i,j)-entry of A (denoted by aij) is the element in the row i and column j position. 

Entries are shown explicitly in the first term above. The second term is used as shorthand 

when the size of the matrix (the numbers m and n) is understood. 

A 1-by-n matrix is known as a row vector and an m-by-1 matrix is known as a column 

vector. The m-by-n matrix A can be thought of as a single column of m row vectors or as 

a single row of n column vectors. 

If A is an m-by-n matrix, then the transpose of A, denoted by A T , is the n-by-m matrix 

whose (i,j)-entry is aji. For example, if 

A = 

 

1 2 3 

, then A 

4 5 6 

T ⎡ ⎤ 

1 4 

= ⎣ 2 5 ⎦. 

3 6 

7.1.6 Addition and Scalar Multiplication of Matrices 

Let A = [aij] and B = [bij] be m-by-n matrices of numbers and let c ∈ R be a scalar. Then 

1. the matrix sum, A + B, is the m-by-n matrix obtained by componentwise addition, 

A + B = [aij + bij], and 

2. the scalar product, cA, is the m-by-n matrix obtained by multiplying each component 

of A by c, 

cA = [caij]. 

 

Example 7.6 (2-by-3 Matrices) If A = 

A + B = 

 

7 −3 8 

2 5 1 

 

, 5A = 

 

10 15 30 

−10 0 5 

2 3 6 

−2 0 1 

 

 

 

5 −6 2 

and B = 

4 5 0 

and 5A − 2B = 

 

 

, then 

0 27 26 

−18 −10 5 

Properties of addition and scalar multiplication. Since operations are performed 

componentwise, properties of addition and scalar multiplication of matrices are similar to 

those for vectors. 

129 

 

.

7.1.7 Matrix Multiplication 

Let A = [aij] be an m-by-n matrix and let B = [bij] be an n-by-p matrix. 

The matrix product, C = AB, is the m-by-p matrix whose (i,j)-entry is the dot product of 

the i th row of A with the j th column of B: 

cij = 

n 

aikbkj for i = 1,2,... ,m; j = 1,2,... ,p. 

k=1 

 

Example 7.7 (2-by-3 and 3-by-2 Matrices) Let A = 

Then 

AB = 

 

28 −45 

2 −4 

 

and BA = 

For example, the (1,1)-entry of the product AB is 

⎡ 

⎢ 

⎣ 

2 3 6 

−1 0 1 

7 6 9 

−1 0 1 

15 12 17 

(2,3,6)·(2,0,4) = (2)(2) + (3)(0) + (6)(4) = 28. 

Similarly, the (1,1)-entry of the product BA is (2)(2) + (−3)(−1) = 7. 

Properties. Matrix products are associative. Thus, the product 

ABC = A(BC) = (AB)C 

⎡ 

 

and B = ⎣ 

⎤ 

⎥ 

⎦ . 

2 −3 

0 1 

4 −7 

is well-defined. Matrix products are not necessarily commutative, as illustrated in the 

example above. In fact, if the number of rows of A is not equal to the number of columns 

of B, then the product BA is not defined. 

Note also that the transpose of the product of two matrices is equal to the product of the 

transposes in the opposite order: (AB) T = B T A T . 

7.1.8 Determinants of Square Matrices 

A square matrix is a matrix with m = n. If A is a square matrix, then the determinant 

of A is a number which summarizes certain properties of the matrix. Notations for the 

determinant of A are det(A) or |A|. 

The determinant is often defined recursively, as illustrated in the following special cases: 

1. If A = 

a1 a2 

b1 b2 

 

is a 2-by-2 matrix, then 

 

 

det(A) = 

 

 

a1 a2 

 

b1 b2 

130 

= a1b2 − b1a2 . 

⎤ 

⎦.

2. If A = 

⎡ 

⎣ a1 a2 a3 

b1 b2 b3 

c1 c2 c3 

 

 

 

det(A) = 

 

 

⎤ 

⎦ is a 3-by-3 matrix, then 

 

a1 a2 a3 

 

b1 b2 b3 

 

 

 

c1 c2 c3 

 

 

= a1 

 

 

 

b2 b3 

 

c2 c3 

 

 

 

− a2 

 

 

 

b1 b3 

 

c1 c3 

 

 

 

+ a3 

 

 

 

b1 b2 

 

c1 c2 

(The formula uses an alternating sum of 3 determinants of 2-by-2 matrices.) 

⎡ 

⎢ 

3. If A = ⎣ 

 

 

 

 

 

 

 

a1 a2 a3 a4 

b1 b2 b3 b4 

c1 c2 c3 c4 

d1 d2 d3 d4 

a1 a2 a3 a4 

b1 b2 b3 b4 

c1 c2 c3 c4 

d1 d2 d3 d4 

 

 

 

 

= a1 

 

 

⎤ 

⎥ 

⎦ is a 4-by-4 matrix, then det(A) = 

b2 b3 b4 

c2 c3 c4 

d2 d3 d4 

 

 

 

 

 

−a2 

 

 

 

 

 

 

b1 b3 b4 

c1 c3 c4 

d1 d3 d4 

 

 

 

 

 

+a3 

 

 

 

 

 

 

b1 b2 b4 

c1 c2 c4 

d1 d2 d4 

 

 

 

 

 

−a4 

 

 

 

 

 

 

(The formula uses an alternating sum of 4 determinants of 3-by-3 matrices.) 

. 

b1 b2 b3 

c1 c2 c3 

d1 d2 d3 

Note that the notation for the entries in the special cases was chosen to emphasize the 

recursive nature of the definition. 

In general, if Aij is the (n − 1)-by-(n − 1) submatrix obtained by eliminating the i th row 

and the j th column of A, then 

n 

det(A) = (−1) 

j=1 

1+j a1jdet(A1j). 

The formula uses an alternating sum of n determinants of (n − 1)-by-(n − 1) matrices. We 

say that the determinant is obtained by “expanding in the first row.” (Additional methods 

for computing det(A) are discussed in Section 9.3.) 

Example 7.8 (3-by-3 Matrix) If A = 

⎡ 

⎢ 

⎣ 

4 1 2 

−1 2 1 

1 3 −3 

⎤ 

⎥ 

⎦, then det(A) = 

 

 

4 1 2 

 

 

 

 

−1 2 1 2 1 

−1 1 

−1 2 

 

= 4 − 1 + 2 

3 −3 1 −3 1 3 

1 3 −3 

7.2 Limits and Continuity 

This section generalizes ideas from Section 4.2. 

= 4(−6 − 3) − 1(3 − 1) + 2(−3 − 2) 

= 4(−9) − 1(2) + 2(−5) = −48. 

131 

 

 

 

 

 

.

7.2.1 Neighborhoods, Deleted Neighborhoods and Limits 

A neighborhood of the vector a ∈ R k is the set of vectors x ∈ R k satisfying 

x − a < r for some positive number r. 

A deleted neighborhood of a is the neighborhood with a removed; that is, the set of vectors 

x satisfying 0 < x − a < r for some positive r. 

The vector a ∈ D is said to be a limit point of the domain D if every deleted neighborhood 

of a contains points of D. 

Consider f : D ⊆ R k −→ R, and suppose that a is a limit point of D. We say that the 

limit as x approaches a of f(x) is L, 

lim f(x) = L, 

x→a 

if for every positive number ǫ there exists a positive number δ such that 

if x ∈ D satisfies 0 < x − a < δ, then f(x) satisfies |f(x) − L| < ǫ. 

(Vectors in the intersection of the domain of the function with the deleted δ-neighborhood 

of a are mapped into the ǫ-neighborhood of L.) 

Example 7.9 (Limit Exists) Consider the 2-variable function 

Let x = (x,y) and a = (0,0). Then 

f(x,y) = x3 − y 3 

x − y 

x 


x→a (x,y)→(0,0) 

3 − y3 x − y 

when x = y. 

= lim 

(x,y)→(0,0) x2 + xy + y 2 = 0. 

(As (x,y) approaches the origin from any direction not crossing the line y = x, the function 

values f(x,y) approach zero.) 

Example 7.10 (Limit Does Not Exist) Consider the 2-variable function 

f(x,y) = x2 − y2 x2 when (x,y) = (0,0). 

+ y2 Let x = (x,y) and a = (0,0), and consider approaching (0,0) along the line y = mx. Then 

Since 

f(x,y) = f(x,mx) = x2 − m2x2 x2 + m2 1 − m2 

= 

x2 1 + m2 on this line (when x = 0). 


x→a along y = mx (x,y)→(0,0) along y = mx 

and m can be any number, the two-dimensional limit does not exist. 

132 

1 − m2 1 − m2 

= 

1 + m2 1 + m2

Properties of limits. Properties of limits for 1-variable functions (see page 57) can be 

generalized to limits for k-variable functions. 

Example 7.11 (Squeezing Principle) Consider the 2-variable function 

f(x,y) = 3x2y x2 when (x,y) = (0,0). 

+ y2 Let x = (x,y) and a = (0,0). The squeezing principle can be used to demonstrate that 

To see this, note that 

lim f(x,y) = 0. 

(x,y)→(0,0) 

0 ≤ x2 

x2 ≤ 1 when (x,y) = (0,0) 

+ y2 implies that values of f have the following simple bounds: 

−3|y| ≤ f(x,y) ≤ 3|y| when (x,y) = (0,0). 

Since ±3|y| → 0 as y → 0, f(x,y) must approach 0 as well. 

7.2.2 Continuous Functions 

The function f(x) is said to be continuous at a if 

lim f(x) = f(a). 

x→a 

The function is said to be discontinuous at a otherwise. Note that f(x) is discontinuous at 

a when f(a) is undefined, or when the limit does not exist, or when the limit and function 

values exist but are not equal. 

Linear functions. A linear function in k variables is a function of the form 

f(x) = c0 + c1x1 + c2x2 + · · · + ckxk 

where each coefficient ci ∈ R. Linear functions are continuous for all x ∈ R k . 

Polynomial functions. A polynomial function in k variables is a function of the form 

f(x) = 

d 

j1,j2,...,jk=0 

cj1,j2,...,jk xj1 

1 xj2 

2 · · · x jk 

k 

where d is a nonnegative integer, 

and the c’s are real numbers. For example, f(x,y) = 3x 2 − 10xy 8 + y 2 + 12 is a polynomial 

function in two variables. Polynomial functions are continuous for all x ∈ R k . 

Rational functions. A rational function in k variables is a ratio of polynomials, f(x) = 

p(x)/q(x). Rational functions are continuous whenever q(x) = 0. 

133

More generally. More generally, properties of limits imply that sums, differences, products, 

quotients, and constant multiples of continuous functions are continuous whenever the 

appropriate operation is defined. 

Composition of continuous functions. If the k-variable function g(x) is continuous 

at a and the 1-variable function f(x) is continuous at g(a), then the composite function 

f(g(x)) is continuous at a. In other words, 

 

lim f(g(x)) = f 

x→a 

lim 

x→a g(x) 

 

= f(g(a)). 

For example, let g(x,y) = x 2 − 4y, f(x) = √ x and a = (5,1). Then the composite function 

f(g(x,y)) is continuous at (5,1) since 

 

lim f(g(x,y)) = lim x 

(x,y)→(5,1) (x,y)→(5,1) 

2 − 4y = √ 25 − 4 = √ 21 = f(g(5,1)). 

7.3 Differentiation in Several Variables 

This section generalizes differentiability from Section 4.3. 

7.3.1 Partial Derivatives and the Gradient Vector 

The partial derivative of f(x) with respect to xi is 

f(x1,... ,xi−1,xi + h,xi+1,... ,xk) − f(x) 

fxi (x) = lim 

h→0 

h 

(The ith partial derivative is the ordinary derivative of the function where all variables 

except the ith variable are held fixed.) Alternative notations are Dxi 

. 

f(x) and ∂f 

∂xi (x). 

Example 7.12 (k = 3) Let x = (x,y,z) and f(x,y,z) = −2xy 2 + z 3 + 3x 2 yz 4 . Then 

fx(x,y,z) = −2y 2 + 6xy z 4 , fy(x,y,z) = −4xy + 3x 2 z 4 , fz(x,y,z) = 3z 2 + 12x 2 y z 3 . 

The gradient vector, ∇f(x) (“grad f of x”), is the vector of partial derivatives: 

∇f(x) = (fx1 (x),fx2 (x),... ,fxk (x)) . 

For example, for the 3-variable function above, 

∇f(x,y,z) = (−2y 2 + 6xy z 4 , − 4xy + 3x 2 z 4 , 3z 2 + 12x 2 y z 3 ). 

134

Matrix notation. We sometimes write the list of partial derivatives as a 1-by-k matrix, 

Df(x) = [fx1 (x) fx2 (x) ... fxk (x)] . 

Note that matrix notation is especially useful when considering vector-valued functions. For 

example, if the domain of f is a subset of R n and the range of f is a subset of R m , then 

Df(x) is an m-by-n matrix of partial derivatives. 

7.3.2 Second-Order Partial Derivatives 

The second-order partial derivative of f(x) with respect to xi and xj is the partial derivative 

of fxi (x) with respect to xj: 

 

∂ ∂f 

fxi,xj (x) = (x) = 

∂xj ∂xi 

∂2f (x) = Dxi,xj 

∂xj ∂xi 

f(x). 

When i = j, the second-order partial derivative can be interpreted as the concavity of a 

partial function. When i = j, the second-order partial derivative is often called a mixed 

partial derivative. Additional notations when i = j are: 

fxi,xi (x) = ∂2f ∂x2 (x) = D 

i 

2 xif(x). Theorem 7.13 (Clairaut’s Theorem) Suppose that the k-variable function f(x) has 

continuous first- and second-order partial derivatives in a neighborhood of a. Then for each 

i and j, the mixed partial derivatives at a are equal: 

fxi,xj (a) = fxj,xi (a) for all i,j = 1,2,... ,k. 

Example 7.14 (k = 3, continued) The second-order partial derivatives (with arguments 

omitted) of the 3-variable function above are as follows: 

fx,x = 6y z 4 , fy,y = −4x, fz,z = 6z + 36x 2 y z 2 , 

fx,y = fy,x = −4y + 6xz 4 , fx,z = fz,x = 24xy z 3 , fy,z = fz,y = 12x 2 z 3 . 

Remark. Higher-order partial derivatives can also be defined. 

7.3.3 Differentiability, Local Linearity and Tangent (Hyper)Planes 

Suppose that f(x) is defined for all x in a neighborhood of a and that 

fxi (a) exists for i = 1,2,... ,k. 

135

z 

-2 

-1 

0 

x 

1 

2 -2 

-1 

0 

2 

1 

y 

Figure 7.4: Plots of z = 4 − x 2 − y 2 (left part) and z = x 3 − 3xy + y 3 (right part). In each 

case, partial functions are superimposed on the surface. (See Figure 7.5 also.) 

Let 

h(x) = f(a) + ∇f(a)·(x − a) 

z 

-1 

= f(a) + fx1 (a)(x1 − a1) + · · · + fxk (a)(xk − ak). 

We say that f is differentiable at a if the following limit holds: 

f(x) − h(x) 

lim = 0. 

x→a x − a 

That is, if the error in using h as an approximation to f approaches 0 at a rate faster than 

the rate at which the distance between x and a approaches 0. 

Local linearity. The definition of differentiability says that f is locally linear; the error 

in using the linear function h as an approximation to f can be made as small as we like by 

taking x to be close enough to a. 

Tangent (hyper)planes. If f is differentiable at a, then y = h(x) is called the tangent 

plane at (a,f(a)) when k = 2 and is called the tangent hyperplane at (a,f(a)) when k > 2. 

Example 7.15 (Quadratic Function) Consider the 2-variable function 

f(x,y) = 4 − x 2 − y 2 and let a = (0.5, −1). 

The left part of Figure 7.4 shows a graph of the surface z = f(x,y), with graphs of the 

partial functions z = f(0.5,y) and z = f(x, −1) superimposed. 

1. The partial derivatives of f are fx(x,y) = −2x and fy(x,y) = −2y. 

2. The slope of z = f(x, −1) when x = 0.5 is fx(0.5, −1) = −1. 

3. The slope of z = f(0.5,y) when y = −1 is fy(0.5, −1) = 2. 

136 

0 

x 

1 

2 

-1 

0 

2 

1 

y

4. Since f(0.5, −1) = 2.75, the linear approximation at (0.5, −1) is 

h(x,y) = 2.75 − (x − 0.5) + 2(y + 1). 

In addition, f is differentiable at this point (and everywhere). 

Example 7.16 (Cubic Function) Consider the 2-variable function 

f(x,y) = x 3 − 3xy + y 3 and let a = (1.5,1). 

The right part of Figure 7.4 shows a graph of the surface z = f(x,y), with graphs of the 

partial functions z = f(1.5,y) and z = f(x,1) superimposed. 

1. The partial derivatives of f are fx(x,y) = 3x 2 − 3y and fy(x,y) = 3y 2 − 3x. 

2. The slope of z = f(x,1) when x = 1.5 is fx(1.5,1) = 3.75. 

3. The slope of z = f(1.5,y) when y = 1 is fy(1.5,1) = −1.5. 

4. Since f(1.5,1) = −0.125, the linear approximation at (1.5,1) is 

h(x,y) = −0.125 + 3.75(x − 1.5) − 1.5(y − 1). 

In addition, f is differentiable at this point (and everywhere). 

Example 7.17 (Derivative Does Not Exist) Consider the 2-variable function 

 

 

f(x,y) = |x| − |y| − |x| − |y| and let a = (0,0). 

Since f(x,0) = 0 and f(0,y) = 0, fx(x,0) = 0 and fy(0,y) = 0 for each x and y, and the 

linear approximation h(x,y) = 0. 

But f is not differentiable at the origin since the limit given in the definition of differentiability 

does not exist. If you evaluate the limit by approaching along y = 0 and along y = x, 

for example, you get two different answers: 

f(x,0) − 0 

f(x,x) − 0 

lim = 0 and lim 

(x,0)→(0,0) (x,0) (x,x)→(0,0) (x,x) = −√2. Theorem 7.18 (Differentiability and Continuity) Suppose that the k-variable function 

f(x) is defined for all x in a neighborhood of a. Then 

1. If f is differentiable at a, then f is continuous at a. 

2. If each partial derivative function, fxi (x) for i = 1,2,... ,k, is continuous in a neighborhood 

of a, then f is differentiable at a. 

Note that in the quadratic and cubic function examples, the partial derivative functions (fx 

and fy) are defined and continuous for all pairs. In the absolute value function example, 

partial derivative functions are not defined in a neighborhood of the origin. 

137

7.3.4 Total Derivative 

If each coordinate xi is a function of t, 

xi = xi(t) for i = 1,2,... ,k 

(or simply x = x(t)), then f is also a function of t. The derivative of f with respect to t 

(when it exists) is called the total derivative of f. 

Theorem 7.19 (Total Derivatives) If f is differentiable in a neighborhood of x0 = x(t0) 

and each coordinate function is differentiable at t0, then the total derivative exists and its 

value can be computed as follows: 

df 

dt (t0) = fx1 (x0)x ′ 1 (t0) + fx2 (x0)x ′ 2 (t0) + · · · + fxk (x0)x ′ k (t0). 

Note that if we let x ′ (t) = (x ′ 1 (t),x′ 2 (t),... ,x′ k (t)) be the vector of derivatives of the coordinate 

functions, then the total derivative can be written as the dot product of the gradient 

vector of f with the derivative vector of x: 

df 

dt (t0) = ∇f(x0)·x ′ (t0). 

Thus, the formula for the total derivative is a generalization of the chain rule for compositions 

of 1-variable functions. 

Example 7.20 (k = 3, continued) Let f(x,y,z) = −2xy 2 + z 3 + 3x 2 y z 4 , 

and t0 = 1. Since 

 

1 1 

x(t) = , 

t t2,−2t 

x = x(t) = 1 1 

, y = y(t) = 

t t2, z = z(t) = −2t 

⇒ x ′ 

(t) = − 1 2 

t2,− t3,−2 

⇒ x ′ (1) = (−1, −2, −2), 

and ∇f(x(1)) = ∇f(1,1, −2) = (94,44, −84) (using the formula for ∇f derived earlier), 

the total derivative of f when t = 1 is 

7.3.5 Directional Derivatives 

∇f(x(1))·x ′ (1) = (94,44, −84)·(−1, −2, −2) = −14. 

If u is a unit vector, and f is a k-variable function defined in a neighorhood of a, then the 

directional derivative of f at a is 

f(a + hu) − f(a) 

Duf(a) = lim 

h→0 h 

138 

provided the limit exists.

In general, the function Duf(x) represents the rate of change of the function f with respect 

to a unit change in the direction given by u. 

Theorem 7.21 (Directional Derivatives) If the k-variable function f is differentiable 

in a neighborhood of a and u is a unit vector, then 

Duf(a) = ∇f(a)·u . 

That is, the rate of change is the dot product of the gradient vector of f at a with u. 

Bounds on directional derivatives. Let θ be the angle between vectors ∇f(a) and u 

(see page 128). Then, since u = 1, 

Duf(a) = ∇f(a)·u = ∇f(a)cos(θ) . 

Further, since the cosine function takes values in the interval [−1,1], directional derivatives 

take values between −∇f(a) and ∇f(a). If ∇f(a) = O, then 

1. The minimum value is assumed when u = −1 

∇f(a) ∇f(a), and 

2. The maximum value is assumed when u = 1 

∇f(a) ∇f(a). 

In the first case, the direction of u is called the direction of steepest descent; in the second 

case, it is called the direction of steepest ascent. 

Example 7.22 (Quadratic Function, continued) Consider again the function 

f(x,y) = 4 − x 2 − y 2 and let a = (0.5, −1). 

The left part of Figure 7.5 shows contours corresponding to z = 0,1,2,3,4. The unit vector 

in the direction of steepest ascent at (0.5, −1) is highlighted. 

Since ∇f(a) = (−1,2), the direction of steepest ascent is u = 1 √ 5 (−1,2), and the slope of 

the partial function in this direction is √ 5 ≈ 2.24. 

Note that the contour corresponding to z = k is the set of all pairs (x, y) satisfying f(x, y) = k. 

Example 7.23 (Cubic Function, continued) Consider again the function 

f(x,y) = x 3 − 3xy + y 3 and let a = (1.5,1). 

The right part of Figure 7.5 shows contours corresponding to z = −1,0,1,2,3,4. The unit 

vector in the direction of steepest ascent at (1.5,1) is highlighted. 

Since ∇f(a) = (3.75, −1.5), the direction of steepest ascent is u = 1 

√ 16.3125 (3.75, −1.5), and 

the slope of the partial function in this direction is √ 16.3125 ≈ 4.04. 

139

y 

2 

1 

0 

-1 

-2 -1 0 1 2 x 

-2 

-1 

-1 0 1 2 

Figure 7.5: Contour plots of z = 4 − x 2 − y 2 (left part) and z = x 3 − 3xy + y 3 (right part), 

with directions of steepest ascent highlighted. (See Figure 7.4 also.) 

7.4 Optimization 

Definitions of local and global extrema for functions of k-variables directly generalize the 

definitions given in Section 4.4 for 1-variable functions. In addition, Fermat’s theorem can 

be generalized as follows: 

Theorem 7.24 (Fermat’s Theorem) If the k-variable function f has a local extremum 

at c and if f is differentiable in a neighborhood of c, then ∇f(c) = O. 

Example 7.25 (Quadratic Function, continued) Consider again 

f(x,y) = 4 − x 2 − y 2 (left parts of Figures 7.4 and 7.5). 

f has a local (and global) maximum at the origin and ∇f(0,0) = (0,0). 

Example 7.26 (Cubic Function, continued) Consider again 

f(x,y) = x 3 − 3xy + y 3 (right parts of Figures 7.4 and 7.5). 

f has a local minimum at (1,1) and ∇f(1,1) = (0,0). 

Note that ∇f(0,0) = (0,0), but (0,0) is neither a local max nor a local min. 

Critical points. A point c is a critical point (or stationary point) of the k-variable function 

f if either ∇f(c) = O or if one of the partial derivatives does not exist at c. 

For example, the quadratic function f(x,y) = 4 − x 2 − y 2 has a single critical point at the 

origin; the cubic function f(x,y) = x 3 − 3xy + y 3 has critical points (1,1) and (0,0). 

7.4.1 Linear and Quadratic Approximations 

The definition of Taylor polynomials given in Section 6.6 can be generalized to k-variable 

functions. This section considers first-order and second-order approximations only. 

140 

2 

1 

0 

y 

x

Linear approximation. Suppose that the k-variable function f is differentiable in a 

neighborhood of a. The first-order Taylor polynomial centered at a is defined as follows: 

p1(x) = f(a) + ∇f(a)·(x − a). 

Note that, since f is differentiable, the error in using p1 as an approximation to f can be 

made as small as we like by taking x to be close enough to a. 

If we write the list of partial derivatives as the 1-by-k matrix Df(a), the difference vector 

(x − a) as a k-by-1 matrix, and consider the matrix product 

Df(a)(x − a) = [fx1 (a) fx2 (a) · · · fxk (a)] 

⎡ ⎤ 

x1 − a1 

⎢ x2 − a2 ⎥ 

⎣ ... ⎦ 

xk − ak 

= fx1 (a)(x1 − a1) + fx2 (a)(x2 − a2) + · · · + fxk (a)(xk − ak) 

(a 1-by-1 matrix is equivalent to a scalar), then the first-order Taylor polynomial can be 

written as follows: 

p1(x) = f(a) + Df(a)(x − a). 

Quadratic approximation. The Hessian of a k-variable function f is the k-by-k matrix 

whose (i,j)-entry is fxi,xj : 

⎡ 

fx1,x1 (x) 

⎢ fx2,x1 ⎢ (x) 

Hf(x) = ⎢ 

⎣ . 

fx1,x2 (x) 

fx2,x2 (x) 

. 

· · · 

· · · 

· · · 

⎤ 

fx1,xk (x) 

fx2,xk (x) ⎥ 

. ⎦ 

fxk,x1 (x) fxk,x2 (x) · · · fxk,xk (x) 

Suppose that the k-variable function f has continuous first- and second-order partial derivatives 

in a neighborhood of a. Then the second-order Taylor polynomial centered at a is 

defined as follows: 

p2(x) = f(a) + Df(a)(x − a) + 1 

2 (x − a)T Hf(a)(x − a) 

where Df(a) is the list of partial derivatives written as a 1-by-k matrix, (x − a) is the 

difference vector written as a k-by-1 matrix, and Hf(a) is the k-by-k Hessian matrix at a. 

Example 7.27 (Ratio Function) Let f(x,y) = y 

x 

1. The first- and second- partial derivatives of f are: 

and a = (1,1). 

fx(x,y) = − y 

x 2, fy(x,y) = 1 

x , 

fxx(x,y) = 2y 

x 3, fxy(x,y) = fyx(x,y) = − 1 

x 2 and fyy(x,y) = 0. 

141

2. Since f(a) = 1, fx(a) = −1 and fy(a) = 1, the first-order Taylor polynomial centered 

at a is 

 

x − 1 

p1(x,y) = 1 + [ −1 1 ] = 1 − (x − 1) + (y − 1). 

y − 1 

3. Further, since fxx(a) = 2, fxy(a) = fyx(a) = −1 and fyy(a) = 0, the second-order 

Taylor polynomial centered at a is 

p2(x,y) = 1 + [ −1 1 ] 

 

x − 1 

+ 

y − 1 

1 

2 [ x − 1 y − 1 ] 

2 −1 

−1 0 

= 1 − (x − 1) + (y − 1) + (x − 1) 2 − (x − 1)(y − 1) . 

 

x − 1 

y − 1 

Theorem 7.28 (Second-Order Taylor Polynomials) Suppose that the k-variable 

function f has continuous first- and second-order partial derivatives in a neighborhood of 

a, and let p2(x) be the second-order Taylor polynomial centered at a. Then 

lim 

x→a 

f(x) − p2(x) 

= 0. 

x − a2 That is, the error function e(x) = f(x) − p2(x) approaches zero at a rate faster than the 

rate at which the square of the distance between x and a approaches zero. 

7.4.2 Second Derivative Test 

Suppose that the k-variable function f has continuous first- and second-order partial derivatives 

in a neighborhood of the critical point a. Since Df(a) is the zero vector, 

p2(x) = f(a) + 1 

2 (x − a)T Hf(a)(x − a). 

If the term (x −a) T Hf(a)(x −a) is positive for all x sufficiently close (but not equal) to a, 

then f has a local minimum at a. Similarly, if the term (x − a) T Hf(a)(x − a) is negative 

for all x sufficiently close (but not equal) to a, then f has a local maximum at a. 

The second derivative test gives criteria for determining when the quadratic term in the 

second-order approximation is always positive or always negative for x near a. Matrix 

notation and operations are useful in developing the test. 

Symmetric matrices and quadratic forms. The n-by-n matrix A is symmetric if 

aij = aji for all i and j. Symmetric matrices have the property that A T = A. 

If A is a symmetric n-by-n matrix and h is an n-by-1 matrix (a column vector), then 

is called a quadratic form in h. 

Q(h) = h T n n 

Ah = aijhihj 

i=1 j=1 

142

Positive and negative definite quadratic forms. The quadratic form Q (respectively, 

the associated symmetric matrix A) is said to be positive definite if Q(h) > 0 when h = O 

and is said to be negative definite if Q(h) < 0 when h = O. 

If the k-variable function f has continuous first- and second-partial derivatives in a neighborhood 

of a, then by Clairaut’s theorem (page 135) the Hessian matrix Hf(a) is a symmetric 

k-by-k matrix. Our concern is whether Hf(a) is positive definite or negative definite. 

Principal minors and determinants. If A is an k-by-k matrix, then the j-by-j principal 

minor of A is the matrix 

⎡ 

⎤ 

⎢ 

Aj = ⎢ 

⎣ 

a11 a12 · · · a1j 

a21 a22 · · · a2j ⎥ 

⎦ 

. 

. · · · 

aj1 aj2 · · · ajj 

. 

for j = 1,2,... ,k. 

Let dj = det(Aj) be the determinant of the j-by-j principal minor (j > 1) and let d1 = a11. 

⎡ 

4 1 

For example, if A = ⎣ −1 2 

⎤ 

2 

1 ⎦, then d1 = 4, d2 = 10 and d3 = −48. 

1 3 −3 

Second derivative test for local extrema. Suppose that the k-variable function f 

has continuous first- and second-order partial derivatives in a neighborhood of the critical 

point a, let Hf(a) be the Hessian matrix at a, and let d1, d2, · · ·, dk be the sequence of 

determinants of the principal minors of Hf(a). 

Assume that dk = det(Hf(a)) = 0. Then 

1. If di > 0 for i = 1,2,... ,k, then f has a local minimum at a. 

2. If d1 < 0, d2 > 0, d3 < 0, ..., then f has a local maximum at a. 

3. If the patterns given in the first two steps do not hold, then f has a saddle point at 

the point a. That is, there is neither a maximum nor a minimum at a. 

If dk = det(Hf(a)) = 0, then anything can happen. (The critical point may correspond to 

a local maximum, a local minimum, or a saddle point.) 

Example 7.29 (Quadratic Function, continued) Consider again 

Since 

f(x,y) = 4 − x 2 − y 2 (left parts of Figures 7.4 and 7.5). 

the only critical point is the origin. Since 

 

−2 0 

Hf(0,0) = 

0 −2 

∇f(x,y) = (−2x, −2y) = (0,0) ⇒ x = y = 0 

⇒ d1 = −2 and d2 = 4, 

the second derivative test implies that f has a local maximum at the origin. 

143

Example 7.30 (Cubic Function, continued) Consider again 

Since 

f(x,y) = x 3 − 3xy + y 3 (right parts of Figures 7.4 and 7.5). 

∇f(x,y) = (3x 2 − 3y, −3x + 3y 2 ) = (0,0) ⇒ (x,y) = (1,1) or (0,0), 

there are two critical points to consider. 

1. Let (x,y) = (1,1). Since 

Hf(1,1) = 

 

6 −3 

−3 6 

 

⇒ d1 = 6 and d2 = 27, 

the second derivative test implies that f has a local minimum at (1,1). 

2. Let (x,y) = (0,0). Since 

Hf(0,0) = 

 

0 −3 

−3 0 

 

⇒ d1 = 0 and d2 = −9, 

the second derivative test implies that f has a saddle point at the origin. 

Note that if we let y = −x, then the 1-variable function z = f(x, −x) = 3x 2 has a local 

minimum at the origin; if we let y = x, then the 1-variable function z = f(x,x) = 

2x 3 − 3x 2 has a local maximum at the origin. 

7.4.3 Constrained Optimization and Lagrange Multipliers 

In many applications of calculus the goal is to optimize (either maximize or minimize) a 

real-valued function of k-variables, f(x), subject to one or more constraints: 

g1(x) = c1, g2(x) = c2, ... , gp(x) = cp, where the ci are constants, and p < k. 

Solve and substitute method. It may be possible to use the constraints to reduce the 

number of variables in the optimization problem, as illustrated in the following example. 

Example 7.31 (Constrained Minimization) Let x represent length, y represent width, 

and z represent height of an aquarium. Assume that the cost of the slate needed for 

the bottom of the aquarium is 4 times the cost of the glass needed for the sides. Then 

4xy + 2xz + 2yz is proportional to the total cost for these materials. 

The problem of interest is to find positive numbers x, y, z to 

Minimize 4xy + 2xz + 2yz when the volume V = xyz = 16 cubic units. 

The volume constraint can be solved for z, and the result substituted into the 3-variable 

function we would like to minimize, yielding the following 2-variable problem: 

Minimize f(x,y) = 4xy + 32 

y 

+ 32 

x 

over the domain where x > 0 and y > 0. 

144

f 

1 

2 

x 

3 

4 

1 

5 

2 

5 

4 

3 

y 

y 

5 

4 

3 

2 

1 

1 2 3 4 5 x 

Figure 7.6: Surface (left) and contour (right) plots of f(x,y) = 4xy + 32 

y 

correspond to f(x,y) = 49,51,53,55,57. 

(See Figure 7.6.) Working with this function, 

 

∇f(x,y) = 4y − 32 32 

x2,4x − 

y2 

and Hf(x,y) = 

 

64/x3 4 

4 64/y3 

. 

32 + x . Contours 

 

8 4 

Since ∇f(x,y) = (0,0) when x = y = 2, Hf(2,2) = , d1 = 8 and d2 = 48, we know 

4 8 

that f has a local (and global) minimum at (2,2). The minimum value is 48. 

The dimensions of the aquarium should be x = 2, y = 2, z = 4. 

Method of Lagrange Multipliers. A general method, which does not require the “solve 

and substitute” approach outlined in the example above, depends on the following theorem. 

Theorem 7.32 (Lagrange Multipliers) Assume that the k-variable functions f and 

g1, g2, ..., gp have continuous first partial derivatives on the domain 

D = {x : g1(x) = c1, g2(x) = c2, ..., gp(x) = cp}, 

f has an extremum on this domain at a, and that 

{∇g1(a), ∇g2(a), ..., ∇gp(a)} is a linearly independent set (see below). 

Then there exist constants λ1, ..., λp (called Lagrange multipliers) satisfying 

∇f(a) = λ1∇g1(a) + λ2∇g2(a) + · · · + λp∇gp(a). 

Note that when p = 1, the linear independence requirement in Lagrange’s theorem reduces 

to ∇g1(a) = O. When p > 1, the requirement means that 

The linear combination p 

i=1 ai∇gi(a) = O only when each coefficient ai = 0. 

145

The theorem can be used to find critical points. When p = 1, we drop the subscript on the 

single constraint and solve 

∇f(x) = λ∇g(x) and g(x) = c for x1,x2,...,xk and λ. 

Example 7.33 (Constrained Minimization, continued) Consider again the aquarium 

problem and let f(x,y,z) = 4xy + 2xz + 2yz and g(x,y,z) = xyz. Then 

∇f(x,y,z) = (4y + 2z,4x + 2z,2x + 2y) and ∇g(x,y,z) = (yz,xz,xy). 

Critical points need to satisfy the following system of 4 equations in 4 unknowns: 

4y + 2z = λyz, 4x + 2z = λxz, 2x + 2y = λxy and xyz = 16. 

Since x, y and z must be positive, we can solve the first 3 equations for λ: 

λ = 4 

z 

2 4 

+ , λ = 

y z 

2 2 2 

+ , λ = + 

x y x . 

From equations 1 and 2 we get y = x, and from equations 1 and 3 we get z = 2x. Now, 

16 = xyz = x(x)(2x) ⇒ x = 2, y = 2, z = 4. 

Finally, λ = 2 as well. (Note that λ can be interpreted as the rate of change of cost with 

respect to a unit change in the volume constraint.) 

Lagrangian function. The method outlined in Lagrange’s theorem can be coded using 

the following (p + k)-variable Lagrangian function: 

p 

L(λ1,... ,λp,x1,...,xk) = f(x1,... ,xk) − λi(g(x1,... ,xk) − ci). 

i=1 

For simplicity, we use the notation L(λ,x). 

Critical points of L satisfy the result of Lagrange’s theorem. A special form of the second 

derivative test allows us to determine in some cases whether critical points are local extrema. 

Second derivative test for constrained local extrema. Let a be a constrained critical 

point of f subject to the constraints gi(x) = ci (for all i), and assume that all first- and 

second-partial derivatives are continuous on the constrained domain. 

Let d1, d2, ..., dp+k be the determinants of the principal minors of HL(λ,a) and consider 

the subsequence of (k − p) values: 

(−1) p d2p+1, (−1) p d2p+2, ..., (−1) p dp+k. 

Assume that dp+k = det(HL(λ,a)) = 0. Then 

1. If all terms in the subsequence are positive, then f has a constrained local 

minimum value at a. 

146

2. If the first term in the subsequence is negative, and the signs alternate, 

then f has a constrained local maximum value at a. 

3. If the patterns given in the first two steps do not hold, then f has a constrained 

saddle point at a. 

Note that if dp+k = det(HL(λ,a)) = 0, then anything can happen. 

Example 7.34 (Constrained Minimization, continued) Returning to the problem 

above, 

L(λ,x,y,z) = 4xy + 2xz + 2yz − λ(xyz − 16), 

∇L(λ,x,y,z) = (16 − xy z,4y + 2z − y z λ,4x + 2z − xz λ,2x + 2y − xy λ), and 

⎡ 

0 − (y z) − (xz) 

⎤ 

− (xy) 

⎢ 

HL(λ,x,y,z) = ⎢ − (y z) 

⎣ − (xz) 

0 

4 − z λ 

4 − z λ 

0 

2 − y λ ⎥ 

2 − xλ ⎦ 

− (xy) 2 − y λ 2 − xλ 0 

. 

If λ = 2, x = 2, y = 2, z = 4, then d1 = 0, d2 = −64, d3 = −512, d4 = −768. 

We need to consider the subsequence −d3 = 512 and −d4 = 768. Since both terms are 

positive, f has a constrained local minimum when x = 2, y = 2, z = 4. 

Example 7.35 (k = 3, p = 2) Consider the following minimization problem: 

and let 

Minimize x 2 + y 2 + z 2 subject to x + y + z = 12 and x + 2y + 3z = 18 

L(λ,µ,x,y,z) = x 2 + y 2 + z 2 − λ(x + y + z − 12) − µ(x + 2y + 3z − 18). 

Then ∇L(λ,µ,x,y,z) = 

(12 − x − y − z, 18 − x − 2y − 3z, 2x − λ − µ, 2y − λ − 2µ, 2z − λ − 3µ), 

⎡ 

⎤ 

0 0 −1 −1 −1 

⎢ 0 0 −1 −2 −3 ⎥ 

⎢ 

⎥ 

HL(λ,µ,x,y,z) = ⎢ −1 −1 2 0 0 ⎥ , 

⎢ 

⎥ 

⎣ −1 −2 0 2 0 ⎦ 

−1 −3 0 0 2 

and d1 = d2 = d3 = 0, d4 = 1 and d5 = 12 (for all x, y, z). 

Since 

∇L(λ,µ,x,y,z) = O ⇒ λ = 20, µ = −6, x = 7, y = 4, z = 1, 

and (−1) 2 d5 = 12 > 0, there is a constrained minimum at (x,y,z) = (7,4,1). 

Footnote: The minimization problem above corresponds to finding the point on the intersection of 

the planes x + y + z = 12 and x + 2y + 3z = 18 which is closest to the origin. 

147

7.4.4 Method of Steepest Descent/Ascent 

The first step in solving an optimization problem is to find the critical points of f by solving 

the derivative equation ∇f(x) = O. Since this equation may be difficult to solve, a variety 

of approximate methods have been developed. This section focuses on methods based on 

properties of directional derivatives and gradients. (See Section 7.3.5.) 

Method of steepest descent. Suppose that f has continuous partial derivatives in a 

neighborhood of a and has a local minimum at a. If x0 (an initial point) is close enough 

to the point a and ∇f(x0) = O, then 

The minimum value of Duf(x0) occurs when u = −∇f(x0)/∇f(x0). 

That is, the directional derivative is minimized in the direction of the negative of the gradient 

at the point x0. 

The idea of the method of steepest descent is to move in the direction of steepest descent 

as far as possible with the hope of getting closer to a. Specifically, let g(t) be the following 

1-variable function: 

 

g(t) = f x0 − t ∇f(x0) 

 

. 

∇f(x0) 

Note that g(0) = f(x0). In addition, for values of t near zero, g(t) is decreasing. Let t0 > 0 

be the critical number of g closest to 0 and let x1 = x0 − t0(∇f(x0)/∇f(x0)). 

In general, the process is repeated, defining 

until ti is close enough to zero. 

xi+1 = xi − ti(∇f(xi)/∇f(xi)), 

Example 7.36 (Constrained Minimization, continued) The method is illustrated 

using a familiar situation: 

Minimize f(x,y) = 4xy + 32 

y 

(See Figure 7.6.) 

+ 32 

x 

over the domain where x > 0 and y > 0. 

The following table shows the results of the first eight steps of the algorithm, using (3.2,4.1) 

as the initial point, and using four decimal places of accuracy: 

i 0 1 2 3 4 5 6 7 

ti 2.0973 0.9084 0.1588 0.0458 0.0038 0.0007 0.0000 0.0000 

xi 3.2 1.5789 2.1553 2.0325 2.0034 2.0005 2.0000 2.0000 

yi 4.1 2.7694 2.0672 1.9664 2.0019 1.9995 2.0000 2.0000 

Both t6 and t7 are virtually zero (and x6 and x7 are virtually equal), indicating that the 

algorithm has identified (2,2) as a local minimum point. 

Note that Newton’s method (Section 4.4.5) was used to find each ti. 

148

15 

10 

5 

0 

-5 

-10 

y 

0 2 4 6 8 x 

Figure 7.7: Weight change in grams (vertical axis) versus dosage level in 100 mg/kg/day 

(horizontal axis) for ten animals. The gray curve is y = −0.49x 2 + 1.98x + 6.70. 

Method of steepest ascent. To approximate a local maximum, the function 

is used to find ti, and the formula 

is used to find the next approximate point. 


g(t) = f (xi + t (∇f(xi)/∇f(xi))) 

xi+1 = xi + ti(∇f(xi)/∇f(xi)) 

Optimization is an essential part of many problems in statistics. In this section, optimization 

techniques are used to solve problems in least squares and maximum likelihood estimation. 

7.5.1 Least Squares Estimation in Quadratic Models 

Given n data pairs (x1,y1), (x2,y2), ..., (xn,yn), whose values lie close to a quadratic curve 

of the form 

y = ax 2 + bx + c , 

the method of least squares, developed by Legendre in the early 1800’s, allows us to estimate 

the unknown parameters: Find the values of a, b, and c which minimize the sum of squared 

deviations of observed y-coordinates from the values expected if the data points fit the curve 

exactly. That is, minimize the following 3-variable function: 

n 

f(a,b,c) = (yi − (ax 

i=1 

2 i + bxi + c)) 2 . 

Example 7.37 (Toxicology Study) (Fine & Bosch, JASA 94:375-382, 2000) As part of a 

study to determine the adverse effects of a proposed drug for the treatment of tuberculosis, 

female rats were given the drug for a period of 14 days at each of five dosage levels (in 100 

milligrams per kilogram per day). 

149

The vertical axis in Figure 7.7 is the weight change in grams (defined as the weight at the 

end of the period minus the weight at the beginning of the period) for ten animals in the 

study, and the horizontal axis shows the dose in 100 mg/kg/day. 

For these data, f(a,b,c) = 

562.63 + 703.45a + 7612.13a 2 − 15.1b + 2223.5ab + 172.5b 2 − 88.6c + 345ac + 62bc + 10c 2 , 

∇f(a,b,c) = 

(703.45+15224.3a+2223.5b+345c, −15.1+2223.5a+345b+62c, −88.6+345a+62b+20c) 

and ∇f(a,b,c) = O when a = −0.49, b = 1.98, and c = 6.70. 

Since 

⎡ 

15224.3 2223.5 

⎤ 

345 

Hf = ⎣ 2223.5 345 62 ⎦ , d1 = 15224.3, d2 ≈ 3.1 × 10 

345 62 20 

5 , d3 ≈ 1.7 × 10 6 

and f is a quadratic function, f is minimized when a = −0.49, b = 1.98, and c = 6.70. The 

least squares quadratic fit formula is shown in Figure 7.7. 

Linear algebra and least squares. A linear algebra approach to least squares analysis 

is discussed in Sections 9.5.5 and 9.8.1 of these notes. 

7.5.2 Maximum Likelihood Estimation in Multinomial Models 

This section generalizes ideas from Section 4.5.1. 

Consider the problem of estimating the proportions p1,... ,pk in a multinomial experiment. 

Let p = (p1,p2,... ,pk), and let xi be the observed number of occurrences of outcome i in 

n trials (for i = 1,2,... ,k). We will demonstrate that the sample proportions, pi = xi/n, 

are maximum likelihood estimates of the proportions in the multinomial experiment using 

constrained optimatization techniques. 

The likelihood function, Lik(p), is the joint PDF with n and the xi fixed and p varying: 

 

n 

Lik(p) = 

p 

x1,x2,... ,xk 

x1 

1 px2 2 · · · pxk 

k . 

The log-likelihood function, ℓ(p), is the natural logarithm of the likelihood: 

 

n 

ℓ(p) = ln(Lik(p)) = ln 

x1,x2,...,xk 

+ x1 ln(p1) + x2 ln(p2) + · · · + xk ln(pk). 

We wish to maximize ℓ(p) subject to the constraint that the sum of the proportions is 1. 

The Lagrangian function is 

L(λ,p) = ℓ(p) − λ(p1 + p2 + · · · + pk − 1), 

150

and the gradient of the Lagrangian function is 

∇L(λ,p) = (1 − p1 − p2 − · · · − pk,x1/p1 − λ,x2/p2 − λ,... ,xk/pk − λ). 

If all quantities are positive, then ∇L(λ,p) = O implies 

1 = 

i pi and pi = xi/λ for each i. 

These equations imply that (n,x1/n,x2/n,...,xk/n) is a critical point. 

Example 7.38 (k = 3) To check that the result is a maximum, we consider the special 

case when 

k = 3, n = 40, x1 = 18, x2 = 12 and x3 = 10. 

Then 

⎡ 

0 −1 −1 

⎤ 

−1 

⎢ 

HL = ⎢ −1 −800/9 

⎣ −1 0 

0 

−400/3 

0 

0 

⎥ 

⎦ 

−1 0 0 −160 

, d1 = 0, d2 = −1, d3 = 2000 

9 , d4 = − 1280000 

. 

27 

Since −d3 < 0 and −d4 > 0, the second derivative test for constrained local extrema implies 

that there is a constrained local maximum when p1 = 0.45, p2 = 0.30, and p3 = 0.25 (and 

λ = 40). There is both a local and global maximum at (p1,p2,p3) = (0.45,0.35,0.25). 

7.6 Multiple Integration 

This section generalizes ideas from Section 4.7. 

7.6.1 Riemann Sums and Double Integrals over Rectangles 

Suppose that the 2-variable function f(x,y) is continuous on the rectangle 

R = [a,b] × [c,d] = {(x,y) : a ≤ x ≤ b, c ≤ y ≤ d} ⊂ R 2 

and let m and n be positive integers. Further, let △x = (b−a) (d−c) 

m , △y = n , 

1. xi = a + i△x for i = 0,1,... ,m, x ∗ i = (xi−1 + xi)/2 for i = 1,2,... ,m, 

2. yj = c + j△y for j = 0,1,... ,n, and y ∗ j = (yj−1 + yj)/2 for j = 1,2,... ,n. 

Then the (m + 1) numbers 

a = x0 < x1 < x2 < · · · < xm = b 

partition [a,b] into m subintervals of equal length with midpoints x∗ i , the (n + 1) numbers 

c = y0 < y1 < y2 < · · · < yn = d 

151

z 

1 

x 

3 

1 

3 

y 

5 

Figure 7.8: Plot of z = 3x 2 + y + 2 on [0,4] × [0,6] (left) and approximating boxes (right). 

partition [c,d] into n subintervals of equal length with midpoints y ∗ j 

gles 

Ri,j = [xi−1,xi] × [yj−1,yj] 

z 

1 

x 

3 

1 

3 

5 

y 

and the mn subrectan- 

form an m-by-n partition of the rectangle R. Each subrectangle has area △A = (△x)(△y). 

The double integral of f over R is 

 

f(x,y) dA = lim 

R 

m,n→∞ 

m n 

f(x 

i=1 j=1 

∗ i ,y∗ j )△A. 

In this formula, f(x,y) is the integrand, and the sum on the right is called a Riemann sum. 

Continuity and Integrability. f(x,y) is said to be integrable on the rectangle R if 

the limit above exists. Continuous functions are always integrable. Further, it is possible 

to show that the limit exists when the values of f are bounded on R and the set of 

discontinuities has area zero. 

Volumes. If f is nonnegative on R, then the double integral over R can be interpreted 

as the volume under the surface z = f(x,y) and above the xy-plane for (x,y) ∈ R. 

Example 7.39 (Computing Volume) Let f(x,y) = 3x 2 + y + 2 on R = [0,4] × [0,6]. 

To estimate the volume under the surface z = f(x,y) and above the xy-plane over R (left 

part of Figure 7.8), we let m = 4, n = 6 and ∆x = ∆y = 1. The 24 midpoints are 

(x ∗ ,y ∗ ), where x ∗ = 0.5,1.5,2.5,3.5 and y ∗ = 0.5,1.5,... ,5.5, 

and the value of the Riemann sum is 498.0. This value is the sum of the volumes of the 24 

rectangular boxes shown in the right part of Figure 7.8. 

The following list of approximate values suggests that the limit is 504. 

m 4 8 16 32 64 128 256 

n 6 12 24 48 96 192 384 

Sum 498.0 502.5 503.625 503.906 503.977 503.994 503.999 

152

z 

0 

1 

2 

x 3 

4 0 

2 

4 

y 

6 

z 

0 

1 

2 

x 3 

4 0 

Figure 7.9: Area slices of z = 3x 2 + y + 2 on [0,4] × [0,6] for fixed x = 0,0.5,... ,4 (left) 

and for fixed y = 0,0.5,... ,6 (right). 

Average values. The quantity 

 

1 

f(x,y) dA 

(b − a)(d − c) R 

can be interpreted as the average value of f(x,y) on R. 

To see this, let f = 1 

mn i,j f(x∗i ,y∗ j ) be the sample average, and let A = (b − a)(d − c) be 

the area of the rectangle. Then 

f = A 

⎛ 

⎝ 

A 

1 

f(x 

mn 

∗ i,y ∗ ⎞ 

⎠ 

j) = 1 

A 

i,j 

i,j 

2 

4 

y 

f(x ∗ i,y ∗ 

1 

j) △A ⇒ lim f = f(x,y) dA. 

m,n→∞ A R 

For example, the list of approximations in the previous example suggest that the average 

value of f(x,y) = 3x 2 + y + 2 on [0,4] × [0,6] is 504/24 = 21. 

7.6.2 Iterated Integrals over Rectangles 

Suppose that f is a continuous, 2-variable function on the rectangle R = [a,b] × [c,d]. 

We use the notation d 

f(x,y) dy 

c 

to mean that x is held fixed and f is integrated as y varies from c to d. Similarly, we use 

the notation b 

f(x,y) dx 

to mean that y is held fixed and f is integrated as x varies from a to b. 

If f is nonnegative on R, then the results can be interpreted as areas. 

a 

Example 7.40 (Area Slices) Consider again f(x,y) = 3x 2 + y + 2 on [0,4] × [0,6]. 

153 

6

The left part of Figure 7.9 shows area slices related to the partial functions of f when 

x = 0,0.5,1,... ,4. The area of each slice as a function of x is 

Ax(x) = 

6 

0 

(3x 2 + y + 2) dy = 

 

3x 2 y + 1 

2 y2 6 + 2y = 18x 

0 

2 + 30, 

and the areas of the displayed slices are 30, 34.5, 48, ..., 318. 

The right part of Figure 7.9 shows area slices related to the partial functions of f when 

y = 0,0.5,1,... ,6. The area of each slice as a function of y is 

Ay(y) = 

4 

0 

(3x 2 + y + 2) dx = 

and the areas of the displayed slices are 72, 74, 76, ..., 96. 

 

x 3 4 + yx + 2x = 4y + 72, 

0 

Cavalieri’s principle. Continuing with the area application, Cavalieri’s principle (also 

called the method of slicing) says that the volume under the surface can be obtained by 

integrating the area function: 

Volume = d 

c Ay(y) dy = d 

c 

= b 

a Ax(x) dx = b 

a 

b 

a f(x,y) dx dy 

d 

c f(x,y) dy dx 

The right most integrals on each line above are examples of iterated integrals. In each case, 

the inside integral is evaluated first, followed by the outside integral. 

Example 7.41 (Cavalieri’s Principle) Continuing with the example above, since 

4 

0 

(18x 2 + 30) dx = 

 

6x 3 4 + 30x = 504 and 

0 

6 

0 

(4y + 72) dy = 

 

2y 2 6 + 72y = 504 

0 

we know that the volume of the solid shown in the left part of Figure 7.8 is 504 cubic units. 

Double versus iterated integrals. Fubini’s theorem says that we can compute double 

integrals by computing iterated integrals, where the iteration can be done in either order. 

Theorem 7.42 (Fubini’s Theorem) Suppose that f is a continuous, 2-variable function 

on the rectangle R = [a,b] × [c,d]. Then 

 

R 

f(x,y) dA = 

d b 

c 

a 

f(x,y) dx dy = 

b d 

a 

c 

f(x,y) dy dx . 

Note that Fubini’s theorem remains true if f is a bounded function on R satisfying (1) the 

set of discontinuities of f has area 0, and (2) each slice (for fixed x or for fixed y) intersects 

the set of discontinuities in at most a finite number of points. 

154

z 

0 

1 

x 

2 0 

1 

y 

2 

y 

2 

1 

1 2 x 

Figure 7.10: Plot of z = 10/x 2 −90/(x + 2y) 2 on [1,2]×[0,2] (left plot) and function domain 

with y = x superimposed (right plot). 

Example 7.43 (Difference in Volumes) Let 

as illustrated in Figure 7.10. 

By Fubini’s theorem, 

f(x,y) = 10 90 

− on R = [1,2] × [0,2], 

x2 (x + 2y) 2 

2 

0 

 

R f(x,y) dA = 2 

1 f(x,y) dx dy 

= 

2 20 

1 x2 − 45 

 

45 

x + 4+x dx 

 

= −20 x − 45ln(x) + 45ln(4 + x) 

2 

1 

≈ −12.9871. 

Note that, given (x,y) ∈ R, f(x,y) ≥ 0 when y ≥ x and f(x,y) ≤ 0 when y ≤ x. Thus, this 

answer can be interpreted as the difference between (1) the volume of the solid bounded 

by z = f(x,y) and the xy-plane for (x,y) ∈ R with y ≥ x and (2) the volume of the solid 

bounded by z = f(x,y) and the xy-plane for (x,y) ∈ R with y ≤ x. 

7.6.3 Properties of Double Integrals 

Properties of double integrals generalize those for single integrals from Section 4.7.2. 

Suppose that f(x,y) and g(x,y) are integrable functions on R, and let c be a constant. 

Then properties of limits imply: 

1. Constant functions: 

2. Sums and differences: 

3. Constant multiples: 

 

R 

 

R 

 

R 

c dA = c × (Area of R). 

4. Null domain: If the area of R is 0, then 

 

(f(x,y) ± g(x,y)) dA = R 

 

cf(x,y) dA = c R f(x,y) dA. 

 

R f(x,y) dA = 0. 

155 

 

f(x,y) dA ± R g(x,y) dA.

y 

2 

1 

0.5 1 x 

1 

0.5 

y 

0.5 1 x 

Figure 7.11: Domains of type 1 (left) and of both types (right). 

 

 

5. Ordering: If f(x,y) ≤ g(x,y) for all (x,y) ∈ R, then R f(x,y) dA ≤ R g(x,y) dA. 

 

 

6. Absolute value: If |f| is integrable, then | R f(x,y) dA| ≤ R |f(x,y)| dA. 

7.6.4 Double and Iterated Integrals over General Domains 

Double and iterated integrals can be evaluated over more general domains. Specifically, 

1. Type 1: Let D = {(x,y) : a ≤ x ≤ b, ℓ(x) ≤ y ≤ u(x)}, where ℓ(x) and u(x) are 

continuous on [a,b], and let f be a continuous, 2-variable function on D. Then 

 

D 

f(x,y) dA = 

b u(x) 

a 

ℓ(x) 

f(x,y) dy dx . 

2. Type 2: Let D = {(x,y) : c ≤ y ≤ d, ℓ(y) ≤ x ≤ u(y)}, where ℓ(y) and u(y) are 

continuous on [c,d], and let f be a continuous, 2-variable function on D. Then 

 

D 

f(x,y) dA = 

d u(y) 

c 

ℓ(y) 

The left part of Figure 7.11 illustrates a domain of type 1: 

f(x,y) dx dy . 

Dleft = {(x,y) : 0 ≤ x ≤ 1, 1 − x ≤ y ≤ 1 + x} . 

The right part of the figure illustrates a domain of both types: 

Dright = {(x,y) : 0 ≤ x ≤ 1, x 2 ≤ y ≤ x} = {(x,y) : 0 ≤ y ≤ 1, y ≤ x ≤ √ y} . 

Example 7.44 (Figure 7.11, Left Part) To illustrate computations, we compute the 

double integral of f(x,y) = 70xy 2 over the domain shown in the left part of Figure 7.11: 

 

Dleft 

70xy 2 dA = 

1 1+x 

0 

1−x 

70xy 2 dy dx = 

156 

1 

(140x 

0 

2 + 140x 4 /3) dx = 56.

Note that Dleft can be written as the union of two type 2 domains whose intersection is a 

set with area 0, and that the double integral can be computed as the sum of the values of 

two iterated integrals. Specifically, 

Dleft = {(x,y) : 0 ≤ y ≤ 1, 1 − y ≤ x ≤ 1} ∪ {(x,y) : 1 ≤ y ≤ 2, y − 1 ≤ x ≤ 1} 

(union of triangular regions with intersection the segment from (0,1) to (1,1)), and 

 

Dleft 

70xy 2 dA = 

1 1 

0 

1−y 

70xy 2 dx dy + 

2 1 

1 

y−1 

70xy 2 dx dy = 21 

2 

Example 7.45 (Difference in Volumes, continued) Consider again 

f(x,y) = 10 90 

− on R = [1,2] × [0,2] (Figure 7.10). 

x2 (x + 2y) 2 

+ 91 

2 

= 56. 

The rectangle R can be written as the union of two type 1 domains whose intersection is 

the segment from (1,1) to (2,2): 

Since 

 

and 

 

D2 

D1 

D1 = R ∩ {(x,y) : y ≥ x} and D2 = R ∩ {(x,y) : y ≤ x}. 

f(x,y) dA = 

f(x,y) dA = 

the double integral 

2 2 

1 

2 x 

1 

0 

x 

f(x,y) dy dx = 

f(x,y) dy dx = 

Further, the total volume enclosed by 

2 

1 

2 

1 

 

20 25 45 

− + 

x2 2 4 + x 

− 20 

x 

 

dx ≈ 0.8758 

dx = [−20ln(x)] 2 

1 ≈ −13.8629, 

 

R f(x,y) dA ≈ −12.9871, confirming the result obtained earlier. 

z = f(x,y), the xy-plane, x = 1, x = 2, y = 0 and y = 2 

is approximately 0.8758 + 13.8629 = 14.7387. 

Average values over general domains. If f is integrable over the general domain D 

with positive area A(D), then the average value of f over D is 

Average of f over D = 1 

 

f(x,y) dA . 

A(D) D 

Example 7.46 (Figure 7.11, Right Part) To illustrate computations, we find the 

average value of f(x,y) = 70xy 2 over the domain shown in the right part of Figure 7.11, 

and view the domain as a domain of type 1. 

157

The double integral of f over Dright is 

 

Dright 

70xy 2 dA = 

1 x 

0 

x 2 

70xy 2 dy dx = 

1 

(70x 

0 

4 /3 − 70x 7 /3) dx = 7/4. 

The area of Dright is the double integral of the function 1 (or, equivalently, the single integral 

over [0,1] of the difference between the value of y on the upper curve and the value of y on 

the lower curve): 

 

Dright 

1 dA = 

1 x 

0 

x 2 

1 dy dx = 

Finally, the average value is (7/4)/(1/6) = 21/2. 

7.6.5 Triple and Iterated Integrals over Boxes 

1 

0 

(x − x 2 ) dx = 1/6. 

Suppose that the 3-variable function f(x,y,z) is continuous on the box 

B = [a,b] × [c,d] × [p,q] = {(x,y,z) : a ≤ x ≤ b, c ≤ y ≤ d, p ≤ z ≤ q} ⊂ R 3 

and let ℓ, m and n be positive integers. Further, let ∆x = (b−a) 

ℓ , ∆y = (d−c) (q−p) 

m and ∆z = n , 

1. xi = a + i∆x for i = 0,1,... ,ℓ, x ∗ i = (xi−1 + xi)/2 for i = 1,2,... ,ℓ, 

2. yj = c + j∆y for j = 0,1,... ,m, y ∗ j = (yj−1 + yj)/2 for j = 1,2,... ,m, 

3. zk = p + k∆z for k = 0,1,... ,n, and z ∗ k = (zk−1 + zk)/2 for k = 1,2,... ,n. 

Then the (ℓ + 1) numbers 

a = x0 < x1 < · · · < xℓ = b 

partition [a,b] into ℓ subintervals of equal length with midpoints x∗ i , the (m + 1) numbers 

c = y0 < y1 < · · · < ym = d 

partition [c,d] into m subintervals of equal length with midpoints y∗ j , the (n + 1) numbers 

p = z0 < z1 < · · · < zn = q 

partition [p,q] into n subintervals of equal length with midpoints z∗ k , and the ℓmn subboxes 

Bi,j,k = [xi−1,xi] × [yj−1,yj] × [zk−1,zk] (i = 1,... ,ℓ; j = 1,... ,m; k = 1,... ,n) 

form an ℓ-by-m-by-n partition of B. Each subbox has volume △V = (△x)(△y)(△z). 

The triple integral of f over B is 

 

B 

f(x,y,z) dV = lim 

ℓ m n 

f(x 

ℓ,m,n→∞ 

i=1 j=1 k=1 

∗ i,y ∗ j,z ∗ k)△V. 

In this formula, f(x,y,z) is the integrand and the sum on the right is a Riemann sum. 

158

Continuity and integrability. f(x,y,z) is said to be integrable on the box B if the limit 

above exists. Continuous functions are always integrable. Further, it is possible to show 

that the limit above exists if the values of f are bounded on B and the set of discontinuities 

of f on B has volume zero. 

Iterated integrals and Fubini’s theorem. A generalization of Fubini’s theorem (p. 154) 

states that if f is continuous on B, then the triple integral can be computed using iterated 

integrals, where the iteration can be done in any order. In particular, 

 

b d q 

f(x,y,z) dV = f(x,y,z) dz dy dx . 

(There are a total of six orders.) 

B 

a 

More generally, if f is a bounded function on B satisfying (1) the set of discontinuities of 

f has volume 0, and (2) each line parallel to one of the coordinate axes intersects the set 

of discontinuities of f in at most a finite number of points, then the triple integral can be 

computed using iterated integrals in any order. 

Average values. The quantity 

1 

(b − a)(d − c)(q − p) 

c 

p 

 

can be interpreted as the average value of f on B. 

To see this, let 

f = 1 

ℓmn 

B 

 

f(x ∗ i,y ∗ j,z ∗ k) 

i,j,k 

f(x,y,z) dV 

be the sample average, and let V = (b − a)(d − c)(q − p) be the volume of the box. Then 

f = V 1 

ℓmn i,j,k 

V 

f(x∗i , y∗ j , z∗ k ) 

 

= 1 

i,j,k 

V 

f(x∗i , y∗ j , z∗ k ) △V 

1 

⇒ lim f = B f(x, y, z) dV . 

ℓ,m,n→∞ V 

Example 7.47 (Average Value) To illustrate computations, we find the average of 

f(x,y,z) = 12x 2 (y − z) on B = [−1,1] × [−1,2] × [0,3]. 

Using the generalization of Fubini’s theorem, 

 

B f(x,y,z) dV = 1 2 3 

−1 −1 0 12x2 (y − z) dz dy dx 

= 1 2 

−1 −1 −54x2 + 36x2y dy dx 

= 1 

−1 −108x2 dx = −72. 

Since the volume of B is 18, the average of f on B is −72/18 = −4. 

159

6 

z 

4 

2 

0 

0 

1 

x 

0 

2 

1 

0.5 

y 

25 

z 

0 

0 

5 

y 

10 0 

Figure 7.12: Domains of integration for triple integral examples. 

7.6.6 Triple Integrals over General Domains 

Triple integrals can also be computed over more general domains. There are six basic types, 

which depend on the order of integration. 

One of the six types can be defined as follows: 

where 

W = {(x,y,z) : a ≤ x ≤ b, ℓ1(x) ≤ y ≤ u1(x), ℓ2(x,y) ≤ z ≤ u2(x,y)}, 

1. ℓ1(x) and u1(x) are continuous on [a,b] and 

2. ℓ2(x,y) and u2(x,y) are continuous on the 2-dimensional domain 

If f(x,y,z) is continuous on W, then 

 

W 

{(x,y) : a ≤ x ≤ b, ℓ1(x) ≤ y ≤ u1(x)}. 

f(x,y,z) dV = 

b u1(x) u2(x,y) 

a 

ℓ1(x) 

ℓ2(x,y) 

f(x,y,z) dz dy dx . 

Example 7.48 (Figure 7.12, Left Part) Let f(x,y,z) = xy on 

(left part of Figure 7.12). 

W = {(x,y,z) : 0 ≤ x ≤ 2, 0 ≤ y ≤ 1, 0 ≤ z ≤ 4 + x + y} 

Then, the triple integral of f over W is 

 

W f dV = 2 1 4+x+y 

0 0 0 xy dz dy dx = 

2 1 

(4xy + x 

0 0 

2 y + xy 2 2 

) dy dx = (7x/3 + x 

0 

2 /2) dx = 6. 

160 

10x 

20

Average values over general domains. If f is integrable over the general domain W 

with positive volume V (W), then the average value of f over W is 

Average value of f over W = 1 

V (W) 

 

W 

f(x,y,z) dV. 

Example 7.49 (Figure 7.12, Right Part) Let f(x,y,z) = 3x on 

(right part of Figure 7.12). 

W = {(x,y,z) : 0 ≤ y ≤ 10, y ≤ x ≤ 20 − y, 0 ≤ z ≤ 25 − y} 

The triple integral of f over W is 

 

W f dV = 10 20−y 25−y 

0 y 0 3x dz dx dy = 

10 20−y 

0 

y 

(75x − 3xy) dx dy = 

10 

0 

(15000 − 2100y + 60y 2 ) dy = 65000. 

The volume of W is the triple integral of the function 1 (equivalently, the volume is the 

double integral of 25 − y over {(x,y) : 0 ≤ y ≤ 10, y ≤ x ≤ 20 − y}): 

V (W) = 10 20−y 25−y 

0 y 0 1 dz dx dy = 

10 20−y 

0 

y 

(25 − y) dx dy = 

10 

Finally, the average is (65000)/(6500/3) = 30. 

0 

(500 − 70y + 2y 2 ) dy = 6500 

3 . 

Extensions to k variables. Definitions and techniques for double and triple integrals 

can be extended to define integration methods for k-variable functions. The extensions are 

straightforward, but the computations can be quite tedious. 

7.6.7 Partial Anti-Differentiation and Partial Differentiation 

Computing iterated integrals involves computing partial anti-differentives. This section 

explores the processes of partial differentiation and partial anti-differentation further in the 

2-variable setting. 

Theorem 7.50 (Fixed Endpoints) If the 2-variable function f has continuous first 

partial derivatives on the rectangle [a,b] × [c,d], then 

 

b 

b 

d 

f(x, y) dx = fy(x, y) dx for y ∈ [c, d] 

dy 

and 

a 

 

d 

d 

f(x, y) dy = 

dx c 

a 

d 

c 

161 

fx(x, y) dy for x ∈ [a, b].

To illustrate the first equality, let f(x,y) = 4x 3 y 2 and [a,b] = [1,2]. Then 

2 

d 

4x 

dy 1 

3 y 2 

dx = d 2 

15y 

dy 

= 30y and 

2 

1 

∂ 3 2 

4x y 

∂y 

dy = 

2 

8x 3 y dx = 30y. 

The theorem can be extended to integrals with variable endpoints, using the rule for the 

total derivative from Section 7.3.4. 

7.6.8 Change of Variables and the Jacobian 

The change of variables technique for multiple integrals generalizes the method of substitution 

for single integrals (see page 83). In this section, we introduce the technique in the 

2-variable setting. 

Consider a vector-valued function T : D ∗ ⊆ R 2 → R 2 , with formula 

T(u,v) = (x(u,v),y(u,v)) for all (u,v) ∈ D ∗ , 

and let D = T(D ∗ ) be the image of D ∗ . We say that the transformation T maps the 

domain D ∗ in the uv-plane to the domain D in the xy-plane. 

The functions x(u,v) and y(u,v) are called the coordinate functions of T. 

The Jacobian. Assume that each coordinate function has continuous first partial deriva- 

, is defined as follows: 

tives. The Jacobian of the transformation T, denoted by ∂(x,y) 

∂(u,v) 

∂(x,y) 

∂(u,v) = 

 

xu xv 

 

yu yv 

= xuyv − xvyu. 

(The Jacobian is the determinant of the matrix of partial derivatives). 

Example 7.51 (Linear Transformation) Let 

where a, b, c and d are constants. 

(x,y) = T(u,v) = (au + bv,cu + dv), 

The formula for T can be written using matrix notation. Specifically, 

where 

A = 

 

a b 

, x = 

c d 

x = T(u) = Au, 

 

x 

y 

and u = 

 

u 

. 

v 

A is the matrix of partial derivatives of T; the Jacobian of T is det(A) = (ad − bc). 

162 

1

v 

1 

1 

u 

y 

1 

-1 

2 4 x 

Figure 7.13: Unit square in uv-plane (left) and parallelogram in xy-plane (right). 

Example 7.52 (Nonlinear Transformation) Let 

where u and v are positive. 

(x,y) = T(u,v) = 

The (simplified) matrix of partial derivatives is 

⎡ 

⎢ 

⎣ xu xv 

yu yv 

and the (simplified) Jacobian is 1/(2v). 

⎤ 

⎥ 

⎦ = 

 

u 

v , √ 

uv , 

⎡ 

⎢ 

1 

⎢ 2 

⎢ 

⎣ 

√ uv − √ u 

2v √ v 

√ 

v 

2 √ √ 

u 

u 2 √ v 

Stretching/shrinking factor. The absolute value of the Jacobian is the “stretching” 

(or “shrinking”) factor which tells us how areas in xy-space are related to areas in uv-space. 

For linear transformations, the relationship between areas is easy to state. 

Theorem 7.53 (Linear Transformations) Let x = T(u) = Au, where 

Then 

A = 

 

a b 

c d 

1. T is a one-to-one and onto transformation. 

⎤ 

⎥ 

⎦ 

and det(A) = (ad − bc) = 0. 

2. T maps triangles to triangles and parallelograms to parallelograms in such a way that 

vertices are mapped to vertices. 

3. If D ∗ is a parallelogram in the uv-plane and D = T(D ∗ ) in the xy-plane, then 

 

 

(Area of D) = 

∂(x,y) 

 

∂(u,v) 

(Area of D∗ ) = |det(A)| (Area of D∗ ). 

163

Example 7.54 (Area of Parallelogram) The parallelogram D in the xy-plane whose 

corners are (0,0), (1, −1), (4,0) and (3,1) is the image of D ∗ = [0,1] ×[0,1] in the uv-plane 

under the linear transformation 

as illustrated in Figure 7.13. 

 

 

The Jacobian of T is 3 1 

 

1 −1 = −4. 

(x,y) = T(u,v) = (3u + v,u − v) for (u,v) ∈ D ∗ , 

Since the area of D ∗ is 1, the area of D is 

(Area of D) = | − 4|(Area of D ∗ ) = 4. 

Change of variables. In 2-variable substitution, we replace 

 

 

dA = dx dy with dA = 

∂(x,y) 

 

∂(u,v) 

du dv. 

The complete formula is given in the following theorem. 

Theorem 7.55 (Change of Variables) Let D and D ∗ be basic regions (of types 1 or 

2) in the xy- and uv-planes, respectively. Suppose that 

T : D ∗ ⊆ R 2 → R 2 

maps D∗ onto D in a one-to-one fashion, and that each coordinate function of T has 

continuous first partial derivatives. Further, suppose that f is a 2-variable, continuous 

function on D. Then 

 

 

 

 

f(x,y) dx dy = f(x(u,v),y(u,v)) 

∂(x,y) 

 

∂(u,v) 

du dv . 

D 

D ∗ 

The change of variables technique is used to simplify the boundaries of integration, to 

simplify the integrand, or both. 

Example 7.56 (Simplifying Integrand) Consider evaluating the double integral 

 

x − y 

cos dA 

x + y 

D 

where D is the triangular region in the xy-plane bounded by x = 0, y = 0, x + y = 1. 

The integrand can be simplified by using 

Since 

u = x − y, v = x + y ⇐⇒ x = 

u + v −u + v 

, y = 

2 2 

(1) x = 0 ⇒ u = −v, (2) y = 0 ⇒ u = v and (3) x + y = 1 ⇒ v = 1, 

164 

.

the domain D ∗ = {(u,v) : 0 ≤ v ≤ 1, −v ≤ u ≤ v}. The Jacobian is 1/2. 

The integral becomes 

 

D 

cos 

 

x − y 

x + y 

dA = 

1 v 

0 

−v 

cos 

 

u 1 

v 2 

du dv = 

(Note that, by substitution, cos(u/v) du = v sin(u/v) + C.) 

1 

0 

v sin(1) dv = 1 

2 sin(1). 

Example 7.57 (Simplifying Boundary) Consider evaluating the double integral 

 

xy dA 

D 

where D is the region bounded by xy = 1, xy = 4, xy 2 = 1, xy 2 = 4. 

The boundary can be simplified by using 

u = xy, v = xy 2 ⇐⇒ x = u2 v 

, y = 

v u . 

The region D ∗ = [1,4] × [1,4]. After simplification, the Jacobian is 1/v. 

The integral becomes 

 

D 

xy dA = 

4 4 

1 

1 

u 1 

v 

du dv = 

4 

1 

15 

2v 

15ln(4) 

dv = . 

2 

Remarks. As stated above, the absolute value of the Jacobian of a linear transformation 

is related to a change in area. If T is a nonlinear transformation with continuous first 

partial derivatives, then it is approximately linear in small enough regions. Thus, for small 

regions, the absolute value of the Jacobian is an approximate change in area. 

Linear transformations will be studied further in Chapter 9 of these notes. 

Polar coordinate transformations. Consider (x,y) in polar coordinates 

x = r cos(θ), y = r sin(θ) 

where r = x 2 + y 2 is the distance between (x,y) and (0,0), and θ is the angle measured 

counterclockwise from the positive x-axis to the ray from (0,0) to (x,y). 

Transformations from domains in the rθ-plane to domains in the xy-plane have Jacobian 

∂(x,y) 

∂(r,θ) = 

 

xr xθ 

 

= 

 

 

 

cos(θ) −r sin(θ) 

 

sin(θ) r cos(θ) = r cos2 (θ) + r sin 2 (θ) = r , 

yr yθ 

since sin 2 (θ) + cos 2 (θ) = 1 for all θ. 

Polar coordinate transformations are particularly useful when the domain of integration is 

a disk (the boundary plus the interior of a circle), or a wedge of a disk. 

165

Example 7.58 (Quarter Disk) Consider evaluating 

 

(x 

D 

2 + y 2 ) dA 

where D is the subset of the disk x 2 +y 2 ≤ 1 in the first quadrant. Then D can be described 

in polar coordinates as D∗ = {(r,θ) : 0 ≤ r ≤ 1, 0 ≤ θ ≤ π/2}, and 

 

(x 2 + y 2 1 π/2 

) dA = r 2 1 π 

r dθ dr = 

2 r3 dr = π 

8 . 

D 

0 

0 

Application of polar transformation. An important application of the polar coordinate 

transformation in probability and statistics is the evaluation of the double integral 

 

e 

D 

−(x2 +y2 ) 

dA, where D = R2 . 

D can be described in polar coordinates as D ∗ = {(r,θ) : 0 ≤ r < ∞, 0 ≤ θ ≤ 2π} and 

 

D e−(x2 +y2 

) ∞ 2π 

dA = 0 0 e−r2 

= π ∞ 

= π 

= π 


r dθ dr 

(2r) dr 

0 e−r2 

 

−e−r2 r→∞ 

0 

 

limr→∞(−e −r2 

) + 1 

= π(0 + 1) = π. 

 

0 

(by substitution) 

Multiple integration methods are important in statistics, and, in particular, when working 

with joint density functions. Joint density functions are used to “smooth over” joint 

probability histograms, and as functions supporting the theory of multivariate continuous 

random variables (Chapter 8). This section previews some ideas. 

7.7.1 Computing Probabilities for Continuous Random Pairs 

If the continuous random pair (X,Y ) has joint density function f(x,y), then the probability 

that (X,Y ) takes values in the domain D is the double integral of f over D: 

 

P((X,Y ) ∈ D) = f(x,y) dA. 

Example 7.59 (Joint Distribution on Rectangle) Let 

f(x,y) = 1 

x 

8 

2 + y 2 

for (x,y) ∈ [−1,2] × [−1,1] 

166 

D

z 

2 

-2 

-1 

0 

x 

1 

2 

3 

0 

-1 

-2 

1 

y 

y 

2 

1 

-2 -1 1 2 3 x 

-2 -1 1 2 3 x 

Figure 7.14: Volume (left) over D (right) is P(X ≥ Y ) for a random pair on [−1,2]×[−1,1]. 

(and 0 otherwise) be the joint density function of a continuous random pair (X,Y ). 

To find the probability that X is greater than or equal to Y , for example, we need to 

compute the double integral of f over the domain D, where D is the intersection of the 

rectangle [−1,2] × [−1,1] with the half-plane x ≥ y: 

P(X ≥ Y ) = 

1 

−1 

2 

y 

1 

x 

8 

2 + y 2 

dx dy = 

1 

−1 

-1 

-2 

 

1 y2 y3 

+ − 

3 4 6 

dy = 5 

6 . 

The domain D is pictured in the right part of Figure 7.14; the probability P(X ≥ Y ) is the 

volume of the solid shown in the left part of the figure. 

7.7.2 Contours for Independent Standard Normal Random Variables 

Assume that the continuous random pair (X,Y ) has joint density function 

f(x,y) = 1 

2π exp 

 

−(x 2 + y 2 

)/2 

for (x,y) ∈ R 2 , 

where exp() is the exponential function. The left part of Figure 7.15 is a graph of the 

surface z = f(x,y) when (x,y) ∈ [−3,3] × [−3,3]. 

Contours for the joint density are circles in the xy-plane centered at the origin. The right 

part of Figure 7.15 shows contours enclosing 5%, 20%, ..., 95% of the joint probability. 

Given a > 0, let 

Da = {(x,y) : x 2 + y 2 ≤ a 2 } 

be the disk of radius a centered at the origin. The polar transformation can be used to 

compute the probability that values of (X,Y ) lie in Da. Specifically, 

P(X 2 + Y 2 ≤ a 2 

) = 

Da 

f(x,y) dA = 

2π a 

(See the end of the last section for a similar computation.) 

167 

0 

0 

1 

2π e−r2 /2 r dr dθ = 1 − e −a 2 /2 .

z 

-3 

0 

x 

3 -3 

0 

y 

3 

-3 -1.5 1.5 3 x 

-1.5 

Figure 7.15: Joint density (left) and contours enclosing 5%, 20%, ..., 95% of probability 

when X and Y are independent standard normal random variables. 

To find the value of a so that Da encloses a given proportion p of probability, we solve 

p = P(X 2 + Y 2 ≤ a 2 ) = 1 − e −a2 /2 ⇒ a = 

3 

1.5 

-3 

y 

 

−2ln(1 − p). 

Further, for (x,y) satisfying x 2 + y 2 = a 2 = −2ln(1 − p), f(x,y) = (1 − p)/(2π). 

Remarks. The joint density given in this section corresponds to the density of independent 

standard normal random variables. See Section 8.1.9 for details about bivariate normal 

distributions. 

Probability contours for bivariate continuous distributions like the bivariate normal distribution 

take the place of quantiles for univariate continuous distributions. 

168

8 Joint Continuous Distributions 

Recall, from Chapter 3, that a probability distribution describing the joint variability of 

two or more random variables is called a joint distribution, and that a bivariate distribution 

is the joint distribution of a pair of random variables. 

8.1 Bivariate Distributions 

This section considers joint distributions of pairs of continuous random variables. 


In the bivariate continuous setting, we assume that X and Y are continuous random variables 

and that there exists a nonnegative function f(x,y) defined over the entire xy-plane 

such that 

 

P((X,Y ) ∈ D) = f dA for every D ⊆ R2 . 

D 

The function f(x,y) is called the joint probability density function (joint PDF) (or joint 

density function) of the random pair (X,Y ). The notation fXY (x,y) is sometimes used to 

emphasize the two random variables. 

Joint PDFs satisfy the following two properties: 

1. f(x,y) ≥ 0 for all pairs (x,y). 

 

2. 

R2 f(x,y) dA = 1. 

The joint cumulative distribution function (joint CDF) of the random pair (X,Y ), denoted 

by F(x,y), is the function 

 

F(x,y) = P(X ≤ x,Y ≤ y) = f dA, for all real pairs (x,y), 

where Dxy = {(u,v) : u ≤ x, v ≤ y}. 

Joint CDFs satisfy the following properties: 

1. lim (x,y)→(−∞,−∞) F(x,y) = 0 and lim (x,y)→(∞,∞) F(x,y) = 1. 

Dxy 

2. If x1 ≤ x2 and y1 ≤ y2, then F(x1,y1) ≤ F(x2,y2). 

3. F(x,y) is continuous. 

Example 8.1 (Joint Distribution in 1st Quadrant) Assume that (X,Y ) has the 

following joint density function: 

f(x,y) = 2e −x−2y when x,y ≥ 0 and 0 otherwise. 

169

Given (x,y) in the first quadrant (both coordinates nonnegative), 

F(x,y) = 

y x 

F(x,y) is equal to 0 otherwise. 

0 

0 

2e −u−2v du dv = 

y 

2e 

0 

−2v (1 − e −x ) dv = (1 − e −x )(1 − e −2y ). 

Further, to compute the probability that X is greater than or equal to Y , for example, we 

compute the double integral of the joint PDF over the intersection of the half-plane x ≥ y 

with the first quadrant: 

∞ ∞ 

P(X ≥ Y ) = 2e −x−2y ∞ 

dx dy = 

8.1.2 Marginal Distributions 

0 

y 

0 

2e −3y dy = 2 

3 . 

The marginal probability density function (marginal PDF) (or marginal density function) 

of X is 

fX(x) = 

y f(x,y)dy for x in the range of X 

and 0 otherwise, where the integral is taken over all y in the range of Y . 

Similarly, the marginal PDF (or marginal density function) of Y is 

fY (y) = 

x f(x,y)dx for y in the range of Y 

and 0 otherwise, where the integral is taken over all x in the range of X. 

Example 8.2 (Joint Distribution in 1st Quadrant, continued) For the joint distribution 

described above: 

1. When x ≥ 0, ∞ 

0 f(x,y) dy = e−x . Thus, 

fX(x) = e −x when x ≥ 0 and 0 otherwise. 

(X is an exponential random variable with parameter 1.) 

2. When y ≥ 0, ∞ 

0 f(x,y) dx = 2e−2y . Thus, 

fY (y) = 2e −2y when y ≥ 0 and 0 otherwise. 

(Y is an exponential random variable with parameter 2.) 

Example 8.3 (Joint Distribution on Eighth Plane) Assume that (X,Y ) has the 

following joint density function: 

f(x,y) = 1 

4 e−y/2 when 0 ≤ x ≤ y and 0 otherwise. 

170

z 

0 

5 

x 

0 

10 

5 

10 

y 

y 

10 

5 

5 10 x 

5 10 x 

Figure 8.1: Plot of z = (1/4)e −y/2 on 0 ≤ x ≤ y (left) and region of nonzero density (right). 

The left plot in Figure 8.1 shows part of the surface z = f(x,y) (with vertical planes added) 

and the right plot shows part of the region of nonzero density. 

The volume of the solid shown in the left part of Figure 8.1 is equal to the probability that 

Y is at most 11: 

11 y 1 

P(Y ≤ 11) = 

4 e−y/2 11 1 

dx dy = 

0 4 y e−y/2 dy = 1 − 13 

2 e−11/2 ≈ 0.9734. 

Further, for this random pair, 

0 

0 

1. When x ≥ 0, ∞ 

1 

x f(x,y) dy = 2 e−x/2 . Thus, 

fX(x) = 1 

2 e−x/2 when x ≥ 0 and 0 otherwise. 

(X is an exponential random variable with parameter 1/2.) 

2. When y ≥ 0, y 

1 

0 f(x,y) dx = 4 y e−y/2 . Thus, 

fY (y) = 1 

4 y e−y/2 when y ≥ 0 and 0 otherwise. 

(Y is a gamma random variable with shape parameter 2 and scale parameter 2.) 

8.1.3 Conditional Distributions 

Let X and Y be continuous random variables with joint PDF f(x,y), and marginal PDFs 

fX(x) and fY (y), respectively. 

If fY (y) = 0, then the conditional probability density function (conditional PDF) (or conditional 

density function) of X given Y = y is defined as follows: 

f X|Y =y(x|y) = f(x,y) 

fY (y) 

171 

for all real numbers x.

Similarly, if fX(x) = 0, then the conditional PDF (or conditional density function) of Y 

given X = x is 

fY |X=x(y|x) = f(x,y) 

for all real numbers y. 

fX(x) 

Example 8.4 (Joint Distribution in 1st Quadrant, continued) Consider again the 

random pair with joint density function 

f(x,y) = 2e −x−2y when x,y ≥ 0 and 0 otherwise. 

1. Given x ≥ 0, the conditional distribution of Y given X = x has PDF 

f Y |X=x(y|x) = 2e −2y when y ≥ 0 and 0 otherwise. 

(The conditional distribution is the same as the marginal distribution.) 

2. Given y ≥ 0, the conditional distribution of X given Y = y has PDF 

f X|Y =y(x|y) = e −x when x ≥ 0 and 0 otherwise. 

(The conditional distribution is the same as the marginal distribution.) 

Example 8.5 (Joint Distribution on Eighth Plane, continued) Consider again the 

random pair with joint density function 

f(x,y) = 1 

4 e−y/2 when 0 ≤ x ≤ y and 0 otherwise. (Figure 8.1) 

1. Given x ≥ 0, the conditional distribution of Y given X = x has PDF 

f Y |X=x(y|x) = 1 

2 e−(y−x)/2 when y ≥ x and 0 otherwise. 

(The conditional distribution is a shifted exponential distribution.) 

2. Given y > 0, the conditional distribution of X given Y = y has PDF 

f X|Y =y(x|y) = 1 

y 

when 0 ≤ x ≤ y and 0 otherwise. 

(The conditional distribution is uniform on the interval [0,y].) 

Remarks. The function fX|Y =y(x|y) is a valid PDF since fX|Y =y(x|y) ≥ 0 and 

 

fX|Y =y(x|y) dx = 

x 

1 

 

f(x,y) dx = 

fY (y) x 

1 

fY (y) fY (y) = 1. 

Similarly, the function f Y |X=x(y|x) is a valid PDF. Thus, conditional distributions are 

distributions in their own right. 

Although, for example, fY (y) does not represent probability, we can think of the conditional 

density function f X|Y =y(x|y) as a limit of probabilities. Specifically, 

∂ 

fX|Y =y(x|y) = lim P(X ≤ x|y − ǫ ≤ Y ≤ y + ǫ). 

ǫ→0 ∂x 

Thus, the conditional PDF of X given Y = y can be thought of as the conditional PDF of 

X given that Y is very close to y. 

172

8.1.4 Independent Random Variables 

The continuous random variables X and Y are said to be independent if their joint density 

function equals the product of their marginal density functions for all real pairs, 

f(x,y) = fX(x)fY (y) for all (x,y). 

Equivalently, X and Y are independent if their joint CDF equals the product of their 

marginal CDFs for all real pairs, 

F(x,y) = FX(x)FY (y) for all (x,y). 

Random variables that are not independent are said to be dependent. 

Example 8.6 (Joint Distribution in 1st Quadrant, continued) The random variables 

with joint density function 

are independent. 

f(x,y) = 2e −x−2y = (e −x )(2e −2y ) when x,y ≥ 0 and 0 otherwise 

Example 8.7 (Joint Distribution on Eighth Plane, continued) The random variables 

with joint density function 

f(x,y) = 1 

4 e−y/2 when 0 ≤ x ≤ y and 0 otherwise (Figure 8.1) 

are dependent. To demonstrate the dependence of X and Y , we just need to show that 

f(x,y) = fX(x)fY (y) for some pair (x,y). For example, f(2,1) = fX(2)fY (1). 

8.1.5 Mathematical Expectation for Bivariate Continuous Distributions 

Let X and Y be continuous random variables with joint density function f(x,y) and let 

g(X,Y ) be a real-valued function. 

The mean of g(X,Y ) (or the expected value of g(X,Y ) or the expectation of g(X,Y )) is 

 

E(g(X,Y )) = g(x,y) f(x,y) dy dx 

x 

y 

provided the integral converges absolutely. The double integral is assumed to include all 

pairs with nonzero joint density. 

Example 8.8 (Joint Distribution on Triangle) Assume that (X,Y ) has the following 

joint density function 

f(x,y) = 2 

when 0 ≤ x ≤ 3, 0 ≤ y ≤ 5x/3 and 0 otherwise. 

15 

(The random pair has constant density on the triangle with corners (0,0), (3,0), (3,5).) 

Let g(X,Y ) = XY be the product of the variables. Then 

E(g(X,Y )) = 

3 5x/3 

0 

0 

xy 2 

15 

173 

dy dx = 

3 

0 

5 

27 x3 dx = 15 

4 .

Properties of expectation. The properties of expectation stated in Section 3.1.6 in the 

bivariate discrete case are also true in the bivariate continuous case. 

8.1.6 Covariance and Correlation 

Let X and Y be random variables with finite means (µx, µy) and finite standard deviations 

(σx, σy). The covariance of X and Y , Cov(X,Y ), is defined as follows: 

Cov(X,Y ) = E((X − µx)(Y − µy)). 

The notation σxy = Cov(X,Y ) is used to denote the covariance. The correlation of X and 

Y , Corr(X,Y ), is defined as follows: 

Cov(X,Y ) 

Corr(X,Y ) = 

σxσy 

= σxy 

. 

σxσy 

The notation ρ = Corr(X,Y ) is used to denote the correlation of X and Y ; ρ is called the 

correlation coefficient. 

The properties of covariance and correlation stated in Section 3.1.7 in the bivariate discrete 

case are also true in the bivariate continuous case. 

Example 8.9 (Joint Distribution on Triangle, continued) Continuing with the joint 

distribution above, 

E(X) = 2, E(Y ) = 5 

3 , E(X2 ) = 9 

Using properties of variance and covariance: 

2 , E(Y 2 ) = 25 

6 

, E(XY ) = 15 

4 . 

V ar(X) = E(X 2 ) − (E(X)) 2 = 1 

2 , V ar(Y ) = E(Y 2 ) − (E(Y )) 2 = 25 

18 , 

Cov(X,Y ) = E(XY ) − E(X)E(Y ) = 5/12 and ρ = Corr(X,Y ) = 1/2. 

Example 8.10 (Joint Distribution on Diamond) Assume that the continuous pair 

(X,Y ) has joint density function 

f(x,y) = 1 

2 

when |x| + |y| ≤ 1 and 0 otherwise. 

(Joint density is constant on the “diamond” with corners (−1,0), (0,1), (1,0), (0, −1).) 

For this joint distribution, 

1. E(X) = E(Y ) = E(XY ) = 0. 

2. Cov(X,Y ) = 0 and Corr(X,Y ) = 0. 

3. The marginal PDF of X is fX(x) = 1 − |x| when −1 ≤ x ≤ 1 and 0 otherwise. 

4. The marginal PDF of Y is fY (y) = 1 − |y| when −1 ≤ y ≤ 1 and 0 otherwise. 

Since Corr(X,Y ) = 0, X and Y are uncorrelated. But, X and Y are dependent since, for 

example, f(0,0) = fX(0)fY (0). 

174

z 

2 

1.5 

1 1 

y 

1.5 

2 

0.5 

x 2.5 0 

3 

2 

1.5 

1 

0.5 

y 

1.5 2 2.5 3 x 

Figure 8.2: Graph of z = f(x,y) for a continuous bivariate distribution (left plot) and 

contour plot with z = 1/2,1/4,1/6,... ,1/20 and conditional expectation of Y given X = x 

superimposed (right plot). 

8.1.7 Conditional Expectation and Regression 

Let X and Y be continuous random variables with joint density function f(x,y). 

If fX(x) = 0, then the conditional expectation (or the conditional mean) of Y given X = x, 

E(Y |X = x), is defined as follows: 

 

E(Y |X = x) = y fY |X=x(y|x) dy, 

y 

where the integral is over all y with nonzero conditional density (f Y |X=x(y|x) = 0), provided 

the integral converges absolutely. 

The conditional expectation of X given Y = y, E(X|Y = y), is defined similarly. 

Example 8.11 (Conditional Expectation) Assume that the continuous pair (X,Y ) 

has joint density function 

f(x,y) = e−yx 

2 √ x 

1. The marginal PDF of X is 

2. The conditional PDF of Y given X = x is 

when x ≥ 1,y ≥ 0 and 0 otherwise. 

fX(x) = 1 


2x3/2 f Y |X=x(y|x) = xe −xy when y ≥ 0 and 0 otherwise. 

(The conditional distribution is exponential with parameter x.) 

3. Since the expected value of an exponential random variable is the reciprocal of its 

parameter (see Table 5.2), the conditional mean formula is 

E(Y |X = x) = 1 

x 

175 

for x ≥ 1.

The left part of Figure 8.2 is a graph of z = f(x,y) (with vertical planes added) and the 

right part is a contour plot with z = 1/2,1/4,1/6,... ,1/20 (in gray). The formula for the 

conditional mean, y = 1/x, is superimposed on the contour plot (in black). 

Regression equation. Note that the formula for the conditional expectation 

E(Y |X = x) as a function of x 

is called the regression equation of Y on X. Similarly, the formula for the conditional 

expectation E(X|Y = y) as a function of y is called the regression equation of X on Y . 

Linear conditional means. Let X and Y be random variables with finite means (µx, 

µy), standard deviations (σx, σy), and correlation (ρ). 

If E(Y |X = x) is a linear function of x, then the formula is of the form: 

E(Y |X = x) = µy + ρσy 

(x − µx). 

Similarly, if E(X|Y = y) is a linear function of y, then the formula is of the form: 

σx 


(y − µy). 

Example 8.12 (Two Step Experiment) Consider the following 2 step experiment: 

Step 1: Choose X uniformly from the interval [10,20]. 

Step 2: Given that X = x, choose Y uniformly from the interval [0,x]. 

Then the continuous pair (X,Y ) has joint density function 

f(x,y) = fX(x)f Y |X=x(y|x) = 

σy 

 

1 1 

= 

10 x 

1 

10x 

when 10 ≤ x ≤ 20, 0 ≤ y ≤ x and 0 otherwise. (The region of nonzero density is the 

trapezoidal region with corners (10,0), (10,10), (20,20), (20,0).) 

Since the conditional distribution of Y given X = x is uniform on the interval [0,x], and the 

expected value of a uniform random variable is the midpoint of its interval (see Table 5.2), 

the conditional mean formula is 

E(Y |X = x) = x 

2 

(a linear function of x). Further, since 

for 10 ≤ x ≤ 20 

µx = 15, µy = 15 

2 , σx = 5 

√ , σy = 

3 5√31 6 

and ρ = 

3 

31 

for this random pair, µy + ρ(σy/σx)(x − µx) = x/2, verifying the formula given above. 

176

8.1.8 Example: Bivariate Uniform Distribution 

Let R be a region of the plane with finite positive area. The continuous random pair (X,Y ) 

is said to have a bivariate uniform distribution on the region R if its joint PDF is as follows: 

f(x,y) = 

1 

Area of R 

when (x,y) ∈ R and 0 otherwise. 

Bivariate uniform distributions have constant density over the region of nonzero density. 

That constant density is the reciprocal of the area. If A is a subregion of R (A ⊆ R), then 

P((X,Y ) ∈ A) = 

Area of A 

Area of R . 

That is, the probability of the event “the point (x,y) is in A” is the ratio of the area of the 

subregion to the area of the full region. 

The joint distributions on the triangle and on the diamond considered earlier are examples 

of bivariate uniform distributions on regions with areas 15/2 and 2, respectively. 

Expectations. If (X,Y ) has a bivariate uniform distribution on R and g(X,Y ) is a realvalued 

function, then E(g(X,Y )) is the same as the average value of the 2-variable function 

g over the region R, as discussed on page 157. 

Marginal distributions. If (X,Y ) has a bivariate uniform distribution, then X and Y 

may not be uniform random variables. 

Consider again the joint distribution on the diamond. X is not a uniform random variable 

since its PDF is not constant on the interval [−1,1]. Similarly, Y is not a uniform random 

variable since its PDF is not constant on [−1,1]. 

8.1.9 Example: Bivariate Normal Distribution 

Let µx and µy be real numbers, σx and σy be positive real numbers, and ρ be a number 

in the interval −1 < ρ < 1. The random pair (X,Y ) is said to have a bivariate normal 

distribution with parameters µx, µy, σx, σy and ρ when its joint PDF is as follows 

where 

1 

f(x,y) = 

2πσxσy 1 − ρ2 exp 

 

− term 

2(1 − ρ2 

) 

term = 

x − µx 

σy 

2 

and exp() is the exponential function. 

For this random pair, 

for all real pairs (x,y) 

 

x − µx 

− 2ρ 

 

y − µy y − µy 

+ 

σx 

177 

σy 

σy 

2

z 

-3 

0 

x 

3 -3 

0 

y 

3 

1.5 

-1.5 1.5 

-1.5 

Figure 8.3: Joint density (left) and contours enclosing 5%, 35%, 65% and 95% of the 

probability for the standard bivariate normal distribution with correlation ρ = −0.80. 

1. µx and σx are the mean and standard deviation of X, 

2. µy and σy are the mean and standard deviation of Y and 

3. ρ (the correlation coefficient) is the correlation between X and Y . 

Further, if ρ = 0, then X and Y are independent. 

Standard bivariate normal distribution. The random pair (X,Y ) is said to have a 

standard bivariate normal distribution with parameter ρ when (X,Y ) has a bivariate normal 

distribution with µx = µy = 0 and σx = σy = 1. 

The joint PDF of (X,Y ) is as follows: 

1 

f(x,y) = 

2π 

−(x2 − 2ρxy + y2 ) 

exp 

1 − ρ2 2(1 − ρ2 

) 

y 

x 

for all real pairs (x,y). 

Example 8.13 (ρ = −0.80) The left part of Figure 8.3 shows the joint density function 

of the bivariate standard normal distribution with correlation ρ = −0.80. 

The right part of the figure shows contours enclosing 5%, 35%, 65% and 95% of the probability 

of this joint distribution. Each contour is an ellipse whose major axis lies along the 

line y = −x and whose minor axis lies along the line y = x. 

Marginal and conditional distributions. If (X,Y ) has a bivariate normal distribution, 

then each marginal and conditional distribution is normal. Specifically, 

1. X is a normal random variable with mean µx and standard deviation σx, 

2. Y is a normal random variable with mean µy and standard deviation σy, 

178

v 

1 

2 u 

2 u 

X Dens. 

3 

2 

1 

0 

0 1 2 3 x 

Figure 8.4: The region of nonzero density for a bivariate uniform distribution on the open 

rectangle (0,2) ×(0,1) (left plot) and the probability density function of the product of the 

coordinates with values in (0,2) (right plot). Contours for products equal to 0.2, 0.6, 1.0, 

and 1.4 are shown in the left plot. 

3. the conditional distribution of Y given X = x is normal with mean 

 

and standard deviation σy 1 − ρ2 , and 

E(Y |X = x) = µy + ρσy 

(x − µx) 

4. the conditional distribution of X given Y = y is normal with mean 

 

and standard deviation σx 1 − ρ2 . 

σx 


(y − µy) 

Remarks. The computations needed to find probability contours for bivariate normal 

distributions generalize ideas from Section 7.7.2 and are discussed further in the next section. 

8.1.10 Transforming Continuous Random Variables 

Let (U,V ) be a continuous random pair with joint density function f(u,v). This section 

considers scalar-valued and vector-valued continuous transformations of (U,V ). 

Scalar-valued functions. We first consider scalar-valued functions. 

Example 8.14 (Product Transformation) Let U be the length, V be the width, and 

X = UV be the area of a random rectangle. Specifically, assume that U is a uniform 

random variable on the interval (0,2), V is a uniform random variable on the interval (0,1), 

and that U and V are independent. 

Since the joint PDF of (U,V ) is 

f(u,v) = fU(u)fV (v) = 1 

2 

σy 

when (u,v) ∈ (0,2) × (0,1) and 0 otherwise, 

179

(U,V ) has a bivariate uniform distribution on the open rectangle. 

For 0 < x < 2, 

P(X ≤ x) = P(UV ≤ x) = x 

 

1 2 x 

+ 

2 2 x u 

and d 

1 

1 

dxP(X ≤ x) = 2 (ln(2) − ln(x)) = 2 ln(2/x). Thus, 

1 

du = (x + xln(2) − xln(x)) 

2 

fX(x) = 1 

ln(2/x) when 0 < x < 2 and 0 otherwise. 

2 

The left part of Figure 8.4 shows the region of nonzero joint density, with contours corresponding 

to x = 0.2,0.6,1.0,1.4 superimposed. The right part is a plot of the density 

function of X = UV . 

Example 8.15 (Sum Transformation) Let X = U +V , where U and V are independent 

exponential random variables with parameter 1. The range of X is the nonnegative real 

numbers. 

Since U and V are independent, the joint density function of (U,V ) is 

Given x ≥ 0, 

P(X ≤ x) = 

f(u,v) = fU(u)fV (v) = e −u−v when u,v ≥ 0 and 0 otherwise. 

x x−u 

0 

0 

e −u−v dv du = 

x −x −u 

−e + e du = 1 − e −x − xe −x , 

and d 

dx P(X ≤ x) = xe−x (using the product rule for derivatives). Thus, 

fX(x) = xe −x when x ≥ 0 and 0 otherwise. 

(X is a gamma random variable with shape parameter 2 and scale parameter 1.) 

Working with sums. If X = U + V and f(u,v) has continuous partial derivatives, then 

the PDF of X can be computed as follows: 

 

fX(x) = f(u,x − u) du where R is the range of U. 

u∈R 

This result can be proven using the relationship between partial differentiation and partial 

anti-differentation given in Section 7.6.7. 

For the example above, since the joint density function is nonzero when both arguments 

are nonnegative, the integral reduces to 

fX(x) = 

0 

x 

e 

0 

−x du = xe −x when x ≥ 0 and 0 otherwise, 

confirming the result obtained using double integration followed by differentiation. 

180

Vector-valued functions. We next consider vector-valued functions. 

Let (x,y) = T(u,v) be a vector-valued function of two variables. Assume that the coordinate 

functions of the transformation, x = x(u,v) and y = y(u,v), have continuous partial 

derivatives and that the Jacobian of the transformation 

∂(x,y) 

∂(u,v) = 

 

 

 

xu 

 

xv 

 

yu yv 

= xuyv − xvyu = 0 for all (u,v). 

(See Section 7.6.8 for a discussion of the Jacobian.) 

If (X,Y ) = T(U,V ) and f(u,v) has continuous partial derivatives, then, for all (x,y) with 

nonzero joint density, 

 

∂(x,y) 

fXY (x,y) = f(u,v)/ 

∂(u,v)) , where (x,y) = T(u,v). 

(This result generalizes the result for monotone transformations from page 100.) 

Example 8.16 (Standard Bivariate Normal Distribution) Let U and V be independent 

standard normal random variables and let 

 

(X,Y ) = T(U,V ) = U,ρU + 1 − ρ2 

V , 

where −1 < ρ < 1. Then (X,Y ) has a standard bivariate normal distribution with correlation 

ρ. To see this, 

1. The Jacobian of the transformation is 

∂(x,y) 

∂(u,v) = 

 

xu 

 

 

xv 

 

= 

 

 

1 

ρ 

0 

1 − ρ2 

 

 

= 

yu yv 

 

1 − ρ 2 . 

2. Since x = u, y = ρu + 1 − ρ2v implies u = x, v = 1 √ (−ρx + y), 

1−ρ2 u 2 + v 2 = x 2 1 

+ 

(1 − ρ2 ) (−ρx + y)2 = x2 − 2ρxy + y2 (1 − ρ2 ) 

3. Since f(u,v) = exp(−(u2 + v2 )/2)/(2π), the formula above implies that 

 

1 

fXY (x,y) = 

2π 

−(x2 − 2ρxy + y2 ) 

exp 

1 − ρ2 2(1 − ρ2 ) 

. 

for all (x,y). 

The result is the same as the joint density function of the standard bivariate normal 

distribution with correlation ρ given earlier. 

Example 8.17 (Bivariate Normal Distribution) More generally, if U and V are 

independent standard normal random variables and we let 

 

 

(X,Y ) = T(U,V ) = µx + σxU, µy + σy ρU + 1 − ρ2 

V , 

then (X,Y ) has a bivariate normal distribution with parameters µx, µy, σx, σy and ρ. 

The steps in the derivation follow those in the previous example. 

181

Remarks. The linear transformations given in the last two examples can be used to find 

probability contours for bivariate normal distributions. That is, the transformations can be 

used to find contours enclosing probability p, for 0 

We consider the first transformation. 

1. Let U and V be independent standard normal random variables, Ca be the circle of 

radius a centered at the origin and 

Da = {(u,v) : u 2 + v 2 ≤ a 2 } 

be the disk of radius a centered at the origin. If a = −2ln(1 − p), then 

P ((U,V ) ∈ Da) = p 

(see Section 7.7.2) and Ca is the contour enclosing probability p. 

Further, if (u,v) ∈ Ca, then f(u,v) = (1 − p)/(2π). 

2. Let (X,Y ) = T(U,V ) = 

The image of Da under T, 

 

U,ρU + 1 − ρ 2 V 

 

, where −1 < ρ < 1. 

T(Da) = {(x,y) : x 2 − 2ρxy + y 2 ≤ a 2 (1 − ρ 2 )}, 

is an elliptical region centered at the origin, whose boundary, T(Ca), is the contour 

enclosing probability p. 

Further, if (x,y) ∈ T(Ca), then fXY (x,y) = (1 − p)/(2π 1 − ρ 2 ). 

Note that the right part of Figure 8.3 shows contours corresponding to p = 0.05, 0.35, 0.65 

and 0.95 when ρ = −0.80. 

Bivariate normal distributions (and multivariate normal distributions) will be studied further 

in Chapter 9 of these notes. 

8.2 Multivariate Distributions 

Recall (from Section 3.2) that a multivariate distribution is the joint distribution of k 

random variables. Ideas studied in the bivariate case (k = 2) can be generalized to the case 

where k > 2. 


Let X1, X2, ..., Xk be continuous random variables and let 

X = (X1,X2,...,Xk). 

We assume there exists a nonnegative k-variable function f with domain Rk such that 

 

P(X ∈ W) = · · · f(x1,x2,... ,xk) dx1 · · · dxk 

W 

182

for every W ⊆ R k . The function f(x1,x2,... ,xk) is called the joint probability density 

function (joint PDF) (or joint density function) of the random k-tuple X. 

The joint cumulative distribution function (joint CDF) of X is defined as follows: 

F(x) = P(X1 ≤ x1,X2 ≤ x2,... ,Xk ≤ xk) for all x = (x1,x2,... ,xk) ∈ R k , 

where commas are understood to mean the intersection of events. 

Example 8.18 (Joint Distribution on 1st Octant) Assume that X = (X,Y,Z) has 

joint density function 

Given x,y,z ≥ 0, 

f(x,y,z) = 

F(x,y,z) = x y z 

0 0 0 f(u,v,w) dw dv du = 

6 

when x,y,z ≥ 0 and 0 otherwise. 

(1 + x + y + z) 4 

yz(2 + y + z) 

(1 + y)(1 + z)(1 + y + z) − 

yz(2 + 2x + y + z) 

(1 + x)(1 + x + y)(1 + x + z)(1 + x + y + z) . 

The joint CDF is equal to 0 otherwise. 

Let c be a positive constant. To find the probability that the sum of the three variables is 

at most c, for example, we compute the triple integral of f over the region 

The probability is 

W = {(x,y,z) : 0 ≤ x ≤ c, 0 ≤ y ≤ c − x, 0 ≤ z ≤ c − x − y} ⊂ R 3 . 

 

P(X + Y + Z ≤ c) = 

8.2.2 Marginal and Conditional Distributions 

W 

f(x,y,z) dV = 

3 c 

. 

c + 1 

The concepts of marginal and conditional distributions can also be generalized. 

Example 8.19 (Joint Distribution on 1st Octant, continued) To illustrate computations, 

we continue with the joint distribution defined above. 

1. The marginal distribution of the random pair (X,Y ) is 

∞ 

2 

fXY (x,y) = f(x,y,z) dz = when x,y ≥ 0 and 0 otherwise. 

(1 + x + y) 3 

0 

2. The marginal distribution of Z is 

∞ ∞ 

fZ(z) = f(x,y,z) dx dy = 

0 

0 

183 

1 

when z ≥ 0 and 0 otherwise. 

(1 + z) 2

3. Let x and y be fixed nonnegative constants. The conditional distribution of Z given 

(X,Y ) = (x,y) is 

when z ≥ 0 and 0 otherwise. 

fZ|(X,Y )=(x,y)(z|(x,y)) = f(x,y,z) 3(1 + x + y)3 

= 

fXY (x,y) (1 + x + y + z) 4 

8.2.3 Mutually Independent Random Variables and Random Samples 

The continuous random variables X1, X2, ..., Xk are said to be mutually independent (or 

independent) if 

f(x) = f1(x1)f2(x2) · · · fk(xk) for all x = (x1,x2,...,xk) ∈ R k , 

where fi(xi) is the PDF of Xi for each i. 

Equivalently, X1, X2, ..., Xk are said to be mutually independent if 

F(x) = F1(x1)F2(x2) · · · Fk(xk) for all x = (x1,x2,... ,xk) ∈ R k , 

where Fi(xi) is the CDF of Xi for each i. 

X1, X2, ..., Xk are said to be dependent if they are not independent. 

Example 8.20 (Joint Distribution on 1st Octant, continued) The random variables 

X, Y and Z in the example above are dependent. To see this, first note that each random 

variable has the same marginal density function: 

fX(u) = fY (u) = fZ(u) = 

1 

when u ≥ 0 and 0 otherwise. 

(1 + u) 2 

To demonstrate dependence, we just need to show that f(x,y,z) = fX(x)fY (y)fZ(z) for 

some triple (x,y,z). For example, f(1,1,1) = fX(1)fY (1)fZ(1). 

Random sample. If the continuous random variables X1, X2, ..., Xk are mutually 

independent and have a common distribution (each marginal distribution is the same), 

then X1, X2, ..., Xk are said to be a random sample from that distribution. 

8.2.4 Mathematical Expectation For Multivariate Continuous Distributions 

Let X1, X2, ..., Xk be continuous random variables with joint PDF f(x) and let g(X) be 

a real-valued function. The mean or expected value or expectation of g(X) is 

 

E(g(X)) = · · · g(x1,x2,... ,xk) f(x1,x2,...,xk)dx1dx2 · · · dxk 

xk 

xk−1 

x1 

provided that the integral converges absolutely. The multiple integral is assumed to include 

all k-tuples with nonzero joint PDF. 

184

Example 8.21 (Product Function) You pick three numbers independently and uniformly 

from the interval [0,5]. Let X1, X2, X3 be the three numbers and let g(X1,X2,X3) 

be the function with rule g(X1,X2,X3) = X1X 2 2 X3. 

By independence, 

f(x1,x2,x3) = 1 

125 

The expected product is E(g(X1,X2,X3)) = 

5 5 5 

5 5 

0 

0 

0 

 

x1x 2 2x3 1 

125 dx3 dx2 dx1 = 

on the [0,5] × [0,5] × [0,5] box and 0 otherwise. 

0 

0 

1 

10 x1x 2 2 dx2 

5 

25 

dx1 = 

0 6 x1 dx1 = 625 

≈ 50.08. 

12 

Properties of expectation. Properties of expectation in the multivariate case directly 

generalize properties in the univariate and bivariate cases. In particular, 

1. If a and b1,b2,... ,bn are constants and gi(X) are real-valued k-variable functions for 

each i = 1,2,... ,n, then 

 

n 

 

n 

E a + bigi(X) = a + biE(gi(X)). 

i=1 

2. If X1, X2, ..., Xk are mutually independent and gi(Xi) are real-valued 1-variable 

functions for i = 1,2,... ,k, then 

i=1 

k 

E(g1(X1)g2(X2) · · · gk(Xk)) = E(gi(Xi)). 

i=1 

Example 8.22 (Product Function, continued) Continuing with the example above, 

and using properties of expectation, the expected product is 

E(X1X 2 2X3) = E(X1)E(X 2 2 )E(X3) 

 

5 25 5 

= 

= 

2 3 2 

625 

12 , 

confirming the result obtained above. 

8.2.5 Sample Summaries 

If X1, X2, ..., Xn is a random sample from a distribution with mean µ and standard 

deviation σ, then the sample mean, X, is the random variable 

X = 1 

n (X1 + X2 + · · · + Xn) , 

the sample variance, S 2 , is the random variable 

S 2 = 1 

n − 1 

n 

i=1 

185 

2 Xi − X

and the sample standard deviation, S, is the positive square root of the sample variance. 

The following theorem can be proven using properties of expectation (see Section 3.2.7): 

Theorem 8.23 (Sample Summaries) If X is the sample mean and S 2 is the sample 

variance of a random sample of size n from a distribution with mean µ and standard 

deviation σ, then 

1. E(X) = µ and V ar(X) = σ 2 /n. 

2. E(S 2 ) = σ 2 . 

Sample correlation. A random sample of size n from the joint (X,Y ) distribution is a 

list of n mutually independent random pairs, each with the same distribution as (X,Y ). 

If (X1,Y1),(X2,Y2),... ,(Xn,Yn) is a random sample of size n from a bivariate distribution 

with correlation ρ = Corr(X,Y ), then the sample correlation, R, is the random variable 

ni=1(Xi − X)(Yi − Y ) 

R = 

ni=1 

(Xi − X) 2 n i=1 (Yi − Y ) 2 

where X and Y are the sample means of the X and Y samples, respectively. 

Applications. In statistical applications, observed values of the sample mean, sample 

variance, and sample correlation are used to estimate unknown values of the parameters µ, 

σ 2 and ρ, respectively. 

Example 8.24 (Brain-Body Weights)(Allison & Cicchetti, Science, 194:732-374, 1976; 

lib.stat.cmu.edu/DASL.) As part of a study on sleep in mammals, researchers collected 

information on the average brain weight and average body weight for 43 different species. 

(Body-weight, brain-weight) combinations ranged from (0.05kg,0.14g) for the short-tail 

shrew to (2547.0kg,4603.0g) for the Asian elephant. The data for “man” are 

(62kg,1320g) = (136.4lb,2.9lb). 

Let X be the common logarithm of body weight in kilograms, and Y be the common 

logarithm of brain weight in grams. Sample summaries are as follows: 

1. Mean log-body weight is x = 0.311738, with a SD of sx = 1.33113. 

2. Mean log-brain weight is y = 1.17421, with a SD of sy = 1.09415. 

3. Correlation between log-body weight and log-brain weight is r = 0.951693. 

186

y 

4 

3 

2 

-3 -2 -1 0 1 2 3 4 x 

1 

0 

-1 

-2 

Figure 8.5: Log-brain weight (vertical axis) versus log-body weight (horizontal axis) for the 

brain-body study. Probability contours of the estimated bivariate normal model and the 

estimated conditional expectation are superimposed. 

The researchers determined that a bivariate normal model was reasonable for these data. 

Figure 8.5 compares the common logarithms of average brain weight in grams (vertical 

axis) and average body weight in kilograms (horizontal axis) for the 43 species. Probability 

contours for the model with parameters equal to the sample summaries are shown on the 

plot; contours enclose 5%, 35%, 65% and 95% of probability. The pair for “man” is the 

only point outside the outermost contour. 

The estimated conditional expectation line, 

y = 1.17421 + 0.782263 (x − 0.311738), 

is displayed in the plot. Note, in particular, that the slope of the line is r(sy/sx) = 0.782263. 

187

188

9 Linear Algebra Concepts 

The central equations of interest in linear algebra are of the form Ax = b, where A is an 

m-by-n coefficient matrix, x is an n-by-1 vector of unknowns, and b is an m-by-1 vector 

of values. Solutions come at three levels of sophistication: (1) direct solution methods, (2) 

matrix algebra methods, and (3) vector space methods. 

This chapter introduces linear algebra, and applications in statistics. 

9.1 Solving Systems of Equations 

A linear equation in the variables x1,x2,... ,xn is an equation of the form 

a1x1 + a2x2 + · · · + anxn = b 

where a1,a2,... ,an and b are constants. A system of linear equations in the variables 

x1,x2,... ,xn is a collection of one or more linear equations in these variables. For example, 

x1 −2x2 + x3 = 0 

2x2 −8x3 = 8 

−4x1 +5x2 +9x3 = −9 

is a system of 3 linear equations in the three unknowns x1,x2,x3 (a “3-by-3 system”). 

A solution of a linear system is a list (s1,s2,... ,sn) of numbers that makes each equation 

in the system true when si is substituted for xi for i = 1,2,... ,n. The list (29, 16, 3) is a 

solution to the system above. To check this, note that 

(29) −2(16) + (3) = 0 

2(16) −8(3) = 8 

−4(29) +5(16) +9(3) = −9 

The solution set of a linear system is the collection of all possible solutions of the system. 

For the example above, (29, 16, 3) is the unique solution. 

Two linear systems are equivalent if each has the same solution set. 

9.1.1 Elementary Row Operations 

The strategy for solving a linear system is to replace the system with an equivalent one that 

is easier to solve. The equivalent system is obtained using elementary row operations: 

1. (Replacement) Add a multiple of one row to another row, 

2. (Interchange) Interchange two rows, 

3. (Scaling) Multiply all entries in a row by a nonzero constant, 

where each “row” corresponds to an equation in the system. 

189

Example 9.1 (Using Row Operations) We work out the strategy using the 3-by-3 

system given above, where a 3-by-4 augmented matrix of the coefficients and right hand 

side values follows our progress toward a solution. 

(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

R1 : 

R2 : 

R3 : 

R1 : 

R2 : 

R3 + 4R1 : 

R1 : 

1 

2 R2 : 

R3 : 

R1 : 

R2 : 

R3 + 3R2 : 

R1 − R3 : 

R2 + 4R3 : 

R3 : 

R1 + 2R2 : 

R2 : 

R3 : 

x1 −2x2 + x3 = 0 

2x2 −8x3 = 8 

−4x1 +5x2 +9x3 = −9 

x1 −2x2 + x3 = 0 

2x2 −8x3 = 8 

−3x2 +13x3 = −9 

x1 −2x2 + x3 = 0 

x2 −4x3 = 4 

−3x2 +13x3 = −9 

x1 −2x2 + x3 = 0 

x2 −4x3 = 4 

x3 = 3 

x1 −2x2 = −3 

x2 = 16 

x3 = 3 

x1 = 29 

x2 = 16 

x3 = 3 

Thus, the solution is x1 = 29, x2 = 16, and x3 = 3. 

⎡ 

⎣ 

⎡ 

⎣ 

⎡ 

⎣ 

1 −2 1 0 

0 2 −8 8 

−4 5 9 −9 

⎡ 

⎣ 

⎡ 

⎣ 

1 −2 1 0 

0 2 −8 8 

0 −3 13 −9 

1 −2 1 0 

0 1 −4 4 

0 −3 13 −9 

⎡ 

⎣ 

1 −2 1 0 

0 1 −4 4 

0 0 1 3 

1 −2 0 −3 

0 1 0 16 

0 0 1 3 

1 0 0 29 

0 1 0 16 

0 0 1 3 

Forward and backward phases. In the forward phase of the row reduction process of 

our example (corresponding to systems (1) through (4)), R1 is used to eliminate x1 from 

R2 and R3; then R2 is used to eliminate x2 from R3. In the backward phase of the process 

(corresponding to systems (5) and (6)), R3 is used to eliminate x3 from R1 and R2; then 

R2 is used to eliminate x2 from R1. 

Existence and uniqueness. Two fundamental questions about linear systems are: 

1. Does at least one solution exist? 

2. If a solution exists, is the solution unique? 

If a linear system has one or more solutions, then it is said to be consistent; otherwise, it is 

said to be inconsistent. A linear system can be inconsistent, or have a unique solution, or 

have infinitely many solutions. 

Note that in the example above, we knew that the system was consistent once we reached 

equivalent system (4); the remaining steps allowed us to find the unique solution. 

190 

⎤ 

⎦ 

⎤ 

⎦ 

⎤ 

⎦ 

⎤ 

⎦ 

⎤ 

⎦ 

⎤ 

⎦

Example 9.2 (Infinitely Many Solutions) Consider the following 3-by-4 system: 

x1 + x2 − x3 = 9 

x1 +2x2 −3x3 −4x4 = 9 

−3x1 −2xx +2x3 −3x4 = −23 

and the following sequence of equivalent augmented matrices: 

⎡ 

⎣ 

1 

1 

1 

2 

−1 0 

−3 −4 

⎤ ⎡ 

9 1 

9 ⎦ ∼ ⎣ 0 

⎤ ⎡ 

1 −1 0 9 1 

1 −2 −4 0 ⎦ ∼ ⎣ 0 

1 −1 

1 −2 

0 9 

−4 0 

−3 −2 2 −3 −23 0 1 −1 −3 4 0 0 1 1 4 

⎡ 

∼ ⎣ 

1 1 0 1 13 

0 1 0 −2 8 

0 0 1 1 4 

⎤ 

⎡ 

⎦ ∼ ⎣ 

1 0 0 3 5 

0 1 0 −2 8 

0 0 1 1 4 

The last matrix corresponds to x1 + 3x4 = 5, x2 − 2x4 = 8, and x3 + x4 = 4. Thus, 

is a solution for any value of x4. 

(s1,s2,s3,s4) = (5 − 3x4,8 + 2x4,4 − x4,x4) 

Example 9.3 (Inconsistent System) Consider the following 3-by-3 system: 

3x2 −6x3 = 8 

x1 −2x2 +3x3 = −1 

5x1 −7x2 +9x3 = 0 

and the following sequence of equivalent augmented matrices: 

⎡ 

⎤ ⎡ 

0 3 −6 8 1 −2 

⎣ 1 −2 3 −1 ⎦ ∼ ⎣ 0 3 

3 

−6 

⎤ ⎡ 

−1 1 

8 ⎦ ∼ ⎣ 0 

−2 

3 

3 

−6 

−1 

8 

5 −7 9 0 5 −7 9 0 0 3 −6 5 

⎤ 

⎡ 

⎦ ∼ ⎣ 

⎤ 

⎦ 

⎤ 

⎦ 

1 −2 3 −1 

0 3 −6 8 

0 0 0 −3 

The last matrix corresponds to x1 − 2x2 + 3x3 = −1, 3x2 − 6x3 = 8, and 0 = −3. 

Since 0 = −3, the system has no solution. 

9.1.2 Row Reduction and Echelon Forms 

A matrix is in (row) echelon form when 

1. All nonzero rows are above any row of zeros. 

2. Each leading entry (that is, leftmost nonzero entry) in a row is in a column to the 

right of the leading entries of the rows above it. 

3. All entries in a column below a leading entry are zero. 

191 

⎤ 

⎦

Table 9.1: 6-by-11 matrices in echelon (left) and reduced echelon (right) forms. 

⎡ 

0 • ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ 

⎤ 

∗ 

⎡ 

0 1 ∗ 0 0 ∗ ∗ 0 0 ∗ 

⎤ 

∗ 

⎢0 

0 

⎢ 

⎢0 

0 

⎢ 

⎢0 

0 

⎣ 

0 0 

0 

0 

0 

0 

• ∗ 

0 • 

0 0 

0 0 

∗ ∗ 

∗ ∗ 

0 0 

0 0 

∗ ∗ 

∗ ∗ 

• ∗ 

0 • 

∗ 

∗ 

∗ 

∗ 

∗ ⎥ 

∗ ⎥ 

∗ ⎥ 

⎦ 

∗ 

⎢0 

⎢ 

⎢0 

⎢ 

⎢0 

⎣ 

0 

0 0 

0 0 

0 0 

0 0 

1 0 

0 1 

0 0 

0 0 

∗ ∗ 

∗ ∗ 

0 0 

0 0 

0 0 

0 0 

1 0 

0 1 

∗ 

∗ 

∗ 

∗ 

∗ ⎥ 

∗ ⎥ 

∗⎥ 

⎦ 

∗ 

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

An echelon form matrix is in reduced echelon form when (in addition) 

4. The leading entry in each nonzero row is 1. 

5. Each leading 1 is the only nonzero entry in its column. 

The left part of Table 9.1 represents a 6-by-11 echelon form matrix, where • represents 

the leading entry, and ∗ represents any number (either zero or nonzero). The right part of 

Table 9.1 represents a 6-by-11 reduced echelon form matrix. 

Starting with the augmented matrix of a system of linear equations, the forward phase of 

the row reduction process will produce a matrix in echelon form. Continuing the process 

through the backward phase, we get a reduced echelon form matrix. An important theorem 

is the following: 

Theorem 9.4 (Uniqueness Theorem) Each matrix is row-equivalent to one and only 

one reduced echelon form matrix. 

Further, we can “read” the solutions to the original system from the reduced echelon form 

of the augmented matrix. 

Pivot positions. A pivot position is a position of a leading entry in an echelon form 

matrix. A column that contains a pivot position is called a pivot column. For example, in 

the left part of Table 9.1, each • is located at a pivot position, and columns 2, 4, 5, 8, and 

9 are pivot columns. 

A pivot is a nonzero number that either is used in a pivot position to create zeros or is 

changed into a leading 1, which in turn is used to create zeros. In general there is no more 

than one pivot in any row and no more than one pivot in any column. 

Example 9.5 (Echelon Forms) Consider the following 3-by-6 matrix: 

⎡ 

0 3 −6 

⎣ 3 −7 8 

⎤ 

6 4 −5 

−5 8 9 ⎦ 

3 −9 12 −9 6 15 

We first interchange R1 and R3: 

⎡ 

⎣ 

3 −9 12 −9 6 15 

3 −7 8 −5 8 9 

0 3 −6 6 4 −5 

192 

⎤ 

⎦

Column 1 is a pivot column and we use the 3 in the upper left corner to convert the 3 in 

row 2 to 0 (R2 − R1 replaces R2). In addition, R1 is replaced by 1 

3R1 for simplicity: 

⎡ 

⎤ 

1 −3 4 −3 2 5 

⎣ 0 2 −4 4 2 −6 ⎦ 

0 3 −6 6 4 −5 

Column 2 is a pivot column. We could use the current row 2, or interchange the second and 

third rows and use the new row two for pivot. For simplicity, we use the 2 in the second 

row to convert the 3 in the third row to a zero (R3 − 3 

2 R2 replaces R3). In addition, R2 is 

replaced by 1 

2 R2 for simplicity: 

⎡ 

⎣ 

1 −3 4 −3 2 5 

0 1 −2 2 1 −3 

0 0 0 0 1 4 

This matrix is in echelon form; columns 1, 2, and 5 are pivot columns. 

Working with rows 3 and 2 (in that order) to “clear” columns 5 and 2, we get the following 

reduced echelon form matrix: 

⎡ 

⎤ ⎡ 

1 −3 4 −3 2 5 1 −3 4 −3 0 

⎣ 0 1 −2 2 1 −3 ⎦ ∼ ⎣ 0 1 −2 2 0 

⎤ ⎡ 

−3 1 

−7 ⎦ ∼ ⎣ 0 

0 

1 

−2 3 0 

−2 2 0 

⎤ 

−24 

−7 ⎦ 

0 0 0 0 1 4 0 0 0 0 1 4 0 0 0 0 1 4 

Note that if the original matrix was the augmented matrix of a 3-by-5 system of linear 

equations in x1, x2, x3, x4, x5, then the last matrix corresponds to the equivalent linear 

system x1 − 2x3 + 3x4 = −24, x2 − 2x3 + 2x4 = −7, x5 = 4. Thus, 

(s1,s2,s3,s4,s5) = (−24 + 2x3 − 3x4, −7 + 2x3 − 2x4,x3,x4,4) 

is a solution for any values of x3, x4. 

Solving linear systems. When solving linear systems, 

1. A basic variable is any variable that corresponds to a pivot column in the augmented 

matrix of the system, and a free variable is any nonbasic variable. 

2. From the equivalent system produced using the reduced echelon form, we solve each 

equation for the basic variable in terms of the free variables (if any) in the equation. 

3. If there is at least one free variable, then the original linear system has an infinite 

number of solutions. 

4. If an echelon form of the augmented matrix has a row of the form 

[ 0 0 · · · 0 b ] where b = 0, 

then the system is inconsistent. (Once we know the system is inconsistent, there is 

no need to continue to find the reduced echelon form.) 

193 

⎤ 

⎦

Example 9.6 (Echelon Forms, continued) Continuing with the last example, the basic 

variables are x1, x2, and x5; the free variables are x3 and x4. The free variables are used to 

characterize the infinite number of solutions to the 3-by-5 system. 

9.1.3 Vector and Matrix Equations 

In linear algebra, we let Rm be the set of m-by-1 matrices of real numbers: 

Rm ⎧⎡ 

⎤ 

x1 

⎪⎨ ⎢ x2 ⎥ 

= ⎢ 

⎣ . ⎥ 

⎪⎩ . 

⎦ : xi 

⎫ 

⎪⎬ 

∈ R . 

⎪⎭ 

xm 

Each element of Rm is called a point or a vector. 

⎡ ⎤ 

If x = 

⎢ 

⎣ 

x1 

x2 

. 

. 

xm 

⎥ 

⎦ is a vector in Rm , we say that xi is the i th component of x. 

The zero vector, O, is the vector all of whose components equal zero. 

Addition and scalar multiplication. If x,y ∈ R m and c ∈ R, then 

1. the vector sum x + y is the vector obtained by componentwise addition. That is, the 

vector whose i th component is xi + yi for each i. And, 

2. the scalar product cx is the vector obtained by multiplying each component by c. 

That is, the vector whose i th component is cxi for each i. 

Linear combinations and spans. If v1,v2,... ,vn ∈ R m and c1,c2,...,cn ∈ R, then 

w = c1v1 + c2v2 + · · · + cnvn 

is a linear combination of the vectors v1, v2, ..., vn. The span of the set {v1,v2,...,vn} 

is the collection of all linear combinations of the vectors v1, v2, ..., vn: 

Span{v1,v2,...,vn} = {c1v1 + c2v2 + · · · + cnvn : ci ∈ R} ⊆ R m . 

The span of a set of vectors always contains the zero vector O (let each ci = 0). 

Vector equations. Consider an m-by-n system of equations in the variables x1, x2, ..., 

xn, where each equation is of the form 

ai1x1 + ai2x2 + · · · + ainxn = bi for i = 1,2,... ,m. 

Let aj be the vector of coefficients of xj, and b be the vector of right hand side values. 

194

The m-by-n system can be rewritten as a vector equation: 

x1a1 + x2a2 + · · · + xnan = b. 

The system is consistent if and only if b ∈ Span{a1,a2,... ,an}. 

Matrix equations. The coefficient matrix A of the m-by-n system above is the m-by-n 

matrix whose columns are the ai’s: A = [a1 a2 · · · an ]. 

If x ∈ Rn is the vector of unknowns, then the matrix product of A with x is 

⎡ ⎤ 

⎢ 

Ax = [ a1 a2 · · · an ] ⎢ 

⎣ 

and the matrix equation for the system is Ax = b. 

x1 

x2 

. 

. 

xn 

⎥ 

⎦ = x1a1 + x2a2 + · · · + xnan, 

Note that the matrix product of a linear combination of vectors is equal to that linear 

combination of matrix products: 

A(c1v1 + c2v2 + · · · + cnvn) = c1Av1 + c2Av2 + · · · + cnAvn 

for all ci ∈ R and vi ∈ R m . (Multiplication by A distributes over addition, and we can 

factor out constants.) 

Consistency for all b. The following theorem gives criteria for determining when a 

system of equations is consistent for all b ∈ R m : 

Theorem 9.7 (Consistency Theorem) Let A be an m-by-n coefficient matrix, and x 

be an n-by-1 vector of unknowns. The following statements are equivalent: 

1. The equation Ax = b has a solution for all b ∈ R m . 

2. Each b ∈ R m is a linear combination of the coefficient vectors aj. 

3. Span{a1,a2,...,an} = R m . 

4. A has a pivot position in each row. 

Note that the number of pivot positions is at most the minimum of m and n. Thus, by the 

consistency theorem, if m > n (there are more equations than unknowns), a system with 

coefficient matrix A cannot be consistent for all b ∈ R m . 

Homogeneous systems and solution sets. A homogeneous system is a linear system 

with matrix equation Ax = O. 

Homogeneous systems are consistent since x = O is a solution to the system (called the 

trivial solution). In addition, each linear combination of solutions is a solution. Thus, the 

solution set is a span of some set of vectors. 

195

Example 9.8 (Trivial Solution Only) Consider the matrix equation 

⎡ ⎤ ⎡ 

2 −2 

x1 

⎣ 1 2 ⎦ = ⎣ 

x2 

1 8 

0 

⎤ 

0 ⎦. 

0 

Since ⎡ 

⎣ 

2 −2 0 

1 2 0 

1 8 0 

⎤ 

⎡ 

⎦ ∼ · · · ∼ ⎣ 

1 0 0 

0 1 0 

0 0 0 

⎤ 

⎦ ⇒ x1 = 0 

x2 = 0 

the system has the trivial solution only. The solution set is Span{O} = {O} ⊆ R 2 . 

Example 9.9 (One Generating Vector) Consider the matrix equation 

 

2 4 −6 

4 8 −10 

⎡ 

⎣ x1 

⎤ 

 

x2 ⎦ 0 

= . 

0 

Since 

2 4 −6 0 

4 8 −10 0 

 

∼ · · · ∼ 

x3 

1 2 0 0 

0 0 1 0 

solutions can be written parametrically as follows: 

⎡ 

s = ⎣ −2x2 

⎤ ⎡ 

x2 ⎦ = x2⎣ 

0 

−2 

1 

0 

where x2 is free. The solution set is Span{v} ⊆ R 3 . 

⎤ 

 

⇒ 

⎦ = x2v 

x1 = −2x2 

x2 is free 

x3 = 0 

Example 9.10 (Two Generating Vectors) Consider the matrix equation 

⎡ ⎤ 

⎡ 

⎤ x1 

2 4 −6 8 ⎢ x2 ⎥ 

⎣ 4 8 4 0 ⎦⎢ 

⎥ 

⎣ x3 ⎦ 

3 6 −1 4 

= 

⎡ 

⎣ 0 

⎤ 

0 ⎦. 

0 

Since 

⎡ 

⎣ 

2 4 −6 8 0 

4 8 4 0 0 

3 6 −1 4 0 

x4 

⎤ 

⎡ 

⎦ ∼ · · · ∼ ⎣ 

x4 

1 2 0 1 0 

0 0 1 −1 0 

0 0 0 0 0 

solutions can be written parametrically as: 

⎡ ⎤ ⎡ ⎤ ⎡ 

−2x2 − x4 

−2 

⎢ x2 ⎥ ⎢ 

s = ⎢ ⎥ 

⎣ x4 ⎦ = x2 

⎢ 1 ⎥ ⎢ 

⎥ 

⎣ 0 ⎦ + x4 

⎢ 

⎣ 

0 

⎤ 

⎦ ⇒ 

where x2 and x4 are free. The solution set is Span{v1,v2} ⊆ R 4 . 

196 

−1 

0 

1 

1 

⎤ 

x1 = −2x2 − x4 

x2 is free 

x3 = x4 

x4 is free 

⎥ 

⎦ = x2v1 + x4v2

Nonhomogeneous systems and solution sets. A nonhomogeneous system is a linear 

system with matrix equation Ax = b where b = O. 

Theorem 9.11 (Solution Sets) Suppose that Ax = b is consistent and let p be a 

solution. Then the solution set of the system Ax = b is the set of all vectors of the form 

s = p + vh 

where vh is a solution of the homogeneous system Ax = O. 

Example 9.12 (Two Generating Vectors, continued) Consider the matrix equation 

⎡ ⎤ 

⎡ 

⎤ x1 

2 4 −6 8 ⎢ x2 ⎥ 

⎣ 4 8 4 0 ⎦⎢ 

⎥ 

⎣ x3 ⎦ 

3 6 −1 4 

= 

⎡ 

⎣ 18 

⎤ 

4 ⎦. 

11 

Since 

⎡ 

⎣ 

2 4 −6 8 18 

4 8 4 0 4 

3 6 −1 4 11 

x4 

⎤ 

⎡ 

⎦ ∼ · · · ∼ ⎣ 

x4 

1 2 0 1 3 

0 0 1 −1 −2 

0 0 0 0 0 

solutions can be written parametrically as: 

⎡ ⎤ 

3 − 2x2 − x4 

⎢ x2 ⎥ 

s = ⎢ ⎥ 

⎣ −2 + x4 ⎦ = 

⎡ ⎤ ⎡ ⎤ ⎡ 

3 

−2 

⎢ 0 ⎥ ⎢ 

⎥ 

⎣ −2 ⎦ + x2 

⎢ 1 ⎥ ⎢ 

⎥ 

⎣ 0 ⎦ + x4 

⎢ 

⎣ 

0 0 

−1 

0 

1 

1 

⎤ 

⎦ ⇒ 

where x2 and x4 are free. (The solution set is a shift of a span.) 

9.2 Matrix Operations 

An m-by-n matrix A is a rectangular array with m rows and n columns: 

⎡ 

⎤ 

A = 

⎢ 

⎣ 

a11 a12 · · · a1n 

a21 a22 · · · a2n 

. 

. · · · 

. 

am1 am2 · · · amn 

⎤ 

x1 = 3 − 2x2 − x4 

x2 is free 

x3 = −2 + x4 

x4 is free 

⎥ 

⎦ = p + (x2v1 + x4v2) = p + vh 

⎥ 

⎦ = [ a1 a2 · · · an ]. 

The (i,j)-entry of A (denoted by aij) is the element in the row i and column j position. In 

the first term above, the entries of A are shown explicitly. In the second term, A is written 

as n columns where each column is a vector in R m . (See Section 7.1 also.) 

Square matrices. A square matrix is a matrix with m = n. An n-by-n square matrix is 

often called a square matrix of order n. If A is a square matrix of order n, then the diagonal 

elements of A are the elements: a11, a22, ..., ann. 

Symmetric matrices. Let A be a square matrix. Then A is said to be a symmetric 

matrix if aij = aji for all i and j. 

197

Triangular matrices. Let A be a square matrix. Then A is said to be an upper triangular 

matrix if aij = 0 when i > j; A is said to a lower triangular matrix if aij = 0 when i < j; 

and A is said to be a triangular matrix if it is either upper or lower triangular. 

Diagonal matrices, identity matrices. The square matrix A is said to be a diagonal 

matrix when aij = 0 when i = j. 

Given d1, d2, ..., dn, Diag(d1,d2,... ,dn) is used to denote the diagonal matrix of order n 

whose diagonal elements are the di’s: 

⎡ 

d1 0 0 · · · 0 

0 d2 0 · · · 0 

0 0 d3 · · · 0 

⎢ 

Diag(d1,d2,...,dn) = ⎢ 

⎣ 

. . . · · · 

⎥ 

. 

⎦ 

0 0 0 · · · dn 

. 

The identity matrix of order n, In, is the diagonal matrix with di = 1 for each i. We 

sometimes omit the subscript when the order of the matrix is clear from the problem. 

Note that diagonal matrices are symmetric, upper triangular, and lower triangular. 

Vectors and matrices with all ones. For convenience, we let 1m be the vector in R m 

all of whose components equal one, and we let Jm×n be the m-by-n matrix all of whose 

entries equal one. We sometimes omit the subscripts when the orders of the vectors and 

matrices are clear from the problem. 

9.2.1 Basic Operations 

In this section, we consider the operations of addition and scalar multiplication, linear 

combinations of matrices, matrix transpose and matrix product. 

Addition and scalar multiplication. Let A and B be m-by-n matrices, and let c ∈ R 

be a scalar. Then matrix sum and scalar product are defined as follows: 

1. The matrix sum of A and B is the m-by-n matrix 

A + B = [a1 + b1 a2 + b2 · · · an + bn ]. 

That is, A + B is the matrix with (i,j)-entry (A + B) ij = aij + bij. 

2. The scalar product of A by c is the m-by-n matrix 

cA = [ ca1 ca2 · · · can ]. 

That is, cA is the matrix with (i,j)-entry (cA) ij = caij. 

198 

⎤

Linear combinations of matrices. Let A and B be m-by-n matrices, and c and d be 

scalars. Then the m-by-n matrix 

cA + dB = [ca1 + db1 ca2 + db2 · · · can + dbn ] 

is said to be a linear combination of A and B. 

The linear combination of A and B has (i,j)-entry as follows: 

⎡ 

For example, if A = ⎣ 

1 −2 

3 −4 

5 −6 

⎤ 

(cA + dB) ij = caij + dbij. 

⎡ 

⎦ and B = ⎣ 

0 7 

4 2 

−2 −6 

⎤ 

⎡ 

⎦, then 2A − B = ⎣ 

2 −11 

2 −10 

12 −6 

Matrix transpose. The transpose of the m-by-n matrix A, denoted by AT 

, is the n-by-m 

matrix whose (i,j)-entry is aji: AT = aji. 

ij 

For convenience, we let αi ∈ R n (for i = 1,2,... ,m) be the columns of A T and write 

⎡ 

A = ⎣ 

α1 T 

. 

αm T 

when we want to view A as a “stack” of m row vectors. 

Matrix product. Let A be an m-by-n matrix and let B = [ b1 b2 · · · bp ] be an 

n-by-p matrix. Then the matrix product, AB, is the m-by-p matrix whose j th column is the 

product of A with the j th column of B: 

AB = [Ab1 Ab2 · · · Abp ] . 

Equivalently, AB is the matrix whose (i,j)-entry is 

(See Section 7.1.7 also.) 

(AB) ij = ai1b1j + ai2b2j + · · · + ainbnj. 

Dot and matrix products. Another convenient way to write the matrix product is in 

terms of dot products. Recall that if v and w are vectors in R n , then their dot product is 

the scalar 

v·w = v T w = v1w1 + v2w2 + · · · + vnwn. 

Then 

⎡ 

AB = ⎣ 

α1 T 

. 

. 

αm T 

⎡ 

⎤ 

⎦[b1 b2 · · · 

⎢ 

bp ] = ⎢ 

⎣ 

199 

⎤ 

⎦ 

α1 T b1 α1 T b2 · · · α1 T bp 

α2 T b1 α2 T b2 · · · α2 T bp 

. 

. · · · 

αm T b1 αm T b2 · · · αm T bp 

. 

⎤ 

⎥ 

⎦ . 

⎤ 

⎦.

Properties of the basic operations. Some properties of the basic operations are given 

below. In each case, the sizes of the matrices are assumed to be compatible. 

1. Commutative: A + B = B + A. 

2. Associative: A + (B + C) = (A + B) + C and A(BC) = (AB)C. 

3. Distributive: A(B + C) = (AB) + (AC), (B + C)A = (BA) + (CA). 

4. Scalar Multiples: Given c ∈ R: 

c(AB) = (cA)B = A(cB), c(A + B) = (cA) + (cB) and (cA) T = cA T . 

5. Identity: ImA = A = AIn when A is an m-by-n matrix. 

6. Transpose of Transpose: 

 

A T T 

= A. 

7. Transpose of Sum: (A + B) T = A T + B T . 

8. Transpose of Product: (AB) T = B T A T . 

Note, in particular, that the matrix sum is commutative but the matrix product is not. In 

fact, it is possible that AB is defined but that BA is undefined. 

9.2.2 Inverses of Square Matrices 

The n-by-n matrix A is said to be invertible if there exists a square matrix C satisfying 

AC = CA = In where In is the n-by-n identity matrix. 

Otherwise, A is said to be not invertible. Square matrices that are not invertible are also 

called singular matrices; invertible square matrices are also called nonsingular matrices. 

If A is invertible, then the matrix C is unique. To see this, suppose that the matrix D also 

satisfies the equality AD = DA = In. Then, using properties of matrix products, 

D = DIn = D(AC) = (DA)C = InC = C. 

If A is invertible, then A −1 is used to denote the matrix inverse of A. 

Special case: 2-by-2 matrices. Let A = 

A −1 = 1 

ad−bc 

 

a b 

be a 2-by-2 matrix. Then 

c d 

 

d −b 

when ad − bc = 0. 

−c a 

If ad − bc = 0, then A does not have an inverse. (Note: ad − bc is the determinant of A.) 

200

Special case: diagonal matrices. Let A = Diag(a1,a2,...,an) be a diagonal matrix. 

If ai = 0 for each i, then A is invertible with inverse 

If ai = 0 for some i, then A is not invertible. 

A −1 = Diag(1/a1,1/a2,...,1/an). 

Demonstration: Note first that if C = Diag(c1,c2,... ,cn), then 

AC = CA = Diag(a1c1,a2c2,...,ancn). 

To obtain AC = CA = In, we need aici = 1 for each i. If ai = 0 for some i, then we cannot 

find solutions; otherwise, we let ci = 1/ai for each i. 

Properties of inverses. Let A and B be invertible square matrices of order n and let c 

be a nonzero constant. Then 

1. Inverse of Identity: I −1 

n = In for each n. 

2. Inverse of Inverse: A −1 −1 = A. 

3. Inverse of Product: (AB) −1 = B −1 A −1 . 

4. Inverse of Scalar Multiple: (cA) −1 = (1/c)A−1 . 

 

5. Inverse of Transpose: AT −1 = A−1T . 

Algorithm for finding an inverse. Let A be an n-by-n matrix. Form the n-by-2n 

augmented matrix whose last n columns are the columns of the identity matrix: [A In ]. 

If A is row equivalent to the identity matrix, then the n-by-2n augmented matrix will be 

transformed as follows: 

[A In ] ∼ · · · ∼ [In A −1 ] 

That is, the last n columns will be the inverse of A. Otherwise, A does not have an inverse. 

Partial justification: For convenience, let C = A −1 and In = [e1 e2 · · · en ]. Then 

AC = In can be written as follows: 

Equivalently, 

AC = [ Ac1 Ac2 · · · Acn ] = [e1 e2 · · · en ]. 

Ac1 = e1, Ac2 = e2, ..., Acn = en. 

By using the n-by-2n augmented matrix, we are solving all n systems at once. The solutions 

to the n systems form the n columns of the inverse matrix. 

 

−7 8 

Example 9.13 (2-by-2 Invertible Matrix) Let A = . Using the special 

2 −2 

formula for 2-by-2 matrices, 

A −1 = 1 

 

−2 −8 1 4 

= . 

−2 −2 −7 1 7/2 

201

Using the algorithm for finding inverses, 

 

−7 8 1 

[ A I2 ] = 

2 −2 0 

 

0 −7 

∼ 

1 0 

 

8 1 0 −7 8 1 

∼ 

2/7 2/7 1 0 1 1 

0 

7/2 

 

−7 0 

∼ 

0 1 

 

−7 −28 1 0 

∼ 

1 7/2 0 1 

 

1 4 

= [ I2 

1 7/2 

A−1 ] . 

The last two columns of the augmented matrix match the inverse computed above. 

⎡ 

0 

Example 9.14 (3-by-3 Invertible Matrix) Let A = ⎣ 1 

⎤ 

1 2 

0 3 ⎦. Since 

⎡ 

0 

[ A I3 ] = ⎣ 1 

1 

0 

2 1 

3 0 

⎤ ⎡ 

0 0 1 0 

1 0 ⎦ ∼ · · · ∼ ⎣ 0 1 

4 −3 8 

0 −4.5 

0 −2.0 

⎤ 

7.0 −1.5 

4.0 −1.0 ⎦, 

4 −3 8 0 0 1 0 0 1 1.5 −2.0 −0.5 

A is invertible with inverse A−1 ⎡ 

4.5 

= ⎣ −2.0 

⎤ 

7.0 −1.5 

4.0 −1.0 ⎦. 

1.5 −2.0 −0.5 

⎡ 

0 

Example 9.15 (3-by-3 Singular Matrix) Let A = ⎣ 1 

1 

0 

⎤ 

2 

3 ⎦. Since 

4 −3 6 

⎡ 

0 

[ A I3 ] = ⎣ 1 

1 2 

0 3 

1 0 

0 1 

⎤ ⎡ 

0 1 

0 ⎦ ∼ · · · ∼ ⎣ 0 

0 3 

1 2 

0 

1 

1 

0 

⎤ 

0 

0 ⎦, 

4 −3 6 0 0 1 0 0 0 3 −4 1 

A is not invertible. (Note that the third row has 3 leading zeros.) 

Elementary matrices and matrix inverses. An elementary matrix, E, is one obtained 

by performing a single elementary row operation on an identity matrix. Since each elementary 

row operation is reversible, each elementary matrix is invertible. 

For example, let n = 4. 

1. (Replacement) Starting with I4, we replace R3 with R3+aR1. The elementary matrix 

is E shown on the left below. The operation is reversed by replacing R3 with R3−aR1. 

The inverse matrix is shown on the right below. 

E = 

⎡ 

⎢ 

⎣ 

1 0 0 0 

0 1 0 0 

a 0 1 0 

0 0 0 1 

⎤ 

⎥ 

⎦ and E−1 = 

⎡ 

⎢ 

⎣ 

1 0 0 0 

0 1 0 0 

−a 0 1 0 

0 0 0 1 

2. (Scaling) Starting with I4, we scale R2 using the nonzero constant c. The elementary 

matrix is E shown on the left below. The operation is reversed by scaling R2 by the 

constant 1/c. The inverse matrix is shown on the right below. 

E = 

⎡ 

⎢ 

⎣ 

1 0 0 0 

0 c 0 0 

0 0 1 0 

0 0 0 1 

⎤ 

⎥ 

⎦ and E−1 = 

202 

⎡ 

⎢ 

⎣ 

1 0 0 0 

0 1/c 0 0 

0 0 1 0 

0 0 0 1 

⎤ 

⎥ 

⎦ . 

⎤ 

⎥ 

⎦ .

3. (Interchange) Starting with I4, we interchange R2 and R4. The elementary matrix 

is E shown below. The operation is reversed by interchanging R2 and R4. Thus, 

E −1 = E. 

Note also that E 2 = I4. 

E = 

⎡ 

⎢ 

⎣ 

1 0 0 0 

0 0 0 1 

0 0 1 0 

0 1 0 0 

⎤ 

⎥ 

⎦ = E−1 . 

Let A be an n-by-n matrix. Elementary row operations on A correspond to left multiplication 

by the corresponding elementary matrix. If a sequence of p elementary row operations 

(with elementary matrices E1, E2, ..., Ep) transform A to the identity matrix, then 

(Ep(Ep−1 · · · (E2(E1A)))) = (EpEp−1 · · · E2E1)A = In, 

the inverse of A is the following product of elementary matrices: 

A −1 = EpEp−1 · · · E2E1 

and A is the product of the inverses in the opposite order: 

A = E −1 

1 E−1 

2 · · · E −1 

p−1 E−1 

p . 

These computations can be used to formally justify the algorithm for finding inverses. 

Example 9.16 (2 × 2 Matrix, continued) Continuing with the 2-by-2 example, four 

elementary row operations were used. Written as a matrix product: 

E4E3E2E1A = 

−1/7 0 

0 1 

1 −1 

0 1 

Thus, A−1 

= E4E3E2E1 = 

1 0 

0 7/2 

1 4 

−1 7/2 

 

. 

 

1 0 

2/7 1 

−7 8 

2 −2 

 

= 

1 0 

0 1 

 

= I2. 

Relationship to solving systems of equations. The following theorem provides an 

important relationship between matrix inverses and systems of linear equations. 

Theorem 9.17 (Uniqueness Theorem) Let A be an invertible n-by-n matrix. Then 

the matrix equation Ax = b has the unique solution x = A −1 b for any b ∈ R n . 

9.2.3 Matrix Factorizations 

A factorization of a matrix A is an equation expressing A as the product of two or more 

matrices. This section introduces the LU-factorization, which supports an important practical 

method for solving large systems of linear equations. Additional factorizations will be 

considered later in these notes. 

203

LU-Factorization. Let A be an m-by-n matrix. An LU-factorization of A is a matrix 

equation A = LU, where 

1. L is an m-by-m unit lower triangular matrix, and 

2. U is an m-by-n (upper) echelon form matrix. 

Note that L is said to be a unit lower triangular matrix if L is a lower triangular matrix 

whose diagonal elements ℓii = 1 for all i. 

The m-by-n matrix A has an LU-factorization when (and only when) A is row equivalent 

to an echelon form matrix using a sequence of row replacements only: 

EpEp−1 · · · E2E1A = U =⇒ A = LU where L = (EpEp−1 · · · E2E1) −1 . 

In this equation, the Ei’s are the elementary matrices representing the row replacements. 

Example 9.18 (3-by-3 Matrix) Consider the following matrix and equivalent forms: 

⎡ 

3 1 

A = ⎣ −6 0 

⎤ ⎡ 

0 3 1 

−4 ⎦ ∼ ⎣ 0 2 

⎤ ⎡ 

0 3 1 

−4 ⎦ ∼ ⎣ 0 2 

⎤ 

0 

−4 ⎦. 

0 6 −10 0 6 −10 0 0 2 

The last matrix is the echelon matrix U. Since two row replacements were used, 

L = E −1 

1 E−1 2 = 

⎡ 

1 0 

⎣ −2 1 

0 0 

⎤⎡ 

0 1 0 

0 ⎦⎣ 

0 1 

1 0 3 

⎤ ⎡ ⎤ 

0 1 0 0 

0 ⎦ = ⎣ −2 1 0 ⎦. 

1 0 3 1 

Example 9.19 (3-by-4 Matrix) Consider the following matrix and equivalent forms: 

⎡ 

2 4 −1 

⎤ 

5 

⎡ 

2 4 −1 

⎤ 

5 

⎡ 

2 4 

⎤ 

−1 5 

A = ⎣ −4 −5 3 −8 ⎦ ∼ ⎣ 0 3 1 2 ⎦ ∼ ⎣ 0 3 1 2 ⎦. 

2 −5 −4 1 0 −9 −3 −4 0 0 0 2 

The last matrix is the echelon matrix U. Since three row replacements were used, 

L = E −1 

1 E−1 2 E−1 3 = 

⎡ ⎤⎡ 

1 0 0 1 

⎣ −2 1 0 ⎦⎣ 

0 

0 0 1 1 

⎤ ⎡ 

0 0 1 

1 0 ⎦ ⎣ 0 

0 1 0 

⎤ ⎡ 

0 0 1 0 0 

1 0 ⎦ = ⎣ −2 1 0 

−3 1 1 −3 1 

Application. If A = LU and Ax = b is consistent, then the system can be solved 

efficiently. Consider the following setup: 

b = Ax = (LU)x = L(Ux) = Ly where y = Ux. 

The efficient solution is done in two steps: 

1. Solve Ly = b for y. 

2. Solve Ux = y for x. 

The first step is essentially a forward substitution method, and the second step is essentially 

a backward substitution method. The process is useful when the same coefficient matrix is 

used with several different right hand sides. 

204 

⎤ 

⎦.

Permutation matrices and factorizations. A permutation matrix P is a product of 

elementary matrices corresponding to row interchanges. 

In the LU-factorization, L encodes information about row replacements, and the pivot 

positions of U encode information about scalings. If row interchanges are needed to solve a 

system, then a permutation matrix can be used to encode these operations. Specifically, let 

P be the product of interchange matrices needed to bring the rows of A into proper order 

for an LU-factorization. Then write PA = LU. (If row interchanges are needed, then A 

does not have an LU-factorization. The matrix PA does have an LU-factorization.) 

9.2.4 Partitioned Matrices 

A partitioned matrix (or blocked matrix) is one whose entries are themselves matrices. An 

m-by-n matrix can be partitioned in many ways, including the following 2-by-2 form: 

 

A11 A12 

A = where Aij is a pi-by-qj matrix, 

A21 A22 

p1 + p2 = m and q1 + q2 = n. (This can be generalized to 3-by-3 forms, etc.) 

Each submatrix in a partitioned matrix is called a block. Partitioning is especially useful 

when you can take advantage of block patterns. For example, if certain Aij’s are zero 

matrices, identity matrices, diagonal matrices, or matrices all of whose entries are equal. 

If all submatrices are conformable, then partitioned matrices can be added or multiplied by 

adding or multiplying the appropriate blocks. In particular, 

 

A11 A12 B11 B12 C11 C12 

AB = 

= 

A21 A22 

where Cij = Ai1B1j + Ai2B2j for each i and j. 

B21 B22 

C21 C22 

This section considers inverses of partitioned square matrices. Additional methods for 

partitioned matrices will be considered in later sections of these notes. 

Block diagonal matrices. A 2-by-2 block diagonal matrix is a square matrix A of order 

n written as follows 

 

A11 O 

A = where A11 is p-by-p, A22 is q-by-q, 

O A22 

each O is a zero matrix, and p + q = n. (This can be generalized to 3-by-3, etc.) 

A is invertible if and only if both A11 and A22 are invertible. Then, A −1 = 

A −1 

11 

O 

O A −1 

22 

Example 9.20 (4-by-4 Matrix) Let A be the 4-by-4 matrix shown on the left below. 

The inverse of A can be computed using the formula for 2-by-2 block diagonal matrices: 

⎡ 

⎢ 

A = ⎢ 

⎣ 

8 

−3 

0 

−5 

2 

0 

0 

0 

9 

0 

0 

4 

⎤ 

⎥ 

⎦ 

0 0 4 2 

=⇒ A−1 ⎡ 

2 

⎢ 

= ⎢ 3 

⎣ 0 

5 

8 

0 

0 

0 

1 

⎤ 

0 

0 ⎥ 

−2 ⎦ 

0 0 −2 9/2 

205 

 

.

Block upper triangular matrices. A 2-by-2 block upper triangular matrix is a square 

matrix A of order n written as follows 

 

A11 A12 

A = where A11 is p-by-p, A12 is p-by-q, A22 is q-by-q, 

O A22 

O is a zero matrix, and p + q = n. (This can be generalized to 3-by-3, etc.) 

A is invertible if and only if both A11 and A22 are invertible. Then, A −1 = 

where C = −A −1 −1 

11 A12A22 . 

In particular, if A11 and A22 are identity matrices, then A −1 = 

Ip −A12 

O Iq 

 

. 

A −1 

11 

C 

O A −1 

22 

Block lower triangular matrices. A 2-by-2 block lower triangular matrix is a square 

matrix A of order n written as follows 

 

A11 O 

A = where A11 is p-by-p, A21 is q-by-p, A22 is q-by-q, 

A21 A22 

O is a zero matrix, and p + q = n. (This can be generalized to 3-by-3, etc.) 

Since AT 

T A11 A 

= 

T 

21 is block upper triangular, the method above easily generalizes to 

handle this case. 

O A T 22 

Block elimination and Schur complements. Let M be a square matrix of order n 

written as follows: 

 

A B 

M = where the diagonal submatrices are square, 

C D 

and assume that A is invertible. Then block elimination subtracts CA−1 times the first row 

from the second row. This leaves the Schur complement of A in the lower right corner of 

the resulting block upper triangular matrix, as illustrated below: 

 

I O 

−CA −1 I 

 

A B 

C D 

= 

 

A B 

O D − CA−1 

. 

B 

If the Schur complement (D − CA −1 B) is invertible, then the inverse of M can be found in 

block form using methods for block triangular matrices. 

Example 9.21 (5-by-5 Matrix) Consider the 5-by-5 matrix M (written as a 2-by-2 block 

matrix) and its Schur complement: 

M = 

 

A B 

= 

C D 

⎡ 

⎢ 

⎣ 

3 1 0 −42 0 

4 1 −2 0 −1 

2 0 −5 1 −7 

1 0 1 2 1 

1 1 1 3 4 

⎤ 

⎥ 

⎦ 

206 

and (D − CA −1 B) = 

 

−289 −13 

378 17 

 

. 

 

,

Since A, D and (D − CA −1 B) are invertible, we can compute M −1 using block methods. 

Using properties of inverses, 

M −1 = 

⎢ 

= ⎢ 

⎣ 

 

A B 

O D − CA−1 −1 

B 

⎡ 

⎡ 

⎢ 

= ⎢ 

⎣ 

I O 

−CA −1 I 

−5 5 −2 −134 −103 

16 −15 6 1116 855 

−2 2 −1 479 366 

0 0 0 17 13 

0 0 0 −378 −289 

 

⎤⎡ 

⎥⎢ 

⎥⎢ 

⎥⎢ 

⎥⎢ 

⎦⎣ 

−16 119 −95 −134 −103 

133 −987 789 1116 855 

57 −423 338 479 366 

2 −15 12 17 13 

−45 334 −267 −378 −289 

9.3 Determinants of Square Matrices 

1 0 0 0 0 

0 1 0 0 0 

0 0 1 0 0 

7 −7 3 1 0 

−9 8 −3 0 1 

Determinants of square matrices were introduced in Section 7.1.8. This section gives additional 

information about determinants. 

Recursive or cofactor definition. Let A be a square matrix of order n, and let Aij 

be the submatrix obtained by eliminating the i th row and the j th column of A. Then the 

determinant of A can be defined recursively as follows: 

⎤ 

⎥ 

⎦ . 

n 

det(A) = (−1) 

j=1 

1+j a1jdet(A1j). 

(The determinant has been computed by expanding across the first row.) 

The value of the determinant remains the same if we expand across the i th row: 

or if we expand down the j th column: 

for any i or j. The number 

is often called the ij-cofactor of A. 

n 

det(A) = (−1) 

j=1 

i+j aijdet(Aij) 

n 

det(A) = (−1) 

i=1 

i+j aijdet(Aij) 

cij = (−1) i+j det(Aij) 

207 

⎤ 

⎥ 

⎦

Properties of determinants. Let A and B be square matrices of order n. Then 

 

1. Transpose: det AT = det(A). 

2. Triangular: If A is a triangular matrix, then det(A) = a11 · · · ann. 

3. Product: det(AB) = det(A) det(B). 

4. Idempotent: If A 2 = A (A is an idempotent matrix), then det(A) = 0 or 1. 

5. Inverse: A is an invertible matrix if and only if det(A) = 0. Further, if the matrix A 

is invertible, then det A −1 = 1/det(A). 

6. Elementary Matrices: Let E be an elementary matrix. Then 

⎧ 

⎪⎨ 

1 when E corresponds to row replacement 

det(E) = −1 when E corresponds to row interchange 

⎪⎩ c when E corresponds to scaling a row by c 

7. Zeroes: If A has a row or column of zeroes, then det(A) = 0. 

8. Identical: If two rows or two columns of A are identical, then det(A) = 0. 

⎡ 

0 

Example 9.22 (3-by-3 Invertible Matrix, continued) If A = ⎣ 1 

⎤ 

1 2 

0 3 ⎦, then 

4 −3 8 

 

 

0 1 2 

 

det(A) = 

1 0 3 

= 0 − 1 

 

4 −3 8 4 

 

3 

 

 

8 

+ 2 1 

0 

 

4 −3 

= 0 − (−4) + 2(−3) = −2 = 0. 

⎡ 

0 

Example 9.23 (3-by-3 Singular Matrix, continued) If A = ⎣ 1 

1 

0 

⎤ 

2 

3 ⎦,then 

4 −3 6 

 

 

0 1 

det(A) = 

1 0 

4 −3 

 

2 

 

3 

= 0 − 1 

 

6 4 

 

3 

 

 

6 

+ 2 1 

4 

0 

 

−3 

= 0 − (−6) + 2(−3) = 0. 

Relationship to pivots. Consider the square matrix A of order n. Let K be a product 

of elementary matrices needed to bring A into echelon form, and U be the resulting echelon 

form matrix: KA = U. 

Since det(K) = 0 by Properties 3 and 5, and det(U) = 

i uii be Property 2, 

det(A) = 

1 

det(K) 

 

uii (using Property 3). 

Thus, det(A) = 0 when there are fewer than n pivot positions. 

i 

208

Reducing to echelon form. The determinant of A can be derived by reducing A to an 

echelon form matrix and observing the following: 

1. If a multiple of one row is added to another, then the determinant is unchanged. 

2. If two rows are interchanged, then the determinant changes sign. 

3. If a row is multiplied by c, then the determinant is multiplied by c. 

For example, 

 

 

3 6 6 

 

 

1 2 1 

= 3 

 

1 4 4 

1 2 2 

1 2 1 

1 4 4 

 

 

 

 

= 3 

 

 

1 2 2 

0 0 −1 

0 2 2 

 

 

 

 

= −3 

 

 

1 2 2 

0 2 2 

0 0 −1 

 

 

 

 

= −3(−2) = 6. 

 

Explicit formula for the inverse using cofactors. If A is invertible, then the inverse 

of A can be written explicitly in terms of deteminants as follows: 

A −1 = 

⎡ 

1 ⎢ 

det(A) ⎣ 

c11 c12 · · · c1n 

c21 c22 · · · c2n 

. 

. · · · 

cn1 cn2 · · · cnn 

. 

⎤T 

⎥ 

⎦ 

where cij = (−1) i+j det(Aij). 

(A −1 equals the reciprocal of det(A) times the transpose of the cofactor matrix.) 

This explicit formula is often useful when proving theorems about inverses, but it is not a 

useful computational formula. 

9.4 Vector Spaces and Subspaces 

This section introduces vector space theory, emphasizing topics of interest in statistics. 

9.4.1 Definitions 

A vector space V over the reals is a set of objects (called vectors), and operations of addition 

and scalar multiplication satisfying the following axioms: 

1. Vector sum: Vector sum is commutative and associative. 

2. Zero vector: There is a zero vector O satisfying 

3. Additive inverse: For each x ∈ V , 

O + x = x = x + O for each x ∈ V . 

y + x = O = x + y for some y ∈ V . 

209

4. Scalar multiples: Given x,y ∈ V and c,d ∈ R, 

c(x + y) = (cx) + (cy), (c + d)x = (cx) + (dx), (cd)x = c(dx), 1x = x. 

Examples of vector spaces include 

1. R m with the usual vector addition and scalar multiplication. 

2. The set of r-by-c matrices, Mr×c, with the usual operations of matrix addition and 

scalar multiplication. 

3. The set of polynomial functions of degree at most k, Pk, 

Pk = {p(t) = a0 + a1t + a2t 2 + · · · + akt k : ai ∈ R} 

where polynomials are added by adding their values for each t, and the scalar multiple 

by c is obtained by multiplying each value of the polynomial by the scalar c. 

4. The set of all polynomial functions, P∞, 

P∞ = P1 ∪ P2 ∪ P3 ∪ · · · 

with addition and scalar multiplication as defined above. 

5. The set of continuous real-valued functions with domain [0,1], C[0, 1], 

C[0, 1] = {f : [0,1] → R : f is continuous} 

where functions are added by adding their values for each x ∈ [0,1], and the scalar 

multiple by c is obtained by multiplying each value of the function by the scalar c. 

Subspaces. A subset H ⊆ V is a subspace of the vector space V if 

1. Contains zero vector: O ∈ H. 

2. Closed under addition: If x,y ∈ H, then x + y ∈ H. 

3. Closed under scalar multiplication: If x ∈ H, then cx ∈ H for each c ∈ R. 

Note that the subset {O} is a subspace, called the trivial subspace. 

Example 9.24 (Lines in R2 ) The set of points in the plane satisfying y = mx + b for 

some m and b: 

 

x 

Lm,b = 

y 

 

: y = mx + b ⊆ R2 is a subspace of R 2 when b = 0 (the line goes through the origin) and is not a subspace 

when b = 0 (the line does not go through the origin). 

210

Example 9.25 (Planes in R3 ) The set of points in three-space satisfying z = ax+by +c 

for some a, b, c: 

⎧⎡ 

⎤ 

⎫ 

⎨ x 

⎬ 

Pa,b,c = ⎣ y ⎦ : z = ax + by + c ⊆ R3 

⎩ 

⎭ 

z 

is a subspace of R 3 when c = 0 (the plane goes through the origin) and is not a subspace 

when c = 0 (the plane does not go through the origin). 

Of interest. Subspaces are vector spaces in their own right. Of interest in statistics are 

the vector spaces R m , and the vector subspaces H ⊆ R m . 

9.4.2 Subspaces of R m 

Recall (from Section 9.1.3) that the span of the set {v1,v2,... ,vn} ⊆ R m , is the collection 

of all linear combinations of the vectors v1, v2, ..., vn: 

Span{v1,v2,...,vn} = {c1v1 + c2v2 + · · · + cnvn : ci ∈ R} ⊆ R m . 

Spans are subspaces. Of particular interest are spans related to coefficient matrices. 

Column space of A. Let A = [a1 a2 · · · an] be an m-by-n coefficient matrix. The 

column space of A is the span of the columns of A: 

Col(A) = Span{a1,a2,... ,an} ⊆ R m . 

The column space of A is the collection of all b so that Ax = b is consistent. 

Null space of A. Let A be an m-by-n coefficient matrix. The null space of A is the set 

of solutions of the homogeneous system Ax = O: 

Null(A) = {x : Ax = O} ⊆ R n . 

Since Null(A) can be written as a span of a set of vectors, it is a subspace of R n . 

9.4.3 Linear Independence, Basis and Dimension 

The set {v1,v2,... ,vn} ⊆ R m is said to be linearly independent when 

c1v1 + c2v2 + · · · + cnvn = O 

only when each ci = 0. Otherwise, the set is said to be linearly dependent. 

The following uniqueness theorem gives an important property of linearly independent sets. 

211

Theorem 9.26 (Uniqueness Theorem) If {v1,v2,...,vn} ⊆ R m is a linearly independent 

set and w ∈ Span{v1,v2,... ,vn}, then w has a unique representation as a linear 

combination of the vi’s. That is, we can write 

w = c1v1 + c2v2 + · · · + cnvn 

for a unique ordered list of scalars c1, c2, ..., cn (called the coordinates of w). 

Note that the coordinates can be found by solving the system 

[ v1 v2 · · · vn ] 

⎡ 

⎣ 

c1 

. 

. 

cn 

⎤ 

⎡ 

⎦ = ⎣ 

That is, by solving Ac = w, where A is the m-by-n matrix whose columns are the vi’s. 

Additional facts. Some additional facts about linear independence/dependence are: 

1. If vi = O for some i, then the set {v1,v2,...,vn} is linearly dependent. 

2. If n > m, then the set {v1,v2,... ,vn} is linearly dependent. 

3. Assume vi = O for each i. Then, the set is linearly dependent if and only if one vector 

in the set can be written as a linear combination of the others. 

4. Let be the m-by-n matrix whose columns are the vi’s: A = [ v1 v2 · · · vn ]. Then, 

the set is linearly independent if and only if A has a pivot in every column. 

5. Let be the m-by-n matrix whose columns are the vi’s: A = [ v1 v2 · · · vn ]. Then, 

the set is linearly independent if and only if the homogeneous system Ax = O has 

the trivial solution only. 

Bases. A basis for the vector space V ⊆ R m is a linearly independent set which spans V . 

That is, a basis is a set of linearly independent vectors {v1,v2,... ,vn} satisfying 

w1 

. 

. 

wm 

V = Span{v1,v2,... ,vn}. 

Example 9.27 (Bases for R3 ) Bases are not unique. For example, the sets 

⎧⎡ 

⎨ 

⎣ 

⎩ 

1 

⎤ ⎡ 

0 ⎦, ⎣ 

0 

0 

⎤ ⎡ 

1⎦, 

⎣ 

0 

0 

⎤⎫ 

⎬ 

0⎦ 

⎭ 

1 

and 

⎧⎡ 

⎨ 

⎣ 

⎩ 

−3 

⎤ ⎡ 

0 ⎦ , ⎣ 

0 

2 

⎤ ⎡ 

1 ⎦, ⎣ 

0 

6 

⎤⎫ 

⎬ 

−2 ⎦ 

⎭ 

1 

are each bases of V = R 3 . 

Further, if the components of x ∈ R 3 are x1, x2, x3, then the coordinates of x 

1. With respect to the first basis are x1, x2, x3. 

2. With respect to the second basis are 1 

3 (−x1 + 2x2 + 10x3), x2 + 2x3, x3. 

212 

⎤ 

⎦.

Standard basis for R m . The standard basis for R m is the set {e1,e2,... ,em}, where 

the ei’s are the columns of the identity matrix Im. That is, where ei is the vector whose 

j th component equals 0 when j = i and equals 1 when j = i. 

Note that the first basis given in the previous example is the standard basis for R 3 . 

Basis for the span of a set of vectors. Suppose that 

V = Span{v1,v2,... ,vn} ⊆ R m , 

A basis for V can be found using row reductions. Specifically, 

Theorem 9.28 (Spanning Set Theorem) If V = Span{v1,v2,...,vn} ⊆ R m is not 

the trivial subspace, and A is the m-by-n matrix whose columns are the vi’s: 

A = [v1 v2 · · · vn ], 

then the pivot columns of A form a basis for the vector space V . 

Basis for the column space of A. In particular, the pivot columns of the coefficient 

matrix A form a basis for the column space of A, Col(A). 

Example 9.29 (3-by-5 Matrix) Consider the following 3-by-5 coefficient matrix A and 

its equivalent reduced echelon form matrix: 

A = [a1 a2 a3 a4 

⎡ 

0 3 

a5 ] = ⎣ 3 −7 

−6 

8 

⎤ 

6 4 

−5 8 ⎦ ∼ 

⎡ 

1 0 

⎣ 0 1 

⎤ 

−2 3 0 

−2 2 0 ⎦ 

3 −9 12 −9 6 0 0 0 0 1 

Since the pivots are in columns 1, 2, 5, a basis for Col(A) is the set {a1,a2,a5}. 

Basis for the null space of A. The method for writing the solution set of a homogeneous 

system Ax = O given in Section 9.1.3 automatically produces a set of linearly independent 

vectors which span Null(A). Thus, no further work is needed to find a basis for Null(A). 

Example 9.30 (3-by-5 Matrix, continued) Continuing with the example above, since 

Ax = O ⇒ 

x1 = 2x3 − 3x4 

x2 = 2x3 − 2x4 

x3 is free 

x4 is free 

x5 = 0 

⎡ 

⎢ 

⇒ s = x3 

⎢ 

⎣ 

2 

2 

1 

0 

0 

⎤ 

⎡ 

⎥ ⎢ 

⎥ ⎢ 

⎥ + x4 

⎢ 

⎦ ⎣ 

where x3 and x4 are free, a basis for Null(A) is the set {v1,v2}. 

213 

−3 

−2 

0 

1 

0 

⎤ 

⎥ 

⎦ = x3v1 + x4v2,

Dimension of a vector space. Although there are many choices for the basis vectors of 

a vector space V , the number of basis vectors does not change. 

Theorem 9.31 (Basis Theorem) If V ⊆ R m has a basis with p vectors, then 

1. Every basis of V must contain exactly p vectors. 

2. Any subset of V with p linearly independent vectors is a basis of V . 

3. Any subset of V with p vectors which span V is a basis of V . 

If V is a nontrivial subspace of R m , then the dimension of V , denoted by dim(V ), is the 

number of vectors in any basis of V . By the basis theorem, this number is well-defined. For 

convenience, we say that the dimension of the trivial subspace, {O}, is zero. 

Note that for every subspace V ⊆ R m , 0 ≤ dim(V ) ≤ m. 

9.4.4 Rank and Nullity 

Let A be an m-by-n matrix. The rank of A, denoted by rank(A), is the number of pivot 

columns of A. Equivalently, the rank of A is the dimension of the column space of A, 

rank(A) = dim(Col(A)). 

The nullity of A, denoted by nullity(A), is the dimension of its null space, 

nullity(A) = dim(Null(A)). 

If A is the coefficient matrix of an m-by-n system of linear equations and r = rank(A), 

then the system has r basic variables and n − r free variables. Since the number of free 

variables corresponds to the dimension of the null space of A, we have 

rank(A) + nullity(A) = n. 

Example 9.32 (3-by-5 Matrix, continued) For example, the 3-by-5 coefficient matrix 

A considered earlier has rank 3 and nullity 2. The sum of these dimensions equals the 

number of columns of the matrix. 

Subspaces related to A and A T . Let A be an m-by-n coefficient matrix. There are 

four important subspaces related A: 

1. Column space of A: The column space of A, Col(A) ⊆ R m . 

2. Null space of A: The null space of A, Null(A) ⊆ R n . 

3. Row space of A: The column space of A T , Col(A T ) ⊆ R n . 

4. Left null space of A: The null space of A T , Null(A T ) ⊆ R m . 

214

The dimensions of these spaces are related as follows: 

Theorem 9.33 (Fundamental Theorem of Linear Algebra, Part I) If A is an 

m-by-n coefficient matrix of rank r, then 

1. rank(A) = rank(A T ) = r, 

2. nullity(A) = n − r and 

3. nullity(A T ) = m − r. 

Matrices of rank one. If the m-by-n matrix A has rank 1, then the columns of A are 

all multiples of a single vector, say ai = civ for some constants ci and vector v ∈ R m . Let 

c ∈ R n be the vector whose components are the ci’s. Then 

9.5 Orthogonality 

A = [ c1v c2v · · · cnv ] = vc T . 

Let v,w ∈ R m . Recall (from Section 7.1.4) that 

1. The dot product of v with w is the number 

2. If θ is the angle between v and w, then 

v·w = v T w = v1w1 + v2w2 + · · · + vnwn. 

v·w = v w cos(θ). 

3. The vectors v and w are said to be orthogonal if v·w = 0. 

When m = 2 or m = 3, orthogonality corresponds to being at right angles in the usual 

geometric sense. When m > 3, the definition of orthogonality in terms of the dot product 

is a useful generalization. 

9.5.1 Orthogonal Complements 

If V ⊆ R m is a subspace, then the orthogonal complement of V is the set of all vectors 

x ∈ R m which are orthogonal to every vector in V : 

(V ⊥ is read “V -perp.”) 

V ⊥ = {x : v·x = 0 for every v ∈ V } ⊆ R m . 

215

Properties. Orthogonal complements satisfy the following properties: 

1. Subspace: V ⊥ is a subspace of R m . 

2. Trivial intersection: Since x·x = 0 only when x = O, V ∩ V ⊥ = {O}. 

 

3. Complement of complement: V ⊥ ⊥ 

= V . 

4. Bases: If you pool bases for V and V ⊥ , you get a basis for R m . 

5. Dimensions: dim(V ) + dim(V ⊥ ) = m. 

Example 9.34 (Complements in R3 ) Let V be the following subspace: 

⎧⎡ 

⎨ 

V = Span{v1,v2} = Span ⎣ 

⎩ 

1 

⎤ ⎡ 

−1 ⎦, ⎣ 

0 

2 

⎤⎫ 

⎬ 

0 ⎦ 

⎭ 

−2 

⊆ R3 . 

V is a two-dimensional subspace of R 3 since the set {v1,v2} is linearly independent. To 

find V ⊥ it is sufficient to find those vectors x satisfying v1·x = 0 and v2·x = 0. 

Equivalently, we need to solve the following matrix equation 

 

v1 T 

x = O. 

Since 

1 −1 0 0 

2 0 −2 0 

 

∼ 

v2 T 

1 0 −1 0 

0 1 −1 0 

 

=⇒ 

x1 = x3 

x2 = x3 

x3 is free 

V ⊥ ⎡ 

is the span of the single vector v3 = ⎣ 1 

⎤ 

1⎦. 

Further, {v1,v2,v3} is a basis for R 

1 

3 . 

The example above suggests the second part of the Fundamental Theorem of Linear Algebra: 

Theorem 9.35 (Fundamental Theorem of Linear Algebra, Part II) Let A be an 

m-by-n coefficient matrix. Then 

1. Null(A T ) and Col(A) are orthogonal complements in R m . 

2. Null(A) and Col(A T ) are orthogonal complements in R n . 

9.5.2 Orthogonal Sets, Bases and Projections 

The nonzero vectors v1, v2, ..., vn are said to be mutually orthogonal if vi·vj = 0 when 

i = j. A set of mutually orthogonal vectors is known as an orthogonal set. 

The following theorem summarizes several important properties of orthogonal sets. 

Theorem 9.36 (Orthogonal Sets) Let {v1,v2,...,vn} be an orthogonal set of vectors 

in R m , and let V be the subspace spanned by these vectors. Then 

216

1. {v1,v2,...,vn} is a basis for V . 

2. If y ∈ V , then 

3. If y ∈ V , then 

y = c1v1 + c2v2 + · · · + cnvn where ci = y·vi 

vi·vi 

y = c1v1 + c2v2 + · · · + cnvn where ci = y·vi 

vi·vi 

for each i. 

for each i 

is the point closest to y in V . Further, the difference between the vector y and the 

vector y is a vector in the orthogonal complement of V : (y − y) ∈ V ⊥ . 

The first part of the theorem tells us that orthogonal sets are linearly independent. The 

second part gives a convenient way of finding the coordinates of a vector with respect to a 

basis of mutually orthogonal vectors (called an orthogonal basis). The third part generalizes 

the idea from analytic geometry of “dropping a perpendicular.” 

Projections and orthogonal decompositions. If y ∈ V , then y in the theorem above 

is called the projection of y on V , and is denoted by 

y = proj V (y). 

Writing y as the sum of a vector in V and a vector in V ⊥ : 

y = y + (y − y) 

is called the orthogonal decomposition of y with respect to V . For each y ∈ R m , the 

orthogonal decomposition of y with respect to V is unique. 

9.5.3 Gram-Schmidt Orthogonalization 

The Gram-Schmidt orthogonalization process allows us to convert any basis of a subspace 

of R m to an orthogonal basis for the space. 

Theorem 9.37 (Gram-Schmidt Process) Let {v1,v2,... ,vn} be a basis for the vector 

space V ⊆ R m , and let {u1,u2,... ,un} be defined as follows: 

u1 = v1 

u2 = v2 − proj V1 (v2) where V1 = Span{u1} = Span{v1}. 

u3 = v3 − proj V2 (v3) where V2 = Span{u1,u2} = Span{v1,v2}. 

. 

un = vn − proj Vn−1 (vn) where Vn−1 = Span{u1,... ,un−1} = Span{v1,...,vn−1}. 

Then {u1,u2,... ,un} is an orthogonal basis of V . 

217

Example 9.38 (Subspace of R4 ⎧⎡ 

⎤ 

⎪⎨ 

1 

⎢ 

) If V = Span ⎢ −1 ⎥ 

⎣ 

⎪⎩ 

0 ⎦ 

1 

, 

⎡ ⎤ 

2 

⎢ 0 ⎥ 

⎣ −2 ⎦ 

1 

, 

⎡ ⎤⎫ 

4 ⎪⎬ ⎢ 0 ⎥ 

⎣ 5 ⎦ , then 

⎪⎭ 

2 

⎡ ⎤ 

1 

⎢ 

u1 = v1 = ⎢ −1 ⎥ 

⎣ 0 ⎦ 

1 

, u2 

⎡ ⎤ 

2 

v2·u1 ⎢ 

= v2 − u1 = ⎢ 0 ⎥ 

u1·u1 

⎣ −2 ⎦ 

1 

− 

⎡ ⎤ 

1 

3 ⎢ −1 ⎥ 

3 ⎣ 0 ⎦ 

1 

= 

⎡ ⎤ 

1 

⎢ 1 ⎥ 

⎣ −2 ⎦ 

0 

, 

and 

⎡ ⎤ 

4 

v3·u1 v3·u2 ⎢ 

u3 = v3 − u1 − u2 = ⎢ 0 ⎥ 

u1·u1 u2·u2 

⎣ 5 ⎦ 

2 

− 

⎡ ⎤ 

1 

6 ⎢ −1 ⎥ 

3 ⎣ 0 ⎦ 

1 

− 

⎡ ⎤ 

1 

−6 ⎢ 1 ⎥ 

6 ⎣ −2 ⎦ 

0 

= 

⎡ ⎤ 

3 

⎢ 3 ⎥ 

⎣ 3 ⎦ 

0 

. 

{u1,u2,u3} is an orthogonal basis of V ⊆ R 4 . 

9.5.4 QR-Factorization 

Orthogonal vectors make many calculations easier. For example, if {u1,u2,...,un} is an 

orthogonal set in R m , and A is the matrix whose columns are the ui’s, then A T A is a 

diagonal matrix. This idea motivates a technique known as QR-factorization. 

Orthonormal sets. The orthogonal set {q1,q2,... ,qn} is said to be an orthonormal set 

of vectors if each qi is a unit vector. (Each vector has been normalized to have length 1.) 

Theorem 9.39 (Orthonormal Sets) If {q1,q2,... ,qn} is an orthonormal set of vectors 

in R m , Q is the matrix whose columns are the qi’s, and x,y ∈ R m , then 

1. Q T Q = In (and QQ T = Im). 

2. Qx = x. 

3. (Qx)·(Qy) = x·y. 

4. (Qx)·(Qy) = 0 if and only if x·y = 0. 

Orthogonal matrices. The square matrix Q is said to be an orthogonal matrix if its 

columns form an orthonormal set. If Q is an orthogonal matrix of order n, then the first 

part of the theorem above implies 

Q T Q = QQ T = In =⇒ Q −1 = Q T . 

That is, the inverse of Q is equal to its transpose. 

218

QR-factorization. Let A = [v1 v2 · · · vn] be an m-by-n matrix whose columns are 

linearly independent. Use the Gram-Schmidt process to convert the vi’s to ui’s, let 

qi = 1 

ui ui for i = 1,2,... ,n, 

and let Q be the m-by-n matrix whose columns are the qi’s. 

In the Gram-Schmidt process, 

vi ∈ Span{v1,... ,vi} = Span{u1,... ,ui} for i = 1,2,... ,n. 

By the uniqueness theorem for coordinates, Q T A = R must be an upper triangular matrix 

of order n. Further, it can be shown that the diagonal elements of R are all positive. Since 

the matrix product QQ T is the identity matrix of order m, 

Q T A = R =⇒ Q(Q T A) = QR =⇒ A = QR. 

Example 9.40 (Subspace of R 4 , continued) For example, the QR-factorization of the 

4-by-3 matrix whose columns are the vi’s from the Gram-Schmidt example above is 

⎡ 

⎢ 

⎣ 

1 2 4 

−1 0 0 

0 −2 5 

1 1 2 

9.5.5 Linear Least Squares 

⎤ 

⎥ 

⎦ = 

⎡ 1√ 

3 

⎢ √−1 3 ⎢ 

⎣ 0 

√1 6 

√1 6 2 − 3 

√1 ⎤ 

3 ⎡ √ 

√1 ⎥ 

3 ⎥ 3 

⎥⎣ 

√1 ⎥ 0 

3 ⎦ 

√ 

3 

√ 

2 3 

0 0 

√ 6 − √ 0 0 

6 

3 √ ⎤ 

⎦. 

3 

1√ 3 

One of the most useful applications of the methods from this section is to the linear least 

squares problem. The introduction here focuses on the general problem of finding approximate 

solutions to inconsistent systems. 

Let A be an m-by-n coefficient matrix, and assume that Ax = b is inconsistent (b is not in 

the column space of A). To find approximate solutions to the original system, we propose 

to do the following: 

1. Find the projection of b on Col(A), b, and 

2. Report solutions to the consistent system Ax = b. 

There are two key observations: 

Observation 1: Since b is as close to b as possible, each approximate solution x satisfies 

b − b = b−Ax is as small as possible. 

219

The difference vector is 

b−Ax = 

⎡ 

⎢ 

⎣ 

b1 − (a1,1x1 + a1,2x2 + · · · + a1,nxn) 

b2 − (a2,1x1 + a2,2x2 + · · · + a2,nxn) 

. 

. 

bm − (am,1x1 + am,2x2 + · · · + am,nxn) 

and the square of the length of the difference vector is 

m 

(bi − (ai,1x1 + ai,2x2 + · · · + ai,nxn)) 

i=1 

2 . 

Each approximate solution x will minimize the above sum of squared differences. For this 

reason the approximate solutions are called least squares solutions. 

Observation 2: Since the difference vector 

(b − b) = (b−Ax) ∈ Col(A) ⊥ = Null(A T ) 

(by the Fundamental Theorem of Linear Algebra), we know that 

O = A T (b−Ax) = A T b − A T Ax =⇒ A T Ax = A T b. 

Thus, least squares solutions can be found by solving the consistent system on the right; 

there is no need to explicitly compute the projection of b on the column space of A. 

The matrix equation A T Ax = A T b is often called the normal equation of the system. 

Example 9.41 (3-by-2 Inconsistent System) Consider the inconsistent system 

⎡ ⎤ ⎡ 

1 0 

⎣1 1 ⎦ x 

= ⎣ 

y 

1 2 

6 

⎤ 

0⎦. 

0 

Then 

A T Ax = A T b ⇒ 

3 3 

3 5 

 

x 

= 

y 

 

6 

0 

⇒ 

⎤ 

⎥ 

⎦ 

x = 5 

y = −3 

Thus, x = 5 and y = −3 is the unique least squares solution to the system. 

Example 9.42 (4-by-3 Inconsistent System) Consider the inconsistent system 

⎡ ⎤ 

1 3 5 ⎡ 

⎢1 

1 0 ⎥ 

⎣ ⎦⎣ 

1 1 2 

1 3 3 

x 

⎡ ⎤ 

⎤ 3 

⎢ 

y ⎦ = ⎢ 5 ⎥ 

⎣ 7 ⎦ 

z 

−3 

. 

Then 


⎡ 

4 

⎣ 8 

⎤⎡ 

8 10 

20 26⎦ 

⎣ 

10 26 38 

x 

⎤ ⎡ 

y ⎦ = ⎣ 

z 

12 

⎤ 

12⎦ 

⇒ 

20 

x = 10 

y = −6 

z = 2 

Thus, x = 10, y = −6 and z = 2 is the unique least squares solution to the system. 

220

Remark. If the columns of the coefficient matrix A are linearly independent, then the 

symmetric matrix ATA is invertible and 

 

x = A T −1 A A T b 

is the unique least squares solution. (The proof uses the QR-factorization of A.) 

9.6 Eigenvalues and Eigenvectors 

Let A be a square matrix of order n. The constant λ is said to be an eigenvalue of A if 

Ax = λx for some nonzero vector x ∈ R n . 

Each nonzero x satisfying this equation is said to be an eigenvector with eigenvalue λ. 

If x is an eigenvector of A with eigenvalue λ, then so is each scalar multiple of x, since 

A(cx) = c(Ax) = c(λx) = λ(cx) for each nonzero constant c. 

If we think of the collection of scalar multiples of x as defining a line in n-space, then A 

takes each point on the line to another point on the line. 

9.6.1 Characteristic Equation 

Since x = Inx, where In is the identity matrix, 

Ax = λx ⇐⇒ Ax − λx = O ⇐⇒ (A − λIn)x = O. 

That is, λ is an eigenvalue of A if and only if the homogeneous system of equations on the 

right has nontrivial solutions. The system has nontrivial solutions if and only if (A − λIn) 

is a singular matrix. The matrix is singular if and only if its determinant is zero. 

Thus, the eigenvalues of A can be found by solving 

det(A − λIn) = 0 for λ. 

This equation is known as the characteristic equation for the matrix A. Further, if λ is an 

eigenvalue of A, then its eigenvectors are the nonzero vectors in the null space of (A −λIn). 

 

−8 −10 

Example 9.43 (2-by-2 Matrix) Let A = . Then 

5 7 

 

 

 

det(A − λI2) = −8 − λ −10 

 

5 7 − λ = (−8 − λ)(7 − λ) + 50 = λ2 + λ − 6 = (λ + 3)(λ − 2), 

and the determinant equals 0 when λ = 2 or λ = −3. 

Using the techniques of Section 9.1.3, 

 

1 

Null(A − 2I2) = Span 

−1 

 

1 

Finally, note that the set {v1,v2} = 

−1 

 

−2 

and Null(A + 3I2) = Span 

1 

 

−2 

, 

1 

221 

 

is a basis for R2 . 

 

.

Properties. The following are important properties of eigenvalues and eigenvectors: 

1. 0 is an eigenvalue of A if and only if A is singular (does not have an inverse). 

2. If A is a triangular matrix, then the eigenvalues of A are its diagonal elements. 

3. If λ is an eigenvalue of A with eigenvector x and A k = AA · · · A is the k th power of 

the matrix A, then λ k is an eigenvalue of A k with eigenvector x. 

4. Suppose that A has k distinct eigenvalues λ1, λ2, ..., λk. If you pool the bases of 

the null spaces Null(A − λiIn), for i = 1,2,... ,k, then the resulting set is linearly 

independent. If the resulting set has n elements, the set is a basis for R n . 

Remarks. If A is a square matrix of order n, then the characteristic equation is a polynomial 

equation of degree n. Polynomials of degree n cannot always be factored into n terms 

of the form “(λ−c)” where the c’s are real numbers, but can be factored into n linear terms 

if we allow some c’s to be complex numbers. 

9.6.2 Diagonalization 

Suppose that A has n linearly independent eigenvectors, 

vi for i = 1,2,... ,n, 

with corresponding eigenvalues λi for i = 1,2,... ,n. 

Let P be the matrix whose columns are the vi’s, and let Λ be the diagonal matrix whose 

diagonal elements are the λi’s, 

Then, since 

P = [v1 v2 · · · vn], Λ = Diag(λ1,λ2,...,λn). 

AP = [Av1 Av2 · · · Avn ] = [λ1v1 λ2v2 · · · λ2vn ] = PΛ 

and P is invertible, P −1 AP = Λ and A = PΛP −1 . 

The matrix equation A = PΛP −1 , where Λ is a diagonal matrix, is called a diagonalization 

of the matrix A. Diagonalizations do not always exist. 

Theorem 9.44 (Diagonalization Theorem) The square matrix A of order n is diagonalizable 

if and only if A has n linearly independent eigenvectors. 

⎡ ⎤ 

3 0 0 

Example 9.45 (Diagonalizable 3-by-3 Matrix) Let A = ⎣ 0 3 0 ⎦. 

−1 1 2 

Since A is a triangular matrix, its eigenvalues are 3, 3, 2. (Since 3 corresponds to two roots 

of the characteristic equation, it appears twice in this list.) 

222


⎧⎡ 

⎨ 

Null(A − 3I3) = Span ⎣ 

⎩ 

1 

⎤ ⎡ 

1 ⎦, ⎣ 

0 

−1 

⎤⎫ 

⎬ 

0 ⎦ 

⎭ 

1 

and Null(A − 2I3) 

⎧⎡ 

⎨ 

= Span ⎣ 

⎩ 

0 

0 

1 

⎤⎫ 

⎬ 

⎦ 

⎭ . 

Since there are three linearly independent eigenvectors, A is diagonalizable. Further, we 

can factor A as follows: 

A = PΛP −1 ⎡ 

1 

= ⎣ 1 

⎤⎡ 

−1 0 3 0 

0 0 ⎦⎣ 

0 3 

⎤⎡ 

0 0 

0 ⎦⎣ 

−1 

⎤ 

1 0 

1 0 ⎦. 

0 1 1 0 0 2 1 −1 1 

Example 9.46 (Nondiagonalizable 3-by-3 Matrix) Let A = ⎣ 

As above, since A is a triangular matrix, its eigenvalues are 3, 3, 2. 

⎡ 

3 0 0 

2 3 0 

2 1 2 


⎧⎡ 

⎨ 

Null(A − 3I3) = Span ⎣ 

⎩ 

0 

⎤⎫ 

⎬ 

1 ⎦ 

⎭ 

1 

and Null(A − 2I3) 

⎧⎡ 

⎨ 

= Span ⎣ 

⎩ 

0 

⎤⎫ 

⎬ 

0 ⎦ 

⎭ 

1 

. 

Since there are only two linearly independent eigenvectors, A is not diagonalizable. 

9.6.3 Symmetric Matrices 

Recall that the square matrix A is symmetric if aij = aji for all i and j. Written succinctly, 

the square matrix A is symmetric if A = A T . 

Symmetric matrices are always diagonalizable. Further, the eigenvectors can be chosen to 

be orthonormal. These facts are summarized in the following theorem. 

Theorem 9.47 (Spectral Theorem) Let A be a symmetric matrix of order n. Then 

there exists a diagonal matrix Λ and an orthogonal matrix Q satisfying 

(since the inverse of Q is its transpose). 

A = QΛQ −1 = QΛQ T 

The spectral theorem says that if A is symmetric, then (1) the diagonalization process 

outlined above always results in finding n linearly independent eigenvectors, and (2) eigenvectors 

corresponding to different eigenvalues are always orthogonal. 

To construct an orthonormal eigenvector basis, you just need to convert the basis of each 

null space, Null(A − λiIn) for each i, to an orthonormal basis using the normalized Gram- 

Schmidt process. 

223 

⎤ 

⎦.

Spectral decomposition. Let Q = [q1 q2 · · · qn ] and let λi be the eigenvalue 

associated with qi for each i. Then 

A = QΛQ T = λ1q1q1 T + λ2q2q2 T + · · · + λnqnqn T 

is called the spectral decomposition of A. The spectral decomposition breaks up A into pieces 

determined by its “spectrum” of eigenvalues. Each term in the spectral decomposition of 

A is a rank 1 matrix. 

9.6.4 Positive Definite Matrices 

Let A be a symmetric matrix of order n and let Quad(x) be the following real-valued 

n-variable function: 

Quad(x) = x T n n 

Ax = aijxixj . 

i=1 j=1 

(Quad(x) is a quadratic form in x.) Then, A is said to be positive definite if 

Quad(x) > 0 when x = O. 

The following theorem relates the method for determining when a matrix is positive definite 

given in Section 7.4.2 (on the second derivative test) to the eigenvalues of A. 

Theorem 9.48 (Positive Definite Matrices) Let A be a symmetric matrix. If A has 

one of the following three properties, it has them all: 

1. All the eigenvalues of A are positive. 

2. The determinants of all the principal minors are positive. 

3. Quad(x) = x T Ax > 0 except when x = O. 

To demonstrate that the first statement implies the third, for example, write A = QΛQ T , 

as given in the spectral theorem above. Then 

x T Ax = x T (QΛQ T )x = y T n 

Λy = λiy 

i=1 

2 i 

where y = Q T x. 

Since Q is an invertible matrix (with inverse Q T ), y = Q T x = O when x = O. Further, 

since each λi is positive, the sum on the right is positive when y = O. 

9.6.5 Singular Value Decomposition 

The singular value decomposition is a useful factorization method for rectangular matrices, 

generalizing the idea of diagonalization. 

224

Singular values. Let A be an m-by-n matrix. By the spectral theorem, the symmetric 

matrix A T A can be orthogonally diagonalized. Let 

A T A = V ΛV T where V = [v1 v2 · · · vn ], Λ = Diag(λ1,λ2,... ,λn), 

the columns of V form an orthogonal set, and the λi’s (and vi’s) have been ordered so that 

λ1 ≥ λ2 ≥ · · · ≥ λn. 

The λi’s are nonnegative since they correspond to squared lengths: 

Avi 2 = (Avi) T (Avi) = vi T A T Avi = vi T λivi = λi ≥ 0 for each i. 

The singular values of A are the square roots of the eigenvalues of A T A, 

σi = √ λi for i = 1,2,... ,n 

written in descending order. (The singular values are the lengths of the vectors Avi.) 

Theorem 9.49 (Singular Values) Given the setup above, if the rank of A is r, then 

1. σi > 0 when i ≤ r and σi = 0 when i > r. 

2. The set {Av1,Av2,... ,Avr} is an orthogonal basis of Col(A). 

Decomposition. To complete the singular value decomposition, let 

ui = 1 

Avi Avi = 1 

σi 

Avi for i = 1,2,... ,r. 

Then {u1,u2,...,ur} is an orthonormal basis of Col(A). This basis can be completed to 

an orthonormal basis of R m , say {u1,u2,...,um}. 

Let U be the orthogonal matrix of order m whose columns are the ui’s, 

U = [u1 u2 · · · um], 

let D be the diagonal matrix of order r whose diagonal elements are the nonzero σi’s, 

D = Diag(σ1,σ2,... ,σr), 

and let Σ be the m-by-n partitioned matrix 

 

D 

Σ = 

O 

 

O 

O 

where each O is a zero matrix. 

Then, since 

UΣ = [σ1u1 · · · σrur 0 · · · 0] = [Av1 · · · Avr 0 · · · 0] = AV 

and V is invertible (with inverse V T ), UΣV T = AV V T = A. 

225

The matrix equation 

A = UΣV T , 

where U and V are orthogonal matrices and Σ has the form shown above, is called a singular 

value decomposition of A. Singular value decompositions always exist. 

 

2 2 

Example 9.50 (2-by-2 Matrix of Rank 1) Let A = . 

1 1 

Then AT 

5 5 

A = = V ΛV 

5 5 

T where 

V = [ v1 v2 ] = 

√ √ 

1/ 2 −1/ 2 

1/ √ 2 1/ √ 

λ1 0 

and Λ = = 

2 

0 λ2 

 

10 0 

. 

0 0 

The column space of A is spanned by v1, the first singular value is σ1 = √ 10, and 

u1 = 1 

√ 

2/ 5 

Av1 = 

1/ √ 

. 

5 

σ1 

√ 

−1/ 5 

A unit vector orthogonal to u1 is u2 = 

2/ √ 

. 

5 

Thus, a singular value decomposition of A is 

A = UΣV T = 

2/ √ 5 −1/ √ 5 

1/ √ 5 2/ √ 5 

√ 10 0 

0 0 

Example 9.51 (3-by-2 Matrix of Rank 2) Let A = ⎣ 

Then A T A = 

117 −81 

−81 333 

V = [v1 v2 ] = 

 

= V ΛV T where 

−1/ √ 10 3/ √ 10 

3/ √ 10 1/ √ 10 

√ √ 

1/ 2 1/ 2 

−1/ √ 2 1/ √ 

. 

2 

⎡ 

−1 13 

4 8 

10 −10 

 

λ1 0 

and Λ = = 

0 λ2 

The column space of A is spanned by {v1,v2}, σ1 = 6 √ 10, σ2 = 3 √ 10, 

u1 = 1 

⎡ 

Av1 = ⎣ 

σ1 

2/3 

⎤ 

1/3 ⎦ and u2 = 

−2/3 

1 

⎡ 

Av2 = ⎣ 

σ2 

1/3 

⎤ 

2/3 ⎦. 

2/3 

A unit vector orthogonal to these vectors is u3 = 

⎡ 

⎣ 2/3 

−2/3 

1/3 

⎤ 

⎦. 

⎤ 

⎦. 

360 0 

0 90 

Thus, a singular value decomposition of A is 

A = UΣV T ⎡ 

⎤⎡ 

2/3 1/3 2/3 

= ⎣ 1/3 2/3 −2/3 ⎦⎣ 

−2/3 2/3 1/3 

6√10 0 

0 3 √ ⎤ 

√ √ 

10⎦ 

−1/ 10 3/ 10 

3/ 

0 0 

√ 10 1/ √ 10 

226 

 

. 

 

.

Bases for the four important subspaces. Let A be an m-by-n matrix of rank r, and 

let A = UΣV T be a singular value decomposition. Let U = [Ur Um−r ] be a partition of 

U into matrices with r and (m − r) columns. Similarly, let V = [Vr Vn−r ] be a partition 

of V into matrices with r and (n − r) columns. Then 

1. The columns of Ur form an orthonormal basis for Col(A) ⊆ R m . 

2. The columns of Um−r form an orthonormal basis for Null(A T ) ⊆ R m . 

3. The columns of Vr form an orthonormal basis for Col(A T ) ⊆ R n . 

4. The columns of Vn−r form an orthonormal basis for Null(A) ⊆ R n . 

For this reason, the singular value decomposition is often thought of as the third part of 

the Fundamental Theorem of Linear Algebra. 

Note that we can now write: 

A = UΣV T 

= 

Ur Um−r 

⎡ 

⎢ 

⎣ 

⎤ ⎡ 

D O ⎥ ⎢ V T 

r 

⎦ ⎣ 

O O V T ⎤ 

n−r 

⎥ 

⎦ = UrDV T 

r , 

where D = Diag(σ1,σ2,... ,σr) is the diagonal matrix of order r whose diagonal elements 

are the nonzero singular values, written in descending order. 

Moore-Penrose pseudoinverse. Working from the decomposition above, the matrix 

A + = VrD −1 U T r 

is called the Moore-Penrose pseudoinverse of the matrix A. A + is an n-by-m matrix. 

Using properties of orthogonal matrices: A + AA + = A and AA + A = A + . 

The Moore-Penrose pseudoinverse can be applied to least squares problems (Section 9.5.5). 

If Ax = b is inconsistent and A T A does not have an inverse, then there are infinitely many 

least squares solutions. By using A + as if it were a true inverse: 

Ax = b =⇒ x + = A + b 

we obtain the least squares solution of minimum length. 

Example 9.52 (Solution of Minimum Length) Consider the inconsistent system 

Then 


 

⎡ 

⎣ 

1 −2 

2 −4 

−1 2 

6 −12 

−12 24 

⎤ 

⎦ 

⎡ 

 

x 

= ⎣ 

y 

3 

−4 

15 

 

x −20 

= 

y 40 

227 

⎤ 

⎦. 

 

⇒ 

x = 2y − 10/3 

. 

y is free

Thus, there are an infinite number of least squares solutions of the form 

(2y − 10/3,y) where y is free. 

⎡ 

−1/ 

Since A = ⎣ 

√ 6 

−2/ √ 6 

1/ √ ⎤ 

⎦ 

6 

√ 30 [ −1/ √ 5 2/ √ 5 ], the pseudoinverse is 

A + √ 

−1/ 5 

= 

2/ √ 

1/ √ √ √ √ 

30 [ −1/ 6 −2/ 6 1/ 6 ] 

5 

and x + = A + 

−2/3 

b = . 

4/3 

Note that the square of the length of each solution is f(y) = (2y − 10/3) 2 + y 2 , and that 

the function f(y) is minimized when y = 4/3. 

Numerical linear algebra. In most practical applications of linear algebra, solutions 

are subject to roundoff errors and care must be taken to minimize these errors. Often, a 

singular value decomposition of the coefficient matrix is used in the computations, and the 

range of nonzero singular values is used to analyze errors. 

If the coefficient matrix is invertible, then the ratio of the largest to the smallest singular 

value c = σ1/σn is called the condition number of the matrix. The larger the condition 

number, the more error prone the numerical algorithm. A rule of thumb, which has been 

experimentally verified in a large number of trial cases, is that the computer can lose about 

log 10(c) decimal places of accuracy while solving linear equations numerically. 

In linear regression analysis, for example, if two columns of the design matrix X are nearly 

linearly related (nearly colinear), then the condition number of X T X will be quite high, 

affecting estimates of coefficients and variances. 

9.7 Linear Transformations 

Let V and W be vector spaces and let T : V → W be a function assigning to each v ∈ V 

a vector T(v) ∈ W. Then T is said to be a linear transformation if 

1. T(v1 + v2) = T(v1) + T(v2) for all v1,v2 ∈ V and 

2. T(cv) = cT(v) for all v ∈ V and c ∈ R. 

(That is, if T “respects” vector addition and scalar multiplication.) 

Example 9.53 (Transformations on R2 ) For example, let V = W = R2 , 

 

x 2x + 3y x x − y 

T1 = and T2 = . 

y −y 

y xy 

Then T1 satisfies the criteria for linear transformations, but T2 does not. In particular, 

 

3 0 0 1 

T2 = = = 3T2 . 

3 9 3 1 

228

x1,y1 

2 

-2 2 

-2 

x2,y2 

2 

-2 2 

-2 

x2,y2 

2 

-2 2 

-2 

x2,y2 

2 

-2 2 

Figure 9.1: The effects of three different linear transformations on the unit square (left). 

Linear combinations. If T is a linear transformation, then T(O) = O and 

T(c1v1 + c2v2 + · · · + cnvn) = c1T(v1) + c2T(v2) + · · · + cnT(vn) 

for each linear combination of vectors in V . 

9.7.1 Range and Kernel 

If T is a linear transformation, then the range of T, 

Range(T) = {w : w = T(v) for some v ∈ V } ⊆ W, 

is a subspace of W. Similarly, the kernel of T, 

is a subspace of V . 

Kernel(T) = {v : T(v) = O} ⊆ V, 

Isomorphism. T is said to be one-to-one if its kernel is trivial, Kernel(T) = {O}. T is 

said to be onto if its range is the entire space, Range(T) = W. 

If T is both one-to-one and onto, then T is an isomorphism between the vector spaces. 

9.7.2 Linear Transformations and Matrices 

If V ⊆ R n and W ⊆ R m , then the linear transformation T corresponds to matrix multiplication. 

That is, 

T(v) = Av for some m-by-n matrix A. 

Further, Range(T) = Col(A) and Kernel(T) = Null(A). 

Example 9.54 (Planar Transformations) When V = W = R2 , linear transformations 

are often viewed geometrically. For example, Figure 9.1 shows the image of the unit square 

(the set [0,1] × [0,1]) under linear transformations whose matrices are 

 

1 1.5 

0 1 

 

, 

 

−1.5 0 

0 2 

 

, and 

229 

 

−2 1 

−1 2.5 

 

, respectively. 

-2

y1 

x1 

Figure 9.2: A transformation of R 2 (left) whose range (right) is isomorphic to R 2 . 

A linear transformation whose matrix is invertible will map parallelograms to parallelograms. 

Since the second matrix is a diagonal matrix, the image of the unit square is a 

rectangle whose sides have lengths equal to the absolute values of the diagonal elements. 

In each case, the area of the image of the unit square is the absolute value of the determinant 

of the matrix. The resulting areas in this case are 1, 3, and 4, respectively. 

9.7.3 Linearly Independent Columns 

Let T : R n → R m be a linear transformation with matrix A. If the columns of A are 

linearly independent, then 

y2 

Range(T) = Col(A) ⊆ R m is isomorphic to R n . 

Example 9.55 (Dimension 2) Let T : R2 → R3 be defined as follows: 

⎡ ⎤ ⎡ 

1 0 

x1 

x1 

T(x) = A = ⎣ 0 −1 ⎦ = ⎣ 

y1 

y1 

1 1 

x1 

⎤ 

−y1 ⎦. 

x1 + y1 

Since the columns of A are linearly independent, the range of T is the 2-dimensional subspace 

of R 3 shown in the right part of Figure 9.2. (Plane with equation z2 = x2 − y2.) 

The figure also shows how a square grid on the x1y1-plane is transformed by T to a parallelogram 

grid on the surface in x2y2z2-space with equation z2 = x2 − y2. 

9.7.4 Composition of Functions and Matrix Multiplication 

Let U ⊆ R p , V ⊆ R n , and W ⊆ R m be subspaces, let 

TB : U → V and TA : V → W 

be linear transformations with matrices B and A, respectively, and let 

T : U → W be the composition T(u) = TA(TB(u)). 

230 

x2 

z2

x1,y1 

8 

4 

-8 -4 4 8 

-4 

-8 

x2,y2 

8 

4 

-8 -4 4 8 

-4 

-8 

x3,y3 

8 

4 

-8 -4 4 8 

-4 

-8 

x4,y4 

8 

4 

-8 -4 4 8 

Figure 9.3: A linear transformation of the [−2,2]×[−2,2] square viewed through the singular 

value decomposition of the matrix of the transformation. 

Then the matrix of the linear transformation T is the matrix product AB. 

 

2 −1 

Example 9.56 (SVD) The planar transformation whose matrix is A = 

2 2 

maps the x1y1-plane to the x4y4-plane, shown in the first and last parts of Figure 9.3. 

Consider the following singular value decomposition of A 

A = UΣV T = 

1/ √ 5 −2/ √ 5 

2/ √ 5 1/ √ 5 

and its effect on the [−2,2] × [−2,2] square. 

3 0 

0 2 

 

2/ √ 5 1/ √ 5 

−1/ √ 5 2/ √ 5 

1. The x1y1-plane is mapped to the x2y2-plane by the transformation whose matrix is 

V T . Since V is orthogonal, the image of the square is a tilted square of the same size. 

2. The x2y2-plane is mapped to the x3y3-plane by the transformation whose matrix is 

Σ. Since Σ is diagonal, the tilted square is stretched in the horizontal and vertical 

directions by 3 and 2, respectively, yielding a parallelogram. 

3. The x3y3-plane is mapped to the x4y4-plane by the transformation whose matrix is U. 

Since U is orthogonal, the image of the parallelogram is a congruent parallelogram. 

The figure also shows that the unit circle is transformed to an ellipse by the action of the 

diagonal matrix in the singular value decomposition. 

9.7.5 Diagonalization and Change of Bases 

Suppose that A is a diagonalizable matrix of order n (Section 9.6.2), with diagonalization 

A = PΛP −1 , where P = [v1 v2 · · · vn ] and Λ = Diag(λ1,λ2,... ,λn). 

The columns of P form an eigenvector basis of R n . Given x ∈ R n , the coordinates of x 

with respect to this basis, say c, are found by solving the matrix equation: 

Pc = x =⇒ c = P −1 x. 

231 

 

, 

-4 

-8

Thus, the linear transformation whose matrix is P −1 can be thought of as a transformation 

from the standard basis of R n to the eigenvector basis {v1,v2,...,vn}. Similarly, the 

transformation whose matrix is P can be thought of as a change of basis from the eigenvector 

basis back to the standard basis. 

Now, consider the transformation T whose matrix is A. Since 

 

T(x) = Ax = PΛP −1 

 

x = P Λ P −1 

x = P (Λc) , 

the action of T is (1) to change from the standard basis to the eigenvector basis, (2) to 

operate in the eigenvector-coordinate system as a diagonal matrix, and (3) to interpret the 

results back in standard-basis form. 

Powers of A. The k th power of A can be written as follows: 

A k = 

 

PΛP −1 k 

= PΛP 

−1 

PΛP −1 

 

· · · PΛP −1 

= PΛ k P −1 . 

Thus, the action of the linear transformation whose matrix is A k is (1) to change to the 

eigenvector basis, (2) to operate in the eigenvector-coordinate system by applying Λ k times, 

and (3) to interpret the results back in standard-basis form. 

This geometric interpretation is especially useful in practice when considering, for example, 

transition matrices (from probability theory) or Leslie matrices (from population dynamics). 

9.7.6 Random Vectors and Linear Transformations 

Linear transformations of vectors of random variables are commonly used. Let 

⎡ ⎤ 

⎡ 

X1 

E(X1) 

⎤ 

⎢ X2 ⎥ 

⎢ E(X2) ⎥ 

X = ⎢ ⎥ 

⎣ 

. 

⎦ and µ = E(X) = ⎢ ⎥ 

⎣ . ⎦ 

. 

E(Xn) 

Xn 

be n-by-1 vectors of random variables and their means, respectively, and let 

V ar(X) = E((X − µ)(X − µ) T ) 

be the n-by-n matrix of covariances of the random variables. That is, V ar(X) is the matrix 

whose ij-entry is equal to the covariance between Xi and Xj when i = j, and is equal to 

the variance of Xi when i = j. 

Linear transformations. If A is an m-by-n matrix, then the linearly transformed random 

vector Y = AX has an m-variate distribution with mean vector 

and with covariance matrix 

E(Y ) = E(AX) = AE(X) = Aµ 

V ar(Y ) = E((AX − Aµ)(AX − Aµ) T ) = E(A(X − µ)(X − µ) T A T ) = AV ar(X)A T . 

232

z 

-3 

0 

x 

3 -3 

0 

y 

3 

1.5 

-1.5 1.5 

-1.5 

Figure 9.4: Joint density (left) and contours enclosing 20%, 40% and 60% of the probability 

for the standard bivariate normal distribution with correlation ρ = 0.50. 

Example 9.57 (2-by-2 Matrix) Let 

If 

then 

X = 

X1 

X2 

E(Y ) = Aµ = 

 

, A = 

µ = E(X) = 

 

3 

0 

1 1 

2 −4 

 

2 

1 

 

 

X1 + X2 

and Y = AX = 

2X1 − 4X2 

and V ar(X) = 

y 

 

1 1/2 

, 

1/2 1 

and V ar(Y ) = AV ar(X)A T = 

Note that Corr(X1,X2) = 1 

2 and Corr(Y1,Y2) = Cov(Y1,Y2) √ = −1 

V ar(Y1)V ar(Y2) 2 . 

9.7.7 Example: Multivariate Normal Distribution 

⎡ 

Let X = ⎣ 

X1 

. 

Xk 

⎤ 

⎦ be a random k-tuple. 

 

3 −3 

−3 12 

X is said to have a multivariate normal distribution if its joint PDF has the following form 

f(x) = 

1 

(2π) k/2 |V | exp 

 

− 1 

2 (x − µ)TV −1 

(x − µ) 

 

. 

 

. 

x 

for all x ∈ R k . 

In this formula, µ = E(X) is the k-by-1 vector of means and V = V ar(X) is the k-by-k 

covariance matrix. (See Section 8.1.9 also.) 

Note that to find the probability that a normal random k-tuple takes values in a domain 

D ⊆ R k , you need to compute a k-variate integral. 

233

x1,y1 

1.5 

-1.5 1.5 

-1.5 

Example 9.58 (k = 2) Let X = 

x2,y2 

1.5 

-1.5 1.5 

-1.5 

Figure 9.5: Transformations of probability contours. 

 

X 

, µ = O and V = 

Y 

 

1 0.5 

. 

0.5 1 

x3,y3 

1.5 

-1.5 1.5 

-1.5 

Then the random pair has a standard bivariate normal distribution with parameter ρ = 0.50. 

The left part of Figure 9.4 is a graph of the surface z = f(x,y). The right part of the figure 

shows contours enclosing 20%, 40% and 60% of the probability. 

Each contour in the right plot is an ellipse whose major and minor axes are along the lines 

y = x and y = −x (gray lines), respectively. Let D1, D2, and D3 be the regions bounded 

by the three ellipses in increasing size. 

Orthogonal diagonalization. Since V is symmetric, the spectral theorem (Section 9.6.3) 

implies that it can be diagonalized as follows: 

V = QΛQ T , 

where Λ is a diagonal matrix of eigenvalues and Q is orthogonal. 

As discussed earlier, the linear transformation whose matrix is QT can be thought of as a 

transformation from the standard basis of Rk to the basis whose elements are the columns 

of Q. Further, if Y = QTX, then 

E(Y ) = Q T µ and V ar(Y ) = Q T 

T T 

V ar(X)Q = Q QΛQ Q = Λ. 

Thus, the Yi’s are uncorrelated, with variances equal to the eigenvalues of V . (The Yi’s are, 

in fact, independent normal random variables.) 

Example 9.59 (k = 2, continued) Continuing with the example above, 

V = QΛQ T = 

1/ √ 2 −1/ √ 2 

1/ √ 2 1/ √ 2 

1.5 0 

0 0.5 

 

1/ √ 2 1/ √ 2 

−1/ √ 2 1/ √ 2 

The action of the matrix product Λ −1 Q T is illustrated in Figure 9.5. 

Specifically, if we let x2 = Q T x1, then points on the ellipses shown in the x1y1-plane (left 

plot) are rotated by 45 degrees in the clockwise direction to points on the ellipses shown in 

the x2y2-plane (center plot). 

234 

 

.

Further, if we let x3 = Λ −1 x2, then points on the rotated ellipses in the x2y2-plane (center 

plot) are mapped to circles in the x3y3-plane (right plot). 

These transformations (plus a polar transformation) can be used to demonstrate that 

P((X,Y ) ∈ D1) = 0.20, P((X,Y ) ∈ D2) = 0.40, P((X,Y ) ∈ D3) = 0.60. 


Linear algebra is one of the most useful branches of mathematics. In statistics, linear algebra 

methods are used to simplify computations related to multivariate probability distributions 

and to simplify computations related to summarizing random samples. This section focuses 

on two applications. 

9.8.1 Least Squares Estimation 

Given n cases, (x1,i,x2,i,... ,xp−1,i,yi) for i = 1,2,... ,n, whose values lie close to a linear 

equation of the form 

y = β0 + β1x1 + β2x2 + · · · + βp−1xp−1, 

the method of least squares can be used to estimate the unknown parameters. 

The n-by-p design matrix, X, the p-by-1 vector of unknowns, β, and the n-by-1 vector of 

right hand side values, y, are defined as follows: 

⎡ 

⎢ 

X = ⎢ 

⎣ . 

1 x1,1 x2,1 · · · xp−1,1 

1 x1,2 x2,2 · · · xp−1,2 

. 

. · · · 

. 

1 x1,n x2,n · · · xp−1,n 

⎤ 

⎡ 

⎥ ⎢ 

⎥ 

⎦ , β = ⎢ 

⎣ 

β0 

β1 

. 

. 

βp−1 

⎤ 

⎥ 

⎦ 

⎡ 

⎢ 

and y = ⎢ 

⎣ 

We wish to find least squares solutions to the inconsistent system Xβ = y. 

Example 9.60 (Sleep Study, Part 1)(Allison & Cicchetti, Science, 194:732-374, 1976; 

lib.stat.cmu.edu/DASL.) As part of a study on sleep in mammals, researchers collected 

information on the average hours of nondreaming and dreaming sleep for 43 different species. 

The left part of Figure 9.6 compares the common logarithm of nondreaming sleep (vertical 

axis) to the common logarithm of dreaming sleep (horizontal axis). A simple linear prediction 

formula will be used to write the common logarithm of hours of nondreaming sleep as 

a function of the common logarithm of hours of dreaming sleep. 

For these data, Xβ = y ⇒ X T Xβ = X T y ⇒ 

 

 

43 8.75915 β0 

= 

8.75915 5.33664 β1 

 

38.3639 

9.33209 

y1 

y2 

. 

. 

yn 

⇒ β0 = 0.805178 

β1 = 0.427124 . 

Thus, the simple linear prediction equation is y = 0.427124x + 0.805178. 

235 

⎤ 

⎥ 

⎦ .

1.2 

0.8 

0.4 

NDS 

0 0.4 0.8 DS 

1.2 

0.8 

0.4 

NDS 

-2 0 2 

Figure 9.6: Comparison of nondreaming sleep in log-hours with dreaming sleep in log-hours 

(left plot) and with body weight in log-kilograms (right plot) for data from the sleep study. 

Least squares linear fits are superimposed. 

The simple linear prediction equation is shown in the left part of Figure 9.6. 

Example 9.61 (Sleep Study, Part 2) The researchers also collected information on the 

average body weight in kilograms for the 43 species, and on other variables. They developed 

a danger index using a five-point scale, where a higher value corresponds to a higher risk 

of exposure to the elements while sleeping and/or a higher risk of predation while sleeping. 

Table 9.2 gives the values of the danger index for each species in the study. 

The right part of Figure 9.6 compares the common logarithm of nondreaming sleep (vertical 

axis) to the common logarithm of body weight (horizontal axis). A linear prediction formula 

will be used to write the common logarithm of hours of nondreaming sleep as a function 

of the common logarithm of kilograms of body weight (variable 1) and the danger index 

(variable 2). 

For these data, Xβ = y ⇒ X T Xβ = X T y ⇒ 

⎡ 

43 13.4047 115 

⎤ ⎡ 

⎣ 13.4047 78.5992 63.7371 ⎦ ⎣ 

115 63.7371 389 

⎤ 

β0 

β1 

β2 

⎡ ⎤ 

38.3639 

⎦ = ⎣ 3.22446 ⎦ ⇒ 

95.4993 

β0 = 1.06671 

β1 = −0.097165 

β2 = −0.0539304 

Thus, the linear prediction equation is y = −0.097165x − 0.0539304d + 1.06671. 

Nondreaming sleep is negatively associated with both body weight and danger index. The 

right part of Figure 9.6 shows the simple linear prediction formulas when d = 1,2,3,4,5. 

9.8.2 Principal Components Analysis 

Let X = [ x1 x2 · · · xn ] be a p-by-n data matrix, whose columns are a sample of size 

n from a p-variate distribution, let x be the p-by-1 sample mean, 

x = 1 

n 

n 

xi, 

i=1 

236 

. 

BW

Table 9.2: Danger indices for 43 species in the sleep study. 

Danger = 1 Danger = 2 Danger = 3 Danger = 4 Danger = 5 

Big brown bat European hedgehog African giant rat Asian elephant Cow 

Cat Galago ground squirrel Baboon Goat 

Chimpanzee Golden hamster Mountain beaver Brazilian tapir Horse 

E.Amer.mole Owl monkey Mouse Chincilla Rabbit 

Gray seal Phanlanger Musk shrew guinea pig Sheep 

Little brown bat Rhesus monkey Rat Short tail shrew 

Man Rock hyrax (h.b.) Rockhyrax (p.h.) Patas monkey 

Mole rat Tenrec Tree hyrax Pig 

N.Amer.opossum Tree shrew Vervet 

9 banded armadillo 

Red fox 

Water opossum 

and let X be the p-by-n matrix whose j th column is the deviation of the j th observation 

from the sample mean: xj = xj − x. 

Let S be the sample correlation matrix. S can be computed using the matrix equation 

S = 

where X is the centered data matrix above. 

1 

(n − 1) X X T , 

Since S is symmetric, we can use the spectral theorem (Section 9.6.3) to write 

S = QΛQ T 

where Λ is a diagonal matrix of eigenvalues ordered so that λ1 ≥ λ2 ≥ · · · ≥ λp, and where 

Q is an orthogonal matrix. 

Consider the linear transformation 

Y = Q T ⎡ 

q1·x1 

⎢ 

X ⎢ q2·x1 

= ⎢ 

⎣ . 

q1·x2 

q2·x2 

. 

· · · 

· · · 

· · · 

⎤ 

q1· xn 

q2· xn 

⎥ 

. ⎦ , 

qp·x1 qp·x2 · · · qp· xn 

and let SY be the sample correlation matrix of Y . Then 

SY = Q T 

T T 

SQ = Q QΛQ Q = Λ, 

indicating that the p transformed centered samples are uncorrelated. Further, the first 

transformed centered sample (first row of Y ) has the largest sample variance, the second 

(second row of Y ) has the next largest sample variance, and so forth. 

Since the i th column of Q was used to transform the i th centered sample, the vector qi is 

called the i th principal component of the centered data matrix X. 

237

115 

95 

75 

a 

75 95 115 c 

h 

115 

95 

75 

75 

95 

a 

115 

115 

75 

95 c 

Figure 9.7: Abdomen and chest measurements (left), and abdomen, chest and hip measurements 

(right) for 100 men. Principal component axes are highlighted in each plot. 

Example 9.62 (Body Fat Study, Part 1) (Johnson, JSE:4(1), 1996) As part of a study 

to determine if simple body measurements could be used to predict body fat, abdomen 

and chest measurements in centimeters were made on 100 men. The left part of Figure 9.7 

compares chest (variable 1) and abdomen (variable 2) measurements for these men. 

For these data, 

and S = QΛQ T where 

 

λ1 0 

Λ = ≈ 

0 λ2 

x = 

 

101.068 

, S = 

92.183 

 

163.709 0 

 

0 7.218 

 

68.3283 76.3466 

, 

76.3466 102.599 

and Q = [q1 q2 ] ≈ 

 

0.625 −0.781 

0.781 0.625 

The principal component axes shown in the left part of Figure 9.7 are lines described 

parametrically as ℓi(t) = x + tqi, for i = 1,2. The lines are centered at x. 

Since λ1 is much greater than λ2, most of the variation in the data is along the first principal 

component axis. In fact, since 

λ1 

λ1 + λ2 

≈ 0.958 and 

λ2 

λ1 + λ2 

≈ 0.042 

about 95.8% of the total variation in the data is due to the first component. 

Further, if c is the chest measurement and a is the abdomen measurement of a man in the 

study, then the formula 

y = 0.625(c − 101.068) + 0.781(a − 92.183) 

can be used as a body “size index” in later analyses. 

238 

 

.

Example 9.63 (Body Fat Study, Part 2) The researchers also took hip measurements 

in centimeters. The right part of Figure 9.7 compares chest (variable 1), abdomen (variable 

2), and hip (variable 3) measurements for the 100 men. 

For these data, 

and S = QΛQ T where 

and 

⎡ ⎤ 

101.068 

⎡ 

68.3283 76.3466 

⎤ 

43.8397 

x = ⎣ 92.183 ⎦, S = ⎣ 76.3466 102.599 56.79 ⎦ , 

99.594 43.8397 56.79 42.1284 

⎡ 

λ1 

Λ = ⎣ 0 

0 

λ2 

⎤ ⎡ 

0 196.946 

0 ⎦ ≈ ⎣ 0 

0 

9.474 

0 

0 

⎤ 

⎦ 

0 0 λ3 0 0 6.635 

Q = [q1 q2 q3 ] ≈ 

⎡ 

⎢ 

⎣ 

0.565 0.588 0.579 

0.710 0.011 −0.704 

0.420 −0.809 0.412 

The principal component axes shown in the right part of Figure 9.7 are lines described 

parametrically as ℓi(t) = x + tqi, for i = 1,2,3. The lines are centered at x. 

Since λ1 is much greater than λ2 and λ3, most of the variation in the data is along the first 

principal component axis. In fact, since 

λ1 

λ1 + λ2 + λ3 

≈ 0.924, 

λ2 

λ1 + λ2 + λ3 

≈ 0.044, 

λ3 

⎤ 

⎥ 

⎦ . 

λ1 + λ2 + λ3 

≈ 0.031, 

about 92.4% of the total variation in the data is due to the first component. 

Further, if c is the chest measurement, a is the abdomen measurement, and h is the hip 

measurement of a man in the study, then the formula 

y = 0.565(c − 101.068) + 0.710(a − 92.183) + 0.420(h − 99.594) 

can be used as a body “size index” in later analyses. 

239

240

10 Additional Reading 

(1-2) An Introduction to Mathematical Statistics, by Larsen and Marx (Prentice Hall) and 

Mathematical Statistics and Data Analysis, by Rice (Duxbury Press), are introductions to 

probability theory and mathematical statistics with emphasis on applications. The book 

by Larsen and Marx is currently used as the text book for the MT426-427 mathematical 

statistics sequence. 

(3-4) Calculus by Hughes-Hallett, Gleason, et al (Wiley), and Multivariable Calculus by 

McCallum, Hughes-Hallett, Gleason, et al (Wiley), are gentle introductions to single variable 

and multivariable calculus methods. The first book is currently used as the text book for 

the MT100-101 calculus sequence. 

(5) Vector Calculus, by Colley (Prentice Hall), is a more sophisticated introduction to 

multivariable calculus. Colley emphasizes the use of linear algebra methods. This book is 

currently used as the text book for the MT202 multivariable calculus course. 

(6-7) Linear Algebra and its Applications, by Lay (Addison-Wesley), and Introduction to 

Linear Algebra, by Strang (Wellesley-Cambridge) are good introductions to linear algebra 

methods. Each book emphasizes both theory and practice. The book by Lay is currently 

used as the text book for the MT210 linear algebra course. 

(8) Advanced Calculus with Applications in Statistics, by Khuri (Wiley), is a more advanced 

text designed specifically to tie topics in advanced calculus to applications in statistics. 

(9-10) Introduction to Matrices with Applications in Statistics by Graybill (Wadsworth), 

and Matrix Algebra useful for Statistics by Searle (Wiley), are well-respected texts designed 

specifically to tie linear algebra methods to applications in statistics. Note, however, that 

each book is quite old, and each uses out-of-date notation. 

(11) Statistical Models, by Freedman (Cambridge University Press), is a recent text for a 

second course in statistics. From the Preface: “The contents of the book can be fairly 

described as what you have to know in order to start reading empirical papers that use 

statistical models.” 

241

MT580 Mathematics for Statistics: - Boston College

Create successful ePaper yourself

Delete template?

Save as template?