28.11.2014 Views

MTH3051 Introduction to Computational Mathematics - User Web ...

MTH3051 Introduction to Computational Mathematics - User Web ...

MTH3051 Introduction to Computational Mathematics - User Web ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>MTH3051</strong><br />

<strong>Introduction</strong> <strong>to</strong><br />

<strong>Computational</strong> <strong>Mathematics</strong><br />

Lecture notes<br />

Clay<strong>to</strong>n Campus<br />

2014 Campus<br />

Australia Malaysia South Africa Italy India monash.edu/science


School of Mathematical Sciences<br />

Monash University<br />

Contents<br />

1. Computing π 3<br />

1.1 A slice of π . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />

1.2 A formula for π . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />

1.3 Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5<br />

1.4 A flood of formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />

1.5 Mnemonics for π . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />

1.6 Some useful references on π . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />

2. An <strong>Introduction</strong> <strong>to</strong> Programming and Matlab 9<br />

2.1 <strong>Introduction</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />

2.2 A quadratic equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11<br />

2.2.1 Why IF (a == 0) won’t work . . . . . . . . . . . . . . . . . . . . . 13<br />

2.3 A finite series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br />

2.4 Matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

2.5 An infinite series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />

3. Truncation and Round-off Errors 21<br />

3.1 Order estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />

3.2 Absolute and relative errors . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />

3.3 Truncation errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />

3.4 Round-off errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />

3.5 Understanding round-off errors . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

3.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />

3.7 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />

3.8 Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />

3.9 Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />

4. Solutions of Equations in One Variable 34<br />

4.1 <strong>Introduction</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />

4.2 Fixed point iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />

4.2.1 Why the funny name? . . . . . . . . . . . . . . . . . . . . . . . . . 37<br />

4.2.2 Fixed point in pictures . . . . . . . . . . . . . . . . . . . . . . . . . 37<br />

4.2.3 Cyclic Fixed point iterations . . . . . . . . . . . . . . . . . . . . . . 39<br />

4.2.4 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />

16-Feb-2014 2


School of Mathematical Sciences<br />

Monash University<br />

4.2.5 Programming notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 40<br />

4.2.6 A Matlab program . . . . . . . . . . . . . . . . . . . . . . . . . . . 42<br />

4.3 New<strong>to</strong>n-Raphson iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . 42<br />

4.3.1 Some notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44<br />

4.3.2 New<strong>to</strong>n-Raphson for multiple roots . . . . . . . . . . . . . . . . . . 45<br />

4.3.3 Cycling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />

4.4 Interval methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />

4.4.1 Half Interval Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 48<br />

4.4.2 False Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49<br />

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50<br />

5. Solving Systems of Linear Equations 51<br />

5.1 <strong>Introduction</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52<br />

5.2 Gaussian elimination with back substitution . . . . . . . . . . . . . . . . . 52<br />

5.2.1 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />

5.2.2 Tri-diagonal systems . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />

5.2.3 Round-off Errors and Pivoting . . . . . . . . . . . . . . . . . . . . . 57<br />

5.3 Ill-conditioned systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />

5.4 Operational cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60<br />

5.5 Iterative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61<br />

5.5.1 Jacobi iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62<br />

5.5.2 Gauss-Seidel iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />

5.5.3 Diagonal dominance and convergence . . . . . . . . . . . . . . . . . 66<br />

5.6 Operational counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69<br />

6. Solving Systems of Nonlinear Equations 70<br />

6.1 <strong>Introduction</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />

6.2 Generalised Fixed Point iteration . . . . . . . . . . . . . . . . . . . . . . . 72<br />

6.2.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />

6.2.2 Matlab example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76<br />

6.3 Generalised New<strong>to</strong>n-Raphson . . . . . . . . . . . . . . . . . . . . . . . . . 78<br />

6.3.1 Matlab code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80<br />

7. Interpolation and Approximation of Data 82<br />

7.1 The what and why of interpolation . . . . . . . . . . . . . . . . . . . . . . 83<br />

7.2 Lagrangian interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84<br />

16-Feb-2014 3


School of Mathematical Sciences<br />

Monash University<br />

7.2.1 The whole polynomial or just its value? . . . . . . . . . . . . . . . . 85<br />

7.2.2 Matlab code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />

7.3 New<strong>to</strong>n polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />

7.3.1 Horner’s form of the New<strong>to</strong>n polynomial . . . . . . . . . . . . . . . 92<br />

7.4 Uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />

7.5 Piecewise polynomial interpolation . . . . . . . . . . . . . . . . . . . . . . 93<br />

7.5.1 Cubic Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94<br />

7.5.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97<br />

7.6 Non-polynomial interpolation . . . . . . . . . . . . . . . . . . . . . . . . . 97<br />

7.6.1 Fourier series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />

7.6.2 Estimating the Fourier coefficients . . . . . . . . . . . . . . . . . . . 98<br />

7.6.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />

7.7 Approximating functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101<br />

7.7.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102<br />

7.7.2 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102<br />

7.7.3 Generalised least squares . . . . . . . . . . . . . . . . . . . . . . . . 105<br />

7.7.4 Variations on a theme . . . . . . . . . . . . . . . . . . . . . . . . . 105<br />

7.7.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108<br />

7.7.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109<br />

7.7.7 Matlab example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109<br />

8. Extrapolation Methods 110<br />

8.1 Richardson extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111<br />

8.1.1 Example – computing π . . . . . . . . . . . . . . . . . . . . . . . . 111<br />

8.1.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />

9. Numerical integration 113<br />

9.1 <strong>Introduction</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />

9.2 The Left and Right hand sum rules . . . . . . . . . . . . . . . . . . . . . . 114<br />

9.2.1 The Left Hand Rule . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />

9.2.2 The Right Hand Rule . . . . . . . . . . . . . . . . . . . . . . . . . . 115<br />

9.2.3 The Mid Point rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 116<br />

9.3 The Trapezoidal rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117<br />

9.3.1 Choices, choices, so many choices . . . . . . . . . . . . . . . . . . . 117<br />

9.4 Simpson’s rule and Romberg integration . . . . . . . . . . . . . . . . . . . 118<br />

16-Feb-2014 4


School of Mathematical Sciences<br />

Monash University<br />

9.5 Simpson’s rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118<br />

9.5.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118<br />

9.6 Romberg integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />

9.6.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120<br />

10. Numerical differentiation 121<br />

10.1 <strong>Introduction</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122<br />

10.2 Finite differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122<br />

10.2.1 First derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122<br />

10.2.2 Second derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . 124<br />

10.3 Truncation and round-off errors . . . . . . . . . . . . . . . . . . . . . . . 125<br />

10.3.1 Example – Forward finite differences . . . . . . . . . . . . . . . . . 125<br />

10.3.2 Example – Centred finite differences . . . . . . . . . . . . . . . . . 127<br />

11. Numerical Solutions of Ordinary Differential Equations 129<br />

11.1 <strong>Introduction</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130<br />

11.2 Initial value problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130<br />

11.3 Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130<br />

11.4 Improved Euler scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132<br />

11.5 Taylor series method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134<br />

11.6 Runge-Kutta schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135<br />

11.6.1 Second Order Runge-Kutta . . . . . . . . . . . . . . . . . . . . . . 136<br />

11.6.2 Fourth Order Runge-Kutta . . . . . . . . . . . . . . . . . . . . . . 137<br />

11.7 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137<br />

11.7.1 Discretization errors . . . . . . . . . . . . . . . . . . . . . . . . . . 138<br />

11.7.2 Local discretization error . . . . . . . . . . . . . . . . . . . . . . . 138<br />

11.7.3 Global Discretization Error . . . . . . . . . . . . . . . . . . . . . . 139<br />

11.7.4 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . 140<br />

11.8 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140<br />

11.8.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140<br />

11.8.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141<br />

11.8.3 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141<br />

11.8.4 Example 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142<br />

12. Optimisation 143<br />

12.1 Golden Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144<br />

16-Feb-2014 5


School of Mathematical Sciences<br />

Monash University<br />

12.1.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145<br />

12.2 Steepest descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148<br />

12.3 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149<br />

12.3.1 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150<br />

12.3.2 Breeding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150<br />

12.3.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151<br />

13. Random numbers 153<br />

13.1 Uniform random numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 154<br />

13.2 Non-uniform random numbers . . . . . . . . . . . . . . . . . . . . . . . . 156<br />

16-Feb-2014 6


SCHOOL OF MATHEMATICAL SCIENCES<br />

<strong>MTH3051</strong><br />

<strong>Introduction</strong> <strong>to</strong><br />

<strong>Computational</strong> <strong>Mathematics</strong><br />

1. Computing π


School of Mathematical Sciences<br />

Monash University<br />

1.1 A slice of π<br />

Everybody knows that 22/7 is only an approximation <strong>to</strong> π. Here is a better approximation,<br />

accurate <strong>to</strong> 200 decimal places.<br />

π ≈ 3.14159265358979323846264338327950288419716939937510582<br />

09749445923078164062862089986280348253421170679821480<br />

86513282306647093844609550582231725359408128481117450<br />

28410270193852110555964462294895493038196...<br />

One Freddo frog for the first person who can recite this, from memory, in class!<br />

Some questions come <strong>to</strong> mind.<br />

◮ Why would we want so many decimal digits?<br />

◮ How was the above approximation obtained?<br />

◮ How might we get better approximations?<br />

1.2 A formula for π<br />

To compute π we need a formula. Here is one simple formula,<br />

π = 4<br />

(1 − 1 3 + 1 5 − 1 )<br />

7 + · · · + (−1)k+1<br />

2k − 1 + · · · = 4<br />

∞∑<br />

k=1<br />

(−1) k+1<br />

2k − 1<br />

This suggests a scheme for approximating π – terminate the infinite series at some chosen<br />

term, say n,<br />

n∑ (−1) k+1<br />

π ≈ S n = 4<br />

2k − 1<br />

We call this an algorithm for π.<br />

k=1<br />

The big question is How well does this algorithm work?<br />

approximations<br />

Here is a table of successive<br />

n S n |S n − π|<br />

1 4.000000000e+00 8.584e-01<br />

10 3.041839619e+00 9.975e-02<br />

100 3.131592904e+00 1.000e-02<br />

1000 3.140592654e+00 1.000e-03<br />

10000 3.141492654e+00 1.000e-04<br />

16-Feb-2014 8


School of Mathematical Sciences<br />

Monash University<br />

What do we observe? First (and most importantly) it appears that our successive<br />

approximations are converging <strong>to</strong> π. Second, each extra digit of accuracy requires a 10<br />

fold increase in the number of terms. This is extremely inefficient – <strong>to</strong> recover the above<br />

200 digits of π would required over 10 200 terms. Clearly we need a better algorithm.<br />

This time we will use<br />

(<br />

) 1/2<br />

n∑<br />

π ≈ S n ′ (−1) k+1<br />

= 12<br />

k 2<br />

for which we find<br />

k=1<br />

n S n ′ |S n ′ − π|<br />

1 3.464101615e+00 3.225e-01<br />

10 3.132977195e+00 8.615e-03<br />

100 3.141498114e+00 9.454e-05<br />

1000 3.141591700e+00 9.540e-07<br />

10000 3.141592644e+00 9.548e-09<br />

Again we notice that the series converges (good!) and that now we get two extra digits of<br />

accuracy for every 10 fold increase in the number of terms. This is an improvement (but<br />

not enough <strong>to</strong> tackle the 200 digit calculation, that requires much more sophisticated<br />

algorithms than we have time <strong>to</strong> explore).<br />

The point <strong>to</strong> take home from this pair of examples is that you may need <strong>to</strong> trawl through<br />

various formulae, all mathematically equivalent, <strong>to</strong> find an algorithm that is efficient and<br />

accurate – we want an accurate answer with a minimum of computation. Much of what<br />

we will do in this subject is <strong>to</strong> search for suitable algorithms for various mathematical<br />

tasks.<br />

1.3 Programming<br />

How were the above tables generated? For small values of n we could imagine doing<br />

the computations by hand but for large values our patience might wear thin. Clearly<br />

we need a way <strong>to</strong> au<strong>to</strong>mate the process. This is where the computers come in <strong>to</strong> the<br />

picture. The simple thing is that we provide the computer with a set of instructions<br />

which it faithfully executes and out pops our answers. Here are the instructions that<br />

were used for the first algorithm.<br />

16-Feb-2014 9


School of Mathematical Sciences<br />

Monash University<br />

n = 100; % set number of terms<br />

sum = 0; % set initial value for sum<br />

sign = 1; % an integer +/- 1<br />

for k = 1 : n<br />

% loop over k from 1 <strong>to</strong> n<br />

term = sign/(2*k-1); % compute the term<br />

sum = sum + term; % update rolling sum<br />

sign = - sign; % flip the sign<br />

end;<br />

disp(n);<br />

% print n<br />

disp(4*sum);<br />

% print the approximation <strong>to</strong> π<br />

disp(4*sum-pi);<br />

% print the error<br />

This is an example of Matlab syntax. Matlab is a programming language well suited<br />

<strong>to</strong> numerical computations. There are many other languages such as Fortran, C, Java,<br />

Maple and Mathematica. They all have a similar flavour and they all serve the one<br />

purpose of getting the computer <strong>to</strong> do useful work for us. We will use Matlab throughout<br />

this course as it is (arguably) the easiest <strong>to</strong> learn for someone with no programming<br />

experience. But don’t let this s<strong>to</strong>p you from learning about other programming languages<br />

(in your spare time).<br />

In reading the above code you must keep in mind one extremely important fact the<br />

equals sign is not what you might expect it <strong>to</strong> be. The computer will treat the equal<br />

sign (usually) as a replacement opera<strong>to</strong>r. Thus in executing a line like x = 2*y the<br />

computer will first evaluate the right-hand side then assign that value <strong>to</strong> the left-hand<br />

side. Whatever value x had before, it will be wiped out. Its value after the line is<br />

executed will be 2*y. This use of the equals sign allows us <strong>to</strong> write lines like x = x +<br />

1 <strong>to</strong> increment the current value of x by 1. In contrast, if you showed a mathematician<br />

a line like x = x + 1 he or she would look at you very very strangely (why?).<br />

The above Matlab code is fairly easy <strong>to</strong> read. The first three lines set initial values <strong>to</strong><br />

various symbols (aka variables). Then we encounter a for-loop. Matlab will repeat the<br />

code in this loop for each value of k from 1 <strong>to</strong> n in strict order. Each time through this<br />

loop we are calculating one term in the series. After the loop finishes we print out the<br />

various numbers that interest us (using the Matlab command disp which is short for<br />

display ).<br />

If the above code baffles you then one way <strong>to</strong> understand it is <strong>to</strong> pretend you are the<br />

computer and follow the Matlab commands. Get out a pencil and paper and start<br />

following the instructions. Work your way through the first 5 or so terms. You should<br />

see that it is correct and does compute the series.<br />

16-Feb-2014 10


School of Mathematical Sciences<br />

Monash University<br />

Example 1.1<br />

Modify the above Matlab code <strong>to</strong> use the second algorithm for π. You’ll be able <strong>to</strong> test<br />

your code later in your tu<strong>to</strong>rial classes.<br />

1.4 A flood of formulae<br />

Mathematicians have a deep rooted love for π and not surprisingly they have derived<br />

many varied formulae for π. Here, just for the curious, are some other formulae that<br />

you might like <strong>to</strong> play with.<br />

Many of these use the following infinite series for arc-tan,<br />

tan −1 (x) = x − x3<br />

3 + x5<br />

5 − x7<br />

x2k+1<br />

+ · · · + (−1)k+1<br />

7 2k + 1 · · ·<br />

π = 4 tan −1 (1)<br />

π = 16 tan −1 (1/5) − 4 tan −1 (1/239)<br />

π = 16 tan −1 (1/5) − 4 tan −1 (1/70) + 4 tan −1 (1/99)<br />

π 2<br />

6 = 1 1 2 + 1 2 2 + 1 3 2 + 1 4 2 + · · ·<br />

π 3<br />

32 = 1 1 − 1 3 3 + 1 3 5 − 1 3 7 + · · · 3<br />

π<br />

2 = 2 × 2 × 4 × 4 × 6 × 6 × 8 · · ·<br />

1 × 3 × 3 × 5 × 5 × 7 × 7 · · ·<br />

√<br />

√<br />

2 1<br />

π = 2 × 1<br />

4<br />

π = 1 + 1 2<br />

2 +<br />

2 + 1 2√<br />

1<br />

2 × √ √√√1<br />

2 + 1 2<br />

2 +<br />

3 2<br />

5 2<br />

2 + 72<br />

2 + · · ·<br />

∞<br />

1<br />

π = 12 ∑<br />

(−1) k (6k)!<br />

(k!) 3 (3k)!<br />

π =<br />

∞∑<br />

k=0<br />

k=0<br />

√<br />

1<br />

2 + 1 √<br />

1<br />

2 2 · · ·<br />

13591409 + 545140134k<br />

640320 3(2k+1)/2<br />

( 4<br />

8k + 1 − 2<br />

8k + 4 − 1<br />

8k + 5 − 1<br />

8k + 6) ( 1<br />

16<br />

π = lim<br />

k→∞<br />

f k , where f k = f k−1 + sin(f k−1 ), f 0 = 1<br />

) k<br />

16-Feb-2014 11


School of Mathematical Sciences<br />

Monash University<br />

1.5 Mnemonics for π<br />

Here are few simple mnemonics that people have used <strong>to</strong> help memorise the decimal<br />

digits of π. Each works by taking the number of letters in each word as the value of the<br />

digit at that point in the series for π.<br />

How I wish I could calculate pi<br />

How I like a drink, alcoholic of course, after the heavy lectures involving<br />

quantum mechanics.<br />

Sir, I bear a rhyme excelling<br />

In mystic force and magic spelling<br />

Celestial sprites elucidate<br />

All my own striving can’t relate<br />

Or locate they who can cogitate<br />

And so finally terminate. Finis.<br />

1.6 Some useful references on π<br />

http://www.joyofpi.com<br />

http://mathworld.wolfram.com/PiFormulas.html<br />

16-Feb-2014 12


SCHOOL OF MATHEMATICAL SCIENCES<br />

<strong>MTH3051</strong><br />

<strong>Introduction</strong> <strong>to</strong><br />

<strong>Computational</strong> <strong>Mathematics</strong><br />

2. An <strong>Introduction</strong> <strong>to</strong> Programming and Matlab


School of Mathematical Sciences<br />

Monash University<br />

2.1 <strong>Introduction</strong><br />

One of the main hurdles that new-comers <strong>to</strong> numerical methods face is how <strong>to</strong> convert<br />

a mathematical equation like<br />

sin(x) = x − x3<br />

3! + x5<br />

5! + · · · = ∞<br />

∑<br />

in<strong>to</strong> a Matlab program, such as the following<br />

k=0<br />

(−1) k x 2k+1<br />

(2k + 1)!<br />

% --- set initial values ----------------------------------------<br />

k = 0;<br />

x = 1.23;<br />

sum = x;<br />

term = x;<br />

k_max = 100;<br />

looping = true;<br />

x_square = x*x;<br />

% --- compute successive terms in the series --------------------<br />

while looping do<br />

end<br />

k= k + 1;<br />

term = - term*x_square/( (2*k+1)*(2k) );<br />

sum = sum + term;<br />

if ( k >= k_max )<br />

looping = false;<br />

end<br />

if ( abs(term) < 0.00001 )<br />

looping = false;<br />

end<br />

answer = sum;<br />

The aim of this set of notes is <strong>to</strong> show you how <strong>to</strong> make the transition from Maths <strong>to</strong><br />

Matlab. This will hardly be an exhaustive study but it will you give a kick start (from<br />

which you can go on <strong>to</strong> scale great heights!).<br />

In each of the following examples we will start by writing, usually in just one line of<br />

code, the heart of the mathematics in basic Matlab form. We will then look at this code<br />

and ask two basic questions<br />

16-Feb-2014 14


School of Mathematical Sciences<br />

Monash University<br />

◮ What more do we need <strong>to</strong> make this code work?<br />

◮ What can go wrong with this code?<br />

Both questions are very important. Their answers will force you <strong>to</strong> add extra Matlab<br />

code and in this way you will build a complete working program. What began as one<br />

line may turn out <strong>to</strong> be dozens of lines.<br />

Once you have a working program you might also want <strong>to</strong> ask a third question,<br />

◮ What improvements can we make <strong>to</strong> the code?<br />

This covers a wide raft of issues, such as efficiency, readability, utility and so on.<br />

2.2 A quadratic equation<br />

Given the quadratic equation<br />

we know that the two roots are given by<br />

r 1 = −b + √ b 2 − 4ac<br />

2a<br />

This can be written in Matlab as<br />

0 = ax 2 + bx + c<br />

, r 1 = −b − √ b 2 − 4ac<br />

2a<br />

r_1 = ( -b + (b*b - 4*a*c)^(1/2) )/(2*a) ;<br />

r_2 = ( -b - (b*b - 4*a*c)^(1/2) )/(2*a) ;<br />

Notes<br />

◮ Matlab executes each line in turn, from <strong>to</strong>p <strong>to</strong> bot<strong>to</strong>m.<br />

◮ Lines that end with ; are silent – Matlab prints nothing as it executes that line.<br />

◮ Multiplication is denoted by the * symbol.<br />

◮ The symbol ∧ denotes exponentiation (raising <strong>to</strong> a power).<br />

◮ Names like a,b,c etc. are known as Matlab variables.<br />

◮ Variables have values. These values may change during the program’s execution.<br />

◮ Matlab reads the right hand side, computes a value, then assigns it <strong>to</strong> the variable<br />

on the left hand side of the equals sign.<br />

16-Feb-2014 15


School of Mathematical Sciences<br />

Monash University<br />

What more do we need <strong>to</strong> make this code work?<br />

We need values for a,b and c. This is easy – we simply include lines like a = 2; b =<br />

1; c = 5; before the above pair of lines.<br />

What can go wrong with this code?<br />

We might encounter complex roots. Though Matlab can handle complex numbers (without<br />

any fuss) we’ll declare (for this example) that complex roots are forbidden. So we<br />

need <strong>to</strong> avoid computing the square root when b 2 − 4ac < 0. This we do by using an if<br />

statement.<br />

Here is our updated code.<br />

a = 2; b = 1; c = 5;<br />

if ( b*b - 4*a*c >= 0 )<br />

r_1 = ( -b + (b*b - 4*a*c)^(1/2) )/(2*a) ;<br />

r_2 = ( -b - (b*b - 4*a*c)^(1/2) )/(2*a) ;<br />

end<br />

Notes<br />

◮ The ; also allows us <strong>to</strong> put more than one expression on each line.<br />

◮ The if (...) and end lines define the block of lines <strong>to</strong> be executed only when<br />

b 2 − 4ac is greater or equal <strong>to</strong> zero.<br />

If we left the code as it is we could run in<strong>to</strong> another problem further down the track.<br />

How so?. Well, we have deliberately chosen (or forgotten?) <strong>to</strong> not set the values for r 1<br />

and r 2 in the case of complex roots. But what would happen if we tried <strong>to</strong> use r 1 and<br />

r 2 later, in some other part of the code? Heaven only knows what values would be used<br />

(never assume that a variable starts with an initial value of zero). Clearly we need <strong>to</strong><br />

provide values for r 1 and r 2 for all possible cases. Thus we modify the if statement<br />

as follows<br />

a = 2; b = 1; c = 5;<br />

if ( b*b - 4*a*c >= 0 )<br />

r_1 = ( -b + (b*b - 4*a*c)^(1/2) )/(2*a) ;<br />

r_2 = ( -b - (b*b - 4*a*c)^(1/2) )/(2*a) ;<br />

else<br />

r_1 = 0;<br />

r_2 = 0;<br />

end<br />

Note that setting r 1 = 0 and r 2 = 0 is mathematically wrong but at least we can<br />

now proceed with known values for r 1 and r 2. We could also use these zero values as<br />

16-Feb-2014 16


School of Mathematical Sciences<br />

Monash University<br />

an indica<strong>to</strong>r <strong>to</strong> other parts of the program that we have a special case <strong>to</strong> consider (i.e.<br />

complex roots).<br />

2.2.1 Why IF (a == 0) won’t work<br />

Once again we ask What can go wrong with this code?. We’ve already handled<br />

the case of complex roots but we have yet <strong>to</strong> deal with the glaring problem that arises<br />

when a = 0.<br />

Blindly running the above Matlab code with a=0 will surely be very disappointing! So in<br />

despair we return <strong>to</strong> the mathematics and quickly realise that when a = 0 our quadratic<br />

actually reduces <strong>to</strong> the simple linear equation<br />

0 = bx + c<br />

for which the single solution is x = −c/b (assuming b ≠ 0). With that in mind we might<br />

modify our Matlab code <strong>to</strong> look like<br />

a = 2; b = 1; c = 5;<br />

if ( a ~= 0 )<br />

if ( b*b - 4*a*c ) >= 0 )<br />

r_1 = ( -b + (b*b - 4*a*c)^(1/2) )/(2*a) ;<br />

r_2 = ( -b - (b*b - 4*a*c)^(1/2) )/(2*a) ;<br />

else<br />

r_1 = 0;<br />

r_2 = 0;<br />

end<br />

else<br />

r_1 = -c/b;<br />

r_2 = r_1;<br />

end;<br />

But this <strong>to</strong>o may give you heartache – why? Because small round-off errors may push a<br />

slightly away from zero. Thus even though a should be exactly zero it might be s<strong>to</strong>red<br />

as some small number such as 1.0 × 10 −12 . Even worse – it may be s<strong>to</strong>red as a negative<br />

number! The solution is <strong>to</strong> compare a against some pre-chosen small number as in the<br />

following example<br />

16-Feb-2014 17


School of Mathematical Sciences<br />

Monash University<br />

a = 2; b = 1; c = 5;<br />

if ( abs(a) > 1e-10 )<br />

if ( b*b - 4*a*c ) >= 0 )<br />

r_1 = ( -b + (b*b - 4*a*c)^(1/2) )/(2*a) ;<br />

r_2 = ( -b - (b*b - 4*a*c)^(1/2) )/(2*a) ;<br />

else<br />

r_1 = 0;<br />

r_2 = 0;<br />

end<br />

else<br />

r_1 = -c/b;<br />

r_2 = r_1;<br />

end;<br />

The message here is never test two real numbers for equality. Thus if the mathematics<br />

tells you something special happens when p = q (for example) then in your<br />

Matlab code you should never use an if statement like<br />

if (p == q)<br />

· · ·<br />

end<br />

but rather<br />

if abs(p - q) < very_small<br />

· · ·<br />

end<br />

There still remains the niggling problem of how <strong>to</strong> handle the case where both a and b<br />

are zero. Here I invoke the classic excuse of a lecturer – this case is left as an exercise<br />

for the student.<br />

2.3 A finite series<br />

How would you get Matlab <strong>to</strong> compute the following sum<br />

S = 1 + 1 2 + 1 3 + 1 4 + · · · + 1<br />

100<br />

This seems simple enough, just add up 100 numbers. Too easy really (agreed?). If you<br />

were <strong>to</strong> compute this by hand (always a good place <strong>to</strong> start when writing computer code)<br />

you most probably would start with the first term, add on the second term, then the<br />

16-Feb-2014 18


School of Mathematical Sciences<br />

Monash University<br />

third term and so on s<strong>to</strong>pping only after adding on 1/100. The heart of the calculations<br />

looks like this<br />

term = ...;<br />

sum = sum + term;<br />

where the three dots denotes the typical number (e.g. 1/47). Once again we ask What<br />

more do we need <strong>to</strong> make this code work?. Clearly we need<br />

◮ A rule for computing each term (the three dots)<br />

◮ A mechanism for stepping through all the number form 1 <strong>to</strong> 100 and<br />

◮ An initial value for the variable sum.<br />

Here is a Matlab program that does the job<br />

sum = 0;<br />

for num = 1:100<br />

term = 1/num;<br />

sum = sum + term;<br />

end;<br />

answer = sum;<br />

The main new construct in this code is the for loop. The loop begins with the keyword<br />

for and ends with the line end;. The contents of the loop are then repeatedly executed<br />

for the values of num from 1 <strong>to</strong> 100 (in strict sequence!). Once the loop is finished (i.e.<br />

after all 100 numbers have been added <strong>to</strong> sum) Matlab will continue execution on the<br />

line directly following the loop, in this case it assigns (copies) the value of sum <strong>to</strong> the<br />

variable answer.<br />

This structure is very common. It contains a section that initialises some data (sum<br />

= 0), followed by a repeated set of calculations (the for loop) and then a section that<br />

records the answers (answer = sum).<br />

You could ask the other popular question What can go wrong with this code? but<br />

I think you’ll see that it’s bullet proof (for which we rejoice as this is rarely the case).<br />

2.4 Matrix multiplication<br />

Given an n × p matrix U and a p × m matrix V their product UV is a n × m matrix.<br />

We will use subscripts like ij <strong>to</strong> denote the entry in row i and column j. Then we have<br />

(UV ) ij =<br />

p∑<br />

k=1<br />

U ik V kj ,<br />

1 ≤ i ≤ n , 1 ≤ j ≤ m<br />

16-Feb-2014 19


School of Mathematical Sciences<br />

Monash University<br />

For a single entry in the product matrix (at (i,j) for example) we need <strong>to</strong> compute a<br />

finite sum of products. This is easy <strong>to</strong> write in Matlab form<br />

sum = 0;<br />

for k = 1:p<br />

sum = sum + U(i,k)*V(k,j);<br />

end<br />

UV(i,j) = sum;<br />

This needs <strong>to</strong> be repeated for all choices of i and j and that is best done using a pair<br />

of for loops. This leads <strong>to</strong><br />

for i=1:n<br />

for j=1:m<br />

sum = 0;<br />

for k = 1:p<br />

sum = sum + U(i,k)*V(k,j);<br />

end<br />

UV(i,j) = sum;<br />

end<br />

end<br />

It may please you <strong>to</strong> know that Matlab is matrix savvy – it knows how <strong>to</strong> directly<br />

multiply matrices. Thus the above could also be written simply in one single statement<br />

UV = U*V. This applies <strong>to</strong> any pair of matrices (provided they have suitable sizes). Thus<br />

if A is a 3 element column vec<strong>to</strong>r and B is a 3 element row vec<strong>to</strong>r then B*A is a single<br />

number (the dot product of the vec<strong>to</strong>rs) while A*B is a new 3 × 3 matrix. Why am I<br />

telling you this now? Because there will be many many times when you will need <strong>to</strong><br />

access parts of matrices in ways that might not be possible without using explicit index<br />

notation.<br />

2.5 An infinite series<br />

We (should) know that<br />

cos(x) = 1 − x2<br />

2! + x4<br />

4! + · · · = ∞<br />

∑<br />

k=0<br />

x2k<br />

(−1) k<br />

(2k)!<br />

This is an infinite series and that presents us with our first challenge – how does Matlab<br />

cope with an infinite number of terms? The simple answer is that it can’t so we are<br />

forced <strong>to</strong> approximate the infinite series, for example with a finite series, such as the<br />

16-Feb-2014 20


School of Mathematical Sciences<br />

Monash University<br />

first 1001 terms<br />

cos(x) ≈ 1 − x2<br />

2! + x4<br />

4! + · · · + x2000<br />

1000<br />

2000! = ∑<br />

k=0<br />

x2k<br />

(−1) k<br />

(2k)!<br />

This is now similar <strong>to</strong> our previous example of a finite series (although somewhat more<br />

challenging). As before we take the approach that in our Matlab code we will compute<br />

the sum term by term. Let’s suppose Matlab has computed the first k − 1 terms and<br />

recorded the sum (so far) in the variable sum. The next step would be <strong>to</strong> add the next<br />

term (−1) k x 2k /(2k)! <strong>to</strong> sum. For this we might propose a Matlab fragment like<br />

sum = sum + ( (-1)^k )*( x^(2*k) )/( (2k)! );<br />

What more do we need <strong>to</strong> make this code work?<br />

Clearly we need a value for x. We also need <strong>to</strong> set an initial value for sum and we need<br />

<strong>to</strong> run through the allowed values of k.<br />

What can go wrong with this code?<br />

It is an infinite sum and as we do not want the computer <strong>to</strong> run forever we need <strong>to</strong><br />

keep track of how many terms we have computed. If that exceeds a predefined limit<br />

we should then terminate the computations (taking the last value of sum as the best<br />

approximation <strong>to</strong> the infinite series).<br />

With these variations in mind we now propose<br />

x = 1.23;<br />

sum = 1;<br />

for k = 1:1000<br />

sum = sum + ( (-1)^k )*( x^(2*k) )/( (2k)! );<br />

end<br />

answer = sum;<br />

And yet there remains one major problem – Matlab does not understand the fac<strong>to</strong>rial<br />

symbol “!”. So we need <strong>to</strong> compute it ourselves. We know that<br />

n! = 1 × 2 × 3 × 4 · · · × n<br />

This is similar <strong>to</strong> what we have already been playing with but rather than taking a sum<br />

of numbers here we have <strong>to</strong> compute a product of numbers. Thus it’s not hard <strong>to</strong> see<br />

that n! could be computed using<br />

fact = 1;<br />

for num = 1:n<br />

fact = fact * num;<br />

end<br />

16-Feb-2014 21


School of Mathematical Sciences<br />

Monash University<br />

We can use this fragment <strong>to</strong> compute (2k)! and so our code now looks like<br />

x = 1.23;<br />

sum = 1;<br />

for k = 1:1000<br />

fact = 1;<br />

for num = 1:(2*k)<br />

fact = fact * num;<br />

end<br />

sum = sum + ( (-1)^k )*( x^(2*k) )/( fact );<br />

end<br />

answer = sum;<br />

Though this program will work it is worth asking if it’s the best we can do. Surprise,<br />

surprise – we can do far better. But what is it that we see as being problematic? (shades<br />

of if ain’t broke, don’t fix it).<br />

◮ Do we really need <strong>to</strong> compute 1000 terms of the series?<br />

◮ Can Matlab accurately compute both x (2k) and (2k)! for large k?<br />

The simple answer <strong>to</strong> both questions is no. What do we do? Here is a neat trick that<br />

deals with the second objection (we will deal with the first objection a little later on).<br />

A typical pair of terms in the infinite series are<br />

and thus<br />

a k = (−1) k x2k<br />

(2k)! , a k−1 = (−1) k−1 x 2k−2<br />

(2k − 2)!<br />

a k = −a k−1<br />

x 2<br />

(2k)(2k − 1)<br />

Thus each new term in our infinite series can be generated from the previous term by<br />

this simple formula. It’s clearly very easy <strong>to</strong> compute and much more efficient than our<br />

previous formula. But <strong>to</strong> use this we need an initial value for a 0 . That is not hard <strong>to</strong><br />

determine – we choose it so that we get the correct value for a 1 . That is a 0 = 1. We<br />

will use term as the Matlab variable for both a k−1 and a k . Then our Matlab code for<br />

cos(x), with blank lines added for clarity, can be streamlined <strong>to</strong><br />

16-Feb-2014 22


School of Mathematical Sciences<br />

Monash University<br />

x = 1.23;<br />

sum = 1;<br />

term = 1;<br />

for k = 1:1000<br />

end<br />

term = - term*(x*x)/( (2*k)*(2*k-1) );<br />

sum = sum + term;<br />

answer = sum;<br />

This is a significant improvement – we have only one loop (rather than two) and the<br />

computations are simple and unlikely <strong>to</strong> cause problems (no large numbers such as (2k)!).<br />

But we still have the crazy situation of always computing 1000 terms. If we are looking<br />

for an answer that is accurate <strong>to</strong> say five decimal places then that might occur at k = 37<br />

for example. That is, when the (absolute) value of term is less than 0.00001 (five decimal<br />

places) we should bail out of the loop. Here is one way <strong>to</strong> do that<br />

x = 1.23;<br />

sum = 1;<br />

term = 1;<br />

for k = 1:1000<br />

end<br />

term = - term*(x*x)/( (2*k)*(2*k-1) );<br />

sum = sum + term;<br />

if ( abs(term) < 0.00001 )<br />

break<br />

end<br />

answer = sum;<br />

When the break command is executed Matlab will jump out of the surrounding loop<br />

and continues execution at the first line directly after the loop, in this case the final line<br />

answer = sum.<br />

There is a more elegant way <strong>to</strong> achieve the same outcome and it uses a special type of<br />

variable known as boolean variables. These take on just two values true or false and<br />

are often used <strong>to</strong> control the flow of the program. We shall modify the above code by<br />

introducing a new boolean variable looping which will be true only when we have not<br />

reached our target accuracy of five decimal places. We will need <strong>to</strong> set the value for<br />

16-Feb-2014 23


School of Mathematical Sciences<br />

Monash University<br />

looping as each term is calculated. Here is the final code<br />

% --- set initial values ----------------------------------------<br />

k = 0;<br />

x = 1.23;<br />

sum = 1;<br />

term = 1;<br />

k_max = 100;<br />

looping = true;<br />

x_square = x*x;<br />

% --- compute successive terms in the series --------------------<br />

while looping<br />

end<br />

k = k + 1;<br />

term = - term*x_square/( (2*k)*(2*k-1) );<br />

sum = sum + term;<br />

if ( abs(term) < 0.00001 )<br />

looping = false;<br />

end<br />

if ( k > k_max )<br />

looping = false;<br />

end<br />

answer = sum;<br />

There are quite a few changes introduced in the above code. We have replaced the for<br />

loop with a while loop. Thus we have also been forced <strong>to</strong> explicitly increment k (i.e.<br />

the line k = k + 1) and <strong>to</strong> limit the number of terms (hence the new variable k max).<br />

Though the above code is longer than the previous version it is (in my opinion) easier <strong>to</strong><br />

read and better conveys what we are actually doing (i.e. looping until we reach a given<br />

accuracy).<br />

Compare this code with that given for sin(x) at the start of these notes. You will see<br />

that they are similar but they do have important differences (as they should, after all<br />

sin(x) and cos(x) are different functions). Note in particular the way term is calculated,<br />

and the initial values for k and sum. If you have any doubts that they are correct one<br />

way <strong>to</strong> check is <strong>to</strong> follow the Matlab code (pretend you are the computer) and following<br />

(writing out) the loops for the first few terms. You will very quickly see that both<br />

programs as written are correct.<br />

16-Feb-2014 24


SCHOOL OF MATHEMATICAL SCIENCES<br />

<strong>MTH3051</strong><br />

<strong>Introduction</strong> <strong>to</strong><br />

<strong>Computational</strong> <strong>Mathematics</strong><br />

3. Truncation and Round-off Errors


School of Mathematical Sciences<br />

Monash University<br />

3.1 Order estimates<br />

Quite commonly, when analysing the performance of an algorithm, we find ourselves<br />

making statements about pairs of related numbers, say x, y(x), along the lines of<br />

y(x) = Ax m + terms in x smaller than Ax m<br />

where A and m are some numbers (which we may or may not know). In this way we are<br />

isolating the dominant term Ax m of y(x) for small values of x (i.e. |x| ≪ 1). The value<br />

of A is often of little importance and so we use a notation which draws our attention <strong>to</strong><br />

the important number m, that is<br />

y = O (x m )<br />

The formal definition is as follows. If lim x→0 y(x)/x m = A ≠ 0 then we say y(x) = O (x m )<br />

for |x| ≪ 1. In words, we say that y(x) is of order x m for small x.<br />

Example 3.1<br />

Show that y(x) = x + x 2 is O (x) for small x.<br />

Example 3.2<br />

Show that y(x) = 1 − cos(x) is O (x 2 ) for small x.<br />

Example 3.3<br />

If u(x) = O (x 3 ) and v(x) = O (x 2 ) what can you say about u(x) + v(x) and u(x)v(x)<br />

for small x?<br />

3.2 Absolute and relative errors<br />

Its a sad fact that in our computations are never perfect and they will will carry (hopefully)<br />

small errors. There are two primary sources of error, truncation errors (which arise<br />

largely from our choice of algorithm) and round-off errors (introduced by the computer<br />

and much less in our control). Both of these will be discussed below. Whatever the<br />

nature of the error we often speak of two ways in which <strong>to</strong> measure that error. These<br />

are known as absolute and relative errors and they are defined as follows. Suppose ˜x is<br />

our approximation <strong>to</strong> some exact number x. Then we define<br />

◮ Absolute error: |˜x − x|<br />

◮ Relative error:<br />

|˜x − x|/|x|<br />

16-Feb-2014 26


School of Mathematical Sciences<br />

Monash University<br />

3.3 Truncation errors<br />

The two algorithms<br />

π ≈ S n = 4<br />

π ≈ S ′ n =<br />

(<br />

n∑<br />

k=1<br />

12<br />

(−1) k+1<br />

2k − 1<br />

n∑<br />

k=1<br />

(−1) k+1<br />

k 2 ) 1/2<br />

,<br />

which we have seen provide convergent approximations <strong>to</strong> π, were each obtained by<br />

truncating the related infinite series. Not surprisingly the error incurred in doing so<br />

is known as the truncation error. This is one of a variety of errors that can enter our<br />

numerical computations (the other main source of error is known as round-off error<br />

which we will discuss soon). This type of error is of our own making and by choosing<br />

where <strong>to</strong> terminate the series or even choosing a different series we can control the size<br />

of the truncation error. The point is that the truncation error is introduced by our<br />

mathematical manipulations prior <strong>to</strong> turning <strong>to</strong> the computer or calcula<strong>to</strong>r.<br />

Let us write E t (n) for the truncation error for our first algorithm, that is<br />

E t (n) = |S n − π| = 4<br />

∞∑<br />

k=n+1<br />

(−1) k+1<br />

2k − 1<br />

Our numerical results, that we get one extra decimal digit for every 10 fold increase in<br />

n strongly suggests that E t (n) must vary in proportion <strong>to</strong> 1/n. That is<br />

E t ≈ A n<br />

for some unknown constant A that does not depend on n. This is a numerical observation.<br />

Can we do any better that this? Yes.<br />

Example 3.4<br />

Using the data from the previous tables verify that |S n − π| = O (n −1 ) and |S ′ n − π| =<br />

O (n −2 )<br />

Example 3.5<br />

Prove that the truncation error E t (n) = |S n − π| is bounded by<br />

2<br />

(n+2) 2<br />

< E t (n) < 2 n<br />

This shows that for each ten fold increase in n we can expect at least one one extra<br />

decimal digit of accuracy but no more than two extra digits.<br />

16-Feb-2014 27


School of Mathematical Sciences<br />

Monash University<br />

Example 3.6<br />

The series S n<br />

′′ = ∑ n<br />

k=1 (−1)k+1 /k 2 converges <strong>to</strong> π 2 /12. Show that for this series, the<br />

truncation error E t (n) = |S n ′′ − π 2 2<br />

/12| is bounded by < E<br />

(n+2) 3 t (n) < 1 . From this<br />

(n+1) 2<br />

we would infer that we would get at least two (and no more that three) extra decimal<br />

digits for each 10 fold increase in n.<br />

This kind of analysis is often used as a way of checking that our numerical calculations<br />

are on track (i.e. we are looking for consistent behaviour between our numerical and<br />

theoretical calculations, but be warned – simply observing consistency does not prove<br />

that our computer calculations are correct – that requires more work!).<br />

3.4 Round-off errors<br />

All information in a computer is s<strong>to</strong>red in fixed length arrays or registers. This has significant<br />

consequences for the s<strong>to</strong>rage of decimal numbers such as 1/3 = 0.333 · · ·. This<br />

number has an infinite number of digits and so only a finite number of the leading digits<br />

(usually no more than 15) can be s<strong>to</strong>red. The remaining digits must be discarded. This<br />

introduces a small error in s<strong>to</strong>ring the number. The important issue is <strong>to</strong> what extent<br />

does this small error effect subsequent calculations. Errors of this kind are known as<br />

round-off errors and, depending on the nature of the following calculations, the round-off<br />

errors (which occur with every calculation) may remain small or they may accumulate<br />

and eventually swamp the calculation – at which point any further computation is meaningless.<br />

Example 3.7<br />

Working <strong>to</strong> 2,4 and 8 decimal digits compute 2.34567 + 1.23456, 2.34567 − 1.23456 and<br />

their absolute and relative errors.<br />

digits Sum Abs. error Rel. error<br />

2 3.500000000e+00 8.023e-02 2.241e-02<br />

4 3.581000000e+00 7.700e-04 2.151e-04<br />

8 3.580230000e+00 0.000e+00 0.000e+00<br />

digits Diff Abs. error Rel. error<br />

2 1.100000000e+00 1.111e-02 9.999e-03<br />

4 1.111000000e+00 1.100e-04 9.900e-05<br />

8 1.111110000e+00 2.220e-16 1.998e-16<br />

Do these results look reasonable? Yes, the errors are small and we get improved accuracy<br />

when we use more digits. A benign example.<br />

16-Feb-2014 28


School of Mathematical Sciences<br />

Monash University<br />

Example 3.8<br />

Working <strong>to</strong> 2,4 and 8 decimal digits compute 1.23457 + 1.23456, 1.23457 − 1.23456 and<br />

their absolute and relative errors.<br />

digits Sum Abs. error Rel. error<br />

2 2.400000000e+00 6.913e-02 2.800e-02<br />

4 2.470000000e+00 8.700e-04 3.524e-04<br />

8 2.469130000e+00 4.441e-16 1.799e-16<br />

digits Diff Abs. error Rel. error<br />

2 0.000000000e+00 1.000e-05 1.000e+00<br />

4 0.000000000e+00 1.000e-05 1.000e+00<br />

8 1.000000000e-05 1.565e-16 1.565e-11<br />

Notice that the relative errors in computing the difference 1.234567 − 1.23456 are now<br />

much much larger than in the previous example 2.34567 − 1.23456. This is a classic<br />

example of round-off error – when two nearly equal numbers are subtracted the leading<br />

digits in each number cancel thus leaving behind only the (inaccurate) trailing digits<br />

and thus also introducing a large relative error in the result. This effect is sometimes<br />

referred <strong>to</strong> as a loss of precision or a loss of significance.<br />

Example 3.9<br />

The above tables were generated by some fancy programming on a 16-digit computer. In<br />

the 8-digit calculations you will see that the absolute errors are listed as approximately<br />

10 −16 . But should not this be zero? After all an 8-digit computer should be able <strong>to</strong><br />

compute 2.34567 + 1.23456 without any error. What is going on here?<br />

3.5 Understanding round-off errors<br />

To properly understand round-off errors we need a model of how decimal numbers are<br />

s<strong>to</strong>red in a computer’s registers. Suppose our computer can only s<strong>to</strong>re the first N decimal<br />

digits of any number. All digits after the first N digits will be lost – we call this the<br />

round off error.<br />

Here is a little picture showing how a typical number x might be s<strong>to</strong>red on our N− digit<br />

computer.<br />

16-Feb-2014 29


School of Mathematical Sciences<br />

Monash University<br />

0 . 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8<br />

We humans often write numbers like 0.0001234 or 675.123. This is not how the computer<br />

s<strong>to</strong>res them. It will always shuffle the decimal point left or right <strong>to</strong> put the number in<br />

the form shown above. This process is called normalisation and it is applied after every<br />

computation. Numbers such as 0.0001234 and 675.123 are known as un-normalised<br />

numbers.<br />

For the time being we will work only with positive numbers (it saves writing ± with<br />

every number).<br />

3.6 Examples<br />

0 . 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9<br />

0 . 2 5 6 1 2 3 4 0 0 0 0 0 0 0 0<br />

16-Feb-2014 30


School of Mathematical Sciences<br />

Monash University<br />

0 . 1 2 3 4 5 6 7 0 0 0 0 0 0 0 0<br />

From these examples we can see that every number x can be written in the form<br />

x = ( a + 10 −N b ) × 10 m<br />

with both a and b simple numbers in the range 0.1 <strong>to</strong> 0.99999 · · ·<br />

In this representation for x we have<br />

˜x = a × 10 m<br />

E R (x) = b × 10 m × 10 −N<br />

(Note : a has exactly N decimal digits while b has an infinite number of digits.)<br />

In most of what we do with E R (x) we will not be <strong>to</strong>o concerned with the exact value of<br />

b. Thus as both x and b × 10 m carry a fac<strong>to</strong>r of 10 m we write<br />

Here is a very important question<br />

E R (x) = 10 −N O (x) (1)<br />

How do round off errors propagate through a series of calculations?<br />

We might be tempted <strong>to</strong> say that at the end of a series of calculations, the round off<br />

error in the answer will be<br />

E R (y) = 10 −N O (y)<br />

on a n N− digit computer. This is not always true (the above only applies when s<strong>to</strong>ring<br />

an exact number – it makes no account for the computations that may have preceded<br />

this number).<br />

As an example, suppose you calculated y = f(x) for some given function f(x). This will<br />

entail at least three sources of round off error. First, there will be a round-off error in<br />

x (due <strong>to</strong> its own prior his<strong>to</strong>ry). Second, there will be a round-off error in computing f<br />

and third there may be an error in s<strong>to</strong>ring the computed value of f (do we round up or<br />

round down?). All three errors errors will combine and the best we can expect will be<br />

with M ≤ N.<br />

E R (y) = 10 −M O (y)<br />

16-Feb-2014 31


School of Mathematical Sciences<br />

Monash University<br />

3.7 Addition<br />

0 . 1 2 3 4 5 6 7 8 9 0<br />

0 . 5 6 7 8 1 2 3 4 5 6<br />

0 . 6 9 1 2 6 9 1 3 4 6<br />

From this we observe (assuming no carries)<br />

E R (x + y) = E R (x) + E R (y)<br />

= 10 −N O (x) + 10 −N O (y)<br />

= 10 −N O (x + y) (since x, y > 0)<br />

So we expect N digits in the final answer – i.e. no loss of significant digits.<br />

Example 3.10<br />

Given that ˜z is the numerical result for z = x + y, how many digits in ˜z can we take<br />

as being exactly correct? Which digits might be in error (and why)? Hint: think about<br />

carries (if any).<br />

3.8 Subtraction<br />

Let us suppose that the first Q digits of x and y match<br />

16-Feb-2014 32


School of Mathematical Sciences<br />

Monash University<br />

0 . 1 2 3 4 5 9 8 7 6 5 4 3 2 1 0<br />

0 . 1 2 3 4 5 0 1 2 3 4 5 6 7 8 9<br />

0 . 0 0 0 0 0 9 7 5 3 0 8 6 4 2 1<br />

0 . 9 7 5 3 0 8 6 4 2 1 ? ? ? ? ?<br />

Here we only have N − Q accurate digits, all of the others are junk.<br />

⇒ E R (x − y) = 10 −(N−Q) O (x − y)<br />

We have lost Q digits – this is known as a loss of precision. The worst case occurs when<br />

Q = N i.e. when ˜x = ỹ while x ≠ y. So we have<br />

E R (x − y) = 10 −(N−Q) O (x − y)<br />

and as we have just noted this form shows us that only N −Q digits of ˜x − y are accurate.<br />

But there is another way of expressing this result which will be useful later on when we<br />

look at finite differences.<br />

Now we know x ≈ y and x = O (10 m ) and x − y = O ( 10 m−Q) thus we have<br />

Thus we also have<br />

(x − y) × 10 +Q = O ( 10 m−Q) × 10 +Q<br />

= O (10 m )<br />

= O (x)<br />

E R (x − y) = 10 −N O (x) ,<br />

when x ≈ y<br />

3.9 Division<br />

Here we want <strong>to</strong> compute (well estimate) the round off error in z = x/y. On this occasion<br />

drawing little pictures is not much help. So this time we shall use a purely algebraic<br />

16-Feb-2014 33


School of Mathematical Sciences<br />

Monash University<br />

approach (one that can be used for other more challenging computations). We will start<br />

by writing both x and y in the standard form<br />

Thus we have<br />

then<br />

x = (a + 10 −N b) × 10 m<br />

y = (c + 10 −N d) × 10 n<br />

˜x = a × 10 m , E R (x) = 10 −N × (b × 10 m )<br />

ỹ = c × 10 m , E R (y) = 10 −N × (d × 10 m )<br />

x<br />

y = a + 10−N b<br />

c + 10 −N d 10m−n<br />

On most computers N ≈ 15. So 10 −N d is much smaller than c. Thus we can use a<br />

Taylor series for 1/(1 + 10 −N (d/c)) in powers of 10 −N <strong>to</strong> produce<br />

x<br />

y = 1 c (a + 10−N b)10 m−n (1 − 10 −N d d2<br />

+ 10−2N<br />

c c − · · · )<br />

2<br />

If we retain just the first two terms in the series then we obtain (after a wee bit of<br />

algebra)<br />

( (<br />

x a b<br />

y = c + 10−N c − ad ))<br />

10 m−n + O ( 10 −2N)<br />

c 2<br />

Our computer will compute the approximation ˜z <strong>to</strong> the exact value z = x/y with a<br />

round off error which we denote by E R (z). We have z = ˜z + E R (z) and (ignoring any<br />

carries)<br />

˜z = a c × 10m−n<br />

E R (z) = 10 −N ( b<br />

c − ad<br />

c 2 )<br />

10 m−n<br />

= 10 −N O (z)<br />

This last line shows that our N− digit computer will return an estimate for z = x/y<br />

accurate <strong>to</strong> N digits. This is good! We rejoice (in moderation).<br />

This is a very general technique and it can be applied <strong>to</strong> any function <strong>to</strong> analyse its<br />

sensitivity <strong>to</strong> round-off errors. It is well worth your time studying the above example in<br />

detail (as you always do, n’est pas?).<br />

16-Feb-2014 34


School of Mathematical Sciences<br />

Monash University<br />

Example 3.11 Round-off errors<br />

Lest you think that round-off errors are not important here is a simple example that<br />

proves otherwise. Suppose we need <strong>to</strong> compute the roots x of the quadratic<br />

0 = x 2 − x + λ<br />

for various values of the parameter λ in the range 0 ≤ λ ≤ 1. Solving this quadratic<br />

exactly is easy, we all know that<br />

x = 1 ± √ 1 − 4λ<br />

2<br />

For λ = 0 we expect two roots, x = 0 and x = 1. How well does a finite precision<br />

computer handle this job? Here are some results.<br />

Exact and Computed values for x = (1 − √ 1 − 4λ)/2<br />

λ Exact 2-digits 4-digits 6-digits 8-digits<br />

1.00e-01 1.1270e-01 1.0000e-01 1.1250e-01 1.1271e-01 1.1270e-01<br />

1.00e-02 1.0102e-02 0.0000e+00 1.0000e-02 1.0100e-02 1.0102e-02<br />

1.00e-03 1.0010e-03 0.0000e+00 1.0000e-03 1.0000e-03 1.0010e-03<br />

1.00e-04 1.0001e-04 0.0000e+00 0.0000e+00 1.0000e-04 1.0000e-04<br />

Absolute errors for x = (1 − √ 1 − 4λ)/2<br />

λ Exact 2-digits 4-digits 6-digits 8-digits<br />

1.00e-01 1.1270e-01 1.2702e-02 2.0167e-04 3.3346e-06 1.5379e-08<br />

1.00e-02 1.0102e-02 1.0102e-02 1.0205e-04 2.0514e-06 1.4434e-09<br />

1.00e-03 1.0010e-03 1.0010e-03 1.0020e-06 1.0020e-06 2.0050e-09<br />

1.00e-04 1.0001e-04 1.0001e-04 1.0001e-04 1.0002e-08 1.0002e-08<br />

Relative errors for x = (1 − √ 1 − 4λ)/2<br />

λ Exact 2-digits 4-digits 6-digits 8-digits<br />

1.00e-01 1.1270e-01 1.1270e-01 1.7894e-03 2.9588e-05 1.3646e-07<br />

1.00e-02 1.0102e-02 1.0000e+00 1.0102e-02 2.0307e-04 1.4288e-07<br />

1.00e-03 1.0010e-03 1.0000e+00 1.0010e-03 1.0010e-03 2.0030e-06<br />

1.00e-04 1.0001e-04 1.0000e+00 1.0000e+00 1.0001e-04 1.0001e-04<br />

It is clear that significant errors arise when λ ≪ 1. For the computer with 4-digits<br />

the absolute errors might seem small but notice that they are of the same scale as the<br />

number we are trying <strong>to</strong> compute (the exact value for x). Thus, when λ is small, we have<br />

16-Feb-2014 35


School of Mathematical Sciences<br />

Monash University<br />

no reason <strong>to</strong> trust the answers from a 4-digit computer. This is better seen in the third<br />

table which lists the relative errors. Here you can see that the 2 and 4 digit computers<br />

yield answers that are 100% in error – they produce junk!<br />

The first big question is What causes this problem? When λ ≪ 1, the numbers 1 and<br />

√<br />

1 − 4λ are almost equal. Thus when we compute their difference we introduce a large<br />

relative error.<br />

The second big question is Can we cure this problem? In this case, yes, we have at least<br />

two options, both of which entail a re-working of the mathematics prior <strong>to</strong> handing<br />

control over <strong>to</strong> the computer. That is we search for different algorithms <strong>to</strong> compute x<br />

in the case where λ ≪ 1.<br />

Option 1. Here we do some simple algebra<br />

x =<br />

=<br />

1 − (1 − 4λ)(1/2)<br />

2<br />

2λ<br />

1 + (1 − 4λ) (1/2)<br />

=<br />

1 − (1 − 4λ)(1/2)<br />

2<br />

(<br />

)<br />

1 + (1 − 4λ) (1/2)<br />

1 + (1 − 4λ) (1/2)<br />

We see that when λ ≪ 1 we do not have any problem with cancellation between nearly<br />

equal numbers (i.e no loss of precision).<br />

Relative errors for Option 1<br />

λ Exact 2-digits 4-digits 6-digits 8-digits<br />

1.00e-01 1.1270e-01 2.3972e-02 1.4777e-05 2.9691e-06 4.7730e-08<br />

1.00e-02 1.0102e-02 1.0102e-02 2.0307e-04 5.0924e-06 4.3889e-08<br />

1.00e-03 1.0010e-03 1.0010e-03 2.0030e-06 2.0030e-06 5.0090e-09<br />

1.00e-04 1.0001e-04 1.0001e-04 1.0001e-04 2.0003e-08 2.0003e-08<br />

Option 2. This time we use a Taylor series. For λ ≪ 1 we can expnad √ 1 − 4λ as<br />

1 − 2λ + O (λ 2 ). Thus we find, for λ ≪ 1,<br />

x = λ + O ( λ 2)<br />

Relative errors for Option 2<br />

λ Exact 2-digits 4-digits 6-digits 8-digits<br />

1.00e-01 1.1270e-01 1.1270e-01 1.1270e-01 1.1270e-01 1.1270e-01<br />

1.00e-02 1.0102e-02 1.0102e-02 1.0102e-02 1.0102e-02 1.0102e-02<br />

1.00e-03 1.0010e-03 1.0010e-03 1.0010e-03 1.0010e-03 1.0010e-03<br />

1.00e-04 1.0001e-04 1.0001e-04 1.0001e-04 1.0001e-04 1.0001e-04<br />

16-Feb-2014 36


School of Mathematical Sciences<br />

Monash University<br />

And with this we are pleased – the relative errors are small and well behaved when<br />

λ ≪ 1. This is good. This is another example where a judicious choice of algorithm can<br />

save the day. This time we were lucky, the reliable algorithms were easy <strong>to</strong> find but you<br />

will find many problems that may require more head scratching than you might care <strong>to</strong><br />

endure – sadly you have no other choice (unless you can live with inaccurate answers!).<br />

Example 3.12<br />

Look at the numbers in the last row of the previous table. It appears that there is no<br />

improvement in accuracy as we move from a 2-digit computer <strong>to</strong> an 8-digit computer?<br />

Is this a surprise? Can you explain why this might be so?<br />

16-Feb-2014 37


SCHOOL OF MATHEMATICAL SCIENCES<br />

<strong>MTH3051</strong><br />

<strong>Introduction</strong> <strong>to</strong><br />

<strong>Computational</strong> <strong>Mathematics</strong><br />

4. Solutions of Equations in One Variable


School of Mathematical Sciences<br />

Monash University<br />

4.1 <strong>Introduction</strong><br />

The game here is, given a function f(x), <strong>to</strong> find x such that<br />

In some cases this is easy,<br />

but in other cases, such as<br />

0 = f(x)<br />

0 = 3x 2 + 2x − 7 ⇒ x = −2 ± 2√ 22<br />

6<br />

0 = x − e −x<br />

we do not have any choice but <strong>to</strong> resort <strong>to</strong> numerical means.<br />

Our basic strategy will be <strong>to</strong> invent some way <strong>to</strong> create a sequence x 1 , x 2 , x 3 , · · · which<br />

(we hope) converges <strong>to</strong> the root x.<br />

The main issues that we will look at are<br />

◮ Algorithms: How do we generate the sequence x 1 , x 2 , x 3 , · · ·?<br />

◮ Convergence:<br />

◮ Robustness:<br />

For values of x 1 will the sequence converge? And if it does,<br />

how quickly?<br />

For what class of functions f(x) will the algorithm work?<br />

Here is your first reality check – any hope of finding a perfect algorithm – one that<br />

converges for all functions and for all initial guesses – is pure fantasy. All algorithms<br />

will have trouble under certain conditions. Our game will be <strong>to</strong> find a range of algorithms<br />

that collectively will allow us <strong>to</strong> solve most problems. We will also want <strong>to</strong> investigate<br />

what it is that causes one algorithm <strong>to</strong> succeed where others fail. Thus our work will be<br />

a mix of empirical tinkering (i.e. algorithm design) and solid mathematical analysis.<br />

4.2 Fixed point iteration<br />

Given 0 = x − e −x we also have x = e −x and this suggests the following sequence<br />

x n+1 = e −xn n = 1, 2, 3, · · ·<br />

To get the ball rolling we need an initial guess, let’s take x 1 = 0.5. How well does this<br />

work? Here are the results for the first ten iterations.<br />

16-Feb-2014 39


School of Mathematical Sciences<br />

Monash University<br />

Fixed point iterations x n+1 = e −xn<br />

Iteration n Old guess x n New guess x n+1 x n+1 − e −x n+1<br />

0 0.500000000000 -1.065e-01<br />

1 0.500000000000 0.606530659713 6.129e-02<br />

2 0.606530659713 0.545239211893 -3.446e-02<br />

3 0.545239211893 0.579703094878 1.964e-02<br />

4 0.579703094878 0.560064627939 -1.111e-02<br />

5 0.560064627939 0.571172148977 6.309e-03<br />

6 0.571172148977 0.564862946980 -3.575e-03<br />

7 0.564862946980 0.568438047570 2.029e-03<br />

8 0.568438047570 0.566409452747 -1.150e-03<br />

9 0.566409452747 0.567559634262 6.524e-04<br />

10 0.567559634262 0.566907212935 -3.700e-04<br />

It appears <strong>to</strong> be converging, the last column seems <strong>to</strong> be getting smaller with each<br />

iteration, but the its seems <strong>to</strong> be a slow convergence. Here are the results at every 10-th<br />

iteration.<br />

Fixed point iterations x n+1 = e −xn<br />

Iteration n Old guess x n New guess x n+1 x n+1 − e −x n+1<br />

1 0.500000000000 0.606530659713 6.129e-02<br />

11 0.566907212935 0.567277195971 2.098e-04<br />

21 0.567142477551 0.567143751417 7.225e-07<br />

31 0.567143287611 0.567143291997 2.487e-09<br />

41 0.567143290400 0.567143290415 8.564e-12<br />

51 0.567143290410 0.567143290410 2.931e-14<br />

61 0.567143290410 0.567143290410 1.110e-16<br />

Clearly the algorithm worked, we have a solution, but it <strong>to</strong>ok over 60 iterations. For<br />

functions as simple as this, 61 iterations does not take <strong>to</strong>o much time <strong>to</strong> compute. So we<br />

might feel that this is a good result. Not quite. If this happened <strong>to</strong> be part of a much<br />

larger computation, where we needed <strong>to</strong> compute the root many thousands of times (for<br />

example) then we have good reason <strong>to</strong> want <strong>to</strong> improve on 60 iterations per root. As we<br />

shall see in later lectures there are algorithms that can converge in (usually) less than<br />

about 5 iterations. This is a significant improvement.<br />

You might ask how well does this algorithm work for other choices of the initial guess<br />

x 1 ? It converges for a wide range of values! By direct trial and error you can verify<br />

(you’ll need a program!) that it converges (at least) for any initial guess in the range<br />

−5 < x 1 < +5. Again this is encouraging. Flush with confidence you might try rewriting<br />

the original equation 0 = x − e −x as x = − log(x) and thus create the sequence<br />

x n+1 = − log(x n ) n = 1, 2, 3, · · ·<br />

16-Feb-2014 40


School of Mathematical Sciences<br />

Monash University<br />

If you start with x 1 = 2 you will get an error very quickly<br />

x 1 = 2<br />

x 2 = − log(x 1 ) = − log(2)<br />

x 3 = − log(x 2 ) = − log(− log(2))<br />

Starting with x 1 = 0.5 leads <strong>to</strong> the same problem at n = 15.<br />

Fixed Point Iteration<br />

If an equation 0 = f(x) can be re-written in the form<br />

x = g(x)<br />

then the sequence<br />

x n+1 = g(x n ) n = 1, 2, 3, · · ·<br />

is known as fixed point iteration. The sequence is not guaranteed <strong>to</strong> converge <strong>to</strong> x.<br />

4.2.1 Why the funny name?<br />

You might be wondering why we call this method the fixed point method. It is simple.<br />

If we have found the root x then one extra iteration x = g(x) produces no change in x,<br />

that is the x is fixed and so we call it a fixed point of g(x).<br />

4.2.2 Fixed point in pictures<br />

The sequence x n+1 = g(x n ) can be drawn as points in the (x, y) plane. All you do is<br />

plot the points (x 1 , y 1 ), (x 2 , y 2 ), (x 3 , y 3 ) · · · where y n = g(x n ). This gives us a cute way<br />

<strong>to</strong> display the convergent and divergent sequences.<br />

16-Feb-2014 41


School of Mathematical Sciences<br />

Monash University<br />

Y<br />

Convergent fixed point iteration<br />

y = x<br />

y = g(x)<br />

X<br />

X2 X4<br />

X3<br />

X1<br />

Y<br />

Divergent fixed point iteration<br />

y = g(x)<br />

y = x<br />

X<br />

X 4 X2 X<br />

1 X3<br />

X 5<br />

16-Feb-2014 42


School of Mathematical Sciences<br />

Monash University<br />

4.2.3 Cyclic Fixed point iterations<br />

If we are unlucky we may bump in<strong>to</strong> an x for which x = g(g(x))and x ≠ g(x). This is<br />

an example of a cyclic sequence, as shown in the following diagram.<br />

Y<br />

Cyclic fixed point iteration<br />

y = g(x)<br />

y = x<br />

X<br />

X , X , X , X ,...<br />

2 4 6 8<br />

X , X , X , X ,...<br />

1 3 5 7<br />

In cases like this you have no choice but <strong>to</strong> start again with a new guess for x 1 and there<br />

is no guarantee that you won’t bump in<strong>to</strong> the same cyclic sequence. Good luck!<br />

4.2.4 Convergence<br />

Is there anything that we can say about when a fixed point iteration will converge? Yes!<br />

Suppose we have<br />

which we hope will converge <strong>to</strong> the root of<br />

x n+1 = g(x n )<br />

x = g(x)<br />

Suppose we are close, that is x n ≈ x and also that g(x) is a nice smooth function. We<br />

can then write<br />

g(x n ) = g(x) + g ′ (x)(x n − x) + O ( (x n − x) 2)<br />

But x n+1 = g(x n ) and x = g(x) and so we also have<br />

x n+1 = x + g ′ (x)(x n − x) + O ( (x n − x) 2)<br />

16-Feb-2014 43


School of Mathematical Sciences<br />

Monash University<br />

which we re-write as<br />

x n+1 − x = g ′ (x)(x n − x) + O ( (x n − x) 2)<br />

Now notice that x n − x is the error in our approximation at iteration n. So let’s define<br />

the error at each iteration by<br />

ɛ n = |x n − x|<br />

then we have<br />

ɛ n+1 = |g ′ (x)|ɛ n + O ( )<br />

ɛ 2 n<br />

Notice that if |g ′ (x)| < 1 then successive errors will be smaller than the previous errors,<br />

that is the iterations will converge. On the other hand, if |g ′ (x)| > 1 the errors will grow<br />

and the sequence diverges. The only other case is when |g ′ (x)| = 1 and in this case the<br />

errors neither grow nor decay.<br />

Note also that the above arguments all hinge on the assumption that we are near the<br />

fixed point. All bets are off if we have a bad initial guess (i.e. x 1 is far from the fixed<br />

point). However once the sequence gets close <strong>to</strong> the fixed point (if we should be so lucky)<br />

then the above arguments do apply.<br />

Fixed Point Iteration Convergence<br />

Given an x 1 close <strong>to</strong> the fixed point x = g(x), then the sequence<br />

x n+1 = g(x n ) n = 1, 2, 3, · · ·<br />

Converges when |g ′ (x)| < 1<br />

Diverges when |g ′ (x)| > 1<br />

Useless when |g ′ (x)| = 1<br />

Example 4.1<br />

Verify, using the above conditions, that the scheme x n+1 = e −xn<br />

scheme x n+1 = − log(x n ) should diverge.<br />

will converge, while the<br />

4.2.5 Programming notes<br />

Since we know that the fixed point iterations might fail we must take care when we write<br />

our computer programs. We need <strong>to</strong> ask ourselves what can go wrong in the calculations<br />

(<strong>to</strong>o many iterations? not converging? stalled? etc.) and put appropriate tests in our<br />

programs. Here is a very rough sketch of a program that could be used for any algorithm<br />

<strong>to</strong> solve 0 = f(x) for x.<br />

16-Feb-2014 44


School of Mathematical Sciences<br />

Monash University<br />

Set initial guess for x<br />

Set the desired accuracy for x<br />

Set limit for number of iterations<br />

While iterating do<br />

Choose a new guess for x<br />

Compute the new f(x)<br />

If the change in x is small, then exit<br />

If the new f is small, then exit<br />

If the number of iterations is <strong>to</strong>o large, then exit<br />

Otherwise, prepare for the next iteration<br />

Finished, print x<br />

16-Feb-2014 45


School of Mathematical Sciences<br />

Monash University<br />

4.2.6 A Matlab program<br />

Here is a rough Matlab program that does the job. You might like <strong>to</strong> look at this code<br />

very carefully.<br />

x_old = 0.5; % initial guess<br />

loop = 0; % set loop <strong>to</strong> zero<br />

loop_max = 50;<br />

% maximum number of iterations<br />

small_number = 0.0001; % target accuracy<br />

looping = (loop < loop_max);<br />

while looping<br />

% start of iterations<br />

x_new = exp(-x_old);<br />

f_old = x_old - exp(-x_old);<br />

f_new = x_new - exp(-x_new);<br />

loop = loop + 1;<br />

if loop > loop_max<br />

looping = false;<br />

end<br />

if abs(x_new-x_old) < small_number<br />

looping = false;<br />

end<br />

if abs(f_new) < small_number<br />

looping = false;<br />

end<br />

x_old = x_new;<br />

disp([loop x_new f_new ]);<br />

% next iteration<br />

% <strong>to</strong>o many iterations, exit<br />

% x values converged, exit<br />

% f values very small, exit<br />

% prepare for next iteration<br />

% display current approximation<br />

end;<br />

4.3 New<strong>to</strong>n-Raphson iteration<br />

Once again we have the problem of finding x such 0 = f(x). And once again we will<br />

generate a sequence x 1 , x 2 , x 3 , · · · which we hope will converge <strong>to</strong> x.<br />

How? Okay, let’s suppose we have a guess, call it x n and we wish <strong>to</strong> create a new<br />

(improved) guess x n+1 . This new guess will only be an approximation <strong>to</strong> x. Let δx be<br />

the error in x n , that is<br />

x n = x + δx<br />

Our game now is <strong>to</strong> compute (or estimate) δx.<br />

16-Feb-2014 46


School of Mathematical Sciences<br />

Monash University<br />

Given 0 = f(x) we have<br />

0 = f(x n − δx)<br />

and if δx is small, then we can expand the right hand side using a Taylor series. Thus<br />

0 = f(x n ) − f ′ (x n )δx + O ( δx 2)<br />

Since δx is small we can discard the second order terms, thus we can solve for δx,<br />

You might be tempted <strong>to</strong> write<br />

δx = f(x n)<br />

f ′ (x n )<br />

x = x n − δx<br />

and think your job is done, that you have found the exact root. But don’t forget that<br />

we truncated the Taylor series, and thus we introduced and approximation for δx. Thus<br />

the previous line is only approximately true. We thus take the right hand side as our<br />

next approximation, that is x n+1 = x n + δx.<br />

Example 4.2 New<strong>to</strong>n-Raphson<br />

Here are the results for our standard problem 0 = x − e −x .<br />

New<strong>to</strong>n-Raphson iterations<br />

x n+1 = x n − f n /f ′ n, f(x) = x − e −x<br />

Iteration n Old guess x n New guess x n+1 x n+1 − e −x n+1<br />

ɛ n+1 /ɛ 2 n<br />

0 0.500000000000 -1.065e-01<br />

1 0.500000000000 0.566311003197 -1.305e-03 1.846e-01<br />

2 0.566311003197 0.567143165035 -1.965e-07 1.810e-01<br />

3 0.567143165035 0.567143290410 -4.441e-15 1.393e+01<br />

4 0.567143290410 0.567143290410 1.110e-16 4.507e+12<br />

5 0.567143290410 0.567143290410 0.000e+00 4.631e+12<br />

Notice how quickly the iterations converge. This is a very good result! We get accurate<br />

answers for very little effort! Yippee (well, I get easily carried away).<br />

Example 4.3<br />

Let ɛ n be defined by ɛ n = |x − x n |. Show, using a Taylor series, that ɛ n+1 = O (ɛ 2 n). This<br />

means that each successive iteration will double the number of digits in our approximation.<br />

This is known as quadratic convergence. In contrast, the fixed point iterations<br />

have ɛ n+1 = O (ɛ n ) and this is called linear convergence.<br />

But note one important fact – the proof that ɛ n+1 = O (ɛ 2 n) assumes that f ′ (x) ≠ 0.<br />

What happens when f ′ (x) = 0?<br />

16-Feb-2014 47


School of Mathematical Sciences<br />

Monash University<br />

Example 4.4 New<strong>to</strong>n-Raphson with f ′ (x) = 0<br />

Find x such that 0 = f(x) = (x − 1) 2 . Clearly f(x) = 0 and f ′ (x) = 0 at x = 1 so x = 1<br />

is our root. Let’s take the initial guess of x = 0.5. Here is what we get from a naive<br />

application of the New<strong>to</strong>n-Raphson algorithm.<br />

New<strong>to</strong>n-Raphson iterations x n+1 = x n − f n /f n, ′ f(x) = (x − 1) 2<br />

Iteration n Old guess x n New guess x n+1 (x n+1 − 1) 2 ɛ n+1 /ɛ n<br />

0 0.500000000000 2.500e-01<br />

1 0.500000000000 0.750000000000 6.250e-02 5.000e-01<br />

2 0.750000000000 0.875000000000 1.563e-02 5.000e-01<br />

3 0.875000000000 0.937500000000 3.906e-03 5.000e-01<br />

4 0.937500000000 0.968750000000 9.766e-04 5.000e-01<br />

5 0.968750000000 0.984375000000 2.441e-04 5.000e-01<br />

6 0.984375000000 0.992187500000 6.104e-05 5.000e-01<br />

7 0.992187500000 0.996093750000 1.526e-05 5.000e-01<br />

8 0.996093750000 0.998046875000 3.815e-06 5.000e-01<br />

9 0.998046875000 0.999023437500 9.537e-07 5.000e-01<br />

10 0.999023437500 0.999511718750 2.384e-07 5.000e-01<br />

The iteration do converge, but not as fast as in the previous example. Notice that<br />

ɛ n+1 /ɛ n remains constant. That is, each iteration reduces the error by a constant fac<strong>to</strong>r<br />

(in this case 1/2). This is typical of New<strong>to</strong>n-Raphson iteration when x is a root of both<br />

f(x) = 0 and f ′ (x) = 0.<br />

4.3.1 Some notation.<br />

◮ Simple root: When f(x) = 0 and f ′ (x) ≠ 0.<br />

◮ Double root: When f(x) = 0, f ′ (x) = 0 and f ′′ (x) ≠ 0.<br />

◮ Root of order m:<br />

When f(x) = 0 and all derivatives up <strong>to</strong> f (m−1) (x) are<br />

zero at x while f (m) (x) ≠ 0.<br />

16-Feb-2014 48


School of Mathematical Sciences<br />

Monash University<br />

New<strong>to</strong>n-Raphson Iteration<br />

For the equation 0 = f(x) compute the sequence<br />

x n+1 = x n − f(x n)<br />

f ′ (x n )<br />

If this sequence converges then,<br />

ɛ n+1 = O ( )<br />

ɛ 2 n<br />

ɛ n+1 = O (ɛ n )<br />

n = 1, 2, 3, · · ·<br />

at a simple root<br />

at a multiple root<br />

4.3.2 New<strong>to</strong>n-Raphson for multiple roots<br />

We have seen that the standard New<strong>to</strong>n-Raphson, when applied <strong>to</strong> a function with a<br />

multiple root, does converge but only linearly. Can we modify the algorithm <strong>to</strong> recover<br />

quadratic convergence? Yes (you knew that).<br />

Let f(x) have a multiple root at x = p. Then for x near p we must have<br />

f(x) = (x − p) m h(x)<br />

where h(x) is some other (unknown) function with h(p) ≠ 0. Thus<br />

f 1/m (x) = (x − p)h 1/m (x)<br />

This new function has a simple root at x = p and thus we expect quadratic convergence<br />

when the New<strong>to</strong>n-Raphson method is applied <strong>to</strong> f 1/m (x).<br />

What does the New<strong>to</strong>n-Raphson iteration look like for this new function? Put g(x) =<br />

f 1/m (x) then<br />

x n+1 = x n − g(x n)<br />

g ′ (x n )<br />

f 1/m (x n )<br />

= x n −<br />

(1/m)f 1/m−1 (x n )f ′ (x n )<br />

= x n − m f(x n)<br />

f ′ (x n )<br />

Here is what we get for the simple function f(x) = (x − 1) 2 .<br />

New<strong>to</strong>n-Raphson iterations x n+1 = x n − 2f n /f n, ′ f(x) = (x − 1) 2<br />

Iteration n Old guess x n New guess x n+1 (x n+1 − 1) 2 ɛ n+1 /ɛ n<br />

0 0.500000000000 2.500e-01<br />

1 0.500000000000 1.000000000000 0.000e+00 0.000e+00<br />

16-Feb-2014 49


School of Mathematical Sciences<br />

Monash University<br />

Yes, this is correct, it does converge in one iteration. But do not think that all functions<br />

with a double root will converge so quickly. Here is another example, based on f(x) =<br />

x 3 − 3x + 2 which has a double root at x = 1.<br />

New<strong>to</strong>n-Raphson iterations x n+1 = x n − 2f n /f ′ n, f(x) = x 3 − 3x + 2<br />

Iteration n Old guess x n New guess x n+1 (x n+1 − 1) 2 ɛ n+1 /ɛ 2 n<br />

0 1.200000000000 1.280e-01<br />

1 1.200000000000 1.006060606061 1.104e-04 1.515e-01<br />

2 1.006060606061 1.000006103329 1.118e-10 1.662e-01<br />

3 1.000006103329 1.000000000004 -2.220e-16 9.683e-02<br />

Example 4.5<br />

Repeat the above calculations for 7 iterations. What do you notice? Can you explain<br />

this?<br />

Example 4.6<br />

Can you write out a formula for ɛ n as a function of n for the double root (easy) and for<br />

the simple root (not so easy – hint, use the above tables and write out successive ɛ n ’s<br />

and express each in terms of ɛ 1 ).<br />

Example 4.7<br />

For multiple roots we could apply the New<strong>to</strong>n-Raphson method <strong>to</strong> the function g(x) =<br />

f m−1 (x), i.e the (m − 1) st derivative of f(x). What do you think would be the pros and<br />

cons of this approach? (Hint – two words : efficiency, round-off).<br />

4.3.3 Cycling<br />

If we are unlucky we will find that the New<strong>to</strong>n-Raphson may cycle, just as we saw with<br />

the fixed point algorithm.<br />

16-Feb-2014 50


School of Mathematical Sciences<br />

Monash University<br />

Y<br />

Cyclic New<strong>to</strong>n-Raphson iteration<br />

y = f(x)<br />

X , X , X , X ,...<br />

2 4 6 8<br />

X<br />

X<br />

1<br />

, X<br />

3<br />

, X<br />

5<br />

, X<br />

7<br />

,...<br />

There is not much we can do in this case. We could try different initial guesses and we<br />

may be lucky and avoid the cycling but there is no guarantee that we’ll be so lucky. You<br />

may have <strong>to</strong> give up and try another algorithm (for example, half-interval search).<br />

4.4 Interval methods<br />

Our game so far has been <strong>to</strong> create a sequence x 1 , x 2 , x 3 , · · · of approximations <strong>to</strong> the<br />

root of f(x) with the hope that x n+1 is a better approximation than x n <strong>to</strong> the root.<br />

At any stage in this process we have a single point approximation <strong>to</strong> the root. In the<br />

following two algorithms (half-interval search and false position) we will use pairs of<br />

points <strong>to</strong> define a range of values that are guaranteed <strong>to</strong> contain the root.<br />

This is a good thing. If the root lies in the range a < x < b then we have a concrete<br />

measure of how large the error might be in taking any number from the interval as an<br />

approximation <strong>to</strong> x. The worst case is that the error is no larger than b − a. If we can<br />

find an algorithm that successively shrinks the interval then we are guaranteed that the<br />

iterations will converge <strong>to</strong> the root.<br />

Algorithms of this type are also referred <strong>to</strong> as bracketing methods.<br />

The success of these methods depends on two things<br />

◮ The function f(x) must be continuous and<br />

◮ That we can find at least one interval that contains just this one root.<br />

16-Feb-2014 51


School of Mathematical Sciences<br />

Monash University<br />

4.4.1 Half Interval Search<br />

This is the classic example of interval methods. Let’s suppose we have a simple function<br />

and that by some means (usually a table or a plot) we have found two points a and b<br />

with a < b and most importantly that f(a) and f(b) are of opposite sign. This is the<br />

key condition for the choice of a and b. Let’s call this the root condition.<br />

If the function is continuous (which we always assume) then the function must have a<br />

root in the range a < x < b. Good. Now the big question is How do we choose a new<br />

smaller interval that also contains the root?<br />

Given the interval [a, b] we create two new intervals based on the mid-point c = (a+b)/2.<br />

Thus we can split the original interval in<strong>to</strong> two smaller non-overlapping intervals. Here<br />

is the big thing – Only one of the intervals will satisfy the root condition. Whichever<br />

interval that happens <strong>to</strong> be, we take it as the next interval and start the process all over<br />

again.<br />

In this way we generate a sequence of intervals, each 1/2 the size of the previous interval,<br />

that are guaranteed <strong>to</strong> converge <strong>to</strong> the root. This certainty is very comforting (point<br />

algorithms are not guaranteed <strong>to</strong> converge).<br />

Half Interval Search iterations f(x) = x − e −x<br />

n x left x middle x right f left f middle f right<br />

1 0.000000 1.500000 3.000000 -1.0e+00 1.3e+00 3.0e+00<br />

2 0.000000 0.750000 1.500000 -1.0e+00 2.8e-01 1.3e+00<br />

3 0.000000 0.375000 0.750000 -1.0e+00 -3.1e-01 2.8e-01<br />

4 0.375000 0.562500 0.750000 -3.1e-01 -7.3e-03 2.8e-01<br />

5 0.562500 0.656250 0.750000 -7.3e-03 1.4e-01 2.8e-01<br />

6 0.562500 0.609375 0.656250 -7.3e-03 6.6e-02 1.4e-01<br />

7 0.562500 0.585938 0.609375 -7.3e-03 2.9e-02 6.6e-02<br />

8 0.562500 0.574219 0.585938 -7.3e-03 1.1e-02 2.9e-02<br />

9 0.562500 0.568359 0.574219 -7.3e-03 1.9e-03 1.1e-02<br />

10 0.562500 0.565430 0.568359 -7.3e-03 -2.7e-03 1.9e-03<br />

Half-Interval Search<br />

Given a continuous function f(x) and an interval [a, b] with f(a)f(b) < 0 then<br />

Compute c = a + (b − a)/2, the mid point<br />

If f(a)f(c) < 0 then<br />

Choose the new interval as [a, c]<br />

Else<br />

Choose the new interval as [c, b]<br />

16-Feb-2014 52


School of Mathematical Sciences<br />

Monash University<br />

Notes<br />

◮ Half-interval search is also commonly known as the Bisection method.<br />

◮ It is far better <strong>to</strong> use c = a+(b−a)/2 than c = (a+b)/2 <strong>to</strong> compute the mid-point.<br />

The later method can, due <strong>to</strong> round-off errors, produce a c that is not contained<br />

in the interval [a, b].<br />

4.4.2 False Position<br />

This differs from the Half-Interval method only in the way the interval is split in two.<br />

In this case we draw a line joining the two points on the curve and locate where this line<br />

crosses the x-axis. That splits the interval in two, everything else remains the same. It<br />

is easy <strong>to</strong> verify that this point is given by<br />

c =<br />

af(b) − bf(a)<br />

f(b) − f(a)<br />

Here are our results.<br />

False Position iterations f(x) = x − e −x<br />

n x left x new x right f left f new f right<br />

1 0.000000 0.759453 3.000000 -1.0e+00 2.9e-01 3.0e+00<br />

2 0.000000 0.588025 0.759453 -1.0e+00 3.3e-02 2.9e-01<br />

3 0.000000 0.569460 0.588025 -1.0e+00 3.6e-03 3.3e-02<br />

4 0.000000 0.567401 0.569460 -1.0e+00 4.0e-04 3.6e-03<br />

5 0.000000 0.567172 0.567401 -1.0e+00 4.5e-05 4.0e-04<br />

6 0.000000 0.567146 0.567172 -1.0e+00 5.0e-06 4.5e-05<br />

7 0.000000 0.567144 0.567146 -1.0e+00 5.5e-07 5.0e-06<br />

8 0.000000 0.567143 0.567144 -1.0e+00 6.2e-08 5.5e-07<br />

16-Feb-2014 53


School of Mathematical Sciences<br />

Monash University<br />

4.5 Summary<br />

Algorithm<br />

Convergence<br />

criteria<br />

Convergence<br />

rate<br />

Pros & Cons<br />

Fixed point<br />

x n+1 = g(x n )<br />

New<strong>to</strong>n-Raphson<br />

x n+1 = x n − f(x n)<br />

f ′ (x n )<br />

Half-Interval Search<br />

c = a + b<br />

2<br />

False Position<br />

bf(a) − af(b)<br />

c =<br />

f(a) − f(b)<br />

|g ′ (x)| < 1 ɛ n+1 = O (ɛ n ) ✔ simple<br />

✔ no derivatives<br />

✗ slow<br />

✗ may cycle<br />

x n ≈ x<br />

ɛ n+1 = O ( )<br />

ɛ 2 n<br />

✗ may not find all roots<br />

✔ quadratic convergence<br />

✗ requires derivatives<br />

✗ may cycle<br />

✗ linear convergence at multiple roots<br />

f(a)f(b) < 0 ɛ n+1 = 1 2 ɛ n ✔ simple<br />

✔ no derivatives<br />

✔ guaranteed <strong>to</strong> converge<br />

✗ slow<br />

✗ fails at even powered multiple roots<br />

f(a)f(b) < 0 ɛ n+1 = O (ɛ n ) ✔ simple<br />

✔ no derivatives<br />

✔ guaranteed <strong>to</strong> converge<br />

✗ slow<br />

✗ fails at even powered multiple roots<br />

16-Feb-2014 54


SCHOOL OF MATHEMATICAL SCIENCES<br />

<strong>MTH3051</strong><br />

<strong>Introduction</strong> <strong>to</strong><br />

<strong>Computational</strong> <strong>Mathematics</strong><br />

5. Solving Systems of Linear Equations


School of Mathematical Sciences<br />

Monash University<br />

5.1 <strong>Introduction</strong><br />

Remember all the fun you had using Gaussian elimination <strong>to</strong> solve equations like<br />

2x + 3y + z = 10<br />

x + 2y + 2z = 10<br />

4x + 8y + 11z = 49<br />

for x, y and z? You may be disappointed <strong>to</strong> learn that we’ll be leaving all that fun behind<br />

us by handing the job over <strong>to</strong> the computer. We will study two classes of algorithms,<br />

direct methods where we use variations on Gaussian elimination <strong>to</strong> obtain the solution<br />

and iterative methods where we invent (yet another) algorithm <strong>to</strong> create a sequence of<br />

approximations <strong>to</strong> the solution.<br />

5.2 Gaussian elimination with back substitution<br />

The following steps should serve as a basic reminder of Gaussian elimination with back<br />

substitution.<br />

Using the above set of equations, your pen and paper calculations might look like the<br />

following.<br />

2x + 3y + z = 10<br />

x + 2y + 2z = 10<br />

4x + 8y + 11z = 49<br />

(1)<br />

(2) ′ ← 2(2) − (1)<br />

(3) ′ ← (3) − 2(1)<br />

2x + 3y + z = 10<br />

y + 3z = 10<br />

2y + 9z = 29<br />

(1)<br />

(2) ′<br />

(3) ′′ ← (3) ′ − 2(2) ′<br />

2x + 3y + z = 10<br />

y + 3z = 10<br />

3z = 9<br />

(1)<br />

(2) ′<br />

(3) ′′<br />

Now we solve this system using back-substitution, z = 3, y = 1, x = 2.<br />

There are two basic stages. First, we apply a series of row operations <strong>to</strong> reduce the<br />

system <strong>to</strong> an upper triangular form. Second, we apply the back substitution where we<br />

solve from the last up <strong>to</strong> the first equation. There are minor variations on this pattern<br />

(such as full Gaussian elimination) but for the moment we will not worry about such<br />

matters.<br />

16-Feb-2014 56


School of Mathematical Sciences<br />

Monash University<br />

For bookkeeping purpose we normally write the equations in matrix form, such as<br />

⎡ ⎤ ⎡ ⎤ ⎡ ⎤<br />

1 3 1 x 10<br />

⎣ 1 2 2 ⎦ ⎣ y ⎦ = ⎣ 10 ⎦<br />

4 8 11 z 49<br />

or more simply as just<br />

AX = f<br />

where each of A, X and f are matrices (all of this should be very familiar <strong>to</strong> you).<br />

Our game will be <strong>to</strong> write a short Matlab program that implements Gaussian elimination<br />

with back substitution on the system AX = f. We will do this in simple stages, starting<br />

with the basic mathematics and slowly adding fragments of Matlab code <strong>to</strong> build a fully<br />

working program.<br />

Matlab code for Gaussian elimination We will assume that we have N equations<br />

in N unknowns with augmented matrix M. In our Matlab program we can access the<br />

entry in row i and column j of M by writing M(i, j).<br />

Let’s suppose we have completed the eliminations for columns 1 <strong>to</strong> a − 1. We now need<br />

<strong>to</strong> apply row operations <strong>to</strong> M <strong>to</strong> eliminate all of the entries in the column below the<br />

entry at (a, a) (i.e. entries (a+1, a), (a+2, a), (a+3, a), · · · (N, a)). Let’s suppose we are<br />

about <strong>to</strong> eliminate the entry at (b, a). Looking back on our pen-and-paper calculations<br />

we see that we need <strong>to</strong> do three tasks<br />

◮ Divide row a by M(a, a)<br />

◮ Multiply row a by M(b, a)<br />

◮ Subtract row a from row b<br />

Here is a Matlab fragment that does the job.<br />

fac<strong>to</strong>r = M(b,a)/M(a,a);<br />

for c = a : N+1<br />

M(b,c) = M(b,c) - fac<strong>to</strong>r*M(a,c);<br />

end<br />

% want M(b,a) <strong>to</strong> be zero<br />

% col’s <strong>to</strong> right of (a,a)<br />

% row b - fac<strong>to</strong>r * row a<br />

16-Feb-2014 57


School of Mathematical Sciences<br />

Monash University<br />

Example 5.1<br />

Where we not meant <strong>to</strong> process the whole row? Does not the above only process part<br />

of the row? And why does the loop go up <strong>to</strong> N + 1? Have we made two mistakes?<br />

This does the work for one row. We now need <strong>to</strong> process every row below row a. This<br />

is easy – we simply wrap up our Matlab fragment in another for-loop.<br />

for b = a+1 : N<br />

fac<strong>to</strong>r = M(b,a)/M(a,a);<br />

for c = a : N+1<br />

M(b,c) = M(b,c) - fac<strong>to</strong>r*M(a,c);<br />

end<br />

end<br />

% rows below a<br />

% want M(b,a) <strong>to</strong> be zero<br />

% col’s <strong>to</strong> right of (a,a)<br />

% row b - fac<strong>to</strong>r * row a<br />

This is looking good. We have completed all the eliminations for column a. All we need<br />

do now is repeat this process for the remaining columns. This introduces one more loop,<br />

for a = 1 : N-1<br />

for b = a+1 : N<br />

fac<strong>to</strong>r = M(b,a)/M(a,a);<br />

for c = a : N+1<br />

M(b,c) = M(b,c) - fac<strong>to</strong>r*M(a,c);<br />

end<br />

end<br />

end<br />

% first N-1 rows<br />

% rows below a<br />

% want M(b,a) <strong>to</strong> be zero<br />

% col’s <strong>to</strong> right of (a,a)<br />

% row b - fac<strong>to</strong>r * row a<br />

Matlab code for back substitution Okay, our matrix M is now in upper triangular<br />

form and we are ready <strong>to</strong> do the back substitution. We need a vec<strong>to</strong>r <strong>to</strong> s<strong>to</strong>re the<br />

unknowns. Let’s call it X (with entries X(i)). Recall that in the back substitution we<br />

use equation N <strong>to</strong> solve for X(N), then equation N − 1 <strong>to</strong> solve for X(N − 1) and so<br />

on finishing with the first equation giving X(1). As before, we will build our Matlab<br />

code by first assuming we are mid-way through our calculations. So suppose we have<br />

computed X(N), X(N − 1), X(N − 2), · · · X(a + 1). We now solve for X(a) from row<br />

(a).<br />

sum = M(a,N+1);<br />

for b = a+1 : N<br />

sum = sum - M(a,b)*X(b);<br />

end<br />

X(a) = sum/M(a,a);<br />

% RHS of row a<br />

% shuffle all known X’s<br />

% across <strong>to</strong> the RHS<br />

% compute X(a)<br />

And as before we wrap this in one more loop <strong>to</strong> get all of the X’s,<br />

16-Feb-2014 58


School of Mathematical Sciences<br />

Monash University<br />

for a = N : -1 : 1<br />

sum = M(a,N+1);<br />

for b = a+1 : N<br />

sum = sum - M(a,b)*X(b);<br />

end<br />

X(a) = sum/M(a,a);<br />

end<br />

% step backwards<br />

% RHS of row a<br />

% shuffle all known X’s<br />

% across <strong>to</strong> the RHS<br />

% compute X(a)<br />

That completes the job, we have written a very basic Matlab program that implements a<br />

simple Gaussian elimination algorithm with back substitution. This is our final program<br />

%--- Gaussian elimination -----------------------------------------<br />

for a = 1 : N-1<br />

% first N-1 rows<br />

for b = a+1 : N<br />

% rows below a<br />

fac<strong>to</strong>r = M(b,a)/M(a,a);<br />

% want M(b,a) <strong>to</strong> be zero<br />

for c = a : N+1<br />

% col’s <strong>to</strong> right of (a,a)<br />

M(b,c) = M(b,c) - fac<strong>to</strong>r*M(a,c); % row b - fac<strong>to</strong>r * row a<br />

end<br />

end<br />

end<br />

%--- Back substitution --------------------------------------------<br />

for a = N : -1 : 1<br />

% step backwards<br />

sum = M(a,N+1);<br />

% RHS of row a<br />

for b = a+1 : N<br />

% shuffle all known X’s<br />

sum = sum - M(a,b)*X(b);<br />

% across <strong>to</strong> the RHS<br />

end<br />

X(a) = sum/M(a,a);<br />

% compute X(a)<br />

end<br />

5.2.1 Pivoting<br />

Our Matlab program suffers from one very obvious drawback – it will fail when the<br />

diagonal element M(a, a) is zero (why?). The usual trick is <strong>to</strong> swap such a row with<br />

one of the other lower rows. But which one? The simplest strategy is <strong>to</strong> look at each<br />

element below the diagonal and find that which is the largest in absolute value. Then<br />

we swap those two rows. Here is a little Matlab fragment that does the job.<br />

16-Feb-2014 59


School of Mathematical Sciences<br />

Monash University<br />

% --- Find row containing the biggest number ----------------------<br />

big_b = a; % default row <strong>to</strong> swap<br />

big_num = abs( M(a,a) );<br />

% start with M(a,a)<br />

for b = a+1 : N<br />

% all rows below row a<br />

if abs( M(b,a) ) > big_num<br />

% a new big number?<br />

big_b = b; % save this row<br />

big_num = abs( M(b,a) );<br />

% save this number<br />

end<br />

end<br />

swap = big_b;<br />

% target row <strong>to</strong> swap<br />

% --- Swap this row with the diagonal row -------------------------<br />

if swap ~= a<br />

% don’t swap a with a<br />

for b = a : N+1<br />

% all col’s <strong>to</strong> the right<br />

save = M(a,b);<br />

% avoid over write<br />

M(a,b) = M(swap,b);<br />

% 1st part of swap<br />

M(swap,b) = save;<br />

% 2nd part of swap<br />

end<br />

end<br />

You can fairly easily merge this code fragment back in<strong>to</strong> our previous code (exercise!).<br />

Despite the origins of pivoting, <strong>to</strong> overcome a zero on the diagonal, it happens <strong>to</strong> be a<br />

good thing <strong>to</strong> do even when the diagonal is not zero. Why? Because it tends <strong>to</strong> reduce<br />

the effects of round off errors (we’ll see an example in one of the following lectures).<br />

The pivoting that we have just seen is actually a particular form of pivoting known as<br />

partial pivoting. There is another form known as full pivoting in which we search for the<br />

largest element not only down the column but also across the row. This could lead <strong>to</strong> a<br />

swap of columns rather than rows (depending on the outcome of the search). Swapping<br />

columns is allowed if you also swap appropriate rows in the solution vec<strong>to</strong>r X.<br />

5.2.2 Tri-diagonal systems<br />

There is a particularly simple but important system of linear equations that crops up<br />

frequently when studying numerical approximations <strong>to</strong> differential equations. The augmented<br />

matrix has this simple structure<br />

⎡<br />

⎤<br />

β 1 γ 1 0 0 0 0 0 · · · λ 1<br />

α 2 β 2 γ 2 0 0 0 0 · · · λ 2<br />

0 α 3 β 3 γ 3 0 0 0 · · · λ 3<br />

0 0 α 4 β 4 γ 4 0 0 · · · λ 4<br />

0 0 0 α 5 β 5 γ 5 0 · · · λ 5<br />

⎢<br />

⎥<br />

⎣ . . . . . . . · · · . ⎦<br />

0 0 0 · · · 0 0 α N β N λ N<br />

When Gaussian elimination is applied <strong>to</strong> the this system we get the following solution for<br />

X (its easy, try it!). This is known as the Thomas algorithm for a tri-diagonal system.<br />

16-Feb-2014 60


School of Mathematical Sciences<br />

Monash University<br />

%--- Gaussian elimination -----------------------------------------<br />

for i = 2 : N<br />

β i = β i − (α i /β i−1 )γ i−1<br />

λ i = λ i − (α i /β i−1 )λ i−1<br />

end<br />

%--- Back substitution --------------------------------------------<br />

X N = λ N /β N<br />

for i = N-1 : -1 : 1<br />

X i = (λ i − γ i X i+1 )/β i<br />

end<br />

Example 5.2<br />

Verify that the above Matlab code is correct (i.e.<br />

elimination and see that it leads <strong>to</strong> the above code).<br />

follow the steps of the Gaussian<br />

5.2.3 Round-off Errors and Pivoting<br />

The collected wisdom of many experts is that Gaussian elimination with back substitution<br />

is highly susceptible <strong>to</strong> the accumulation of round-off errors. This is particularly<br />

prevalent for large systems (and this may happen even when N ≈ 10!). You might well<br />

be pondering how we might measure the error in the computed solution. Let’s suppose<br />

the exact (pencil and paper) solution is X and that the computer has returned ˜X as its<br />

approximation <strong>to</strong> X. How might we check ˜X? If we knew X then we would hardly need<br />

<strong>to</strong> compute ˜X. So in the absence of X the best we can do is substitute ˜X back in<strong>to</strong> our<br />

system of equations <strong>to</strong> compute the residual r defined by<br />

r = A˜X − f<br />

The entries in r will measure the error in each equation. Ideally we would like each entry<br />

in r <strong>to</strong> be zero. But in the real world, the entries in r will not be zero. How large can<br />

we expect the entries in r <strong>to</strong> be? And when should we be worried (i.e. when would be<br />

say that the errors are <strong>to</strong>o large and thus that the we should be very wary about the<br />

quality of the approximation ˜X). Well, whole books are written on this matter. One<br />

approach would be <strong>to</strong> compare the length of r <strong>to</strong> that of X. If we define L(u) <strong>to</strong> be the<br />

length of a vec<strong>to</strong>r u then, in this example, we would say that if L(r) ≪ L(˜X) then ˜X<br />

is probably a good (accurate) solution. There are other measures which we will not go<br />

in<strong>to</strong>.<br />

The point <strong>to</strong> take home here is that for large system of equations it is best not <strong>to</strong> use<br />

Gaussian elimination in any of its forms.<br />

Example 5.3<br />

Why did we make no mention of truncation errors?<br />

16-Feb-2014 61


School of Mathematical Sciences<br />

Monash University<br />

Example 5.4<br />

The system of equations<br />

⎡<br />

−0.002 4.000<br />

⎤ ⎡<br />

4.000<br />

⎣ −2.000 2.906 −5.387 ⎦ ⎣<br />

3.000 −4.031 −3.112<br />

has the exact solution<br />

x = y = z = 1<br />

x<br />

y<br />

z<br />

⎤<br />

⎡<br />

⎦ = ⎣<br />

7.998<br />

−4.481<br />

−4.143<br />

This system is nearly singular (the determinant is close <strong>to</strong> zero) so we can expect some<br />

troubles, particularly when we only use a limited number of decimal places of accuracy.<br />

Here are our results performed without pivoting.<br />

⎤<br />

⎦<br />

Gaussian elimination with B-S but without pivoting<br />

Digits [x, y, z]<br />

2 0.00000e+00 0.00000e+00 0.00000e+00<br />

4 5.00000e+00 2.00200e+00 0.00000e+00<br />

6 1.02000e+00 1.00692e+00 9.93092e-01<br />

8 1.00030e+00 1.00007e+00 9.99931e-01<br />

15 1.00000e+00 1.00000e+00 1.00000e+00<br />

Notice how poor the answer is when we use less than 8 decimal digits. If on the other<br />

hand we use pivoting then we find<br />

Gaussian elimination with B-S and with pivoting<br />

Digits [x, y, z]<br />

2 9.00000e-01 1.00000e+00 1.00000e+00<br />

4 1.00000e+00 1.00000e+00 1.00000e+00<br />

6 1.00000e+00 1.00000e+00 1.00000e+00<br />

8 1.00000e+00 1.00000e+00 1.00000e+00<br />

15 1.00000e+00 1.00000e+00 1.00000e+00<br />

Thus this simple change (<strong>to</strong> use pivoting) has brought about a dramatic improvement<br />

in the computations (even on a 2 digit computer).<br />

How did this occur? If you follow the exact steps of the Gaussian elimination stage<br />

(without pivoting) you will find that at some stage you will subtract two nearly equal<br />

numbers. Later you will divide this by the very small number −0.002 from the element<br />

(1, 1) and this will amplify the already significant relative error (in the earlier<br />

subtraction). This amplified error will then be propagated through the system during<br />

the remainder of the Gaussian elimination. When you begin the back substitution the<br />

16-Feb-2014 62


School of Mathematical Sciences<br />

Monash University<br />

last equation will be junk – it has been swamped by round-off error. So all subsequent<br />

computations will return junk – the whole scheme fails.<br />

On the other hand when you do use pivoting you will only divide by large numbers thus<br />

avoiding the problem of amplifying the round-off error. This keeps the calculations in<br />

check and gives us a reasonable answer. But be warned – pivoting is not the universal<br />

panacea <strong>to</strong> the problems such as the above. There are many systems that even with<br />

pivoting will yield poor results unless you use a sufficient number of digits (15 will not<br />

always suffice!).<br />

Example 5.5<br />

Do exactly as outlined in the last two paragraphs. This will show you exactly where<br />

the round-off error creeps in. Remember, at each stage you can test the quality of your<br />

system by substituting in the exact solution x = y = z = 1.<br />

5.3 Ill-conditioned systems<br />

These are systems for which small changes in the coefficient matrix leads <strong>to</strong> large changes<br />

in the solutions. This is not a good thing!<br />

Example 5.6<br />

Here are series of simple pairs of equations and their exact (pen and paper) solutions.<br />

[ ] [ ] [ ]<br />

[ ] [ ]<br />

1.000 2.000 x 6.000<br />

x 6.000<br />

=<br />

⇒<br />

=<br />

1.000 2.001 y 6.000<br />

y 0.000<br />

[ 1.000 2.000<br />

1.001 2.000<br />

[ 1.000 2.000<br />

1.001 2.000<br />

] [ x<br />

y<br />

] [ x<br />

y<br />

]<br />

=<br />

]<br />

=<br />

[ 6.000<br />

6.000<br />

[ 6.000<br />

6.002<br />

]<br />

]<br />

⇒<br />

⇒<br />

[ x<br />

y<br />

[ x<br />

y<br />

]<br />

=<br />

]<br />

=<br />

[ 0.000<br />

3.000<br />

[ 2.000<br />

2.000<br />

As you can see very small changes in the coefficient matrix leads <strong>to</strong> significant changes<br />

in the exact solution. This is not a computer problem, it is a problem intrinsic <strong>to</strong> the<br />

pair of equations. As such we can expect that when such systems are solved on a finite<br />

precision computer we will see problems. Here is what we get for our 2,4,6 and 8-digit<br />

computers using Gaussian elimination with back substitution and pivoting.<br />

]<br />

]<br />

1st system, exact solution x = 6, y = 0<br />

Digits x y<br />

2 0.00000e+00 0.00000e+00<br />

4 6.00000e+00 0.00000e+00<br />

6 6.00000e+00 0.00000e+00<br />

8 6.00000e+00 0.00000e+00<br />

16-Feb-2014 63


School of Mathematical Sciences<br />

Monash University<br />

2nd system, exact solution x = 0, y = 3<br />

Digits x y<br />

2 0.00000e+00 0.00000e+00<br />

4 -4.00000e+00 5.00000e+00<br />

6 2.00000e-05 3.00000e+00<br />

8 0.00000e+00 3.00000e+00<br />

3rd system, exact solution x = 2, y = 2<br />

Digits x y<br />

2 0.00000e+00 0.00000e+00<br />

4 5.99600e+00 0.00000e+00<br />

6 2.00000e+00 2.00000e+00<br />

8 2.00000e+00 2.00000e+00<br />

Notice that each of these systems are nearly singular, that is the determinant of the<br />

coefficient matrix is very small (typically 0.001). This is one way of spotting an illconditioned<br />

system. But this is a limited <strong>to</strong>ol for it could be that all entries in the<br />

matrix are small and yet it is far from singular (for example, take 0.001 times the 2 × 2<br />

identity matrix). More precise measures of ill-conditioned systems do exist but <strong>to</strong> discuss<br />

them now would take us in<strong>to</strong> messy terri<strong>to</strong>ry – something for you <strong>to</strong> study in your own<br />

time (keywords are: ill-conditioned, matrix norm and condition number).<br />

5.4 Operational cost<br />

It’s no surprise that it takes more time <strong>to</strong> solve a 10 by 10 system than a 3 by 3 system.<br />

But how much more time? Well, we could use a s<strong>to</strong>p-watch <strong>to</strong> get the actual times<br />

but, as useful as that might be, we can do better – we can develop a simple theoretical<br />

estimate. What is important is not the actual time for any one calculation but how<br />

many times longer a 10 by 10 system takes over a 3 by 3 system. We will measure each<br />

time in units of floating point operations. That is, we will count (approximately) the<br />

number of floating point operations (typically just the multiplies and divides) required<br />

<strong>to</strong> completely solve a N by N system.<br />

Why do we only count the multiplies and divides and not the additions and subtractions?<br />

Because the latter operations are done much more quickly than the former operations<br />

and thus its reasonable <strong>to</strong> ignore them when estimating the <strong>to</strong>tal computational effort.<br />

So how do we apply this idea <strong>to</strong> Gaussian elimination with back substitution (but without<br />

pivoting)? We will examine each stage in turn. First, we look at the Gaussian elimination<br />

stage. At the heart of this stage is the line<br />

M(b,c) = M(b,c) - fac<strong>to</strong>r*M(a,c)<br />

16-Feb-2014 64


School of Mathematical Sciences<br />

Monash University<br />

This involves just one floating point operation (the multiply). And this is buried inside<br />

three loops each running (approximately) from 1 <strong>to</strong> N giving a <strong>to</strong>tal of O (N 3 ) flops<br />

(flops = Floating Point Operations). In the outer two loops we also have one division<br />

in calculating fac<strong>to</strong>r. This adds a further O (N 2 ) flops <strong>to</strong> the <strong>to</strong>tal. Thus in this first<br />

stage we estimate the <strong>to</strong>tal operational count <strong>to</strong> be O (N 3 ) + O (N 2 ) flops. Now for the<br />

back substitution we see that we have just two loops, with one multiply in both loops<br />

and one divide in just the outer loop. Thus we estimate the operation count for this<br />

stage <strong>to</strong> be O (N 2 ) + O (N 1 ). Finally, we estimate the combined operation count <strong>to</strong> be<br />

O (N 3 ) + O (N 2 ) + O (N 1 ). For large N (say N 10) this is dominated by O (N 3 ) and<br />

so we take this as our final estimate of the operational cost for Gaussian elimination.<br />

Operation Count : Gaussian elimination<br />

The operational cost <strong>to</strong> solve an N by N system of equations using Gaussian elimination<br />

is O (N 3 ).<br />

5.5 Iterative methods<br />

Its been said a few times that Gaussian elimination performs badly on large systems.<br />

Here is an example.<br />

We start with the simple function f(t) = (1 − t n+1 )/(1 − t) with n any positive integer..<br />

You might (should?) recognise this function – its the sum of the geometric series with<br />

common ratio t. That is<br />

f(t) = 1 − tn+1<br />

1 − t<br />

= 1 + t + t 2 + t 3 + · · · t n n = 1, 2, 3, · · ·<br />

This is a polynomial in t with each coefficient equal <strong>to</strong> one. Is it possible <strong>to</strong> recover<br />

these coefficients by sampling f(t) for various values of t? Suppose we wrote<br />

f(t) = 1 + a 1 t + a 2 t 2 + a 3 t 3 + · · · a n t n<br />

We could evaluate both left and right hand sides for n distinct choices of t. In this<br />

example we will choose t = 1, 2, 3, · · · n + 1. This leads us <strong>to</strong> the system of equations<br />

f(1) − 1 = 1 1 a 1 + 1 2 a 2 + 1 3 a 3 + · · · 1 n a n<br />

f(2) − 1 = 2 1 a 1 + 2 2 a 2 + 2 3 a 3 + · · · 2 n a n<br />

f(3) − 1 = 3 1 a 1 + 3 2 a 2 + 3 3 a 3 + · · · 3 n a n<br />

. = .<br />

f(n) − 1 = n 1 a 1 + n 2 a 2 + n 3 a 3 + · · · n n a n<br />

16-Feb-2014 65


School of Mathematical Sciences<br />

Monash University<br />

Since we know the values for f(t) we can treat this as a system of n equations in the<br />

n unknowns a 1 , a 2 , a 3 , · · · a n . On a perfect computer (i.e. infinite precision) we would<br />

expect all a j = 1 so we can define an error by<br />

( (1 − a1 ) 2 + (1 − a 2 ) 2 + (1 − a 3 ) + · · · (1 − a n ) 2<br />

E n =<br />

n<br />

So much for definitions, what do we get? Here are the results for a 15 digit computer.<br />

) 1/2<br />

n 11 12 13 14 15 16<br />

E n 0.000e+00 0.000e+00 0.000e+00 1.150e+02 6.482e+04 1.603e+11<br />

We see that things go way off the rails for N around 14. This is not a very large system of<br />

equations and yet things have gone terribly wrong. The cause is simply the accumulation<br />

of round-off errors. If you take a look at the entries in the coefficient matrix you will see<br />

some wildly varying numbers. For N = 5 the coefficient matrix is<br />

⎡<br />

⎤<br />

1.000e + 00 1.000e + 00 1.000e + 00 1.000e + 00 1.000e + 00 1.000e + 00<br />

2.000e + 00 4.000e + 00 8.000e + 00 1.600e + 01 3.200e + 01 6.400e + 01<br />

3.000e + 00 9.000e + 00 2.700e + 01 8.100e + 01 2.430e + 02 7.290e + 02<br />

4.000e + 00 1.600e + 01 6.400e + 01 2.560e + 02 1.024e + 03 4.096e + 03<br />

⎢<br />

⎥<br />

⎣ 5.000e + 00 2.500e + 01 1.250e + 02 6.250e + 02 3.125e + 03 1.563e + 04 ⎦<br />

6.000e + 00 3.600e + 01 2.160e + 02 1.296e + 03 7.776e + 03 4.666e + 04<br />

As you can see the numbers in the bot<strong>to</strong>m right hand corner are large. For large N = 15<br />

the bot<strong>to</strong>m right hand corner is approximately 4.4 × 10 17 . With such a wide range of<br />

numbers (the <strong>to</strong>p left hand corner is always 1) its no surprise that a 15 digit computer<br />

will have troubles maintaining numerical accuracy.<br />

What is the lesson here? We should not place undying faith in Gaussian elimination.<br />

We must develop alternatives (whether or not it helps us on the above system).<br />

In this section we will look at iterative methods, similar <strong>to</strong> the fixed point methods we<br />

saw earlier, <strong>to</strong> the solution of linear systems of equations.<br />

Let it be said at the outset – do not expect miracles with these methods. We will see<br />

that they can work but that their convergence is slow.<br />

5.5.1 Jacobi iteration<br />

Most of us should have little trouble solving a system such as<br />

⎡<br />

⎤ ⎡ ⎤ ⎡ ⎤<br />

400 0 0 x 400<br />

⎣ 0 200 0 ⎦ ⎣ y ⎦ = ⎣ 400 ⎦<br />

0 0 300 z 300<br />

16-Feb-2014 66


School of Mathematical Sciences<br />

Monash University<br />

The solution is x = 1, y = 2 and z = 1. Now suppose we want <strong>to</strong> solve the related<br />

system<br />

⎡<br />

⎤ ⎡ ⎤ ⎡ ⎤<br />

400 1 2 x 400<br />

⎣ 3 200 1 ⎦ ⎣ y ⎦ = ⎣ 400 ⎦<br />

1 3 300 z 300<br />

We could reasonably guess that x = 1, y = 2 and z = 3 would be a good approximation<br />

<strong>to</strong> the exact solution. How might we improve on this solution? We could apply Gaussian<br />

elimination and it would work well. But for the purposes of showing you a new technique,<br />

let’s rule out Gaussian elimination.<br />

Re-arrange the equations in the following way.<br />

solve 1st equation for x ⇒ x = (400 − y − 2z)/400<br />

solve 2nd equation for y ⇒ y = (400 − 3x − z)/200<br />

solve 3rd equation for z ⇒ z = (300 − x − 3y)/300<br />

This suggests the iterative scheme<br />

x n+1 = 400 − y n − 2z n<br />

400<br />

y n+1 = 400 − 3x n − z n<br />

200<br />

z n+1 = 300 − x n − 3y n<br />

300<br />

To get the ball rolling we need an initial guess, let’s take x 0 = y 0 = z 0 = 0. Here are<br />

our results.<br />

Jacobi iteration<br />

Iteration [x, y, z]<br />

0 0.0000000e+00 0.0000000e+00 0.0000000e+00<br />

1 1.0000000e+00 2.0000000e+00 1.0000000e+00<br />

2 9.9000000e-01 1.9800000e+00 9.7666667e-01<br />

3 9.9016667e-01 1.9802667e+00 9.7690000e-01<br />

4 9.9016483e-01 1.9802630e+00 9.7689678e-01<br />

5 9.9016486e-01 1.9802630e+00 9.7689682e-01<br />

6 9.9016486e-01 1.9802630e+00 9.7689682e-01<br />

7 9.9016486e-01 1.9802630e+00 9.7689682e-01<br />

It converges quickly, and you can check that the x, y, z values are correct, but this is<br />

probably not much of a surprise since the system is almost trivial with just a few small<br />

off-diagonal terms. How well does this idea work for other systems? There is only one<br />

real test – try it on other systems!<br />

16-Feb-2014 67


School of Mathematical Sciences<br />

Monash University<br />

Here is another system<br />

⎡<br />

⎣<br />

5 1 2<br />

1 6 2<br />

1 1 3<br />

⎤ ⎡<br />

⎦ ⎣<br />

x<br />

y<br />

z<br />

⎤<br />

⎡<br />

⎦ = ⎣<br />

from which we create the following iteration formula<br />

x n+1 = 5 − y n − 2z n<br />

5<br />

y n+1 = 9 − x n − 2z n<br />

6<br />

z n+1 = 4 − x n − y n<br />

3<br />

Here are our results, starting with x 0 = y 0 = z 0 = 0<br />

5<br />

9<br />

4<br />

⎤<br />

⎦<br />

Jacobi iteration<br />

Iteration [x, y, z]<br />

0 0.0000000e+00 0.0000000e+00 0.0000000e+00<br />

2 1.6666667e-01 8.8888889e-01 5.0000000e-01<br />

4 3.4629630e-01 1.0691358e+00 6.9074074e-01<br />

6 4.1298354e-01 1.1278464e+00 7.5936214e-01<br />

8 4.3650091e-01 1.1483112e+00 7.8375514e-01<br />

10 4.4477533e-01 1.1555065e+00 7.9238872e-01<br />

20 4.4925085e-01 1.1593990e+00 7.9707572e-01<br />

30 4.4927523e-01 1.1594202e+00 7.9710131e-01<br />

40 4.4927536e-01 1.1594203e+00 7.9710145e-01<br />

50 4.4927536e-01 1.1594203e+00 7.9710145e-01<br />

This is not so good. It does converge but rather slowly. As we shall see there is one<br />

way <strong>to</strong> accelerate the convergence (but do not get your hopes <strong>to</strong>o high – this class of<br />

iterations are no<strong>to</strong>riously slow in their convergence, with sometimes, many thousands of<br />

iterations required <strong>to</strong> achieve even a modest 5 decimal digits of accuracy).<br />

16-Feb-2014 68


School of Mathematical Sciences<br />

Monash University<br />

5.5.2 Gauss-Seidel iteration<br />

This is a minor variation on the Jacobi iteration where now we use the new values as<br />

soon as they become available. Thus for the second Jacobi example we would use<br />

x n+1 = 5 − y n − 2z n<br />

5<br />

y n+1 = 9 − x n+1 − 2z n<br />

6<br />

z n+1 = 4 − x n+1 − y n+1<br />

3<br />

Notice the x n+1 and y n+1 appearing on the right hand side. These equations must be<br />

executed in order from first <strong>to</strong> last (how else could we use x n+1 in the second equation?).<br />

This minor change produced the following results<br />

Gauss-Seidel iteration<br />

Iteration [x, y, z]<br />

0 0.0000000e+00 0.0000000e+00 0.0000000e+00<br />

2 5.1111111e-01 1.2296296e+00 7.5308642e-01<br />

4 4.4881207e-01 1.1614577e+00 7.9657674e-01<br />

6 4.4923516e-01 1.1594281e+00 7.9711224e-01<br />

8 4.4927475e-01 1.1594194e+00 7.9710193e-01<br />

10 4.4927537e-01 1.1594203e+00 7.9710145e-01<br />

20 4.4927536e-01 1.1594203e+00 7.9710145e-01<br />

This is better, but its far from ideal.<br />

Can we do better? Remember how New<strong>to</strong>n-Raphson iterations converged quadratically<br />

(every iteration double the number of decimal places of accuracy). Can we obtain<br />

similar convergence for iterative solutions of linear systems? In general no. Dash. Dang.<br />

Fiddlesticks. ’Doh. The sad fact is that solving large systems of linear system by iterative<br />

methods will be slow – patience, patience and more patience!<br />

16-Feb-2014 69


School of Mathematical Sciences<br />

Monash University<br />

Jacobi Iteration<br />

For an n × n system of equations such as<br />

⎡<br />

⎤ ⎡<br />

a 11 a 12 a 13 · · · a 1n<br />

a 21 a 22 a 23 · · · a 2n<br />

a 31 a 32 a 33 · · · a 3n<br />

⎢<br />

⎥ ⎢<br />

⎣ . . . · · · . ⎦ ⎣<br />

a n1 a n2 a n3 · · · a nn<br />

⎤<br />

x 1<br />

x 2<br />

x 3<br />

⎥<br />

. ⎦<br />

x n<br />

⎡<br />

=<br />

⎢<br />

⎣<br />

⎤<br />

b 1<br />

b 2<br />

b 3<br />

⎥<br />

. ⎦<br />

b n<br />

The Jacobi iteration for this system is defined <strong>to</strong> be<br />

(<br />

(x i ) n+1 = 1 b i − ∑ )<br />

a ij (x j ) n<br />

a ii<br />

j≠i<br />

i = 1, 2, 3, · · · n<br />

where the notation (x i ) n means the n − th iteration for the exact value x i .<br />

Gauss-Seidel Iteration<br />

For an n × n system of equations such as<br />

⎡<br />

⎤ ⎡<br />

a 11 a 12 a 13 · · · a 1n<br />

a 21 a 22 a 23 · · · a 2n<br />

a 31 a 32 a 33 · · · a 3n<br />

⎢<br />

⎥ ⎢<br />

⎣ . . . · · · . ⎦ ⎣<br />

a n1 a n2 a n3 · · · a nn<br />

⎤<br />

x 1<br />

x 2<br />

x 3<br />

⎥<br />

. ⎦<br />

x n<br />

⎡<br />

=<br />

⎢<br />

⎣<br />

⎤<br />

b 1<br />

b 2<br />

b 3<br />

⎥<br />

. ⎦<br />

b n<br />

The Gauss-Seidel iteration for this system is defined <strong>to</strong> be<br />

(<br />

)<br />

(x i ) n+1 = 1 ∑i−1<br />

n∑<br />

b i − a ij (x j ) n+1 − a ij (x j ) n<br />

a ii<br />

j=1<br />

j=i+1<br />

i = 1, 2, 3, · · · n<br />

where the notation (x i ) n means the n − th iteration for the exact value x i .<br />

5.5.3 Diagonal dominance and convergence<br />

Will Jacobi or Gauss-Seidel iterations always converge? No prizes for correctly guessing<br />

No. Here is an example.<br />

We start with the system<br />

⎡<br />

⎣<br />

1 3 5<br />

5 1 2<br />

2 4 1<br />

⎤ ⎡<br />

⎦ ⎣<br />

x<br />

y<br />

z<br />

⎤<br />

⎡<br />

⎦ = ⎣<br />

1<br />

3<br />

2<br />

⎤<br />

⎦<br />

16-Feb-2014 70


School of Mathematical Sciences<br />

Monash University<br />

for which a Jacobi iteration scheme would be<br />

x n+1 = 1 − 3y n − 5z n<br />

y n+1 = 3 − 5x n − 2z n<br />

with the following results<br />

z n+1 = 2 − 2x n − 4y n<br />

Divergent Jacobi iteration<br />

Iteration [x, y, z]<br />

0 0.0000000e+00 0.0000000e+00 0.0000000e+00<br />

2 -1.8000000e+01 -6.0000000e+00 -1.2000000e+01<br />

4 -6.6000000e+02 -5.1600000e+02 -6.2400000e+02<br />

6 -3.0582000e+04 -3.0114000e+04 -2.7540000e+04<br />

8 -1.5320880e+06 -1.5034560e+06 -1.2880560e+06<br />

10 -7.6099674e+07 -7.2909246e+07 -6.2847516e+07<br />

20 -2.1522512e+16 -2.0473354e+16 -1.7848334e+16<br />

Whoops, things have gone horribly astray. Why? Look at the coefficients on the equation<br />

for x n+1 . These are dominated by those for y n and z n and importantly are larger than 1.<br />

Thus any errors in y n and z n will be amplified and passed on as a larger error in x n+1 .<br />

So any small error will quickly swamp the iterations (as the above results show).<br />

If that logic is correct (it is) then perhaps a re-arrangement of the original equations<br />

might help. If we move the first equation <strong>to</strong> the last in the original system, then we<br />

obtain<br />

⎡ ⎤ ⎡ ⎤ ⎡ ⎤<br />

5 1 2 x 3<br />

⎣ 2 4 1 ⎦ ⎣ y ⎦ = ⎣ 2 ⎦<br />

1 3 5 z 1<br />

for which a Jacobi iteration scheme would be<br />

x n+1 = 3 − y n − 2z n<br />

5<br />

y n+1 = 2 − 2x n − z n<br />

4<br />

z n+1 = 1 − x n − 3y n<br />

5<br />

Does this save the day? Okay, here are the results<br />

16-Feb-2014 71


School of Mathematical Sciences<br />

Monash University<br />

Convergent Jacobi iteration<br />

Iteration [x, y, z]<br />

0 0.0000000e+00 0.0000000e+00 0.0000000e+00<br />

2 4.2000000e-01 1.5000000e-01 -2.2000000e-01<br />

4 5.2060000e-01 1.6450000e-01 -1.3860000e-01<br />

6 5.4625800e-01 1.8943500e-01 -8.9118000e-02<br />

8 5.5933494e-01 2.0684805e-01 -6.9042340e-02<br />

10 5.6687170e-01 2.1587029e-01 -5.9805334e-02<br />

20 5.7471659e-01 2.2468058e-01 -5.0347124e-02<br />

30 5.7499005e-01 2.2498879e-01 -5.0012182e-02<br />

40 5.7499965e-01 2.2499961e-01 -5.0000427e-02<br />

50 5.7499999e-01 2.2499999e-01 -5.0000015e-02<br />

There is a theorem that makes sense of what we have just observed.<br />

Diagonal Dominance and Convergence<br />

For an n × n system of equations such as<br />

⎡<br />

⎤ ⎡<br />

a 11 a 12 a 13 · · · a 1n<br />

a 21 a 22 a 23 · · · a 2n<br />

a 31 a 32 a 33 · · · a 3n<br />

⎢<br />

⎥ ⎢<br />

⎣ . . . · · · . ⎦ ⎣<br />

a n1 a n2 a n3 · · · a nn<br />

⎤<br />

x 1<br />

x 2<br />

x 3<br />

⎥<br />

. ⎦<br />

x n<br />

⎡<br />

=<br />

⎢<br />

⎣<br />

The Jacobi and Gauss-Seidel iterations will converge when<br />

⎤<br />

b 1<br />

b 2<br />

b 3<br />

⎥<br />

. ⎦<br />

b n<br />

|a ii | > ∑ j≠i<br />

|a ij |<br />

for each i = 1, 2, 3, · · · n.<br />

A matrix which satisfies this conditions is said <strong>to</strong> be diagonally dominant.<br />

Example 5.7<br />

Consider the Jacobi iteration. Define the error in x i at iteration p by (ɛ i ) p = (x i ) p<br />

− x i .<br />

Express the Jacobi iteration for (x i ) p as an iteration scheme for (ɛ i ) p . Then, using<br />

|a + b + c + · · · | < |a| + |b| + |c| + · · · (which is true for any numbers a, b, c, . . .), show<br />

that, if the matrix is diagonally dominant, then<br />

max<br />

i<br />

What can you infer from this last line?<br />

∣ ∣∣(ɛi<br />

) p+1<br />

∣ ∣∣ < max<br />

i<br />

∣ ∣∣(ɛi<br />

) p<br />

∣ ∣∣<br />

16-Feb-2014 72


School of Mathematical Sciences<br />

Monash University<br />

5.6 Operational counts<br />

We know that solving an N ×N system of equations using Gaussian elimination requires<br />

O (N 3 ) flops. What do we make of the Jacobi and Gauss-Seidel iterations? For each<br />

new (x i ) n+1 we have <strong>to</strong> complete a sum of N − 1 terms each of which involves just one<br />

multiply. Thus we have N − 1 multiplications and one division for each of the N by<br />

(x i ) n+1 ’s. This gives us a <strong>to</strong>tal operational count of O (N 2 ). This is much smaller that<br />

the O (N 3 ) for the standard Gaussian elimination.<br />

What does this tell us? If, for example N = 100, the computer would take roughly as<br />

much time <strong>to</strong> compute one Gaussian elimination as 100 Jacobi iterations. Thus 100<br />

iterations is not as computationally demanding as it may have seemed at first. Thus for<br />

large N, successive Jacobi or Gauss-Seidel iterations may well be computationally more<br />

efficient than a single Gaussian elimination.<br />

16-Feb-2014 73


SCHOOL OF MATHEMATICAL SCIENCES<br />

<strong>MTH3051</strong><br />

<strong>Introduction</strong> <strong>to</strong><br />

<strong>Computational</strong> <strong>Mathematics</strong><br />

6. Solving Systems of Nonlinear Equations


School of Mathematical Sciences<br />

Monash University<br />

6.1 <strong>Introduction</strong><br />

Let’s start with a simple example. Suppose we wish <strong>to</strong> find the (x, y) values such that<br />

0 = f(x, y) = y − 7x 3 + 10x + 1 (1)<br />

0 = g(x, y) = x + 8y 3 − 11y − 1 (2)<br />

If we only had on equation, say 0 = f(x, y), then for any value of x we could solve for<br />

y (using methods like New<strong>to</strong>n-Raphson for a function of one variable). Thus this one<br />

equation describes a curve in the xy−plane. Likewise for the second function 0 = g(x, y),<br />

it also describes a curve. What we are looking for are the the intersection points of these<br />

curves.<br />

y<br />

−4 0 4<br />

−6 −4 −2 0 2 4 6<br />

x<br />

y<br />

0 2<br />

−2 −1 0 1 2<br />

From the plots we can see that we have nine points of intersection. Our game is <strong>to</strong> find<br />

these points. We will do this using two different schemes, both generalised versions of<br />

the algorithms we used for functions of one variable. Each algorithm will generate a<br />

x<br />

16-Feb-2014 75


School of Mathematical Sciences<br />

Monash University<br />

series of approximations (x, y) 0 , (x, y) 1 , (x, y) 2 , · · · which we hope will converge <strong>to</strong> the at<br />

least one of the nine intersection points.<br />

6.2 Generalised Fixed Point iteration<br />

Its a simple matter <strong>to</strong> rearrange the pair of equations (1) and (2) in<strong>to</strong><br />

which then suggest the iteration scheme<br />

y = 7x 3 − 10x − 1<br />

x = −8y 3 + 11y + 1<br />

y n+1 = 7x 3 n − 10x n − 1 (3)<br />

x n+1 = −8y 3 n + 11y n + 1 (4)<br />

Starting with (x, y) 0 = (0, 0) (by inspection of the above plots) we get for the first five<br />

iterations<br />

Divergent Generalised Fixed Point iterations<br />

Loop x n y n f(x n , y n ) g(x n , y n )<br />

0 0.000e+00 0.000e+00 1.000e+00 -1.000e+00<br />

1 1.000e+00 -4.000e+00 0.000e+00 -4.680e+02<br />

2 4.690e+02 7.221e+08 0.000e+00 3.013e+27<br />

3 -3.013e+27 -1.914e+83 -3.013e+28 -5.61e+250<br />

4 5.61e+250 Inf NaN NaN<br />

5 NaN NaN NaN NaN<br />

This is not a good start – the iterations diverge very rapidly. Its not hard see why – the<br />

coefficients in (3) and (4) are all greater than one and so any errors in (x, y) n will be<br />

amplified with each iteration.<br />

Undaunted by first round failure, we press on with any other choice of iteration scheme,<br />

such as the following<br />

x n+1 = ( 7x 3 n − y n − 1 ) /10 (5)<br />

y n+1 = ( 8y 3 n + x n − 1 ) /11 (6)<br />

Starting with (x, y) 0 = (0, 0) (by inspection of the above plots) we get for the first five<br />

iterations<br />

16-Feb-2014 76


School of Mathematical Sciences<br />

Monash University<br />

Convergent Generalised Fixed Point iterations<br />

Loop x n y n f(x n , y n ) g(x n , y n )<br />

0 0.000e+00 0.000e+00 1.000e+00 -1.000e+00<br />

1 -1.000e-01 -1.000e-01 -9.300e-02 -8.000e-03<br />

2 -9.070e-02 -9.988e-02 -1.659e-03 2.833e-05<br />

3 -9.053e-02 -9.986e-02 -1.095e-05 4.227e-06<br />

4 -9.053e-02 -9.986e-02 2.953e-07 1.158e-07<br />

5 -9.053e-02 -9.986e-02 1.292e-08 1.877e-09<br />

This is much better. The iterations are converging. Great. But can we get the remaining<br />

8 intersection points? No, try as you may, for all choices of initial guess the series<br />

converges <strong>to</strong> just this one point (−9.053e − 02, −9.986e − 02). To recover other points<br />

you will need <strong>to</strong> work with other fixed point versions of the original system of equations<br />

(1,2).<br />

Example 6.1<br />

See if you can find any of the other points of intersection by constructing alternative<br />

forms for the fixed point iterations for the system (1,2).<br />

Generalised Fixed Point Iteration<br />

If a system of equations<br />

0 = f i (x 1 , x 2 , x 3 , · · · , x n ) , i = 1, 2, 3, · · · , n<br />

can be re-written in the form<br />

x i = g i (x 1 , x 2 , x 3 , · · · , x n ) ,<br />

i = 1, 2, 3, · · · , n<br />

then the sequence<br />

)<br />

x p+1<br />

i = g i<br />

(x p 1, x p 2, x p 3, · · · , x p n , i = 1, 2, 3, · · · , n<br />

for p = 0, 1, 2, 3 · · · is known as generalised fixed point iteration. In the above x p i<br />

denotes the p th iteration for x i . The sequence is not guaranteed <strong>to</strong> converge <strong>to</strong> any<br />

root of the system.<br />

6.2.1 Convergence<br />

What can we say about the convergence of the generalised fixed point iterations? As<br />

we have seen, they may or may not converge. When we saw that the first form of fixed<br />

point iteration diverged we realised that this was <strong>to</strong> be expected because the coefficients<br />

16-Feb-2014 77


School of Mathematical Sciences<br />

Monash University<br />

in the iteration equations were all larger than 1 and thus any errors would be amplified<br />

in each iteration. Though this is true (in this case) it is a loose mathematical statement<br />

(e.g. x n+1 = 7x 3 n will converge <strong>to</strong> x = 0) and not easily applied <strong>to</strong> other systems. Here<br />

we shall explore a better and more mathematically based criteria for convergence.<br />

We begin by writing our system of equation in the form<br />

x i = g i<br />

(<br />

x 1 , x 2 , x 3 , · · · , x n<br />

)<br />

, i = 1, 2, 3, · · · , n<br />

Just so we all agree – this is a set of n equations in n variables (x 1 , x 2 , x 3 , · · · , x n ). Now<br />

we can define the fixed point iteration equations as<br />

)<br />

x p+1<br />

i = g i<br />

(x p 1, x p 2, x p 3, · · · , x p n , i = 1, 2, 3, · · · , n<br />

Next, we define the error at the p th iteration of x i by<br />

ɛ p i = xp i − x i ,<br />

i = 1, 2, 3, · · · , n<br />

Thus x p i = x i + ɛ p i and we can substitute this in<strong>to</strong> our fixed point equations<br />

)<br />

x i + ɛ p+1<br />

i = g i<br />

(x 1 + ɛ p 1, x 2 + ɛ p 2, x 3 + ɛ p 3, · · · , x n + ɛ p n , i = 1, 2, 3, · · · , n<br />

But our hope is that the sequence is converging. Thus every error ɛ i is small allowing<br />

us <strong>to</strong> expand each function as a Taylor series in powers of ɛ i . If we retain just the first<br />

order terms (i.e. ignore higher powers of ɛ i ) then we find<br />

ɛ p+1<br />

i<br />

= ∂g i<br />

∂x 1<br />

ɛ p 1 + ∂g i<br />

∂x 2<br />

ɛ p 2 + ∂g i<br />

∂x 3<br />

ɛ p 3 + · · · + ∂g i<br />

∂x n<br />

ɛ p n ,<br />

⎢<br />

⎣<br />

ɛ p+1<br />

2<br />

ɛ p+1<br />

3<br />

.<br />

ɛ p+1<br />

n<br />

=<br />

⎥ ⎢<br />

⎦ ⎣<br />

i = 1, 2, 3, · · · , n<br />

Each of the partial derivatives is evaluated at the root x i . It helps <strong>to</strong> write this in matrix<br />

form,<br />

⎡ ⎤ ⎡<br />

⎤ ⎡ ⎤<br />

ɛ p+1 ∂g 1 ∂g 1 ∂g 1 ∂g 1<br />

1<br />

· · ·<br />

∂x 1 ∂x 2 ∂x 3 ∂x n<br />

⎥ ⎢<br />

ɛ p 1 ⎥<br />

∂g 2<br />

∂x 1<br />

∂g 2<br />

∂x 2<br />

∂g 2<br />

∂x 3<br />

· · ·<br />

∂g 3<br />

∂x 1<br />

∂g 3<br />

∂x 2<br />

∂g 3<br />

∂x 3<br />

· · ·<br />

.<br />

∂g n<br />

∂x 1<br />

.<br />

∂g n<br />

∂x 2<br />

. .<br />

∂g n<br />

· · ·<br />

∂x 3<br />

∂g 2<br />

∂x n<br />

∂g 3<br />

∂x n<br />

.<br />

∂g n<br />

∂x n<br />

What do we make of this equation? To help see what’s going on let’s condense these<br />

equations in<strong>to</strong><br />

E p+1 = JE p<br />

where E is the column vec<strong>to</strong>r of errors and J is the matrix of partial derivatives. We<br />

can wind this equation all the way back <strong>to</strong> the initial iteration<br />

⎥ ⎢<br />

⎦ ⎣<br />

E p+1 = JE p = J ( JE p−1) = J ( J ( JE p−2)) = J ( J ( J ( JE p−3))) = · · · = J p+1 E 0<br />

ɛ p 2<br />

ɛ p 3<br />

.<br />

ɛ p n<br />

⎥<br />

⎦<br />

16-Feb-2014 78


School of Mathematical Sciences<br />

Monash University<br />

Now we come <strong>to</strong> the crunch. We want the sequence <strong>to</strong> converge. That is we want E p → 0<br />

as p → ∞. This can only occur when J p → 0 as p → ∞. Very interesting you might<br />

say, but what use is this? Well, a theorem from linear algebra tells us that if J p → 0 as<br />

p → ∞ then all of the eigenvalues of J must have absolute value less than 1. Its been a<br />

long journey but here is the result we want,<br />

Convergence of Generalised Fixed Point Iteration<br />

For a system of equations<br />

x i = g i (x 1 , x 2 , x 3 , · · · , x n ) ,<br />

i = 1, 2, 3, · · · , n<br />

construct the n × n matrix J with entries<br />

J ij = ∂g i<br />

∂x j<br />

Let λ i be the eigenvalues of J. If |λ i | < 1 , i = 1, 2, 3, · · · , n then the sequence<br />

)<br />

x p+1<br />

i = g i<br />

(x p 1, x p 2, x p 3, · · · , x p n , i = 1, 2, 3, · · · , n<br />

will converge <strong>to</strong> a root of the given system.<br />

Example 6.2<br />

When we were playing with functions of one variable, we created the fixed point sequence<br />

x p+1 = g(x p ) and we found the convergence criteria <strong>to</strong> be |dg/dx| < 1. How does this fit<br />

in with what we have just found?<br />

Note that <strong>to</strong> be able <strong>to</strong> apply the above test for convergence we must construct the<br />

matrix J at the root of the system. But we don’t know the root – that’s why we are<br />

using an iterative scheme. The best you can do is estimate J by using your current<br />

approximation <strong>to</strong> x i .<br />

Example 6.3<br />

Estimate J for the first form of the fixed point iteration equations (3,4) at the initial<br />

guess x = 0, y = 0. Hence infer, without any further iterations, whether or not you<br />

would expect the subsequent iterations <strong>to</strong> converge or diverge.<br />

16-Feb-2014 79


School of Mathematical Sciences<br />

Monash University<br />

Example 6.4<br />

Modify the following Matlab code <strong>to</strong> implement a Gauss-Seidel style of iteration in which<br />

new values are used immediately (see the discussion on Gauss-Seidel iterations for linear<br />

systems given in previous lectures).<br />

6.2.2 Matlab example<br />

Here is a Matlab function which implements our first fixed point scheme. You will need<br />

<strong>to</strong> save this in a file called gen fp.m in your workspace. You can run this example by<br />

typing the Matlab command<br />

Matlab>> [x,y] = gen_fp([0 0],0.001,20)<br />

This runs the code with initial guess (0, 0) and requesting an accuracy of 0.001 within<br />

20 iterations. The final approximations are returned as x and y (how inventive).<br />

Here is the code<br />

16-Feb-2014 80


School of Mathematical Sciences<br />

Monash University<br />

function [x,y] = gen_fp(x_start,target_error,max_loop)<br />

% --- Set starting values<br />

x_old = x_start;<br />

x_new = x_start;<br />

loop = 1;<br />

looping = (loop < max_loop);<br />

converged = false;<br />

% --- Loop until we get target accuracy or <strong>to</strong>o many loops<br />

while (looping) & (~converged)<br />

x_new = onestep(x_old);<br />

error = norm(x_new - x_old);<br />

loop = loop +1;<br />

looping = (loop < max_loop);<br />

converged = (error < target_error);<br />

x_old = x_new;<br />

% do one iteration<br />

% measure the change<br />

% update loop counter<br />

% <strong>to</strong>o many iterations?<br />

% are we done?<br />

% prepare for next iteration<br />

end<br />

% --- Finished, save data<br />

x<br />

y<br />

= x_new(1);<br />

= x_new(2);<br />

% --- The iteration formula<br />

function y = onestep(x)<br />

y(1) = -8*x(2)^3+11*x(2)+1; % x = g1(x,y)<br />

y(2) = 7*x(1)^3-10*x(1)-1; % y = g2(x,y)<br />

function y = norm(x)<br />

% compute the length of a vec<strong>to</strong>r<br />

y = sqrt(dot(x,x));<br />

16-Feb-2014 81


School of Mathematical Sciences<br />

Monash University<br />

6.3 Generalised New<strong>to</strong>n-Raphson<br />

For the pair of equations<br />

0 = f(x, y) = y − 7x 3 + 10x + 1<br />

0 = g(x, y) = x + 8y 3 − 11y − 1<br />

we know there are 9 intersection points but yet, try as we might, the generalised fixed<br />

point method does not recover all 9 points. What do we do? Ten points <strong>to</strong> Gryffindor for<br />

suggesting the New<strong>to</strong>n-Raphson method. But how do we apply it <strong>to</strong> a pair of equations?<br />

Good point. We will have <strong>to</strong> return <strong>to</strong> the basics following steps similar <strong>to</strong> those we<br />

used for functions of one variable.<br />

Let (x, y) be the exact root of the above equations and, as usual, let (x, y) n be our<br />

current approximation <strong>to</strong> (x, y). Can we compute small changes δx and δy so that we<br />

jump <strong>to</strong> the exact root in the next iteration? That is, we want<br />

0 = f(x n + δx, y n + δy)<br />

0 = g(x n + δx, y n + δy)<br />

We need access <strong>to</strong> δx and δy yet they are buried inside the functions f and g. We can<br />

get hold of these by expanding the pair in powers of δx and δy using a Taylor series.<br />

This leads <strong>to</strong><br />

0 = f(x n , y n ) + ∂f ∂f<br />

δx +<br />

∂x ∂y δy + · · ·<br />

0 = g(x n , y n ) + ∂g ∂g<br />

δx +<br />

∂x ∂y δy + · · ·<br />

where we have hidden all the higher order terms (e.g. δx 2 , δxδy 2 ) in the trailing dots. If<br />

we assume (as we usually do) that the δ’s are small then the highre order terms can be<br />

discarded and we are left with<br />

−f(x n , y n ) = ∂f ∂f<br />

δx +<br />

∂x ∂y δy<br />

−g(x n , y n ) = ∂g ∂g<br />

δx +<br />

∂x ∂y δy<br />

This is a simple 2×2 system of linear equations for δx and δy which we can solve (easily)<br />

by any suitable method (pencil and paper, Gaussian elimination). We can then update<br />

our approximations by<br />

x n+1 = x n + δx<br />

y n+1 = y n + δy<br />

16-Feb-2014 82


School of Mathematical Sciences<br />

Monash University<br />

Example 6.5<br />

Didn’t we say we were going <strong>to</strong> choose δx and δy so that we would jump <strong>to</strong> the exact<br />

root in one iteration? Why then is this only an improved approximation?<br />

How well does this new method work? Let’s apply it <strong>to</strong> the above pair. Here is what<br />

we get for a few starting guesses.<br />

Generalised New<strong>to</strong>n-Raphson iterations<br />

Loop x n y n f(x n , y n ) g(x n , y n )<br />

0 0.000e+00 0.000e+00 1.000e+00 -1.000e+00<br />

1 -9.009e-02 -9.910e-02 5.118e-03 -7.786e-03<br />

2 -9.053e-02 -9.986e-02 3.718e-07 -1.393e-06<br />

3 -9.053e-02 -9.986e-02 9.992e-16 -4.186e-14<br />

4 -9.053e-02 -9.986e-02 1.110e-16 0.000e+00<br />

5 -9.053e-02 -9.986e-02 0.000e+00 -1.110e-16<br />

Loop x n y n f(x n , y n ) g(x n , y n )<br />

0 1.000e+00 1.000e+00 5.000e+00 -3.000e+00<br />

1 1.472e+00 1.194e+00 -5.420e+00 9.662e-01<br />

2 1.319e+00 1.159e+00 -7.040e-01 3.471e-02<br />

3 1.292e+00 1.159e+00 -1.941e-02 4.078e-06<br />

4 1.291e+00 1.159e+00 -1.622e-05 3.646e-08<br />

5 1.291e+00 1.159e+00 -1.135e-11 1.954e-14<br />

Loop x n y n f(x n , y n ) g(x n , y n )<br />

0 1.000e+00 -1.000e+00 3.000e+00 3.000e+00<br />

1 1.250e+00 -1.250e+00 -1.422e+00 -1.625e+00<br />

2 1.190e+00 -1.186e+00 -9.159e-02 -1.192e-01<br />

3 1.186e+00 -1.181e+00 -4.747e-04 -8.358e-04<br />

4 1.186e+00 -1.181e+00 -1.243e-08 -4.133e-08<br />

5 1.186e+00 -1.181e+00 0.000e+00 0.000e+00<br />

It works! Yippee. The remaining 6 points are also easily found simply by choosing an<br />

initial guess close <strong>to</strong> the desired point (which we get by inspection of the plots of the<br />

functions, as shown in an previous lecture).<br />

Notice also that each iteration appears <strong>to</strong> double the number of digits of accuracy (look<br />

at the values of f and g after each iteration, they are roughly the square of the previous<br />

value). Thus not only have we a method that seems capable of finding all the roots<br />

it does so very efficiently. This is the preferred method whenever we want quick and<br />

16-Feb-2014 83


School of Mathematical Sciences<br />

Monash University<br />

accurate roots. The price you pay is that you have more work <strong>to</strong> do in evaluating the<br />

partial derivatives (most times this should not be <strong>to</strong>o problematic).<br />

Generalised New<strong>to</strong>n-Raphson Iteration<br />

The Generalised New<strong>to</strong>n-Raphson method for a system of equations<br />

is given by<br />

0 = f i (x 1 , x 2 , x 3 , · · · , x n ) , i = 1, 2, 3, · · · , n<br />

x p+1<br />

i = x p i + δx i , i = 1, 2, 3, · · · , n<br />

where the δx i are obtained by solving the n × n linear system<br />

−f i = ∂f i<br />

∂x 1<br />

δx 1 + ∂f i<br />

∂x 2<br />

δx 2 + ∂f i<br />

∂x 3<br />

δx 3 + · · · + ∂f i<br />

∂x n<br />

δx n ,<br />

i = 1, 2, 3, · · · , n<br />

in which each f i and ∂f i /∂x j is evaluated at the x p i .<br />

If the initial guess is close <strong>to</strong> the root, then the sequence will converge quadratically,<br />

(<br />

= O (ɛ p i )2) , i = 1, 2, 3, · · · , n<br />

ɛ p+1<br />

i<br />

where ɛ p i = xp i − x i is the error in the p th iteration for x i .<br />

Example 6.6<br />

Using techniques similar <strong>to</strong> that used for the function of one variable, show that the<br />

convergence for a Generalised New<strong>to</strong>n-Raphson scheme is quadratic (i.e. the error in<br />

the next iteration is of order the square of the error previous iteration)).<br />

6.3.1 Matlab code<br />

The Matlab code which we used in the Generalised Fixed Point method can be very<br />

easily adapted <strong>to</strong> the Generalised New<strong>to</strong>n-Rapshon method – all we need do is change<br />

the iteration formula inside the function onestep. Here is the modified Matlab function.<br />

16-Feb-2014 84


School of Mathematical Sciences<br />

Monash University<br />

function y = onestep(x)<br />

% --- the function values ---------------------------------<br />

f(1) = x(2) - 7*x(1)^3 + 10*x(1) + 1;<br />

f(2) = x(1) + 8*x(2)^3 - 11*x(2) - 1;<br />

% --- the partial derivatives -----------------------------<br />

a(1,1) = -21*x(1)^2 + 10;<br />

a(1,2) = 1;<br />

a(2,1) = 1;<br />

a(2,2) = +24*x(2)^2 - 11;<br />

% --- solve the linear system for dx ----------------------<br />

dx = inv(a)*( -f );<br />

% --- the next iteration ----------------------------------<br />

y(1) = x(1) + dx(1);<br />

y(2) = x(2) + dx(2);<br />

16-Feb-2014 85


SCHOOL OF MATHEMATICAL SCIENCES<br />

<strong>MTH3051</strong><br />

<strong>Introduction</strong> <strong>to</strong><br />

<strong>Computational</strong> <strong>Mathematics</strong><br />

7. Interpolation and Approximation of Data


School of Mathematical Sciences<br />

Monash University<br />

7.1 The what and why of interpolation<br />

Suppose you are asked <strong>to</strong> evaluate a function f(x) at x = 3.4 but all you are given is<br />

the following table of numbers<br />

x 0.000 1.200 2.400 3.600 4.000 6.000 7.000<br />

f(x) 0.000 0.932 0.675 -0.443 -0.757 -0.279 0.657<br />

Since the target point is not in the table the best we can hope <strong>to</strong> do is <strong>to</strong> estimate f(3.4).<br />

How might we do this? We could construct a straight line built on the two points nearest<br />

x = 3.4. Or we could build a quadratic built on any three points that straddle x = 3.4.<br />

If we got really excited we might try building a cubic by selecting four points around<br />

x = 3.4. You get the picture – we use a set of points near the target point <strong>to</strong> build a<br />

polynomial. Then we estimate f by evaluating the polynomial at the target point. This<br />

process is called polynomial interpolation (surprised?).<br />

Let’s be be a bit more specific. We are given a table of n + 1 data points (x, f) i , i =<br />

0, 1, 2, · · · , n and we wish <strong>to</strong> estimate f at x = x ⋆ . We decide on building a polynomial<br />

of degree n by which we will estimate f(x ⋆ ). Let’s call this polynomial P n (x) and let’s<br />

write it out in the form<br />

P n (x) = a 0 + a 1 x + a 2 x 2 + · · · + a n x n<br />

where a 0 , a 1 , a 2 , · · · a n are some set of numbers. How might we compute the coefficients<br />

a 0 , a 1 , a 2 , · · · a m ? Simple demand that the polynomial pass through exactly m + 1 of the<br />

given data points. That is we demand the following interpolation condition<br />

Interpolation Condition<br />

Demand that P n (x) be such that<br />

f j = P n (x j ) = a 0 + a 1 x j + a 2 x 2 j + · · · + a n x n j ,<br />

j = 0, 1, 2, · · · , n<br />

This gives us n + 1 linear equations for the n + 1 unknowns a 0 , a 1 , a 2 , · · · a n .<br />

So in principle we can solve this system and thus we have our P n (x). Our job is done.<br />

Yes? No?<br />

We played this game once before, when we set out <strong>to</strong> recover the polynomial<br />

f(t) = 1 + tn+1<br />

1 − t<br />

= 1 + t + t 2 + t 3 + · · · + t n<br />

(see section (5.5)). There we found that resulting matrix equations could easily cause<br />

great grief when using Gaussian elimination. So we need other way <strong>to</strong> do the same job.<br />

We will look at two methods Lagrangian interpolation and New<strong>to</strong>n’s divided differences.<br />

Both have their merits and failings.<br />

16-Feb-2014 87


School of Mathematical Sciences<br />

Monash University<br />

7.2 Lagrangian interpolation<br />

Suppose we have just three points (x, y) i , i = 0, 1, 2. Now let us build the polynomials<br />

L 0 (x) = (x − x 1)(x − x 2 )<br />

(x 0 − x 1 )(x 0 − x 2 )<br />

L 1 (x) = (x − x 0)(x − x 2 )<br />

(x 1 − x 0 )(x 1 − x 2 )<br />

L 2 (x) = (x − x 0)(x − x 1 )<br />

(x 2 − x 0 )(x 2 − x 1 )<br />

The L i (x) were constructed in this way for a very good reason.<br />

quadratic in x and they have the following properties<br />

Each of these is a<br />

L 0 (x 0 ) = 1 L 0 (x 1 ) = 0 L 0 (x 2 ) = 0<br />

L 1 (x 0 ) = 0 L 1 (x 1 ) = 1 L 1 (x 2 ) = 0<br />

L 2 (x 0 ) = 0 L 2 (x 1 ) = 0 L 2 (x 2 ) = 1<br />

Since they take on only the values 0 and 1 at the tabulated values we are certain that<br />

the combination<br />

˜f(x) = f 0 L 0 (x) + f 1 L 1 (x) + f 2 L 2 (x)<br />

satisfies the interpolation condition. Thus we have a quadratic polynomial that passes<br />

through the given points (i.e. ˜f(x) satisfies the interpolation condition). Now our job is<br />

done. We have a quadratic passing through the given points and we can now compute<br />

˜f(x ⋆ ).<br />

Note that in all that follows we will always use a tilde, as in ˜f, <strong>to</strong> remind us that we are<br />

estimating a function’s value at some point.<br />

Note also that it is common practise <strong>to</strong> write P N (x) instead of ˜f(x) as the Lagrange<br />

polynomial of degree N.<br />

Lagrange interpolation<br />

The Lagrange polynomial P N (x), based on the N + 1 data points (x, y) i , i =<br />

0, 1, 2, · · · N is given by<br />

P N (x) =<br />

N∑<br />

i=0<br />

y i<br />

N<br />

∏<br />

j=0<br />

j≠i<br />

(x − x j )<br />

(x i − x j )<br />

16-Feb-2014 88


School of Mathematical Sciences<br />

Monash University<br />

Example 7.1<br />

Compute the quadratic that passes through the following data<br />

x 0.0 1.0 3.0<br />

f(x) 3.0 2.0 6.0<br />

7.2.1 The whole polynomial or just its value?<br />

In the previous example we pushed the Lagrangian interpolation all the way <strong>to</strong> an<br />

explicit function of x, that is we found ˜f(x) = x 2 − 2x + 3. Normally nobody would ever<br />

bother going this far. Rather, it is standard practise <strong>to</strong> leave the pieces of the Lagrange<br />

polynomial in their unexpanded form (why do more than you have <strong>to</strong>).<br />

Example 7.2 A smooth function y = sin(x)<br />

Polynomials are nice smooth functions. Thus we can expect that they can provide good<br />

approximations <strong>to</strong> other smooth functions. Here we shall explore a few low order (i.e.<br />

N = 2, 3, 4 and 5) polynomial approximations <strong>to</strong> sin(x) over the interval 0 < x < π. In<br />

each of the following we first sampled sin(x) at a few points in the interval 0 < x < π<br />

and then built the Lagrange polynomial on those points. Here are the plots.<br />

Lagrangian interpolation of sin(x)<br />

y<br />

−0.5 0.0 0.5 1.0<br />

■<br />

N = 2 quadratic<br />

■<br />

■<br />

■<br />

■<br />

N = 3 cubic<br />

■<br />

■<br />

y<br />

−0.5 0.0 0.5 1.0<br />

■<br />

■<br />

■<br />

N = 4 quartic<br />

■<br />

■<br />

■<br />

■<br />

■<br />

N = 5 quintic<br />

■<br />

■<br />

■<br />

0 1 2 3<br />

x<br />

0 1 2 3<br />

x<br />

16-Feb-2014 89


School of Mathematical Sciences<br />

Monash University<br />

What do we observe from the above plots? First, that each Lagrange polynomial passes<br />

through the given data points (as they must!). Second, that all the approximations<br />

look good with the N = 5 quintic interpolation being almost indistinguishable from the<br />

source function sin(x). But the picture changes dramatically if the we plot the Lagrange<br />

polynomials outside the range on which they were built. Here is what our collection<br />

looks like over the interval 0 < x < 2π.<br />

N=2,3,4,5 Lagrangian interpolation<br />

y<br />

−1.0 −0.5 0.0 0.5 1.0<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

0 1 2 3 4 5 6<br />

Clearly things have gone off the rails. Are we surprised? Not really. Using the polynomials<br />

in this way is not much different from trying <strong>to</strong> predict the future (on the basis of<br />

past events). This process (predicting outside the given data) is known as extrapolation.<br />

It almost always gives junk like the above – so be very very afraid when you venture<br />

down this path.<br />

x<br />

16-Feb-2014 90


School of Mathematical Sciences<br />

Monash University<br />

Example 7.3 A not so smooth function y = H(x)<br />

The Heaviside step function is very simple and is defined by<br />

⎧<br />

⎪⎨<br />

−0.5 : x < 0<br />

H(x) = 0 : x = 0<br />

⎪⎩<br />

0.5 : 0 < x<br />

Langrangian interpolation of H(x)<br />

y<br />

−0.5 0.0 0.5<br />

■<br />

N = 4<br />

■<br />

■<br />

■<br />

■<br />

N = 8<br />

■ ■ ■ ■<br />

■<br />

■ ■ ■ ■<br />

y<br />

−0.5 0.0 0.5<br />

N = 16<br />

■ ■ ■ ■ ■ ■ ■ ■<br />

■<br />

■ ■ ■ ■ ■ ■ ■ ■<br />

N = 32<br />

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■<br />

■<br />

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■<br />

−1.0 −0.5 0.0 0.5 1.0<br />

x<br />

−1.0 −0.5 0.0 0.5 1.0<br />

x<br />

The plot in the lower right corner, which uses an overly ambitious number of points,<br />

N = 32, is not a pretty sight. We can see that all of the polynomials pass through the<br />

chosen data points but the price we pay is that the higher order interpolations (which<br />

we expect <strong>to</strong> be accurate) oscillate wildly near the end of the domain. This is serious.<br />

The cause of this problem is the sharp jump at x = 0. The best advice we can give is<br />

that if you must use polynomial interpolation (there are other choices, and we will look<br />

at these in later lectures) then stick <strong>to</strong> lower order polynomials (e.g. cubic).<br />

16-Feb-2014 91


School of Mathematical Sciences<br />

Monash University<br />

7.2.2 Matlab code<br />

Here is a simple Matlab function which will return the value of the Lagrange polynomial<br />

for a single x value x target.<br />

function answer = lagrngpoly(x,f,n,x_target)<br />

% --- loop <strong>to</strong> compute each term in ∑ k f kL k (x ⋆ )<br />

sum = 0;<br />

for k=1:n<br />

term=1;<br />

% start sum at zero<br />

% add on one term at a time<br />

% start each term at one<br />

% --- 1st half loop<br />

for j=1:k-1<br />

term=term*(x_target-x(j))/(x(k)-x(j));<br />

end<br />

% --- 2nd half loop<br />

for j=k+1:n<br />

term=term*(x_target-x(j))/(x(k)-x(j));<br />

end<br />

% --- term now equals L k (x ⋆ )<br />

sum=sum+f(k)*term;<br />

% update the sum<br />

end<br />

answer = sum;<br />

% save the final answer<br />

Here is an example of how you might run the above code<br />

Matlab>> x_data = [0.1 0.2 0.4 0.8 1.2];<br />

Matlab>> f_data = [1.2 1.0 0.6 0.8 1.6];<br />

Matlab>> fapprox = lagrngpoly(x_data,f_data,5,0.123)<br />

This runs the code and requests a quintic approximation at x = 0.123.<br />

16-Feb-2014 92


School of Mathematical Sciences<br />

Monash University<br />

7.3 New<strong>to</strong>n polynomials<br />

Suppose you are playing the game of polynomial interpolation. You might first use linear<br />

interpolation, then quadratic and so on until you find an estimate that you are happy<br />

with. If you chose Lagrangian interpolation then you will have <strong>to</strong> scrap all previous<br />

calculations as you move through this chain of polynomials. For example you can not<br />

use and parts of the quadratic Lagrange polynomial <strong>to</strong> help you build the cubic Lagrange<br />

polynomial. So this is an expensive process. It would be much better if we could find an<br />

algorithm that does draw upon past calculations and allows higher order polynomials<br />

<strong>to</strong> be built from previously computed lower order polynomials. This brings us <strong>to</strong> an<br />

algorithm known as New<strong>to</strong>n’s divided differences. It will produce exactly the same<br />

polynomials as Lagrangian interpolation (more on this later) but it will do so in a<br />

different way.<br />

Here are a set examples that shows where we are headed. In all of these examples<br />

we are given N data points (x, f) i , i = 0, 1, 2, · · · N. We will build each polynomial<br />

in terms of some unknown numbers a i , i = 0, 1, 2, · · · N and then use the interpolation<br />

condition ˜f(x i ) = f i <strong>to</strong> compute the a i . We will write ˜f j (x), j = 0, 1, 2, · · · as our chain<br />

of polynomials (i.e. ˜f1 (x) is our linear polynomial and ˜f 3 (x) is our cubic polynomial).<br />

Example 7.4 One data point : constant interpolation<br />

With N = 1 we have just one data point and our interpolation is of the form<br />

˜f 0 (x) = a 0<br />

for some number a 0 . There is only one interpolation condition and we easily see that<br />

a 0 = f 0<br />

Example 7.5 Two data points : linear interpolation<br />

Now we have N = 2 and we propose an interpolation like<br />

˜f 1 (x) = ˜f 0 (x) + a 1 (x − x 0 )<br />

We choose it in this form because it clearly satisfies the interpolation condition at x = x 0 .<br />

We only have one other interpolation condition <strong>to</strong> impose, at x = x 1 . This gives us<br />

a 1 = f 1 − ˜f 0 (x 1 )<br />

x 1 − x 0<br />

16-Feb-2014 93


School of Mathematical Sciences<br />

Monash University<br />

Example 7.6 Three data points : quadratic interpolation<br />

With N = 3 we now choose<br />

˜f 2 (x) = ˜f 1 (x) + a 2 (x − x 0 )(x − x 1 )<br />

Again we have been sneaky in the way we have built this polynomial. It clearly passes<br />

through the first two points. All we need do is choose a 2 so that it passes through the<br />

third point. This leads <strong>to</strong><br />

f 2 −<br />

a 2 =<br />

˜f 1 (x 2 )<br />

(x 2 − x 0 )(x 2 − x 1 )<br />

Example 7.7 Four data points : cubic interpolation<br />

By now you may be seeing a pattern. This time we have N = 4 and we put<br />

˜f 3 (x) = ˜f 2 (x) + a 3 (x − x 0 )(x − x 1 )(x − x 2 )<br />

and we find<br />

a 3 =<br />

f 3 − ˜f 2 (x 3 )<br />

(x 3 − x 0 )(x 3 − x 1 )(x 3 − x 2 )<br />

So much for examples, how do we do this in general? Let’s suppose we have built ˜f j (x)<br />

based on the first j + 1 points. Then we build the next polynomial as<br />

˜f j+1 (x) = ˜f j (x) + a j+1 (x − x 0 )(x − x 1 )(x − x 2 ) · · · (x − x j )<br />

with<br />

a j+1 =<br />

f j+1 − ˜f j (x j+1 )<br />

(x j+1 − x 0 )(x j+1 − x 1 )(x j+1 − x 2 ) · · · (x j+1 − x j )<br />

This might look a bit tedious but there is a further trick that makes this computation<br />

rather easy. We will build a triangular table of data from which we will later pick out<br />

the various a j . The entries in the table will be denoted d ij with j being the column<br />

index. We will build the table column by column, from left <strong>to</strong> right, using the following<br />

recursive formula<br />

d ij = d i,j−1 − d i−1,j−1<br />

x i − x i−j<br />

1 ≤ j ≤ i ≤ N<br />

with the first column set as d i0 = f i , i = 0, 1, 2, · · · N. The we have<br />

a j = d jj<br />

j = 0, 1, 2, · · · N<br />

16-Feb-2014 94


School of Mathematical Sciences<br />

Monash University<br />

New<strong>to</strong>n interpolation<br />

x i f i = d i0 d i1 d i2 d i3 d i4 d i5<br />

x 0 = 1<br />

d 00 = −3<br />

d 11 = 3<br />

x 1 = 2 d 10 = 0 d 22 = 6<br />

d 21 = 15 d 33 = 1<br />

x 2 = 3 d 20 = 15 d 32 = 9 d 44 = 0<br />

d 31 = 33 d 43 = 1 d 55 = 0<br />

x 3 = 4 d 30 = 48 d 42 = 12 d 54 = 0<br />

d 41 = 57 d 53 = 1<br />

x 4 = 5 d 40 = 105 d 52 = 15<br />

d 51 = 87<br />

x 5 = 6 d 50 = 192<br />

The column headings simply remind us which column we are building and the numbers<br />

in the body of the table are the various d ij ’s.<br />

To build the polynomial we simply read off the coefficients along the leading diagonal<br />

˜f 5 (x) = d 00 + d 11 (x − x 0 ) + d 22 (x − x 1 )(x − x 0 ) + · · ·<br />

= −3 + 3(x − 1) + 6(x − 1)(x − 2) + 1(x − 1)(x − 2)(x − 3)<br />

Equally we could use the coefficients along the lower diagonal<br />

˜f 5 (x) = d N0 + d N−1,1 (x − x N ) + d N−2,2 (x − x N )(x − x N−1 ) + · · ·<br />

= 192 + 87(x − 6) + 15(x − 6)(x − 5) + 1(x − 6)(x − 5)(x − 4)<br />

Note that d i4 = d i5 = 0 thus showing that our data was actually a pure cubic (which we<br />

have now recovered from the data).<br />

16-Feb-2014 95


School of Mathematical Sciences<br />

Monash University<br />

Example 7.8<br />

Show that the two polynomials given above are one and the same polynomial f(x) =<br />

x 3 − 4x.<br />

New<strong>to</strong>n interpolation polynomial<br />

Given a set of N + 1 data points, (x, f) i , i = 0, 1, 2, · · · N the New<strong>to</strong>n interpolation<br />

polynomial is given by<br />

˜f N (x) = d 00 + d 11 (x − x 0 ) + d 22 (x − x 0 )(x − x 1 )<br />

+ · · · + d NN (x − x 0 )(x − x 1 )(x − x 2 ) · · · (x − x N−1 )<br />

where the d ij are computed by the recursive formula<br />

d ij = d i,j−1 − d i−1,j−1<br />

x i − x i−j<br />

1 ≤ j ≤ i ≤ N<br />

with d i0 = f i , i = 0, 1, 2, · · · N.<br />

7.3.1 Horner’s form of the New<strong>to</strong>n polynomial<br />

There is a computational efficient way <strong>to</strong> evaluate a New<strong>to</strong>n polynomial. Its goes like<br />

this. Start with the general form<br />

˜f N (x) = d 00 + d 11 (x − x 0 ) + d 22 (x − x 0 )(x − x 1 )<br />

+ · · · + d NN (x − x 0 )(x − x 1 )(x − x 2 ) · · · (x − x N−1 )<br />

and then group all the common fac<strong>to</strong>rs<br />

˜f N (x) = d 00 + (x − x 0 )<br />

(d 11 + (x − x 1 ) ( d 22 + (x − x 2 )(d 33 + · · · + d NN (x − x N−1 )) ))<br />

7.4 Uniqueness<br />

Are the Lagrange and New<strong>to</strong>n polynomials the same for a given dataset? Yes – there is<br />

only one polynomial of degree N that passes through the N + 1 data points. Thus even<br />

though we have used different formulae <strong>to</strong> compute these polynomials, they are in fact<br />

identical. Use whichever method you feel most comfortable with.<br />

The polynomial of degree N is often written as P N (x) rather than our ˜f N (x).<br />

16-Feb-2014 96


School of Mathematical Sciences<br />

Monash University<br />

7.5 Piecewise polynomial interpolation<br />

Suppose we are given a simple dataset and we are asked <strong>to</strong> estimate the derivative at<br />

say x = 0.35? How do we proceed? Here is one approach.<br />

Construct, by whatever means, a smooth approximation ỹ(x) <strong>to</strong> y(x) near x = 0.35.<br />

Then put y ′ (0.35) ≈ ỹ ′ (0.35).<br />

We have some options for constructing ỹ(x)<br />

◮ Least squares estimation. This is easy <strong>to</strong> apply but it does not interpolate the<br />

data and it can develop spurious wiggles – a disaster for derivatives.<br />

◮ Piecewise polynomial interpolation. This is also easy <strong>to</strong> apply, but which local set<br />

of points do we use? Different choices will produce different estimates for y(x) and<br />

y ′ (x).<br />

16-Feb-2014 97


School of Mathematical Sciences<br />

Monash University<br />

What we want is a method which<br />

◮ Produces a unique approximation,<br />

◮ Is continuous over the domain and<br />

◮ Has, at least, a continuous first derivative over the domain.<br />

If we get continuity in any higher derivatives then we shall be pleased (but not greedy).<br />

We will look at one way of achieving this, known as cubic spline interpolation.<br />

7.5.1 Cubic Splines<br />

Here is our problem. We have a set of data points (x i , y i ), i = 0, 1, 2, · · · n and we wish<br />

<strong>to</strong> build an approximation ỹ(x) which has as much continuity as we can get (now we<br />

are being greedy).<br />

Between each pair of points we will construct a cubic. Let ỹ i (x) be the cubic for the<br />

interval x i ≤ x ≤ x i+1 . We demand that<br />

Interpolation condition<br />

Continuity of the function<br />

Continuity of the first derivative<br />

Continuity of the second derivative<br />

y i = ỹ i (x i ) (1)<br />

ỹ i−1 (x i ) = ỹ i (x i ) (2)<br />

ỹ ′ i−1(x i ) = ỹ ′ i(x i ) (3)<br />

ỹ ′′<br />

i−1(x i ) = ỹ ′′<br />

i (x i ) (4)<br />

Can we solve this system of equations? We need <strong>to</strong> balance the number of unknowns<br />

against the number of equations. We have n + 1 data points and thus n cubics ỹ i (x)<br />

<strong>to</strong> compute. Each cubic has 4 coefficients, thus we have 4n unknowns. And how many<br />

equations? From the above we count n + 1 equations in (1), and (n − 1) equations in<br />

each of (2), (3) and (4). A <strong>to</strong>tal of 4n − 2 equations for 4n unknowns. We see that we<br />

will have <strong>to</strong> provide two extra pieces of information. No matter, we’ll press on see what<br />

comes up.<br />

Start by putting<br />

16-Feb-2014 98


School of Mathematical Sciences<br />

Monash University<br />

ỹ i (x) = y i + a i (x − x i ) + b i (x − x i ) 2 + c i (x − x i ) 3 (5)<br />

which au<strong>to</strong>matically satisfies equation (1). For the moment suppose we happen <strong>to</strong> know<br />

all of the second derivatives y i ′′ . We then have ỹ ′′<br />

i (x) = 2b i + 6c i (x − x i ) and evaluating<br />

this at x = x i leads <strong>to</strong><br />

b i = y ′′<br />

i /2 (6)<br />

Now we turn <strong>to</strong> equation (4) y ′′<br />

i+1 = y ′′<br />

i + 6c i (x i+1 − x i ) which gives<br />

c i = (y ′′<br />

i+1 − y ′′<br />

i )/(6h i ) (7)<br />

where we have introduced h i = x i+1 − x i . Next we compute the a i by applying equation<br />

(2),<br />

y i+1 = y i + a i h i + 1 6 (y′′ i+1 + 2y ′′<br />

i )h 2 i (8)<br />

and so<br />

a i = y i+1 − y i<br />

h i<br />

− 1 6 h i(y ′′<br />

i+1 + 2y ′′<br />

i ) (9)<br />

It appears that we have completely determined each of the cubics, though we have yet<br />

<strong>to</strong> use is (3), continuity in the first derivative. But remember that we don’t yet know<br />

the values of y i ′′ . Thus equation (3) will be used <strong>to</strong> compute the y i ′′ . Using our values<br />

for a i , b i and c i we find (after much fiddling) that equation (3) is<br />

(<br />

yi+1 − y i<br />

6 − y )<br />

i − y i−1<br />

= h i y i+1 ′′ + 2(h i + h i−1 )y i ′′ + h i−1 y i−1 ′′ (10)<br />

h i h i−1<br />

The only unknowns in this equation are the y i<br />

′′ of which there are n + 1. But there are<br />

only n − 1 equations. Thus we must supply two extra pieces of information.<br />

The simplest choice is <strong>to</strong> set y 0 ′′ = y n ′′ = 0. Then we have a tri-diagonal system of<br />

equations <strong>to</strong> solve for y i ′′ . That’s as far as we need push the algebra – we can call a<br />

Matlab routine <strong>to</strong> solve the tri-diagonal system.<br />

16-Feb-2014 99


School of Mathematical Sciences<br />

Monash University<br />

The recipe<br />

◮ Solve equation (10) for y ′′<br />

i ,<br />

◮ Compute all of the a i from equation (9),<br />

◮ Compute all of the b i from equation (6),<br />

◮ Compute all of the c i from equation (7) and finally<br />

◮ Assemble all of the cubics using equation (5).<br />

Our job is done. We have computed the cubic spline for each interval.<br />

16-Feb-2014 100


School of Mathematical Sciences<br />

Monash University<br />

7.5.2 Example<br />

Here we compare a cubic spline interpolation against a set of polynomial interpolations<br />

for a simple step function. Notice how the cubic splines are much smoother than the<br />

polynomials. In each plot there are four different interpolations, based on 4, 8, 16 and<br />

32 evenly spaced points over the interval −1 ≤ x ≤ 1. This corresponds <strong>to</strong> polynomials<br />

of degree 3, 7, 15 and 31. The key observation here is that higher order polynomials<br />

contain very large oscillations near the limits of the dataset. It is far better <strong>to</strong> choose a<br />

low order (less than 5) polynomial for the interpolation. Even better, use a cubic spline.<br />

y<br />

−0.5 0.0 0.5<br />

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■<br />

■ ■ ■ ■ ■ ■ ■ ■<br />

■ ■ ■ ■ ■ ■ ■■ ■ ■ ■ ■ ■ ■<br />

■<br />

Cubic splines<br />

■<br />

−1.0 −0.5 0.0 0.5 1.0<br />

x<br />

y<br />

−0.5 0.0 0.5<br />

Polynomial interpolation<br />

−1.0 −0.5 0.0 0.5 1.0<br />

x<br />

7.6 Non-polynomial interpolation<br />

In previous lectures we used polynomials <strong>to</strong> interpolate data. But why s<strong>to</strong>p at polynomials?<br />

There are many other classes of functions which we can use, in particular sin<br />

and cos functions. Why would we want <strong>to</strong> make such a change? We know that cyclic<br />

behaviour is a common feature in many natural phenomena. For example, we all know<br />

16-Feb-2014 101


School of Mathematical Sciences<br />

Monash University<br />

that the Earth’s temperature varies on a daily and seasonal basis. If we didn’t know<br />

this how might we extract the periods of these cycles from measured data? By fitting<br />

suitably chosen periodic functions, such as sin and cos functions, <strong>to</strong> the data.<br />

So our game <strong>to</strong>day is <strong>to</strong> see how we might fit a function of the form<br />

f(x) = a 0<br />

2 + a 1 cos(x) + a 2 cos(2x) + a 3 cos(3x) + · · ·<br />

+ b 1 sin(x) + b 2 sin(2x) + b 3 cos(3x) + · · ·<br />

<strong>to</strong> a set of data (x i , f i ) for f(x). This leads us <strong>to</strong> Fourier series.<br />

7.6.1 Fourier series<br />

Here are some useful facts (aka theorems).<br />

If f(x) is defined over the interval −π ≤ x ≤ +π then<br />

f(x) = a 0<br />

2 + ∞<br />

∑<br />

where the coefficients are given by<br />

j=1<br />

a j cos(jx) + b j sin(jx)<br />

a j = 1 π<br />

b j = 1 π<br />

∫ +π<br />

−π<br />

∫ +π<br />

−π<br />

f(x) cos(jx) dx<br />

f(x) sin(jx) dx<br />

The infinite series converges <strong>to</strong> the mid-point in any discontinuity in f(x).<br />

The infinite series provides a natural extension of f(x) <strong>to</strong> all values of x. Simply f(x ±<br />

2nπ) = f(x) for any integer n.<br />

The a j , b j are known as the Fourier coefficients and they measure the amplitude of the<br />

particular harmonic in the function.<br />

7.6.2 Estimating the Fourier coefficients<br />

Suppose we are given the (x i , f(x i )) (with the x i evenly spaced in −π < x < π) and<br />

that we wish <strong>to</strong> estimate the Fourier coefficients. We have two options, we can either<br />

◮ Estimate the integrals using a left hand sum rule or<br />

◮ Use the interpolation condition ˜f (xi ) = f i .<br />

In the following we will assume that x j = −π + (2jπ)/N for j = 0, 1, 2, · · · N and that<br />

N is an even integer.<br />

16-Feb-2014 102


School of Mathematical Sciences<br />

Monash University<br />

Approximating the integrals<br />

Here we estimate the integrals by a left hand sum<br />

for j = 0, 1, 2, · · ·.<br />

a j ≈ ã j = 2 N<br />

b j ≈ ˜b j = 2 N<br />

N−1<br />

∑<br />

k=0<br />

N−1<br />

∑<br />

k=0<br />

f k cos(jx k )<br />

f k sin(jx k )<br />

Interpolation condition<br />

The interpolation condition ˜f (xi ) = f i is just<br />

f k = ã0<br />

2 + ∞<br />

∑<br />

j=1<br />

ã j cos(jx k ) + ˜b j sin(jx k )<br />

for k = 0, 1, 2, · · · N − 1. (Exercise : why do we s<strong>to</strong>p at k = N − 1?).<br />

From this set of N equations we need <strong>to</strong> compute the a j , b j . But we have a problem –<br />

there are N equations and an infinite number of coefficients. Clearly we have <strong>to</strong> reduce<br />

the number of coefficients <strong>to</strong> N. This involves two tricks.<br />

First, note that, since k and N are integers,<br />

cos ((j + N)x k ) = cos (jx k + Nx k ) = cos (jx k + (−Nπ + 2πk)) = cos(jx k )<br />

sin ((j + N)x k ) = sin (jx k + Nx k ) = sin (jx k + (−Nπ + 2πk)) = sin(jx k )<br />

This allows us <strong>to</strong> combine terms j, j + N, j + 2N, j + 3N, · · · in the infinite series,<br />

f k = ã0<br />

2 + N<br />

∑<br />

j=1<br />

ã ′ j cos(jx k ) + ˜b<br />

′<br />

j sin(jx k )<br />

where ã ′ j = ã j + ã j+N + ã j+2N + · · · and ˜b<br />

′<br />

j = ˜b j + ˜b j+N + ˜b j+2N + · · ·. The second<br />

trick is almost a repeat of the first. This time we note that<br />

cos ((N − k)x i ) = cos ((Nx i − kx i )) = cos (−kx i ) = cos (kx i )<br />

sin ((N − k)x i ) = sin ((Nx i − kx i )) = sin (−kx i ) = − sin (kx i )<br />

This allows us <strong>to</strong> combine terms k and N − k leading <strong>to</strong><br />

f k = ã0<br />

2 + m<br />

∑<br />

j=1<br />

ã ′′<br />

j cos(jx k ) + ˜b<br />

′′<br />

j sin(jx k )<br />

where ã ′′<br />

j = ã ′ j + ã ′ N−j and ′′ ′ ′ ˜b j = ˜b j − ˜b N−j and N = 2m. The final thing <strong>to</strong> note is<br />

that when j = m we have sin(jx k ) = 0. That is the last term in the sum drops out –<br />

16-Feb-2014 103


School of Mathematical Sciences<br />

Monash University<br />

there is no ˜b<br />

′′<br />

m in the equations. We have gone as far as we need in massaging the terms<br />

in the series so we will now drop the double dashes on the coefficients.<br />

Okay, how many coefficients do we have? One ã 0 plus m by ã j and m − 1 by ˜b j for a<br />

<strong>to</strong>tal of 2m = N. And we have exactly N equations. So in principle we should now be<br />

able <strong>to</strong> solve the system of equations (by, e.g. Gaussian elimination) for the ã j and ˜b j .<br />

The surprise is that the solution is almost exactly what we found using the left hand<br />

rule for the integrals – the sum over estimates a m by a fac<strong>to</strong>r of two. We can account<br />

for this by a slight alteration <strong>to</strong> the way we write the Fourier sum. Thus we have<br />

Discrete Fourier Series<br />

where N = 2m and<br />

ã 0<br />

f(x) ≈ ˜f (x) =<br />

2 + ãm<br />

m−1<br />

2 cos(mx) + ∑<br />

j=1<br />

ã j cos(jx) + ˜b j sin(jx)<br />

ã j = 2 N<br />

N−1<br />

∑<br />

f k cos(jx k ) ,<br />

˜b j = 2 N<br />

N−1<br />

∑<br />

f k sin(jx k ) ,<br />

j = 0, 1, 2, · · · m<br />

k=0<br />

k=0<br />

7.6.3 Example<br />

This is almost the same example as we used in the lecture on cubic splines – a step<br />

function. In this instance we have carefully defined the step function at the discontinuities<br />

<strong>to</strong> ensure that the function has a consistent periodic extension. We defined the<br />

value of f(x) at each discontinuity <strong>to</strong> be the mid point in the jump . This is necessary<br />

if we want the successive copies f(x), f(x + 2π), f(x + 4π) · · · <strong>to</strong> fit <strong>to</strong>gether neatly at<br />

x = 2π, 4π, 6π · · ·<br />

The Fourier interpolation is seen <strong>to</strong> be free of the large oscillations that plague the polynomial<br />

interpolations. The Fourier and polynomial interpolations yield similar values<br />

near the centre of the dataset.<br />

16-Feb-2014 104


School of Mathematical Sciences<br />

Monash University<br />

y<br />

−0.6 −0.3 0.0 0.3 0.6<br />

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■<br />

■<br />

■<br />

■<br />

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■<br />

Fourier interpolation<br />

−3 −2 −1 0 1 2 3<br />

x<br />

y<br />

−0.6 −0.3 0.0 0.3 0.6<br />

Polynomial interpolation<br />

−3 −2 −1 0 1 2 3<br />

x<br />

7.7 Approximating functions<br />

The game here is simple – given a table of data points, estimate the value of a function<br />

at some point not in the table.<br />

The general approach is <strong>to</strong> construct a a new function which captures as best as possible<br />

the information in the table.<br />

Let’s write (x i , y i ) for the data points, y(x) for the underlying function and ỹ(x) for the<br />

approximation <strong>to</strong> y(x).<br />

Interpolation. In this case we demand that ỹ(x) gives the exact values when evaluated<br />

at the tabulated points, ỹ(x i ) = y i . This has been covered in previous lectures.<br />

Approximation. This time we choose ỹ(x) so that it passes close, but not necessarily<br />

through, each data point. This extra flexibility can often give us better approximations<br />

than would otherwise be given by polynomial interpolation.<br />

16-Feb-2014 105


School of Mathematical Sciences<br />

Monash University<br />

If the data in the table are drawn from an underlying smooth function then it often<br />

doesn’t matter which method you choose, both will give reasonable answers. (But what’s<br />

reasonable? Good question). However, if the data happens <strong>to</strong> contain high frequency<br />

oscillations (e.g. noisy experimental data) then it would not make sense <strong>to</strong> interpolate<br />

the data. In this case an approximation <strong>to</strong> the function would make much more sense<br />

and would more likely give a much better answer.<br />

7.7.1 Example<br />

■<br />

y<br />

0 1 2 3 4<br />

■ ■ ■ ■ ■ ■ ■ ■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

y<br />

0 1 2 3 4<br />

■<br />

■<br />

■<br />

■<br />

■ ■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

0.0 0.5 1.0 1.5 2.0<br />

x<br />

0.0 0.5 1.0 1.5 2.0<br />

x<br />

7.7.2 Least Squares<br />

Suppose we have the following table of data<br />

x<br />

y(x)<br />

0.0 0.00<br />

0.4 0.64<br />

0.8 0.42<br />

1.2 1.58<br />

1.6 1.36<br />

2.0 2.00<br />

■<br />

y<br />

0.0 0.5 1.0 1.5 2.0<br />

■<br />

■<br />

■<br />

■<br />

■<br />

0.0 0.5 1.0 1.5 2.0<br />

and that we chose <strong>to</strong> approximate the function with a straight line<br />

x<br />

y(x) ≈ ỹ(x) = Ax + C<br />

How do we compute the coefficients A and C?<br />

We can’t use standard polynomial interpolation as there are 6 data points but only 2<br />

parameters <strong>to</strong> compute A and C. The best we can do is <strong>to</strong> chose A and C <strong>to</strong> minimise<br />

the error between y(x) and ỹ(x).<br />

16-Feb-2014 106


School of Mathematical Sciences<br />

Monash University<br />

We define the error (also called the residual by)<br />

E(A, C) = ∑ i<br />

(ỹ(xi<br />

) − y i<br />

) 2<br />

= ∑ i<br />

(<br />

Axi + C − y i<br />

) 2<br />

And for the minimum we set 0 = ∂E/∂A and 0 = ∂E/∂C, leading <strong>to</strong><br />

16-Feb-2014 107


School of Mathematical Sciences<br />

Monash University<br />

∑<br />

y i = A ∑<br />

i<br />

i<br />

∑<br />

x i y i = A ∑<br />

i<br />

i<br />

x i + C ∑ i<br />

x 2 i + C ∑ i<br />

1<br />

x i<br />

These are known as the normal equations. For our previous example we get<br />

6.000 = 6.00 A + 6.00 C<br />

8.664 = 8.80 A + 6.00 C<br />

This 2 by 2 system is easily solved, A = 0.9514 and C = 0.0486, and so<br />

y(x) ≈ ỹ(x) = 0.9514x + 0.0486<br />

y<br />

0.0 0.5 1.0 1.5 2.0<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

0.0 0.5 1.0 1.5 2.0<br />

The line does not pass through the data but it does capture the general trend in the<br />

data.<br />

x<br />

16-Feb-2014 108


School of Mathematical Sciences<br />

Monash University<br />

7.7.3 Generalised least squares<br />

Suppose we now set<br />

ỹ(x) = a 0 + a 1 x + a 2 x 2 + · · · + a n x n<br />

Our aim is <strong>to</strong> chose the a i so that ỹ(x) is the best possible approximation <strong>to</strong> y(x). We<br />

set about minimising the residual<br />

E(a i ) = ∑ i<br />

(ỹ(xi<br />

) − y i<br />

) 2<br />

over all possible choices of a i . Thus we put 0 = ∂E/∂a i for each a i . This gives us n + 1<br />

equations for n + 1 unknowns,<br />

∑<br />

x<br />

0<br />

i y i = a 0<br />

∑<br />

x<br />

0<br />

i + a 1<br />

∑<br />

x<br />

1<br />

i + a 2<br />

∑<br />

x<br />

2<br />

i + · · · + a n<br />

∑<br />

x<br />

n+0<br />

i<br />

∑<br />

x<br />

1<br />

i y i = a 0<br />

∑<br />

x<br />

1<br />

i + a 1<br />

∑<br />

x<br />

2<br />

i + a 2<br />

∑<br />

x<br />

3<br />

i + · · · + a n<br />

∑<br />

x<br />

n+1<br />

i<br />

∑<br />

x<br />

2<br />

i y i = a 0<br />

∑<br />

x<br />

2<br />

i + a 1<br />

∑<br />

x<br />

3<br />

i + a 2<br />

∑<br />

x<br />

4<br />

i + · · · + a n<br />

∑<br />

x<br />

n+2<br />

i<br />

.<br />

∑ ∑ ∑ ∑ ∑<br />

x<br />

n<br />

i y i = a 0 x<br />

n<br />

i + a 1 x<br />

n+1<br />

i + a 2 x<br />

n+2<br />

i + · · · + a n x<br />

2n<br />

i<br />

These are the normal equations for the function ỹ(x). They can be solved by standard<br />

matrix methods, but note for n > 3 these equations are often ill-conditioned.<br />

7.7.4 Variations on a theme<br />

Okay, so much for linear functions. Can we apply the least squares idea <strong>to</strong> functions<br />

such as ỹ(x) = αe βx ? Yes, and there are two common approaches.<br />

Method 1 This involves a change of variables. We put u(x) = ln y(x) and thus<br />

u(x) = ln α + βx. As this is a linear function, we can apply the simple linear least<br />

squares method <strong>to</strong> the data (x i , u i ). We set u(x) ≈ ũ(x) = Ax + C and compute A and<br />

C by minimising the residual. Then we return <strong>to</strong> our original variable, y(x). This gives<br />

us α = e C and β = A<br />

16-Feb-2014 109


School of Mathematical Sciences<br />

Monash University<br />

Method 2 This is the head on approach. We make no change of variable and simply<br />

define the residual as,<br />

E(α, β) = ∑ i<br />

(ỹ(xi<br />

) − y i<br />

) 2<br />

= ∑ i<br />

(<br />

αe<br />

βx i<br />

− y i<br />

) 2<br />

As usual, we set 0 = ∂E/∂α and 0 = ∂E/∂β <strong>to</strong> obtain the best choice for α and β.<br />

The problem with this method is that the normal equations will be non-linear functions<br />

in the parameters. This is harder <strong>to</strong> solve (but not impossible).<br />

Most people opt for the first method.<br />

Example : Method 1<br />

Here is the data<br />

x 0.000 0.142 0.285 0.428 0.571 0.714 0.857 1.000<br />

y(x) 1.500 1.495 1.040 0.821 1.003 0.821 0.442 0.552<br />

<strong>to</strong> which we’ll fit ỹ(x) = αe βx .<br />

First we compute a new table with u(x) = ln y(x),<br />

x 0.000 0.142 0.285 0.428 0.571 0.714 0.857 1.000<br />

u(x) 0.405 0.402 0.039 -0.197 0.003 -0.197 -0.816 -0.594<br />

16-Feb-2014 110


School of Mathematical Sciences<br />

Monash University<br />

we then construct the normal equations,<br />

−0.962 = 8.00 A + 3.97 C<br />

−1.451 = 3.97 A + 2.85 C<br />

whose solution is A = 0.445 and C = −1.132. Finally we convert back <strong>to</strong> the original<br />

variable y(x). That is α = e A = 1.561 and β = C = −1.132. So we have found<br />

y(x) ≈ ỹ(x) = 1.561e −1.132x<br />

y<br />

0.6 0.9 1.2 1.5<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

0.0 0.2 0.4 0.6 0.8 1.0<br />

x<br />

16-Feb-2014 111


School of Mathematical Sciences<br />

Monash University<br />

Example : Method 2<br />

Same data, same function, but now we set<br />

(<br />

αe<br />

βx i<br />

− y i<br />

) 2<br />

E(α, β) = ∑ i<br />

0 = ∂E<br />

∂α = 2 ∑ i<br />

0 = ∂E<br />

∂β = 2 ∑ i<br />

(<br />

αe<br />

βx i<br />

− y i<br />

)<br />

e<br />

βx i<br />

(<br />

αe<br />

βx i<br />

− y i<br />

)<br />

e<br />

βx i<br />

(αx i )<br />

This is a non-linear pair of equations for α and β and so must be solved with fancy<br />

methods (e.g. New<strong>to</strong>n-Raphson).<br />

We expect the solution for α, β <strong>to</strong> be similar (but different) <strong>to</strong> the solution found using<br />

method 1.<br />

7.7.5 Applications<br />

◮ Smoothing. Experimental data often contains an element of noise. If you happen<br />

<strong>to</strong> know the expected shape of the function then this noise can be smoothed out<br />

by applying a least squares method.<br />

◮ Function estimation. Equations generated by least squares methods can be<br />

used <strong>to</strong> evaluate the function at points not in the table.<br />

◮ Parameter estimations. Sometimes it is the parameters in the function that<br />

are important (e.g. the slope of the function) rather than the specific values of the<br />

function.<br />

16-Feb-2014 112


School of Mathematical Sciences<br />

Monash University<br />

7.7.6 Notes<br />

◮ Avoid non-linear least squares where ever possible.<br />

◮ Use low order polynomials. This minimises the problems of ill-conditioning. It<br />

also reduces the effect of spurious wiggles in the function.<br />

◮ Try using small sub-sets of data near the target point. This gives the least squares<br />

method a chance at getting a good fit <strong>to</strong> the data but at the risk of throwing out<br />

important information about the function.<br />

◮ The method is called least squares because we chose the residual <strong>to</strong> be the sum of<br />

the squares of the errors.<br />

◮ There are other choices for residuals, such as E = ∑ |ỹ(x i )−y i |, but such schemes<br />

are not least square methods.<br />

◮ The values of parameters estimated from these alternative methods should be<br />

comparable <strong>to</strong> the values given by the least squares method.<br />

7.7.7 Matlab example<br />

The matlab procedure for polynomial least squares is polyfit. Here is a typical example<br />

x = (0:0.1:5)’; % x from 0 <strong>to</strong> 5 in steps of 0.1<br />

y = sin(x);<br />

% get y values<br />

p = polyfit(x,y,3); % fit a cubic <strong>to</strong> the data<br />

f = polyval(p,x); % evaluate the cubic on the x data<br />

plot(x,y,’o’,x,f,’-’) % plot y and its approximation f<br />

16-Feb-2014 113


SCHOOL OF MATHEMATICAL SCIENCES<br />

<strong>MTH3051</strong><br />

<strong>Introduction</strong> <strong>to</strong><br />

<strong>Computational</strong> <strong>Mathematics</strong><br />

8. Extrapolation Methods


School of Mathematical Sciences<br />

Monash University<br />

8.1 Richardson extrapolation<br />

In trying <strong>to</strong> estimate a particular number, say L ⋆ , it is quite common <strong>to</strong> use algorithms<br />

that generate a sequence of approximations such as L 0 , L 1 , L 2 · · ·. Each approximation<br />

is an improvement on the previous.<br />

Richardson extrapolation is a scheme whereby we can use the previously computed values<br />

<strong>to</strong> provide improved estimates. They key <strong>to</strong> the method is knowing the exact form of<br />

the error in the approximation at each iteration.<br />

8.1.1 Example – computing π<br />

The length of a circle of radius 1 is 2π and we can estimate this by approximating the<br />

circle by a series of straight line segments (the chords). If there are n segments each of<br />

length ∆L then the <strong>to</strong>tal length is n∆L. For the first few values of n we can compute<br />

L(n) = n∆L by hand.<br />

n 2 4 8<br />

L(n) 4 4 √ 2 8 √ 2 − √ 2<br />

L(n) 4 ≈ 5.6568542 ≈ 6.1229349<br />

% error 36% 10% 2.5%<br />

To improve our approximation <strong>to</strong> 2π we could repeat this calculation for larger values<br />

of n. But this becomes harder and harder. Is there a way in which we can use just the<br />

above data <strong>to</strong> get a better estimate for 2π? Yes – Richardson extrapolation.<br />

All applications of Richardson extrapolation begin with a formal statement on the error<br />

term in the approximation.<br />

We divided the circle in<strong>to</strong> n equal segments. The angle subtended at the centre of<br />

the circle by one segment will be ∆θ = 2π/n. The length of the chord will be ∆L =<br />

2 sin(∆θ/2) while the length of the arc will be 1/∆θ. Thus we have<br />

( π<br />

2π = L ⋆ ≈ L = 2n sin<br />

n)<br />

Suppose n is large, then we can expand the sine function as a Taylor series<br />

( π<br />

L(n) = 2n<br />

n − 1 ( π<br />

) 3 1<br />

( π<br />

) 5 1<br />

( π<br />

) )<br />

7<br />

+ − + · · ·<br />

3! n 5! n 7! n<br />

= L ⋆ + a n 2 + b n 4 + c n 6 + · · ·<br />

where a, b, c are constants (that do not depend on n). All terms after L ⋆ represent the<br />

error in L. The errors form a power series in 1/n 2 and the leading error term is O(1/n 2 ).<br />

16-Feb-2014 115


School of Mathematical Sciences<br />

Monash University<br />

The trick now is <strong>to</strong> form linear combinations of L(n), L(2n), L(4n) · · · <strong>to</strong> successively<br />

knock out the leading order error terms. From the above we have<br />

and thus<br />

L(2n) = L ⋆ +<br />

a<br />

(2n) 2 +<br />

b<br />

(2n) 4 +<br />

c<br />

(2n) 6 + · · ·<br />

4<br />

3 L(2n) − 1 3 L(n) = L⋆ + b′<br />

(2n) 4 + c′<br />

(2n) 6 + · · ·<br />

for some new numbers a ′ , b ′ . This approximation converges <strong>to</strong> L ⋆ and has leading error<br />

term of order 1/n 4 . This should be a much better approximation <strong>to</strong> 2π than L(n).<br />

Define M(n) = (4L(2n) − L(n))/3, then<br />

n 2 4<br />

M(n) 6.2091390 6.2782951<br />

% error 1% 0.08%<br />

We might be tempted <strong>to</strong> apply this four-thirds, one-third rule once again, hoping <strong>to</strong> get<br />

a further improvement. But that would be very wrong. Those coefficients were derived<br />

on the basis that the leading error term was O(1/n 2 ) but for the new series M(n) we<br />

found the leading error term <strong>to</strong> be O(1/n 4 ). To eliminate this term we must recompute<br />

the coefficients. This time we find the combination <strong>to</strong> be (16/15)M(2n) − (1/15)M(n).<br />

Call this Q(n). Then we get<br />

n 2<br />

Q(n) 6.2829055<br />

% error 0.004%<br />

For Q(n) the leading error term is O(1/n 6 ).<br />

8.1.2 Conclusion<br />

Each stage in this process (of eliminating leading order error terms) is one application<br />

of Richardson extrapolation.<br />

The great value in this method is that it can significantly improve the quality of our approximations<br />

at next <strong>to</strong> no extra computational effort. The key of course is knowing the<br />

exact form of the error series. With this in hand you can form the correct combinations<br />

<strong>to</strong> kill off successive leading order error terms.<br />

Note that in our example the error series was power series in 1/n 2 . This need not always<br />

be the case, in some cases you may get a power series in 1/n. In any case the logic<br />

developed above still applies.<br />

16-Feb-2014 116


SCHOOL OF MATHEMATICAL SCIENCES<br />

<strong>MTH3051</strong><br />

<strong>Introduction</strong> <strong>to</strong><br />

<strong>Computational</strong> <strong>Mathematics</strong><br />

9. Numerical integration


School of Mathematical Sciences<br />

Monash University<br />

9.1 <strong>Introduction</strong><br />

In first year calculus we learnt that<br />

I(a, b) =<br />

∫ b<br />

a<br />

f(x) dx = F (b) − F (a)<br />

where F (x) is the anti-derivative of f(x). What do we do when the anti-derivative is<br />

<strong>to</strong>o hard <strong>to</strong> compute or when it can not be expressed in terms of familiar functions like<br />

sin, cos log etc. (e.g. try finding the anti-derivative of e −x2 ). One option is <strong>to</strong> try a<br />

numerical method. This will be our focus for this and the next lecture.<br />

The general method will be <strong>to</strong> estimate I(a, b) by a finite sum of the form<br />

I(a, b) ≈ Ĩ (a, b) =<br />

j=N<br />

∑<br />

j=0<br />

w j f(x j )<br />

where x j , w j are chosen according <strong>to</strong> some set rule. The main questions will be<br />

◮ How do we choose the x j and w j ? and, as always,<br />

◮ How does the error |I(a, b) − Ĩ (a, b)| depend on the number of points N (for a<br />

fixed rule for computing x j , w j ).<br />

One point which we should keep in mind is that as we are evaluating f(x) at various<br />

x j ’s we may run in<strong>to</strong> a problem if f(x) is not defined or even worse singular at one or<br />

more of the x j ’s. For the moment we will put such pathological cases aside by assuming<br />

that f(x) is a well behaved function throughout the closed interval [a, b].<br />

All of the choices for x j , w j arise by making simple piecewise polynomial approximations<br />

<strong>to</strong> f(x). Why polynomials? Because they are easy <strong>to</strong> integrate and, with a suitable<br />

choice of node points, we should be able <strong>to</strong> get any desired accuracy.<br />

9.2 The Left and Right hand sum rules<br />

This is about as easy as it gets. Here we subdivide [a, b] in<strong>to</strong> N intervals of width<br />

(b − a)/N and we set x j = a + j(b − a)/N for j = 0, 1, 2 · · · N.<br />

16-Feb-2014 118


School of Mathematical Sciences<br />

Monash University<br />

9.2.1 The Left Hand Rule<br />

In each interval x j ≤ x ≤ x j+1 we use the constant approximation f(x) ≈ ˜f (x) = f(xj ).<br />

Then<br />

∫ b<br />

a<br />

f(x) dx =<br />

N−1<br />

∑<br />

j=0<br />

∫ xj+1<br />

x j<br />

f(x) dx<br />

≈<br />

N−1<br />

∑<br />

j=0<br />

∫ xj+1<br />

x j<br />

f(x j ) dx<br />

=<br />

N−1<br />

∑<br />

j=0<br />

= b − a<br />

N<br />

f(x j ) (x j+1 − x j )<br />

N−1<br />

∑<br />

j=0<br />

f(x j )<br />

As a test case we might choose I = ∫ 1<br />

0 x4 dx = 1/5. And here are the results.<br />

N Ĩ (0, 1) I(0, 1) E(N) = |Ĩ − I| E(N/2)/E(N)<br />

4 9.570E-02 2.000E-01 1.043E-01<br />

8 1.427E-01 2.000E-01 5.730E-02 1.820E+00<br />

16 1.701E-01 2.000E-01 2.995E-02 1.913E+00<br />

32 1.847E-01 2.000E-01 1.530E-02 1.957E+00<br />

64 1.923E-01 2.000E-01 7.731E-03 1.979E+00<br />

128 1.961E-01 2.000E-01 3.886E-03 1.990E+00<br />

256 1.981E-01 2.000E-01 1.948E-03 1.995E+00<br />

512 1.990E-01 2.000E-01 9.753E-04 1.997E+00<br />

Notice that the numbers in last column are close <strong>to</strong> 2. This suggests that each time<br />

we double N we halve the error, that is E = O (1/N). The method converges, but <strong>to</strong>o<br />

slowly <strong>to</strong> be of any practical use.<br />

Note that in the above table (an all others that follow in this chapter) we use n <strong>to</strong> record<br />

the number of points and N <strong>to</strong> record the number intervals. Clearly n = N + 1.<br />

9.2.2 The Right Hand Rule<br />

In the same way we could approximate the integrand by its value at the right hand edge<br />

of the interval, f(x) ≈ ˜f (x) = f(xj+1 ). This leads <strong>to</strong><br />

∫ b<br />

a<br />

f(x) dx = b − a<br />

N<br />

N∑<br />

j=1<br />

f(x j )<br />

with the following results<br />

16-Feb-2014 119


School of Mathematical Sciences<br />

Monash University<br />

N Ĩ (0, 1) I(0, 1) E(N) = |Ĩ − I| E(N/2)/E(N)<br />

4 3.457E-01 2.000E-01 1.457E-01<br />

8 2.677E-01 2.000E-01 6.770E-02 2.152E+00<br />

16 2.326E-01 2.000E-01 3.255E-02 2.080E+00<br />

32 2.160E-01 2.000E-01 1.595E-02 2.041E+00<br />

64 2.079E-01 2.000E-01 7.894E-03 2.021E+00<br />

128 2.039E-01 2.000E-01 3.927E-03 2.010E+00<br />

256 2.020E-01 2.000E-01 1.958E-03 2.005E+00<br />

512 2.010E-01 2.000E-01 9.778E-04 2.003E+00<br />

Clearly this scheme is also O(1/N) accurate.<br />

9.2.3 The Mid Point rule<br />

This again uses a constant approximation for f(x) in each interval. This time we choose<br />

the mid-point (surprise!) of the interval, f(x) ≈ f((x j + x j+1 )/2). Doing the integration<br />

as before leads <strong>to</strong><br />

∫ b<br />

a<br />

f(x) dx = b − a<br />

N<br />

N−1<br />

∑<br />

j=0<br />

f( x j + x J+1<br />

)<br />

2<br />

with the following results<br />

N Ĩ (0, 1) I(0, 1) E(N) = |Ĩ − I| E(N/2)/E(N)<br />

4 1.897E-01 2.000E-01 1.030E-02<br />

8 1.974E-01 2.000E-01 2.597E-03 3.967E+00<br />

16 1.993E-01 2.000E-01 6.506E-04 3.992E+00<br />

32 1.998E-01 2.000E-01 1.627E-04 3.998E+00<br />

64 2.000E-01 2.000E-01 4.069E-05 3.999E+00<br />

128 2.000E-01 2.000E-01 1.017E-05 4.000E+00<br />

256 2.000E-01 2.000E-01 2.543E-06 4.000E+00<br />

512 2.000E-01 2.000E-01 6.358E-07 4.000E+00<br />

Now we see something nice – the error now is reduced by a fac<strong>to</strong>r of four every time we<br />

double N. This means the error varies as E = O (1/N 2 ). This is a worthwhile change.<br />

For no extra computational effort we have got a much better approximation.<br />

But why? Simple sketches of f(x) and ˜f (x) for the Left Hand Sum reveal that ˜f (x)<br />

consistently underestimates f(x) while the Right Hand Sum produces an overestimate.<br />

The mid-point rule on the other hand will have a small overestimate and underestimate<br />

in each interval. These will cancel each other out thus improving the approximation.<br />

Of course all of these statements can be made mathematically rigorous by explicitly<br />

accounting for the error between f(x) and ˜f (x).<br />

16-Feb-2014 120


School of Mathematical Sciences<br />

Monash University<br />

9.3 The Trapezoidal rule<br />

This is similar <strong>to</strong> the three previous methods except that here we use a straight line<br />

approximation for f(x) in the interval [x j , x j+1 ]. That is we use<br />

f(x) ≈ ˜f (x) =<br />

f(x j )(x j+1 − x) + f(x j+1 )(x − x j )<br />

x j+1 − x j<br />

The integration maybe slightly more tricky than before (not much) but it can be done<br />

and this is the result<br />

(<br />

)<br />

I(a, b) ≈ Ĩ (a, b) = b − a<br />

j=N−1<br />

1<br />

N 2 (f(x ∑<br />

0) + f(x N )) + f(x j )<br />

j=1<br />

And here are the numerical results,<br />

N Ĩ (0, 1) I(0, 1) E(N) = |Ĩ − I| E(N/2)/E(N)<br />

4 2.207E-01 2.000E-01 2.070E-02<br />

8 2.052E-01 2.000E-01 5.200E-03 3.981E+00<br />

16 2.013E-01 2.000E-01 1.302E-03 3.995E+00<br />

32 2.003E-01 2.000E-01 3.255E-04 3.999E+00<br />

64 2.001E-01 2.000E-01 8.138E-05 4.000E+00<br />

128 2.000E-01 2.000E-01 2.034E-05 4.000E+00<br />

256 2.000E-01 2.000E-01 5.086E-06 4.000E+00<br />

512 2.000E-01 2.000E-01 1.272E-06 4.000E+00<br />

Surprise, surprise, it also has E(N) = O (1/N 2 ). This is very good.<br />

9.3.1 Choices, choices, so many choices<br />

Both the mid-point and trapezoidal rules have O (1/N 2 ) accuracy. Which should we<br />

choose? There is one very good reason for choosing the Trapezoidal rule. Consider two<br />

instances of applying the Trapezoidal rule, once with N + 1 points and once with 2N + 1<br />

points. The x j ’s in the first integration also appear as every second point in the second<br />

integration. Thus the f(x) values computed in the first integration can be saved and<br />

reused in the second integration. In this way we can avoid duplicating our efforts as we<br />

compute successive approximations for N = 2, 4, 8, 16, 32 · · ·. This is not possible with<br />

the mid-point rule as there are no x j ’s shared by both I N and I 2N .<br />

In fact it is easy <strong>to</strong> show that if I N is the Trapezoidal approximation with N + 1 points<br />

then,<br />

I 2N (a, b) = 1 2 I N(a, b) + b − a<br />

2N<br />

N−1<br />

∑<br />

j=0<br />

f(x 2j+1 )<br />

where x j = a + j(b − a)/(2N). The sum on the right contains just the new f’s not<br />

previously seen in the lower values of N. This is a very efficient way of computing the<br />

successive I N (a, b).<br />

16-Feb-2014 121


School of Mathematical Sciences<br />

Monash University<br />

9.4 Simpson’s rule and Romberg integration<br />

It seems reasonable <strong>to</strong> explore what comes of choosing higher order interpolations for<br />

the integrand than what we used for the Trapezoidal rule. Doing so should produce, for<br />

a given number of grid points, a better approximations <strong>to</strong> the integral.<br />

We will start the ball rolling with Simpson’s rule which will give us an insight in<strong>to</strong> how<br />

we can au<strong>to</strong>mate the process of producing successively higher order approximations. The<br />

result will be a very powerful algorithm know as Romberg integration. It is nothing more<br />

than an elegant combination of the Trapezoidal rule with Richardson extrapolation.<br />

9.5 Simpson’s rule<br />

This is next step beyond the Trapezoidal rule – here we use a piecewise quadratic <strong>to</strong><br />

approximate f(x). Note that since a quadratic requires three data points, we build each<br />

˜f over successive pairs of intervals e.g. [xj−1 , x j ] and [x j , x j+1 ]. Thus for Simpson’s rule<br />

we must choose N <strong>to</strong> be an even integer.<br />

Its a bit messy, but the quadratic approximation ˜f (x) <strong>to</strong> f(x) in the interval [xj−1 , x j+1 ]<br />

is given by the 2nd order Lagrange polynomial<br />

(x − x j−1 )(x − x j )<br />

f(x) ≈ ˜f (x) = f(xj+1 )<br />

(x j+1 − x j−1 )(x j+1 − x j )<br />

+ f(x j ) (x − x j+1)(x − x j−1 )<br />

(x j − x j+1 )(x j − x j−1 )<br />

(x − x j+1 )(x − x j )<br />

+ f(x j−1 )<br />

(x j−1 − x j+1 )(x j−1 − x j )<br />

We now use this <strong>to</strong> form our estimate of the integral,<br />

∫ b<br />

∫ xj+1<br />

∫ xj+1<br />

I(a, b) = f(x) dx = ∑ f(x) dx ≈ ∑ ˜f (x) dx<br />

a<br />

j x j−1<br />

j x j−1<br />

= 2 b − a<br />

)<br />

(f 0 + 4f 1 + 2f 2 + 4f 3 + 2f 4 · · · + 2f N−2 + 4f N−1 + f N<br />

3 2N<br />

where f j = f(x j ) and both sums over j use only the odd integers j = 1, 3, 5, · · · N − 1.<br />

Note the alternating pattern of 4’s and 2’s.<br />

9.5.1 Example<br />

Returning <strong>to</strong> our familiar test case I = ∫ 1<br />

0 x4 dx, we find the following results<br />

16-Feb-2014 122


School of Mathematical Sciences<br />

Monash University<br />

N Ĩ (0, 1) I(0, 1) E(N) = |Ĩ − I| E(N/2)/E(N)<br />

4 2.00521E-01 2.000E-01 5.208E-04<br />

8 2.00033E-01 2.000E-01 3.255E-05 1.600E+01<br />

16 2.00002E-01 2.000E-01 2.035E-06 1.600E+01<br />

32 2.00000E-01 2.000E-01 1.272E-07 1.600E+01<br />

64 2.00000E-01 2.000E-01 7.947E-09 1.600E+01<br />

128 2.00000E-01 2.000E-01 4.967E-10 1.600E+01<br />

256 2.00000E-01 2.000E-01 3.104E-11 1.600E+01<br />

512 2.00000E-01 2.000E-01 1.940E-12 1.600E+01<br />

This time we find a 16 fold improvement in the error each time we double N. Thus we<br />

see that E = O(1/N 4 ). This is good, but we can do even better!<br />

Exercise. Let T (N), S(N) be the Trapezoidal and Simpson’s rule as defined above.<br />

Show that S(2N) = (4/3)T (2N) − (1/3)T (N)<br />

9.6 Romberg integration<br />

The coefficients 4/3 and 1/3 found in the previous exercise remind us of Richardson<br />

extrapolation. In fact for the Trapezoidal rule we can show that<br />

I N = I + a N 2 +<br />

b<br />

N + c<br />

4 N + · · · 6<br />

thus we are in a position <strong>to</strong> apply number of Richardson extrapolations starting from just<br />

a table of Trapezoidal approximations. We could start by assembling the Trapezoidal<br />

approximations in<strong>to</strong> a column and then use the Richardson extrapolation <strong>to</strong> generate<br />

further columns <strong>to</strong> the right. We will define R(N, j) <strong>to</strong> be the result of applying j rounds<br />

of Richardson extrapolation having started from R(N, 0).<br />

In a similar fashion we can define E(N, j) <strong>to</strong> be the error in R(N, j). From the above<br />

error formula we can easily see that E(N, j) = O(1/N 2(j+1) ). Now we can apply one<br />

level of Richardson extrapolation, the result is<br />

R(2N, j + 1) = 4j+1 R(2N, j) − R(N, j)<br />

4 j+1 − 1<br />

This is a recursive formula – it allows us <strong>to</strong> calculate the successive columns in the R(N, j)<br />

table. We start the computation by setting the first column <strong>to</strong> the Trapezoidal data,<br />

R(N, 0) = T (N). Then we use the above equation <strong>to</strong> fill in the remaining columns, one<br />

by one <strong>to</strong> the right. Each successive column should be more accurate than the previous,<br />

with the error varying as E(N, j) = O(1/N 2(j+1) ).<br />

This process is called Romberg integration. The last number generated is usually taken<br />

as the best approximation <strong>to</strong> the integral.<br />

16-Feb-2014 123


School of Mathematical Sciences<br />

Monash University<br />

9.6.1 Example<br />

Our previous example I = ∫ 1<br />

0 x4 dx provides no challenge for Romberg integration so, in<br />

this instance, we have used I = ∫ 1<br />

4/(1 + 0 x2 ) dx. for which we know the exact answer<br />

<strong>to</strong> be I = π. Here are the results<br />

N R(N, 0) R(N, 1) R(N, 2) R(N, 3) R(N, 4)<br />

2 3.00000E+00<br />

4 3.10000E+00 3.13333E+00<br />

8 3.13118E+00 3.14157E+00 3.14212E+00<br />

16 3.13899E+00 3.14159E+00 3.14159E+00 3.14159E+00<br />

32 3.14094E+00 3.14159E+00 3.14159E+00 3.14159E+00 3.14159E+00<br />

64 3.14143E+00 3.14159E+00 3.14159E+00 3.14159E+00 3.14159E+00<br />

128 3.14155E+00 3.14159E+00 3.14159E+00 3.14159E+00 3.14159E+00<br />

We can see that the convergence is very rapid and that our best answer would be I ≈<br />

R(128, 4) = 3.14159265359 (obtained by forcing the program <strong>to</strong> print more significant<br />

figures).. Since we know the exact answer <strong>to</strong> be I = π we can also compute the errors<br />

E(N, j) and here they are<br />

N E(N, 0) E(N, 1) E(N, 2) E(N, 3) E(N, 4)<br />

2 1.416E-01<br />

4 4.159E-02 8.259E-03<br />

8 1.042E-02 2.403E-05 5.250E-04<br />

16 2.604E-03 1.511E-07 1.441E-06 6.870E-06<br />

32 6.510E-04 2.365E-09 7.553E-09 1.519E-08 1.169E-08<br />

64 1.628E-04 3.696E-11 1.182E-10 2.349E-13 5.982E-11<br />

128 4.069E-05 5.769E-13 1.848E-12 4.441E-16 4.441E-16<br />

The computer on which these calculations was performed can s<strong>to</strong>re no more than about<br />

15 decimal digits in the mantissa. That is, it will always have a round-off error of around<br />

10 −15 in each computation. Since R(128, 4) ≈ 3.14159 and E(128, 4) ≈ 4 × 10 −16 we see<br />

that we have hit the level of round-off errors in R(128, 4) and so no further rounds of<br />

Richardson extrapolation on this computer could improve upon our best approximation.<br />

16-Feb-2014 124


SCHOOL OF MATHEMATICAL SCIENCES<br />

<strong>MTH3051</strong><br />

<strong>Introduction</strong> <strong>to</strong><br />

<strong>Computational</strong> <strong>Mathematics</strong><br />

10. Numerical differentiation


School of Mathematical Sciences<br />

Monash University<br />

10.1 <strong>Introduction</strong><br />

If we are given a set of data (x i , f i ) for a function f(x), how might we estimate the<br />

derivative at a point?<br />

One approach would be <strong>to</strong> plot the data in a graph and measure the slope of the tangent<br />

line. This is tedious and prone <strong>to</strong> errors.<br />

Another approach is <strong>to</strong> construct an interpolation ˜f (x) <strong>to</strong> f(x) and then compute<br />

df/dx ≈ d˜f /dx. This works but it is computationally intensive (each time we have<br />

<strong>to</strong> build the complete polynomial yet we may only be looking for one number d˜f /dx).<br />

A better approach is <strong>to</strong> use simple generic formulae that can turn a table of data (x i , f i )<br />

in<strong>to</strong> a similar table (x i , df/dx) for the derivative. The technique we will follow is known<br />

as Finite Differences.<br />

10.2 Finite differences<br />

This is a simple method based upon the following Taylor series<br />

f(x + h) = f(x) + df<br />

dx h + d2 f<br />

dx 2 h 2<br />

2! + d3 f<br />

dx 3 h 3<br />

3! + · · · (1)<br />

f(x − h) = f(x) − df<br />

dx h + d2 f<br />

dx 2 h 2<br />

2! − d3 f<br />

dx 3 h 3<br />

3! + · · · (2)<br />

All finite difference approximations can be derived from these and related Taylor series<br />

by taking suitable linear combinations.<br />

Though it is possible <strong>to</strong> develop finite difference approximations for unevenly spaced<br />

data we will limit ourselves <strong>to</strong> equally spaced data.<br />

10.2.1 First derivatives<br />

From the above we can immediately obtain three simple estimates<br />

df<br />

dx ≈ d˜f<br />

dx<br />

df<br />

dx ≈ d˜f<br />

dx<br />

df<br />

dx ≈ d˜f<br />

dx<br />

=f(x<br />

+ h) − f(x)<br />

h<br />

=f(x)<br />

− f(x − h)<br />

h<br />

=f(x<br />

+ h) − f(x − h)<br />

2h<br />

Forward differences (3)<br />

Backward differences (4)<br />

Centered differences (5)<br />

Exercise. Show that each of these finite difference estimates can also be derived by<br />

constructing straight lines through the data (x i , f i ).<br />

16-Feb-2014 126


School of Mathematical Sciences<br />

Monash University<br />

Example 1 : forward differences In all of the following examples we will use f(x) =<br />

e x . Since df/dx = e x it is easy <strong>to</strong> compute the error, which we define by err = |df/dx −<br />

˜df/dx|.<br />

For the forward differences we obtain for h = 0.1<br />

x 0.00E+00 1.00E-01 2.00E-01 3.00E-01 4.00E-01 5.00E-01<br />

d˜f /dx 1.05E+00 1.16E+00 1.28E+00 1.42E+00 1.57E+00 1.73E+00<br />

error 5.17E-02 5.71E-02 6.32E-02 6.98E-02 7.71E-02 8.53E-02<br />

There are two questions we might like <strong>to</strong> ask<br />

◮ How does the error depend on the step length h? and<br />

◮ How does the error depend on the function f(x)?<br />

The first question can be answered by direct calculation,<br />

h 1.00E+00 1.00E-01 1.00E-02 1.00E-03 1.00E-04<br />

d˜f /dx 4.67E+00 2.86E+00 2.73E+00 2.72E+00 2.72E+00<br />

error 1.95E+00 1.41E-01 1.36E-02 1.36E-03 1.36E-04<br />

error/h 1.95E+00 1.41E+00 1.36E+00 1.36E+00 1.36E+00<br />

The last line shows that the error varies as O(h), that is linearly with h.<br />

The second question is best explored by inspection of the Taylor series. If we retain the<br />

higher order terms in the Taylor series then we find<br />

df<br />

dx<br />

=<br />

f(x + h) − f(x)<br />

h<br />

( )<br />

+ O h d2 f<br />

dx 2<br />

This shows that the error should vary as O(h), a fact we have already seen, and that the<br />

error also varies as O(d 2 f/dx 2 ). This makes prefect sense. Since the forward difference<br />

approximation arises from the approximation of the data by a straight line the error<br />

should arise from the failure of a straight line <strong>to</strong> approximate the data. This will occur<br />

in the quadratic and higher terms, hence the appearance of the second derivative.<br />

Example 2 : centred differences<br />

time using centred finite differences,<br />

We can repeat the above computations but this<br />

x 0.00E+00 1.00E-01 2.00E-01 3.00E-01 4.00E-01 5.00E-01<br />

d˜f /dx 1.00E+00 1.11E+00 1.22E+00 1.35E+00 1.49E+00 1.65E+00<br />

error 1.67E-03 1.84E-03 2.04E-03 2.25E-03 2.49E-03 2.75E-03<br />

16-Feb-2014 127


School of Mathematical Sciences<br />

Monash University<br />

Note how the error now is more than ten times less than that for forward differences.<br />

h 1.00E+00 1.00E-01 1.00E-02 1.00E-03 1.00E-04<br />

d˜f /dx 3.19E+00 2.72E+00 2.72E+00 2.72E+00 2.72E+00<br />

error 4.76E-01 4.53E-03 4.53E-05 4.53E-07 4.53E-09<br />

error/h 2 4.76E-01 4.53E-01 4.53E-01 4.53E-01 4.53E-01<br />

This time we observe that the error decreases by O(h 2 ), that is if we reduce h by a fac<strong>to</strong>r<br />

of 10 then the error will be reduced by a fac<strong>to</strong>r of 100. This is a considerable advantage<br />

over forward differences.<br />

Exercise. Confirm this result by retaining the higher order terms in the Taylor series.<br />

10.2.2 Second derivatives<br />

To estimate the second derivative we once again use the Taylor series expansions. Using<br />

equations (1) and (2) we can easily show that<br />

d 2 f f(x + h) − 2f(x) + f(x − h)<br />

≈ + O(h 2 f (iv) ) (6)<br />

dx2 h 2<br />

Exercise. Confirm this result by retaining the higher order terms in the Taylor series.<br />

Example Using our standard test function f(x) = e x , we find the following estimates<br />

for df/dx over 0 < x < 0.5 and with h = 0.1<br />

x 0.00E+00 1.00E-01 2.00E-01 3.00E-01 4.00E-01 5.00E-01<br />

d 2˜f /dx 2 1.00E+00 1.11E+00 1.22E+00 1.35E+00 1.49E+00 1.65E+00<br />

error 8.34E-04 9.21E-04 1.02E-03 1.13E-03 1.24E-03 1.37E-03<br />

Setting x = 1 and varying h we find the following estimates for the error in the centered<br />

finite difference approximation<br />

h 1.00E-01 1.00E-02 1.00E-03 1.00E-04 1.00E-05<br />

d 2˜f /dx 2 2.72E+00 2.72E+00 2.72E+00 2.72E+00 2.72E+00<br />

error 2.27E-03 2.27E-05 2.27E-07 3.78E-08 5.99E-06<br />

error/h 2 2.27E-01 2.27E-01 2.27E-01 3.78E+00 5.99E+04<br />

Once again, this shows that for centred differences we have an error that varies as O(h 2 ).<br />

This is good!<br />

Notice also how the error suddenly becomes very large when h is reduced <strong>to</strong> very small<br />

values. This is an example of how round-off errors can seriously effect the quality of the<br />

computations.<br />

16-Feb-2014 128


School of Mathematical Sciences<br />

Monash University<br />

Why does this occur in this computation? Because for very small values of h all three<br />

numbers f(x+h), f(x), f(x−h) will be almost equal and thus there will be a significant<br />

loss of accuracy in computing f(x + h) − 2f(x) + f(x − h). This point will be explored<br />

in more detail in the next lecture.<br />

The upshot of this example is that the accuracy of a numerical derivative need not<br />

improve with every reduction in the step length h and thus there will be an optimal<br />

value for h. This point will also be explored in the next lecture.<br />

10.3 Truncation and round-off errors<br />

In any calculation there will exist both truncation and round-off errors and depending on<br />

the nature of the computations one or the other may dominate. It may also be possible<br />

<strong>to</strong> examine the conditions under which their combined effect can be minimised. Its wise<br />

<strong>to</strong> do so if this option is available.<br />

10.3.1 Example – Forward finite differences<br />

For a smooth function y(x) we can approximate its first derivative by taking forward<br />

differences<br />

dy<br />

dx ≈ dỹ y(x + h) − y(x)<br />

=<br />

dx h<br />

We first ask – what is likely <strong>to</strong>tal error, from both round-off and truncation errors, in<br />

the estimate given by the right hand side?<br />

First, the round off error. Let us define a quantity P by<br />

P =<br />

y(x + h) − y(x)<br />

h<br />

This P will later be our approximation <strong>to</strong> the first derivative but for the moment think<br />

of this as just a simple function and that we are exploring just the round-off errors in<br />

computing P .<br />

We now ask : How will the round off error enter in<strong>to</strong> this calculation? As we are<br />

subtracting two nearly equal numbers (since h is meant <strong>to</strong> be small) we can expect a<br />

significant round off error as we let h → 0. We will follow the method developed when<br />

looking at round-off errors for division. Put<br />

P = ˜P + E R (P )<br />

We seek an expression for E R (P ) in terms of h and y and possibly their respective round<br />

off errors. Put<br />

y(x + h) − y(x) =<br />

h = ˜h + E R (h) = ˜h + 10 −N O (h)<br />

˜ y(x + h) − y(x) + ER (y(x + h) − y(x))<br />

16-Feb-2014 129


School of Mathematical Sciences<br />

Monash University<br />

When h is very small we have<br />

y(x + h) ≈ y(x)<br />

which allows us <strong>to</strong> write (recall the much earlier discussion on round-off errors for subtraction)<br />

E R (y(x + h) − y(x)) = 10 −N O (y(x))<br />

Substitute all of this back in<strong>to</strong> the original equation for P <strong>to</strong> obtain<br />

P =<br />

˜ y(x + h) − y(x) + 10 −N O (y(x))<br />

˜h + 10 −N O (h)<br />

and since N is (usually) at least 15 we can expand this as a power series in 10 −N and<br />

retain just the leading terms. The result is<br />

where ˜P =<br />

that<br />

P = ˜P −N O (y)<br />

+ 10 − 10<br />

˜h<br />

−N O (h)<br />

˜P + · · ·<br />

˜h<br />

˜ y(x + h) − y(x)/˜h is the computer’s estimate for P . From this we deduce<br />

−N O (y)<br />

E R (P ) = 10 − 10<br />

˜h<br />

−N O (h)<br />

˜P<br />

˜h<br />

How does E R (P ) behave as ˜h → 0? First we expect ˜P ≈ P for some range of small values<br />

for h. Thus the second term should remain approximately constant as ˜h is reduced.<br />

In contrast the first term will diverge for ever reducing values of ˜h. Thus E R (P ) is<br />

dominated by its first term, and we are correct in writing, for small h,<br />

−N O (y)<br />

E R (P ) = 10<br />

h<br />

Now we turn <strong>to</strong> the job of estimating the truncation error. This is much easier then<br />

computing the round off error. We start with a standard Taylor series<br />

y(x + h) = y(x) + h dy<br />

dx + h2 d 2 y<br />

2! dx + · · · 2<br />

then<br />

y(x + h) − y(x)<br />

= dy<br />

h dx + h d 2 y<br />

2! dx + · · · 2<br />

from which we conclude that<br />

E T = h d 2 y<br />

2! dx 2<br />

where (as is usual in this game) we have discarded all higher order terms (we are only<br />

after an estimate of the size of the error not its exact value).<br />

The <strong>to</strong>tal error is E = E T + E R ,<br />

E(h) = hA + 10 −N B h<br />

16-Feb-2014 130


School of Mathematical Sciences<br />

Monash University<br />

where we have written A = (1/2)d 2 y/dx 2 and B = O (y). Note that both A and B do<br />

not depend on h. Our aim is <strong>to</strong> find the best estimate for the derivatives. This in turn<br />

requires us <strong>to</strong> choose h so that the <strong>to</strong>tal error is a minimum. Thus we set dE/dh <strong>to</strong><br />

zero,<br />

0 = dE<br />

dh = A − 10−N B h 2<br />

Solving this for h > 0 we find<br />

h = 10 −N/2 ( B<br />

A<br />

) 1/2<br />

which leads <strong>to</strong><br />

E T = E R = 10 −N/2 (AB) 1/2<br />

What do we learn for all this? We have just found, for the best choice of h, that<br />

E R = 10 −N/2 (AB) 1/2 . If we can turn this in<strong>to</strong> the standard form<br />

E R (P ) = 10 −Q O (P )<br />

then we could say that E R (P ) has Q decimal digits of accuracy.<br />

Notice that A, B and P are all finite and they do not depend on N. Thus (looking back<br />

the definitions for A and B) we find (AB) 1/2 = O (P ) and thus<br />

E R (P ) = 10 −N/2 O (P )<br />

So finally we conclude that the best we can ever hope for in using (y(x + h) − y(x))/h as<br />

an approximation <strong>to</strong> dy/dx will be <strong>to</strong> get at most N/2 digits of accuracy on an N− digit<br />

computer.<br />

There are two important points <strong>to</strong> note<br />

◮ The optimal choice of step length is not zero! You can verify this by looking at<br />

the table of results in the lecture on finite differences.<br />

◮ Even though we can s<strong>to</strong>re up <strong>to</strong> N decimal digits of accuracy, we see that this<br />

algorithm, at best, will produce estimates with only N/2 decimal digits of accuracy.<br />

(i.e 7 digits with N = 14).<br />

10.3.2 Example – Centred finite differences<br />

In this case we have<br />

dy<br />

dx ≈ dỹ y(x + h) − y(x − h)<br />

=<br />

dx 2h<br />

and we follow the same arguments as before <strong>to</strong> arrive at<br />

−N O (y)<br />

E R + E T = 10 + h 2 O<br />

h<br />

( d 4 y<br />

dx 4 )<br />

16-Feb-2014 131


School of Mathematical Sciences<br />

Monash University<br />

Again we wish <strong>to</strong> set this <strong>to</strong> be a minimum for a suitable choice of h.<br />

derivative equal <strong>to</strong> zero and solving for h gives us<br />

h = O ( 10 −N/3)<br />

Setting the<br />

E R = O ( 10 −2N/3) ,<br />

E T = O ( 10 −2N/3)<br />

Now we see that<br />

◮ The optimal step length for centred differences is larger than that for forward<br />

differences, and<br />

◮ We now have 2N/3 digits of accuracy (i.e 10 digits with N = 15). This is consistent<br />

with our previous empirical observations that centred differences gave far better<br />

answers than forward differences.<br />

16-Feb-2014 132


SCHOOL OF MATHEMATICAL SCIENCES<br />

<strong>MTH3051</strong><br />

<strong>Introduction</strong> <strong>to</strong><br />

<strong>Computational</strong> <strong>Mathematics</strong><br />

11. Numerical Solutions of Ordinary Differential<br />

Equations


School of Mathematical Sciences<br />

Monash University<br />

11.1 <strong>Introduction</strong><br />

If somebody asked you <strong>to</strong> evaluate y(b) given that y(x) is the solution of<br />

dy<br />

dx = f(x)<br />

subject <strong>to</strong> y(a) = 0 you could simply evaluate the definite integral<br />

y(b) =<br />

∫ b<br />

a<br />

f(x) dx<br />

But how would you answer the similar question for the differential equation<br />

dy<br />

dx<br />

= f(x, y)<br />

This is a much harder problem and requires a wider range of numerical techniques than<br />

what we developed for definite integrals. This will be the purpose of the next few<br />

lectures, <strong>to</strong> develop suitable integration schemes for ordinary differential equations.<br />

11.2 Initial value problems<br />

In finding solutions of ordinary differential equations of the form<br />

dy<br />

dx<br />

= f(x, y) y(a)<br />

= y a<br />

we usually use a technique that marches the solution through increasing values of x<br />

having started from x = a. Its not surprising then these are known as initial value<br />

problems (often abbreviated as an IVP)<br />

We being by subdividing the x-axis in<strong>to</strong> equal step lengths h with x j = a + jh for<br />

j = 0, 1, 2, · · ·. The numerical estimates for the solution y(x j ) will be denoted by y j . All<br />

of the following schemes provide a way <strong>to</strong> generate the y j in succession starting with y 0 .<br />

11.3 Euler’s method<br />

This is the simplest integration scheme of all for IVP’s. It can be derived by making the<br />

simple forward difference approximation <strong>to</strong> the derivative<br />

( ) dy<br />

≈ y j+1 − y j<br />

= 1 dx x j+1 − x j h (y j+1 − y j )<br />

This then gives us<br />

j<br />

y j+1 = y j + hf(x j , y j )<br />

We start at j = 0 with y = y 0 and x = x 0 and compute successive values for y j . If we<br />

are lucky the y j will be good approximations <strong>to</strong> y(x j ). The accuracy will depend on a<br />

16-Feb-2014 134


School of Mathematical Sciences<br />

Monash University<br />

number of fac<strong>to</strong>rs, such as the size of the step length h, the number of steps and the<br />

character of the solution y(x).<br />

Here are the results for the simple case where f(x, y) = −y with y(0) = 1. The exact<br />

solution is y(x) = e −x . Here are the results for four choices of step length, h = 0.1, 0.5, 1.5<br />

and 2.5<br />

Numerical integration using Euler’s method<br />

y<br />

0.0 0.2 0.4 0.6 0.8 1.0<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

h = 0.1<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■<br />

0.0 0.2 0.4 0.6 0.8 1.0<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

h = 0.5<br />

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■<br />

0 1 2 3 4 5<br />

0 1 2 3 4 5<br />

y<br />

−0.5 0.0 0.5 1.0<br />

■<br />

■<br />

h = 1.5<br />

■<br />

■<br />

■<br />

■ ■ ■ ■ ■ ■ ■<br />

■<br />

■<br />

0 5 10 15 20<br />

x<br />

−20 −10 0 10 20<br />

h = 2.5<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

0 5 10 15 20<br />

x<br />

■<br />

We see that for small values of the step length system gives good accurate answers. But<br />

for larger values values we loose accuracy, and even stability! We can see oscillations<br />

developing in the solution for h = 1.5 but they do appear <strong>to</strong> die away. However at h = 2.5<br />

the oscillations grow without bound. This is our first lesson – numerical integration of<br />

ODE’s may leads <strong>to</strong> unstable schemes for a poor choice of step length.<br />

It is not hard <strong>to</strong> see why this occurs. The Euler scheme for the IVP is just<br />

y j+1 = y j − hy j = y j (1 − h) = y 0 (1 − h) j<br />

We now that the correct solution has y → 0 as x → ∞. This will occur in our Euler<br />

scheme provided |1−h| < 1 which gives 0 < h < 2. So its no surprises that the successive<br />

y j diverged when h = 2.5.<br />

Fortunately, in this case, there is a simple fix. Had we chosen a backward finite difference<br />

( ) dy<br />

≈ y j − y j−1<br />

= 1 dx x j − x j−1 h (y j+1 − y j )<br />

j<br />

16-Feb-2014 135


School of Mathematical Sciences<br />

Monash University<br />

This then gives us<br />

y j = y j−1 − hy j<br />

which we then solve for y j in terms of y j−1 . Schemes such as this, where we find the next<br />

value for y on both sides of the equation, are known as implicit schemes. In contrast,<br />

schemes which provide the next value for y on just the left hand side of the equation are<br />

known as explicit schemes. The forward difference Euler scheme is an explicit scheme.<br />

Usually implicit schemes are harder <strong>to</strong> apply but they are usually stable for a much<br />

wider range of step lengths than for explicit schemes.<br />

Stability of implicit integration<br />

y<br />

0.0 0.2 0.4 0.6 0.8 1.0<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

h = 0.1<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■<br />

0.0 0.2 0.4 0.6 0.8 1.0<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

h = 0.5<br />

■<br />

■<br />

■<br />

■<br />

0 1 2 3 4 5<br />

0 1 2 3 4 5<br />

y<br />

0.0 0.2 0.4 0.6 0.8 1.0<br />

■<br />

■<br />

■<br />

■<br />

■<br />

h = 1.5<br />

■ ■ ■ ■ ■ ■ ■ ■ ■<br />

0.0 0.2 0.4 0.6 0.8 1.0<br />

■<br />

■<br />

■<br />

■<br />

h = 2.5<br />

■ ■ ■ ■ ■<br />

0 5 10 15 20<br />

x<br />

0 5 10 15 20<br />

x<br />

11.4 Improved Euler scheme<br />

The basic Euler scheme presented in the previous lecture served as a very simple model<br />

for numerical integration of a first order ordinary differential equation. Its simplicity<br />

is appealing but that feature is also the source of its main weakness – it often gives<br />

poor answers (i.e. large errors or even unstable integrations). We need a more reliable<br />

integration scheme, one which has better accuracy and stability properties than the<br />

Euler scheme.<br />

Once again we start with the ODE<br />

dy<br />

dx<br />

= f(x, y)<br />

16-Feb-2014 136


School of Mathematical Sciences<br />

Monash University<br />

and ask ourselves how can we convert this ODE in<strong>to</strong> an algebraic equation involving<br />

successive values of y j . We recall that approximating a derivative at the centre of an<br />

interval gave far better answers than estimating it at either end of the interval (centred<br />

differences versus forward or backward differences). So can we compute dy/dx at the<br />

mid-point of [x j , x j+1 ]. Here is one method,<br />

But we could also write<br />

and combining these we get<br />

( ) dy<br />

dx<br />

j+ 1 2<br />

= 1 2<br />

= 1 2<br />

( ) dy<br />

dx<br />

j+ 1 2<br />

( (dy )<br />

dx<br />

j<br />

+<br />

( ) )<br />

dy<br />

dx<br />

j+1<br />

(<br />

)<br />

f(x j , y j ) + f(x j+1 , y j+1 )<br />

= y j+1 − y j<br />

= y j+1 − y j<br />

x j+1 − x j h<br />

y j+1 − y j<br />

h<br />

= 1 2<br />

(<br />

)<br />

f(x j , y j ) + f(x j+1 , y j+1 )<br />

Okay, we have an algebraic equation but it is an implicit scheme (y j+1 appears more<br />

than once in this equation). To use this as it stands would require a root finding method<br />

for each step in the integration. This would be exceedingly slow. We need another trick<br />

<strong>to</strong> make this practical.<br />

For the y j+1 on the right hand side we can make the simple Euler approximation<br />

y j+1 ≈ y j + hf(x j , y j )<br />

then we compute the true (well a better estimate of) y j+1 using our fancy scheme above.<br />

This gives us the Improved Euler Scheme (sometimes also called the Modified Euler<br />

Scheme).<br />

Here is the scheme,<br />

ỹ j+1 = y j + hf(x j , y j )<br />

y j+1 = y j + h (<br />

)<br />

f(x j , y j ) + f(x j+1 , ỹ j+1 )<br />

2<br />

How well does it work? Take the same example as before dy/dx = −y with y(0) = 1.<br />

Here are the results, first the function values y i and then the errors E = y j − y(x j ).<br />

16-Feb-2014 137


School of Mathematical Sciences<br />

Monash University<br />

Errors in Improved Euler<br />

y<br />

0 2 4 6 8×10 −4<br />

h = 0.1<br />

■ ■■■■ ■ ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

0.00 0.01 0.02 0.03<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

h = 0.5<br />

■<br />

■<br />

■<br />

■<br />

0 1 2 3 4 5<br />

0 1 2 3 4 5<br />

y<br />

0.0 0.1 0.2 0.3 0.4 0.5<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

h = 1.5<br />

0 10 20 30 40 50<br />

■ ■ ■ ■ ■ ■<br />

■<br />

h = 2.5<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

0 5 10 15 20<br />

x<br />

0 5 10 15 20<br />

x<br />

We see that once again the numerical errors grow without bound for h = 2.5. This is<br />

not much better than what we had with the Euler method. But in both cases a step<br />

length as large as h = 2.5 for a problem such as this is clearly crazy. The single thing <strong>to</strong><br />

learn from these examples is that the choice of step length can have a significant effect<br />

on the accuracy and stability of the integration.<br />

We could at this point delve in<strong>to</strong> the formal calculation of the error in the approximations<br />

y i and how that error depends on the step length. However we’ll defer that until we’ve<br />

had a look at some other schemes.<br />

11.5 Taylor series method<br />

This is a very nice way <strong>to</strong> generate a whole family of integration schemes.<br />

We start with our friend<br />

dy<br />

dx<br />

= f(x, y)<br />

and then we recall that a Taylor series for y(x) is<br />

y(x + h) = y(x) + h dy<br />

dx + h2 d 2 y<br />

2 dx + h3 d 3 y<br />

2 6 dx + · · · 3<br />

16-Feb-2014 138


School of Mathematical Sciences<br />

Monash University<br />

This gives us the option of replacing all the derivatives on the right with f(x, y) and its<br />

derivatives. Thus we get<br />

y(x + h) = y(x) + hf(x, y) + h2<br />

2<br />

( )<br />

∂f<br />

+ f(x, y)∂f + · · ·<br />

∂x ∂y<br />

This is very elegant and it gives us a ready handle on the nature of the error, roughly<br />

the first term we left off in the tail of the series.<br />

Here are the results for h = 0.1 and h = 0.5 for a new example f(x, y) = −xy, y(0) = 1.<br />

The exact solution is y(x) = exp(−x 2 /2) and we plot just the errors y j − y(x j ).<br />

Errors in Taylor series integration<br />

Errors yj − y(xj)<br />

0.0000 0.0005 0.0010 0.0015<br />

■<br />

■ ■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■ ■<br />

■■<br />

■<br />

h = 0.1<br />

0.00 0.01 0.02 0.03 0.04<br />

■ ■■■ ■ ■■■■■■■■■■■■■■■■■■■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

h = 0.5<br />

■<br />

■<br />

■<br />

■<br />

0 1 2 3 4 5<br />

x<br />

0 1 2 3 4 5<br />

x<br />

11.6 Runge-Kutta schemes<br />

Designing integration around Taylor series seems like great idea – its elegant, systematic<br />

and we get, at no cost, an estimate of the likely error in our solutions. But there’s a<br />

catch (there’s always a catch).<br />

The big problem with the Taylor series approach is that it can be very tedious <strong>to</strong> evaluate<br />

the higher derivatives of f(x, y). For example, try computing d 3 f(x, y)/dx 3 – its a mess.<br />

The objective with Runge-Kutta methods is <strong>to</strong> mimic the Taylor series but without<br />

computing the derivatives of f(x, y).<br />

16-Feb-2014 139


School of Mathematical Sciences<br />

Monash University<br />

11.6.1 Second Order Runge-Kutta<br />

Our objective here is <strong>to</strong> build a scheme, which does not involve any derivatives of f(x, y),<br />

yet matching the Taylor series schemes up <strong>to</strong> and including order h 2 terms.<br />

We look <strong>to</strong> the Improved Euler scheme (another non-derivative scheme) for inspiration.<br />

That scheme can be cast in the form<br />

with 1 = α = β = 2a = 2b.<br />

k 1 = hf(x j , y j ) (1)<br />

k 2 = hf(x j + αh, y j + βk 1 ) (2)<br />

y j+1 = y j + ak 1 + bk 2 (3)<br />

Exercise. Show that this is indeed the Improved Euler scheme.<br />

Why do we cast it in this odd form? Because it gives us flexibility. We are free <strong>to</strong> choose<br />

the four parameters a, b, α, β <strong>to</strong> be any set of numbers provided the scheme remains<br />

second order. That’s our objective for now – find the constraints on a, b, α, β so that the<br />

scheme is second order.<br />

We begin by writing down the second order Taylor series scheme evaluated at x j<br />

y j+1 = y j + hf j + h2<br />

2 (f x + ff y ) j<br />

+ O ( h 3) (4)<br />

where the x, y subcripts now denote partial derivatives.<br />

Next we use a (different) Taylor series (a Taylor series for a function of two variables)<br />

<strong>to</strong> expand k 2 around x = x , y = y j ,<br />

k 2 = h<br />

(f j + αh (f x ) j<br />

+ βh (ff y ) j<br />

+ O ( h 2))<br />

Exercise. Why did we s<strong>to</strong>p at O (h 2 ) in this expansion?<br />

Now combine this with the previous equations (1,3) <strong>to</strong> get<br />

)<br />

y j+1 = y j + ahf j + bh<br />

(f j + αh (f x ) j<br />

+ βh (ff y ) j<br />

+ O ( h 3)<br />

We want this <strong>to</strong> be exactly the same as the Taylor series. So we compare the various<br />

terms and demand that<br />

a + b = 1,<br />

1 = 2αb = 2βb<br />

We have found three constraints amongst our four parameters α, β, a and b. Clearly we<br />

are free <strong>to</strong> choose one of the parameters and thus we get a whole family of integration<br />

schemes, all of which are 2nd order accurate.<br />

16-Feb-2014 140


School of Mathematical Sciences<br />

Monash University<br />

Examples<br />

◮ Euler.<br />

This is simply a = 1, b = 0.<br />

◮ Improved Euler.<br />

This corresponds <strong>to</strong> setting 1 = α = β = 2a = 2b.<br />

◮ Mid-point rule.<br />

For this we set 1/2 = α = β, 0 = a, 1 = b for which we get<br />

(<br />

y j+1 = y j + hf x j + h 2 , y j + h )<br />

2 f j<br />

Exercise. Why do you think this is called the mid-point rule? (Too easy really.)<br />

11.6.2 Fourth Order Runge-Kutta<br />

This a simple extension of the above ideas – this time we force the numerical scheme <strong>to</strong><br />

match the Taylor series up <strong>to</strong> and including order h 4 terms. Its a lengthy calculation,<br />

but nothing that we haven’t already seen. As with the 2nd order Runge-Kutta we get a<br />

whole family of schemes. One of the most popular is<br />

y j+1 = y j + 1 6 (k 1 + 2k 2 + 2k 3 + k 4 )<br />

k 1 = f(x j , y j )<br />

k 2 = f(x j + h 2 , y j + 1 2 k 1)<br />

k 3 = f(x j + h 2 , y j + 1 2 k 2)<br />

k 4 = f(x j + h, y j + k 3 )<br />

This is not the only choice, but its the most common choice.<br />

11.7 Error analysis<br />

We know that our numerical integration schemes aren’t prefect, they have errors and<br />

they can be unstable. We will look at stability a little later on but for the moment let’s<br />

focus on the errors.<br />

16-Feb-2014 141


School of Mathematical Sciences<br />

Monash University<br />

11.7.1 Discretization errors<br />

There are few of things we might like <strong>to</strong> know about our integration schemes. Such as,<br />

what the error is at a particular stage in the integration, and further, how that error<br />

might grow with successive steps. For this we define two types of error,<br />

◮ Global discretization error<br />

This is defined by E j = y(x j ) − y j and is equal <strong>to</strong> the error at a specific x-value. It<br />

is principally made up of the accumulated local discretization errors (plus a small<br />

component of round off errors).<br />

◮ Local discretization error<br />

This is defined by e j = E j+1 − E j which is the error introduced in one step of the<br />

integration.<br />

What we shall find is that if e j = O (h n+1 ) for some n then, generally, E j = O (h n ).<br />

The starting point for the formal error analysis is, as always, a Taylor series. We shall<br />

use the Improved Euler Scheme as an example.<br />

11.7.2 Local discretization error<br />

Suppose (in a dream) that our integration up <strong>to</strong> x j is prefect, that there is no error in<br />

y j . That is y j = y(x j ) and thus E j = 0. Then the local truncation error will be given<br />

by e j = E j+1 = y(x j+1 ) − y j+1 and this we can calculate.<br />

We start with the Taylor series on the exact solution<br />

( ) ( dy<br />

y(x j+1 ) = y(x j ) + h + h2 d 2 y<br />

+ h3<br />

dx 2<br />

6<br />

j<br />

dx 2 )j<br />

( d 3 y<br />

dx 3 )j<br />

+ O ( h 4)<br />

(<br />

= y(x j ) + hf j + h2<br />

2 (f x + ff y ) j<br />

+ h3 d 2 f<br />

+ O<br />

6 dx<br />

)j<br />

( h 4)<br />

2<br />

and a similar Taylor series for the Improved Euler Scheme<br />

y j+1 = y j + h 2<br />

= y j + h 2<br />

(<br />

)<br />

f(x j , y j ) + f(x j + h, y j + hf j )<br />

(2f + h(f x + ff y ) + h2<br />

2<br />

d 2 f<br />

+ O<br />

dx<br />

)j<br />

( h 4)<br />

2<br />

Now since y j = y(x j ) we find<br />

(<br />

e j = − h3 d 2 f<br />

+ O<br />

12 dx<br />

)j<br />

( h 4)<br />

2<br />

16-Feb-2014 142


School of Mathematical Sciences<br />

Monash University<br />

What do we learn from this? First the local discretization error is O (h 3 ) and second,<br />

that the Improved Euler Scheme will be exact if d 2 f/dx 2 = 0 for all x, y.<br />

We can also use this calculation <strong>to</strong> estimate the Global Discretization Error.<br />

11.7.3 Global Discretization Error<br />

We know two things, e j = E j+1 − E j and e j = O (h 3 f ′′ ) from which we can now easily<br />

compute E j .<br />

E j+1 − E 0 =<br />

j∑<br />

E k+1 − E k =<br />

j∑<br />

e j<br />

k=1<br />

k=1<br />

=<br />

j∑<br />

k=1<br />

(<br />

− h3 d 2 f<br />

+ O<br />

12 dx<br />

)k<br />

( h 4)<br />

2<br />

= h 2 j∑<br />

k=1<br />

− h 12<br />

( d 2 f<br />

dx 2 )k<br />

+ O ( h 4)<br />

Notice that the sum on the right is a Riemann sum for the integral ∫ x j+1<br />

x 0<br />

in turn we estimate using the Mean Value Theorem, thus<br />

f ′′ dx which<br />

( ( ))<br />

h<br />

2 d 2 f<br />

E j+1 − E 0 = O<br />

12 dx 2<br />

with the right hand side evaluated at some (unknown) point inside [x 0 , x j+1 ].<br />

But since we are given exact initial values (i.e. we are given y = y 0 at x = x 0 ) E 0 must<br />

be zero, so we have our final result<br />

( ( ))<br />

h<br />

2 d 2 f<br />

E j+1 = O<br />

12 dx 2<br />

You might feel uneasy that the right hand side has <strong>to</strong> be evaluated at some unknown<br />

point. But this is not a major sticking point because we never really need <strong>to</strong> compute<br />

a number for the right hand side. Instead we use it only as a formal statement that the<br />

error varies as O (h 2 ).<br />

16-Feb-2014 143


School of Mathematical Sciences<br />

Monash University<br />

11.7.4 Numerical results<br />

The theory is all well and good but we should also be able <strong>to</strong> demonstrate these results<br />

directly from numerical calculations. This is very easy <strong>to</strong> do – just propose a problem<br />

for which you know the exact solution. Then compare the exact versus the numerical<br />

solutions. A piece of cake.<br />

Here are some results for the simple ODE dy/dx = −2xy with y(0) = 1. The exact<br />

solution is y(x) = exp(−x 2 ).<br />

h y-approx y-exact E(h) E(2h)/E(h)<br />

5.000E-01 3.750E-01 3.679E-01 7.121E-03<br />

2.500E-01 3.742E-01 3.679E-01 6.278E-03 1.134E+00<br />

1.250E-01 3.697E-01 3.679E-01 1.804E-03 3.479E+00<br />

6.250E-02 3.683E-01 3.679E-01 4.678E-04 3.857E+00<br />

3.125E-02 3.680E-01 3.679E-01 1.185E-04 3.948E+00<br />

1.562E-02 3.679E-01 3.679E-01 2.979E-05 3.978E+00<br />

7.812E-03 3.679E-01 3.679E-01 7.467E-06 3.990E+00<br />

In each row the step length is one half the previous value and the Global Discretization<br />

Errors are evaluated at x = 1. The final column shows that E(h) = O (h 2 ), as expected.<br />

Exercise. Repeat the above calculations for both the Euler and Mid-point schemes.<br />

11.8 Stability<br />

In deriving the previous results we made two crucial assumptions – that the various<br />

Taylor series converged and that they were dominated by their leading terms. What<br />

happens when either of these assumptions does not apply? Quite often the integration<br />

scheme develops an instability with an exponential growth in the error in the numerical<br />

solution. How can we test for such a situation? The simplest answer is <strong>to</strong> just run the<br />

program and inspect the results. A slightly better approach would be <strong>to</strong> consider two<br />

separate integrations differing only by a small change in the initial conditions. If the two<br />

solutions remain close throughout the integration then most likely both integrations are<br />

stable.<br />

In the previous section we considered the errors by comparing the numerical solution<br />

<strong>to</strong> the exact solution. In this section we will study (briefly) the difference between two,<br />

initially close, numerical solutions.<br />

11.8.1 Example 1<br />

Here is a simple equation,<br />

dy<br />

dx = −y<br />

16-Feb-2014 144


School of Mathematical Sciences<br />

Monash University<br />

for which we might use the Forward Euler scheme,<br />

y j+1 = y j + hf(x j , y j ) = y j − hy j<br />

Now suppose we generate two solutions, one for y 0 = 1 and one for y 0 = 1 + ɛ 0 for some<br />

small number ɛ 0 . Denote the two solutions by y 1 j and y 2 j . We can track the two solutions<br />

by computing ɛ j = y 2 j − y 1 j and from the above equation we find<br />

ɛ j+1 = (1 − h)ɛ j<br />

If the original equations for y j are <strong>to</strong> be stable then we could reasonably require that<br />

|ɛ j+1 /ɛ j | remains bounded for increasing j. Thus we must have<br />

from which we find 0 ≤ h ≤ 2.<br />

−1 ≤ 1 − h ≤ 1<br />

This same technique can be applied for many other situations.<br />

11.8.2 Example 2<br />

Suppose we had used a Backward Euler scheme, then we would have found<br />

ɛ j+1 = 1<br />

1 + h ɛ j<br />

and the requirement that |ɛ j+1 /ɛ j | remains bounded leads <strong>to</strong> h > 0. That is, this scheme<br />

can be expected <strong>to</strong> be stable for all values of h. It may be inaccurate for large values,<br />

but at least it will be stable (according <strong>to</strong> our definition – there are others).<br />

11.8.3 Example 3<br />

Now consider this equation<br />

dy<br />

dx = y<br />

Demanding that |ɛ j+1 /ɛ j | remains bounded leads <strong>to</strong> the absurd condition that h < 0 for<br />

both the Forward and Backward Euler schemes. We expect the numerical solutions <strong>to</strong><br />

have exponentially growing errors for any choice of step length.<br />

This is no surprise. Since the exact solution is y(x) = C exp(+x) we see that ɛ(x +<br />

h)/ɛ(x) = exp(h) which is greater than one for h > 0.<br />

How might we rescue this situation? The hint comes from the above analysis – choose<br />

h < 0. That is we do a backward integration.<br />

How? Start at say x = 10, guess a value for y(10) and integrate backwards <strong>to</strong> x = 0.<br />

Then adjust the guess so that your computed value of y 0 matches the given initial<br />

condition. This involves a root finding process. You have <strong>to</strong> do a lot more work, but at<br />

least you get a stable scheme.<br />

16-Feb-2014 145


School of Mathematical Sciences<br />

Monash University<br />

11.8.4 Example 4<br />

The general solution of<br />

dy<br />

dx = y + 2e−x<br />

is y(x) = Ce x − e −x where C is a constant of integration.<br />

If y(0) = −1 then we must set C = 0 and the exact solution is y(x) = −e −x . But<br />

our numerical calculations will always contain some round off errors. Thus even if we<br />

are able <strong>to</strong> set the initial values exactly the subsequent calculations may introduce a<br />

small round off error. The effect will be as if we had started with a small but non-zero<br />

value for C and thus the e x term will eventually dominate the numerical solution. Any<br />

forward (i.e. explicit) integration scheme is doomed <strong>to</strong> fail. The only option is <strong>to</strong> use a<br />

fully backward scheme, such as that used in the previous example.<br />

Here are the results using the Forward and Backward Euler schemes. We know that the<br />

exact solution has the property that y → 0 as x → ∞. The numerical solutions, if they<br />

are of any use, should preserve this property. In the early stages of the integration both<br />

schemes seem <strong>to</strong> work well but later on the Forward Euler scheme has clearly run amok<br />

courtesy of the round-off error allowing the e x term <strong>to</strong> rear its ugly head.<br />

Growth of dominant term in general solution.<br />

■<br />

Numerical solution y<br />

0 10 20 30<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■<br />

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■<br />

−1 0 1 2 3 4 5 6 7 8 9 10 11<br />

Eaxct solution y(x) = e −x<br />

−1.0 −0.8 −0.6 −0.4 −0.2 0.0<br />

−1 0 1 2 3 4 5 6 7 8 9 10 11<br />

x<br />

16-Feb-2014 146


SCHOOL OF MATHEMATICAL SCIENCES<br />

<strong>MTH3051</strong><br />

<strong>Introduction</strong> <strong>to</strong><br />

<strong>Computational</strong> <strong>Mathematics</strong><br />

12. Optimisation


School of Mathematical Sciences<br />

Monash University<br />

12.1 Golden Search<br />

Suppose we are given a function f(x) and that we wish <strong>to</strong> find the points (there may be<br />

more than one) at which f is locally minimised. As a guide in developing a strategy for<br />

finding such points we begin by asking how might we verify that a point, say x = x ⋆ , is<br />

a minimum point for the function? We would need <strong>to</strong> show that for two other points,<br />

a and b, either side of x ⋆ (that is a < x ⋆ < b) that f(a) > f(x ⋆ ) and f(b) > f(x ⋆ ).<br />

For this <strong>to</strong> work properly we would need <strong>to</strong> choose a and b <strong>to</strong> be very close <strong>to</strong> x ⋆ (<strong>to</strong><br />

avoid the possibility of the function wiggling its way below f(x ⋆ )). We will call this the<br />

minimisation condition.<br />

Usually we do not know x ⋆ but instead some approximation <strong>to</strong> x ⋆ , which we will call<br />

c. The game now is <strong>to</strong> find a scheme where we can iterate on the approximation c such<br />

that successive iterations converge <strong>to</strong> x ⋆ . The strategy we will use will be <strong>to</strong> create a<br />

triple of numbers a < c < b and <strong>to</strong> use this as a bracket for x ⋆ . That is, at each stage<br />

in the iteration, we will take c as our current approximation <strong>to</strong> x ⋆ , while a and b will be<br />

the bounds by which we can assert that the true minimum lies within the interval [a, b].<br />

How do we choose a, b and c? Let’s put that aside for the moment and consider a related<br />

question : How do we update a, b and c?<br />

The point c will lie (usually) closer <strong>to</strong> one of a or b. Suppose it happens <strong>to</strong> be a. Then<br />

suppose we introduce a fourth point d such that a < c < d < b. Now we have two<br />

overlapping sets of triples a < c < d and c < d < b of which only one interval can satisfy<br />

the minimisation condition (this point is crucial!). And it is that interval which we take<br />

in<strong>to</strong> the next iteration. Thus we have a new set of values for a, b and c. This completes<br />

the update. Notice that the new interval is smaller than the previous interval and thus<br />

the successive iterations will close in on x ⋆ . After a number of iterations we should have<br />

a good approximation <strong>to</strong> x ⋆ .<br />

There remains two issues, how <strong>to</strong> choose the initial values for a, b and c and also how<br />

<strong>to</strong> choose the fourth point d. Consider first the choice of d. Its up <strong>to</strong> us <strong>to</strong> invent a<br />

reasonable strategy, here is one that works well.<br />

Choose d <strong>to</strong> lie in the larger of the two intervals [a, c] and [c, b]. Suppose for example<br />

that this happens <strong>to</strong> be in the [c, b] interval. Then demand that<br />

which leads <strong>to</strong><br />

c − a = b − d<br />

d = b − (c − a)<br />

There is actually a small problem with this strategy. If the point c was actually the<br />

mid-point of [a, b] then we would have d = c and the new interval would be exactly the<br />

same as the original interval. This would be of no use <strong>to</strong> us and so we need a scheme<br />

that avoids this problem. This must impose some constraint on a, b and c. Again, like<br />

the choice for d, we have some flexibility. All that we need guarantee is that in each<br />

iteration c is not the mid-point of a, b and c. We can achieve this by demanding the<br />

new and old intervals are split in the same proportions. Let the right sub-interval (e.g.<br />

16-Feb-2014 148


School of Mathematical Sciences<br />

Monash University<br />

[c, b]) be a fraction r of the <strong>to</strong>tal interval (e.g. [a, b]). Then we demand that<br />

b − c = r(b − a)<br />

b − d = r(b − c)<br />

c − a = (1 − r)(b − a)<br />

When these are combined with the above equation c − a = b − d we find (it’s easy) that<br />

which leads <strong>to</strong><br />

r 2 = 1 − r<br />

r = −1 ± √ 5<br />

2<br />

We can discard the negative square root as r must be positive. Thus we have found<br />

r = −1 + √ 5<br />

2<br />

≈ 0.618<br />

This number is famous, it was used by ancient Greeks in much of their architecture, and<br />

is often called The Golden Ratio.<br />

All that remains now is <strong>to</strong> decide how we might choose the initial values of a, b and<br />

c. This is not <strong>to</strong>o hard. We make a guess for a and b. Then we choose c so that<br />

b − c = r(b − a). If this triple, a, b and c, satisfy the minimisation condition then we<br />

start with these values, if not we make another guess for a and b and start again.<br />

Minimisation by Golden Ratio search<br />

Suppose we seek x ⋆ such that f(x ⋆ ) is a minimum.<br />

1. Choose any a, b, c with c = b − r(b − a) such that<br />

f(c) < f(a) and f(c) < f(b).<br />

2. If (b − c) > (c − a)<br />

set d = b − r(b − c)<br />

If f(d) < f(c) then set a = c and c = d, else set b = d<br />

3. Else<br />

set d = c − r(c − a)<br />

If f(d) < f(c) then set b = c and c = d, else set a = d<br />

4. Take x ⋆ ≈ c, repeat from step 2 as required.<br />

12.1.1 Example<br />

To test our methods we will use f(x) = 1 + x 2 − 1 cos(10x). This has three local minima<br />

4<br />

in the interval −1 < x < 1, at x ⋆ ≈ ±0.6 and x ⋆ ≈ 0.<br />

16-Feb-2014 149


School of Mathematical Sciences<br />

Monash University<br />

y<br />

0.5 1.0 1.5 2.0 2.5<br />

−1.0 −0.5 0.0 0.5 1.0<br />

x<br />

The results for this example can be found in the following table.<br />

16-Feb-2014 150


School of Mathematical Sciences<br />

Monash University<br />

Loop a c:d c:d b f(a) f(c):f(d) f(c):f(d) f(b)<br />

0 -7.000E-01 -5.854E-01 -4.000E-01 1.302E+00 1.115E+00 1.323E+00<br />

l 1 -7.000E-01 -5.854E-01 -5.146E-01 -4.000E-01 1.302E+00 1.115E+00 1.160E+00 1.323E+00<br />

r 2 -7.000E-01 -6.562E-01 -5.854E-01 -5.146E-01 1.302E+00 1.190E+00 1.115E+00 1.160E+00<br />

r 3 -6.562E-01 -6.292E-01 -5.854E-01 -5.146E-01 1.190E+00 1.146E+00 1.115E+00 1.160E+00<br />

l 4 -6.292E-01 -5.854E-01 -5.584E-01 -5.146E-01 1.146E+00 1.115E+00 1.120E+00 1.160E+00<br />

r 5 -6.292E-01 -6.125E-01 -5.854E-01 -5.584E-01 1.146E+00 1.128E+00 1.115E+00 1.120E+00<br />

r 10 -5.854E-01 -5.815E-01 -5.790E-01 -5.751E-01 1.115E+00 1.115E+00 1.115E+00 1.115E+00<br />

l 20 -5.801E-01 -5.800E-01 -5.800E-01 -5.800E-01 1.115E+00 1.115E+00 1.115E+00 1.115E+00<br />

l 30 -5.801E-01 -5.801E-01 -5.801E-01 -5.801E-01 1.115E+00 1.115E+00 1.115E+00 1.115E+00<br />

l 40 -5.801E-01 -5.801E-01 -5.801E-01 -5.801E-01 1.115E+00 1.115E+00 1.115E+00 1.115E+00<br />

The initial values were taken as a = −0.7 and b = −0.4. This forced us <strong>to</strong> take c = −0.7 + r(−0.4 − (−0.7) ≈ −0.5854.<br />

Note that the formatting in the table has <strong>to</strong> take account of the possibility that the test point d can be introduced either in the<br />

interval [a, c] or in the interval [c, b]. Thus the notation c : d indicates that there are two types of data in this column, c and d. A<br />

similar idea applies <strong>to</strong> the columns headed by f(c) : f(d).<br />

The first column contains the letters ‘l’ and ‘r’. These record which triple set of points are taken for the next iteration. Suppose<br />

d was created in the interval [a, c] then ‘l’ records that the left triple, (a, d, c) was chosen, while ‘r’ indicates that the right triple<br />

(d, c, b) was chosen.<br />

By careful inspection of the table you can follow the progress of the interval (a, c, b). In each line three of (a, b, c, d) are carried<br />

forward <strong>to</strong> the next line. The introduced number is always d and it is always introduced in<strong>to</strong> the larger of the two intervals [a, c]<br />

and [c, b].<br />

Note that the overall length of the interval c − a is reduced by a fac<strong>to</strong>r of approximately 0.6 with every iteration. Thus <strong>to</strong> gain one<br />

decimal digit of accuracy we need <strong>to</strong> apply five extra iterations (0.6 4 ≈ 0.13 and 0.6 5 ≈ 0.08). This is very slow. But the main things<br />

is that we are guaranteed that the method will converge <strong>to</strong> a local minimum.<br />

16-Feb-2014 151


School of Mathematical Sciences<br />

Monash University<br />

12.2 Steepest descent<br />

This method uses the classic calculus formulation of a local extrema : df/dx = 0 at the<br />

extrema.<br />

Given f(x) = x 2 + (1 − cos(10x)/4) we have <strong>to</strong> find the roots of<br />

0 = 2x + 5 2 sin(10x)<br />

This we do using a New<strong>to</strong>n-Raphson method, for x such that 0 = g(x), compute x n+1 =<br />

x n − g n /g ′ n. In our case g(x) = f ′ and g ′ (x) = f ′′ (x).<br />

Loop x f(x) f’(x)<br />

0 -7.00000E-01 1.30152E+00 -3.04247E+00<br />

1 -5.54061E-01 1.12280E+00 5.82342E-01<br />

2 -5.82582E-01 1.11510E+00 -6.11959E-02<br />

3 -5.80077E-01 1.11502E+00 -3.52201E-04<br />

4 -5.80062E-01 1.11502E+00 -1.23382E-08<br />

5 -5.80062E-01 1.11502E+00 6.66134E-16<br />

This table displays the one significant advantage of this method – it converges very<br />

quickly! In fact, it can be shown that the sequence x n converges quadratically <strong>to</strong> the<br />

true minimum. That is, provided x n is close <strong>to</strong> x ⋆ , then<br />

|x n+1 − x ⋆ | = O ( |x n − x ⋆ | 2)<br />

A hint of this quadratic convergence can be seen in the last column. This shows that<br />

f ′ (x) converges quadratically <strong>to</strong> zero. The upshot of quadratic convergence is that each<br />

iteration will double the number of correct digits. Thus it is very common <strong>to</strong> see this<br />

method converging in about 5 iterations.<br />

There are also a number of disadvantages with this method.<br />

◮ The method makes no distinction between maxima and minima.<br />

◮ The method makes no distinction between local and global minima.<br />

◮ The method can fail when f ′′ (x) = 0 at the minima of f(x). For example, when<br />

f(x) = x 4 .<br />

◮ The method is applicable only <strong>to</strong> functions with smooth derivatives.<br />

16-Feb-2014 152


School of Mathematical Sciences<br />

Monash University<br />

12.3 Genetic Algorithms<br />

Once again we are faced with the problem of finding the minimum of some given function.<br />

Unlike the previous methods in which a single approximation was successively improved<br />

<strong>to</strong>wards a minimum these methods work with a whole family of approximations. In each<br />

iteration we will manipulate the family in such a way as <strong>to</strong> focus the family <strong>to</strong>wards the<br />

minimum.<br />

The remarkable thing about the methods we are about <strong>to</strong> develop is that they are<br />

inspired by ideas drawn from evolutionary biology. Nature, in the Darwinian paradigm<br />

of the survival of the fittest, solves its own optimisation problem, namely how <strong>to</strong> produce<br />

life forms that are well adapted <strong>to</strong> their environment.<br />

Here is a very rough description of the genetic theory of evolution. Each individual<br />

is comprised of a set of genetic material and this genetic material fully defines the<br />

individual. Some individuals are well adapted <strong>to</strong> their environment and these are called<br />

fit individuals. Other individuals are not well suited <strong>to</strong> their environment. These are<br />

called weak or less fit. The survival of the fittest means that it is much more likely<br />

that fit rather than weak individuals will pass on their genetic material <strong>to</strong> the next<br />

generation. In this way successive generations will become dominated by fit individuals.<br />

We will apply these ideas <strong>to</strong> our problem of finding the minimum of a given function.<br />

Along the way we will need <strong>to</strong> translate the language of evolutionary genetics in<strong>to</strong> familiar<br />

mathematical terms. At the end of our journey (its not a long journey) we will<br />

done something amazing (well I think it is) – we will have solved a purely mathematical<br />

problem by drawing on strong analogies with processes that occur in evolutionary<br />

biology.<br />

Let’s get serious. We are looking for the value x ⋆ at which the function f(x) is minimised.<br />

We will start with a randomly chosen collection of candidates for x ⋆ . Denote these by<br />

x i , i = 1, 2, · · · N. This will be our first generation. Some of the x i will be close <strong>to</strong> x ⋆<br />

while others will be far from x ⋆ . Our aim is <strong>to</strong> breed successive generations so that the<br />

spread of x-values shrinks <strong>to</strong> an arbitrarily small range centred on x ⋆ .<br />

Let x i be a typical individual. We define the fitness of this individual <strong>to</strong> be the value<br />

of the function, f(x i ). Since we wish <strong>to</strong> minimise the function it is natural <strong>to</strong> call an<br />

individual fit if its f-value (fitness) is less than that for other weak individuals (yes, this<br />

does sound like a tau<strong>to</strong>logy).<br />

We need <strong>to</strong> decide how we are going <strong>to</strong> build one generation from another. There are<br />

many possibilities and we will be guided by the simplest biological analogy. Here is<br />

a rough outline of how we might create one new generation. First we select a pair of<br />

individuals. These will be our parents. We then create two children by exchanging<br />

genetic material between the parents. These children form two members of the next<br />

generation. This process is repeated again until a new generation is complete (i.e. the<br />

same number of children as parents). At this point we delete the parent’s generation and<br />

start afresh with children’s generation. After each generation we have N individuals.<br />

There are questions that we must ask<br />

16-Feb-2014 153


School of Mathematical Sciences<br />

Monash University<br />

◮ How do we choose the parents?<br />

◮ How do the parents exchange genetic material?<br />

12.3.1 Selection<br />

In selecting a parent we want a scheme that favours fit individuals over weak individuals.<br />

Here is one simple scheme.<br />

Compute the average fitness f by<br />

f = 1 N<br />

N∑<br />

i=1<br />

f(x i )<br />

Select a random individual, say x j . If f(x j ) < f then we accept this individual as a<br />

parent. If the condition is not met then step through successive individuals, j, j + 1, j +<br />

2, · · · , N, 1, 2, · · ·, until the condition is satisfied.<br />

Note that the selection of successive parents is done without knowledge of the previous<br />

parents (i.e. selection with replacement). This means that in one generation are very fit<br />

individual may be selected many many times (and may even be partnered with itself).<br />

This preferential selection is one of the key elements of genetic algorithms.<br />

12.3.2 Breeding<br />

We need <strong>to</strong> exchange genetic material between the parents. So obviously we need access<br />

<strong>to</strong> their genetic material. What might this be? Again we have many options but<br />

whatever we do we need <strong>to</strong> represent the individuals by a string of letters (e.g. the base<br />

pairs in the DNA). For our simple problem we can choose this <strong>to</strong> be the binary form<br />

of the number represented by x i . For example, we might have x 23 = 111010101100101.<br />

This expresses x 23 as a 15-digit binary number. This is our genetic material.<br />

Now suppose we have two parents,<br />

x 23 = 111010101100101<br />

x 47 = 100111011001101<br />

And now the breeding begins (turn the lights out please). First we choose a random<br />

number between 1 and 15, say get 6. Then we cleave mum and dad after the first 6<br />

binary digits.<br />

x 23 = 111010 101100101<br />

x 47 = 100111 011001101<br />

Then we swap over the leading 6 binary digits <strong>to</strong> form two children, y 01 and y 02<br />

y 01 = 100111 101100101 = 100111101100101<br />

y 02 = 111010 011001101 = 111010011001101<br />

16-Feb-2014 154


School of Mathematical Sciences<br />

Monash University<br />

This method of breeding is known as crossover. A variation on the method is <strong>to</strong> also<br />

allow for mutations. This is applied after the two children are formed. Each binary digit<br />

in each child is flipped (1’s and 0’s swapped) with a very small probability. This again<br />

is done by analogy with the genetic paradigm.<br />

We now have all of the elements in place – its time <strong>to</strong> see how this method works on a<br />

specific example.<br />

12.3.3 Example<br />

Here we will minimise a simple function f(x) = x 2 subject <strong>to</strong> the condition that 0 ≤<br />

x ≤ 1. We known the answer must be x ⋆ = 0 and we can use this as a measure of how<br />

well the method works.<br />

We begin by choosing N random numbers in the interval 0 ≤ x ≤ 1. We record each x i<br />

in its binary form, for example, with 5 digits we might have<br />

x 43 = 11010 = 1 × 2 −1 + 1 × 2 −2 + 0 × 2 −3 + 1 × 2 −4 + 0 × 2 −5<br />

The we apply the above ideas. Here is the initial population (the right hand column is<br />

the decimal value for x i ).<br />

110111011100101000000001011101 : 0.86636<br />

011101110111011001011001100111 : 0.46665<br />

000000001001100100100001011001 : 0.00234<br />

100100010001011101001111111001 : 0.56676<br />

001110111011110000111011110111 : 0.23334<br />

000010001110110000001101001001 : 0.03485<br />

010111000110010010001000100110 : 0.36091<br />

111101001101010010011001011111 : 0.95637<br />

000110100110111110100010010110 : 0.10327<br />

111111010100001001011100000100 : 0.98929<br />

011110111100110010110111101000 : 0.48359<br />

000110111011000111110001100101 : 0.10818<br />

011011010101011110000010001001 : 0.42712<br />

011110110000010001011011001001 : 0.48054<br />

110000100110100101110111011011 : 0.75942<br />

100110111011001110001100100001 : 0.60821<br />

111000101110110000110101111110 : 0.88642<br />

110011110101011111111110001001 : 0.80994<br />

001000101110100001011110110100 : 0.13636<br />

010110001000111000011011100110 : 0.34592<br />

And after 100 generations we get<br />

000010101110100001011001100000 : 0.04261<br />

16-Feb-2014 155


School of Mathematical Sciences<br />

Monash University<br />

000010101110100001011001100000 : 0.04261<br />

000010101110100001011001100000 : 0.04261<br />

000010101110100001011001100000 : 0.04261<br />

000010101110100001011001100000 : 0.04261<br />

000010101110100001011001100000 : 0.04261<br />

000010101110100001011001100000 : 0.04261<br />

000010101110100001011001100000 : 0.04261<br />

000010101110100001011001100000 : 0.04261<br />

000010101110100001011001100000 : 0.04261<br />

000010101110100001011001100000 : 0.04261<br />

000010101110100001011001100000 : 0.04261<br />

000010101110100001011001100000 : 0.04261<br />

000010101110100001011001100000 : 0.04261<br />

000010101110100001011001100000 : 0.04261<br />

000010101110100001011001100000 : 0.04261<br />

000010101110100001011001100000 : 0.04261<br />

000010101110100001011001100000 : 0.04261<br />

000010101110100001011001100000 : 0.04261<br />

000010101110100001011001100000 : 0.04261<br />

As you can see, the initial population is randomly scattered in the range 0 < x < 1 but<br />

the final population (after 100 generations) has all members equal <strong>to</strong> the same value,<br />

x = 0.04261. The population now consists of a set of clones, no improvement will be<br />

possible with extra generations because there is no genetic diversity in the population.<br />

One way <strong>to</strong> fi x this is <strong>to</strong> introduce a small degree of mutations.<br />

16-Feb-2014 156


SCHOOL OF MATHEMATICAL SCIENCES<br />

<strong>MTH3051</strong><br />

<strong>Introduction</strong> <strong>to</strong><br />

<strong>Computational</strong> <strong>Mathematics</strong><br />

13. Random numbers


School of Mathematical Sciences<br />

Monash University<br />

13.1 Uniform random numbers<br />

Computers are very predictable. They do exactly what we tell them <strong>to</strong> do. So how can<br />

they be used <strong>to</strong> compute random numbers? Quite simply – they can not! The best we<br />

can expect from them is the appearance of randomness. Each time we run the program<br />

we will get the exact same sequence of (apparently) random numbers.<br />

Suppose we have a very long list of integers (which we will take as a sample of random<br />

numbers). We could imagine some function f(x) that delivers these integers one at a<br />

time by way of a recursive formula such as<br />

I n+1 = f(I n )<br />

Given an initial value for I 0 we can use this formula <strong>to</strong> compute I 1 , I 2 , I 3 , · · ·. How many<br />

distinct numbers can we generate in this sequence? Clearly we would want this sequence<br />

<strong>to</strong> be as long as possible if we are <strong>to</strong> have the best possible sequence of random numbers<br />

(if ever we repeat a number then the sequence will repeat and that will be end of game<br />

for our so-called random numbers). Suppose our computer can s<strong>to</strong>re integers in the<br />

range 0 <strong>to</strong> M. That is<br />

0 ≤ I n ≤ M<br />

Then the longest sequence of random numbers will contain M + 1 distinct entries. This<br />

is the best we can hope for. On a typical 32 bit computer<br />

M = 2 31 − 1 = 2, 147, 483, 647<br />

That’s a big number and should be adequate for most purposes!<br />

How might we choose the magic function f(x)? The main criteria is that it produces<br />

very very long sequences of numbers without repetition. Any f that does this is suitable.<br />

Here is one popular choice<br />

I n+1 = 16807I n mod M<br />

with I 0 chosen <strong>to</strong> be any number (other than 0!). (Recall that a mod b is the remainder<br />

after dividing by a by b.)<br />

Exercise. If the I n happens <strong>to</strong> be very large (as it will be form time <strong>to</strong> time) then the<br />

product 16807I n might very well exceed M and recall that M is the largest integer the<br />

computer can handle. How then does the computer compute I n+1 in this case?<br />

How well does this work? That is, how random are these numbers? We can check on<br />

the quality of the sequence by forming a simple probability distribution. Let<br />

x = I n<br />

M<br />

then 0 ≤ x ≤ 1 and the successive values of x should appear <strong>to</strong> be uniformly (i.e.<br />

without favouring one value over an other) and randomly distributed over the range 0<br />

<strong>to</strong> 1. Here are two frequency his<strong>to</strong>grams. In each case we <strong>to</strong>ok a fixed number of x’s and<br />

then counted the number of times the x’s fell inside the 100 equally spaced sub-intervals.<br />

If our numbers were truly random then we would expect <strong>to</strong> record the same number of<br />

16-Feb-2014 158


School of Mathematical Sciences<br />

Monash University<br />

x’s in each of the sub-intervals – that is, the his<strong>to</strong>gram should look flat. In the first<br />

his<strong>to</strong>gram, with only 10000 numbers the his<strong>to</strong>gram is far from flat. However, by the<br />

time we have used 1000000 numbers the profile is indeed very flat.<br />

y<br />

5 10 15 20<br />

N = 10 4<br />

y<br />

2000 4000 6000 8000 10000 12000<br />

0.0 0.2 0.4 0.6 0.8 1.0<br />

x<br />

N = 10 6<br />

0.0 0.2 0.4 0.6 0.8 1.0<br />

x<br />

16-Feb-2014 159


School of Mathematical Sciences<br />

Monash University<br />

13.2 Non-uniform random numbers<br />

In the previous section we found that we could easily create a uniform random distribution<br />

by setting<br />

x = I n<br />

M<br />

Now we ask the slightly more interesting question: How do we create a series of x’s that<br />

follow a prescribed non-uniform behaviour? We still want the x’s <strong>to</strong> be random (i.e. we<br />

have no way of predicting what the next number will be) but we want the numbers <strong>to</strong><br />

more often favour certain x values than others. Let’s begin by reminding ourselves of<br />

some basic probability theory.<br />

If X is a random variable with probability density function ρ(x) and if X takes on values<br />

only in the range 0 <strong>to</strong> 1 then the probability that X is found in the range a <strong>to</strong> b is given<br />

by<br />

Pr(a < X < b) =<br />

∫ b<br />

a<br />

ρ(x) dx<br />

So our question now is: How do we choose the successive x’s randomly such that the<br />

his<strong>to</strong>gram, for a very long series, has the same shape as ρ(x)? Here is one way <strong>to</strong> so (it’s<br />

known as the Metropolis algorithm). First we build a rectangular box that just contains<br />

ρ(x). Next, uniformly sprinkle points throughout that box. That is easy <strong>to</strong> do, just<br />

choose (x, y) pairs as follows<br />

x = I n<br />

M , y = I n+1<br />

M<br />

Now here comes them interesting bit. If we now throw away all the points that lie above<br />

y = ρ(x) then the remaining points will have x’s values with a probability distribution<br />

governed by ρ(x).<br />

Non-uniform random numbers<br />

To generate a sequence of random numbers in the range 0 < x < 1 that follow the<br />

probability density function ρ(x)<br />

◮ Compute x = I n /M and y = I n+1 /M<br />

◮ If y > ρ(x) then reject (x, y) and try again.<br />

Exercise. Prove the above claim, that by rejecting the points with y > ρ(x) the resulting<br />

x’s will have ρ(x) as their probability density function.<br />

16-Feb-2014 160


School of Mathematical Sciences<br />

Monash University<br />

Labora<strong>to</strong>ry class notes<br />

16-Feb-2014 161


SCHOOL OF MATHEMATICAL SCIENCES<br />

<strong>MTH3051</strong><br />

<strong>Introduction</strong> <strong>to</strong> <strong>Computational</strong> <strong>Mathematics</strong><br />

Labora<strong>to</strong>ry class 1<br />

<strong>Introduction</strong> <strong>to</strong> Matlab<br />

Getting started<br />

Today is a day for playing with Matlab.<br />

Logon <strong>to</strong> the PC and click on the Matlab icon on your desk<strong>to</strong>p. You should see a screen<br />

like the following.<br />

Your current workspace<br />

Your Matlab<br />

files<br />

You enter matlab<br />

commands here<br />

Your command<br />

his<strong>to</strong>ry<br />

This contains two small windows and one large window. The larger window is where you<br />

will enter your Matlab commands. The other windows are simply for your information.<br />

They show the his<strong>to</strong>ry of your previous commands and your Matlab files (you’ll learn<br />

how <strong>to</strong> use and create Matlab files in the next labora<strong>to</strong>ry class).


School of Mathematical Sciences<br />

Monash University<br />

Setting your workspace<br />

Matlab likes <strong>to</strong> s<strong>to</strong>re files (yours and Matlab’s) in various direc<strong>to</strong>ries on the computer.<br />

One thing you should do is tell Matlab where it should s<strong>to</strong>re your files – in your U: drive.<br />

Do this be setting your Current Direc<strong>to</strong>ry by clicking on the icon on the menu bar.<br />

Then navigate through <strong>to</strong> your U: drive and select okay. If you don’t get in<strong>to</strong> the habit<br />

of doing this then your files may be written <strong>to</strong> the local hard drive of the computer you<br />

are currently on (so when you move <strong>to</strong> another computer you will not be able <strong>to</strong> access<br />

that file).<br />

You may need <strong>to</strong> do this ever time you start up Matlab. Your tu<strong>to</strong>r may know a way<br />

<strong>to</strong> force Matlab <strong>to</strong> remember this setting but I do not know this (yet).<br />

Setting your Matlab Path<br />

This is something you should only need do once. Matlab will look in various direc<strong>to</strong>ries<br />

when searching for files (ie. when you ask it <strong>to</strong> perform a command). You will (soon)<br />

be creating your own files and you will need <strong>to</strong> tell Matlab where <strong>to</strong> find them. This is<br />

what Matlab calls the path. You can set this be selecting the File menu and then Set<br />

Path.... You will see a new window and you should click on Add Folder. The navigate<br />

<strong>to</strong> your U: drive. In this example I selected my own private direc<strong>to</strong>ry, its the <strong>to</strong>p entry<br />

in this diagram (this is how your window will look after clicking on Add Folder and<br />

after selecting your U: drive).<br />

You might like <strong>to</strong> create a new direc<strong>to</strong>ry on you U: drive specifically for <strong>MTH3051</strong>. In<br />

which case you might want <strong>to</strong> repeat the above operation.<br />

16-Feb-2014 163


School of Mathematical Sciences<br />

Monash University<br />

The final thing <strong>to</strong> do, <strong>to</strong> set your path, is <strong>to</strong> click on Save. That’s it. You are now ready<br />

<strong>to</strong> play with Matlab.<br />

Exercises<br />

On the following pages you will find scanned copies of the the first four lessons from<br />

Getting Started with MATLAB 7 by Rudra Pratap. Work your way through the lessons<br />

and the exercises. This should prepare you well for the fun times ahead (where we, that<br />

is you, will develop Matlab programs that actually do something useful, quelle suprise?)<br />

16-Feb-2014 164


School of Mathematical Sciences<br />

Monash University<br />

16-Feb-2014 165


School of Mathematical Sciences<br />

Monash University<br />

16-Feb-2014 166


School of Mathematical Sciences<br />

Monash University<br />

16-Feb-2014 167


School of Mathematical Sciences<br />

Monash University<br />

16-Feb-2014 168


School of Mathematical Sciences<br />

Monash University<br />

16-Feb-2014 169


School of Mathematical Sciences<br />

Monash University<br />

16-Feb-2014 170


School of Mathematical Sciences<br />

Monash University<br />

16-Feb-2014 171


School of Mathematical Sciences<br />

Monash University<br />

16-Feb-2014 172


School of Mathematical Sciences<br />

Monash University<br />

16-Feb-2014 173


School of Mathematical Sciences<br />

Monash University<br />

16-Feb-2014 174


School of Mathematical Sciences<br />

Monash University<br />

16-Feb-2014 175


School of Mathematical Sciences<br />

Monash University<br />

16-Feb-2014 176


School of Mathematical Sciences<br />

Monash University<br />

16-Feb-2014 177


School of Mathematical Sciences<br />

Monash University<br />

16-Feb-2014 178


School of Mathematical Sciences<br />

Monash University<br />

16-Feb-2014 179


School of Mathematical Sciences<br />

Monash University<br />

16-Feb-2014 180


SCHOOL OF MATHEMATICAL SCIENCES<br />

<strong>MTH3051</strong><br />

<strong>Introduction</strong> <strong>to</strong> <strong>Computational</strong> <strong>Mathematics</strong><br />

Labora<strong>to</strong>ry class 2<br />

Truncation errors<br />

1. Estimate the truncation error for each of the following approximations (for |x| ≪ 1)<br />

cos(x) ≈ 1<br />

sin(x) ≈ x − x3<br />

3!<br />

exp(x) ≈ 1 + x + x2<br />

2!<br />

2. Use a series expansion in u <strong>to</strong> show that<br />

∫ 1<br />

0<br />

1<br />

1 + u 2 du = 1 − 1 3 + 1 5 − 1 7 · · ·<br />

Re-evaluate the integral using the substitution u = tan θ. What use can you make of<br />

this result?<br />

Order estimates<br />

3. If u(x) = O (x 2 ) and v(x) = O (x 4 ) for |x| ≪ 1 what can you say about the functions<br />

u(x) + v(x) and u(x)v(x) when |x| ≪ 1?<br />

4. Suppose y(x) = 3 + O (2x) and g(x) = cos(x) + O (x 3 ) for x


School of Mathematical Sciences<br />

Monash University<br />

Programming<br />

5. Pretend you are the computer and follow this algorithm.<br />

Set x <strong>to</strong> 0.5<br />

For i = 1 <strong>to</strong> 6 do<br />

If x > 3 then<br />

Replace x by x - 1<br />

Else<br />

Replace x by x + 1<br />

End<br />

After you complete the instructions, what will be the value of x ?<br />

(a) -1.5 (b) 3.5 (c) 4.5 (d) none of these<br />

6. Which mathematical expression does the following pseudo-code represent?<br />

Read x<br />

Set sum = 1<br />

Set term = 1<br />

Set k = 1<br />

While k < 101 do<br />

Replace term by - term*x*x/( (2k)*(2k-1) )<br />

Replace sum by sum + term<br />

Replace k by k + 1<br />

End<br />

(a)<br />

(c)<br />

∑100<br />

k=1<br />

∑100<br />

k=0<br />

k+1 xk<br />

(−1)<br />

k!<br />

x2k<br />

(−1) k<br />

(2k)!<br />

(b)<br />

(d)<br />

∑100<br />

k=0<br />

∑100<br />

k=1<br />

x 2<br />

2k<br />

(−1) k x2<br />

2k<br />

7. Which of the following is the correct algorithm for approximating ∑ ∞<br />

n=1 n−2 ?<br />

(a) Set n <strong>to</strong> 0<br />

Set sum <strong>to</strong> 0<br />

Repeat<br />

Replace n by n+1<br />

Set sum <strong>to</strong> 1/(n*n)<br />

Until sum < 0.000001<br />

(b) Set n <strong>to</strong> 0<br />

Set sum <strong>to</strong> 0<br />

Repeat<br />

Replace n by n+1<br />

Replace sum by sum+(1/(n*n))<br />

Until sum < 0.000001<br />

16-Feb-2014 182


School of Mathematical Sciences<br />

Monash University<br />

(c) Set n <strong>to</strong> 1<br />

Set sum <strong>to</strong> 1<br />

Repeat<br />

Replace n by n+1<br />

Replace sum by sum+(1/(n*n))<br />

Until 1/(n*n) < 0.000001<br />

(d) Set n <strong>to</strong> 1<br />

Set sum <strong>to</strong> 1<br />

Repeat<br />

Replace sum by sum+(1/(n*n))<br />

Replace n by n+1<br />

Until 1/(n*n) < 0.000001<br />

8. Here is the code given in lectures <strong>to</strong> estimate π. Write a Matlab M-file that contains<br />

this code and use it <strong>to</strong> generate a series of approximations for π.<br />

n = 100; % set number of terms<br />

sum = 0; % set initial value for sum<br />

sign = 1; % an integer +/- 1<br />

for k = 1 : n<br />

% loop over k from 1 <strong>to</strong> n<br />

term = sign/(2*k-1); % compute the term<br />

sum = sum + term; % update rolling sum<br />

sign = - sign; % flip the sign<br />

end;<br />

disp(n);<br />

% print n<br />

disp(4*sum);<br />

% print the approximation <strong>to</strong> π<br />

disp(4*sum-pi);<br />

% print the error<br />

9. Modify the above code <strong>to</strong> use the series π 2 = 12 ∑ ∞<br />

k=1 (−1)k+1 /k 2 . Verify that for a<br />

the same number of terms, this series gives much better answers than the original code<br />

above.<br />

16-Feb-2014 183


SCHOOL OF MATHEMATICAL SCIENCES<br />

<strong>MTH3051</strong><br />

<strong>Introduction</strong> <strong>to</strong> <strong>Computational</strong> <strong>Mathematics</strong><br />

Labora<strong>to</strong>ry class 3<br />

Finite precision arithmetic<br />

1. We saw in lectures that doing arithmetic with a limited number of digits can cause<br />

numerical problems. Here you will explore directly some of those problems. Your first<br />

task will be <strong>to</strong> write a short Matlab program that takes one integer in and converts it<br />

<strong>to</strong> a integer with only 5 significant digits. Let’s suppose you call this function Trunc.<br />

Here are some examples that your function should reproduce.<br />

123 ← Trunc(123)<br />

12345 ← Trunc(12345)<br />

12345e+03 ← Trunc(12345678)<br />

2. Modify the code so that you can specify exactly how many significant digits Trunc should<br />

return.<br />

3. Write two Matlab programs, one <strong>to</strong> add a pair of integers, another <strong>to</strong> subtract a pair of<br />

integers. All of the integers (the two inputs and one result) should be expressed as fixed<br />

precision integers. You may use the Trunc program from the previous questions.<br />

Test your programs on the following examples (which uses 5 digit calculations). In each<br />

case compute the actual and relative errors in the final answer. Do this for 3, 5 and 7<br />

significant digits.<br />

579 ← Plus(123,456)<br />

12345 ← Plus(12e+03,345)<br />

12345e+03 ← Plus(12345e+03,678)<br />

333 ← Minus(456,123)<br />

1 ← Minus(124,123)<br />

500 ← Minus(1235e+03,Plus(1234e+03,567))<br />

Are your results consistent with what we saw in lectures?<br />

4. Here are two ways in which you could estimate e −5 .<br />

(a)<br />

e −5 ≈<br />

9∑<br />

k=0<br />

(−1) k 5 k<br />

k!<br />

(b)<br />

e −5 ≈<br />

1<br />

∑ 9<br />

k=0 5k /k!


School of Mathematical Sciences<br />

Monash University<br />

Both of these series are based on the standard Taylor series for e x . The correct value<br />

of e −5 , <strong>to</strong> three significant digits, is 6.74e-3. Which series, (a) or (b), will give the most<br />

accuracy when all calculations are done <strong>to</strong> 3 significant figures? Why?<br />

5. The finite sum ∑ 10<br />

k=1 1/k2 could be evaluated in many ways, for example 1 + (1/2 2 ) +<br />

(1/3 2 ) + · · · + (1/10 2 ) or as (1/10 2 ) + (1/9 2 ) + (1/8 2 ) + · · · 1. If these two sums are<br />

performed on a 3-digit computer which sum do you think will be most accurate? Why?<br />

6. Look carefully at the following mathematical expression.<br />

S =<br />

n∑<br />

i=1<br />

(a) How many multiplications are required <strong>to</strong> compute S using the above formula?<br />

(b) Are there other (better) ways <strong>to</strong> evaluate S? (Hint : you should only require one<br />

multiplication.)<br />

n∑<br />

j=1<br />

a i b j<br />

Elementary probability distributions<br />

7. We all know and love the Binomial and Poisson distributions. These are given by<br />

−λ λn<br />

Poisson: Pr(X = n) = e , n = 0, 1, 2, · · · ∞<br />

n!<br />

( ) N<br />

Binomial: Pr(X = n) = p n (1 − p) N−n , n = 0, 1, 2, · · · N<br />

n<br />

Your job is <strong>to</strong> write Matlab programs that can accurately evaluate these distributions,<br />

which for simplicity we will denote by Binomial(N, n, p) and Poisson(n, λ).<br />

Here are some test cases.<br />

7.29000e-02 ← Binomial(5,2,0.1)<br />

1.84865e-01 ← Binomial(100,2,0.01)<br />

3.60610e-02 ← Binomial(256,70,0.3)<br />

9.02235e-02 ← Poisson(4,2)<br />

1.86608e-03 ← Poisson(20,10)<br />

2.31105e-03 ← Poisson(45,30)<br />

Bases other than 10<br />

8. Suppose you have a number x and that is happens <strong>to</strong> have the value 1234 in base 10.<br />

This means that<br />

x = 1 × 10 3 + 2 × 10 2 + 3 × 10 1 + 4 × 10 0 = (1234) 10<br />

16-Feb-2014 185


School of Mathematical Sciences<br />

Monash University<br />

(the little 10 written as a subscript is just <strong>to</strong> remind us that in this instance we are using<br />

base 10).<br />

But this same x could also be written in base 3, that is<br />

So we have<br />

x = 1 × 3 6 + 2 × 3 5 + 2 × 3 2 + 1 = (1200201) 3<br />

x = (1234) 10 = (1200201) 3<br />

Your game is <strong>to</strong> write a Matlab program that allows you <strong>to</strong> convert any base 10 number<br />

in<strong>to</strong> any other base (less then 10). In your quest you will find two standard Matlab<br />

commands mod and floor very useful. The function mod returns the remainder after<br />

division, e.g. mod(17,5) = 2. The function floor chops off any decimal part in a division,<br />

e.g. floor(10/4) = 2.<br />

9. Compute by hand 0.125 in base 2. Do the same for 0.3. Write a Matlab program that<br />

computes the first 6 binary digits of any number in the range 0.1 10 <strong>to</strong> 1.<br />

16-Feb-2014 186


SCHOOL OF MATHEMATICAL SCIENCES<br />

<strong>MTH3051</strong><br />

<strong>Introduction</strong> <strong>to</strong> <strong>Computational</strong> <strong>Mathematics</strong><br />

Labora<strong>to</strong>ry class 4<br />

Fixed point iteration<br />

1. Which of the following is a suitable fixed point method for solving<br />

0 = 2 + x − tan x<br />

(a) x = 2 + tan −1 x (b) x = 2 + tan x<br />

(c) x = tan −1 (2 + x) (d) none of these<br />

2. Two iterations of the fixed point method for the equation x = 1/(3 + x 2 ), starting with<br />

x = 0, yields, as an approximation <strong>to</strong> the root<br />

(a) x = 3 (b) x = 1/3 (c) x = 28/9 (d) x = 9/28<br />

3. Show that the New<strong>to</strong>n-Raphson method<br />

x n+1 = x n − f(x n)<br />

f ′ (x n )<br />

can also be viewed as a fixed-point iteration (i.e x n+1 = g(x n )). Using the convergence<br />

criteria for fixed-point iterations what can you say about the convergence of the New<strong>to</strong>n-<br />

Raphson iterations?<br />

New<strong>to</strong>n-Raphson<br />

4. Use a New<strong>to</strong>n-Raphson iteration <strong>to</strong> find a root, accurate <strong>to</strong> 4 decimal places, of each of<br />

the following equations (for x in the stated interval).<br />

(a) x 3 + 10x − 37 = 0 for 2 ≤ x ≤ 3.<br />

(b) x 4 + x 3 − 22 = 0 for 1 ≤ x ≤ 2.<br />

(c) x − ln(x + 2) = 0 for 1 ≤ x ≤ 2.<br />

5. Show, using a New<strong>to</strong>n-Raphson iteration, that the square root of any number y can be<br />

computed from the iteration<br />

x n+1 = 1 (x n + y )<br />

2 x n


School of Mathematical Sciences<br />

Monash University<br />

6. Use the previous algorithm <strong>to</strong> estimate the square roots of 2,3 and 10. What goes wrong<br />

when you try <strong>to</strong> find the square root of -1?<br />

7. Imagine this scenario. You have a job, you are earning mucho $$$ and you are stashing<br />

it away in a bank with compound interest paid monthly. After n months, your initial<br />

saving P will be worth<br />

( ) 1 − i<br />

n<br />

P n = P<br />

n ≥ 1<br />

1 − i<br />

where i is the interest per payment period (one month). Suppose that in 20 years you<br />

wish <strong>to</strong> amass a fortune of $750,000 from monthly deposits of $1500. What interest rate<br />

per year would you need <strong>to</strong> secure your fortune?<br />

Half-interval search<br />

8. Plot each of the functions listed in question (4) and determine which are suitable for<br />

a half-interval search method and in such cases apply that method <strong>to</strong> find the root<br />

(accurate <strong>to</strong> 4 decimal places).<br />

9. Can the half-interval search method be used <strong>to</strong> find the root of function such as f(x) =<br />

(x − 7) 2 g(x) with g(7) ≠ 0? What might you do in such cases (while still using a<br />

half-interval search)?<br />

10. Given that f(x) = √ x − cos(x) has one root in the interval 0 < x < 1 perform three<br />

rounds of the bisection method <strong>to</strong> estimate this root.<br />

11. Suppose you are <strong>to</strong>ld that f(x) = (x + 2)(x + 1)x(x − 1) 3 (x − 2). To which zero of f<br />

would the bisection method converge for the following intervals?<br />

(a) [−3, 2.5] (b) [−2.5, 3]<br />

(c) [−1.75, 1.5] (d) [−1.5, 1.75]<br />

16-Feb-2014 188


School of Mathematical Sciences<br />

Monash University<br />

Round-off errors : Bessel functions<br />

12. When studying certain physical systems (e.g. the vibrations on a drum) the following<br />

linear differential equation arises<br />

x 2 d2 y<br />

dx 2 + xdy dx + (x2 − n 2 )y = 0<br />

Solutions of this equation are known as Bessel functions of order n. As it’s a second order<br />

equation we expect two (linearly independent) solutions. These are denoted by J n (x)<br />

and Y n (x). They differ most notably in their behaviour near x = 0 with J n remaining<br />

finite will Y n diverges <strong>to</strong> −∞ as x → 0. We will compute J n (x) for various values of n<br />

at x = 1.<br />

The J n (x) have the following properties<br />

lim J n(x) = 0<br />

n→∞<br />

|J n (x)| ≤ 1 , for |x| < ∞<br />

J 0 (0) = 1 , J n (0) = 0 , n ≠ 0<br />

1 = J 0 (x) + 2J 2 (x) + 2J 4 (x) + 2J 6 (x) + · · ·<br />

J n+1 + J n−1 = 2n x J n(x)<br />

(a) The last equation can be used <strong>to</strong> J 2 given J 1 and J 0 . This then can be used<br />

<strong>to</strong> compute J 3 which in turn can be used <strong>to</strong> compute J 4 . In this fashion we<br />

can generate successive J n ’s from previous J n ’s. Not surprisingly this is known<br />

as a recurrence relation. Given that J 0 = 0.76519768655796655145 and J 1 =<br />

0.44005058574493351596 (both accurate <strong>to</strong> 20 decimal places) use the recurrence<br />

relation <strong>to</strong> compute (in strict order) J 2 , J 3 , J 4 · · · J 20 . What do you observe?<br />

(b) The recurrence relation can also be used in the reverse direction. That is J 20<br />

and J 19 can be used <strong>to</strong> compute J 18 and so on down <strong>to</strong> J 0 . Since we know that<br />

lim n→∞ J n (x) = 0 we might make the reasonable estimate J 20 (1) = 0. But what<br />

guess should we make for J 19 ? Let’s just call it α, some unknown number. Now<br />

proceed <strong>to</strong> compute (in strict order) J 18 , J 17 , J 16 · · · J 0 . Given that we know the<br />

correct value for J 0 you should then be able <strong>to</strong> determine the correct value for<br />

α. Follow this algorithm and compute the J n for n = 0, 1, 2, · · · 20. How do your<br />

answers compare with those from part (a)? Which set of J n ’s do you trust?<br />

(c) Can you modify the algorithm given part (b) so that you do not need <strong>to</strong> know the<br />

exact value of J 0 ? (i.e. can you compute the J n without being <strong>to</strong>ld in advance the<br />

value of J 0 ?).<br />

16-Feb-2014 189


School of Mathematical Sciences<br />

Monash University<br />

The Error function<br />

13. In statistical analysis one often encounters the following integral<br />

erf(x) = 2 √ π<br />

∫ x<br />

0<br />

e −t2 dt<br />

and it is used <strong>to</strong> define what we call the error function, namely erf(x). The disappointing<br />

thing is that this new function can not be expressed in terms of what we call elementary<br />

function (such as polynomials, trigonometric and exponential functions). If we need<br />

values for erf(x) then we must obtain those values from the above integral. This is our<br />

challenge – <strong>to</strong> construct a Matlab code that can compute erf(x).<br />

(a) Using the Tylor series around t = 0 for e −t2<br />

show that<br />

erf(x) = 2 √ π<br />

∞<br />

∑<br />

k=0<br />

(−1) k x 2k+1<br />

(2k + 1)k!<br />

(b) Take it as fact that erf(x) can also be computed from<br />

erf(x) = 2 √ π<br />

e −x2<br />

∞<br />

∑<br />

k=0<br />

2 k x 2k+1<br />

1 · 3 · 5 · · · (2k + 1)<br />

Verify that the two series have the same leading terms (e.g. the terms match for<br />

k = 0, 1, 2).<br />

(c) Use the series in part (a) <strong>to</strong> estimate erf(1) <strong>to</strong> six decimal places.<br />

(d) How many terms did you use in part (c)? Use the same number of terms this time<br />

with the series for part (b). What answer did get? How does it compare with the<br />

answer from part (a)?<br />

(e) Argue why you would expect the series in part (b) <strong>to</strong> be inferior <strong>to</strong> the series from<br />

part (a) for estimating erf(x).<br />

16-Feb-2014 190


SCHOOL OF MATHEMATICAL SCIENCES<br />

<strong>MTH3051</strong><br />

<strong>Introduction</strong> <strong>to</strong> <strong>Computational</strong> <strong>Mathematics</strong><br />

Labora<strong>to</strong>ry class 5<br />

Gaussian elimination<br />

You are welcome <strong>to</strong> write your own Matlab programs but if you would prefer <strong>to</strong> look<br />

at my examples, you can find them on our MUSO website (they are under the link<br />

Programs). For this class you will only need the GEsimple.m file (this does Gaussian<br />

elimination without pivoting, while GEpivot.m includes pivoting).<br />

1. Use your (Leo’s?) Gaussian elimination code <strong>to</strong> solve the following systems. Verify that<br />

your solution is indeed the correct solution of the equations.<br />

[ ] [ ] [ ]<br />

⎡<br />

⎤ ⎡ ⎤ ⎡ ⎤<br />

(a) 1 −1 u 1 (b) 1 2 −1 x 1<br />

=<br />

2 3 v 0<br />

⎣ 2 3 1 ⎦ ⎣ y ⎦ = ⎣ 0 ⎦<br />

−1 2 5 z 2<br />

⎡<br />

⎤ ⎡ ⎤ ⎡ ⎤<br />

⎡<br />

⎤ ⎡ ⎤ ⎡ ⎤<br />

(c) 3 1 2 x 2 (d) 1 2 −1 2 p 1<br />

⎣ 1 0 2 ⎦ ⎣ y ⎦ = ⎣ 1 ⎦<br />

⎢ 2 3 1 1<br />

⎥ ⎢ q<br />

⎥<br />

3 1 7 z 9<br />

⎣ −1 2 5 3 ⎦ ⎣ r ⎦ = ⎢ 0<br />

⎥<br />

⎣ 2 ⎦<br />

1 2 −1 3 s 3<br />

2. Here is a piece of Matlab code that will print out an n × m matrix M.<br />

for i = 1 : n<br />

disp(sprintf(’␣%12.4f’,M(i,:)))<br />

end<br />

disp(’-----------------------------’)<br />

% step through rows of M<br />

% print this row of M<br />

% mark the end of the matrix<br />

Insert the above in your Gaussian elimination code so that you can see the various stages<br />

in which the matrix is reduced <strong>to</strong> upper triangular form. This is another way <strong>to</strong> verify<br />

that the code is doing what it should do.<br />

3. Modify your Gaussian elimination with back substitution code <strong>to</strong> do full Gaussian elimination<br />

(i.e. reduce the matrix <strong>to</strong> diagonal form, not just upper triangular form). Verify<br />

that your new code works by applying it <strong>to</strong> the systems given in question (1).<br />

4. Modify the code created in question (2) <strong>to</strong> compute the inverse of an n × n matrix.<br />

Recall that A −1 can be computed as follows. Start with A and then form a new n × 2n<br />

matrix [A|I] where I is the n × n identity matrix. Then apply full Gaussian elimination<br />

<strong>to</strong> reduce this <strong>to</strong> [I|B]. Then the matrix B will be the inverse of A.


School of Mathematical Sciences<br />

Monash University<br />

5. The determinant of a matrix can be computed using steps very similar <strong>to</strong> that used in<br />

the Gaussian elimination code. Let’s suppose we want <strong>to</strong> compute det(A). Suppose<br />

we feed A in<strong>to</strong> a standard Gaussian elimination code and that we recover the upper<br />

triangular matrix A u . Then it can be shown that<br />

det(A) = (−1) m λ 1 λ 2 λ 3 · · · λ q det(A u )<br />

where m is the number of times we swapped rows and each 1/λ equals scale fac<strong>to</strong>r used<br />

whenever we did a row operation of the form row j ← row j /λ. Modify your Gaussianelimination<br />

code so that it can correctly compute the determinant of an n × n matrix<br />

A. Note that det(A u ) is easy – its just the product of its diagonal elements.<br />

6. The inverse of a matrix can also be computed using Cramer’s rule. This is how it works.<br />

Let B be the inverse of A. Then the entries in B are calculated as follows<br />

B(i, j) =<br />

det(R(j, i))<br />

det(A)<br />

, i, j = 1, 2, 3, · · · , n<br />

where R(i, j) is the matrix obtained from A by first replacing row i and column j with<br />

zeroes and then setting R(j, i) = 1 (here we take the first index as a row index and the<br />

second as a column index). Modify your code from the previous question <strong>to</strong> compute the<br />

inverse of A. You can verify your answers using the Matlab routine inv (i.e. inv(A).<br />

7. Estimate the operational cost (number of floating point operations, flops) <strong>to</strong> compute<br />

the inverse of an n × n matrix A using (i) standard Gaussian elimination and (ii) using<br />

Cramer’s rule. Which method should you use for large n?<br />

16-Feb-2014 192


SCHOOL OF MATHEMATICAL SCIENCES<br />

<strong>MTH3051</strong><br />

<strong>Introduction</strong> <strong>to</strong> <strong>Computational</strong> <strong>Mathematics</strong><br />

Labora<strong>to</strong>ry class 6<br />

Matlab programs<br />

For this class you will need the Matlab program jacobi.m which you can find on the<br />

subject web site (in MUSO). As always, you are welcome <strong>to</strong> write your own code from<br />

scratch. Later you will be asked <strong>to</strong> make various changes <strong>to</strong> implement Gauss-Seidel<br />

iterations, generalised fixed point and generalised New<strong>to</strong>n-Raphson iterations.<br />

Jacobi iteration<br />

1. Two iterations of the Jacobi method for solving<br />

3 = 7x − 2y<br />

1 = x + 4y<br />

starting from x 0 = 0 and y 0 = 0 yields<br />

(a) x 2 = 7/3 and y 2 = 4 (b) x 2 = 1/2 and y 2 = 1/7<br />

(c) x 2 = 3/7 and y 2 = 1/4 (d) x 2 = −1/2 and y 2 = 1/7<br />

Gauss-Seidel iteration<br />

2. Which of the following is a suitable Gauss-Seidel scheme for solving<br />

3 = 7x + 9y<br />

5 = 4x + 2y<br />

(a) x n+1 = (3 − 9y n )/7<br />

y n+1 = (5 − 4x n+1 )/2<br />

(b) x n+1 = (3 − 9y n )/7<br />

y n+1 = (5 − 4x n )/2<br />

(c) x n+1 = (5 − 2y n )/4<br />

y n+1 = (3 − 7x n+1 )/9<br />

(d)<br />

None of these


School of Mathematical Sciences<br />

Monash University<br />

3. Write down a convergent Gauss-Seidel scheme for the following system of equations<br />

5x + y + 2z = 5<br />

x + y + 3z = 6<br />

−x + 4y − z = −9<br />

Perform three iterations, by hand, of this scheme starting with initial guess x = 0, y = 0<br />

and z = 0.<br />

4. Modify your Jacobi code <strong>to</strong> perform Gauss-Seidel iterations. Test your new code by<br />

applying it <strong>to</strong> the system of equations from the previous question.<br />

5. Modify your Jacobi and Gauss-Seidel codes so that they accept the augmented matrix<br />

of the linear system of equations and the number of equations as input arguments. That<br />

is, modify your code so that the following system<br />

could be solved by typing<br />

7x − 2y = 3<br />

x + 4y = 1<br />

Matlab>> M = [ 7 -2 3; 1 4 1 ];<br />

Matlab>> x_start = [ 0 0 ];<br />

Matlab>> x_final = jacobi(M,x_start,2,0.001,20);<br />

In this way you will not need <strong>to</strong> continually modify the function onestep each time you<br />

encounter a new system of equations.<br />

6. In the lecture notes you will find the theorem that if a matrix is diagonally dominant<br />

then both the Jacobi and Gauss-Seidel iterations will converge. Will the iterations<br />

converge when the matrix is not diagonally dominant? To explore this question, use<br />

Jacobi iterations on the following system<br />

x + z = 2<br />

−x + y = 0<br />

x + 2y − 3z = 0<br />

You should find that the iterations converge. Repeat the calculations using a Gauss-<br />

Seidel iteration. What do you observe?<br />

16-Feb-2014 194


School of Mathematical Sciences<br />

Monash University<br />

Generalised Fixed-Point iteration<br />

7. Create a Matlab program for the generalised fixed-point algorithm by making suitable<br />

modifications <strong>to</strong> the function onesetp in the Matlab program jacobi.m But don’t overwrite<br />

your old jacobi.m program, save your new program in a new file, say gen fp.m.<br />

And remember that whatever name you choose for your file you must also choose the<br />

same name for the function. Test your program by searching for the roots of the system<br />

x n+1 = ( 7x 3 n − y n − 1 ) /10<br />

y n+1 = ( −8y 3 n + x n − 1 ) /11<br />

starting with (x, y) 0 = (0, 0). Compare your results against those given in lectures.<br />

8. See if you can find any other roots of the system of equations<br />

y = 7x 3 − 10x − 1<br />

x = −8y 3 + 11y + 1<br />

by exploring variations of the iterations used in the previous question.<br />

9. Look back at the pair of equations in Question 1. Though these are linear equations<br />

there is nothing s<strong>to</strong>pping you from solving them by a generalised fixed point algorithm.<br />

Write out the fixed point equations for this system. How do the compare with the Jacobi<br />

iterations?<br />

Generalised New<strong>to</strong>n-Raphson iteration<br />

10. Modify your generalised fixed-point code <strong>to</strong> implement the generalised New<strong>to</strong>n-Raphson<br />

method. Test your code by solving the above system of equations. You should be able<br />

<strong>to</strong> find any and all of the nine roots (some of which were given in lectures).<br />

11. (a) Verify that (x, y) = (1, 1) and (x, y) = (−1, −1) are solutions of<br />

0 = x 2 + y 2 − 2<br />

0 = xy − 1<br />

(b) Write out the New<strong>to</strong>n-Raphson iteration equations in full for this system.<br />

(c) What problems do you think might arise as the iterations converge <strong>to</strong> (x, y) = (1, 1)?<br />

12. The following pair of equations<br />

0 = x 2 − y 2 + 2y<br />

0 = 2x + y 2 − 6<br />

are known <strong>to</strong> have four solutions near the following points (−5, −4), (2, −1), (0.5, 2) and<br />

(−2, 3).<br />

16-Feb-2014 195


School of Mathematical Sciences<br />

Monash University<br />

(a) Re-arrange the system of equations in<strong>to</strong> a fixed-point iteration form (note that<br />

there are many ways <strong>to</strong> do this).<br />

(b) Using the above values as initial guesses determine which of the roots can be found<br />

with your choice of fixed point iteration.<br />

(c) Repeat part (b) this time using the Generalised New<strong>to</strong>n-Raphson method.<br />

16-Feb-2014 196


SCHOOL OF MATHEMATICAL SCIENCES<br />

<strong>MTH3051</strong><br />

<strong>Introduction</strong> <strong>to</strong> <strong>Computational</strong> <strong>Mathematics</strong><br />

Labora<strong>to</strong>ry class 7<br />

Matlab programs<br />

You should attempt the following questions by hand. Programs are good, but they<br />

should only be used once you know what they are doing (it’s the same with calcula<strong>to</strong>rs<br />

– you need <strong>to</strong> know basic arithmetic before you use them! well, that’s my 2 cents worth<br />

and I enjoy the view from this soapbox).<br />

Lagrangian interpolation<br />

1. Given the following data<br />

x 0.0 1.0 2.0 4.0<br />

f(x) 0.0 1.0 16.0 256.0<br />

use a cubic Lagrange polynomial <strong>to</strong> estimate f(1.5). You can probably guess that the<br />

underlying function is f(x) = x 4 . Use this <strong>to</strong> compute the error in your estimate for<br />

f(1.5).<br />

2. Using the same table, construct two quadratic polynomials <strong>to</strong> <strong>to</strong> interpolate f at x = 1.5.<br />

Compare this pair of estimates with the exact answer and the cubic estimate (using the<br />

cubic you built in the previous question). How well do these errors compare? If you did<br />

not know that the underlying function was f(x) = x 4 could you still estimate the error<br />

in the cubic interpolation?<br />

3. Below are a set of functions f(x). For each function use the node points x 0 = 0, x 1 = 0.6<br />

and x 2 = 0.9 <strong>to</strong> estimate f(0.45) and the error.<br />

(a) f(x) = cos(x) (b) f(x) = √ 1 + x<br />

(c) f(x) = log e (1 + x) (d) f(x) = tan(x)<br />

New<strong>to</strong>n interpolation<br />

4. Using the same data as above, construct the triangular table of numbers (known as divided<br />

differences) as given in the lecture notes. Hence write down the cubic interpolating


School of Mathematical Sciences<br />

Monash University<br />

polynomial. Verify that that your polynomial does pass through the given data points.<br />

Estimate f at x = 1.5. How does this compare with the estimate from Question 1?<br />

5. Without doing any more calculations, write down, from this table, two quadratics that<br />

could be used <strong>to</strong> interpolate f(x) at x = 1.5. Then repeat as per Question 2. How do<br />

these estimates compare with those from Question 2? Are you surprised? (You shouldn’t<br />

be!)<br />

6. Consider a typical cubic polynomial in the form<br />

f(x) = a + bx + cx(x − 1) + dx(x − 1)(x − 2)<br />

Use this <strong>to</strong> estimate the derivatives of f(x) at x = 0. Could you then estimate d given<br />

just the derivatives? Could this also be used <strong>to</strong> estimate the other coefficients a, b and<br />

c? If so, how?<br />

7. (a) Construct a divided difference table for the following data<br />

x -2.0 -1.0 0.0 1.0 2.0<br />

f(x) -1.0 3.0 1.0 -1.0 3.0<br />

(b) From your table construct the following pair of polynomials.<br />

P (x) = 3 − 2(x + 1) + 0(x + 1)(x) + (x + 1)(x)(x − 1)<br />

Q(x) = −1 + 4(x + 2) − 3(x + 2)(x + 1) + (x + 2)(x + 1)(x)<br />

(c) Show that both P (x) and Q(x) interpolate the above data.<br />

(d) In lectures it was claimed that the interpolating polynomial is unique. How can<br />

this be so? Surely P and Q are different (or are they?).<br />

8. You are <strong>to</strong>ld that the following should be a divided difference table. Yet it contains some<br />

missing entries. Fill in the missing entries.<br />

x i f i = d i0 d i1 d i2<br />

0 ??<br />

??<br />

0.4 ?? 50/7<br />

10<br />

0.7 6<br />

Cubic splines<br />

9. Consider the use of cubic splines <strong>to</strong> interpolate a set of data. Suppose at some stage in<br />

the calculation we arrive at the following spline functions for two consecutive intervals<br />

˜f 0 (x) = x 3 + ax 2 + bx + c −1 ≤ x ≤ 1<br />

˜f 1 (x) = 2x 3 + x 2 − x + 4 1 ≤ x ≤ 2<br />

16-Feb-2014 198


School of Mathematical Sciences<br />

Monash University<br />

(a) State the conditions that should be imposed on the two functions.<br />

(b) Hence compute a, b and c.<br />

16-Feb-2014 199


SCHOOL OF MATHEMATICAL SCIENCES<br />

<strong>MTH3051</strong><br />

<strong>Introduction</strong> <strong>to</strong> <strong>Computational</strong> <strong>Mathematics</strong><br />

Labora<strong>to</strong>ry class 8<br />

Finite Differences<br />

1. Use a standard Taylor series expansion<br />

<strong>to</strong> verify the formulae given in lectures<br />

f(x + h) = f(x) + df<br />

dx h + d2 f<br />

dx 2 h 2<br />

2! + d3 f<br />

dx 3 h 3<br />

3! + · · ·<br />

df(x)<br />

dx<br />

d 2 f(x)<br />

dx 2 =<br />

=<br />

f(x + h) − f(x − h)<br />

2h<br />

+ O ( h 2)<br />

f(x + h) − 2f(x) + f(x − h)<br />

h 2 + O ( h 2)<br />

2. (a) Write out the Taylor series for f(x ± nh) for n = 1, 2.<br />

(b) Hence show that the first derivative, at x = x 0 , may be approximated by<br />

f ′ (x 0 ) = 1 (<br />

)<br />

− 3f(x 0 ) + 4f(x 0 + h) − f(x 0 + 2h) + O ( h 2)<br />

2h<br />

(c) For those who love long tedious calculations, go on and show that<br />

f ′ (x 0 ) = 1 (<br />

− 25f(x 0 ) + 48f(x 0 + h) − 36f(x 0 + 2h)<br />

12h<br />

)<br />

+ 16f(x 0 + 3h) − 3f(x 0 + 4h) + O ( h 2)<br />

3. Using the same methods as in the previous question, show also that<br />

f ′ (x 0 ) ≈ 1 (<br />

)<br />

2f(x 2 ) + f(x 1 ) − 3f(x 0 )<br />

5h<br />

f ′′ (x 0 ) ≈ 1 (<br />

)<br />

f(x<br />

2h 2 2 ) − f(x 1 ) − f(x 0 ) + f(x −1 )<br />

4. Consider three points (x −1 , f(x −1 )), (x 0 , f(x 0 )) and (x 1 , f(x 1 )). Construct a quadratic<br />

through these three points and then show that<br />

f ′′ (x 0 ) ≈ 1 )<br />

(f(x<br />

h 2 1 ) − 2f(x 0 ) + f(x −1 )


School of Mathematical Sciences<br />

Monash University<br />

5. Use Matlab <strong>to</strong> compute 2nd order finite difference approximations <strong>to</strong> d 2 f(x)/dx 2 for<br />

f(x) = sin(x) at x = π/2. What is the exact value? How do your numerical estimates<br />

compare with the exact values? Try using smaller and smaller values of the step length.<br />

What do you observe? Can you explain this behaviour?<br />

You might like <strong>to</strong> use the Matlab program NumDeriv.m which is available on the subject<br />

web page. (But do note that that program is for first derivatives.)<br />

6. Here is the school-yard definition (yes, its that well known) of the derivative of a function<br />

f(x)<br />

df<br />

dx = lim f(x + h) − f(x)<br />

h→0 h<br />

Choose your favourite function f(x), any non-zero number x, and compute the following<br />

approximation for the derivative<br />

( ) df<br />

= f(x + 10−n ) − f(x)<br />

dx<br />

n<br />

10 −n<br />

for n = 1, 2, · · · , 20. What do you notice? Can you explain this behaviour?<br />

7. We saw in lectures that the first derivative of f(x) at x = x i can be approximated by<br />

f ′ (x i ) = 1 (<br />

)<br />

f(x i+1 ) − f(x i−1 )<br />

2h<br />

Given that f(x) is the derivative of some other function, say g(x), that is f(x) = g ′ (x) use<br />

the above approximation <strong>to</strong> form an approximation for g ′′ (x i ). How does this compare<br />

with that approximation given in lectures?<br />

Numerical Errors<br />

8. Download the Matlab program MatlabEps.m. This is a simple program that you can<br />

use <strong>to</strong> determine the number of decimal digits that Matlab uses. Study the program,<br />

run it and then determine how many decimal digits of are used by Matlab.<br />

9. Verify the calculations given in lectures that showed that the optimal choice for h in<br />

df<br />

dx<br />

≈<br />

f(x + h) − f(x)<br />

h<br />

was h = O ( 10 −N/2) where N is the number of decimal digits used by the computer.<br />

10. Using the same method as used in the previous question, determine the optimal choice<br />

of h for the approximation<br />

d f f(x + h) − 2f(x) + f(x − h)<br />

≈<br />

dx2 h 2<br />

How many decimal digits of accuracy can you expect (for the optimal choice of h)?<br />

11. Suppose someone decided <strong>to</strong> build a computer in which all real numbers where approximated<br />

by fractions suh as a/b where a and b are integers. Suppose this computer can<br />

16-Feb-2014 201


School of Mathematical Sciences<br />

Monash University<br />

s<strong>to</strong>re N decimal digits for each of a and b. Formulate a model of the round off error for<br />

this computer. Contrast this against a standard computer in which each real number is<br />

s<strong>to</strong>red in 2N decimal digits. In answering this question you can assume that all the real<br />

numbers lie between 0 and 1.<br />

Improved Euler method<br />

12. Copy the sample program ImpEuler.m from the subject home page.<br />

Use this <strong>to</strong> explore some of the claims made in lectures.<br />

(a) Plot solutions for the ODE dy/dx = ky for various choices of k and initial conditions<br />

(x, y) 0 .<br />

(b) Is the system stable? Try different values of the step length.<br />

(c) Plot the solutions for dy/dx = y + 2e −x for various choices of initial conditions.<br />

What do you observe? (Try the case y(0) = −1).<br />

(d) Modify your code <strong>to</strong> return the error as a function of x.<br />

(e) Verify that the global discretization error is O (h 2 ).<br />

(f) Use this fact <strong>to</strong> derive a higher order scheme (hint : Richardson extrapolation).<br />

(g) Modify your code <strong>to</strong> implement your scheme and verify that its global discretization<br />

error is what you expect.<br />

16-Feb-2014 202


SCHOOL OF MATHEMATICAL SCIENCES<br />

<strong>MTH3051</strong><br />

<strong>Introduction</strong> <strong>to</strong> <strong>Computational</strong> <strong>Mathematics</strong><br />

Labora<strong>to</strong>ry class 9<br />

Richardson extrapolation<br />

1. In lectures we approximated 2π by the <strong>to</strong>tal sum of the chord lengths of a unit circle<br />

subdivided in<strong>to</strong> many equal parts. A similar calculation can be made using the areas of<br />

the triangles formed by the circle’s centre and each chord. Do the calculations for the<br />

first few subdivision, then apply two levels of Richardson extrapolation. You will need<br />

<strong>to</strong> develop a formula for the error terms in the approximation.<br />

2. The finite difference approximations<br />

df(x)<br />

dx<br />

d 2 f(x)<br />

dx 2 ≈<br />

≈<br />

f(x + h) − f(x − h)<br />

2h<br />

f(x + h) − 2f(x) + f(x − h)<br />

h 2<br />

are known <strong>to</strong> have a leading truncation error of O (h 2 ). Use this information <strong>to</strong> obtain<br />

higher order approximations.<br />

3. Suppose that you have established the following series expansion<br />

˜f(h) = f + Ah 2 + Bh 4 + Ch 6 + · · · + Dh 2n + · · ·<br />

where ˜f(h) is an approximation <strong>to</strong> some exact quantity f and h is some freely chosen<br />

(small) parameter. Let F n be result of n applications of the Richardson extrapolation<br />

method. Derive a general formula that links F n <strong>to</strong> F n−1 .<br />

4. Suppose that N(h) is an approximation <strong>to</strong> some quantity M and that for every h > 0<br />

we have<br />

M = N(h) + ah + bh 2 + ch 3 + · · ·<br />

where a, b, c · · · are numbers that do not depend on h. Clearly as h → 0 we have<br />

N(h) → M. Thus the N(h) for various h can be used as approximations <strong>to</strong> M. This<br />

much is standard Richardson extrapolation. Now use the values N(h), N(h/3) and<br />

N(h/9) <strong>to</strong> produce an O (h 3 ) approximation for M.<br />

5. Here we will start with the equation<br />

( ) 1/h 2 + h<br />

e = lim<br />

h→0 2 − h


School of Mathematical Sciences<br />

Monash University<br />

and use this as a basis <strong>to</strong> estimate e. The adventurous might like <strong>to</strong> prove this equation.<br />

(a) Define<br />

N(h) =<br />

( ) 1/h 2 + h<br />

2 − h<br />

and use this <strong>to</strong> compute N(h) for h = 0.04, 0.02 and 0.01.<br />

(b) Assume that e = N(h) + ah + bh 2 + ch 3 + · · · for some a, b, c · · ·. Use standard<br />

Richardson extrapolation <strong>to</strong> compute the best estimate for e.<br />

(c) Show that N(−h) = N(h). What does this tell you about a, b, c · · ·?<br />

(d) Use part (c) <strong>to</strong> show that e = N(h) + bh 2 + dh 4 + fh 6 + · · · and hence rework your<br />

Richardson extrapolation. What best estimate do you now get for e?<br />

(e) Which answer, part (b) or part (d) is the most accurate? Why?<br />

Numerical integration<br />

6. Download the Matlab programs Romberg.m and Trapezoidal.m. Both of the programs<br />

produce numerical estimates for I = ∫ 1<br />

4/(1 + 0 x2 ) dx. What is the exact value for<br />

I? Study the Romberg.m code. As you will see it computes successive columns in the<br />

Romberg table. Modify the code so that it computes the table row by row. Which<br />

version, row by row or column by column, would be better suited <strong>to</strong> an au<strong>to</strong>matic<br />

integration package (i.e. given an integral and a desired accuracy the package returns<br />

an estimate of the integral).<br />

16-Feb-2014 204


SCHOOL OF MATHEMATICAL SCIENCES<br />

<strong>MTH3051</strong><br />

<strong>Introduction</strong> <strong>to</strong> <strong>Computational</strong> <strong>Mathematics</strong><br />

Labora<strong>to</strong>ry class 10<br />

Random numbers<br />

1. In Matlab you can create uniform random numbers using the Matlab command random.<br />

Here is an example,<br />

z = -1 + 2*rand(2000,1) % create 2000 uniform random nums from -1 <strong>to</strong> 1<br />

hist(z,100);<br />

% draw a his<strong>to</strong>gram with 100 bins<br />

In this example the matlabe command rand(2000,1) creates 2000 uniform random<br />

nubers in the range 0 <strong>to</strong> 1. So <strong>to</strong> create N uniformly distributed random numbers in<br />

the interval [a, b] you can use a+(b-a)*rand(N,1).<br />

Use the above code <strong>to</strong> display various his<strong>to</strong>grams for 10, 100, 1000 etc. random numbers.<br />

Do these his<strong>to</strong>grams look reasonable <strong>to</strong> you?<br />

2. Here is another way <strong>to</strong> create random numbers in the interval 0 < x < 1. Define x i by<br />

x i = frac ( (π + x i−1 ) 5) , i = 1, 2, 3, · · ·<br />

where the function frac(x) returns just the decimal part (e.g. 0.245 = frac(73.245))<br />

(a) Modify the Matlab code from the previous question <strong>to</strong> compute this sequence. In<br />

Matlab you can compute the fractional part of x by using x − fix(x).<br />

(b) Does this sequence look random <strong>to</strong> you?<br />

3. Here is a short Matlab program that creates a uniform distribution of points inside a<br />

square box (in the region −1 < x < 1 and −1 < y < 1).<br />

z = -1 + 2*rand(2000,1) % create 2000 uniform random nums from -1 <strong>to</strong> 1<br />

x = z(1:2:1999); % x = odd entries in z<br />

y = z(2:2:2000); % y = even entries in z<br />

plot(x,y,’.’);<br />

% plot the (x,y) points<br />

Modify the code so as <strong>to</strong> produce uniform random points inside the unit circle x 2 +y 2 = 1.<br />

Re-draw the plot and confirm that the pattern of points still looks uniform (<strong>to</strong> your eye).


School of Mathematical Sciences<br />

Monash University<br />

Golden Search optimisation<br />

4. Download the Matlab program MinGolden.m. This is a simple matlab program that<br />

applies the Golden Search algorithm <strong>to</strong> minimise a given function. Run the program<br />

(answer = MinGolden(-0.3,0.1,0.3)). At each iteration you should see the three<br />

active points plotted in red on the curve. The blue square is the new fourth point.<br />

To continue through the iterations press the return key. You can now run your own<br />

experiments. Try different starting values. What happens if you break the constraint<br />

on the three points (that there is a true minimum in the interval)? Try changing the<br />

logic used <strong>to</strong> create the fourth point (e.g. try using the mid-point). Try other functions<br />

<strong>to</strong> minimise (e.g. non-continuous functions such as tan −1 ).<br />

5. Suppose the three points are a, b and c with a < b < c and suppose further that the true<br />

minimum x of f(x) lies somewhere in the interval a < x < c. Thus the error in setting<br />

x = b is no greater than |c − a|. This is called an error bound for the minimum. Find a<br />

similar error bound after n iterations of the Golden Search algorithm.<br />

Genetic Algorithms<br />

On the MUSO webpage you will find a set of Matlab functions that implements a simple<br />

Genetic Algorithm. They are all quite short and should be fairly easy <strong>to</strong> read. As an<br />

example type GAmain(30,20) in Matlab. This will use a population of 30 individuals<br />

and run over 20 generations.<br />

6. Run the program a few times. What do you observe? Does the population always find<br />

the true minimum? Trying using different values for the population size and the number<br />

of generations. What do you observe? Is it better <strong>to</strong> start with a large population and<br />

run for a few generations or a small population run over many generations?<br />

7. Modify the main program GAmain.m so that it also plots a graph of the average fitness<br />

as a function of the generation. What type of curve do you observe? What does this<br />

tell you about the rate of convergence as a function of the number of generations? Can<br />

you (roughly) explain this behaviour?<br />

8. In the the standard Genetic Algorithm the next generation is built by adding successive<br />

pairs of children from each pair of parents. However other strategies are possible. For<br />

example, we could say that of the four individuals consisting of the two parents and<br />

the two children we should select the two fittest individuals <strong>to</strong> be passed in<strong>to</strong> the next<br />

generation. Modify your main program <strong>to</strong> implement this strategy. How well does it<br />

work?<br />

9. You may have observed that sometimes the population converges <strong>to</strong> a point which is not<br />

the true minimum. This occurs when all of the genes are identical and thus subsequent<br />

generations will be identical <strong>to</strong> previous generations. Somehow we need <strong>to</strong> kick the<br />

population so that the individuals are not all clones of themselves. This same problem<br />

is overcome in biological systems by the process of mutation. As each gene is copied from<br />

the parent <strong>to</strong> the child small errors are introduced. Thus if both parents were clones of<br />

16-Feb-2014 206


School of Mathematical Sciences<br />

Monash University<br />

each other then the children will not be clones of their parents. This introduces a small<br />

element of random variations in the population. This process is known as mutation.<br />

We can adapt this idea <strong>to</strong> our genetic algorithm as follows. After each child is created<br />

we scan the gene for that child. At each position in the gene we look at the binary bit.<br />

If its a 1 we flip it <strong>to</strong> a 0 with a very small probability. If its a 0 we likewise flip it <strong>to</strong><br />

1 with the same small probability. We do this across the whole length of the gene for<br />

each child. Your task is <strong>to</strong> implement this scheme. You should choose the mutation<br />

probability <strong>to</strong> be small, about 0.05. You will need <strong>to</strong> use the Matlab random number<br />

genera<strong>to</strong>r rand. The statement if rand < 0.05 will be true when the random number<br />

is less than 0.05.<br />

Modify the Matlab programs <strong>to</strong> implement mutation. What do you observe? Does the<br />

evolution stall? What price do we pay in making this change (i.e. how does this change<br />

effect the accuracy in our final estimate for the minimum)?<br />

16-Feb-2014 207

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!