MTH3051 Introduction to Computational Mathematics - User Web ...

MTH3051 

Introduction to 

Computational Mathematics 

Lecture notes 

Clayton Campus 

2014 Campus 

Australia Malaysia South Africa Italy India monash.edu/science

School of Mathematical Sciences 

Monash University 

Contents 

1. Computing π 3 

1.1 A slice of π . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 

1.2 A formula for π . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 

1.3 Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 

1.4 A flood of formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 

1.5 Mnemonics for π . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 

1.6 Some useful references on π . . . . . . . . . . . . . . . . . . . . . . . . . . 8 

2. An Introduction to Programming and Matlab 9 

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 

2.2 A quadratic equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 

2.2.1 Why IF (a == 0) won’t work . . . . . . . . . . . . . . . . . . . . . 13 

2.3 A finite series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 

2.4 Matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 

2.5 An infinite series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 

3. Truncation and Round-off Errors 21 

3.1 Order estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 

3.2 Absolute and relative errors . . . . . . . . . . . . . . . . . . . . . . . . . . 22 

3.3 Truncation errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 

3.4 Round-off errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 

3.5 Understanding round-off errors . . . . . . . . . . . . . . . . . . . . . . . . 25 

3.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 

3.7 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 

3.8 Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 

3.9 Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 

4. Solutions of Equations in One Variable 34 


4.2 Fixed point iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 

4.2.1 Why the funny name? . . . . . . . . . . . . . . . . . . . . . . . . . 37 

4.2.2 Fixed point in pictures . . . . . . . . . . . . . . . . . . . . . . . . . 37 

4.2.3 Cyclic Fixed point iterations . . . . . . . . . . . . . . . . . . . . . . 39 

4.2.4 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 

16-Feb-2014 2



4.2.5 Programming notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 

4.2.6 A Matlab program . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 

4.3 Newton-Raphson iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 

4.3.1 Some notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 

4.3.2 Newton-Raphson for multiple roots . . . . . . . . . . . . . . . . . . 45 

4.3.3 Cycling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 

4.4 Interval methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 

4.4.1 Half Interval Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 

4.4.2 False Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 

5. Solving Systems of Linear Equations 51 


5.2 Gaussian elimination with back substitution . . . . . . . . . . . . . . . . . 52 

5.2.1 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 

5.2.2 Tri-diagonal systems . . . . . . . . . . . . . . . . . . . . . . . . . . 56 

5.2.3 Round-off Errors and Pivoting . . . . . . . . . . . . . . . . . . . . . 57 

5.3 Ill-conditioned systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 

5.4 Operational cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 

5.5 Iterative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 

5.5.1 Jacobi iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 

5.5.2 Gauss-Seidel iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 65 

5.5.3 Diagonal dominance and convergence . . . . . . . . . . . . . . . . . 66 

5.6 Operational counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 

6. Solving Systems of Nonlinear Equations 70 


6.2 Generalised Fixed Point iteration . . . . . . . . . . . . . . . . . . . . . . . 72 

6.2.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 

6.2.2 Matlab example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 

6.3 Generalised Newton-Raphson . . . . . . . . . . . . . . . . . . . . . . . . . 78 

6.3.1 Matlab code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 

7. Interpolation and Approximation of Data 82 

7.1 The what and why of interpolation . . . . . . . . . . . . . . . . . . . . . . 83 

7.2 Lagrangian interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 

16-Feb-2014 3



7.2.1 The whole polynomial or just its value? . . . . . . . . . . . . . . . . 85 

7.2.2 Matlab code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 

7.3 Newton polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 

7.3.1 Horner’s form of the Newton polynomial . . . . . . . . . . . . . . . 92 

7.4 Uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 

7.5 Piecewise polynomial interpolation . . . . . . . . . . . . . . . . . . . . . . 93 

7.5.1 Cubic Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 

7.5.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 

7.6 Non-polynomial interpolation . . . . . . . . . . . . . . . . . . . . . . . . . 97 

7.6.1 Fourier series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 

7.6.2 Estimating the Fourier coefficients . . . . . . . . . . . . . . . . . . . 98 

7.6.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 

7.7 Approximating functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 

7.7.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 

7.7.2 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 

7.7.3 Generalised least squares . . . . . . . . . . . . . . . . . . . . . . . . 105 

7.7.4 Variations on a theme . . . . . . . . . . . . . . . . . . . . . . . . . 105 

7.7.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 

7.7.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 

7.7.7 Matlab example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 

8. Extrapolation Methods 110 

8.1 Richardson extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 

8.1.1 Example – computing π . . . . . . . . . . . . . . . . . . . . . . . . 111 

8.1.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 

9. Numerical integration 113 


9.2 The Left and Right hand sum rules . . . . . . . . . . . . . . . . . . . . . . 114 

9.2.1 The Left Hand Rule . . . . . . . . . . . . . . . . . . . . . . . . . . 115 

9.2.2 The Right Hand Rule . . . . . . . . . . . . . . . . . . . . . . . . . . 115 

9.2.3 The Mid Point rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 

9.3 The Trapezoidal rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 

9.3.1 Choices, choices, so many choices . . . . . . . . . . . . . . . . . . . 117 

9.4 Simpson’s rule and Romberg integration . . . . . . . . . . . . . . . . . . . 118 

16-Feb-2014 4



9.5 Simpson’s rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 

9.5.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 

9.6 Romberg integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 

9.6.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 

10. Numerical differentiation 121 

10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 

10.2 Finite differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 

10.2.1 First derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 

10.2.2 Second derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 

10.3 Truncation and round-off errors . . . . . . . . . . . . . . . . . . . . . . . 125 

10.3.1 Example – Forward finite differences . . . . . . . . . . . . . . . . . 125 

10.3.2 Example – Centred finite differences . . . . . . . . . . . . . . . . . 127 

11. Numerical Solutions of Ordinary Differential Equations 129 

11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 

11.2 Initial value problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 

11.3 Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 

11.4 Improved Euler scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 

11.5 Taylor series method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 

11.6 Runge-Kutta schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 

11.6.1 Second Order Runge-Kutta . . . . . . . . . . . . . . . . . . . . . . 136 

11.6.2 Fourth Order Runge-Kutta . . . . . . . . . . . . . . . . . . . . . . 137 

11.7 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 

11.7.1 Discretization errors . . . . . . . . . . . . . . . . . . . . . . . . . . 138 

11.7.2 Local discretization error . . . . . . . . . . . . . . . . . . . . . . . 138 

11.7.3 Global Discretization Error . . . . . . . . . . . . . . . . . . . . . . 139 

11.7.4 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 

11.8 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 

11.8.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 

11.8.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 

11.8.3 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 

11.8.4 Example 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 

12. Optimisation 143 

12.1 Golden Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 

16-Feb-2014 5



12.1.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 

12.2 Steepest descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 

12.3 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 

12.3.1 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 

12.3.2 Breeding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 

12.3.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 

13. Random numbers 153 

13.1 Uniform random numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 

13.2 Non-uniform random numbers . . . . . . . . . . . . . . . . . . . . . . . . 156 

16-Feb-2014 6

SCHOOL OF MATHEMATICAL SCIENCES 




1. Computing π



1.1 A slice of π 

Everybody knows that 22/7 is only an approximation to π. Here is a better approximation, 

accurate to 200 decimal places. 

π ≈ 3.14159265358979323846264338327950288419716939937510582 

09749445923078164062862089986280348253421170679821480 

86513282306647093844609550582231725359408128481117450 

28410270193852110555964462294895493038196... 

One Freddo frog for the first person who can recite this, from memory, in class! 

Some questions come to mind. 

◮ Why would we want so many decimal digits? 

◮ How was the above approximation obtained? 

◮ How might we get better approximations? 

1.2 A formula for π 

To compute π we need a formula. Here is one simple formula, 

π = 4 

(1 − 1 3 + 1 5 − 1 ) 

7 + · · · + (−1)k+1 

2k − 1 + · · · = 4 

∞∑ 

k=1 

(−1) k+1 

2k − 1 

This suggests a scheme for approximating π – terminate the infinite series at some chosen 

term, say n, 

n∑ (−1) k+1 

π ≈ S n = 4 

2k − 1 

We call this an algorithm for π. 

k=1 

The big question is How well does this algorithm work? 

approximations 

Here is a table of successive 

n S n |S n − π| 

1 4.000000000e+00 8.584e-01 

10 3.041839619e+00 9.975e-02 

100 3.131592904e+00 1.000e-02 

1000 3.140592654e+00 1.000e-03 

10000 3.141492654e+00 1.000e-04 

16-Feb-2014 8



What do we observe? First (and most importantly) it appears that our successive 

approximations are converging to π. Second, each extra digit of accuracy requires a 10 

fold increase in the number of terms. This is extremely inefficient – to recover the above 

200 digits of π would required over 10 200 terms. Clearly we need a better algorithm. 

This time we will use 

( 

) 1/2 

n∑ 

π ≈ S n ′ (−1) k+1 

= 12 

k 2 

for which we find 

k=1 

n S n ′ |S n ′ − π| 

1 3.464101615e+00 3.225e-01 

10 3.132977195e+00 8.615e-03 

100 3.141498114e+00 9.454e-05 

1000 3.141591700e+00 9.540e-07 

10000 3.141592644e+00 9.548e-09 

Again we notice that the series converges (good!) and that now we get two extra digits of 

accuracy for every 10 fold increase in the number of terms. This is an improvement (but 

not enough to tackle the 200 digit calculation, that requires much more sophisticated 

algorithms than we have time to explore). 

The point to take home from this pair of examples is that you may need to trawl through 

various formulae, all mathematically equivalent, to find an algorithm that is efficient and 

accurate – we want an accurate answer with a minimum of computation. Much of what 

we will do in this subject is to search for suitable algorithms for various mathematical 

tasks. 

1.3 Programming 

How were the above tables generated? For small values of n we could imagine doing 

the computations by hand but for large values our patience might wear thin. Clearly 

we need a way to automate the process. This is where the computers come in to the 

picture. The simple thing is that we provide the computer with a set of instructions 

which it faithfully executes and out pops our answers. Here are the instructions that 

were used for the first algorithm. 

16-Feb-2014 9



n = 100; % set number of terms 

sum = 0; % set initial value for sum 

sign = 1; % an integer +/- 1 

for k = 1 : n 

% loop over k from 1 to n 

term = sign/(2*k-1); % compute the term 

sum = sum + term; % update rolling sum 

sign = - sign; % flip the sign 

end; 

disp(n); 

% print n 

disp(4*sum); 

% print the approximation to π 

disp(4*sum-pi); 

% print the error 

This is an example of Matlab syntax. Matlab is a programming language well suited 

to numerical computations. There are many other languages such as Fortran, C, Java, 

Maple and Mathematica. They all have a similar flavour and they all serve the one 

purpose of getting the computer to do useful work for us. We will use Matlab throughout 

this course as it is (arguably) the easiest to learn for someone with no programming 

experience. But don’t let this stop you from learning about other programming languages 

(in your spare time). 

In reading the above code you must keep in mind one extremely important fact the 

equals sign is not what you might expect it to be. The computer will treat the equal 

sign (usually) as a replacement operator. Thus in executing a line like x = 2*y the 

computer will first evaluate the right-hand side then assign that value to the left-hand 

side. Whatever value x had before, it will be wiped out. Its value after the line is 

executed will be 2*y. This use of the equals sign allows us to write lines like x = x + 

1 to increment the current value of x by 1. In contrast, if you showed a mathematician 

a line like x = x + 1 he or she would look at you very very strangely (why?). 

The above Matlab code is fairly easy to read. The first three lines set initial values to 

various symbols (aka variables). Then we encounter a for-loop. Matlab will repeat the 

code in this loop for each value of k from 1 to n in strict order. Each time through this 

loop we are calculating one term in the series. After the loop finishes we print out the 

various numbers that interest us (using the Matlab command disp which is short for 

display ). 

If the above code baffles you then one way to understand it is to pretend you are the 

computer and follow the Matlab commands. Get out a pencil and paper and start 

following the instructions. Work your way through the first 5 or so terms. You should 

see that it is correct and does compute the series. 

16-Feb-2014 10



Example 1.1 

Modify the above Matlab code to use the second algorithm for π. You’ll be able to test 

your code later in your tutorial classes. 

1.4 A flood of formulae 

Mathematicians have a deep rooted love for π and not surprisingly they have derived 

many varied formulae for π. Here, just for the curious, are some other formulae that 

you might like to play with. 

Many of these use the following infinite series for arc-tan, 

tan −1 (x) = x − x3 

3 + x5 

5 − x7 

x2k+1 

+ · · · + (−1)k+1 

7 2k + 1 · · · 

π = 4 tan −1 (1) 

π = 16 tan −1 (1/5) − 4 tan −1 (1/239) 

π = 16 tan −1 (1/5) − 4 tan −1 (1/70) + 4 tan −1 (1/99) 

π 2 

6 = 1 1 2 + 1 2 2 + 1 3 2 + 1 4 2 + · · · 

π 3 

32 = 1 1 − 1 3 3 + 1 3 5 − 1 3 7 + · · · 3 

π 

2 = 2 × 2 × 4 × 4 × 6 × 6 × 8 · · · 

1 × 3 × 3 × 5 × 5 × 7 × 7 · · · 

√ 

√ 

2 1 

π = 2 × 1 

4 

π = 1 + 1 2 

2 + 

2 + 1 2√ 

1 

2 × √ √√√1 

2 + 1 2 

2 + 

3 2 

5 2 

2 + 72 

2 + · · · 

∞ 

1 

π = 12 ∑ 

(−1) k (6k)! 

(k!) 3 (3k)! 

π = 

∞∑ 

k=0 

k=0 

√ 

1 

2 + 1 √ 

1 

2 2 · · · 

13591409 + 545140134k 

640320 3(2k+1)/2 

( 4 

8k + 1 − 2 

8k + 4 − 1 

8k + 5 − 1 

8k + 6) ( 1 

16 

π = lim 

k→∞ 

f k , where f k = f k−1 + sin(f k−1 ), f 0 = 1 

) k 

16-Feb-2014 11



1.5 Mnemonics for π 

Here are few simple mnemonics that people have used to help memorise the decimal 

digits of π. Each works by taking the number of letters in each word as the value of the 

digit at that point in the series for π. 

How I wish I could calculate pi 

How I like a drink, alcoholic of course, after the heavy lectures involving 

quantum mechanics. 

Sir, I bear a rhyme excelling 

In mystic force and magic spelling 

Celestial sprites elucidate 

All my own striving can’t relate 

Or locate they who can cogitate 

And so finally terminate. Finis. 

1.6 Some useful references on π 

http://www.joyofpi.com 

http://mathworld.wolfram.com/PiFormulas.html 

16-Feb-2014 12





2. An Introduction to Programming and Matlab



2.1 Introduction 

One of the main hurdles that new-comers to numerical methods face is how to convert 

a mathematical equation like 

sin(x) = x − x3 

3! + x5 

5! + · · · = ∞ 

∑ 

into a Matlab program, such as the following 

k=0 

(−1) k x 2k+1 

(2k + 1)! 

% --- set initial values ---------------------------------------- 

k = 0; 

x = 1.23; 

sum = x; 

term = x; 

k_max = 100; 

looping = true; 

x_square = x*x; 

% --- compute successive terms in the series -------------------- 

while looping do 

end 

k= k + 1; 

term = - term*x_square/( (2*k+1)*(2k) ); 

sum = sum + term; 

if ( k >= k_max ) 

looping = false; 

end 

if ( abs(term) < 0.00001 ) 


end 

answer = sum; 

The aim of this set of notes is to show you how to make the transition from Maths to 

Matlab. This will hardly be an exhaustive study but it will you give a kick start (from 

which you can go on to scale great heights!). 

In each of the following examples we will start by writing, usually in just one line of 

code, the heart of the mathematics in basic Matlab form. We will then look at this code 

and ask two basic questions 

16-Feb-2014 14



◮ What more do we need to make this code work? 

◮ What can go wrong with this code? 

Both questions are very important. Their answers will force you to add extra Matlab 

code and in this way you will build a complete working program. What began as one 

line may turn out to be dozens of lines. 

Once you have a working program you might also want to ask a third question, 

◮ What improvements can we make to the code? 

This covers a wide raft of issues, such as efficiency, readability, utility and so on. 

2.2 A quadratic equation 

Given the quadratic equation 

we know that the two roots are given by 

r 1 = −b + √ b 2 − 4ac 

2a 

This can be written in Matlab as 

0 = ax 2 + bx + c 

, r 1 = −b − √ b 2 − 4ac 

2a 

r_1 = ( -b + (b*b - 4*a*c)^(1/2) )/(2*a) ; 

r_2 = ( -b - (b*b - 4*a*c)^(1/2) )/(2*a) ; 

Notes 

◮ Matlab executes each line in turn, from top to bottom. 

◮ Lines that end with ; are silent – Matlab prints nothing as it executes that line. 

◮ Multiplication is denoted by the * symbol. 

◮ The symbol ∧ denotes exponentiation (raising to a power). 

◮ Names like a,b,c etc. are known as Matlab variables. 

◮ Variables have values. These values may change during the program’s execution. 

◮ Matlab reads the right hand side, computes a value, then assigns it to the variable 

on the left hand side of the equals sign. 

16-Feb-2014 15



What more do we need to make this code work? 

We need values for a,b and c. This is easy – we simply include lines like a = 2; b = 

1; c = 5; before the above pair of lines. 

What can go wrong with this code? 

We might encounter complex roots. Though Matlab can handle complex numbers (without 

any fuss) we’ll declare (for this example) that complex roots are forbidden. So we 

need to avoid computing the square root when b 2 − 4ac < 0. This we do by using an if 

statement. 

Here is our updated code. 

a = 2; b = 1; c = 5; 

if ( b*b - 4*a*c >= 0 ) 

r_1 = ( -b + (b*b - 4*a*c)^(1/2) )/(2*a) ; 

r_2 = ( -b - (b*b - 4*a*c)^(1/2) )/(2*a) ; 

end 

Notes 

◮ The ; also allows us to put more than one expression on each line. 

◮ The if (...) and end lines define the block of lines to be executed only when 

b 2 − 4ac is greater or equal to zero. 

If we left the code as it is we could run into another problem further down the track. 

How so?. Well, we have deliberately chosen (or forgotten?) to not set the values for r 1 

and r 2 in the case of complex roots. But what would happen if we tried to use r 1 and 

r 2 later, in some other part of the code? Heaven only knows what values would be used 

(never assume that a variable starts with an initial value of zero). Clearly we need to 

provide values for r 1 and r 2 for all possible cases. Thus we modify the if statement 

as follows 

a = 2; b = 1; c = 5; 

if ( b*b - 4*a*c >= 0 ) 

r_1 = ( -b + (b*b - 4*a*c)^(1/2) )/(2*a) ; 

r_2 = ( -b - (b*b - 4*a*c)^(1/2) )/(2*a) ; 

else 

r_1 = 0; 

r_2 = 0; 

end 

Note that setting r 1 = 0 and r 2 = 0 is mathematically wrong but at least we can 

now proceed with known values for r 1 and r 2. We could also use these zero values as 

16-Feb-2014 16



an indicator to other parts of the program that we have a special case to consider (i.e. 

complex roots). 

2.2.1 Why IF (a == 0) won’t work 

Once again we ask What can go wrong with this code?. We’ve already handled 

the case of complex roots but we have yet to deal with the glaring problem that arises 

when a = 0. 

Blindly running the above Matlab code with a=0 will surely be very disappointing! So in 

despair we return to the mathematics and quickly realise that when a = 0 our quadratic 

actually reduces to the simple linear equation 

0 = bx + c 

for which the single solution is x = −c/b (assuming b ≠ 0). With that in mind we might 

modify our Matlab code to look like 

a = 2; b = 1; c = 5; 

if ( a ~= 0 ) 

if ( b*b - 4*a*c ) >= 0 ) 

r_1 = ( -b + (b*b - 4*a*c)^(1/2) )/(2*a) ; 

r_2 = ( -b - (b*b - 4*a*c)^(1/2) )/(2*a) ; 

else 

r_1 = 0; 

r_2 = 0; 

end 

else 

r_1 = -c/b; 

r_2 = r_1; 

end; 

But this too may give you heartache – why? Because small round-off errors may push a 

slightly away from zero. Thus even though a should be exactly zero it might be stored 

as some small number such as 1.0 × 10 −12 . Even worse – it may be stored as a negative 

number! The solution is to compare a against some pre-chosen small number as in the 

following example 

16-Feb-2014 17



a = 2; b = 1; c = 5; 

if ( abs(a) > 1e-10 ) 

if ( b*b - 4*a*c ) >= 0 ) 

r_1 = ( -b + (b*b - 4*a*c)^(1/2) )/(2*a) ; 

r_2 = ( -b - (b*b - 4*a*c)^(1/2) )/(2*a) ; 

else 

r_1 = 0; 

r_2 = 0; 

end 

else 

r_1 = -c/b; 

r_2 = r_1; 

end; 

The message here is never test two real numbers for equality. Thus if the mathematics 

tells you something special happens when p = q (for example) then in your 

Matlab code you should never use an if statement like 

if (p == q) 

· · · 

end 

but rather 

if abs(p - q) < very_small 

· · · 

end 

There still remains the niggling problem of how to handle the case where both a and b 

are zero. Here I invoke the classic excuse of a lecturer – this case is left as an exercise 

for the student. 

2.3 A finite series 

How would you get Matlab to compute the following sum 

S = 1 + 1 2 + 1 3 + 1 4 + · · · + 1 

100 

This seems simple enough, just add up 100 numbers. Too easy really (agreed?). If you 

were to compute this by hand (always a good place to start when writing computer code) 

you most probably would start with the first term, add on the second term, then the 

16-Feb-2014 18



third term and so on stopping only after adding on 1/100. The heart of the calculations 

looks like this 

term = ...; 


where the three dots denotes the typical number (e.g. 1/47). Once again we ask What 

more do we need to make this code work?. Clearly we need 

◮ A rule for computing each term (the three dots) 

◮ A mechanism for stepping through all the number form 1 to 100 and 

◮ An initial value for the variable sum. 

Here is a Matlab program that does the job 

sum = 0; 

for num = 1:100 

term = 1/num; 


end; 

answer = sum; 

The main new construct in this code is the for loop. The loop begins with the keyword 

for and ends with the line end;. The contents of the loop are then repeatedly executed 

for the values of num from 1 to 100 (in strict sequence!). Once the loop is finished (i.e. 

after all 100 numbers have been added to sum) Matlab will continue execution on the 

line directly following the loop, in this case it assigns (copies) the value of sum to the 

variable answer. 

This structure is very common. It contains a section that initialises some data (sum 

= 0), followed by a repeated set of calculations (the for loop) and then a section that 

records the answers (answer = sum). 

You could ask the other popular question What can go wrong with this code? but 

I think you’ll see that it’s bullet proof (for which we rejoice as this is rarely the case). 

2.4 Matrix multiplication 

Given an n × p matrix U and a p × m matrix V their product UV is a n × m matrix. 

We will use subscripts like ij to denote the entry in row i and column j. Then we have 

(UV ) ij = 

p∑ 

k=1 

U ik V kj , 

1 ≤ i ≤ n , 1 ≤ j ≤ m 

16-Feb-2014 19



For a single entry in the product matrix (at (i,j) for example) we need to compute a 

finite sum of products. This is easy to write in Matlab form 

sum = 0; 

for k = 1:p 

sum = sum + U(i,k)*V(k,j); 

end 

UV(i,j) = sum; 

This needs to be repeated for all choices of i and j and that is best done using a pair 

of for loops. This leads to 

for i=1:n 

for j=1:m 

sum = 0; 

for k = 1:p 

sum = sum + U(i,k)*V(k,j); 

end 

UV(i,j) = sum; 

end 

end 

It may please you to know that Matlab is matrix savvy – it knows how to directly 

multiply matrices. Thus the above could also be written simply in one single statement 

UV = U*V. This applies to any pair of matrices (provided they have suitable sizes). Thus 

if A is a 3 element column vector and B is a 3 element row vector then B*A is a single 

number (the dot product of the vectors) while A*B is a new 3 × 3 matrix. Why am I 

telling you this now? Because there will be many many times when you will need to 

access parts of matrices in ways that might not be possible without using explicit index 

notation. 

2.5 An infinite series 

We (should) know that 

cos(x) = 1 − x2 

2! + x4 

4! + · · · = ∞ 

∑ 

k=0 

x2k 

(−1) k 

(2k)! 

This is an infinite series and that presents us with our first challenge – how does Matlab 

cope with an infinite number of terms? The simple answer is that it can’t so we are 

forced to approximate the infinite series, for example with a finite series, such as the 

16-Feb-2014 20



first 1001 terms 

cos(x) ≈ 1 − x2 

2! + x4 

4! + · · · + x2000 

1000 

2000! = ∑ 

k=0 

x2k 

(−1) k 

(2k)! 

This is now similar to our previous example of a finite series (although somewhat more 

challenging). As before we take the approach that in our Matlab code we will compute 

the sum term by term. Let’s suppose Matlab has computed the first k − 1 terms and 

recorded the sum (so far) in the variable sum. The next step would be to add the next 

term (−1) k x 2k /(2k)! to sum. For this we might propose a Matlab fragment like 

sum = sum + ( (-1)^k )*( x^(2*k) )/( (2k)! ); 

What more do we need to make this code work? 

Clearly we need a value for x. We also need to set an initial value for sum and we need 

to run through the allowed values of k. 

What can go wrong with this code? 

It is an infinite sum and as we do not want the computer to run forever we need to 

keep track of how many terms we have computed. If that exceeds a predefined limit 

we should then terminate the computations (taking the last value of sum as the best 

approximation to the infinite series). 

With these variations in mind we now propose 

x = 1.23; 

sum = 1; 

for k = 1:1000 

sum = sum + ( (-1)^k )*( x^(2*k) )/( (2k)! ); 

end 

answer = sum; 

And yet there remains one major problem – Matlab does not understand the factorial 

symbol “!”. So we need to compute it ourselves. We know that 

n! = 1 × 2 × 3 × 4 · · · × n 

This is similar to what we have already been playing with but rather than taking a sum 

of numbers here we have to compute a product of numbers. Thus it’s not hard to see 

that n! could be computed using 

fact = 1; 

for num = 1:n 

fact = fact * num; 

end 

16-Feb-2014 21



We can use this fragment to compute (2k)! and so our code now looks like 

x = 1.23; 

sum = 1; 

for k = 1:1000 

fact = 1; 

for num = 1:(2*k) 

fact = fact * num; 

end 

sum = sum + ( (-1)^k )*( x^(2*k) )/( fact ); 

end 

answer = sum; 

Though this program will work it is worth asking if it’s the best we can do. Surprise, 

surprise – we can do far better. But what is it that we see as being problematic? (shades 

of if ain’t broke, don’t fix it). 

◮ Do we really need to compute 1000 terms of the series? 

◮ Can Matlab accurately compute both x (2k) and (2k)! for large k? 

The simple answer to both questions is no. What do we do? Here is a neat trick that 

deals with the second objection (we will deal with the first objection a little later on). 

A typical pair of terms in the infinite series are 

and thus 

a k = (−1) k x2k 

(2k)! , a k−1 = (−1) k−1 x 2k−2 

(2k − 2)! 

a k = −a k−1 

x 2 

(2k)(2k − 1) 

Thus each new term in our infinite series can be generated from the previous term by 

this simple formula. It’s clearly very easy to compute and much more efficient than our 

previous formula. But to use this we need an initial value for a 0 . That is not hard to 

determine – we choose it so that we get the correct value for a 1 . That is a 0 = 1. We 

will use term as the Matlab variable for both a k−1 and a k . Then our Matlab code for 

cos(x), with blank lines added for clarity, can be streamlined to 

16-Feb-2014 22



x = 1.23; 

sum = 1; 

term = 1; 

for k = 1:1000 

end 

term = - term*(x*x)/( (2*k)*(2*k-1) ); 


answer = sum; 

This is a significant improvement – we have only one loop (rather than two) and the 

computations are simple and unlikely to cause problems (no large numbers such as (2k)!). 

But we still have the crazy situation of always computing 1000 terms. If we are looking 

for an answer that is accurate to say five decimal places then that might occur at k = 37 

for example. That is, when the (absolute) value of term is less than 0.00001 (five decimal 

places) we should bail out of the loop. Here is one way to do that 

x = 1.23; 

sum = 1; 

term = 1; 

for k = 1:1000 

end 

term = - term*(x*x)/( (2*k)*(2*k-1) ); 


if ( abs(term) < 0.00001 ) 

break 

end 

answer = sum; 

When the break command is executed Matlab will jump out of the surrounding loop 

and continues execution at the first line directly after the loop, in this case the final line 

answer = sum. 

There is a more elegant way to achieve the same outcome and it uses a special type of 

variable known as boolean variables. These take on just two values true or false and 

are often used to control the flow of the program. We shall modify the above code by 

introducing a new boolean variable looping which will be true only when we have not 

reached our target accuracy of five decimal places. We will need to set the value for 

16-Feb-2014 23



looping as each term is calculated. Here is the final code 

% --- set initial values ---------------------------------------- 

k = 0; 

x = 1.23; 

sum = 1; 

term = 1; 

k_max = 100; 

looping = true; 

x_square = x*x; 

% --- compute successive terms in the series -------------------- 

while looping 

end 

k = k + 1; 

term = - term*x_square/( (2*k)*(2*k-1) ); 


if ( abs(term) < 0.00001 ) 


end 

if ( k > k_max ) 


end 

answer = sum; 

There are quite a few changes introduced in the above code. We have replaced the for 

loop with a while loop. Thus we have also been forced to explicitly increment k (i.e. 

the line k = k + 1) and to limit the number of terms (hence the new variable k max). 

Though the above code is longer than the previous version it is (in my opinion) easier to 

read and better conveys what we are actually doing (i.e. looping until we reach a given 

accuracy). 

Compare this code with that given for sin(x) at the start of these notes. You will see 

that they are similar but they do have important differences (as they should, after all 

sin(x) and cos(x) are different functions). Note in particular the way term is calculated, 

and the initial values for k and sum. If you have any doubts that they are correct one 

way to check is to follow the Matlab code (pretend you are the computer) and following 

(writing out) the loops for the first few terms. You will very quickly see that both 

programs as written are correct. 

16-Feb-2014 24





3. Truncation and Round-off Errors



3.1 Order estimates 

Quite commonly, when analysing the performance of an algorithm, we find ourselves 

making statements about pairs of related numbers, say x, y(x), along the lines of 

y(x) = Ax m + terms in x smaller than Ax m 

where A and m are some numbers (which we may or may not know). In this way we are 

isolating the dominant term Ax m of y(x) for small values of x (i.e. |x| ≪ 1). The value 

of A is often of little importance and so we use a notation which draws our attention to 

the important number m, that is 

y = O (x m ) 

The formal definition is as follows. If lim x→0 y(x)/x m = A ≠ 0 then we say y(x) = O (x m ) 

for |x| ≪ 1. In words, we say that y(x) is of order x m for small x. 

Example 3.1 

Show that y(x) = x + x 2 is O (x) for small x. 

Example 3.2 

Show that y(x) = 1 − cos(x) is O (x 2 ) for small x. 

Example 3.3 

If u(x) = O (x 3 ) and v(x) = O (x 2 ) what can you say about u(x) + v(x) and u(x)v(x) 

for small x? 

3.2 Absolute and relative errors 

Its a sad fact that in our computations are never perfect and they will will carry (hopefully) 

small errors. There are two primary sources of error, truncation errors (which arise 

largely from our choice of algorithm) and round-off errors (introduced by the computer 

and much less in our control). Both of these will be discussed below. Whatever the 

nature of the error we often speak of two ways in which to measure that error. These 

are known as absolute and relative errors and they are defined as follows. Suppose ˜x is 

our approximation to some exact number x. Then we define 

◮ Absolute error: |˜x − x| 

◮ Relative error: 

|˜x − x|/|x| 

16-Feb-2014 26



3.3 Truncation errors 

The two algorithms 

π ≈ S n = 4 

π ≈ S ′ n = 

( 

n∑ 

k=1 

12 

(−1) k+1 

2k − 1 

n∑ 

k=1 

(−1) k+1 

k 2 ) 1/2 

, 

which we have seen provide convergent approximations to π, were each obtained by 

truncating the related infinite series. Not surprisingly the error incurred in doing so 

is known as the truncation error. This is one of a variety of errors that can enter our 

numerical computations (the other main source of error is known as round-off error 

which we will discuss soon). This type of error is of our own making and by choosing 

where to terminate the series or even choosing a different series we can control the size 

of the truncation error. The point is that the truncation error is introduced by our 

mathematical manipulations prior to turning to the computer or calculator. 

Let us write E t (n) for the truncation error for our first algorithm, that is 

E t (n) = |S n − π| = 4 

∞∑ 

k=n+1 

(−1) k+1 

2k − 1 

Our numerical results, that we get one extra decimal digit for every 10 fold increase in 

n strongly suggests that E t (n) must vary in proportion to 1/n. That is 

E t ≈ A n 

for some unknown constant A that does not depend on n. This is a numerical observation. 

Can we do any better that this? Yes. 

Example 3.4 

Using the data from the previous tables verify that |S n − π| = O (n −1 ) and |S ′ n − π| = 

O (n −2 ) 

Example 3.5 

Prove that the truncation error E t (n) = |S n − π| is bounded by 

2 

(n+2) 2 

< E t (n) < 2 n 

This shows that for each ten fold increase in n we can expect at least one one extra 

decimal digit of accuracy but no more than two extra digits. 

16-Feb-2014 27



Example 3.6 

The series S n 

′′ = ∑ n 

k=1 (−1)k+1 /k 2 converges to π 2 /12. Show that for this series, the 

truncation error E t (n) = |S n ′′ − π 2 2 

/12| is bounded by < E 

(n+2) 3 t (n) < 1 . From this 

(n+1) 2 

we would infer that we would get at least two (and no more that three) extra decimal 

digits for each 10 fold increase in n. 

This kind of analysis is often used as a way of checking that our numerical calculations 

are on track (i.e. we are looking for consistent behaviour between our numerical and 

theoretical calculations, but be warned – simply observing consistency does not prove 

that our computer calculations are correct – that requires more work!). 

3.4 Round-off errors 

All information in a computer is stored in fixed length arrays or registers. This has significant 

consequences for the storage of decimal numbers such as 1/3 = 0.333 · · ·. This 

number has an infinite number of digits and so only a finite number of the leading digits 

(usually no more than 15) can be stored. The remaining digits must be discarded. This 

introduces a small error in storing the number. The important issue is to what extent 

does this small error effect subsequent calculations. Errors of this kind are known as 

round-off errors and, depending on the nature of the following calculations, the round-off 

errors (which occur with every calculation) may remain small or they may accumulate 

and eventually swamp the calculation – at which point any further computation is meaningless. 

Example 3.7 

Working to 2,4 and 8 decimal digits compute 2.34567 + 1.23456, 2.34567 − 1.23456 and 

their absolute and relative errors. 

digits Sum Abs. error Rel. error 

2 3.500000000e+00 8.023e-02 2.241e-02 

4 3.581000000e+00 7.700e-04 2.151e-04 

8 3.580230000e+00 0.000e+00 0.000e+00 

digits Diff Abs. error Rel. error 

2 1.100000000e+00 1.111e-02 9.999e-03 

4 1.111000000e+00 1.100e-04 9.900e-05 

8 1.111110000e+00 2.220e-16 1.998e-16 

Do these results look reasonable? Yes, the errors are small and we get improved accuracy 

when we use more digits. A benign example. 

16-Feb-2014 28



Example 3.8 

Working to 2,4 and 8 decimal digits compute 1.23457 + 1.23456, 1.23457 − 1.23456 and 

their absolute and relative errors. 

digits Sum Abs. error Rel. error 

2 2.400000000e+00 6.913e-02 2.800e-02 

4 2.470000000e+00 8.700e-04 3.524e-04 

8 2.469130000e+00 4.441e-16 1.799e-16 

digits Diff Abs. error Rel. error 

2 0.000000000e+00 1.000e-05 1.000e+00 

4 0.000000000e+00 1.000e-05 1.000e+00 

8 1.000000000e-05 1.565e-16 1.565e-11 

Notice that the relative errors in computing the difference 1.234567 − 1.23456 are now 

much much larger than in the previous example 2.34567 − 1.23456. This is a classic 

example of round-off error – when two nearly equal numbers are subtracted the leading 

digits in each number cancel thus leaving behind only the (inaccurate) trailing digits 

and thus also introducing a large relative error in the result. This effect is sometimes 

referred to as a loss of precision or a loss of significance. 

Example 3.9 

The above tables were generated by some fancy programming on a 16-digit computer. In 

the 8-digit calculations you will see that the absolute errors are listed as approximately 

10 −16 . But should not this be zero? After all an 8-digit computer should be able to 

compute 2.34567 + 1.23456 without any error. What is going on here? 

3.5 Understanding round-off errors 

To properly understand round-off errors we need a model of how decimal numbers are 

stored in a computer’s registers. Suppose our computer can only store the first N decimal 

digits of any number. All digits after the first N digits will be lost – we call this the 

round off error. 

Here is a little picture showing how a typical number x might be stored on our N− digit 

computer. 

16-Feb-2014 29



0 . 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 

We humans often write numbers like 0.0001234 or 675.123. This is not how the computer 

stores them. It will always shuffle the decimal point left or right to put the number in 

the form shown above. This process is called normalisation and it is applied after every 

computation. Numbers such as 0.0001234 and 675.123 are known as un-normalised 

numbers. 

For the time being we will work only with positive numbers (it saves writing ± with 

every number). 

3.6 Examples 

0 . 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 

0 . 2 5 6 1 2 3 4 0 0 0 0 0 0 0 0 

16-Feb-2014 30



0 . 1 2 3 4 5 6 7 0 0 0 0 0 0 0 0 

From these examples we can see that every number x can be written in the form 

x = ( a + 10 −N b ) × 10 m 

with both a and b simple numbers in the range 0.1 to 0.99999 · · · 

In this representation for x we have 

˜x = a × 10 m 

E R (x) = b × 10 m × 10 −N 

(Note : a has exactly N decimal digits while b has an infinite number of digits.) 

In most of what we do with E R (x) we will not be too concerned with the exact value of 

b. Thus as both x and b × 10 m carry a factor of 10 m we write 

Here is a very important question 

E R (x) = 10 −N O (x) (1) 

How do round off errors propagate through a series of calculations? 

We might be tempted to say that at the end of a series of calculations, the round off 

error in the answer will be 

E R (y) = 10 −N O (y) 

on a n N− digit computer. This is not always true (the above only applies when storing 

an exact number – it makes no account for the computations that may have preceded 

this number). 

As an example, suppose you calculated y = f(x) for some given function f(x). This will 

entail at least three sources of round off error. First, there will be a round-off error in 

x (due to its own prior history). Second, there will be a round-off error in computing f 

and third there may be an error in storing the computed value of f (do we round up or 

round down?). All three errors errors will combine and the best we can expect will be 

with M ≤ N. 

E R (y) = 10 −M O (y) 

16-Feb-2014 31



3.7 Addition 

0 . 1 2 3 4 5 6 7 8 9 0 

0 . 5 6 7 8 1 2 3 4 5 6 

0 . 6 9 1 2 6 9 1 3 4 6 

From this we observe (assuming no carries) 

E R (x + y) = E R (x) + E R (y) 

= 10 −N O (x) + 10 −N O (y) 

= 10 −N O (x + y) (since x, y > 0) 

So we expect N digits in the final answer – i.e. no loss of significant digits. 

Example 3.10 

Given that ˜z is the numerical result for z = x + y, how many digits in ˜z can we take 

as being exactly correct? Which digits might be in error (and why)? Hint: think about 

carries (if any). 

3.8 Subtraction 

Let us suppose that the first Q digits of x and y match 

16-Feb-2014 32



0 . 1 2 3 4 5 9 8 7 6 5 4 3 2 1 0 

0 . 1 2 3 4 5 0 1 2 3 4 5 6 7 8 9 

0 . 0 0 0 0 0 9 7 5 3 0 8 6 4 2 1 

0 . 9 7 5 3 0 8 6 4 2 1 ? ? ? ? ? 

Here we only have N − Q accurate digits, all of the others are junk. 

⇒ E R (x − y) = 10 −(N−Q) O (x − y) 

We have lost Q digits – this is known as a loss of precision. The worst case occurs when 

Q = N i.e. when ˜x = ỹ while x ≠ y. So we have 

E R (x − y) = 10 −(N−Q) O (x − y) 

and as we have just noted this form shows us that only N −Q digits of ˜x − y are accurate. 

But there is another way of expressing this result which will be useful later on when we 

look at finite differences. 

Now we know x ≈ y and x = O (10 m ) and x − y = O ( 10 m−Q) thus we have 

Thus we also have 

(x − y) × 10 +Q = O ( 10 m−Q) × 10 +Q 

= O (10 m ) 

= O (x) 

E R (x − y) = 10 −N O (x) , 

when x ≈ y 

3.9 Division 

Here we want to compute (well estimate) the round off error in z = x/y. On this occasion 

drawing little pictures is not much help. So this time we shall use a purely algebraic 

16-Feb-2014 33



approach (one that can be used for other more challenging computations). We will start 

by writing both x and y in the standard form 

Thus we have 

then 

x = (a + 10 −N b) × 10 m 

y = (c + 10 −N d) × 10 n 

˜x = a × 10 m , E R (x) = 10 −N × (b × 10 m ) 

ỹ = c × 10 m , E R (y) = 10 −N × (d × 10 m ) 

x 

y = a + 10−N b 

c + 10 −N d 10m−n 

On most computers N ≈ 15. So 10 −N d is much smaller than c. Thus we can use a 

Taylor series for 1/(1 + 10 −N (d/c)) in powers of 10 −N to produce 

x 

y = 1 c (a + 10−N b)10 m−n (1 − 10 −N d d2 

+ 10−2N 

c c − · · · ) 

2 

If we retain just the first two terms in the series then we obtain (after a wee bit of 

algebra) 

( ( 

x a b 

y = c + 10−N c − ad )) 

10 m−n + O ( 10 −2N) 

c 2 

Our computer will compute the approximation ˜z to the exact value z = x/y with a 

round off error which we denote by E R (z). We have z = ˜z + E R (z) and (ignoring any 

carries) 

˜z = a c × 10m−n 

E R (z) = 10 −N ( b 

c − ad 

c 2 ) 

10 m−n 

= 10 −N O (z) 

This last line shows that our N− digit computer will return an estimate for z = x/y 

accurate to N digits. This is good! We rejoice (in moderation). 

This is a very general technique and it can be applied to any function to analyse its 

sensitivity to round-off errors. It is well worth your time studying the above example in 

detail (as you always do, n’est pas?). 

16-Feb-2014 34



Example 3.11 Round-off errors 

Lest you think that round-off errors are not important here is a simple example that 

proves otherwise. Suppose we need to compute the roots x of the quadratic 

0 = x 2 − x + λ 

for various values of the parameter λ in the range 0 ≤ λ ≤ 1. Solving this quadratic 

exactly is easy, we all know that 

x = 1 ± √ 1 − 4λ 

2 

For λ = 0 we expect two roots, x = 0 and x = 1. How well does a finite precision 

computer handle this job? Here are some results. 

Exact and Computed values for x = (1 − √ 1 − 4λ)/2 

λ Exact 2-digits 4-digits 6-digits 8-digits 

1.00e-01 1.1270e-01 1.0000e-01 1.1250e-01 1.1271e-01 1.1270e-01 

1.00e-02 1.0102e-02 0.0000e+00 1.0000e-02 1.0100e-02 1.0102e-02 

1.00e-03 1.0010e-03 0.0000e+00 1.0000e-03 1.0000e-03 1.0010e-03 

1.00e-04 1.0001e-04 0.0000e+00 0.0000e+00 1.0000e-04 1.0000e-04 

Absolute errors for x = (1 − √ 1 − 4λ)/2 


1.00e-01 1.1270e-01 1.2702e-02 2.0167e-04 3.3346e-06 1.5379e-08 

1.00e-02 1.0102e-02 1.0102e-02 1.0205e-04 2.0514e-06 1.4434e-09 

1.00e-03 1.0010e-03 1.0010e-03 1.0020e-06 1.0020e-06 2.0050e-09 

1.00e-04 1.0001e-04 1.0001e-04 1.0001e-04 1.0002e-08 1.0002e-08 

Relative errors for x = (1 − √ 1 − 4λ)/2 


1.00e-01 1.1270e-01 1.1270e-01 1.7894e-03 2.9588e-05 1.3646e-07 

1.00e-02 1.0102e-02 1.0000e+00 1.0102e-02 2.0307e-04 1.4288e-07 

1.00e-03 1.0010e-03 1.0000e+00 1.0010e-03 1.0010e-03 2.0030e-06 

1.00e-04 1.0001e-04 1.0000e+00 1.0000e+00 1.0001e-04 1.0001e-04 

It is clear that significant errors arise when λ ≪ 1. For the computer with 4-digits 

the absolute errors might seem small but notice that they are of the same scale as the 

number we are trying to compute (the exact value for x). Thus, when λ is small, we have 

16-Feb-2014 35



no reason to trust the answers from a 4-digit computer. This is better seen in the third 

table which lists the relative errors. Here you can see that the 2 and 4 digit computers 

yield answers that are 100% in error – they produce junk! 

The first big question is What causes this problem? When λ ≪ 1, the numbers 1 and 

√ 

1 − 4λ are almost equal. Thus when we compute their difference we introduce a large 

relative error. 

The second big question is Can we cure this problem? In this case, yes, we have at least 

two options, both of which entail a re-working of the mathematics prior to handing 

control over to the computer. That is we search for different algorithms to compute x 

in the case where λ ≪ 1. 

Option 1. Here we do some simple algebra 

x = 

= 

1 − (1 − 4λ)(1/2) 

2 

2λ 

1 + (1 − 4λ) (1/2) 

= 

1 − (1 − 4λ)(1/2) 

2 

( 

) 

1 + (1 − 4λ) (1/2) 

1 + (1 − 4λ) (1/2) 

We see that when λ ≪ 1 we do not have any problem with cancellation between nearly 

equal numbers (i.e no loss of precision). 

Relative errors for Option 1 


1.00e-01 1.1270e-01 2.3972e-02 1.4777e-05 2.9691e-06 4.7730e-08 

1.00e-02 1.0102e-02 1.0102e-02 2.0307e-04 5.0924e-06 4.3889e-08 

1.00e-03 1.0010e-03 1.0010e-03 2.0030e-06 2.0030e-06 5.0090e-09 

1.00e-04 1.0001e-04 1.0001e-04 1.0001e-04 2.0003e-08 2.0003e-08 

Option 2. This time we use a Taylor series. For λ ≪ 1 we can expnad √ 1 − 4λ as 

1 − 2λ + O (λ 2 ). Thus we find, for λ ≪ 1, 

x = λ + O ( λ 2) 

Relative errors for Option 2 


1.00e-01 1.1270e-01 1.1270e-01 1.1270e-01 1.1270e-01 1.1270e-01 

1.00e-02 1.0102e-02 1.0102e-02 1.0102e-02 1.0102e-02 1.0102e-02 

1.00e-03 1.0010e-03 1.0010e-03 1.0010e-03 1.0010e-03 1.0010e-03 

1.00e-04 1.0001e-04 1.0001e-04 1.0001e-04 1.0001e-04 1.0001e-04 

16-Feb-2014 36



And with this we are pleased – the relative errors are small and well behaved when 

λ ≪ 1. This is good. This is another example where a judicious choice of algorithm can 

save the day. This time we were lucky, the reliable algorithms were easy to find but you 

will find many problems that may require more head scratching than you might care to 

endure – sadly you have no other choice (unless you can live with inaccurate answers!). 

Example 3.12 

Look at the numbers in the last row of the previous table. It appears that there is no 

improvement in accuracy as we move from a 2-digit computer to an 8-digit computer? 

Is this a surprise? Can you explain why this might be so? 

16-Feb-2014 37





4. Solutions of Equations in One Variable




The game here is, given a function f(x), to find x such that 

In some cases this is easy, 

but in other cases, such as 

0 = f(x) 

0 = 3x 2 + 2x − 7 ⇒ x = −2 ± 2√ 22 

6 

0 = x − e −x 

we do not have any choice but to resort to numerical means. 

Our basic strategy will be to invent some way to create a sequence x 1 , x 2 , x 3 , · · · which 

(we hope) converges to the root x. 

The main issues that we will look at are 

◮ Algorithms: How do we generate the sequence x 1 , x 2 , x 3 , · · ·? 

◮ Convergence: 

◮ Robustness: 

For values of x 1 will the sequence converge? And if it does, 

how quickly? 

For what class of functions f(x) will the algorithm work? 

Here is your first reality check – any hope of finding a perfect algorithm – one that 

converges for all functions and for all initial guesses – is pure fantasy. All algorithms 

will have trouble under certain conditions. Our game will be to find a range of algorithms 

that collectively will allow us to solve most problems. We will also want to investigate 

what it is that causes one algorithm to succeed where others fail. Thus our work will be 

a mix of empirical tinkering (i.e. algorithm design) and solid mathematical analysis. 

4.2 Fixed point iteration 

Given 0 = x − e −x we also have x = e −x and this suggests the following sequence 

x n+1 = e −xn n = 1, 2, 3, · · · 

To get the ball rolling we need an initial guess, let’s take x 1 = 0.5. How well does this 

work? Here are the results for the first ten iterations. 

16-Feb-2014 39



Fixed point iterations x n+1 = e −xn 

Iteration n Old guess x n New guess x n+1 x n+1 − e −x n+1 

0 0.500000000000 -1.065e-01 

1 0.500000000000 0.606530659713 6.129e-02 

2 0.606530659713 0.545239211893 -3.446e-02 

3 0.545239211893 0.579703094878 1.964e-02 

4 0.579703094878 0.560064627939 -1.111e-02 

5 0.560064627939 0.571172148977 6.309e-03 

6 0.571172148977 0.564862946980 -3.575e-03 

7 0.564862946980 0.568438047570 2.029e-03 

8 0.568438047570 0.566409452747 -1.150e-03 

9 0.566409452747 0.567559634262 6.524e-04 

10 0.567559634262 0.566907212935 -3.700e-04 

It appears to be converging, the last column seems to be getting smaller with each 

iteration, but the its seems to be a slow convergence. Here are the results at every 10-th 

iteration. 

Fixed point iterations x n+1 = e −xn 


1 0.500000000000 0.606530659713 6.129e-02 

11 0.566907212935 0.567277195971 2.098e-04 

21 0.567142477551 0.567143751417 7.225e-07 

31 0.567143287611 0.567143291997 2.487e-09 

41 0.567143290400 0.567143290415 8.564e-12 

51 0.567143290410 0.567143290410 2.931e-14 

61 0.567143290410 0.567143290410 1.110e-16 

Clearly the algorithm worked, we have a solution, but it took over 60 iterations. For 

functions as simple as this, 61 iterations does not take too much time to compute. So we 

might feel that this is a good result. Not quite. If this happened to be part of a much 

larger computation, where we needed to compute the root many thousands of times (for 

example) then we have good reason to want to improve on 60 iterations per root. As we 

shall see in later lectures there are algorithms that can converge in (usually) less than 

about 5 iterations. This is a significant improvement. 

You might ask how well does this algorithm work for other choices of the initial guess 

x 1 ? It converges for a wide range of values! By direct trial and error you can verify 

(you’ll need a program!) that it converges (at least) for any initial guess in the range 

−5 < x 1 < +5. Again this is encouraging. Flush with confidence you might try rewriting 

the original equation 0 = x − e −x as x = − log(x) and thus create the sequence 

x n+1 = − log(x n ) n = 1, 2, 3, · · · 

16-Feb-2014 40



If you start with x 1 = 2 you will get an error very quickly 

x 1 = 2 

x 2 = − log(x 1 ) = − log(2) 

x 3 = − log(x 2 ) = − log(− log(2)) 

Starting with x 1 = 0.5 leads to the same problem at n = 15. 

Fixed Point Iteration 

If an equation 0 = f(x) can be re-written in the form 

x = g(x) 

then the sequence 

x n+1 = g(x n ) n = 1, 2, 3, · · · 

is known as fixed point iteration. The sequence is not guaranteed to converge to x. 

4.2.1 Why the funny name? 

You might be wondering why we call this method the fixed point method. It is simple. 

If we have found the root x then one extra iteration x = g(x) produces no change in x, 

that is the x is fixed and so we call it a fixed point of g(x). 

4.2.2 Fixed point in pictures 

The sequence x n+1 = g(x n ) can be drawn as points in the (x, y) plane. All you do is 

plot the points (x 1 , y 1 ), (x 2 , y 2 ), (x 3 , y 3 ) · · · where y n = g(x n ). This gives us a cute way 

to display the convergent and divergent sequences. 

16-Feb-2014 41



Y 

Convergent fixed point iteration 

y = x 

y = g(x) 

X 

X2 X4 

X3 

X1 

Y 

Divergent fixed point iteration 

y = g(x) 

y = x 

X 

X 4 X2 X 

1 X3 

X 5 

16-Feb-2014 42



4.2.3 Cyclic Fixed point iterations 

If we are unlucky we may bump into an x for which x = g(g(x))and x ≠ g(x). This is 

an example of a cyclic sequence, as shown in the following diagram. 

Y 

Cyclic fixed point iteration 

y = g(x) 

y = x 

X 

X , X , X , X ,... 

2 4 6 8 

X , X , X , X ,... 

1 3 5 7 

In cases like this you have no choice but to start again with a new guess for x 1 and there 

is no guarantee that you won’t bump into the same cyclic sequence. Good luck! 

4.2.4 Convergence 

Is there anything that we can say about when a fixed point iteration will converge? Yes! 

Suppose we have 

which we hope will converge to the root of 

x n+1 = g(x n ) 

x = g(x) 

Suppose we are close, that is x n ≈ x and also that g(x) is a nice smooth function. We 

can then write 

g(x n ) = g(x) + g ′ (x)(x n − x) + O ( (x n − x) 2) 

But x n+1 = g(x n ) and x = g(x) and so we also have 

x n+1 = x + g ′ (x)(x n − x) + O ( (x n − x) 2) 

16-Feb-2014 43



which we re-write as 

x n+1 − x = g ′ (x)(x n − x) + O ( (x n − x) 2) 

Now notice that x n − x is the error in our approximation at iteration n. So let’s define 

the error at each iteration by 

ɛ n = |x n − x| 

then we have 

ɛ n+1 = |g ′ (x)|ɛ n + O ( ) 

ɛ 2 n 

Notice that if |g ′ (x)| < 1 then successive errors will be smaller than the previous errors, 

that is the iterations will converge. On the other hand, if |g ′ (x)| > 1 the errors will grow 

and the sequence diverges. The only other case is when |g ′ (x)| = 1 and in this case the 

errors neither grow nor decay. 

Note also that the above arguments all hinge on the assumption that we are near the 

fixed point. All bets are off if we have a bad initial guess (i.e. x 1 is far from the fixed 

point). However once the sequence gets close to the fixed point (if we should be so lucky) 

then the above arguments do apply. 

Fixed Point Iteration Convergence 

Given an x 1 close to the fixed point x = g(x), then the sequence 

x n+1 = g(x n ) n = 1, 2, 3, · · · 

Converges when |g ′ (x)| < 1 

Diverges when |g ′ (x)| > 1 

Useless when |g ′ (x)| = 1 

Example 4.1 

Verify, using the above conditions, that the scheme x n+1 = e −xn 

scheme x n+1 = − log(x n ) should diverge. 

will converge, while the 

4.2.5 Programming notes 

Since we know that the fixed point iterations might fail we must take care when we write 

our computer programs. We need to ask ourselves what can go wrong in the calculations 

(too many iterations? not converging? stalled? etc.) and put appropriate tests in our 

programs. Here is a very rough sketch of a program that could be used for any algorithm 

to solve 0 = f(x) for x. 

16-Feb-2014 44



Set initial guess for x 

Set the desired accuracy for x 

Set limit for number of iterations 

While iterating do 

Choose a new guess for x 

Compute the new f(x) 

If the change in x is small, then exit 

If the new f is small, then exit 

If the number of iterations is too large, then exit 

Otherwise, prepare for the next iteration 

Finished, print x 

16-Feb-2014 45



4.2.6 A Matlab program 

Here is a rough Matlab program that does the job. You might like to look at this code 

very carefully. 

x_old = 0.5; % initial guess 

loop = 0; % set loop to zero 

loop_max = 50; 

% maximum number of iterations 

small_number = 0.0001; % target accuracy 

looping = (loop < loop_max); 

while looping 

% start of iterations 

x_new = exp(-x_old); 

f_old = x_old - exp(-x_old); 

f_new = x_new - exp(-x_new); 

loop = loop + 1; 

if loop > loop_max 


end 

if abs(x_new-x_old) < small_number 


end 

if abs(f_new) < small_number 


end 

x_old = x_new; 

disp([loop x_new f_new ]); 

% next iteration 

% too many iterations, exit 

% x values converged, exit 

% f values very small, exit 

% prepare for next iteration 

% display current approximation 

end; 

4.3 Newton-Raphson iteration 

Once again we have the problem of finding x such 0 = f(x). And once again we will 

generate a sequence x 1 , x 2 , x 3 , · · · which we hope will converge to x. 

How? Okay, let’s suppose we have a guess, call it x n and we wish to create a new 

(improved) guess x n+1 . This new guess will only be an approximation to x. Let δx be 

the error in x n , that is 

x n = x + δx 

Our game now is to compute (or estimate) δx. 

16-Feb-2014 46



Given 0 = f(x) we have 

0 = f(x n − δx) 

and if δx is small, then we can expand the right hand side using a Taylor series. Thus 

0 = f(x n ) − f ′ (x n )δx + O ( δx 2) 

Since δx is small we can discard the second order terms, thus we can solve for δx, 

You might be tempted to write 

δx = f(x n) 

f ′ (x n ) 

x = x n − δx 

and think your job is done, that you have found the exact root. But don’t forget that 

we truncated the Taylor series, and thus we introduced and approximation for δx. Thus 

the previous line is only approximately true. We thus take the right hand side as our 

next approximation, that is x n+1 = x n + δx. 

Example 4.2 Newton-Raphson 

Here are the results for our standard problem 0 = x − e −x . 

Newton-Raphson iterations 

x n+1 = x n − f n /f ′ n, f(x) = x − e −x 


ɛ n+1 /ɛ 2 n 

0 0.500000000000 -1.065e-01 

1 0.500000000000 0.566311003197 -1.305e-03 1.846e-01 

2 0.566311003197 0.567143165035 -1.965e-07 1.810e-01 

3 0.567143165035 0.567143290410 -4.441e-15 1.393e+01 

4 0.567143290410 0.567143290410 1.110e-16 4.507e+12 

5 0.567143290410 0.567143290410 0.000e+00 4.631e+12 

Notice how quickly the iterations converge. This is a very good result! We get accurate 

answers for very little effort! Yippee (well, I get easily carried away). 

Example 4.3 

Let ɛ n be defined by ɛ n = |x − x n |. Show, using a Taylor series, that ɛ n+1 = O (ɛ 2 n). This 

means that each successive iteration will double the number of digits in our approximation. 

This is known as quadratic convergence. In contrast, the fixed point iterations 

have ɛ n+1 = O (ɛ n ) and this is called linear convergence. 

But note one important fact – the proof that ɛ n+1 = O (ɛ 2 n) assumes that f ′ (x) ≠ 0. 

What happens when f ′ (x) = 0? 

16-Feb-2014 47



Example 4.4 Newton-Raphson with f ′ (x) = 0 

Find x such that 0 = f(x) = (x − 1) 2 . Clearly f(x) = 0 and f ′ (x) = 0 at x = 1 so x = 1 

is our root. Let’s take the initial guess of x = 0.5. Here is what we get from a naive 

application of the Newton-Raphson algorithm. 

Newton-Raphson iterations x n+1 = x n − f n /f n, ′ f(x) = (x − 1) 2 

Iteration n Old guess x n New guess x n+1 (x n+1 − 1) 2 ɛ n+1 /ɛ n 

0 0.500000000000 2.500e-01 

1 0.500000000000 0.750000000000 6.250e-02 5.000e-01 

2 0.750000000000 0.875000000000 1.563e-02 5.000e-01 

3 0.875000000000 0.937500000000 3.906e-03 5.000e-01 

4 0.937500000000 0.968750000000 9.766e-04 5.000e-01 

5 0.968750000000 0.984375000000 2.441e-04 5.000e-01 

6 0.984375000000 0.992187500000 6.104e-05 5.000e-01 

7 0.992187500000 0.996093750000 1.526e-05 5.000e-01 

8 0.996093750000 0.998046875000 3.815e-06 5.000e-01 

9 0.998046875000 0.999023437500 9.537e-07 5.000e-01 

10 0.999023437500 0.999511718750 2.384e-07 5.000e-01 

The iteration do converge, but not as fast as in the previous example. Notice that 

ɛ n+1 /ɛ n remains constant. That is, each iteration reduces the error by a constant factor 

(in this case 1/2). This is typical of Newton-Raphson iteration when x is a root of both 

f(x) = 0 and f ′ (x) = 0. 

4.3.1 Some notation. 

◮ Simple root: When f(x) = 0 and f ′ (x) ≠ 0. 

◮ Double root: When f(x) = 0, f ′ (x) = 0 and f ′′ (x) ≠ 0. 

◮ Root of order m: 

When f(x) = 0 and all derivatives up to f (m−1) (x) are 

zero at x while f (m) (x) ≠ 0. 

16-Feb-2014 48



Newton-Raphson Iteration 

For the equation 0 = f(x) compute the sequence 

x n+1 = x n − f(x n) 

f ′ (x n ) 

If this sequence converges then, 

ɛ n+1 = O ( ) 

ɛ 2 n 

ɛ n+1 = O (ɛ n ) 

n = 1, 2, 3, · · · 

at a simple root 

at a multiple root 

4.3.2 Newton-Raphson for multiple roots 

We have seen that the standard Newton-Raphson, when applied to a function with a 

multiple root, does converge but only linearly. Can we modify the algorithm to recover 

quadratic convergence? Yes (you knew that). 

Let f(x) have a multiple root at x = p. Then for x near p we must have 

f(x) = (x − p) m h(x) 

where h(x) is some other (unknown) function with h(p) ≠ 0. Thus 

f 1/m (x) = (x − p)h 1/m (x) 

This new function has a simple root at x = p and thus we expect quadratic convergence 

when the Newton-Raphson method is applied to f 1/m (x). 

What does the Newton-Raphson iteration look like for this new function? Put g(x) = 

f 1/m (x) then 

x n+1 = x n − g(x n) 

g ′ (x n ) 

f 1/m (x n ) 

= x n − 

(1/m)f 1/m−1 (x n )f ′ (x n ) 

= x n − m f(x n) 

f ′ (x n ) 

Here is what we get for the simple function f(x) = (x − 1) 2 . 

Newton-Raphson iterations x n+1 = x n − 2f n /f n, ′ f(x) = (x − 1) 2 

Iteration n Old guess x n New guess x n+1 (x n+1 − 1) 2 ɛ n+1 /ɛ n 

0 0.500000000000 2.500e-01 

1 0.500000000000 1.000000000000 0.000e+00 0.000e+00 

16-Feb-2014 49



Yes, this is correct, it does converge in one iteration. But do not think that all functions 

with a double root will converge so quickly. Here is another example, based on f(x) = 

x 3 − 3x + 2 which has a double root at x = 1. 

Newton-Raphson iterations x n+1 = x n − 2f n /f ′ n, f(x) = x 3 − 3x + 2 

Iteration n Old guess x n New guess x n+1 (x n+1 − 1) 2 ɛ n+1 /ɛ 2 n 

0 1.200000000000 1.280e-01 

1 1.200000000000 1.006060606061 1.104e-04 1.515e-01 

2 1.006060606061 1.000006103329 1.118e-10 1.662e-01 

3 1.000006103329 1.000000000004 -2.220e-16 9.683e-02 

Example 4.5 

Repeat the above calculations for 7 iterations. What do you notice? Can you explain 

this? 

Example 4.6 

Can you write out a formula for ɛ n as a function of n for the double root (easy) and for 

the simple root (not so easy – hint, use the above tables and write out successive ɛ n ’s 

and express each in terms of ɛ 1 ). 

Example 4.7 

For multiple roots we could apply the Newton-Raphson method to the function g(x) = 

f m−1 (x), i.e the (m − 1) st derivative of f(x). What do you think would be the pros and 

cons of this approach? (Hint – two words : efficiency, round-off). 

4.3.3 Cycling 

If we are unlucky we will find that the Newton-Raphson may cycle, just as we saw with 

the fixed point algorithm. 

16-Feb-2014 50



Y 

Cyclic Newton-Raphson iteration 

y = f(x) 

X , X , X , X ,... 

2 4 6 8 

X 

X 

1 

, X 

3 

, X 

5 

, X 

7 

,... 

There is not much we can do in this case. We could try different initial guesses and we 

may be lucky and avoid the cycling but there is no guarantee that we’ll be so lucky. You 

may have to give up and try another algorithm (for example, half-interval search). 

4.4 Interval methods 

Our game so far has been to create a sequence x 1 , x 2 , x 3 , · · · of approximations to the 

root of f(x) with the hope that x n+1 is a better approximation than x n to the root. 

At any stage in this process we have a single point approximation to the root. In the 

following two algorithms (half-interval search and false position) we will use pairs of 

points to define a range of values that are guaranteed to contain the root. 

This is a good thing. If the root lies in the range a < x 

measure of how large the error might be in taking any number from the interval as an 

approximation to x. The worst case is that the error is no larger than b − a. If we can 

find an algorithm that successively shrinks the interval then we are guaranteed that the 

iterations will converge to the root. 

Algorithms of this type are also referred to as bracketing methods. 

The success of these methods depends on two things 

◮ The function f(x) must be continuous and 

◮ That we can find at least one interval that contains just this one root. 

16-Feb-2014 51



4.4.1 Half Interval Search 

This is the classic example of interval methods. Let’s suppose we have a simple function 

and that by some means (usually a table or a plot) we have found two points a and b 

with a 

key condition for the choice of a and b. Let’s call this the root condition. 

If the function is continuous (which we always assume) then the function must have a 

root in the range a < x < b. Good. Now the big question is How do we choose a new 

smaller interval that also contains the root? 

Given the interval [a, b] we create two new intervals based on the mid-point c = (a+b)/2. 

Thus we can split the original interval into two smaller non-overlapping intervals. Here 

is the big thing – Only one of the intervals will satisfy the root condition. Whichever 

interval that happens to be, we take it as the next interval and start the process all over 

again. 

In this way we generate a sequence of intervals, each 1/2 the size of the previous interval, 

that are guaranteed to converge to the root. This certainty is very comforting (point 

algorithms are not guaranteed to converge). 

Half Interval Search iterations f(x) = x − e −x 

n x left x middle x right f left f middle f right 

1 0.000000 1.500000 3.000000 -1.0e+00 1.3e+00 3.0e+00 

2 0.000000 0.750000 1.500000 -1.0e+00 2.8e-01 1.3e+00 

3 0.000000 0.375000 0.750000 -1.0e+00 -3.1e-01 2.8e-01 

4 0.375000 0.562500 0.750000 -3.1e-01 -7.3e-03 2.8e-01 

5 0.562500 0.656250 0.750000 -7.3e-03 1.4e-01 2.8e-01 

6 0.562500 0.609375 0.656250 -7.3e-03 6.6e-02 1.4e-01 

7 0.562500 0.585938 0.609375 -7.3e-03 2.9e-02 6.6e-02 

8 0.562500 0.574219 0.585938 -7.3e-03 1.1e-02 2.9e-02 

9 0.562500 0.568359 0.574219 -7.3e-03 1.9e-03 1.1e-02 

10 0.562500 0.565430 0.568359 -7.3e-03 -2.7e-03 1.9e-03 

Half-Interval Search 

Given a continuous function f(x) and an interval [a, b] with f(a)f(b) < 0 then 

Compute c = a + (b − a)/2, the mid point 

If f(a)f(c) < 0 then 

Choose the new interval as [a, c] 

Else 

Choose the new interval as [c, b] 

16-Feb-2014 52



Notes 

◮ Half-interval search is also commonly known as the Bisection method. 

◮ It is far better to use c = a+(b−a)/2 than c = (a+b)/2 to compute the mid-point. 

The later method can, due to round-off errors, produce a c that is not contained 

in the interval [a, b]. 

4.4.2 False Position 

This differs from the Half-Interval method only in the way the interval is split in two. 

In this case we draw a line joining the two points on the curve and locate where this line 

crosses the x-axis. That splits the interval in two, everything else remains the same. It 

is easy to verify that this point is given by 

c = 

af(b) − bf(a) 

f(b) − f(a) 

Here are our results. 

False Position iterations f(x) = x − e −x 

n x left x new x right f left f new f right 

1 0.000000 0.759453 3.000000 -1.0e+00 2.9e-01 3.0e+00 

2 0.000000 0.588025 0.759453 -1.0e+00 3.3e-02 2.9e-01 

3 0.000000 0.569460 0.588025 -1.0e+00 3.6e-03 3.3e-02 

4 0.000000 0.567401 0.569460 -1.0e+00 4.0e-04 3.6e-03 

5 0.000000 0.567172 0.567401 -1.0e+00 4.5e-05 4.0e-04 

6 0.000000 0.567146 0.567172 -1.0e+00 5.0e-06 4.5e-05 

7 0.000000 0.567144 0.567146 -1.0e+00 5.5e-07 5.0e-06 

8 0.000000 0.567143 0.567144 -1.0e+00 6.2e-08 5.5e-07 

16-Feb-2014 53



4.5 Summary 

Algorithm 

Convergence 

criteria 

Convergence 

rate 

Pros & Cons 

Fixed point 

x n+1 = g(x n ) 

Newton-Raphson 

x n+1 = x n − f(x n) 

f ′ (x n ) 

Half-Interval Search 

c = a + b 

2 

False Position 

bf(a) − af(b) 

c = 

f(a) − f(b) 

|g ′ (x)| < 1 ɛ n+1 = O (ɛ n ) ✔ simple 

✔ no derivatives 

✗ slow 

✗ may cycle 

x n ≈ x 

ɛ n+1 = O ( ) 

ɛ 2 n 

✗ may not find all roots 

✔ quadratic convergence 

✗ requires derivatives 

✗ may cycle 

✗ linear convergence at multiple roots 

f(a)f(b) < 0 ɛ n+1 = 1 2 ɛ n ✔ simple 


✔ guaranteed to converge 

✗ slow 

✗ fails at even powered multiple roots 

f(a)f(b) < 0 ɛ n+1 = O (ɛ n ) ✔ simple 


✔ guaranteed to converge 

✗ slow 

✗ fails at even powered multiple roots 

16-Feb-2014 54





5. Solving Systems of Linear Equations




Remember all the fun you had using Gaussian elimination to solve equations like 

2x + 3y + z = 10 

x + 2y + 2z = 10 

4x + 8y + 11z = 49 

for x, y and z? You may be disappointed to learn that we’ll be leaving all that fun behind 

us by handing the job over to the computer. We will study two classes of algorithms, 

direct methods where we use variations on Gaussian elimination to obtain the solution 

and iterative methods where we invent (yet another) algorithm to create a sequence of 

approximations to the solution. 

5.2 Gaussian elimination with back substitution 

The following steps should serve as a basic reminder of Gaussian elimination with back 

substitution. 

Using the above set of equations, your pen and paper calculations might look like the 

following. 

2x + 3y + z = 10 

x + 2y + 2z = 10 

4x + 8y + 11z = 49 

(1) 

(2) ′ ← 2(2) − (1) 

(3) ′ ← (3) − 2(1) 

2x + 3y + z = 10 

y + 3z = 10 

2y + 9z = 29 

(1) 

(2) ′ 

(3) ′′ ← (3) ′ − 2(2) ′ 

2x + 3y + z = 10 

y + 3z = 10 

3z = 9 

(1) 

(2) ′ 

(3) ′′ 

Now we solve this system using back-substitution, z = 3, y = 1, x = 2. 

There are two basic stages. First, we apply a series of row operations to reduce the 

system to an upper triangular form. Second, we apply the back substitution where we 

solve from the last up to the first equation. There are minor variations on this pattern 

(such as full Gaussian elimination) but for the moment we will not worry about such 

matters. 

16-Feb-2014 56



For bookkeeping purpose we normally write the equations in matrix form, such as 

⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 

1 3 1 x 10 

⎣ 1 2 2 ⎦ ⎣ y ⎦ = ⎣ 10 ⎦ 

4 8 11 z 49 

or more simply as just 

AX = f 

where each of A, X and f are matrices (all of this should be very familiar to you). 

Our game will be to write a short Matlab program that implements Gaussian elimination 

with back substitution on the system AX = f. We will do this in simple stages, starting 

with the basic mathematics and slowly adding fragments of Matlab code to build a fully 

working program. 

Matlab code for Gaussian elimination We will assume that we have N equations 

in N unknowns with augmented matrix M. In our Matlab program we can access the 

entry in row i and column j of M by writing M(i, j). 

Let’s suppose we have completed the eliminations for columns 1 to a − 1. We now need 

to apply row operations to M to eliminate all of the entries in the column below the 

entry at (a, a) (i.e. entries (a+1, a), (a+2, a), (a+3, a), · · · (N, a)). Let’s suppose we are 

about to eliminate the entry at (b, a). Looking back on our pen-and-paper calculations 

we see that we need to do three tasks 

◮ Divide row a by M(a, a) 

◮ Multiply row a by M(b, a) 

◮ Subtract row a from row b 

Here is a Matlab fragment that does the job. 

factor = M(b,a)/M(a,a); 

for c = a : N+1 

M(b,c) = M(b,c) - factor*M(a,c); 

end 

% want M(b,a) to be zero 

% col’s to right of (a,a) 

% row b - factor * row a 

16-Feb-2014 57



Example 5.1 

Where we not meant to process the whole row? Does not the above only process part 

of the row? And why does the loop go up to N + 1? Have we made two mistakes? 

This does the work for one row. We now need to process every row below row a. This 

is easy – we simply wrap up our Matlab fragment in another for-loop. 

for b = a+1 : N 


for c = a : N+1 


end 

end 

% rows below a 




This is looking good. We have completed all the eliminations for column a. All we need 

do now is repeat this process for the remaining columns. This introduces one more loop, 

for a = 1 : N-1 

for b = a+1 : N 


for c = a : N+1 


end 

end 

end 

% first N-1 rows 





Matlab code for back substitution Okay, our matrix M is now in upper triangular 

form and we are ready to do the back substitution. We need a vector to store the 

unknowns. Let’s call it X (with entries X(i)). Recall that in the back substitution we 

use equation N to solve for X(N), then equation N − 1 to solve for X(N − 1) and so 

on finishing with the first equation giving X(1). As before, we will build our Matlab 

code by first assuming we are mid-way through our calculations. So suppose we have 

computed X(N), X(N − 1), X(N − 2), · · · X(a + 1). We now solve for X(a) from row 

(a). 

sum = M(a,N+1); 

for b = a+1 : N 

sum = sum - M(a,b)*X(b); 

end 

X(a) = sum/M(a,a); 

% RHS of row a 

% shuffle all known X’s 

% across to the RHS 

% compute X(a) 

And as before we wrap this in one more loop to get all of the X’s, 

16-Feb-2014 58



for a = N : -1 : 1 

sum = M(a,N+1); 

for b = a+1 : N 


end 


end 

% step backwards 





That completes the job, we have written a very basic Matlab program that implements a 

simple Gaussian elimination algorithm with back substitution. This is our final program 

%--- Gaussian elimination ----------------------------------------- 

for a = 1 : N-1 

% first N-1 rows 

for b = a+1 : N 




for c = a : N+1 


M(b,c) = M(b,c) - factor*M(a,c); % row b - factor * row a 

end 

end 

end 

%--- Back substitution -------------------------------------------- 

for a = N : -1 : 1 

% step backwards 

sum = M(a,N+1); 


for b = a+1 : N 




end 



end 

5.2.1 Pivoting 

Our Matlab program suffers from one very obvious drawback – it will fail when the 

diagonal element M(a, a) is zero (why?). The usual trick is to swap such a row with 

one of the other lower rows. But which one? The simplest strategy is to look at each 

element below the diagonal and find that which is the largest in absolute value. Then 

we swap those two rows. Here is a little Matlab fragment that does the job. 

16-Feb-2014 59



% --- Find row containing the biggest number ---------------------- 

big_b = a; % default row to swap 

big_num = abs( M(a,a) ); 

% start with M(a,a) 

for b = a+1 : N 

% all rows below row a 

if abs( M(b,a) ) > big_num 

% a new big number? 

big_b = b; % save this row 

big_num = abs( M(b,a) ); 

% save this number 

end 

end 

swap = big_b; 

% target row to swap 

% --- Swap this row with the diagonal row ------------------------- 

if swap ~= a 

% don’t swap a with a 

for b = a : N+1 

% all col’s to the right 

save = M(a,b); 

% avoid over write 

M(a,b) = M(swap,b); 

% 1st part of swap 

M(swap,b) = save; 

% 2nd part of swap 

end 

end 

You can fairly easily merge this code fragment back into our previous code (exercise!). 

Despite the origins of pivoting, to overcome a zero on the diagonal, it happens to be a 

good thing to do even when the diagonal is not zero. Why? Because it tends to reduce 

the effects of round off errors (we’ll see an example in one of the following lectures). 

The pivoting that we have just seen is actually a particular form of pivoting known as 

partial pivoting. There is another form known as full pivoting in which we search for the 

largest element not only down the column but also across the row. This could lead to a 

swap of columns rather than rows (depending on the outcome of the search). Swapping 

columns is allowed if you also swap appropriate rows in the solution vector X. 

5.2.2 Tri-diagonal systems 

There is a particularly simple but important system of linear equations that crops up 

frequently when studying numerical approximations to differential equations. The augmented 

matrix has this simple structure 

⎡ 

⎤ 

β 1 γ 1 0 0 0 0 0 · · · λ 1 

α 2 β 2 γ 2 0 0 0 0 · · · λ 2 

0 α 3 β 3 γ 3 0 0 0 · · · λ 3 

0 0 α 4 β 4 γ 4 0 0 · · · λ 4 

0 0 0 α 5 β 5 γ 5 0 · · · λ 5 

⎢ 

⎥ 

⎣ . . . . . . . · · · . ⎦ 

0 0 0 · · · 0 0 α N β N λ N 

When Gaussian elimination is applied to the this system we get the following solution for 

X (its easy, try it!). This is known as the Thomas algorithm for a tri-diagonal system. 

16-Feb-2014 60



%--- Gaussian elimination ----------------------------------------- 

for i = 2 : N 

β i = β i − (α i /β i−1 )γ i−1 

λ i = λ i − (α i /β i−1 )λ i−1 

end 

%--- Back substitution -------------------------------------------- 

X N = λ N /β N 

for i = N-1 : -1 : 1 

X i = (λ i − γ i X i+1 )/β i 

end 

Example 5.2 

Verify that the above Matlab code is correct (i.e. 

elimination and see that it leads to the above code). 

follow the steps of the Gaussian 

5.2.3 Round-off Errors and Pivoting 

The collected wisdom of many experts is that Gaussian elimination with back substitution 

is highly susceptible to the accumulation of round-off errors. This is particularly 

prevalent for large systems (and this may happen even when N ≈ 10!). You might well 

be pondering how we might measure the error in the computed solution. Let’s suppose 

the exact (pencil and paper) solution is X and that the computer has returned ˜X as its 

approximation to X. How might we check ˜X? If we knew X then we would hardly need 

to compute ˜X. So in the absence of X the best we can do is substitute ˜X back into our 

system of equations to compute the residual r defined by 

r = A˜X − f 

The entries in r will measure the error in each equation. Ideally we would like each entry 

in r to be zero. But in the real world, the entries in r will not be zero. How large can 

we expect the entries in r to be? And when should we be worried (i.e. when would be 

say that the errors are too large and thus that the we should be very wary about the 

quality of the approximation ˜X). Well, whole books are written on this matter. One 

approach would be to compare the length of r to that of X. If we define L(u) to be the 

length of a vector u then, in this example, we would say that if L(r) ≪ L(˜X) then ˜X 

is probably a good (accurate) solution. There are other measures which we will not go 

into. 

The point to take home here is that for large system of equations it is best not to use 

Gaussian elimination in any of its forms. 

Example 5.3 

Why did we make no mention of truncation errors? 

16-Feb-2014 61



Example 5.4 

The system of equations 

⎡ 

−0.002 4.000 

⎤ ⎡ 

4.000 

⎣ −2.000 2.906 −5.387 ⎦ ⎣ 

3.000 −4.031 −3.112 

has the exact solution 

x = y = z = 1 

x 

y 

z 

⎤ 

⎡ 

⎦ = ⎣ 

7.998 

−4.481 

−4.143 

This system is nearly singular (the determinant is close to zero) so we can expect some 

troubles, particularly when we only use a limited number of decimal places of accuracy. 

Here are our results performed without pivoting. 

⎤ 

⎦ 

Gaussian elimination with B-S but without pivoting 

Digits [x, y, z] 

2 0.00000e+00 0.00000e+00 0.00000e+00 

4 5.00000e+00 2.00200e+00 0.00000e+00 

6 1.02000e+00 1.00692e+00 9.93092e-01 

8 1.00030e+00 1.00007e+00 9.99931e-01 

15 1.00000e+00 1.00000e+00 1.00000e+00 

Notice how poor the answer is when we use less than 8 decimal digits. If on the other 

hand we use pivoting then we find 

Gaussian elimination with B-S and with pivoting 

Digits [x, y, z] 

2 9.00000e-01 1.00000e+00 1.00000e+00 

4 1.00000e+00 1.00000e+00 1.00000e+00 

6 1.00000e+00 1.00000e+00 1.00000e+00 

8 1.00000e+00 1.00000e+00 1.00000e+00 

15 1.00000e+00 1.00000e+00 1.00000e+00 

Thus this simple change (to use pivoting) has brought about a dramatic improvement 

in the computations (even on a 2 digit computer). 

How did this occur? If you follow the exact steps of the Gaussian elimination stage 

(without pivoting) you will find that at some stage you will subtract two nearly equal 

numbers. Later you will divide this by the very small number −0.002 from the element 

(1, 1) and this will amplify the already significant relative error (in the earlier 

subtraction). This amplified error will then be propagated through the system during 

the remainder of the Gaussian elimination. When you begin the back substitution the 

16-Feb-2014 62



last equation will be junk – it has been swamped by round-off error. So all subsequent 

computations will return junk – the whole scheme fails. 

On the other hand when you do use pivoting you will only divide by large numbers thus 

avoiding the problem of amplifying the round-off error. This keeps the calculations in 

check and gives us a reasonable answer. But be warned – pivoting is not the universal 

panacea to the problems such as the above. There are many systems that even with 

pivoting will yield poor results unless you use a sufficient number of digits (15 will not 

always suffice!). 

Example 5.5 

Do exactly as outlined in the last two paragraphs. This will show you exactly where 

the round-off error creeps in. Remember, at each stage you can test the quality of your 

system by substituting in the exact solution x = y = z = 1. 

5.3 Ill-conditioned systems 

These are systems for which small changes in the coefficient matrix leads to large changes 

in the solutions. This is not a good thing! 

Example 5.6 

Here are series of simple pairs of equations and their exact (pen and paper) solutions. 

[ ] [ ] [ ] 

[ ] [ ] 

1.000 2.000 x 6.000 

x 6.000 

= 

⇒ 

= 

1.000 2.001 y 6.000 

y 0.000 

[ 1.000 2.000 

1.001 2.000 

[ 1.000 2.000 

1.001 2.000 

] [ x 

y 

] [ x 

y 

] 

= 

] 

= 

[ 6.000 

6.000 

[ 6.000 

6.002 

] 

] 

⇒ 

⇒ 

[ x 

y 

[ x 

y 

] 

= 

] 

= 

[ 0.000 

3.000 

[ 2.000 

2.000 

As you can see very small changes in the coefficient matrix leads to significant changes 

in the exact solution. This is not a computer problem, it is a problem intrinsic to the 

pair of equations. As such we can expect that when such systems are solved on a finite 

precision computer we will see problems. Here is what we get for our 2,4,6 and 8-digit 

computers using Gaussian elimination with back substitution and pivoting. 

] 

] 

1st system, exact solution x = 6, y = 0 

Digits x y 

2 0.00000e+00 0.00000e+00 

4 6.00000e+00 0.00000e+00 

6 6.00000e+00 0.00000e+00 

8 6.00000e+00 0.00000e+00 

16-Feb-2014 63



2nd system, exact solution x = 0, y = 3 

Digits x y 

2 0.00000e+00 0.00000e+00 

4 -4.00000e+00 5.00000e+00 

6 2.00000e-05 3.00000e+00 

8 0.00000e+00 3.00000e+00 

3rd system, exact solution x = 2, y = 2 

Digits x y 

2 0.00000e+00 0.00000e+00 

4 5.99600e+00 0.00000e+00 

6 2.00000e+00 2.00000e+00 

8 2.00000e+00 2.00000e+00 

Notice that each of these systems are nearly singular, that is the determinant of the 

coefficient matrix is very small (typically 0.001). This is one way of spotting an illconditioned 

system. But this is a limited tool for it could be that all entries in the 

matrix are small and yet it is far from singular (for example, take 0.001 times the 2 × 2 

identity matrix). More precise measures of ill-conditioned systems do exist but to discuss 

them now would take us into messy territory – something for you to study in your own 

time (keywords are: ill-conditioned, matrix norm and condition number). 

5.4 Operational cost 

It’s no surprise that it takes more time to solve a 10 by 10 system than a 3 by 3 system. 

But how much more time? Well, we could use a stop-watch to get the actual times 

but, as useful as that might be, we can do better – we can develop a simple theoretical 

estimate. What is important is not the actual time for any one calculation but how 

many times longer a 10 by 10 system takes over a 3 by 3 system. We will measure each 

time in units of floating point operations. That is, we will count (approximately) the 

number of floating point operations (typically just the multiplies and divides) required 

to completely solve a N by N system. 

Why do we only count the multiplies and divides and not the additions and subtractions? 

Because the latter operations are done much more quickly than the former operations 

and thus its reasonable to ignore them when estimating the total computational effort. 

So how do we apply this idea to Gaussian elimination with back substitution (but without 

pivoting)? We will examine each stage in turn. First, we look at the Gaussian elimination 

stage. At the heart of this stage is the line 

M(b,c) = M(b,c) - factor*M(a,c) 

16-Feb-2014 64



This involves just one floating point operation (the multiply). And this is buried inside 

three loops each running (approximately) from 1 to N giving a total of O (N 3 ) flops 

(flops = Floating Point Operations). In the outer two loops we also have one division 

in calculating factor. This adds a further O (N 2 ) flops to the total. Thus in this first 

stage we estimate the total operational count to be O (N 3 ) + O (N 2 ) flops. Now for the 

back substitution we see that we have just two loops, with one multiply in both loops 

and one divide in just the outer loop. Thus we estimate the operation count for this 

stage to be O (N 2 ) + O (N 1 ). Finally, we estimate the combined operation count to be 

O (N 3 ) + O (N 2 ) + O (N 1 ). For large N (say N 10) this is dominated by O (N 3 ) and 

so we take this as our final estimate of the operational cost for Gaussian elimination. 

Operation Count : Gaussian elimination 

The operational cost to solve an N by N system of equations using Gaussian elimination 

is O (N 3 ). 

5.5 Iterative methods 

Its been said a few times that Gaussian elimination performs badly on large systems. 

Here is an example. 

We start with the simple function f(t) = (1 − t n+1 )/(1 − t) with n any positive integer.. 

You might (should?) recognise this function – its the sum of the geometric series with 

common ratio t. That is 

f(t) = 1 − tn+1 

1 − t 

= 1 + t + t 2 + t 3 + · · · t n n = 1, 2, 3, · · · 

This is a polynomial in t with each coefficient equal to one. Is it possible to recover 

these coefficients by sampling f(t) for various values of t? Suppose we wrote 

f(t) = 1 + a 1 t + a 2 t 2 + a 3 t 3 + · · · a n t n 

We could evaluate both left and right hand sides for n distinct choices of t. In this 

example we will choose t = 1, 2, 3, · · · n + 1. This leads us to the system of equations 

f(1) − 1 = 1 1 a 1 + 1 2 a 2 + 1 3 a 3 + · · · 1 n a n 

f(2) − 1 = 2 1 a 1 + 2 2 a 2 + 2 3 a 3 + · · · 2 n a n 

f(3) − 1 = 3 1 a 1 + 3 2 a 2 + 3 3 a 3 + · · · 3 n a n 

. = . 

f(n) − 1 = n 1 a 1 + n 2 a 2 + n 3 a 3 + · · · n n a n 

16-Feb-2014 65



Since we know the values for f(t) we can treat this as a system of n equations in the 

n unknowns a 1 , a 2 , a 3 , · · · a n . On a perfect computer (i.e. infinite precision) we would 

expect all a j = 1 so we can define an error by 

( (1 − a1 ) 2 + (1 − a 2 ) 2 + (1 − a 3 ) + · · · (1 − a n ) 2 

E n = 

n 

So much for definitions, what do we get? Here are the results for a 15 digit computer. 

) 1/2 

n 11 12 13 14 15 16 

E n 0.000e+00 0.000e+00 0.000e+00 1.150e+02 6.482e+04 1.603e+11 

We see that things go way off the rails for N around 14. This is not a very large system of 

equations and yet things have gone terribly wrong. The cause is simply the accumulation 

of round-off errors. If you take a look at the entries in the coefficient matrix you will see 

some wildly varying numbers. For N = 5 the coefficient matrix is 

⎡ 

⎤ 

1.000e + 00 1.000e + 00 1.000e + 00 1.000e + 00 1.000e + 00 1.000e + 00 

2.000e + 00 4.000e + 00 8.000e + 00 1.600e + 01 3.200e + 01 6.400e + 01 

3.000e + 00 9.000e + 00 2.700e + 01 8.100e + 01 2.430e + 02 7.290e + 02 

4.000e + 00 1.600e + 01 6.400e + 01 2.560e + 02 1.024e + 03 4.096e + 03 

⎢ 

⎥ 

⎣ 5.000e + 00 2.500e + 01 1.250e + 02 6.250e + 02 3.125e + 03 1.563e + 04 ⎦ 

6.000e + 00 3.600e + 01 2.160e + 02 1.296e + 03 7.776e + 03 4.666e + 04 

As you can see the numbers in the bottom right hand corner are large. For large N = 15 

the bottom right hand corner is approximately 4.4 × 10 17 . With such a wide range of 

numbers (the top left hand corner is always 1) its no surprise that a 15 digit computer 

will have troubles maintaining numerical accuracy. 

What is the lesson here? We should not place undying faith in Gaussian elimination. 

We must develop alternatives (whether or not it helps us on the above system). 

In this section we will look at iterative methods, similar to the fixed point methods we 

saw earlier, to the solution of linear systems of equations. 

Let it be said at the outset – do not expect miracles with these methods. We will see 

that they can work but that their convergence is slow. 

5.5.1 Jacobi iteration 

Most of us should have little trouble solving a system such as 

⎡ 

⎤ ⎡ ⎤ ⎡ ⎤ 

400 0 0 x 400 

⎣ 0 200 0 ⎦ ⎣ y ⎦ = ⎣ 400 ⎦ 

0 0 300 z 300 

16-Feb-2014 66



The solution is x = 1, y = 2 and z = 1. Now suppose we want to solve the related 

system 

⎡ 

⎤ ⎡ ⎤ ⎡ ⎤ 

400 1 2 x 400 

⎣ 3 200 1 ⎦ ⎣ y ⎦ = ⎣ 400 ⎦ 

1 3 300 z 300 

We could reasonably guess that x = 1, y = 2 and z = 3 would be a good approximation 

to the exact solution. How might we improve on this solution? We could apply Gaussian 

elimination and it would work well. But for the purposes of showing you a new technique, 

let’s rule out Gaussian elimination. 

Re-arrange the equations in the following way. 

solve 1st equation for x ⇒ x = (400 − y − 2z)/400 

solve 2nd equation for y ⇒ y = (400 − 3x − z)/200 

solve 3rd equation for z ⇒ z = (300 − x − 3y)/300 

This suggests the iterative scheme 

x n+1 = 400 − y n − 2z n 

400 

y n+1 = 400 − 3x n − z n 

200 

z n+1 = 300 − x n − 3y n 

300 

To get the ball rolling we need an initial guess, let’s take x 0 = y 0 = z 0 = 0. Here are 

our results. 

Jacobi iteration 

Iteration [x, y, z] 

0 0.0000000e+00 0.0000000e+00 0.0000000e+00 

1 1.0000000e+00 2.0000000e+00 1.0000000e+00 

2 9.9000000e-01 1.9800000e+00 9.7666667e-01 

3 9.9016667e-01 1.9802667e+00 9.7690000e-01 

4 9.9016483e-01 1.9802630e+00 9.7689678e-01 

5 9.9016486e-01 1.9802630e+00 9.7689682e-01 

6 9.9016486e-01 1.9802630e+00 9.7689682e-01 

7 9.9016486e-01 1.9802630e+00 9.7689682e-01 

It converges quickly, and you can check that the x, y, z values are correct, but this is 

probably not much of a surprise since the system is almost trivial with just a few small 

off-diagonal terms. How well does this idea work for other systems? There is only one 

real test – try it on other systems! 

16-Feb-2014 67



Here is another system 

⎡ 

⎣ 

5 1 2 

1 6 2 

1 1 3 

⎤ ⎡ 

⎦ ⎣ 

x 

y 

z 

⎤ 

⎡ 

⎦ = ⎣ 

from which we create the following iteration formula 

x n+1 = 5 − y n − 2z n 

5 

y n+1 = 9 − x n − 2z n 

6 

z n+1 = 4 − x n − y n 

3 

Here are our results, starting with x 0 = y 0 = z 0 = 0 

5 

9 

4 

⎤ 

⎦ 



0 0.0000000e+00 0.0000000e+00 0.0000000e+00 

2 1.6666667e-01 8.8888889e-01 5.0000000e-01 

4 3.4629630e-01 1.0691358e+00 6.9074074e-01 

6 4.1298354e-01 1.1278464e+00 7.5936214e-01 

8 4.3650091e-01 1.1483112e+00 7.8375514e-01 

10 4.4477533e-01 1.1555065e+00 7.9238872e-01 

20 4.4925085e-01 1.1593990e+00 7.9707572e-01 

30 4.4927523e-01 1.1594202e+00 7.9710131e-01 

40 4.4927536e-01 1.1594203e+00 7.9710145e-01 

50 4.4927536e-01 1.1594203e+00 7.9710145e-01 

This is not so good. It does converge but rather slowly. As we shall see there is one 

way to accelerate the convergence (but do not get your hopes too high – this class of 

iterations are notoriously slow in their convergence, with sometimes, many thousands of 

iterations required to achieve even a modest 5 decimal digits of accuracy). 

16-Feb-2014 68



5.5.2 Gauss-Seidel iteration 

This is a minor variation on the Jacobi iteration where now we use the new values as 

soon as they become available. Thus for the second Jacobi example we would use 

x n+1 = 5 − y n − 2z n 

5 

y n+1 = 9 − x n+1 − 2z n 

6 

z n+1 = 4 − x n+1 − y n+1 

3 

Notice the x n+1 and y n+1 appearing on the right hand side. These equations must be 

executed in order from first to last (how else could we use x n+1 in the second equation?). 

This minor change produced the following results 

Gauss-Seidel iteration 


0 0.0000000e+00 0.0000000e+00 0.0000000e+00 

2 5.1111111e-01 1.2296296e+00 7.5308642e-01 

4 4.4881207e-01 1.1614577e+00 7.9657674e-01 

6 4.4923516e-01 1.1594281e+00 7.9711224e-01 

8 4.4927475e-01 1.1594194e+00 7.9710193e-01 

10 4.4927537e-01 1.1594203e+00 7.9710145e-01 

20 4.4927536e-01 1.1594203e+00 7.9710145e-01 

This is better, but its far from ideal. 

Can we do better? Remember how Newton-Raphson iterations converged quadratically 

(every iteration double the number of decimal places of accuracy). Can we obtain 

similar convergence for iterative solutions of linear systems? In general no. Dash. Dang. 

Fiddlesticks. ’Doh. The sad fact is that solving large systems of linear system by iterative 

methods will be slow – patience, patience and more patience! 

16-Feb-2014 69



Jacobi Iteration 

For an n × n system of equations such as 

⎡ 

⎤ ⎡ 

a 11 a 12 a 13 · · · a 1n 

a 21 a 22 a 23 · · · a 2n 

a 31 a 32 a 33 · · · a 3n 

⎢ 

⎥ ⎢ 

⎣ . . . · · · . ⎦ ⎣ 

a n1 a n2 a n3 · · · a nn 

⎤ 

x 1 

x 2 

x 3 

⎥ 

. ⎦ 

x n 

⎡ 

= 

⎢ 

⎣ 

⎤ 

b 1 

b 2 

b 3 

⎥ 

. ⎦ 

b n 

The Jacobi iteration for this system is defined to be 

( 

(x i ) n+1 = 1 b i − ∑ ) 

a ij (x j ) n 

a ii 

j≠i 

i = 1, 2, 3, · · · n 

where the notation (x i ) n means the n − th iteration for the exact value x i . 

Gauss-Seidel Iteration 


⎡ 

⎤ ⎡ 

a 11 a 12 a 13 · · · a 1n 

a 21 a 22 a 23 · · · a 2n 

a 31 a 32 a 33 · · · a 3n 

⎢ 

⎥ ⎢ 

⎣ . . . · · · . ⎦ ⎣ 

a n1 a n2 a n3 · · · a nn 

⎤ 

x 1 

x 2 

x 3 

⎥ 

. ⎦ 

x n 

⎡ 

= 

⎢ 

⎣ 

⎤ 

b 1 

b 2 

b 3 

⎥ 

. ⎦ 

b n 

The Gauss-Seidel iteration for this system is defined to be 

( 

) 

(x i ) n+1 = 1 ∑i−1 

n∑ 

b i − a ij (x j ) n+1 − a ij (x j ) n 

a ii 

j=1 

j=i+1 

i = 1, 2, 3, · · · n 

where the notation (x i ) n means the n − th iteration for the exact value x i . 

5.5.3 Diagonal dominance and convergence 

Will Jacobi or Gauss-Seidel iterations always converge? No prizes for correctly guessing 

No. Here is an example. 

We start with the system 

⎡ 

⎣ 

1 3 5 

5 1 2 

2 4 1 

⎤ ⎡ 

⎦ ⎣ 

x 

y 

z 

⎤ 

⎡ 

⎦ = ⎣ 

1 

3 

2 

⎤ 

⎦ 

16-Feb-2014 70



for which a Jacobi iteration scheme would be 

x n+1 = 1 − 3y n − 5z n 

y n+1 = 3 − 5x n − 2z n 

with the following results 

z n+1 = 2 − 2x n − 4y n 

Divergent Jacobi iteration 


0 0.0000000e+00 0.0000000e+00 0.0000000e+00 

2 -1.8000000e+01 -6.0000000e+00 -1.2000000e+01 

4 -6.6000000e+02 -5.1600000e+02 -6.2400000e+02 

6 -3.0582000e+04 -3.0114000e+04 -2.7540000e+04 

8 -1.5320880e+06 -1.5034560e+06 -1.2880560e+06 

10 -7.6099674e+07 -7.2909246e+07 -6.2847516e+07 

20 -2.1522512e+16 -2.0473354e+16 -1.7848334e+16 

Whoops, things have gone horribly astray. Why? Look at the coefficients on the equation 

for x n+1 . These are dominated by those for y n and z n and importantly are larger than 1. 

Thus any errors in y n and z n will be amplified and passed on as a larger error in x n+1 . 

So any small error will quickly swamp the iterations (as the above results show). 

If that logic is correct (it is) then perhaps a re-arrangement of the original equations 

might help. If we move the first equation to the last in the original system, then we 

obtain 

⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 

5 1 2 x 3 

⎣ 2 4 1 ⎦ ⎣ y ⎦ = ⎣ 2 ⎦ 

1 3 5 z 1 

for which a Jacobi iteration scheme would be 

x n+1 = 3 − y n − 2z n 

5 

y n+1 = 2 − 2x n − z n 

4 

z n+1 = 1 − x n − 3y n 

5 

Does this save the day? Okay, here are the results 

16-Feb-2014 71



Convergent Jacobi iteration 


0 0.0000000e+00 0.0000000e+00 0.0000000e+00 

2 4.2000000e-01 1.5000000e-01 -2.2000000e-01 

4 5.2060000e-01 1.6450000e-01 -1.3860000e-01 

6 5.4625800e-01 1.8943500e-01 -8.9118000e-02 

8 5.5933494e-01 2.0684805e-01 -6.9042340e-02 

10 5.6687170e-01 2.1587029e-01 -5.9805334e-02 

20 5.7471659e-01 2.2468058e-01 -5.0347124e-02 

30 5.7499005e-01 2.2498879e-01 -5.0012182e-02 

40 5.7499965e-01 2.2499961e-01 -5.0000427e-02 

50 5.7499999e-01 2.2499999e-01 -5.0000015e-02 

There is a theorem that makes sense of what we have just observed. 

Diagonal Dominance and Convergence 


⎡ 

⎤ ⎡ 

a 11 a 12 a 13 · · · a 1n 

a 21 a 22 a 23 · · · a 2n 

a 31 a 32 a 33 · · · a 3n 

⎢ 

⎥ ⎢ 

⎣ . . . · · · . ⎦ ⎣ 

a n1 a n2 a n3 · · · a nn 

⎤ 

x 1 

x 2 

x 3 

⎥ 

. ⎦ 

x n 

⎡ 

= 

⎢ 

⎣ 

The Jacobi and Gauss-Seidel iterations will converge when 

⎤ 

b 1 

b 2 

b 3 

⎥ 

. ⎦ 

b n 

|a ii | > ∑ j≠i 

|a ij | 

for each i = 1, 2, 3, · · · n. 

A matrix which satisfies this conditions is said to be diagonally dominant. 

Example 5.7 

Consider the Jacobi iteration. Define the error in x i at iteration p by (ɛ i ) p = (x i ) p 

− x i . 

Express the Jacobi iteration for (x i ) p as an iteration scheme for (ɛ i ) p . Then, using 

|a + b + c + · · · | < |a| + |b| + |c| + · · · (which is true for any numbers a, b, c, . . .), show 

that, if the matrix is diagonally dominant, then 

max 

i 

What can you infer from this last line? 

∣ ∣∣(ɛi 

) p+1 

∣ ∣∣ < max 

i 

∣ ∣∣(ɛi 

) p 

∣ ∣∣ 

16-Feb-2014 72



5.6 Operational counts 

We know that solving an N ×N system of equations using Gaussian elimination requires 

O (N 3 ) flops. What do we make of the Jacobi and Gauss-Seidel iterations? For each 

new (x i ) n+1 we have to complete a sum of N − 1 terms each of which involves just one 

multiply. Thus we have N − 1 multiplications and one division for each of the N by 

(x i ) n+1 ’s. This gives us a total operational count of O (N 2 ). This is much smaller that 

the O (N 3 ) for the standard Gaussian elimination. 

What does this tell us? If, for example N = 100, the computer would take roughly as 

much time to compute one Gaussian elimination as 100 Jacobi iterations. Thus 100 

iterations is not as computationally demanding as it may have seemed at first. Thus for 

large N, successive Jacobi or Gauss-Seidel iterations may well be computationally more 

efficient than a single Gaussian elimination. 

16-Feb-2014 73





6. Solving Systems of Nonlinear Equations




Let’s start with a simple example. Suppose we wish to find the (x, y) values such that 

0 = f(x, y) = y − 7x 3 + 10x + 1 (1) 

0 = g(x, y) = x + 8y 3 − 11y − 1 (2) 

If we only had on equation, say 0 = f(x, y), then for any value of x we could solve for 

y (using methods like Newton-Raphson for a function of one variable). Thus this one 

equation describes a curve in the xy−plane. Likewise for the second function 0 = g(x, y), 

it also describes a curve. What we are looking for are the the intersection points of these 

curves. 

y 

−4 0 4 

−6 −4 −2 0 2 4 6 

x 

y 

0 2 

−2 −1 0 1 2 

From the plots we can see that we have nine points of intersection. Our game is to find 

these points. We will do this using two different schemes, both generalised versions of 

the algorithms we used for functions of one variable. Each algorithm will generate a 

x 

16-Feb-2014 75



series of approximations (x, y) 0 , (x, y) 1 , (x, y) 2 , · · · which we hope will converge to the at 

least one of the nine intersection points. 

6.2 Generalised Fixed Point iteration 

Its a simple matter to rearrange the pair of equations (1) and (2) into 

which then suggest the iteration scheme 

y = 7x 3 − 10x − 1 

x = −8y 3 + 11y + 1 

y n+1 = 7x 3 n − 10x n − 1 (3) 

x n+1 = −8y 3 n + 11y n + 1 (4) 

Starting with (x, y) 0 = (0, 0) (by inspection of the above plots) we get for the first five 

iterations 

Divergent Generalised Fixed Point iterations 

Loop x n y n f(x n , y n ) g(x n , y n ) 

0 0.000e+00 0.000e+00 1.000e+00 -1.000e+00 

1 1.000e+00 -4.000e+00 0.000e+00 -4.680e+02 

2 4.690e+02 7.221e+08 0.000e+00 3.013e+27 

3 -3.013e+27 -1.914e+83 -3.013e+28 -5.61e+250 

4 5.61e+250 Inf NaN NaN 

5 NaN NaN NaN NaN 

This is not a good start – the iterations diverge very rapidly. Its not hard see why – the 

coefficients in (3) and (4) are all greater than one and so any errors in (x, y) n will be 

amplified with each iteration. 

Undaunted by first round failure, we press on with any other choice of iteration scheme, 

such as the following 

x n+1 = ( 7x 3 n − y n − 1 ) /10 (5) 

y n+1 = ( 8y 3 n + x n − 1 ) /11 (6) 

Starting with (x, y) 0 = (0, 0) (by inspection of the above plots) we get for the first five 

iterations 

16-Feb-2014 76



Convergent Generalised Fixed Point iterations 


0 0.000e+00 0.000e+00 1.000e+00 -1.000e+00 

1 -1.000e-01 -1.000e-01 -9.300e-02 -8.000e-03 

2 -9.070e-02 -9.988e-02 -1.659e-03 2.833e-05 

3 -9.053e-02 -9.986e-02 -1.095e-05 4.227e-06 

4 -9.053e-02 -9.986e-02 2.953e-07 1.158e-07 

5 -9.053e-02 -9.986e-02 1.292e-08 1.877e-09 

This is much better. The iterations are converging. Great. But can we get the remaining 

8 intersection points? No, try as you may, for all choices of initial guess the series 

converges to just this one point (−9.053e − 02, −9.986e − 02). To recover other points 

you will need to work with other fixed point versions of the original system of equations 

(1,2). 

Example 6.1 

See if you can find any of the other points of intersection by constructing alternative 

forms for the fixed point iterations for the system (1,2). 

Generalised Fixed Point Iteration 

If a system of equations 

0 = f i (x 1 , x 2 , x 3 , · · · , x n ) , i = 1, 2, 3, · · · , n 

can be re-written in the form 

x i = g i (x 1 , x 2 , x 3 , · · · , x n ) , 

i = 1, 2, 3, · · · , n 

then the sequence 

) 

x p+1 

i = g i 

(x p 1, x p 2, x p 3, · · · , x p n , i = 1, 2, 3, · · · , n 

for p = 0, 1, 2, 3 · · · is known as generalised fixed point iteration. In the above x p i 

denotes the p th iteration for x i . The sequence is not guaranteed to converge to any 

root of the system. 

6.2.1 Convergence 

What can we say about the convergence of the generalised fixed point iterations? As 

we have seen, they may or may not converge. When we saw that the first form of fixed 

point iteration diverged we realised that this was to be expected because the coefficients 

16-Feb-2014 77



in the iteration equations were all larger than 1 and thus any errors would be amplified 

in each iteration. Though this is true (in this case) it is a loose mathematical statement 

(e.g. x n+1 = 7x 3 n will converge to x = 0) and not easily applied to other systems. Here 

we shall explore a better and more mathematically based criteria for convergence. 

We begin by writing our system of equation in the form 

x i = g i 

( 

x 1 , x 2 , x 3 , · · · , x n 

) 

, i = 1, 2, 3, · · · , n 

Just so we all agree – this is a set of n equations in n variables (x 1 , x 2 , x 3 , · · · , x n ). Now 

we can define the fixed point iteration equations as 

) 

x p+1 

i = g i 

(x p 1, x p 2, x p 3, · · · , x p n , i = 1, 2, 3, · · · , n 

Next, we define the error at the p th iteration of x i by 

ɛ p i = xp i − x i , 

i = 1, 2, 3, · · · , n 

Thus x p i = x i + ɛ p i and we can substitute this into our fixed point equations 

) 

x i + ɛ p+1 

i = g i 

(x 1 + ɛ p 1, x 2 + ɛ p 2, x 3 + ɛ p 3, · · · , x n + ɛ p n , i = 1, 2, 3, · · · , n 

But our hope is that the sequence is converging. Thus every error ɛ i is small allowing 

us to expand each function as a Taylor series in powers of ɛ i . If we retain just the first 

order terms (i.e. ignore higher powers of ɛ i ) then we find 

ɛ p+1 

i 

= ∂g i 

∂x 1 

ɛ p 1 + ∂g i 

∂x 2 

ɛ p 2 + ∂g i 

∂x 3 

ɛ p 3 + · · · + ∂g i 

∂x n 

ɛ p n , 

⎢ 

⎣ 

ɛ p+1 

2 

ɛ p+1 

3 

. 

ɛ p+1 

n 

= 

⎥ ⎢ 

⎦ ⎣ 

i = 1, 2, 3, · · · , n 

Each of the partial derivatives is evaluated at the root x i . It helps to write this in matrix 

form, 

⎡ ⎤ ⎡ 

⎤ ⎡ ⎤ 

ɛ p+1 ∂g 1 ∂g 1 ∂g 1 ∂g 1 

1 

· · · 

∂x 1 ∂x 2 ∂x 3 ∂x n 

⎥ ⎢ 

ɛ p 1 ⎥ 

∂g 2 

∂x 1 

∂g 2 

∂x 2 

∂g 2 

∂x 3 

· · · 

∂g 3 

∂x 1 

∂g 3 

∂x 2 

∂g 3 

∂x 3 

· · · 

. 

∂g n 

∂x 1 

. 

∂g n 

∂x 2 

. . 

∂g n 

· · · 

∂x 3 

∂g 2 

∂x n 

∂g 3 

∂x n 

. 

∂g n 

∂x n 

What do we make of this equation? To help see what’s going on let’s condense these 

equations into 

E p+1 = JE p 

where E is the column vector of errors and J is the matrix of partial derivatives. We 

can wind this equation all the way back to the initial iteration 

⎥ ⎢ 

⎦ ⎣ 

E p+1 = JE p = J ( JE p−1) = J ( J ( JE p−2)) = J ( J ( J ( JE p−3))) = · · · = J p+1 E 0 

ɛ p 2 

ɛ p 3 

. 

ɛ p n 

⎥ 

⎦ 

16-Feb-2014 78



Now we come to the crunch. We want the sequence to converge. That is we want E p → 0 

as p → ∞. This can only occur when J p → 0 as p → ∞. Very interesting you might 

say, but what use is this? Well, a theorem from linear algebra tells us that if J p → 0 as 

p → ∞ then all of the eigenvalues of J must have absolute value less than 1. Its been a 

long journey but here is the result we want, 

Convergence of Generalised Fixed Point Iteration 

For a system of equations 

x i = g i (x 1 , x 2 , x 3 , · · · , x n ) , 

i = 1, 2, 3, · · · , n 

construct the n × n matrix J with entries 

J ij = ∂g i 

∂x j 

Let λ i be the eigenvalues of J. If |λ i | < 1 , i = 1, 2, 3, · · · , n then the sequence 

) 

x p+1 

i = g i 

(x p 1, x p 2, x p 3, · · · , x p n , i = 1, 2, 3, · · · , n 

will converge to a root of the given system. 

Example 6.2 

When we were playing with functions of one variable, we created the fixed point sequence 

x p+1 = g(x p ) and we found the convergence criteria to be |dg/dx| < 1. How does this fit 

in with what we have just found? 

Note that to be able to apply the above test for convergence we must construct the 

matrix J at the root of the system. But we don’t know the root – that’s why we are 

using an iterative scheme. The best you can do is estimate J by using your current 

approximation to x i . 

Example 6.3 

Estimate J for the first form of the fixed point iteration equations (3,4) at the initial 

guess x = 0, y = 0. Hence infer, without any further iterations, whether or not you 

would expect the subsequent iterations to converge or diverge. 

16-Feb-2014 79



Example 6.4 

Modify the following Matlab code to implement a Gauss-Seidel style of iteration in which 

new values are used immediately (see the discussion on Gauss-Seidel iterations for linear 

systems given in previous lectures). 

6.2.2 Matlab example 

Here is a Matlab function which implements our first fixed point scheme. You will need 

to save this in a file called gen fp.m in your workspace. You can run this example by 

typing the Matlab command 

Matlab>> [x,y] = gen_fp([0 0],0.001,20) 

This runs the code with initial guess (0, 0) and requesting an accuracy of 0.001 within 

20 iterations. The final approximations are returned as x and y (how inventive). 

Here is the code 

16-Feb-2014 80



function [x,y] = gen_fp(x_start,target_error,max_loop) 

% --- Set starting values 

x_old = x_start; 

x_new = x_start; 

loop = 1; 

looping = (loop < max_loop); 

converged = false; 

% --- Loop until we get target accuracy or too many loops 

while (looping) & (~converged) 

x_new = onestep(x_old); 

error = norm(x_new - x_old); 

loop = loop +1; 

looping = (loop < max_loop); 

converged = (error < target_error); 

x_old = x_new; 

% do one iteration 

% measure the change 

% update loop counter 

% too many iterations? 

% are we done? 

% prepare for next iteration 

end 

% --- Finished, save data 

x 

y 

= x_new(1); 

= x_new(2); 

% --- The iteration formula 

function y = onestep(x) 

y(1) = -8*x(2)^3+11*x(2)+1; % x = g1(x,y) 

y(2) = 7*x(1)^3-10*x(1)-1; % y = g2(x,y) 

function y = norm(x) 

% compute the length of a vector 

y = sqrt(dot(x,x)); 

16-Feb-2014 81



6.3 Generalised Newton-Raphson 

For the pair of equations 

0 = f(x, y) = y − 7x 3 + 10x + 1 

0 = g(x, y) = x + 8y 3 − 11y − 1 

we know there are 9 intersection points but yet, try as we might, the generalised fixed 

point method does not recover all 9 points. What do we do? Ten points to Gryffindor for 

suggesting the Newton-Raphson method. But how do we apply it to a pair of equations? 

Good point. We will have to return to the basics following steps similar to those we 

used for functions of one variable. 

Let (x, y) be the exact root of the above equations and, as usual, let (x, y) n be our 

current approximation to (x, y). Can we compute small changes δx and δy so that we 

jump to the exact root in the next iteration? That is, we want 

0 = f(x n + δx, y n + δy) 

0 = g(x n + δx, y n + δy) 

We need access to δx and δy yet they are buried inside the functions f and g. We can 

get hold of these by expanding the pair in powers of δx and δy using a Taylor series. 

This leads to 

0 = f(x n , y n ) + ∂f ∂f 

δx + 

∂x ∂y δy + · · · 

0 = g(x n , y n ) + ∂g ∂g 

δx + 

∂x ∂y δy + · · · 

where we have hidden all the higher order terms (e.g. δx 2 , δxδy 2 ) in the trailing dots. If 

we assume (as we usually do) that the δ’s are small then the highre order terms can be 

discarded and we are left with 

−f(x n , y n ) = ∂f ∂f 

δx + 

∂x ∂y δy 

−g(x n , y n ) = ∂g ∂g 

δx + 

∂x ∂y δy 

This is a simple 2×2 system of linear equations for δx and δy which we can solve (easily) 

by any suitable method (pencil and paper, Gaussian elimination). We can then update 

our approximations by 

x n+1 = x n + δx 

y n+1 = y n + δy 

16-Feb-2014 82



Example 6.5 

Didn’t we say we were going to choose δx and δy so that we would jump to the exact 

root in one iteration? Why then is this only an improved approximation? 

How well does this new method work? Let’s apply it to the above pair. Here is what 

we get for a few starting guesses. 

Generalised Newton-Raphson iterations 


0 0.000e+00 0.000e+00 1.000e+00 -1.000e+00 

1 -9.009e-02 -9.910e-02 5.118e-03 -7.786e-03 

2 -9.053e-02 -9.986e-02 3.718e-07 -1.393e-06 

3 -9.053e-02 -9.986e-02 9.992e-16 -4.186e-14 

4 -9.053e-02 -9.986e-02 1.110e-16 0.000e+00 

5 -9.053e-02 -9.986e-02 0.000e+00 -1.110e-16 


0 1.000e+00 1.000e+00 5.000e+00 -3.000e+00 

1 1.472e+00 1.194e+00 -5.420e+00 9.662e-01 

2 1.319e+00 1.159e+00 -7.040e-01 3.471e-02 

3 1.292e+00 1.159e+00 -1.941e-02 4.078e-06 

4 1.291e+00 1.159e+00 -1.622e-05 3.646e-08 

5 1.291e+00 1.159e+00 -1.135e-11 1.954e-14 


0 1.000e+00 -1.000e+00 3.000e+00 3.000e+00 

1 1.250e+00 -1.250e+00 -1.422e+00 -1.625e+00 

2 1.190e+00 -1.186e+00 -9.159e-02 -1.192e-01 

3 1.186e+00 -1.181e+00 -4.747e-04 -8.358e-04 

4 1.186e+00 -1.181e+00 -1.243e-08 -4.133e-08 

5 1.186e+00 -1.181e+00 0.000e+00 0.000e+00 

It works! Yippee. The remaining 6 points are also easily found simply by choosing an 

initial guess close to the desired point (which we get by inspection of the plots of the 

functions, as shown in an previous lecture). 

Notice also that each iteration appears to double the number of digits of accuracy (look 

at the values of f and g after each iteration, they are roughly the square of the previous 

value). Thus not only have we a method that seems capable of finding all the roots 

it does so very efficiently. This is the preferred method whenever we want quick and 

16-Feb-2014 83



accurate roots. The price you pay is that you have more work to do in evaluating the 

partial derivatives (most times this should not be too problematic). 

Generalised Newton-Raphson Iteration 

The Generalised Newton-Raphson method for a system of equations 

is given by 

0 = f i (x 1 , x 2 , x 3 , · · · , x n ) , i = 1, 2, 3, · · · , n 

x p+1 

i = x p i + δx i , i = 1, 2, 3, · · · , n 

where the δx i are obtained by solving the n × n linear system 

−f i = ∂f i 

∂x 1 

δx 1 + ∂f i 

∂x 2 

δx 2 + ∂f i 

∂x 3 

δx 3 + · · · + ∂f i 

∂x n 

δx n , 

i = 1, 2, 3, · · · , n 

in which each f i and ∂f i /∂x j is evaluated at the x p i . 

If the initial guess is close to the root, then the sequence will converge quadratically, 

( 

= O (ɛ p i )2) , i = 1, 2, 3, · · · , n 

ɛ p+1 

i 

where ɛ p i = xp i − x i is the error in the p th iteration for x i . 

Example 6.6 

Using techniques similar to that used for the function of one variable, show that the 

convergence for a Generalised Newton-Raphson scheme is quadratic (i.e. the error in 

the next iteration is of order the square of the error previous iteration)). 

6.3.1 Matlab code 

The Matlab code which we used in the Generalised Fixed Point method can be very 

easily adapted to the Generalised Newton-Rapshon method – all we need do is change 

the iteration formula inside the function onestep. Here is the modified Matlab function. 

16-Feb-2014 84



function y = onestep(x) 

% --- the function values --------------------------------- 

f(1) = x(2) - 7*x(1)^3 + 10*x(1) + 1; 

f(2) = x(1) + 8*x(2)^3 - 11*x(2) - 1; 

% --- the partial derivatives ----------------------------- 

a(1,1) = -21*x(1)^2 + 10; 

a(1,2) = 1; 

a(2,1) = 1; 

a(2,2) = +24*x(2)^2 - 11; 

% --- solve the linear system for dx ---------------------- 

dx = inv(a)*( -f ); 

% --- the next iteration ---------------------------------- 

y(1) = x(1) + dx(1); 

y(2) = x(2) + dx(2); 

16-Feb-2014 85





7. Interpolation and Approximation of Data



7.1 The what and why of interpolation 

Suppose you are asked to evaluate a function f(x) at x = 3.4 but all you are given is 

the following table of numbers 

x 0.000 1.200 2.400 3.600 4.000 6.000 7.000 

f(x) 0.000 0.932 0.675 -0.443 -0.757 -0.279 0.657 

Since the target point is not in the table the best we can hope to do is to estimate f(3.4). 

How might we do this? We could construct a straight line built on the two points nearest 

x = 3.4. Or we could build a quadratic built on any three points that straddle x = 3.4. 

If we got really excited we might try building a cubic by selecting four points around 

x = 3.4. You get the picture – we use a set of points near the target point to build a 

polynomial. Then we estimate f by evaluating the polynomial at the target point. This 

process is called polynomial interpolation (surprised?). 

Let’s be be a bit more specific. We are given a table of n + 1 data points (x, f) i , i = 

0, 1, 2, · · · , n and we wish to estimate f at x = x ⋆ . We decide on building a polynomial 

of degree n by which we will estimate f(x ⋆ ). Let’s call this polynomial P n (x) and let’s 

write it out in the form 

P n (x) = a 0 + a 1 x + a 2 x 2 + · · · + a n x n 

where a 0 , a 1 , a 2 , · · · a n are some set of numbers. How might we compute the coefficients 

a 0 , a 1 , a 2 , · · · a m ? Simple demand that the polynomial pass through exactly m + 1 of the 

given data points. That is we demand the following interpolation condition 

Interpolation Condition 

Demand that P n (x) be such that 

f j = P n (x j ) = a 0 + a 1 x j + a 2 x 2 j + · · · + a n x n j , 

j = 0, 1, 2, · · · , n 

This gives us n + 1 linear equations for the n + 1 unknowns a 0 , a 1 , a 2 , · · · a n . 

So in principle we can solve this system and thus we have our P n (x). Our job is done. 

Yes? No? 

We played this game once before, when we set out to recover the polynomial 

f(t) = 1 + tn+1 

1 − t 

= 1 + t + t 2 + t 3 + · · · + t n 

(see section (5.5)). There we found that resulting matrix equations could easily cause 

great grief when using Gaussian elimination. So we need other way to do the same job. 

We will look at two methods Lagrangian interpolation and Newton’s divided differences. 

Both have their merits and failings. 

16-Feb-2014 87



7.2 Lagrangian interpolation 

Suppose we have just three points (x, y) i , i = 0, 1, 2. Now let us build the polynomials 

L 0 (x) = (x − x 1)(x − x 2 ) 

(x 0 − x 1 )(x 0 − x 2 ) 

L 1 (x) = (x − x 0)(x − x 2 ) 

(x 1 − x 0 )(x 1 − x 2 ) 

L 2 (x) = (x − x 0)(x − x 1 ) 

(x 2 − x 0 )(x 2 − x 1 ) 

The L i (x) were constructed in this way for a very good reason. 

quadratic in x and they have the following properties 

Each of these is a 

L 0 (x 0 ) = 1 L 0 (x 1 ) = 0 L 0 (x 2 ) = 0 

L 1 (x 0 ) = 0 L 1 (x 1 ) = 1 L 1 (x 2 ) = 0 

L 2 (x 0 ) = 0 L 2 (x 1 ) = 0 L 2 (x 2 ) = 1 

Since they take on only the values 0 and 1 at the tabulated values we are certain that 

the combination 

˜f(x) = f 0 L 0 (x) + f 1 L 1 (x) + f 2 L 2 (x) 

satisfies the interpolation condition. Thus we have a quadratic polynomial that passes 

through the given points (i.e. ˜f(x) satisfies the interpolation condition). Now our job is 

done. We have a quadratic passing through the given points and we can now compute 

˜f(x ⋆ ). 

Note that in all that follows we will always use a tilde, as in ˜f, to remind us that we are 

estimating a function’s value at some point. 

Note also that it is common practise to write P N (x) instead of ˜f(x) as the Lagrange 

polynomial of degree N. 

Lagrange interpolation 

The Lagrange polynomial P N (x), based on the N + 1 data points (x, y) i , i = 

0, 1, 2, · · · N is given by 

P N (x) = 

N∑ 

i=0 

y i 

N 

∏ 

j=0 

j≠i 

(x − x j ) 

(x i − x j ) 

16-Feb-2014 88



Example 7.1 

Compute the quadratic that passes through the following data 

x 0.0 1.0 3.0 

f(x) 3.0 2.0 6.0 

7.2.1 The whole polynomial or just its value? 

In the previous example we pushed the Lagrangian interpolation all the way to an 

explicit function of x, that is we found ˜f(x) = x 2 − 2x + 3. Normally nobody would ever 

bother going this far. Rather, it is standard practise to leave the pieces of the Lagrange 

polynomial in their unexpanded form (why do more than you have to). 

Example 7.2 A smooth function y = sin(x) 

Polynomials are nice smooth functions. Thus we can expect that they can provide good 

approximations to other smooth functions. Here we shall explore a few low order (i.e. 

N = 2, 3, 4 and 5) polynomial approximations to sin(x) over the interval 0 < x < π. In 

each of the following we first sampled sin(x) at a few points in the interval 0 < x < π 

and then built the Lagrange polynomial on those points. Here are the plots. 

Lagrangian interpolation of sin(x) 

y 

−0.5 0.0 0.5 1.0 

■ 

N = 2 quadratic 

■ 

■ 

■ 

■ 

N = 3 cubic 

■ 

■ 

y 

−0.5 0.0 0.5 1.0 

■ 

■ 

■ 

N = 4 quartic 

■ 

■ 

■ 

■ 

■ 

N = 5 quintic 

■ 

■ 

■ 

0 1 2 3 

x 

0 1 2 3 

x 

16-Feb-2014 89



What do we observe from the above plots? First, that each Lagrange polynomial passes 

through the given data points (as they must!). Second, that all the approximations 

look good with the N = 5 quintic interpolation being almost indistinguishable from the 

source function sin(x). But the picture changes dramatically if the we plot the Lagrange 

polynomials outside the range on which they were built. Here is what our collection 

looks like over the interval 0 < x < 2π. 

N=2,3,4,5 Lagrangian interpolation 

y 

−1.0 −0.5 0.0 0.5 1.0 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

0 1 2 3 4 5 6 

Clearly things have gone off the rails. Are we surprised? Not really. Using the polynomials 

in this way is not much different from trying to predict the future (on the basis of 

past events). This process (predicting outside the given data) is known as extrapolation. 

It almost always gives junk like the above – so be very very afraid when you venture 

down this path. 

x 

16-Feb-2014 90



Example 7.3 A not so smooth function y = H(x) 

The Heaviside step function is very simple and is defined by 

⎧ 

⎪⎨ 

−0.5 : x < 0 

H(x) = 0 : x = 0 

⎪⎩ 

0.5 : 0 < x 

Langrangian interpolation of H(x) 

y 

−0.5 0.0 0.5 

■ 

N = 4 

■ 

■ 

■ 

■ 

N = 8 

■ ■ ■ ■ 

■ 

■ ■ ■ ■ 

y 

−0.5 0.0 0.5 

N = 16 

■ ■ ■ ■ ■ ■ ■ ■ 

■ 

■ ■ ■ ■ ■ ■ ■ ■ 

N = 32 

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ 

■ 

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ 

−1.0 −0.5 0.0 0.5 1.0 

x 

−1.0 −0.5 0.0 0.5 1.0 

x 

The plot in the lower right corner, which uses an overly ambitious number of points, 

N = 32, is not a pretty sight. We can see that all of the polynomials pass through the 

chosen data points but the price we pay is that the higher order interpolations (which 

we expect to be accurate) oscillate wildly near the end of the domain. This is serious. 

The cause of this problem is the sharp jump at x = 0. The best advice we can give is 

that if you must use polynomial interpolation (there are other choices, and we will look 

at these in later lectures) then stick to lower order polynomials (e.g. cubic). 

16-Feb-2014 91



7.2.2 Matlab code 

Here is a simple Matlab function which will return the value of the Lagrange polynomial 

for a single x value x target. 

function answer = lagrngpoly(x,f,n,x_target) 

% --- loop to compute each term in ∑ k f kL k (x ⋆ ) 

sum = 0; 

for k=1:n 

term=1; 

% start sum at zero 

% add on one term at a time 

% start each term at one 

% --- 1st half loop 

for j=1:k-1 

term=term*(x_target-x(j))/(x(k)-x(j)); 

end 

% --- 2nd half loop 

for j=k+1:n 

term=term*(x_target-x(j))/(x(k)-x(j)); 

end 

% --- term now equals L k (x ⋆ ) 

sum=sum+f(k)*term; 

% update the sum 

end 

answer = sum; 

% save the final answer 

Here is an example of how you might run the above code 

Matlab>> x_data = [0.1 0.2 0.4 0.8 1.2]; 

Matlab>> f_data = [1.2 1.0 0.6 0.8 1.6]; 

Matlab>> fapprox = lagrngpoly(x_data,f_data,5,0.123) 

This runs the code and requests a quintic approximation at x = 0.123. 

16-Feb-2014 92



7.3 Newton polynomials 

Suppose you are playing the game of polynomial interpolation. You might first use linear 

interpolation, then quadratic and so on until you find an estimate that you are happy 

with. If you chose Lagrangian interpolation then you will have to scrap all previous 

calculations as you move through this chain of polynomials. For example you can not 

use and parts of the quadratic Lagrange polynomial to help you build the cubic Lagrange 

polynomial. So this is an expensive process. It would be much better if we could find an 

algorithm that does draw upon past calculations and allows higher order polynomials 

to be built from previously computed lower order polynomials. This brings us to an 

algorithm known as Newton’s divided differences. It will produce exactly the same 

polynomials as Lagrangian interpolation (more on this later) but it will do so in a 

different way. 

Here are a set examples that shows where we are headed. In all of these examples 

we are given N data points (x, f) i , i = 0, 1, 2, · · · N. We will build each polynomial 

in terms of some unknown numbers a i , i = 0, 1, 2, · · · N and then use the interpolation 

condition ˜f(x i ) = f i to compute the a i . We will write ˜f j (x), j = 0, 1, 2, · · · as our chain 

of polynomials (i.e. ˜f1 (x) is our linear polynomial and ˜f 3 (x) is our cubic polynomial). 

Example 7.4 One data point : constant interpolation 

With N = 1 we have just one data point and our interpolation is of the form 

˜f 0 (x) = a 0 

for some number a 0 . There is only one interpolation condition and we easily see that 

a 0 = f 0 

Example 7.5 Two data points : linear interpolation 

Now we have N = 2 and we propose an interpolation like 

˜f 1 (x) = ˜f 0 (x) + a 1 (x − x 0 ) 

We choose it in this form because it clearly satisfies the interpolation condition at x = x 0 . 

We only have one other interpolation condition to impose, at x = x 1 . This gives us 

a 1 = f 1 − ˜f 0 (x 1 ) 

x 1 − x 0 

16-Feb-2014 93



Example 7.6 Three data points : quadratic interpolation 

With N = 3 we now choose 

˜f 2 (x) = ˜f 1 (x) + a 2 (x − x 0 )(x − x 1 ) 

Again we have been sneaky in the way we have built this polynomial. It clearly passes 

through the first two points. All we need do is choose a 2 so that it passes through the 

third point. This leads to 

f 2 − 

a 2 = 

˜f 1 (x 2 ) 

(x 2 − x 0 )(x 2 − x 1 ) 

Example 7.7 Four data points : cubic interpolation 

By now you may be seeing a pattern. This time we have N = 4 and we put 

˜f 3 (x) = ˜f 2 (x) + a 3 (x − x 0 )(x − x 1 )(x − x 2 ) 

and we find 

a 3 = 

f 3 − ˜f 2 (x 3 ) 

(x 3 − x 0 )(x 3 − x 1 )(x 3 − x 2 ) 

So much for examples, how do we do this in general? Let’s suppose we have built ˜f j (x) 

based on the first j + 1 points. Then we build the next polynomial as 

˜f j+1 (x) = ˜f j (x) + a j+1 (x − x 0 )(x − x 1 )(x − x 2 ) · · · (x − x j ) 

with 

a j+1 = 

f j+1 − ˜f j (x j+1 ) 

(x j+1 − x 0 )(x j+1 − x 1 )(x j+1 − x 2 ) · · · (x j+1 − x j ) 

This might look a bit tedious but there is a further trick that makes this computation 

rather easy. We will build a triangular table of data from which we will later pick out 

the various a j . The entries in the table will be denoted d ij with j being the column 

index. We will build the table column by column, from left to right, using the following 

recursive formula 

d ij = d i,j−1 − d i−1,j−1 

x i − x i−j 

1 ≤ j ≤ i ≤ N 

with the first column set as d i0 = f i , i = 0, 1, 2, · · · N. The we have 

a j = d jj 

j = 0, 1, 2, · · · N 

16-Feb-2014 94



Newton interpolation 

x i f i = d i0 d i1 d i2 d i3 d i4 d i5 

x 0 = 1 

d 00 = −3 

d 11 = 3 

x 1 = 2 d 10 = 0 d 22 = 6 

d 21 = 15 d 33 = 1 

x 2 = 3 d 20 = 15 d 32 = 9 d 44 = 0 

d 31 = 33 d 43 = 1 d 55 = 0 

x 3 = 4 d 30 = 48 d 42 = 12 d 54 = 0 

d 41 = 57 d 53 = 1 

x 4 = 5 d 40 = 105 d 52 = 15 

d 51 = 87 

x 5 = 6 d 50 = 192 

The column headings simply remind us which column we are building and the numbers 

in the body of the table are the various d ij ’s. 

To build the polynomial we simply read off the coefficients along the leading diagonal 

˜f 5 (x) = d 00 + d 11 (x − x 0 ) + d 22 (x − x 1 )(x − x 0 ) + · · · 

= −3 + 3(x − 1) + 6(x − 1)(x − 2) + 1(x − 1)(x − 2)(x − 3) 

Equally we could use the coefficients along the lower diagonal 

˜f 5 (x) = d N0 + d N−1,1 (x − x N ) + d N−2,2 (x − x N )(x − x N−1 ) + · · · 

= 192 + 87(x − 6) + 15(x − 6)(x − 5) + 1(x − 6)(x − 5)(x − 4) 

Note that d i4 = d i5 = 0 thus showing that our data was actually a pure cubic (which we 

have now recovered from the data). 

16-Feb-2014 95



Example 7.8 

Show that the two polynomials given above are one and the same polynomial f(x) = 

x 3 − 4x. 

Newton interpolation polynomial 

Given a set of N + 1 data points, (x, f) i , i = 0, 1, 2, · · · N the Newton interpolation 

polynomial is given by 

˜f N (x) = d 00 + d 11 (x − x 0 ) + d 22 (x − x 0 )(x − x 1 ) 

+ · · · + d NN (x − x 0 )(x − x 1 )(x − x 2 ) · · · (x − x N−1 ) 

where the d ij are computed by the recursive formula 

d ij = d i,j−1 − d i−1,j−1 

x i − x i−j 

1 ≤ j ≤ i ≤ N 

with d i0 = f i , i = 0, 1, 2, · · · N. 

7.3.1 Horner’s form of the Newton polynomial 

There is a computational efficient way to evaluate a Newton polynomial. Its goes like 

this. Start with the general form 

˜f N (x) = d 00 + d 11 (x − x 0 ) + d 22 (x − x 0 )(x − x 1 ) 

+ · · · + d NN (x − x 0 )(x − x 1 )(x − x 2 ) · · · (x − x N−1 ) 

and then group all the common factors 

˜f N (x) = d 00 + (x − x 0 ) 

(d 11 + (x − x 1 ) ( d 22 + (x − x 2 )(d 33 + · · · + d NN (x − x N−1 )) )) 

7.4 Uniqueness 

Are the Lagrange and Newton polynomials the same for a given dataset? Yes – there is 

only one polynomial of degree N that passes through the N + 1 data points. Thus even 

though we have used different formulae to compute these polynomials, they are in fact 

identical. Use whichever method you feel most comfortable with. 

The polynomial of degree N is often written as P N (x) rather than our ˜f N (x). 

16-Feb-2014 96



7.5 Piecewise polynomial interpolation 

Suppose we are given a simple dataset and we are asked to estimate the derivative at 

say x = 0.35? How do we proceed? Here is one approach. 

Construct, by whatever means, a smooth approximation ỹ(x) to y(x) near x = 0.35. 

Then put y ′ (0.35) ≈ ỹ ′ (0.35). 

We have some options for constructing ỹ(x) 

◮ Least squares estimation. This is easy to apply but it does not interpolate the 

data and it can develop spurious wiggles – a disaster for derivatives. 

◮ Piecewise polynomial interpolation. This is also easy to apply, but which local set 

of points do we use? Different choices will produce different estimates for y(x) and 

y ′ (x). 

16-Feb-2014 97



What we want is a method which 

◮ Produces a unique approximation, 

◮ Is continuous over the domain and 

◮ Has, at least, a continuous first derivative over the domain. 

If we get continuity in any higher derivatives then we shall be pleased (but not greedy). 

We will look at one way of achieving this, known as cubic spline interpolation. 

7.5.1 Cubic Splines 

Here is our problem. We have a set of data points (x i , y i ), i = 0, 1, 2, · · · n and we wish 

to build an approximation ỹ(x) which has as much continuity as we can get (now we 

are being greedy). 

Between each pair of points we will construct a cubic. Let ỹ i (x) be the cubic for the 

interval x i ≤ x ≤ x i+1 . We demand that 

Interpolation condition 

Continuity of the function 

Continuity of the first derivative 

Continuity of the second derivative 

y i = ỹ i (x i ) (1) 

ỹ i−1 (x i ) = ỹ i (x i ) (2) 

ỹ ′ i−1(x i ) = ỹ ′ i(x i ) (3) 

ỹ ′′ 

i−1(x i ) = ỹ ′′ 

i (x i ) (4) 

Can we solve this system of equations? We need to balance the number of unknowns 

against the number of equations. We have n + 1 data points and thus n cubics ỹ i (x) 

to compute. Each cubic has 4 coefficients, thus we have 4n unknowns. And how many 

equations? From the above we count n + 1 equations in (1), and (n − 1) equations in 

each of (2), (3) and (4). A total of 4n − 2 equations for 4n unknowns. We see that we 

will have to provide two extra pieces of information. No matter, we’ll press on see what 

comes up. 

Start by putting 

16-Feb-2014 98



ỹ i (x) = y i + a i (x − x i ) + b i (x − x i ) 2 + c i (x − x i ) 3 (5) 

which automatically satisfies equation (1). For the moment suppose we happen to know 

all of the second derivatives y i ′′ . We then have ỹ ′′ 

i (x) = 2b i + 6c i (x − x i ) and evaluating 

this at x = x i leads to 

b i = y ′′ 

i /2 (6) 

Now we turn to equation (4) y ′′ 

i+1 = y ′′ 

i + 6c i (x i+1 − x i ) which gives 

c i = (y ′′ 

i+1 − y ′′ 

i )/(6h i ) (7) 

where we have introduced h i = x i+1 − x i . Next we compute the a i by applying equation 

(2), 

y i+1 = y i + a i h i + 1 6 (y′′ i+1 + 2y ′′ 

i )h 2 i (8) 

and so 

a i = y i+1 − y i 

h i 

− 1 6 h i(y ′′ 

i+1 + 2y ′′ 

i ) (9) 

It appears that we have completely determined each of the cubics, though we have yet 

to use is (3), continuity in the first derivative. But remember that we don’t yet know 

the values of y i ′′ . Thus equation (3) will be used to compute the y i ′′ . Using our values 

for a i , b i and c i we find (after much fiddling) that equation (3) is 

( 

yi+1 − y i 

6 − y ) 

i − y i−1 

= h i y i+1 ′′ + 2(h i + h i−1 )y i ′′ + h i−1 y i−1 ′′ (10) 

h i h i−1 

The only unknowns in this equation are the y i 

′′ of which there are n + 1. But there are 

only n − 1 equations. Thus we must supply two extra pieces of information. 

The simplest choice is to set y 0 ′′ = y n ′′ = 0. Then we have a tri-diagonal system of 

equations to solve for y i ′′ . That’s as far as we need push the algebra – we can call a 

Matlab routine to solve the tri-diagonal system. 

16-Feb-2014 99



The recipe 

◮ Solve equation (10) for y ′′ 

i , 

◮ Compute all of the a i from equation (9), 

◮ Compute all of the b i from equation (6), 

◮ Compute all of the c i from equation (7) and finally 

◮ Assemble all of the cubics using equation (5). 

Our job is done. We have computed the cubic spline for each interval. 

16-Feb-2014 100



7.5.2 Example 

Here we compare a cubic spline interpolation against a set of polynomial interpolations 

for a simple step function. Notice how the cubic splines are much smoother than the 

polynomials. In each plot there are four different interpolations, based on 4, 8, 16 and 

32 evenly spaced points over the interval −1 ≤ x ≤ 1. This corresponds to polynomials 

of degree 3, 7, 15 and 31. The key observation here is that higher order polynomials 

contain very large oscillations near the limits of the dataset. It is far better to choose a 

low order (less than 5) polynomial for the interpolation. Even better, use a cubic spline. 

y 

−0.5 0.0 0.5 

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ 

■ ■ ■ ■ ■ ■ ■ ■ 

■ ■ ■ ■ ■ ■ ■■ ■ ■ ■ ■ ■ ■ 

■ 

Cubic splines 

■ 

−1.0 −0.5 0.0 0.5 1.0 

x 

y 

−0.5 0.0 0.5 

Polynomial interpolation 

−1.0 −0.5 0.0 0.5 1.0 

x 

7.6 Non-polynomial interpolation 

In previous lectures we used polynomials to interpolate data. But why stop at polynomials? 

There are many other classes of functions which we can use, in particular sin 

and cos functions. Why would we want to make such a change? We know that cyclic 

behaviour is a common feature in many natural phenomena. For example, we all know 

16-Feb-2014 101



that the Earth’s temperature varies on a daily and seasonal basis. If we didn’t know 

this how might we extract the periods of these cycles from measured data? By fitting 

suitably chosen periodic functions, such as sin and cos functions, to the data. 

So our game today is to see how we might fit a function of the form 

f(x) = a 0 

2 + a 1 cos(x) + a 2 cos(2x) + a 3 cos(3x) + · · · 

+ b 1 sin(x) + b 2 sin(2x) + b 3 cos(3x) + · · · 

to a set of data (x i , f i ) for f(x). This leads us to Fourier series. 

7.6.1 Fourier series 

Here are some useful facts (aka theorems). 

If f(x) is defined over the interval −π ≤ x ≤ +π then 

f(x) = a 0 

2 + ∞ 

∑ 

where the coefficients are given by 

j=1 

a j cos(jx) + b j sin(jx) 

a j = 1 π 

b j = 1 π 

∫ +π 

−π 

∫ +π 

−π 

f(x) cos(jx) dx 

f(x) sin(jx) dx 

The infinite series converges to the mid-point in any discontinuity in f(x). 

The infinite series provides a natural extension of f(x) to all values of x. Simply f(x ± 

2nπ) = f(x) for any integer n. 

The a j , b j are known as the Fourier coefficients and they measure the amplitude of the 

particular harmonic in the function. 

7.6.2 Estimating the Fourier coefficients 

Suppose we are given the (x i , f(x i )) (with the x i evenly spaced in −π < x < π) and 

that we wish to estimate the Fourier coefficients. We have two options, we can either 

◮ Estimate the integrals using a left hand sum rule or 

◮ Use the interpolation condition ˜f (xi ) = f i . 

In the following we will assume that x j = −π + (2jπ)/N for j = 0, 1, 2, · · · N and that 

N is an even integer. 

16-Feb-2014 102



Approximating the integrals 

Here we estimate the integrals by a left hand sum 

for j = 0, 1, 2, · · ·. 

a j ≈ ã j = 2 N 

b j ≈ ˜b j = 2 N 

N−1 

∑ 

k=0 

N−1 

∑ 

k=0 

f k cos(jx k ) 

f k sin(jx k ) 

Interpolation condition 

The interpolation condition ˜f (xi ) = f i is just 

f k = ã0 

2 + ∞ 

∑ 

j=1 

ã j cos(jx k ) + ˜b j sin(jx k ) 

for k = 0, 1, 2, · · · N − 1. (Exercise : why do we stop at k = N − 1?). 

From this set of N equations we need to compute the a j , b j . But we have a problem – 

there are N equations and an infinite number of coefficients. Clearly we have to reduce 

the number of coefficients to N. This involves two tricks. 

First, note that, since k and N are integers, 

cos ((j + N)x k ) = cos (jx k + Nx k ) = cos (jx k + (−Nπ + 2πk)) = cos(jx k ) 

sin ((j + N)x k ) = sin (jx k + Nx k ) = sin (jx k + (−Nπ + 2πk)) = sin(jx k ) 

This allows us to combine terms j, j + N, j + 2N, j + 3N, · · · in the infinite series, 

f k = ã0 

2 + N 

∑ 

j=1 

ã ′ j cos(jx k ) + ˜b 

′ 

j sin(jx k ) 

where ã ′ j = ã j + ã j+N + ã j+2N + · · · and ˜b 

′ 

j = ˜b j + ˜b j+N + ˜b j+2N + · · ·. The second 

trick is almost a repeat of the first. This time we note that 

cos ((N − k)x i ) = cos ((Nx i − kx i )) = cos (−kx i ) = cos (kx i ) 

sin ((N − k)x i ) = sin ((Nx i − kx i )) = sin (−kx i ) = − sin (kx i ) 

This allows us to combine terms k and N − k leading to 

f k = ã0 

2 + m 

∑ 

j=1 

ã ′′ 

j cos(jx k ) + ˜b 

′′ 

j sin(jx k ) 

where ã ′′ 

j = ã ′ j + ã ′ N−j and ′′ ′ ′ ˜b j = ˜b j − ˜b N−j and N = 2m. The final thing to note is 

that when j = m we have sin(jx k ) = 0. That is the last term in the sum drops out – 

16-Feb-2014 103



there is no ˜b 

′′ 

m in the equations. We have gone as far as we need in massaging the terms 

in the series so we will now drop the double dashes on the coefficients. 

Okay, how many coefficients do we have? One ã 0 plus m by ã j and m − 1 by ˜b j for a 

total of 2m = N. And we have exactly N equations. So in principle we should now be 

able to solve the system of equations (by, e.g. Gaussian elimination) for the ã j and ˜b j . 

The surprise is that the solution is almost exactly what we found using the left hand 

rule for the integrals – the sum over estimates a m by a factor of two. We can account 

for this by a slight alteration to the way we write the Fourier sum. Thus we have 

Discrete Fourier Series 

where N = 2m and 

ã 0 

f(x) ≈ ˜f (x) = 

2 + ãm 

m−1 

2 cos(mx) + ∑ 

j=1 

ã j cos(jx) + ˜b j sin(jx) 

ã j = 2 N 

N−1 

∑ 

f k cos(jx k ) , 

˜b j = 2 N 

N−1 

∑ 

f k sin(jx k ) , 

j = 0, 1, 2, · · · m 

k=0 

k=0 

7.6.3 Example 

This is almost the same example as we used in the lecture on cubic splines – a step 

function. In this instance we have carefully defined the step function at the discontinuities 

to ensure that the function has a consistent periodic extension. We defined the 

value of f(x) at each discontinuity to be the mid point in the jump . This is necessary 

if we want the successive copies f(x), f(x + 2π), f(x + 4π) · · · to fit together neatly at 

x = 2π, 4π, 6π · · · 

The Fourier interpolation is seen to be free of the large oscillations that plague the polynomial 

interpolations. The Fourier and polynomial interpolations yield similar values 

near the centre of the dataset. 

16-Feb-2014 104



y 

−0.6 −0.3 0.0 0.3 0.6 

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ 

■ 

■ 

■ 

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ 

Fourier interpolation 

−3 −2 −1 0 1 2 3 

x 

y 

−0.6 −0.3 0.0 0.3 0.6 

Polynomial interpolation 

−3 −2 −1 0 1 2 3 

x 

7.7 Approximating functions 

The game here is simple – given a table of data points, estimate the value of a function 

at some point not in the table. 

The general approach is to construct a a new function which captures as best as possible 

the information in the table. 

Let’s write (x i , y i ) for the data points, y(x) for the underlying function and ỹ(x) for the 

approximation to y(x). 

Interpolation. In this case we demand that ỹ(x) gives the exact values when evaluated 

at the tabulated points, ỹ(x i ) = y i . This has been covered in previous lectures. 

Approximation. This time we choose ỹ(x) so that it passes close, but not necessarily 

through, each data point. This extra flexibility can often give us better approximations 

than would otherwise be given by polynomial interpolation. 

16-Feb-2014 105



If the data in the table are drawn from an underlying smooth function then it often 

doesn’t matter which method you choose, both will give reasonable answers. (But what’s 

reasonable? Good question). However, if the data happens to contain high frequency 

oscillations (e.g. noisy experimental data) then it would not make sense to interpolate 

the data. In this case an approximation to the function would make much more sense 

and would more likely give a much better answer. 

7.7.1 Example 

■ 

y 

0 1 2 3 4 

■ ■ ■ ■ ■ ■ ■ ■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

y 

0 1 2 3 4 

■ 

■ 

■ 

■ 

■ ■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

0.0 0.5 1.0 1.5 2.0 

x 

0.0 0.5 1.0 1.5 2.0 

x 

7.7.2 Least Squares 

Suppose we have the following table of data 

x 

y(x) 

0.0 0.00 

0.4 0.64 

0.8 0.42 

1.2 1.58 

1.6 1.36 

2.0 2.00 

■ 

y 

0.0 0.5 1.0 1.5 2.0 

■ 

■ 

■ 

■ 

■ 

0.0 0.5 1.0 1.5 2.0 

and that we chose to approximate the function with a straight line 

x 

y(x) ≈ ỹ(x) = Ax + C 

How do we compute the coefficients A and C? 

We can’t use standard polynomial interpolation as there are 6 data points but only 2 

parameters to compute A and C. The best we can do is to chose A and C to minimise 

the error between y(x) and ỹ(x). 

16-Feb-2014 106



We define the error (also called the residual by) 

E(A, C) = ∑ i 

(ỹ(xi 

) − y i 

) 2 

= ∑ i 

( 

Axi + C − y i 

) 2 

And for the minimum we set 0 = ∂E/∂A and 0 = ∂E/∂C, leading to 

16-Feb-2014 107



∑ 

y i = A ∑ 

i 

i 

∑ 

x i y i = A ∑ 

i 

i 

x i + C ∑ i 

x 2 i + C ∑ i 

1 

x i 

These are known as the normal equations. For our previous example we get 

6.000 = 6.00 A + 6.00 C 

8.664 = 8.80 A + 6.00 C 

This 2 by 2 system is easily solved, A = 0.9514 and C = 0.0486, and so 

y(x) ≈ ỹ(x) = 0.9514x + 0.0486 

y 

0.0 0.5 1.0 1.5 2.0 

■ 

■ 

■ 

■ 

■ 

■ 

0.0 0.5 1.0 1.5 2.0 

The line does not pass through the data but it does capture the general trend in the 

data. 

x 

16-Feb-2014 108



7.7.3 Generalised least squares 

Suppose we now set 

ỹ(x) = a 0 + a 1 x + a 2 x 2 + · · · + a n x n 

Our aim is to chose the a i so that ỹ(x) is the best possible approximation to y(x). We 

set about minimising the residual 

E(a i ) = ∑ i 

(ỹ(xi 

) − y i 

) 2 

over all possible choices of a i . Thus we put 0 = ∂E/∂a i for each a i . This gives us n + 1 

equations for n + 1 unknowns, 

∑ 

x 

0 

i y i = a 0 

∑ 

x 

0 

i + a 1 

∑ 

x 

1 

i + a 2 

∑ 

x 

2 

i + · · · + a n 

∑ 

x 

n+0 

i 

∑ 

x 

1 

i y i = a 0 

∑ 

x 

1 

i + a 1 

∑ 

x 

2 

i + a 2 

∑ 

x 

3 

i + · · · + a n 

∑ 

x 

n+1 

i 

∑ 

x 

2 

i y i = a 0 

∑ 

x 

2 

i + a 1 

∑ 

x 

3 

i + a 2 

∑ 

x 

4 

i + · · · + a n 

∑ 

x 

n+2 

i 

. 

∑ ∑ ∑ ∑ ∑ 

x 

n 

i y i = a 0 x 

n 

i + a 1 x 

n+1 

i + a 2 x 

n+2 

i + · · · + a n x 

2n 

i 

These are the normal equations for the function ỹ(x). They can be solved by standard 

matrix methods, but note for n > 3 these equations are often ill-conditioned. 

7.7.4 Variations on a theme 

Okay, so much for linear functions. Can we apply the least squares idea to functions 

such as ỹ(x) = αe βx ? Yes, and there are two common approaches. 

Method 1 This involves a change of variables. We put u(x) = ln y(x) and thus 

u(x) = ln α + βx. As this is a linear function, we can apply the simple linear least 

squares method to the data (x i , u i ). We set u(x) ≈ ũ(x) = Ax + C and compute A and 

C by minimising the residual. Then we return to our original variable, y(x). This gives 

us α = e C and β = A 

16-Feb-2014 109



Method 2 This is the head on approach. We make no change of variable and simply 

define the residual as, 

E(α, β) = ∑ i 

(ỹ(xi 

) − y i 

) 2 

= ∑ i 

( 

αe 

βx i 

− y i 

) 2 

As usual, we set 0 = ∂E/∂α and 0 = ∂E/∂β to obtain the best choice for α and β. 

The problem with this method is that the normal equations will be non-linear functions 

in the parameters. This is harder to solve (but not impossible). 

Most people opt for the first method. 

Example : Method 1 

Here is the data 

x 0.000 0.142 0.285 0.428 0.571 0.714 0.857 1.000 

y(x) 1.500 1.495 1.040 0.821 1.003 0.821 0.442 0.552 

to which we’ll fit ỹ(x) = αe βx . 

First we compute a new table with u(x) = ln y(x), 

x 0.000 0.142 0.285 0.428 0.571 0.714 0.857 1.000 

u(x) 0.405 0.402 0.039 -0.197 0.003 -0.197 -0.816 -0.594 

16-Feb-2014 110



we then construct the normal equations, 

−0.962 = 8.00 A + 3.97 C 

−1.451 = 3.97 A + 2.85 C 

whose solution is A = 0.445 and C = −1.132. Finally we convert back to the original 

variable y(x). That is α = e A = 1.561 and β = C = −1.132. So we have found 

y(x) ≈ ỹ(x) = 1.561e −1.132x 

y 

0.6 0.9 1.2 1.5 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

0.0 0.2 0.4 0.6 0.8 1.0 

x 

16-Feb-2014 111



Example : Method 2 

Same data, same function, but now we set 

( 

αe 

βx i 

− y i 

) 2 

E(α, β) = ∑ i 

0 = ∂E 

∂α = 2 ∑ i 

0 = ∂E 

∂β = 2 ∑ i 

( 

αe 

βx i 

− y i 

) 

e 

βx i 

( 

αe 

βx i 

− y i 

) 

e 

βx i 

(αx i ) 

This is a non-linear pair of equations for α and β and so must be solved with fancy 

methods (e.g. Newton-Raphson). 

We expect the solution for α, β to be similar (but different) to the solution found using 

method 1. 

7.7.5 Applications 

◮ Smoothing. Experimental data often contains an element of noise. If you happen 

to know the expected shape of the function then this noise can be smoothed out 

by applying a least squares method. 

◮ Function estimation. Equations generated by least squares methods can be 

used to evaluate the function at points not in the table. 

◮ Parameter estimations. Sometimes it is the parameters in the function that 

are important (e.g. the slope of the function) rather than the specific values of the 

function. 

16-Feb-2014 112



7.7.6 Notes 

◮ Avoid non-linear least squares where ever possible. 

◮ Use low order polynomials. This minimises the problems of ill-conditioning. It 

also reduces the effect of spurious wiggles in the function. 

◮ Try using small sub-sets of data near the target point. This gives the least squares 

method a chance at getting a good fit to the data but at the risk of throwing out 

important information about the function. 

◮ The method is called least squares because we chose the residual to be the sum of 

the squares of the errors. 

◮ There are other choices for residuals, such as E = ∑ |ỹ(x i )−y i |, but such schemes 

are not least square methods. 

◮ The values of parameters estimated from these alternative methods should be 

comparable to the values given by the least squares method. 

7.7.7 Matlab example 

The matlab procedure for polynomial least squares is polyfit. Here is a typical example 

x = (0:0.1:5)’; % x from 0 to 5 in steps of 0.1 

y = sin(x); 

% get y values 

p = polyfit(x,y,3); % fit a cubic to the data 

f = polyval(p,x); % evaluate the cubic on the x data 

plot(x,y,’o’,x,f,’-’) % plot y and its approximation f 

16-Feb-2014 113





8. Extrapolation Methods



8.1 Richardson extrapolation 

In trying to estimate a particular number, say L ⋆ , it is quite common to use algorithms 

that generate a sequence of approximations such as L 0 , L 1 , L 2 · · ·. Each approximation 

is an improvement on the previous. 

Richardson extrapolation is a scheme whereby we can use the previously computed values 

to provide improved estimates. They key to the method is knowing the exact form of 

the error in the approximation at each iteration. 

8.1.1 Example – computing π 

The length of a circle of radius 1 is 2π and we can estimate this by approximating the 

circle by a series of straight line segments (the chords). If there are n segments each of 

length ∆L then the total length is n∆L. For the first few values of n we can compute 

L(n) = n∆L by hand. 

n 2 4 8 

L(n) 4 4 √ 2 8 √ 2 − √ 2 

L(n) 4 ≈ 5.6568542 ≈ 6.1229349 

% error 36% 10% 2.5% 

To improve our approximation to 2π we could repeat this calculation for larger values 

of n. But this becomes harder and harder. Is there a way in which we can use just the 

above data to get a better estimate for 2π? Yes – Richardson extrapolation. 

All applications of Richardson extrapolation begin with a formal statement on the error 

term in the approximation. 

We divided the circle into n equal segments. The angle subtended at the centre of 

the circle by one segment will be ∆θ = 2π/n. The length of the chord will be ∆L = 

2 sin(∆θ/2) while the length of the arc will be 1/∆θ. Thus we have 

( π 

2π = L ⋆ ≈ L = 2n sin 

n) 

Suppose n is large, then we can expand the sine function as a Taylor series 

( π 

L(n) = 2n 

n − 1 ( π 

) 3 1 

( π 

) 5 1 

( π 

) ) 

7 

+ − + · · · 

3! n 5! n 7! n 

= L ⋆ + a n 2 + b n 4 + c n 6 + · · · 

where a, b, c are constants (that do not depend on n). All terms after L ⋆ represent the 

error in L. The errors form a power series in 1/n 2 and the leading error term is O(1/n 2 ). 

16-Feb-2014 115



The trick now is to form linear combinations of L(n), L(2n), L(4n) · · · to successively 

knock out the leading order error terms. From the above we have 

and thus 

L(2n) = L ⋆ + 

a 

(2n) 2 + 

b 

(2n) 4 + 

c 

(2n) 6 + · · · 

4 

3 L(2n) − 1 3 L(n) = L⋆ + b′ 

(2n) 4 + c′ 

(2n) 6 + · · · 

for some new numbers a ′ , b ′ . This approximation converges to L ⋆ and has leading error 

term of order 1/n 4 . This should be a much better approximation to 2π than L(n). 

Define M(n) = (4L(2n) − L(n))/3, then 

n 2 4 

M(n) 6.2091390 6.2782951 

% error 1% 0.08% 

We might be tempted to apply this four-thirds, one-third rule once again, hoping to get 

a further improvement. But that would be very wrong. Those coefficients were derived 

on the basis that the leading error term was O(1/n 2 ) but for the new series M(n) we 

found the leading error term to be O(1/n 4 ). To eliminate this term we must recompute 

the coefficients. This time we find the combination to be (16/15)M(2n) − (1/15)M(n). 

Call this Q(n). Then we get 

n 2 

Q(n) 6.2829055 

% error 0.004% 

For Q(n) the leading error term is O(1/n 6 ). 

8.1.2 Conclusion 

Each stage in this process (of eliminating leading order error terms) is one application 

of Richardson extrapolation. 

The great value in this method is that it can significantly improve the quality of our approximations 

at next to no extra computational effort. The key of course is knowing the 

exact form of the error series. With this in hand you can form the correct combinations 

to kill off successive leading order error terms. 

Note that in our example the error series was power series in 1/n 2 . This need not always 

be the case, in some cases you may get a power series in 1/n. In any case the logic 

developed above still applies. 

16-Feb-2014 116





9. Numerical integration




In first year calculus we learnt that 

I(a, b) = 

∫ b 

a 

f(x) dx = F (b) − F (a) 

where F (x) is the anti-derivative of f(x). What do we do when the anti-derivative is 

too hard to compute or when it can not be expressed in terms of familiar functions like 

sin, cos log etc. (e.g. try finding the anti-derivative of e −x2 ). One option is to try a 

numerical method. This will be our focus for this and the next lecture. 

The general method will be to estimate I(a, b) by a finite sum of the form 

I(a, b) ≈ Ĩ (a, b) = 

j=N 

∑ 

j=0 

w j f(x j ) 

where x j , w j are chosen according to some set rule. The main questions will be 

◮ How do we choose the x j and w j ? and, as always, 

◮ How does the error |I(a, b) − Ĩ (a, b)| depend on the number of points N (for a 

fixed rule for computing x j , w j ). 

One point which we should keep in mind is that as we are evaluating f(x) at various 

x j ’s we may run into a problem if f(x) is not defined or even worse singular at one or 

more of the x j ’s. For the moment we will put such pathological cases aside by assuming 

that f(x) is a well behaved function throughout the closed interval [a, b]. 

All of the choices for x j , w j arise by making simple piecewise polynomial approximations 

to f(x). Why polynomials? Because they are easy to integrate and, with a suitable 

choice of node points, we should be able to get any desired accuracy. 

9.2 The Left and Right hand sum rules 

This is about as easy as it gets. Here we subdivide [a, b] into N intervals of width 

(b − a)/N and we set x j = a + j(b − a)/N for j = 0, 1, 2 · · · N. 

16-Feb-2014 118



9.2.1 The Left Hand Rule 

In each interval x j ≤ x ≤ x j+1 we use the constant approximation f(x) ≈ ˜f (x) = f(xj ). 

Then 

∫ b 

a 

f(x) dx = 

N−1 

∑ 

j=0 

∫ xj+1 

x j 

f(x) dx 

≈ 

N−1 

∑ 

j=0 

∫ xj+1 

x j 

f(x j ) dx 

= 

N−1 

∑ 

j=0 

= b − a 

N 

f(x j ) (x j+1 − x j ) 

N−1 

∑ 

j=0 

f(x j ) 

As a test case we might choose I = ∫ 1 

0 x4 dx = 1/5. And here are the results. 

N Ĩ (0, 1) I(0, 1) E(N) = |Ĩ − I| E(N/2)/E(N) 

4 9.570E-02 2.000E-01 1.043E-01 

8 1.427E-01 2.000E-01 5.730E-02 1.820E+00 

16 1.701E-01 2.000E-01 2.995E-02 1.913E+00 

32 1.847E-01 2.000E-01 1.530E-02 1.957E+00 

64 1.923E-01 2.000E-01 7.731E-03 1.979E+00 

128 1.961E-01 2.000E-01 3.886E-03 1.990E+00 

256 1.981E-01 2.000E-01 1.948E-03 1.995E+00 

512 1.990E-01 2.000E-01 9.753E-04 1.997E+00 

Notice that the numbers in last column are close to 2. This suggests that each time 

we double N we halve the error, that is E = O (1/N). The method converges, but too 

slowly to be of any practical use. 

Note that in the above table (an all others that follow in this chapter) we use n to record 

the number of points and N to record the number intervals. Clearly n = N + 1. 

9.2.2 The Right Hand Rule 

In the same way we could approximate the integrand by its value at the right hand edge 

of the interval, f(x) ≈ ˜f (x) = f(xj+1 ). This leads to 

∫ b 

a 

f(x) dx = b − a 

N 

N∑ 

j=1 

f(x j ) 


16-Feb-2014 119



N Ĩ (0, 1) I(0, 1) E(N) = |Ĩ − I| E(N/2)/E(N) 

4 3.457E-01 2.000E-01 1.457E-01 

8 2.677E-01 2.000E-01 6.770E-02 2.152E+00 

16 2.326E-01 2.000E-01 3.255E-02 2.080E+00 

32 2.160E-01 2.000E-01 1.595E-02 2.041E+00 

64 2.079E-01 2.000E-01 7.894E-03 2.021E+00 

128 2.039E-01 2.000E-01 3.927E-03 2.010E+00 

256 2.020E-01 2.000E-01 1.958E-03 2.005E+00 

512 2.010E-01 2.000E-01 9.778E-04 2.003E+00 

Clearly this scheme is also O(1/N) accurate. 

9.2.3 The Mid Point rule 

This again uses a constant approximation for f(x) in each interval. This time we choose 

the mid-point (surprise!) of the interval, f(x) ≈ f((x j + x j+1 )/2). Doing the integration 

as before leads to 

∫ b 

a 

f(x) dx = b − a 

N 

N−1 

∑ 

j=0 

f( x j + x J+1 

) 

2 


N Ĩ (0, 1) I(0, 1) E(N) = |Ĩ − I| E(N/2)/E(N) 

4 1.897E-01 2.000E-01 1.030E-02 

8 1.974E-01 2.000E-01 2.597E-03 3.967E+00 

16 1.993E-01 2.000E-01 6.506E-04 3.992E+00 

32 1.998E-01 2.000E-01 1.627E-04 3.998E+00 

64 2.000E-01 2.000E-01 4.069E-05 3.999E+00 

128 2.000E-01 2.000E-01 1.017E-05 4.000E+00 

256 2.000E-01 2.000E-01 2.543E-06 4.000E+00 

512 2.000E-01 2.000E-01 6.358E-07 4.000E+00 

Now we see something nice – the error now is reduced by a factor of four every time we 

double N. This means the error varies as E = O (1/N 2 ). This is a worthwhile change. 

For no extra computational effort we have got a much better approximation. 

But why? Simple sketches of f(x) and ˜f (x) for the Left Hand Sum reveal that ˜f (x) 

consistently underestimates f(x) while the Right Hand Sum produces an overestimate. 

The mid-point rule on the other hand will have a small overestimate and underestimate 

in each interval. These will cancel each other out thus improving the approximation. 

Of course all of these statements can be made mathematically rigorous by explicitly 

accounting for the error between f(x) and ˜f (x). 

16-Feb-2014 120



9.3 The Trapezoidal rule 

This is similar to the three previous methods except that here we use a straight line 

approximation for f(x) in the interval [x j , x j+1 ]. That is we use 

f(x) ≈ ˜f (x) = 

f(x j )(x j+1 − x) + f(x j+1 )(x − x j ) 

x j+1 − x j 

The integration maybe slightly more tricky than before (not much) but it can be done 

and this is the result 

( 

) 

I(a, b) ≈ Ĩ (a, b) = b − a 

j=N−1 

1 

N 2 (f(x ∑ 

0) + f(x N )) + f(x j ) 

j=1 

And here are the numerical results, 

N Ĩ (0, 1) I(0, 1) E(N) = |Ĩ − I| E(N/2)/E(N) 

4 2.207E-01 2.000E-01 2.070E-02 

8 2.052E-01 2.000E-01 5.200E-03 3.981E+00 

16 2.013E-01 2.000E-01 1.302E-03 3.995E+00 

32 2.003E-01 2.000E-01 3.255E-04 3.999E+00 

64 2.001E-01 2.000E-01 8.138E-05 4.000E+00 

128 2.000E-01 2.000E-01 2.034E-05 4.000E+00 

256 2.000E-01 2.000E-01 5.086E-06 4.000E+00 

512 2.000E-01 2.000E-01 1.272E-06 4.000E+00 

Surprise, surprise, it also has E(N) = O (1/N 2 ). This is very good. 

9.3.1 Choices, choices, so many choices 

Both the mid-point and trapezoidal rules have O (1/N 2 ) accuracy. Which should we 

choose? There is one very good reason for choosing the Trapezoidal rule. Consider two 

instances of applying the Trapezoidal rule, once with N + 1 points and once with 2N + 1 

points. The x j ’s in the first integration also appear as every second point in the second 

integration. Thus the f(x) values computed in the first integration can be saved and 

reused in the second integration. In this way we can avoid duplicating our efforts as we 

compute successive approximations for N = 2, 4, 8, 16, 32 · · ·. This is not possible with 

the mid-point rule as there are no x j ’s shared by both I N and I 2N . 

In fact it is easy to show that if I N is the Trapezoidal approximation with N + 1 points 

then, 

I 2N (a, b) = 1 2 I N(a, b) + b − a 

2N 

N−1 

∑ 

j=0 

f(x 2j+1 ) 

where x j = a + j(b − a)/(2N). The sum on the right contains just the new f’s not 

previously seen in the lower values of N. This is a very efficient way of computing the 

successive I N (a, b). 

16-Feb-2014 121



9.4 Simpson’s rule and Romberg integration 

It seems reasonable to explore what comes of choosing higher order interpolations for 

the integrand than what we used for the Trapezoidal rule. Doing so should produce, for 

a given number of grid points, a better approximations to the integral. 

We will start the ball rolling with Simpson’s rule which will give us an insight into how 

we can automate the process of producing successively higher order approximations. The 

result will be a very powerful algorithm know as Romberg integration. It is nothing more 

than an elegant combination of the Trapezoidal rule with Richardson extrapolation. 

9.5 Simpson’s rule 

This is next step beyond the Trapezoidal rule – here we use a piecewise quadratic to 

approximate f(x). Note that since a quadratic requires three data points, we build each 

˜f over successive pairs of intervals e.g. [xj−1 , x j ] and [x j , x j+1 ]. Thus for Simpson’s rule 

we must choose N to be an even integer. 

Its a bit messy, but the quadratic approximation ˜f (x) to f(x) in the interval [xj−1 , x j+1 ] 

is given by the 2nd order Lagrange polynomial 

(x − x j−1 )(x − x j ) 

f(x) ≈ ˜f (x) = f(xj+1 ) 

(x j+1 − x j−1 )(x j+1 − x j ) 

+ f(x j ) (x − x j+1)(x − x j−1 ) 

(x j − x j+1 )(x j − x j−1 ) 

(x − x j+1 )(x − x j ) 

+ f(x j−1 ) 

(x j−1 − x j+1 )(x j−1 − x j ) 

We now use this to form our estimate of the integral, 

∫ b 

∫ xj+1 

∫ xj+1 

I(a, b) = f(x) dx = ∑ f(x) dx ≈ ∑ ˜f (x) dx 

a 

j x j−1 

j x j−1 

= 2 b − a 

) 

(f 0 + 4f 1 + 2f 2 + 4f 3 + 2f 4 · · · + 2f N−2 + 4f N−1 + f N 

3 2N 

where f j = f(x j ) and both sums over j use only the odd integers j = 1, 3, 5, · · · N − 1. 

Note the alternating pattern of 4’s and 2’s. 

9.5.1 Example 

Returning to our familiar test case I = ∫ 1 

0 x4 dx, we find the following results 

16-Feb-2014 122



N Ĩ (0, 1) I(0, 1) E(N) = |Ĩ − I| E(N/2)/E(N) 

4 2.00521E-01 2.000E-01 5.208E-04 

8 2.00033E-01 2.000E-01 3.255E-05 1.600E+01 

16 2.00002E-01 2.000E-01 2.035E-06 1.600E+01 

32 2.00000E-01 2.000E-01 1.272E-07 1.600E+01 

64 2.00000E-01 2.000E-01 7.947E-09 1.600E+01 

128 2.00000E-01 2.000E-01 4.967E-10 1.600E+01 

256 2.00000E-01 2.000E-01 3.104E-11 1.600E+01 

512 2.00000E-01 2.000E-01 1.940E-12 1.600E+01 

This time we find a 16 fold improvement in the error each time we double N. Thus we 

see that E = O(1/N 4 ). This is good, but we can do even better! 

Exercise. Let T (N), S(N) be the Trapezoidal and Simpson’s rule as defined above. 

Show that S(2N) = (4/3)T (2N) − (1/3)T (N) 

9.6 Romberg integration 

The coefficients 4/3 and 1/3 found in the previous exercise remind us of Richardson 

extrapolation. In fact for the Trapezoidal rule we can show that 

I N = I + a N 2 + 

b 

N + c 

4 N + · · · 6 

thus we are in a position to apply number of Richardson extrapolations starting from just 

a table of Trapezoidal approximations. We could start by assembling the Trapezoidal 

approximations into a column and then use the Richardson extrapolation to generate 

further columns to the right. We will define R(N, j) to be the result of applying j rounds 

of Richardson extrapolation having started from R(N, 0). 

In a similar fashion we can define E(N, j) to be the error in R(N, j). From the above 

error formula we can easily see that E(N, j) = O(1/N 2(j+1) ). Now we can apply one 

level of Richardson extrapolation, the result is 

R(2N, j + 1) = 4j+1 R(2N, j) − R(N, j) 

4 j+1 − 1 

This is a recursive formula – it allows us to calculate the successive columns in the R(N, j) 

table. We start the computation by setting the first column to the Trapezoidal data, 

R(N, 0) = T (N). Then we use the above equation to fill in the remaining columns, one 

by one to the right. Each successive column should be more accurate than the previous, 

with the error varying as E(N, j) = O(1/N 2(j+1) ). 

This process is called Romberg integration. The last number generated is usually taken 

as the best approximation to the integral. 

16-Feb-2014 123



9.6.1 Example 

Our previous example I = ∫ 1 

0 x4 dx provides no challenge for Romberg integration so, in 

this instance, we have used I = ∫ 1 

4/(1 + 0 x2 ) dx. for which we know the exact answer 

to be I = π. Here are the results 

N R(N, 0) R(N, 1) R(N, 2) R(N, 3) R(N, 4) 

2 3.00000E+00 

4 3.10000E+00 3.13333E+00 

8 3.13118E+00 3.14157E+00 3.14212E+00 

16 3.13899E+00 3.14159E+00 3.14159E+00 3.14159E+00 

32 3.14094E+00 3.14159E+00 3.14159E+00 3.14159E+00 3.14159E+00 

64 3.14143E+00 3.14159E+00 3.14159E+00 3.14159E+00 3.14159E+00 

128 3.14155E+00 3.14159E+00 3.14159E+00 3.14159E+00 3.14159E+00 

We can see that the convergence is very rapid and that our best answer would be I ≈ 

R(128, 4) = 3.14159265359 (obtained by forcing the program to print more significant 

figures).. Since we know the exact answer to be I = π we can also compute the errors 

E(N, j) and here they are 

N E(N, 0) E(N, 1) E(N, 2) E(N, 3) E(N, 4) 

2 1.416E-01 

4 4.159E-02 8.259E-03 

8 1.042E-02 2.403E-05 5.250E-04 

16 2.604E-03 1.511E-07 1.441E-06 6.870E-06 

32 6.510E-04 2.365E-09 7.553E-09 1.519E-08 1.169E-08 

64 1.628E-04 3.696E-11 1.182E-10 2.349E-13 5.982E-11 

128 4.069E-05 5.769E-13 1.848E-12 4.441E-16 4.441E-16 

The computer on which these calculations was performed can store no more than about 

15 decimal digits in the mantissa. That is, it will always have a round-off error of around 

10 −15 in each computation. Since R(128, 4) ≈ 3.14159 and E(128, 4) ≈ 4 × 10 −16 we see 

that we have hit the level of round-off errors in R(128, 4) and so no further rounds of 

Richardson extrapolation on this computer could improve upon our best approximation. 

16-Feb-2014 124





10. Numerical differentiation




If we are given a set of data (x i , f i ) for a function f(x), how might we estimate the 

derivative at a point? 

One approach would be to plot the data in a graph and measure the slope of the tangent 

line. This is tedious and prone to errors. 

Another approach is to construct an interpolation ˜f (x) to f(x) and then compute 

df/dx ≈ d˜f /dx. This works but it is computationally intensive (each time we have 

to build the complete polynomial yet we may only be looking for one number d˜f /dx). 

A better approach is to use simple generic formulae that can turn a table of data (x i , f i ) 

into a similar table (x i , df/dx) for the derivative. The technique we will follow is known 

as Finite Differences. 

10.2 Finite differences 

This is a simple method based upon the following Taylor series 

f(x + h) = f(x) + df 

dx h + d2 f 

dx 2 h 2 

2! + d3 f 

dx 3 h 3 

3! + · · · (1) 

f(x − h) = f(x) − df 

dx h + d2 f 

dx 2 h 2 

2! − d3 f 

dx 3 h 3 

3! + · · · (2) 

All finite difference approximations can be derived from these and related Taylor series 

by taking suitable linear combinations. 

Though it is possible to develop finite difference approximations for unevenly spaced 

data we will limit ourselves to equally spaced data. 

10.2.1 First derivatives 

From the above we can immediately obtain three simple estimates 

df 

dx ≈ d˜f 

dx 

df 

dx ≈ d˜f 

dx 

df 

dx ≈ d˜f 

dx 

=f(x 

+ h) − f(x) 

h 

=f(x) 

− f(x − h) 

h 

=f(x 

+ h) − f(x − h) 

2h 

Forward differences (3) 

Backward differences (4) 

Centered differences (5) 

Exercise. Show that each of these finite difference estimates can also be derived by 

constructing straight lines through the data (x i , f i ). 

16-Feb-2014 126



Example 1 : forward differences In all of the following examples we will use f(x) = 

e x . Since df/dx = e x it is easy to compute the error, which we define by err = |df/dx − 

˜df/dx|. 

For the forward differences we obtain for h = 0.1 

x 0.00E+00 1.00E-01 2.00E-01 3.00E-01 4.00E-01 5.00E-01 

d˜f /dx 1.05E+00 1.16E+00 1.28E+00 1.42E+00 1.57E+00 1.73E+00 

error 5.17E-02 5.71E-02 6.32E-02 6.98E-02 7.71E-02 8.53E-02 

There are two questions we might like to ask 

◮ How does the error depend on the step length h? and 

◮ How does the error depend on the function f(x)? 

The first question can be answered by direct calculation, 

h 1.00E+00 1.00E-01 1.00E-02 1.00E-03 1.00E-04 

d˜f /dx 4.67E+00 2.86E+00 2.73E+00 2.72E+00 2.72E+00 

error 1.95E+00 1.41E-01 1.36E-02 1.36E-03 1.36E-04 

error/h 1.95E+00 1.41E+00 1.36E+00 1.36E+00 1.36E+00 

The last line shows that the error varies as O(h), that is linearly with h. 

The second question is best explored by inspection of the Taylor series. If we retain the 

higher order terms in the Taylor series then we find 

df 

dx 

= 

f(x + h) − f(x) 

h 

( ) 

+ O h d2 f 

dx 2 

This shows that the error should vary as O(h), a fact we have already seen, and that the 

error also varies as O(d 2 f/dx 2 ). This makes prefect sense. Since the forward difference 

approximation arises from the approximation of the data by a straight line the error 

should arise from the failure of a straight line to approximate the data. This will occur 

in the quadratic and higher terms, hence the appearance of the second derivative. 

Example 2 : centred differences 

time using centred finite differences, 

We can repeat the above computations but this 

x 0.00E+00 1.00E-01 2.00E-01 3.00E-01 4.00E-01 5.00E-01 

d˜f /dx 1.00E+00 1.11E+00 1.22E+00 1.35E+00 1.49E+00 1.65E+00 

error 1.67E-03 1.84E-03 2.04E-03 2.25E-03 2.49E-03 2.75E-03 

16-Feb-2014 127



Note how the error now is more than ten times less than that for forward differences. 

h 1.00E+00 1.00E-01 1.00E-02 1.00E-03 1.00E-04 

d˜f /dx 3.19E+00 2.72E+00 2.72E+00 2.72E+00 2.72E+00 

error 4.76E-01 4.53E-03 4.53E-05 4.53E-07 4.53E-09 

error/h 2 4.76E-01 4.53E-01 4.53E-01 4.53E-01 4.53E-01 

This time we observe that the error decreases by O(h 2 ), that is if we reduce h by a factor 

of 10 then the error will be reduced by a factor of 100. This is a considerable advantage 

over forward differences. 

Exercise. Confirm this result by retaining the higher order terms in the Taylor series. 

10.2.2 Second derivatives 

To estimate the second derivative we once again use the Taylor series expansions. Using 

equations (1) and (2) we can easily show that 

d 2 f f(x + h) − 2f(x) + f(x − h) 

≈ + O(h 2 f (iv) ) (6) 

dx2 h 2 

Exercise. Confirm this result by retaining the higher order terms in the Taylor series. 

Example Using our standard test function f(x) = e x , we find the following estimates 

for df/dx over 0 < x < 0.5 and with h = 0.1 

x 0.00E+00 1.00E-01 2.00E-01 3.00E-01 4.00E-01 5.00E-01 

d 2˜f /dx 2 1.00E+00 1.11E+00 1.22E+00 1.35E+00 1.49E+00 1.65E+00 

error 8.34E-04 9.21E-04 1.02E-03 1.13E-03 1.24E-03 1.37E-03 

Setting x = 1 and varying h we find the following estimates for the error in the centered 

finite difference approximation 

h 1.00E-01 1.00E-02 1.00E-03 1.00E-04 1.00E-05 

d 2˜f /dx 2 2.72E+00 2.72E+00 2.72E+00 2.72E+00 2.72E+00 

error 2.27E-03 2.27E-05 2.27E-07 3.78E-08 5.99E-06 

error/h 2 2.27E-01 2.27E-01 2.27E-01 3.78E+00 5.99E+04 

Once again, this shows that for centred differences we have an error that varies as O(h 2 ). 

This is good! 

Notice also how the error suddenly becomes very large when h is reduced to very small 

values. This is an example of how round-off errors can seriously effect the quality of the 

computations. 

16-Feb-2014 128



Why does this occur in this computation? Because for very small values of h all three 

numbers f(x+h), f(x), f(x−h) will be almost equal and thus there will be a significant 

loss of accuracy in computing f(x + h) − 2f(x) + f(x − h). This point will be explored 

in more detail in the next lecture. 

The upshot of this example is that the accuracy of a numerical derivative need not 

improve with every reduction in the step length h and thus there will be an optimal 

value for h. This point will also be explored in the next lecture. 

10.3 Truncation and round-off errors 

In any calculation there will exist both truncation and round-off errors and depending on 

the nature of the computations one or the other may dominate. It may also be possible 

to examine the conditions under which their combined effect can be minimised. Its wise 

to do so if this option is available. 

10.3.1 Example – Forward finite differences 

For a smooth function y(x) we can approximate its first derivative by taking forward 

differences 

dy 

dx ≈ dỹ y(x + h) − y(x) 

= 

dx h 

We first ask – what is likely total error, from both round-off and truncation errors, in 

the estimate given by the right hand side? 

First, the round off error. Let us define a quantity P by 

P = 

y(x + h) − y(x) 

h 

This P will later be our approximation to the first derivative but for the moment think 

of this as just a simple function and that we are exploring just the round-off errors in 

computing P . 

We now ask : How will the round off error enter into this calculation? As we are 

subtracting two nearly equal numbers (since h is meant to be small) we can expect a 

significant round off error as we let h → 0. We will follow the method developed when 

looking at round-off errors for division. Put 

P = ˜P + E R (P ) 

We seek an expression for E R (P ) in terms of h and y and possibly their respective round 

off errors. Put 

y(x + h) − y(x) = 

h = ˜h + E R (h) = ˜h + 10 −N O (h) 

˜ y(x + h) − y(x) + ER (y(x + h) − y(x)) 

16-Feb-2014 129



When h is very small we have 

y(x + h) ≈ y(x) 

which allows us to write (recall the much earlier discussion on round-off errors for subtraction) 

E R (y(x + h) − y(x)) = 10 −N O (y(x)) 

Substitute all of this back into the original equation for P to obtain 

P = 

˜ y(x + h) − y(x) + 10 −N O (y(x)) 

˜h + 10 −N O (h) 

and since N is (usually) at least 15 we can expand this as a power series in 10 −N and 

retain just the leading terms. The result is 

where ˜P = 

that 

P = ˜P −N O (y) 

+ 10 − 10 

˜h 

−N O (h) 

˜P + · · · 

˜h 

˜ y(x + h) − y(x)/˜h is the computer’s estimate for P . From this we deduce 

−N O (y) 

E R (P ) = 10 − 10 

˜h 

−N O (h) 

˜P 

˜h 

How does E R (P ) behave as ˜h → 0? First we expect ˜P ≈ P for some range of small values 

for h. Thus the second term should remain approximately constant as ˜h is reduced. 

In contrast the first term will diverge for ever reducing values of ˜h. Thus E R (P ) is 

dominated by its first term, and we are correct in writing, for small h, 

−N O (y) 

E R (P ) = 10 

h 

Now we turn to the job of estimating the truncation error. This is much easier then 

computing the round off error. We start with a standard Taylor series 

y(x + h) = y(x) + h dy 

dx + h2 d 2 y 

2! dx + · · · 2 

then 

y(x + h) − y(x) 

= dy 

h dx + h d 2 y 

2! dx + · · · 2 

from which we conclude that 

E T = h d 2 y 

2! dx 2 

where (as is usual in this game) we have discarded all higher order terms (we are only 

after an estimate of the size of the error not its exact value). 

The total error is E = E T + E R , 

E(h) = hA + 10 −N B h 

16-Feb-2014 130



where we have written A = (1/2)d 2 y/dx 2 and B = O (y). Note that both A and B do 

not depend on h. Our aim is to find the best estimate for the derivatives. This in turn 

requires us to choose h so that the total error is a minimum. Thus we set dE/dh to 

zero, 

0 = dE 

dh = A − 10−N B h 2 

Solving this for h > 0 we find 

h = 10 −N/2 ( B 

A 

) 1/2 

which leads to 

E T = E R = 10 −N/2 (AB) 1/2 

What do we learn for all this? We have just found, for the best choice of h, that 

E R = 10 −N/2 (AB) 1/2 . If we can turn this into the standard form 

E R (P ) = 10 −Q O (P ) 

then we could say that E R (P ) has Q decimal digits of accuracy. 

Notice that A, B and P are all finite and they do not depend on N. Thus (looking back 

the definitions for A and B) we find (AB) 1/2 = O (P ) and thus 

E R (P ) = 10 −N/2 O (P ) 

So finally we conclude that the best we can ever hope for in using (y(x + h) − y(x))/h as 

an approximation to dy/dx will be to get at most N/2 digits of accuracy on an N− digit 

computer. 

There are two important points to note 

◮ The optimal choice of step length is not zero! You can verify this by looking at 

the table of results in the lecture on finite differences. 

◮ Even though we can store up to N decimal digits of accuracy, we see that this 

algorithm, at best, will produce estimates with only N/2 decimal digits of accuracy. 

(i.e 7 digits with N = 14). 

10.3.2 Example – Centred finite differences 

In this case we have 

dy 

dx ≈ dỹ y(x + h) − y(x − h) 

= 

dx 2h 

and we follow the same arguments as before to arrive at 

−N O (y) 

E R + E T = 10 + h 2 O 

h 

( d 4 y 

dx 4 ) 

16-Feb-2014 131



Again we wish to set this to be a minimum for a suitable choice of h. 

derivative equal to zero and solving for h gives us 

h = O ( 10 −N/3) 

Setting the 

E R = O ( 10 −2N/3) , 

E T = O ( 10 −2N/3) 

Now we see that 

◮ The optimal step length for centred differences is larger than that for forward 

differences, and 

◮ We now have 2N/3 digits of accuracy (i.e 10 digits with N = 15). This is consistent 

with our previous empirical observations that centred differences gave far better 

answers than forward differences. 

16-Feb-2014 132





11. Numerical Solutions of Ordinary Differential 

Equations




If somebody asked you to evaluate y(b) given that y(x) is the solution of 

dy 

dx = f(x) 

subject to y(a) = 0 you could simply evaluate the definite integral 

y(b) = 

∫ b 

a 

f(x) dx 

But how would you answer the similar question for the differential equation 

dy 

dx 

= f(x, y) 

This is a much harder problem and requires a wider range of numerical techniques than 

what we developed for definite integrals. This will be the purpose of the next few 

lectures, to develop suitable integration schemes for ordinary differential equations. 

11.2 Initial value problems 

In finding solutions of ordinary differential equations of the form 

dy 

dx 

= f(x, y) y(a) 

= y a 

we usually use a technique that marches the solution through increasing values of x 

having started from x = a. Its not surprising then these are known as initial value 

problems (often abbreviated as an IVP) 

We being by subdividing the x-axis into equal step lengths h with x j = a + jh for 

j = 0, 1, 2, · · ·. The numerical estimates for the solution y(x j ) will be denoted by y j . All 

of the following schemes provide a way to generate the y j in succession starting with y 0 . 

11.3 Euler’s method 

This is the simplest integration scheme of all for IVP’s. It can be derived by making the 

simple forward difference approximation to the derivative 

( ) dy 

≈ y j+1 − y j 

= 1 dx x j+1 − x j h (y j+1 − y j ) 

This then gives us 

j 

y j+1 = y j + hf(x j , y j ) 

We start at j = 0 with y = y 0 and x = x 0 and compute successive values for y j . If we 

are lucky the y j will be good approximations to y(x j ). The accuracy will depend on a 

16-Feb-2014 134



number of factors, such as the size of the step length h, the number of steps and the 

character of the solution y(x). 

Here are the results for the simple case where f(x, y) = −y with y(0) = 1. The exact 

solution is y(x) = e −x . Here are the results for four choices of step length, h = 0.1, 0.5, 1.5 

and 2.5 

Numerical integration using Euler’s method 

y 

0.0 0.2 0.4 0.6 0.8 1.0 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

h = 0.1 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ 

0.0 0.2 0.4 0.6 0.8 1.0 

■ 

■ 

■ 

■ 

■ 

■ 

h = 0.5 

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ 

0 1 2 3 4 5 

0 1 2 3 4 5 

y 

−0.5 0.0 0.5 1.0 

■ 

■ 

h = 1.5 

■ 

■ 

■ 

■ ■ ■ ■ ■ ■ ■ 

■ 

■ 

0 5 10 15 20 

x 

−20 −10 0 10 20 

h = 2.5 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

0 5 10 15 20 

x 

■ 

We see that for small values of the step length system gives good accurate answers. But 

for larger values values we loose accuracy, and even stability! We can see oscillations 

developing in the solution for h = 1.5 but they do appear to die away. However at h = 2.5 

the oscillations grow without bound. This is our first lesson – numerical integration of 

ODE’s may leads to unstable schemes for a poor choice of step length. 

It is not hard to see why this occurs. The Euler scheme for the IVP is just 

y j+1 = y j − hy j = y j (1 − h) = y 0 (1 − h) j 

We now that the correct solution has y → 0 as x → ∞. This will occur in our Euler 

scheme provided |1−h| < 1 which gives 0 < h < 2. So its no surprises that the successive 

y j diverged when h = 2.5. 

Fortunately, in this case, there is a simple fix. Had we chosen a backward finite difference 

( ) dy 

≈ y j − y j−1 

= 1 dx x j − x j−1 h (y j+1 − y j ) 

j 

16-Feb-2014 135



This then gives us 

y j = y j−1 − hy j 

which we then solve for y j in terms of y j−1 . Schemes such as this, where we find the next 

value for y on both sides of the equation, are known as implicit schemes. In contrast, 

schemes which provide the next value for y on just the left hand side of the equation are 

known as explicit schemes. The forward difference Euler scheme is an explicit scheme. 

Usually implicit schemes are harder to apply but they are usually stable for a much 

wider range of step lengths than for explicit schemes. 

Stability of implicit integration 

y 

0.0 0.2 0.4 0.6 0.8 1.0 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

h = 0.1 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ 

0.0 0.2 0.4 0.6 0.8 1.0 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

h = 0.5 

■ 

■ 

■ 

■ 

0 1 2 3 4 5 

0 1 2 3 4 5 

y 

0.0 0.2 0.4 0.6 0.8 1.0 

■ 

■ 

■ 

■ 

■ 

h = 1.5 

■ ■ ■ ■ ■ ■ ■ ■ ■ 

0.0 0.2 0.4 0.6 0.8 1.0 

■ 

■ 

■ 

■ 

h = 2.5 

■ ■ ■ ■ ■ 

0 5 10 15 20 

x 

0 5 10 15 20 

x 

11.4 Improved Euler scheme 

The basic Euler scheme presented in the previous lecture served as a very simple model 

for numerical integration of a first order ordinary differential equation. Its simplicity 

is appealing but that feature is also the source of its main weakness – it often gives 

poor answers (i.e. large errors or even unstable integrations). We need a more reliable 

integration scheme, one which has better accuracy and stability properties than the 

Euler scheme. 

Once again we start with the ODE 

dy 

dx 

= f(x, y) 

16-Feb-2014 136



and ask ourselves how can we convert this ODE into an algebraic equation involving 

successive values of y j . We recall that approximating a derivative at the centre of an 

interval gave far better answers than estimating it at either end of the interval (centred 

differences versus forward or backward differences). So can we compute dy/dx at the 

mid-point of [x j , x j+1 ]. Here is one method, 

But we could also write 

and combining these we get 

( ) dy 

dx 

j+ 1 2 

= 1 2 

= 1 2 

( ) dy 

dx 

j+ 1 2 

( (dy ) 

dx 

j 

+ 

( ) ) 

dy 

dx 

j+1 

( 

) 

f(x j , y j ) + f(x j+1 , y j+1 ) 

= y j+1 − y j 

= y j+1 − y j 

x j+1 − x j h 

y j+1 − y j 

h 

= 1 2 

( 

) 

f(x j , y j ) + f(x j+1 , y j+1 ) 

Okay, we have an algebraic equation but it is an implicit scheme (y j+1 appears more 

than once in this equation). To use this as it stands would require a root finding method 

for each step in the integration. This would be exceedingly slow. We need another trick 

to make this practical. 

For the y j+1 on the right hand side we can make the simple Euler approximation 

y j+1 ≈ y j + hf(x j , y j ) 

then we compute the true (well a better estimate of) y j+1 using our fancy scheme above. 

This gives us the Improved Euler Scheme (sometimes also called the Modified Euler 

Scheme). 

Here is the scheme, 

ỹ j+1 = y j + hf(x j , y j ) 

y j+1 = y j + h ( 

) 

f(x j , y j ) + f(x j+1 , ỹ j+1 ) 

2 

How well does it work? Take the same example as before dy/dx = −y with y(0) = 1. 

Here are the results, first the function values y i and then the errors E = y j − y(x j ). 

16-Feb-2014 137



Errors in Improved Euler 

y 

0 2 4 6 8×10 −4 

h = 0.1 

■ ■■■■ ■ ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

0.00 0.01 0.02 0.03 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

h = 0.5 

■ 

■ 

■ 

■ 

0 1 2 3 4 5 

0 1 2 3 4 5 

y 

0.0 0.1 0.2 0.3 0.4 0.5 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

h = 1.5 

0 10 20 30 40 50 

■ ■ ■ ■ ■ ■ 

■ 

h = 2.5 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

0 5 10 15 20 

x 

0 5 10 15 20 

x 

We see that once again the numerical errors grow without bound for h = 2.5. This is 

not much better than what we had with the Euler method. But in both cases a step 

length as large as h = 2.5 for a problem such as this is clearly crazy. The single thing to 

learn from these examples is that the choice of step length can have a significant effect 

on the accuracy and stability of the integration. 

We could at this point delve into the formal calculation of the error in the approximations 

y i and how that error depends on the step length. However we’ll defer that until we’ve 

had a look at some other schemes. 

11.5 Taylor series method 

This is a very nice way to generate a whole family of integration schemes. 

We start with our friend 

dy 

dx 

= f(x, y) 

and then we recall that a Taylor series for y(x) is 

y(x + h) = y(x) + h dy 

dx + h2 d 2 y 

2 dx + h3 d 3 y 

2 6 dx + · · · 3 

16-Feb-2014 138



This gives us the option of replacing all the derivatives on the right with f(x, y) and its 

derivatives. Thus we get 

y(x + h) = y(x) + hf(x, y) + h2 

2 

( ) 

∂f 

+ f(x, y)∂f + · · · 

∂x ∂y 

This is very elegant and it gives us a ready handle on the nature of the error, roughly 

the first term we left off in the tail of the series. 

Here are the results for h = 0.1 and h = 0.5 for a new example f(x, y) = −xy, y(0) = 1. 

The exact solution is y(x) = exp(−x 2 /2) and we plot just the errors y j − y(x j ). 

Errors in Taylor series integration 

Errors yj − y(xj) 

0.0000 0.0005 0.0010 0.0015 

■ 

■ ■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ ■ 

■■ 

■ 

h = 0.1 

0.00 0.01 0.02 0.03 0.04 

■ ■■■ ■ ■■■■■■■■■■■■■■■■■■■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

h = 0.5 

■ 

■ 

■ 

■ 

0 1 2 3 4 5 

x 

0 1 2 3 4 5 

x 

11.6 Runge-Kutta schemes 

Designing integration around Taylor series seems like great idea – its elegant, systematic 

and we get, at no cost, an estimate of the likely error in our solutions. But there’s a 

catch (there’s always a catch). 

The big problem with the Taylor series approach is that it can be very tedious to evaluate 

the higher derivatives of f(x, y). For example, try computing d 3 f(x, y)/dx 3 – its a mess. 

The objective with Runge-Kutta methods is to mimic the Taylor series but without 

computing the derivatives of f(x, y). 

16-Feb-2014 139



11.6.1 Second Order Runge-Kutta 

Our objective here is to build a scheme, which does not involve any derivatives of f(x, y), 

yet matching the Taylor series schemes up to and including order h 2 terms. 

We look to the Improved Euler scheme (another non-derivative scheme) for inspiration. 

That scheme can be cast in the form 

with 1 = α = β = 2a = 2b. 

k 1 = hf(x j , y j ) (1) 

k 2 = hf(x j + αh, y j + βk 1 ) (2) 

y j+1 = y j + ak 1 + bk 2 (3) 

Exercise. Show that this is indeed the Improved Euler scheme. 

Why do we cast it in this odd form? Because it gives us flexibility. We are free to choose 

the four parameters a, b, α, β to be any set of numbers provided the scheme remains 

second order. That’s our objective for now – find the constraints on a, b, α, β so that the 

scheme is second order. 

We begin by writing down the second order Taylor series scheme evaluated at x j 

y j+1 = y j + hf j + h2 

2 (f x + ff y ) j 

+ O ( h 3) (4) 

where the x, y subcripts now denote partial derivatives. 

Next we use a (different) Taylor series (a Taylor series for a function of two variables) 

to expand k 2 around x = x , y = y j , 

k 2 = h 

(f j + αh (f x ) j 

+ βh (ff y ) j 

+ O ( h 2)) 

Exercise. Why did we stop at O (h 2 ) in this expansion? 

Now combine this with the previous equations (1,3) to get 

) 

y j+1 = y j + ahf j + bh 

(f j + αh (f x ) j 

+ βh (ff y ) j 

+ O ( h 3) 

We want this to be exactly the same as the Taylor series. So we compare the various 

terms and demand that 

a + b = 1, 

1 = 2αb = 2βb 

We have found three constraints amongst our four parameters α, β, a and b. Clearly we 

are free to choose one of the parameters and thus we get a whole family of integration 

schemes, all of which are 2nd order accurate. 

16-Feb-2014 140



Examples 

◮ Euler. 

This is simply a = 1, b = 0. 

◮ Improved Euler. 

This corresponds to setting 1 = α = β = 2a = 2b. 

◮ Mid-point rule. 

For this we set 1/2 = α = β, 0 = a, 1 = b for which we get 

( 

y j+1 = y j + hf x j + h 2 , y j + h ) 

2 f j 

Exercise. Why do you think this is called the mid-point rule? (Too easy really.) 

11.6.2 Fourth Order Runge-Kutta 

This a simple extension of the above ideas – this time we force the numerical scheme to 

match the Taylor series up to and including order h 4 terms. Its a lengthy calculation, 

but nothing that we haven’t already seen. As with the 2nd order Runge-Kutta we get a 

whole family of schemes. One of the most popular is 

y j+1 = y j + 1 6 (k 1 + 2k 2 + 2k 3 + k 4 ) 

k 1 = f(x j , y j ) 

k 2 = f(x j + h 2 , y j + 1 2 k 1) 

k 3 = f(x j + h 2 , y j + 1 2 k 2) 

k 4 = f(x j + h, y j + k 3 ) 

This is not the only choice, but its the most common choice. 

11.7 Error analysis 

We know that our numerical integration schemes aren’t prefect, they have errors and 

they can be unstable. We will look at stability a little later on but for the moment let’s 

focus on the errors. 

16-Feb-2014 141



11.7.1 Discretization errors 

There are few of things we might like to know about our integration schemes. Such as, 

what the error is at a particular stage in the integration, and further, how that error 

might grow with successive steps. For this we define two types of error, 

◮ Global discretization error 

This is defined by E j = y(x j ) − y j and is equal to the error at a specific x-value. It 

is principally made up of the accumulated local discretization errors (plus a small 

component of round off errors). 

◮ Local discretization error 

This is defined by e j = E j+1 − E j which is the error introduced in one step of the 

integration. 

What we shall find is that if e j = O (h n+1 ) for some n then, generally, E j = O (h n ). 

The starting point for the formal error analysis is, as always, a Taylor series. We shall 

use the Improved Euler Scheme as an example. 

11.7.2 Local discretization error 

Suppose (in a dream) that our integration up to x j is prefect, that there is no error in 

y j . That is y j = y(x j ) and thus E j = 0. Then the local truncation error will be given 

by e j = E j+1 = y(x j+1 ) − y j+1 and this we can calculate. 

We start with the Taylor series on the exact solution 

( ) ( dy 

y(x j+1 ) = y(x j ) + h + h2 d 2 y 

+ h3 

dx 2 

6 

j 

dx 2 )j 

( d 3 y 

dx 3 )j 

+ O ( h 4) 

( 

= y(x j ) + hf j + h2 

2 (f x + ff y ) j 

+ h3 d 2 f 

+ O 

6 dx 

)j 

( h 4) 

2 

and a similar Taylor series for the Improved Euler Scheme 

y j+1 = y j + h 2 

= y j + h 2 

( 

) 

f(x j , y j ) + f(x j + h, y j + hf j ) 

(2f + h(f x + ff y ) + h2 

2 

d 2 f 

+ O 

dx 

)j 

( h 4) 

2 

Now since y j = y(x j ) we find 

( 

e j = − h3 d 2 f 

+ O 

12 dx 

)j 

( h 4) 

2 

16-Feb-2014 142



What do we learn from this? First the local discretization error is O (h 3 ) and second, 

that the Improved Euler Scheme will be exact if d 2 f/dx 2 = 0 for all x, y. 

We can also use this calculation to estimate the Global Discretization Error. 

11.7.3 Global Discretization Error 

We know two things, e j = E j+1 − E j and e j = O (h 3 f ′′ ) from which we can now easily 

compute E j . 

E j+1 − E 0 = 

j∑ 

E k+1 − E k = 

j∑ 

e j 

k=1 

k=1 

= 

j∑ 

k=1 

( 

− h3 d 2 f 

+ O 

12 dx 

)k 

( h 4) 

2 

= h 2 j∑ 

k=1 

− h 12 

( d 2 f 

dx 2 )k 

+ O ( h 4) 

Notice that the sum on the right is a Riemann sum for the integral ∫ x j+1 

x 0 

in turn we estimate using the Mean Value Theorem, thus 

f ′′ dx which 

( ( )) 

h 

2 d 2 f 

E j+1 − E 0 = O 

12 dx 2 

with the right hand side evaluated at some (unknown) point inside [x 0 , x j+1 ]. 

But since we are given exact initial values (i.e. we are given y = y 0 at x = x 0 ) E 0 must 

be zero, so we have our final result 

( ( )) 

h 

2 d 2 f 

E j+1 = O 

12 dx 2 

You might feel uneasy that the right hand side has to be evaluated at some unknown 

point. But this is not a major sticking point because we never really need to compute 

a number for the right hand side. Instead we use it only as a formal statement that the 

error varies as O (h 2 ). 

16-Feb-2014 143



11.7.4 Numerical results 

The theory is all well and good but we should also be able to demonstrate these results 

directly from numerical calculations. This is very easy to do – just propose a problem 

for which you know the exact solution. Then compare the exact versus the numerical 

solutions. A piece of cake. 

Here are some results for the simple ODE dy/dx = −2xy with y(0) = 1. The exact 

solution is y(x) = exp(−x 2 ). 

h y-approx y-exact E(h) E(2h)/E(h) 

5.000E-01 3.750E-01 3.679E-01 7.121E-03 

2.500E-01 3.742E-01 3.679E-01 6.278E-03 1.134E+00 

1.250E-01 3.697E-01 3.679E-01 1.804E-03 3.479E+00 

6.250E-02 3.683E-01 3.679E-01 4.678E-04 3.857E+00 

3.125E-02 3.680E-01 3.679E-01 1.185E-04 3.948E+00 

1.562E-02 3.679E-01 3.679E-01 2.979E-05 3.978E+00 

7.812E-03 3.679E-01 3.679E-01 7.467E-06 3.990E+00 

In each row the step length is one half the previous value and the Global Discretization 

Errors are evaluated at x = 1. The final column shows that E(h) = O (h 2 ), as expected. 

Exercise. Repeat the above calculations for both the Euler and Mid-point schemes. 

11.8 Stability 

In deriving the previous results we made two crucial assumptions – that the various 

Taylor series converged and that they were dominated by their leading terms. What 

happens when either of these assumptions does not apply? Quite often the integration 

scheme develops an instability with an exponential growth in the error in the numerical 

solution. How can we test for such a situation? The simplest answer is to just run the 

program and inspect the results. A slightly better approach would be to consider two 

separate integrations differing only by a small change in the initial conditions. If the two 

solutions remain close throughout the integration then most likely both integrations are 

stable. 

In the previous section we considered the errors by comparing the numerical solution 

to the exact solution. In this section we will study (briefly) the difference between two, 

initially close, numerical solutions. 

11.8.1 Example 1 

Here is a simple equation, 

dy 

dx = −y 

16-Feb-2014 144



for which we might use the Forward Euler scheme, 

y j+1 = y j + hf(x j , y j ) = y j − hy j 

Now suppose we generate two solutions, one for y 0 = 1 and one for y 0 = 1 + ɛ 0 for some 

small number ɛ 0 . Denote the two solutions by y 1 j and y 2 j . We can track the two solutions 

by computing ɛ j = y 2 j − y 1 j and from the above equation we find 

ɛ j+1 = (1 − h)ɛ j 

If the original equations for y j are to be stable then we could reasonably require that 

|ɛ j+1 /ɛ j | remains bounded for increasing j. Thus we must have 

from which we find 0 ≤ h ≤ 2. 

−1 ≤ 1 − h ≤ 1 

This same technique can be applied for many other situations. 


Suppose we had used a Backward Euler scheme, then we would have found 

ɛ j+1 = 1 

1 + h ɛ j 

and the requirement that |ɛ j+1 /ɛ j | remains bounded leads to h > 0. That is, this scheme 

can be expected to be stable for all values of h. It may be inaccurate for large values, 

but at least it will be stable (according to our definition – there are others). 


Now consider this equation 

dy 

dx = y 

Demanding that |ɛ j+1 /ɛ j | remains bounded leads to the absurd condition that h < 0 for 

both the Forward and Backward Euler schemes. We expect the numerical solutions to 

have exponentially growing errors for any choice of step length. 

This is no surprise. Since the exact solution is y(x) = C exp(+x) we see that ɛ(x + 

h)/ɛ(x) = exp(h) which is greater than one for h > 0. 

How might we rescue this situation? The hint comes from the above analysis – choose 

h < 0. That is we do a backward integration. 

How? Start at say x = 10, guess a value for y(10) and integrate backwards to x = 0. 

Then adjust the guess so that your computed value of y 0 matches the given initial 

condition. This involves a root finding process. You have to do a lot more work, but at 

least you get a stable scheme. 

16-Feb-2014 145




The general solution of 

dy 

dx = y + 2e−x 

is y(x) = Ce x − e −x where C is a constant of integration. 

If y(0) = −1 then we must set C = 0 and the exact solution is y(x) = −e −x . But 

our numerical calculations will always contain some round off errors. Thus even if we 

are able to set the initial values exactly the subsequent calculations may introduce a 

small round off error. The effect will be as if we had started with a small but non-zero 

value for C and thus the e x term will eventually dominate the numerical solution. Any 

forward (i.e. explicit) integration scheme is doomed to fail. The only option is to use a 

fully backward scheme, such as that used in the previous example. 

Here are the results using the Forward and Backward Euler schemes. We know that the 

exact solution has the property that y → 0 as x → ∞. The numerical solutions, if they 

are of any use, should preserve this property. In the early stages of the integration both 

schemes seem to work well but later on the Forward Euler scheme has clearly run amok 

courtesy of the round-off error allowing the e x term to rear its ugly head. 

Growth of dominant term in general solution. 

■ 

Numerical solution y 

0 10 20 30 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ 

−1 0 1 2 3 4 5 6 7 8 9 10 11 

Eaxct solution y(x) = e −x 

−1.0 −0.8 −0.6 −0.4 −0.2 0.0 

−1 0 1 2 3 4 5 6 7 8 9 10 11 

x 

16-Feb-2014 146





12. Optimisation



12.1 Golden Search 

Suppose we are given a function f(x) and that we wish to find the points (there may be 

more than one) at which f is locally minimised. As a guide in developing a strategy for 

finding such points we begin by asking how might we verify that a point, say x = x ⋆ , is 

a minimum point for the function? We would need to show that for two other points, 

a and b, either side of x ⋆ (that is a < x ⋆ < b) that f(a) > f(x ⋆ ) and f(b) > f(x ⋆ ). 

For this to work properly we would need to choose a and b to be very close to x ⋆ (to 

avoid the possibility of the function wiggling its way below f(x ⋆ )). We will call this the 

minimisation condition. 

Usually we do not know x ⋆ but instead some approximation to x ⋆ , which we will call 

c. The game now is to find a scheme where we can iterate on the approximation c such 

that successive iterations converge to x ⋆ . The strategy we will use will be to create a 

triple of numbers a < c to use this as a bracket for x ⋆ . That is, at each stage 

in the iteration, we will take c as our current approximation to x ⋆ , while a and b will be 

the bounds by which we can assert that the true minimum lies within the interval [a, b]. 

How do we choose a, b and c? Let’s put that aside for the moment and consider a related 

question : How do we update a, b and c? 

The point c will lie (usually) closer to one of a or b. Suppose it happens to be a. Then 

suppose we introduce a fourth point d such that a < c < d < b. Now we have two 

overlapping sets of triples a < c < d and c < d 

the minimisation condition (this point is crucial!). And it is that interval which we take 

into the next iteration. Thus we have a new set of values for a, b and c. This completes 

the update. Notice that the new interval is smaller than the previous interval and thus 

the successive iterations will close in on x ⋆ . After a number of iterations we should have 

a good approximation to x ⋆ . 

There remains two issues, how to choose the initial values for a, b and c and also how 

to choose the fourth point d. Consider first the choice of d. Its up to us to invent a 

reasonable strategy, here is one that works well. 

Choose d to lie in the larger of the two intervals [a, c] and [c, b]. Suppose for example 

that this happens to be in the [c, b] interval. Then demand that 


c − a = b − d 

d = b − (c − a) 

There is actually a small problem with this strategy. If the point c was actually the 

mid-point of [a, b] then we would have d = c and the new interval would be exactly the 

same as the original interval. This would be of no use to us and so we need a scheme 

that avoids this problem. This must impose some constraint on a, b and c. Again, like 

the choice for d, we have some flexibility. All that we need guarantee is that in each 

iteration c is not the mid-point of a, b and c. We can achieve this by demanding the 

new and old intervals are split in the same proportions. Let the right sub-interval (e.g. 

16-Feb-2014 148



[c, b]) be a fraction r of the total interval (e.g. [a, b]). Then we demand that 

b − c = r(b − a) 

b − d = r(b − c) 

c − a = (1 − r)(b − a) 

When these are combined with the above equation c − a = b − d we find (it’s easy) that 


r 2 = 1 − r 

r = −1 ± √ 5 

2 

We can discard the negative square root as r must be positive. Thus we have found 

r = −1 + √ 5 

2 

≈ 0.618 

This number is famous, it was used by ancient Greeks in much of their architecture, and 

is often called The Golden Ratio. 

All that remains now is to decide how we might choose the initial values of a, b and 

c. This is not too hard. We make a guess for a and b. Then we choose c so that 

b − c = r(b − a). If this triple, a, b and c, satisfy the minimisation condition then we 

start with these values, if not we make another guess for a and b and start again. 

Minimisation by Golden Ratio search 

Suppose we seek x ⋆ such that f(x ⋆ ) is a minimum. 

1. Choose any a, b, c with c = b − r(b − a) such that 

f(c) < f(a) and f(c) < f(b). 

2. If (b − c) > (c − a) 

set d = b − r(b − c) 

If f(d) < f(c) then set a = c and c = d, else set b = d 

3. Else 

set d = c − r(c − a) 

If f(d) < f(c) then set b = c and c = d, else set a = d 

4. Take x ⋆ ≈ c, repeat from step 2 as required. 

12.1.1 Example 

To test our methods we will use f(x) = 1 + x 2 − 1 cos(10x). This has three local minima 

4 

in the interval −1 < x < 1, at x ⋆ ≈ ±0.6 and x ⋆ ≈ 0. 

16-Feb-2014 149



y 

0.5 1.0 1.5 2.0 2.5 

−1.0 −0.5 0.0 0.5 1.0 

x 

The results for this example can be found in the following table. 

16-Feb-2014 150



Loop a c:d c:d b f(a) f(c):f(d) f(c):f(d) f(b) 

0 -7.000E-01 -5.854E-01 -4.000E-01 1.302E+00 1.115E+00 1.323E+00 

l 1 -7.000E-01 -5.854E-01 -5.146E-01 -4.000E-01 1.302E+00 1.115E+00 1.160E+00 1.323E+00 

r 2 -7.000E-01 -6.562E-01 -5.854E-01 -5.146E-01 1.302E+00 1.190E+00 1.115E+00 1.160E+00 

r 3 -6.562E-01 -6.292E-01 -5.854E-01 -5.146E-01 1.190E+00 1.146E+00 1.115E+00 1.160E+00 

l 4 -6.292E-01 -5.854E-01 -5.584E-01 -5.146E-01 1.146E+00 1.115E+00 1.120E+00 1.160E+00 

r 5 -6.292E-01 -6.125E-01 -5.854E-01 -5.584E-01 1.146E+00 1.128E+00 1.115E+00 1.120E+00 

r 10 -5.854E-01 -5.815E-01 -5.790E-01 -5.751E-01 1.115E+00 1.115E+00 1.115E+00 1.115E+00 

l 20 -5.801E-01 -5.800E-01 -5.800E-01 -5.800E-01 1.115E+00 1.115E+00 1.115E+00 1.115E+00 

l 30 -5.801E-01 -5.801E-01 -5.801E-01 -5.801E-01 1.115E+00 1.115E+00 1.115E+00 1.115E+00 

l 40 -5.801E-01 -5.801E-01 -5.801E-01 -5.801E-01 1.115E+00 1.115E+00 1.115E+00 1.115E+00 

The initial values were taken as a = −0.7 and b = −0.4. This forced us to take c = −0.7 + r(−0.4 − (−0.7) ≈ −0.5854. 

Note that the formatting in the table has to take account of the possibility that the test point d can be introduced either in the 

interval [a, c] or in the interval [c, b]. Thus the notation c : d indicates that there are two types of data in this column, c and d. A 

similar idea applies to the columns headed by f(c) : f(d). 

The first column contains the letters ‘l’ and ‘r’. These record which triple set of points are taken for the next iteration. Suppose 

d was created in the interval [a, c] then ‘l’ records that the left triple, (a, d, c) was chosen, while ‘r’ indicates that the right triple 

(d, c, b) was chosen. 

By careful inspection of the table you can follow the progress of the interval (a, c, b). In each line three of (a, b, c, d) are carried 

forward to the next line. The introduced number is always d and it is always introduced into the larger of the two intervals [a, c] 

and [c, b]. 

Note that the overall length of the interval c − a is reduced by a factor of approximately 0.6 with every iteration. Thus to gain one 

decimal digit of accuracy we need to apply five extra iterations (0.6 4 ≈ 0.13 and 0.6 5 ≈ 0.08). This is very slow. But the main things 

is that we are guaranteed that the method will converge to a local minimum. 

16-Feb-2014 151



12.2 Steepest descent 

This method uses the classic calculus formulation of a local extrema : df/dx = 0 at the 

extrema. 

Given f(x) = x 2 + (1 − cos(10x)/4) we have to find the roots of 

0 = 2x + 5 2 sin(10x) 

This we do using a Newton-Raphson method, for x such that 0 = g(x), compute x n+1 = 

x n − g n /g ′ n. In our case g(x) = f ′ and g ′ (x) = f ′′ (x). 

Loop x f(x) f’(x) 

0 -7.00000E-01 1.30152E+00 -3.04247E+00 

1 -5.54061E-01 1.12280E+00 5.82342E-01 

2 -5.82582E-01 1.11510E+00 -6.11959E-02 

3 -5.80077E-01 1.11502E+00 -3.52201E-04 

4 -5.80062E-01 1.11502E+00 -1.23382E-08 

5 -5.80062E-01 1.11502E+00 6.66134E-16 

This table displays the one significant advantage of this method – it converges very 

quickly! In fact, it can be shown that the sequence x n converges quadratically to the 

true minimum. That is, provided x n is close to x ⋆ , then 

|x n+1 − x ⋆ | = O ( |x n − x ⋆ | 2) 

A hint of this quadratic convergence can be seen in the last column. This shows that 

f ′ (x) converges quadratically to zero. The upshot of quadratic convergence is that each 

iteration will double the number of correct digits. Thus it is very common to see this 

method converging in about 5 iterations. 

There are also a number of disadvantages with this method. 

◮ The method makes no distinction between maxima and minima. 

◮ The method makes no distinction between local and global minima. 

◮ The method can fail when f ′′ (x) = 0 at the minima of f(x). For example, when 

f(x) = x 4 . 

◮ The method is applicable only to functions with smooth derivatives. 

16-Feb-2014 152



12.3 Genetic Algorithms 

Once again we are faced with the problem of finding the minimum of some given function. 

Unlike the previous methods in which a single approximation was successively improved 

towards a minimum these methods work with a whole family of approximations. In each 

iteration we will manipulate the family in such a way as to focus the family towards the 

minimum. 

The remarkable thing about the methods we are about to develop is that they are 

inspired by ideas drawn from evolutionary biology. Nature, in the Darwinian paradigm 

of the survival of the fittest, solves its own optimisation problem, namely how to produce 

life forms that are well adapted to their environment. 

Here is a very rough description of the genetic theory of evolution. Each individual 

is comprised of a set of genetic material and this genetic material fully defines the 

individual. Some individuals are well adapted to their environment and these are called 

fit individuals. Other individuals are not well suited to their environment. These are 

called weak or less fit. The survival of the fittest means that it is much more likely 

that fit rather than weak individuals will pass on their genetic material to the next 

generation. In this way successive generations will become dominated by fit individuals. 

We will apply these ideas to our problem of finding the minimum of a given function. 

Along the way we will need to translate the language of evolutionary genetics into familiar 

mathematical terms. At the end of our journey (its not a long journey) we will 

done something amazing (well I think it is) – we will have solved a purely mathematical 

problem by drawing on strong analogies with processes that occur in evolutionary 

biology. 

Let’s get serious. We are looking for the value x ⋆ at which the function f(x) is minimised. 

We will start with a randomly chosen collection of candidates for x ⋆ . Denote these by 

x i , i = 1, 2, · · · N. This will be our first generation. Some of the x i will be close to x ⋆ 

while others will be far from x ⋆ . Our aim is to breed successive generations so that the 

spread of x-values shrinks to an arbitrarily small range centred on x ⋆ . 

Let x i be a typical individual. We define the fitness of this individual to be the value 

of the function, f(x i ). Since we wish to minimise the function it is natural to call an 

individual fit if its f-value (fitness) is less than that for other weak individuals (yes, this 

does sound like a tautology). 

We need to decide how we are going to build one generation from another. There are 

many possibilities and we will be guided by the simplest biological analogy. Here is 

a rough outline of how we might create one new generation. First we select a pair of 

individuals. These will be our parents. We then create two children by exchanging 

genetic material between the parents. These children form two members of the next 

generation. This process is repeated again until a new generation is complete (i.e. the 

same number of children as parents). At this point we delete the parent’s generation and 

start afresh with children’s generation. After each generation we have N individuals. 

There are questions that we must ask 

16-Feb-2014 153



◮ How do we choose the parents? 

◮ How do the parents exchange genetic material? 

12.3.1 Selection 

In selecting a parent we want a scheme that favours fit individuals over weak individuals. 

Here is one simple scheme. 

Compute the average fitness f by 

f = 1 N 

N∑ 

i=1 

f(x i ) 

Select a random individual, say x j . If f(x j ) < f then we accept this individual as a 

parent. If the condition is not met then step through successive individuals, j, j + 1, j + 

2, · · · , N, 1, 2, · · ·, until the condition is satisfied. 

Note that the selection of successive parents is done without knowledge of the previous 

parents (i.e. selection with replacement). This means that in one generation are very fit 

individual may be selected many many times (and may even be partnered with itself). 

This preferential selection is one of the key elements of genetic algorithms. 

12.3.2 Breeding 

We need to exchange genetic material between the parents. So obviously we need access 

to their genetic material. What might this be? Again we have many options but 

whatever we do we need to represent the individuals by a string of letters (e.g. the base 

pairs in the DNA). For our simple problem we can choose this to be the binary form 

of the number represented by x i . For example, we might have x 23 = 111010101100101. 

This expresses x 23 as a 15-digit binary number. This is our genetic material. 

Now suppose we have two parents, 

x 23 = 111010101100101 

x 47 = 100111011001101 

And now the breeding begins (turn the lights out please). First we choose a random 

number between 1 and 15, say get 6. Then we cleave mum and dad after the first 6 

binary digits. 

x 23 = 111010 101100101 

x 47 = 100111 011001101 

Then we swap over the leading 6 binary digits to form two children, y 01 and y 02 

y 01 = 100111 101100101 = 100111101100101 

y 02 = 111010 011001101 = 111010011001101 

16-Feb-2014 154



This method of breeding is known as crossover. A variation on the method is to also 

allow for mutations. This is applied after the two children are formed. Each binary digit 

in each child is flipped (1’s and 0’s swapped) with a very small probability. This again 

is done by analogy with the genetic paradigm. 

We now have all of the elements in place – its time to see how this method works on a 

specific example. 

12.3.3 Example 

Here we will minimise a simple function f(x) = x 2 subject to the condition that 0 ≤ 

x ≤ 1. We known the answer must be x ⋆ = 0 and we can use this as a measure of how 

well the method works. 

We begin by choosing N random numbers in the interval 0 ≤ x ≤ 1. We record each x i 

in its binary form, for example, with 5 digits we might have 

x 43 = 11010 = 1 × 2 −1 + 1 × 2 −2 + 0 × 2 −3 + 1 × 2 −4 + 0 × 2 −5 

The we apply the above ideas. Here is the initial population (the right hand column is 

the decimal value for x i ). 

110111011100101000000001011101 : 0.86636 

011101110111011001011001100111 : 0.46665 

000000001001100100100001011001 : 0.00234 

100100010001011101001111111001 : 0.56676 

001110111011110000111011110111 : 0.23334 

000010001110110000001101001001 : 0.03485 

010111000110010010001000100110 : 0.36091 

111101001101010010011001011111 : 0.95637 

000110100110111110100010010110 : 0.10327 

111111010100001001011100000100 : 0.98929 

011110111100110010110111101000 : 0.48359 

000110111011000111110001100101 : 0.10818 

011011010101011110000010001001 : 0.42712 

011110110000010001011011001001 : 0.48054 

110000100110100101110111011011 : 0.75942 

100110111011001110001100100001 : 0.60821 

111000101110110000110101111110 : 0.88642 

110011110101011111111110001001 : 0.80994 

001000101110100001011110110100 : 0.13636 

010110001000111000011011100110 : 0.34592 

And after 100 generations we get 

000010101110100001011001100000 : 0.04261 

16-Feb-2014 155



000010101110100001011001100000 : 0.04261 

000010101110100001011001100000 : 0.04261 

000010101110100001011001100000 : 0.04261 

000010101110100001011001100000 : 0.04261 

000010101110100001011001100000 : 0.04261 

000010101110100001011001100000 : 0.04261 

000010101110100001011001100000 : 0.04261 

000010101110100001011001100000 : 0.04261 

000010101110100001011001100000 : 0.04261 

000010101110100001011001100000 : 0.04261 

000010101110100001011001100000 : 0.04261 

000010101110100001011001100000 : 0.04261 

000010101110100001011001100000 : 0.04261 

000010101110100001011001100000 : 0.04261 

000010101110100001011001100000 : 0.04261 

000010101110100001011001100000 : 0.04261 

000010101110100001011001100000 : 0.04261 

000010101110100001011001100000 : 0.04261 

000010101110100001011001100000 : 0.04261 

As you can see, the initial population is randomly scattered in the range 0 < x < 1 but 

the final population (after 100 generations) has all members equal to the same value, 

x = 0.04261. The population now consists of a set of clones, no improvement will be 

possible with extra generations because there is no genetic diversity in the population. 

One way to fi x this is to introduce a small degree of mutations. 

16-Feb-2014 156





13. Random numbers



13.1 Uniform random numbers 

Computers are very predictable. They do exactly what we tell them to do. So how can 

they be used to compute random numbers? Quite simply – they can not! The best we 

can expect from them is the appearance of randomness. Each time we run the program 

we will get the exact same sequence of (apparently) random numbers. 

Suppose we have a very long list of integers (which we will take as a sample of random 

numbers). We could imagine some function f(x) that delivers these integers one at a 

time by way of a recursive formula such as 

I n+1 = f(I n ) 

Given an initial value for I 0 we can use this formula to compute I 1 , I 2 , I 3 , · · ·. How many 

distinct numbers can we generate in this sequence? Clearly we would want this sequence 

to be as long as possible if we are to have the best possible sequence of random numbers 

(if ever we repeat a number then the sequence will repeat and that will be end of game 

for our so-called random numbers). Suppose our computer can store integers in the 

range 0 to M. That is 

0 ≤ I n ≤ M 

Then the longest sequence of random numbers will contain M + 1 distinct entries. This 

is the best we can hope for. On a typical 32 bit computer 

M = 2 31 − 1 = 2, 147, 483, 647 

That’s a big number and should be adequate for most purposes! 

How might we choose the magic function f(x)? The main criteria is that it produces 

very very long sequences of numbers without repetition. Any f that does this is suitable. 

Here is one popular choice 

I n+1 = 16807I n mod M 

with I 0 chosen to be any number (other than 0!). (Recall that a mod b is the remainder 

after dividing by a by b.) 

Exercise. If the I n happens to be very large (as it will be form time to time) then the 

product 16807I n might very well exceed M and recall that M is the largest integer the 

computer can handle. How then does the computer compute I n+1 in this case? 

How well does this work? That is, how random are these numbers? We can check on 

the quality of the sequence by forming a simple probability distribution. Let 

x = I n 

M 

then 0 ≤ x ≤ 1 and the successive values of x should appear to be uniformly (i.e. 

without favouring one value over an other) and randomly distributed over the range 0 

to 1. Here are two frequency histograms. In each case we took a fixed number of x’s and 

then counted the number of times the x’s fell inside the 100 equally spaced sub-intervals. 

If our numbers were truly random then we would expect to record the same number of 

16-Feb-2014 158



x’s in each of the sub-intervals – that is, the histogram should look flat. In the first 

histogram, with only 10000 numbers the histogram is far from flat. However, by the 

time we have used 1000000 numbers the profile is indeed very flat. 

y 

5 10 15 20 

N = 10 4 

y 

2000 4000 6000 8000 10000 12000 

0.0 0.2 0.4 0.6 0.8 1.0 

x 

N = 10 6 

0.0 0.2 0.4 0.6 0.8 1.0 

x 

16-Feb-2014 159



13.2 Non-uniform random numbers 

In the previous section we found that we could easily create a uniform random distribution 

by setting 

x = I n 

M 

Now we ask the slightly more interesting question: How do we create a series of x’s that 

follow a prescribed non-uniform behaviour? We still want the x’s to be random (i.e. we 

have no way of predicting what the next number will be) but we want the numbers to 

more often favour certain x values than others. Let’s begin by reminding ourselves of 

some basic probability theory. 

If X is a random variable with probability density function ρ(x) and if X takes on values 

only in the range 0 to 1 then the probability that X is found in the range a to b is given 

by 

Pr(a < X < b) = 

∫ b 

a 

ρ(x) dx 

So our question now is: How do we choose the successive x’s randomly such that the 

histogram, for a very long series, has the same shape as ρ(x)? Here is one way to so (it’s 

known as the Metropolis algorithm). First we build a rectangular box that just contains 

ρ(x). Next, uniformly sprinkle points throughout that box. That is easy to do, just 

choose (x, y) pairs as follows 

x = I n 

M , y = I n+1 

M 

Now here comes them interesting bit. If we now throw away all the points that lie above 

y = ρ(x) then the remaining points will have x’s values with a probability distribution 

governed by ρ(x). 

Non-uniform random numbers 

To generate a sequence of random numbers in the range 0 < x < 1 that follow the 

probability density function ρ(x) 

◮ Compute x = I n /M and y = I n+1 /M 

◮ If y > ρ(x) then reject (x, y) and try again. 

Exercise. Prove the above claim, that by rejecting the points with y > ρ(x) the resulting 

x’s will have ρ(x) as their probability density function. 

16-Feb-2014 160



Laboratory class notes 

16-Feb-2014 161



Introduction to Computational Mathematics 

Laboratory class 1 

Introduction to Matlab 

Getting started 

Today is a day for playing with Matlab. 

Logon to the PC and click on the Matlab icon on your desktop. You should see a screen 

like the following. 

Your current workspace 

Your Matlab 

files 

You enter matlab 

commands here 

Your command 

history 

This contains two small windows and one large window. The larger window is where you 

will enter your Matlab commands. The other windows are simply for your information. 

They show the history of your previous commands and your Matlab files (you’ll learn 

how to use and create Matlab files in the next laboratory class).



Setting your workspace 

Matlab likes to store files (yours and Matlab’s) in various directories on the computer. 

One thing you should do is tell Matlab where it should store your files – in your U: drive. 

Do this be setting your Current Directory by clicking on the icon on the menu bar. 

Then navigate through to your U: drive and select okay. If you don’t get into the habit 

of doing this then your files may be written to the local hard drive of the computer you 

are currently on (so when you move to another computer you will not be able to access 

that file). 

You may need to do this ever time you start up Matlab. Your tutor may know a way 

to force Matlab to remember this setting but I do not know this (yet). 

Setting your Matlab Path 

This is something you should only need do once. Matlab will look in various directories 

when searching for files (ie. when you ask it to perform a command). You will (soon) 

be creating your own files and you will need to tell Matlab where to find them. This is 

what Matlab calls the path. You can set this be selecting the File menu and then Set 

Path.... You will see a new window and you should click on Add Folder. The navigate 

to your U: drive. In this example I selected my own private directory, its the top entry 

in this diagram (this is how your window will look after clicking on Add Folder and 

after selecting your U: drive). 

You might like to create a new directory on you U: drive specifically for MTH3051. In 

which case you might want to repeat the above operation. 

16-Feb-2014 163



The final thing to do, to set your path, is to click on Save. That’s it. You are now ready 

to play with Matlab. 

Exercises 

On the following pages you will find scanned copies of the the first four lessons from 

Getting Started with MATLAB 7 by Rudra Pratap. Work your way through the lessons 

and the exercises. This should prepare you well for the fun times ahead (where we, that 

is you, will develop Matlab programs that actually do something useful, quelle suprise?) 

16-Feb-2014 164



16-Feb-2014 165



16-Feb-2014 166



16-Feb-2014 167



16-Feb-2014 168



16-Feb-2014 169



16-Feb-2014 170



16-Feb-2014 171



16-Feb-2014 172



16-Feb-2014 173



16-Feb-2014 174



16-Feb-2014 175



16-Feb-2014 176



16-Feb-2014 177



16-Feb-2014 178



16-Feb-2014 179



16-Feb-2014 180





Truncation errors 

1. Estimate the truncation error for each of the following approximations (for |x| ≪ 1) 

cos(x) ≈ 1 

sin(x) ≈ x − x3 

3! 

exp(x) ≈ 1 + x + x2 

2! 

2. Use a series expansion in u to show that 

∫ 1 

0 

1 

1 + u 2 du = 1 − 1 3 + 1 5 − 1 7 · · · 

Re-evaluate the integral using the substitution u = tan θ. What use can you make of 

this result? 

Order estimates 

3. If u(x) = O (x 2 ) and v(x) = O (x 4 ) for |x| ≪ 1 what can you say about the functions 

u(x) + v(x) and u(x)v(x) when |x| ≪ 1? 

4. Suppose y(x) = 3 + O (2x) and g(x) = cos(x) + O (x 3 ) for x



Programming 

5. Pretend you are the computer and follow this algorithm. 

Set x to 0.5 

For i = 1 to 6 do 

If x > 3 then 

Replace x by x - 1 

Else 

Replace x by x + 1 

End 

After you complete the instructions, what will be the value of x ? 

(a) -1.5 (b) 3.5 (c) 4.5 (d) none of these 

6. Which mathematical expression does the following pseudo-code represent? 

Read x 

Set sum = 1 

Set term = 1 

Set k = 1 

While k < 101 do 

Replace term by - term*x*x/( (2k)*(2k-1) ) 

Replace sum by sum + term 

Replace k by k + 1 

End 

(a) 

(c) 

∑100 

k=1 

∑100 

k=0 

k+1 xk 

(−1) 

k! 

x2k 

(−1) k 

(2k)! 

(b) 

(d) 

∑100 

k=0 

∑100 

k=1 

x 2 

2k 

(−1) k x2 

2k 

7. Which of the following is the correct algorithm for approximating ∑ ∞ 

n=1 n−2 ? 

(a) Set n to 0 

Set sum to 0 

Repeat 

Replace n by n+1 

Set sum to 1/(n*n) 

Until sum < 0.000001 

(b) Set n to 0 


Repeat 


Replace sum by sum+(1/(n*n)) 

Until sum < 0.000001 

16-Feb-2014 182



(c) Set n to 1 


Repeat 



Until 1/(n*n) < 0.000001 

(d) Set n to 1 


Repeat 



Until 1/(n*n) < 0.000001 

8. Here is the code given in lectures to estimate π. Write a Matlab M-file that contains 

this code and use it to generate a series of approximations for π. 

n = 100; % set number of terms 

sum = 0; % set initial value for sum 

sign = 1; % an integer +/- 1 

for k = 1 : n 

% loop over k from 1 to n 

term = sign/(2*k-1); % compute the term 

sum = sum + term; % update rolling sum 

sign = - sign; % flip the sign 

end; 

disp(n); 

% print n 

disp(4*sum); 

% print the approximation to π 

disp(4*sum-pi); 

% print the error 

9. Modify the above code to use the series π 2 = 12 ∑ ∞ 

k=1 (−1)k+1 /k 2 . Verify that for a 

the same number of terms, this series gives much better answers than the original code 

above. 

16-Feb-2014 183





Finite precision arithmetic 

1. We saw in lectures that doing arithmetic with a limited number of digits can cause 

numerical problems. Here you will explore directly some of those problems. Your first 

task will be to write a short Matlab program that takes one integer in and converts it 

to a integer with only 5 significant digits. Let’s suppose you call this function Trunc. 

Here are some examples that your function should reproduce. 

123 ← Trunc(123) 

12345 ← Trunc(12345) 

12345e+03 ← Trunc(12345678) 

2. Modify the code so that you can specify exactly how many significant digits Trunc should 

return. 

3. Write two Matlab programs, one to add a pair of integers, another to subtract a pair of 

integers. All of the integers (the two inputs and one result) should be expressed as fixed 

precision integers. You may use the Trunc program from the previous questions. 

Test your programs on the following examples (which uses 5 digit calculations). In each 

case compute the actual and relative errors in the final answer. Do this for 3, 5 and 7 

significant digits. 

579 ← Plus(123,456) 

12345 ← Plus(12e+03,345) 

12345e+03 ← Plus(12345e+03,678) 

333 ← Minus(456,123) 

1 ← Minus(124,123) 

500 ← Minus(1235e+03,Plus(1234e+03,567)) 

Are your results consistent with what we saw in lectures? 

4. Here are two ways in which you could estimate e −5 . 

(a) 

e −5 ≈ 

9∑ 

k=0 

(−1) k 5 k 

k! 

(b) 

e −5 ≈ 

1 

∑ 9 

k=0 5k /k!



Both of these series are based on the standard Taylor series for e x . The correct value 

of e −5 , to three significant digits, is 6.74e-3. Which series, (a) or (b), will give the most 

accuracy when all calculations are done to 3 significant figures? Why? 

5. The finite sum ∑ 10 

k=1 1/k2 could be evaluated in many ways, for example 1 + (1/2 2 ) + 

(1/3 2 ) + · · · + (1/10 2 ) or as (1/10 2 ) + (1/9 2 ) + (1/8 2 ) + · · · 1. If these two sums are 

performed on a 3-digit computer which sum do you think will be most accurate? Why? 

6. Look carefully at the following mathematical expression. 

S = 

n∑ 

i=1 

(a) How many multiplications are required to compute S using the above formula? 

(b) Are there other (better) ways to evaluate S? (Hint : you should only require one 

multiplication.) 

n∑ 

j=1 

a i b j 

Elementary probability distributions 

7. We all know and love the Binomial and Poisson distributions. These are given by 

−λ λn 

Poisson: Pr(X = n) = e , n = 0, 1, 2, · · · ∞ 

n! 

( ) N 

Binomial: Pr(X = n) = p n (1 − p) N−n , n = 0, 1, 2, · · · N 

n 

Your job is to write Matlab programs that can accurately evaluate these distributions, 

which for simplicity we will denote by Binomial(N, n, p) and Poisson(n, λ). 

Here are some test cases. 

7.29000e-02 ← Binomial(5,2,0.1) 

1.84865e-01 ← Binomial(100,2,0.01) 

3.60610e-02 ← Binomial(256,70,0.3) 

9.02235e-02 ← Poisson(4,2) 

1.86608e-03 ← Poisson(20,10) 

2.31105e-03 ← Poisson(45,30) 

Bases other than 10 

8. Suppose you have a number x and that is happens to have the value 1234 in base 10. 

This means that 

x = 1 × 10 3 + 2 × 10 2 + 3 × 10 1 + 4 × 10 0 = (1234) 10 

16-Feb-2014 185



(the little 10 written as a subscript is just to remind us that in this instance we are using 

base 10). 

But this same x could also be written in base 3, that is 

So we have 

x = 1 × 3 6 + 2 × 3 5 + 2 × 3 2 + 1 = (1200201) 3 

x = (1234) 10 = (1200201) 3 

Your game is to write a Matlab program that allows you to convert any base 10 number 

into any other base (less then 10). In your quest you will find two standard Matlab 

commands mod and floor very useful. The function mod returns the remainder after 

division, e.g. mod(17,5) = 2. The function floor chops off any decimal part in a division, 

e.g. floor(10/4) = 2. 

9. Compute by hand 0.125 in base 2. Do the same for 0.3. Write a Matlab program that 

computes the first 6 binary digits of any number in the range 0.1 10 to 1. 

16-Feb-2014 186





Fixed point iteration 

1. Which of the following is a suitable fixed point method for solving 

0 = 2 + x − tan x 

(a) x = 2 + tan −1 x (b) x = 2 + tan x 

(c) x = tan −1 (2 + x) (d) none of these 

2. Two iterations of the fixed point method for the equation x = 1/(3 + x 2 ), starting with 

x = 0, yields, as an approximation to the root 

(a) x = 3 (b) x = 1/3 (c) x = 28/9 (d) x = 9/28 

3. Show that the Newton-Raphson method 

x n+1 = x n − f(x n) 

f ′ (x n ) 

can also be viewed as a fixed-point iteration (i.e x n+1 = g(x n )). Using the convergence 

criteria for fixed-point iterations what can you say about the convergence of the Newton- 

Raphson iterations? 

Newton-Raphson 

4. Use a Newton-Raphson iteration to find a root, accurate to 4 decimal places, of each of 

the following equations (for x in the stated interval). 

(a) x 3 + 10x − 37 = 0 for 2 ≤ x ≤ 3. 

(b) x 4 + x 3 − 22 = 0 for 1 ≤ x ≤ 2. 

(c) x − ln(x + 2) = 0 for 1 ≤ x ≤ 2. 

5. Show, using a Newton-Raphson iteration, that the square root of any number y can be 

computed from the iteration 

x n+1 = 1 (x n + y ) 

2 x n



6. Use the previous algorithm to estimate the square roots of 2,3 and 10. What goes wrong 

when you try to find the square root of -1? 

7. Imagine this scenario. You have a job, you are earning mucho $$$ and you are stashing 

it away in a bank with compound interest paid monthly. After n months, your initial 

saving P will be worth 

( ) 1 − i 

n 

P n = P 

n ≥ 1 

1 − i 

where i is the interest per payment period (one month). Suppose that in 20 years you 

wish to amass a fortune of $750,000 from monthly deposits of $1500. What interest rate 

per year would you need to secure your fortune? 

Half-interval search 

8. Plot each of the functions listed in question (4) and determine which are suitable for 

a half-interval search method and in such cases apply that method to find the root 

(accurate to 4 decimal places). 

9. Can the half-interval search method be used to find the root of function such as f(x) = 

(x − 7) 2 g(x) with g(7) ≠ 0? What might you do in such cases (while still using a 

half-interval search)? 

10. Given that f(x) = √ x − cos(x) has one root in the interval 0 < x < 1 perform three 

rounds of the bisection method to estimate this root. 

11. Suppose you are told that f(x) = (x + 2)(x + 1)x(x − 1) 3 (x − 2). To which zero of f 

would the bisection method converge for the following intervals? 

(a) [−3, 2.5] (b) [−2.5, 3] 

(c) [−1.75, 1.5] (d) [−1.5, 1.75] 

16-Feb-2014 188



Round-off errors : Bessel functions 

12. When studying certain physical systems (e.g. the vibrations on a drum) the following 

linear differential equation arises 

x 2 d2 y 

dx 2 + xdy dx + (x2 − n 2 )y = 0 

Solutions of this equation are known as Bessel functions of order n. As it’s a second order 

equation we expect two (linearly independent) solutions. These are denoted by J n (x) 

and Y n (x). They differ most notably in their behaviour near x = 0 with J n remaining 

finite will Y n diverges to −∞ as x → 0. We will compute J n (x) for various values of n 

at x = 1. 

The J n (x) have the following properties 

lim J n(x) = 0 

n→∞ 

|J n (x)| ≤ 1 , for |x| < ∞ 

J 0 (0) = 1 , J n (0) = 0 , n ≠ 0 

1 = J 0 (x) + 2J 2 (x) + 2J 4 (x) + 2J 6 (x) + · · · 

J n+1 + J n−1 = 2n x J n(x) 

(a) The last equation can be used to J 2 given J 1 and J 0 . This then can be used 

to compute J 3 which in turn can be used to compute J 4 . In this fashion we 

can generate successive J n ’s from previous J n ’s. Not surprisingly this is known 

as a recurrence relation. Given that J 0 = 0.76519768655796655145 and J 1 = 

0.44005058574493351596 (both accurate to 20 decimal places) use the recurrence 

relation to compute (in strict order) J 2 , J 3 , J 4 · · · J 20 . What do you observe? 

(b) The recurrence relation can also be used in the reverse direction. That is J 20 

and J 19 can be used to compute J 18 and so on down to J 0 . Since we know that 

lim n→∞ J n (x) = 0 we might make the reasonable estimate J 20 (1) = 0. But what 

guess should we make for J 19 ? Let’s just call it α, some unknown number. Now 

proceed to compute (in strict order) J 18 , J 17 , J 16 · · · J 0 . Given that we know the 

correct value for J 0 you should then be able to determine the correct value for 

α. Follow this algorithm and compute the J n for n = 0, 1, 2, · · · 20. How do your 

answers compare with those from part (a)? Which set of J n ’s do you trust? 

(c) Can you modify the algorithm given part (b) so that you do not need to know the 

exact value of J 0 ? (i.e. can you compute the J n without being told in advance the 

value of J 0 ?). 

16-Feb-2014 189



The Error function 

13. In statistical analysis one often encounters the following integral 

erf(x) = 2 √ π 

∫ x 

0 

e −t2 dt 

and it is used to define what we call the error function, namely erf(x). The disappointing 

thing is that this new function can not be expressed in terms of what we call elementary 

function (such as polynomials, trigonometric and exponential functions). If we need 

values for erf(x) then we must obtain those values from the above integral. This is our 

challenge – to construct a Matlab code that can compute erf(x). 

(a) Using the Tylor series around t = 0 for e −t2 

show that 

erf(x) = 2 √ π 

∞ 

∑ 

k=0 

(−1) k x 2k+1 

(2k + 1)k! 

(b) Take it as fact that erf(x) can also be computed from 

erf(x) = 2 √ π 

e −x2 

∞ 

∑ 

k=0 

2 k x 2k+1 

1 · 3 · 5 · · · (2k + 1) 

Verify that the two series have the same leading terms (e.g. the terms match for 

k = 0, 1, 2). 

(c) Use the series in part (a) to estimate erf(1) to six decimal places. 

(d) How many terms did you use in part (c)? Use the same number of terms this time 

with the series for part (b). What answer did get? How does it compare with the 

answer from part (a)? 

(e) Argue why you would expect the series in part (b) to be inferior to the series from 

part (a) for estimating erf(x). 

16-Feb-2014 190





Gaussian elimination 

You are welcome to write your own Matlab programs but if you would prefer to look 

at my examples, you can find them on our MUSO website (they are under the link 

Programs). For this class you will only need the GEsimple.m file (this does Gaussian 

elimination without pivoting, while GEpivot.m includes pivoting). 

1. Use your (Leo’s?) Gaussian elimination code to solve the following systems. Verify that 

your solution is indeed the correct solution of the equations. 

[ ] [ ] [ ] 

⎡ 

⎤ ⎡ ⎤ ⎡ ⎤ 

(a) 1 −1 u 1 (b) 1 2 −1 x 1 

= 

2 3 v 0 

⎣ 2 3 1 ⎦ ⎣ y ⎦ = ⎣ 0 ⎦ 

−1 2 5 z 2 

⎡ 

⎤ ⎡ ⎤ ⎡ ⎤ 

⎡ 

⎤ ⎡ ⎤ ⎡ ⎤ 

(c) 3 1 2 x 2 (d) 1 2 −1 2 p 1 

⎣ 1 0 2 ⎦ ⎣ y ⎦ = ⎣ 1 ⎦ 

⎢ 2 3 1 1 

⎥ ⎢ q 

⎥ 

3 1 7 z 9 

⎣ −1 2 5 3 ⎦ ⎣ r ⎦ = ⎢ 0 

⎥ 

⎣ 2 ⎦ 

1 2 −1 3 s 3 

2. Here is a piece of Matlab code that will print out an n × m matrix M. 

for i = 1 : n 

disp(sprintf(’␣%12.4f’,M(i,:))) 

end 

disp(’-----------------------------’) 

% step through rows of M 

% print this row of M 

% mark the end of the matrix 

Insert the above in your Gaussian elimination code so that you can see the various stages 

in which the matrix is reduced to upper triangular form. This is another way to verify 

that the code is doing what it should do. 

3. Modify your Gaussian elimination with back substitution code to do full Gaussian elimination 

(i.e. reduce the matrix to diagonal form, not just upper triangular form). Verify 

that your new code works by applying it to the systems given in question (1). 

4. Modify the code created in question (2) to compute the inverse of an n × n matrix. 

Recall that A −1 can be computed as follows. Start with A and then form a new n × 2n 

matrix [A|I] where I is the n × n identity matrix. Then apply full Gaussian elimination 

to reduce this to [I|B]. Then the matrix B will be the inverse of A.



5. The determinant of a matrix can be computed using steps very similar to that used in 

the Gaussian elimination code. Let’s suppose we want to compute det(A). Suppose 

we feed A into a standard Gaussian elimination code and that we recover the upper 

triangular matrix A u . Then it can be shown that 

det(A) = (−1) m λ 1 λ 2 λ 3 · · · λ q det(A u ) 

where m is the number of times we swapped rows and each 1/λ equals scale factor used 

whenever we did a row operation of the form row j ← row j /λ. Modify your Gaussianelimination 

code so that it can correctly compute the determinant of an n × n matrix 

A. Note that det(A u ) is easy – its just the product of its diagonal elements. 

6. The inverse of a matrix can also be computed using Cramer’s rule. This is how it works. 

Let B be the inverse of A. Then the entries in B are calculated as follows 

B(i, j) = 

det(R(j, i)) 

det(A) 

, i, j = 1, 2, 3, · · · , n 

where R(i, j) is the matrix obtained from A by first replacing row i and column j with 

zeroes and then setting R(j, i) = 1 (here we take the first index as a row index and the 

second as a column index). Modify your code from the previous question to compute the 

inverse of A. You can verify your answers using the Matlab routine inv (i.e. inv(A). 

7. Estimate the operational cost (number of floating point operations, flops) to compute 

the inverse of an n × n matrix A using (i) standard Gaussian elimination and (ii) using 

Cramer’s rule. Which method should you use for large n? 

16-Feb-2014 192





Matlab programs 

For this class you will need the Matlab program jacobi.m which you can find on the 

subject web site (in MUSO). As always, you are welcome to write your own code from 

scratch. Later you will be asked to make various changes to implement Gauss-Seidel 

iterations, generalised fixed point and generalised Newton-Raphson iterations. 


1. Two iterations of the Jacobi method for solving 

3 = 7x − 2y 

1 = x + 4y 

starting from x 0 = 0 and y 0 = 0 yields 

(a) x 2 = 7/3 and y 2 = 4 (b) x 2 = 1/2 and y 2 = 1/7 

(c) x 2 = 3/7 and y 2 = 1/4 (d) x 2 = −1/2 and y 2 = 1/7 

Gauss-Seidel iteration 

2. Which of the following is a suitable Gauss-Seidel scheme for solving 

3 = 7x + 9y 

5 = 4x + 2y 

(a) x n+1 = (3 − 9y n )/7 

y n+1 = (5 − 4x n+1 )/2 

(b) x n+1 = (3 − 9y n )/7 

y n+1 = (5 − 4x n )/2 

(c) x n+1 = (5 − 2y n )/4 

y n+1 = (3 − 7x n+1 )/9 

(d) 

None of these



3. Write down a convergent Gauss-Seidel scheme for the following system of equations 

5x + y + 2z = 5 

x + y + 3z = 6 

−x + 4y − z = −9 

Perform three iterations, by hand, of this scheme starting with initial guess x = 0, y = 0 

and z = 0. 

4. Modify your Jacobi code to perform Gauss-Seidel iterations. Test your new code by 

applying it to the system of equations from the previous question. 

5. Modify your Jacobi and Gauss-Seidel codes so that they accept the augmented matrix 

of the linear system of equations and the number of equations as input arguments. That 

is, modify your code so that the following system 

could be solved by typing 

7x − 2y = 3 

x + 4y = 1 

Matlab>> M = [ 7 -2 3; 1 4 1 ]; 

Matlab>> x_start = [ 0 0 ]; 

Matlab>> x_final = jacobi(M,x_start,2,0.001,20); 

In this way you will not need to continually modify the function onestep each time you 

encounter a new system of equations. 

6. In the lecture notes you will find the theorem that if a matrix is diagonally dominant 

then both the Jacobi and Gauss-Seidel iterations will converge. Will the iterations 

converge when the matrix is not diagonally dominant? To explore this question, use 

Jacobi iterations on the following system 

x + z = 2 

−x + y = 0 

x + 2y − 3z = 0 

You should find that the iterations converge. Repeat the calculations using a Gauss- 

Seidel iteration. What do you observe? 

16-Feb-2014 194



Generalised Fixed-Point iteration 

7. Create a Matlab program for the generalised fixed-point algorithm by making suitable 

modifications to the function onesetp in the Matlab program jacobi.m But don’t overwrite 

your old jacobi.m program, save your new program in a new file, say gen fp.m. 

And remember that whatever name you choose for your file you must also choose the 

same name for the function. Test your program by searching for the roots of the system 

x n+1 = ( 7x 3 n − y n − 1 ) /10 

y n+1 = ( −8y 3 n + x n − 1 ) /11 

starting with (x, y) 0 = (0, 0). Compare your results against those given in lectures. 

8. See if you can find any other roots of the system of equations 

y = 7x 3 − 10x − 1 

x = −8y 3 + 11y + 1 

by exploring variations of the iterations used in the previous question. 

9. Look back at the pair of equations in Question 1. Though these are linear equations 

there is nothing stopping you from solving them by a generalised fixed point algorithm. 

Write out the fixed point equations for this system. How do the compare with the Jacobi 

iterations? 

Generalised Newton-Raphson iteration 

10. Modify your generalised fixed-point code to implement the generalised Newton-Raphson 

method. Test your code by solving the above system of equations. You should be able 

to find any and all of the nine roots (some of which were given in lectures). 

11. (a) Verify that (x, y) = (1, 1) and (x, y) = (−1, −1) are solutions of 

0 = x 2 + y 2 − 2 

0 = xy − 1 

(b) Write out the Newton-Raphson iteration equations in full for this system. 

(c) What problems do you think might arise as the iterations converge to (x, y) = (1, 1)? 

12. The following pair of equations 

0 = x 2 − y 2 + 2y 

0 = 2x + y 2 − 6 

are known to have four solutions near the following points (−5, −4), (2, −1), (0.5, 2) and 

(−2, 3). 

16-Feb-2014 195



(a) Re-arrange the system of equations into a fixed-point iteration form (note that 

there are many ways to do this). 

(b) Using the above values as initial guesses determine which of the roots can be found 

with your choice of fixed point iteration. 

(c) Repeat part (b) this time using the Generalised Newton-Raphson method. 

16-Feb-2014 196





Matlab programs 

You should attempt the following questions by hand. Programs are good, but they 

should only be used once you know what they are doing (it’s the same with calculators 

– you need to know basic arithmetic before you use them! well, that’s my 2 cents worth 

and I enjoy the view from this soapbox). 

Lagrangian interpolation 

1. Given the following data 

x 0.0 1.0 2.0 4.0 

f(x) 0.0 1.0 16.0 256.0 

use a cubic Lagrange polynomial to estimate f(1.5). You can probably guess that the 

underlying function is f(x) = x 4 . Use this to compute the error in your estimate for 

f(1.5). 

2. Using the same table, construct two quadratic polynomials to to interpolate f at x = 1.5. 

Compare this pair of estimates with the exact answer and the cubic estimate (using the 

cubic you built in the previous question). How well do these errors compare? If you did 

not know that the underlying function was f(x) = x 4 could you still estimate the error 

in the cubic interpolation? 

3. Below are a set of functions f(x). For each function use the node points x 0 = 0, x 1 = 0.6 

and x 2 = 0.9 to estimate f(0.45) and the error. 

(a) f(x) = cos(x) (b) f(x) = √ 1 + x 

(c) f(x) = log e (1 + x) (d) f(x) = tan(x) 

Newton interpolation 

4. Using the same data as above, construct the triangular table of numbers (known as divided 

differences) as given in the lecture notes. Hence write down the cubic interpolating



polynomial. Verify that that your polynomial does pass through the given data points. 

Estimate f at x = 1.5. How does this compare with the estimate from Question 1? 

5. Without doing any more calculations, write down, from this table, two quadratics that 

could be used to interpolate f(x) at x = 1.5. Then repeat as per Question 2. How do 

these estimates compare with those from Question 2? Are you surprised? (You shouldn’t 

be!) 

6. Consider a typical cubic polynomial in the form 

f(x) = a + bx + cx(x − 1) + dx(x − 1)(x − 2) 

Use this to estimate the derivatives of f(x) at x = 0. Could you then estimate d given 

just the derivatives? Could this also be used to estimate the other coefficients a, b and 

c? If so, how? 

7. (a) Construct a divided difference table for the following data 

x -2.0 -1.0 0.0 1.0 2.0 

f(x) -1.0 3.0 1.0 -1.0 3.0 

(b) From your table construct the following pair of polynomials. 

P (x) = 3 − 2(x + 1) + 0(x + 1)(x) + (x + 1)(x)(x − 1) 

Q(x) = −1 + 4(x + 2) − 3(x + 2)(x + 1) + (x + 2)(x + 1)(x) 

(c) Show that both P (x) and Q(x) interpolate the above data. 

(d) In lectures it was claimed that the interpolating polynomial is unique. How can 

this be so? Surely P and Q are different (or are they?). 

8. You are told that the following should be a divided difference table. Yet it contains some 

missing entries. Fill in the missing entries. 

x i f i = d i0 d i1 d i2 

0 ?? 

?? 

0.4 ?? 50/7 

10 

0.7 6 

Cubic splines 

9. Consider the use of cubic splines to interpolate a set of data. Suppose at some stage in 

the calculation we arrive at the following spline functions for two consecutive intervals 

˜f 0 (x) = x 3 + ax 2 + bx + c −1 ≤ x ≤ 1 

˜f 1 (x) = 2x 3 + x 2 − x + 4 1 ≤ x ≤ 2 

16-Feb-2014 198



(a) State the conditions that should be imposed on the two functions. 

(b) Hence compute a, b and c. 

16-Feb-2014 199





Finite Differences 

1. Use a standard Taylor series expansion 

to verify the formulae given in lectures 

f(x + h) = f(x) + df 

dx h + d2 f 

dx 2 h 2 

2! + d3 f 

dx 3 h 3 

3! + · · · 

df(x) 

dx 

d 2 f(x) 

dx 2 = 

= 

f(x + h) − f(x − h) 

2h 

+ O ( h 2) 

f(x + h) − 2f(x) + f(x − h) 

h 2 + O ( h 2) 

2. (a) Write out the Taylor series for f(x ± nh) for n = 1, 2. 

(b) Hence show that the first derivative, at x = x 0 , may be approximated by 

f ′ (x 0 ) = 1 ( 

) 

− 3f(x 0 ) + 4f(x 0 + h) − f(x 0 + 2h) + O ( h 2) 

2h 

(c) For those who love long tedious calculations, go on and show that 

f ′ (x 0 ) = 1 ( 

− 25f(x 0 ) + 48f(x 0 + h) − 36f(x 0 + 2h) 

12h 

) 

+ 16f(x 0 + 3h) − 3f(x 0 + 4h) + O ( h 2) 

3. Using the same methods as in the previous question, show also that 

f ′ (x 0 ) ≈ 1 ( 

) 

2f(x 2 ) + f(x 1 ) − 3f(x 0 ) 

5h 

f ′′ (x 0 ) ≈ 1 ( 

) 

f(x 

2h 2 2 ) − f(x 1 ) − f(x 0 ) + f(x −1 ) 

4. Consider three points (x −1 , f(x −1 )), (x 0 , f(x 0 )) and (x 1 , f(x 1 )). Construct a quadratic 

through these three points and then show that 

f ′′ (x 0 ) ≈ 1 ) 

(f(x 

h 2 1 ) − 2f(x 0 ) + f(x −1 )



5. Use Matlab to compute 2nd order finite difference approximations to d 2 f(x)/dx 2 for 

f(x) = sin(x) at x = π/2. What is the exact value? How do your numerical estimates 

compare with the exact values? Try using smaller and smaller values of the step length. 

What do you observe? Can you explain this behaviour? 

You might like to use the Matlab program NumDeriv.m which is available on the subject 

web page. (But do note that that program is for first derivatives.) 

6. Here is the school-yard definition (yes, its that well known) of the derivative of a function 

f(x) 

df 

dx = lim f(x + h) − f(x) 

h→0 h 

Choose your favourite function f(x), any non-zero number x, and compute the following 

approximation for the derivative 

( ) df 

= f(x + 10−n ) − f(x) 

dx 

n 

10 −n 

for n = 1, 2, · · · , 20. What do you notice? Can you explain this behaviour? 

7. We saw in lectures that the first derivative of f(x) at x = x i can be approximated by 

f ′ (x i ) = 1 ( 

) 

f(x i+1 ) − f(x i−1 ) 

2h 

Given that f(x) is the derivative of some other function, say g(x), that is f(x) = g ′ (x) use 

the above approximation to form an approximation for g ′′ (x i ). How does this compare 

with that approximation given in lectures? 

Numerical Errors 

8. Download the Matlab program MatlabEps.m. This is a simple program that you can 

use to determine the number of decimal digits that Matlab uses. Study the program, 

run it and then determine how many decimal digits of are used by Matlab. 

9. Verify the calculations given in lectures that showed that the optimal choice for h in 

df 

dx 

≈ 

f(x + h) − f(x) 

h 

was h = O ( 10 −N/2) where N is the number of decimal digits used by the computer. 

10. Using the same method as used in the previous question, determine the optimal choice 

of h for the approximation 

d f f(x + h) − 2f(x) + f(x − h) 

≈ 

dx2 h 2 

How many decimal digits of accuracy can you expect (for the optimal choice of h)? 

11. Suppose someone decided to build a computer in which all real numbers where approximated 

by fractions suh as a/b where a and b are integers. Suppose this computer can 

16-Feb-2014 201



store N decimal digits for each of a and b. Formulate a model of the round off error for 

this computer. Contrast this against a standard computer in which each real number is 

stored in 2N decimal digits. In answering this question you can assume that all the real 

numbers lie between 0 and 1. 

Improved Euler method 

12. Copy the sample program ImpEuler.m from the subject home page. 

Use this to explore some of the claims made in lectures. 

(a) Plot solutions for the ODE dy/dx = ky for various choices of k and initial conditions 

(x, y) 0 . 

(b) Is the system stable? Try different values of the step length. 

(c) Plot the solutions for dy/dx = y + 2e −x for various choices of initial conditions. 

What do you observe? (Try the case y(0) = −1). 

(d) Modify your code to return the error as a function of x. 

(e) Verify that the global discretization error is O (h 2 ). 

(f) Use this fact to derive a higher order scheme (hint : Richardson extrapolation). 

(g) Modify your code to implement your scheme and verify that its global discretization 

error is what you expect. 

16-Feb-2014 202





Richardson extrapolation 

1. In lectures we approximated 2π by the total sum of the chord lengths of a unit circle 

subdivided into many equal parts. A similar calculation can be made using the areas of 

the triangles formed by the circle’s centre and each chord. Do the calculations for the 

first few subdivision, then apply two levels of Richardson extrapolation. You will need 

to develop a formula for the error terms in the approximation. 

2. The finite difference approximations 

df(x) 

dx 

d 2 f(x) 

dx 2 ≈ 

≈ 

f(x + h) − f(x − h) 

2h 

f(x + h) − 2f(x) + f(x − h) 

h 2 

are known to have a leading truncation error of O (h 2 ). Use this information to obtain 

higher order approximations. 

3. Suppose that you have established the following series expansion 

˜f(h) = f + Ah 2 + Bh 4 + Ch 6 + · · · + Dh 2n + · · · 

where ˜f(h) is an approximation to some exact quantity f and h is some freely chosen 

(small) parameter. Let F n be result of n applications of the Richardson extrapolation 

method. Derive a general formula that links F n to F n−1 . 

4. Suppose that N(h) is an approximation to some quantity M and that for every h > 0 

we have 

M = N(h) + ah + bh 2 + ch 3 + · · · 

where a, b, c · · · are numbers that do not depend on h. Clearly as h → 0 we have 

N(h) → M. Thus the N(h) for various h can be used as approximations to M. This 

much is standard Richardson extrapolation. Now use the values N(h), N(h/3) and 

N(h/9) to produce an O (h 3 ) approximation for M. 

5. Here we will start with the equation 

( ) 1/h 2 + h 

e = lim 

h→0 2 − h



and use this as a basis to estimate e. The adventurous might like to prove this equation. 

(a) Define 

N(h) = 

( ) 1/h 2 + h 

2 − h 

and use this to compute N(h) for h = 0.04, 0.02 and 0.01. 

(b) Assume that e = N(h) + ah + bh 2 + ch 3 + · · · for some a, b, c · · ·. Use standard 

Richardson extrapolation to compute the best estimate for e. 

(c) Show that N(−h) = N(h). What does this tell you about a, b, c · · ·? 

(d) Use part (c) to show that e = N(h) + bh 2 + dh 4 + fh 6 + · · · and hence rework your 

Richardson extrapolation. What best estimate do you now get for e? 

(e) Which answer, part (b) or part (d) is the most accurate? Why? 

Numerical integration 

6. Download the Matlab programs Romberg.m and Trapezoidal.m. Both of the programs 

produce numerical estimates for I = ∫ 1 

4/(1 + 0 x2 ) dx. What is the exact value for 

I? Study the Romberg.m code. As you will see it computes successive columns in the 

Romberg table. Modify the code so that it computes the table row by row. Which 

version, row by row or column by column, would be better suited to an automatic 

integration package (i.e. given an integral and a desired accuracy the package returns 

an estimate of the integral). 

16-Feb-2014 204





Random numbers 

1. In Matlab you can create uniform random numbers using the Matlab command random. 

Here is an example, 

z = -1 + 2*rand(2000,1) % create 2000 uniform random nums from -1 to 1 

hist(z,100); 

% draw a histogram with 100 bins 

In this example the matlabe command rand(2000,1) creates 2000 uniform random 

nubers in the range 0 to 1. So to create N uniformly distributed random numbers in 

the interval [a, b] you can use a+(b-a)*rand(N,1). 

Use the above code to display various histograms for 10, 100, 1000 etc. random numbers. 

Do these histograms look reasonable to you? 

2. Here is another way to create random numbers in the interval 0 < x < 1. Define x i by 

x i = frac ( (π + x i−1 ) 5) , i = 1, 2, 3, · · · 

where the function frac(x) returns just the decimal part (e.g. 0.245 = frac(73.245)) 

(a) Modify the Matlab code from the previous question to compute this sequence. In 

Matlab you can compute the fractional part of x by using x − fix(x). 

(b) Does this sequence look random to you? 

3. Here is a short Matlab program that creates a uniform distribution of points inside a 

square box (in the region −1 < x < 1 and −1 < y < 1). 

z = -1 + 2*rand(2000,1) % create 2000 uniform random nums from -1 to 1 

x = z(1:2:1999); % x = odd entries in z 

y = z(2:2:2000); % y = even entries in z 

plot(x,y,’.’); 

% plot the (x,y) points 

Modify the code so as to produce uniform random points inside the unit circle x 2 +y 2 = 1. 

Re-draw the plot and confirm that the pattern of points still looks uniform (to your eye).



Golden Search optimisation 

4. Download the Matlab program MinGolden.m. This is a simple matlab program that 

applies the Golden Search algorithm to minimise a given function. Run the program 

(answer = MinGolden(-0.3,0.1,0.3)). At each iteration you should see the three 

active points plotted in red on the curve. The blue square is the new fourth point. 

To continue through the iterations press the return key. You can now run your own 

experiments. Try different starting values. What happens if you break the constraint 

on the three points (that there is a true minimum in the interval)? Try changing the 

logic used to create the fourth point (e.g. try using the mid-point). Try other functions 

to minimise (e.g. non-continuous functions such as tan −1 ). 

5. Suppose the three points are a, b and c with a 

minimum x of f(x) lies somewhere in the interval a < x < c. Thus the error in setting 

x = b is no greater than |c − a|. This is called an error bound for the minimum. Find a 

similar error bound after n iterations of the Golden Search algorithm. 

Genetic Algorithms 

On the MUSO webpage you will find a set of Matlab functions that implements a simple 

Genetic Algorithm. They are all quite short and should be fairly easy to read. As an 

example type GAmain(30,20) in Matlab. This will use a population of 30 individuals 

and run over 20 generations. 

6. Run the program a few times. What do you observe? Does the population always find 

the true minimum? Trying using different values for the population size and the number 

of generations. What do you observe? Is it better to start with a large population and 

run for a few generations or a small population run over many generations? 

7. Modify the main program GAmain.m so that it also plots a graph of the average fitness 

as a function of the generation. What type of curve do you observe? What does this 

tell you about the rate of convergence as a function of the number of generations? Can 

you (roughly) explain this behaviour? 

8. In the the standard Genetic Algorithm the next generation is built by adding successive 

pairs of children from each pair of parents. However other strategies are possible. For 

example, we could say that of the four individuals consisting of the two parents and 

the two children we should select the two fittest individuals to be passed into the next 

generation. Modify your main program to implement this strategy. How well does it 

work? 

9. You may have observed that sometimes the population converges to a point which is not 

the true minimum. This occurs when all of the genes are identical and thus subsequent 

generations will be identical to previous generations. Somehow we need to kick the 

population so that the individuals are not all clones of themselves. This same problem 

is overcome in biological systems by the process of mutation. As each gene is copied from 

the parent to the child small errors are introduced. Thus if both parents were clones of 

16-Feb-2014 206



each other then the children will not be clones of their parents. This introduces a small 

element of random variations in the population. This process is known as mutation. 

We can adapt this idea to our genetic algorithm as follows. After each child is created 

we scan the gene for that child. At each position in the gene we look at the binary bit. 

If its a 1 we flip it to a 0 with a very small probability. If its a 0 we likewise flip it to 

1 with the same small probability. We do this across the whole length of the gene for 

each child. Your task is to implement this scheme. You should choose the mutation 

probability to be small, about 0.05. You will need to use the Matlab random number 

generator rand. The statement if rand < 0.05 will be true when the random number 

is less than 0.05. 

Modify the Matlab programs to implement mutation. What do you observe? Does the 

evolution stall? What price do we pay in making this change (i.e. how does this change 

effect the accuracy in our final estimate for the minimum)? 

16-Feb-2014 207

MTH3051 Introduction to Computational Mathematics - User Web ...

Create successful ePaper yourself

Delete template?

Save as template?