12.07.2015 Views

Lecture Notes Course ÅMA 190 Numerical Mathematics, First ...

Lecture Notes Course ÅMA 190 Numerical Mathematics, First ...

Lecture Notes Course ÅMA 190 Numerical Mathematics, First ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Lecture</strong> <strong>Notes</strong><strong>Course</strong> ÅMA <strong>190</strong><strong>Numerical</strong> <strong>Mathematics</strong>, <strong>First</strong> <strong>Course</strong>Autumn 2009Sven-Åke GustafsonUniversity of Stavanger4036 StavangerNorwayDecember 16, 2008


11.5.2 The trapezoidal rule for a finite interval, general functionsand equidistant tables . . . . . . . . . . . . . . . . . . . . 9011.5.3 The trapezoidal rule for periodical functions, integral overone period. . . . . . . . . . . . . . . . . . . . . . . . . . . 9111.5.4 The trapezoidal rule for the real line . . . . . . . . . . . . 9212 <strong>Numerical</strong> treatment of ordinary differential equations 9412.1 Introduction: Vector- and matrix-valued functions . . . . . . . . 9412.2 Initial-value problems . . . . . . . . . . . . . . . . . . . . . . . . 9512.3 <strong>Numerical</strong> schemes based on discretisation for initial-value problems 9612.3.1 Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . 9712.3.2 The trapezoidal method . . . . . . . . . . . . . . . . . . 9712.4 Examples in initial-value problems . . . . . . . . . . . . . . . . . 9712.4.1 Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . 9712.4.2 The trapezoidal method . . . . . . . . . . . . . . . . . . . 9813 A collection of useful formulas 10213.1 Some formulas from the theory of differentiation and integration 10213.1.1 Summation and integration . . . . . . . . . . . . . . . . . 10213.1.2 Factorials and the Γ− function . . . . . . . . . . . . . . . 10213.1.3 Stirling’s formula . . . . . . . . . . . . . . . . . . . . . . . 10313.1.4 Binomial coefficients . . . . . . . . . . . . . . . . . . . . . 10313.1.5 Higher derivatives of a product . . . . . . . . . . . . . . . 10313.1.6 Change of variable in an integral . . . . . . . . . . . . . . 10313.1.7 Integration by parts . . . . . . . . . . . . . . . . . . . . . 10413.1.8 Differentiation with respect to a parameter . . . . . . . . 10413.1.9 Mean-value theorems, valid under appropriate conditions 10413.1.10 Taylor’s formula and power expansions . . . . . . . . . . . 10413.2 Vector spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10513.3 Matrix algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10613.4 Iterative methods for linear systems of equations . . . . . . . . . 10713.4.1 General iteration . . . . . . . . . . . . . . . . . . . . . . . 10713.4.2 Fix-point iteration . . . . . . . . . . . . . . . . . . . . . . 10813.4.3 Jacobi iteration on Ax = b. . . . . . . . . . . . . . . . . . 10813.4.4 Gauss-Seidel iteration on Ax = b. . . . . . . . . . . . . . . 10813.5 Least squares solution of linear systems of equations . . . . . . . 10913.6 Interpolation and approximation with polynomials . . . . . . . . 10913.6.1 Remainders and error bounds . . . . . . . . . . . . . . . . 10913.6.2 Lagrange’s formula for interpolating polynomial Q . . . . 11013.6.3 Divided differences . . . . . . . . . . . . . . . . . . . . . . 11013.6.4 Newton’s interpolation formula with divided differences . 11013.7 Propagation of errors . . . . . . . . . . . . . . . . . . . . . . . . . 11013.8 Solution of a single nonlinear equation . . . . . . . . . . . . . . . 11113.8.1 Existence of root in an interval [a, b] . . . . . . . . . . . . 11113.8.2 Existence and uniqueness of root in an interval [a, b] . . . 11113.8.3 Bisection method for f(x) = 0. . . . . . . . . . . . . . . . 1125


13.8.4 Inverse linear interpolation for f(x) = 0 . . . . . . . . . . 11213.8.5 Newton-Raphson’s method for f(x) = 0 . . . . . . . . . . 11213.8.6 Fix-point iteration for x = g(x) . . . . . . . . . . . . . . . 11213.9 Systems of nonlinear equations . . . . . . . . . . . . . . . . . . . 11313.9.1 Newton-Raphson’s method forF (x) = 0, x ∈ R n , F (x) ∈ R n . . . . . . . . . . . . . . . . 11313.9.2 Fix-point iteration for x = G(x), x ∈ R n , G(x) ∈ R n . . . 11313.10<strong>Numerical</strong> integration with the trapezoidal rule . . . . . . . . . . 11313.11Differential and difference equations with constant coefficients . . 11413.12Initial-value problems for a single ordinary differential equation . 11513.12.1 Piccard iteration . . . . . . . . . . . . . . . . . . . . . . . 11613.12.2 Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . 11613.12.3 The trapezoidal method . . . . . . . . . . . . . . . . . . . 11613.13Initial-value problems for systems of ordinary differential equations11614 Laboratory exercises and case studies 11714.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11714.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11714.3 Three methods for solving a linear system of equations with atridiagonal matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 11714.3.1 Similar problem . . . . . . . . . . . . . . . . . . . . . . . 11714.3.2 Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11714.4 A least squares problem . . . . . . . . . . . . . . . . . . . . . . . 11814.4.1 Similar problem . . . . . . . . . . . . . . . . . . . . . . . 11814.4.2 Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11814.5 Three methods for solving a nonlinear equation in one variable . 11814.5.1 Similar problem . . . . . . . . . . . . . . . . . . . . . . . 11814.5.2 Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11914.6 Analysing a table over the solution to an initial-value problem . . 11914.6.1 Similar problem . . . . . . . . . . . . . . . . . . . . . . . 11914.6.2 Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12014.7 Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12114.8 Evaluation of equivalent sums . . . . . . . . . . . . . . . . . . . . 12114.8.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12114.8.2 Computational treatment . . . . . . . . . . . . . . . . . . 12114.8.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 12114.9 The stabilising effect of pivoting in Gauss elimination . . . . . . 12214.9.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12214.9.2 Computational results . . . . . . . . . . . . . . . . . . . . 12214.9.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 12314.10Least squares fit . . . . . . . . . . . . . . . . . . . . . . . . . . . 12414.10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 12414.10.2 Fitting a straight line to set of observations . . . . . . . . 12514.10.3 Derivation of the normal equations . . . . . . . . . . . . . 12814.11Interpolation with polynomials . . . . . . . . . . . . . . . . . . . 12814.11.1 Lagrange’s formula . . . . . . . . . . . . . . . . . . . . . . 1296


14.11.2 Estimate of the effect of round-off in given data on thecalculated value of P (x) . . . . . . . . . . . . . . . . . . . 12914.11.3 Estimation of truncation error . . . . . . . . . . . . . . . 13014.11.4 Newton’s interpolation formula . . . . . . . . . . . . . . . 13014.12<strong>Numerical</strong> integration problems using the trapezoidal method . . 13114.12.1 The trapezoidal rule on a periodic function over one period13114.12.2 The trapezoidal rule over the real line . . . . . . . . . . . 13114.12.3 The trapezoidal rule over the real line . . . . . . . . . . . 13214.12.4 Integral over a finite interval transformed to an integralover the real line . . . . . . . . . . . . . . . . . . . . . . . 13214.12.5 Integrand with integrable singularity . . . . . . . . . . . . 1337


Abstract<strong>Numerical</strong> analysts study the algorithms of continuous mathematics, e.g. thesolution of linear and nonlinear systems of equations, numerical evaluation of integralsand the treatment of differential equations. Hence integer programmingdoes not belong to numerical analysis. The goal of the numerical treatment ofa mathematical problem is to determine a solution and to quantify the uncertaintyof the calculated results. A major topic of numerical mathematics is thestudy of the issues arising by the use of computers. The result of this work isoften represented in the form of packages of computer programs. Computersare used not only because the number of arithmetic operations is large, butalso because the computational schemes become large and complex. The userof the computer packages does not know all the details of the calculations tobe performed, but (s)he needs to know the main principles. The research anddevelopment in applied mathematics often could be described as a three-phaseprocess:• Analysis of the problem under study. The result of this investigation maybe represented in the form of theorems giving the mathematical propertiesof the solution sought.• Treatment of simple model problems, illustrating the theoretical resultsobtained.• Construction and documentation of computer programs solving the tasksabove.These lecture notes define the contents of <strong>Course</strong> ÅMA<strong>190</strong> which gives 5 creditsin the European Credit Transfer System, ECTS. The author has taught thisand similar courses in Stavanger most academic years in the period 1986–2007.This is an introductory course and may even be looked upon as a user’s coursein the areas it introduces. It may not be sufficient for those aspiring to doprofessional work in the study and development of computational schemes. Thecourse gives an introduction to some basic concepts and methods in numericalmathematics. These topics include the study of error propagation, treatmentof linear systems of equations with a stable variant of the Gauss eliminationmethod, iterative schemes for linear and nonlinear systems of equations as wellas evaluating integrals by means of the trapezoidal and mid-point methods.The numerical treatment of ordinary differential equations by discretisation isillustrated by presenting the Euler and trapezoidal schemes. This knowledgewill often be useful for those using the results of computational mathematics.The written exam tests the ability to solve relatively simple tasks using thelecture notes and other mathematical literature. Hence this is not a test ofknowledge learnt by heart.


Chapter 1Introduction and somestandard definitions andresults1.1 Number systems in applied mathematicalanalysisThe following number systems are often encountered in applied mathematicalwork:• integers• rational numbers• real numbers• complex numbersWe note that each of the listed classes of numbers is a subset of the followingones. Integers and rational numbers may be processed by means of symbolicmanipulation, e.g. in Maple. Rational numbers are then represented in the formr = p/q,where p and q are integers (q ≠ 0). Complex numbers are written in the formz = x + iy,where x and y are reals. See also [11], page 4 – 9.2


1.1.1 Two standard inequalitiesLet a, s, b be real numbers, such thata < s < b.then|s − a + b2 | ≤ b − a2If a, b are real or complex, then(1.1)||a| − |b|| ≤ |a + b| ≤ |a| + |b| (1.2)1.1.2 On positional number systemsOur positional systems make it possible to represent any integer using a finitenumber of different symbols. This is a major difference from the Roman systemwhich in principal requires infinitely many different symbols.Thus in the usual decimal systemand in general any integer N is written321 = 1 · 10 0 + 2 · 10 1 + 3 · 10 2 ,N = sign(N)(a n a n−1 . . . a 0 ) = sign(N)n∑a r · 10 r ,r=0where sign(N) is either + or −, and each integer a r satisfies0 ≤ a r ≤ 9, r = 0, 1, . . . , n.Hence we need only 10 different symbols and + and − to define any integer N.It is customary to work with decimal fractions. Let u, 0 < u ≤ 1 be a realnumber. Then u may be represented as the sum of an infinite series,u =∞∑a r · 10 −r , 0 ≤ a r ≤ 9, a r integer.r=1It is easily verified that this series converges. The numbers a r are uniquely determinedprovided that we only deal with series with a finite number of nonzeroterms. Otherwise two distinct representations are indeed possible. We illustratethis with the example u = 0.123 which has the two representations:u = 0.123, and u = 0.122999999 . . .It is also known, that if there are numbers n, p such thata r+p = a r , r ≥ n,3


if and only if u is a rational number. Example:4/7 = 0.571428 571428 571428 . . .In practical work we approximate real numbers with decimal fractions of a finitelength. There are the two possibilities• fixed point numbers• floating point numbers1.1.3 Fixed-point numbersIn fixed point numbers we work with a fixed number s of decimal places. Thuseach number a is approximated by a number ā with s decimal places and bycorrect rounding we determine ā such thatExample 1.1.1|a − ā| ≤ 0.5 · 10 −s .s = 3, a = √ 2, ā = 1.414,s = 6, a = √ 2, ā = 1.414214 .We next carry out our calculations with these rounded number and if necessary,the result of each operation is rounded as well.Example 1.1.2 c = ab, a = π, b = e, ˜c = ā¯b, ¯c = c, correctly rounded. Thus:s = 3, ā = 3.142, ¯b = 2.718 ˜c = ¯c = 8.540,s = 6, ā = 3.141593, ¯b = 2.718282 ˜c = 8.539736, ¯c = 8.539734 .Thus the rounding errors propagate in each step of the calculation and it isadvisable to carry out intermediate calculations with one or two extra decimalplaces, so-called guarding digits. In the example above π · e was not obtainedwith 6 correct decimals when each factor was rounded to this precision andthe final result also was rounded to 6 decimals. In computational work onenormally uses computers and calculators whose working accuracy is high, sothat the effect of rounding errors can be ignored.1.1.4 Floating-point numbersFloating-point numbers are written in the forma = x 1 · 10 x2 ,where x 2 is an integer and x 1 a fixed-point number with s decimal places.4


Example 1.1.3 Examples with s = 3:N A = 6.024 · 10 26 , h = 6.625 · 10 −34 .Here N A is Avogadro’s number with the dimension atoms/K· kmol and h isPlanck’s constant in J.Definition 1.1.4 If we choose the exponent x 2 such that 0.1 ≤ x 1 < 1 then thenumber a is said to be in normalised form.Thus we have for N A and h:N A = 0.6024 · 10 27 , h = 0.6625 · 10 −33 .Definition 1.1.5 Let ā be an approximation for a. We introduce:• δa = ā − a, the absolute error in ā as an approximation for a.• δa/|a| = (ā − a)/|a|, the relative error in ā as an approximation for a,which is defined, if a ≠ 0.Let now ∆a satisfy ∆a ≥ |δa|. Then we introduce the concepts• ∆a, the absolute uncertainty in ā as an approximation for a.• ∆a/|a|, the relative uncertainty in ā as an approximation for a, which isdefined if a ≠ 0.Normally, the error itself δa is not known but an upper bound ∆a i.e.uncertainty, is often available.theLet now ā be an approximation to a given in normalised floating point formatwith s decimal places. Thenby correct rounding. Further we find:∆a = |a − ā| ≤ 0.5 · 10 x2−s ,∆a/a =|a − ā||a|≤0.5 · 10x2−sx 1 · 10 x2=0.5 · 10−sx 1≤ 0.5 · 10 1−s .The last inequality is a consequence of the fact that a is a normalised numberand hence 0.1 ≤ x 1 < 1Example 1.1.6∆N A ≤ 0.5 · 10 −4 · 10 27 = 0.5 · 10 23 , ∆N A /N A ≤ 0.5 · 10 −4 /0.6024 ≈ 0.83 · 10 −4 ,∆h ≤ 0.5 · 10 −4 · 10 −33 = 0.5 · 10 −37 , ∆h/h ≤ 0.5 · 10 −4 /0.6625 ≈ 0.76 · 10 −4 .5


1.2 Well-posedness and conditionWe introduceDefinition 1.2.1 A problem P is said to be well-posed in the sense of Hadamardif:• It has a solution.• The solution is unique.• The solution depends continuously on input data.If the problem P does not have all the properties above, it is said to be ill-posedin the sense of Hadamard.Remark 1.2.2 If a problem P does not have a unique solution, this may beremedied by introducing additional conditions. The input data belongs to a certainset and here continuity may be defined by introducing a distance functionsuitable for the problem at hand. Thus the well-posedness of a problem dependson the distance function chosen.Example 1.2.3 Solve the equationx 2 = a, a ≥ 0This equation has for a > 0 the two distinct rootsx 1 = − √ a, x 2 = √ aIf a = 0 the root x = 0 is unique but since for a < 0 there are no real roots,the problem is ill-posed for all real values of a including 0. However, we mayintroduce the further condition that the roots should be nonnegative, then theproblem is well-posed for a > 0.Within the class of well-posed problems some are more sensitive to changesin input data than others. We discuss now the important class of well-posedproblems, where the output is a differentiable function of the input. To definethis sensitivity we introduce the measure of the condition number which givesan estimate of the upper bound of the quotient between the change of theoutput (answer) and the change of the input of a problem caused by a generallysmall perturbation of input data. The numerical value of the condition numberdepends on the measures chosen for the changes of input and output. Sincethe perturbation of the input is generally small, we may linearise the change ofthe output. We illustrate with defining the condition number of evaluating thereal-valued function f admitting real argumentsDefinition 1.2.4 (Condition number of evaluating f(x)) The absolute conditionnumber of evaluating f at xκ a (f, x, ∆x) =|f(x + δx) − f(x)|supδx≤∆x ∆x6


and the relative condition number byκ r = κ a(f, x, ∆x)|f(x)|/|x|When f has a continuous derivative in a neighbourhood of x and ∆x is smallwe may use the approximationsκ a (f, x, ∆x) = |f ′ (x)|, κ r (f, x, ∆x) = |xf ′ (x)||f(x)| .Example 1.2.5 We determine the condition number of calculating the functionWe may linearize this problem usingf(x) = x 2 .f ′ (x) = 2x,and hence the change δf = 2xδx corresponds to the perturbation δx. Hence thecondition number of the problem of evaluating f(x) is 2|x|, if we measure inabsolute errors. If we instead use relative errors, then the relative size of theperturbation is |δx/x| and the relative change of the output becomes |δf/f| =2|δx/x| and the relative condition number is thus 2 for all values of x.1.3 RegularisationDefinition 1.3.1 Approximation of an ill-posed problem by a well-posed one iscalled regularisation of the first-mentioned problemWe discuss numerical differentiation:Example 1.3.2 Let f be a given function. We want to approximate the derivativef ′ (x) at the fixed point x, with the differencedh(f)(x) =f(x + h) − f(x − h).2hThe functional values f(x + h) and f(x − h) are not known exactly, but we havethe approximations f(x + h) and f(x − h), which are such thatwhere ɛ is a known bound.|f(x + h) − f(x + h)| < ɛ, |f(x − h) − f(x − h)| < ɛ,Thus we have the error R given byR = f ′ (x) −f(x + h) − f(x − h),2h7


which may be writtenwhereR = R 1 + R 2 ,R 1 = f ′ (x)−f(x + h) − f(x − h), R 2 =2hf(x + h) − f(x − h)−2hf(x + h) − f(x − h).2hIf f has three continuous derivatives, we make use of Taylor expansion to findthat there is a constant c and a number h 1 such thatR 1 ≤ ch 2 , h ≤ h 1 ,and using the triangular inequality we establishThus the total error R satisfiesR 2 ≤ ɛ h .R ≤ ch 2 + ɛ h .The constant c is generally not known. It depends on the function f as wellas on x but not on h. We note that R 1 , the first part of the error bound canbe made arbitrarily small by taking h sufficiently small, but the second partR 2 grows indefinitely when h decreases. Hence there is a positive value of hwhich renders the bound for R a minimum but this value can as a rule not becalculated.1.4 Linear spacesWe recall that a set of numbers S such that the operations of addition, subtraction,multiplication and division are defined and satisfying the laws we are usedto from real numbers is called a set of scalars. Other examples of scalars arethe complex numbers and the rational numbers.Definition 1.4.1 E is called a vector space (or linear space) over the set ofscalars F , if vector addition and multiplication by scalars are defined, satisfyingthe laws we remember from the familiar example of R n , the space of orderedn-tuplesExample 1.4.2 Let S be a set, E the set of real-valued functions defined on S.We next define linear combinations as follows(af 1 + bf 2 )(s) = af 1 (s) + bf 2 (s),where a, b are scalars. Thus f 1 , f 2 are elements of E (vectors) and the newvector (af 1 + bf 2 ), the linear combination of f 1 and f 2 is defined by using thelaws of real numbers to evaluate the right hand side.8


This general example may be specialised in different ways to give familiar instancesof linear spaces. If S consists of n distinct elements, we get a space likeR n , the space of n-tuples. We may also consider spaces of continuous functionsdefined on an interval. A particular instance is the linear space of polynomials.Definition 1.4.3 Let E be a linear space and let the subset B ⊂ E be such thatB is linearly independent and each element x ∈ E is a linear combination ofelements in B. Then B is called a basis of EIt can be shown that all linear spaces have a basis. In general, a given linearspace has more than one basis. If a linear space has a basis with n elements,where n is a finite number, then all bases have the same number n of elementsand n is called the dimension of the space.We next consider functions defined on linear spaces:Definition 1.4.4 Let X, Y be linear spaces, f a function mapping elements ofX on elements of Y . We write thiswhich may also be writtenf : X → Y,f(x) ∈ Y, x ∈ X.Instead of the word “function” we may use one of the words “mapping”, or“operator”. In particular, if the space “Y” is a a set of scalars we use the word“functional”. f is said to be linear, iff(ax 1 + bx 2 ) = af(x 1 ) + bf(x 2 ), x 1 ∈ X, x 2 ∈ X, a, b scalars.Example 1.4.5 Let X be C[−1, 1] the linear space of functions, continuous on[−1, 1]. Define I, T, δ and F byI(f) =∫ 1−1f(t) dt, T (f) = 0.1 ·δf = f(0), F (f)(x) =((f(−1) + f(1))2∫ 1−1)∑19+ f(−1 + r · 0.1) ,r=1exp(xt)f(t) dt.Then all of I, T, δ and F could be said to be linear operators, in particularI, T, δ are linear functionals. We evaluate these functionals and operators forthe argument f(t) = t. Then we findI(f) =∫ 1−1t dt, = 0, T (f) = 0.1 ·δf = 0, F (f)(x) =∫ 1−1((−1) + 12)∑19+ (−1 + r · 0.1) = 0,r=1exp(xt)t dt = ex + e −xx+ e−x − e xx 29


Definition 1.4.6 Let E be a linear space. The functional ||f|| is called a normon E, if for all f, g ∈ E and scalars a• ||f|| ≥ 0, f ∈ E,• ||f|| = 0 implies f = 0,• ||af|| = |a| · ||f||,• ||f + g|| ≤ ||f|| + ||g||.Example 1.4.7 Let X be C[−1, 1] the linear space of real-valued functions,continuous on [−1, 1]. Then we often use the following norms• ||f|| ∞ = max −1≤t≤1 |f(t)|• ||f|| 2 2 = ∫ 1−1 (f(t))2 dt• ||f|| 1 = ∫ 1|f(t)| dt−1If we now consider the special example f(t) = t, we find||f|| ∞ = 1, ||f|| 2 = √ 2/3, ||f|| 1 = 1 .1.5 Dual systems in R n1.5.1 Dual systemsWe consider R n , the space of n-tuples.represented by scalar productsIn this space linear functionals areand operators by square matrices.l(x) = c T x =n∑c r x r ,Definition 1.5.1 (Dual systems) Let A be a square matrix and b, c, x, y bevectors. We introduce the two problems:Determineb T x when Ax = c,r=1andDeterminec T y when A T y = b.We will only consider the case when the two linear systems have unique solutionsfor all right hand sides c, b. Then we have the following duality result10


Theorem 1.5.2 Let A be a square matrix and b, c, x, y be vectors. If A is suchthat the systemAx = c,has a unique solution x for each c, we define the linear functionald(c) = b T x.If b allows the representationwe haveb = A T y,d(c) = c T y.Proof:d(c) = b T x = (A T y) T x = y T Ax = y T c,• This proof may also be carried out by calculating component-wise: We haven∑d(c) = b i x i ,i=1andHencen∑d(c) = b i x i =n∑n∑b i = a ri y r , c r = a ri x i .n∑x ii=1 i=1 r=1r=1i=1n∑a ri y r =n∑∑ ny rr=1 i=1x i a ri =n∑y r c r .r=1Remark 1.5.3 This simple result is very useful in many contexts and allowsgeneralisations in various directions. In R n we consider the two linear systemsAx = c, A T y = b,and efficient methods for solving both at the same time are available. We notethat evaluating the functional d(c) for various values of c may be looked upon asa generalisation of the task of tabulating d.We also establishTheorem 1.5.4 Let ¯c = c + ɛ. Then we haved(¯c) − d(c) = y T ɛ.Proof:We haved(¯c) = b T ¯x, A¯x = c + ɛ.d(c) = b T x, Ax = c.11


Subtracting the last relation from the preceding one we getd(¯c) − d(c) = b T (¯x − x) = (A T y) T (¯x − x) = y T A(¯x − x) = y T (¯c − c) = y T ɛ.•Thus we may estimate the change in the value of the functional d(c) caused bya perturbation of c without studying the inverse of A.1.5.2 Error analysis for linear systems of equationsIn many practical situations the numerical treatment of a linear system of equationsgives an approximate solution vector which is far from the exact solution,but the residual is small. However, we do not seek the solution vector itself,but need to enter it into some formula to obtain the end result. If we need toevaluate a linear functional to find the desired result, we may use the theoremabove to estimate the error in our final result which may be much smaller thanthe error in the calculated solution vector. We note that we may estimate theerror in component x i by putting b = (0, 0, . . . , 1, 0, . . . , 0) T , i.e. by puttingthe i-th component equal to 1 and all the others to 0. It is worth mentioningthat these error bounds are obtained without estimating the norm of the inverseof the square matrix A. However, if we need to estimate the errors in allthe components of the calculated solution vector, then we need to perform acomputational effort equivalent to inverting A.1.6 Some standard definitions and results1.6.1 Order of magnitude O and oDefinition 1.6.1 We write thatf(x) = O(g(x)), x → ∞ whenExample: f(x) = x 3 + 5, g(x) = x 3andExample: f(x) = x 2 , g(x) = e xSimilarlyf(x) = o(g(x)), x → ∞ whenf(x)limx→∞ g(x) = c ≠ 0,f(x)limx→∞ g(x) = 0.f(x) = O(g(x)), x → 0 when limx→0f(x)g(x) = c ≠ 0.Example: f(x) = sin x, g(x) = xandf(x)f(x) = o(g(x)), x → 0 when limx→0 g(x) = 0.Example: f(x) = 1 − cos x, g(x) = x12


1.6.2 Arithmetic sequencen(n − 1)a r = A + rd, r = 0, · · · , n − 1, s n = nA + d = n a 0 + a n−1,22d, A constants, wheren−1∑s n = a r .r=0Example: s n = 1 + 2 + . . . + n = n(n + 1)/2Note that the sum defining s n has n terms and that s n may be obtained byaveraging the first and the last term and multiplying the result by the numberof terms.1.6.3 Geometric seriesWritingn−1∑s n (x) = x r = 1 − xn1 − x , x ≠ 1, s n(1) = n,r=0s(x) = 11 − x , s(x) = s n(x) + xn1 − x , x ≠ 1.s(x) = s n (x) + R n (x),where R n (x) is called the truncation error introduced when we approximates(x) with s n (x), we haveExample:R n (x) =xn1 − x , x ≠ 1 and R n+1(x) = x · R n (x).s n = 1 + 0.9 + 0.81 + 0.729 . . . =R n = (0.9)n = 10 · (0.9)n1 − 0.911 − 0.9 = 101.6.4 Leibniz’s theorem for alternating seriesLet the seriesbe such thatthens =∞∑n−1∑a r with s n = a r ,r=0r=0a r = (−1) r b r , 0 ≤ b r+1 ≤ b r , limr→∞ b r = 0,|R n | ≤ b n ..Example: s = 1 − 1/2 + 1/3 . . . = ln 2 = 0.6931s 4 = 1 − 1/2 + 1/3 − 1/4 = 0.5833, s − s 4 = 0.1098 < 1/5 = 0.213


1.6.5 Mean value theoremsThe mean value formulas in Subsection 13.1.9 are often used in error estimations.If the interval [a, b] and the functions f and w as well as the points t i andweights w i are given, formulas (13.1), (13.2) and (13.3) define equations whichare satisfied by a number s which may not be uniquely determined. We illustratewith some numerical examples:In (13.1) we set a = −1, b = 1, f(t) = t 3 . Thus we get the equationwhich has the solutions1 − (−1)2= 3s 2s = ±1/ √ 3In (13.2) we put a = −1, b = 1, w(t) = 1, f(t) = t 4 to arrive at the equationwhich has the real solutions0.4 = s 4 · 2s = ± 4√ 0.2In (13.3) we put n = 3, t 1 = −1, t 2 = 0, t 3 = 1, w 1 = 1/4, w 2 = 1/2, w 3 =1/4, f(t) = t 2 Then we find the equation1/4 + 1/4 = s 2which has the solutions = ±1/ √ 21.6.6 Integral estimates for partial sums:Theorem 1.6.2 Let f be a function which is continuous and decreasing overthe positive real halfline and let h be a positive constant. Then for n > m wehave the inequalitiesandh∫ nhmhf(t) dt − hn∑f(rh) =r=m∫ nhmh(f(mh) − f(nh)2+n−1∑r=m+1f(rh))f(mh) − f(nh)≤ h2f(t) dt + h(f(mh) + f(nh))/2 ± h(f(mh) − f(nh))/2,(this is a simple instance of the Euler-Maclaurin summation formula)The proof is based on using the inequalitieshf(h(r + 1)) ≤∫ h(r+1)rhf(t) dt ≤ hf(rh), r = m, m + 1, . . . , n − 1.14


1.6.7 Taylor’s formulaWe derive Taylor’s formula by means of repeated integration by parts: Let a bea fixed point, x a real number and assume that the function f has at least nderivatives on the interval with endpoints a and a+x.Then we have the relationf(a + x) = f(a) +∫ a+xaf ′ (t) dt.Integrating by parts and remembering that the derivative with respect to t ofthe functionis 1 we gett − a − xf(a + x) = f(a) + [ a+xa f ′ (t)(t − a − x)] −Simplifying we findf(a + x) = f(a) + xf ′ (a) −A new integration by parts gives∫ a+xa∫ a+xa(t − a − x)f ′′ (t) dt(t − a − x)f ′′ (t) dt.[ ](t − a − x)f(a + x) = f(a) + xf ′ 2∫ a+x(a) −f ′′ (t) +2aUpon simplification we have∫ a+xf(a + x) = f(a) + xf ′ (a) + x22 f ′′ (a) +a(t − a − x) 2f (3) (t) dt.2(t − a − x) 2f (3) (t) dt.2This procedure is repeated several times and the formula in 13.1.10 emerges,upon using the mean-value theorem (13.2) on the last integral.1.7 Introduction exercisesIn <strong>Numerical</strong> <strong>Mathematics</strong> we take advantage of the results from mathematicalanalysis. For the present course it is necessary that the students recall whatthey have learnt in earlier courses in Calculus and Linear Algebra. They shouldbe able to use mathematical results from the literature and practise seeking suchresults in relevant books or in the internet. The exercises to follow illustratematerial which is deemed to be useful for the study of <strong>Numerical</strong> <strong>Mathematics</strong>.The problems may be solved analytically or by using a simple pocket calculator.15


1.7.1 Arithmetic seriesEvaluate the arithmetic sumCalculate A(100).Solution:A(n) = 1 + 3 + . . . + 2n − 1, n ≥ 1.A(n) = 1 +3 + . . . + 2n − 1, n ≥ 1.A(n) = 2n − 1 +2n − 3 + . . . + 1, n ≥ 1.1.7.2 Geometric seriesEvaluate the geometric seriesSolution:1.7.3 Limit valuesFind the following limit values:a)b)c)d)⇒ 2A(n) = n · 2n.A(n) = n 2 , A(100) = 10000.s = 2 + 0.4 + 0.08 + . . . + 2 · (0.2) n + . . .s = 2/(1 − 0.2) = 2.5limx→0sin xlimx→0 x .x 2e x − 1 − x .limx→+∞ x2 e −x .lim (1 +n→∞ 2x/n)n .e)2x 3 + 7x + 6limx→∞ x 3 .Solution:L’Hôpital’s rule in a) and b):a)sin xlimx→0 x= lim cos x= 1.x→0 116


)c)becauselimx→0x 2e x − 1 − x = limx→02xe x − 1 = limx→02e x = 2.limx→+∞ x2 e −x = limx→+∞ e−x+2 ln x = 0,−x + 2 ln x → −∞ when x → ∞.Alternative derivation: We haveandx 2 e −x > 0, x > 0,x 2 e −x = x2e x = x 21 + x + x 2 /2 + x 3 /6 + . . . < x2x 3 = 6/x → 0 when x → ∞./6d)e)lim (1 +n→∞ 2x/n)n = lim exp(ln(1 + 2x/n)n) = exp(2x).n→∞2x 3 + 7x + 6limx→∞ x 3 = lim (2 +x→∞ 7/x2 + 6/x 3 ) = 2.1.7.4 Moment integralsEvaluate the integralsM 2n =∫ 1−1Compute M 0 , M 1 , M 2 , M 3 , M 4 .Solution:M 2n =x 2n dx, M 2n+1 =M r =∫ 1−1∫ 1−1x r dx = 1 − (−1)r+1 .r + 1x 2n+1 dx, n = 1, 2, . . .22n + 1 , M 2n+1 = 0, n = 1, 2, . . .M 0 = 2, M 1 = 0, M 2 = 2/3, M 3 = 0, M 4 = 2/5.17


1.7.5 Upper bound for a sumSetwheref = ɛ 1 + 2ɛ 2 + 3ɛ 3 ,|ɛ i | ≤ ɛ, i = 1, 2, 3,and ɛ > 0 is a given number. Show thatSolution:|f| ≤ 6ɛ.|ɛ 1 + 2ɛ 2 + 3ɛ 3 | ≤ |ɛ 1 | + 2|ɛ 2 | + 3|ɛ 3 | ≤ 6ɛ.1.7.6 Local extrema of a functionPutf(x) = x(x 2 − 1), −2 ≤ x ≤ 2.Find all local extrema of f.Solution:Putf(x) = x(x 2 − 1), −2 ≤ x ≤ 2.Local extrema may occur at the endpoints x = −2 and x = 2 as well as thosepoints x satisfyingf ′ (x) = 0, and |x| < 2.We findf(−2) = (−2) · (4 − 1) = −6, f(2) = 2(4 − 1) = 6.f ′ (x) = x 2 − 1 + x · 2x = 3x 2 − 1.f ′ (x) = 0 for x = x 1 , x 2 ,wherex 1 = −1/ √ 3, x 2 = 1/ √ 3.We havef(x 1 ) = 2√ 39= 0.384900, f(x 2 ) = − 2√ 39= −0.384900,18


1.7.7 Locating rootsPutShow that the equationf(x) = x 3 + 6x − 4.f(x) = 0has a root s in [0, 1]. Is the root unique?Solution:f(x) = x 3 + 6x − 4.f(0) = −4, f(1) = 1 + 6 − 4 = 3.Hence f(0) · f(1) < 0 and f is a continuous function. Thus the equationhas a root s in [0, 1]. We find alsof(x) = 0,f ′ (x) = 3x 2 + 6 > 0, all x.The function f is strictly increasing and can therefore cut the x-axis only once.1.7.8 Solutions of a linear system of equationsThe system of equations Ax = b is given, with⎛A = ⎝1 2 30 4 50 0 a⎞⎛⎠ b = ⎝For which values of a and b does the system have solutions. Determine thesolution set in these cases. The system is written componentwise:Solution:We find the solutions:x 1 + 2x 2 +3x 3 = 6,4x 2 +5x 3 = 9,ax 3 = b.69b⎞⎠ .a ≠ 0 : x 3 = b/a, x 2 = (9 − 5x 3 )/4, x 1 = 6 − 3x 3 − 2x 2 ,Hence the system is inconsistent.a = 0, b ≠ 0 : ax 3 = b has no solution .a = 0, b = 0 : ax 3 = b is solved by all x 3 .The system has infinitely many solutions.19


1.7.9 Least square solutionsSolve the least squares problem Ax = b for all values of a, b and c, whereA =⎛⎜⎝1 2 30 4 50 0 a0 0 0⎞⎟⎠⎛b = ⎜⎝69bc⎞⎟⎠ .Solution:We will always be able to satisfy the first two equations puttingx 2 = (9 − 5x 3 )/4, x 1 = 6 − 3x 3 − 2x 2 .Hence we arrive at the problem:Findminx 3(b − ax 3 ) 2 + c 2 .If a ≠ 0, we put x 3 = b/a and get the minimal value c 2If a = 0, the minimum value becomes b 2 + c 2 and is achieved for all x 3 .1.7.10 Ordinary differential equation with constant coefficientsSolve the initial-value problemẍ + 4ẋ + 3x = t, x(0) = 1, ẋ(0) = 2.Solution:We construct a particular solution and trygivingx(t) = A + Bt,ẋ(t) = B, ẍ(t) = 0.Entering this into the equation we get the relation4B + 3A + 3Bt = t,which must hold for all t. Thus we must haveThis gives3B = 1, 4B + 3A = 0.B = 1/3, A = −4/9,and we arrive at the particular solutionx p (t) = −4/9 + t/3.20


We next construct the general solution of the homogeneous equation.characteristic equation isr 2 + 4r + 3 = 0,which has the roots r 1 = −3 and r 2 = −1. Therefore the general solution of thedifferential equation becomeswithx(t) = Ae −3t + Be −t − 4/9 + t/3,ẋ(t) = −3Ae −3t − Be −t + 1/3.Using the initial conditions we obtain the following linear system of equationsfor determining the constants A and B:orgivingThe solution sought becomesx(0) = A + B − 4/9 = 1,ẋ(0) = −3A − B + 1/3 = 2,A + B = 13/9,−3A − B = 5/3,A = −14/9, B = 3.x(t) = − 14e−3t9+ 3e −t − 4 9 + t 3 .The21


Chapter 2On the representation ofnumbers in a computer2.1 IntroductionComputers may be used to process numbers in two different ways, namely asexact quantities which are processed by means of symbolic manipulations or asdata which are subject to numerical evaluation, often involving approximation.Software packages called computer algebras perform symbolic manipulations.Examples of such packets are: Maple, Mathematica, Derive and Macsyma.There is a principal limit to the performance of these processors, since onlya finite number of symbols may be stored in a given computer while it is possibleto devise mathematical expressions of any length, e.g.s =∞∑ln(cos 2 ( n√ n)) · e −n .n=1If we accept approximate values, then our calculations may be carried out bymeans of an algorithmic language like Matlab, Fortran77, Fortran90 or C. Bywriting a short program in any of these languages it would be easy to determinethe sum above to any desired accuracy, say 8 decimal places, by directsummation.2.2 Implementation of expressions on a computerSince a computer only may carry out a finite number of arithmetic operationsand a finite number of logical choices the most general expression which may beevaluated by means of a computer is a piecewise rational function. The value ofsuch an expression is affected by round-off errors, which may accumulate. Thisalso means that e.g. all standard functions must be approximated by piecewiserational functions. We now introduce22


Definition 2.2.1 We distinguish between the following classes of errors:• Round-off error, due to the approximation of a number to its computerrepresentation and the fact that arithmetic operations are not carried outexactly.• Data error if given parameters are not exactly known.• Truncation error: The replacement of an infinite sequence of operationsby a finite one.Remark 2.2.2 It is desirable to organise the computations such that the effectsof round-offs and truncations are negligible in comparison to the effects of theuncertainties in given data. If e.g. data are given with an uncertainty of 1%, thecalculations should be carried out such that the contributions of the round-offsand truncations are certainly less then 0.1%.2.3 Representing numbers in a computerWhen working on a computer, several modifications to the presentation aboveare called for. The number s of decimal places is fixed for a given computer andsoft-ware. Also the range of the exponent x 2 is limited such that there are twointegers m, M such thatm ≤ x 2 ≤ M.The number 10 which is the basis of the decimal system, is replaced by a positiveinteger B. The choices B = 2, 8 or 16 are most common. Thus for an integerN we havem∑N = sign(N)(b m b m−1 · b 0 ) B = b r B r ,where the integers b r satisfy 0 ≤ b r < B. Thus the common decimal systemhas B = 10. Fixed-point and floating point numbers are introduced as before.Thus the result of correctly rounding to s B-mals gives the bound|a − ā| ≤ 0.5 · B −s .Example 2.3.1 a = 1/3, B = 8, s = 4.We write13 = b 18 + b 28 2 + b 38 3 + b 48 4 + ɛ,where ɛ ≤ 0.5 · 8 −4 . Multiplying both sides with 8 we get the relation:Here we take b 1 = 2 and obtain:r=083 = b 1 + b 28 + b 38 2 + b 48 3 + ɛ · 8 .23 = b 28 + b 38 2 + b 48 3 + ɛ · 8 .23


This relation is again multiplied by 8 giving:Now we put b 2 = 5 and arrive atMultiplying by 8 we findThus b 3 = 2 andFinally we findThis gives:163 = b 2 + b 38 + b 48 2 + ɛ · 82 .13 = b 38 + b 48 2 + ɛ · 82 .83 = b 3 + b 48 + ɛ · 83 .23 = b 48 + ɛ · 83 .163 = b 4 + ɛ · 8 4 .b 4 = 5, ɛ = 1 3 · 8−4 .We next establish:We have namely:1/3 = (0.252525 . . .) 8 .(0.252525 . . .) = 2 8 + 5 8 2 + 2 8 3 + 5 8 4 + . . .= 2 8 (1 + 8−2 + 8 −4 + . . .) + 5 8 2 (1 + 8−2 + 8 −4 + . . .)= (2/8 + 5/8 2 )(1/(1 − 1/64)) = 1/3 .Remark 2.3.2 If B = B p 0 with p an integer, then one digit in the B systemcorresponds to p numbers in the B 0 -system.Example 2.3.3 B = 1000, p = 3, B 0 = 10π = 3.141 592 654 = 3 · B 0 + 141 · B −1 + 592 · B −2 + 654 · B −3 .We also find that one digit in the system with B = 16 corresponds to 4 digitsfor B = 2. We now introduce:Definition 2.3.4 Two real numbers a and b are said to be computationallyequivalent, if they have the same representation in a given computer.We realise that whether or not two given numbers are computationally equivalentdepends on the working accuracy of the computer used. The definition ofcomputational equivalence is extended to vectors, matrices and functions in anobvious way.24


Example 2.3.5 Consider a computer with working relative accuracy 1.0 · 10 −7Define the functions f and g according tof(x) = exp(x), g(x) =∑12r=0x r, −1 ≤ x ≤ 1.r!Then the exponential function f and the polynomial g are computationally equivalenton the interval [−1, +1].We next observe that there are only finitely many different computer representationsof a real number on a given computer with a pre-defined working accuracybut there are infinitely many real numbers. This means that there are infinitelymany reals having an identical computer representation. This conclusion maybe generalised to classes of computational problems as illustrated byExample 2.3.6 Let A be a matrix, b, c and x compatible vectors. Consider thetask of evaluatingv = b T x, when Ax = c.There are infinitely many different matrices and vectors A, b, c having identicalcomputer representations and they could conceivably define different values ofthe answer v. This is not desirable.25


Chapter 3Gaussian elimination forAx = b3.1 IntroductionThe Gauss elimination is based on systematic use of the followingLemma 3.1.1 Let x = (x 1 , . . . , x n ) T ∈ R n be a column vector, f 1 (x), f 2 (x)be functions having x as argument. Let finally c ≠ 0 be a constant. Then thefollowing two systems of equations have the same solution sets:f 1 (x) = 0, f 2 (x) = 0, (3.1)andf 1 (x) = 0, f 1 (x) + c · f 2 (x) = 0. (3.2)Proof: If x satisfies (3.1) then it immediately follows that x satisfies (3.2). Onthe other hand if f 1 (x) = 0, f 1 (x) + c · f 2 (x) = 0 we first see that f 1 (x) = 0giving that c · f 2 (x) = 0 and since c ≠ 0 we find that f 2 (x) = 0 as well, provingthe lemma.•We illustrate an application of the lemma with the followingExample 3.1.2 Solve the linear system3x + y = 1, (3.3)2x − y = 4 . (3.4)Using the lemma above we conclude that the system (3.3),(3.4) has the samesolution set as the system (3.5) below for all nonzero values of c:3x + y = 1, (2 + 3c)x + (c − 1)y = (4 + c). (3.5)26


We now determine c such that 2 + 3c = 0, i.e. c = −2/3, and hence arrive atthe system3x + y = 1, −5/3 · y = 10/3 . (3.6)This last system is of triangular form and we find easily y = −2 and subsequentlyx = 1 This procedure is applied to general systems of equations to transformthem into systems of a special structure whose solution set is easily determined.This scheme which is called Gaussian elimination with pivoting is described insection 3.33.2 Trapezoidal systemsDefinition 3.2.1 Let A be a rectangular matrix with n rows and m columns,x and b column vectors with m and n elements respectively. ThenAx = b,specifies a linear system of equations with n equations and m unknowns. b isthe right hand side, A the table of coefficients and x is a solution vector to bedetermined.The system may be inconsistent, have a unique solution or have infinitely manysolution vectors.If n = m, the matrix A is called square. The element in row r and column k isdenoteda r,k .The matrix A is called trapezoidal or in echelon form if it is such that: There isan integer k with the properties (1) and (2) below:(1)a i,i ≠ 0, a i,j = 0, j < i, i = 1, . . . , k.(2)If k < n thena i,j = 0, i > k.Important special case: If m = n = k then A is said to be triangular.Example 3.2.2⎛1 2 3⎞ ⎛A = ⎝ 0 4 5 ⎠ B = ⎝0 0 61 2 3 70 4 5 80 0 6 9⎞⎛⎠ C = ⎝1 2 30 4 50 0 0All the three matrices above are trapezoidal. Matrix A is even triangular.Consider now the three linear systems of equationsAx = b, By = b, Cz = b.⎞⎠ .27


Then we easily verify: The system Ax = b has a unique solution for all righthand sides b. The system By = b has infinitely many solutions since y 4 maybe chosen arbitrarily and then the remaining elements in the solution y may beexpressed uniquely in terms of y 4 . The system Cz = b is solvable only if b 3 = 0.In this case z 3 may be chosen arbitrarily and the remaining components of thesolution vector may be expressed uniquely in terms of z 3 .Definition 3.2.3 The following two classes of square matrices are useful: Thesquare matrix A is termed symmetric ifA is called tri-diagonal ifa i,j = a j,i .a i,j = 0, |i − j| > 1 .Provided that the table of coefficients is of trapezoidal form, it is possible todetermine all solutions of the systemAx = b,by means of a finite number of arithmetic operations.3.3 Reduction of a general system of equationsto trapezoidal formIf A is of trapezoidal form, then the system Ax = b is such that the coefficienta i,j in front of the unknown x j in equation number i is 0 if i > j. The Gaussianelimination scheme may be used to reduce a general system of equations on asystem of the trapezoidal form which has the same solution set as the originalsystem. To secure stability, pivoting, i.e. reordering of the equations, is performedaccording to certain rules to be described below. For certain classes ofmatrices pivoting is not required. Also, if the elimination is carried out exactly,as e.g. in theoretical analysis, pivoting is only carried out to avoid division by0.We describe now the Gaussian elimination for a general system Ax = b, whereA has n rows and m columns. The main idea is that by performing eliminationstep number k one eliminates the coefficient in front of the unknown x k in equationsi > k for k = 1, . . . , min(m, n). It is essential that the coefficient in frontof x k in equation k remains different from 0. The algorithm may be describedas follows: (In order to simplify the notation we denote by a i,j the contents ofthe place in row i and column j, which could be thought of as a space in thecomputer.)Step 1: Elimination of coefficients in front of x 1 :Substep 1a: Exceptional case: If all element in column 1 are 0, find another columnwhich has a nonzero element and interchange it with the first column. (Observethat this interchange means a reordering of the unknowns, which should28


e recorded) If there is no such column, A is already in the trapezoidal form(with all elements zero) and the algorithm is stopped.Substep 1b: Let a l,1 be an element in column 1 with the largest absolute value.(In case of tie any of the elements with largest absolute values will do) Equationnumber l is called the pivot equation.Substep 1c: If l ≠ 1, interchange equations number 1 and l, including the righthand side. (This interchange is called pivoting)Substep 1d: Eliminate the coefficients in front of x 1 in equations number 2, . . . , nby subtracting suitable multiples of the pivot equation from the other equations.The transformation of the coefficients in equation number i is performed accordingto the formulas:m i = a i,1 /a 1,1 , a newi,j= a oldi,j − m i · a 1,j j = 2, 3, . . . , n.An analogous formula gives the transformation of the right hand side. Thisoperation is carried out for equations 2, 3, . . . , nStep number k: The coefficients in front of x k are eliminated in equationsnumber k + 1, . . . , n by means of the following substeps:Substep k(a): Exceptional case: If all element in column k and rows numberk, . . . , n are 0, find another column which has a nonzero element in one of theserows and interchange it with the first column. (Observe that this interchangemeans a reordering of the unknowns, which should be recorded). If there is nosuch column, A is already in the trapezoidal form, and the algorithm is stopped.Substep k(b): Let a l,k be an element in column k with l ≥ k and having thelargest absolute value. (In case of tie any of the elements with largest absolutevalues will do) Equation number l is called the pivot equation.Substep k(c): If l ≠ k interchange equations number k and l, including theright hand side. (This interchange is called pivoting)Substep k(d): Eliminate the coefficients in front of x k in equations numberk + 1, . . . , n by subtracting suitable multiples of the pivot equation from theother equations. The transformation of the coefficients in equation number i isperformed according to the formulas:m i = a i,k /a k,k , a newi,j= a oldi,j − m i · a k,j j = k + 1, k + 2, . . . , n.An analogous formula gives the transformation of the right hand side.operation is carried out for equations k + 1, k + 2, . . . , nThisExample 3.3.1 We want to use Gaussian elimination with partial pivoting forinvestigating the set of solutions of the linear systemx 1 + 2x 2 + 3x 3 + 4x 4 = 12x 1 + 3x 2 + 4x 3 + 5x 4 = 13x 1 + 4x 2 + 5x 3 + 6x 4 = 14x 1 + 5x 2 + 6x 3 + 7x 4 = 129


Writing the system in matrix form with detached coefficients we get⎛⎞1 2 3 41⎜ 2 3 4 51⎟⎝ 3 4 5 61 ⎠4 5 6 7 ∣ 1The coefficient in front of x 1 has the largest absolute value in the fourth equation.Hence this equation will be used as pivot. We reorder the equationsinterchanging the first and fourth equations and obtain:⎛⎜⎝4 5 6 72 3 4 53 4 5 61 2 3 4In the new system we multiply the first equation by the factors∣0.5, 0.75, 0.25and subtract from the second, third and fourth equations. The coefficientsin front of x 1 become zero and the other coefficients in these equations aretransformed according to:Second equation: 3 − 0.5 · 5 = 0.5, 4 − 0.5 · 6 = 1, 5 − 0.5 · 7 = 1.5Right hand side: 1 − 0.5 · 1=0.5Third equation: 4 − 0.75 · 5 = 0.25, 5 − 0.75 · 6 = 0.5, 6 − 0.75 · 7 = 0.75Right hand side: 1 − 0.75 · 1 = 0.25Fourth equation: 2 − 0.25 · 5 = 0.75, 3 − 0.25 · 6 = 1.5, 4 − 0.25 · 7 = 2.25Right hand side: 1 − 0.25 · 1 = 0.75and the detached form becomes, after these row operations⎛⎞4 5 6 71→ ⎜ 0 0.5 1 1.50.5⎟⎝ 0 0.25 0.5 0.750.25 ⎠0 0.75 1.5 2.25 ∣ 0.75We next want to eliminate the coefficient in front of x 2 in two of the three lastequations. We note that it has the largest absolute value in the fourth equation,which should be used as pivot. Thus we interchange the fourth and the secondequations and the the detached form then becomes→⎛⎜⎝11114 5 6 70 0.75 1.5 2.250 0.25 0.5 0.750 0.5 1 1.5⎞⎟⎠∣10.750.250.5We next multiply the second equation by the factors 0.25/0.75 and 0.5/0.75 andsubtract it from the third and fourth equations. Then all coefficients both in⎞⎟⎠30


the left hand side and the right hand side become zero. Hence these equationsbecome redundant and may be removed, without changing the solution set. Wefinally end up with the system with 4 variables and two equations:→(4 5 6 70 0.75 1.5 2.25∣10.75We may here select x 3 and x 4 arbitrarily and next determine x 1 and x 2 from0.75x 2 + 1.5x 3 + 2.25x 4 = 0.75)and4x 1 + 5x 2 + 6x 3 + 7x 4 = 1.Remark 3.3.2 In Example 3.3.1 we have changed a linear system to such asystem of trapezoidal form by adding multiples of some equations to others. ByLemma 3.1.1 the set of solution vectors is not changed by such operations. Wehave chosen x 3 and x 4 arbitrarily, but we would have gotten the same solutionset for any choice of two variables, which is such that the other two could bedetermined uniquely in terms of the arbitrary ones. The elimination in thefirst part of the example above may also be carried out in different ways, butnumerical stability is only guaranteed, if we multiply the equations by factorswith absolute values no greater than oneExample 3.3.3 We want to determine the polynomial of degree < 4 whichinterpolates the function exp(t) at the 4 points −1, −0.5, 0.5, 1. Write the polynomialQ(t) = a 1 + a 2 t + a 3 t 2 + a 4 t 3 .The interpolation conditions give the following linear set of equations with thecoefficients a 1 , . . . , a 4 as unknowns:⎛⎞1 −1 1 −10.3679⎜ 1 −0.5 0.25 −0.1250.6065⎟⎝ 1 0.5 0.25 0.1251.6487 ⎠1 1 1 1 ∣ 2.7183We eliminate the coefficients in front of a 1 in the second and following equationsby subtracting the first equation from the others. Thus the coefficients and righthand side change according to:Second equation: −0.5 − (−1) = 0.5, 0.25 − 1 = −0.75, −0.125 − (−1) =0.875, 0.6065 − 0.3679 = 0.2386Third equation: 0.5−(−1) = 1.5, 0.25−1 = −0.75, 0.125−(−1) = 1.125, 1.6487−0.3679 = 1.2808Fourth equation: 1−(−1) = 2, 1−1 = 0, 1−(−1) = 2, 2.7183−0.3679 = 2.350431


The system then takes the form⎛⎜⎝1 −1 1 −10 0.5 −0.75 .8750 1.5 −0.75 1.1250 2 0 2∣0.36790.23861.28082.3504⎞⎟⎠The coefficient in front of a 2 has the largest absolute value in equation 4 andhence this equation is interchanged with the second one, i.e. we perform apivoting operation before carrying out the elimination:⎛⎞1 −1 1 −10.3679⎜ 0 2 0 22.3504⎟⎝ 0 1.5 −0.75 1.1251.2808 ⎠0 0.5 −0.75 .875 ∣ 0.2386After the interchange we eliminate the coefficient in front of a 2 in the thirdand fourth equations. Thus we subtract the second equation multiplied by1.5/2 = 0.75 from the third equation and by 0.5/2 = 0.25 from the fourthequation. The coefficients and the right hand side in the third and fourthequations change according to:Third equation: −0.75 − 0.75 · 0 = −0.75, 1.125 − 0.75 · 2 = −0.375, 1.2808 −0.75 · 2.3504 = −0.4820Fourth equation: −0.75 − 0.25 · 0 = −0.75, 0.875 − 0.25 · 2 = 0.375, 0.2386 −0.25 · 2.3504 = −0.3490 Then we obtain:⎛⎞1 −1 1 −10.3679⎜ 0 2 0 22.3504⎟⎝ 0 0 −0.75 −0.375−0.4820 ⎠0 0 −0.75 0.375 ∣ −0.3490Before the last elimination step no pivoting is required and we obtain a triangularsystem after subtracting the third equation from the fourth.⎛⎞1 −1 1 −10.3679⎜ 0 2 0 22.3504⎟⎝ 0 0 −0.75 −0.375−0.4820 ⎠0 0 0 0.75 ∣ 0.1330Now back-substitution gives the solution:a 4 = 0.1330/0.75 = 0.17733a 3 = (−0.4820 + 0.375 · a 4 )/(−0.75) = 0.55400a 2 = (2.3504 − 2a 4 )/2 = 0.99787a 1 = 0.3679 + a 4 − a 3 + a 2 = 0.98910Hence we have a 1 = 0.98910, a 2 = 0.99787, a 3 = 0.55400, a 4 = 0.1773332


Chapter 4Iterative methods for Ax = b4.1 IntroductionWe want to find an approximation to the smallest positive root s of the equationx = 0.2 + x 6 . (4.1)Assuming that 0 < s < 1 we get the first approximationx 0 = 0.2 .Entering this into the right hand side we get the next approximationx 1 = 0.2 + x 6 0 ,and the nextand hence the general relationx 2 = 0.2 + x 1 6 ,x r+1 = 0.2 + x r 6 ,which generates the infinite sequence x 0 , x 1 , . . . which can be shown to convergetowards the root s. Using Maple and working with 10 decimal places we findthe numerical valuesx 0 = 0.2, x 1 = 0.200064, x 2 = 0.2000641230, x 3 = 0.2000641232,and hence we accept x 3 as our approximation for s. This is an example of aniterative scheme. In principle, the sequence of approximation is infinite, but wewant to find a good approximation to the limit value after a finite number ofsteps. It is then necessary to estimate the error in the accepted approximation.Iterative schemes have been developed for many classes of problems, includingsystems of linear and nonlinear equations. The present chapter will be devotedto describing two classical schemes for systems of linear equations.33


4.2 Jacobi and Gauss-Seidel iterations4.2.1 Jacobi iterationWe consider again the linear system Ax = b which we write on the formx = Bx + f.The latter is solved by iterations. We discuss only two different methods, namelythe Jacobi and the Gauss-Seidel schemes. By Jacobi iterations we form thesequence:x l+1 = Bx l + f, l = 0, 1, . . . ,where the starting approximation may be arbitrary. Often one takes x 0 = 0.The elements of x l are updated according ton∑x l+1i = b i,j x l j + f i ,This process is easily parallelized.j=14.2.2 Gauss-Seidel iterationThe Gauss-Seidel iterations differs from the Jacobi ones, in that one uses thelatest available components of x to make the next update. Thus one updates thecomponents of x one at a time beginning with x 1 . When x 2 is updated, one usesthe just calculated value of x 1 together with the earlier values of the remainingcomponents. To update x 3 one uses the recent values of x 1 , x 2 together withearlier values for x 4 , . . . and so on. The two iteration schemes do not convergefor all matrices B but the following condition is sufficient for convergence ofboth methods:maxin∑|b i,j | < 1 .j=1Example 4.2.1 We solve the system Ax = b where⎛A = ⎝ 10 1 2⎞ ⎛3 11 1 ⎠ b = ⎝ 13152 3 12 17We write this system in the formx = Bx + f,by solving the first equation with respect to x 1 , the second with respect to x 2and the third with respect to x 3 . Then we find⎞⎠ .x 1 = 1 10 (13 − x 2 − 2x 3 ) (4.2)x 2 = 1 11 (15 − 3x 1 − x 3 ) (4.3)x 3 = 1 12 (17 − 2x 1 − 3x 2 ) (4.4)34


or, in matrix form with the coefficients rounded to 6 decimal places:⎛0 −0.1 −0.2⎞ ⎛1.3⎞B = ⎝ −0.272727 0 −0.090909 ⎠ b = ⎝ 1.363636 ⎠ .−0.166667 −0.25 0 1.416667Jacobi iteration:We enter the starting vector x 1 = x 2 = x 3 = 0 into the system (4.2),(4.3)and (4.4) and get the first approximation, working with 6 decimal places: x 1 =1.3, x 2 = 15/11 = 1.363636, x 3 = 17/12 = 1.416667 This vector is entered intothe system to give the next approximation:x 1 = 0.1(13 − 1.363636 − 2 · 1.416667) = 0.880303x 2 = 1 (15 − 3 · 1.3 − 1.416667) = 0.88030311x 3 = 1 (17 − 2 · 1.3 − 3 · 1.363636) = 0.85909113This procedure is repeated generating the table below for Jacobi iterations.Each row of the table represents an approximation of the solution vector.Gauss-Seidel iterationIn the Gauss-Seidel scheme we use the latest calculated value of each componentof the solution vector by the updating. We start as before with the zero vectorand as before we update x 1 according to (4.2) and get x 1 = 1.3. (The latestvalues of x 2 and x 3 are 0) We next update x 2 with (4.3) and now we use thelatest value for x 1 , x 3 , namely 1.3 and 0. Hencex 2 = 1 (15 − 3 · 1.3) = 1.00909111We use (4.4) to update x 3 , using the latest values x 1 = 1.3, x 2 = 1.009091gettingx 3 = 1 (17 − 2 · 1.3 − 3 · 1.009091) = 0.94772712We continue in this way, always using the latest values of each component. Thusin the next iterationx 1 = 1 (13 − 1.009091 − 2 · 0.947727) = 1.00954610x 2 = 1 (15 − 3 · 1.009546 − 0.947727) = 1.00214911x 3 = 1 (17 − 2 · 1.009546 − 3 · 1.002149) = 0.99787012Continuing in this way we generate the Gauss-Seidel table.Remark 4.2.2 If we enter the vector x with x 1 = x 2 = x 3 = 1 into each of theiteration formulas we get the same vector back. It is hence the fix-point of thetwo iteration schemes and is also the solution of the given system of equations.35


We compare the results of Jacobi and Gauss-Seidel iterations. The followingoutput was obtained from simple Fortran programs implementing these methodsand using the null vector as starting vector.Jacobi iterationGiven system10.0000 1.0000 2.0000 13.00003.0000 11.0000 1.0000 15.00002.0000 3.0000 12.0000 17.0000Starting vector.000000 .000000 .000000Jacobi approximates1.300000 1.363636 1.416667.880303 .880303 .8590911.040151 1.045455 1.049874.985480 .984516 .9819441.005160 1.005601 1.006291.998182 .998021 .9977401.000650 1.000701 1.000798.999770 .999750 .9997161.000082 1.000088 1.000101.999971 .999969 .9999641.000010 1.000011 1.000013.999996 .999996 .9999951.000001 1.000001 1.0000021.000000 .999999 .9999991.000000 1.000000 1.000000Gauss-Seidel iterationGiven system10.000000 1.000000 2.000000 13.0000003.000000 11.000000 1.000000 15.0000002.000000 3.000000 12.000000 17.000000Starting vector.000000 .000000 .000000Gauss-Seidel approximates1.300000 1.009091 .9477271.009546 1.002149 .9978721.000211 1.000136 .9999311.000000 1.000006 .9999981.000000 1.000000 1.00000036


Chapter 5Least squares fit5.1 Normal equationsDefinition 5.1.1 Let A be a rectangular matrix with n rows and m columns,x and b column vectors with m and n elements respectively. ThenAx = b,specifies a linear system of equations with n equations and m unknowns. b isthe right hand side, A the table of coefficients and x is a solution vector to bedetermined.The system may be inconsistent, have a unique solution or have infinitely manysolution vectors.We discuss here the case when n > m. Then the system is, as a rule, inconsistent,i.e. one cannot find any vector x such that Ax = b exactly. Instead, one seeksa vector x which minimizes the square of the length of the deviation vector.Hence we want to minimize(Ax − b) T (Ax − b).This problem always has a solution, which satisfies the normal equations,A T Ax = A T b.This system may be solved using Gaussian elimination, and it can be verifiedthat pivoting is not required for stability. We describe next how to form theelements in the matrix C = A T A and vector f = A T b. Using the definitions ofmatrix by matrix product and matrix by vector product we find the formulas:c i,j =n∑a l,i a l,j f i =l=137n∑a l,i b l .l=1


Thus the element c i,j is the scalar product of columns number i and j of matrixA, while f i is the scalar product of column number i of matrix A and theoriginal right hand side b. Since the resulting system A T Ax = A T b, often calledthe normal equations, always is consistent, it may have a unique solution orinfinitely many solution vectors.Example 5.1.2 We consider the same example as in Chapter 2, namely thetask to approximate exp(t) with a polynomial of degree less than 4. Now we doleast squares fit with respect to the set of the five points −1, −0.5, 0, 0.5, 1. Thuswe get an over-determined linear system with 5 equations and 4 unknowns.The following output was obtainedgiven system1.000000 -1.000000 1.000000 -1.000000 .3679001.000000 -.500000 .250000 -.125000 .6065001.000000 .000000 .000000 .000000 1.0000001.000000 .500000 .250000 .125000 1.6487001.000000 1.000000 1.000000 1.000000 2.718300normal equations:5.000000 .000000 2.500000 .000000 6.341400.000000 2.500000 .000000 2.125000 2.8715002.500000 .000000 2.125000 .000000 3.650000.000000 2.125000 .000000 2.031250 2.480675triangular system5.000000 .000000 2.500000 .000000 6.341400.000000 2.500000 .000000 2.125000 2.871500.000000 .000000 .875000 .000000 .479300.000000 .000000 .000000 .225000 .039900solution vector.994394 .997867 .547771 .177333The coefficients c i,j of the left hand side and f i of the right hand side of thenormal equations are obtained by forming the scalar products of the columnsof the given (overdetermined) system. Thus the coefficients of the first normalequation are the scalar products of the first column and the other columns andthe right hand side. Thusc 1,1 is the scalar product of the first column with itself:c 1,1 = 1 + 1 + 1 + 1 + 1 = 5c 1,2 is the scalar products of the first and the second columns:c 1,2 = −1 − 0.5 + 0 + 0.5 + 1 = 0c 1,3 is the product of the first and the third columns:c 1,3 = 1 + 0.25 + 0 + 0.25 + 1 = 2.538


c 1,4 is the product of the first and the fourth columns:c 1,4 = −1 − 0.125 + 0 + 0.125 + 1 = 0The right hand side:f 1 is the product of the first column and the right hand side vector:f 1 = 0.3579 + 0.6065 + 1 + 1.6487 + 2.7183 = 6.3414We form the second normal equation by forming scalar products with the secondcolumn. We observe thatc i,k = c k,iThus c 2,1 = c 1,2 = 0c 2,2 is the second column multiplied by itself:c 2,2 = 1 + .25 + 0 + 0.25 + 1 = 2.5c 2,3 is the second column multiplied by the third:c 2,3 = −1 − 0.125 − 0 + 0.125 + 1 = 0c 2,4 is the second column multiplied by the fourth:c 2,4 = 1 + 0.0625 + 0 + 0.0625 + 1 = 2.125The right hand side:f 2 is the product of the second column and the right hand side vector:f 2 = −0.3679 − 0.30325 + 0 + 0.82435 + 2.7183 = 2.8715The third normal equation is obtained by forming scalar products with the thirdcolumn observing thatc 3,1 = c 1,3 , c 3,2 = c 2,3The fourth normal equation results from forming scalar products with the fourthcolumn. The normal equations are brought on triangular form by Gaussianelimination and afterwards solved by back-substitution. Alternative methodswhich are more efficient are often used but they are not presented here due tospace limitations.39


Chapter 6On interpolation withpolynomials6.1 Spaces of polynomials6.1.1 On polynomials as approximating functionsIn this chapter we will discuss some linear spaces often encountered in numericalwork. As written earlier, we need to approximate a given function f with anexpression g, which may be evaluated with a finite number of arithmetic operationsand logical choices. Sometimes we want to perform further operations, likeintegration or derivation on g and it is an advantage, if these latter operationscan be carried out easily, or even analytically. As an example, assume that wehave constructed g such thatf(t) ≈ g(t), −1 ≤ t ≤ 1 .We seek an approximation for the integral of f and make the approximation∫ 1−1f(t) dt ≈∫ 1−1g(t) dt.Hence it is desirable that the integral on the right hand side may be evaluatedmore easily than that on the left hand side. A common choice it to construct apolynomial g to approximate f. In the sequel we will present some systematicmethods for doing this. We will need to describe some useful properties ofpolynomials. We begin withExample 6.1.1 Let E 5 be the linear space of polynomials of degree < 5. Wewant to give some examples of bases to this space which has the dimension 5.Let now {t 1 , t 2 , t 3 , t 4 , t 5 } be 5 distinct real points such thatt 1 < t 2 < t 3 < t 4 < t 5 .40


PutandL i (t) =P (t) =5∏(t − t i ),i=1P (t) , i = 1, . . . , 5.(t − t i )Then the following three sets of functions are bases for the space E 5 :•••where1, t, t 2 , t 3 , t 4 ,L 1 , L 2 , L 3 , L 4 , L 5 ,T 0 , T 1 , T 2 , T 3 , T 4 ,T 0 (t) = 1, T 1 (t) = t, T r+1 = 2 · t · T r (t) − T r−1 , r = 1, 2, 3 .See [5], page 104.6.1.2 The interpolation problemWe next establish the following result:Theorem 6.1.2 Let n distinct real points t i , and n numbers y i , i = 1, . . . , n begiven. Then there is a unique polynomial Q of degree < n, such thatQ(t i ) = y i , i = 1, . . . , n. (6.1)Proof:Existence of interpolating polynomial Q: Form the sequence of polynomialsaccording toNext putN 1 (t) = 1, N i+1 = (t − t i ) · N i (t), i = 1, . . . , n − 1.Q(t) =n∑c i N i (t).i=1Next determine the coefficients c i such that Q(t i ) = y i , i = 1, . . . , n. We arriveat a linear system of equations of triangular form and the terms on the diagonalare different from 0. Thus this system may be solved, defining an interpolating41


polynomial Q. We next prove uniqueness:Let Q, R be two interpolating polynomials of degree < n Next setS(t) = P (t) − R(t).Thus S is a polynomial of degree < n such thatS(t i ) = 0, i = 1, . . . , n.Thus S is of degree < n having n distinct zeros. This is only possible, if S isidentically 0, establishing uniqueness.•6.1.3 Interpolation formula with remainderTheorem 6.1.3 Let I = [a, b] be a real interval, f be a function defined on Ihaving n continuous derivatives there and let t 1 < t 2 < . . . < t n be n distinctpoints in I. Let also Q be the polynomial of degree < n satisfyingThen we establishQ(t i ) = f(t i ), i = 1, 2, . . . , n.f(x) = Q(x) + K(x)P (x), P (x) =where ξ depends on x.n∏(x − t i ), K(x) = f n (ξ),n!Proof: If x coincides with one of the points t i , it is seen at once that thestatement is true. We treat next the case when x is different from all of thesen points. Let now x be fixed. Then we havei=1f(x) = Q(x) + K(x)P (x), K(x) =Next form the new function F , given byF (t) = f(t) − Q(t) − K(x)P (t).f(x) − Q(x).P (x)We verify that F is 0 at t 1 , . . . , t n , x and hence F has in total n + 1 zeros. ByRolle’s theorem F (n) has a zero at a point which we denote ξ. Differentiating ntimes with respect to t we arrive atF (n) (t) = f (n) (t) − K(x) · n!.Using the fact that F (n) (ξ) = 0 we finally getas claimed.•K(x) = f (n) (ξ),n!Remark 6.1.4 Only in special cases are the higher order derivatives availableas useful formulas, a fact which often causes difficulties in practical situations.42


6.2 On the choice of nodesIf t i in (6.1) are the scaled Chebyshev points given byt i =(a + b)2+(b − a)2cos(θ i ), θ i =π(i − 1/2), i = 1, 2, . . . , n.nthen(b − a)n|f(t) − Q(t)| ≤ 2 ·4 n max |f (n) (t)|/n!.a≤t≤bThe numerical interpolation problem generally becomes more stable, if a changeof variable is performed such that the interpolation interval becomes [−c, c].Often c is taken to 1. Thus, if t is the original variable which is in the interval[a, b], we introduce the new variable u in the interval [−1, 1] by means of theformula(a + b) (b − a)t = + u.2 26.3 Linear interpolationTake n = 2 in (6.1). One verifies directly that Q may be written in one of thetwo equivalent formsand the remainder becomes:Using the fact thatQ(t) = f(t 1 ) + (t − t 1 ) f(t 2) − f(t 1 ), (6.2)(t 2 − t 1 )Q(t) = f(t 1 ) (t 2 − t)(t 2 − t 1 ) + f(t 2) (t − t 1)(t 2 − t 1 ) , (6.3)R(t) = (t − t 1 )(t − t 2 )f ′′ (ξ)/2, t 1 ≤ ξ ≤ t 2 .|(t − t 1 )(t − t 2 )| ≤ (t 2 − t 1 ) 2 /4,we get the error bound for linear interpolationR(t) ≤ (t 2 − t 1 ) 2 /8 · max |f ′′ (t)|, if t 1 ≤ t ≤ t 2 .t 1≤t≤t 26.4 General interpolation formulas6.4.1 Linear systems for coefficients of interpolating polynomialsin power formThe interpolation problem may always be solved by using the interpolationconditions to formulate a linear system of equations whose solution defines theinterpolating formulas sought. This approach is illustrated by Example 3.3.3Since the structure of this system is special its solution may also be obtainedusing special formulas, associated with the names of Lagrange and Newton. Theuse of these formulas is illustrated in Example 6.4.5 below.43


6.4.2 Lagrange’s formulaTheorem 6.4.1 (Lagrange’s formula) Let the n nodes t i be distinct. ThenLagrange’s formula for the interpolating polynomial Q readsQ(t) =n∑ P (t)nf(t i )(t − t i )P ′ (t i ) , P (t) = ∏(t − t i ). (6.4)i=1i=1Proof. SetUsing L’Hôpital’s rule we findp i (t) = P (t)t − t i.lim p i (t) = P ′ (t i ).t→t iFor k ≠ i, we conclude, that p i (t k ) = 0. Since the points t i are distinct, wehave that P ′ (t i ) ≠ 0. Also, p i is a polynomial of exact degree n − 1. Thus (6.4)defines the interpolating polynomial as claimed.bulletRemark 6.4.2 Explicit form of Lagrange’s formula in a special case: We writeQ(t) =We give L i for n = 4n∑P (t)f(t i )L i (t), L i (t) =(t − t i )P ′ (t i )i=1L 1 (t) =L 2 (t) =L 3 (t) =L 4 (t) =(t − t 2 )(t − t 3 )(t − t 4 )(t 1 − t 2 )(t 1 − t 3 )(t 1 − t 4 )(t − t 1 )(t − t 3 )(t − t 4 )(t 2 − t 1 )(t 2 − t 3 )(t 2 − t 4 )(t − t 1 )(t − t 2 )(t − t 4 )(t 3 − t 1 )(t 3 − t 2 )(t 3 − t 4 )(t − t 1 )(t − t 2 )(t − t 3 )(t 4 − t 1 )(t 4 − t 2 )(t 4 − t 3 )6.4.3 Newton’s formula with divided differencesDivided differences with distinct arguments.:We define the divided differences recursively as follows. One argument:f[t i ] = f(t i ), i = 1, 2, . . . , n.Two arguments:f[t i , t j ] = f[t j] − f[t i ].(t j − t i )44


k arguments:f[t i1 , . . . , t ik ] = f[t i 2, . . . , t ik ] − f[t i1 , . . . , t ik−1 ]t ik − t i1.The divided differences are symmetric functions of their arguments. We findfrom the definition above:1f[t 1 , t 2 ] = f(t 1 )(t 1 − t 2 ) + f(t 12)(t 2 − t 1 )1f[t 1 , t 2 , t 3 ] = f(t 1 )(t 1 − t 2 )(t 1 − t 3 ) +f(t 12)(t 2 − t 1 )(t 2 − t 3 ) +f(t 13)(t 3 − t 1 )(t 3 − t 2 ) +It is now easy to derive a similar formula for a divided difference with n argumentsand verify its correctness by induction.Theorem 6.4.3 (Newton’s formula) Let f be defined on [a, b] and let a ≤t 1 < t 2 < . . . < t n ≤ b. Define Q by (6.1). Then the following relations hold:Q(t) = f[t 1 ] + (t − t 1 )f[t 1 , t 2 ] + (t − t 1 )(t − t 2 )f[t 1 , t 2 , t 3 ] + . . .+(t − t 1 )(t − t 2 ) · . . . · (t − t n−1 )f[t 1 , . . . , t n ].f(t) = Q(t) + P (t)f[t, t 1 , t 2 , . . . , t n ].Proof: Let t be a fixed point in [a, b]. We will use the definition above of divideddifferences and also the fact, that a reordering of the arguments does not changethe value of a given divided difference. Sincewe haveNextgivingandWe now writeto obtainf[t 1 ] = f(t 1 ), f[t 1 , t] = f[t] − f[t 1].(t − t 1 )f(t) = f[t 1 ] + (t − t 1 )f[t 1 , t].f[t 1 , t 2 , t] = f[t 1, t] − f[t 1 , t 2 ]t − t 2,f[t 1 , t] = f[t 1 , t 2 ] + (t − t 2 )f[t 1 , t 2 , t],f(t) = f[t 1 ] + (t − t 1 ]f[t 1 , t 2 ] + (t − t 1 )(t − t 2 )f[t 1 , t 2 , t].f[t 1 , t 2 , t 3 , t] = f[t 1, t 2 , t] − f[t 1 , t 2 , t 3 ]t − t 3,f[t 1 , t 2 , t] = f[t 1 , t 2 , t 3 ] + (t − t 3 )f[t 1 , t 2 , t 3 , t],45


esulting inf(t) = f[t 1 ] + (t − t 1 )f[t 1 , t 2 ] + (t − t 1 )(t − t 2 )f[t 1 , t 2 , t 3 ]+ (t − t 1 )(t − t 2 )(t − t 3 )f[t 1 , t 2 , t 3 , t].These operations are continued in an analogous manner to give the desired result.•Example 6.4.4 For the special function f(t) = 1/(1−xt), x constant, we havefor general t i :11 − xt = Q(t) + P (t)(1 − xt)P (1/x) .We give now a numerical example illustrating the use of Lagrange’s and Newton’sinterpolation formulas:Example 6.4.5 The following table of the function f is givenx f(x)−2 4.60000 3.00001 4.00002 6.2000Construct the polynomial of degree < 4 which interpolates f at the 4 pointsa) Use Lagrange’s interpolation formula.b) Use Newton’s interpolation formula with divided differences.a) Lagrange’s formula:We get:x(x − 1)(x − 2)Q(x) = 4.6(−2 − 0)(−2 − 1)(−2 − 2)(x + 2)x(x − 2) + 2)x(x − 1)+ 4.0 + 6.2(x(1 + 2)1(1 − 2) (2 + 2)2(2 − 1) .Q(x) = −4.6 x(x2 − 3x + 2)24− 4.0 x(x2 − 4)3+ 3.0(x+ 2)(x − 1)(x − 2)(0 + 2)(0 − 1)(0 − 2)+ 3.0 (x − 1)(x2 − 4)4+ 6.2 x(x2 + x − 2)8b) Newton’s interpolation formula with divided differences:Table over divided differences:−2 4.60 3.0 −0.81 4.0 1.0 0.62 6.2 2.2 0.6 0= 0.6x 2 + 0.4x + 3 .,46


Q(x) = 4.6 − 0.8(x + 2) + 0.6x(x + 2) = 0.6x 2 + 0.4x + 3 .Alternative way of finding the divided differences: SetQ(x) = c 0 + c 1 (x + 2) + c 2 x(x + 2) + c 3 x(x + 2)(x − 1).The coefficients c 0 , c 1 , c 2 , c 3 are the divided differences needed. We form thelinear system of equations below using the interpolatory conditions on the valuesof Q(x).c 0 = 4.6 ,c 0 +(0 + 2)c 1 = 3.0 ,c 0 +(1 + 2)c 1 +1(1 + 2)c 2 = 4.0 ,c 0 +(2 + 2)c 1 +2(2 + 2)c 2 +2(2 + 2)(2 − 1)c 3 = 6.2 .The system may be written in the simpler formThis givesc 0 = 4.6 .c 0 = 4.6 ,c 0 +2c 1 = 3.0 ,c 0 +3c 1 +3c 2 = 4.0 ,c 0 +4c 1 +8c 2 +8c 3 = 6.2 .c 1 = (3.0 − c 0 )/2 = (3 − 4.6)/2 = −0.8 ,c 2 = (4.0 − c 0 − 3c 1 )/3 = (4.0 − 4.6 − 3(−0.8)/3 = 0.6 ,c 3 = (6.2 − c 0 − 4c 1 − 8c 2 )/8 = (6.2 − 4.6 − 4(−0.8) − 8 · 0.6)/8 = 0 ...47


Chapter 7On error propagation7.1 Absolute and relative errors in input dataExample 7.1.1 Assume that we seek the value of a certain number s whichsatisfies the inequalitya ≤ s ≤ b.Then we may estimate s with the approximate value s ∗ wheres ∗ = (a + b)/2 .We note that|s − s ∗ | ≤ (b − a)/2 .Definition 7.1.2 Let x ∗ be an approximation for x . Thenδx = x ∗ − x,is called the absolute error and if x ≠ 0 thenis termed the relative error.δx|x| ≈ δx|x ∗ | ,Remark 7.1.3 We note that the true value x occurs in the definition of therelative error. However, x is unknown in general and one needs to approximatex with the known approximation x ∗ .Definition 7.1.4 If ∆x is a known number such that∆x ≥ |δx|,then ∆x is an absolute error bound and∆x|x| ≈ ∆x|x ∗ | ,48


a relative error bound if x ≠ 0. We may also say that x is known with theabsolute uncertainty ∆x and relative uncertainty ∆x/|x|Remark 7.1.5 Error bounds are normally evaluated with a relative accuracyof about 1% and given with 2 significant figures.Example 7.1.6 Assume that 0 < a ≤ s ≤ b where a and b are known numbers.Then as before we haves ∗ = (b − a)/2,with the absolute error boundand the relative error boundExample 7.1.7 Put|s − s ∗ | ≤ ∆s = (b − a)/2,∆s|s| ≤ b − a2a .s = e √ 2.Then s rounded to 4 decimal places is s = 3.8442 Hence3.84415 ≤ s ≤ 3.84425Thus the absolute uncertainty ∆s is 0.5 · 10 −4 und the relative uncertainty is∆s/3.84415 = 1.3 · 10 −5 . However, s rounded to 6 decimal places is 3.844231and hence the absolute error is |s − s| = 0.31 · 10 −5 and the correspondingrelative error becomes 0.31 · 10 −5 (In most common situations, the error itselfis unknown but an error bound is availableExample 7.1.8a) Assume that s, correctly rounded to 3 decimal places, has the valueThen we haveWe then gets ∗ = 3.456 .3.4555 ≤ s ≤ 3.4565 .|∆s| ≤ 0.0005,and|∆s|≤ 0.0005s 3.4555 ≈ 0.00014 = 1.4 · 10−4 .b) Let x, correctly rounded, beWe now haveThusandx ∗ = 0.248 · 10 7 .0.2475 · 10 7 ≤ x ≤ 0.2485 · 10 7 .∆x ≤ 0.0005 · 10 7 = 5 · 10 3 = 5000,∆xx ≤ 0.00050.2475 = 0.0020 .49


7.2 Error propagation during arithmetic operations.7.2.1 Addition and subtractionLetx ∗ i = x i + ɛ i , |ɛ i | ≤ ∆x i, i = 1, 2, . . . , n.Let a i , i = 1, 2, . . . , n be known exactly. Putn∑s n = a i x i and s ∗ n =and seek a bound forWe haveimplyingHences ∗ n =|s n − s ∗ n| = |i=1|s ∗ n − s n |.n∑a i x ∗ i =i=1s ∗ n = s n +n∑a i ɛ i | ≤i=1and the relative bound becomes:|s n − s ∗ n|/|s n | ≤n∑a i x ∗ i ,i=1n∑a i (x i + ɛ i ),i=1n∑a i ɛ i .i=1n∑|a i ||ɛ i | ≤i=1n∑|a i |∆x i ,i=1n∑|a i |∆x i /|s n |.Example 7.2.1 Let x 1 and x 2 , correctly rounded to 3 decimal places bePutWe haveori=1x ∗ 1 = 3.125, x ∗ 2 = 3.123 .s 2 = x 1 − x 2 .3.1245 − 3.1235 ≤ s 2 ≤ 3.1255 − 3.1225,0.0010 ≤ s 2 ≤ 0.0030 .Thus s 2 = 0.0020 ± 0.0010 and the relative uncertainty becomes0.0010/0.0020 = 0.5 .Note that the relative uncertainty of x ∗ 2 is0.0005/3.123 ≈ 1.6 · 10 −450


7.2.2 MultiplicationPutwheres = x 1 · x 2 ,0 < a 1 ≤ x 1 ≤ b 1 , 0 < a 2 ≤ x 2 ≤ b 2 .Next put a = a 1 a 2 , b = b 1 b 2 . The optimal estimate, giving minimal bounds forthe uncertainty for s is given bywith the error bounds ∗ = (a + b)/2,|s − s ∗ | ≤ (b − a)/2 .However, to find s ∗ we need to carry out two multiplications instead of just oneand hence our error bound costs as much in work as the actual result. Insteadwe use the following estimatewheres = x 1 · x 2 .x 1 = (a 1 + b 1 )/2, x 2 = (a 2 + b 2 )/2 .We need to get an expression for the error associated with this estimate. Tothis end we evaluate the difference s − s ∗ . PutThus we haveMultiplying these relations we obtainδx 1 = (b 1 − a 1) )/2, δx 2 = (b 2 − a 2 )/2 .a 1 = x 1 − δx 1 ,a 2 = x 2 − δx 2 ,b 1 = x 1 + δx 1 ,b 2 = x 2 + δx 2 .a = a 1 a 2 = x 1 · x 2 − x 1 · δx 2 − x 2 · δx 1 + δx 1 · δx 2 ,b = b 1 b 2 = x 1 · x 2 + x 1 · δx 2 + x 2 · δx 1 + δx 1 · δx 2 .From the last relations we gets ∗ = (a + b)/2 = s + δx 1 · δx 2 .We next want to bound the error incurred, when s is used as an estimate for s.We find that|s − s| ≤ max(d 1 , d 2 ),d 1 = x 1 · x 2 − a,51


Henced 2 = b − x 1 · x 2 .d 1 = x 1 · x 2 − (x 1 − δx 1 )(x 2 − δx 2 ) = x 1 δx 2 + x 2 δx 1 − δx 1 · δx 2 ,d 2 = (x 1 + δx 1 )(x 2 + δx 2 ) − x 1 · x 2 = x 1 δx 2 + x 2 δx 1 + δx 1 · δx 2 .Thus we always have0 ≤ d 1 ≤ d 2 .We find the following approximative expression for the relative error, when s isused as an approximation for s.d 2s = d 2x 1 x 2= δx 1x 1+ δx 2x 2+ δx 1x 1· δx 2x 2.Hence we conclude that when the relative errors in the factors x 1 and x 2 are sosmall that the products of these errors may be neglected, then the relative errorin the estimate of the product s is the sum of the relative errors in the factorsand the same relation holds for the error bounds.7.2.3 DivisionLet x 1 and x 2 positive numbers which are defined as before but putz = x 1x 2.Arguing in a similar way as in the preceding subsection we may prove that therelative error in the estimated quotient may be considered to be the sum of therelative errors in the estimates of the two numbers.7.3 Monotonic functions of one variableLet f be a function which is positive and increasing on the interval [a, b]. Weneed to estimate f(x) when it is known that a ≤ x ≤ b. Let z be the valuesought. Arguing as before we immediately havef(a) ≤ z ≤ f(b),and we get the two approximations z ∗ and z wherez ∗ =f(a) + f(b)2z = f((a + b)/2).We note that the first estimate requires that the function f is evaluated at twopoints a and b while for the second only one functional evaluation is called for.The uncertainty of the first approximation isf(b) − f(a)252


We now assume that f has a continuous derivative and use the mean valuetheorem of differentiation to estimate the error associated with the second approximationobserving that|x − (a + b)/2| ≤ (b − a)/2Hence|f(x) − f((a + b)/2)| ≤ b − a2max |f ′ (ξ)| ≈ b − aa≤ξ≤b 2 |f ′ ((a + b)/2)|7.3.1 <strong>Numerical</strong> exampleLet f(x) = √ 3 + x, a = 1.456, b = 1.457. We getHence we get the estimatef(a) = √ 3 + 1.456 = √ 4.456 = 2.11092397f(b) = √ 3 + 1.457 = √ 4.457 = 2.11116082z = f((a + b)/2) = √ 4.4565 = 2.11104240 .z ∗ = (2.11092397 + 2.11116082)/2 = 2.11104240 .We note that the two estimates have the first 8 decimal places in common.x = 4.4565 ± 0.0005, z = 2.11104 ± 0.00012 .7.3.2 Estimate for a monotonic function with two continuousderivativesLet f be a function which is increasing and has two continuous derivatives onthe interval [a, b]. As before we havez ∗ =f(a) + f(b)2z = f((a + b)/2) . (7.1)We want to derive an expression for the difference between these two estimates.We writeand henceNext putWe immediately findd =x = a + b2∆x = b − a2 ,a = x − ∆x, b = x + ∆x.d = z ∗ − z.f(x + ∆x) − 2f(x) + f(x − ∆x).253


Thus the difference d between the two estimates is half the second difference ofthe function f with respect to the three equidistant points a, (a + b)/2, b. Wenext show thatd = (∆x)2 f ′′ (ξ), a ≤ ξ ≤ b.2The number ξ is unknown but must belong to the specified interval. For theproof of this statement we will use the mean-value theorems in 13.1.9 We nowdefine the function u by means of the expressionu(h) =f(x + h) − 2f(x) + f(x − h).2We observe that u(∆x) = d, u(0) = 0. We next differentiate u with respect toh and getu ′ (h) = f ′ (x + h) − f ′ (x − h).2Using the mean-value theorem of differentiation we getu ′ (h) = hf ′′ (ξ(h)),where ξ(h) defines a continuous function. Since u(0) = 0 we also haveu(h) =∫ h0u ′ (v)dv =∫ h0vf ′′ (ξ(v))dv = h22 f ′′ (ξ ∗ ),where ξ ∗ is a number such that 0 ≤ ξ ∗ ≤ h. Here we use the mean-value theoremof integration. We thus have for the difference d between the two estimatesd = u(∆x) = h22 f ′′ (ξ ∗ ) ≈ h22 f ′′ ((a + b)/2)For h sufficiently small this difference may be neglected and the second approximationz in (7.1) may be used. We thus have the estimates for the uncertaintyin the argument:x − x = (b − a)/2,and for the uncertainty in the functional value:z − z ≈ f ′ (x) b − a2 .7.4 Error in estimate of extreme value of a functionof one variableWe consider now the problem of calculating the maximum value of a functionwhich is defined and has two continuous derivatives on the interval [a, b]. Themaximum value is assumed either at the end-points a, b or at a point s such thatf ′ (s) = 0. Thus we need to find all these points and compare the corresponding54


functional values. Let s ∗ be an approximation for s such that |s − s ∗ | ≤ ∆swhere ∆s is a known number. We assume also that s − a > ∆s and b − s > ∆sPut z = f(s) and z ∗ = f(s ∗ ) We thus need to estimate the errorWe now use that f ′ (s) = 0 and thatδz = f(s ∗ ) − f(s).∫ s∗f(s ∗ ) = f(s) + f ′ (u)du,sand hence thatIntegrating by parts we get∫ s∗δz = f ′ (u)du.sδz = [f ′ (u)(u − s ∗ )] s∗sδz = −∫ s∗s−∫ s∗s(u − z ∗ )f ′′ (u)du.(u − s ∗ )f ′′ (u) duSince the factor u − z ∗ does not change signs over the interval of integration wemay use the mean-value theorem of integration to arrive atand hence the uncertainty becomesδz = − (s − s∗ ) 2f ′′ (ξ), s ≤ ξ ≤ s ∗ ,2∆z ≈ (s − s∗ ) 2|f ′′ (s ∗ )|, s ≤ ξ ≤ s ∗ .27.4.1 <strong>Numerical</strong> example of evaluating maximum value.We consider the functionf(x) = πx − x 3 /3, x ≥ 0 .This function has a global maximum for x = s = √ π and we use the approximations ∗ = 1.772 Since s ∗ is the value of s correctly rounded to 3 decimalplaces we have|s − s ∗ | ≤ 0.0005 .We findf(s ∗ ) = π · 1.772 − (1.772) 3 /3 = 3.71221830,andf(s) = π √ π − ( √ π) 3 /3 = 2π√ π3= 3.71221867 .55


In this case we have found that|f(s) − f(s ∗ )| = 0.00000037,and thus the error in the calculated functional value was much less than theuncertainty in the argument. Let z denote the functional value sought anddenote by δz the error in z. In the same way let δs be the error in the argument.Since f ′ (s) = 0, we have the estimateWe haveHence we get the boundδz ≈ (δs)22 |f ′′ (s)| ≈ (δs)22 |f ′′ (s ∗ )|.f ′′ (x) = −2x, s ∗ = 1.772, |δs| ≤ ∆s = 0.0005 .∆z ≤ (∆s)22|f ′′ (s ∗ | = (0.0005)2 2 · 1.772 = 44 · 10 −8 .2We find that the calculated error bound is not much larger than the observederror which is explained by the fact that the round-off error in z ∗ is close to themaximum magnitude. We have namely. that √ π = 1.772457.5 General error propagation formula for functionsof several variables7.5.1 Error propagation formula for two variablesWe also discuss error propagation by the evaluation of general functions ofseveral variables. We first consider a function f of two variables, which we needto evaluate for the arguments x, y, when we know the approximations x ∗ , y ∗ . Asusual we putx ∗ = x + δx, y ∗ = y + δy, z = f(x, y), z ∗ = f(x ∗ , y ∗ ), z ∗ = z + δz.Using Taylor’s formula and keeping only the linear terms we getδz ≈ ∂f ∂fδx +∂x ∂y δy.We recall that the symbols δx and δy represent finite values.bounds|δx| ≤ ∆x, |δy| ≤ ∆y,We have thewhere ∆x and ∆y are the uncertainties in x and y. Using the triangle inequalitywe arrive at the following approximate bound for ∆z, the uncertainty in z :∆z ≤ | ∂f |∆x + |∂f∂x ∂y |∆y.56


7.5.2 Example: additionLetWe then haveand hencez = x + y.∂z∂x = 1, ∂z∂x = 1,∆z = ∆x + ∆y.Hence the bounds for the absolute values of the two terms add up to form thebound for the absolute uncertainty in the sum7.5.3 Example: multiplicationLetThen we findand thusFrom this last relation we obtainz = x · y.∂z∂x = y, ∂z∂y = x,∆z = |y| · ∆x + |x| · ∆y.∆z|z| = ∆x|x| + ∆y|y| .Hence the bound for the relative error in the product is the sum of the boundsfor the relative errors in the factors. This result generalizes to products of morethan two factors.7.5.4 Example: divisionPutz = x y .Using Taylor’s formula and keeping the linear terms, we find, using notationswhich are analogous to those of the preceding exampleand hence we get the error boundδz = δx y − xδyy 2 ,|δz| ≤ |δx||y|57+ |xδy||y 2 | ,


giving|∆z||z|≤ |∆x||x|+ |∆y||y| .Thus the bound for the relative error in the quotient z is the sum of the relativebounds for the errors in x and y.7.5.5 <strong>Numerical</strong> examplez = x + 4y + x + 10 .Estimate the relative and absolute errors in z if x = 10, y = 5 and the relativeerrors are for x 0.01 and for y 0.02. We getThusFurtherx = 10 ± 0.1, y = 5 ± 0.1.z = 10 + 45 + 10 + 10 = 0.5600 .∂z∂x = 1y + x + 10 − x + 4(y + x + 10) 2 = 15 + 10 + 10 − 10 + 4(5 + 10 + 10) 2 = 0.0176,∂z∂y = − x + 4(y + x + 10) 2 = − 10 + 4(5 + 10 + 10) 2 = −0.0224 .Hence the bound for the absolute error becomes:and that for the relative error∆z = 0.1 · 0.0176 + 0.1 · 0.0224 = 0.0040,∆zz = 0.00400.56 = 0.0071 .7.5.6 Error propagation for functions of several variables.Let f be a real-valued function which is defined for arguments x ∈ R n , then-dimensional real space. The error in x is represented by the vector δx ∈R n . Using Taylor’s formula and keeping the linear terms we get the followingapproximation for the error δf(x)δf(x) ≈n∑i=1∂f(x)∂x iδx i ,and hence the absolute and relative error boundn∑∆f(x) ≤ | ∂f(x) | · ∆x i ,∂x ii=1∆f(x)|f(x)| .58


Chapter 8<strong>Numerical</strong> treatment ofnonlinear equations8.1 IntroductionIn this Chapter we shall describe some numerical methods for the solution ofequations. The discussion will focus on nonlinear equations in one unknown.We recall that some important classes of equations may be solved by meansof special algorithms. Linear equations may be considered a special case ofnonlinear ones and methods for this important class have been presented inChapters 3 and 4. The algorithm for solving the quadratic equationx 2 + a 1 x + a 2 = 0,is well-known, at least when the coefficients are real. However, a straightforwardapplication could give numerical problems. To avoid those, it is advisablefirst to determine the root with the largest absolute value. Let this root bex 1 . Then the root with the smaller absolute value, x 2 is calculated according tox 2 = a 2 /x 1 .Formulas for polynomial equations of the third and fourth degree may be foundin the literature, e.g. in [11] but these are seldom used. The computer algebralanguage Maple may be used to determine expressions for all complex zerosof a polynomial of degree less than five with complex coefficients. Some otherequations involving elementary functions may be solved analytically by meansof special methods. Then an alternative to the general numerical methods tobe described later may be available. We recall also that these equations may besolved numerically using Maple.59


8.2 Treatment of a single equation f(x) = 0, whenf is continuous: Existence of rootsIn this section we discuss the single equationf(x) = 0,when f is only assumed to be continuous. The bisection method to be given inSection 8.4 and applicable for this case cannot be easily extended to systems ofequations. It is based on the following theorem:Theorem 8.2.1 If the function f is continuous on [a, b] and such thatf(a) · f(b) < 0,then the equationhas a root s such thatf(x) = 0,a < s < b.We have also the general error bound:See (1.1)|s − (a + b)/2| < (b − a)/2.Example 8.2.2 Consider the equation f(x) = 0 with f(x) = e x − 2, a =0.69, b = 0.70We find f(a) = −0.0063, f(b) = 0.0138. By Theorem 8.2.1 above we find thatthe equation has a root s in the interval [0.69, 0.70] and we get the estimates = 0.695 ± 0.005.(The true root is ln 2 = 0.6931 which illustrates that the errorbound in Theorem 8.2.1 is valid. Note that under the assumptions of Theorem8.2.1 it is not possible to draw any conclusions about the number of roots in[a, b]. This illustrated by:Example 8.2.3 In Theorem 8.2.1 choose:f(x) = (x 3 − 1)x 2 , a = −2, b = 2.Then we havef(−2) = −36, f(2) = 28 .It is easily seen that the equation f(x) = 0 has a double root at 0 and a singleroot at 1. while the equation in Example 8.2.2 has a unique root in [a, b].60


8.3 Determination of approximate roots. A generalerror bound. UniquenessBefore one starts the numerical treatment of a nonlinear equationf(x) = 0,one needs to determine the number of roots and to find approximate valuesof these. For each root one needs to find an interval which contains it. Fora successful numerical treatment one also needs to know whether the root isisolated or multiple, i.e. whether both f and its derivatives have a commonzero. This information is obtained by analytic means and there is no fixedprocedure to follow, which works for all equations.We also give the following result which may be used for bounding the errorin an approximation s ∗ for a root s.Theorem 8.3.1 Let f be a function on the interval [a, b] which is monotonicand has a continuous derivative f ′ which is such that |f ′ (x)| ≥ m > 0 for acertain number m. Assume also thatf(a) · f(b) < 0.Then f has a unique root s in [a, b]. Let now s ∗ be a number such thatThen we have the inequalities:(s ∗ − a)(b − s ∗ ) > 0.|s − s ∗ | < b − a, |s − s ∗ | < |f(s ∗ )|/m.Example 8.3.2 We consider the problem: Determine all real roots of the equatione x = 4 + x, (8.1)with an uncertainty < 1.0 · 10 −5 .We define the functionf(x) = e x − 4 − x. (8.2)We observe that when x → −∞ then f(x) ≈ −4 − x and sincef(x) = e x (1 − 4 · e −x − x · e −x ),f(x) is growing unboundedly when x→ ∞. We also havef ′ (x) = e x − 1, f ′′ (x) = e x > 0.Hence f ′ has only one zero, namely x = 0, which corresponds to a local minimum,because f is decreasing for x < 0 and increasing for x > 0. Further,f(−4) = e −4 > 0, f(0) = e 0 − 4 = −3,61


f(1) = e − 4 − 1 = −2.2818, f(2) = e 2 − 4 − 2 = 1.389,Thus we conclude that this equation has one negative root and one positiveroot. Let s be the positive root. From the above we have1 < s < 2,since f(1) · f(2) < 0 We thus conclude thats = 1.5 ± 0.5 .We use the theorem above to get, if possible, a better bound than 0.5 for theapproximation s ∗ for the root s. We findf(1.5) = e 1.5 − 4 − 1.5 = −1.018 .Since f ′ (x) = e x − 1 is increasing we concludeand we obtain the boundm = |f ′ (1)| = e − 1 = 1.718,|s − 1.5| ≤ 1.018/1.718 = 0.593 .The theorem did not give any improvement in this case. In the sequel wewill present three different numerical schemes for finding the positive root andillustrate with numerical results.8.4 The bisection method.If the assumptions of Theorem 8.2.1 are valid, then it is possible to construct asequence of approximations to a root in [a, b]. The main idea is the following:Put c = (a + b)/2. If f(c) · f(a) > 0, then f(c) has the same sign as f(a) andhence f(c) · f(b) < 0. Thus Theorem 8.2.1 may be applied to the interval [c, b].Otherwise, f(a) · f(c) < 0 and the Theorem applies to [a, c]. In both cases theinterval guaranteed to contain a root has been halved. This argument may berepeated. We formulate the bisection algorithm which is described inTheorem 8.4.1 Let a, b and f be as in Theorem 8.2.1. Generate the sequencesof numbers x 0 , x 1 , . . . y 0 , y 1 , . . . and z 0 , z 1 , . . . as follows:Start: Put x 0 = a, y 0 = b.Step 1: Put n = 0.Step 2: Put z n = (x n + y n )/2.Step 3: If f(z n ) · f(x n ) > 0, put x n+1 = z n and y n+1 = y n , increment n by 1and go to Step 2.Step 4: If f(z n ) · f(x n ) ≤ 0, put y n+1 = z n and x n+1 = x n , increment n by 1and go to Step 2.Then there is a root s ∈ [a, b] such that the following inequality holds|z n − s| ≤ b − a2· 2 −n .62


Proof:The case n = 0 is the statement of Theorem 8.2.1. Every time the n is incrementedby 1 the interval containing a root of f(x) = 0 is halved.• We illustrate on the example 8.3.2 above. We find that f(1) = −2.28,f(2) = 1.39 and hence our first estimate for the positive root isx = 1.5 ± 0.5,and using this starting information we get the following table for the bisectionmethod:n x n f(x n ) y n f(y n ) z n f(z n ) h n0 1.000000 −2.281718 2.000000 1.389056 1.500000 −1.018311 .5000001 1.500000 −1.018311 2.000000 1.389056 1.750000 .004603 .2500002 1.500000 −1.018311 1.750000 .004603 1.625000 −.546581 .1250003 1.625000 −.546581 1.750000 .004603 1.687500 −.281551 .0625004 1.687500 −.281551 1.750000 .004603 1.718750 −.141198 .0312505 1.718750 −.141198 1.750000 .004603 1.734375 −.068989 .0156256 1.734375 −.068989 1.750000 .004603 1.742188 −.032368 .0078137 1.742188 −.032368 1.750000 .004603 1.746094 −.013926 .0039068 1.746094 −.013926 1.750000 .004603 1.748047 −.004673 .0019539 1.748047 −.004673 1.750000 .004603 1.749023 −.000038 .00097710 1.749023 −.000038 1.750000 .004603 1.749512 .002282 .00048811 1.749023 −.000038 1.749512 .002282 1.749268 .001122 .00024412 1.749023 −.000038 1.749268 .001122 1.749146 .000542 .00012213 1.749023 −.000038 1.749146 .000542 1.749084 .000252 .00006114 1.749023 −.000038 1.749084 .000252 1.749054 .000107 .00003115 1.749023 −.000038 1.749054 .000107 1.749039 .000035 .00001516 1.749023 −.000038 1.749039 .000035 1.749031 −.000001 .00000817 1.749031 −.000001 1.749039 .000035 1.749035 .000017 .00000418 1.749031 −.000001 1.749035 .000017 1.749033 .000008 .00000219 1.749031 −.000001 1.749033 .000008 1.749032 .000003 .00000120 1.749031 −.000001 1.749032 .000003 1.749032 .000001 .0000008.5 Newton-Raphson’s methodTo use this method we write the equations asf(x) = 0.An approximation x 0 for a root s is known and one seeks to generate a sequenceof points x 0 , x 1 . . . which converges to the desired root. Let x n be a member ofthis sequence. We write x n + h = s and put x n+1 = x n + h n where h n shouldbe a sufficiently good approximation of h to insure that |x n+1 − s| < |x n − s|.In Newton-Raphson’s method h n is constructed by using Taylor’s series and63


keeping only the linear term. (This idea is easily extended to systems of severalequations.) Thus we writeHere we put0 = f(s) ≈ f(x n + h n ) ≈ f(x n ) + h · f ′ (x n ).h n = − f(x n)f ′ (x n ) .To summarise: The Newton-Raphson iteration becomes: Let x 0 be given andthen put for n = 0, 1, . . .x n+1 = x n + h n , h n = −f(x n )/f ′ (x n ).Example 8.5.1 We want to determine the positive root of the equation (8.1).In this case we havef(x) = e x − 4 − x, f ′ (x) = e x − 1 .Taking x 0 = 1.8 and working with 6 decimal places we getHence we find:f(x 0 ) = e 1.8 − 4 − 1.8 = 0.249647, f ′ (x 0 ) = e 1.8 − 1 = 5.049647,h 0 = −0.249647/5.049647 = −0.049439 .x 1 = x 0 + h 0 = 1.800000 − 0.049439 = 1.750561,Continuing on we get the following table:n x n f(x n ) Df(x n ) h n0 1.800000 .249647 5.049647 −.0494391 1.750561 .007273 4.757834 −.0015292 1.749033 .000007 4.749040 −.0000013 1.749031 .000000 4.749032 .000000Hence s = 1.749031±0.000001 and x 3 , h 3 are both rounded to 6 decimal places.The combined rounding error is therefore 0.000001. We use h 3 as the estimateof the truncation error. It is negligible. The Newton-Raphson method maygive a convergent sequence of approximations {x n } for different choices of thestarting point x 0 . We note thatWe also have the approximationlim x n = s ⇒ lim f ′ (x n ) = f ′ (s)n→∞ n→∞s − x n ≈ x n+1 − x n = h n64


8.6 The fixed-point methodTo use this method we write the equation:x = F (x).Starting with a suitable approximation x 0 for the sought root s we iterate accordingto the formulax n+1 = F (x n ), n = 0, 1. . . . .This sequence may or may not converge towards the desired root. We have therelationx n+1 − s ≈ F ′ (s)(x n − s),which holds, if the sequence does indeed converge and x n is close to the root.We note that if there is a number q such that|F ′ (s)| ≤ q < 1,then the absolute value of the error is multiplied by a number not greater thanq in each iteration. If the fixed-point method shall be more rapidly convergentthan the bisection method, one needs to require that q < 1/2. The fixed-pointmethod may be generalised to systems of several equations. As an example ofthis we mention the Jacobi scheme for linear systems of equations.RemarkNewton-Raphson’s method may be looked upon as a special instance of thefixed-point method. Assume that we solve the equationwriting this equation in the formf(x) = 0,x = F (x), F (x) = x − f(x)f ′ (x) ,and then apply the fixed-point iteration to the equationx = F (x),the sequence of points becomes identical to that of Newton-Raphson’s method.We note in particular, that in this case F ′ (s) = 0, provided that f ′ (s) ≠ 0 .This fact explains the rapid convergence of Newton-Raphson’s method.Example 8.6.1 (Heron’s formula). To determine the square root of a positivenumber a, we treat the equationf(x) = 0, f(x) = x 2 − a.65


We findand hencegivingIn this case,f ′ (x) = 2a,x n+1 = x n − (x 2 n − 2a)/(2x n ),x n+1 = F (x n ), F (x n ) = x n + a/x n.2s = √ a and F (x) = x + a/x2, F ′ (x) = 1 − a/x2 , F ′ (s) = 0.2Example 8.6.2 We determine the positive root of the equation (8.1) writing itfirste x = 4 + x,and finally in the formx = ln(4 + x).Starting with x 0 = 2. and working with 6 decimal places we get x 1 = ln(4+2) =1.791759, x 2 = ln(4 + x 1 ) = ln(5.791759) = 1.756436, etc. We have collectedthe results in the table below. We observe that in this ExampleF ′ (s) = 14 + s ≈ 14 + 1.75 = 0.174 .It should be pointed out that if we want to calculate the negative root with thefixed-point method, we may write the equationx = e x − 4,and use x 0 = −4 as the starting approximation.n x n0 2.0000001 1.7917592 1.7564363 1.7503194 1.7492555 1.7490706 1.7490387 1.7490338 1.7490329 1.74903110 1.74903166


Chapter 9On approximation offunctions9.1 Calculation of standard functions9.1.1 General remarksThe construction of schemes for calculation of standard functions is a topic whichhas been studied for a very long time. We recall that Briggs devoted decades ofhis life to the construction of accurate tables of logarithms in the 17th century.Before the era of computers the elementary functions were tabulated over somereference intervals (See e.g.[1] or [4]) and the properties of the function were usedto determine its values for arguments outside the range of the table. The tablegives only the values at a discrete set of arguments in the tabulated range andinterpolation is called for in order to evaluate the function at arguments betweenthese points. Tables of logarithms were used to simplify computational work,since addition of logarithms may replace multiplication of the correspondingnumbers and division of logarithms with real numbers may replace the operationof determining roots and powers. We note the exampleslog(ab) = log a + log b, log(a p ) = p log a,If a, b are given, one needs first to determine log a and log b, then one adds log ato log b to get log ab. Using the table, the number ab is found. This is easierthan multiplying a with b by hand.Example 9.1.1 (Extension of a table over the exponential function)Assume that we seekz = exp(4.756) .In [4] we find a table of exp(x), 0 ≤ x ≤ 3. Thus we writeexp(4.756) = exp(2) · exp(2.756) = 7.38906 · 15.73677 = 116.2799 .67


When we need to construct a standard computer program for evaluating theexponential function for real arguments, we use a similar approach. <strong>First</strong> wedetermine a polynomial or rational function, which approximates exp(x) in areference interval to acceptable accuracy, and then we use the particular propertiesof this function to calculate its value at arguments outside the referenceinterval.9.1.2 Division in simple computersSome of the first computers did not have instructions for division. To determine1/b, b > 0 one had to solve the following equation numerically (without usingdivision)1x − b = 0 .We first determine an integer n such thatThen we find1 ≤ 2 n b ≤ 2 .12 n b = 12 n b = 12 n · 1b ,and 2 −n is stored exactly in the computer. Multiplication and division with thisnumber correspond to simple and rapid shift operations. Thus we put a = 2 n band proceed to solving the equation1x − a = 0 .Using Newton-Raphson’s method we derive the the sequencex 0 = 1, x n+1 = 2x n − ax 2 n, n = 0, 1, . . .which converges rapidly towards its limit value 1/aExample 9.1.2 Determine 1/17 without divisions. We have17 = 2 4 · ·1716 .We need to invert 17/16 and form the recursion:We find the numerical valuesx 0 = 1, x n+1 = 2x n − 1716 · x2 n, n = 0, 1, . . . .1.000000, 0.937500, 0.941162, 0.941176, 0.941176 .Hence we obtain 1/17=0.058823568


9.1.3 The square rootTo find √ a, a > 0 one needs to solve the equationx 2 = a,numerically. We consider here only the case 1 ≤ a ≤ 2. If a is outside of thisinterval, we formb = 2 n a,where the integer is chosen such that 1 ≤ b ≤ 2. The classical method ofdetermining the square root of b is Heron’s formula which is the name given tothe recursionx 0 = 1, x n+1 = 1 2 (x n + b/x n ), n = 0, 1, . . . .This recursion may be interpreted as follows. We seek a positive number x, suchthat x 2 = b. Assume that we have found an approximation x n . Then we maywritebx n · = b.x nWe get the next approximation x n+1 by forming the average of the two factorsx n and b/x n . This is repeated. The convergence to √ b is quadratic. A numericalexample, namely the calculation of √ 7 is given in Section 10.3 This examplealso demonstrates that Heron’s formula can be used for finding the square rootof numbers larger than 2.9.1.4 Two examples of elementary functionsPolynomial approximations valid in standard intervals are found in [1]. Examplesof accurate rational approximations (quotient of two polynomials) are givenin [2] We mention hereExample 9.1.3andsin x =60x − 7x311x7+ R(x), R(x) ≈60 + 3x2 50400 ,e x ≈ 1680 + 840x + 180x2 + 20x 3 + x 41680 − 840x + 180x 2 − 20x 3 + x 4 .9.2 Approximation by polynomials, some examplesExample 9.2.1 Assume that we want to derive a polynomial expression for theexponential function on [0, 1]. Using the expansion above we gete x = 1 + x + x 2 /2 + x 3 /6 + x 4 /24 + . . . + x n−1 /(n − 1)! + R n (x), 0 ≤ x ≤ 1,69


whereSince 0 ≤ x ≤ 1, we obtainR n (x) = exp(θx) x n .n!|R n (x)| ≤ exp(1) x n ,n!and hence |R n (x)| ≤ 1.0 · 10 −6 , if n ≥ 10.If we instead expand around the middle-point of the interval, i.e. take a = 1/2in the general formula above, we get the expansione x = e 1/2 · (1+ (x − 1/2) + (x − 1/2) 2 /2 + (x − 1/2) 3 /6 )whereHence+ e 1/2 · ((x− 1/2) 4 /24 + . . . + (x − 1/2) n−1 /(n − 1)! ) + R n (x), 0


which is achieved forTo gett i = 1/2 + 1/2 · cos ξ i , ξ i = 2i − 12n π.||R|| ∞ ≤ 1. · 10 −6 ,with this choice of t i we need to take n ≥ 7.9.2.1 Interpolation with piecewise polynomialsAssume that we are given a table with many pairs of values t i , y i i = 1, . . . , N.Thus the table is defined by the vectors t ∈ R N and y ∈ R N If N is a largenumber, it is generally not advisable to determine a single polynomial whichinterpolates those points. The task to determine this polynomial is unstablewhen N is large, which may even cause the calculations to break down. Thenone often splits the table into subsets and construct an interpolating functionfor each of this subsets. When the argument x passes from one subset to thenext the interpolating function and/or its derivatives may have a discontinuity.These points are called break-points. They need not to belong to the set ofarguments in the table. When the argument vector t ∈ R N is fixed, there is alinear mapping from y ∈ R N to the interpolating function s. Hence the values(x) may be writtenN∑s(x) = y i f i (x). (9.1)i=1A special case of this is the Lagrange interpolation formula (6.4) where f i correspondsto L i . These functions are called finite elements. If we evaluate a functionalof s, the value becomes a linear expression with y as argument. Hence, ifwe integrate (9.1) over [0, 1] we get∫ 10s(x) dx =N∑∫ 1y i f i (x) dx.i=19.2.2 Piecewise linear interpolationLet t ∈ R N , y ∈ R N be as before. We want to construct a piecewise linearinterpolating function s(x) by using the linear interpolating formula (6.3) oneach of the subintervals [t r−1 , t r ], r = 2, 2, . . . , N settingCollecting terms we get in (9.1)f r (x)f r (x)0h r = t r − t r−1 . (9.2)= h −1r+1 (t r+1 − x), t r ≤ x ≤ t r+1 , r = 1, . . . , N − 1= h −1r (x − t r−1 ), t r−1 ≤ x ≤ t r , r = 2, 3 . . . , Nthen s(x) in (9.1) becomes a piecewise linear and continuous function with thevalue y r for x = t r .71


Chapter 10On the numericalevaluation of limit values10.1 Examples of limit values in computationalmathematicsMany computational problems may be formulated as the task of computinglimit values.Example 10.1.1 (<strong>Numerical</strong> integration) If f is continuous on [−1, 1] wehave∫ 11N∑f(t) dt = lim f( i − 1/2N→∞ NN ).−1−N+1Example 10.1.2 (<strong>Numerical</strong> differentiation) See Example 1.3.2Example 10.1.3 (Fourier integrals) We consider the Fourier integralSincewe could writeF (w) =∫ ∞−∞F (w) =∫ ∞−∞e iwt f(t) dt.e iwt = cos wt + i sin wt,∫ ∞f(t) cos wt dt + i f(t) sin wt dt.−∞Provided certain general conditions hold we haveF (w) = limh→0F (h, w), (10.1)72


whereF (h, w) = h∞∑n=−∞e iwnh f(nh). (10.2)This defines a double limiting process since we need to determine F (h, w) byevaluating an infinite series (10.2), then find the limit value (10.1) We observethat the integral defining F (w) is not defined for all functions f10.2 Some measures of convergence speed of seriesand sequencesIn this section we shall discuss calculating limit values and we shall presentmethods of estimating the errors in calculated results. We start by consideringthe general situation of a sequences 1 , s 2 , . . . , (10.3)such thats = lim s n, (10.4)n→∞is defined. Here s 1 , s 2 , . . . , s n may be calculated numerically and the effortnecessary depends on n. PutThusWe often make the approximationR n = s − s n .s = s n + R n . (10.5)s ≈ s n . (10.6)Then R n is the associated truncation error, which is unknown in general. Hencewe need to derive bounds on R n using any special properties which the sequences 1 , s 2 , . . . might be established to possess. It should be borne in mind thatwithout any bounds on the truncation error, the estimate s n does not bring anyinformation of the value of s, even if n is large. Associated with the sequence(10.3) is a series with terms a 0 , a 1 , . . . where we define s 0 = 0 andHencea r−1 = s r − s r−1 , r = 1, 2, . . . .n−1∑s n = a r . (10.7)r=0Note that the sum defining s n has n terms. We write the series correspondingto the sequence (10.3), (10.7)∞∑s = a r . (10.8)r=073


ThusIf we now put∞∑R n = a r . (10.9)r=ns ≈ s n ,then we shall refer to a n as the first neglected term.Definition 10.2.1 The sequence (10.7) and the equivalent series (10.8) aresaid to be convergent, iflimn→∞ R n = 0.The convergence is said to be rapid, ifR n+1lim = 0,n→∞ R nand the convergence is termed geometric (or exponential in n) if|R n+1 |lim = q,n→∞ |R n |where 0 < q < 1 Otherwise the convergence is said to be slow.Remark 10.2.2 We recall the familiar fact that the requirementlim a n = 0,n→∞is not sufficient for convergence. This is illustrated by the examplea n = 1/n, n = 1, 2, . . . ,since in this case we may prove, using the integral inequality in 1.6.6,limn→∞s nln n = 1.This explains why the definition of convergence itself is expressed in terms ofthe remainder R n .Theorem 10.2.3 (Sufficient condition for geometric and rapid convergence)Assume thata n+1lim = q,n→∞ a nThen if 0 < |q| < 1 the series is geometrically convergent, if q = 0 the series israpidly convergent.74


Proof PutThen we haveandM n = supr>na r+1a r+1, m n = inf .a rr>n a rm n ≤ M n ,M n+1 ≤ M n , m n+1 ≥ m n , n = 1, 2, . . . ,since when n is increased by 1, one element is removed from the set having M nand m n as infimum and supremum. We also have the limiting values:Thus we get the bounds:Hencelim M n = lim m n = q.n→∞ n→∞R n =∞∑a r ≥a n,1 − m nr=nR n+1 ≤ a n+1 /(1 − M n+1 ).R n+1≤ a n+1 1 − m n· ,R n a n 1 − M n+1and we arrive at the desired conclusion letting n → ∞ and observing that m nand M n both tend to q.•Lemma 10.2.4 Let the series (10.8) be geometrically or rapidly convergent.Then the following statements follow from Definition 10.2.1:There is an N such thatProof:Using the definition of R n we haveDividing this relation by R n we getR nlim = 1n→∞ a n 1 − q(10.10)|a n+1 | < |a n |, n > N. (10.11)R n = a n + R n+1 .1 = a nR n+ R n+1R n.Letting n → ∞ we obtaina n1 = lim + q,n→∞ R nestablishing the first statement. Since lim n→∞ |R n /R n+1 | = q < 1 there is anN such that for all n > N, |a n+1 | < |a n |, the second result is a consequence of(10.10).•75


Remark 10.2.5 If (10.8) is a rapidly or geometrically convergent series, thenwe may use the estimateR n ≈ a n /(1 − M n ). (10.12)In many practical situations a n , “the first neglected term”, is not available andone uses instead the more conservative estimatesM nM n|R n | < |a n−1 | = |s n − s n−1 | , (10.13)1 − M n 1 − M ni.e. “the last included term” in the sum s n . Here M n is defined as in Theorem10.2.3Example 10.2.6 The seriesx − x 3 /6 + x 5 /120 − . . . = sin x,satisfies the conditions of Leibniz’s theorem if |x| < 1. It is also rapidly convergent.Note that geometric convergence implieslog |R n | < log C + n log q,where log is the logarithm with basis 10. This means that by geometrical convergencethe number of correct digits in our estimate s ≈ s n grows linearlywith n, the number of terms used in forming s n . We observe that if we formthe subsequence s 1 , s 2 , s 4 , s 8 , . . . , this new sequence is even quadratically convergent.For efficient calculation of the limit s it is desirable that the series iseither rapidly convergent or geometrically convergent with q not much greaterthan about 0.5. Often a series is transformed to improve its convergence.Example 10.2.7 We illustrate the Definition 10.2.1 with the following fourseries:a n = 1/n!, (10.14)a n =1(n + 1) 2 , (10.15)a n =0.1 n√ , n + 1(10.16)a n = 0.99 n2 .(10.17)(10.14) and (10.17) are rapidly convergent, (10.16) is geometrically convergentwith q = 0.1 and (10.15) is slowly convergent.76


10.3 <strong>Numerical</strong> treatment of rapidly convergentseries and sequences in the presence of roundoffsIn Definition 10.2.1 we introduced the concept of rapidly converging series andsequences and we gave some properties, which may be used to determine whetheran analytically given series or sequence indeed converges rapidly. Let (10.3) bea rapidly converging sequence and let˜s 1 , ˜s 2 , . . . , (10.18)be the calculated values of the partial sums s 1 , s 2 , . . .. Since we are using acomputer working with a finite accuracy, in general we cannot expect the calculatedsequence (10.18) to be rapidly converging in the sense of Definition 10.2.1.Instead we use our knowledge about the exact, but unavailable sequence (10.3)to draw conclusions about our calculated values (10.18). We illustrate this ideawith the following simpleExample 10.3.1 Consider the sequence (10.3) withs 1 = 1, s n+1 = (s n + 7/s n )/2, n = 1, 2, . . . (10.19)The corresponding numerical values are listed in Table 10.1 This sequence israpidly convergent.i s i1 1.00000002 4.00000003 2.87500004 2.65489135 2.64576706 2.64575127 2.6457512Table 10.1 The sequence defined in (10.19).We next consider the series (10.17) which can be shown to be rapidly convergent.We have namely in this casea n+1a n= 0.99 2n+1 → 0, when n → ∞.We give some numerical results in the following table:77


n a n−1 s n1 1.0000000 1.00000002 .9900000 1.99000003 .9605960 2.95059614 .9135174 3.86411335 .8514579 4.715571410 .4430483 7.769568920 .0265648 9.290034330 .0002134 9.339803738 .0000011 9.340054539 .0000005 9.340055540 .0000002 9.340055541 .0000001 9.3400555Table 10.2 The sequence defined in (10.17) for selected values of n.In these two examples the sequence of calculated values converged to a definitevalue in each case, That does not occur always, since it is possible to constructsequences where the influence of round-offs increases with n and one seeks anestimate such that the combined effect of truncation and round-off errors isminimal.10.4 Stability of term by term summation of seriesConvergence acceleration methods are not always used, maybe because thecorresponding theory may be considered unusual or complicated. If supercomputersare available, it would be tempting to estimate the sum by summingthe series term by term. We shall discuss three examples which illustrate thatthe apparently simplest is not always the best. We discuss the following threeexamplesa r = (−10)r ,r!(10.20)a r = (−1)rr + 1 , (10.21)a r =1(r + 1) 2 . (10.22)All three series are convergent, the first is even rapidly convergent according toDefinition 10.2.1. The two first series are alternating. We define the sequencescorresponding to these series according to (10.7) Since in these examples thetrue sums are known, we evaluated the differences between the partial sumsand the true limiting values in the two latter examples. The results shown inTables 10.3, 10.4 and 10.5 emerged.78


n a n−1 s n1 1.00000 1.000002 −10.00000 −9.0000010 −2755.73210 −1413.1447011 2755.73210 1342.5874012 −2505.21110 −1162.6237020 −82.20636 −27.7064225 1.61174 .4641830 −.01131 −.0029135 .00003 −.0000536 −.00001 −.0000637 .00000 −.0000638 .00000 −.00006Table 10.3 The sequence defined in (10.20) for selected values of n.n a n s n+1 s − s n+10 1.0000000 1.0000000 −.30685281 −.5000000 .5000000 .19314722 .3333333 .8333334 −.14018623 −.2500000 .5833334 .10981384 .2000000 .7833334 −.09018625 −.1666667 .6166667 .07648056 .1428571 .7595238 −.06637667 −.1250000 .6345238 .05862348 .1111111 .7456349 −.05248779 −.1000000 .6456349 .047512310000 .0001000 .6931917 −.000044520000 .0000500 .6931655 −.000018430000 .0000333 .6931564 −.000009240000 .0000250 .6931525 −.000005350000 .0000200 .6931494 −.000002260000 .0000167 .6931478 −.000000670000 .0000143 .6931464 .000000880000 .0000125 .6931457 .000001590000 .0000111 .6931448 .0000024100000 .0000100 .6931441 .0000031200000 .0000050 .6931411 .0000061400000 .0000025 .6931394 .0000077800000 .0000012 .6931385 .00000861000000 .0000010 .6931383 .0000089Table 10.4 The sequence defined in (10.21) for selected values of n.79


n a n s n+1 s − s n+10 1.0000000 1.0000000 .64493411 .2500000 1.2500000 .39493412 .1111111 1.3611112 .28382293 .0625000 1.4236112 .22132294 .0400000 1.4636111 .18132295 .0277778 1.4913889 .15354516 .0204082 1.5117971 .13313707 .0156250 1.5274221 .11751208 .0123457 1.5397677 .10516639 .0100000 1.5497677 .0951663100 .0000980 1.6350820 .0098521200 .0000248 1.6399715 .0049626300 .0000110 1.6416175 .0033165400 .0000062 1.6424432 .0024909500 .0000040 1.6429399 .0019941600 .0000028 1.6432716 .0016625700 .0000020 1.6435084 .0014256800 .0000016 1.6436863 .0012478900 .0000012 1.6438249 .00110911000 .0000010 1.6439358 .00099831500 .0000004 1.6442682 .00066592000 .0000002 1.6444323 .00050183000 .0000001 1.6445948 .00033934000 .0000001 1.6447140 .00022014500 .0000000 1.6447253 .0002087Table 10.5 The sequence defined in (10.22) for selected values of n.The three series (10.20), (10.21) and (10.22) have the sums e −10 , ln 2 and π 2 /6.We note that for large n the calculated partial sums s n give very poor estimatesfor the true sums. In the case of Table 10.3 even the sign is wrong. To approximatethe error in the calculated sum with the first neglected term wouldbe incorrect in all three cases. For (10.22) we haveR n ≈ 1/n.Since (10.20) for n > 10 and (10.21) for all n satisfy the conditions of Leibniz’stheorem it is correct to estimate the truncation error with the first neglectedterm, provided the calculations are carried out exactly on exact data. In thesetwo cases the influence of round-offs and data-errors in single precision (relativeaccuracy ≈ 1.2 · 10 −7 ) are significant but for (10.22) this source of error was notequally important. If n is large, the accumulated effect of round- offs duringthe addition of many terms could be serious. Modern computers often carryout the calculations of sums and scalar products in double precision. However,the error which is caused by the fact that the terms are represented in a finiteprecision cannot be eliminated and its effect may be significant. To study this80


phenomenon we needDefinition 10.4.1 Let the series (10.8) be convergent. The convergence is saidto be absolute, if∞∑|a r | ≤ M < ∞. (10.23)r=0Otherwise the series is said to be conditionally convergent.Remark 10.4.2 Geometrically and rapidly converging series are absolutely convergent.Theorem 10.4.3 Term by term summation of a conditionally convergent seriesis numerically unstable. However, if the terms are given with a relative error≤ e then the relative error in the calculated sum of an absolutely convergentseries is bounded byeM/|s|,where s is defined by (10.8) and M by (10.23).Proof: Let a r be the exact value of a term, ã r its computer representation. PutSettingwe findã r = a r + ɛ r , r = 0, 1, . . . , with |ɛ r | ≤ |e||a r |, r = 0, 1, . . . .n−1∑s˜n = ã r , s n =r=0n−1∑a r ,r=0n−1∑| s˜n − s n | ≤ e |a r |.Letting n → ∞ we immediately reach the desired conclusion upon dividing thelimiting relation by s in the case of an absolutely convergent series.•r=0Remark 10.4.4 In the example (10.20) we have s = e −10M = e 10 and henceFor (10.21) we find|˜s − s| ≤ e · exp(20).sn−1∑| s˜n − s n | ≤ e 1/(k + 1) ≈ e ln n,r=oand hence the influence of the error could grow unboundedly when n is takenlarge. In (10.22) we get s = M = π 2 /6 and hence the influence of data-errorsis modest. However, the truncation error is significant, close to 1/n.81


10.5 Stopping rule for term by term summationof rapidly converging sequencesIf a series (10.8) is rapidly converging, numerical evaluation of its sum s usingterm by term summation is often efficient. Thus we calculate the sequence(10.18) using the recursions r+1 ˜ = ˜s r + ã r , r = 1, 2, . . .We note that according to (10.11) the absolute values of the terms will decreaseif r > N but N is unknown and need to be determined in the course of thecalculations. The error in the terms is in general not known. The same is trueof the number M. We definen−1∑M n = |a r |.r=0The latter sequence is also rapidly converging and may be evaluated in thesame way as the original series using term by term summation. We use theapproximations ≈ s˜n ,where n is such thatand at the same time| s n+1 ˜ − s˜n | ≥ | s˜n −s n−1 ˜ |,| s˜n − s n−1 ˜ | < ɛ · M n ,where ɛ is an upper bound for the relative error in the calculated terms. Weobserve that the absolute values of the terms may increase with the index inthe beginning but sooner or later the absolute values will decrease. However,the calculated values of the terms may behave differently, since the numericalvalues are influenced by errors. Hence we want to stop when new terms donot add information and we also want to avoid a premature stop for a lowindex, when the asymptotic property is not yet valid. In the case of the rapidlyconvergent series (10.20) the first 10 terms show increasing absolute values andhence we need to add more than 10 terms. On the other hand the accuracy ofthe calculated sum does not increase by using more than 40 terms. In this case,we need more than single precision in order to get a reasonably good estimateof the sum. This would be established by calculating the sequence M n definedabove.10.6 Series which may be majorised by rapidlyconverging seriesWe have now described how to treat rapidly converging series but need to extendour results to a slightly more general class. Consider the two series whose terms82


a r and b r are given bya r = exp(−r 2 ) sin(rπ/6), b r = exp(−r 2 ).The second series satisfies the sufficient condition for rapid convergence andmay be treated as described above but the first one is obviously not rapidlyconvergent since a 6 = a 12 = · · · , = 0. However |a r | ≤ b r and hence the absolutevalues of the remainders of the first series are no greater that those of the secondand if we approximate the sum of the first series by the partial sum containingthe first n terms, then the associated truncation error may be estimated by b ninstead of |a n |. We define a general rule after introducingDefinition 10.6.1 Let two series a r and b r be given which are such that |a r | ≤b r . Then the second series is said to majorise the first.We now establish in a straight-forward way:Theorem 10.6.2 Let two series a r and b r be given, which are such that |a r | ≤b r , and let the second series be rapidly convergent. Let further s n and s˜n be theexact and calculated values of the partial sums of the first series, which has thesum s. Then we haves − s˜n = s − s n + s n − s˜n ,giving|s − s˜n | ≤ |s − s n | + |s n − s˜n |,where for large values of n we may use the estimate |s − s n | ≈ b n and for all nwe haven−1∑|s n − s˜n | ≤ e · |a r |,withr=0|a r − ã r | ≤ e · |a r |.83


Chapter 11<strong>Numerical</strong> integration withthe trapezoidal andmidpoint rules11.1 IntroductionWe discuss here the problem of evaluating the integral∫ baf(t) dt, (11.1)when f is continuous and an equidistant table of f is given. We treat firstthe case when a and b are finite numbers. Let n be a positive integer and puth = (b − a)/n. Then we have∫ ban−1∑f(t) dt =i=0∫ a+(i+1)ha+ihf(t) dt. (11.2)Next we want to approximate the integrals to the right in (11.2). Making achange of variables we get∫ a+(i+1)hf(t) dt =∫ h/2a+ih−h/2g(u) du, (11.3)with t = a + (i + 1/2)h + u, g(u) = f(t(u)). Thus the problem of evaluating theintegral (11.1) has been reduced to the task of determining n integrals of thetype of the right hand side of (11.3).84


11.2 Constructing the trapezoidal rule and themidpoint rule with the method of undeterminedcoefficientsWe consider first the trapezoidal rule. Thus we seek constants A and B suchthat the rule∫ h/2−h/2gives exact result for the p test functionsg(u) du = h[Ag(−h/2) + Bg(h/2)],g r (u) = u r , r = 0, 1, . . . , p − 1. (11.4)We want to determine A and B such that p can be chosen as large as possible.We note that∫ h/2{ 0 r oddg r (u)du =2 · (h/2) r+1 /(r + 1), r even.−h/2Taking p = 2 we get the linear system of equationsh = h(A + B),0 = h2(−A + B),2which has the solution A = B = 1/2. This gives the trapezoidal rule:∫ h/2−h/2g(u) du = h 2 [g(−h/2) + g(h/2)] + R T (g), (11.5)where R T (g) is the truncation error associated with the function g. If we nowput g(u) = u 2 , we findE T (g) = h 3 /12 − h 3 /4 = −h 3 /6.If g is odd, i.e. if g(−u) = −g(u) we find that R T (g) = 0. Hence the trapezoidalrule gives exact result for all polynomials of degree less than 2 and for all oddfunctions.To construct the midpoint rule we determine the constant C such that therule∫ h/2−h/2g(u) du = hCg(0),gives exact results for the p first test-functions (11.4), where p shall be taken aslarge as possible. Taking p = 1 we get the relationh = C,85


and hence we get the midpoint rule:∫ h/2−h/2g(u) du = hg(0) + R T (g). (11.6)We find that R T (g) = 0, if g(u) = 1 and g(u) = u. Also, R T (g) = 0, if g is anodd function.11.3 Approximating general integrals with thetrapezoidal and midpoint rules with stepsizehWe next apply (11.5) to each integral on the right hand side of (11.2) and arriveat the approximation,I(f) = T (h, f) + R T (f), (11.7)where,I(f) =∫ baf(t) dt,n−1f(a) + f(b) ∑T (h, f) = h[ + f(a + ih)], h = b − a2ni=1and R T (f) is the truncation error associated with f.In the same way we find for the midpoint ruleI(f) = M(f) + R M (f),M(f) =n∑h f(a + (i − 1/2)h), h = b − ani=1(11.8)(11.9)and R T (f) is the truncation error of the midpoint rule associated with f. Wealso give the trapezoidal approximations for integrals over infinite intervals.Thus∫ ∞∞∑f(t) dt = h[f(0)/2 + f(rh)] + R T (f), (11.10)0∫ ∞−∞f(t) dt = h∞∑r=−∞r=1f(rh) + R T (f). (11.11)11.4 Improving the accuracy of the trapezoidalrule by decreasing the step-size hIt may be shown that under very general conditions,lim T (h, f) = I(f).h→086


In the case of (11.2) it is sufficient to require that f is continuous. Thus weselect a fixed h and evaluate the sequenceThus we haveT 1,r = T (h r , f) with h r = 2 −r h, r = 0, 1. . . .lim = I(f).r→∞Under certain general conditions the convergence is rapid for (11.11) and for(11.8) it is rapid, when f is periodic with period (b−a) . See [12] Otherwise, theconvergence is geometric with quotient 1/4, i.e. the truncation error is dividedby approximately 4, when h is halved. We illustrate the rapid convergence onthe Examples 11.4.1 and 11.4.2. The geometric convergence is illustrated by theExamples 11.4.3 and 11.4.4Example 11.4.1∫ ∞−∞e −t2 dt = √ π.(11.11) gives the following approximation for the integral2h(0.5 + e −h2 + e −4h2 + e −9h2 + . . .).This sum converges rapidly for each h. Taking h = 1 we get the sum 1.7728which approximates the integral with an error < 0.0002 while h = 0.5 givesthe sum 1.772453851, which approximates the integral correctly with 9 decimalplaces. The error in the first case was caused by approximating the integral witha sum, not by truncating the infinite sum.Example 11.4.2∫ π−πe cos θ cos(sin θ) dθ = 2π.Here we start using (11.8) with a = 0, b = 2π and taking h = 2π, π, π/2,pi/4, . . .. The first approximations in this sequence are poor. We get the sequenceof errors 10.97, 3.41, 0.25, 0.000156, 0.000001 Thus h = π/2 gives thefirst reasonable estimate in this sequence with the error 0.25 but further halvingsof h give rapidly improved accuracy, so that for h = π/8 the first 6 decimalplaces are correct.Example 11.4.3Here we find the trapezoidal sum∫ ∞0e −t dt = 1.T (h) = h(1/2 + e −h + e −2h + e −3h + . . .) = h(1 + e−h )2(1 − e −h ) .87


We evaluate the closed expression for T (h) for h = 1, 1/2, 1/4, . . . and find thecorresponding sequence of approximations for the integral:1.081937, 1.020747, 1.005203, 1.001302, . . .. These results illustrate the geometricconvergence. We verify directly that T (h) = T (−h) and hence concludethat the Taylor series expansion for T (h) contains only even powers of h.Example 11.4.4∫ 101dt = ln 2.1 + tHere we evaluate T (h) for h = 1, 1/2, 1/4, . . .. The errors in the correspondingapproximations for the integral are 0.056853, 0.015186, 0.003877, 0.000975,0.0000244. These numbers illustrate the geometric convergence with factor 1/4.11.5 <strong>Numerical</strong> integration over general intervalsIn numerical integration one seeks to construct approximation formulas of theform∫ bn∑f(t)dt ≈ x i f(t i ),awhere the numbers x i are called weights and the numbers t i are termed abscissæ.We note that the formula above is linear with respect to the function f. Thisimplies that if the formula gives exact integral values for the functions u and v,then it also is exact for the functions w and g defined byi=1w(t) = u(t) + v(t), g(t) = c · u(t), a ≤ t ≤ b,where c is a constant. In particular, if we require the formula to give exactresult for a general linear function, i.e. forf(t) = y 1 + y 2 t,where y 1 and y 2 are constants, then we must require it to be exact for thefunctions u(t) = 1 and v(t) = t. Hence the abscissæ and weights must satisfythe two conditions ∫ bn∑dt = b − a = x i ,∫ baai=1t dt = (b 2 − a 2 )/2 =n∑x i t i ,The first relation implies that the sum of the weights should be equal to thelength of the interval, which holds for most integration rules used in the computationalpractice. We have seen that the requirement that the quadrature rulei=188


should give exact results for all linear integrands f implies that the abscissæ andweights must satisfy two equations. It is possible to generate further conditionsby requiring that the rule should be exact for polynomials of higher degree,since then one gets one relation for each power of t. In particular, if we choosen distinct abscissæ and require the formula to be exact for all polynomials ofdegree less than n, we get a system of n linear equations with the same numberof unknowns. It is known, that this system has a unique solution, which definesthe weights. If we want to determine the abscissæ as well as the weights fromthe requirement that the rule should be exact for polynomials of a prescribeddegree, we get nonlinear systems of equations, whose direct numerical solutionis an ill-conditioned task. We next show how the trapezoidal rule may be usedfor constructing a general integration formula.11.5.1 The trapezoidal rule for a general tableWe consider as before the general problem of evaluating an integral over thefinite interval [a, b] Values of the function f are given at the n points t 1 ′t 2 , . . . t n ,which are such thata = t 1 < t 2 < . . . < t n−1 < t n = b.The general idea is to apply the trapezoidal rule to each of the n−1 subintervals[t 1 , t 2 ], [t 2 , t 3 ], . . . [t n−1 , t n ] and add up the results. The resulting approximationfor the integral is exact for all functions f which are linear in each subinterval.We illustrate with:Example 11.5.1 The trapezoidal rule based on a non-equidistant table: Calculateusing the values in the table:∫ 0.70f(t) dt, f(t) =t f(t)0.0 1.0000000.1 1.0047010.4 1.0655890.6 1.1388240.7 1.184560et1 + t ,We next calculate the trapezoidal values for the four subintervals as follows:∫ 0.10∫ 0.40.1f(t) dt ≈ 0.1 ·f(t) dt ≈ 0.3 ·1.000000 + 1.00470121.004701 + 1.065589289


∫ 0.60.4∫ 0.70.6f(t) dt ≈ 0.2 ·f(t) dt ≈ 0.1 ·1.065589 + 1.13882421.138824 + 1.1845602Adding up the contributions we get the estimate for the total integral.computations may be arranged as follows:Thet f(t) dsum.000000 1.000000.100000 1.004701 .100235.400000 1.065589 .310544.600000 1.138824 .220441.700000 1.184560 .116169integral .747389The exact value, rounded to 6 decimal places is: 0.74526711.5.2 The trapezoidal rule for a finite interval, generalfunctions and equidistant tablesWe illustrate with:Example 11.5.2∫ 1.60f(t) dt, f(t) = 11 + t 2 ,We use equidistant tables with step-lengths h with h = 1.6, 0.8, 0.4, 0.2We construct an equidistant table over f with step-length 0.2 . It contains9 functional values and it can be used to determine 4 different estimates forthe integral sought, corresponding to the step-lengths mentioned above. Wecalculate these values as follows:h=1.6 :h=0.8:h=0.4 :T (1.6) = 1.6 ·(f(0) + f(1.6).2f(0) + f(1.6)T (0.8) = 0.8 · ( + f(0.8)).2f(0) + f(1.6)T (0.4) = 0.4 · ( + f(0.4) + f(0.8) + f(1.2)).290


h=0.2 :f(0) + f(1.6)T (0.2) = 0.2 · ( + s),2wheres = f(0.2) + f(0.4) + f(0.6) + f(0.8) + f(1.0) + f(1.2) + f(1.4).The calculations may be arranged as in the following table.t f(t) h =1.60000002 h = .80000001 h = .40000001 h = .20000000.00000000 1.00000000 x x x x.20000000 .96153849 x.40000001 .86206901 x x.60000002 .73529410 x.80000001 .60975605 x x x1.00000000 .50000000 x1.20000005 .40983605 x x1.39999998 .33783785 x1.60000002 .28089887 x x x xT (h) 1.02471912 1.00016439 1.00884426 1.01135612The functional values appearing in the different estimates are marked with theletter x in the corresponding columns. The exact value, rounded to 6 decimalplaces is 1.012197 As expected, the accuracy of the estimate becomes betterwhen the step length decreases, and it is possible to show that the truncationerror R(h), which arises when the integral is approximated with the trapezoidalsum satisfies the relationR(h) = O(h 2 ), h → 0.Thus the error is roughly divided by 4 when the step-size is halved.11.5.3 The trapezoidal rule for periodical functions, integralover one period.We discuss the example∫ 2π0f(t) dt =ecos t2 + sin t .This integral may be treated numerically in the same way as the precedingexample and we may use the same tabular arrangement. Then we get thefollowing results.91


t f(t) h =6.28318548 h =3.14159274 h =1.57079637 h = .78539819.00000000 1.35914087 x x x x.78539819 .74918175 x1.57079637 .33333331 x x2.35619450 .18213862 x3.14159274 .18393974 x x x3.92699099 .38136852 x4.71238899 1.00000000 x x5.49778748 1.56866407 x6.28318548 1.35914075 x x x xT (h) 8.53973389 4.84773064 4.51826048 4.52213955The exact result, rounded to 6 decimal places, is 4.522171 We notice that theerror decreases much more rapidly then in the preceding examples. This occursbecause the integrand f satisfies the relation:f(t + 2π) = f(t).This means that f is a periodic function with the period 2π, and the integralextends over exactly one period.11.5.4 The trapezoidal rule for the real lineWe consider the example∫ ∞−∞f(t) dt, f(t) =e−t24 + cos t .The corresponding trapezoidal approximation becomesT (h) = h ·∞∑n=−∞f(nh).Since f is even, i.e. such that f(−t) = f(t) we getT (h) = h · f(0) + 2h ·∞∑f(nh).Thus, for each value of h we need to evaluate an infinite sum, which convergesrapidly. Thus a small number of terms are required for determining the sumwith good approximation. The accuracy increases rapidly as the step-size ischosen smaller. We get the formulas:h=1.0 :n=1T (1.0) = f(0) + 2 · (f(1) + f(2) + f(3) + . . .)92


h=0.5 :T (0.5) = 0.5 · f(0) + f(0.5) + f(1.0) + f(1.5) + . . .h=0.25 :T (0.25) = 0.25 · f(0) + 0.5 · (f(0.25) + f(0.5) + f(0.75) + . . .).We give the following numerical results:step-size # values int.-est.1 13 0.37235390.5 19 0.37234340.25 35 0.372343493


Chapter 12<strong>Numerical</strong> treatment ofordinary differentialequations12.1 Introduction: Vector- and matrix-valuedfunctionsLet t be a real-valued parameter. To each value of t we associate a vectorx(t) ∈ R n . Thus we have defined a mapping from a real t-interval to a curvein n-dimensional space. For n = 3, x(t) may be interpreted as the position ofa particle at time t. More generally, x(t) could describe the state of a somesystem at time t. Next we introduce the rate of change of x which is defined bythe derivativeẋ(t) = dxdt ,which is obtained by differentiating each component of x with respect to t. Thisderivative is often interpreted as the velocity vector. In the same way the secondderivative of x may be interpreted as acceleration.Example 12.1.1 Let x(t) = (3 cos t, 4 sin t) T be the position at time t of aparticle in the plane. It hence travels along an ellipse with half-axes 3 and 4.The motion is periodic and it returns to the point x(0) when the time 2π haselapsed. Its velocity is given by ẋ(t) = (−3 sin t, 4 cos t) T and acceleration byẍ(t) = (−3 cos t, −4 sin t) TVector-valued functions are integrated and differentiated component-wise andwe have the relationx(t) = x(0) +94∫ t0ẋ(τ) dτ.


A system of ordinary differential equations of the first order is given byẋ(t) = f(t, x(t)), (12.1)where f is a vector-valued function with two arguments, namely the scalar t andthe vector x(t). If t is missing as argument of f, then the differential equationis termed autonomous. A system of order k is defined byx (k) (t) = f(t, x(t), ẋ(t), . . . x (k−1) (t)) (12.2)where f has as arguments the scalar t and the k vectors x(t), . . . , x (k−1) (t). Sucha system may be replaced by an equivalent system of the first order by introducingthe block vector y(t) = x(t), . . . , x (k−1) (t), which has (k − 1)n components.We will also need matrix-valued functions. Then with each t we associate asquare matrix A(t) ∈ R n×n . Integrals and derivatives of A are defined bymeans of element-wise integration and derivation. We also define the matrixexp(A) by means of the formulaexp(A) =∞∑r=0This series converges for any square matrix A.12.2 Initial-value problemsA rr! .Definition 12.2.1 An initial-value problem is the task of determining a vectorvaluedfunction x such that for a given vector x 0x(0) = x 0 , ẋ(t) = f(t, x(t)). (12.3)Example 12.2.2 Initial-value problem: Find x(t) such thatx(0) = 1, ẋ(0) = 2, ẍ(t) + sin(t)ẋ(t) + cos x(t) = t 2 .We have a differential equation of the second order which we replace by anequivalent system of the first order by settingand arriving aty 1 (t) = x(t), y 2 (t) = ẋ(t),y˙1 (t) = y 2 (t)y˙2 (t) = −y 2 (t) sin t − cos y 1 (t) + t 2 .The side-conditions involve only the point t = 0 (initial-value problem).Definition 12.2.3 The system (12.3) is termed linear if the dependent functionx occurs linearly in (12.3), otherwise the system is said to be nonlinear.95


Example 12.2.4 The system in Example 12.2.2 is nonlinear, but the differentialequationẍ(t) + sin(t)ẋ(t) + cos(t)x(t) = e t ,is linear.Picard iteration for (12.3)Iterate according to:x 0 (t) = x 0 , t ≥ 0, x l+1 (t) = x 0 +∫ t0f(u, x l (u)) du.It may be shown that this iteration converges to the solution x under fairlyweak assumptions on f.The linear initial-value problem for the single linear differential equationof the first order,x ′ (t) + f(t)x(t) = g(t),has the solutionwhere∫ tx(t) = e −F (t) x(0) + e −F (t) e F (τ) g(τ) dτ,F (τ) =∫ τ00f(u) du.However, this expression for the general solution of a linear differential equationis not valid for systems. In the special casewith A a constant matrix, we havex(0) = x 0 , x ′ (t) = Ax(t) + g(t),orx(t) = e At x 0 +x(t) = e At (x 0 +∫ t0∫ t0e A(t−τ) g(τ) dτ.)e −Aτ g(τ) dτ .12.3 <strong>Numerical</strong> schemes based on discretisationfor initial-value problemsWe consider the system (12.3) with the initial-value condition x(0) = x 0 anddescribe two methods for constructing approximate values x n for x(nh).96


12.3.1 Euler’s methodx r+1 = x r + hf(t r , x r ), t r = rh.Global error: O(h), h → 0 .Asymptotic expression for the global error R T (t) at fixed point t:R T (t) = c 1 (t)h + c 2 (t)h 2 + . . . , (all powers of h),12.3.2 The trapezoidal methodx r+1 = x r + h[f(t r , x r ) + f(t r+1 , x r+1 )]/2.The equation for x r+1 may be solved iteratively according to the scheme:x (r+1),0 = x r + hf(t r , x r ),x (r+1),(l+1) = x r + hf(t r , x r )/2 + hf(t r+1 , x (r+1),l )/2, l = 0, 1, . . .x r+1 = liml→∞x (r+1),l .Special case: the trapezoidal method for system of the linear differential equations:ẋ(t) + A(t)x(t) = g(t).In each step we obtain the following system of linear equations for the unknownvector x r+1(I + h 2 A(tr+1 ))x r+1 = h 2 (g(t r) + g(t r+1 )) + (I − h 2 A(t r))x r .Global error for the trapezoidal method in the general case:O(h 2 ), h → 0.Asymptotic expression for the global error R T (t) at fixed point t:R T (t) = c 2 (t)h 2 + c 4 (t)h 4 + . . . , (even powers of h).12.4 Examples in initial-value problems12.4.1 Euler’s methodWe discuss the following initial-value problemx ′ = t 2 − x, x(0) = 3.It has the exact solutionx(t) = e −t + (t − 1) 2 + 1 .97


Table 12.1: Euler’s method, step-size h = 0.1t x k = h · f(t, x).000000 3.000000 -.300000.100000 2.700000 -.269000.200000 2.431000 -.239100.300000 2.19<strong>190</strong>0 -.210<strong>190</strong>.400000 1.981710 -.182171.500000 1.799539 -.154954.600000 1.644585 -.128459.700000 1.516127 -.102613.800000 1.413514 -.077351.900000 1.336163 -.0526161.000000 1.283546We use Euler’s method to calculate a table of approximate values of the solution,t n , x n , n = 0, 1, . . . where t n = nh and x n is the calculated approximation forx(t n ). The constant h is called the step-size. If we apply Euler’s method to thepresent differential equation, we get the recursion:Thus we find for h = 0.1x 0 = 3, x n+1 = x n + h · (t 2 n − x n ), n = 0, 1, . . .x 0 = 3.000000, x 1 = 3 + 0.1 · (0 2 − 3) = 2.700000x 2 = 2.7 + 0.1 · ((0.1) 2 − 2.7) = 2.431000Continuing on we generate Table 12.1. If we repeat the calculation with smallerstep-sizes, we will get denser tables, yielding more accurate values for the solution.We use the step-sizes 0.1, 0.05 and 0.025 to calculate approximationsfor the solution for t = 1 and obtain the values, rounded to 4 decimal places:1.2835, 1.3264, 1.3473 . Since the exact value rounded to 4 decimals is 1.3679,the errors are -0.0844, -0.0415 and -0.0206 respectively, illustrating the fact thatwhen we use Euler’s method and the equation has a solution which may bedifferentiated an arbitrary number of times, then the error in the calculatedsolution at a fixed point is proportional to the step-size used. We also note thatthe computational effort required increases when the step-size is chosen smaller.In the present example the step-size 0.1 corresponds to 10 steps, 0.05 to 20 stepsand 0.025 to 40 steps for constructing a table of the solution in the interval [0, 1].12.4.2 The trapezoidal methodExample 12.4.1 We consider the problemx ′ (t) − 2tx(t) = 1/(1 + t), x(0) = 098


It is directly verified that the solution of this problem may be writtenx(t) = e t2 ∫ t0e −u2 11 + u du.This form is not very suitable for numerical evaluation. Instead we solve thedifferential equation by the trapezoidal method. using the step-sizes h = 0.1and h = 0.05. In each step one needs to solve a linear equation, but this maybe done explicitly. The following results were obtained.Number of equal steps in [0,1], 10.00 .000000.10 .096419.20 .188270.30 .280580.40 .378306.50 .486827.60 .612502.70 .763348.80 .949970.90 1.1868801.00 1.494461Number of equal steps in [0,1], 20.00 .000000.05 .048932.10 .096071.15 .142084.20 .187598.25 .233222.30 .279562.35 .327231.40 .376871.45 .429163.50 .484848.55 .544746.60 .609773.65 .680975.70 .759548.75 .846882.80 .944598.85 1.054606.90 1.179160.95 1.3209461.00 1.48317099


Example 12.4.2 We study the nonlinear problemx ′ (t) = t + cos x(t), x(0) = 0 .Application of the trapezoidal method means that one needs to solve a nonlinearequation in each step. This may be done by means of a fix-point method. Aninitial approximation is obtained by means of the explicit Heun’s method (whichis not treated in this course). The following numerical results were obtained.Number of equal steps in [0,1]:1.00 .00000000Heun:1.00 1.27015114Fix-point1.00 1.148068191.00 1.205124971.00 1.178788191.00 1.191022521.00 1.185355191.00 1.187983991.00 1.186765431.00 1.187330371.00 1.187068461.00 1.18718994Trapezoidal1.00 1.18718994Number of equal steps in [0,1]:2.00 .00000000Heun:.50 .59439564Fix-point.50 .58212179.50 .58382452.50 .58359015.50 .58362246.50 .58361799.50 .58361864.50 .58361852.50 .58361852Trapezoidal.50 .58361852Heun:1.00 1.24586463Fix-point1.00 1.247048021.00 1.24676764100


1.00 1.246834041.00 1.246818301.00 1.246822121.00 1.246821171.00 1.246821401.00 1.24682140Trapezoidal1.00 1.24682140We also give a tables of the results obtained by using the step-sizes 0.1 and 0.05.Only the values which are common for the two tables are printed:Number of equal steps in [0,1]:10.00 .00000000.10 .10472606.20 .21826585.30 .33923015.40 .46604827.50 .59706527.60 .73065168.70 .86530888.80 .99975455.90 1.132978201.00 1.26426411Number of equal steps in [0,1]:20.00 .00000000.10 .10479728.20 .21842280.30 .33948070.40 .46639177.50 .59749222.60 .73114491.70 .86584604.80 1.00031114.90 1.133530861.00 1.26479268101


Chapter 13A collection of usefulformulas13.1 Some formulas from the theory of differentiationand integration13.1.1 Summation and integrationWe haveq∑a i =i=pq∑a k .If q ≥ p each of these sums have q − p + 1 terms, if q < p the value of the sumis 0.∫ bs=an∑∑nf ii=1 k=1∫ bf(s){t=a∫ bag k h i,k =f(t) dt =n∑g(t)h(s, t) dt} ds =∫ bt=ak=p∫ baf(s) ds.n∑f i g k h iki=1 k=1∫ bs=a∑n ∑ng kk=1 i=1∫ bt=a∫ bg(t){ f(s)h(s, t) ds} dt.s=a13.1.2 Factorials and the Γ− functionΓ(x) =∫ ∞0t x−1 e −t dt,Γ(n + 1) = n! n integer, n ≥ 1,f i h i,k .f(s)g(t)h(s, t)dsdt =0! = 1, 1! = 1, 2! = 2, 3! = 1 · 2 · 3 = 6, 4! = 1 · 2 · 3 · 4 = 24, Γ(1/2) = √ π.102


13.1.3 Stirling’s formula13.1.4 Binomial coefficientsn! = √ 2πn(n/e) n [1 + O(1/n)], n → ∞.(1 + x) n =n∑( nkk=0)x k .For n, k nonnegative integers(nt n,k =k)n!, =k!(n − k)! , 0 ≤ k ≤ n,may be calculated according to the rules:t 0,0 = t n,n = 1, n = 0, 1, . . .t n,k = t n−1,k−1 + t n−1,k , k = 2, 3, . . . , n − 1.The first few coefficients t n,k are given in the table below:Values of t n,k for selected values of k and nk → 0 1 2 3 4 5 6 7n=0 1n=1 1 1n=2 1 2 1n=3 1 3 3 1n=4 1 4 6 4 1n=5 1 5 10 10 5 1n=6 1 6 15 20 15 6 1n=7 1 7 21 35 35 21 7 113.1.5 Higher derivatives of a productd n f(x)g(x)dx n =n∑( nkk=0)f (k) (x)g (n−k) (x).13.1.6 Change of variable in an integralIf g is monotonic on [a, b] then, under appropriate conditions∫ bf(g(t))g ′ (t) dt =∫ g(b)ag(a)f(u) du.103


13.1.7 Integration by parts∫ bf(t)g(t) dt = F (b)g(b) − F (a)g(a) −∫ baawhere F is any function such that F ′ (t) = f(t).F (t)g ′ (t) dt,13.1.8 Differentiation with respect to a parameterIfF (x) =then under appropriate conditions∫ g2(x)g 1(x)K(x, t) dt,∫ g2(x)F ′ (x) = K(x, g 2 (x))g 2(x) ′ − K(x, g 1 (x))g 1(x) ′ ∂K(x, t)+dt.g 1(x) ∂xSpecial cases:F (x) =∫ baF (x) =K(x, t) dt, F ′ (x) =∫ xa∫ ba∂K(x, t)∂xf(t) dt, F ′ (x) = f(x).13.1.9 Mean-value theorems, valid under appropriate conditions∫ baf(b) − f(a)b − aw(t)f(t) dt = f(s)k=1dt.= f ′ (s), a < s < b. (13.1)∫ baw(t) dt, w(t) ≥ 0, a < s < b. (13.2)n∑n∑w k f(t k ) = f(s) w k , w k ≥ 0, (13.3)k=1a ≤ s ≤ b, a ≤ t 1 < . . . , t n ≤ b.13.1.10 Taylor’s formula and power expansionsf(a + x) =n−1∑k=0f (k) (a)k!x k + f (n) (a + θx)x n , 0 < θ < 1.n!104


Power expansions for some elementary functions:Special cases:e x = 1 + x + x 2 /2 + x 3 /6 + x 4 /24 + . . .+x n /n! + . . . , all xsin x = x − x 3 /6 + x 5 /120 − x 7 /5040 + . . .+(−1) n x 2n+1 /(2n + 1)! + . . .cos x = 1 − x 2 /2 + x 4 /24 − x 6 /720 + . . .+(−1) n x 2n /(2n)! + . . .ln(1 + x) = x − x 2 /2 + x 3 /3 + . . .all x+(−1) n+1 x n /n + . . . |x| < 1arctan x = x − x 3 /3 + x 5 /5 − x 7 /7 + . . .all x+(−1) n x 2n+1 /(2n + 1) + . . . |x| < 1(1 + x) p = 1 + px + p(p − 1)x 2 /2 + p(p − 1)(p − 2)x 3 /6 + . . .( ) p+ x n + . . . |x| < 1n(1 + x) −1 = 1 − x + x 2 − x 3 + . . . + (−x) n+ (−x)n+11 + x(1 − x) −1 = 1 + x + x 2 + x 3 + . . . + x n+ xn+11 − x√1 + x = 1 + x/2 − x 2 /8 + x 3 /16 − 5x 4 /128 + . . .n−1 1 · 3 · 5 . . . · (2n − 3)+(−1)2 n x n + . . . |x| < 1· n!(The general expression in the last series holds only for n ≥ 2).13.2 Vector spacesVector addition and multiplication by scalarsIf α, β scalars and x ∈ R n , y ∈ R n , we define z ∈ R n bywhereDefinition: ||x|| is a vector norm ifz = αx + βy,z i = αx i + βy i , i = 1, . . . , n.||x|| ≥ 0,105


=1||x|| = 0 ⇒ x = 0,||αx|| = |α| · ||x||, α scalar,||x + y|| ≤ ||x|| + ||y||.Special vector norms in R n :n∑∑||x|| 1 = |x r |, ||x|| 2 = √ n |x r | 2 , ||x|| ∞ = max |x r|.1≤r≤nr=1Scalar product: 〈x, y〉, x ∈ R n , y ∈ R n , x 1 ∈ R n , x 2 ∈ R n :〈x, y〉 = 〈y, x〉,〈α 1 x 1 + α 2 x 2 , y〉 = α 1 〈x 1 , y〉 + α 2 〈x 2 , y〉, α 1 , α 2 scalars,〈x, x〉 ≥ 0,〈x, x〉 = 0 ⇒ x = 0.Special case:n∑〈x, y〉 = x T y = x r y r .r=113.3 Matrix algebraA ∈ R n×m means that matrix A has n rows, namelyA 1. , . . . , A n. ∈ R m ,and m columns, namelyMatrix-vector productA .1 , . . . , A .m ∈ R n .y = Ax, A ∈ R n×m , x ∈ R m , y ∈ R n ,m∑y i = a i,j x j , i = 1, . . . , n,j=1m∑y = A .j x j .j=1Matrix by matrix productC = AB, A ∈ R n×k B ∈ R k×m , C ∈ R n×m ,k∑c i,j = a i,l b l,j = A i. B .j , i = 1, . . . , n, j = 1, . . . , m.l=1106


TranspositionThenA ∈ R n×m , B = A T ∈ R m×n .B ∈ R m×n , b i,j = a j,i , i = 1, . . . , m, j = 1, . . . , n.Matrix norms ||A|| is a matrix norm, ifIf in particular,thenSpecial norms for A ∈ R n×m||A|| ≥ 0,||A|| = 0 ⇒ A = 0,||αA|| = |α| · ||A||, α scalar,||A + B|| ≤ ||A|| + ||B||.||A|| = maxx≠0 ||Ax||/||x||,||AB|| ≤ ||A|| · ||B||.||A|| 1 = max1≤j≤mi=1n∑|a i,j |,||A|| 2 2 = maxx≠0 xT A T Ax/x T x,m∑||A|| ∞ = max |a i,j |.1≤i≤nj=1Determinants for 2 by 2 and 3 by 3 matrices∣ a ∣11 a 12 ∣∣∣= aa 21 a 11 a 22 − a 12 a 2122 ∣ a 11 a 12 a 13 ∣∣∣∣∣ a 21 a 22 a 23 = a 11 a 22 a 33 +a 21 a 32 a 13 +a 31 a 12 a 23 −a 13 a 22 a 31 −a 23 a 32 a 11 −a 33 a 12 a 21∣ a 31 a 32 a 3313.4 Iterative methods for linear systems of equations13.4.1 General iterationConsiderwherex = Ax + f,A ∈ R n×n , x ∈ R n , f ∈ R n .107


13.4.2 Fix-point iterationx 0 = 0, x l+1 = Ax l + f, l = 0, 1, . . .Thenand||x l − x|| ≤||x l+1 − x l || ≤ ||A|| · ||x l − x l−1 ||,||A||1 − ||A|| · ||xl − x l−1 || when x = Ax + f.13.4.3 Jacobi iteration on Ax = b.Putx 0 = 0,and set for l = 0, 1, . . .x l+11 = (b 1 −x l+1n∑a 1,j x l j)/a 1,1 ,j=2∑i−1i = (b i − a i,j x l j −j=1n∑j=i+1n−1∑x l+1n = (b n − a n,j x l j)/a n,n .j=1a i,j x l j)/a i,i , i = 2, . . . , n − 1,13.4.4 Gauss-Seidel iteration on Ax = b.Putx 0 = 0.and set for l = 0, 1, . . .x l+11 = (b 1 −x l+1n∑a 1,j x l j)/a 1,1 ,j=2∑i−1i = (b i − a i,j x l+1jj=1−n∑j=i+1n−1∑x l+1n = (b n − a n,j x l+1j )/a n,n .j=1a i,j x l j)/a i,i , i = 2, . . . , n − 1,108


13.5 Least squares solution of linear systems ofequationsLetAx = b, A ∈ R n×m , b ∈ R n ,be a linear system of equations which may or may not have a solution vector.Then x solves the problemmin ||Ax − b|| 2 ,if and only if x solves the linear systemwhereandb i,j =Bx = c, B = A T A, c = A T b,B ∈ R m×m , c ∈ R m ,n∑a k,i a k,j , i = 1, . . . , m, j = 1, . . . , m, c i =k=1n∑a k,i b k , i = 1, . . . , m.13.6 Interpolation and approximation with polynomials13.6.1 Remainders and error boundsLet f be defined on [a, b] and let a ≤ t 1 < t 2 < . . . < t n ≤ b be n given points.Denote by Q the interpolating polynomial of degree < n which is defined by theconditionsQ(t i ) = f(t i ), i = 1, 2, . . . , n.PutThenP (t) =n∏(t − t i ).i=1k=1f(t) = Q(t) + P (t) f (n] (ξ), a ≤ ξ ≤ b.n!If t i are the scaled Chebyshev points given byt i =(a + b)2+(b − a)2cos(θ i ), θ i =π(i − 1/2), i = 1, 2, . . . , n.nthen(b − a)n|f(t) − Q(t)| ≤ 2 ·4 n max f (n) (t)/n!a≤t≤bFor the function f(t) = 1/(1 − xt), x constant, we have for general t i :11 − xt = Q(t) + P (t)(1 − xt)P (1/x) .109


13.6.2 Lagrange’s formula for interpolating polynomial QQ(t) =n∑ P (t)f(t i )(t − t i )P ′ (t i ) .i=113.6.3 Divided differencesOne argument:Two arguments:k arguments:f[t i ] = f(t i ), i = 1, 2, . . . , n.f[t i , t j ] = (f[t j] − f[t i ]).(t j − t i )f[t i1 , . . . , t ik ] = f[t i 2, . . . , t ik ] − f[t i1 , . . . , t ik−1 ]t ik − t i1.13.6.4 Newton’s interpolation formula with divided differencesQ(t) = f[t 1 ] + (t − t 1 )f[t 1 , t 2 ] + (t − t 1 )(t − t 2 )f[t 1 , t 2 , t 3 ] + . . .+(t − t 1 )(t − t 2 ) · . . . · (t − t n−1 )f[t 1 , . . . , t n ].f(t) = Q(t) + P (t)f[t, t 1 , t 2 , . . . , t n ].13.7 Propagation of errorsLet ¯x be an approximate value of the real number x. Then the absolute errorδx in ¯x, when used as an estimate for the generally unknown x is defined asδx = x − ¯x,and the relative error is for x ≠ 0 defined asδx/x.Let ∆x be a known bound for |δx|.. Then∆x.is the absolute uncertainty in the value ¯x and∆x/|x|,is the relative uncertainty in the approximation ¯x for x.110


Let f be a function which may be differentiated several times. Then theabsolute uncertainty of the approximation f(¯x) for f(x) is denoted by ∆f(x)and satisfiesand by∆f(x) ≤ | ∂f | · ∆x + o(∆x), ∆x → 0,∂x∆f(x) ≤ | ∂2 f∂x 2 | · (∆x)2 /2 + o(∆x) 2 , ∆x → 0,∂fif∂x ≠ 0,if ∂f∂x = 0.Let now ¯x denote an approximation for a vector x ∈ R n and let ∆x r be theabsolute uncertainty in the component ¯x r . If g is a differentiable function ofthe vector argument x, then the absolute uncertainty in g( ¯x r ) is ∆g(x) whichsatisfiesn∑∆g(x) ≤ | δg | · ∆x r + o(||∆x||), ||∆x|| → 0.δx rr=113.8 Solution of a single nonlinear equation13.8.1 Existence of root in an interval [a, b]If the function f is continuous on [a, b] and such thatthen the equationhas a root s such thatWe have also the general error boundf(a) · f(b) < 0,f(x) = 0,a < s < b.|s − (a + b)/2| < (b − a)/2.13.8.2 Existence and uniqueness of root in an interval [a, b]If the function f is continuously differentiable on [a, b] and such thatthen the equationhas a unique root s such thatf(a) · f(b) < 0, |f ′ (x)| ≥ m > 0, a ≤ x ≤ b,f(x) = 0,a < s < b.Let now ¯x be a fixed point in [a, b]. We have then general error bounds:|¯x − s| < b − a, |¯x − s| ≤ |f(¯x)|/m.111


13.8.3 Bisection method for f(x) = 0.Start: Find x 0 , y 0 such that x 0 < y 0 and f(x 0 ) · f(y 0 ) < 0. Putz 0 = (x 0 + y 0 )/2.General iteration step: For n = 0, 1, . . . determine x n+1 , y n+1 , z n+1 , accordingto the rules: Always putz n = (x n + y n )/2.If f(x n ) · f(z n ) ≥ 0 then x n+1 = z n , y n+1 = y n .If f(x n ) · f(z n ) < 0 then y n+1 = z n , x n+1 = x n .We have the general error bound for a root s:|z n − s| ≤ 2 −(n+1) (y 0 − x 0 ) ≈ (y 0 − x 0 ) · 10 −(n+1)·0.301 .13.8.4 Inverse linear interpolation for f(x) = 0Let the function f be continuously differentiable on [a, b] and such thatf(a) · f(b) < 0, |f ′ (x)| ≥ m > 0, a ≤ x ≤ b, |f ′′ (x)| ≤ m 2 , a ≥ x ≥ b.PutThen¯x = a −|¯x − s| ≤ |f(a)f(b)|2f(a)(b − a)f(b) − f(a) .·(b − a) 2 m 2(f(b) − f(a)) 2 m .13.8.5 Newton-Raphson’s method for f(x) = 0Find an approximate root x 0 . Then iterate according tox n+1 = x n + h n , h n = −f(x n )/f ′ (x n ), n = 0, . . .Asymptotic estimate of truncation error:x n+1 − s ≈ h2 n2 · f ′′ (s)f ′ (s) .13.8.6 Fix-point iteration for x = g(x)Find an approximate root x 0 . Then iterate according tox n+1 = g(x n ), n = 0, . . .Asymptotic estimate of truncation error:x n+1 − s ≈ −g′ (s)1 − g ′ (s) · (x n+1 − x n ).112


13.9 Systems of nonlinear equations13.9.1 Newton-Raphson’s method forF (x) = 0, x ∈ R n , F (x) ∈ R nFind an approximate solution vector x 0 ∈ R n . Then iterate according tox l+1 = x l + h l , A l h l = −F (x l ), l = 0, . . .where the matrix A l ∈ R n×n is given by(A l ) i,j = ∂F i∂x j(x l ).13.9.2 Fix-point iteration for x = G(x), x ∈ R n , G(x) ∈ R nFind an approximate solution vector x 0 ∈ R n . Then iterate according tox l+1 = G(x l ), l = 0, . . .13.10 <strong>Numerical</strong> integration with the trapezoidalruleLet [a, b] be a finite interval. Let n be an integer and puth = (b − a)/n. Definen−1∑T (h) = h · [(f(a) + f(b))/2 + f(a + ih)].i=1If f has continuous derivatives of all orders on [a, b], thenT (h) =∫ baf(t) dt + c 1 h 2 + c 2 h 4 + c 3 h 6 + · · · + c k h 2k + O(h 2k+2 ),h → 0, k = 1, 2, . . .If f also is periodic with the period b − a, thenT (h) =∫ baf(t) dt + o(h k ), k = 1, 2, . . .If f has continuous derivatives of all orders on [0, ∞], then∞∑T (h) = h · [f(0)/2 + f(ih)],i=1113


andT (h) =∫ ∞0f(t) dt + c 1 h 2 + c 2 h 4 + c 3 h 6 + · · · + c k h 2k + O(h 2k+2 ),h → 0, k = 1, 2, . . .If f has continuous derivatives of all orders on the entire real line andlim f (k) (t) = 0, k = 1, 2, . . .t→±∞thenandT (h) =∞∑T (h) = h · f(ih),∫ ∞−∞i=−∞f(t) dt + o(h k ), k = 1, 2, . . .13.11 Differential and difference equations withconstant coefficientsConsider the homogeneous differential equation with constant coefficients a 0 , a 1 , . . . , a nn∑a r x (r) (t) = 0.r=0Let the corresponding characteristic polynomialP (z) =n∑a r z r ,r=0have the distinct roots z i with multiplicities m i , i = 1, 2, . . . , k. Then thegeneral solution isk∑x(t) = Q i (t)e zit ,i=1where Q i is a general polynomial of degree less than m i . In the special case whenk = n and m i = 1, i = 1, 2, . . . , k then the general solution of the homogeneousequation becomesn∑x(t) = c i e zit , c i any constants.i=1The general solution of the inhomogeneous equationn∑a r x (r) (t) = f(t),r=0114


may be writtenx(t) = x h (t) + x p (t),where x p is any solution of the inhomogeneous equation and x h is the generalsolution of the homogeneous equation.Consider the homogeneous difference equation with constant coefficientsa 0 , a 1 , . . . , a nn ∑r=0a r x (l+r) = 0.Let the corresponding characteristic polynomialP (z) =n∑a r z r ,r=0have the distinct roots z i with multiplicities m i , i = 1, 2, . . . , k. Then thegeneral solution isk∑x l = Q i (l)zi l ,i=1where Q i is a general polynomial of degree less than m i . In the special case whenk = n and m i = 1, i = 1, 2, . . . , k then the general solution of the homogeneousdifference equation becomesx l =n∑c i zi l , c i any constants.i=1The general solution of the inhomogeneous equationn∑a r x (r) (l) = f(l),r=0may be writtenx(l) = x h (l) + x p (l),where x p is any fixed solution of the inhomogeneous equation and x h is thegeneral solution of the homogeneous equation.13.12 Initial-value problems for a single ordinarydifferential equationGivenx(0) = x 0 , x ′ (t) = f(t, x(t)).115


13.12.1 Piccard iterationIterate according to:x 0 (t) = x 0 , t ≥ 0, x l+1 (t) = x 0 +13.12.2 Euler’s method∫ tx n+1 = x n + hf(t n , x n ), t n = nh.Global error: O(h), h → 0.13.12.3 The trapezoidal method0f(u, x l (u)) du.x n+1 = x n + h[f(t n , x n ) + f(t n+1 , x n+1 )]/2.The equation for x n+1 may be solved iteratively according to the scheme:x 0 n+1 = x n + hf(t n , x n ),x l+1n+1 = x n + hf(t n , x n )/2 + hf(t n+1 , x l n+1)/2, l = 0, 1, . . .x n+1 = lim x ll→∞n+1.Special case: the trapezoidal method for the linear differential equation:x ′ (t) + f(t)x(t) = g(t).x n+1 = h(g(t n) + g(t n+1 ))/2 + x n (1 − hf(t n )/2).1 + hf(t n+1 )/2Global error for the trapezoidal method in the general case: O(h 2 ), h → 0.13.13 Initial-value problems for systems of ordinarydifferential equationsThe same numerical methods may be used as in the case of a single differentialequation with the change that x becomes a vector and f, g etcetera are vectorvaluedfunctions. Note that, the expression for the general solution of a lineardifferential equation is not valid. However, in the special casewith A a constant matrix we havex(0) = x 0 , x ′ (t) = Ax(t) + g(t),x(t) = e At (x 0 +∫ t0)e −Aτ g(τ) dτ .116


Chapter 14Laboratory exercises andcase studies14.1 IntroductionA term paper is not mandatory for the course ÅMA<strong>190</strong> <strong>Numerical</strong> <strong>Mathematics</strong>,but a voluntary task for interested students, who want to get a feed-back on theprogress of their study. Each of the four following sections describes one problemto be treated by the students. Each problem is preceded by text dealing with asimilar problem, which will be treated in a lecture. The second section containsthe results of case studies which will illustrate some important results in theareas of numerical analysis to be covered by the course. If nothing else is statedthe reported calculations are carried out either in single precision (machine-eps= 1.19 · 10 −7 ) or double precision (machine-eps= 2.22 · 10 −16 )14.2 Problems14.3 Three methods for solving a linear systemof equations with a tridiagonal matrix14.3.1 Similar problemExercise 1.1014.3.2 TaskThe system of equations Ax = b is given, with117


⎛A = ⎜⎝10.0 1.5 0.0 0.01.5 10.0 1.5 0.00.0 1.5 10.0 1.50.0 0.0 1.5 10.0⎞⎟⎠⎛b = ⎜⎝exp(0.4)√3.0πln(2.0)⎞⎟⎠ .a) Solve the system with Gauss elimination. Work with 4 decimal places.b) Use Jacobi’s method. Start with x = (0, 0, 0, 0) T and iterate 3 times. Workwith 4 decimal places.c) Do the same task using Gauss-Seidel’s method.14.4 A least squares problem14.4.1 Similar problemExercise 2.814.4.2 TaskDetermine the best approximation in the sense of the least squares of the forma + bx + c(x 2 − 1),to the functionat the 5 pointsf(x) = cos(πx/2),x 1 = −1, x 2 = −0.5, x 3 = 0, x 4 = 0.5, x 5 = 1.where x is measured in radians. Work with 6 decimal places.14.5 Three methods for solving a nonlinear equationin one variable14.5.1 Similar problemThe equation is given:e −x + √ x − x = 0.Show that it has a unique positive root s such that 1 < s < 2. We shall usedifferent numerical methods for determining s :a) Fix-point iteration: write the equationx = e −x + √ x.118


Start with x 0 = 1b) Use the bisection method starting with the interval [1,2].c) Apply Newton’s method. Start with x 0 = 1 and iterate until the 5 firstdecimals remain constant.Find in each case the number of functional values one needs to determine theroot with an uncertainty less than 1.0 · 10 −5 .14.5.2 TaskThe equation is given:e x + x 2 = 2.Show that it has a unique positive root s such that 0 < s < 1. We shall usedifferent numerical methods for determining s :a) Fix-point iteration: write the equationx = ln(2 − x 2 ).Start with x 0 = 1b) Use the bisection method starting with the interval [0,1].c) Apply Newton’s method. Start with x 0 = 1 and iterate until the 5 firstdecimals remain constant.Find in each case the number of functional values one needs to determine theroot with an uncertainty less than 1.0 · 10 −5 .14.6 Analysing a table over the solution to aninitial-value problem14.6.1 Similar problemThe initial-value problemy ′ (x) = ln(1 + y) + e x − x, y(0) = 0.This problem was treated with the classical Runge-Kutta method (RK4), withthe step-lengths h = 0.2 and h = 0.1. Hence two tables emerged, having thesolution points at x = 0, 0.2, . . . , 1.0 in common. The smaller step-length givesthe most accurate values and since the global error is O(h 4 ), the error in thesebest values may be estimated by 1/15 of the difference of the calculated valuesat these common points.119


x y, h = 0.10 y, h = 0.2 error estimate.000000 .000000 0.000000 0.000000.100000 .105171.200000 .221402 0.221398 0.000000.300000 .349858.400000 .491824 0.491815 0.000001.500000 .648720.600000 .822118 0.822105 0.000001.700000 1.013752.800000 1.225540 1.225523 0.000001.900000 1.4596021.000000 1.718280 1.718260 0.00000114.6.2 Taska) Show that the initial-value problemhas the exact solutiony ′ (x) + y(x)1 + x = cos x , y(0) = 0,1 + xy(x) = sin x1 + x .b) This problem was treated with the classical Runge-Kutta method (RK4),with the step-lengths h = 0.5 and h = 0.25. Hence two tables emerged, havingthe solution points at x = 0.0, 0.5, 1.0 in common. The smaller step-length givesthe most accurate values and since the global error is O(h 4 ) the error in thesebest values may be estimated by 1/15 of the difference of the calculated valuesat these common points.x y, h = 0.25 y, h = 0.5.000000 .000000 0.000000.250000 .197923.500000 .319617 0.319624.750000 .3895081.000000 .420736 0.420745Estimate the error. Compare with the exact solution to get a test on the accuracyof the error estimate.120


14.7 Case studies14.8 Evaluation of equivalent sums14.8.1 ProblemAs known we have∞∑n=11n 2 = π26 = 1.64493We want to confirm this result by approximating the infinite series by addingits 10000 first terms, i.e. bys 10000 =10000∑n=11n 2 .14.8.2 Computational treatmentConsider the two equivalent sumss 1 =10000∑n=1100001n 2 , s ∑2 =n=1However, working in single precision we get1(10001 − n) 2 .s 1 = 1.64473, s 2 = 1.64483If we instead evaluate the two sums in double precision, we get the value 1.64483for both sums. We also note that approximating the series by the finite sumsin double precision gives an error of about 1.0 · 10 −4 caused by neglecting allterms except the first 10000 ones.14.8.3 ConclusionThere are two sources of errors, namely the truncation error caused by approximatingthe infinite series by a finite sum, consisting of the first 10000 terms andthe combined effect by the round-offs arising when the finite sum is computed,working in double or single precision. As expected, the error in double precisionis negligible in comparison to that in simple precision. We also note, thatworking in single precision we get a more accurate result when we add up thesmallest terms first compared to starting with the largest one. This observationis valid for most common computer systems.121


14.9 The stabilising effect of pivoting in Gausselimination14.9.1 ProblemWe consider the problem of solving a linear system of equationsAx = b,using Gauss elimination with pivoting. In some cases, e.g. when the coefficientmatrix is positively definite or diagonally dominant, the Gauss elimination processcan be shown to be stable also without pivoting. If the matrix is structured,e.g. if it is banded, this structure is destroyed by the pivoting and therefore, thepivoting is not carried out. Often reliable results are obtained nevertheless. Itcould therefore be tempting to skip the pivoting and hope for the best. We reporthere the results from numerical experiments where we compare the resultsof solving systems with and without pivoting. We generated 4 different classesof systems. The elements a i,j were calculated as described below and the righthand sides we obtained by setting x = (1, 1, . . . , 1, 1) T and evaluating Ax. Theelements of A and b were calculated in Examples 1, 2 and 3, working in doubleprecision and afterwards we rounded to single precision accuracy. Hence thedata of the system were correct in single precision. In Example 4 the elementswere not rounded to single precision and are hence much more accurate. Thesystem was subsequently solved with and without pivoting and carrying out theelimination in single and double precision. In the latter case we worked withseveral guard digits. In Examples 1, 2, and 4, A was a Van der Monde matrixand hencea i,j = t i−1j , j = 1, . . . , n i = 1, . . . , n.In Example 1 t j is the j-th zero of the shifted n-degree Chebyshev polynomial.Hencet j = 1 + cos θ j π(j − 1/2), θ j = .2nThese numbers tend to accumulate at the endpoints 0 and 1. In Examples 2and 4 t j are equidistant and we havet j = (j − 1)/(n − 1).This point-set is uniformly distributed on the interval [0,1]. Finally, in Example3 the matrix is not of Van der Monde type. Instead the elements are randomnumbers, uniformly distributed on [0,1].14.9.2 Computational resultsWe tabulate the observed error in the least accurately computed component ofthe solution vector x, i.e. the largest observed error.Example 1 Van der Monde, shifted Chebyshev points, largest observed error122


n piv,sing nonpiv,sing piv,dbl nonpiv,dbl5 .48D-06 .38D-05 .24D-06 .24D-0610 .36D-04 .23D-02 .14D-05 .14D-0515 .17D-02 .33D+01 .98D-04 .98D-04Example 2 Van der Monde, equidistant points, largest observed error.n piv,sing nonpiv,sing piv,dbl nonpiv,dbl5 .76D-05 .00D+00 .12D-13 .00D+0010 .14D+00 .37D+00 .48D-01 .48D-0111 .23D+00 .71D+01 .17D+00 .17D+00Example 3 Random matrix elements, largest observed error.n piv,sing nonpiv,sing piv,dbl nonpiv,dbl10 .76D-05 .36D-04 .84D-06 .84D-0620 .22D-04 .24D-03 .12D-05 .12D-0530 .82D-04 .28D-02 .14D-04 .14D-0440 .63D-04 .39D-02 .34D-05 .34D-0550 .98D-04 .50D-03 .60D-05 .60D-0560 .71D-04 .46D-01 .13D-04 .13D-04Example 4 Van der Monde, equidistant points, double prec., largest observederror.14.9.3 Conclusionsn piv,dbl nonpiv,dbl5 .12D-13 .00D+0010 .42D-09 .11D-0815 .42D-05 .19D-0320 .23D+00 .14D+02We note that the accuracy of the calculated results depends on the followingfactors:• The input data consisting of the elements of the matrix A, the vector band n the number of unknowns.• The accuracy of input data: single or double precision.• The accuracy with which the calculations are carried out: single or doubleprecision.• Whether or not pivoting is carried out.In all examples the accuracy declines when n is increased, which is to be expectedif one studies the condition numbers of the problems, but this topic is not treatedin the present course. The difference between Examples 2 and 4 is the accuracy123


of the input data which explains why the results are more accurate in the lastmentionedexample. In the other three examples the data were given in singleprecision and the calculations were done in single or double precision. We notethat the results in column 3 are more accurate than those of column 1 andin the same way those of column 4 are more accurate than the correspondingresults in column 2 in the first three tables. One could say that we are carryingsome guarding digits to contain the propagation of rounding errors when doubleprecision is used. We note the stabilising effect of pivoting by comparing theresults in the second column with those of the first column. and to a lesser extentby comparing column 4 with column 3. Finally, we observe that the problemwith a matrix with random elements is much more stable than the the problemswith Van der Monde matrices. In addition, the problem with equidistant points(Examples 2 and 4) is less stable than the one with Chebyshev points.14.10 Least squares fit14.10.1 IntroductionWe discuss problems where one or several parameters are to be determined fromthe fact that several conditions in the form of equalities need to be satisfied.This may be expressed as a problem to solve a set of equations. If these cannotbe satisfied at the same time, the problem is said to be inconsistent. Then onemay transfer to an approximation problem and seek to satisfy the conditions asclosely as possible where the error is measured by a suitable functional, whichthen is to be minimised.Example 14.10.1 Let0 ≤ x 1 ≤ x 2 ≤, . . . , ≤ x n ,be n given real numbers which represent measurements of a constant c. We wantto estimate c from these measurements. We introduce the following measures ofthe goodness of fit for any value of cd ∞ (c) = max1≤i≤n |c − x i|,d 2 2 (c) =d 1 (c) =n∑(x i − c) 2 ,i=1n∑|x i − c|.i=1Next we determine c such that one of these gauges is minimised for n = 5 andx i given by the set {100, 101, 103, 104, 110}.124


We put c = 100 + u and solve the problem of determining u from the setIn the case of d ∞ we need to find{0, 1, 3, 4, 10}.which isWe also haveminumax{|u|, |u − 1|, |u − 3|, |u − 4|, |u − 10|},u = 5.d 2 2 (u) = u 2 + (u − 1) 2 + (u − 3) 2 + (u − 4) 2 + (u − 10) 2 .This parabolic function is minimised forFinallyu = (0 + 1 + 3 + 4 + 10)/5 = 18/5 = 3.6 .d 1 (u) = |u| + |u − 1| + |u − 3| + |u − 4| + |u − 10|,is a piecewise linear function having break-points at 0,1,3,4,10. It cannot havean isolated minimum at any other point. We findd 1 (0) = 18, d 1 (1) = 15, d 1 (3) = 13, d 1 (4) = 14, d 1 (10) = 32,and hence u = 3 gives the best fit.14.10.2 Fitting a straight line to set of observationsAssume that we want to fit the linear expressionto the table of N valuesa + bt,{t i , f i }, i = 1, 2, . . . , N.We assume here that the N points are exact but the values f i may be subjectto errors. These are given byδ i = f i − (a + bt i ).The conditions on a and b may be represented as the over-determined system⎛⎞1 t 1 f 12 t 2 f 21 t 3 f 3⎜ . . ..⎟⎝ . . . ⎠1 t N f N125


In analogy with the example above we may determine a, b such that one of thefollowing expressions is minimizedmax |δ i|,1≤i≤NN∑|δ i |,i=1N∑(δ i ) 2 .i=1All three formulas define functions of two real variables, namely a, b The firsttwo expressions may be minimised, using simple algorithms based on linearprogramming and hence are not treated here. Instead we shall discuss the thirdexpression. PutN∑∆(a, b) = (f i − a − bt i ) 2 .i=1Expanding the square we get the sumorN∑∆(a, b) = (fi 2 + a 2 + b 2 t 2 i − 2af i − 2bf i t i + 2abt i ),i=1∆(a, b) = Na 2 + b 2N ∑i=1N∑N∑N∑ N∑t 2 i + 2ab t i − 2a f i − 2b f i t i + fi 2 .i=1Thus ∆(a, b) is a quadratic form in a, b and it is nonnegative, hence of theellipsoidal type. We form the linear systemwhich may be writtenwhere:i=1∂∆∂a = 0, ∂∆∂b = 0,A 1,1 a + A 1,2 b = B 1 ,A 2,1 a + A 2,2 = B 2 ,i=1i=1N∑N∑A 1,1 = N, A 1,2 = A 2,1 = t i , B 1 = f i ,N∑N∑A 2,2 = t 2 i , B 2 = f i t i .Example 14.10.2 Treat the following inconsistent system:i=1i=1x + y = 1, x + y = 2.We put x + y = z and get the conditionsz = 1, z = 2,i=1i=1126


and hence we may consider minimising one of the following three expressions:max{|z − 1|, |2 − z|}, (z − 1) 2 + (2 − z) 2 , |z − 1| + |z − 2|.All of these three expressions are minimised by z = 3/2. Hence the least squaresolution satisfies x + y = 3/2 and is not unique. Among these solutions onemay take the one which renders x 2 + y 2 a minimum, namely x = y = 3/4.Example 14.10.3 Consider the least square problemAx = b,which is given byA =⎛⎜⎝1 2 30 1 20 0 c0 0 0⎞ ⎛⎟⎠ b = ⎜⎝1234⎞⎟⎠ .where c is a parameter.The deviation in the last equation is 4 for all values of c. If c ≠ 0 we may get theresidual 0 in the first three equations by putting x 3 = 3/c, x 2 = 2 − 2x 3 , x 1 =1 − 3x 3 − 2x 2 and this solution is unique. But if c = 0, then the residuals inthe third and fourth equations are 3 and 4 for all x. To get the residual 0 inthe two first equations we need to find a solution to the system formed of theseequations. This may be done by choosing one of x 2 , x 3 arbitrarily and then solve(uniquely) for the other two variables. Let x 3 be the free parameter. Then weget the system( 1 20 112 + x 2 ∣ −3−2We next eliminate the coefficient in front of x 2 in the first equation to arrive at( 1 00 1giving the parametric solution−32+ x 2∣ ∣∣∣ 1−2).),x 1 = −3 + x 3 , x 2 = 2 − 2x 3 .The least square solution with the smallest norm is found by minimisingx 2 1 + x 2 2 + x 2 3,givingx 1 = −11/6, x 2 = −1/3, x 3 = 7/6.127


14.10.3 Derivation of the normal equationsWe consider a system of N rows and n variables which we writeAx = b.It may be inconsistent and we want to determine x ∈ R n such that a normof the residual Ax − b is rendered a minimum. If we use the ||Ax − b|| 1 or||Ax − b|| ∞ we get optimisation problems which may be solved by means oflinear programming. We will here derive the normal equations which determinesolutions to the minimisation problem in the 2-norm. PutThenor∆(x) = ||Ax − b|| 2 2 = (Ax − b) T (Ax − b).∆(x + h) = (Ax + Ah − b) T (Ax + Ah − b) = (Ax − b + Ah) T (Ax − b + Ah),∆(x + h) = (Ax − b) T (Ax − b) + (Ax − b) T (Ah) + (Ah) T (Ax − b) + (Ah) T (Ah).Sincewe obtainand conclude that(Ax − b) T (Ah) = (Ah) T (Ax − b), (Ah) T (Ah) = (||Ah|| 2 ) 2 ,for all h, if and only if∆(x + h) = ∆(x) + 2h T A T (Ax − b) + (||Ah||) 2 ,∆(x + h) ≥ ∆(x),A T (Ax − b) = 0, or A T Ax = A T b.This formula is called the normal equations.14.11 Interpolation with polynomialsLetf(x) = exp(x), −1 ≤ x ≤ 1.We want to approximate f with a polynomial of degree less than 4 as well ascalculate the integral:∫ 1−1f(x) dx.The following table of functional values, correctly rounded to 4 decimals is givenwhere the x-values are exact:128


x f(x)−1.0 0.3679−0.5 0.6065 .0.5 1.64871.0 2.7183We determine the interpolating polynomial using the data in the table. Astraight-forward approach expressing the polynomial in power form is given inExercises, Task 3.6 . Here we present alternative approaches, based on Lagrange’sand Newton’s interpolation formulas, which are alternative methodsfor solving the linear system which determines the interpolating polynomial.14.11.1 Lagrange’s formula(x + 0.5)(x − 0.5)(x − 1)L 1 (x) =(−1 + 0.5)(−1 − 0.5)(−1 − 1) = − 1/4)(x − 1)−(x2 .1.5(x + 1)(x − 0.5)(x − 1)L 2 (x) =(−0.5 + 1)(−0.5 − 0.5)(−0.5 − 1) = (x2 − 1)(x − 0.5).0.75(x + 1)(x + 0.5)(x − 1)L 3 (x) =(+0.5 + 1)(0.5 − 0.5)(0.5 − 1) = − 1)(x + 0.5)−(x2 .0.75(x + 1)(x + 0.5)(x − 0.5)L 4 (x) =(1 + 1)(1 + 0.5)(1 − 0.5) = (x2 − 1/4)(x + 1).1.5HenceP (x) = 0.3679 · L 1 (x) + 0.6065 · L 2 (x) + 1.6487 · L 3 (x) + 2.7183 · L 4 (x).14.11.2 Estimate of the effect of round-off in given dataon the calculated value of P (x)Since the tabular values are rounded to 4 decimal places the error in the functionalvalues is bounded by 0.5 · 10 −4 . Let δ R P (x) be the error in the value ofthe interpolating polynomial P (x) caused by the round-off of the tabular values.Using the triangular inequality we findδ R P (x) ≤ 0.5 · 10 −4 · (|L 1 (x)| + |L 2 (x)| + |L 3 (x)| + |L 4 (x)|).<strong>Numerical</strong> example:δ R P (0) ≤ 0.5 · 10 −4 · (1/6 + 2/3 + 2/3 + 1/6) = 0.8 · 10 −4 .129


14.11.3 Estimation of truncation errorLet the truncation error at the point x be R T (x). The general expression is(x − 1)(x − 0.5)(x + 0.5)(x + 1)R T (x) = f (4) (ξ), −1 ≤ ξ ≤ 1.24Since we know that f(x) = exp(x) we get the bound:R T (x) ≤ (x2 − 1)(x 2 − 1/4)· e, −1 ≤ x ≤ 1.24<strong>Numerical</strong> example:R T (0) ≤ (02 − 1)(0 2 − 1/4)· e = e/96 = 0.028 .24We next determine a bound for the maximum value of R T (x) We haveR T (x) ≤Thus we need to find the maximal value of|(x 2 − 1)(x 2 − 1/4)|max· e.−1≤x≤1 24g(x) = |(x 2 − 1)(x 2 − 1/4)|.Put u = x 2 , and r(u) = |(u − 1)(u − 1/4)|. We need to determineThe values at the endpoints are:The equationmax r(u).0≤u≤1r(0) = 1/4, r(1) = 0.r ′ (u) = 0,has the solution u = 5/8 and we find r(u) = 9/64. Hence the bound for thetruncation error is largest at t = 0. We compare this result with the observederror when we evaluate P (0) by means of Lagrange’s formula. We obtainP (0) = −0.3679/6 + 2 · 0.6065/3 + 2 ∗ 1.6487/2 − 2.7183/6 = .9891,and the observed error 0.011 is somewhat less than the bound 0.028 for thetruncation error at x = 0 .14.11.4 Newton’s interpolation formulaWe calculate the divided differences and form the array−1.0 0.3679−0.5 0.6065 0.47720.5 1.6487 1.0422 0.376671.0 2.7183 2.1392 0.731333 0.177333The corresponding formula for the interpolating polynomial becomesP (x) = 0.3678+0.4772(x+1)+0.376667(x+1)(x+0.5)+0.177333(x+1)(x+0, 5)(x−0.5).130


14.12 <strong>Numerical</strong> integration problems using thetrapezoidal method14.12.1 The trapezoidal rule on a periodic function overone periodWe wish to evaluate the integral∫ 2π0ln(10 + sin t)20 + cos tUsing the trapezoidal rule we construct the table below. Note that we evaluatethe integrand at in total 9 points and that all points are used to find the last andpresumably best estimate. The crosses indicate which functional values that weused to determine the estimate T (h) corresponding to the step-size h at the topof the column.dt.t f(t) h =6.28318548 h =3.14159274 h =1.57079637 h = .78539819.00000000 .10964691 x x x x.78539819 .11449730 x1.57079637 .11989477 x x2.35619450 .12289022 x3.14159274 .12118869 x x x3.92699099 .11554773 x4.71238899 .10986123 x x5.49778748 .10765627 x6.28318548 .10964691 x x x xT (h) .68893188 .72519147 .72349560 .72349554The sequenceI k = T (h 0 /2 k ), k = 0, 1, . . .converges rapidly towards the exact value of the integral and if I k is taken as anestimate of the integral, the associated error may be assessed as the differenceI k − I k−1 which as a rule is a conservative bound.14.12.2 The trapezoidal rule over the real line∫ ∞−∞exp(−t 2 ) dt.The trapezoidal approximation is represented by an infinite series which convergesrapidly. We have two sources of errors, namely the discretisation error,which arises when the integral is replaced by the trapezoidal sum and the truncationerror when the infinite trapezoidal series is replaced by a finite sum.Using the trapezoidal rule with step-sizes h = 1, 1/2, 1/4 we got the results inthe table131


nbr step-size integral errsum errtotal15 1.000 1.772637204826652 .10D-20 .18D-0327 .500 1.772453850905516 .45D-18 .22D-1551 .250 1.772453850905516 .54D-17 .22D-15Here nbr is the number of functional values used and the total error is obtainedby comparing the calculated sum with the exact sum which is known for thistest example. errsum is the error in the trapezoidal sum. These notations areused in subsequent examples. We note the rapid convergence when the step-sizeh is made smaller. See also [12]14.12.3 The trapezoidal rule over the real line∫ ∞−∞√1 + t2(e 2t + e −t ) dt.Using the trapezoidal rule with step-sizes h = 1, 1/2, 1/4, . . . we got the resultsin the tablenbr step-size integral errsum83 1.000 1.823529232672016 .64D-16161 .500 1.822416522249511 .85D-16315 .250 1.822415984058650 .88D-16615 .125 1.822415984059720 .10D-151205 .0625 1.822415984059720 .11D-15We observe that more functional values are used than in the preceding exampledue to the slower convergence of the trapezoidal series.14.12.4 Integral over a finite interval transformed to anintegral over the real line∫ 10e cos t dt.Putting t = e u /(1 + e u ) and using the trapezoidal rule with step-sizes h =1, 1/2, 1/4, . . . on the resulting integral over the real line, we got the results inthe tablenbr step-size integral errsum77 1.000 2.341576671162078 .14D-15149 .500 2.341574841713079 .19D-15291 .250 2.341574841713053 .20D-15569 .125 2.341574841713051 .21D-15.132


14.12.5 Integrand with integrable singularity∫ 10e cos t / √ t dt.Putting t = e u /(1 + e u ) and using the trapezoidal rule with step-sizes h =1, 1/2, 1/4, . . . on the resulting integral over the real line, we got the results inthe table. We note that the method works, independently of the singularity atthe origin. See[12]See [12]nbr step-size integral147 1.000 4.978176663671881287 .500 4.978176950606336561 .250 4.9781769506064291099 .125 4.9781769506064232151 .063 4.978176950606420.133


Bibliography[1] M. Abromowitz and I. A. Stegun (Eds)Handbook of Mathematical Functions,Dover, New York, 1965.[2] E.W. Cheney, Introduction to Approximation Theory, McGraw-Hill,1966.[3] E. Cohen, R. F. Riesenfeld and G. Elber, Geometric Modeling withSplines: An Introduction, A. K. Peters, Natick, Mass., USA, 2001.[4] L. J. Comrie, Chambers’s Six-Figure Mathematical Tables, Volume II,Natural Values, W. & R. Chambers, Edinburgh, London, 1959.[5] G. Dahlquist and Å. Björck, <strong>Numerical</strong> Methods, Prentice-Hall, EnglewoodCliffs, New Jersey, 1974.[6] L. Eldén and L. Wittmeyer-Koch, <strong>Numerical</strong> Analysis, an Introduction,Academic Press, 1990.[7] I. S. Gradshteyn and I. M. Ryzhik, Tables of Integrals, Series and Products,Academic Press, London, 1980.[8] E. Kreyszig,Advanced Engineering <strong>Mathematics</strong>, 7th Edition, John Wiley& Sons.[9] M.L. Overton, <strong>Numerical</strong> Computing with IEEE Floating Point Arithmetic,SIAM, Philadelphia, 2001.[10] W. H. Press et al. <strong>Numerical</strong> Recipes in Fortran, Cambridge UniversityPress, 1992.[11] K. Rottmann,Matematisk Formelsamling, Spektrum, 2003.[12] F. Stenger<strong>Numerical</strong> Methods Based on Sinc and Analytic Functions,Springer-Verlag, New York, 1993.[13] G. B. Thomas and R. L. Finney, Calculus and Analytic Geometry, 7thEdition, Addison-Wesley.[14] G. Woan, The Cambridge Handbook of Physics Formulas, CambridgeUniversity Press, 2003.134

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!