13.07.2015 Views

An Introduction to the Conjugate Gradient Method Without the ...

An Introduction to the Conjugate Gradient Method Without the ...

An Introduction to the Conjugate Gradient Method Without the ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>An</strong> <strong>Introduction</strong> <strong>to</strong><strong>the</strong> <strong>Conjugate</strong> <strong>Gradient</strong> <strong>Method</strong><strong>Without</strong> <strong>the</strong> Agonizing PainEdition 1 1 4Jonathan Richard ShewchukAugust 4, 1994School of Computer ScienceCarnegie Mellon UniversityPittsburgh, PA 15213AbstractThe <strong>Conjugate</strong> <strong>Gradient</strong> <strong>Method</strong> is <strong>the</strong> most prominent iterative method for solving sparse systems of linear equations.Unfortunately, many textbook treatments of <strong>the</strong> <strong>to</strong>pic are written with nei<strong>the</strong>r illustrations nor intuition, and <strong>the</strong>irvictims can be found <strong>to</strong> this day babbling senselessly in <strong>the</strong> corners of dusty libraries. For this reason, a deep,geometric understanding of <strong>the</strong> method has been reserved for <strong>the</strong> elite brilliant few who have painstakingly decoded<strong>the</strong> mumblings of <strong>the</strong>ir forebears. Never<strong>the</strong>less, <strong>the</strong> <strong>Conjugate</strong> <strong>Gradient</strong> <strong>Method</strong> is a composite of simple, elegant ideasthat almost anyone can understand. Of course, a reader as intelligent as yourself will learn <strong>the</strong>m almost effortlessly.The idea of quadratic forms is introduced and used <strong>to</strong> derive <strong>the</strong> methods of Steepest Descent, <strong>Conjugate</strong> Directions,and <strong>Conjugate</strong> <strong>Gradient</strong>s. Eigenvec<strong>to</strong>rs are explained and used <strong>to</strong> examine <strong>the</strong> convergence of <strong>the</strong> Jacobi <strong>Method</strong>,Steepest Descent, and <strong>Conjugate</strong> <strong>Gradient</strong>s. O<strong>the</strong>r <strong>to</strong>pics include preconditioningand <strong>the</strong> nonlinear <strong>Conjugate</strong> <strong>Gradient</strong><strong>Method</strong>. I have taken pains <strong>to</strong> make this article easy <strong>to</strong> read. Sixty-six illustrations are provided. Dense prose isavoided. Concepts are explained in several different ways. Most equations are coupled with an intuitive interpretation.Supported in part by <strong>the</strong> Natural Sciences and Engineering Research Council of Canada under a 1967 Science and EngineeringScholarship and by <strong>the</strong> National Science Foundation under Grant ASC-9318163. The views and conclusions contained in thisdocument are those of <strong>the</strong> author and should not be interpreted as representing <strong>the</strong> official policies, ei<strong>the</strong>r express or implied, ofNSERC, NSF, or <strong>the</strong> U.S. Government.


Keywords: conjugate gradient method, preconditioning, convergence analysis, agonizing pain


Contents1. <strong>Introduction</strong> 12. Notation 13. The Quadratic Form 24. The <strong>Method</strong> of Steepest Descent 6¡ ¢ £ ¡¡¡ £ ¡ ¢ £ ¡¡ £ ¡ ¢ ¡ £ ¡ £ ¡ ¡ ¢ ¡¡ £ ¡¡ £ ¡ £ ¡ ¡ ¢ ¡ £ ¡ £¢ £ ¡¡ £ £ ¡ ¡ ¡ ¢ ¢ £ £ ¡¡ £ ¡¡ ¢ £ ¡¡ £ ¡¡ ¢ £ ¡¡ £ ¡¡ £ ¡¡¡ £ ¡£ ¢ ¡ £ ¡¡ £ ¡ ¢ ¡ £ ¡¡ ¡ £ £ ¡ ¡ ¢ ¡ ¡ ¡ £ ¡¡ £ ¡¡ ¢ £ ¡¡ £ ¡¡ £5. Thinking with Eigenvec<strong>to</strong>rs and Eigenvalues 95.1. Eigen do it if I try 95.2. Jacobi iterations 115.3. A Concrete Example 12¡ £ ¡ ¢ £ ¡¡¡ £ ¡ £ ¡ ¡ ¢ £ ¡¡ £ ¡¡ ¢ £ ¡¡ £ ¡¡ ¢ £ ¡¡ £ ¡¡ £ ¢ ¡ £ ¡£ ¡ ¢ £ ¡¡ £ £ ¡ ¡ ¡ ¡ ¡ ¢ £ ¡¡ £ ¡¡ ¢ £ ¡¡ £ ¡¡ ¢ £ ¡¡ £ ¡¡ £6. Convergence <strong>An</strong>alysis of Steepest Descent 136.1. Instant Results 136.2. General Convergence 17¡¡ ¢ £ ¡¡ £ ¡¡ £ ¡ ¡ £ ¡ £ ¡ ¢ ¡ ¡ £ ¢ ¡¡ £ ¡¡ £ ¢ ¡ £ ¢ ¡ £ ¡ ¢ ¡ ¡ £ ¡ £ ¡£ ¡ ¡ £ ¡ ¡ ¢ £ ¡¡¡ £ ¡ ¢ £ ¡¡¡ £ ¡ ¢ £ ¡¡ £ £ ¡ ¢ £ ¡ ¡ ¡£ ¡¡¡ £ ¡ ¡ ¢ £ £ ¡ ¡ ¡ ¡ £ ¡ ¡ £ ¢ ¡ ¢ £ ¡¡¡ £ ¡ ¢ £ ¡¡ £7. The <strong>Method</strong> of <strong>Conjugate</strong> Directions 217.1. Conjugacy 217.2. Gram-Schmidt Conjugation 257.3. Optimality of <strong>the</strong> Error Term 268. The <strong>Method</strong> of <strong>Conjugate</strong> <strong>Gradient</strong>s 30¢ £ ¡¡ £ ¡¡ £ ¢ ¡¡ ¡ £ ¡ £ ¡¡ £ ¢ ¡¡ £ ¡¡ £ ¢ ¡ £ £ ¡ ¢ ¡¡¡ £ ¡ ¢ £ ¡ ¡ ¡ ¢ ¡ ¡ £ £ ¡ ¢ £ ¡¡¡ £ ¡ ¢ £ ¡¡¡ £ ¡ ¢ £ ¡¡ £9. Convergence <strong>An</strong>alysis of <strong>Conjugate</strong> <strong>Gradient</strong>s 329.1. Picking Perfect Polynomials 339.2. Chebyshev Polynomials 3510. Complexity 37¡ £ ¡¡ ¢ £ ¡ £ ¡ £ ¡¡ £ ¡¡ £ ¢ ¡¡ £ ¡¡ £ ¢ ¡¡ £ ¡¡ £ ¢ ¡¡ £ ¡¡ £ ¢ ¡ £ ¡ ¡ ¢¡ £ ¢ ¡ £ £ ¡ ¡ ¡ ¡¡ £ ¢ ¡ £ ¡¡ £ ¡ ¢ ¡ £ ¡¡ £ ¡ ¢ ¡ £ ¡¡ £ ¡ ¢ ¡ £ ¡¡ £ ¡ ¢ £11. Starting and S<strong>to</strong>pping 3811.1. Starting 3811.2. S<strong>to</strong>pping 3812. Preconditioning 3913. <strong>Conjugate</strong> <strong>Gradient</strong>s on <strong>the</strong> Normal Equations 41¡ £ ¡ ¢ ¡ £ ¡¡ £ ¡ ¢ ¡ £ ¡¡ ¡ £ ¢ ££ ¢ ¡ £ ¡¡ ¢ ¢ £ £ ¡ ¡ ¡ ¡ £ ¡ ¡¡ ¢ £ ¡¡ £ ¡¡ ¢ £ ¡¡ £ ¡¡ £ ¡ £ ¡£ ¢ ¡¡ £ ¡¡ £ ¢ ¡ £ ¡¡¡ £ ¢ ¡ £ £ ¡ ¡ ¡ ¡¡ £ ¢ ¡ £ ¡¡¡ £ ¢ ¡ £ ¡¡ £14. The Nonlinear <strong>Conjugate</strong> <strong>Gradient</strong> <strong>Method</strong> 4214.1. Outline of <strong>the</strong> Nonlinear <strong>Conjugate</strong> <strong>Gradient</strong> <strong>Method</strong> 4214.2. General Line Search 4314.3. Preconditioning 47A Notes 48£ ¡¡ £ ¡¡ ¢ ¡ £ ¡ ¢ ¡ ¡ £ ¡ £ ¡¡ £ ¢ ¡¡ £ ¡¡ £ ¢ ¡ £ £ ¡ ¡ £ ¡ £ ¡ ¢ ¡£ £ ¡ ¡ ¡ ¡ £ ¡¡ ¢ £ ¡¡ £ ¡¡ ¢ £ ¡¡ £ ¡¡ ¢ £ ¡¡ £ ¡¡ £ ¡ £ ¡ ¢ ¢¡ ¢ £ ¡¡ £ £ ¡ ¡ ¡ ¡ ¢ £ £ ¡ ¡¡ £ ¡¡ ¢ £ ¡¡ £ ¡¡ £B Canned Algorithms 49B1. Steepest Descent 49B2. <strong>Conjugate</strong> <strong>Gradient</strong>s 50B3. Preconditioned <strong>Conjugate</strong> <strong>Gradient</strong>s 51i


¡ £ ¡¡ £ ¢ ¡ £¡¡ £ ¢ ¡ £B4. Nonlinear <strong>Conjugate</strong> <strong>Gradient</strong>s with New<strong>to</strong>n-Raphson and Fletcher-Reeves 52B5. Preconditioned Nonlinear <strong>Conjugate</strong> <strong>Gradient</strong>s with Secant and Polak-Ribière 53¢¡¤£¦¥ £ ¡¡¡ £ ¡ ¢ £ ¡¡¡ ¡ £ ¡ £ ¡ ¢ £¢ ¡ § ¢ £ ¡¡ £ ¡ ¡¡ £ £ ¡ £ ¡¡ £ ¡ ¢£ ¡¡ £ ¡¡ ¢ £ ¡¡ £ ¡¡ ¢ ¡ £ ¡ ¡ £ ¡ ¡ £ ¡ ¡ ¢ ¡ £C Ugly Proofs 54C1. The Solution <strong>to</strong> Minimizes <strong>the</strong> Quadratic Form 54C2. A Symmetric Matrix Has Orthogonal Eigenvec<strong>to</strong>rs. 54C3. Optimality of Chebyshev Polynomials 55D Homework 56ii


¢£©§¢£¡¡¡¤©§£¢£¥¥¥¤©§1. <strong>Introduction</strong>When I decided <strong>to</strong> learn <strong>the</strong> <strong>Conjugate</strong> <strong>Gradient</strong> <strong>Method</strong> (henceforth, CG), I read four different descriptions,which I shall politely not identify. I unders<strong>to</strong>od none of <strong>the</strong>m. Most of <strong>the</strong>m simply wrote down <strong>the</strong> method,<strong>the</strong>n proved its properties without any intuitive explanation or hint of how anybody might have invented CGin <strong>the</strong> first place. This article was born of my frustration, with <strong>the</strong> wish that future students of CG will learna rich and elegant algorithm, ra<strong>the</strong>r than a confusing mass of equations.CG is <strong>the</strong> most popular iterative method for solving large systems of linear equations. CG is effectivefor systems of <strong>the</strong> form¢¡ £ ¥where ¡ is an unknown vec<strong>to</strong>r, ¥ is a known vec<strong>to</strong>r, andis a known, square, symmetric, positive-definite(or positive-indefinite) matrix. (Don’t worry if you’ve forgotten what “positive-definite” means; we shallreview it.) These systems arise in many important settings, such as finite difference and finite elementmethods for solving partial differential equations, structural analysis, circuit analysis, and math homework.Iterative methods like CG are suited for use with sparse matrices. If is dense, your best course ofaction is probably <strong>to</strong> fac<strong>to</strong>r and solve <strong>the</strong> equation by backsubstitution. The time spent fac<strong>to</strong>ring a denseis roughly equivalent <strong>to</strong> <strong>the</strong> time spent solving <strong>the</strong> system iteratively; and once is fac<strong>to</strong>red, <strong>the</strong> systemcan be backsolved quickly for multiple values of . Compare this dense matrix with a sparse matrix oflarger size that fills <strong>the</strong> same amount of memory. The triangular fac<strong>to</strong>rs of a sparse usually have manymore nonzero elements than itself. Fac<strong>to</strong>ring may be impossible due <strong>to</strong> limited memory, and will betime-consuming as well; even <strong>the</strong> backsolving step may be slower than iterative solution. On <strong>the</strong> o<strong>the</strong>r¥hand, most iterative methods are memory-efficient and run quickly with sparse matrices.I assume that you have taken a first course in linear algebra, and that you have a solid understandingof matrix multiplication and linear independence, although you probably don’t remember what thoseeigenthingies were all about. From this foundation, I shall build <strong>the</strong> edifice of CG as clearly as I can.(1)2. NotationLet us begin with a few definitions and notes on notation.With a few exceptions, I shall use capital letters <strong>to</strong> denote matrices, lower case letters <strong>to</strong> denote vec<strong>to</strong>rs,1 matrices.Written out fully, Equation 1 isand Greek letters <strong>to</strong> denote scalars. is an § § matrix, and ¡ and ¥ are vec<strong>to</strong>rs — that is, §¢¡¢¢11 1¤1221 2¤22.. .. .¦¨§ §§¢¡¢¢12.¦¨§ §§¢¡¢¢12.¦¨§ §§¤ 1 ¤ 2 ¤¥¤The inner product of two vec<strong>to</strong>rs is written , and represents <strong>the</strong> scalar sum ¡ £ ¡ ¡¤1 . Note that¡. If ¡ and are orthogonal, <strong>the</strong>n ¡ £ 0. In general, expressions that reduce <strong>to</strong> 1 1 matrices,such as ¡ and ¡ ¡ , are treated as scalar values.


¥¡¡¡£ £¡2 Jonathan Richard Shewchuk4223 ¡ 1 2 ¡ 22-4 -2 2 4 61-22 ¡ 1 6 ¡ 28£¢¡-4-6Figure 1: Sample two-dimensional linear system. The solution lies at <strong>the</strong> intersection of <strong>the</strong> lines.A matrix is positive-definite if, for every nonzero vec<strong>to</strong>r ¡ ,¡ ¡¤£0 (2)This may mean little <strong>to</strong> you, but don’t feel bad; it’s not a very intuitive idea, and it’s hard <strong>to</strong> imagine howa matrix that is positive-definite might look differently from one that isn’t. We will get a feeling for whatpositive-definiteness is about when we see how it affects <strong>the</strong> shape of quadratic forms.Finally, don’t forget <strong>the</strong> important basic identities ¥ §¦©¨ ¦©¨ £¦ £¦and11 1 ¥ .3. The Quadratic FormA quadratic form is simply a scalar, quadratic function of a vec<strong>to</strong>r with <strong>the</strong> form£ 1 ¡¨ ¡ ¡¡ ¥ ¡ 2¡ ¥ where is a matrix, and are vec<strong>to</strong>rs, and is a scalar constant. I shall show shortly that if is symmetricis minimized by <strong>the</strong> solution <strong>to</strong> .¡¨and ¥positive-definite, (3)Throughout this paper, I will demonstrate ideas with <strong>the</strong> simple sample problem3 2 £2 6£ 2 ¥ 80 (4)The system is illustrated in Figure 1. In general, <strong>the</strong> solution lies at <strong>the</strong> intersection poin<strong>to</strong>f ¡ hyperplanes, each having dimension £ . The¡ ¥ §§¡¨corresponding quadratic ¥ form 1. For this problem, <strong>the</strong> solution is ¡ £ 2 ¡¡¨appears in Figure 2. A con<strong>to</strong>ur plot ¥ of is illustrated in Figure 3.2


¡Figure 2: Graph of a quadratic form ¢¡¤£¦¥ . The minimum point of this surface is <strong>the</strong> solution <strong>to</strong> §¡¡¡¡¥The Quadratic Form 31504210050¡¨20-2-4-6-4-2021406.£©¨242-4 -2 2 4 61-2-4-6Figure 3: Con<strong>to</strong>urs of <strong>the</strong> quadratic form. Each ellipsoidal curve has constant ¢¡£¥ .


¥¥ ¡¡¥¢£££¥¤££¥¤££¦¤¦§¥¥¥©§¡4 Jonathan Richard Shewchuk6242-4 -2 2 4 61-2-4-6-8¡¤£¦¥Figure 4: <strong>Gradient</strong> of <strong>the</strong> quadratic form. For every , <strong>the</strong> gradient points in <strong>the</strong> direction of steepestincrease £ of , and is orthogonal <strong>to</strong> <strong>the</strong> con<strong>to</strong>ur lines.¢¡¤£¦¥¡¨Because is positive-definite, <strong>the</strong> surface defined ¥ by <strong>to</strong> say about this in a moment.)is shaped like a paraboloid bowl. (I’ll have moreThe gradient of a quadratic form is defined <strong>to</strong> be¢¡¢¢1§ ¦ §§¡¨¢¡¡ ¨ £2.(5)¡¨¡ ¨.Figure 4 illustrates <strong>the</strong> gradient vec<strong>to</strong>rs for Equation 3 with <strong>the</strong> constants given in Equation 4. At <strong>the</strong> bot<strong>to</strong>mequal <strong>to</strong> zero.The gradient is a vec<strong>to</strong>r field that, for a given point , points in <strong>the</strong> direction of greatest increase of ¥¡ ¨¡of <strong>the</strong> paraboloid bowl, <strong>the</strong> gradient is zero. One can minimize ¥by setting ¡ ¥¡ ¨¡ ¨IfWith a little bit of tedious math, one can apply Equation 5 <strong>to</strong> Equation 3, and deriveis symmetric, this equation reduces <strong>to</strong> ¡¡ ¨ £ 12 ¡ 12¡¡ ¥(6)¡ ¨ £ ¢¡ ¡ ¥(7)Setting <strong>the</strong> gradient <strong>to</strong> zero, we obtain Equation 1, <strong>the</strong> linear system we wish <strong>to</strong> solve. Therefore, <strong>the</strong>. If is positive-definite as well as symmetric, <strong>the</strong>n thissolution <strong>to</strong> ¢¡ £ ¥ is a critical point of ¥¡ ¨


¡¡¡¡¡¡¡¥¥¥¡¡¡¡¡¡¥¥The Quadratic Form 5(a)(b)¡¨¡¨2121(c)(d)¡¨¡¨2121Figure 5: (a) Quadratic form for a positive-definite matrix. (b) For a negative-definite matrix. (c) For asingular (and positive-indefinite) matrix. A line that runs through <strong>the</strong> bot<strong>to</strong>m of <strong>the</strong> valley is <strong>the</strong> set ofsolutions. (d) For an indefinite matrix. Because <strong>the</strong> solution is a saddle point, Steepest Descent and CGwill not work. In three dimensions or higher, a singular matrix can also have a saddle.¡¨solution is a minimum ¥ of , so can be solved by finding an that minimizes ¡¨¥symmetric, <strong>the</strong>n Equation 6 hints that CG will find a solution <strong>to</strong> <strong>the</strong> system 1 2 ¥ ¨ ¡ £ ¥¡ ¥ ¢¡¤£12 ¥ ¨is symmetric.). (If is not. Note thatWhy do symmetric positive-definite matrices have this nice property? Consider <strong>the</strong> relationship betweenat some arbitrary point and at <strong>the</strong> solution point £¦ 1¥. From Equation 3 one can show (Appendix C1)that if is symmetric (be it positive-definite or not),¡¥¨ £ ¨ ¡ ¡ 1 ¡¨ ¡¥¡ ¨2 ¥If is positive-definite as well, <strong>the</strong>n by Inequality 2, <strong>the</strong> latter term is positive for allis a global minimum of . ¢¡£ ¡(8). It follows that ¡¨¥The fact that is a paraboloid is our best intuition of what it means for a matrix <strong>to</strong> be positive-definite.If is not positive-definite, <strong>the</strong>re are several o<strong>the</strong>r possibilities. could be negative-definite — <strong>the</strong> resul<strong>to</strong>f negating a positive-definite matrix (see Figure 2, but hold it upside-down). might be singular, in whichcase no solution is unique; <strong>the</strong> set of solutions is a line or hyperplane having a uniform value for . Ifis none of <strong>the</strong> above, <strong>the</strong>n ¡ is a saddle point, and techniques like Steepest Descent and CG will likely fail.Figure 5 demonstrates <strong>the</strong> possibilities. The values of ¥ and determine where <strong>the</strong> minimum point of <strong>the</strong>paraboloid lies, but do not affect <strong>the</strong> paraboloid’s shape.Why go <strong>to</strong> <strong>the</strong> trouble of converting <strong>the</strong> linear system in<strong>to</strong> a <strong>to</strong>ugher-looking problem? The methodsunder study — Steepest Descent and CG — were developed and are intuitively unders<strong>to</strong>od in terms ofminimization problems like Figure 2, not in terms of intersecting hyperplanes such as Figure 1.


¡¥¨¡To determine ¤ , note that ¡ ¥¥¥¡¥¡¦¡¡£¥¡¨¥¥£¡¡¥£¡£££££££¡£££££¤ £¤ £££¡¡£¨¥¡¨¡¡¥¡¨¡6 Jonathan Richard Shewchuk4. The <strong>Method</strong> of Steepest DescentIn <strong>the</strong> method of Steepest Descent, we start at an arbitrary 0¡ point and slide down <strong>to</strong> <strong>the</strong> bot<strong>to</strong>m of <strong>the</strong>paraboloid. We take a series of steps 1¡ ¡2¡until we are satisfied that we are close enough <strong>to</strong> <strong>the</strong>solution .¡ ¡When we take a step, we choose <strong>the</strong> direction in which decreases most quickly, which is <strong>the</strong> direction¡ ¡ . ¡opposite ¡ ¥. According <strong>to</strong> Equation 7, this direction is ¡ ¡ ¥¨ £¦¥ ¡ ¡ Allow me <strong>to</strong> introduce a few definitions, which you should memorize. ¢ £ ¡ ¡ ¡The error is avec<strong>to</strong>r that indicates how far we are from <strong>the</strong> solution. £ £ ¥ ¡ ¡ ¡ The residual indicates how far weare from <strong>the</strong> correct value of . It is easy <strong>to</strong> see £ £ ¡¢¢¡ that , and you should think of <strong>the</strong> residual asbeing <strong>the</strong> error transformed by in<strong>to</strong> <strong>the</strong> same space as . More importantly, £ £ ¡ ¡ ¡ , and youshould also think of <strong>the</strong> residual as <strong>the</strong> direction of steepest descent. For nonlinear problems, discussed inSection 14, only <strong>the</strong> latter definition applies. So remember, whenever you read “residual”, think “direction¥of steepest ¥ descent.”Suppose we£ ¡ ¡start at 2 . Our first step, along <strong>the</strong> direction of steepest descent, will fall¡0¡somewhere on <strong>the</strong> solid line in Figure 6(a). In o<strong>the</strong>r words, we will choose a point1¡The question is, how big a step should we take?2£ ¡¥¤0¡0¡ (9)A line search is a procedure that chooses <strong>to</strong> minimize along a line. Figure 6(b) illustrates this task:we are restricted <strong>to</strong> choosing a point on <strong>the</strong> intersection of <strong>the</strong> vertical plane and <strong>the</strong> paraboloid. Figure 6(c)¤is <strong>the</strong> parabola defined by <strong>the</strong> intersection of <strong>the</strong>se surfaces. What is <strong>the</strong> value of at <strong>the</strong> base of <strong>the</strong>parabola?¤From basic calculus, ¤ minimizes when <strong>the</strong> directional derivative ¦is equal <strong>to</strong> zero. By <strong>the</strong>chain ¦ rule,should be chosen so £ that1¡1¡1¡0¡ . Setting this expression <strong>to</strong> zero, we find that ¤1¡are orthogonal (see Figure 6(d)).£ ¡¦¨§1¡¦¨§¨ £ ¡¨ ¦¨§¨ There is an intuitive reason why we should expect <strong>the</strong>se vec<strong>to</strong>rs <strong>to</strong> be orthogonal at <strong>the</strong> minimum.Figure 7 shows <strong>the</strong> gradient vec<strong>to</strong>rs at various points along <strong>the</strong> search line. The slope of <strong>the</strong> parabola(Figure 6(c)) at any point is equal <strong>to</strong> <strong>the</strong> magnitude of <strong>the</strong> projection of <strong>the</strong> gradient on<strong>to</strong> <strong>the</strong> line (Figure 7,dotted arrows). These projections represent <strong>the</strong> rate of increase of as one traverses <strong>the</strong> search line. isminimized where <strong>the</strong> projection is zero — where <strong>the</strong> gradient is orthogonal <strong>to</strong> <strong>the</strong> search line.0¡ and ¡ ¥1¡¥ ¡¨ ¥ ¡ ¢¡¥¤0¡¤ ¡0¡1¡£ 1¡ ¨¨ ¨ ¨ 0¡0¡0¡00001¡ , and we have1¡¨ £¢¡0¡¥ ¡ ¢¡0¡0¡¥ £0¡¨ 0¡¥ ¡ ¢¡0¡¨ 0¡¥ £0¡0¡ ¥ £0¡ £0¡0¡¤ £ £0¡ £0¡0¡0¡ £


¡¥¡¡¡¨¡¡¤¡¡¡¡¡¡¡¡¥The <strong>Method</strong> of Steepest Descent 724(a)(b)2-4 -2 2 4 60¡ ¢-4¥¤¡ 14012010080604020-2-6£0¡(c)0.2 0.4 0.612.502-2.5-5242-4 -2 2 4 6-2-4-6-2.5 0 2.5 511¡(d)1500 501001¡ ¨Figure 6: The method of Steepest Descent. (a) Starting at ¢¡ 2£ ¡ 2¤¦¥ , take a step in <strong>the</strong> direction of steepestdescent of . (b) Find <strong>the</strong> point on <strong>the</strong> intersection of <strong>the</strong>se two surfaces that minimizes . (c) This parabolais <strong>the</strong> intersection of surfaces. The bot<strong>to</strong>mmost point is our target. (d) The gradient at <strong>the</strong> bot<strong>to</strong>mmost pointis orthogonal <strong>to</strong> <strong>the</strong> gradient of <strong>the</strong> previous step.221-2 -1 1 2 31-1-2-3Figure 7: The gradient is shown at several locations along <strong>the</strong> search line (solid arrows). Each gradient’sprojection on<strong>to</strong> <strong>the</strong> line is also shown (dotted arrows). The gradient vec<strong>to</strong>rs represent <strong>the</strong> direction ofsteepest increase of , and <strong>the</strong> projections represent <strong>the</strong> rate of increase as one traverses <strong>the</strong> search line.On <strong>the</strong> search line, is minimized where <strong>the</strong> gradient is orthogonal <strong>to</strong> <strong>the</strong> search line.


¡¡ ¡ ¤¡¡££ ¥ ¡ ¡ £ ¡ £¡ ¡ ¥¤ ¡¡¡¡8 Jonathan Richard Shewchuk422-4 -2 2 4 61-20¡-4-6Figure 8: Here, <strong>the</strong> method of Steepest Descent starts at ¢¡ 2£ ¡ 2¤¦¥ and converges at 2£ ¡ 2¤ ¥ .Putting it all <strong>to</strong>ge<strong>the</strong>r, <strong>the</strong> method of Steepest Descent is:(10)££ £¡ £(11)¡ £¡ ¡ £¡ (12)1¡The example is run until it converges in Figure 8. Note <strong>the</strong> zigzag path, which appears because eachgradient is orthogonal <strong>to</strong> <strong>the</strong> previous gradient.The algorithm, as written above, requires two matrix-vec<strong>to</strong>r multiplications per iteration. The computationalcost of Steepest Descent is dominated by matrix-vec<strong>to</strong>r products; fortunately, one can be eliminated.By premultiplying both sides of Equation 12 by ¡ and adding ¥ , we have£¡ ¤ £¡ (13)£ 1¡¡Although Equation 10 is still £ 0¡ needed <strong>to</strong> compute , Equation 13 can be used for every iteration <strong>the</strong>reafter.The product , which occurs in both Equations 11 and 13, need only be computed once. The disadvantage£of using this recurrence is that <strong>the</strong> sequence defined by Equation 13 is generated without any feedback from<strong>the</strong>¡ value of , so that accumulation of floating point roundoff error may¡ cause <strong>to</strong> converge <strong>to</strong> somepoint near . This effect can be avoided by periodically using Equation 10 <strong>to</strong> recompute <strong>the</strong> correct residual.¡ ¡ ¡Before analyzing <strong>the</strong> convergence of Steepest Descent, I must digress <strong>to</strong> ensure that you have a solidunderstanding of eigenvec<strong>to</strong>rs.


¦¥¡¢£¡¡Thinking with Eigenvec<strong>to</strong>rs and Eigenvalues 95. Thinking with Eigenvec<strong>to</strong>rs and EigenvaluesAfter my one course in linear algebra, I knew eigenvec<strong>to</strong>rs and eigenvalues like <strong>the</strong> back of my head. If yourinstruc<strong>to</strong>r was anything like mine, you recall solving problems involving eigendoohickeys, but you neverreally unders<strong>to</strong>od <strong>the</strong>m. Unfortunately, without an intuitive grasp of <strong>the</strong>m, CG won’t make sense ei<strong>the</strong>r. Ifyou’re already eigentalented, feel free <strong>to</strong> skip this section.Eigenvec<strong>to</strong>rs are used primarily as an analysis <strong>to</strong>ol; Steepest Descent and CG do not calculate <strong>the</strong> valueof any eigenvec<strong>to</strong>rs as part of <strong>the</strong> algorithm 1 .5.1. Eigen do it if I try<strong>An</strong> eigenvec<strong>to</strong>r of a matrix ¦ is a nonzero vec<strong>to</strong>r that does not rotate when ¦ is applied <strong>to</strong> it (exceptperhaps <strong>to</strong> point in precisely <strong>the</strong> opposite direction). may change length or reverse its direction, but itwon’t turn sideways. In o<strong>the</strong>r words, <strong>the</strong>re is some scalar constant£ ¡such that . The value isan eigenvalue ¡ of . For any constant , ¡ <strong>the</strong> vec<strong>to</strong>r is also an eigenvec<strong>to</strong>r with eigenvalue , because¡. In o<strong>the</strong>r words, if you scale an eigenvec<strong>to</strong>r, it’s still an eigenvec<strong>to</strong>r.¤¤¦¦¤ ¨ £ ¤ ¦ £ ¡ ¤Why should you care? Iterative methods often depend on applying ¦<strong>to</strong> a vec<strong>to</strong>r over and over again.When is repeatedly applied <strong>to</strong> an eigenvec<strong>to</strong>r , one of two things can happen. If ¢will vanish as ¦ approaches infinity (Figure 9). If ¢ 1, <strong>the</strong>n ¦ will grow <strong>to</strong> infinity (Figure 10). Each¥time is applied, <strong>the</strong> vec<strong>to</strong>r grows or shrinks according <strong>to</strong> <strong>the</strong> value of ¢ ¢ .¦¢¤£ 1, <strong>the</strong>n ¦ £ ¡v2B vBv3B v¦ § ¡ 0¨ © §¦ Figure 9: is an eigenvec<strong>to</strong>r of with a corresponding eigenvalue of 5. As increases, converges<strong>to</strong> zero.2 3v BvB v B vFigure 10: Here, ¦ has a corresponding eigenvalue of 2. As © increases, §¦ diverges <strong>to</strong> infinity.1 However, <strong>the</strong>re are practical applications for eigenvec<strong>to</strong>rs. The eigenvec<strong>to</strong>rs of <strong>the</strong> stiffness matrix associated with a discretizedstructure of uniform density represent <strong>the</strong> natural modes of vibration of <strong>the</strong> structure being studied. For instance, <strong>the</strong> eigenvec<strong>to</strong>rsof <strong>the</strong> stiffness matrix associated with a one-dimensional uniformly-spaced mesh are sine waves, and <strong>to</strong> express vibrations as alinear combination of <strong>the</strong>se eigenvec<strong>to</strong>rs is equivalent <strong>to</strong> performing a discrete Fourier transform.


By <strong>the</strong> definition of positive-definite, <strong>the</strong> ¦¨£¡¡10 Jonathan Richard ShewchukIf is symmetric (and often if it is not), <strong>the</strong>n <strong>the</strong>re exists a set § of linearly independent eigenvec<strong>to</strong>rsof , denoted ¦1 2 ¦ ¤ . This set is not unique, because each eigenvec<strong>to</strong>r can be scaled by an arbitrarynonzero constant. Each eigenvec<strong>to</strong>r has a corresponding eigenvalue, denoted ¡ 1 2¤ . These are uniquely defined for a given matrix. The eigenvalues may or may not be equal <strong>to</strong> each o<strong>the</strong>r; for instance,<strong>the</strong> eigenvalues of <strong>the</strong> identity matrix are all one, and every nonzero vec<strong>to</strong>r is an eigenvec<strong>to</strong>r of .What if is applied <strong>to</strong> a vec<strong>to</strong>r that is not an eigenvec<strong>to</strong>r? A very important skill in understandinglinear algebra — <strong>the</strong> skill this section is written <strong>to</strong> teach — is <strong>to</strong> think of a vec<strong>to</strong>r as a sum of o<strong>the</strong>r vec<strong>to</strong>rswhose behavior is unders<strong>to</strong>od. Consider that <strong>the</strong> set of eigenvec<strong>to</strong>rs ¦£¢forms a basis for ¤ ¤ (because a¡symmetric has§ eigenvec<strong>to</strong>rs that are linearly independent). <strong>An</strong>y § -dimensional vec<strong>to</strong>r can be expressedas a linear combination of <strong>the</strong>se eigenvec<strong>to</strong>rs, and because matrix-vec<strong>to</strong>r multiplication is distributive, onecan examine <strong>the</strong> effect of ¦on each eigenvec<strong>to</strong>r separately.¦In Figure 11, a vec<strong>to</strong>r ¡ is illustrated as a sum of two eigenvec<strong>to</strong>rs 1 and 2 . Applying ¦ <strong>to</strong> ¡ isequivalent <strong>to</strong> applying <strong>to</strong> <strong>the</strong> eigenvec<strong>to</strong>rs, and summing <strong>the</strong> result. On repeated application,¦ ¡ £¦ ¦we have£ ¡¡1 2 1 12 2. If <strong>the</strong> magnitudes of all <strong>the</strong> eigenvalues are smaller than¡one, willconverge <strong>to</strong> zero, because <strong>the</strong> eigenvec<strong>to</strong>rs that ¦compose converge <strong>to</strong> zero when is repeatedly applied.If one of <strong>the</strong> eigenvalues has magnitude greater than one, will ¦ diverge <strong>to</strong> infinity. This is why numericalanalysts attach importance <strong>to</strong> <strong>the</strong> spectral radius of a matrix:¦¡ ¡¦©¨ £¡ ¢¢ max¡ is an eigenvalue of ¦ .If we want ¡ <strong>to</strong> converge <strong>to</strong> zero quickly, ¥ ¥¥ ¥¦©¨should be less than one, and preferably as small as possible.v1xv2Bx2B x3B xFigure 11: The vec<strong>to</strong>r (solid arrow) can be expressed as a linear combination of eigenvec<strong>to</strong>rs (dashedarrows), whose associated eigenvalues £are 0¨ 1 7 ¦¨and¡2 2. The effect of repeatedly § applying <strong>to</strong> ¦is best unders<strong>to</strong>od by examining <strong>the</strong> effect £of on each eigenvec<strong>to</strong>r. § When is repeatedly applied, one§eigenvec<strong>to</strong>r converges <strong>to</strong> zero while <strong>the</strong> o<strong>the</strong>r §diverges; hence, also diverges.It’s important <strong>to</strong> mention that <strong>the</strong>re is a family of nonsymmetric matrices that do not have a full set ofindependent eigenvec<strong>to</strong>rs. These matrices are known as defective, a name that betrays <strong>the</strong> § well-deservedhostility <strong>the</strong>y have earned from frustrated linear algebraists. The details are <strong>to</strong>o complicated <strong>to</strong> describe inthis article, but <strong>the</strong> behavior of defective matrices can be analyzed in terms of generalized eigenvec<strong>to</strong>rs andgeneralized eigenvalues. The rule¡that converges <strong>to</strong> zero if and only if all <strong>the</strong> generalized eigenvalueshave magnitude smaller than one still holds, but is harder <strong>to</strong> prove.¦Here’s a useful fact: <strong>the</strong> eigenvalues of a positive-definite matrix are all positive. This fact can be provenfrom <strong>the</strong> definition of eigenvalue:£ ¡ ¦¦ £ ¡ is positive (for nonzero ). Hence, ¡must be positive also.


¢¡¡ £ ¡ 1 ¡¥¡¥¡¡¨ ¤¢¡¦ £¢¡ 1 ¡¡¡5.2. Jacobi iterationsThinking with Eigenvec<strong>to</strong>rs and Eigenvalues 11¡ £ ¥¡ ¡Of course, a procedure that always converges <strong>to</strong> zero isn’t going <strong>to</strong> help you attract friends. Consider a moreuseful procedure: <strong>the</strong> Jacobi <strong>Method</strong> for solving . The matrix is split in<strong>to</strong> two parts: , whosediagonal elements are identical <strong>to</strong> those of , and whose off-diagonal elements are zero; and , whosediagonal elements are zero, and whose off-diagonal elements are identical <strong>to</strong> those of . Thus, .We derive <strong>the</strong> Jacobi <strong>Method</strong>:£ ¥ ¢¡£ ¡ ¡¡ ¥where¢ £ 1¥(14)¡ 1¥¡ £ ¦ ¡ £¢Because is diagonal, it is easy <strong>to</strong> invert. This identity can be converted in<strong>to</strong> an iterative method byforming <strong>the</strong>¡recurrence¤¢(15)1¡Given a starting vec<strong>to</strong>r ¡£ ¦ ¡ vec<strong>to</strong>r will be closer <strong>to</strong> <strong>the</strong> solution ¡ than <strong>the</strong> last. ¡ is called a stationary point of Equation 15, because if¡ , <strong>the</strong>n ¡ 0¡ , this formula generates a sequence of vec<strong>to</strong>rs. Our hope is that each successive1¡ will also equal ¡ .£ ¡Now, this derivation may seem quite arbitrary <strong>to</strong> you, and you’re right. We could have formed anynumber of identities for instead of Equation 14. In fact, simply by splitting differently — that is,by choosing a different and ¡ — we could have derived <strong>the</strong> Gauss-Seidel method, or <strong>the</strong> method ofSuccessive Over-Relaxation (SOR). Our hope is that we have chosen a splitting for which ¡ has a smallspectral radius. Here, I chose <strong>the</strong> Jacobi splitting arbitrarily for simplicity.¦Suppose we start with some arbitrary vec<strong>to</strong>r<strong>to</strong> <strong>the</strong> result. What does each iteration do?¡0¡ . For each iteration, we apply ¦ <strong>to</strong> this vec<strong>to</strong>r, <strong>the</strong>n addAgain, apply <strong>the</strong> principle of thinking of a vec<strong>to</strong>r as a sum of o<strong>the</strong>r, well-unders<strong>to</strong>od vec<strong>to</strong>rs. Expresseach iterate¡ . Then, Equation 15 becomes¡¡ as <strong>the</strong> sum of <strong>the</strong> exact solution ¡ and <strong>the</strong> error term ¢¡ £ ¦ ¡ £¢1¡£ ¦¢¢£ ¦ ¡ £¢ ¦¢£ ¡ ¦£ ¦(by Equation 14) ¡¡ (16)¢¢ 1¡Each iteration does not affect <strong>the</strong> “correct part”¡ of (because is a stationary point); but each iterationdoes affect <strong>the</strong> error term. It is apparent from Equation 16 that if¥ ¦©¨¥ £ 1, <strong>the</strong>n <strong>the</strong> error term ¢¡ willconverge <strong>to</strong> zero as ¡ approaches infinity. Hence, <strong>the</strong> initial vec<strong>to</strong>r ¡ 0¡ has no effect on <strong>the</strong> inevitable¥ ¡outcome!Of course, <strong>the</strong> choice 0¡ of does affect <strong>the</strong> number of iterations required <strong>to</strong> converge <strong>to</strong> withina given <strong>to</strong>lerance. However, its effect is less important than that of <strong>the</strong> spectral radius¥ ¦©¨¥ , whichdetermines <strong>the</strong> speed of convergence. Suppose that ¡is <strong>the</strong> eigenvec<strong>to</strong>r of with ¡ <strong>the</strong> largest eigenvalue¦©¨ £ ¡ §¦ ). If <strong>the</strong> initial error ¦¦, expressed as a linear combination of eigenvec<strong>to</strong>rs, includes a0¡ ¢component in <strong>the</strong> direction of , this component will be <strong>the</strong> slowest <strong>to</strong> converge.¨¦(so that¥ ¥


¦¡¥¡6 ¥¡¡¡¥14 £ ¥£¥¡12 Jonathan Richard Shewchuk422-4 -2 2 4 61-227-4-6Figure 12: The eigenvec<strong>to</strong>rs § of are directed along <strong>the</strong> axes of <strong>the</strong> paraboloid defined by <strong>the</strong> quadraticform . Each eigenvec<strong>to</strong>r is labeled with its associated eigenvalue. Each eigenvalue is proportional <strong>to</strong><strong>the</strong> steepness of <strong>the</strong> corresponding slope.¢¡¤£¦¥¦©¨is not generally symmetric (even if is), and may even be defective. However, <strong>the</strong> rate of convergence,which depends on . Unfortunately, Jacobi does not convergefor every , or even for every positive-definite .of <strong>the</strong> Jacobi <strong>Method</strong> depends largely on ¥ ¥5.3. A Concrete ExampleTo demonstrate <strong>the</strong>se ideas, I shall solve <strong>the</strong> system specified by Equation 4. First, we need a method offinding eigenvalues and eigenvec<strong>to</strong>rs. By definition, for any eigenvec<strong>to</strong>r with eigenvalue ¡ ,£ ¡ £ ¡¡ ¡ ¨ £Eigenvec<strong>to</strong>rs are nonzero, so ¡ ¡ must be singular. Then,¡ ¡ ¨ £00det¥The determinant of is called <strong>the</strong> characteristic polynomial. It is a degree § polynomial in ¡whose roots are <strong>the</strong> set of eigenvalues. The characteristic polynomial of (from Equation 4) is¡ ¡det ¡ ¡32¡ ¡2£ ¡ 2¡9 ¡¡ ¡7¨¡ ¡2¨and <strong>the</strong> eigenvalues are 7 and 2. To find <strong>the</strong> eigenvec<strong>to</strong>r associated with ¡ £ 7,¡ ¡ ¨ £ 422 1 12 04 1¡2 2£0


¤¡¡£££¡¡¡¡££ £¢ ¡£ ¡¡¡¡¡¥¡¡¨¡Convergence <strong>An</strong>alysis of Steepest Descent 13<strong>An</strong>y solution <strong>to</strong> this equation is an£ eigenvec<strong>to</strong>r; say, 2 1 . By <strong>the</strong> same method, we find that 2 ¡ 1 is an eigenvec<strong>to</strong>r corresponding <strong>to</strong> <strong>the</strong> eigenvalue 2. In Figure 12, we see that <strong>the</strong>se eigenvec<strong>to</strong>rs coincidewith <strong>the</strong> axes of <strong>the</strong> familiar ellipsoid, and that a larger eigenvalue corresponds <strong>to</strong> a steeper slope. (Negativeeigenvalues indicate that decreases along <strong>the</strong> axis, as in Figures 5(b) and 5(d).)Now, let’s see <strong>the</strong> Jacobi <strong>Method</strong> in action. Using <strong>the</strong> constants specified by Equation 4, we have¡ 1¡£ ¡ 130016 0 22 0 £ 0¡12¡330 ¡ ¡ 234¡3 130 106 28 The eigenvec<strong>to</strong>rs of are 2 ¦ 1 with eigenvalue ¡ 2¡ 3, and 2 ¡ 1 with eigenvalue 2¡ 3. These are graphed in Figure 13(a); note that <strong>the</strong>y do not coincide with <strong>the</strong> eigenvec<strong>to</strong>rs of , and are notrelated <strong>to</strong> <strong>the</strong> axes of <strong>the</strong> paraboloid.Figure 13(b) shows <strong>the</strong> convergence of <strong>the</strong> Jacobi method. The mysterious path <strong>the</strong> algorithm takes canbe unders<strong>to</strong>od by watching <strong>the</strong> eigenvec<strong>to</strong>r components of each successive error term (Figures 13(c), (d),and (e)). Figure 13(f) plots <strong>the</strong> eigenvec<strong>to</strong>r components as arrowheads. These are converging normally at<strong>the</strong> rate defined by <strong>the</strong>ir eigenvalues, as in Figure 11.I hope that this section has convinced you that eigenvec<strong>to</strong>rs are useful <strong>to</strong>ols, and not just bizarre <strong>to</strong>rturedevices inflicted on you by your professors for <strong>the</strong> pleasure of watching you suffer (although <strong>the</strong> latter is anice fringe benefit).6. Convergence <strong>An</strong>alysis of Steepest Descent6.1. Instant ResultsTo understand <strong>the</strong> convergence of Steepest Descent, let’s first consider <strong>the</strong> case ¢¡ where is an eigenvec<strong>to</strong>ris also an eigenvec<strong>to</strong>r. Equation 12 gives¡with eigenvalue ¡£¢ . Then, <strong>the</strong> residual ££¢¡ ¡£¢¢¢£ ¡¡ £¢¢£1¡¡ £¡ £¡ ¡£¢¢¢0¡ £Figure 14 demonstrates why it takes only one step <strong>to</strong> converge <strong>to</strong> <strong>the</strong> exact solution. The point¡ lieson one of <strong>the</strong> axes of <strong>the</strong> ellipsoid, and so <strong>the</strong> residual points directly <strong>to</strong> <strong>the</strong> center of <strong>the</strong> ellipsoid. Choosing¡ ¡ 1 ¢gives us instant convergence.£For a more general analysis, we must ¢¡ express as a linear combination of eigenvec<strong>to</strong>rs, and weshall fur<strong>the</strong>rmore require <strong>the</strong>se eigenvec<strong>to</strong>rs <strong>to</strong> be orthonormal. It is proven in Appendix C2 that if is


¢¡¡¡¢¡¡¡¡¡¡¡¡¡¢¡¡¡14 Jonathan Richard Shewchuk42(a)42(b)22-4 -2 2 4 61-4 -2 2 4 61-20 470 47-20¡-4-4-642(c)-642(d)22-4 -2 2 4 61-4 -2 2 4 61-21¡-4-40¡-642(e)-642(f)22-4 -2 2 4 61-4 -2 2 4 61-2-22¡-4-4-6-6Figure 13: Convergence of <strong>the</strong> Jacobi <strong>Method</strong>. (a) The eigenvec<strong>to</strong>rs § of are shown with <strong>the</strong>ir correspondingeigenvalues. Unlike <strong>the</strong> eigenvec<strong>to</strong>rs § of , <strong>the</strong>se eigenvec<strong>to</strong>rs are not <strong>the</strong> axes of <strong>the</strong> paraboloid. (b) TheJacobi <strong>Method</strong> starts at 2£ ¡ 2¤ ¥ and converges at 2£ ¡ 2¤ ¥ . (c, d, e) The error vec<strong>to</strong>rs ¢¡ 0£ , ¤¡ 1£ , ¤¡ 2£ (solidarrows) and <strong>the</strong>ir eigenvec<strong>to</strong>r components (dashed arrows). (f) Arrowheads represent <strong>the</strong> eigenvec<strong>to</strong>rcomponents of <strong>the</strong> first four error vec<strong>to</strong>rs. Each eigenvec<strong>to</strong>r component of <strong>the</strong> error is converging <strong>to</strong> zero at<strong>the</strong> expected rate based on its eigenvalue.¡


¦¦is <strong>the</strong> length of each component of ¢§¢§£¢£ ¢£ ¡¡¡¡¡¡¡£ ¥£¥ £¥ ££ ¥¦¦¦¦¦¦¦¦¡£¢¡ ¥¦¦¦¦¡Convergence <strong>An</strong>alysis of Steepest Descent 15422-4 -2 2 4 61-2-4-6Figure 14: Steepest Descent converges <strong>to</strong> <strong>the</strong> exact solution on <strong>the</strong> first iteration if <strong>the</strong> error term is aneigenvec<strong>to</strong>r.symmetric, <strong>the</strong>re exists a set § of orthogonal eigenvec<strong>to</strong>rs of . As we can scale eigenvec<strong>to</strong>rs arbitrarily,let us choose so that each eigenvec<strong>to</strong>r is of unit length. This choice gives us <strong>the</strong> useful property that0 ¢¡¦ £¡1 ¢£¤£(17)£¤£Express <strong>the</strong> error term as a linear combination of eigenvec<strong>to</strong>rs¢¤ ¥ £ ¦¦ ¦(18)where¡ . From Equations 17 and 18 we have <strong>the</strong> following identities:(19)1¦¦ ¡ ¦ ¦¦ 2(20)££ ¡¢¡§2£¡ ¢¦ ¡ ¦ ¦ ¨¦ ¨ ¦¦¦¦¡ ¦ 2(21)¡ ¢¥¥¦¡ 2 ¦ 2(22)¥¥¦¡ 2 ¦ 3(23)¡§2£¡ £¡ £


¡¡££¡££¡¡¡¡¡££ ¦¦¦¦¡¡¦¡¡¨¡16 Jonathan Richard Shewchuk6242-4 -2 2 4 61-2-4Figure 15: Steepest Descent converges <strong>to</strong> <strong>the</strong> exact solution on <strong>the</strong> first iteration if <strong>the</strong> eigenvalues are allequal.Equation 19 shows £¡ that <strong>to</strong>o can be expressed as a sum of eigenvec<strong>to</strong>r components, and <strong>the</strong> length of¡ ¦. Equations 20 and 22 are just Pythagoras’ Law.¦<strong>the</strong>se components are ¡ ¦Now we can proceed with <strong>the</strong> analysis. Equation 12 gives¡ £¢£1¡¢ £¡ ¦¦¡2 2¦¡2¦ 3¡ (24)£¢ ¦We saw in <strong>the</strong> last example ¢¡ that, if has only one eigenvec<strong>to</strong>r component, <strong>the</strong>n convergence isachieved in one step by £ ¡ choosing¢ 1. Now let’s examine <strong>the</strong> case ¢¡ where is arbitrary, but all <strong>the</strong>eigenvec<strong>to</strong>rs have a common eigenvalue . Equation 24 becomes¡¤2 ¦¦ 2¢1¡¢03 ¦2¦¥¢¡ ¡Figure 15 demonstrates why, once again, <strong>the</strong>re is instant convergence. Because all <strong>the</strong> eigenvalues areequal, <strong>the</strong> ellipsoid is spherical; hence, no matter what point we start at, <strong>the</strong> residual must point <strong>to</strong> <strong>the</strong> centerof <strong>the</strong> sphere. As before, choose 1 £ ¡ . ¤


¢¡¦¡¡Convergence <strong>An</strong>alysis of Steepest Descent 1782642-6 -4 -2 21Figure 16: The energy norm of <strong>the</strong>se two vec<strong>to</strong>rs is equal.-2However, if <strong>the</strong>re are several unequal, nonzero eigenvalues, <strong>the</strong>n no choice¡ of will eliminate all <strong>the</strong>eigenvec<strong>to</strong>r components, and our choice becomes a sort of compromise. In fact, <strong>the</strong> fraction in Equation ¤ 241is best thought of as a weighted average of <strong>the</strong> values of ¦ . The weights2 ¦ ensure that longer componentsof ¡ are given precedence. As a result, on any given iteration, some of <strong>the</strong> shorter components of ¢¢ ¡might actually increase in length (though never for long). For this reason, <strong>the</strong> methods of Steepest Descentand <strong>Conjugate</strong> <strong>Gradient</strong>s are called roughers. By contrast, <strong>the</strong> Jacobi <strong>Method</strong> is a smoo<strong>the</strong>r, because everyeigenvec<strong>to</strong>r component is reduced on every iteration. Steepest Descent and <strong>Conjugate</strong> <strong>Gradient</strong>s are notsmoo<strong>the</strong>rs, although <strong>the</strong>y are often erroneously identified as such in <strong>the</strong> ma<strong>the</strong>matical literature.6.2. General ConvergenceTo bound <strong>the</strong> convergence of Steepest Descent in <strong>the</strong> general case, we shall define <strong>the</strong> energy norm¨ 1¡ 2 (see Figure 16). This norm is easier <strong>to</strong> work with than <strong>the</strong> Euclidean norm, and is in¢ ¥some sense a more natural norm; examination of Equation 8 shows that minimizing ¢ §¡§is equivalent <strong>to</strong>§¢§ £


§¢¡¨££¢¢ £ §¢¡¡§2¡ ¥ £££ ¡¨1 ¡ ¥ ££¡¡¦¡¡¥¡¦£ ¡¡¡¦¨£¡¡¦¡¡¡¨¡¨¨¥¥¦¥¦£¡¡£¦¡¡¦£¡18 Jonathan Richard¡Shewchuk. With this norm, we haveminimizing ¥ ¥ ¢ £ §¢¢1¡ ¤ £ ¡¢¡ 2 ¤ ¡¡§2¢¥ £ ¢¡ ¡ ¡ £ 2 £ ¥¤£¡2 ¥¤£ ¡ (by Equation 12) £¡ (by symmetry of )¡ ¤£ £ ¡ £2 1¡§2£1¡¡ £¡ £¡ £¡¢¡¡ £¡¦¥¨2¡ £¡ £¨2¡ ££ §¢£ §¢¡§2 £¡§2 £¥ £ £¡1 ¡ ¥ ¦¦¡2 3¦¡2¥ ¢ ¢¡¦ ¨2 2¦¡ ¦ ¨2(by Identities 21, 22, 23)£ §¢1 ¡ ¥ ¦¦¡2 3¦¡2¦ ¨2 2¦¡ ¦ ¨ (25)2¥ ¦¦¨¥ ¦¦¨¡§ 2¨§ 2§ 2£¥ ¦¥ ¦The analysis depends on finding an upper § bound for . To demonstrate how <strong>the</strong> weights and eigenvaluesaffect convergence, I shall derive a § result for 2. Assume that 1 2. The spectral condition number¡ ©of is defined £ ¡1¡¡<strong>to</strong> © be 2 1. The ¢¡ slope of (relative <strong>to</strong> <strong>the</strong> coordinate system defined by <strong>the</strong>eigenvec<strong>to</strong>rs), which depends on <strong>the</strong> starting point, ¦is denoted 1. We have§ 2£1 ¥ ¦ 2¡11 ¡ ¥ 2¨¦1¦ ¦2121222 2222 22131¦2¡2¡23¨22¨ 23(26)2¨ ¥¥ 2¨The value § of , which determines <strong>the</strong> rate of convergence of Steepest Descent, is graphed as a function of and in Figure 17. The graph confirms my two examples. ¢ 0¡ If is an eigenvec<strong>to</strong>r, <strong>the</strong>n <strong>the</strong> slopeis zero (or infinite); we see from <strong>the</strong> graph § that is zero, so convergence is instant. If <strong>the</strong> eigenvalues areequal, <strong>the</strong>n <strong>the</strong> condition number is one; again, we see § that is zero.Figure 18 illustrates examples from near each of <strong>the</strong> four corners of Figure 17. These quadratic formsare graphed in <strong>the</strong> coordinate system defined by <strong>the</strong>ir eigenvec<strong>to</strong>rs. Figures 18(a) and 18(b) are exampleswith a large condition number. Steepest Descent can converge quickly if a fortunate starting point is chosen(Figure 18(a)), but is usually at its worst when is large (Figure 18(b)). The latter figure gives us our¡¨bestintuition for why a large condition number can be ¥ bad: forms a trough, and Steepest Descent bouncesback and forth between <strong>the</strong> sides of <strong>the</strong> trough while making little progress along its length. In Figures 18(c)and 18(d), <strong>the</strong> condition number is small, so <strong>the</strong> quadratic form is nearly spherical, and convergence is quickregardless of <strong>the</strong> starting point.£ £Holding constant (because is fixed), a little basic calculus reveals that Equation 26 is maximizedwhen . In Figure 17, one can see a faint ridge defined by this line. Figure 19 plots worst-casestarting points for our sample matrix . These starting points fall on <strong>the</strong> lines defined by. <strong>An</strong>¦1¦2¡


¡§¡Convergence <strong>An</strong>alysis of Steepest Descent 190.80.61000.4800.26040201051015020Figure 17: Convergence of Steepest Descent as a function of ¡ (<strong>the</strong> slope ofnumber of § ). Convergence is fast when ¡ or ¢ are small. For a fixed matrix, convergence is worst when¢ .£ ) and ¢ (<strong>the</strong> condition62(a)62(b)¨¤£4422-4 -2 2 4-21-4 -2 2 4-21-4-426(c)26(d)4422-4 -2 2 4-21-4 -2 2 4-21-4-4Figure 18: These four examples represent points near <strong>the</strong> corresponding four corners of <strong>the</strong> graph inFigure 17. (a) Large ¢ , small ¡ . (b) <strong>An</strong> example of poor convergence. ¢ and ¡ are both large. (c) Small ¢and ¡ . (d) Small ¢ , large ¡ .


¨£¡§¢¡¡¡£ ¡£¢¥¤¤¡¡¦¢ ¡§¢0¡§¡20 Jonathan Richard Shewchuk420¡2-4 -2 2 4 61-2-4-6Figure 19: Solid lines represent <strong>the</strong> starting points that give <strong>the</strong> worst convergence for Steepest Descent.Dashed lines represent steps <strong>to</strong>ward convergence. If <strong>the</strong> first iteration starts from a worst-case point, so doall succeeding iterations. Each step taken intersects <strong>the</strong> paraboloid axes (gray arrows) at precisely a 45angle. ¢ 3¨ Here, 5.upper bound for § (corresponding <strong>to</strong> <strong>the</strong> worst-case starting points) is found by setting 2£ 2 :§ 2 ¡ 1 ¡ 4 45 2 4 5 ¡ 24£ 54 221¨¥ £1¨ 2 ¥¡ 1 §1 33 3(27)Inequality 27 is plotted in Figure 20. The more ill-conditioned <strong>the</strong> matrix (that is, <strong>the</strong> larger its condition number ), <strong>the</strong> slower <strong>the</strong> convergence of Steepest Descent. In Section 9.2, it is proven that Equation 27 isalso valid § for 2, if <strong>the</strong> condition number of a symmetric, positive-definite matrix is defined <strong>to</strong> be¤<strong>the</strong> ratio of <strong>the</strong> largest <strong>to</strong> smallest eigenvalue. The convergence results for Steepest Descent are1and (28)¡§ ¡¨§ 1©


¥¥¡§¡¥¥¡¡¡¡The <strong>Method</strong> of <strong>Conjugate</strong> Directions 2110.80.60.40.2020 40 60 80 100Figure 20: Convergence of Steepest Descent (per iteration) worsens as <strong>the</strong> condition number of <strong>the</strong> matrixincreases.¡ 0¡¡ ¨¡ ¨¡ ¨¡ ¨£1 ¡ ¢21¢ 0¡ ¢2 ¢12 0¡(by Equation 8)¡ § 1©7. The <strong>Method</strong> of <strong>Conjugate</strong> Directions7.1. ConjugacySteepest Descent often finds itself taking steps in <strong>the</strong> same direction as earlier steps (see Figure 8). Wouldn’tit be better if, every time we <strong>to</strong>ok a step, we got it right <strong>the</strong> first time? Here’s an idea: let’s pick a set oforthogonal search 0¡1¡ ¤ 1¡ directions. In each search direction, we’ll take exactly one step,and that step will be just <strong>the</strong> right length <strong>to</strong> line up evenly with ¡ . After § steps, we’ll be done.Figure 21 illustrates this idea, using <strong>the</strong> coordinate axes as search directions. The first (horizontal) stepleads <strong>to</strong> <strong>the</strong> correct 1-coordinate; <strong>the</strong> second (vertical) step will hit home. Notice that ¢ 1¡ is orthogonal <strong>to</strong>0¡ . In general, for each step we choose a point¡¡ £ ¡ 1¡¥¤ ¡ (29)To find <strong>the</strong> value of ¤ ¡ , so that we need never step in¡ , use <strong>the</strong> fact that ¢1¡ should be orthogonal <strong>to</strong>


¦¡ ¡¡¡¡ ¤¡¡£¢¦¡£¡ ¡ ¡¡¡¡22 Jonathan Richard Shewchuk422-4 -2 2 4 61-21¡0¡-40¡1¡-6Figure 21: The <strong>Method</strong> of Orthogonal Directions. Unfortunately, this method only works if you already know<strong>the</strong> answer.<strong>the</strong> direction¡ of again. Using this condition, we have ¢¡¤ 1¡¨ £00 (by Equation 29)¡ ¢(30)¡ ¥ ¢£ ¡Unfortunately, we haven’t accomplished anything, because we can’t compute¡ without knowing ¢¡ ;and if we knew ¤ ¡ , <strong>the</strong> problem would already be solved.¢The solution is <strong>to</strong> make <strong>the</strong> search directions -orthogonal instead of orthogonal. Two¡ vec<strong>to</strong>rs andare -orthogonal, or conjugate, if¡ Figure 22(a) shows what -orthogonal vec<strong>to</strong>rs look like. Imagine if this article were printed on bubblegum, and you grabbed Figure 22(a) by <strong>the</strong> ends and stretched it until <strong>the</strong> ellipses appeared circular. Thevec<strong>to</strong>rs would <strong>the</strong>n appear orthogonal, as in Figure 22(b).0Our new requirement is that ¢be -orthogonal <strong>to</strong>¡ (see Figure 23(a)). Not coincidentally, this1¡orthogonality condition is equivalent <strong>to</strong> finding <strong>the</strong> minimum point along <strong>the</strong> search¡ direction , as in


¡¥¡ ¤¡¤¡££¡¤¥ ¡ ¡¡¡¡£££¡¡¡¡2The <strong>Method</strong> of <strong>Conjugate</strong> Directions 2324422-4 -2 2 41-4 -2 2 41-2-2-4-4(a)(b)Figure 22: These pairs of vec<strong>to</strong>rs are § -orthogonal ¨ ¨ ¨ because <strong>the</strong>se pairs of vec<strong>to</strong>rs are orthogonal.Steepest Descent. To see this, set <strong>the</strong> directional derivative <strong>to</strong> zero:¡ 1¡¨ £0 ¡0 ¡ 1¡1¡001¡¨ ¡ ¢1¡Following <strong>the</strong> derivation of Equation 30, here is <strong>the</strong> expression for¡ when <strong>the</strong> search directions are-orthogonal:¤¡ ¢(31)£ ¡¡ £(32)Unlike Equation 30, we can calculate this expression. Note that if <strong>the</strong> search vec<strong>to</strong>r were <strong>the</strong> residual, thisformula would be identical <strong>to</strong> <strong>the</strong> formula used by Steepest Descent.


¡¡¡ ¢¡ ¢¡¢¡¡£££¦¡¦ ¢¡¡¢ ¡¡ ¡¡ ¢¡ ¡¡¡¦¡¡ ¡¢¡¡¡¨¢¡¡24 Jonathan Richard Shewchuk424222-4 -2 2 4 61-4 -2 2 4 61-2-21¡ ¢1¡0¡0¡-4-40¡-6(a)-6(b)Figure 23: The method of <strong>Conjugate</strong> Directions converges in steps. (a) The first step is taken along some¡¡direction . The minimum point 1£ is chosen by <strong>the</strong> constraint that ¢¡ 1£ must be § -orthogonal <strong>to</strong> ¡ ¡0£ . (b)0£The initial ¡ error can be expressed as a sum £ of -orthogonal components (gray arrows). Each step 0£ § of<strong>Conjugate</strong> Directions eliminates one of <strong>the</strong>se components.To prove that this procedure really does compute ¡ in § steps, express <strong>the</strong> error term as a linearcombination of search directions; namely,0¡¤ 1 ¥ £¦ 0¢ ¦ ¦¡ (33)The values of¦is possible <strong>to</strong> eliminate all <strong>the</strong> ¢ ¦ ¢ : ¡can be found by a ma<strong>the</strong>matical trick. Because <strong>the</strong> search directions are -orthogonal, itvalues but one from Expression 33 by premultiplying <strong>the</strong> expression by£ ¥0¡¡ ¡ (by -orthogonality of vec<strong>to</strong>rs)0¡£ ¢0¡¡ ¥ ¢1 ¤ ¡ 0 0¡(by -orthogonality of vec<strong>to</strong>rs)(By Equation 29). (34)By Equations 31 and 34, we find that £ ¡£¢ ¡ . This fact gives us a new way <strong>to</strong> look at <strong>the</strong> errorterm. As <strong>the</strong> following equation shows, <strong>the</strong> process of building up ¤ component by component can also be¡


¡£¢¦ ¢¡¦ ¢¡After § iterations, every component is cut away, and ¢£ ¡1. Set ¡ ¡0£ £¡¡1£¨¨¦££ ¦¡¡0 ¦ £¦¡¡¡¦¡¥¦¡¡¦¡¦¡¥¦¡¦£¡¡£¦¡¡¡¦¡The <strong>Method</strong> of <strong>Conjugate</strong> Directions 25viewed as a process of cutting down <strong>the</strong> error term component by component (see Figure 23(b)). ¥ 10¡¦ ¤ ¦0¤ 1 ¥ £ ¥ 1¢¦ 0£¤ 1¦ ¢ ¦0¡ (35)¥ ¦0; <strong>the</strong> proof is complete.¤ ¡7.2. Gram-Schmidt ConjugationAll that is needed now is a set of -orthogonal search ¡directions<strong>to</strong> generate <strong>the</strong>m, called a conjugate Gram-Schmidt process.¢. Fortunately, <strong>the</strong>re is a simple waySuppose we have a set of § linearly independent vec<strong>to</strong>rs 0 1 ¤ 1. The coordinate axes will¡ 0¡¥do in a pinch, although more intelligent choices are possible. To construct , take and subtract outany components that are not -orthogonal <strong>to</strong> <strong>the</strong> previous vec<strong>to</strong>rs (see Figure 24). In o<strong>the</strong>r words, set0, and for 0, setwhere <strong>the</strong> ¡ are defined for ¥£ ¥ 1¡ ¡0£ £. To find <strong>the</strong>ir values, use <strong>the</strong> same trick used <strong>to</strong> find ¢ ¦ 1:(36)¡ ¦ ¡ ¡0¢(by -orthogonality of vec<strong>to</strong>rs)¡ ¦ £ ¡ ¦(37)The difficulty with using Gram-Schmidt conjugation in <strong>the</strong> method of <strong>Conjugate</strong> Directions is that all<strong>the</strong> old search vec<strong>to</strong>rs must be kept in memory <strong>to</strong> construct each new one, and fur<strong>the</strong>rmore ¢ ¥ § 3¨operationsu0u1d(0)+uu *d(0)d(1)Figure 24: Gram-Schmidt conjugation of two vec<strong>to</strong>rs. Begin with two linearly independent vec<strong>to</strong>rs £ 0 and¡0£ <strong>to</strong> £§¦¡, and , which is parallel <strong>to</strong> ¡ ¡£¨¤ .£ 0. The vec<strong>to</strong>r £ 1 is composed of two components: £¥¤ , which is § -orthogonal (or conjugate)0£ . After conjugation, only <strong>the</strong> § -orthogonal portion remains, and


¡¡26 Jonathan Richard Shewchuk422-4 -2 2 4 61-2-4-6Figure 25: The method of <strong>Conjugate</strong> Directions using <strong>the</strong> axial unit vec<strong>to</strong>rs, also known as Gaussianelimination.are required <strong>to</strong> generate <strong>the</strong> full set. In fact, if <strong>the</strong> search vec<strong>to</strong>rs are constructed by conjugation of <strong>the</strong> axialunit vec<strong>to</strong>rs, <strong>Conjugate</strong> Directions becomes equivalent <strong>to</strong> performing Gaussian elimination (see Figure 25).As a result, <strong>the</strong> method of <strong>Conjugate</strong> Directions enjoyed little use until <strong>the</strong> discovery of CG — which is amethod of <strong>Conjugate</strong> Directions — cured <strong>the</strong>se disadvantages.<strong>An</strong> important key <strong>to</strong> understanding <strong>the</strong> method of <strong>Conjugate</strong> Directions (and also CG) is <strong>to</strong> noticethat Figure 25 is just a stretched copy of Figure 21! Remember that when one is performing <strong>the</strong> methodof <strong>Conjugate</strong> Directions (including CG), one is simultaneously performing <strong>the</strong> method of OrthogonalDirections in a stretched (scaled) space.7.3. Optimality of <strong>the</strong> Error Term<strong>Conjugate</strong> Directions has an interesting property: it finds at every step <strong>the</strong> best solution within <strong>the</strong> boundsof where it’s been allowed <strong>to</strong> explore. Where has it been allowed <strong>to</strong> explore? ¥ Let be <strong>the</strong> -dimensional1¡ 1¡¢subspace ; <strong>the</strong> ¢¡ value is chosen ¢ 0¡ from . What do I mean by “bestspan¡ 0¡solution”? I mean that <strong>Conjugate</strong> Directions chooses ¢ 0¡<strong>the</strong> value from ¢¡§that minimizes (seeFigure 26). In fact, some authors derive CG by trying <strong>to</strong> ¢¡§minimize ¢ within .§ §0¡In <strong>the</strong> same way that <strong>the</strong> error term can be expressed as a linear combination of search directions


Figure 26: In this figure, <strong>the</strong> shaded area is ¡ 0£¢¥¡¨§¢¢ 2¦¡¢ ¦¦¡¡¢¡¦¦¡¨span£0£The <strong>Method</strong> of <strong>Conjugate</strong> Directions 270d(1)d (0)e (0)ee (2)(1)2¡¡£¡¡1£¥¤ . The ellipsoid is a con<strong>to</strong>ur on0£which <strong>the</strong> energy norm is constant. After two steps, <strong>Conjugate</strong> Directions finds , <strong>the</strong> point on ¡ ¡ 2£ 0£§ §©¨ that minimizes .¢¡¦¡2(Equation 35), its energy norm can be expressed as a summation.§ £¤ 1 ¥¡1 ¥ ¤¡ ¡ (by Equation 35)¦ £¤ 1¡ (by -orthogonality of vec<strong>to</strong>rs).¥ ¦Each term in this summation is associated with a search direction that has not yet been traversed. <strong>An</strong>y o<strong>the</strong>r¢ vec<strong>to</strong>r chosen ¢ 0¡ from must have <strong>the</strong>se same terms in its expansion, which proves ¢¡ that musthave <strong>the</strong> minimum energy norm.Having proven <strong>the</strong> optimality with equations, let’s turn <strong>to</strong> <strong>the</strong> intuition. Perhaps <strong>the</strong> best way <strong>to</strong> visualize<strong>the</strong> workings of <strong>Conjugate</strong> Directions is by comparing <strong>the</strong> space we are working in with a “stretched” space,as in Figure 22. Figures 27(a) and 27(c) demonstrate <strong>the</strong> behavior of <strong>Conjugate</strong> Directions in ¤ 2 and ¤ 3 ;lines that appear perpendicular in <strong>the</strong>se illustrations are orthogonal. On <strong>the</strong> o<strong>the</strong>r hand, Figures 27(b) and27(d) show <strong>the</strong> same drawings in spaces that are stretched (along <strong>the</strong> eigenvec<strong>to</strong>r axes) so that <strong>the</strong> ellipsoidalcon<strong>to</strong>ur lines become spherical. Lines that appear perpendicular in <strong>the</strong>se illustrations are -orthogonal.In Figure 27(a), <strong>the</strong> <strong>Method</strong> of <strong>Conjugate</strong> Directions begins at 0¡ , takes a step in <strong>the</strong> direction of 0¡ ,and s<strong>to</strong>ps at <strong>the</strong> point ¡1¡ , where <strong>the</strong> error vec<strong>to</strong>r ¢ 1¡ is -orthogonal <strong>to</strong> 0¡ . Why should we expectthis <strong>to</strong> be <strong>the</strong> minimum point on ¡0¡ 1? The answer is found in Figure 27(b): in this stretched space,1¡ is a radius of <strong>the</strong>¡<strong>to</strong> <strong>the</strong> circle that ¡1 must be tangent at ¡0¡ §. 1¡1¡concentric circles that represent con<strong>to</strong>urs on which § ¢§is constant, so ¡1¡ appears perpendicular <strong>to</strong> 0¡ because <strong>the</strong>y are -orthogonal. The error vec<strong>to</strong>r ¢This is not surprising; we have already seen in Section 7.1 that-conjugacy of <strong>the</strong> search direction and<strong>the</strong> error term is equivalent <strong>to</strong> minimizing (and <strong>the</strong>refore § ¢§) along <strong>the</strong> search direction. However,after <strong>Conjugate</strong> Directions takes a second step, minimizing § §along a second search direction 1¡ , why¢should we expect that ¢§will still be minimized in <strong>the</strong> direction of 0¡ ? After taking ¥ steps, why will§¡ be minimized over all of ¡?1¡ lies on. Hence, ¡1¡ is <strong>the</strong> point on ¡0¡ 1 that minimizes § ¢0¡


¡¡¡¡£¡¡¢£¡¡¡¡¡¡¡¡¡££¡¢¡¡1¡ ¡¡28 Jonathan Richard Shewchuk0¡0¡0¡ 10¡ 11¡1¡1¡1¡1¡1¡1¡0¡1¡0¡(a)(b)0¡0¡ 10¡ 10¡1¡1¡1¡0¡ 20¡ 21¡2¡1¡¡¡2¡0¡0¡(d)(c)Figure 27: Optimality of <strong>the</strong> <strong>Method</strong> of <strong>Conjugate</strong> Directions. (a) A two-dimensional problem. Lines thatappear perpendicular are orthogonal. (b) The same problem in a “stretched” space. Lines that appearperpendicular are § -orthogonal. (c) A three-dimensional problem. Two concentric ellipsoids are shown; £¡is at <strong>the</strong> center of both. The 0£ linetangent <strong>to</strong> <strong>the</strong> inner ellipsoid £ ¡at £¡¡1 is tangent <strong>to</strong> <strong>the</strong> outer 1£ ellipsoid at . The 0£ plane £ £2£ . (d) Stretched view of <strong>the</strong> three-dimensional problem.2 is


¡¡¡¡¡£¡¡ ¡¥¡¦0¡ 1¡ 1¡0¡ ¡1¡ ¡1¡ 2¡ 0¡ ¡2¡ ¡1¡ ¢§ ¡0¡1¡ 1¡1¡The <strong>Method</strong> of <strong>Conjugate</strong> Directions 29In Figure 27(b), and appear perpendicular because <strong>the</strong>y are -orthogonal. It is clear thatmust point <strong>to</strong> <strong>the</strong> solution , because is tangent at <strong>to</strong> a circle whose center is . However, athree-dimensional example is more revealing. Figures 27(c) and 27(d) each show two concentric ellipsoids.The point lies on <strong>the</strong> outer ellipsoid, and lies on <strong>the</strong> inner ellipsoid. Look carefully at <strong>the</strong>se figures:<strong>the</strong> plane 2 slices through <strong>the</strong> larger ellipsoid, and is tangent <strong>to</strong> <strong>the</strong> smaller ellipsoid at . Thepoint is at <strong>the</strong> center of <strong>the</strong> ellipsoids, underneath <strong>the</strong> plane.Looking at Figure 27(c), we can rephrase our question. Suppose you and I are standing at , andwant <strong>to</strong> walk <strong>to</strong> <strong>the</strong> point that minimizes on 2; but we are constrained <strong>to</strong> walk along <strong>the</strong> searchdirection . If points <strong>to</strong> <strong>the</strong> minimum point, we will succeed. Is <strong>the</strong>re any reason <strong>to</strong> expect thatwill point <strong>the</strong> right way?Figure 27(d) supplies an answer. Because 0¡ is -orthogonal <strong>to</strong> , <strong>the</strong>y are perpendicular in this1¡diagram. Now, suppose you were staring down 0¡ at <strong>the</strong> plane 2 as if it were a sheet of paper; <strong>the</strong>sight you’d see would be identical <strong>to</strong> Figure 27(b). 2¡ The point would be at <strong>the</strong> center of <strong>the</strong> paper,and ¡<strong>the</strong> point would lie underneath <strong>the</strong> paper, directly under <strong>the</strong> 2¡ point . 0¡ Because 1¡ and are¡perpendicular, points directly <strong>to</strong> 2¡ , which is <strong>the</strong> point in ¡0¡ 2 closest <strong>to</strong> . The plane ¡ ¡0¡ 21¡is tangent <strong>to</strong> <strong>the</strong> sphere on ¡ ¡which lies. If you <strong>to</strong>ok a third step, it would be straight down from 2¡ <strong>to</strong>2¡, in a direction -orthogonal <strong>to</strong> ¡2.¡¡<strong>An</strong>o<strong>the</strong>r way <strong>to</strong> understand what is happening in Figure 27(d) is <strong>to</strong> imagine yourself standing at <strong>the</strong>solution point , pulling a string connected <strong>to</strong> a bead that is constrained <strong>to</strong> lie in ¡0¡ . Each time <strong>the</strong>expanding subspace is enlarged by a dimension, <strong>the</strong> bead becomes free <strong>to</strong> move a little closer <strong>to</strong> you.Now if you stretch <strong>the</strong> space so it looks like Figure 27(c), you have <strong>the</strong> <strong>Method</strong> of <strong>Conjugate</strong> Directions.¡<strong>An</strong>o<strong>the</strong>r important property of <strong>Conjugate</strong> Directions is visible in <strong>the</strong>se illustrations. We have seen that,at each step, <strong>the</strong> hyperplane 0¡ is tangent <strong>to</strong> <strong>the</strong> ellipsoid on which ¡ ¡ lies. Recall from Section 4that <strong>the</strong> residual at any point is orthogonal <strong>to</strong> <strong>the</strong> ellipsoidal surface at that point. It follows that £¡ isorthogonal <strong>to</strong> as well. To show this fact ma<strong>the</strong>matically, premultiply Equation 35 by ¡ ¡ :¡¡ ¢ ¦¡ ¢ ¦¡¤ 1 ¥ £¡ (38)¦ 0 ¥ £¢(by -orthogonality of -vec<strong>to</strong>rs). (39)We could have derived this identity by ano<strong>the</strong>r tack. Recall that once we take a step in a search direction,we need never step in that direction again; <strong>the</strong> error term is evermore -orthogonal <strong>to</strong> all <strong>the</strong> old searchdirections. £¡ Because , <strong>the</strong> residual is evermore orthogonal <strong>to</strong> all <strong>the</strong> old search directions.¡ £ ¦£ ¡¢isBecause <strong>the</strong> search directions are constructed from <strong>the</strong> vec<strong>to</strong>rs, <strong>the</strong> subspace spannedby 0 1, and <strong>the</strong> £residual is orthogonal <strong>to</strong> <strong>the</strong>se previous vec<strong>to</strong>rs as well (see Figure 28). This is ¡ provenby taking <strong>the</strong> inner product of Equation £ ¦¡ 36 and : £¦¡0 £ £ ¦£ £¦ 1¡ ¡ £¦0¡ (40)¥ £¢(by Equation 39) (41)¡There is one more identity we will use later. From Equation 40 (and Figure 28), £¡ £ £(42)¡


£ ££££¡£0¡0¡0¡£¡0¡¡¡ ¤¡20¡ 2 £0¡ ¡¨ 10¡ 1 £¢30 Jonathan Richard Shewchuku 2d(2)e(2)r(2)d(1)d(0)u 1 u0¡Figure 28: Because <strong>the</strong> search 0£ £ directions ¡same subspace 2 (<strong>the</strong> gray-colored plane). The error term § is -orthogonal <strong>to</strong> 2, <strong>the</strong> residual is¤¡ ¡ 2£ 2£ ¡ ¡¡orthogonal <strong>to</strong> 2, and a new 2£ search direction is £ constructed (from 2) § <strong>to</strong> be -orthogonal <strong>to</strong> 2. Theis constructed from ¡ 2 by Gram-Schmidt¡ ¡ £ 2£endpoints¡of 2 2£ and lie on a plane parallel <strong>to</strong> 2, because ¡ ¡¡ £ ¡conjugation.¡¡are constructed from <strong>the</strong> vec<strong>to</strong>rs £ 0£ £ 1, <strong>the</strong>y span <strong>the</strong>1£To conclude this section, I note that as with <strong>the</strong> method of Steepest Descent, <strong>the</strong> number of matrix-vec<strong>to</strong>rproducts per iteration can be reduced <strong>to</strong> one by using a recurrence <strong>to</strong> find <strong>the</strong> residual:£ ¡¢1¡1¡£¢¡ ¤ ¡ (43)£ ¡¢¥ ¢£8. The <strong>Method</strong> of <strong>Conjugate</strong> <strong>Gradient</strong>sIt may seem odd that an article about CG doesn’t describe CG until page 30, but all <strong>the</strong> machinery is now inplace. In fact, CG is simply <strong>the</strong> method of <strong>Conjugate</strong> Directions where <strong>the</strong> search directions are constructedby conjugation of <strong>the</strong> residuals (that is, by setting££¡ ).This choice makes sense for many reasons. First, <strong>the</strong> residuals worked for Steepest Descent, so why notfor <strong>Conjugate</strong> Directions? Second, <strong>the</strong> residual has <strong>the</strong> nice property that it’s orthogonal <strong>to</strong> <strong>the</strong> previoussearch directions (Equation 39), so it’s guaranteed always <strong>to</strong> produce a new, linearly independent searchdirection unless <strong>the</strong> residual is zero, in which case <strong>the</strong> problem is already solved. As we shall see, <strong>the</strong>re isan even better reason <strong>to</strong> choose <strong>the</strong> residual.Let’s consider <strong>the</strong> implications of this choice. Because <strong>the</strong> search vec<strong>to</strong>rs are built from <strong>the</strong> residuals, <strong>the</strong>span¡ £ 0¡£ 1¡¢subspace is equal <strong>to</strong> . As each residual is orthogonal <strong>to</strong> <strong>the</strong> previous searchdirections, it is also orthogonal <strong>to</strong> <strong>the</strong> previous residuals (see Figure 29); Equation 41 becomes1¡ 0 (44)Interestingly, Equation 43 shows that each new residual is just a linear combination of <strong>the</strong> previousresidual and . Recalling that , this fact implies that each new subspace 1 is formedfrom <strong>the</strong> union of <strong>the</strong> previous subspace and <strong>the</strong> subspace . Hence,span¡span¡ £¢0¡


¡££ ¡¡¦¦¡¡£££££ ¡¡¡¡¡£ £ ¡ 1¡ £ 1¡ 1¡ £ 1¡¡¡¦¡¥¥££¥£££¡¦¡and ¡ ¡2£¦¡¦The <strong>Method</strong> of <strong>Conjugate</strong> <strong>Gradient</strong>s 31r(2)d(2)e(2)d(1)d(0)r(1)r(0)Figure 29: In <strong>the</strong> method of <strong>Conjugate</strong> <strong>Gradient</strong>s, each new residual is orthogonal <strong>to</strong> all <strong>the</strong> previousresiduals and search directions; and each new search direction is constructed (from <strong>the</strong> residual) <strong>to</strong> be-orthogonal <strong>to</strong> all <strong>the</strong> previous residuals and search directions. The endpoints of2£ lie on a§plane parallel <strong>to</strong> 2 (<strong>the</strong> shaded subspace). In CG, ¡ ¡2£ is a linear combination of ¤¡ 2£ and ¡ ¡¡1£ .This subspace is called a Krylov subspace, a subspace created by repeatedly applying a matrix <strong>to</strong> avec<strong>to</strong>r. It has a pleasing property: because is included in 1, <strong>the</strong> fact that <strong>the</strong> next £1¡residualis orthogonal <strong>to</strong> 1 (Equation 39) implies £1¡ that is -orthogonal <strong>to</strong> . Gram-Schmidt conjugationbecomes easy, £¡ because !Recall from Equation 37 that <strong>the</strong> Gram-Schmidt constants are ¡ ¦ £ ¡simplify this expression. Taking <strong>the</strong> inner product £¡ of and Equation 43, ¡ ¡¡ ; let us1¡ is already -orthogonal <strong>to</strong> all of <strong>the</strong> previous search directions except¡ ¤ ¦1¡¡ £¡ £ ¦¡ £ ¦¤ ¦1¡ £¦¡ £¦¡ £1¡ ¥ ¡ ¦ £¡¡¢¡¡£ £¡1§¥¤§¦©¨§¥¤¦ 1¨££¡ ¡(By Equation 44.) ¢1 0¢o<strong>the</strong>rwise £¡ ¡0 ¥¢£1¤§¦©¨ ¤¦©¨ 1¨ ¦¤§¦1 ¢1¢(By Equation 37.)As if by magic, most of <strong>the</strong> ¦terms have disappeared. It is no longer necessary <strong>to</strong> s<strong>to</strong>re old search vec<strong>to</strong>rs<strong>to</strong> ensure <strong>the</strong> -orthogonality of new search vec<strong>to</strong>rs. This major advance is what makes CG as importantan algorithm as it is, because both <strong>the</strong> space complexity and time complexity per iteration are reduced from¡where , is <strong>the</strong> number of nonzero entries of . Henceforth, I shall use <strong>the</strong> abbreviation¢ ¥ § 2¨¡ ¢¨£ ¡<strong>to</strong> 1. Simplifying fur<strong>the</strong>r: ¥1¨ ¦¤¦ 1¨ §¥¤§¦£ £¡ £(by Equation 32)£ £¡ £(by Equation 42)


¡ ¤¡ ¡¡ ¡¡ ¡¡£££££ ¡ ¡ ¡¤ ¡£ ¥¤¡ ¡¡0¡¡¡¡32 Jonathan Richard Shewchuk422-4 -2 2 4 61-20¡-4-6Figure 30: The method of <strong>Conjugate</strong> <strong>Gradient</strong>s.Let’s put it all <strong>to</strong>ge<strong>the</strong>r in<strong>to</strong> one piece now. The method of <strong>Conjugate</strong> <strong>Gradient</strong>s is:(45)£ ¥ ¡ ¡0¡0¡£ £¡ £(by Equations 32 and 42) (46)(47)1¡£1¡££ £1¡ £(48)1¡1¡¡ £¡ 1¡£1¡¡ 1¡¡ (49)The performance of CG on our sample problem is demonstrated in Figure 30. The name “<strong>Conjugate</strong><strong>Gradient</strong>s” is a bit of a misnomer, because <strong>the</strong> gradients are not conjugate, and <strong>the</strong> conjugate directions arenot all gradients. “<strong>Conjugate</strong>d <strong>Gradient</strong>s” would be more accurate.9. Convergence <strong>An</strong>alysis of <strong>Conjugate</strong> <strong>Gradient</strong>sCG is complete § after iterations, so why should we care about convergence analysis? In practice,accumulated floating point roundoff error causes <strong>the</strong> residual <strong>to</strong> gradually lose accuracy, and cancellation


££§¢¡¡¡0¡¢¡¥ £¥ £¡§2£ ¥£2 ¢0¡¥2 £ ¦¦¦¦¦¦¦¦¦¤£¥¢¥¥3 ¢0¡ ¥¡ ¦ ¨ ¦¡ ¦ ¨ ¡ ¦ ¦¡ ¦ ¨1 £¢¢Convergence <strong>An</strong>alysis of <strong>Conjugate</strong> <strong>Gradient</strong>s 33error causes <strong>the</strong> search vec<strong>to</strong>rs <strong>to</strong> lose -orthogonality. The former problem could be dealt with as it wasfor Steepest Descent, but <strong>the</strong> latter problem is not easily curable. Because of this loss of conjugacy, <strong>the</strong>ma<strong>the</strong>matical community discarded CG during <strong>the</strong> 1960s, and interest only resurged when evidence for itseffectiveness as an iterative procedure was published in <strong>the</strong> seventies.Times have changed, and so has our outlook. Today, convergence analysis is important because CG iscommonly used for problems so large it is not feasible <strong>to</strong> run even § iterations. Convergence analysis isseen less as a ward against floating point error, and more as a proof that CG is useful for problems that areout of <strong>the</strong> reach of any exact algorithm.The first iteration of CG is identical <strong>to</strong> <strong>the</strong> first iteration of Steepest Descent, so without changes,Section 6.1 describes <strong>the</strong> conditions under which CG converges on <strong>the</strong> first iteration.9.1. Picking Perfect PolynomialsWe have seen that at each step of CG, <strong>the</strong> value ¢, wherespan¡ £¡ is chosen from ¢0¡span¡ ¢0¡¢0¡ 0¡Krylov subspaces such as this have ano<strong>the</strong>r pleasing property. For ¥ a fixed , <strong>the</strong> error term has <strong>the</strong> form0¡ 0¡£¡0¡The coefficients¢¦are related <strong>to</strong> <strong>the</strong> values¡ , but <strong>the</strong> precise relationship is not important here.What is important is <strong>the</strong> proof in Section 7.3 that CG chooses ¤ ¦coefficients that minimize § ¢¡§.<strong>the</strong>¢¦¡ and ¡ 1¢¢¥ ¦¢Let¦ ¡ ¨¥¥.¦if¦¡2¥¨ ¨¢£<strong>the</strong>n¦£2¥¡ £The expression in paren<strong>the</strong>ses above can be expressed as a polynomial. be a polynomial ofdegree can take ei<strong>the</strong>r a scalar or a matrix as its argument, and will evaluate <strong>to</strong> <strong>the</strong> same; for instance,221, 2 2 . This flexible notation comes in handy for eigenvec<strong>to</strong>rs, for,2. (Note that2, and so on.)which you should convince yourself that¦¥¨ £¦¥¡ ¨£ ¡Now we can express <strong>the</strong> error term as¨if we require that¦¥ 0¨ £ 1. CG chooses this polynomial when it chooses <strong>the</strong>¢¦0¡coefficients. Let’s¢£¦examine <strong>the</strong> effect of applying this polynomial ¢ 0¡ <strong>to</strong> . As in <strong>the</strong> analysis of Steepest Descent, ¢ expressas a linear combination of orthogonal unit eigenvec<strong>to</strong>rs0¡¦ ¦0¡¤ ¥ £ ¦and we find that1¦¢¢2 ¦¦¦¦2 ¡ ¦


£¡¡¡¥¡ ¨¡¡¡34 Jonathan Richard Shewchuk(a)10.750.50.25¦ 0¥¡ ¨(b)10.750.50.25¦ 1¥-0.25-0.5-0.75-12 7¡ ¨-0.25-0.5-0.75-12 7(c)10.750.50.25¦ 2¥¡ ¨(d)10.750.50.25¦ 2¥-0.25-0.5-0.75-12 7¡ ¨-0.25-0.5-0.75-12 7Figure 31: The convergence of © CG after iterations depends on how close a polynomialbe <strong>to</strong> zero on each eigenvalue, given <strong>the</strong> constraint that 0 1.¨ ¥of degree © canCG finds <strong>the</strong> polynomial that minimizes this expression, but convergence is only as good as <strong>the</strong> convergenceof <strong>the</strong> worst eigenvec<strong>to</strong>r. Λ¥ Letting be <strong>the</strong> set of eigenvalues of , we have§¢¨¡§2 ¡ min ¡¦¢¤£ max¡ Λ¥¡ ¨ 2¥¦¡ ¦ 2¦min ¡¦¢¤£ max¡ Λ¦¦§2(50)0¡ 2§¢Figure 31 illustrates, for several values of ¥ , <strong>the</strong>¦that minimizes this expression for our sample problem¦with eigenvalues 2 and 7. There is only one polynomial of degree zero satisfies¦ 0¥ 0¨ that 1, and that¡ ¨ £¡ ¨ ££1, graphed in Figure 31(a). The optimal polynomial of degree one is¦ 1¥1 ¡ 2 ¡ ¡ 9,graphed in Figure 31(b). Note that¦ 1¥ 2¨ £ 5¡ 9 and¦ 1¥ 7¨ £ ¡ 5¡ 9, and so <strong>the</strong> energy norm of <strong>the</strong> errorterm after one iteration of CG is no greater than 5¡ 9 its initial value. Figure 31(c) shows that, after two0¥ is¦iterations, Equation 50 evaluates <strong>to</strong> zero. This is because a polynomial of degree two can be fit <strong>to</strong> threepoints 1 points,and <strong>the</strong>reby accommodate § separate eigenvalues.(¦2¥ 0¨ £ 1,¦ 2¥ 2¨ £ 0, and¦ 2¥ 7¨ £ 0). In general, a polynomial of degree § can fit §The foregoing discussing reinforces our understanding that CG yields <strong>the</strong> exact result § after iterations;and fur<strong>the</strong>rmore proves that CG is quicker if <strong>the</strong>re are duplicated eigenvalues. Given infinite floating pointprecision, <strong>the</strong> number of iterations required <strong>to</strong> compute an exact solution is at most <strong>the</strong> number of distincteigenvalues. (There is one o<strong>the</strong>r possibility for early 0¡ termination: may already be -orthogonal <strong>to</strong>some of <strong>the</strong> eigenvec<strong>to</strong>rs of . If eigenvec<strong>to</strong>rs are missing from <strong>the</strong> expansion of 0¡ , <strong>the</strong>ir eigenvaluesmay be omitted from consideration in Equation 50. Be forewarned, however, that <strong>the</strong>se eigenvec<strong>to</strong>rs maybe reintroduced by floating point roundoff error.)¡


¢¢¡§§§§¢Convergence <strong>An</strong>alysis of <strong>Conjugate</strong> <strong>Gradient</strong>s 3521.510.521.510.52¥ § ¨5¥ § ¨-1 -0.5-0.50.5 1-1-1.5-2-1 -0.5-0.50.5 1-1-1.5-221.510.521.510.510¥ § ¨49¥ § ¨-1 -0.5-0.50.5 1-1-1.5-2-1 -0.5-0.50.5 1-1-1.5-2Figure 32: Chebyshev polynomials of degree 2, 5, 10, and 49.We also find that CG converges more quickly when eigenvalues are clustered <strong>to</strong>ge<strong>the</strong>r (Figure 31(d))than when <strong>the</strong>y are irregularly distributed betweenand ¡ ¡£¢¥¤ ¤, because it is easier for CG <strong>to</strong> choose a¤polynomial that makes Equation 50 small.¢If we know something about <strong>the</strong> characteristics of <strong>the</strong> eigenvalues of , it is sometimes possible <strong>to</strong>suggest a polynomial that leads <strong>to</strong> a proof of a fast convergence. For <strong>the</strong> remainder of this analysis, however,I shall assume <strong>the</strong> most general case: <strong>the</strong> eigenvalues are evenly distributed betweenand ¡¡¦¢ ¤ ¤, <strong>the</strong>¤number of distinct eigenvalues is large, and floating point roundoff ¢ occurs.9.2. Chebyshev PolynomialsA useful approach is <strong>to</strong> minimize Equation 50 over <strong>the</strong> range ¡ ¢ ¤ ¡£¢¥¤¤ ra<strong>the</strong>r than at a finite number ofpoints. The polynomials that accomplish this are based on Chebyshev polynomials.The Chebyshev polynomial of degree ¥ is¥ § ¨ £ 12(If this expression doesn’t look like a polynomial <strong>to</strong> you, try working it out ¥ for equal <strong>to</strong> 1 or 2.) SeveralChebyshev polynomials are illustrated in Figure 32. The Chebyshev polynomials have <strong>the</strong> property that¥ § ¨¥ § £¢ § 2 ¡ 1¨ ¥ § ¡ ¢ § 2 ¡ 1¨ ¥¤1 1 , and fur<strong>the</strong>rmore that¡ 1 (in fact, <strong>the</strong>y oscillate between 1 and 1) on <strong>the</strong> domain §¡ ¡¢1 ¡ among all such polynomials. Loosely speaking, ¢1 is maximum on <strong>the</strong> domain § ¡ ¡ ¡¢increases as quickly as possible outside <strong>the</strong> boxes in <strong>the</strong> illustration.¥ § ¨¥ § ¨


¡ ¨£§¢¦¡¡ ¨ £¤¡¦¢ ¡£¢¥¤¤¡¡¢ ¦ ¢§¦©¨¥ 2 ¢¢§¦©¨¥ ¢ ¦ ¦§¡¦§ 1§¢¦©¨¥ ¢ ¦ ¢¢ ¦©¨¥ ¢ ¦ ¦§¡¦§¡0¡§ 1¥ 1§¢§¢¡¡¨© 1§¢¡¨¤36 Jonathan Richard Shewchuk¦ 2¥10.750.50.251 2 3 4 5 6 7 8-0.25-0.5-0.75-1¡Figure 33: The ¦¥polynomial 2 that minimizes Equation ¦ 50 for ¦ £¢¥¤ 2 and 7 in <strong>the</strong> general case.This curve is a scaled version of <strong>the</strong> Chebyshev polynomial of degree 2. The energy norm of <strong>the</strong> error termafter two iterations is no greater than 0.183 times its initial value. Compare with Figure 31(c), where it isknown that <strong>the</strong>re are only two eigenvalues.It is shown in Appendix C3 that Equation 50 is minimized by choosing¥ ¡ ¡ ¡£¢¥¤¤ ¡(see Figure 33). The denomina<strong>to</strong>r enforces our ¢ ¥ 0¨ requirement 1. The numera<strong>to</strong>r has a maximumvalue of one on <strong>the</strong> interval between ¡ ¢ ¤ and ¡£¢¥¤ ¤, so from Equation 50 we have£ that¦This polynomial has <strong>the</strong> oscillating properties of Chebyshev polynomials on <strong>the</strong> domain ¡§¢¡§ ¡ §¡£¢¥¤¤¡ ¡¦¢ 0¡§£ § 1¤ ©21©£ £ 11 ¦0¡§(51)¤£ 1¥1¥The second addend inside <strong>the</strong> square brackets converges <strong>to</strong> zero ¥ as grows, so it is more common <strong>to</strong> express<strong>the</strong> convergence of CG with <strong>the</strong> weaker inequality10¡§(52)¡§ ¡ 2 £


¡¨§¡¡£Complexity 3710.80.60.40.2020 40 60 80 100Figure 34: Convergence of <strong>Conjugate</strong> <strong>Gradient</strong>s (per iteration) as a function of condition number. Comparewith Figure 20.The first step of CG is identical <strong>to</strong> a step of Steepest Descent. ¥ Setting 1 in Equation 51, we obtainEquation 28, <strong>the</strong> convergence result for Steepest Descent. This is just <strong>the</strong> linear polynomial case illustratedin Figure 31(b).Figure 34 charts <strong>the</strong> convergence per iteration of CG, ignoring <strong>the</strong> lost fac<strong>to</strong>r of 2. In practice, CGusually converges faster than Equation 52 would suggest, because of good eigenvalue distributions or goodstarting points. Comparing Equations 52 and 28, it is clear that <strong>the</strong> convergence of CG is much quickerthan that of Steepest Descent (see Figure 35). However, it is not necessarily true that every iteration of CGenjoys faster convergence; for example, <strong>the</strong> first iteration of CG is an iteration of Steepest Descent. Thefac<strong>to</strong>r of 2 in Equation 52 allows CG a little slack for <strong>the</strong>se poor iterations.10. Complexity¢¨¥ whereThe dominating operations during an iteration of ei<strong>the</strong>r Steepest Descent or CG are matrix-vec<strong>to</strong>r products.In general, matrix-vec<strong>to</strong>r multiplication requires operations, is <strong>the</strong> number of non-zeroentries in <strong>the</strong> matrix. For many problems, including those listed in <strong>the</strong> introduction, is sparse and.¢ ¥ §Suppose we wish <strong>to</strong> perform enough iterations <strong>to</strong> reduce <strong>the</strong> norm of <strong>the</strong> error by a fac<strong>to</strong>r of ; that§is,¡§ ¡ §¢ 0¡§. Equation 28 can be used <strong>to</strong> show that <strong>the</strong> maximum number of iterations required <strong>to</strong>¢achieve this bound using Steepest Descent is¥¡12 ln § 1¢©¤£ whereas Equation 52 suggests that <strong>the</strong> maximum number of iterations CG requires is¥¡12 ln § 2¢©¤£


¨£§¨¨38 Jonathan Richard Shewchuk4035302520151050200 400 600 800 1000Figure 35: Number of iterations of Steepest Descent required <strong>to</strong> match one iteration of CG.I conclude that Steepest Descent has a ¢ time complexity of. Both algorithms have a space complexity of ¢¨. ¥ ¥ ¢¥, whereas CG has a time complexity ofFinite difference and finite element approximations of second-order elliptic boundary value problemsposed on -dimensional domains often ¡¢ ¥ § 2¡¦have . Thus, Steepest Descent has a time complexity of¥ § 2¨for two-dimensional problems, versus ¢ ¥ § 3¡ 2¨for CG; and Steepest Descent has a time complexity¢for CG.of ¢ ¥ § 5¡ 3¨for three-dimensional problems, versus ¢ ¥ § 4¡ 3¨11. Starting and S<strong>to</strong>ppingIn <strong>the</strong> preceding presentation of <strong>the</strong> Steepest Descent and <strong>Conjugate</strong> <strong>Gradient</strong> algorithms, several detailshave been omitted; particularly, how <strong>to</strong> choose a starting point, and when <strong>to</strong> s<strong>to</strong>p.11.1. StartingThere’s not much <strong>to</strong> say about starting. If you have a rough estimate of <strong>the</strong> value of , use it as <strong>the</strong> startingvalue ¡0¡ . If not, set ¡0¡ 0; ei<strong>the</strong>r Steepest Descent or CG will eventually converge when used <strong>to</strong> solvelinear systems. Nonlinear minimization (coming up in Section 14) is trickier, though, because <strong>the</strong>re maybe several local minima, and <strong>the</strong> choice of starting point will determine which minimum <strong>the</strong> procedure¡converges <strong>to</strong>, or whe<strong>the</strong>r it will converge at all.11.2. S<strong>to</strong>ppingWhen Steepest Descent or CG reaches <strong>the</strong> minimum point, <strong>the</strong> residual becomes zero, and if Equation 11or 48 is evaluated an iteration later, a division by zero will result. It seems, <strong>the</strong>n, that one must s<strong>to</strong>p


¡¡££ ££££¡ ¤¡£¡ ££££ £¡¡ £¡¡£ ££ £¡££ ¡¡¤ ¡£££ ¡ ¡££¡¡£ ¡ £¡ ¥¤¡ ¡£ ££ £¡ £¡¡¡£¡¡¡¡0¡¡¡Preconditioning 39immediately when <strong>the</strong> residual is zero. To complicate things, though, accumulated roundoff error in <strong>the</strong>recursive formulation of <strong>the</strong> residual (Equation 47) may yield a false zero residual; this problem could beresolved by restarting with Equation 45.Usually, however, one wishes <strong>to</strong> s<strong>to</strong>p before convergence is complete. Because <strong>the</strong> error term is notavailable, it is cus<strong>to</strong>mary <strong>to</strong> s<strong>to</strong>p when <strong>the</strong> norm of <strong>the</strong> residual falls below a specified value; often, thisvalue is some small fraction of <strong>the</strong> initial residual ( £0¡§). See Appendix B for sample code.§¡§£§£12. PreconditioningPreconditioning is a technique for improving <strong>the</strong> condition number of a matrix. Suppose that is asymmetric, positive-definite matrix that approximates , but is easier <strong>to</strong> invert. We can solve¥ ¡¦£¡ £ 1¥indirectly by solving1(53) ¨¢¡ ¨ ¥ ¥ If1, or if <strong>the</strong> eigenvalues of are better clustered than those ofsolve Equation 53 more quickly than <strong>the</strong> original problem. The catch is thatsymmetric nor definite, even if and are.11, we can iterativelyis not generallynecessarily unique) matrix ¡ that has <strong>the</strong> property that ¡ ¡£¡ ¡¡¡We can circumvent this difficulty, because for every symmetric, positive-definite <strong>the</strong>re is a (not. (Such an can be obtained, forinstance, by Cholesky fac<strong>to</strong>rization.) The matrices1and1have <strong>the</strong> same eigenvalues. Thisis true because if is an eigenvec<strong>to</strong>r of1with eigenvalue , <strong>the</strong>n is an eigenvec<strong>to</strong>r of1with eigenvalue : 1£ ¡ ¨ £1 ¨ ¨1£¥ ¡¥ ¡¥ ¡The system¡ £¦¥ can be transformed in<strong>to</strong> <strong>the</strong> problemwhich we solve first for , <strong>the</strong>n for . Because ¡ is symmetric and positive-definite, can be£ ¡found by Steepest Descent or CG. The process of using CG <strong>to</strong> solve this system is called <strong>the</strong> Transformed£Preconditioned <strong>Conjugate</strong> <strong>Gradient</strong> ¡ ¡ <strong>Method</strong>:1¤£ ¡ £ 1¥¡¤£ £ ¡ 11¤£ ¡ 0¡0¡ 1¥ ¡ £¡¡ ¡11¡£ £ £1¡¡ ¡11¡1¡1¡¡ 1¡1¡¡1¡


£¡ ¡ ¡¡ ¡ £ ¤¡¡¡££ 1 £ ¡ ¡ ££ £ 1 ££¡¡0¡0¡ 1 ¡ £¡ ¥¤¡¤ ¡ ¡¡1 £ 1 £¡¡ ¡¡£¡¡£¡¡¡40 Jonathan Richard Shewchuk2-4 -2 2 4 61-2-4-6-8Figure 36: Con<strong>to</strong>ur lines of <strong>the</strong> quadratic form of <strong>the</strong> diagonally preconditioned sample problem.¡¡ £ £ £¡ ¡This method has <strong>the</strong> undesirable characteristic that must be computed. However, a few carefulvariable substitutions can eliminate . Setting 1 and, and using <strong>the</strong> identities1£ 1 , we derive <strong>the</strong> Untransformed Preconditioned <strong>Conjugate</strong> <strong>Gradient</strong>¡ £¡<strong>Method</strong>:¡ and ¡ 0¡£¦¥ ¡ ¢¡0¡£ £¡ 1¡£1¡£1¡1¡1¡£ £¡ ¡ The matrix does not appear in <strong>the</strong>se equations; only1 is needed. By <strong>the</strong> same means, it is possible<strong>to</strong> derive a Preconditioned Steepest Descent <strong>Method</strong> that does ¡ not use .1¡1¡1¡The effectiveness of a preconditioner is determined by <strong>the</strong> condition number of , and occasionallyby its clustering of eigenvalues. The problem remains of finding a preconditioner thatapproximateswell enough <strong>to</strong> improve convergence enough <strong>to</strong> make up for <strong>the</strong> cost of computing <strong>the</strong> product1 £once per iteration. (It is not necessary <strong>to</strong> explicitly compute or1 ; it is only necessary <strong>to</strong> be able<strong>to</strong>compute <strong>the</strong> effect of applying1 <strong>to</strong> a vec<strong>to</strong>r.) Within this constraint, <strong>the</strong>re is a surprisingly rich supplyof possibilities, and I can only scratch <strong>the</strong> surface here.1


£££<strong>Conjugate</strong> <strong>Gradient</strong>s on <strong>the</strong> Normal Equations 41Intuitively, preconditioning is an attempt <strong>to</strong> stretch <strong>the</strong> quadratic form <strong>to</strong> make it appear more spherical,so that <strong>the</strong> eigenvalues are close <strong>to</strong> each o<strong>the</strong>r. A perfect preconditioner is ;for this preconditioner,1has a condition number of one, and <strong>the</strong> quadratic form is perfectly spherical, so solution takes onlyone iteration. Unfortunately, <strong>the</strong> preconditioning step is solving <strong>the</strong> system , so this isn’t a usefulpreconditioner at all.¥ £ ¡The simplest preconditioner is a diagonal matrix whose diagonal entries are identical <strong>to</strong> those of . Theprocess of applying this preconditioner, known as diagonal preconditioning or Jacobi preconditioning, isequivalent <strong>to</strong> scaling <strong>the</strong> quadratic form along <strong>the</strong> coordinate axes. (By comparison, <strong>the</strong> perfect preconditionerscales <strong>the</strong> quadratic form along its eigenvec<strong>to</strong>r axes.) A diagonal matrix is trivial <strong>to</strong> invert,but is often only a mediocre preconditioner. The con<strong>to</strong>ur lines of our sample problem are shown, afterdiagonal preconditioning, in Figure 36. Comparing with Figure 3, it is clear that some improvement hasoccurred. The condition number has improved from 3 5 <strong>to</strong> roughly 2 8. Of course, this improvement ismuch more beneficial for systems § where 2.A more elaborate preconditioner is incomplete Cholesky preconditioning. Cholesky fac<strong>to</strong>rization is atechnique for fac<strong>to</strong>ring a ¡¢¡matrix in<strong>to</strong> ¡ <strong>the</strong> form , where is a lower triangular matrix. IncompleteCholesky fac<strong>to</strong>rization is a variant in which little or no fill is allowed; is £approximated £by <strong>the</strong> product¡ , where might be restricted <strong>to</strong> have <strong>the</strong> same pattern of nonzero elements as ; o<strong>the</strong>r ¡ elements ¡£ ¡£of £are£¡ ¡thrown away. To use as a ¡ ¡¤£ £ ¢preconditioner, <strong>the</strong> solution <strong>to</strong>is £computed £by backsubstitution¡(<strong>the</strong> inverse of is never explicitly computed). Unfortunately, incomplete Cholesky preconditioning is¡not always stable.Many preconditioners, some quite sophisticated, have been developed. Whatever your choice, it isgenerally accepted that for large-scale applications, CG should nearly always be used with a preconditioner.13. <strong>Conjugate</strong> <strong>Gradient</strong>s on <strong>the</strong> Normal EquationsCG can be used <strong>to</strong> solve systems wheresolution <strong>to</strong> <strong>the</strong> least squares problemis not symmetric, not positive-definite, and even not square. Amin ¤§¡¡ ¥ §2(54)can be found by setting <strong>the</strong> derivative of Expression 54 <strong>to</strong> zero: ¡ £ ¥(55)If is square and nonsingular, <strong>the</strong> solution <strong>to</strong> Equation 55 is <strong>the</strong> solution <strong>to</strong> ¡ £¦¥ . If is not square,and ¢¡ £ ¥ is overconstrained — that is, has more linearly independent equations than variables — <strong>the</strong>n<strong>the</strong>re may or may not be a solution <strong>to</strong> ¢¡¤£ ¥ , but it is always possible <strong>to</strong> find a value of ¡ that minimizesExpression 54, <strong>the</strong> sum of <strong>the</strong> squares of <strong>the</strong> errors of each linear equation.is symmetric and positive (for any ¡ , ¡ ¢¡ £ § ¢¡ § 2 ©0). If ¢¡ £¦¥ is not underconstrained, <strong>the</strong>n is nonsingular, and methods like Steepest Descent and CG can be used <strong>to</strong> solve Equation 55. Theonly nuisance in doing so is that <strong>the</strong> condition number of is <strong>the</strong> square of that of , so convergence issignificantly slower.<strong>An</strong> important technical point is that <strong>the</strong> matrix is never formed explicitly, because it is less sparsethan . Instead, is multiplied by by first finding <strong>the</strong> product , and <strong>the</strong>n . Also, numerical


¡ ¡¡ ¡£¨ £ Find ¤ ¡££¡ ¡ ¥¤¡¥£¥¡¡ ¨¡¨¡£ ¤¡ £ ¡£¡¨¡¡¡¡¡¡¨¡¡¨¨¥¡¨42 Jonathan Richard Shewchukstability is improved if <strong>the</strong> value (in Equation 46) is computed by taking <strong>the</strong> inner product ofwith itself.14. The Nonlinear <strong>Conjugate</strong> <strong>Gradient</strong> <strong>Method</strong>CG can be used not only <strong>to</strong> find <strong>the</strong> minimum point of a quadratic form, but <strong>to</strong> minimize¡¨any continuousfor which <strong>the</strong> gradient can be computed. Applications include a variety of optimizationproblems, such as engineering design, neural net training, and nonlinear regression.¡ function ¥14.1. Outline of <strong>the</strong> Nonlinear <strong>Conjugate</strong> <strong>Gradient</strong> <strong>Method</strong>To derive nonlinear CG, <strong>the</strong>re are three changes <strong>to</strong> <strong>the</strong> linear algorithm: <strong>the</strong> recursive formula for <strong>the</strong> residualcannot be used, it becomes more complicated <strong>to</strong> compute <strong>the</strong> step size ¤ , and <strong>the</strong>re are several differentchoices for ¡ .In nonlinear CG, <strong>the</strong> residual is always set <strong>to</strong> <strong>the</strong> negation of <strong>the</strong> £ £¢¡ ¡ ¡ gradient; . The searchdirections are computed by Gram-Schmidt conjugation of <strong>the</strong> residuals as with linear CG. Performing a linesearch along this search direction is much more difficult than in <strong>the</strong> linear case, and a variety of procedurescan be used. As with <strong>the</strong> linear CG, a value of¡ that minimizes ¥¡ ¤ is found by ensuringthat <strong>the</strong> gradient is orthogonal <strong>to</strong> <strong>the</strong> search direction. We can use any algorithm that finds <strong>the</strong> zeros of <strong>the</strong>¤ ¥¤. ¡expression ¡ ¥¡ In linear CG, <strong>the</strong>re are several equivalent expressions for <strong>the</strong> value of . In nonlinear CG, <strong>the</strong>se differentexpressions are no longer equivalent; researchers are still investigating <strong>the</strong> best choice. Two choices are <strong>the</strong>Fletcher-Reeves formula, which we used in linear CG for its ease of computation, and <strong>the</strong> Polak-Ribière¡formula:1¡1¡1¡ £1¡ ¥ ££¢¡ ¡¡ ¡ ¡1¡1¡¡ £¡ ££ ££ £¡ The Fletcher-Reeves method converges if <strong>the</strong> starting point is sufficiently close <strong>to</strong> <strong>the</strong> desired minimum,whereas <strong>the</strong> Polak-Ribière method can, in rare cases, cycle infinitely without converging. However, Polak-Ribière often converges much more quickly.Fortunately, convergence of <strong>the</strong> Polak-Ribière method can be guaranteed by choosing max¡¡ ¡ ¡0 .¢Using this value is equivalent <strong>to</strong> restarting CG if ¡ ¡ ¡£ 0. To restart CG is <strong>to</strong> forget <strong>the</strong> past searchdirections, and start CG anew in <strong>the</strong> direction of steepest descent.£ ¡Here is an outline of <strong>the</strong> nonlinear CG method:£¢¡ ¡0¡0¡0¡¡ ¡ that minimizes ¥¡ £ ¡ 1¡££¢¡ ¡¡ £ £ 1¡ £1¡or1¡max1¡ ¥ £ 1¡££01¡1¡1¡¡ £¡ £¡


where ¡ ¡ ¥¤¥¥¥¥¥¥¢¢£££££¥¤¥£££¥§¡¥£££§§¥§The Nonlinear <strong>Conjugate</strong> <strong>Gradient</strong> <strong>Method</strong> 43¡ Nonlinear CG comes with few of <strong>the</strong> convergence guarantees of linear CG. The less similar is <strong>to</strong> aquadratic function, <strong>the</strong> more quickly <strong>the</strong> search directions lose conjugacy. (It will become clear shortly that<strong>the</strong> term “conjugacy” still has some meaning in nonlinear CG.) <strong>An</strong>o<strong>the</strong>r problem is that a general functionmay have many local minima. CG is not guaranteed <strong>to</strong> converge <strong>to</strong> <strong>the</strong> global minimum, and may noteven find a local minimum if has no lower bound.£1¡1¡1¡Figure 37 illustrates nonlinear CG. Figure 37(a) is a function with many local minima. Figure 37(b)demonstrates <strong>the</strong> convergence of nonlinear CG with <strong>the</strong> Fletcher-Reeves formula. In this example, CG isnot nearly as effective as in <strong>the</strong> linear case; this function is deceptively difficult <strong>to</strong> minimize. Figure 37(c)shows a cross-section of <strong>the</strong> surface, corresponding <strong>to</strong> <strong>the</strong> first line search in Figure 37(b). Notice that <strong>the</strong>reare several minima; <strong>the</strong> line search finds a value of corresponding <strong>to</strong> a nearby minimum. Figure 37(d)shows <strong>the</strong> superior convergence of Polak-Ribière CG.¤Because CG can only generate § conjugate vec<strong>to</strong>rs in an § -dimensional space, it makes sense <strong>to</strong> restartCG every § iterations, especially if § is small. Figure 38 shows <strong>the</strong> effect when nonlinear CG is restartedevery second iteration. (For this particular example, both <strong>the</strong> Fletcher-Reeves method and <strong>the</strong> Polak-Ribièremethod appear <strong>the</strong> same.)14.2. General Line SearchDepending on <strong>the</strong> value of ¡ , it might be possible <strong>to</strong> use a fast algorithm <strong>to</strong> find <strong>the</strong> zeros of ¡ . Forinstance, if is polynomial in , <strong>the</strong>n an efficient algorithm for polynomial zero-finding can be used.However, we will only consider general-purpose algorithms.¤ ¡ Two iterative methods for zero-finding are <strong>the</strong> New<strong>to</strong>n-Raphson method and <strong>the</strong> Secant method. Bothmethods require that be twice continuously differentiable. New<strong>to</strong>n-Raphson also requires that it bewith respect <strong>to</strong> .¤possible <strong>to</strong> compute <strong>the</strong> second derivative of ¥¡ ¤ ¨ The New<strong>to</strong>n-Raphson method relies on <strong>the</strong> Taylor series approximation¡ ¥¤ ¨ £ ¡ ¨ ¥¤¡¡ ¨ ¥¤ ¡¡ ¨¡ ¤ ¨£¢2 ¤20 ¡ ¡¡¨2 ¤22¤ 2¡ ¥¤ ¨0(56)¡ ¥¤ ¨ ¡¡¨ ¥¤ ¡ ¡¡¨(57)is <strong>the</strong> Hessian matrix¡¨¢¡¢¢£¥¤1£¥¤1£¥¤1£¥¤2£¥¤1£¥¤ §¦¨§ §§¢¡ ¡¡ ¨ ££¥¤£¥¤2.2¤2¤1£¥¤£¥¤22¤2¤2. .. .£¥¤£¥¤ §22¤£¥¤¦§ £¥¤2¤1£¥¤¦§ £¦¤2£¥¤¦§ £¥¤¦§©2¤2¤2¤


¡ ¡ ¥¤ ¡ ¡¥¨¡¡¤¥¡¡¡¡¡¡44 Jonathan Richard Shewchuk(a)2(b)64500640250-250¡¨20¡220-2-4-202146-4 -2 2 4 6-21(c)2(d)6006400420020¡-0.04 -0.02 0.02 0.04-4 -2 2 4 61-200-2Figure 37: Convergence of <strong>the</strong> nonlinear <strong>Conjugate</strong> <strong>Gradient</strong> <strong>Method</strong>. (a) A complicated function with manylocal minima and maxima. (b) Convergence path of Fletcher-Reeves CG. Unlike linear CG, convergencedoes not occur in two steps. (c) Cross-section of <strong>the</strong> surface corresponding <strong>to</strong> <strong>the</strong> first line search. (d)Convergence path of Polak-Ribière CG.


¡¡¢¡¤¡The Nonlinear <strong>Conjugate</strong> <strong>Gradient</strong> <strong>Method</strong> 4526420¡-4 -2 2 4 61-2Figure 38: Nonlinear CG can be more effective with periodic restarts.10.750.50.25-1 -0.5 0.5 1 1.5 2-0.25-0.5-0.75-1Figure 39: The New<strong>to</strong>n-Raphson method for minimizing a one-dimensional function (solid curve). Startingfrom <strong>the</strong> point , calculate <strong>the</strong> first and second derivatives, and use <strong>the</strong>m <strong>to</strong> construct a quadratic approximation<strong>to</strong> <strong>the</strong> function (dashed curve). A new point is chosen at <strong>the</strong> base of <strong>the</strong> parabola. This procedure£is iterated until convergence is reached.


¥¤¥¤ £ ¡¥¥¥ ¡ ¥§¡ ¡ ¡ ¥¥¥¥¥§¡ £0¥46 Jonathan Richard Shewchukis approximately minimized by setting Expression 57 <strong>to</strong> zero, givingThe function ¥¡ ¥¤ ¨¤ £¢¡The truncated Taylor series ¥¡ ¤ ¨approximates(see Figure 39). In fact, if is a quadratic form, <strong>the</strong>n this parabolic approximation is exact, because ¡ ¡just <strong>the</strong> familiar matrixwith a parabola; we step <strong>to</strong> <strong>the</strong> bot<strong>to</strong>m of <strong>the</strong> parabolais. In general, <strong>the</strong> search directions are conjugate if <strong>the</strong>y are -orthogonal. The¡ ¡ meaning of “conjugate” keeps changing, though, because ¡ ¡ varies with ¡ . The more quickly ¡ ¡ varieswith ¡ , <strong>the</strong> more quickly <strong>the</strong> search directions lose conjugacy. On <strong>the</strong> o<strong>the</strong>r hand, <strong>the</strong> closer ¡ is <strong>to</strong> <strong>the</strong> ¡solution, <strong>the</strong> less varies from iteration <strong>to</strong> iteration. The closer <strong>the</strong> starting point is <strong>to</strong> <strong>the</strong> solution, <strong>the</strong>more similar <strong>the</strong> convergence of nonlinear CG is <strong>to</strong> that of linear CG.¡ ¡ To perform an exact line search of a non-quadratic function, repeated steps must be taken along <strong>the</strong> lineuntil is zero; hence, one CG iteration may include many New<strong>to</strong>n-Raphson iterations. The values of¡ ¡ ¡ ¡and must be evaluated at each step. These evaluations may be inexpensive if can beanalytically simplified, but if <strong>the</strong> full matrix must be ¡ evaluated ¡ repeatedly, <strong>the</strong> algorithm is prohibitivelyslow. For some applications, it is possible <strong>to</strong> circumvent this problem by performing an approximate ¡ ¡ linesearch that uses only <strong>the</strong> diagonal elements of . Of course, <strong>the</strong>re are functions for which it is not possible<strong>to</strong> evaluate at ¡ all.¡¡ ¡ To perform an exact line search without computing , <strong>the</strong> Secant method approximates <strong>the</strong> second¡ ¡¨ ¤by evaluating <strong>the</strong> first derivative at <strong>the</strong> distinct points£0 and ¤ £, where is¡an arbitrary small nonzero ¤ number:derivative of ¥2¤ 2¡ ¤ ¨¦¦¨§¡ ¥¤ ¨¢¡ ¡ ¦¦¨§¡ ¥¤ ¨0£ ¡(58) ¡ ¡which becomes a better approximation <strong>to</strong> <strong>the</strong> second derivative as ¤ and approach zero. If we substituteExpression 58 for <strong>the</strong> third term of <strong>the</strong> Taylor expansion (Equation 56), we have¡ ¨¡ ¨¡ ¥¤ ¨ ¡¤ £ ¡ ¡ ¡ ¥¤¡¨¡ ¨¡ ¨Minimize ¥by setting its derivative <strong>to</strong> zero:¡ ¤ ¨ (59) ¡ ¡ ¡¡ ¨ ¥¡ ¤ ¨©¨¥Like New<strong>to</strong>n-Raphson, <strong>the</strong> Secant method also approximates with a parabola, but insteadof choosing <strong>the</strong> parabola by finding <strong>the</strong> first and second derivative at a point, it finds <strong>the</strong> first derivative attwo different points (see Figure 40). Typically, we will choose an arbitrary on <strong>the</strong> first Secant methoditeration; on subsequent iterations we will choose <strong>to</strong> be <strong>the</strong> value of from <strong>the</strong> previous Secantmethod iteration. In o<strong>the</strong>r words, if we let denote <strong>the</strong> value of calculated during Secant iteration ,<strong>the</strong>n.¦ 1¨£¢¡ ¤ ¦ ¨¡ ¨¡¨Both <strong>the</strong> New<strong>to</strong>n-Raphson and Secant methods should be terminated when ¡ is reasonably close <strong>to</strong> <strong>the</strong>exact solution. Demanding <strong>to</strong>o little precision could cause a failure of convergence, but demanding <strong>to</strong>ofine precision makes <strong>the</strong> computation unnecessarily slow and gains nothing, because conjugacy will break


¨£¨¡¢¨¤¨¨The Nonlinear <strong>Conjugate</strong> <strong>Gradient</strong> <strong>Method</strong> 4710.5-1 1 2 3 4-0.5-1-1.5Figure 40: The Secant method for minimizing a one-dimensional function (solid curve). Starting from <strong>the</strong>point , calculate <strong>the</strong> first derivatives at two different points (here, 0 and 2), and use <strong>the</strong>m <strong>to</strong>construct a quadratic approximation <strong>to</strong> <strong>the</strong> function (dashed curve). Notice that both curves have <strong>the</strong> same£slope at 0, and also at 2. As in <strong>the</strong> New<strong>to</strong>n-Raphson method, a new point is chosen at <strong>the</strong> baseof <strong>the</strong> parabola, and <strong>the</strong> procedure is iterated until convergence.down quickly¡¨anyway if varies much with . Therefore, a quick but inexact line search is often¡ ¡ ¥<strong>the</strong> better policy (for instance, use only a fixed number of New<strong>to</strong>n-Raphson or Secant method iterations).Unfortunately, inexact line search may lead <strong>to</strong> <strong>the</strong> construction of a search direction that is not a descentdirection. A common solution is <strong>to</strong> test ¡ for this eventuality (is nonpositive?), and restart CG if necessary£by £ setting .A bigger problem with both methods is that <strong>the</strong>y cannot distinguish minima from maxima. The resul<strong>to</strong>f nonlinear CG generally depends strongly on <strong>the</strong> starting point, and if CG with <strong>the</strong> New<strong>to</strong>n-Raphson orSecant method starts near a local maximum, it is likely <strong>to</strong> converge <strong>to</strong> that point.Each method has its own advantages. The New<strong>to</strong>n-Raphson method has a better convergence rate, andis¡ ¡<strong>to</strong> be preferred if can be calculated (or well ¢ ¥ §approximated) quickly (i.e., in time). The Secantmethod only requires first derivatives of , but its success may depend on a good choice of <strong>the</strong> parameter .It is easy <strong>to</strong> derive a variety of o<strong>the</strong>r methods as well. For instance, by sampling at three different¨points,¤without <strong>the</strong> need <strong>to</strong> calculate even¡a firstderivative of . it is possible <strong>to</strong> generate a parabola that approximates ¥14.3. PreconditioningNonlinear CG can be preconditioned by choosing a preconditioner that approximates and has <strong>the</strong>property that1 ¡ ¡ is easy <strong>to</strong> compute. In linear CG, <strong>the</strong> preconditioner attempts <strong>to</strong> transform £ <strong>the</strong> quadraticform so that it is similar <strong>to</strong> a sphere; a nonlinear CG preconditioner performs this transformation for a regionnear ¡ .¡Even when it is <strong>to</strong>o expensive <strong>to</strong> compute <strong>the</strong> full Hessian ¡ ¡ , it is often reasonable <strong>to</strong> compute its diagonal


¡¡¡48 Jonathan Richard Shewchuk26420¡-4 -2 2 4 61-2Figure 41: The preconditioned nonlinear <strong>Conjugate</strong> <strong>Gradient</strong> <strong>Method</strong>, using <strong>the</strong> Polak-Ribière formula anda diagonal preconditioner. The space has been “stretched” <strong>to</strong> show <strong>the</strong> improvement in circularity of <strong>the</strong>con<strong>to</strong>ur lines around <strong>the</strong> minimum.for use as a preconditioner. However, be forewarned that if is sufficiently far from a local minimum, <strong>the</strong>diagonal elements of <strong>the</strong> Hessian may not all be positive. A preconditioner should be positive-definite, sononpositive diagonal elements cannot be allowed. A conservative solution is <strong>to</strong> ¡ not£precondition (set )when <strong>the</strong> Hessian cannot be guaranteed <strong>to</strong> be positive-definite. Figure 41 demonstrates <strong>the</strong> convergence ofdiagonally preconditioned nonlinear CG, with <strong>the</strong> Polak-Ribière formula, on <strong>the</strong> same function illustratedin Figure 37. Here, I have cheated by using <strong>the</strong> diagonal of at <strong>the</strong> solution point <strong>to</strong> precondition everyiteration.¡ ¡ ¡ ANotes<strong>Conjugate</strong> Direction methods were probably first presented by Schmidt [14] in 1908, and were independentlyreinvented by Fox, Huskey, and Wilkinson [7] in 1948. In <strong>the</strong> early fifties, <strong>the</strong> method of <strong>Conjugate</strong> <strong>Gradient</strong>swas discovered independently by Hestenes [10] and Stiefel [15]; shortly <strong>the</strong>reafter, <strong>the</strong>y jointly publishedwhat is considered <strong>the</strong> seminal reference on CG [11]. Convergence bounds for CG in terms of Chebyshevpolynomials were developed by Kaniel [12]. A more thorough analysis of CG convergence is provided byvan der Sluis and van der Vorst [16]. CG was popularized as an iterative method for large, sparse matricesby Reid [13] in 1971.CG was generalized <strong>to</strong> nonlinear problems in 1964 by Fletcher and Reeves [6], based on work byDavidon [4] and Fletcher and Powell [5]. Convergence of nonlinear CG with inexact line searches wasanalyzed by Daniel [3]. The choice of ¡ for nonlinear CG is discussed by Gilbert and Nocedal [8].A his<strong>to</strong>ry and extensive annotated bibliography of CG <strong>to</strong> <strong>the</strong> mid-seventies is provided by Golub andO’Leary [9]. Most research since that time has focused on nonsymmetric systems. A survey of iterativemethods for <strong>the</strong> solution of linear systems is offered by Barrett et al. [1].


£¢¢£¢££ ¡¢ ¤¢££££¡ ¤ ¡BCanned AlgorithmsCanned Algorithms 49The code given in this section represents efficient implementations of <strong>the</strong> algorithms discussed in this article.B1. Steepest DescentGiven <strong>the</strong> inputs , ¥ , a starting value ¡ , a maximum number of iterations ¥¤¢¥¤, and an error <strong>to</strong>lerance1: £¥ 00While ¥ £ ¥¢ ¤¤and ¢£ 2 ¢ 0 do¥ ¡ ¡¤£If ¥ is divisible by 50¡ ¡ ¥¤else¥ ¡ ¢¡£ £¥ ¥ 10¡§.¢¥¤This algorithm terminates when <strong>the</strong> maximum number ¥¤of iterations has been exceeded, or whenThe fast recursive formula for <strong>the</strong> residual is usually used, but once every 50 iterations, <strong>the</strong> exact residualis recalculated <strong>to</strong> remove accumulated floating point error. Of course, <strong>the</strong> number 50 is arbitrary; for large§ , § might be appropriate. If <strong>the</strong> <strong>to</strong>lerance is large, <strong>the</strong> residual need not be corrected at all (in practice,this correction is rarely used). If <strong>the</strong> <strong>to</strong>lerance is close <strong>to</strong> <strong>the</strong> limits of <strong>the</strong> floating point precision of <strong>the</strong>machine, a test should be added after ¢ is evaluated <strong>to</strong> check if ¢ ¡ 2 ¢ 0, and if this test holds true, <strong>the</strong> exactresidual should also be recomputed and ¢ reevaluated. This prevents <strong>the</strong> procedure from terminating earlydue <strong>to</strong> floating point roundoff error.¡§ ¡ §£§£


£¢¢¢¢¢¤¢¦£££¢¥¤¤and ¢ ¤¢¢¢ ¤£¡¡ ¤ ¡50 Jonathan Richard ShewchukB2. <strong>Conjugate</strong> <strong>Gradient</strong>sGiven <strong>the</strong> inputs , ¥ , a starting value ¡ , a maximum number of iterations ¥£ 1:¢¥¤¤, and an error <strong>to</strong>lerance¥ 0¡ ¡ ¥£0While ¥ £ ¥£2 ¢ 0 do¤¢£¡¢ §¢¡¤£ ¤ £¦If ¥ is divisible by 50¡ ¡ ¥¤else¥ ¡ ¡£ £¥§¦¤¢£¡ ¢ §¨¡¤£¢© ¥ ¥ 1See <strong>the</strong> comments at <strong>the</strong> end of Section B1.


£¢¢¡ ¡ ¥ 1 £¢¢¢¤¢¦£¢¥¤¤and ¢ ¤¢1 £¢¤¢¡ ¤ ¡Canned Algorithms 51B3. Preconditioned <strong>Conjugate</strong> <strong>Gradient</strong>sGiven <strong>the</strong> inputs , ¥ , a starting value ¡ , a (perhaps implicitly defined) preconditioner , a maximumnumber of iterations ¥¢¥¤¤, and an error <strong>to</strong>lerance £ 1:¥ 00While ¥ £ ¥£2 ¢ 0 do¤¢£¡¢ §¢¡¤£ ¤ £¦If ¥ is divisible by 50¡ ¡ ¥¤else¥ ¡ ¡£ £¥§¦¤¢£¡ ¢ §¨¡¤£¢ © ¡¥ ¥ 1The statement “be in <strong>the</strong> form of a matrix. 1 £ ” implies that one should apply <strong>the</strong> preconditioner, which may not actuallySee also <strong>the</strong> comments at <strong>the</strong> end of Section B1.


¢££¢¢¢££¢¢¢¢¦¦¥£££¢¥¤¤and ¢ ¤¢¢¥¢ ¤£¡£¦¤¤¡¦¡¢ ¤¤and ¤ 2 ¢ ¦£¢52 Jonathan Richard ShewchukB4. Nonlinear <strong>Conjugate</strong> <strong>Gradient</strong>s with New<strong>to</strong>n-Raphson and Fletcher-ReevesGiven a function , a starting value ¡ , a maximum number of CG iterations ¥¤¢¥¤, a CG error¢¥¤<strong>to</strong>lerance¤, and a New<strong>to</strong>n-Raphson error <strong>to</strong>lerance1, a maximum number of New<strong>to</strong>n-Raphson iterations¢£1: £0¥ 0¡ ¡¡ ¨0£2 ¢ 0 do¢¤¥ £ ¥ While0¢¤¢£Do¤ ¡¡ ¡ ¤¢ ¢1¦¦ ¨¤¤2¡ ¡while¢£¢¡¨¥§¦¤¢£¢ © 1If £ £ § or £ ¡ 0¡ ¢ §¨¡¤£0¥ ¥ 1£ £This algorithm terminates when <strong>the</strong> maximum number of ¥ iterations§. 0¡¢¥¤¤has been exceeded, or whenEach New<strong>to</strong>n-Raphson iteration adds ¤¡§ ¡ §£below a given <strong>to</strong>lerance ( § ¤ § ¡<strong>to</strong> ; <strong>the</strong> iterations are terminated when each update falls¤¤¤¢. A fast inexact line¡§£search can be accomplished by using small¢adiagonal.), or when <strong>the</strong> number of iterations exceeds¢¤¢¥¤and/or by approximating <strong>the</strong> Hessian ¡ ¡¨ ¡¥with itsNonlinear CG is restarted (by settingdescent direction. It is also restarted once every § iterations, <strong>to</strong> improve convergence for small § .£ ) whenever a search direction is computed that is not aThe calculation of may result in a divide-by-zero error. This may occur because <strong>the</strong> starting point ¡0¡is not sufficiently close <strong>to</strong> <strong>the</strong> desired minimum, or because ¤ is not twice continuously differentiable. In<strong>the</strong> former case, <strong>the</strong> solution is <strong>to</strong> choose a better starting point or a more sophisticated line search. In <strong>the</strong>latter case, CG might not be <strong>the</strong> most appropriate minimization algorithm.


Given a function , a starting value ¡ , a maximum number of CG iterations ¥§£¢¢££¢¢£¢¢¦¢¢ ¢¢ 1 £¦¦¥¢¥¤¤and ¢ ¤¢¤ ¤ ¥£¢¢¤ £1 £¥¥¡¥¨¢¥¤¤and ¤ 2 ¢ ¦£¢¥¥Canned Algorithms 53B5. Preconditioned Nonlinear <strong>Conjugate</strong> <strong>Gradient</strong>s with Secant and Polak-Ribière1, a Secant method step parameter 0, a maximum number of Secant method iterations¢£Secant method error <strong>to</strong>lerance£ 1:¢¥¤¤, a CG error <strong>to</strong>lerance¢¥¤¤, and a0¥ 0¡¨Calculate a preconditioner ¡ ¡¡ ¡¡¨0£2 ¢ 0 do¤¢£¢¤¥ £ ¥ While0¢¤ ¡0¢¤£ ¡ ¢¡Do ¡0¡¨¥§¦©¨ ¡ ¥¡ ¡ ¤¢¡¢£¢ ¢12¡ ¡while¢£¢¡ ¨Calculate a preconditioner ¡ ¡¥ ¦¡¨¤¢£1If £ £ § or ¡ ¡ 0¡ ¢ §¢¡¤£ ¢¦¦£ £¢ © 0¡else¥ 1¥This algorithm terminates when <strong>the</strong> maximum number of ¥ iterations§. 0¡¢¥¤¤has been exceeded, or whenEach Secant method iteration adds ¤¡§ ¡ §£a given <strong>to</strong>lerance ( § ¤ § ¡<strong>to</strong> ; <strong>the</strong> iterations are terminated when each update falls below¤¤¢¥¤. A fast inexact line search can¡), or when <strong>the</strong> number of exceeds¢¢¥¤iterations¤. The parameter 0 determines <strong>the</strong> value of in Equation 59 for <strong>the</strong>be accomplished by using small¢afirst step of each Secant method minimization. Unfortunately, it may be necessary <strong>to</strong> adjust this parameter


¡¥¡¢¥¢If is positive-definite, <strong>the</strong>n <strong>the</strong> latter term is positive for all ¢¡ £¡¡¡¢¥¡¢¢¢£¥¢¡¢¢¢¢£¡¡1¨¢154 Jonathan Richard Shewchuk<strong>to</strong> achieve convergence.The Polak-Ribière ¡ parameter is ¢ §¢¡ £ ¢¦be taken that <strong>the</strong> preconditionerform of a matrix.¢ © is always positive-definite. The preconditioner is not necessarily in <strong>the</strong>¦¤¦1¨¡¤¦ ¤¦ ¤¦¨¡¤¦¨1¨1¨¡¤¦¨¤¦ ¡1¤¦¨¢ ¤¦¨ 1¨ ¤§¦ ¤¦¨. Care mustNonlinear CG is restarted (by settingalso restarted once every § iterations, <strong>to</strong> improve convergence for small § .£ ) whenever <strong>the</strong> Polak-Ribière ¡ parameter is negative. It isNonlinear CG presents several choices: Preconditioned or not, New<strong>to</strong>n-Raphson method or Secantmethod or ano<strong>the</strong>r method, Fletcher-Reeves or Polak-Ribière. It should be possible <strong>to</strong> construct any of <strong>the</strong>variations from <strong>the</strong> versions presented above. (Polak-Ribière is almost always <strong>to</strong> be preferred.)CUgly ProofsC1. The Solution <strong>to</strong> ¢¡¤£¦¥ Minimizes <strong>the</strong> Quadratic FormSuppose is symmetric. Let be a point that satisfies(Equation 3), and let ¢ be an error term. Then¡¡ £ ¥and minimizes <strong>the</strong> quadratic form¨ £ 12 ¥ ¡1 £21 £2¨ ¨ ¡ ¥1 ¢¡2 ¢ ¢¡ ¡ ¢¡ ¡ ¥ ¡ ¡¡¨ 1 ¥ 1¨ ¡ ¥ ¡ ¡ ¥ 2 ¢ ¡ ¥ (by Equation 3)(by symmetry of )2 ¢ £ 0; <strong>the</strong>refore ¡ minimizes .C2. A Symmetric Matrix Has § Orthogonal Eigenvec<strong>to</strong>rs.<strong>An</strong>y matrix has at least one eigenvec<strong>to</strong>r. To see this, note det¥ that<strong>the</strong>refore has at least one zero (possibly complex); call it . The matrixand is <strong>the</strong>refore singular, so <strong>the</strong>re must be a nonzero vec<strong>to</strong>r for which ¡ ¥an eigenvec<strong>to</strong>r,£ ¡because .¡ ¡ ¨is a polynomial in , and¡ ¡ ¡has determinant zero0. The vec<strong>to</strong>r is¡ ¡ ¨ £<strong>An</strong>y symmetric matrix has a full complement § of orthogonal eigenvec<strong>to</strong>rs. To prove this, I willdemonstrate for <strong>the</strong> case of a 4 4 matrix and let you make <strong>the</strong> obvious generalization <strong>to</strong> matrices ofarbitrary size. It has already been proven that has at least one eigenvec<strong>to</strong>r with eigenvalue . Let14 so that <strong>the</strong> ¡ are mutuallyorthogonal and all of unit length (such vec<strong>to</strong>rs can be found by Gram-Schmidt orthogonalization). Let¡£ £ ¡£¡ § §, which has unit length. Choose three arbitrary vec<strong>to</strong>rs ¡ 21234 . Because <strong>the</strong> ¡ are orthonormal, £ £ £ , and so £ £ £ 1 . Note also that for3


¡ £¥£££©£¦ § cos § £¢¡£¢¡¦¡¥ © ©¡££¡¥¡£¦¡©©0 1 ¥¤1, ¡ 111Ugly Proofs 550. Thus,¢¡ £ ¡ ¢¡£ £ £¢¡¢¢ ¡¡ 1§ ¦ §§2¡¡ 34¡ ¡1234¤ £¢¡¢¢ ¡¡ 1§ ¦ §§2¡¡ 34¡¡ ¡1¢¡2¢¡3¢¡4¢¡¢¢¡0 0 0000§ ¦ §§© £ ¡ ¡ ¡where is a 3 3 symmetric matrix.vec<strong>to</strong>r of length 4 whose first element is zero, and whose remaining elements are <strong>the</strong> elements of . Clearly,£¦must have some eigenvalue with eigenvalue . Let be a£ £ ¡ £1 £ £ £ £ £ £ £ £ ££¢¡¢¢0 0 0000§ ¦ §§£ £ ¡ £ £ £In o<strong>the</strong>r words,1 0 0 0£ £ £ £ £¡¡, and thus is an eigenvec<strong>to</strong>r of .£Fur<strong>the</strong>rmore, 1£0, so 1 and are £orthogonal. Thus, has at least two orthogonal eigenvec<strong>to</strong>rs!£ ££ £ ¡£ £ £ ££ £ £The more general statement that a has§ symmetric matrix orthogonal eigenvec<strong>to</strong>rs is proven by induction.For <strong>the</strong> example above, assume any 3 3 matrix (such ) has 3 orthogonal eigenvec<strong>to</strong>rs; call <strong>the</strong>m 1,£ £¦3 are eigenvec<strong>to</strong>rs of , and <strong>the</strong>y are orthogonal because £has orthonormal columns, and <strong>the</strong>refore maps orthogonal vec<strong>to</strong>rs <strong>to</strong> orthogonal vec<strong>to</strong>rs. Thus, has 4orthogonal eigenvec<strong>to</strong>rs.££ £2, and 3. Then 1, 2, and £ £ £ £ £ ££ £ £ £C3. Optimality of Chebyshev PolynomialsChebyshev polynomials are optimal for minimization of expressions like Expression 50 because <strong>the</strong>yincrease in magnitude more quickly outside <strong>the</strong> range ¡ 1 1 than any o<strong>the</strong>r polynomial that is restricted <strong>to</strong>have magnitude no greater than one inside <strong>the</strong> range ¡ 1 1 .The Chebyshev polynomial of degree ¥ ,¥ § ¨ £ 12¥ § ¢ § 21¨ ¥ § ¡ ¢ § 2 ¡ 1¨ ¤can also be expressed on <strong>the</strong> region ¡ 1 1 as¥ § ¨ £cos¥ ¥ cos 1 § ¨1 ¡ § ¡ 1From this expression (and from Figure 32), one can deduce that <strong>the</strong> Chebyshev polynomials have <strong>the</strong>property1 § ¡ 1 ¡and oscillate rapidly between ¡ 1 and 1:¥ § ¨¢¡ 1 ¡£ £1¨


¡£¡¦ ¨ ¢ ¦ ¢¢ ¦ ¨¥ ¢ ¦ ¦§¦§¦ ¨¥ ¢ ¦ ¢¢ ¦ ¨¥ ¢ ¦ ¦§¦§¡¡¦¡ ¨ £¥¡¦ ¨¥ ¢ ¦ ¢ 2 ¢¢ ¦ ¨¥ ¢ ¦ ¦§¡¦§¦ ¨¥ ¢ ¦ ¢¢ ¦ ¨¥ ¢ ¦ ¦§¡¦§¦¥¥ ¥56 Jonathan Richard ShewchukNotice that <strong>the</strong> zeros of must fall between <strong>the</strong> 1 extrema ofsee <strong>the</strong> five zeros of in Figure 32.5¥ § ¨in <strong>the</strong> range ¡ 1 1 . For an example,Similarly, <strong>the</strong> functionoscillates in <strong>the</strong>¥ range ¥ 0¨ 1.£ that¦¨ 1 on <strong>the</strong> domain ¡£¢ ¤¡£¢¥¤¤ .¦¡ ¨also satisfies <strong>the</strong> requirementThe proof relies on a cute trick. There ¡ ¨¥is no ¥polynomial of¥ 0¨ degree such that 1 and£ is better on <strong>the</strong> range ¢ ¡ ¢ ¤¤ . To prove this, suppose that <strong>the</strong>re is such a polynomial; <strong>the</strong>n,¡ than¦1 ¤¡ ¨ ¨on <strong>the</strong> range ¡£¢ ¡£¢¥¤ ¤ . It follows that <strong>the</strong> polynomial¦ ¡ has a zero¤ ¥ ¥at 0, and also ¥ has zeros on <strong>the</strong> range ¡ ¢ ¡ ¡£¢¥¤ ¤ . Hence,¦ ¡ is a of degree ¥ with¤at least 1 zeros, which is impossible. By contradiction, I conclude that <strong>the</strong> Chebyshev polynomial of¥degree optimizes Expression £ ¥ 50.DHomeworkFor <strong>the</strong> following questions, assume (unless o<strong>the</strong>rwise stated) that you are using exact arithmetic; <strong>the</strong>re isno floating point roundoff error.1. (Easy.) Prove that <strong>the</strong> preconditioned <strong>Conjugate</strong> <strong>Gradient</strong> method is not affected if <strong>the</strong> preconditioningmatrix is scaled. In o<strong>the</strong>r words, for any nonzero ¡ constant , show that <strong>the</strong> sequence of steps1¡taken using <strong>the</strong> preconditioner ¡ is identical <strong>to</strong> <strong>the</strong> steps taken using <strong>the</strong> preconditioner0¡.¢¡ £ ¥ § §2. (Hard, but interesting.) Suppose you wish <strong>to</strong> solve for a symmetric, positive-definitematrix . Unfortunately, <strong>the</strong> trauma of your linear algebra course has caused you <strong>to</strong> repressall memories of <strong>the</strong> <strong>Conjugate</strong> <strong>Gradient</strong> algorithm. Seeing you in distress, <strong>the</strong> Good Eigenfairymaterializes and grants you a list of <strong>the</strong> distinct eigenvalues (but not <strong>the</strong> eigenvec<strong>to</strong>rs) of .However, you do not know how many times each eigenvalue is repeated.Clever person that you are, you mumbled <strong>the</strong> following algorithm in your sleep this morning:Choose an arbitrary starting pointFor ¥ 0 <strong>to</strong> 1¡0¡¡Remove an arbitrary eigenvalue from <strong>the</strong> list and call it ¡¡ ¡ 1 ¡ ££¥ ¡ ¡ ¡No eigenvalue is used twice; on termination, <strong>the</strong> list is empty.1¡(a) Show that upon termination of ¡ this algorithm, is <strong>the</strong> solution <strong>to</strong> . Hint: Read Section6.1 carefully. Graph <strong>the</strong> convergence of a two-dimensional example; draw <strong>the</strong> eigenvec<strong>to</strong>r axes,and try selecting each eigenvalue first. (It is most important <strong>to</strong> intuitively explain what <strong>the</strong>algorithm is doing; if you can, also give equations that prove it.)£¦¥ ¡ ¡


Homework 57(b) Although this routine finds <strong>the</strong> exact solution after iterations, you would like each intermediateiterate¡ <strong>to</strong> be as close <strong>to</strong> <strong>the</strong> solution as possible. Give a crude rule of thumb for how youshould choose an eigenvalue from <strong>the</strong> list on each iteration. (In o<strong>the</strong>r words, in what ordershould <strong>the</strong> eigenvalues be used?)¡(c) What could go terribly wrong with this algorithm if floating point roundoff occurs?(d) Give a rule of thumb for how you should choose an eigenvalue from <strong>the</strong> list on each iteration <strong>to</strong>prevent floating point roundoff error from escalating. Hint: The answer is not <strong>the</strong> same as tha<strong>to</strong>f question (b).


58 Jonathan Richard ShewchukReferences[1] Richard Barrett, Michael Berry, Tony Chan, James Demmel, June Dona<strong>to</strong>, Jack Dongarra, Vic<strong>to</strong>rEijkhout, Roldan Pozo, Charles Romine, and Henk van der Vorst, Templates for <strong>the</strong> Solution of LinearSystems: Building Blocks for Iterative <strong>Method</strong>s, SIAM, Philadelphia, Pennsylvania, 1993.[2] William L. Briggs, A Multigrid Tu<strong>to</strong>rial, SIAM, Philadelphia, Pennsylvania, 1987.[3] James W. Daniel, Convergence of <strong>the</strong> <strong>Conjugate</strong> <strong>Gradient</strong> <strong>Method</strong> with Computationally ConvenientModifications, Numerische Ma<strong>the</strong>matik 10 (1967), 125–131.[4] W. C. Davidon, Variable Metric <strong>Method</strong> for Minimization, Tech. Report ANL-5990, Argonne NationalLabora<strong>to</strong>ry, Argonne, Illinois, 1959.[5] R. Fletcher and M. J. D. Powell, A Rapidly Convergent Descent <strong>Method</strong> for Minimization, ComputerJournal 6 (1963), 163–168.[6] R. Fletcher and C. M. Reeves, Function Minimization by <strong>Conjugate</strong> <strong>Gradient</strong>s, Computer Journal 7(1964), 149–154.[7] L. Fox, H. D. Huskey, and J. H. Wilkinson, Notes on <strong>the</strong> Solution of Algebraic Linear SimultaneousEquations, Quarterly Journal of Mechanics and Applied Ma<strong>the</strong>matics 1 (1948), 149–173.[8] Jean Charles Gilbert and Jorge Nocedal, Global Convergence Properties of <strong>Conjugate</strong> <strong>Gradient</strong><strong>Method</strong>s for Optimization, SIAM Journal on Optimization 2 (1992), no. 1, 21–42.[9] Gene H. Golub and Dianne P. O’Leary, Some His<strong>to</strong>ry of <strong>the</strong> <strong>Conjugate</strong> <strong>Gradient</strong> and Lanczos Algorithms:1948–1976, SIAM Review 31 (1989), no. 1, 50–102.[10] Magnus R. Hestenes, Iterative <strong>Method</strong>s for Solving Linear Equations, Journal of Optimization Theoryand Applications 11 (1973), no. 4, 323–334, Originally published in 1951 as NAML Report No. 52-9,National Bureau of Standards, Washing<strong>to</strong>n, D.C.[11] Magnus R. Hestenes and Eduard Stiefel, <strong>Method</strong>s of <strong>Conjugate</strong> <strong>Gradient</strong>s for Solving Linear Systems,Journal of Research of <strong>the</strong> National Bureau of Standards 49 (1952), 409–436.[12] Shmuel Kaniel, Estimates for Some Computational Techniques in Linear Algebra, Ma<strong>the</strong>matics ofComputation 20 (1966), 369–378.[13] John K. Reid, On <strong>the</strong> <strong>Method</strong> of <strong>Conjugate</strong> <strong>Gradient</strong>s for <strong>the</strong> Solution of Large Sparse Systems ofLinear Equations, Large Sparse Sets of Linear Equations (London and New York) (John K. Reid, ed.),Academic Press, London and New York, 1971, pp. 231–254.[14] E. Schmidt, Title unknown, Rendiconti del Circolo Matematico di Palermo 25 (1908), 53–77.[15] Eduard Stiefel, Über einige <strong>Method</strong>en der Relaxationsrechnung, Zeitschrift für <strong>An</strong>gewandte Ma<strong>the</strong>matikund Physik 3 (1952), no. 1, 1–33.[16] A. van der Sluis and H. A. van der Vorst, The Rate of Convergence of <strong>Conjugate</strong> <strong>Gradient</strong>s, NumerischeMa<strong>the</strong>matik 48 (1986), no. 5, 543–560.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!