604 IV.Branches of Mathematics really goes on isa far more interesting process ofx be done even in principle by formulas,for most mathe. Press/ wnat na ppens instead i answers that are accurate to three or ten digits of pre aundhedorasctentfncoremgineing tion.Provi approximate solutions by an eler entary example.Sup. pery te pose we have one polynomial of degree 4. p(z)=C0+C1z+C222+C3z3+C4z hetineplherconstr and another of degree5, 767-9 4(2=d0+d1z+d2z2+d3z3+d4z+d5z5 on to the th It is well-known that there is an explicit formula that V.1990:Algebraic c expresses the roots of p in terms of radicals (discov bridge,MA:MIT Press/Elsevier more than 250 years later;see THE OF THE QUINTIC I for r )Thus e in a ce Pand are.Yetn practice they hardly A N.Trefethen differ at all.If a scientist or a mathematician ants 1 The Need for Numerical Computation digits of precision in less than a millisecond.Did the nth they turn to computers. Nev ertheless,there is a wide maybe not.Most of the time.the user neither knows The It is otion w made it a dtredcou th and others princip Here are three more examples of problems that can ments led to physical laws exp ssed mathematically nce of elementary and,in the remarkable cycle 05 fruits are all aroun operat i)Lin ments.The day has long sinc tions in n unknowns. :solve a system of n linear equa passed when an advanc in the physical da signif (Linear minimize a linear function roblem:find the shortest tour mathematics. between n cities. nle imagine that mathemati And here are five that,like finding the roots of q,cannot generally be solved in this manner cians generate formulas,and then,by inserting num bers intot essary results h
✐ 604 IV. Branches of Mathematics Karp, R. M., and V. Ramachandran. 1990. Parallel algorithms for shared-memory machines. In Handbook of Theoretical Computer Science, volume A, Algorithms and Complexity, edited by J. van Leeuwen. Cambridge, MA: MIT Press/Elsevier. Kearns, M. J., and U. V. Vazirani. 1994. An Introduction to Computational Learning Theory. Cambridge, MA: MIT Press. Kitaev, A., A. Shen, and M. Vyalyi. 2002. Classical and Quantum Computation. Providence, RI: American Mathematical Society. Kushilevitz, E., and N. Nisan. 1996. Communication Complexity. Cambridge: Cambridge University Press. Ron, D. 2001. Property testing (a tutorial). In Handbook on Randomized Computing, volume II. Dordrecht: Kluwer. Shaltiel, R. 2002. Recent developments in explicit constructions of extractors. Bulletin of the European Association for Theoretical Computer Science 77:67–95. Sipser, M. 1997. Introduction to the Theory of Computation. Boston, MA: PWS. Strassen, V. 1990: Algebraic complexity theory. In Handbook of Theoretical Computer Science, volume A, Algorithms and Complexity, edited by J. van Leeuwen. Cambridge, MA: MIT Press/Elsevier. IV.21 Numerical Analysis Lloyd N. Trefethen 1 The Need for Numerical Computation Everyone knows that when scientists and engineers need numerical answers to mathematical problems, they turn to computers. Nevertheless, there is a widespread misconception about this process. The power of numbers has been extraordinary. It is often noted that the scientific revolution was set in motion when Galileo and others made it a principle that everything must be measured. Numerical measurements led to physical laws expressed mathematically, and, in the remarkable cycle whose fruits are all around us, finer measurements led to refined laws, which in turn led to better technology and still finer measurements. The day has long since passed when an advance in the physical sciences could be achieved, or a signifi- cant engineering product developed, without numerical mathematics. Computers certainly play a part in this story, yet there is a misunderstanding about what their role is. Many people imagine that scientists and mathematicians generate formulas, and then, by inserting numbers into these formulas, computers grind out the necessary results. The reality is nothing like this. What really goes on is a far more interesting process of execution of algorithms. In most cases the job could not be done even in principle by formulas, for most mathematical problems cannot be solved by a finite sequence of elementary operations. What happens instead is that fast algorithms quickly converge to “approximate” answers that are accurate to three or ten digits of precision, or a hundred. For a scientific or engineering application, such an answer may be as good as exact. We can illustrate the complexities of exact versus approximate solutions by an elementary example. Suppose we have one polynomial of degree 4, p(z) = c0 + c1z + c2z2 + c3z3 + c4z4, and another of degree 5, q(z) = d0 + d1z + d2z2 + d3z3 + d4z4 + d5z5. It is well-known that there is an explicit formula that expresses the roots of p in terms of radicals (discovered by Ferrari around 1540), but no such formula for the roots of q (as shown by Ruffini and abel [VI.33] more than 250 years later; see the insolubility of the quintic [V.21] for more details). Thus, in a certain philosophical sense the root-finding problems for p and q are utterly different. Yet in practice they hardly differ at all. If a scientist or a mathematician wants to know the roots of one of these polynomials, he or she will turn to a computer and get an answer to sixteen digits of precision in less than a millisecond. Did the computer use an explicit formula? In the case of q, the answer is certainly no, but what about p? Maybe, maybe not. Most of the time, the user neither knows nor cares, and probably not one mathematician in a hundred could write down formulas for the roots of p from memory. Here are three more examples of problems that can be solved in principle by a finite sequence of elementary operations, like finding the roots of p. (i) Linear equations: solve a system of n linear equations in n unknowns. (ii) Linear programming: minimize a linear function of n variables subject to m linear constraints. (iii) Traveling salesman problem: find the shortest tour between n cities. And here are five that, like finding the roots of q, cannot generally be solved in this manner. (iv) Find an eigenvalue [I.3 §4.3] of an n × n matrix. (v) Minimize a function of several variables
IV.21.Numerical Analysis 605 (vi)Evaluate an integral ics began to explode,but now mainly in the hands of specialists.New journals ed suc matik (959).The revolution was sparked by hardwar Can we conclude that (1)-(l)willbe easier than (v)-(vii but it included mathemati l and algorithmic develo in practice?Absolutely not.Problem (iii)is usually very nn is,say,m the nundr if the integral is in one dimension.Problems()and (v) of around 10,but so did the best algorithms known are of al 100.and easy wh increase in ik 1000000.m fact.n these matters Half a century on,numerical analysis has grown in guide to practice that,for each of the three to one of the largest branches of mathemati cs,the resea ignores the exact solution and uses appr ions journals across the sciences and engine fast!)methods instead g the pysis hanks to the rs of these back man of ers.we have reached a point where most of the clas ariables.(Thi defin lems of the physical sciences n inc osed over the real numbers.but not their discrete ana this ogues.)In the remain ona strong foundation rends This field encomnasses questions of inter 2 A Brief History polatio series expansi ons,and HARMO ANALY a5s00 wit NE Throughout history.leading mathematicians have been with,and in many case polyn omial and rational minimax appro imation asso sul in use today ed with n mes such as CHEE [VL45]and B standing example Among many other contributions basis functions.and wAVELETS.31.We shall not 179 in lea ata fitting have space to address these subjects,bu in almost ical quadrature (1814),as well as inventing THE FAST HOURIER TRANS 6()though the Cooley and Tukey in 1965. Around 1900,the numerical side of mathematics 3 Machine Arithmetic and Rounding Errors I is well-kno n that natics generally and of great uated on a computer,fo hines to work in hase f we desg xample.many advan of the early twentieth cen eal numbers by a system of floating-point arithmetic ne in whi each mber is in a digita numerical calculation matter unless the numberis s huge or tin as to cause or underflow Floating-point arithmetic wa invented by Konrac Zuse in Berlin in the 1930s,and
✐ IV.21. Numerical Analysis 605 (vi) Evaluate an integral. (vii) Solve an ordinary differential equation (ODE). (viii) Solve a partial differential equation (PDE). Can we conclude that (i)–(iii) will be easier than (iv)–(viii) in practice? Absolutely not. Problem (iii) is usually very hard indeed if n is, say, in the hundreds or thousands. Problems (vi) and (vii) are usually rather easy, at least if the integral is in one dimension. Problems (i) and (iv) are of almost exactly the same difficulty: easy when n is small, like 100, and often very hard when n is large, like 1 000 000. In fact, in these matters philosophy is such a poor guide to practice that, for each of the three problems (i)–(iii), when n and m are large one often ignores the exact solution and uses approximate (but fast!) methods instead. Numerical analysis is the study of algorithms for solving the problems of continuous mathematics, by which we mean problems involving real or complex variables. (This definition includes problems like linear programming and the traveling salesman problem posed over the real numbers, but not their discrete analogues.) In the remainder of this article we shall review some of its main branches, past accomplishments, and possible future trends. 2 A Brief History Throughout history, leading mathematicians have been involved with scientific applications, and in many cases this has led to the discovery of numerical algorithms still in use today. gauss [VI.26], as usual, is an outstanding example. Among many other contributions, he made crucial advances in least-squares data fitting (1795), systems of linear equations (1809), and numerical quadrature (1814), as well as inventing the fast fourier transform [III.26] (1805), though the last did not become widely known until its rediscovery by Cooley and Tukey in 1965. Around 1900, the numerical side of mathematics started to become less conspicuous in the activities of research mathematicians. This was a consequence of the growth of mathematics generally and of great advances in fields in which, for technical reasons, mathematical rigor had to be the heart of the matter. For example, many advances of the early twentieth century sprang from mathematicians’ new ability to reason rigorously about infinity, a subject relatively far from numerical calculation. A generation passed, and in the 1940s the computer was invented. From this moment numerical mathematics began to explode, but now mainly in the hands of specialists. New journals were founded such as Mathematics of Computation (1943) and Numerische Mathematik (1959). The revolution was sparked by hardware, but it included mathematical and algorithmic developments that had nothing to do with hardware. In the halfcentury from the 1950s, machines sped up by a factor of around 109, but so did the best algorithms known for some problems, generating a combined increase in speed of almost incomprehensible scale. Half a century on, numerical analysis has grown into one of the largest branches of mathematics, the specialty of thousands of researchers who publish in dozens of mathematical journals as well as applications journals across the sciences and engineering. Thanks to the efforts of these people going back many decades, and thanks to ever more powerful computers, we have reached a point where most of the classical mathematical problems of the physical sciences can be solved numerically to high accuracy. Most of the algorithms that make this possible were invented since 1950. Numerical analysis is built on a strong foundation: the mathematical subject of approximation theory. This field encompasses classical questions of interpolation, series expansions, and harmonic analysis [IV.11] associated with newton [VI.14], fourier [VI.25], Gauss, and others; semiclassical problems of polynomial and rational minimax approximation associated with names such as chebyshev [VI.45] and Bernstein; and major newer topics, including splines, radial basis functions, and wavelets [VII.3]. We shall not have space to address these subjects, but in almost every area of numerical analysis it is a fact that, sooner or later, the discussion comes down to approximation theory. 3 Machine Arithmetic and Rounding Errors It is well-known that computers cannot represent real or complex numbers exactly. A quotient like 1/7 evaluated on a computer, for example, will normally yield an inexact result. (It would be different if we designed machines to work in base 7!) Computers approximate real numbers by a system of floating-point arithmetic, in which each number is represented in a digital equivalent of scientific notation, so that the scale does not matter unless the number is so huge or tiny as to cause overflow or underflow. Floating-point arithmetic was invented by Konrad Zuse in Berlin in the 1930s, and
606 IV.Branches of Mathematics by the end of the 1950s it was standard across the the magnetic moment of the to the Bohr mag Until the 1980s.different computers had widely dif- to more than 12 or 13 digits of a .Thus IEEE ferent arithmetic properties.Then,in 1985.after y ears number on.th was adopted.or EEE arithmetic for short in two sense then,floating-point arithmetic is far his standard has su er to its ideal than is physics.It is a curiou mber coists of a 64-bit word divided widely regarded as an ugly and dangerous compromis or a signed expon ts the the 10 fathers of the field discovered that inexact arithmeti can be a source of danger causing errors in resu Comnuters do not merely renr ent numbers.of per orm operatio on t em such of rounding errors from microscopic to macroscopic d division,and e by certain mo se of these elementary opera s.In floating-point arith and Henric took great pains to publicize theisk ts of computed result mentary oper careless reliance on machine arithmetic.Thes is one of these four operations in its ideal formand to the current widespread impre is the same operation as realized on the computer sion that the main business of nu al analysis is ,assuming ith e quickly:rounding-e r analysis,while ofter x©y=(x*y)1+E discu y the cen ute valu ould remain. computer.n the IEEE system. 4 Numerical Linear Algebra 10 ing to compare the fineness of this discretization with ever since.There are several that of the discretizations of physics.nahandful is at the bottom of atoms or molecules in a line from o noint to anothe arrival of computers is on the order of 10(the cube root of Avogadro's num The starting point of this subject is nation.apro edure tha density.pressure.stress.strain.and temperature.Com perations.Equivalently,it solvesequations of the form puter arith etic,however,is more than a million time b,whe is an nxn r and x and ms the precision to which fundamental constantsar on computers around the world almost every time a known.such as(roughly)4 digits for the gravitationa ystem of lir ea equations is solved.Even if n is as 2008 The de
✐ 606 IV. Branches of Mathematics by the end of the 1950s it was standard across the computer industry. Until the 1980s, different computers had widely different arithmetic properties. Then, in 1985, after years of discussion, the IEEE (Institute of Electrical and Electronics Engineers) standard for binary floating-point arithmetic was adopted, or IEEE arithmetic for short. This standard has subsequently become nearly universal on processors of many kinds. An IEEE (double precision) real number consists of a 64-bit word divided into 53 bits for a signed fraction in base 2 and 11 bits for a signed exponent. Since 2−53 ≈ 1.1 × 10−16, IEEE numbers represent the numbers of the real line to a relative accuracy of about 16 digits. Since 2±210 ≈ 10±308, this system works for numbers up to about 10308 and down to about 10−308. Computers do not merely represent numbers, of course; they perform operations on them such as addition, subtraction, multiplication, and division, and more complicated results are obtained from sequences of these elementary operations. In floating-point arithmetic, the computed result of each elementary operation is almost exactly correct in the following sense: if “∗” is one of these four operations in its ideal form and “-* ” is the same operation as realized on the computer, then for any floating-point numbers x and y, assuming that there is no underflow or overflow, x -* y = (x ∗ y)(1 + ε). Here ε is a very small quantity, no greater in absolute value than a number known as machine epsilon, denoted by εmach, that measures the accuracy of the computer. In the IEEE system, εmach = 2−53 ≈ 1.1 × 10−16. Thus, on a computer, the interval [1, 2], for example, is approximated by about 1016 numbers. It is interesting to compare the fineness of this discretization with that of the discretizations of physics. In a handful of solid or liquid or a balloonful of gas, the number of atoms or molecules in a line from one point to another is on the order of 108 (the cube root of Avogadro’s number). Such a system behaves enough like a continuum to justify our definitions of physical quantities such as density, pressure, stress, strain, and temperature. Computer arithmetic, however, is more than a million times finer than this. Another comparison with physics concerns the precision to which fundamental constants are known, such as (roughly) 4 digits for the gravitational constant G, 7 digits for Planck’s constant h and the elementary charge e, and 12 digits for the ratio µe/µB of the magnetic moment of the electron to the Bohr magneton. At present, almost nothing in physics is known to more than 12 or 13 digits of accuracy. Thus IEEE numbers are orders of magnitude more precise than any number in science. (Of course, purely mathematical quantities like π are another matter.) In two senses, then, floating-point arithmetic is far closer to its ideal than is physics. It is a curious phenomenon that, nevertheless, it is floating-point arithmetic rather than the laws of physics that is widely regarded as an ugly and dangerous compromise. Numerical analysts themselves are partly to blame for this perception. In the 1950s and 1960s, the founding fathers of the field discovered that inexact arithmetic can be a source of danger, causing errors in results that “ought” to be right. The source of such problems is numerical instability: that is, the amplification of rounding errors from microscopic to macroscopic scale by certain modes of computation. These men, including von neumann [VI.91], Wilkinson, Forsythe, and Henrici, took great pains to publicize the risks of careless reliance on machine arithmetic. These risks are very real, but the message was communicated all too successfully, leading to the current widespread impression that the main business of numerical analysis is coping with rounding errors. In fact, the main business of numerical analysis is designing algorithms that converge quickly; rounding-error analysis, while often a part of the discussion, is rarely the central issue. If rounding errors vanished, 90% of numerical analysis would remain. 4 Numerical Linear Algebra Linear algebra became a standard topic in undergraduate mathematics curriculums in the 1950s and 1960s, and has remained there ever since. There are several reasons for this, but I think one is at the bottom of it: the importance of linear algebra has exploded since the arrival of computers. The starting point of this subject is Gaussian elimination, a procedure that can solve n linear equations in n unknowns using on the order of n3 arithmetic operations. Equivalently, it solves equations of the form Ax = b, where A is an n×n matrix and x and b are column vectors of size n. Gaussian elimination is invoked on computers around the world almost every time a system of linear equations is solved. Even if n is as large as 1000, the time required is well under a second on a typical 2008 desktop machine. The idea of
IV.21.Numerical Analysis 607 elimination was first discovered by Chinese scholars with permuted rows.If the permutations are chosen to about 20 ago and more recent con argest entry b ow t in column The modern way of describing such algorithms how then has the additional property for ali ever was apparently introd as late as the 193 and J. racted from the ond row This ation can he interpreted as the multiplication of Aon the left by the practice,pivoting makes Gaussian elimination almost nsisting of the perfectly stable,and it is routinely done by almost all analogous row opera ions correspond to further multi- ations on th by lower-tri matrice Wilkinson and others that for certain exceptional matri on is still setting L=M-1, ancy represents an embarrassing gap at the heart of A-LU Here L isunit lower-triangular,that is,o wer-triangular d entr 04 ince o rep nd I which Gaussian eliminati n amplifies rounding errors can say that Gau in a certain sense exp tion is a process of lower-triangule the dimension.but a theorem to this effect has nev Nany of numerical lnear algebra been proved are also ba ased on writing a matrix as a product of matri Meanw e,beg nning in the lat 195( may s that this neld has a central the use of algorithms ha dogma: or UNITARY [Il.50s3]matrices,th at Is, real matrices algorithms-matrix factorizations Q: otesth orcom x ones w The st of such developments is the idea of QR factorization.I sanm matrx with mn a QR factorization a pro matrix A-QR eQ has orthonormal columns and R is upper-tri Soon after computers came into use it was obs that even for matrices that do have LU factoriza pression of the familar idea ot gram-schmidt ortho form i of o are ope amounts.Stability can be achieved by interchanging by elementary upp triangular matrices one could sav ord bring ma at the Gram-Schmid algorithm aims for Q and get: oin.Since pivoting acts on rows.it again corres A hig howed in 1958 that a dual strategy of orthogonalri- elimination with nivoting is ularization is more effective for mar PA-LU operations cach of which reflects across a ngular,I is unit lower-t hyperplane,one reduces A to upper-triangular form a orthogonal operations:one aims at R and gets Q
✐ IV.21. Numerical Analysis 607 elimination was first discovered by Chinese scholars about 2000 years ago, and more recent contributors include lagrange [VI.22], Gauss, and jacobi [VI.35]. The modern way of describing such algorithms, however, was apparently introduced as late as the 1930s. Suppose that, say, α times the first row of A is subtracted from the second row. This operation can be interpreted as the multiplication of A on the left by the lower-triangular matrix M1 consisting of the identity with the additional nonzero entry m21 = −α. Further analogous row operations correspond to further multiplications on the left by lower-triangular matrices Mj . If k steps convert A to an upper-triangular matrix U, then we have MA = U with M = Mk ··· M2M1, or, upon setting L = M−1, A = LU. Here L is unit lower-triangular, that is, lower-triangular with all its diagonal entries equal to 1. Since U represents the target structure and L encodes the operations carried out to get there, we can say that Gaussian elimination is a process of lower-triangular uppertriangularization. Many other algorithms of numerical linear algebra are also based on writing a matrix as a product of matrices that have special properties. To borrow a phrase from biology, we may say that this field has a central dogma: algorithms ←→ matrix factorizations. In this framework we can quickly describe the next algorithm that needs to be considered. Not every matrix has an LU factorization; a 2 × 2 counterexample is the matrix A = 0 1 1 0 . Soon after computers came into use it was observed that even for matrices that do have LU factorizations, the pure form of Gaussian elimination is unstable, amplifying rounding errors by potentially large amounts. Stability can be achieved by interchanging rows during the elimination in order to bring maximal entries to the diagonal, a process known as pivoting. Since pivoting acts on rows, it again corresponds to a multiplication of A by other matrices on the left. The matrix factorization corresponding to Gaussian elimination with pivoting is PA = LU, where U is upper-triangular, L is unit lower-triangular, and P is a permutation matrix, i.e., the identity matrix with permuted rows. If the permutations are chosen to bring the largest entry below the diagonal in column k to the (k, k) position before the kth elimination step, then L has the additional property |ij | 1 for all i and j. The discovery of pivoting came quickly, but its theoretical analysis has proved astonishingly hard. In practice, pivoting makes Gaussian elimination almost perfectly stable, and it is routinely done by almost all computer programs that need to solve linear systems of equations. Yet it was realized in around 1960 by Wilkinson and others that for certain exceptional matrices, Gaussian elimination is still unstable, even with pivoting. The lack of an explanation of this discrepancy represents an embarrassing gap at the heart of numerical analysis. Experiments suggest that the fraction of matrices (for example, among random matrices with independent normally distributed entries) for which Gaussian elimination amplifies rounding errors by a factor greater than ρn1/2 is in a certain sense exponentially small as a function of ρ as ρ → ∞, where n is the dimension, but a theorem to this effect has never been proved. Meanwhile, beginning in the late 1950s, the field of numerical linear algebra expanded in another direction: the use of algorithms based on orthogonal [III.50 §3] or unitary [III.50 §3] matrices, that is, real matrices with Q−1 = QT or complex ones with Q−1 = Q∗, where Q∗ denotes the conjugate transpose. The starting point of such developments is the idea of QR factorization. If A is an m × n matrix with m n, a QR factorization of A is a product A = QR, where Q has orthonormal columns and R is upper-triangular. One can interpret this formula as a matrix expression of the familiar idea of Gram–Schmidt orthogonalization, in which the columns q1, q2,... of Q are determined one after another. These column operations correspond to multiplication of A on the right by elementary upper-triangular matrices. One could say that the Gram–Schmidt algorithm aims for Q and gets R as a by-product, and is thus a process of triangular orthogonalization. A big event was when Householder showed in 1958 that a dual strategy of orthogonal triangularization is more effective for many purposes. In this approach, by applying a succession of elementary matrix operations each of which reflects Rm across a hyperplane, one reduces A to upper-triangular form via orthogonal operations: one aims at R and gets Q
608 IV.Branches of Mathematics ations preserve norms and thus do not amplify the for virtually every scentist with nobody but a few sp rounding errors introduced at each step. ialists knowing the details of how it is do ous re Istory is that QR factorization can be used by itself to solve least squares pro orthonormal b chmarks that all computer ma ufacturers run to In particular,one of the central problems of numerical is lucky ough to make the TOP500 list. dated twice linear algebra is the determinati ion or the a year since 1993.it is because of its prowess in solving vectors,then by forming a matrix x er trix prooml s Ax"b of dimensions ranging whose columns are thes e eigenvectors and a diagonal The eigenvalue decomposition is familiar to all math ematicians,but the development of numerical lin a AX -XD cene the and hence,since X is nonsingular SVD was discov AN [VI.52],anc eteenth c an A=XDX-1 beginning in around 1965.If A is an mx n matrix with m>n,an SVD of A is a factorization orthonormal eigenvectors always exists.giving A=USV* A=ODO where U ismx Visn×i where The 2 >o.One could compute the 1960s by Francis,Kublanovsk aya,and Wilkinson:the ating it to the eigenvalue problems for that doesnot square A.Computing the SVD is th stan ing the the in principle infinite.Nevertheless.its convergence is orm of the inverse lA-1l =la.in the case where A ctTraordnarvapidnthe condition number in the sense that at each step the number of cor K(A)=用A1MA-1‖=/On rect digits in one of the eigenvalue-eigenvector pairs t is also one of the great trlumphs of numerical analysis,and its impact through widely used least-squa res,computation of ranges and nullspaces n it l d in the and dete and Fortran and later to the software library gensystem )and its desce the dis on above cerns"clas ted in al. e n nsuing quarter-century brought in a whole new set IMSL and Numerical Recipes collecti of tools:methods for large-s LAB.Mro CI s MA se ne
✐ 608 IV. Branches of Mathematics as a by-product. The Householder method turns out to be more stable numerically, because orthogonal operations preserve norms and thus do not amplify the rounding errors introduced at each step. From the QR factorization sprang a rich collection of linear algebra algorithms in the 1960s. The QR factorization can be used by itself to solve leastsquares problems and construct orthonormal bases. More remarkable is its use as a step in other algorithms. In particular, one of the central problems of numerical linear algebra is the determination of the eigenvalues and eigenvectors of a square matrix A. If A has a complete set of eigenvectors, then by forming a matrix X whose columns are these eigenvectors and a diagonal matrix D whose diagonal entries are the corresponding eigenvalues, we obtain AX = XD, and hence, since X is nonsingular, A = XDX−1, the eigenvalue decomposition. In the special case in which A is hermitian [III.50 §3], a complete set of orthonormal eigenvectors always exists, giving A = QDQ∗, where Q is unitary. The standard algorithm for computing these factorizations was developed in the early 1960s by Francis, Kublanovskaya, and Wilkinson: the QR algorithm. Because polynomials of degree 5 or more cannot be solved by a formula, we know that eigenvalues cannot generally be computed in closed form. The QR algorithm is therefore necessarily an iterative one, involving a sequence of QR factorizations that is in principle infinite. Nevertheless, its convergence is extraordinarily rapid. In the symmetric case, for a typical matrix A, the QR algorithm converges cubically, in the sense that at each step the number of correct digits in one of the eigenvalue–eigenvector pairs approximately triples. The QR algorithm is one of the great triumphs of numerical analysis, and its impact through widely used software products has been enormous. Algorithms and analysis based on it led in the 1960s to computer codes in Algol and Fortran and later to the software library EISPACK (“Eigensystem Package”) and its descendant LAPACK. The same methods have also been incorporated in general-purpose numerical libraries such as the NAG, IMSL, and Numerical Recipes collections, and in problem-solving environments such as MATLAB, Maple, and Mathematica. These developments have been so successful that the computation of matrix eigenvalues long ago became a “black box” operation for virtually every scientist, with nobody but a few specialists knowing the details of how it is done. A curious related story is that EISPACK’s relative LINPACK for solving linear systems of equations took on an unexpected function: it became the original basis for the benchmarks that all computer manufacturers run to test the speed of their computers. If a supercomputer is lucky enough to make the TOP500 list, updated twice a year since 1993, it is because of its prowess in solving certain matrix problems Ax = b of dimensions ranging from 100 into the millions. The eigenvalue decomposition is familiar to all mathematicians, but the development of numerical linear algebra has also brought its younger cousin onto the scene: the singular value decomposition (SVD). The SVD was discovered by Beltrami, jordan [VI.52], and sylvester [VI.42] in the late nineteenth century, and made famous by Golub and other numerical analysts beginning in around 1965. If A is an m× n matrix with m n, an SVD of A is a factorization A = UΣV ∗, where U is m×n with orthonormal columns, V is n×n and unitary, and Σ is diagonal with diagonal entries σ1 σ2 ··· σn 0. One could compute the SVD by relating it to the eigenvalue problems for AA∗ and A∗A, but this proves numerically unstable; a better approach is to use a variant of the QR algorithm that does not square A. Computing the SVD is the standard route to determining the norm [III.62] A = σ1 (here · is the hilbert space [III.37] or “2” norm), the norm of the inverse A−1 = 1/σn in the case where A is square and nonsingular, or their product, known as the condition number, κ(A) = A A−1 = σ1/σn. It is also a step in an extraordinary variety of further computational problems including rank-deficient least-squares, computation of ranges and nullspaces, determination of ranks, “total least-squares,” low-rank approximation, and determination of angles between subspaces. All the discussion above concerns “classical” numerical linear algebra, born in the period 1950–75. The ensuing quarter-century brought in a whole new set of tools: methods for large-scale problems based on Krylov subspace iterations. The idea of these iterations is as follows. Suppose a linear algebra problem is given
IV.21.Numerical Analysis 609 that involves a matrix of large dimension,say n far as 2.376 for certan recursive(though impractical) soluti may b e characterize as the vecto 930.Is th d by Coppersmith as minimizingAx-xTb (for solving Ax =bif A is actmc,posite,dctoteorbeingastationarypoimt 5 Numerical Solution of Differential Equations ric)Now if k is a kdimensional subspace of with maticians develop ed numerical methods to thenit m ckly in that subspace. solve problems of analysis y that subspace. The problem of numeri Kx(A.q)=span(q.Ag.....A-q) assic are derived from the idea for an initial vector 4.For reasons that have fascinat ata at n ion theory, ful for small are favorably dist he rten po non If the points are chosen optimally,then to ten-digit precision in iust a few hundred iteration result is Gauss qua rature,which converges rapidl The speedup compared wi n the classical algorithms iter riginated with the are clustered near th nts. A proof is sketched in egradient and Lanczos iterations publishedn195 ut in thos eyears c rs wer olation points bec cos(jm/m),0≤ i≤n.This ods to be competitive.They took off in the 1970s with ra thod is also stable and rapic y conve Reid and P aige and er Vors (log)operations by the fast fourier transform csbyamhamaieayn的ehs ditioning in preconditioning a system Ax=b one of potential theory MAX Mb Around 1850 another problem of a on:the DE .Th ly dis. paced ponts.whichin practice typ tributed eigenvalues and a Krylov subspace iteration cally numbeequa are n ma ads for the have emerged as an indispensable tool of computa The idea here is that for an initial value problem ence we pick a smal ost heavily cited article in all of mathematics in the tn=n△n≥0, 990s1 as the 1989 CGStab,a gener paper by van der Vorst introducing We then replac the ODE by an algebraic approxi that ena s us to calculate a succession of P Finally,we must mention the biggest unsolved prob v=u(tm.n≥0. an ar re equivalent.)Gaussiar
✐ IV.21. Numerical Analysis 609 that involves a matrix of large dimension, say n . 1000. The solution may be characterized as the vector x ∈ Rn that satisfies a certain variational property such as minimizing 1 2 xTAx −xTb (for solving Ax = b if A is symmetric positive definite) or being a stationary point of (xTAx)/(xTx) (for solving Ax = λx if A is symmetric). Now if Kk is a k-dimensional subspace of Rn with k / n, then it may be possible to solve the same variational problem much more quickly in that subspace. The magical choice of Kk is a Krylov subspace Kk(A, q) = span(q, Aq, . . . , Ak−1q) for an initial vector q. For reasons that have fascinating connections with approximation theory, solutions in these subspaces often converge very rapidly to the exact solution in Rn as k increases, if the eigenvalues of A are favorably distributed. For example, it is often possible to solve a matrix problem involving 105 unknowns to ten-digit precision in just a few hundred iterations. The speedup compared with the classical algorithms may be a factor of thousands. Krylov subspace iterations originated with the conjugate gradient and Lanczos iterations published in 1952, but in those years computers were not powerful enough to solve problems of a large enough scale for the methods to be competitive. They took off in the 1970s with the work of Reid and Paige and especially van der Vorst and Meijerink, who made famous the idea of preconditioning. In preconditioning a system Ax = b, one replaces it by a mathematically equivalent system such as MAx = Mb for some nonsingular matrix M. If M is well chosen, the new problem involving MA may have favorably distributed eigenvalues and a Krylov subspace iteration may solve it quickly. Since the 1970s, preconditioned matrix iterations have emerged as an indispensable tool of computational science. As one indication of their prominence we may note that in 2001, Thomson ISI announced that the most heavily cited article in all of mathematics in the 1990s was the 1989 paper by van der Vorst introducing Bi-CGStab, a generalization of conjugate gradients for nonsymmetric matrices. Finally, we must mention the biggest unsolved problem in numerical analysis. Can an arbitrary n×n matrix A be inverted in O(nα) operations for every α > 2? (The problems of solving a system Ax = b or computing a matrix product AB are equivalent.) Gaussian elimination has α = 3, and the exponent shrinks as far as 2.376 for certain recursive (though impractical) algorithms published by Coppersmith and Winograd in 1990. Is there a “fast matrix inverse” in store for us? 5 Numerical Solution of Differential Equations Long before much attention was paid to linear algebra, mathematicians developed numerical methods to solve problems of analysis. The problem of numerical integration or quadrature goes back to Gauss and newton [VI.14], and even to archimedes [VI.3]. The classic quadrature formulas are derived from the idea of interpolating data at n + 1 points by a polynomial of degree n, then integrating the polynomial exactly. Equally spaced interpolation points give the Newton– Cotes formulas, which are useful for small degrees but diverge at a rate as high as 2n as n → ∞: the Runge phenomenon. If the points are chosen optimally, then the result is Gauss quadrature, which converges rapidly and is numerically stable. It turns out that these optimal points are roots of Legendre polynomials, which are clustered near the endpoints. (A proof is sketched in special functions [III.85].) Equally good for most purposes is Clenshaw–Curtis quadrature, where the interpolation points become cos(jπ /n), 0 j n. This quadrature method is also stable and rapidly convergent, and unlike Gauss quadrature can be executed in O(n log n) operations by the fast Fourier transform. The explanation of why clustered points are necessary for effective quadrature rules is related to the subject of potential theory. Around 1850 another problem of analysis began to get attention: the solution of ODEs. The Adams formulas are based on polynomial interpolation in equally spaced points, which in practice typically number fewer than ten. These were the first of what are now called multistep methods for the numerical solution of ODEs. The idea here is that for an initial value problem u = f (t, u) with independent variable t > 0, we pick a small time step ∆t > 0 and consider a finite set of time values tn = n∆t, n 0. We then replace the ODE by an algebraic approximation that enables us to calculate a succession of approximate values vn ≈ u(tn), n 0. (The superscript here is just a superscript, not a power.) The simplest such approximate formula, going back to euler [VI.19], is vn+1 = vn + ∆tf (tn, vn),
610 IV.Branches of Mathematics or,using the abbreviation f"=f(tn.v") ship between the e.function eval vi+1=Un+△tfn. Both the ODE itself and one rauation or many.in which case Kutta in 1901 and have order p=s,but it was not unti 963 that it w u(t.x)and vn become vectors of an appropriate s proved that re order ge involves beautiful mathematics from graph theory and cient at generating accurate solutions.For example.the other areas,and akey figure in this area since the 1960s fourth-order Adams-Bashforth formula is e mi vm+1=vn+△1(55f-59fm-1+37fm-2-9fm-3) xact minima are not known.fortunately.these highe orders are rarely needed for practical purposes. beg ased to solve diffe eas△t 40 formu ab ve is of r h ord h d:once again instability.As before, this phrase refers e .0 erro y enabling excellent accuracy for all kinds of computa rrors are usually those of discretization rather than the ran ed when stil ounding.Instability typically manifests itself accuracy is needed in the as an comp Most unfortunately,the hab in the numerical a matician concerned with this effect was Ger magninicently effcient methods.but of their error.or Dahlquist saw that the phenom more precisely their discretization or truncation error and some regard the po his 1950 alysis n to paper as one of the events marking the birth of mod Ar the um of the twer teth century,the s ep methods.was developed by Runge.Heun.and consistency stability convergence. Kutta.For example,here are the formulas of the famous he th nge- ta method,wh cise definitions of thes notions along the following lines.Consistenc is the teptoith the ald of of the property that the discrete formula has lo function f acy an thus m s t a=△tf(t.vn) step cannot row unboundedly at later times.Com b=△tftm+△t.vn+a) )in the absence c=△tftn+2△t,vn+2b), ult efore Dablauist's the ide d=△tf(tn+△t.Un+c), of an equivalence of stability and convergence was pe n+1 =+(a+2b+2c+d). haps in the air in th ense that practitio mulas.For example,for any s.it is a trivial matter to answer.His theory gave rigorous form to that idea for derive the the s-step Adams-Bashforth a wide class of numerl al meth nas or ng OD
✐ 610 IV. Branches of Mathematics or, using the abbreviation f n = f (tn, vn), vn+1 = vn + ∆tf n. Both the ODE itself and its numerical approximation may involve one equation or many, in which case u(t, x) and vn become vectors of an appropriate dimension. The Adams formulas are higher-order generalizations of Euler’s formula that are much more effi- cient at generating accurate solutions. For example, the fourth-order Adams–Bashforth formula is vn+1 = vn + 1 24∆t(55f n − 59f n−1 + 37f n−2 − 9f n−3). The term “fourth-order” reflects a new element in the numerical treatment of problems of analysis: the appearance of questions of convergence as ∆t → 0. The formula above is of fourth order in the sense that it will normally converge at the rate O((∆t)4). The orders employed in practice are most often in the range 3–6, enabling excellent accuracy for all kinds of computations, typically in the range of 3–10 digits, and higherorder formulas are occasionally used when still more accuracy is needed. Most unfortunately, the habit in the numerical analysis literature is to speak not of the convergence of these magnificently efficient methods, but of their error, or more precisely their discretization or truncation error as distinct from rounding error. This ubiquitous language of error analysis is dismal in tone, but seems ineradicable. At the turn of the twentieth century, the second great class of ODE algorithms, known as Runge–Kutta or one-step methods, was developed by Runge, Heun, and Kutta. For example, here are the formulas of the famous fourth-order Runge–Kutta method, which advance a numerical solution (again scalar or system) from time step tn to tn+1 with the aid of four evaluations of the function f : a = ∆tf (tn, vn), b = ∆tf (tn + 1 2∆t,vn + 1 2 a), c = ∆tf (tn + 1 2∆t,vn + 1 2b), d = ∆tf (tn + ∆t,vn + c), vn+1 = vn + 1 6 (a + 2b + 2c + d). Runge–Kutta methods tend to be easier to implement but sometimes harder to analyze than multistep formulas. For example, for any s, it is a trivial matter to derive the coefficients of the s-step Adams–Bashforth formula, which has order of accuracy p = s. For Runge– Kutta methods, by contrast, there is no simple relationship between the number of “stages” (i.e., function evaluations per step) and the attainable order of accuracy. The classical methods with s = 1, 2, 3, 4 were known to Kutta in 1901 and have order p = s, but it was not until 1963 that it was proved that s = 6 stages are required to achieve order p = 5. The analysis of such problems involves beautiful mathematics from graph theory and other areas, and a key figure in this area since the 1960s has been John Butcher. For orders p = 6, 7, 8 the minimal numbers of stages are s = 7, 9, 11, while for p > 8 exact minima are not known. Fortunately, these higher orders are rarely needed for practical purposes. When computers began to be used to solve differential equations after World War II, a phenomenon of the greatest practical importance appeared: once again, numerical instability. As before, this phrase refers to the unbounded amplification of local errors by a computational process, but now the dominant local errors are usually those of discretization rather than rounding. Instability typically manifests itself as an oscillatory error in the computed solution that blows up exponentially as more numerical steps are taken. One mathematician concerned with this effect was Germund Dahlquist. Dahlquist saw that the phenomenon could be analyzed with great power and generality, and some people regard the appearance of his 1956 paper as one of the events marking the birth of modern numerical analysis. This landmark paper introduced what might be called the fundamental theorem of numerical analysis: consistency + stability = convergence. The theory is based on precise definitions of these three notions along the following lines. Consistency is the property that the discrete formula has locally positive order of accuracy and thus models the right ODE. Stability is the property that errors introduced at one time step cannot grow unboundedly at later times. Convergence is the property that as ∆t → 0, in the absence of rounding errors, the numerical solution converges to the correct result. Before Dahlquist’s paper, the idea of an equivalence of stability and convergence was perhaps in the air in the sense that practitioners realized that if a numerical scheme was not unstable, then it would probably give a good approximation to the right answer. His theory gave rigorous form to that idea for a wide class of numerical methods. As computer methods for ODEs were being developed, the same was happening for the much bigger
IV.21.Numerical Analysis 611 subject of PDEs.Discrete numerical methods for solv In the half-century since von Neumann died,the ing PDEs ha ented around 1910 b ula and its have grown putational fluid dynamics.Early treatments of linear there was also a the paper on finite rence nd n nlinear equations in one space di imension soon ANT [V chs,an todriche-I routine matter to solve problem became famous,the impact of these ideas before com putational grids with hun puters c came along was mted.Aft tha point reds o ints in eac di oup of und von Neumann at the Los Alamos laboratory,including attention to bou lary layers and other fas the young P ngs,then whole aircraft.Engineers still use wind subject to catastrophic instabilities.For example,to olve th dtime st e rom divers xi=j△x.tm=n△t,j.n≥0 ing a differential operator by a difference element methods appi the solu for instane the vnu(tn,x),j,n≥0 of f into ele ary sets such as triangles A well-know Wendroff formula: discretization for this purpose is the Lax- insist 1=+A(1-1)+A2(1-2+% solving a riatio al form of the wit in the corr spor ding finite where A=△t/△x,which can be generalized to non inear systems of hypert ic cons in on solution is optimal within that subspace.Finit thod will co to the cor methods have taken advantage of tool of function in handli nese met Is greater tha n the other ated geometries,and in particula r they are entirely absence of such instabilities could be tested.at least nant in appl for linear constant- oefficient problems,by discrete have been published about finite-element methods is in R ANALY on N E 12 excess of 1000. Beld of method would su ceed if it was not unstable.a theor PDEs.what aspect of the current state of the art would oon appeared that gaver this most surprise Rich rdson or Courant,Friedrichs,and nver in 1956.the same v ear as Dahlquist's paper many The details were different is theory was restricte d to lin scale PDE problem in three dimensions may require a ar y for ODEs al ve h tim that utilizesa finite-difference preconditioner imple ergence to consistency plus stability.Mathematica nented by a Bi-CGStab iteration relying on anothe he key point was the ur eaness principl multigrid preconditoner.suc stacking of tools was
✐ IV.21. Numerical Analysis 611 subject of PDEs. Discrete numerical methods for solving PDEs had been invented around 1910 by Richardson for applications in stress analysis and meteorology, and further developed by Southwell; in 1928 there was also a theoretical paper on finite-difference methods by courant [VI.83], Friedrichs, and Lewy. But although the Courant–Friedrichs–Lewy work later became famous, the impact of these ideas before computers came along was limited. After that point the subject developed quickly. Particularly influential in the early years was the group of researchers around von Neumann at the Los Alamos laboratory, including the young Peter Lax. Just as for ODEs, von Neumann and his colleagues discovered that some numerical methods for PDEs were subject to catastrophic instabilities. For example, to solve the linear wave equation ut = ux numerically we may pick space and time steps ∆x and ∆t for a regular grid, xj = j∆x, tn = n∆t, j,n 0, and replace the PDE by algebraic formulas that compute a succession of approximate values: vn j ≈ u(tn, xj ), j, n 0. A well-known discretization for this purpose is the Lax– Wendroff formula: vn+1 j = vn j + 1 2λ(vn j+1 −vn j−1)+ 1 2λ2(vn j+1 −2vn j +vn j−1), where λ = ∆t/∆x, which can be generalized to nonlinear systems of hyperbolic conservation laws in one dimension. For ut = ux, if λ is held fixed at a value less than or equal to 1, the method will converge to the correct solution as ∆x, ∆t → 0 (ignoring rounding errors). If λ is greater than 1, on the other hand, it will explode. Von Neumann and others realized that the presence or absence of such instabilities could be tested, at least for linear constant-coefficient problems, by discrete fourier analysis [III.27] in x: “von Neumann analysis.” Experience indicated that, as a practical matter, a method would succeed if it was not unstable. A theory soon appeared that gave rigor to this observation: the Lax equivalence theorem, published by Lax and Richtmyer in 1956, the same year as Dahlquist’s paper. Many details were different—this theory was restricted to linear equations whereas Dahlquist’s theory for ODEs also applied to nonlinear ones—but broadly speaking the new result followed the same pattern of equating convergence to consistency plus stability. Mathematically, the key point was the uniform boundedness principle. In the half-century since von Neumann died, the Lax–Wendroff formula and its relatives have grown into a breathtakingly powerful subject known as computational fluid dynamics. Early treatments of linear and nonlinear equations in one space dimension soon moved to two dimensions and eventually to three. It is now a routine matter to solve problems involving millions of variables on computational grids with hundreds of points in each of three directions. The equations are linear or nonlinear; the grids are uniform or nonuniform, often adaptively refined to give special attention to boundary layers and other fast-changing features; the applications are everywhere. Numerical methods were used first to model airfoils, then whole wings, then whole aircraft. Engineers still use wind tunnels, but they rely more on computations. Many of these successes have been facilitated by another numerical technology for solving PDEs that emerged in the 1960s from diverse roots in engineering and mathematics: finite elements. Instead of approximating a differential operator by a difference quotient, finite-element methods approximate the solution itself by functions f that can be broken up into simple pieces. For instance, one might partition the domain of f into elementary sets such as triangles or tetrahedra and insist that the restriction of f to each piece is a polynomial of small degree. The solution is obtained by solving a variational form of the PDE within the corresponding finite-dimensional subspace, and there is often a guarantee that the computed solution is optimal within that subspace. Finite-element methods have taken advantage of tools of functional analysis to develop to a very mature state. These methods are known for their flexibility in handling complicated geometries, and in particular they are entirely dominant in applications in structural mechanics and civil engineering. The number of books and articles that have been published about finite-element methods is in excess of 10 000. In the vast and mature field of numerical solution of PDEs, what aspect of the current state of the art would most surprise Richardson or Courant, Friedrichs, and Lewy? I think it is the universal dependence on exotic algorithms of linear algebra. The solution of a largescale PDE problem in three dimensions may require a system of a million equations to be solved at each time step. This may be achieved by a GMRES matrix iteration that utilizes a finite-difference preconditioner implemented by a Bi-CGStab iteration relying on another multigrid preconditioner. Such stacking of tools was
612 IV.Branches of Mathematics srely ot by the early computer pioneers. bility,for as Crank and Nicolson first noted in 1947 ry "To find a zero x.of a function F of a real variabl the rucial tool for combating instability is the use of veryo knows theidNwon'smtho at th which co together unk F(x)to define a linear mation from which to coupling that solutions of systems of equations are derive a beerem required x(k+1)=x()-F(x(k))/F'(x(k)) som examples that ilustrate the suc on the numerical solution of PDEs chemistry (the fun to systems of two eq SCHRODINGER EQUATION [IIL.83]);structural mechan cs (the uat athe on (th unknowns. STOKES EQUATIONS 3:acoustics (the Helmholtz equation),telecommunications (MAXWELL'sEQUATION 1 ogy(th equ diffusion equations);t ami m deling(the the matrix form x1)-xik)(I(x(k)))-1E(x(k)) Malik equation):metallurgy(the Cahn-Hilliard equa hich ir ans thatt tion):) cial options (the BLACK-SCHO Ix(k))(xik+1)-x(k))=-F(x(k)) s I is linschit ntin 6 Numerical Optimization convergence of this iteration is quadratic: bra Xk+》-X+=Olx)-X.I2 1 several variabl s and the closely related probl em of Students often think it might be a good idea to develop ormulas to enhance the exponent in this estimate to3 S IS dent of that of the rest of numerical analysis,car 1 forward in part by a community of scholars with close quartically convergent one,so the difference in eff 9 o oper arch and econom dratic a mum at a point of zero derivative.or at is replaced by any other number reater than 1 achieve an ext he true distir tion is of these algorithr a boundary.The same tw ossibilities characterize the and those that ima of uncon ically,where the exponent is just astrained nonlinear functions by method elated to multivariate calculus. other are prob u step from solving a sy d is igran nd th mizing a scal r function f of a variabl X∈R:to to un and all the challenge is in the boundary constraints )an nvector a is the Jacobian matrix known as the Hessian of f,with entries dea o
✐ 612 IV. Branches of Mathematics surely not imagined by the early computer pioneers. The need for it ultimately traces to numerical instability, for as Crank and Nicolson first noted in 1947, the crucial tool for combating instability is the use of implicit formulas, which couple together unknowns at the new time step tn+1, and it is in implementing this coupling that solutions of systems of equations are required. Here are some examples that illustrate the successful reliance of today’s science and engineering on the numerical solution of PDEs: chemistry (the schrödinger equation [III.83]); structural mechanics (the equations of elasticity); weather prediction (the geostrophic equations); turbine design (the navier– stokes equations [III.23]); acoustics (the Helmholtz equation); telecommunications (maxwell’s equations [IV.13 §1.1]); cosmology (the Einstein equations); oil discovery (the migration equations); groundwater remediation (Darcy’s law); integrated circuit design (the drift diffusion equations); tsunami modeling (the shallowwater equations); optical fibers (the nonlinear wave equations [III.49]); image enhancement (the Perona– Malik equation); metallurgy (the Cahn–Hilliard equation); pricing financial options (the black–scholes equation [VII.9 §2]). 6 Numerical Optimization The third great branch of numerical analysis is optimization, that is, the minimization of functions of several variables and the closely related problem of solution of nonlinear systems of equations. The development of optimization has been somewhat independent of that of the rest of numerical analysis, carried forward in part by a community of scholars with close links to operations research and economics. Calculus students learn that a smooth function may achieve an extremum at a point of zero derivative, or at a boundary. The same two possibilities characterize the two big strands of the field of optimization. At one end there are problems of finding interior zeros and minima of unconstrained nonlinear functions by methods related to multivariate calculus. At the other are problems of linear programming, where the function to be minimized is linear and therefore easy to understand, and all the challenge is in the boundary constraints. Unconstrained nonlinear optimization is an old subject. Newton introduced the idea of approximating functions by the first few terms of what we now call their Taylor series; indeed, Arnol’d has argued that Taylor series were Newton’s “main mathematical discovery.” To find a zero x∗ of a function F of a real variable x, everyone knows the idea of Newton’s method: at the kth step, given an estimate x(k) ≈ x∗, use the derivative F (x(k)) to define a linear approximation from which to derive a better estimate x(k+1): x(k+1) = x(k) − F(x(k))/F (x(k)). Newton (1669) and Raphson (1690) applied this idea to polynomials, and Simpson (1740) generalized it to other functions F and to systems of two equations. In today’s language, for a system of n equations in n unknowns, we regard F as an n-vector whose derivative at a point x(k) ∈ Rn is the n × n Jacobian matrix with entries Jij (x(k)) = ∂Fi ∂xj (x(k)), 1 i, j n. This matrix defines a linear approximation to F(x) that is accurate for x ≈ x(k). Newton’s method then takes the matrix form x(k+1) = x(k) − (J(x(k)))−1F(x(k)), which in practice means that to get x(k+1) from x(k), we solve a linear system of equations: J(x(k))(x(k+1) − x(k)) = −F(x(k)). As long as J is Lipschitz continuous and nonsingular at x∗ and the initial guess is good enough, the convergence of this iteration is quadratic: x(k+1) − x∗ = O(x(k) − x∗2). (1) Students often think it might be a good idea to develop formulas to enhance the exponent in this estimate to 3 or 4. However, this is an illusion. Taking two steps at a time of a quadratically convergent algorithm yields a quartically convergent one, so the difference in effi- ciency between quadratic and quartic is at best a constant factor. The same goes if the exponent 2, 3, or 4 is replaced by any other number greater than 1. The true distinction is between all of these algorithms that converge superlinearly, of which Newton’s method is the prototype, and those that converge linearly or geometrically, where the exponent is just 1. From the point of view of multivariate calculus, it is a small step from solving a system of equations to minimizing a scalar function f of a variable x ∈ Rn: to find a (local) minimum, we seek a zero of the gradient g(x) = ∇f (x), an n-vector. The derivative of g is the Jacobian matrix known as the Hessian of f , with entries Hij (x(k)) = ∂2f ∂xi∂xj (x(k)), 1 i, j n
IV.21.Numerical Analysis 613 and one may utilize it just as before in a Newton itera- R"to R.Even the problem of stating local optimality nd g(x),the new feature being tha ditions 1 ns to such probl Though the Newton formulas for minimization and adistinction between active and inactive constraints. finding zeros were alr his problem was solved by are not known as quickly encounte ped was ears earlier it wa hat Newton's method ofen fails if the initial gues realized,by Karush.Development of algorithms fo ot g ically b strained mization continues tob e an algorithmic technologles known as ine searches and The problem of constraints brings us to the other t reg Kantorovich in the Soviet Union and Dantzig in the ians or Hessians at every step could be exorbitant United States.As an outgrowth of his work for the d/migh f the as ociated linear equations,while still achie .A linear program is nothing mo re than early bre thro of mi of n var ods in the 1960s by Brovden.Davidon.Fletch and constraints.How can this be a challer r is ell,in partial mation is used to gen that m and n may be large.Large-scale problem or Hessian or its matrix factor and also in thei own right A famous ear subject at the time is s the fact was Leonties the cory of input utput models in ece ran the Nobe n1973 nde n four diffe autho ely Broyder etcher,Goldfarb,and Sha mng the e de me .In subsequent years,as the scale of trac able linear s has incre ponentially,new the fund tor() outed functions to be determined automatically: ne program the fea ble region is a polyh ron. ram its s01 perplanes. and the opt ir derivatives.the idea of auto differe vertex points.(A point is called a vertex if it is n old one,but fo y related olution of sor e subse t of the ations ith t of"reverse mode"formulations.it did not be nhill from one ver. ally practical until the work of Bischof,Carle,and another until an optimal point isreach hed.All of but they ar ot typlcal the depth of thi arkar at at&T Bell Labo s.Ka by s that h e be one uld is to be minimized subicct to ccrtain cqual intcrior of the feasible reion instead.once a con ity constraints c(x) 0 and inequality constraints tion was shown between K armark ar's method and th d,(x)≥O,where{c}an di are also
✐ IV.21. Numerical Analysis 613 and one may utilize it just as before in a Newton iteration to find a zero of g(x), the new feature being that a Hessian is always symmetric. Though the Newton formulas for minimization and finding zeros were already established, the arrival of computers created a new field of numerical optimization. One of the obstacles quickly encountered was that Newton’s method often fails if the initial guess is not good. This problem has been comprehensively addressed both practically and theoretically by the algorithmic technologies known as line searches and trust regions. For problems with more than a few variables, it also quickly became clear that the cost of evaluating Jacobians or Hessians at every step could be exorbitant. Faster methods were needed that might make use of inexact Jacobians or Hessians and/or inexact solutions of the associated linear equations, while still achieving superlinear convergence. An early breakthrough of this kind was the discovery of quasi-Newton methods in the 1960s by Broyden, Davidon, Fletcher, and Powell, in which partial information is used to generate steadily improving estimates of the true Jacobian or Hessian or its matrix factors. An illustration of the urgency of this subject at the time is the fact that in 1970 the optimal rank-two symmetric positivedefinite quasi-Newton updating formula was published independently by no fewer than four different authors, namely Broyden, Fletcher, Goldfarb, and Shanno; their discovery has been known ever since as the BFGS formula. In subsequent years, as the scale of tractable problems has increased exponentially, new ideas have also become important, including automatic differentiation, a technology that enables derivatives of computed functions to be determined automatically: the computer program itself is “differentiated,” so that as well as producing numerical outputs it also produces their derivatives. The idea of automatic differentiation is an old one, but for various reasons, partly related to advances in sparse linear algebra and to the development of “reverse mode” formulations, it did not become fully practical until the work of Bischof, Carle, and Griewank in the 1990s. Unconstrained optimization problems are relatively easy, but they are not typical; the true depth of this field is revealed by the methods that have been developed for dealing with constraints. Suppose a function f : Rn → R is to be minimized subject to certain equality constraints cj (x) = 0 and inequality constraints dj (x) 0, where {cj} and {dj} are also functions from Rn to R. Even the problem of stating local optimality conditions for solutions to such problems is nontrivial, a matter involving lagrange multipliers [III.64] and a distinction between active and inactive constraints. This problem was solved by what are now known as the KKT conditions, introduced by Kuhn and Tucker in 1951 and also twelve years earlier, it was subsequently realized, by Karush. Development of algorithms for constrained nonlinear optimization continues to be an active research topic today. The problem of constraints brings us to the other strand of numerical optimization, linear programming. This subject was born in the 1930s and 1940s with Kantorovich in the Soviet Union and Dantzig in the United States. As an outgrowth of his work for the U.S. Air Force in the war, Dantzig invented in 1947 the famous simplex algorithm [III.84] for solving linear programs. A linear program is nothing more than a problem of minimizing a linear function of n variables subject to m linear equality and/or inequality constraints. How can this be a challenge? One answer is that m and n may be large. Large-scale problems may arise through discretization of continuous problems and also in their own right. A famous early example was Leontiev’s theory of input–output models in economics, which won him the Nobel Prize in 1973. Even in the 1970s the Soviet Union used an input–output computer model involving thousands of variables as a tool for planning the economy. The simplex algorithm made medium- and largescale linear programming problems tractable. Such a problem is defined by its objective function, the function f (x) to be minimized, and its feasible region, the set of vectors x ∈ Rn that satisfy all the constraints. For a linear program the feasible region is a polyhedron, a closed domain bounded by hyperplanes, and the optimal value of f is guaranteed to be attained at one of the vertex points. (A point is called a vertex if it is the unique solution of some subset of the equations that define the constraints.) The simplex algorithm proceeds by moving systematically downhill from one vertex to another until an optimal point is reached. All of the iterates lie on the boundary of the feasible region. In 1984, an upheaval occurred in this field, triggered by Narendra Karmarkar at AT&T Bell Laboratories. Karmarkar showed that one could sometimes do much better than the simplex algorithm by working in the interior of the feasible region instead. Once a connection was shown between Karmarkar’s method and the logarithmic barrier methods popularized by Fiacco and