Advanced Review Gaussian elimination Nicholas J.Higham* As the standard method for solving systems of linear equations,Gaussian elimination(GE)isone of the mostimportant and ubiquitousnumericalalgorithms succes sfu relies on understan ters.We givean of GE from thee xpa hycmputes factor and thearousetofthis matrix factor ization viewpoint.Pivoting strategies for ensuring numerical stability are descrbe al properties of G r certain cla ses of stru LU factorization,corresponding to the use of pivot blocks instead of pivot elements, in how iterative refinement can be used to improve a solution topic e GE for spars 20113230-238DoL10.102wics.16 computers.2011 John Wiley &Sons,Inc.WIREs Comp Sta Gat an elimination;LU factorization;pivoting numerical stability; INTRODUCTION The ties of iu fac ies for ensurine numerical stability.In the section'Structured Matr specia results that hold for LU it and plays a fundamental role in scientific computation. when GE was known to the ancient Chinesel and is wellas eron of GE that uses block pivots.Iterative familiar to rehnement- natu me om nea computed s describe fol oundoff (or machin method in linear algebra courses where it is usually taught in conjunction with reduction to echelon form =2 ≈1.1×10-16.We write fl(A)for the result In this linear a br ra context,GE is shown to be a tool unding th ele ments of A to ing all sol loating poin system,for c ndenotes the vector [1,2,... n:-1: to GE from the point of view of matrix analysis and denotes the vector matrix computations. nd ve denotes the many facets of by and the columns the any summarizing GE and its basic linear algebraic the oo-norm,given for A E R"x by the formula erties,including conditions for its suc ss and the key interpretation of the elimination as LU factorization LU FACTORL☑ATIO Nicholas i.high The aim of GE is to reduce a full system of n linear D0L:10.1002 /wics..16 olve to one t we ca山 230 2011 John Wiley Sons,Inc Volume 3,Mayllune 2011
Advanced Review Gaussian elimination Nicholas J. Higham∗ As the standard method for solving systems of linear equations, Gaussian elimination (GE) is one of the most important and ubiquitous numerical algorithms. However, its successful use relies on understanding its numerical stability properties and how to organize its computations for efficient execution on modern computers. We give an overview of GE, ranging from theory to computation. We explain why GE computes an LU factorization and the various benefits of this matrix factorization viewpoint. Pivoting strategies for ensuring numerical stability are described. Special properties of GE for certain classes of structured matrices are summarized. How to implement GE in a way that efficiently exploits the hierarchical memories of modern computers is discussed. We also describe block LU factorization, corresponding to the use of pivot blocks instead of pivot elements, and explain how iterative refinement can be used to improve a solution computed by GE. Other topics are GE for sparse matrices and the role GE plays in the TOP500 ranking of the world’s fastest computers. 2011 John Wiley & Sons, Inc. WIREs Comp Stat 2011 3 230–238 DOI: 10.1002/wics.164 Keywords: Gaussian elimination; LU factorization; pivoting; numerical stability; iterative refinement INTRODUCTION Gaussian elimination (GE) is the standard method for solving a system of linear equations. As such, it is one of the most ubiquitous numerical algorithms and plays a fundamental role in scientific computation. GE was known to the ancient Chinese1 and is familiar to many school children as the intuitively natural method of eliminating variables from linear equations. Gauss used it in the context of the linear least squares problem.2–4 Undergraduates learn the method in linear algebra courses, where it is usually taught in conjunction with reduction to echelon form. In this linear algebra context, GE is shown to be a tool for obtaining all solutions to a linear system, for computing the determinant, and for deducing the rank of the coefficient matrix. However, there is much more to GE from the point of view of matrix analysis and matrix computations. In this article, we survey the many facets of GE that are relevant to computation—in statistics or in other contexts. We begin in the next section by summarizing GE and its basic linear algebraic properties, including conditions for its success and the key interpretation of the elimination as LU factorization. ∗Correspondence to: Nicholas.j.higham@manchester.ac.uk School of Mathematics, The University of Manchester, Manchester, UK DOI: 10.1002/wics.164 Then we turn to the numerical properties of LU factorization and discuss pivoting strategies for ensuring numerical stability. In the section ‘Structured Matrices’, we describe some special results that hold for LU factorization when the matrix has particular properties. Computer implementation is then discussed, as well as a version of GE that uses block pivots. Iterative refinement—a means for improving the quality of a computed solution—is also described. We will need the following notation. The unit roundoff (or machine precision) is denoted by u; in IEEE double precision arithmetic it has the value u = 2−53 ≈ 1.1 × 10−16. We write fl(A) for the result of rounding the elements of A to floating point numbers. The ith unit vector ei is the vector that is zero except for a 1 in the ith element. The notation 1: n denotes the vector [1, 2, ... , n], while n: − 1: 1 denotes the vector [n, n − 1, ... , 1]. A(i, j), with i and j vectors of indices, denotes the submatrix of A comprising the intersection of the rows specified by i and the columns specified by j. A denotes any subordinate matrix norm, and we sometimes use the ∞-norm, given for A ∈ Rn×n by the formula A∞ = max1≤i≤n n j=1 |aij|. LU FACTORIZATION The aim of GE is to reduce a full system of n linear equations in n unknowns to triangular form using elementary row operations, thereby reducing a problem that we cannot solve to one that we can. There 230 2011 John Wiley & Son s, In c. Volume 3, May/June 2011
WIREs Computational Statistic -1 with A=A∈Rmx" s that GF c tem A)At the start of the th st we have converted the origl systm twhere ops,where a k-1#-k+1 nop denotes a f T e is n A肉= k-1 「A A图 difficulty in generalizing the LU factorization to rect A (1) angular matrices,though by far its most common use 知-k+10 is for r with A upper triangular.The kth stage of the elimi- form nation the elements below the pivot element al in the kth column of A according to the operations that this happens cipa submatrix f Thm.9.1).We define a+=a-mai=k+1:% (2a b+山=b-mb,1=k+1:n(2b) ngular for where the quantities =1:n-1.f Ak is singular for some1≤k≤n-1 then the factorization may exist,but if so it is not m达=a州/a你,i=k+1:n The the theorem are e in general matrices they can be shown always to hold;see the Back the p section'Structured Matrices' triangular system Ux=b is the recurrence The interpretation of GE as an LU f very impo ed tha paradigm for thinking and computings.In particular. 4=(b-∑4)/4,k=n-1-1:1 =k+1 Le give severa Much insight into ge is obtained by =b1 and Ux:thus x is obtained by solving two triangular systems.If we need to solve for another right-hand side b2 we where the Gauss transformation M=I-mge with m=[0,...,0m..Overall, can just carry e LU correspon ing trian that iss if w ork with the MR-1Mn-2…M1A=Am=:U. quations(2)that mix up operations on A and b the fact thatMit easy to Similarly,solving Ay=c redu to solving the 三2us1n A=...M-U the computation of the sca r o yA-.which =(1+m1e)1+m2e…1+mm-1e1)U ny or x)and s oagain requires just two triangula =(+∑m加 then 「1 m211 =eTU-len eTU-IL-1en eTA-len bum U=:LU 1 Volume 3.Mayuune 2011 2011 John wiley sons.Inc 231
WIREs Computational Statistics Gaussian elimination are n − 1 stages, beginning with A(1) = A ∈ Rn×n, b(1) = b, and finishing with the upper triangular system A(n) x = b(n) . At the start of the kth stage we have converted the original system to A(k) x = b(k) , where k − 1 n − k + 1 A(k) = k − 1 A(k) 11 A(k) 12 n − k + 1 0 A(k) 22 (1) with A(k) 11 upper triangular. The kth stage of the elimination zeros the elements below the pivot element a (k) kk in the kth column of A(k) according to the operations a (k+1) ij = a (k) ij − mika (k) kj , i, j = k + 1: n, (2a) b(k+1) i = b(k) i − mikb(k) k , i = k + 1: n, (2b) where the quantities mik = a (k) ik /a (k) kk , i = k + 1: n are called the multipliers and a (k) kk is called the pivot. At the end of the (n − 1)st stage, we have the upper triangular system Ux ≡ A(n) x = b(n) , which is solved by back substitution. Back substitution for the upper triangular system Ux = b is the recurrence xn = bn/unn, xk = bk − n j=k+1 ukjxj ukk, k = n − 1 : −1:1. Much insight into GE is obtained by expressing it in matrix notation. We can write A(k+1) = MkA(k) , where the Gauss transformation Mk = I − mkeT k with mk = [0, ... , 0, mk+1,k, ... , mn,k] T. Overall, Mn−1Mn−2 ··· M1A = A(n) =: U. By using the fact that M−1 k = I + mkeT k it is easy to show that A = M−1 1 M−1 2 ··· M−1 n−1U = (I + m1eT 1 )(I + m2eT 2 ) ···(I + mn−1eT n−1)U = I +n−1 i=1 mieT i U = 1 m21 1 . . . m32 ... . . . . . . ... mn1 mn2 ... mn,n−1 1 U =: LU. The upshot is that GE computes an LU factorization A = LU (also called an LU decomposition), where L is unit lower triangular and U is upper triangular. The cost of the computation is (2/3)n3 + O(n2) flops, where a flop denotes a floating point addition, subtraction, multiplication, or division. There is no difficulty in generalizing the LU factorization to rectangular matrices, though by far its most common use is for square matrices. GE may fail with a division by zero during formation of the multipliers. The following theorem shows that this happens precisely when A has a singular leading principal submatrix of dimension less than n (Ref 5, Thm. 9.1). We define Ak = A(1: k,1: k). Theorem 1 There exists a unique LU factorization of A ∈ Rn×n if and only if Ak is nonsingular for k = 1: n − 1. If Ak is singular for some 1 ≤ k ≤ n − 1 then the factorization may exist, but if so it is not unique. The conditions of the theorem are in general difficult to check, but for some classes of structured matrices they can be shown always to hold; see the section ‘Structured Matrices’. The interpretation of GE as an LU factorization is very important, because it is well established that the matrix factorization viewpoint is a powerful paradigm for thinking and computing6,7. In particular, separating the computation of LU factorization from its application is beneficial. We give several examples. First, note that given A = LU, we can write Ax = b1 as LUx = b1, or Lz = b1 and Ux = z; thus x is obtained by solving two triangular systems. If we need to solve for another right-hand side b2 we can just carry out the corresponding triangular solves, re-using the LU factorization—something that is not so obvious if we work with the GE equations (2) that mix up operations on A and b. Similarly, solving ATy = c reduces to solving the triangular systems UTz = c and LTy = z using the available factors L and U. Another example is the computation of the scalar α = yTA−1x, which can be rewritten α = yTU−1 · L−1x (or α = yT · U−1L−1x) and so again requires just two triangular solves and avoids the need to invert a matrix explicitly. Finally, note that if A = LU and A−1 = (bij) then u−1 nn = eT nU−1en = eT nU−1L−1en = eT nA−1en = bnn. Thus the reciprocal of unn is an element of A−1, and so we have the lower bound A−1≥|u−1 nn |, for all the standard matrix norms. Volume 3, May/June 2011 2011 John Wiley & Son s, In c. 231
Advanced Review A very useful representation of the first stage of GE is from 1 to 0.Hence 1 ows that for thi A一1 1 a as an unstable algorithm. I C-baTlan that arreh thr The matrix C-baT is the Schur complement 1=k+1 of au in A. ore generally,it can be Partial pivoting.At the start of the kth stage,the hown tha the matrix A in Eq.(1)can expres kth and rth rows are interchanged,where 122 A21A 12,where A=A;this is the example symmetric positive definiteness and diagonal Thus an element of maximal magnitude in the pivot column is s cted as pivo interesting result som the sectoral fo elements of Land U(see,e.g.,Ref 8,p.1): for the where la: in other words.a det(A) ivot of maximal magnitude is der((1) det(A;1) isi Although elegant,these are of limited practical use PIVOTING AND NUMERICAL STABILITY ohot is choscn that is thearestn The asor partial pivoti In practical computation,it is not just zero pivots that lookine down a column and across a row for the small pivots.The problem largest element in modulus(Figure 1). sm pivot ey ca le to large artia pivoting requires O( possible loss of significance in the subtraction arithmetic and so is a significant cost.The cost of lepe on ne matr n of the and <.GE produces on the nutation matrices Pand Q:it can be [-[e][6-+小-w hown that PAQ=LU with a unit lower triangula and upper。 with Q= r partia In floating point arithmetic the factors are approxi- would be obtai d if all the mated by were done at the start of the elimination and GE m-[oh 9- In ord to asses the succes of these pivoting a=[6]= the effects of all the roundingerrors committed during 232 e2011 John Wiley Sons,Inc Volume 3.Mayllune 2011
Advanced Review wires.wiley.com/compstats A very useful representation of the first stage of GE is 1 n − 1 A = 1 a11 aT n − 1 b C = 1 0 b/a11 In−1 a11 aT 0 C − baT/a11 . The matrix C − baT/a11 is the Schur complement of a11 in A. More generally, it can be shown that the matrix A(k) 22 in Eq. (1) can expressed as A(k) 22 = A22 − A21A−1 11 A12, where Aij ≡ A(1) ij ; this is the Schur complement of A11 in A. Various structures of A can be shown to be inherited by the Schur complement (for example symmetric positive definiteness and diagonal dominance), and this enables the proof of several interesting results about the LU factors (including some of those in the section ‘Structured Matrices’). Explicit determinantal formulae exist for the elements of L and U (see, e.g., Ref 8, p. 11): ij = det A([1: j − 1, i], 1: j) det(Aj) , i ≥ j, uij = det A(1: i, [1: i − 1, j]) det(Ai−1) , i ≤ j. Although elegant, these are of limited practical use. PIVOTING AND NUMERICAL STABILITY In practical computation, it is not just zero pivots that are unwelcome but also small pivots. The problem with small pivots is that they can lead to large multipliers mik. Indeed if mik is large then there is a possible loss of significance in the subtraction a (k) ij − mika (k) kj , with low-order digits of a (k) ij being lost. Losing these digits could correspond to making a relatively large change to the original matrix A. The simplest example of this phenomenon is for the matrix A = 1 1 1 , where we assume 0 << u. GE produces 1 1 1 = 1 0 1/ 1 1 0 −1/ + 1 = LU. In floating point arithmetic the factors are approximated by fl(L) = 1 0 1/ 1 =: L, fl(U) = 1 0 −1/ =: U, which would be the exact answer if we changed a22 from 1 to 0. Hence LU = A + A with A∞/A∞ = 1/2 u. This shows that for this matrix GE does not compute the LU factorization of a matrix close to A, which means that GE is behaving as an unstable algorithm. Three different pivoting strategies are available that attempt to avoid instability. All three strategies ensure that the multipliers are nicely bounded: |mik| ≤ 1, i = k + 1: n. Partial pivoting. At the start of the kth stage, the kth and rth rows are interchanged, where |a (k) rk | := max k≤i≤n |a (k) ik |. Thus an element of maximal magnitude in the pivot column is selected as pivot. Complete pivoting. At the start of the kth stage rows k and r and columns k and s are interchanged where |a(k) rs | := max k≤i,j≤n |a (k) ij |; in other words, a pivot of maximal magnitude is chosen over the whole remaining submatrix. Rook pivoting. At the start of the kth stage, rows k and r and columns k and s are interchanged, where |a(k) rs | = max k≤i≤n |a (k) is | = max k≤j≤n |a (k) rj |; in other words, a pivot is chosen that is the largest in magnitude in both its column (as for partial pivoting) and its row. The pivot search is done by repeatedly looking down a column and across a row for the largest element in modulus (Figure 1). Partial pivoting requires O(n2) comparisons in total. Complete pivoting requires O(n3) comparisons, which is of the same order of magnitude as the arithmetic and so is a significant cost. The cost of rook pivoting is intermediate between the two and depends on the matrix. The effect on the LU factorization of the row and column interchanges in these pivoting strategies can be captured in permutation matrices P and Q; it can be shown that PAQ = LU with a unit lower triangular L and upper triangular U (with Q = I for partial pivoting). In other words, the triangular factors are those that would be obtained if all the interchanges were done at the start of the elimination and GE without pivoting were used. In order to assess the success of these pivoting strategies in improving numerical stability we need a backward error analysis result. Such a result expresses the effects of all the rounding errors committed during 232 2011 John Wiley & Son s, In c. Volume 3, May/June 2011
WIREs Computational Statistic the o Fo 。,, Pa≤n1/22.3p…n2m-)2nog 14210 However,this bound usually significantly overesti- 95638 133401 complete pivo who found nx ho).putative pivot that istested plereapvoneof the sire of for Interes largest in magnitude in both its row and its column original nce we ca assume In addition to the backward error,the relative This is the r :is alsc of int proof see Ref 5,Thm.9.3,Lemma 9.6. Theorem 2 Let A E R"xm and su e GE produces (Ref 5,Ch.).Aypical bound a computed solution to Ax=b.Then llx-3 (A) (A+△A=b, I△A≤p)Allo where (A)=AllA-l is the matrix condition where p(n)is a cubic polynomial and the growth number with respect to inversion and c=p(n) factor maxz la) Pn= STRUCTURED MATRICES directed structures and to proving properties of the LU factor the elements of a The growth facto 1 measure and the growth fa ctor. We give a just a very brief the growth of elements during the elimination.The rview.For arther details of all these properties cub term p(n)arises from many triangle inequalitie and r in the pro an etry.in tha if A is symmetric then so are all the reduced matrice is in any case outside our control.The message of the A in Eq.(1),but symmetry does not by itself theorem is that GE will be backward stable if is of order 1.A facto A 15 1 so it is more comm on for symmetric positive definit u)mentioned at the start of this section,p= Cholesky RR re upper triangular Th e maxi thre actor fo research.For partial ing.Wilkinson?showed where P is a nutation matrix,L is unit lower that p≤2 nd that this l bound is attainable.In ts1ZePm≤ 0r2. ng o P.of of Volume 3.Maylune 201 2011 John wiley sons.Inc 233
WIREs Computational Statistics Gaussian elimination 2 1 3 2 0 1 10 5 0 2 9 13 1 2 3 14 5 3 2 3 1 2 6 4 4 5 4 1 3 0 5 6 1 0 8 1 FIGURE 1 | Illustration of how rook pivoting searches for the first pivot for a particular 6 × 6 matrix (with the positive integer entries shown). Each dot denotes a putative pivot that is tested to see if it is the largest in magnitude in both its row and its column. the computation as an equivalent perturbation on the original data. Since we can assume P = Q = I without loss of generality, the result is stated for GE without pivoting. This is the result of Wilkinson9 (which he originally proved for partial pivoting); for a modern proof see Ref 5, Thm. 9.3, Lemma 9.6. Theorem 2 Let A ∈ Rn×n and suppose GE produces a computed solution x to Ax = b. Then (A + A)x = b, A∞ ≤ p(n)ρnuA∞, where p(n) is a cubic polynomial and the growth factor ρn = maxi,j,k |a (k) ij | maxi,j |aij| . Ideally, we would like A∞ ≤ uA∞, which reflects the uncertainty caused simply by rounding the elements of A. The growth factor ρn ≥ 1 measures the growth of elements during the elimination. The cubic term p(n) arises from many triangle inequalities in the proof and is pessimistic; replacing it by its square root gives a more realistic bound, but this term is in any case outside our control. The message of the theorem is that GE will be backward stable if ρn is of order 1. A pivoting strategy should therefore aim to keep ρn small. If no pivoting is done ρn can be arbitrarily large. For example, for the matrix A = 1 1 1 (0 13. Research on certain aspects of the size of ρn for complete pivoting is ongoing.12 Interestingly, ρn ≥ n for any Hadamard matrix (a matrix of 1’s and −1’s with orthogonal columns) and any pivoting strategy.13 For rook pivoting, the bound ρn ≤ 1.5n 3 4 log n was obtained by Foster.14 In addition to the backward error, the relative error x −x/x of the solution x computed in floating point arithmetic is also of interest. A bound on the relative error is obtained by applying standard perturbation bounds for linear systems to Theorem 2 (Ref 5, Ch. 7). A typical bound is x −x∞ x∞ ≤ cnuκ∞(A) 1 − cnuκ∞(A) , where κ∞(A) = A∞A−1∞ is the matrix condition number with respect to inversion and cn = p(n)ρn. STRUCTURED MATRICES A great deal of research has been directed at specializing GE to take advantage of particular matrix structures and to proving properties of the LU factors and the growth factor. We give a just a very brief overview. For further details of all these properties and results see Ref 5. GE without pivoting exploits symmetry, in that if A is symmetric then so are all the reduced matrices A(k) 22 in Eq. (1), but symmetry does not by itself guarantee the existence or numerical stability of the LU factorization. If A is also positive definite then GE succeeds (in light of Theorem 1) and the growth factor ρn = 1, so pivoting is not necessary. However, it is more common for symmetric positive definite matrices to use the Cholesky factorization A = RRT, where R is upper triangular with positive diagonal elements.15 For general symmetric indefinite matrices factorizations of the form PAPT = LDLT are used, where P is a permutation matrix, L is unit lower triangular, and D is block diagonal with diagonal blocks of dimension 1 or 2. Several pivoting strategies are available for choosing P, of which one is a symmetric form of rook pivoting. Volume 3, May/June 2011 2011 John Wiley & Son s, In c. 233
Advanced Review If A is diagonally dominant by rows,that is, U have nonnegative elements.Moreover,the growth factor pn=1.More importantly,for such matrices it i=1:n can be onger componentw all ~3u or a is diagonally dominant by columns(that is.AT is diagonally dominant by rows)then it is safe not to use ALGORITHMS interchanges:the LU factorization without pivoting In principle,GE is computationally straightforward. exist DnD.then for k=1:n-1 have bandwidth p (=0 fori>j+p and=0 for forj=k:n fori=k+1:n preserve tor U that m张=ak/akk =a-mka 0. end end end Here, BOX1 SPARSE MATRICES the upper t anglhrfacoUandthedem of L are A matrix is sparse if it has a sufficiently large number of zero entries that it is worth taking ne" ng the matrix and in There are 3!ways of ordering the three nested GE loops in this pseudocode but not all matrix it can produce fill-in,which occurs when a ncy for computer imp de zero entry becomes nonzero.Depending on th f GE such those in kefs 16.17 a sparse matrix with and the LINPACK package -the inner loop tra- he columns of A,which matches t e order i etw ous tecr ues are Thran stor from the 1980s onwards led to the need to modif an issue,these techniques must be implementations of GE in order to maintain good effi y:ow the oo up into plec (or sparse GE to be successfully applied to extremely can arge matrice an te t atment tha lgorithm by writing ory or computation time for G tosolve Ax =b bec tive we must resort to []-[ae][69] ally x[] A matrix is totally nonnegative if the determi (bywhatever means),solve suc then it has an LU factorization A =Luin which Land U are totally nonnegative,so that in particular L and S to obtain 1 and U.Th com -L21U12, and a 234 2011 John Wiley Sons.Inc Volume 3.Mayllune 2011
Advanced Review wires.wiley.com/compstats If A is diagonally dominant by rows, that is, n j=1 j=i |aij|≤|aii|, i = 1: n, or A is diagonally dominant by columns (that is, AT is diagonally dominant by rows) then it is safe not to use interchanges: the LU factorization without pivoting exists and the growth factor satisfies ρn ≤ 2. If A has bandwidth p, that is, aij = 0 for |i − j| > p, then in an LU factorization L and U also have bandwidth p (ij = 0 for i > j + p and uij = 0 for j > i + p). With partial pivoting the bandwidth is not preserved, but it is nevertheless true that in PA = LU the upper triangular factor U has bandwidth 2p and L has at most p + 1 nonzeros per column; moreover, ρn ≤ 22p−1 − (p − 1)2p−2. Tridiagonal matrices (p = 1) form an important special case. For general sparse matrices see Box 1. BOX 1 SPARSE MATRICES A matrix is sparse if it has a sufficiently large number of zero entries that it is worth taking advantage of them in storing the matrix and in computing with it. When GE is applied to a sparse matrix it can produce fill-in, which occurs when a zero entry becomes nonzero. Depending on the matrix there may be no fill-in (as for a tridiagonal matrix), total fill-in (e.g., for a sparse matrix with a full first row and column), or something inbetween. Various techniques are available for re-ordering the rows and columns in order to reduce fill-in. Since numerical stability is also an issue, these techniques must be combined with a strategy for ensuring that the pivots are sufficiently large. Modern techniques allow sparse GE to be successfully applied to extremely large matrices. For an up to date treatment that includes C codes see Ref 32. When the necessary memory or computation time for GE to solve Ax = b becomes prohibitive we must resort to iterative methods, which typically require just the ability to compute matrix–vector products with A (and possibly its transpose).33 A matrix is totally nonnegative if the determinant of every square submatrix is nonnegative. The Hilbert matrix (1/(i + j − 1)) is an example of such a matrix. If A is nonsingular and totally nonnegative then it has an LU factorization A = LU in which L and U are totally nonnegative, so that in particular L and U have nonnegative elements. Moreover, the growth factor ρn = 1. More importantly, for such matrices it can be shown that a much stronger componentwise form of Theorem 2 holds with |aij| ≤ cnu|aij| for all i and j, where cn ≈ 3n. ALGORITHMS In principle, GE is computationally straightforward. It can be expressed in pseudocode as follows: for k = 1 : n − 1 for j = k : n for i = k + 1 : n mik = aik/akk aij = aij − mikakj end end end Here, pivoting has been omitted, and at the end of the computation the upper triangle of A contains the upper triangular factor U and the elements of L are the mij. Incorporating partial pivoting, and forming the permutation matrix P such that PA = LU, is straightforward. There are 3! ways of ordering the three nested loops in this pseudocode, but not all are of equal efficiency for computer implementation. The kji ordering shown above forms the basis of early Fortran implementations of GE such as those in Refs 16, 17, and the LINPACK package18—the inner loop traverses the columns of A, which matches the order in which Fortran stores the elements of two-dimensional arrays. The hierarchical computer memories prevalent from the 1980s onwards led to the need to modify implementations of GE in order to maintain good effi- ciency: now the loops must be broken up into pieces, leading to partitioned (or blocked) algorithms. For a given block size r > 1, we can derive a partitioned algorithm by writing A11 A12 A21 A22 = L11 0 L21 In−r Ir 0 0 S × U11 U12 0 In−r , where A11,L11, U11 ∈ Rr×r . Ignoring pivoting, the idea is to compute an LU factorization A11 = L11U11 (by whatever means), solve the multiple right-hand side triangular systems L21U11 = A21 and L11U12 = A12 for L21 and U12 respectively, form S = A22 − L21U12, and apply the same process to S to obtain L22 and U22. The computations yielding 234 2011 John Wiley & Son s, In c. Volume 3, May/June 2011
WIREs Computational Statistic L21,U12,and S are all mat the dimension n.the block size.the processor and so on.However,the computed x m ce a on the cture. 6 e ne mathematically equivalent to any other variant of GE benchr mark has its origins in the it does the same operations in a different order,but of conte machines was compared by amount of data move ent veis rarchy very active reas of curr en BLOCK LU FACTORIZATION mo which dens to extend the c of the state of the art the to shared memory computers based on all elements below that block.For example.consider processo res The firs the factorization ticore cessors.A key goal is to minimize 0111 the amount of communication between processors,since -1111 ne co raphics processing units (GPUS)in tion with multicore process rs.GPUs have the abilit nt arithmetic at very high paral 典re epot =1aI1, icl.cs.utk.edu/plasma)and MAGMA (http://icl.cs edu/r 62 gma)projects.Representative papers are Refs activity is c ncerned with algorithms )pivo leadi 22 eliminate the elements in the (3:4.1:2)submatrix.In example.Ref27. See Box 2 for the role GE plays in the TOP500 i and 2 in ter of x ranking of the world's fastest computers. an Gaussian elimination.or block LU factorization.In BOX2 general,for a given partitioning A=(A)with the TOP500 thagona dimension),a block LU fastorizationhasthe ctorization The TOP500 list (http://www.top500.org) ranks the world's fastest hich solve on an impl ntation of GE for parallel computers in C and MPI. Performance is measure Lm,-1 rations(flops)per un The U11U12. Uim user is allowed to tune the code to obtain the U22 =LU. st performance,by varying parameters such as Volume 3.Maylune 201 2011 John wiley sons.inc 235
WIREs Computational Statistics Gaussian elimination L21, U12, and S are all matrix–matrix operations and can be carried out using level 3 BLAS,19,20 for which highly optimized implementations are available for most machines. The optimal choice of the block size r depends on the particular computer architecture. It is important to realize that this partitioned algorithm is mathematically equivalent to any other variant of GE: it does the same operations in a different order, but one that reduces the amount of data movement among different levels of the computer memory hierarchy. In an attempt to extract even better performance recursive algorithms of this form with r ≈ n/2 have also been developed.21,22 We mention two very active areas of current research in GE, and more generally in dense linear algebra computations, both of which are aiming to extend the capabilities of the state of the art package LAPACK23 to shared memory computers based on multicore processor architectures. The first is aimed at developing parallel algorithms that run efficiently on systems with multiple sockets of multicore processors. A key goal is to minimize the amount of communication between processors, since on such evolving architectures communication costs are increasingly significant relative to the costs of floating point arithmetic. A second area of research aims to exploit graphics processing units (GPUs) in conjunction with multicore processors. GPUs have the ability to perform floating point arithmetic at very high parallelism and are relatively inexpensive. Current projects addressing these areas include the PLASMA (http:// icl.cs.utk.edu/plasma) and MAGMA (http://icl.cs.utk. edu/magma) projects. Representative papers are Refs 24, 25. Further activity is concerned with algorithms for distributed memory machines, aiming to improve upon those in the ScaLAPACK library26; see, for example, Ref 27. See Box 2 for the role GE plays in the TOP500 ranking of the world’s fastest computers. BOX 2 TOP500 The TOP500 list (http://www.top500.org) ranks the world’s fastest computers by their performance on the LINPACK benchmark,34 which solves a random linear system Ax = b by an implementation of GE for parallel computers written in C and MPI. Performance is measured by the floating point execution rate counted in floating point operations (flops) per second. The user is allowed to tune the code to obtain the best performance, by varying parameters such as the dimension n, the block size, the processor grid size, and so on. However, the computed solutionx must produce a small residual in order for the result to be valid, in the sense that b − Ax∞/(uA∞x∞) is of order 1. This benchmark has its origins in the LINPACK project,18 in which the performance of contemporary machines was compared by running the LINPACK GE code dgefa on a 100 × 100 system. BLOCK LU FACTORIZATION At each stage of GE a pivot element is used to eliminate elements below the diagonal in the pivot column. This notion can generalized to use a pivot block to eliminate all elements below that block. For example, consider the factorization A = 0 1 1 1 −1 1 1 1 −2 3 4 2 −1 2 1 3 = 1 0 0 0 0 1 0 0 1 2 1 0 1 1 0 1 0 1 1 1 −1 1 1 1 0 0 1 −1 0 0 −1 1 ≡ L1U1. GE without pivoting fails on A because of the zero (1, 1) pivot. The displayed factorization corresponds to using the leading 2 × 2 principal submatrix of A to eliminate the elements in the (3: 4, 1: 2) submatrix. In the context of a linear system Ax = b, we have effectively solved for the variables x1 and x2 in terms of x3 and x4 and then substituted for x1 and x2 in the last two equations. This is the key idea underlying block Gaussian elimination, or block LU factorization. In general, for a given partitioning A = (Aij) m i,j=1 with the diagonal blocks Aii square (but not necessarily all of the same dimension), a block LU factorization has the form A = I L21 I . . . ... Lm1 ... Lm,m−1 I × U11 U12 ... U1m U22 . . . ... Um−1,m Umm ≡ LU, Volume 3, May/June 2011 2011 John Wiley & Son s, In c. 235
Advanced Review where L and U are block triangular but U is not nec onward,computing extra precision residuals became essarily tri gular.This is in general different from problematic and this spurred research into fixed pre trictive analog of iterative refin nt,whe 13.2 t.In re only one preci hack int as a because modern processors cither have extra precision ecision mu than in c onsingular of block I u factori re oextra precision computation The n erical stabilit is less satisfactory than for the usual Lu factorization The following theorem summarizes the benefits However,if A is diagonally dominant by columns,or block diagonally dominant by columns in the sense that Theorem 4 Let i applied to the nonsingular linear system Ax=b in c iunction A=1-1-∑1Al≥0,i=1:m, with GE with partial pivoting.Provided A is no reduce ces an元f be shown to be mumerically x for mixed rization ted by th desire to maximize efficiency on modern con uter through the use of matrix-matrix operations.It has wbere cond(A,x)=ll A-1 Alx also been widely used for block trid iter small componentwise backward error,as first shown even for fixed precision refinement.Fo ITERATIVE REFINEMENT diono terativerenensa procedure for improving compu CE Th system Ax e process repeat CONCLUSION the three steps ues tobe the standard 1.Compute r=b-A. not so larg ng lin 2 Solve Ad- or storage dictate the use of irerative methods.The first 3.Update 元+d partial pivoting wa hi code imp In the abs cafruntgtemosshe it is still not understood why the numerical stabilin of this method is so good in practice,or equivalentl cess is iterative.Forcomputed by m Ad=r is solved using the LU factor- computed,so each iteration requires related topics,including 2 the 1960 .row or column scaling (or equilibration), which we chi ion iterative re ment.On some ‘Gheoianeionaiomnwtohatehstaeg lin elem ts both at inner products in extra precision in hardware.mk pally implementation of the process easy.From the 1980s inversio 236 2011 John Wiley Sons.Inc Volume 3.Mayllune 2011
Advanced Review wires.wiley.com/compstats where L and U are block triangular but U is not necessarily triangular. This is in general different from the usual LU factorization. A less restrictive analog of Theorem 1 holds (Ref 5, Thm. 13.2). Theorem 3 The matrix A = (Aij) m i,j=1 ∈ Rn×n has a unique block LU factorization if and only if the first m − 1 leading principal block submatrices of A are nonsingular. The numerical stability of block LU factorization is less satisfactory than for the usual LU factorization. However, if A is diagonally dominant by columns, or block diagonally dominant by columns in the sense that A−1 jj −1 −n i=1 i=j Aij ≥ 0, j = 1: n, then the factorization can be shown to be numerically stable (Ref 5, Ch. 13). Block LU factorization is motivated by the desire to maximize efficiency on modern computers through the use of matrix–matrix operations. It has also been widely used for block tridiagonal matrices arising in the discretization of partial differential equations. ITERATIVE REFINEMENT Iterative refinement is a procedure for improving a computed solution x to a linear system Ax = b—usually one computed by GE. The process repeats the three steps 1. Compute r = b − Ax. 2. Solve Ad = r. 3. Update x ← x + d. In the absence of rounding errors, x is the exact solution to the system after one iteration of the three steps. In practice, rounding errors vitiate all three steps and the process is iterative. For x computed by GE, the system Ad = r is solved using the LU factorization already computed, so each iteration requires only O(n2) flops. Iterative refinement was popular in the 1960s and 1970s, when it was implemented with the residual r computed at twice the working precision, which we call mixed precision iterative refinement. On some machines of that era it was possible to accumulate inner products in extra precision in hardware, making implementation of the process easy. From the 1980s onward, computing extra precision residuals became problematic and this spurred research into fixed precision iterative refinement, where only one precision is used throughout. In the last few years mixed precision iterative refinement has come back into favor, because modern processors either have extra precision registers or can perform arithmetic in single precision much faster than in double precision but also because standardized routines for extra precision computation are now available.28–30 The following theorem summarizes the benefits iterative refinement brings to the forward error (Ref 5, Sect. 12.1). Theorem 4 Let iterative refinement be applied to the nonsingular linear system Ax = b in conjunction with GE with partial pivoting. Provided A is not too ill conditioned, iterative refinement reduces the forward error at each stage until it produces an x for which x −x∞ x∞ ≈ u, for mixed precision, cond(A, x)u, for fixed precision, where cond(A, x) =|A−1Ax| ∞/x∞. This theorem tells only part of the story. Under suitable assumptions, iterative refinement leads to a small componentwise backward error, as first shown by Skeel31—even for fixed precision refinement. For the definition of componentwise backward error and further details, see Ref 5, Sect. 12.2. CONCLUSION GE with partial pivoting continues to be the standard numerical method for solving linear systems that are not so large that considerations of computational cost or storage dictate the use of iterative methods. The first computer program for GE with partial pivoting was probably that of Wilkinson35 (his code implemented iterative refinement too). It is perhaps surprising that it is still not understood why the numerical stability of this method is so good in practice, or equivalently why large element growth with partial pivoting is not seen in practical computations. This overview has omitted a number of GErelated topics, including • row or column scaling (or equilibration), • Gauss–Jordan elimination, in which at each stage of the elimination elements both above and below the diagonal are eliminated, and which is principally used as a method for matrix inversion, 236 2011 John Wiley & Son s, In c. Volume 3, May/June 2011
WIREs Computational Statistic variant ts of Ge mo tions are carried out between adiacent rows only, which (when co reveals the rank of A, REFERENCES 2C. CE Th f the comhination of ohe 1967 ,NJ:Prentice-Hall 17 Moler CB Matriy img.1972,15-268-270. ith Fortran and pag Translated from the Latin originals (1821-1828)by Stewart GW. 18.D H GW.LIN Philadelphia, PA 9.Do I.Du Croz I.Duff Is Ham Trans Math Softw 1990,16:1-17. PA and Applied Mathematics:2002.ISBN:0-89871-521-0 69. 6.Golub GH,Van Loan CF.Matrix Computations.3rd 19.IBN0-01b 0-0- 8(paperback). 21.Gustayson FG.Recursion leads to automatic variable 7.Stewart GW.The deco mpo Cmm 20000 ach to matrix 22. 8.Householder AS.The Theory of Matrices in Numeri Toledo S. York:964 190486 18:1065-1081. 23.Anderson E.Bai Z.Bischof CH.Blackford S.Demme 9 ongarra J,Du Croz Gre mmar 8:281-330. hia.PA for Ind Mathematics:1999,ISBN:0-89871-447-8. 24.Buttari A,Langou J,Kurzak J,Dongarra J.A class 11. Edel 12.K 5 Donga rd GPU lin M M.Towards dens ritis C Mitrouli M.The of systems.Parallel Comput 2010,36:232-240. 26.Blackford LS.Choi I.Cleary A.D'Azevedo E.Demmel I ty for industrial and applied mathematics:1997 LV.T 1BN:0-89871-397-8. omput appl math Corrigendum in Comput App Math1998,98:177 Volume 3.Mayuune 2011 2011 John wiley sons.Inc 237
WIREs Computational Statistics Gaussian elimination • variants of GE motivated by parallel computing, such as pairwise elimination, in which eliminations are carried out between adjacent rows only, • analyzing the extent to which (when computed in floating point arithmetic) an LU factorization reveals the rank of A, • the sensitivity of the LU factors to perturbations in A. For more on these topics see Ref 5 and the references therein. REFERENCES 1. Lay-Yong L, Kangshen S. Methods of solving linear equations in traditional China. Hist Math 1989, 16:107–122. 2. Gauss CF. Theory of the Combination of Observations Least Subject to Errors. Part One, Part Two, Supplement. Philadelphia, PA: Society for Industrial and Applied Mathematics; 1995, ISBN:0-89871-347-1. Translated from the Latin originals (1821–1828) by Stewart GW. 3. Grcar JF. How ordinary elimination became Gaussian elimination. Hist Math 2011, 38:163–218. 4. Stewart GW. Gauss, statistics, and Gaussian elimination. J Comput Graph Stat 1995, 4:1–11. 5. Higham NJ. Accuracy and Stability of Numerical Algorithms. 2nd ed. Philadelphia, PA: Society for Industrial and Applied Mathematics; 2002, ISBN:0-89871-521-0. 6. Golub GH, Van Loan CF. Matrix Computations. 3rd ed. Baltimore, MD: Johns Hopkins University Press; 1996, ISBN:0-8018-5413-X (hardback), 0-8018-5414- 8 (paperback). 7. Stewart GW. The decompositional approach to matrix computation. Comput Sci Eng 2000, 2:50–59. 8. Householder AS. The Theory of Matrices in Numerical Analysis. New York: Blaisdell; 1964, ISBN:0-486- 61781-5. Reprinted by New York: Dover, 1975. 9. Wilkinson JH. Error analysis of direct methods of matrix inversion. J Assoc Comput Mach 1961, 8:281–330. 10. Gould NIM. On growth in Gaussian elimination with complete pivoting. SIAM J Matrix Anal Appl 1991, 12:354–361. 11. Edelman A. The complete pivoting conjecture for Gaussian elimination is false. Mathematica J 1992, 2:58–61. 12. Kravvaritis C, Mitrouli M. The growth factor of a Hadamard matrix of order 16 is 16. Numer Linear Algebra Appl 2009, 16:715–743. 13. Higham NJ, Higham DJ. Large growth factors in Gaussian elimination with pivoting. SIAM J Matrix Anal Appl 1989, 10:155–164. 14. Foster LV. The growth factor and efficiency of Gaussian elimination with rook pivoting. J Comput Appl Math 1997, 86:177–194, Corrigendum in J Comput Appl Math 1998, 98:177. 15. Higham NJ. Cholesky factorization. WIREs Comput Stat 2009, 1:251–254. 16. Forsythe GE, Moler CB. Computer Solution of Linear Algebraic Systems. Englewood Cliffs, NJ: Prentice-Hall; 1967. 17. Moler CB. Matrix computations with Fortran and paging. Commun ACM 1972, 15:268–270. 18. Dongarra JJ, Bunch JR, Moler CB, Stewart GW. LINPACK Users’ Guide. Philadelphia, PA: Society for Industrial and Applied Mathematics; 1979, ISBN:0- 89871-172-X. 19. Dongarra JJ, Du Croz JJ, Duff IS, Hammarling SJ. A set of Level 3 basic linear algebra subprograms. ACM Trans Math Softw 1990, 16:1–17. 20. Dongarra JJ, Du Croz JJ, Duff IS, Hammarling SJ. Algorithm 679. A set of Level 3 basic linear algebra subprograms: model implementation and test programs. ACM Trans Math Softw 1990, 16:18–28. 21. Gustavson FG. Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM J Res Dev 1997, 41:737–755. 22. Toledo S. Locality of reference in LU decomposition with partial pivoting. SIAM J Matrix Anal Appl 1997, 18:1065–1081. 23. Anderson E, Bai Z, Bischof CH, Blackford S, Demmel JW, Dongarra JJ, Du Croz JJ, Greenbaum A, Hammarling SJ, McKenney A, et al. LAPACK Users’ Guide. 3rd ed. Philadelphia, PA: Society for Industrial and Applied Mathematics; 1999, ISBN:0-89871-447-8. 24. Buttari A, Langou J, Kurzak J, Dongarra J. A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput 2009, 35:38–53. 25. Tomov S, Dongarra J, Baboulin M. Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput 2010, 36:232–240. 26. Blackford LS, Choi J, Cleary A, D’Azevedo E, Demmel J, Dhillon I, Dongarra J, Hammarling S, Henry G, Petitet A, et al. ScaLAPACK Users’ Guide. Philadelphia, PA: Society for Industrial and Applied Mathematics; 1997, ISBN:0-89871-397-8. 27. Grigori L, Demmel JW, Xiang H. Communication Avoiding Gaussian Elimination. SC’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. Volume 3, May/June 2011 2011 John Wiley & Son s, In c. 237
Advanced Review wires.wiley.com/compstats Piscataway,NJ:IEEE Press;2008,1-12,ISBN:978-1- 32.Davis TA.Direct Methods for Sparse Linear Systems. 4244-2835-9. Philadelphia,PA:Society for Industrial and Applied 28.Basic Linear Algebra Subprograms Technical(BLAST) Mathematics;2006,ISBN:0-89871-613-6. Forum.Basic Linear Algebra Subprograms Technical 33.Saad Y.Iterative Methods for Sparse Linear Systems. (BLAST)forum standard II.Int J High Perform Comput 2nd ed.Philadelphia,PA:Society for Industrial and Appl2002,16:115-199. Applied Mathematics;2003,ISBN:0-89871-534-2. 29.Baboulin M,Buttari A,Dongarra J,Kurzak J,Langou 34.Dongarra JJ,Luszczek P,Petitet A.The LINPACK J,Langou J,Luszczek P,Tomov S.Accelerating sci- benchmark:past,present and future.Concurr Comput: entific computations with mixed precision algorithms. Pract Exper2003,15:803-820. Comput Phys Commun 2009,180:2526-2533. 35.Wilkinson JH.Progress report on the Automatic Com- 30.Demmel JW,Hida Y,Kahan W,Li XS,Mukherjee S, puting Engine,Report MA/17/1024,Mathematics Divi- Riedy EJ.Error bounds from extra precise iterative re- sion,Department of Scientific and Industrial Research, finement.ACM Trans Math Softw 2006,32:325-351. National Physical Laboratory,Teddington,UK,1948. 31.Skeel RD.Iterative refinement implies numerical sta- bility for Gaussian elimination.Math Comput 1980, 35:817-832. 238 2011 John Wiley Sons,Inc. Volume 3,May/June 2011
Advanced Review wires.wiley.com/compstats Piscataway, NJ: IEEE Press; 2008, 1–12, ISBN: 978-1- 4244-2835-9. 28. Basic Linear Algebra Subprograms Technical (BLAST) Forum. Basic Linear Algebra Subprograms Technical (BLAST) forum standard II. Int J High Perform Comput Appl 2002, 16:115–199. 29. Baboulin M, Buttari A, Dongarra J, Kurzak J, Langou J, Langou J, Luszczek P, Tomov S. Accelerating scientific computations with mixed precision algorithms. Comput Phys Commun 2009, 180:2526–2533. 30. Demmel JW, Hida Y, Kahan W, Li XS, Mukherjee S, Riedy EJ. Error bounds from extra precise iterative re- finement. ACM Trans Math Softw 2006, 32:325–351. 31. Skeel RD. Iterative refinement implies numerical stability for Gaussian elimination. Math Comput 1980, 35:817–832. 32. Davis TA. Direct Methods for Sparse Linear Systems. Philadelphia, PA: Society for Industrial and Applied Mathematics; 2006, ISBN:0-89871-613-6. 33. Saad Y. Iterative Methods for Sparse Linear Systems. 2nd ed. Philadelphia, PA: Society for Industrial and Applied Mathematics; 2003, ISBN:0-89871-534-2. 34. Dongarra JJ, Luszczek P, Petitet A. The LINPACK benchmark: past, present and future. Concurr Comput: Pract Exper 2003, 15:803–820. 35. Wilkinson JH. Progress report on the Automatic Computing Engine, Report MA/17/1024, Mathematics Division, Department of Scientific and Industrial Research, National Physical Laboratory, Teddington, UK, 1948. 238 2011 John Wiley & Son s, In c. Volume 3, May/June 2011