正在加载图片...
WIREs Computational Statistic L21,U12,and S are all mat an0cmdarealR3ocd the dimension n.the block size.the processor and so on.However,the computed x m ult proc ce a on the cture. 6 e ne mathematically equivalent to any other variant of GE benchr mark has its origins in the it does the same operations in a different order,but conte machines was compared by amount of data move ent rarchy very active reas of curr en BLOCK LU FACTORIZATION mo which dens to extend the c of the state of the art the to shared memory computers based on all elements below that block.For example.consider mu processo res The firs the factorization ticore cessors.A key goal is to minimize 0111 the amount of communication between processors,since -1111 ant re ne co raphics processing units (GPUS)in tion with multicore process rs.GPUs have the abilit nt arithmetic at very high paral 6典时e =1aI1, icl.cs.utk.edu/plasma)and MAGMA (http://icl.cs edu/r 62 gma)projects.Representative papers are Refs activity is c ncerned with algorithms )pivo leadi 22 eliminate the elements in the (3:4.1:2)submatrix.In example.Ref27. See Box 2 for the role GE plays in the TOP500 i and 2 in ter of x ranking of the world's fastest computers. an Gaussian elimination.or block LU factorization.In BOX2 general,for a given partitioning A=(A)with the TOP500 thagona dimension),a block LU fastorizationhasthe ctorization The TOP500 list (http://www.top500.org) ranks the world's fastest hich solve on A= an impl ntation of GE for parallel computers in C and MPI. Performance is measure Lm,-1 oint rations(flops)per un The U11U12. Uim user is allowed to tune the code to obtain the U22 =LU. st performance,by varying parameters such as Volume 3.Maylune 201 2011 John wiley sons.inc 235WIREs Computational Statistics Gaussian elimination L21, U12, and S are all matrix–matrix operations and can be carried out using level 3 BLAS,19,20 for which highly optimized implementations are available for most machines. The optimal choice of the block size r depends on the particular computer architecture. It is important to realize that this partitioned algorithm is mathematically equivalent to any other variant of GE: it does the same operations in a different order, but one that reduces the amount of data movement among different levels of the computer memory hierarchy. In an attempt to extract even better performance recur￾sive algorithms of this form with r ≈ n/2 have also been developed.21,22 We mention two very active areas of current research in GE, and more generally in dense lin￾ear algebra computations, both of which are aiming to extend the capabilities of the state of the art package LAPACK23 to shared memory computers based on multicore processor architectures. The first is aimed at developing parallel algorithms that run efficiently on systems with multiple sockets of mul￾ticore processors. A key goal is to minimize the amount of communication between processors, since on such evolving architectures communication costs are increasingly significant relative to the costs of float￾ing point arithmetic. A second area of research aims to exploit graphics processing units (GPUs) in conjunc￾tion with multicore processors. GPUs have the ability to perform floating point arithmetic at very high paral￾lelism and are relatively inexpensive. Current projects addressing these areas include the PLASMA (http:// icl.cs.utk.edu/plasma) and MAGMA (http://icl.cs.utk. edu/magma) projects. Representative papers are Refs 24, 25. Further activity is concerned with algorithms for distributed memory machines, aiming to improve upon those in the ScaLAPACK library26; see, for example, Ref 27. See Box 2 for the role GE plays in the TOP500 ranking of the world’s fastest computers. BOX 2 TOP500 The TOP500 list (http://www.top500.org) ranks the world’s fastest computers by their performance on the LINPACK benchmark,34 which solves a random linear system Ax = b by an implementation of GE for parallel computers written in C and MPI. Performance is measured by the floating point execution rate counted in floating point operations (flops) per second. The user is allowed to tune the code to obtain the best performance, by varying parameters such as the dimension n, the block size, the processor grid size, and so on. However, the computed solutionx must produce a small residual in order for the result to be valid, in the sense that b − Ax∞/(uA∞x∞) is of order 1. This benchmark has its origins in the LINPACK project,18 in which the performance of contemporary machines was compared by running the LINPACK GE code dgefa on a 100 × 100 system. BLOCK LU FACTORIZATION At each stage of GE a pivot element is used to eliminate elements below the diagonal in the pivot column. This notion can generalized to use a pivot block to eliminate all elements below that block. For example, consider the factorization A =     0 1 1 1 −1 1 1 1 −2 3 4 2 −1 2 1 3     =     1 0 0 0 0 1 0 0 1 2 1 0 1 1 0 1         0 1 1 1 −1 1 1 1 0 0 1 −1 0 0 −1 1     ≡ L1U1. GE without pivoting fails on A because of the zero (1, 1) pivot. The displayed factorization corresponds to using the leading 2 × 2 principal submatrix of A to eliminate the elements in the (3: 4, 1: 2) submatrix. In the context of a linear system Ax = b, we have effec￾tively solved for the variables x1 and x2 in terms of x3 and x4 and then substituted for x1 and x2 in the last two equations. This is the key idea underlying block Gaussian elimination, or block LU factorization. In general, for a given partitioning A = (Aij) m i,j=1 with the diagonal blocks Aii square (but not necessarily all of the same dimension), a block LU factorization has the form A =     I L21 I . . . ... Lm1 ... Lm,m−1 I     ×      U11 U12 ... U1m U22 . . . ... Um−1,m Umm      ≡ LU, Volume 3, May/June 2011  2011 John Wiley & Son s, In c. 235
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有