正在加载图片...
h optimization effort.In addition,two updated versions of CUDA Load 1 have been released between the original and final submission of this paper,changing resource usages and optimal configurations of Load 2 44=44 4444 many applications.We are exploring methods to preserve or en- (a)Original load address pattern hance performance of applications when shifts in the underlying architecture or runtime occur. .h15 Load 1 … h15 Acknowledgments Load 2 The authors thank John Nickolls,Mark Harris,and Michael Cox (b)Optimized load address pattern at NVIDIA for their advice on drafts of the paper.We also thank Sanjay Patel,Kurt Akeley,Pradeep Dubey,John Kelm,Hillery Figure 5.LBM Global Load Access Patterns Hunter,and the anonymous reviewers for their feedback.We thank the Spring 2007 class of ECE498AL at the University of Illinois that could be scheduled,exposing the latency of texture memory at Urbana-Champaign for their work on initial versions of many Even so,kernel performance improves by 2.8X over global-only of the applications presented in this paper.We also thank the other access by the use of texture memory. members of the IMPACT research group for their support. Loop unrolling and other"classic"compiler optimizations can Sam Stone is supported under a National Science Foundation have unexpected results,but in general local optimizations on the Graduate Research Fellowship.This work was supported by the Gi- most frequently executed parts of the code has beneficial effects gascale Systems Research Center,funded under the Focus Center Benefit is due to the reduction in the number of operations or Research Program,a Semiconductor Research Corporation pro- strength reduction of individual operations such as integer mul- gram.Experiments were made possible by generous hardware tiply,thus increasing overall computational efficiency.In H.264. loans from NVIDIA and NSF CNS grant 05-51665.Any opin- complete unrolling of the innermost loop obtains significant perfor- ions,findings,conclusions,or recommendations expressed in this publication are those of the authors and do not necessarily reflect mance increase,as did register tiling [10]for the next two higher- the views of the NSF. level loops. The common case of compiler optimizations having negative effects is when they increase the number of registers per thread as References a side effect,forcing the GeForce 8800 to schedule fewer thread blocks per SM and thus degrading performance.The cases where [1]AMD Stream Processor.http://ati.amd.com/products/streamproces- sor/index.html. this is most often seen are common subexpression elimination and [2]CUDA benchmark suite. redundant load elimination,the latter often storing thread and block http://www.crhc.uiuc.edu/impact/cudabench.html. coordinates in registers.Even relatively simple instruction schedul- [3]NVIDIA CUDA.http://developer.nvidia.com/object/cuda.html. ing can change the live ranges of variables and increase the reg- [4]The PeakStream platform:High productivity software development ister usage.Register pressure-sensitive code scheduling algorithms for multi-core processors.Technical report,2006. and optimization strategies have been investigated in the context [5]ECE 498AL1:Programming massively parallel processors,Fall 2007. of instruction-level parallelism compilers.Additional research is http://courses.ece.uiuc.edu/ece498/al1/. [6]J.C.Adams,W.S.Brainerd,J.T.Martin,B.T.Smith.and J.L. needed to apply these strategies to massively threaded environ- ments like CUDA.We will address the control of register usage Wagener.Fortran 90 handbook:complete ANSI/ISO reference. Intertext Publications.Inc.,/McGraw-Hill,Inc.,1992 in future work. [7]R.Allen and K.Kennedy.Automatic translation of Fortran programs to vector form.ACM Transactions on Programming Langugages and 6. Conclusion and Future Work Systems..9(4):491-542.1987. [8]M.J.Atallah,editor.Algorithms and Theory of Computation We present a performance evaluation of the GeForce 8800 GTX Handbook.CRC Press LLC.1998. architecture using CUDA.Although its primary market is graph- [9]I.Buck.Brook Specification vo.2,October 2003. ics processing,this GPU is also capable of impressive performance [10]D.Callahan,S.Carr,and K.Kennedy.Improving register allocation on a set of disparate non-graphics applications.This work presents for subscripted variables.ACM S/GPLAN Notices,9(4):328-342. 2004. general principles for optimizing applications for this type of archi- [11]K.Fatahalian,J.Sugerman,and P.Hanrahan.Understanding the tecture,namely having efficient code,utilizing many threads to hide efficiency of GPU algorithms for matrix-matrix multiplication.In latency,and using local memories to alleviate pressure on global Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference memory bandwidth.We also present an application suite that has on Graphics Hardware,pages 133-137.2004. been ported to this architecture,showing that application kernels [12]N.K.Govindaraju,S.Larsen,J.Gray,and D.Manocha. A that have low global memory access after optimization have sub- memory model for scientific algorithms on graphics processors.In stantial speedup over CPU execution if they are not limited by local Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, number 89,2006. resource availability. [13]K.Kennedy and J.R.Allen.Optimizing compilers for modern We are currently performing research on automated optimiza- architectures:a dependence-based approach.Morgan Kaufmann tions for this architecture.Although many of the optimizations are Publishers Inc.,2002. classical ones,the effects they have on this architecture can be dif- [14]M.S.Lam,E.E.Rothberg,and M.E.Wolf.The cache performance ferent from the effects on traditional superscalar processors.It is and optimizations of blocked algorithms.In Proceedings of the 4th also possible to get stuck in local maximums of performance when International Conference on Architectural Support for Programming attempting to follow a particular optimization strategy.These max- guages and Operaring Systems.pages 63-74.April 1991. [15]D.B.Loveman.High Performance Fortran.IEEE Parallel imums may be significantly lower than the peak achievable per- Distributed Technology:Systems Technology,1(1):25-42,1993 formance.Better tools and compilers that allow programmers to [16]W.R.Mark.R.S.Glanville.K.Akeley,and M.J.Kilgard.Cg:a specify the types of reorganizations desired and automatically ex- system for programming graphics hardware in a C-like language.In periment with their performance effects would greatly reduce the ACM SIGGRAPH 2003 Papers,pages 896-907.2003. 81…… …… …… …… …… …… …… …… (a) Original load address pattern (b) Optimized load address pattern Load 1 Load 2 th0 th1 th0 th1 th0 …………………..…. th15 Load 1 Load 2 …… … … th0 …………………..…. th15 Figure 5. LBM Global Load Access Patterns that could be scheduled, exposing the latency of texture memory. Even so, kernel performance improves by 2.8X over global-only access by the use of texture memory. Loop unrolling and other “classic” compiler optimizations can have unexpected results, but in general local optimizations on the most frequently executed parts of the code has beneficial effects. Benefit is due to the reduction in the number of operations or strength reduction of individual operations such as integer mul￾tiply, thus increasing overall computational efficiency. In H.264, complete unrolling of the innermost loop obtains significant perfor￾mance increase, as did register tiling [10] for the next two higher￾level loops. The common case of compiler optimizations having negative effects is when they increase the number of registers per thread as a side effect, forcing the GeForce 8800 to schedule fewer thread blocks per SM and thus degrading performance. The cases where this is most often seen are common subexpression elimination and redundant load elimination, the latter often storing thread and block coordinates in registers. Even relatively simple instruction schedul￾ing can change the live ranges of variables and increase the reg￾ister usage. Register pressure-sensitive code scheduling algorithms and optimization strategies have been investigated in the context of instruction-level parallelism compilers. Additional research is needed to apply these strategies to massively threaded environ￾ments like CUDA. We will address the control of register usage in future work. 6. Conclusion and Future Work We present a performance evaluation of the GeForce 8800 GTX architecture using CUDA. Although its primary market is graph￾ics processing, this GPU is also capable of impressive performance on a set of disparate non-graphics applications. This work presents general principles for optimizing applications for this type of archi￾tecture, namely having efficient code, utilizing many threads to hide latency, and using local memories to alleviate pressure on global memory bandwidth. We also present an application suite that has been ported to this architecture, showing that application kernels that have low global memory access after optimization have sub￾stantial speedup over CPU execution if they are not limited by local resource availability. We are currently performing research on automated optimiza￾tions for this architecture. Although many of the optimizations are classical ones, the effects they have on this architecture can be dif￾ferent from the effects on traditional superscalar processors. It is also possible to get stuck in local maximums of performance when attempting to follow a particular optimization strategy. These max￾imums may be significantly lower than the peak achievable per￾formance. Better tools and compilers that allow programmers to specify the types of reorganizations desired and automatically ex￾periment with their performance effects would greatly reduce the optimization effort. In addition, two updated versions of CUDA have been released between the original and final submission of this paper, changing resource usages and optimal configurations of many applications. We are exploring methods to preserve or en￾hance performance of applications when shifts in the underlying architecture or runtime occur. Acknowledgments The authors thank John Nickolls, Mark Harris, and Michael Cox at NVIDIA for their advice on drafts of the paper. We also thank Sanjay Patel, Kurt Akeley, Pradeep Dubey, John Kelm, Hillery Hunter, and the anonymous reviewers for their feedback. We thank the Spring 2007 class of ECE498AL at the University of Illinois at Urbana-Champaign for their work on initial versions of many of the applications presented in this paper. We also thank the other members of the IMPACT research group for their support. Sam Stone is supported under a National Science Foundation Graduate Research Fellowship. This work was supported by the Gi￾gascale Systems Research Center, funded under the Focus Center Research Program, a Semiconductor Research Corporation pro￾gram. Experiments were made possible by generous hardware loans from NVIDIA and NSF CNS grant 05-51665. Any opin￾ions, findings, conclusions, or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the NSF. References [1] AMD Stream Processor. http://ati.amd.com/products/ streamproces￾sor/index.html. [2] CUDA benchmark suite. http://www.crhc.uiuc.edu/impact/cudabench.html. [3] NVIDIA CUDA. http://developer.nvidia.com/object/cuda.html. [4] The PeakStream platform: High productivity software development for multi-core processors. Technical report, 2006. [5] ECE 498AL1: Programming massively parallel processors, Fall 2007. http://courses.ece.uiuc.edu/ece498/al1/. [6] J. C. Adams, W. S. Brainerd, J. T. Martin, B. T. Smith, and J. L. Wagener. Fortran 90 handbook: complete ANSI/ISO reference. Intertext Publications, Inc.,/McGraw-Hill, Inc., 1992. [7] R. Allen and K. Kennedy. Automatic translation of Fortran programs to vector form. ACM Transactions on Programming Langugages and Systems, 9(4):491–542, 1987. [8] M. J. Atallah, editor. Algorithms and Theory of Computation Handbook. CRC Press LLC, 1998. [9] I. Buck. Brook Specification v0.2, October 2003. [10] D. Callahan, S. Carr, and K. Kennedy. Improving register allocation for subscripted variables. ACM SIGPLAN Notices, 9(4):328–342, 2004. [11] K. Fatahalian, J. Sugerman, and P. Hanrahan. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pages 133–137, 2004. [12] N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha. A memory model for scientific algorithms on graphics processors. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, number 89, 2006. [13] K. Kennedy and J. R. Allen. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., 2002. [14] M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63–74, April 1991. [15] D. B. Loveman. High Performance Fortran. IEEE Parallel & Distributed Technology: Systems & Technology, 1(1):25–42, 1993. [16] W. R. Mark, R. S. Glanville, K. Akeley, and M. J. Kilgard. Cg: a system for programming graphics hardware in a C-like language. In ACM SIGGRAPH 2003 Papers, pages 896–907, 2003. 81
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有