allows more tuning of application per_中国高校课件下载中心

点击下载：《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

正在加载图片...

allows more tuning of application performance but changes the tion.We conclude with some final statements and suggestions for assumptions developers can make when performing optimizations future work. We discuss these issues in further detail. Another question we address is how well applications can ex- 2. Related Work ecute on the GeForce 8800 and what are the design features that contribute to or limit performance.As a collaborative effort be Data parallel programming languages are considered an intermedi- tween industry and academia,a set of complete numerical appli- ate approach between automatic parallelization efforts [7,28]and cations was ported and evaluated on the CUDA platform.Several explicit parallel programming models such as OpenMP [19]to sup application research groups in the areas of medical imaging,molec- port parallel computing.Fortran 90 [6]was the first widely used ular dynamics,computational chemistry,electromagnetic analysis, language and influenced following data parallel languages by intro- and scientific visualization contributed to this effort.The following ducing array assignment statements.Similar to array assignments are the major principles when choosing code to be executed on this in Fortran 90 is the lock step execution of each single instruction platform: in threads executing simultaneously on a streaming multiprocessor in CUDA programming model.Later,High Performance Fortran 1.Leverage zero-overhead thread scheduling to hide memory la- (HPF)[15]was introduced as an standard data parallel language to tency.On the GeForce 8800 there are 128 execution units avail- support programs with SPMD.However,complexity of data dis- able for use,requiring hundreds of threads to completely oc- tribution and communication optimization techniques,as discussed cupy them.In addition,threads can be starved of data due to in the final two chapters of [13],were a hard-to-solve challenge the long latency to global memory.The general philosophy of As a result application developers became more involved in explic. CUDA for tolerating this latency is to generate and maintain itly handling data distribution and communication:message pass- thousands of threads in flight.This is in contrast with the use ing libraries such as [23]became a popular programming model for of large caches to hide memory latencies in CPU designs.De- scalable parallel systems.Similarly in CUDA,the developer explic- velopers used to traditional multicore systems may need to de- itly manages data layout in DRAM memory spaces,data caching fine threads at a finer granularity in order to generate enough thread communication within thread blocks and other resources. threads.In addition.a high compute-to-memory-access ratio is The interest in GPGPU programming has been driven by rel necessary to avoid saturation of memory channels. atively recent improvements in the programmability of graphics 2.Optimize use of on-chip memory to reduce bandwidth usage and hardware.The release of Cg [16]signified the recognition that redundant execution.Working memory within a group of cores GPUs were programmable processors and that a higher-level lan- consists primarily of a register file and a software-managed on- guage was needed to develop applications on them.Others felt that chip memory called shared memory.These are high fan-out, the abstractions provided by Cg and other shading languages were low latency,limited-capacity memories which are partitioned insufficient and built higher-level language constructs.Brook [9] among thread blocks that are assigned to the same SM at run enables the usage of the GPU as a streaming coprocessor.Acceler- ator [26]is another system that uses data-parallel arrays to perform time.The data in shared memory can be shared among threads in a thread block,enabling interthread data reuse.An incre- general-purpose computation on the GPU.A Microsoft C#library mental increase in the usage of registers or shared memory provides data types and functions to operate on data-parallel ar- per thread can result in a substantial decrease in the number rays.Data-parallel array computation is transparently compiled to of threads that can be simultaneously executed shader programs by the runtime.Other efforts to provide a more productive stream processing programming environment for devel- 3.Group threads to avoid SIMD penalties and memory port/bank oping multi-threaded applications include the RapidMind Stream- conflicts.CUDA is based on the SPMD model,but its cur- ing Execution Manager [17]and PeakStream Virtual Machine [4]. rent implementation on the GeForce 8800 imposes Single- These mainly target HPC applications that are amenable to stream Instruction,Multiple-Data (SIMD)mode among subsets of processing.The achieved performance may be behind customized threads.The latter differs from the short-vector SIMD present GPU/CPU code due to the virtual machine and dynamic compila- in most contemporary processors.This is a cost-effective hard- tion overhead.We refer the reader to a review of the main body of ware model for exploiting data parallelism and allows the work done to map general purpose computation to GPUs by Owens GeForce 8800 to share one instruction issue unit among eight et al.in [21]. execution units.However,it can be ineffective for algorithms In general,previous GPU programming systems limit the size that require diverging control flow decisions in data-parallel and complexity of GPU code due to their underlying graphics API- sections.In some algorithms,threads can be reorganized to based implementations.CUDA supports kernels with much larger avoid divergent control flow.Appropriate thread grouping can code sizes with a new hardware interface and instruction caching. also preserve performance by avoiding port and bank conflicts Previous GPU generations and their APIs also restricted the al- in memories. lowed memory access patterns,usually allowing only sequential 4.Threads within a thread block can communicate via synchro- writes to a linear array.This is due primarily to limits in graph- nization.but there is no built-in global communication mecha- ics APIs and corresponding limits in the specialized pixel and ver- nism for all threads.This avoids the need for virtualization of tex processors.Accelerator does not allow access to an individual hardware resources,enables the execution of the same CUDA element in parallel arrays:operations are performed on all array program across processor family members with a varying num- elements.Brook also executes its kernel for every element in the ber of cores,and makes the hardware scalable.However,it also stream,with some exceptions.The GeForce 8800 allows for gen- limits the kinds of parallelism that can be utilized within a sin- eral addressing of memory via a unified processor model,which gle kernel call. enables CUDA to perform unrestricted scatter-gather operations. Traditional GPUs also provided limited cache bandwidth.Fata- We first discuss related work in Section 2.Section 3 introduces halian et al.discuss in [11]that low bandwidth cache designs on the threading model and execution hardware.Section 4 demon- GPUs limit the types of applications from benefiting from the com- strates the optimization process with in-depth performance anal- putational power available on these architectures.Work discussed ysis,using matrix multiplication kernels.Section 5 presents several in [12]uses an analytical cache performance prediction model for studied applications with performance and optimization informa- GPU-based algorithms.Their results indicate that memory opti- 74allows more tuning of application performance but changes the assumptions developers can make when performing optimizations. We discuss these issues in further detail. Another question we address is how well applications can execute on the GeForce 8800 and what are the design features that contribute to or limit performance. As a collaborative effort between industry and academia, a set of complete numerical applications was ported and evaluated on the CUDA platform. Several application research groups in the areas of medical imaging, molecular dynamics, computational chemistry, electromagnetic analysis, and scientific visualization contributed to this effort. The following are the major principles when choosing code to be executed on this platform: 1. Leverage zero-overhead thread scheduling to hide memory latency. On the GeForce 8800 there are 128 execution units available for use, requiring hundreds of threads to completely occupy them. In addition, threads can be starved of data due to the long latency to global memory. The general philosophy of CUDA for tolerating this latency is to generate and maintain thousands of threads in flight. This is in contrast with the use of large caches to hide memory latencies in CPU designs. Developers used to traditional multicore systems may need to de- fine threads at a finer granularity in order to generate enough threads. In addition, a high compute-to-memory-access ratio is necessary to avoid saturation of memory channels. 2. Optimize use of on-chip memory to reduce bandwidth usage and redundant execution. Working memory within a group of cores consists primarily of a register file and a software-managed onchip memory called shared memory. These are high fan-out, low latency, limited-capacity memories which are partitioned among thread blocks that are assigned to the same SM at runtime. The data in shared memory can be shared among threads in a thread block, enabling interthread data reuse. An incremental increase in the usage of registers or shared memory per thread can result in a substantial decrease in the number of threads that can be simultaneously executed. 3. Group threads to avoid SIMD penalties and memory port/bank conflicts. CUDA is based on the SPMD model, but its current implementation on the GeForce 8800 imposes SingleInstruction, Multiple-Data (SIMD) mode among subsets of threads. The latter differs from the short-vector SIMD present in most contemporary processors. This is a cost-effective hardware model for exploiting data parallelism and allows the GeForce 8800 to share one instruction issue unit among eight execution units. However, it can be ineffective for algorithms that require diverging control flow decisions in data-parallel sections. In some algorithms, threads can be reorganized to avoid divergent control flow. Appropriate thread grouping can also preserve performance by avoiding port and bank conflicts in memories. 4. Threads within a thread block can communicate via synchronization, but there is no built-in global communication mechanism for all threads. This avoids the need for virtualization of hardware resources, enables the execution of the same CUDA program across processor family members with a varying number of cores, and makes the hardware scalable. However, it also limits the kinds of parallelism that can be utilized within a single kernel call. We first discuss related work in Section 2. Section 3 introduces the threading model and execution hardware. Section 4 demonstrates the optimization process with in-depth performance analysis, using matrix multiplication kernels. Section 5 presents several studied applications with performance and optimization information. We conclude with some final statements and suggestions for future work. 2. Related Work Data parallel programming languages are considered an intermediate approach between automatic parallelization efforts [7, 28] and explicit parallel programming models such as OpenMP [19] to support parallel computing. Fortran 90 [6] was the first widely used language and influenced following data parallel languages by introducing array assignment statements. Similar to array assignments in Fortran 90 is the lock step execution of each single instruction in threads executing simultaneously on a streaming multiprocessor in CUDA programming model. Later, High Performance Fortran (HPF) [15] was introduced as an standard data parallel language to support programs with SPMD. However, complexity of data distribution and communication optimization techniques, as discussed in the final two chapters of [13], were a hard-to-solve challenge. As a result application developers became more involved in explicitly handling data distribution and communication; message passing libraries such as [23] became a popular programming model for scalable parallel systems. Similarly in CUDA, the developer explicitly manages data layout in DRAM memory spaces, data caching, thread communication within thread blocks and other resources. The interest in GPGPU programming has been driven by relatively recent improvements in the programmability of graphics hardware. The release of Cg [16] signified the recognition that GPUs were programmable processors and that a higher-level language was needed to develop applications on them. Others felt that the abstractions provided by Cg and other shading languages were insufficient and built higher-level language constructs. Brook [9] enables the usage of the GPU as a streaming coprocessor. Accelerator [26] is another system that uses data-parallel arrays to perform general-purpose computation on the GPU. A Microsoft C# library provides data types and functions to operate on data-parallel arrays. Data-parallel array computation is transparently compiled to shader programs by the runtime. Other efforts to provide a more productive stream processing programming environment for developing multi-threaded applications include the RapidMind Streaming Execution Manager [17] and PeakStream Virtual Machine [4]. These mainly target HPC applications that are amenable to stream processing. The achieved performance may be behind customized GPU/CPU code due to the virtual machine and dynamic compilation overhead. We refer the reader to a review of the main body of work done to map general purpose computation to GPUs by Owens et al. in [21]. In general, previous GPU programming systems limit the size and complexity of GPU code due to their underlying graphics APIbased implementations. CUDA supports kernels with much larger code sizes with a new hardware interface and instruction caching. Previous GPU generations and their APIs also restricted the allowed memory access patterns, usually allowing only sequential writes to a linear array. This is due primarily to limits in graphics APIs and corresponding limits in the specialized pixel and vertex processors. Accelerator does not allow access to an individual element in parallel arrays: operations are performed on all array elements. Brook also executes its kernel for every element in the stream, with some exceptions. The GeForce 8800 allows for general addressing of memory via a unified processor model, which enables CUDA to perform unrestricted scatter-gather operations. Traditional GPUs also provided limited cache bandwidth. Fatahalian et al. discuss in [11] that low bandwidth cache designs on GPUs limit the types of applications from benefiting from the computational power available on these architectures. Work discussed in [12] uses an analytical cache performance prediction model for GPU-based algorithms. Their results indicate that memory opti- 74

<<向上翻页向下翻页>>

点击下载：《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA