正在加载图片...
Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA Shane Ryoot Christopher I.Rodriguest Sara S.Baghsorkhit Sam S.Stonet David B.Kirk*Wen-mei W.Hwut Center for Reliable and High-Performance Computing.University of Illinois at Urbana-Champaign *NVIDIA Corporation sryoo,cirodrig,bsadeghi,ssstone2,hwu Ocrhc.uiuc.edu,dk@nvidia.com Abstract hardware interfaces,programming them does not require special- GPUs have recently attracted the attention of many application ized programming languages or execution through graphics appli- developers as commodity data-parallel coprocessors.The newest cation programming interfaces(APIs),as with previous GPU gen- generations of GPU architecture provide easier programmability erations.This makes an inexpensive,highly parallel system avail- and increased generality while maintaining the tremendous mem- able to a broader community of application developers ory bandwidth and computational power of traditional GPUs.This The NVIDIA CUDA programming model [3]was created for opportunity should redirect efforts in GPGPU research from ad hoc developing applications for this platform.In this model,the system porting of applications to establishing principles and strategies that consists of a host that is a traditional CPU and one or more com- allow efficient mapping of computation to graphics hardware.In pute devices that are massively data-parallel coprocessors.Each this work we discuss the GeForce 8800 GTX processor's organiza- CUDA device processor supports the Single-Program Multiple- tion,features,and generalized optimization strategies.Key to per- Data (SPMD)model [8],widely available in parallel processing formance on this platform is using massive multithreading to uti- systems,where all concurrent threads are based on the same code lize the large number of cores and hide global memory latency. although they may not follow exactly the same path of execution. To achieve this,developers face the challenge of striking the right All threads share the same global address space. balance between each thread's resource usage and the number of si- CUDA programming is done with standard ANSI C extended multaneously active threads.The resources to manage include the with keywords that designate data-parallel functions,called ker- number of registers and the amount of on-chip memory used per nels,and their associated data structures to the compute devices. thread,number of threads per multiprocessor,and global memory These kernels describe the work of a single thread and typically bandwidth.We also obtain increased performance by reordering are invoked on thousands of threads.These threads can.within accesses to off-chip memory to combine requests to the same or developer-defined bundles termed thread blocks,share their data contiguous memory locations and apply classical optimizations to and synchronize their actions through built-in primitives.The reduce the number of executed operations.We apply these strate- CUDA runtime also provides library functions for device memory gies across a variety of applications and domains and achieve be- management and data transfers between the host and the compute tween a 10.5X to 457X speedup in kernel codes and between 1.16X devices.One can view CUDA as a programming environment that to 431X total application speedup. enables software developers to isolate program components that are rich in data parallelism for execution on a coprocessor specialized Categories and Subject Descriptors D.1.3 [Software]:Program- for exploiting massive data parallelism.An overview of the CUDA ming Techniques-Concurrent Programming programming model can be found in [5]. The first version of CUDA programming tools and runtime for General Terms Design,Performance,Languages the NVIDIA GeForce 8 Series GPUs has been available through Keywords parallel computing,GPU computing beta testing since February 2007.To CUDA,the GeForce 8800 GTX consists of 16 streaming multiprocessors(SMs),each with eight processing units,8096 registers,and 16KB of on-chip mem- 1.Introduction ory.It has a peak attainable multiply-add performance of 345.6 As a result of continued demand for programmability,modern single-precision GFLOPS2,features 86.4 GB/s memory bandwidth, graphics processing units(GPUs)such as the NVIDIA GeForce contains 768MB of main memory,and incurs little cost in creating 8 Series are designed as programmable processors employing a thousands of threads.The architecture allows efficient data sharing large number of processor cores [20].With the addition of new and synchronization among threads in the same thread block [18]. A unique aspect of this architecture relative to other parallel platforms is the flexibility in the assignment of local resources such as registers or local memory,to threads.Each SM can run a variable number of threads,and the local resources are divided Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed among threads as specified by the programmer.This flexibility for profit or commercial advantage and that copies bear this notice and the full citation on the first page.To copy otherwise,to republish.to post on servers or to redistribute 1There are several versions of the GeForce 8800 GPU.References of to lists,requires prior specific permission and/or a fee. GeForce 8800 are implied to be the GTX model. PPoPP '08,February 20-23.2008,Salt Lake City.Utah,USA 2 Particular mixes of instructions can achieve higher throughput,as will be Copyright©2008ACM978-1-59593-960-9080002..s5.00 explained in Section 3. 73Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA Shane Ryoo† Christopher I. Rodrigues† Sara S. Baghsorkhi† Sam S. Stone† David B. Kirk∗ Wen-mei W. Hwu† †Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign ∗NVIDIA Corporation {sryoo, cirodrig, bsadeghi, ssstone2, hwu} @crhc.uiuc.edu, dk@nvidia.com Abstract GPUs have recently attracted the attention of many application developers as commodity data-parallel coprocessors. The newest generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous mem￾ory bandwidth and computational power of traditional GPUs. This opportunity should redirect efforts in GPGPU research from ad hoc porting of applications to establishing principles and strategies that allow efficient mapping of computation to graphics hardware. In this work we discuss the GeForce 8800 GTX processor’s organiza￾tion, features, and generalized optimization strategies. Key to per￾formance on this platform is using massive multithreading to uti￾lize the large number of cores and hide global memory latency. To achieve this, developers face the challenge of striking the right balance between each thread’s resource usage and the number of si￾multaneously active threads. The resources to manage include the number of registers and the amount of on-chip memory used per thread, number of threads per multiprocessor, and global memory bandwidth. We also obtain increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations. We apply these strate￾gies across a variety of applications and domains and achieve be￾tween a 10.5X to 457X speedup in kernel codes and between 1.16X to 431X total application speedup. Categories and Subject Descriptors D.1.3 [Software]: Program￾ming Techniques—Concurrent Programming General Terms Design, Performance, Languages Keywords parallel computing, GPU computing 1. Introduction As a result of continued demand for programmability, modern graphics processing units (GPUs) such as the NVIDIA GeForce 8 Series are designed as programmable processors employing a large number of processor cores [20]. With the addition of new Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PPoPP ’08, February 20–23, 2008, Salt Lake City, Utah, USA Copyright c 2008 ACM 978-1-59593-960-9/08/0002. . . $5.00 hardware interfaces, programming them does not require special￾ized programming languages or execution through graphics appli￾cation programming interfaces (APIs), as with previous GPU gen￾erations. This makes an inexpensive, highly parallel system avail￾able to a broader community of application developers. The NVIDIA CUDA programming model [3] was created for developing applications for this platform. In this model, the system consists of a host that is a traditional CPU and one or more com￾pute devices that are massively data-parallel coprocessors. Each CUDA device processor supports the Single-Program Multiple￾Data (SPMD) model [8], widely available in parallel processing systems, where all concurrent threads are based on the same code, although they may not follow exactly the same path of execution. All threads share the same global address space. CUDA programming is done with standard ANSI C extended with keywords that designate data-parallel functions, called ker￾nels, and their associated data structures to the compute devices. These kernels describe the work of a single thread and typically are invoked on thousands of threads. These threads can, within developer-defined bundles termed thread blocks, share their data and synchronize their actions through built-in primitives. The CUDA runtime also provides library functions for device memory management and data transfers between the host and the compute devices. One can view CUDA as a programming environment that enables software developers to isolate program components that are rich in data parallelism for execution on a coprocessor specialized for exploiting massive data parallelism. An overview of the CUDA programming model can be found in [5]. The first version of CUDA programming tools and runtime for the NVIDIA GeForce 8 Series GPUs has been available through beta testing since February 2007. To CUDA, the GeForce 8800 GTX1 consists of 16 streaming multiprocessors (SMs), each with eight processing units, 8096 registers, and 16KB of on-chip mem￾ory. It has a peak attainable multiply-add performance of 345.6 single-precision GFLOPS2 , features 86.4 GB/s memory bandwidth, contains 768MB of main memory, and incurs little cost in creating thousands of threads. The architecture allows efficient data sharing and synchronization among threads in the same thread block [18]. A unique aspect of this architecture relative to other parallel platforms is the flexibility in the assignment of local resources, such as registers or local memory, to threads. Each SM can run a variable number of threads, and the local resources are divided among threads as specified by the programmer. This flexibility 1 There are several versions of the GeForce 8800 GPU. References of GeForce 8800 are implied to be the GTX model. 2 Particular mixes of instructions can achieve higher throughput, as will be explained in Section 3. 73
向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有