mization techniques designed for CPU-_中国高校课件下载中心

点击下载：《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

正在加载图片...

mization techniques designed for CPU-based algorithms may not be directly applicable to GPUs.With the introduction of reasonably sized low-latency,on-chip memory in new generations of GPUs this issue and its optimizations have become less critical A programming interface alternative to CUDA is available for the AMD Stream Processor,using the R580 GPU,in the form of the Close to Metal (CTM)compute runtime driver [1].Like Figure 1.CUDA Compilation Flow CUDA,CTM can maintain the usage of the GPU as a graphics engine:however,instead of abstracting away architecture-level in- structions,CTM completely exposes the ISA to the programmer for mensions of the thread blocks in the grid thus generated.Threads fine-grained control.Furthermore,the R580 continues to resemble also have unique coordinates and up to 512 threads can exist in a previous generation GPUs with their divided architecture for ver- block.Threads in a block can share data through a low-latency,on- tex and pixel processing,whereas the GeForce 8800 has a more chip shared memory and can perform barrrier synchronization by general,unified model.This is presented in the next section invoking the-syncthreads primitive.Threads are otherwise in- Intel's C for Heterogeneous Integration(CHI)programming en- dependent;synchronization across thread blocks can only be safely vironment [27]is a different approach to tightly integrate accelera- accomplished by terminating a kernel.Finally,the hardware groups tors such as GPUs and general purpose CPU cores together based threads in a way that affects performance,which is discussed in on the proposed EXOCHI [27]model.EXOCHI supports a shared Section 3.2. virtual memory heterogeneous multi-threaded programming model An application developer for this platform can compile CUDA with minimal OS intrusion.In CUDA execution model,GPU is a code to an assembly-like representation of the code called PTX device with a separate address space from CPU.As a result,all PTX is not natively executed,but is processed by a run-time en- data communication and synchronization between CPU and GPU vironment,making it uncertain what instructions are actually exe- is explicitly performed through the GPU device driver. cuted on a cycle-by-cycle basis.Two examples we have observed are simple cases of loop-invariant code that can be easily moved Architecture Overview and branches which are split into condition evaluations and predi- The GeForce 8800 GPU is effectively a large set of processor cores cated jump instructions.However,PTX is generally sufficient in the with the ability to directly address into a global memory.This al- initial stages of estimating resource requirements of an application lows for a more general and flexible programming model than pre and optimizing it. vious generations of GPUs,making it easier for developers to im- plement data-parallel kernels.In this section we discuss NVIDIA's 3.2 Base Microarchitecture Compute Unified Device Architecture(CUDA)and the major mi- Figure 2 depicts the microarchitecture of the GeForce 8800.It croarchitectural features of the GeForce 8800.A more complete consists of 16 streaming multiprocessors (SMs),each containing description can be found in [3,18].It should be noted that this eight streaming processors (SPs),or processor cores,running at architecture,although more exposed than previous GPU architec- 1.35GHz.Each core executes a single thread's instruction in SIMD tures,still has details which have not been publicly revealed. (single-instruction,multiple-data)fashion,with the instruction unit broadcasting the current instruction to the cores.Each core has one 3.1 Threading Model 32-bit,single-precision floating-point,multiply-add arithmetic unit The CUDA programming model is ANSI C extended by several that can also perform 32-bit integer arithmetic.Additionally,each keywords and constructs.The GPU is treated as a coprocessor SM has two special functional units(SFUs),which execute more that executes data-parallel kernel code.The user supplies a single complex FP operations such as reciprocal square root,sine,and source program encompassing both host(CPU)and kernel (GPU) cosine with low multi-cycle latency.The arithmetic units and the code.These are separated and compiled as shown in Figure 1.Each SFUs are fully pipelined,yielding 388.8 GLOPS (16 SMs 18 CUDA program consists of multiple phases that are executed on FLOPS/SM 1.35GHz)of peak theoretical performance for the either the CPU or the GPU.The phases that exhibit little or no data GPU. parallelism are implemented in host(CPU)host,which is expressed Each SM has 8192 registers which are dynamically partitioned in ANSI C and compiled with the host C compiler as shown in Fig- among the threads running on it.Non-register memories with dis- ure 1.The phases that exhibit rich data parallelism are implemented tinctive capabilities and uses are described in Table 1 and depicted as kernel functions in the device (GPU)code.A kernel function in Figure 2.Variables in the source code can be declared to reside defines the code to be executed by each of the massive number of in global,shared,local,or constant memory.Texture memory is threads to be invoked for a data-parallel phase.These kernel func- accessed through API calls which compile to special instructions. tions are compiled by the NVIDIA CUDA C compiler and the ker- Bandwidth to off-chip memory is very high at 86.4 GB/s,but mem- nel GPU object code generator.There are several restrictions on ory bandwidth can saturate if many threads request access within kernel functions:there must be no recursion.no static variable dec- a short period of time.In addition,this bandwidth can be obtained larations,and a non-variable number of arguments.The host code only when accesses are contiguous 16-word lines;in other cases the transfers data to and from the GPU's global memory using APl achievable bandwidth is a fraction of the maximum.Optimizations calls.Kernel code is initiated by performing a function call. to coalesce accesses into 16-word lines and reuse data are generally Threads executing on the GeForce 8800 are organized into a necessary to achieve good performance. three-level hierarchy.At the highest level,all threads in a data- There are several non-storage limits to the number of threads parallel execution phase form a grid;they all execute the same ker- that can be executed on the system.First,a maximum of 768 si- nel function.Each grid consists of many thread blocks.A grid can multaneously active thread contexts is supported per SM.Second be at most 216-1 blocks in either of two dimensions,and each an integral number of up to eight thread blocks can be run per SM block has unique coordinates.In turn,each thread block is a three- at one time.The number of thread blocks that are simultaneously dimensional array of threads,explicitly defined by the application resident on an SM is limited by whichever limit of registers,shared developer,that is assigned to an SM.The invocation parameters of memory,threads,or thread blocks is reached first.This has two con a kernel function call defines the organization of the sizes and di- sequences.First,optimization may have negative effects in some 75mization techniques designed for CPU-based algorithms may not be directly applicable to GPUs. With the introduction of reasonably sized low-latency, on-chip memory in new generations of GPUs, this issue and its optimizations have become less critical. A programming interface alternative to CUDA is available for the AMD Stream Processor, using the R580 GPU, in the form of the Close to Metal (CTM) compute runtime driver [1]. Like CUDA, CTM can maintain the usage of the GPU as a graphics engine; however, instead of abstracting away architecture-level instructions, CTM completely exposes the ISA to the programmer for fine-grained control. Furthermore, the R580 continues to resemble previous generation GPUs with their divided architecture for vertex and pixel processing, whereas the GeForce 8800 has a more general, unified model. This is presented in the next section. Intel’s C for Heterogeneous Integration (CHI) programming environment [27] is a different approach to tightly integrate accelerators such as GPUs and general purpose CPU cores together based on the proposed EXOCHI [27] model. EXOCHI supports a shared virtual memory heterogeneous multi-threaded programming model with minimal OS intrusion. In CUDA execution model, GPU is a device with a separate address space from CPU. As a result, all data communication and synchronization between CPU and GPU is explicitly performed through the GPU device driver. 3. Architecture Overview The GeForce 8800 GPU is effectively a large set of processor cores with the ability to directly address into a global memory. This allows for a more general and flexible programming model than previous generations of GPUs, making it easier for developers to implement data-parallel kernels. In this section we discuss NVIDIA’s Compute Unified Device Architecture (CUDA) and the major microarchitectural features of the GeForce 8800. A more complete description can be found in [3, 18]. It should be noted that this architecture, although more exposed than previous GPU architectures, still has details which have not been publicly revealed. 3.1 Threading Model The CUDA programming model is ANSI C extended by several keywords and constructs. The GPU is treated as a coprocessor that executes data-parallel kernel code. The user supplies a single source program encompassing both host (CPU) and kernel (GPU) code. These are separated and compiled as shown in Figure 1. Each CUDA program consists of multiple phases that are executed on either the CPU or the GPU. The phases that exhibit little or no data parallelism are implemented in host (CPU) host, which is expressed in ANSI C and compiled with the host C compiler as shown in Figure 1. The phases that exhibit rich data parallelism are implemented as kernel functions in the device (GPU) code. A kernel function defines the code to be executed by each of the massive number of threads to be invoked for a data-parallel phase. These kernel functions are compiled by the NVIDIA CUDA C compiler and the kernel GPU object code generator. There are several restrictions on kernel functions: there must be no recursion, no static variable declarations, and a non-variable number of arguments. The host code transfers data to and from the GPU’s global memory using API calls. Kernel code is initiated by performing a function call. Threads executing on the GeForce 8800 are organized into a three-level hierarchy. At the highest level, all threads in a dataparallel execution phase form a grid; they all execute the same kernel function. Each grid consists of many thread blocks. A grid can be at most 216 − 1 blocks in either of two dimensions, and each block has unique coordinates. In turn, each thread block is a threedimensional array of threads, explicitly defined by the application developer, that is assigned to an SM. The invocation parameters of a kernel function call defines the organization of the sizes and diIntegrated Source (foo.c, bar.cu) cudacc Front End and Global Optimizer GPU Assembly / Kernel Code (bar.s) CPU Host Code (foo.c, bar.c) Kernel Object Code Generator Host Compiler Host Binary (foo.o, bar.o) Kernel Object Code (bar.gpu) Executable Figure 1. CUDA Compilation Flow mensions of the thread blocks in the grid thus generated. Threads also have unique coordinates and up to 512 threads can exist in a block. Threads in a block can share data through a low-latency, onchip shared memory and can perform barrrier synchronization by invoking the syncthreads primitive. Threads are otherwise independent; synchronization across thread blocks can only be safely accomplished by terminating a kernel. Finally, the hardware groups threads in a way that affects performance, which is discussed in Section 3.2. An application developer for this platform can compile CUDA code to an assembly-like representation of the code called PTX. PTX is not natively executed, but is processed by a run-time environment, making it uncertain what instructions are actually executed on a cycle-by-cycle basis. Two examples we have observed are simple cases of loop-invariant code that can be easily moved and branches which are split into condition evaluations and predicated jump instructions. However, PTX is generally sufficient in the initial stages of estimating resource requirements of an application and optimizing it. 3.2 Base Microarchitecture Figure 2 depicts the microarchitecture of the GeForce 8800. It consists of 16 streaming multiprocessors (SMs), each containing eight streaming processors (SPs), or processor cores, running at 1.35GHz. Each core executes a single thread’s instruction in SIMD (single-instruction, multiple-data) fashion, with the instruction unit broadcasting the current instruction to the cores. Each core has one 32-bit, single-precision floating-point, multiply-add arithmetic unit that can also perform 32-bit integer arithmetic. Additionally, each SM has two special functional units (SFUs), which execute more complex FP operations such as reciprocal square root, sine, and cosine with low multi-cycle latency. The arithmetic units and the SFUs are fully pipelined, yielding 388.8 GLOPS (16 SMs * 18 FLOPS/SM * 1.35GHz) of peak theoretical performance for the GPU. Each SM has 8192 registers which are dynamically partitioned among the threads running on it. Non-register memories with distinctive capabilities and uses are described in Table 1 and depicted in Figure 2. Variables in the source code can be declared to reside in global, shared, local, or constant memory. Texture memory is accessed through API calls which compile to special instructions. Bandwidth to off-chip memory is very high at 86.4 GB/s, but memory bandwidth can saturate if many threads request access within a short period of time. In addition, this bandwidth can be obtained only when accesses are contiguous 16-word lines; in other cases the achievable bandwidth is a fraction of the maximum. Optimizations to coalesce accesses into 16-word lines and reuse data are generally necessary to achieve good performance. There are several non-storage limits to the number of threads that can be executed on the system. First, a maximum of 768 simultaneously active thread contexts is supported per SM. Second, an integral number of up to eight thread blocks can be run per SM at one time. The number of thread blocks that are simultaneously resident on an SM is limited by whichever limit of registers, shared memory, threads, or thread blocks is reached first. This has two consequences. First, optimization may have negative effects in some 75

<<向上翻页向下翻页>>

点击下载：《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA