正在加载图片...
strain the number of active threads,sometimes exposing memory latency.We show one example where the use of additional regis- ters in an attempted optimization allows one fewer thread block to be scheduled per SM,reducing performance. Finally,global memory bandwidth can limit the throughput of the system.Increasing the number of threads does not help performance in this situation.Alleviating the pressure on global memory bandwidth generally involves using additional registers and shared memory to reuse data,which in turn can limit the number of simultaneously executing threads.Balancing the usage of these resources is often non-intuitive and some applications will run into resource limits other than instruction issue on this 12x12t相le architecture. 16x16 tiles The example we use to illustrate these principles is a matrix Figure 4.Performance of Matrix Multiplication Kernels multiplication kernel.In matrix multiplication,the value of an ele- ment in the result matrix is calculated by computing the dot product of the corresponding row of the first matrix and column of the sec- ond matrix.For this example we assume densely populated input between threads:one thread loads a datum and then synchronizes matrices.We analyze several code versions and their sustained per- so that other threads in the same block can use it.Finally,we can formance when multiplying two square matrices with a height and also take advantage of contiguity in main memory accesses when width of 4096 elements.The stated resource usage is for CUDA loading in values as a block,reducing the cycles needed to access version 0.8;later versions of CUDA may have different usages. the data. Figure 3(b)shows the code for a tiled matrix multiplication 4.1 Initial Code Version with a tile size of 16x16,or 256 result elements and threads During execution,the threads work within two input tiles that We begin with a simple version of matrix multiplication.The ma- stride across 16 contiguous rows or columns in the input matrices. trix multiplication kernel creates a thread for each result element Each of the 256 threads is tied to a specific coordinate in a tile for the multiplication,for a total of 4K*4K threads.Many threads It loads the element at that coordinate from the two input tiles are created in an attempt to hide the latency of global memory by into shared memory,so cooperatively the threads load the complete overlapping execution.These threads loop through a sequence that tiles.These loads are organized to take advantage of global access loads two values from global memory,multiplies them,and accu- coalescing.The threads then synchronize to establish consistency, mulates the value.Figure 3(a)shows the core loops of the dot- which enables each thread to load all of its inputs contained in the product computation kernel;starting values for indexA,indexB. tiles from shared memory.Finally,the threads calculate the partial and indexC are determined by block and thread coordinates,which dot product for the inputs in shared memory within a loop. the hardware supports.This code uses ten registers per thread,al- The choice of tile shape and size is a key decision.For a given lowing the maximum of 768 threads to be scheduled per SM.For size,square tiles generally improve the computation to memory ac- convenience,we group them as three thread blocks of 256 threads cess ratio by improving data locality among threads.Larger tile each. sizes increase data sharing and thus global memory efficiency Performance for this code is 10.58 GFLOPS,which is lower The tiling support code adds several overhead instructions per tile than highly optimized libraries executing on a CPU using SIMD which also makes larger sizes more efficient.On this architec- extensions.By examining the PTX for this code,we find that ture,developers also need to consider whether a tile size provides there is approximately'one fused multiply-add out of eight op- enough threads to have good occupancy.Figure 4 shows the re- erations in the inner loop,for a estimated potential throughput of sults of experimenting with different tile sizes.4x4 tiles use only 43.2 GFLOPS.Because we have the maximum number of threads 16 threads,so half of the issue slots in a warp would go unused. scheduled per SM,the bottleneck appears to be global memory This inefficiency,coupled with the 8 thread block limit,causes bandwidth.1/4 of the operations executed during the loop are loads performance to be worse than the non-tiled code.8x8 tiles create from off-chip memory,which would require a bandwidth of 173 thread blocks that occupy two warps,but would still need 12 thread GB/s (128 SPs 1/4 instructions *4 B/instruction 1.35GHz)to blocks to fully occupy an SM,50%more than the supported limit. fully utilize the SPs.Thus,the strategy for optimizing this kernel 12x12 tiles use 144 threads,which is also not an integral number is to improve data reuse and reduce global memory access of warps,and also requires padding of the arrays to prevent over- run.16x16 is the largest convenient size for this platform,and we 4.2 Use of Local Storage can schedule three thread blocks of 8 warps each,for the maximum In Figure 3(a),although the computations of two result elements of 768 threads.Global memory coalescing also happens naturally in the same row or column share half their input data (the same with this configuration.Other applications may have higher perfor- indexA or indexB values),the previous code accesses global mance with smaller tile sizes when they allow a larger number of memory for each datum in every thread.A common optimiza- threads to be scheduled. tion for this type of access pattern is to enhance data sharing via The use of 16x16 tiles reduces global loads by a factor of 16 tiling [14].In the GeForce 8800,developers can utilize shared over the non-tiled configuration,so instead of the bandwidth re- memory to amortize the global latency cost when values are reused quirement being twice what is available,it is now approximately Using low-overhead block synchronization values,can be shared an eighth,removing it as the bottleneck to performance.The ad- ditional instructions reduce the potential throughput slightly below 3 As previously mentioned,PTX code does not necessarily translate to that of the original code.The 16x16 tiled version of matrix multi- executed instructions,so instruction counts are estimates. plication achieves 46.49 GFLOPS,or approximately 4.5X the exe- 4 This is also an estimate.Threads can simultaneously load the same value cution throughput of the initial version.This is slightly higher than from memory and the memory system may be able to combine these into a the estimated potential throughput of the original code,so it appears single request. that the application achieves full usage of the SPs. 77strain the number of active threads, sometimes exposing memory latency. We show one example where the use of additional regis￾ters in an attempted optimization allows one fewer thread block to be scheduled per SM, reducing performance. Finally, global memory bandwidth can limit the throughput of the system. Increasing the number of threads does not help performance in this situation. Alleviating the pressure on global memory bandwidth generally involves using additional registers and shared memory to reuse data, which in turn can limit the number of simultaneously executing threads. Balancing the usage of these resources is often non-intuitive and some applications will run into resource limits other than instruction issue on this architecture. The example we use to illustrate these principles is a matrix multiplication kernel. In matrix multiplication, the value of an ele￾ment in the result matrix is calculated by computing the dot product of the corresponding row of the first matrix and column of the sec￾ond matrix. For this example we assume densely populated input matrices. We analyze several code versions and their sustained per￾formance when multiplying two square matrices with a height and width of 4096 elements. The stated resource usage is for CUDA version 0.8; later versions of CUDA may have different usages. 4.1 Initial Code Version We begin with a simple version of matrix multiplication. The ma￾trix multiplication kernel creates a thread for each result element for the multiplication, for a total of 4K*4K threads. Many threads are created in an attempt to hide the latency of global memory by overlapping execution. These threads loop through a sequence that loads two values from global memory, multiplies them, and accu￾mulates the value. Figure 3(a) shows the core loops of the dot￾product computation kernel; starting values for indexA, indexB, and indexC are determined by block and thread coordinates, which the hardware supports. This code uses ten registers per thread, al￾lowing the maximum of 768 threads to be scheduled per SM. For convenience, we group them as three thread blocks of 256 threads each. Performance for this code is 10.58 GFLOPS, which is lower than highly optimized libraries executing on a CPU using SIMD extensions. By examining the PTX for this code, we find that there is approximately3 one fused multiply-add out of eight op￾erations in the inner loop, for a estimated potential throughput of 43.2 GFLOPS. Because we have the maximum number of threads scheduled per SM, the bottleneck appears to be global memory bandwidth. 1/4 of the operations executed during the loop are loads from off-chip memory, which would require a bandwidth of 173 GB/s (128 SPs * 1/4 instructions * 4 B/instruction * 1.35GHz) to fully utilize the SPs.4 Thus, the strategy for optimizing this kernel is to improve data reuse and reduce global memory access. 4.2 Use of Local Storage In Figure 3(a), although the computations of two result elements in the same row or column share half their input data (the same indexA or indexB values), the previous code accesses global memory for each datum in every thread. A common optimiza￾tion for this type of access pattern is to enhance data sharing via tiling [14]. In the GeForce 8800, developers can utilize shared memory to amortize the global latency cost when values are reused. Using low-overhead block synchronization values, can be shared 3 As previously mentioned, PTX code does not necessarily translate to executed instructions, so instruction counts are estimates. 4 This is also an estimate. Threads can simultaneously load the same value from memory and the memory system may be able to combine these into a single request. GFLOPS 0 10 20 30 40 50 60 70 80 90 100 tiled only tiled & unrolled tiled only tiled & unrolled tiled only tiled & unrolled tiled only tiled & unrolled not tiled 4x4 tiles 8x8 tiles 12x12 tiles 16x16 tiles Figure 4. Performance of Matrix Multiplication Kernels between threads: one thread loads a datum and then synchronizes so that other threads in the same block can use it. Finally, we can also take advantage of contiguity in main memory accesses when loading in values as a block, reducing the cycles needed to access the data. Figure 3(b) shows the code for a tiled matrix multiplication, with a tile size of 16x16, or 256 result elements and threads. During execution, the threads work within two input tiles that stride across 16 contiguous rows or columns in the input matrices. Each of the 256 threads is tied to a specific coordinate in a tile. It loads the element at that coordinate from the two input tiles into shared memory, so cooperatively the threads load the complete tiles. These loads are organized to take advantage of global access coalescing. The threads then synchronize to establish consistency, which enables each thread to load all of its inputs contained in the tiles from shared memory. Finally, the threads calculate the partial dot product for the inputs in shared memory within a loop. The choice of tile shape and size is a key decision. For a given size, square tiles generally improve the computation to memory ac￾cess ratio by improving data locality among threads. Larger tile sizes increase data sharing and thus global memory efficiency. The tiling support code adds several overhead instructions per tile, which also makes larger sizes more efficient. On this architec￾ture, developers also need to consider whether a tile size provides enough threads to have good occupancy. Figure 4 shows the re￾sults of experimenting with different tile sizes. 4x4 tiles use only 16 threads, so half of the issue slots in a warp would go unused. This inefficiency, coupled with the 8 thread block limit, causes performance to be worse than the non-tiled code. 8x8 tiles create thread blocks that occupy two warps, but would still need 12 thread blocks to fully occupy an SM, 50% more than the supported limit. 12x12 tiles use 144 threads, which is also not an integral number of warps, and also requires padding of the arrays to prevent over￾run. 16x16 is the largest convenient size for this platform, and we can schedule three thread blocks of 8 warps each, for the maximum of 768 threads. Global memory coalescing also happens naturally with this configuration. Other applications may have higher perfor￾mance with smaller tile sizes when they allow a larger number of threads to be scheduled. The use of 16x16 tiles reduces global loads by a factor of 16 over the non-tiled configuration, so instead of the bandwidth re￾quirement being twice what is available, it is now approximately an eighth, removing it as the bottleneck to performance. The ad￾ditional instructions reduce the potential throughput slightly below that of the original code. The 16x16 tiled version of matrix multi￾plication achieves 46.49 GFLOPS, or approximately 4.5X the exe￾cution throughput of the initial version. This is slightly higher than the estimated potential throughput of the original code, so it appears that the application achieves full usage of the SPs. 77
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有