Table 2. Application Suite Applicatio_中国高校课件下载中心

点击下载：《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

正在加载图片...

Table 2.Application Suite Application Description Source Kernel CPU Lines Lines Execution Parallelized H264 A modified version of the 464.h264ret benchmark from SPEC CPU2006.This is an H.264 (MPEG-4 3481 194 3巧% AVC)video encoder.A serial dependence between motion estimation of macroblocks in a video frame is removed to enable parallel execution of the motion estimation code.Although this modification changes the output of the program,it is allowed within the H.264 standard. LBM A modified version of the 470.lbm benchmark from SPEC CPU2006.This uses the Lattice-Boltzman 1481 285 >99y% Method for simulating 3D fluid dynamics.The program has been changed to use single-precision floating point and print fewer status reports. RC5-72 This application accelerates distributed.net's RSA RC5-72 bit challenge,which performs brute-force 1979 218 >99% encryption key generation and matching. FEM Finite Element Modeling.Simulation of dynamic behavior of 3D graded materials. 1874 146 99% RPES Rys Polynomial Equation Solver.Calculates 2-electron repulsion integrals,which are a sub-problem of 1104 281 99% molecular dynamics. PNS Petri Net Simulation.Simulation of a mathematical representation of a distributed system 322 160 >99% SAXPY Single-precision floating-point implementation of saxpy from High-Performance LINPACK,used as 952 31 >99% part of a Gaussian elimination routine. TPACF Implementation of Two Point Angular Correlation Function,used to find the probability of finding an 536 98 96% astronomical body at a given angular distance from another astronomical body. FDTD Finite-Difference Time-Domain.2D electromagnetic wave propagation simulation in an arbitrary,user- I365 93 16.4% defined medium. MRI-O Computation of a matrix Q,representing the scanner configuration,used in a 3D magnetic resonance 490 33 >99% image reconstruction algorithm in non-Cartesian space. MRI Computation of an image-specific matrixd,used in a 3D magnetic resonance image reconstruction 343 39 >99% FHD algorithm in non-Cartesian space. CP Computation of electric potential in a volume containing point charges.Based on direct Coulomb 409 47 >99% summation,as described in [24] Table 3.Application Implementation Performance For Typical,Long-Running Execution Profiles Application Max Simul- Registers Shared Global GPU CPU. Architectural ernel Application taneously per Mem per Memory to Exec GPU Bottleneck(s) Speedup Speedup Active Thread Thread Computation Transfer on GPU Threads (B) Cycles Ratio % Mat Mul 12288 9 8.1 0.276 16.2% 4% Instruction issue 7.0X 2.0X H.264 3936 30 55.1 0.006 2.6% 4.5% Register file capacity and 20.2X 1.47X cache latencies LBM 3200 32 84.2 0.415 98.3% 0.4% Shared memory capacity 12.5X 23X RC35-72 3072 42 0.3 ~0 64.3% 0.5% Instruction issue 17.X 11.0X FEM 4096 18 61 1.135 91.4% 《1阳 Global memory bandwidth 11.0X 10.IX RPES 4096 2 24.8 0.0T 37.5% 1% Instruction issue 210X 79.4X PNS 2048 32 9.9 0.241 98% 《1% Global memory capacity 24.0X 23.7X SAXPY 12288 0.3 0.375 88% 45% Global memory bandwidth 19.4X 11.8X TPACF 4096 24 52.2 0.002 34.3% 《10 Shared memory capacity 60.2X 21.6X FDTD 12288 8.1 0.516 1.8% 0.9% Global memory bandwidth 10.5X 1.16X MRI-O 8192 20.1 0.008 >99% 《1% Instruction issue 457X 431X MRI-FHD 8I92 20.1 0.006 99% 1% Instruction issue 316X 263X CP 6144 20 0.4 0.0005 >99% 《1% Instruction issue 102X 102X word granularities.LBM,FEM,FDTD,and other lattice compu- an irregular mesh data structure that has few contiguous accesses tations use arrays of small structures in global memory.Threads even with data reorganization. simultaneously read or write a given field of multiple elements,but On-chip caches are useful in several applications;we focus on these fields are not contiguous in memory.Each non-contiguous two here.For the MRI applications,we placed data in constant access is a separate DRAM access request,overwhelming the de- memory,which reduced average access time [25].We also per- vice's memory bandwidth.In LBM we alleviated the problem us- formed a loop interchange to make all threads in a warp simultae- ing contiguous accesses to prefetch the arrays into shared memory. nously access the same value in the table to remove conflicts.Con- Figure 5 illustrates the access patterns before and after the opti- stant memory is generally intended for small lookup tables,but any mization.Before computation,threads cooperatively load blocks of data that is read-only and has the same location simultaneously read memory into shared memory,as shown in Figure 5(b).They then by all threads is appropriate for it.Our implementation of H.264 synchronize,after which each thread operates on its own data.The uses texture memory for part of the input data,since the data use buffering optimization may also be possible with FDTD if a sub- has 2D locality and the hardware provides boundary-value calcu- stantial amount of data reorganization is performed,but FEM uses lation support that would otherwise need to be calculated in soft- ware.However,a lack of registers restricts the number of threads 80Table 2. Application Suite Application Description Source Lines Kernel Lines CPU Execution Parallelized H.264 A modified version of the 464.h264ref benchmark from SPEC CPU2006. This is an H.264 (MPEG-4 AVC) video encoder. A serial dependence between motion estimation of macroblocks in a video frame is removed to enable parallel execution of the motion estimation code. Although this modification changes the output of the program, it is allowed within the H.264 standard. 34811 194 35% LBM A modified version of the 470.lbm benchmark from SPEC CPU2006. This uses the Lattice-Boltzman Method for simulating 3D fluid dynamics. The program has been changed to use single-precision floating point and print fewer status reports. 1481 285 > 99% RC5-72 This application accelerates distributed.net’s RSA RC5-72 bit challenge, which performs brute-force encryption key generation and matching. 1979 218 > 99% FEM Finite Element Modeling. Simulation of dynamic behavior of 3D graded materials. 1874 146 99% RPES Rys Polynomial Equation Solver. Calculates 2-electron repulsion integrals, which are a sub-problem of molecular dynamics. 1104 281 99% PNS Petri Net Simulation. Simulation of a mathematical representation of a distributed system. 322 160 > 99% SAXPY Single-precision floating-point implementation of saxpy from High-Performance LINPACK, used as part of a Gaussian elimination routine. 952 31 > 99% TPACF Implementation of Two Point Angular Correlation Function, used to find the probability of finding an astronomical body at a given angular distance from another astronomical body. 536 98 96% FDTD Finite-Difference Time-Domain. 2D electromagnetic wave propagation simulation in an arbitrary, userdefined medium. 1365 93 16.4% MRI-Q Computation of a matrix Q, representing the scanner configuration, used in a 3D magnetic resonance image reconstruction algorithm in non-Cartesian space. 490 33 > 99% MRIFHD Computation of an image-specific matrix F Hd, used in a 3D magnetic resonance image reconstruction algorithm in non-Cartesian space. 343 39 > 99% CP Computation of electric potential in a volume containing point charges. Based on direct Coulomb summation, as described in [24]. 409 47 > 99% Table 3. Application Implementation Performance For Typical, Long-Running Execution Profiles Application Max Simultaneously Active Threads Registers per Thread Shared Mem per Thread (B) Global Memory to Computation Cycles Ratio GPU Exec % CPUGPU Transfer % Architectural Bottleneck(s) Kernel Speedup on GPU Application Speedup Mat Mul 12288 9 8.1 0.276 16.2% 4% Instruction issue 7.0X 2.0X H.264 3936 30 55.1 0.006 2.6% 4.5% Register file capacity and cache latencies 20.2X 1.47X LBM 3200 32 84.2 0.415 98.3% 0.4% Shared memory capacity 12.5X 12.3X RC5-72 3072 42 0.3 0 64.3% 0.5% Instruction issue 17.1X 11.0X FEM 4096 18 61 1.135 91.4% 1% Global memory bandwidth 11.0X 10.1X RPES 4096 23 24.8 0.01 37.5% 1% Instruction issue 210X 79.4X PNS 2048 32 9.9 0.241 98% 1% Global memory capacity 24.0X 23.7X SAXPY 12288 7 0.3 0.375 88% 4.5% Global memory bandwidth 19.4X 11.8X TPACF 4096 24 52.2 0.0002 34.3% 1% Shared memory capacity 60.2X 21.6X FDTD 12288 11 8.1 0.516 1.8% 0.9% Global memory bandwidth 10.5X 1.16X MRI-Q 8192 11 20.1 0.008 > 99% 1% Instruction issue 457X 431X MRI-FHD 8192 12 20.1 0.006 99% 1% Instruction issue 316X 263X CP 6144 20 0.4 0.0005 > 99% 1% Instruction issue 102X 102X word granularities. LBM, FEM, FDTD, and other lattice computations use arrays of small structures in global memory. Threads simultaneously read or write a given field of multiple elements, but these fields are not contiguous in memory. Each non-contiguous access is a separate DRAM access request, overwhelming the device’s memory bandwidth. In LBM we alleviated the problem using contiguous accesses to prefetch the arrays into shared memory. Figure 5 illustrates the access patterns before and after the optimization. Before computation, threads cooperatively load blocks of memory into shared memory, as shown in Figure 5(b). They then synchronize, after which each thread operates on its own data. The buffering optimization may also be possible with FDTD if a substantial amount of data reorganization is performed, but FEM uses an irregular mesh data structure that has few contiguous accesses even with data reorganization. On-chip caches are useful in several applications; we focus on two here. For the MRI applications, we placed data in constant memory, which reduced average access time [25]. We also performed a loop interchange to make all threads in a warp simultaenously access the same value in the table to remove conflicts. Constant memory is generally intended for small lookup tables, but any data that is read-only and has the same location simultaneously read by all threads is appropriate for it. Our implementation of H.264 uses texture memory for part of the input data, since the data use has 2D locality and the hardware provides boundary-value calculation support that would otherwise need to be calculated in software. However, a lack of registers restricts the number of threads 80

<<向上翻页向下翻页>>

点击下载：《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA