8  CUB [21]: Our single-pass decoupl_中国高校课件下载中心

点击下载：《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Single-pass Parallel Prefix Scan with Decoupled Look-back

正在加载图片...

StreamScan MGPU Thrust StreamScan MGPU Thrust Saturated M40 1.00x 1.37x 2.08x H-mean M40 1.62x 1.35x 2.99x Saturated K40 1.23x 1.46x 2.73x H-mean K40 1.67x 1.18x 2.87x Saturated C2050 1.12x 1.47x 2.10x H-mean C2050 1.54x 1.20x 2.73x H-mean saturated 1.11x 1.43x 2.27x H-mean all 1.60x 1.19x 2.80x Fig.10.CUB speedup for large inputs Fig.11.Average CUB speedup 。 (a)select if) (b)reduce by key) int32 data w/50%uniform-random selection (int32,fp32)pairs w/average segment length 500 25 02 mam。 (c)partition if) (d)run length encode() int32 data w/50%uniform-random selection int32 data w/average segment length 500 Fig.12.Performance of compaction-like algorithms across 32M inputs .CUB [21]:Our single-pass decoupled-lookback Despite extensive per-input auto-tuning, StreamScan parallelization with ~2n data movement (including the performance is hindered by the latencies of serial prefix adaptations and optimizations described in the previous propagation. This is manifest in two ways:(1)the roofline sections). saturation of StreamScan throughput occurs at relatively higher problem sizes on all architectures,and (2)StreamScan is unable to ·StreamScan [271:A single-pass chained-scan match memepy throughput on Fermi and Kepler architectures parallelization with~2n data movement.StreamScan is a where on-chip resources (register file and shared memory) 32-bit implementation (OpenCL),which precludes very preclude blocking factors large enough to cover roundtrip L2 large problem sizes. Furthermore,it is auto-tuned per cache latency. problem size. Furthermore,these results largely match our performance MGPU [2]:A three-kernel reduce-then-scan parallelization speedup expectations.If were to assume memory-bound operation with~3n data movement. for all implementations,we would expect speedups Ix,1.5x,and .Thrust [3]:A recursive scan-then-propagate parallelization 2x versus StreamScan,MGPU,and Thrust,respectively.In practice,for very large problems (capable of saturating the with~4n data movement. processor),we achieve harmonic mean speedups of 1.1x,1.4x, We also measure the throughput performance of CUDA's and 2.3x,respectively.Fig.10 further enumerates saturated CUB global memepy operation.Copy serves as an ideal performance speedup per architecture,and Fig.I1 enumerates harmonic-mean ceiling for prefix scan because it shares the same minimum I/O CUB speedup for all problem sizes. workload,is completely data-parallel,and has no computational overhead. 5.2 Adaptation for compaction behavior We conducted our evaluation using the three most recent Using CUB collective primitives for data movement,we have generations of NVIDIA Tesla GPU processors (all with ECC applied this single-pass scan strategy to construct very fast, disabled):Maxwell-based M40 (Fig.7),Kepler-based K40(Fig 8),and Fermi-based C2050(Fig.9).Our CUB performance performance-portable implementations of various compaction algorithms: meets or exceeds that of the other implementations for all architectures and problem sizes.For Kepler and Maxwell select-if:applies a binary selection functor to selectively platforms,CUB throughput is able to match the performance copy items from input to output ceiling of memcpy for large problems,and cannot be improved partition-if:applies a binary selection functor to split copy upon nontrivially. items from input into separate partitions within the output8  CUB [21]: Our single-pass decoupled-lookback parallelization with ~2n data movement (including the adaptations and optimizations described in the previous sections).  StreamScan [27]: A single-pass chained-scan parallelization with ~2n data movement. StreamScan is a 32-bit implementation (OpenCL), which precludes very large problem sizes. Furthermore, it is auto-tuned per problem size.  MGPU [2]: A three-kernel reduce-then-scan parallelization with ~3n data movement.  Thrust [3]: A recursive scan-then-propagate parallelization with ~4n data movement. We also measure the throughput performance of CUDA’s global memcpy operation. Copy serves as an ideal performance ceiling for prefix scan because it shares the same minimum I/O workload, is completely data-parallel, and has no computational overhead. We conducted our evaluation using the three most recent generations of NVIDIA Tesla GPU processors (all with ECC disabled): Maxwell-based M40 (Fig. 7), Kepler-based K40 (Fig. 8), and Fermi-based C2050 (Fig. 9). Our CUB performance meets or exceeds that of the other implementations for all architectures and problem sizes. For Kepler and Maxwell platforms, CUB throughput is able to match the performance ceiling of memcpy for large problems, and cannot be improved upon nontrivially. Despite extensive per-input auto-tuning, StreamScan performance is hindered by the latencies of serial prefix propagation. This is manifest in two ways: (1) the roofline saturation of StreamScan throughput occurs at relatively higher problem sizes on all architectures, and (2) StreamScan is unable to match memcpy throughput on Fermi and Kepler architectures where on-chip resources (register file and shared memory) preclude blocking factors large enough to cover roundtrip L2 cache latency. Furthermore, these results largely match our performance speedup expectations. If were to assume memory-bound operation for all implementations, we would expect speedups 1x, 1.5x, and 2x versus StreamScan, MGPU, and Thrust, respectively. In practice, for very large problems (capable of saturating the processor), we achieve harmonic mean speedups of 1.1x, 1.4x, and 2.3x, respectively. Fig. 10 further enumerates saturated CUB speedup per architecture, and Fig. 11 enumerates harmonic-mean CUB speedup for all problem sizes. 5.2 Adaptation for compaction behavior Using CUB collective primitives for data movement, we have applied this single-pass scan strategy to construct very fast, performance-portable implementations of various compaction algorithms:  select-if: applies a binary selection functor to selectively copy items from input to output  partition-if: applies a binary selection functor to split copy items from input into separate partitions within the output StreamScan MGPU Thrust Saturated M40 1.00x 1.37x 2.08x Saturated K40 1.23x 1.46x 2.73x Saturated C2050 1.12x 1.47x 2.10x H-mean saturated 1.11x 1.43x 2.27x StreamScan MGPU Thrust H-mean M40 1.62x 1.35x 2.99x H-mean K40 1.67x 1.18x 2.87x H-mean C2050 1.54x 1.20x 2.73x H-mean all 1.60x 1.19x 2.80x Fig. 10. CUB speedup for large inputs Fig. 11. Average CUB speedup (a) select_if() int32 data w/ 50% uniform-random selection (b) reduce_by_key() {int32,fp32} pairs w/ average segment length 500 (c) partition_if() int32 data w/ 50% uniform-random selection (d) run_length_encode() int32 data w/ average segment length 500 Fig. 12. Performance of compaction-like algorithms across 32M inputs 5.5 10.7 18.3 1.9 6.5 16.4 26.7 3.6 4.5 4.7 1.8 3.6 6.4 6.5 0 5 10 15 20 25 30 Tesla C1060 Tesla C2050 Tesla K20C GeForce 9800 GTX+ Geforce GTX 285 Geforce GTX 580 GeForce GTX Titan billions of input items / sec CUB Thrust v1.7.1 3.2 7.5 16.5 0.6 3.7 11.5 23.4 1.2 3.0 4.9 0.3 1.5 4.2 6.6 0 5 10 15 20 25 Tesla C1060 Tesla C2050 Tesla K20C GeForce 9800 GTX+ Geforce GTX 285 Geforce GTX 580 GeForce GTX Titan billions of input pairs / sec CUB Thrust v1.7.1 4.2 8.6 16.4 1.1 4.9 13.1 23.6 1.7 2.2 2.4 0.9 1.8 3.2 3.3 0 5 10 15 20 25 Tesla C1060 Tesla C2050 Tesla K20C GeForce 9800 GTX+ Geforce GTX 285 Geforce GTX 580 GeForce GTX Titan billions of input items / sec CUB Thrust v1.7.1 3.5 9.3 18.7 0.8 4.0 14.3 26.6 1.2 3.2 5.3 0.4 1.5 5.0 7.0 0 5 10 15 20 25 30 Tesla C1060 Tesla C2050 Tesla K20C GeForce 9800 GTX+ Geforce GTX 285 Geforce GTX 580 GeForce GTX Titan billions of input items / sec CUB Thrust v1.7.1

<<向上翻页向下翻页>>

点击下载：《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Single-pass Parallel Prefix Scan with Decoupled Look-back