正在加载图片...
Single-pass Parallel Prefix Scan with Decoupled Look-back Duane Merrill Michael Garland NVIDIA Corporation NVIDIA Corporation dumerrill@nvidia.com mgarland nvidia.com Abstract In modern computer systems,the performance and power consumption of prefix scan is typically bound by the cost of data We describe a work-efficient,communication-avoiding,single- movement:reading inputs and writing results to memory is pass method for the parallel computation of prefix scan.When generally more expensive than computing the reduction consuming input from memory,our algorithm requires only ~2n operations themselves.Therefore communication avoidance data movement:n inputs are read,n outputs are written.Our (minimizing last-level data movement)is a practical design method embodies a decoupled look-back strategy that performs objective for parallel prefix scan.The sequential prefix scan redundant work to dissociate local computation from the latencies of global prefix propagation.Implemented by the CUB library of algorithm requires only a single pass through the data to accumulate and progressively output the running total.As such,it parallel primitives for GPU architectures,the performance incurs the optimal 2n data movement:n reads and n writes. throughput of our parallel prefix scan approaches that of copy Contemporary GPU scan parallelization strategies such as operations.Furthermore,the single-pass nature of our method reduce-then-scan are typically memory-bound,but impose ~3n allows it to be adapted for (1)in-place compaction behavior,and global data movement [2,16,22].Furthermore,they perform (2)in-situ global allocation within computations that two full passes over the input,which precludes them from serving oversubscribe the processor. as in-situ global allocation mechanisms within computations that 1. Introduction oversubscribe the processor.Finally,these scan algorithms cannot be modified for in-place compaction behavior (selection, Parallel prefix scan is a fundamental parallel computing primitive. run-length-encoding.duplicate removal,etc.because the Given a list of input elements and a binary reduction operator,a execution order of thread blocks within the output pass is prefix scan produces a corresponding output list where each unconstrained.Separate storage is required for the compacted output is computed to be the reduction of the elements occurring output to prevent race conditions where inputs might otherwise be earlier in the input.A prefix stmn connotes a prefix scan with the overwritten before they can be read. addition operator,i.e.,each output number is the sum of the Alternatively,the chained-scan GPU parallelization [11,27] corresponding numbers occurring previously in the input list.An operates in a single pass,but is hindered by serial prefix inclusive scan indicates that theoutput reduction incorporates dependences between adjacent processors that prevent memory thei input element.An exclusive scan indicates the i input is 1/O from fully saturating [27.In comparison,our decoupled- not incorporated into theoutput reduction. Applications of lookback algorithm elides these serial dependences at the expense scan include adder design,linear recurrence and tridiagonal of bounded redundant computation.As a result,our prefix scan solvers,parallel allocation and queuing,list compaction and computations (as well as adaptations for in-place compaction partitioning,segmented reduction,etc.For example,an exclusive behavior and in-sit allocation)are typically capable of saturating prefix sum across a list of allocation requirements [8,6,7,5,3,0,9] memory bandwidth in a single pass. produces a corresponding list of allocation offsets 0,8,14,21,26,29,291 2. Background In this report,we describe the decoupled-lookback method of Parallel solutions to scan problems have been investigated for single-pass parallel prefix scan and its implementation within the decades.In fact,the earliest research predates the discipline of open-source CUB library of GPU parallel primitives [21].For Computer Science itself:scan circuits are fundamental to the highly parallel architectures,prefix sum is a scalable mechanism operation of fast adder hardware (e.g.,carry-skip adder,carry- for cooperative allocation within dynamic and irregular data select adders,and carry-lookahead adder)[8,241.As such,many structures [4,20].Contemporary GPUs are at the leading edge of commonplace scan parallelizations are presented as recursively- the current trend of increased parallelism in computer defined,acyclic dataflow networks in the circuit model [7,26]of architecture,provisioning tens of thousands of data parallel parallel computation.In this model,prefix scan can be thought of threads.As such,prefix scan plays an important role in many as a forest of reduction trees,one for each output.Network size is GPU algorithms. reduced when reduction trees share intermediate partial sums.For practical use in computer software,scan networks are typically NVIDIA Technical Report NVR-2016-002,March 2016 encoded as imperative algorithms in the PRAM model [13,14]. CUB v1.0.1,August 2013 The minimum circuit depth and size for a parallel scan (c)NVIDIA Corporation.All rights reserved network are logan steps and n-I operators,respectively.However, This research was developed,in part,with funding from the Defense Advanced Research there are no known O(n)work-efficient networks having depth Projects Agency (DARPA).The views,opinions,and/or findings contained in this logan.Snir provides a lower-bound regarding depth+size article/presentation are those of the author(s)presenter(s)and should not be interpreted as optimality (DSO)for parallel prefix networks:for a given network representing the official views or policies of the Department of Defense or the U.S of size s gates and depth dlevels,d+s>2n-2 [25].His research Govemment.Single-pass Parallel Prefix Scan with Decoupled Look-back Duane Merrill NVIDIA Corporation dumerrill@nvidia.com Michael Garland NVIDIA Corporation mgarland@nvidia.com Abstract We describe a work-efficient, communication-avoiding, single￾pass method for the parallel computation of prefix scan. When consuming input from memory, our algorithm requires only ~2n data movement: n inputs are read, n outputs are written. Our method embodies a decoupled look-back strategy that performs redundant work to dissociate local computation from the latencies of global prefix propagation. Implemented by the CUB library of parallel primitives for GPU architectures, the performance throughput of our parallel prefix scan approaches that of copy operations. Furthermore, the single-pass nature of our method allows it to be adapted for (1) in-place compaction behavior, and (2) in-situ global allocation within computations that oversubscribe the processor. 1. Introduction Parallel prefix scan is a fundamental parallel computing primitive. Given a list of input elements and a binary reduction operator, a prefix scan produces a corresponding output list where each output is computed to be the reduction of the elements occurring earlier in the input. A prefix sum connotes a prefix scan with the addition operator, i.e., each output number is the sum of the corresponding numbers occurring previously in the input list. An inclusive scan indicates that the i th output reduction incorporates the i th input element. An exclusive scan indicates the i th input is not incorporated into the i th output reduction. Applications of scan include adder design, linear recurrence and tridiagonal solvers, parallel allocation and queuing, list compaction and partitioning, segmented reduction, etc. For example, an exclusive prefix sum across a list of allocation requirements [8,6,7,5,3,0,9] produces a corresponding list of allocation offsets [0,8,14,21,26,29,29]. In this report, we describe the decoupled-lookback method of single-pass parallel prefix scan and its implementation within the open-source CUB library of GPU parallel primitives [21]. For highly parallel architectures, prefix sum is a scalable mechanism for cooperative allocation within dynamic and irregular data structures [4, 20]. Contemporary GPUs are at the leading edge of the current trend of increased parallelism in computer architecture, provisioning tens of thousands of data parallel threads. As such, prefix scan plays an important role in many GPU algorithms. In modern computer systems, the performance and power consumption of prefix scan is typically bound by the cost of data movement: reading inputs and writing results to memory is generally more expensive than computing the reduction operations themselves. Therefore communication avoidance (minimizing last-level data movement) is a practical design objective for parallel prefix scan. The sequential prefix scan algorithm requires only a single pass through the data to accumulate and progressively output the running total. As such, it incurs the optimal 2n data movement: n reads and n writes. Contemporary GPU scan parallelization strategies such as reduce-then-scan are typically memory-bound, but impose ~3n global data movement [2, 16, 22]. Furthermore, they perform two full passes over the input, which precludes them from serving as in-situ global allocation mechanisms within computations that oversubscribe the processor. Finally, these scan algorithms cannot be modified for in-place compaction behavior (selection, run-length-encoding, duplicate removal, etc.) because the execution order of thread blocks within the output pass is unconstrained. Separate storage is required for the compacted output to prevent race conditions where inputs might otherwise be overwritten before they can be read. Alternatively, the chained-scan GPU parallelization [11, 27] operates in a single pass, but is hindered by serial prefix dependences between adjacent processors that prevent memory I/O from fully saturating [27]. In comparison, our decoupled￾lookback algorithm elides these serial dependences at the expense of bounded redundant computation. As a result, our prefix scan computations (as well as adaptations for in-place compaction behavior and in-situ allocation) are typically capable of saturating memory bandwidth in a single pass. 2. Background Parallel solutions to scan problems have been investigated for decades. In fact, the earliest research predates the discipline of Computer Science itself: scan circuits are fundamental to the operation of fast adder hardware (e.g., carry-skip adder, carry￾select adders, and carry-lookahead adder) [8, 24]. As such, many commonplace scan parallelizations are presented as recursively￾defined, acyclic dataflow networks in the circuit model [7, 26] of parallel computation. In this model, prefix scan can be thought of as a forest of reduction trees, one for each output. Network size is reduced when reduction trees share intermediate partial sums. For practical use in computer software, scan networks are typically encoded as imperative algorithms in the PRAM model [13, 14]. The minimum circuit depth and size for a parallel scan network are log2n steps and n-1 operators, respectively. However, there are no known O(n) work-efficient networks having depth log2n. Snir provides a lower-bound regarding depth+size optimality (DSO) for parallel prefix networks: for a given network of size s gates and depth d levels, d + s ≥ 2n – 2 [25]. His research NVIDIA Technical Report NVR-2016-002, March 2016. CUB v1.0.1, August 2013 (c) NVIDIA Corporation. All rights reserved. This research was developed, in part, with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions, and/or findings contained in this article/presentation are those of the author(s)/presenter(s) and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government
向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有