正在加载图片...
reduce-by-key:Reduces segments of values,where [6]Blelloch,G.E.et al.Solving linear recurrences with loop raking segments are demarcated by corresponding runs of identical 416-424. keys [7]Borodin,A.1977.On Relating Time and Space to Size and Depth. run-length-encode.Computes a compressed representation SIAM Journal on Computing.6,4(1977),733-744. of a sequence of input elements such that each maximal [8] Brent,R.P.and Kung,H.T.1982.A Regular Layout for Parallel "run"of consecutive same-valued data items is encoded as a Adders.Computers,IEEE Transactions on.C-31,3 (Mar.1982),260 single item along with a count of the elements in that run -264. These algorithms all make use of prefix sum as a method to [9J Chatterjee,S.et al.1990.Scan primitives for vector computers. determine where output items should be placed.The scatter offset Proceedings of the 1990 ACMIEEE conference on Supercomputing (Los Alamitos,CA,USA,1990),666-675. for a given thread is the count of preceding items to be written by lower-ranked threads.Not only is the Thrust scan algorithm less [10]cudpp-CUDA Data Parallel Primitives Library-Google Project efficient,but the process of item-selection must be either (a)be Hosting:http://code.google.com/p/cudpp/.Accessed:2011-07-12. memoized in off-chip temporary storage,which consumes extra [11]CUSPARSE:https://developer.nvidia.com/cusparse.Accessed: bandwidth;or (b)run twice,once during upsweep and again 2013-06-05. during downsweep.In comparison,item-selection can be fused [12]Dotsenko,Y.et al.2008.Fast scan algorithms on graphics with our prefix scan strategy without incurring additional I/O or processors.Proceedings of the 22nd annual international conference redundant selection computation. on Supercomputing (New York,NY,USA,2008),205-213. Fig.12 illustrates the throughput of our CUB-based primitives [13]Fortune,S.and Wyllie,J.1978.Parallelism in random access versus the functionally-equivalent versions implemented by machines.Proceedings of the tenth annual ACM symposium on Thrust.Our implementations of these operations are 4.Ix,7.1x, Theory of computing (New York,NY,USA,1978),114-118. 3.5x,and 3.8x faster,respectively. [14]Goldschlager,L.M.1982.A universal interconnection pattern for 6.Conclusion parallel computers.J.ACM.29,4 (Oct.1982),1073-1086. [15]Han,T.and Carlson,D.A.1987.Fast area-efficient VLSI adders. Our method is a novel generalization of the chained-scan Computer Arithmetic (ARITH),1987 IEEE 8th Symposium on (May approach with dramatically reduced prefix propagation latencies. 1987),49-56. The principal idea is for processors to progressively inspect the [16]Ha,S.-W.and Han,T.-D.2013.A Scalable Work-Efficient and status of predecessors that are increasingly further away.We Depth-Optimal Parallel Scan for the GPGPU Environment./EEE demonstrate that,unlike prior single-pass algorithms,our method Transactions on Parallel and Distributed Systems.24.12 (2013). is capable of fully saturating DRAM bandwidth by overlapping 2324-2333 the propagation of prefix dependences with a small amount of [17]Hillis,W.D.and Steele,G.L.1986.Data parallel algorithms. redundant computation.We have also shown this strategy to be Communications of the ACM.29,12 (Dec.1986),1170-1183. amenable for implementing parallel algorithms that exhibit in- place compaction behavior,i.e.,it does not require separate [18]Hinze,R.2004.An Algebra of Scans.Mathematics of Program Construction.D.Kozen,ed.Springer Berlin Heidelberg.186-210. storage for the compacted output. Another important distinction between our method and prior [19]Kogge,P.M.and Stone,H.S.1973.A Parallel Algorithm for the work is nondeterministic execution scheduling. Whereas Efficient Solution of a General Class of Recurrence Equations./EEE contemporary scan parallelizations are constructed from static Transactions on Computers.C-22,8(Aug.1973),786-793. data flow networks (i.e.,the order of operations is fixed for a [20]Merrill,D.2011.Allocation-oriented Algorithm Design with given problem setting),our method allows parallel processors to Application to GPU Computing.University of Virginia. perform redundant work as necessary to avoid delays from serial [21]Merrill,D.2013.CUB.1.0.1.A library of warp-wide,block-wide, dependences.An interesting implication is that scan results are and device-wide GPU parallel primitives.NVIDIA Research. not necessarily deterministic for pseudo-associative operators http://nvlabs.github.io/cub/.Accessed:2013-08-08. For example,the prefix sum across a given floating point dataset [22]Merrill,D.and Grimshaw,A.2009.Parallel Scan for Stream may vary from run to run because the number and order of scan Architectures.Technical Report #CS2009-14.Department of operators applied by CUB may vary from one run to the next. Computer Science,University of Virginia. Our CUDA-based implementation is freely available within 23]Sengupta,S.et al.2008.Efficient parallel scan algorithms for GPUs. the open-source CUB library of GPU parallel primitives(v1.0.1 Technical Report #NVR-2008-003.NVIDIA and later)[21]. [24]Sklansky,J.1960.Conditional-Sum Addition Logic. IEEE References Transactions on Electronic Computers.EC-9,2 (Jun.1960),226- 231. [1]Back40 Computing:Fast and efficient software primitives for GPU [25]Snir,M.1986.Depth-size trade-offs for parallel prefix computation. computing:htp://code.google.com/p/back40computing/.Accessed: Journal of Algorithms.7,2 (Jun.1986),185-201. 2011-08-25 [26]Valiant,L.G.1976.Universal circuits (Preliminary Report). [2]Baxter,S.2013.Modern GPU.NVIDIA Research. Proceedings of the eighth annual ACM symposium on Theory of http://mvlabs.github.io/moderngpw/. computing (New York,NY,USA,1976),196-203. [3]Bell,N.and Hoberock,J.Thrust.http://thrust.github.io/.Accessed: [27]Yan,S.et al.2013.StreamScan:fast scan algorithms for GPUs 2011-08-25. without global barrier synchronization.Proceedings of the 18th ACM [4]Blelloch,G.E.1990.Prefix Sums and Their Applications.Synthesis SIGPLAN symposium on Principles and practice of parallel of Parallel Algorithms. programming (New York,NY,USA,2013).229-238. [5]Blelloch,G.E.1989.Scans as primitive parallel operations.IEEE Transactions on Computers.38,11 (Nov.1989),1526-1538.9  reduce-by-key: Reduces segments of values, where segments are demarcated by corresponding runs of identical keys  run-length-encode. Computes a compressed representation of a sequence of input elements such that each maximal "run" of consecutive same-valued data items is encoded as a single item along with a count of the elements in that run These algorithms all make use of prefix sum as a method to determine where output items should be placed. The scatter offset for a given thread is the count of preceding items to be written by lower-ranked threads. Not only is the Thrust scan algorithm less efficient, but the process of item-selection must be either (a) be memoized in off-chip temporary storage, which consumes extra bandwidth; or (b) run twice, once during upsweep and again during downsweep. In comparison, item-selection can be fused with our prefix scan strategy without incurring additional I/O or redundant selection computation. Fig. 12 illustrates the throughput of our CUB-based primitives versus the functionally-equivalent versions implemented by Thrust. Our implementations of these operations are 4.1x, 7.1x, 3.5x, and 3.8x faster, respectively. 6. Conclusion Our method is a novel generalization of the chained-scan approach with dramatically reduced prefix propagation latencies. The principal idea is for processors to progressively inspect the status of predecessors that are increasingly further away. We demonstrate that, unlike prior single-pass algorithms, our method is capable of fully saturating DRAM bandwidth by overlapping the propagation of prefix dependences with a small amount of redundant computation. We have also shown this strategy to be amenable for implementing parallel algorithms that exhibit in￾place compaction behavior, i.e., it does not require separate storage for the compacted output. Another important distinction between our method and prior work is nondeterministic execution scheduling. Whereas contemporary scan parallelizations are constructed from static data flow networks (i.e., the order of operations is fixed for a given problem setting), our method allows parallel processors to perform redundant work as necessary to avoid delays from serial dependences. An interesting implication is that scan results are not necessarily deterministic for pseudo-associative operators. For example, the prefix sum across a given floating point dataset may vary from run to run because the number and order of scan operators applied by CUB may vary from one run to the next. Our CUDA-based implementation is freely available within the open-source CUB library of GPU parallel primitives (v1.0.1 and later) [21]. References [1] Back40 Computing: Fast and efficient software primitives for GPU computing: http://code.google.com/p/back40computing/. Accessed: 2011-08-25. [2] Baxter, S. 2013. Modern GPU. NVIDIA Research. http://nvlabs.github.io/moderngpu/. [3] Bell, N. and Hoberock, J. Thrust. http://thrust.github.io/. Accessed: 2011-08-25. [4] Blelloch, G.E. 1990. Prefix Sums and Their Applications. Synthesis of Parallel Algorithms. [5] Blelloch, G.E. 1989. Scans as primitive parallel operations. IEEE Transactions on Computers. 38, 11 (Nov. 1989), 1526–1538. [6] Blelloch, G.E. et al. Solving linear recurrences with loop raking. 416–424. [7] Borodin, A. 1977. On Relating Time and Space to Size and Depth. SIAM Journal on Computing. 6, 4 (1977), 733–744. [8] Brent, R.P. and Kung, H.T. 1982. A Regular Layout for Parallel Adders. Computers, IEEE Transactions on. C-31, 3 (Mar. 1982), 260 –264. [9] Chatterjee, S. et al. 1990. Scan primitives for vector computers. Proceedings of the 1990 ACM/IEEE conference on Supercomputing (Los Alamitos, CA, USA, 1990), 666–675. [10] cudpp - CUDA Data Parallel Primitives Library - Google Project Hosting: http://code.google.com/p/cudpp/. Accessed: 2011-07-12. [11] CUSPARSE: https://developer.nvidia.com/cusparse. Accessed: 2013-06-05. [12] Dotsenko, Y. et al. 2008. Fast scan algorithms on graphics processors. Proceedings of the 22nd annual international conference on Supercomputing (New York, NY, USA, 2008), 205–213. [13] Fortune, S. and Wyllie, J. 1978. Parallelism in random access machines. Proceedings of the tenth annual ACM symposium on Theory of computing (New York, NY, USA, 1978), 114–118. [14] Goldschlager, L.M. 1982. A universal interconnection pattern for parallel computers. J. ACM. 29, 4 (Oct. 1982), 1073–1086. [15] Han, T. and Carlson, D.A. 1987. Fast area-efficient VLSI adders. Computer Arithmetic (ARITH), 1987 IEEE 8th Symposium on (May 1987), 49–56. [16] Ha, S.-W. and Han, T.-D. 2013. A Scalable Work-Efficient and Depth-Optimal Parallel Scan for the GPGPU Environment. IEEE Transactions on Parallel and Distributed Systems. 24, 12 (2013), 2324–2333. [17] Hillis, W.D. and Steele, G.L. 1986. Data parallel algorithms. Communications of the ACM. 29, 12 (Dec. 1986), 1170–1183. [18] Hinze, R. 2004. An Algebra of Scans. Mathematics of Program Construction. D. Kozen, ed. Springer Berlin / Heidelberg. 186–210. [19] Kogge, P.M. and Stone, H.S. 1973. A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations. IEEE Transactions on Computers. C-22, 8 (Aug. 1973), 786–793. [20] Merrill, D. 2011. Allocation-oriented Algorithm Design with Application to GPU Computing. University of Virginia. [21] Merrill, D. 2013. CUB. 1.0.1. A library of warp-wide, block-wide, and device-wide GPU parallel primitives. NVIDIA Research. http://nvlabs.github.io/cub/. Accessed: 2013-08-08. [22] Merrill, D. and Grimshaw, A. 2009. Parallel Scan for Stream Architectures. Technical Report #CS2009-14. Department of Computer Science, University of Virginia. [23] Sengupta, S. et al. 2008. Efficient parallel scan algorithms for GPUs. Technical Report #NVR-2008-003. NVIDIA. [24] Sklansky, J. 1960. Conditional-Sum Addition Logic. IEEE Transactions on Electronic Computers. EC-9, 2 (Jun. 1960), 226– 231. [25] Snir, M. 1986. Depth-size trade-offs for parallel prefix computation. Journal of Algorithms. 7, 2 (Jun. 1986), 185–201. [26] Valiant, L.G. 1976. Universal circuits (Preliminary Report). Proceedings of the eighth annual ACM symposium on Theory of computing (New York, NY, USA, 1976), 196–203. [27] Yan, S. et al. 2013. StreamScan: fast scan algorithms for GPUs without global barrier synchronization. Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming (New York, NY, USA, 2013), 229–238
<<向上翻页
©2008-现在 cucdc.com 高等教育资讯网 版权所有