nVIDIA Parallel Prefix Sum (Scan)with CUDA Mark Harris mharris@nvidia.com April 2007
April 2007 Parallel Prefix Sum (Scan) with CUDA Mark Harris mharris@nvidia.com
Document Change History Version Date Responsible Reason for Change February 14, Mark Harris Initial release 2007 April 2007
April 2007 ii Document Change History Version Date Responsible Reason for Change February 14, 2007 Mark Harris Initial release
Abstract Parallel prefix sum,also known as parallel Scan,is a useful building block for many parallel algorithms including sorting and building data structures.In this document we introduce Scan and describe step-by-step how it can be implemented efficiently in NVIDIA CUDA.We start with a basic naive algorithm and proceed through more advanced techniques to obtain best performance.We then explain how to scan arrays of arbitrary size that cannot be processed with a single block of threads. Month 2007 1
Month 2007 1 Abstract Parallel prefix sum, also known as parallel Scan, is a useful building block for many parallel algorithms including sorting and building data structures. In this document we introduce Scan and describe step-by-step how it can be implemented efficiently in NVIDIA CUDA. We start with a basic naïve algorithm and proceed through more advanced techniques to obtain best performance. We then explain how to scan arrays of arbitrary size that cannot be processed with a single block of threads
Parallel Prefix Sum (Scan)with CUDA Table of Contents Abstract....... .1 Table of Contents....... .2 Introduction...... .3 Inclusive and Exclusive Scan............ 3 Sequential Scan... 4 A Naive Parallel Scan............ 4 A Work-Efficient Parallel Scan............... .7 Avoiding Bank Conflicts............. .11 Arrays of Arbitrary Size......... 14 Performance.a… .16 Conclusion... .17 Bibliography........ .18 April 2007 2
Parallel Prefix Sum (Scan) with CUDA April 2007 2 Table of Contents Abstract..............................................................................................................1 Table of Contents...............................................................................................2 Introduction.......................................................................................................3 Inclusive and Exclusive Scan .........................................................................................................................3 Sequential Scan.................................................................................................................................................4 A Naïve Parallel Scan .........................................................................................4 A Work-Efficient Parallel Scan...........................................................................7 Avoiding Bank Conflicts ...................................................................................11 Arrays of Arbitrary Size....................................................................................14 Performance.....................................................................................................16 Conclusion........................................................................................................17 Bibliography.....................................................................................................18
Parallel Prefix Sum(Scan)with CUDA Introduction A simple and common parallel algorithm building block is the allprefix-sums operation.In this paper we will define and illustrate the operation,and discuss in detail its efficient implementation on NVIDIA CUDA.As mentioned by Blelloch [1],all-prefix-sums is a good example of a computation that seems inherently sequential,but for which there is an efficient parallel algorithm.The all-prefix-sums operation is defined as follows in [1]: Definition:The all-prefix-sums operation takes a binary associative operator and an array of n elements [ao,a,...,a and returns the array [a,(⊕a),,(⊕4©.⊕a-l Example:If is addition,then the all-prefix-sums operation on the array [317041631, would return [34111114162225. There are many uses for all-prefix-sums,including,but not limited to sorting,lexical analysis, string comparison,polynomial evaluation,stream compaction,and building histograms and data structures(graphs,trees,etc.)in parallel.For example applications,we refer the reader to the survey by Blelloch [1]. In general,all-prefix-sums can be used to convert some sequential computations into equivalent,but parallel,computations as shown in Figure 1. out[0]=0 forall j in parallel do forall j from 1 to n do temp[j]f(in[j]); out[j]outlj-1]f(in[j-1)); all prefix sums (out,temp); Figure 1:A sequential computation and its parallel equivalent. Inclusive and Exclusive Scan All-prefix-sums on an array of data is commonly known as scan.We will use this simpler terminology (which comes from theAPL programming language [11)for the remainder of this paper.As shown in the last section,a scan of an array generates a new array where each element jis the sum of all elements up to and including j.This is an inclsire scan.It is often useful for each element /in the results of a scan to contain the sum of all previous elements, but not jitself.This operation is commonly known as an exclusire scan(or prescan)[1]. Definition:The exclusive scan operation takes a binary associative operator with identity I,and an array of n elements [,a,,as-1 April 2007 3
Parallel Prefix Sum (Scan) with CUDA April 2007 3 Introduction A simple and common parallel algorithm building block is the all-prefix-sums operation. In this paper we will define and illustrate the operation, and discuss in detail its efficient implementation on NVIDIA CUDA. As mentioned by Blelloch [1], all-prefix-sums is a good example of a computation that seems inherently sequential, but for which there is an efficient parallel algorithm. The all-prefix-sums operation is defined as follows in [1]: Definition: The all-prefix-sums operation takes a binary associative operator ⊕, and an array of n elements [a0, a1, …, an-1], and returns the array [a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-1)]. Example: If ⊕ is addition, then the all-prefix-sums operation on the array [3 1 7 0 4 1 6 3], would return [3 4 11 11 14 16 22 25]. There are many uses for all-prefix-sums, including, but not limited to sorting, lexical analysis, string comparison, polynomial evaluation, stream compaction, and building histograms and data structures (graphs, trees, etc.) in parallel. For example applications, we refer the reader to the survey by Blelloch [1]. In general, all-prefix-sums can be used to convert some sequential computations into equivalent, but parallel, computations as shown in Figure 1. out[0] = 0; forall j from 1 to n do out[j] = out[j-1] + f(in[j-1]); forall j in parallel do temp[j] = f(in[j]); all_prefix_sums(out, temp); Figure 1: A sequential computation and its parallel equivalent. Inclusive and Exclusive Scan All-prefix-sums on an array of data is commonly known as scan. We will use this simpler terminology (which comes from the APL programming language [1]) for the remainder of this paper. As shown in the last section, a scan of an array generates a new array where each element j is the sum of all elements up to and including j. This is an inclusive scan. It is often useful for each element j in the results of a scan to contain the sum of all previous elements, but not j itself. This operation is commonly known as an exclusive scan (or prescan) [1]. Definition: The exclusive scan operation takes a binary associative operator ⊕ with identity I, and an array of n elements [a0, a1, …, an-1]
Parallel Prefix Sum(Scan)with CUDA and returns the array [,m,(⊕a,(a⊕4⊕..⊕an2l Example:If is addition,then the exclusive scan operation on the array [31704163, returns [0341111141622 An exclusive scan can be generated from an inclusive scan by shifting the resulting array right by one element and inserting the identity.Likewise,an inclusive scan can be generated from an exclusive scan by shifting the resulting array left,and inserting at the end the sum of the last element of the scan and the last element of the input array [1].For the remainder of this paper we will focus on the implementation of exclusive scan and refer to it simply as scan unless otherwise specified. Sequential Scan Implementing a sequential version of scan(that could be run in a single thread on a CPU,for example)is trivial.We simply loop over all the elements in the input array and add the value of the previous element of the input array to the sum computed for the previous element of the output array,and write the sum to the current element of the output array. void scan(float*output,float*input,int length) output[0]=0;//since this is a prescan,not a scan for(int j=1;j<length;++j) output[j]input[j-1]output[j-1]; This code performs exactly nadds for an array of length m,this is the minimum number of adds required to produce the scanned array.When we develop our parallel version of scan, we would like it to be work-efficient.This means do no more addition operations (or work) than the sequential version.In other words the two implementations should have the same work complexcity,O(n). A Naive Parallel Scan for d:=I to logan do d:=0 to logn-1 forall kin parallel do k=0n-1 if kz 2d then=x+ 2^d Algorithm 1:A sum scan algorithm that is not work-efficient. April 2007
Parallel Prefix Sum (Scan) with CUDA April 2007 4 and returns the array [I, a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-2)]. Example: If ⊕ is addition, then the exclusive scan operation on the array [3 1 7 0 4 1 6 3], returns [0 3 4 11 11 14 16 22]. An exclusive scan can be generated from an inclusive scan by shifting the resulting array right by one element and inserting the identity. Likewise, an inclusive scan can be generated from an exclusive scan by shifting the resulting array left, and inserting at the end the sum of the last element of the scan and the last element of the input array [1]. For the remainder of this paper we will focus on the implementation of exclusive scan and refer to it simply as scan unless otherwise specified. Sequential Scan Implementing a sequential version of scan (that could be run in a single thread on a CPU, for example) is trivial. We simply loop over all the elements in the input array and add the value of the previous element of the input array to the sum computed for the previous element of the output array, and write the sum to the current element of the output array. void scan( float* output, float* input, int length) { output[0] = 0; // since this is a prescan, not a scan for(int j = 1; j < length; ++j) { output[j] = input[j-1] + output[j-1]; } } This code performs exactly n adds for an array of length n; this is the minimum number of adds required to produce the scanned array. When we develop our parallel version of scan, we would like it to be work-efficient. This means do no more addition operations (or work) than the sequential version. In other words the two implementations should have the same work complexity, O(n). A Naïve Parallel Scan Algorithm 1: A sum scan algorithm that is not work-efficient. for d := 1 to log2n do forall k in parallel do if k ≥ 2d then x[k] := x[k − 2d-1] + x[k] k=0,...,n-1 d:=0 to logn-1 2^d
Parallel Prefix Sum(Scan)with CUDA The pseudocode in Algorithm 1 shows a naive parallel scan implementation.This algorithm is based on the scan algorithm presented by Hillis and Steele![4],and demonstrated for GPUs by Horn [5].The problem with Algorithm 1 is apparent if we examine its work complexity.The rtm performs)ddtion operations. Remember that a sequential scan performs O()adds.Therefore,this naive implementation is not work-efficient.The factor of log2 n can have a large effect on performance.In the case of a scan of 1 million elements,the performance difference between this naive implementation and a theoretical work-efficient parallel implementation would be almost a factor of 20. Algorithm 1 assumes that there are as many processors as data elements.On a GPU running CUDA,this is not usually the case.Instead,the forall is automatically divided into small parallel batches(called warp)that are executed sequentially on a multiprocessor. A G80 GPU executes warps of 32 threads in parallel.Because not all threads run simultaneously for arrays larger than the warp size,the algorithm above will not work because it performs the scan in place on the array.The results of one warp will be overwritten by threads in another warp. To solve this problem,we need to double-buffer the array we are scanning.We use two temporary arrays(temp[2][n])to do this.Pseudocode for this is given in Algorithm 2, and CUDA C code for the naive scan is given in Listing 1.Note that this code will run on only a single thread block of the GPU,and so the size of the arrays it can process is limited (to 512 elements on G80 GPUs).Extension of scan to large arrays is discussed later. for d:=1 to logan do d:=0 to logn-1 forall kin parallel do ifk≥24then x[oud[附:=x[imk-24]+x[im[ 2^d else x[out:=x[in]k闷 swap (in,out) Algorithm 2:A double-buffered version of the sum scan from Algorithm 1. 1 Note that while we call this a naire scan in the context of CUDA and NVIDIA GPUs,it was not necessarily naive for a Connection Machine 3],which is the machine Hillis and Steele were focused on.Related to work complexity is the concept of step complexiny,which is the number of steps that the algorithm executes.The Connection Machine was a SIMD machine with many thousands of processors.In the limit where the number of processors equals the number of elements to be scanned,execution time is dominated by step complexity rather than work complexity.Algorithm 1 has a step complexity of o(log compared to the O()step complexity of the sequential algorithm, and is therefore step efficient. April 2007 5
Parallel Prefix Sum (Scan) with CUDA April 2007 5 The pseudocode in Algorithm 1 shows a naïve parallel scan implementation. This algorithm is based on the scan algorithm presented by Hillis and Steele1 [4], and demonstrated for GPUs by Horn [5]. The problem with Algorithm 1 is apparent if we examine its work complexity. The algorithm performs )log(2 2 log 1 2 1 nnOnn d d ∑ =− = − addition operations. Remember that a sequential scan performs O(n) adds. Therefore, this naïve implementation is not work-efficient. The factor of log2 n can have a large effect on performance. In the case of a scan of 1 million elements, the performance difference between this naïve implementation and a theoretical work-efficient parallel implementation would be almost a factor of 20. Algorithm 1 assumes that there are as many processors as data elements. On a GPU running CUDA, this is not usually the case. Instead, the forall is automatically divided into small parallel batches (called warps) that are executed sequentially on a multiprocessor. A G80 GPU executes warps of 32 threads in parallel. Because not all threads run simultaneously for arrays larger than the warp size, the algorithm above will not work because it performs the scan in place on the array. The results of one warp will be overwritten by threads in another warp. To solve this problem, we need to double-buffer the array we are scanning. We use two temporary arrays (temp[2][n]) to do this. Pseudocode for this is given in Algorithm 2, and CUDA C code for the naïve scan is given in Listing 1. Note that this code will run on only a single thread block of the GPU, and so the size of the arrays it can process is limited (to 512 elements on G80 GPUs). Extension of scan to large arrays is discussed later. Algorithm 2: A double-buffered version of the sum scan from Algorithm 1. 1 Note that while we call this a naïve scan in the context of CUDA and NVIDIA GPUs, it was not necessarily naïve for a Connection Machine [3], which is the machine Hillis and Steele were focused on. Related to work complexity is the concept of step complexity, which is the number of steps that the algorithm executes. The Connection Machine was a SIMD machine with many thousands of processors. In the limit where the number of processors equals the number of elements to be scanned, execution time is dominated by step complexity rather than work complexity. Algorithm 1 has a step complexity of O(log n) compared to the O(n) step complexity of the sequential algorithm, and is therefore step efficient. for d := 1 to log2n do forall k in parallel do if k ≥ 2d then x[out][k] := x[in][k − 2d-1] + x[in][k] else x[out][k] := x[in][k] swap(in,out) d := 0 to logn-1 2^d
Parallel Prefix Sum (Scan)with CUDA d=0 Xo X2 X3 X4 X5 X6 d=1 X0.x0) 2xox1)2x1.x2)2x2x) 2x3x4)2x4x) 2x5x6)2x6.x) d=2 2(x0.xo) 2(0x2x0x2) 2x0X3)2x1x4)2(x2.xs)2x3.X6)2x4x7) d=3 2(x0.x0) (xo..x1)Z(xo..x2) x.x3X.x4)氵 xs)2(x.x6) 2X0x7) Figure 1:Computing a scan of an array of 8 elements using the naive scan algorithm. 7-global void scan(float *g_odata,float *g_idata,int n) extern _shared float temp[];//allocated on invocation int thid threadIdx.x; int pout =0,pin 1; /load input into shared memory. /This is exclusive scan,so shift right by one and set first elt to 0 temp[pout*n thid](thid 0)?g_idata[thid-1]0; -syncthreads(); for (int offset 1;offset n;offset *=2) { pout 1-pout;/swap double buffer indices pin =1-pout; if (thid >offset) temp[pout*n+thid]+temp[pin*n+thid -offset]; #check the com中ent else by moving the cursor temp[pout*n+thid]temp[pin*n+thid]; above the green line _syncthreads(); g_odata[thid]temp[pout*n+thidl];//write output Listing 1:CUDA C code for the naive scan algorithm.This version can handle arrays only as large as can be processed by a single thread block running on one multiprocessor of a GPU. April 2007 6
Parallel Prefix Sum (Scan) with CUDA April 2007 6 x0 x1 x2 x3 x4 x5 x6 x7 (x0..x0) (x0..x1) (x1..x2) (x2..x3) (x3..x4) (x4..x5) (x5..x6) (x6..x7) (x0..x0) (x0..x1) (x0..x2) (x0..x3) (x1..x4) (x2..x5) (x3..x6) (x4..x7) (x0..x0) (x0..x1) (x0..x2) (x0..x3) (x0..x4) (x0..x5) (x0..x6) (x0..x7) d=0 d=1 d=2 d=3 Figure 1: Computing a scan of an array of 8 elements using the naïve scan algorithm. Listing 1: CUDA C code for the naive scan algorithm. This version can handle arrays only as large as can be processed by a single thread block running on one multiprocessor of a GPU. __global__ void scan(float *g_odata, float *g_idata, int n) { extern __shared__ float temp[]; // allocated on invocation int thid = threadIdx.x; int pout = 0, pin = 1; // load input into shared memory. // This is exclusive scan, so shift right by one and set first elt to 0 temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0; __syncthreads(); for (int offset = 1; offset = offset) temp[pout*n+thid] += temp[pin*n+thid - offset]; else temp[pout*n+thid] = temp[pin*n+thid]; __syncthreads(); } g_odata[thid] = temp[pout*n+thid1]; // write output } #check the comment by moving the cursor above the green line
Parallel Prefix Sum(Scan)with CUDA A Work-Efficient Parallel Scan Our goal in this section is to develop a work-efficient scan algorithm that avoids the extra factor of log n work performed by the naive algorithm of the previous section.To do this we will use an algorithmic pattern that arises often in parallel computing:balnced trees.The idea is to build a balanced binary tree on the input data and sweep it to and from the root to compute the prefix sum.A binary tree with n leaves has log n levels,and each level de 0,) has 24 nodes.If we perform one add per node,then we will perform O()adds on a single traversal of the tree. The tree we build is not an actual data structure,but a concept we use to determine what each thread does at each step of the traversal.In this work-efficient scan algorithm,we perform the operations in place on an array in shared memory.The algorithm consists of two phases:the reduce phase (also known as the up-sweep phase)and the down-sweep phase.In the reduce phase we traverse the tree from leaves to root computing partial sums at internal nodes of the tree,as shown in Figure 2.This is also known as a parallel reduction,because after this phase,the root node(the last node in the array)holds the sum of all nodes in the array.Pseudocode for the reduce phase is given in Algorithm 3. In the down-sweep phase,we traverse back up the tree from the root,using the partial sums to build the scan in place on the array using the partial sums computed by the reduce phase The down-sweep is shown in Figure 3,and pseudocode is given in Algorithm 4.Note that because this is an exclusive scan (i.e.the total sum is not included in the results),between the phases we zero the last element of the array.This zero propagates back to the head of the array during the down-sweep phase.CUDA C Code for the complete algorithm is given in Listing 2.Like the naive scan code in the previous section,the code in Listing 2 will run on only a single thread block.Because it processes two elements per thread,the maximum array size this code can scan is 1024 elements on G80.Scans of large arrays are discussed later. This scan algorithm performs O()operations (it performs 2*(n-1)adds and n-1 swaps); therefore it is work efficient and for large arrays,should perform much better than the naive algorithm from the previous section.Algorithmic efficiency is not enough;we must also use the hardware efficiently.If we examine the operation of this scan on a GPU running CUDA,we will find that it suffers from many shared memory bank conflicts.These hurt the performance of every access to shared memory,and significantly affect overall performance.In the next section we will look at some simple modifications we can make to the memory address computations to recover much of that lost performance. April 2007 7
Parallel Prefix Sum (Scan) with CUDA April 2007 7 A Work-Efficient Parallel Scan Our goal in this section is to develop a work-efficient scan algorithm that avoids the extra factor of log n work performed by the naïve algorithm of the previous section. To do this we will use an algorithmic pattern that arises often in parallel computing: balanced trees. The idea is to build a balanced binary tree on the input data and sweep it to and from the root to compute the prefix sum. A binary tree with n leaves has log n levels, and each level d∈[0,n) has 2d nodes. If we perform one add per node, then we will perform O(n) adds on a single traversal of the tree. The tree we build is not an actual data structure, but a concept we use to determine what each thread does at each step of the traversal. In this work-efficient scan algorithm, we perform the operations in place on an array in shared memory. The algorithm consists of two phases: the reduce phase (also known as the up-sweep phase) and the down-sweep phase. In the reduce phase we traverse the tree from leaves to root computing partial sums at internal nodes of the tree, as shown in Figure 2. This is also known as a parallel reduction, because after this phase, the root node (the last node in the array) holds the sum of all nodes in the array. Pseudocode for the reduce phase is given in Algorithm 3. In the down-sweep phase, we traverse back up the tree from the root, using the partial sums to build the scan in place on the array using the partial sums computed by the reduce phase. The down-sweep is shown in Figure 3, and pseudocode is given in Algorithm 4. Note that because this is an exclusive scan (i.e. the total sum is not included in the results), between the phases we zero the last element of the array. This zero propagates back to the head of the array during the down-sweep phase. CUDA C Code for the complete algorithm is given in Listing 2. Like the naïve scan code in the previous section, the code in Listing 2 will run on only a single thread block. Because it processes two elements per thread, the maximum array size this code can scan is 1024 elements on G80. Scans of large arrays are discussed later. This scan algorithm performs O(n) operations (it performs 2*(n-1) adds and n-1 swaps); therefore it is work efficient and for large arrays, should perform much better than the naïve algorithm from the previous section. Algorithmic efficiency is not enough; we must also use the hardware efficiently. If we examine the operation of this scan on a GPU running CUDA, we will find that it suffers from many shared memory bank conflicts. These hurt the performance of every access to shared memory, and significantly affect overall performance. In the next section we will look at some simple modifications we can make to the memory address computations to recover much of that lost performance
Parallel Prefix Sum (Scan)with CUDA for d:=0 to logan-1 do for k from0 to n-lby24+!in parallel do x[k+24+11刂=xk+22.刂+x[k+24+1-] Algorithm 3:The up-sweep (reduce)phase of a work-efficient sum scan algorithm (after Blelloch [1]). d-0 Xo Z(xc..X1) X2 E(xc..X3) X4 2(x4.X5) X6 X(xc..X7) d=1 Xo E(xc..X1) X2 ∑(XcX3) X4 E(x4.x5) X6 (X4x7) d=2 Xo X.x) X2 ∑(X2.X3) X4 ∑(X4.x5) X6 2x6.x7) 3 Xo X2 X3 X4 X5 X6 X7 Figure 2:An illustration of the up-sweep,or reduce,phase of a work-efficient sum scan algorithm. April 2007 8
Parallel Prefix Sum (Scan) with CUDA April 2007 8 Algorithm 3: The up-sweep (reduce) phase of a work-efficient sum scan algorithm (after Blelloch [1]). Figure 2: An illustration of the up-sweep, or reduce, phase of a work-efficient sum scan algorithm. for d := 0 to log2n - 1 do for k from 0 to n – 1 by 2d + 1 in parallel do x[k + 2d + 1 - 1] := x[k + 2d - 1] + x [k + 2d + 1 - 1]