《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）NVIDIA Parallel Prefix Sum（Scan）with CUDA（April 2007）

团购合买资源类别：文库，文档格式：PDF，文档页数：21，文件大小：499.04KB

Parallel Prefix Sum(Scan)with CUDA Introduction A simple and common parallel algorithm building block is the allprefix-sums operation.In this paper we will define and illustrate the operation,and discuss in detail its efficient implementation on NVIDIA CUDA.As mentioned by Blelloch [1],all-prefix-sums is a good example of a computation that seems inherently sequential,but for which there is an efficient parallel algorithm.The all-prefix-sums operation is defined as follows in [1]: Definition:The all-prefix-sums operation takes a binary associative operator and an array of n elements [ao,a,...,a and returns the array [a,(⊕a),,(⊕4©.⊕a-l Example:If is addition,then the all-prefix-sums operation on the array [317041631, would return [34111114162225. There are many uses for all-prefix-sums,including,but not limited to sorting,lexical analysis, string comparison,polynomial evaluation,stream compaction,and building histograms and data structures(graphs,trees,etc.)in parallel.For example applications,we refer the reader to the survey by Blelloch [1]. In general,all-prefix-sums can be used to convert some sequential computations into equivalent,but parallel,computations as shown in Figure 1. out[0]=0 forall j in parallel do forall j from 1 to n do temp[j]f(in[j]); out[j]outlj-1]f(in[j-1)); all prefix sums (out,temp); Figure 1:A sequential computation and its parallel equivalent. Inclusive and Exclusive Scan All-prefix-sums on an array of data is commonly known as scan.We will use this simpler terminology (which comes from theAPL programming language [11)for the remainder of this paper.As shown in the last section,a scan of an array generates a new array where each element jis the sum of all elements up to and including j.This is an inclsire scan.It is often useful for each element /in the results of a scan to contain the sum of all previous elements, but not jitself.This operation is commonly known as an exclusire scan(or prescan)[1]. Definition:The exclusive scan operation takes a binary associative operator with identity I,and an array of n elements [,a,,as-1 April 2007 3

Parallel Prefix Sum (Scan) with CUDA April 2007 3 Introduction A simple and common parallel algorithm building block is the all-prefix-sums operation. In this paper we will define and illustrate the operation, and discuss in detail its efficient implementation on NVIDIA CUDA. As mentioned by Blelloch [1], all-prefix-sums is a good example of a computation that seems inherently sequential, but for which there is an efficient parallel algorithm. The all-prefix-sums operation is defined as follows in [1]: Definition: The all-prefix-sums operation takes a binary associative operator ⊕, and an array of n elements [a0, a1, …, an-1], and returns the array [a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-1)]. Example: If ⊕ is addition, then the all-prefix-sums operation on the array [3 1 7 0 4 1 6 3], would return [3 4 11 11 14 16 22 25]. There are many uses for all-prefix-sums, including, but not limited to sorting, lexical analysis, string comparison, polynomial evaluation, stream compaction, and building histograms and data structures (graphs, trees, etc.) in parallel. For example applications, we refer the reader to the survey by Blelloch [1]. In general, all-prefix-sums can be used to convert some sequential computations into equivalent, but parallel, computations as shown in Figure 1. out[0] = 0; forall j from 1 to n do out[j] = out[j-1] + f(in[j-1]); forall j in parallel do temp[j] = f(in[j]); all_prefix_sums(out, temp); Figure 1: A sequential computation and its parallel equivalent. Inclusive and Exclusive Scan All-prefix-sums on an array of data is commonly known as scan. We will use this simpler terminology (which comes from the APL programming language [1]) for the remainder of this paper. As shown in the last section, a scan of an array generates a new array where each element j is the sum of all elements up to and including j. This is an inclusive scan. It is often useful for each element j in the results of a scan to contain the sum of all previous elements, but not j itself. This operation is commonly known as an exclusive scan (or prescan) [1]. Definition: The exclusive scan operation takes a binary associative operator ⊕ with identity I, and an array of n elements [a0, a1, …, an-1]

Parallel Prefix Sum(Scan)with CUDA The pseudocode in Algorithm 1 shows a naive parallel scan implementation.This algorithm is based on the scan algorithm presented by Hillis and Steele![4],and demonstrated for GPUs by Horn [5].The problem with Algorithm 1 is apparent if we examine its work complexity.The rtm performs)ddtion operations. Remember that a sequential scan performs O()adds.Therefore,this naive implementation is not work-efficient.The factor of log2 n can have a large effect on performance.In the case of a scan of 1 million elements,the performance difference between this naive implementation and a theoretical work-efficient parallel implementation would be almost a factor of 20. Algorithm 1 assumes that there are as many processors as data elements.On a GPU running CUDA,this is not usually the case.Instead,the forall is automatically divided into small parallel batches(called warp)that are executed sequentially on a multiprocessor. A G80 GPU executes warps of 32 threads in parallel.Because not all threads run simultaneously for arrays larger than the warp size,the algorithm above will not work because it performs the scan in place on the array.The results of one warp will be overwritten by threads in another warp. To solve this problem,we need to double-buffer the array we are scanning.We use two temporary arrays(temp[2][n])to do this.Pseudocode for this is given in Algorithm 2, and CUDA C code for the naive scan is given in Listing 1.Note that this code will run on only a single thread block of the GPU,and so the size of the arrays it can process is limited (to 512 elements on G80 GPUs).Extension of scan to large arrays is discussed later. for d:=1 to logan do d:=0 to logn-1 forall kin parallel do ifk≥24then x[oud[附：=x[imk-24]+x[im[ 2^d else x[out:=x[in]k闷 swap (in,out) Algorithm 2:A double-buffered version of the sum scan from Algorithm 1. 1 Note that while we call this a naire scan in the context of CUDA and NVIDIA GPUs,it was not necessarily naive for a Connection Machine 3],which is the machine Hillis and Steele were focused on.Related to work complexity is the concept of step complexiny,which is the number of steps that the algorithm executes.The Connection Machine was a SIMD machine with many thousands of processors.In the limit where the number of processors equals the number of elements to be scanned,execution time is dominated by step complexity rather than work complexity.Algorithm 1 has a step complexity of o(log compared to the O()step complexity of the sequential algorithm, and is therefore step efficient. April 2007 5

Parallel Prefix Sum (Scan) with CUDA April 2007 5 The pseudocode in Algorithm 1 shows a naïve parallel scan implementation. This algorithm is based on the scan algorithm presented by Hillis and Steele1 [4], and demonstrated for GPUs by Horn [5]. The problem with Algorithm 1 is apparent if we examine its work complexity. The algorithm performs )log(2 2 log 1 2 1 nnOnn d d ∑ =− = − addition operations. Remember that a sequential scan performs O(n) adds. Therefore, this naïve implementation is not work-efficient. The factor of log2 n can have a large effect on performance. In the case of a scan of 1 million elements, the performance difference between this naïve implementation and a theoretical work-efficient parallel implementation would be almost a factor of 20. Algorithm 1 assumes that there are as many processors as data elements. On a GPU running CUDA, this is not usually the case. Instead, the forall is automatically divided into small parallel batches (called warps) that are executed sequentially on a multiprocessor. A G80 GPU executes warps of 32 threads in parallel. Because not all threads run simultaneously for arrays larger than the warp size, the algorithm above will not work because it performs the scan in place on the array. The results of one warp will be overwritten by threads in another warp. To solve this problem, we need to double-buffer the array we are scanning. We use two temporary arrays (temp[2][n]) to do this. Pseudocode for this is given in Algorithm 2, and CUDA C code for the naïve scan is given in Listing 1. Note that this code will run on only a single thread block of the GPU, and so the size of the arrays it can process is limited (to 512 elements on G80 GPUs). Extension of scan to large arrays is discussed later. Algorithm 2: A double-buffered version of the sum scan from Algorithm 1. 1 Note that while we call this a naïve scan in the context of CUDA and NVIDIA GPUs, it was not necessarily naïve for a Connection Machine [3], which is the machine Hillis and Steele were focused on. Related to work complexity is the concept of step complexity, which is the number of steps that the algorithm executes. The Connection Machine was a SIMD machine with many thousands of processors. In the limit where the number of processors equals the number of elements to be scanned, execution time is dominated by step complexity rather than work complexity. Algorithm 1 has a step complexity of O(log n) compared to the O(n) step complexity of the sequential algorithm, and is therefore step efficient. for d := 1 to log2n do forall k in parallel do if k ≥ 2d then x[out][k] := x[in][k − 2d-1] + x[in][k] else x[out][k] := x[in][k] swap(in,out) d := 0 to logn-1 2^d

Parallel Prefix Sum(Scan)with CUDA A Work-Efficient Parallel Scan Our goal in this section is to develop a work-efficient scan algorithm that avoids the extra factor of log n work performed by the naive algorithm of the previous section.To do this we will use an algorithmic pattern that arises often in parallel computing:balnced trees.The idea is to build a balanced binary tree on the input data and sweep it to and from the root to compute the prefix sum.A binary tree with n leaves has log n levels,and each level de 0,) has 24 nodes.If we perform one add per node,then we will perform O()adds on a single traversal of the tree. The tree we build is not an actual data structure,but a concept we use to determine what each thread does at each step of the traversal.In this work-efficient scan algorithm,we perform the operations in place on an array in shared memory.The algorithm consists of two phases:the reduce phase (also known as the up-sweep phase)and the down-sweep phase.In the reduce phase we traverse the tree from leaves to root computing partial sums at internal nodes of the tree,as shown in Figure 2.This is also known as a parallel reduction,because after this phase,the root node(the last node in the array)holds the sum of all nodes in the array.Pseudocode for the reduce phase is given in Algorithm 3. In the down-sweep phase,we traverse back up the tree from the root,using the partial sums to build the scan in place on the array using the partial sums computed by the reduce phase The down-sweep is shown in Figure 3,and pseudocode is given in Algorithm 4.Note that because this is an exclusive scan (i.e.the total sum is not included in the results),between the phases we zero the last element of the array.This zero propagates back to the head of the array during the down-sweep phase.CUDA C Code for the complete algorithm is given in Listing 2.Like the naive scan code in the previous section,the code in Listing 2 will run on only a single thread block.Because it processes two elements per thread,the maximum array size this code can scan is 1024 elements on G80.Scans of large arrays are discussed later. This scan algorithm performs O()operations (it performs 2*(n-1)adds and n-1 swaps); therefore it is work efficient and for large arrays,should perform much better than the naive algorithm from the previous section.Algorithmic efficiency is not enough;we must also use the hardware efficiently.If we examine the operation of this scan on a GPU running CUDA,we will find that it suffers from many shared memory bank conflicts.These hurt the performance of every access to shared memory,and significantly affect overall performance.In the next section we will look at some simple modifications we can make to the memory address computations to recover much of that lost performance. April 2007 7

Parallel Prefix Sum (Scan) with CUDA April 2007 7 A Work-Efficient Parallel Scan Our goal in this section is to develop a work-efficient scan algorithm that avoids the extra factor of log n work performed by the naïve algorithm of the previous section. To do this we will use an algorithmic pattern that arises often in parallel computing: balanced trees. The idea is to build a balanced binary tree on the input data and sweep it to and from the root to compute the prefix sum. A binary tree with n leaves has log n levels, and each level d∈[0,n) has 2d nodes. If we perform one add per node, then we will perform O(n) adds on a single traversal of the tree. The tree we build is not an actual data structure, but a concept we use to determine what each thread does at each step of the traversal. In this work-efficient scan algorithm, we perform the operations in place on an array in shared memory. The algorithm consists of two phases: the reduce phase (also known as the up-sweep phase) and the down-sweep phase. In the reduce phase we traverse the tree from leaves to root computing partial sums at internal nodes of the tree, as shown in Figure 2. This is also known as a parallel reduction, because after this phase, the root node (the last node in the array) holds the sum of all nodes in the array. Pseudocode for the reduce phase is given in Algorithm 3. In the down-sweep phase, we traverse back up the tree from the root, using the partial sums to build the scan in place on the array using the partial sums computed by the reduce phase. The down-sweep is shown in Figure 3, and pseudocode is given in Algorithm 4. Note that because this is an exclusive scan (i.e. the total sum is not included in the results), between the phases we zero the last element of the array. This zero propagates back to the head of the array during the down-sweep phase. CUDA C Code for the complete algorithm is given in Listing 2. Like the naïve scan code in the previous section, the code in Listing 2 will run on only a single thread block. Because it processes two elements per thread, the maximum array size this code can scan is 1024 elements on G80. Scans of large arrays are discussed later. This scan algorithm performs O(n) operations (it performs 2*(n-1) adds and n-1 swaps); therefore it is work efficient and for large arrays, should perform much better than the naïve algorithm from the previous section. Algorithmic efficiency is not enough; we must also use the hardware efficiently. If we examine the operation of this scan on a GPU running CUDA, we will find that it suffers from many shared memory bank conflicts. These hurt the performance of every access to shared memory, and significantly affect overall performance. In the next section we will look at some simple modifications we can make to the memory address computations to recover much of that lost performance

点击进入文档下载页（PDF格式）

共21页，试读已结束，阅读完整版请下载

点击下载（PDF格式）

浏览记录