LECTURE6-PARALLEL COMPUTATION PATTERNS (SCAN)
Prefix Sum A Work-inefficient Scan Kernel A Work-Efficient Parallel Scan Kernel More on Parallel Scan 电子件做女字
2 Prefix Sum A Work-inefficient Scan Kernel A Work-Efficient Parallel Scan Kernel More on Parallel Scan
Objective To master parallel scan (prefix sum)algorithms Frequently used for parallel work assignment and resource allocation A key primitive in many parallel algorithms to convert serial computation into parallel computation A foundational parallel computation pattern Work efficiency in parallel code/algorithms Reading -Mark Harris,Parallel Prefix Sum with CUDA http://http.developer.nvidia.com/GPUGems3/gpugems3 ch39.html 电子科妓女学 O
3 Objective – To master parallel scan (prefix sum) algorithms – Frequently used for parallel work assignment and resource allocation – A key primitive in many parallel algorithms to convert serial computation into parallel computation – A foundational parallel computation pattern – Work efficiency in parallel code/algorithms – Reading –Mark Harris, Parallel Prefix Sum with CUDA – http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html 3
Inclusive Scan (Prefix-Sum)Definition Definition:The scan operation takes a binary associative operator (pronounced as circle plus),and an array of n elements [oxx and returns the array [xo(x0⊕x1),(x0⊕x1⊕..⊕xn-)1 Example:If is addition,then scan operation on the array would return [31704163], [34111115162225]. 电子科效女学 O
4 Inclusive Scan (Prefix-Sum) Definition Definition: The scan operation takes a binary associative operator ⊕ (pronounced as circle plus), and an array of n elements [x0 , x1 , …, xn-1 ], and returns the array [x0 , (x0 ⊕ x1 ), …, (x0 ⊕ x1 ⊕ … ⊕ xn-1 )]. Example: If ⊕ is addition, then scan operation on the array would return [3 1 7 0 4 1 6 3], [3 4 11 11 15 16 22 25]
An Inclusive Scan Application Example Assume that we have a 100-inch sandwich to feed 10 people We know how much each person wants in inches -[35272843081] How do we cut the sandwich quickly? How much will be left? Method 1:cut the sections sequentially:3 inches first,5 inches second,2 inches third,etc. Method 2:calculate prefix sum: -[3,8,10,17,45,49,52,52,60,61](39 inches left) 电子科妓女学 O
5 An Inclusive Scan Application Example – Assume that we have a 100-inch sandwich to feed 10 people – We know how much each person wants in inches – [3 5 2 7 28 4 3 0 8 1] – How do we cut the sandwich quickly? – How much will be left? – Method 1: cut the sections sequentially: 3 inches first, 5 inches second, 2 inches third, etc. – Method 2: calculate prefix sum: – [3, 8, 10, 17, 45, 49, 52, 52, 60, 61] (39 inches left) 5
Typical Applications of Scan Scan is a simple and useful parallel building block -Convert recurrences from sequential: for(j=1;j<n;j++) out[j]out[j-1]f(j); Into parallel: forall(j)temp[j]=f(j)}; scan (out,temp); 电子科妓女学 O
6 Typical Applications of Scan – Scan is a simple and useful parallel building block – Convert recurrences from sequential: for(j=1;j<n;j++) out[j] = out[j-1] + f(j); – Into parallel: forall(j) { temp[j] = f(j) }; scan(out, temp);
Other Applications Assigning camping spots Assigning Farmer's Market spaces 一 Allocating memory to parallel threads Allocating memory buffer space for communication channels 电子科妓女学 O
7 Other Applications – Assigning camping spots – Assigning Farmer’s Market spaces – Allocating memory to parallel threads – Allocating memory buffer space for communication channels – … 7
An Inclusive Sequential Addition Scan Given a sequence [Xo,X1,X2,... Calculate output [yo,y1,y2,...] Such that yo=Xo y1=X0+X1 y2=X0+X1+X2 Using a recursive definition y%=y%-1+X 电子科妓女学 O
8 An Inclusive Sequential Addition Scan Given a sequence [x0 , x1 , x2 , ... ] Calculate output [y0 , y1 , y2 , ... ] Such that y0 = x0 y1 = x0 + x1 y2 = x0 + x1+ x2 … Using a recursive definition yi = yi − 1 + xi 8
A Work Efficient C Implementation y[0]=[0]: for(i=1;i<Max_;i++)y[0=y[i-1]+[0]: Computationally efficient: N additions needed for N elements-O(N)! Only slightly more expensive than sequential reduction. 电子科妓女学 O
9 A Work Efficient C Implementation y[0] = x[0]; for (i = 1; i < Max_i; i++) y[i] = y [i-1] + x[i]; Computationally efficient: N additions needed for N elements - O(N) ! Only slightly more expensive than sequential reduction. 9
A Naive Inclusive Parallel Scan Assign one thread to calculate each y element Have every thread to add up all x elements needed for the y element yo=Xo y1=X0+X1 y2=X0+X1+X2 "Parallel programming is easy as long as you do not care about performance." 电子科妓女学 O
10 A Naïve Inclusive Parallel Scan – Assign one thread to calculate each y element – Have every thread to add up all x elements needed for the y element y0 = x0 y1 = x0 + x1 y2 = x0 + x1+ x2 “Parallel programming is easy as long as you do not care about performance.” 10