PARALLEL PATTERNS: MERGE SORT
PARALLEL PATTERNS: MERGE SORT
Data Parallelism Data- Dependent Execution Data-Independent Data-Dependent Data Parallel Stencil Histogram SpMV Not Data Prefix Scan Merge Parallel
Data Parallelism / DataDependent Execution
Objective Study increasingly sophisticated parallel merge kernels Observe the combined effects of data- dependent execution and a lack of data parallelism on GPU algorithm design
Objective • Study increasingly sophisticated parallel merge kernels • Observe the combined effects of data - dependent execution and a lack of data parallelism on GPU algorithm design
Merge Sort Input:two sorted arrays Output:the (sorted)union of the input A: 8 9 10 B: 7 10 10 12 C: 1 8 9 10 10 10 12
Merge Sort • Input: two sorted arrays • Output: the (sorted) union of the input arrays
Merge Sort A bottom-up divide-and-conquer sorting algorithm O(n log n)average-(and worst-)case performance O(n)additional space requirement Merging two arrays is the core computation 6 -18-2 -4 3 6 -7-2-18
Merge Sort • A bottom-up divide-and-conquer sorting algorithm • O(n log n) average - (and worst - ) case performance • O(n) additional space requirement • Merging two arrays is the core computation
Other Uses for Merge Taking the union of two (non-overlapping) sparse matrices represented in the CSR format Each row is merged col indices are the keys In MapReduce,when Map produces sorted key-value pairs and Reduce must maintain sorting
Other Uses for Merge • Taking the union of two (non-overlapping) sparse matrices represented in the CSR format – Each row is merged – col_indices are the keys • In MapReduce, when Map produces sorted key-value pairs and Reduce must – maintain sorting
Sequential Merge void merge(const int A,int m,const int B,int n,int C){ int i=0;/Index into A int j=0;/Index into B int k=0;/Index into C //merge the initial overlapping sections of A and B while ((i<m)&&(j<n)){ if(A[0<=B]){ C[k+]=A[i+]; }else k increases by one for every C[k++]=B+]; iteration of the loops } if (i==m){ //done with A,place the rest of B for j<n;j++){ C[k+]=B; } }else{ //done with B,place the rest of A for (;i<m;i++){ C[k+]=A[]; In any given iteration (other than the first),the values of i and j are data-dependent
Sequential Merge void merge(const int * A, int m, const int * B, int n, int * C) { int i = 0; // Index into A int j = 0; // Index into B int k = 0; // Index into C // merge the initial overlapping sections of A and B while ((i < m) && (j < n)) { if (A[i] <= B[j]) { C[k++] = A[i++]; } else { C[k++] = B[j++]; } } if (i == m) { // done with A, place the rest of B for ( ; j < n; j++) { C[k++] = B[j]; } } else { // done with B, place the rest of A for ( ; i < m; i++) { C[k++] = A[i]; } } } k increases by one for every iteration of the loops In any given iteration (other than the first), the values of i and j are data-dependent
Sequential Merge Parallelization Challenges We could assign one thread to write each output element However,given a particular output location, the input element that belongs there is data-dependent The sequential merge is O(n)in the length of the output array so we must be work-efficient
Sequential Merge Parallelization Challenges • We could assign one thread to write each output element • However, given a particular output location, the input element that belongs – there is data-dependent • The sequential merge is O(n) in the length of the output array – so we must be work-efficient
Observations about Merge 1.For any k s.t.0<k<m+n,there is either: 一 a.anis.t.0si<m and C[k]A[i] -b.ajs.t.0≤j<n and C[k]∈B[j] ● 2.For any k s.t.0sk<m+n,there is an iand ajs.t.: -a.i+j=k -b.0≤i≤m c.0≤j≤n d.The subarray C[O:k-1]is the result of merging A[o: i-1]and B[O j-1] Indices i and jare referred to as co-ranks
Observations about Merge • 1. For any k s.t. 0 ≤ k < m + n, there is either: – a. an i s.t. 0 ≤ i < m and C[k] ⇐ A[i] – b. a j s.t. 0 ≤ j < n and C[k] ⇐ B[j] • 2. For any k s.t. 0 ≤ k < m + n, there is an i and a j s.t. : – a. i + j = k – b. 0 ≤ i ≤ m – c. 0 ≤ j ≤ n – d. The subarray C[0 : k-1] is the result of merging A[0 : i-1] and B[0 : j-1] Indices i and j are referred to as co-ranks
A Merge Parallelization Approach Assume a co-rank function of the form: co-rank(k,A,B)=i j=k-i We can use the co-rank function to map a range of output values to a range of input values We'll need to compute co-rank efficiently for a work-efficient merge
A Merge Parallelization Approach • Assume a co-rank function of the form: co-rank(k,A,B) = i j = k - i • We can use the co-rank function to map a range of output values to a range of input values • We’ll need to compute co-rank efficiently for a work-efficient merge