相关文档

电子科技大学：《GPU并行编程 GPU Parallel Programming》课程教学资源（课件讲稿）Lecture 08 Parallel Sparse Methods
电子科技大学：《GPU并行编程 GPU Parallel Programming》课程教学资源（课件讲稿）Lecture 07 JOINT CUDA-MPI PROGRAMMING
电子科技大学：《GPU并行编程 GPU Parallel Programming》课程教学资源（课件讲稿）Lecture 06 PARALLEL COMPUTATION PATTERNS（SCAN）
电子科技大学：《GPU并行编程 GPU Parallel Programming》课程教学资源（课件讲稿）Lecture 05 PARALLEL COMPUTATION PATTERNS（HISTOGRAM）
电子科技大学：《GPU并行编程 GPU Parallel Programming》课程教学资源（课件讲稿）Lecture 04 Performance considerations
电子科技大学：《GPU并行编程 GPU Parallel Programming》课程教学资源（课件讲稿）Lecture 03 MEMORY AND DATA LOCALITY
电子科技大学：《GPU并行编程 GPU Parallel Programming》课程教学资源（课件讲稿）Lecture 02 CUDA PARALLELISM MODEL
电子科技大学：《GPU并行编程 GPU Parallel Programming》课程教学资源（课件讲稿）Lecture 01 Introduction To Cuda C
《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）NVIDIA CUDA C Programming Guide（Design Guide，June 2017）
《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Methods of conjugate gradients for solving linear systems
《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）NVIDIA Parallel Prefix Sum（Scan）with CUDA（April 2007）
《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Single-pass Parallel Prefix Scan with Decoupled Look-back
《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Program Optimization Space Pruning for a Multithreaded GPU
《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA
《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Some Computer Organizations and Their Effectiveness
《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Software and the Concurrency Revolution
《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems
《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）MPI A Message-Passing Interface Standard（Version 2.2）
南京大学：《网络安全与入侵检测 Network Security and Intrusion Detection》课程教学资源（课件讲稿）19 Firewall Design Methods
南京大学：《网络安全与入侵检测 Network Security and Intrusion Detection》课程教学资源（课件讲稿）18 Web Security（SQL Injection and Cross-Site Request Forgery）
电子科技大学：《GPU并行编程 GPU Parallel Programming》课程教学资源（课件讲稿）Lecture 10 Computational Thinking
电子科技大学：《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源（课件讲稿）课程简介（杜平安）
电子科技大学：《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源（课件讲稿）第一章绪论
电子科技大学：《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源（课件讲稿）第二章有限元法的基本原理（平面问题有限元法）
电子科技大学：《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源（课件讲稿）第七章动态分析有限元法 FEM of Dynamic Analysis
电子科技大学：《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源（课件讲稿）第3～6章其他问题有限元法
电子科技大学：《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源（课件讲稿）第八章热分析有限元法 FEM of Thermal Analysis
电子科技大学：《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源（课件讲稿）第二篇有限元建模方法第十二章有限元建模概述 Overview of Finite Element Modeling
电子科技大学：《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源（课件讲稿）第二篇有限元建模方法第十一章有限元建模的基本原则
电子科技大学：《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源（课件讲稿）第二篇有限元建模方法第十四章几何模型的建立
电子科技大学：《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源（课件讲稿）第二篇有限元建模方法第十五章单元类型及特性定义
电子科技大学：《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源（课件讲稿）第二篇有限元建模方法第十六章网格划分方法
电子科技大学：《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源（课件讲稿）第二篇有限元建模方法第十七章模型检查与处理 Model Checking and Processing
电子科技大学：《有限元理论与建模方法 Finite Element Analysis and Modeling》研究生课程教学资源（课件讲稿）第二篇有限元建模方法第十八章边界条件的建立 Creation of Boundary Condition
南京大学：《高级算法 Advanced Algorithms》课程教学资源（课件讲稿）Fingerprinting
南京大学：《高级算法 Advanced Algorithms》课程教学资源（课件讲稿）Greedy and Local Search
南京大学：《高级算法 Advanced Algorithms》课程教学资源（课件讲稿）Balls into Bins
南京大学：《高级算法 Advanced Algorithms》课程教学资源（课件讲稿）Concentration of Measure
南京大学：《高级算法 Advanced Algorithms》课程教学资源（课件讲稿）Introduction（Min-Cut and Max-Cut，尹⼀通）
南京大学：《高级算法 Advanced Algorithms》课程教学资源（课件讲稿）Fingerprinting

电子科技大学：《GPU并行编程 GPU Parallel Programming》课程教学资源（课件讲稿）Lecture 09 Parallel patterns（MERGE SORT）

• Study increasingly sophisticated parallel merge kernels • Observe the combined effects of data - dependent execution and a lack of data parallelism on GPU algorithm design

团购合买资源类别：文库，文档格式：PDF，文档页数：36，文件大小：565.11KB

PARALLEL PATTERNS: MERGE SORT

Data Parallelism Data- Dependent Execution Data-Independent Data-Dependent Data Parallel Stencil Histogram SpMV Not Data Prefix Scan Merge Parallel

Data Parallelism / DataDependent Execution

Objective Study increasingly sophisticated parallel merge kernels Observe the combined effects of data- dependent execution and a lack of data parallelism on GPU algorithm design

Objective • Study increasingly sophisticated parallel merge kernels • Observe the combined effects of data - dependent execution and a lack of data parallelism on GPU algorithm design

Merge Sort Input:two sorted arrays Output:the (sorted)union of the input A: 8 9 10 B: 7 10 10 12 C: 1 8 9 10 10 10 12

Merge Sort • Input: two sorted arrays • Output: the (sorted) union of the input arrays

Merge Sort A bottom-up divide-and-conquer sorting algorithm O(n log n)average-(and worst-)case performance O(n)additional space requirement Merging two arrays is the core computation 6 -18-2 -4 3 6 -7-2-18

Merge Sort • A bottom-up divide-and-conquer sorting algorithm • O(n log n) average - (and worst - ) case performance • O(n) additional space requirement • Merging two arrays is the core computation

Other Uses for Merge Taking the union of two (non-overlapping) sparse matrices represented in the CSR format Each row is merged col indices are the keys In MapReduce,when Map produces sorted key-value pairs and Reduce must maintain sorting

Other Uses for Merge • Taking the union of two (non-overlapping) sparse matrices represented in the CSR format – Each row is merged – col_indices are the keys • In MapReduce, when Map produces sorted key-value pairs and Reduce must – maintain sorting

Sequential Merge void merge(const int A,int m,const int B,int n,int C){ int i=0;/Index into A int j=0;/Index into B int k=0;/Index into C //merge the initial overlapping sections of A and B while ((i<m)&&(j<n)){ if(A[0<=B]){ C[k+]=A[i+]; }else k increases by one for every C[k++]=B+]; iteration of the loops } if (i==m){ //done with A,place the rest of B for j<n;j++){ C[k+]=B; } }else{ //done with B,place the rest of A for (;i<m;i++){ C[k+]=A[]; In any given iteration (other than the first),the values of i and j are data-dependent

Sequential Merge void merge(const int * A, int m, const int * B, int n, int * C) { int i = 0; // Index into A int j = 0; // Index into B int k = 0; // Index into C // merge the initial overlapping sections of A and B while ((i < m) && (j < n)) { if (A[i] <= B[j]) { C[k++] = A[i++]; } else { C[k++] = B[j++]; } } if (i == m) { // done with A, place the rest of B for ( ; j < n; j++) { C[k++] = B[j]; } } else { // done with B, place the rest of A for ( ; i < m; i++) { C[k++] = A[i]; } } } k increases by one for every iteration of the loops In any given iteration (other than the first), the values of i and j are data-dependent

Sequential Merge Parallelization Challenges We could assign one thread to write each output element However,given a particular output location, the input element that belongs there is data-dependent The sequential merge is O(n)in the length of the output array so we must be work-efficient

Sequential Merge Parallelization Challenges • We could assign one thread to write each output element • However, given a particular output location, the input element that belongs – there is data-dependent • The sequential merge is O(n) in the length of the output array – so we must be work-efficient

Observations about Merge 1.For any k s.t.0<k<m+n,there is either: 一 a.anis.t.0si<m and C[k]A[i] -b.ajs.t.0≤j<n and C[k]∈B[j] ● 2.For any k s.t.0sk<m+n,there is an iand ajs.t.: -a.i+j=k -b.0≤i≤m c.0≤j≤n d.The subarray C[O:k-1]is the result of merging A[o: i-1]and B[O j-1] Indices i and jare referred to as co-ranks

Observations about Merge • 1. For any k s.t. 0 ≤ k < m + n, there is either: – a. an i s.t. 0 ≤ i < m and C[k] ⇐ A[i] – b. a j s.t. 0 ≤ j < n and C[k] ⇐ B[j] • 2. For any k s.t. 0 ≤ k < m + n, there is an i and a j s.t. : – a. i + j = k – b. 0 ≤ i ≤ m – c. 0 ≤ j ≤ n – d. The subarray C[0 : k-1] is the result of merging A[0 : i-1] and B[0 : j-1] Indices i and j are referred to as co-ranks

A Merge Parallelization Approach Assume a co-rank function of the form: co-rank(k,A,B)=i j=k-i We can use the co-rank function to map a range of output values to a range of input values We'll need to compute co-rank efficiently for a work-efficient merge

A Merge Parallelization Approach • Assume a co-rank function of the form: co-rank(k,A,B) = i j = k - i • We can use the co-rank function to map a range of output values to a range of input values • We’ll need to compute co-rank efficiently for a work-efficient merge

点击进入文档下载页（PDF格式）

共36页，可试读12页，点击继续阅读 ↓↓

点击下载（PDF格式）

浏览记录