上海交通大学：《Multicore Architecture and Parallel Computing》课程教学资源（PPT课件讲稿）Lecture 8 CUDA, cont’d

团购合买资源类别：文库，文档格式：PPT，文档页数：53，文件大小：4.45MB

上声定通大字 SHANGHAI JLAO TONG UNIVERSITY CS427 Multicore Architecture and Parallel Computing Lecture 8 CUDA, contd Prof Xiaoyao Liang 2016/10/26

CS427 Multicore Architecture and Parallel Computing Lecture 8 CUDA, cont’d Prof. Xiaoyao Liang 2016/10/26 1

O Register File Limitation If each block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM Each block requires 10* 256=2560 registers 8192=3*2560+ change So. three blocks can run on an sm as far as registers are concerned How about if each thread increases the use of registers by Each Block now requires 11256=2816 registers 8192<2816*3 Only two Blocks can run on an SM, 1/3 reduction of parallelism!!!

Dynamic Partitioning Dynamic partitioning gives more flexibility to compilers/programm ers One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each This allows for finer grain threading than traditional cpu threading models The compiler can tradeoff between instruction-level parallelism and thread level parallelism

Dynamic Partitioning 3

O)ILP VS. TLP assume that a kernel has 256-thread blocks. 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, can fit 3 blocks global loads have 400 cycles 4 cycles 4 inst 24 warps =384 400 If a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load, can only fit 2 blocks 4 cycles"8 inst *16warps=512>400, better hiding memory latenc

ILP Vs. TLP 4 • Assume that a kernel has 256-thread Blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, can fit 3 blocks global loads have 400 cycles – 4 cycles * 4 inst * 24 warps = 384 400, better hiding memory latency

O Memory Coalescing Access MIMolM 2,0 direction 1M1M2M21 in Kerne code M3. 2 M 12223,2 Time Period 1 Time Period 2 Moo M1.O M2. M3. 0 MO1 M1 M2.1 M3.Mo: 2 M2 M22 M3.2 Mo.3 M13 M2.3 M33

Memory Coalescing 5

O Memory Coalescing Access direction M 0,1m1,1 M21M31 in Kerne code M。2M12M22M32 M。3M13M2.M Time period 2 Tinhe Period 1 M MOo MoM20M3..1 M1.1 M2. M31M. 2 M1. 2 M2. 2 M3. 2 M3 M1.3 M2. 3 M3.3

Memory Coalescing 6

O Memory Coalescing global void Matrix MulKernel (float* Md, float* Nd, float*Pd, int Width) shared float Mds [TILE WIDTH] [TILE WIDTH]; shared f1 oat nds[T工LEW工DTH][ TILE WIDTH]; 3. int bx blockIdx x int by blockIdxy 4. int tx threadIdx. x; int ty =threadIdxy // Identify the row and column of the Pd element to work on int Row by TILE WIDTH tyi 6. int col bx TILE WIDTH + txi float Pvalue = o // Loop over the Ma and Nd tiles required to compute the pd element 8 for (int m=0; m< Width/TILE WIDTH; ++m) I // Coolaborative loading of Md and Nd tiles into shared memory 9 Mds [ty][tx]= Md[Row*width +(m*TILE WIDTH tx)]i 10 Nds [ty][tx]= Nd[Col +(m*TILE WIDTH ty)*width]i syncthreads ()i 11. for (int k=0; k< TILE WIDTH; ++k) Pvalue + Mds [tyl[k] Nds [k][tx]i Synchthreads ()i 14.} 13. Pd[Row*Width+Col]= Pvaluei

Memory Coalescing 7

O) Increasing Per Thread Work 012 TILE WIDTH-1 Each thread computes two element of pda, Reduced loads from global memory(Md)to shared memory Reduced instruction overhead More work done in each iteration TILE WIDT TIPE WIIDT: IAE WIDB TIIPE WDTR

Increasing Per Thread Work 8

@Double buffering One could double buffer the computation, getting better instruction mix within each thread This is classic software pipelining in ILP compilers Loop i Load next tile from global memory Load current tile to shared memory Loop t Deposit current tile to shared syncthreads memory syncthreadso) Compute current tile Load next tile from global syncthreads( memory Compute current tile

Double Buffering 9

Double buffering Deposit blue tile from register into 012 TILE WIDTH-1 IL Shared memory yncthreads Load orange tile into register Compute blue tile Deposit orange tile into shared memory 二二二量 TILE WIDT TIRE WIIF IE wiLDI TIPE WIDIIB

Double Buffering 10

点击进入文档下载页（PPT格式）

共53页，可试读18页，点击继续阅读 ↓↓

点击下载（PPT格式）

浏览记录