上声定通大字 SHANGHAI JLAO TONG UNIVERSITY CS427 Multicore Architecture and Parallel Computing Lecture 8 CUDA, contd Prof Xiaoyao Liang 2016/10/26
CS427 Multicore Architecture and Parallel Computing Lecture 8 CUDA, cont’d Prof. Xiaoyao Liang 2016/10/26 1
O Register File Limitation If each block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM Each block requires 10* 256=2560 registers 8192=3*2560+ change So. three blocks can run on an sm as far as registers are concerned How about if each thread increases the use of registers by Each Block now requires 11256=2816 registers 8192<2816*3 Only two Blocks can run on an SM, 1/3 reduction of parallelism!!!
Register File Limitation 2
Dynamic Partitioning Dynamic partitioning gives more flexibility to compilers/programm ers One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each This allows for finer grain threading than traditional cpu threading models The compiler can tradeoff between instruction-level parallelism and thread level parallelism
Dynamic Partitioning 3
O)ILP VS. TLP assume that a kernel has 256-thread blocks. 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, can fit 3 blocks global loads have 400 cycles 4 cycles 4 inst 24 warps =384 400 If a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load, can only fit 2 blocks 4 cycles"8 inst *16warps=512>400, better hiding memory latenc
ILP Vs. TLP 4 • Assume that a kernel has 256-thread Blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, can fit 3 blocks global loads have 400 cycles – 4 cycles * 4 inst * 24 warps = 384 400, better hiding memory latency
O Memory Coalescing Access MIMolM 2,0 direction 1M1M2M21 in Kerne code M3. 2 M 12223,2 Time Period 1 Time Period 2 Moo M1.O M2. M3. 0 MO1 M1 M2.1 M3.Mo: 2 M2 M22 M3.2 Mo.3 M13 M2.3 M33
Memory Coalescing 5
O Memory Coalescing Access direction M 0,1m1,1 M21M31 in Kerne code M。2M12M22M32 M。3M13M2.M Time period 2 Tinhe Period 1 M MOo MoM20M3..1 M1.1 M2. M31M. 2 M1. 2 M2. 2 M3. 2 M3 M1.3 M2. 3 M3.3
Memory Coalescing 6
O Memory Coalescing global void Matrix MulKernel (float* Md, float* Nd, float*Pd, int Width) shared float Mds [TILE WIDTH] [TILE WIDTH]; shared f1 oat nds[T工LEW工DTH][ TILE WIDTH]; 3. int bx blockIdx x int by blockIdxy 4. int tx threadIdx. x; int ty =threadIdxy // Identify the row and column of the Pd element to work on int Row by TILE WIDTH tyi 6. int col bx TILE WIDTH + txi float Pvalue = o // Loop over the Ma and Nd tiles required to compute the pd element 8 for (int m=0; m< Width/TILE WIDTH; ++m) I // Coolaborative loading of Md and Nd tiles into shared memory 9 Mds [ty][tx]= Md[Row*width +(m*TILE WIDTH tx)]i 10 Nds [ty][tx]= Nd[Col +(m*TILE WIDTH ty)*width]i syncthreads ()i 11. for (int k=0; k< TILE WIDTH; ++k) Pvalue + Mds [tyl[k] Nds [k][tx]i Synchthreads ()i 14.} 13. Pd[Row*Width+Col]= Pvaluei
Memory Coalescing 7
O) Increasing Per Thread Work 012 TILE WIDTH-1 Each thread computes two element of pda, Reduced loads from global memory(Md)to shared memory Reduced instruction overhead More work done in each iteration TILE WIDT TIPE WIIDT: IAE WIDB TIIPE WDTR
Increasing Per Thread Work 8
@Double buffering One could double buffer the computation, getting better instruction mix within each thread This is classic software pipelining in ILP compilers Loop i Load next tile from global memory Load current tile to shared memory Loop t Deposit current tile to shared syncthreads memory syncthreadso) Compute current tile Load next tile from global syncthreads( memory Compute current tile
Double Buffering 9
Double buffering Deposit blue tile from register into 012 TILE WIDTH-1 IL Shared memory yncthreads Load orange tile into register Compute blue tile Deposit orange tile into shared memory 二二二量 TILE WIDT TIRE WIIF IE wiLDI TIPE WIDIIB
Double Buffering 10