ILP Vs. TLP 4 • Assume that a kernel _中国高校课件下载中心

点击下载：上海交通大学：《Multicore Architecture and Parallel Computing》课程教学资源（PPT课件讲稿）Lecture 8 CUDA, cont’d

正在加载图片...

O)ILP VS. TLP assume that a kernel has 256-thread blocks. 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, can fit 3 blocks global loads have 400 cycles 4 cycles 4 inst 24 warps =384 400 If a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load, can only fit 2 blocks 4 cycles"8 inst *16warps=512>400, better hiding memory latencILP Vs. TLP 4 • Assume that a kernel has 256-thread Blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, can fit 3 blocks global loads have 400 cycles – 4 cycles * 4 inst * 24 warps = 384 < 400 • If a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load, can only fit 2 blocks – 4 cycles * 8 inst * 16 warps = 512 > 400, better hiding memory latency

<<向上翻页向下翻页>>

点击下载：上海交通大学：《Multicore Architecture and Parallel Computing》课程教学资源（PPT课件讲稿）Lecture 8 CUDA, cont’d