正在加载图片...
O)ILP VS. TLP assume that a kernel has 256-thread blocks. 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, can fit 3 blocks global loads have 400 cycles 4 cycles 4 inst 24 warps =384 400 If a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load, can only fit 2 blocks 4 cycles"8 inst *16warps=512>400, better hiding memory latencILP Vs. TLP 4 • Assume that a kernel has 256-thread Blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, can fit 3 blocks global loads have 400 cycles – 4 cycles * 4 inst * 24 warps = 384 < 400 • If a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load, can only fit 2 blocks – 4 cycles * 8 inst * 16 warps = 512 > 400, better hiding memory latency
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有