A Better Parallel Scan Algorithm 1.Read input from device global memory to shared memory 2.Iterate log(n)times;stride from 1 to n-1:double stride each iteration XY 3 STRIDE 1 XY 4 8 ITERATION =1 STRIDE =1 Active threads stride to n-1 (n-stride threads) Thread j adds elements j and j-stride from shared memory and writes result into element j in shared memory Requires barrier synchronization,once before read and once before write 电子科妓女学 O13 A Better Parallel Scan Algorithm 1. Read input from device global memory to shared memory 2. Iterate log(n) times; stride from 1 to n-1: double stride each iteration • Active threads stride to n-1 (n-stride threads) • Thread j adds elements j and j-stride from shared memory and writes result into element j in shared memory • Requires barrier synchronization, once before read and once before write XY 3 4 8 7 4 5 7 9 XY 3 1 7 0 4 1 6 3 ITERATION = 1 STRIDE = 1 STRIDE 1