LECTURE 5 PARALLEL COMPUTATION PATTERNS(HISTOGRAM)
Warps and SIMD Hardware
parallel histogram Data Racing condition privatized histogram kernel 笔十1女子 Uaimraity at Eleetreie Scieeand Tecgd
2 parallel histogram Data Racing condition privatized histogram kernel
Objective To learn the parallel histogram computation pattern -An important,useful computation -Very different from all the patterns we have covered so far in terms of output behavior of each thread: Output can be modified by all participating threads. 电子科妓女学 O
3 Objective – To learn the parallel histogram computation pattern – An important, useful computation – Very different from all the patterns we have covered so far in terms of output behavior of each thread: Output can be modified by all participating threads
A Text Histogram Example Define the bins as four-letter sections of the alphabet:a-d,e-h,i-l,n- p,… For each character in an input string,increment the appropriate bin counter. 一 In the phrase "Programming Massively Parallel Processors"the output histogram is shown below: 12 10 8 0 电子神戏女学 e-h m-p q-t U-X y-Z
4 A Text Histogram Example – Define the bins as four-letter sections of the alphabet: a-d, e-h, i-l, np, … – For each character in an input string, increment the appropriate bin counter. – In the phrase “Programming Massively Parallel Processors” the output histogram is shown below: 4
A simple parallel histogram algorithm Partition the input into sections Have each thread to take a section of the input -Each thread iterates through its section. For each letter,increment the appropriate bin counter 电子科妓女学 O
5 A simple parallel histogram algorithm – Partition the input into sections – Have each thread to take a section of the input – Each thread iterates through its section. – For each letter, increment the appropriate bin counter
Sectioned Partitioning (Iteration #1) p r 0 g r a m m n g m a i e y p a r a e l p r o r Thread 0 Thread 1 Thread 2 Thread 3 0 0 0 3 0 0 0 a-d e-h i-l m-p q-t U-X y-Z 电子神越女学 0
6 Sectioned Partitioning (Iteration #1)
Sectioned Partitioning (Iteration #2) 0 r 0 g r a m m n m a i e p a r e p r 0 e Thread 0 Thread 1 Thread 2 Thread 3 2 0 0 A 1 0 0 a-d e-h i m-p q-t U-X y-2 电子神越女学 O
7 Sectioned Partitioning (Iteration #2) 7
Input Partitioning Affects Memory Access Efficiency Sectioned partitioning results in poor memory access efficiency Adjacent threads do not access adjacent memory locations Accesses are not coalesced DRAM bandwidth is poorly utilized 电子科妓女学 O
8 Input Partitioning Affects Memory Access Efficiency – Sectioned partitioning results in poor memory access efficiency – Adjacent threads do not access adjacent memory locations – Accesses are not coalesced – DRAM bandwidth is poorly utilized
Input Partitioning Affects Memory Access Efficiency Sectioned partitioning results in poor memory access efficiency Adjacent threads do not access adjacent memory locations Accesses are not coalesced DRAM bandwidth is poorly utilized 111112222233333 4444 4 Change to interleaved partitioning All threads process a contiguous section of elements They all move to the next section and repeat The memory accesses are coalesced 12 3 4 23412 3 4 1 2 3 4 1 2 3 4 电子神越女学 0
9 Input Partitioning Affects Memory Access Efficiency – Sectioned partitioning results in poor memory access efficiency – Adjacent threads do not access adjacent memory locations – Accesses are not coalesced – DRAM bandwidth is poorly utilized – Change to interleaved partitioning – All threads process a contiguous section of elements – They all move to the next section and repeat – The memory accesses are coalesced 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
Interleaved Partitioning of Input For coalescing and better memory access performance r o g r a mm i n g m a ss e p a Thread 0 Thread 1 Thread 2 Thread 3 0 1 0 2 1 0 0 a-d e-h m-p q-t U-X y-z 电子神越女学 O
10 Interleaved Partitioning of Input – For coalescing and better memory access performance …