® nVIDIA. GPU Teaching Kit Accelerated Computing ILLINOIS UN VERSITY OF ILLINUS AT URBANA-LHANPWGN Module 11-Computational Thinking
Accelerated Computing GPU Teaching Kit Module 11 – Computational Thinking
Objective To provide you with a framework for further studies on Thinking about the problems of parallel programming Discussing your work with others Approaching complex parallel programming problems Using or building useful tools and environments 2 ②nVIDIA/ ILLINOIS
2 Objective – To provide you with a framework for further studies on – Thinking about the problems of parallel programming – Discussing your work with others – Approaching complex parallel programming problems – Using or building useful tools and environments
Fundamentals of Parallel Computing Parallel computing requires that The problem can be decomposed into sub-problems that can be safely solved at the same time The programmer structures the code and data to solve these sub-problems concurrently The goals of parallel computing are To solve problems in less time (strong scaling),and/or To solve bigger problems (weak scaling),and/or To achieve better solutions (advancing science) The problems must be large enough to justify parallel computing and to exhibit exploitable concurrency. 3 ②nVIDIA ■tuNo5
3 Fundamentals of Parallel Computing – Parallel computing requires that – The problem can be decomposed into sub-problems that can be safely solved at the same time – The programmer structures the code and data to solve these sub-problems concurrently – The goals of parallel computing are – To solve problems in less time (strong scaling), and/or – To solve bigger problems (weak scaling), and/or – To achieve better solutions (advancing science) The problems must be large enough to justify parallel computing and to exhibit exploitable concurrency
Shared Memory vs.Message Passing We have focused on shared memory parallel programming This is what CUDA(and OpenMP,OpenCL)is based on Future massively parallel microprocessors are expected to support shared memory at the chip level The programming considerations of message passing model is quite different! However,you will find parallels for almost every technique you learned in this course Need to be aware of space-time constraints ②nVIDIA ILLINOIS
4 Shared Memory vs. Message Passing – We have focused on shared memory parallel programming – This is what CUDA (and OpenMP, OpenCL) is based on – Future massively parallel microprocessors are expected to support shared memory at the chip level – The programming considerations of message passing model is quite different! – However, you will find parallels for almost every technique you learned in this course – Need to be aware of space-time constraints
Data Sharing Data sharing can be a double-edged sword Excessive data sharing drastically reduces advantage of parallel execution Localized sharing can improve memory bandwidth efficiency Efficient memory bandwidth usage can be achieved by synchronizing the execution of task groups and coordinating their usage of memory data Efficient use of on-chip,shared storage and datapaths Read-only sharing can usually be done at much higher efficiency than read-write sharing,which often requires more synchronization - Many:Many,One:Many,Many:One,One:One 5 ②nVIDIA ILLINOIS
5 Data Sharing – Data sharing can be a double-edged sword – Excessive data sharing drastically reduces advantage of parallel execution – Localized sharing can improve memory bandwidth efficiency – Efficient memory bandwidth usage can be achieved by synchronizing the execution of task groups and coordinating their usage of memory data – Efficient use of on-chip, shared storage and datapaths – Read-only sharing can usually be done at much higher efficiency than read-write sharing, which often requires more synchronization – Many:Many, One:Many, Many:One, One:One
Synchronization Synchronization =Control Sharing Barriers make threads wait until all threads catch up Waiting is lost opportunity for work Atomic operations may reduce waiting Watch out for serialization Important:be aware of which items of work are truly independent 6 ②nVIDIA/ILLINOIS
6 Synchronization – Synchronization == Control Sharing – Barriers make threads wait until all threads catch up – Waiting is lost opportunity for work – Atomic operations may reduce waiting – Watch out for serialization – Important: be aware of which items of work are truly independent 6
Parallel Programming Coding Styles- Program and Data Models Program Models Data Models SPMD Shared Data Master/Worker Shared Queue Loop Parallelism Distributed Array Fork/Join These are not necessarily mutually exclusive. 7 ②nVIDIA ■uNo5
7 Parallel Programming Coding Styles – Program and Data Models Fork/Join Master/Worker SPMD Program Models Loop Parallelism Distributed Array Shared Queue Shared Data Data Models These are not necessarily mutually exclusive
Program Models SPMD (Single Program,Multiple Data) All PE's (Processor Elements)execute the same program in parallel,but has its own data Each PE uses a unique ID to access its portion of data - Different PE can follow different paths through the same code This is essentially the CUDA Grid model (also OpenCL,MPI) SIMD is a special case-WARP used for efficiency -Master/Worker Loop Parallelism -Fork/Join 8 8 ②nVIDIA/ILLINOIS
8 Program Models – SPMD (Single Program, Multiple Data) – All PE’s (Processor Elements) execute the same program in parallel, but has its own data – Each PE uses a unique ID to access its portion of data – Different PE can follow different paths through the same code – This is essentially the CUDA Grid model (also OpenCL, MPI) – SIMD is a special case – WARP used for efficiency – Master/Worker – Loop Parallelism – Fork/Join 8
SPMD 1.Initialize-establish localized data structure and communication channels. 2.Uniquify-each thread acquires a unique identifier,typically ranging from 0 to N-1,where N is the number of threads.Both OpenMP and CUDA have built-in support for this. 3.Distribute data-decompose global data into chunks and localize them,or sharing/replicating major data structures using thread IDs to associate subsets of the data to threads. 4.Compute-run the core computation!Thread IDs are used to differentiate the behavior of individual threads.Use thread ID in loop index calculations to split loop iterations among threads-beware of the potential for memory/data divergence.Use thread ID or conditions based on thread ID to branch to their specific actions- beware of the potential for instruction/execution divergence 一 5.Finalize-reconcile global data structure,and prepare for the next major iteration or group of program phases. 9 ②nVIDIA ILLINOIS
9 SPMD – 1. Initialize—establish localized data structure and communication channels. – 2. Uniquify—each thread acquires a unique identifier, typically ranging from 0 to N-1, where N is the number of threads. Both OpenMP and CUDA have built-in support for this. – 3. Distribute data—decompose global data into chunks and localize them, or sharing/replicating major data structures using thread IDs to associate subsets of the data to threads. – 4. Compute—run the core computation! Thread IDs are used to differentiate the behavior of individual threads. Use thread ID in loop index calculations to split loop iterations among threads—beware of the potential for memory/data divergence. Use thread ID or conditions based on thread ID to branch to their specific actions— beware of the potential for instruction/execution divergence. – 5. Finalize—reconcile global data structure, and prepare for the next major iteration or group of program phases
Program Models SPMD(Single Program,Multiple Data) Master/Worker (OpenMP,OpenACC,TBB) -A Master thread sets up a pool of worker threads and a bag of tasks Workers execute concurrently,removing tasks until done 一 Loop Parallelism (OpenMP,OpenACC,C++AMP) Loop iterations execute in parallel FORTRAN do-all (truly parallel),do-across (with dependence) Fork/Join (Posix p-threads) Most general,generic way of creation of threads 10 /②nVIDIA ILLINOIS
10 Program Models – SPMD (Single Program, Multiple Data) – Master/Worker (OpenMP, OpenACC, TBB) – A Master thread sets up a pool of worker threads and a bag of tasks – Workers execute concurrently, removing tasks until done – Loop Parallelism (OpenMP, OpenACC, C++AMP) – Loop iterations execute in parallel – FORTRAN do-all (truly parallel), do-across (with dependence) – Fork/Join (Posix p-threads) – Most general, generic way of creation of threads 1 0