《并行与分布式程序设计》课程教学参考书：并行与并发编程《An Introduction to Parallel Programming》.pdf_大学文库

Contents Preface .............................................................................. xv Acknowledgments .................................................................. xviii About the Author ................................................................... xix CHAPTER 1 Why Parallel Computing?................................... 1 1.1 Why We Need Ever-Increasing Performance .................. 2 1.2 Why We’re Building Parallel Systems ......................... 3 1.3 Why We Need to Write Parallel Programs ..................... 3 1.4 How Do We Write Parallel Programs? ......................... 6 1.5 What We’ll Be Doing ............................................. 8 1.6 Concurrent, Parallel, Distributed ................................ 9 1.7 The Rest of the Book ............................................. 10 1.8 A Word of Warning ............................................... 10 1.9 Typographical Conventions ...................................... 11 1.10 Summary ........................................................... 12 1.11 Exercises ........................................................... 12 CHAPTER 2 Parallel Hardware and Parallel Software .................... 15 2.1 Some Background................................................. 15 2.1.1 The von Neumann architecture ........................ 15 2.1.2 Processes, multitasking, and threads .................. 17 2.2 Modifications to the von Neumann Model ..................... 18 2.2.1 The basics of caching................................... 19 2.2.2 Cache mappings ........................................ 20 2.2.3 Caches and programs: an example .................... 22 2.2.4 Virtual memory ......................................... 23 2.2.5 Instruction-level parallelism ........................... 25 2.2.6 Hardware multithreading............................... 28 2.3 Parallel Hardware ................................................. 29 2.3.1 SIMD systems .......................................... 29 2.3.2 MIMD systems ......................................... 32 2.3.3 Interconnection networks .............................. 35 2.3.4 Cache coherence ........................................ 43 2.3.5 Shared-memory versus distributed-memory .......... 46 2.4 Parallel Software .................................................. 47 2.4.1 Caveats .................................................. 47 2.4.2 Coordinating the processes/threads.................... 48 2.4.3 Shared-memory ......................................... 49 ix

x Contents 2.4.4 Distributed-memory .................................... 53 2.4.5 Programming hybrid systems .......................... 56 2.5 Input and Output .................................................. 56 2.6 Performance ....................................................... 58 2.6.1 Speedup and efficiency ................................. 58 2.6.2 Amdahl’s law ........................................... 61 2.6.3 Scalability ............................................... 62 2.6.4 Taking timings .......................................... 63 2.7 Parallel Program Design ......................................... 65 2.7.1 An example ............................................. 66 2.8 Writing and Running Parallel Programs........................ 70 2.9 Assumptions ....................................................... 70 2.10 Summary ........................................................... 71 2.10.1 Serial systems ........................................... 71 2.10.2 Parallel hardware ....................................... 73 2.10.3 Parallel software ........................................ 74 2.10.4 Input and output ........................................ 75 2.10.5 Performance............................................. 75 2.10.6 Parallel program design ................................ 76 2.10.7 Assumptions ............................................ 76 2.11 Exercises ........................................................... 77 CHAPTER 3 Distributed-Memory Programming with MPI ................. 83 3.1 Getting Started..................................................... 84 3.1.1 Compilation and execution............................. 84 3.1.2 MPI programs........................................... 86 3.1.3 MPI Init and MPI Finalize .......................... 86 3.1.4 Communicators, MPI Comm size and MPI Comm rank..................................... 87 3.1.5 SPMD programs ........................................ 88 3.1.6 Communication ......................................... 88 3.1.7 MPI Send ................................................ 88 3.1.8 MPI Recv ................................................ 90 3.1.9 Message matching ...................................... 91 3.1.10 The status p argument ................................ 92 3.1.11 Semantics of MPI Send and MPI Recv ................ 93 3.1.12 Some potential pitfalls ................................. 94 3.2 The Trapezoidal Rule in MPI .................................... 94 3.2.1 The trapezoidal rule .................................... 94 3.2.2 Parallelizing the trapezoidal rule ...................... 96

Contents xi 3.3 Dealing with I/O .................................................. 97 3.3.1 Output ................................................... 97 3.3.2 Input ..................................................... 100 3.4 Collective Communication....................................... 101 3.4.1 Tree-structured communication........................ 102 3.4.2 MPI Reduce ............................................. 103 3.4.3 Collective vs. point-to-point communications ........ 105 3.4.4 MPI Allreduce ......................................... 106 3.4.5 Broadcast ................................................ 106 3.4.6 Data distributions ....................................... 109 3.4.7 Scatter ................................................... 110 3.4.8 Gather ................................................... 112 3.4.9 Allgather ................................................ 113 3.5 MPI Derived Datatypes .......................................... 116 3.6 Performance Evaluation of MPI Programs..................... 119 3.6.1 Taking timings .......................................... 119 3.6.2 Results................................................... 122 3.6.3 Speedup and efficiency ................................. 125 3.6.4 Scalability ............................................... 126 3.7 A Parallel Sorting Algorithm .................................... 127 3.7.1 Some simple serial sorting algorithms ................ 127 3.7.2 Parallel odd-even transposition sort ................... 129 3.7.3 Safety in MPI programs ................................ 132 3.7.4 Final details of parallel odd-even sort ................. 134 3.8 Summary ........................................................... 136 3.9 Exercises ........................................................... 140 3.10 Programming Assignments ...................................... 147 CHAPTER 4 Shared-Memory Programming with Pthreads................ 151 4.1 Processes, Threads, and Pthreads ............................... 151 4.2 Hello, World ....................................................... 153 4.2.1 Execution................................................ 153 4.2.2 Preliminaries ............................................ 155 4.2.3 Starting the threads ..................................... 156 4.2.4 Running the threads .................................... 157 4.2.5 Stopping the threads .................................... 158 4.2.6 Error checking .......................................... 158 4.2.7 Other approaches to thread startup .................... 159 4.3 Matrix-Vector Multiplication .................................... 159 4.4 Critical Sections ................................................... 162

xii Contents 4.5 Busy-Waiting.… 165 4.6 Mutexes.… 168 4.7 Producer-Consumer Synchronization and Semaphores....... 171 4.8 Barriers and Condition Variables............................ 176 4.8.1 Busy-waiting and a mutex............................. 177 4.8.2 Semaphores… 177 4.8.3 Condition variables..................................... 179 4.8.4 Pthreads barriers........................................ 181 4.9 Read-Write Locks................................................. 181 4.9.1 Linked list functions.................................... 181 4.9.2 A multi-threaded linked list............................ 183 4.9.3 Pthreads read-write locks.............................. 187 4.9.4 Performance of the various implementations......... 188 4.9.5 Implementing read-write locks........................ 190 4.10 Caches,Cache Coherence,and False Sharing......... 190 4.11 Thread-Safety.… 195 4.11.1 Incorrect programs can produce correct output.......198 4.12 Summary.… 198 4.13 Exercises..… 200 4.14 Programming Assignments................................... 206 CHAPTER 5 Shared-Memory Programming with OpenMP............. 209 5.1 Getting Started............................................... 210 5.1.1 Compiling and running OpenMP programs........... 211 5.1.2 The program.…… 212 5.1.3 Error checking ........................................ 215 5.2 The Trapezoidal Rule................ 216 5.2.1 A first OpenMP version............................. 216 5.3 Scope of Variables..... 220 5.4 The Reduction Clause...................................... 221 5.5 The parallel for Directive............................... 224 5.5.1 Caveats.… 225 5.5.2 Data dependences............................. 227 5.5.3 Finding loop-carried dependences..................... 228 5.5.4 Estimatingπ.....…..… 229 5.5.5 More on scope.… 231 5.6 More About Loops in OpenMP:Sorting................... 232 5.6.1 Bubble sort...… 232 5.6.2 Odd-even transposition sort............................ 233 5.7 Scheduling Loops........... 236 5.7.1 The schedule clause................................... 237 5.7.2 The static schedule type.......................... 238

xii Contents 4.5 Busy-Waiting ...................................................... 165 4.6 Mutexes ............................................................ 168 4.7 Producer-Consumer Synchronization and Semaphores....... 171 4.8 Barriers and Condition Variables................................ 176 4.8.1 Busy-waiting and a mutex ............................. 177 4.8.2 Semaphores ............................................. 177 4.8.3 Condition variables ..................................... 179 4.8.4 Pthreads barriers ........................................ 181 4.9 Read-Write Locks ................................................. 181 4.9.1 Linked list functions.................................... 181 4.9.2 A multi-threaded linked list ............................ 183 4.9.3 Pthreads read-write locks .............................. 187 4.9.4 Performance of the various implementations ......... 188 4.9.5 Implementing read-write locks ........................ 190 4.10 Caches, Cache Coherence, and False Sharing ................. 190 4.11 Thread-Safety...................................................... 195 4.11.1 Incorrect programs can produce correct output ....... 198 4.12 Summary ........................................................... 198 4.13 Exercises ........................................................... 200 4.14 Programming Assignments ...................................... 206 CHAPTER 5 Shared-Memory Programming with OpenMP ................ 209 5.1 Getting Started..................................................... 210 5.1.1 Compiling and running OpenMP programs........... 211 5.1.2 The program ............................................ 212 5.1.3 Error checking .......................................... 215 5.2 The Trapezoidal Rule ............................................. 216 5.2.1 A first OpenMP version ................................ 216 5.3 Scope of Variables ................................................ 220 5.4 The Reduction Clause ............................................ 221 5.5 The parallel for Directive .................................... 224 5.5.1 Caveats .................................................. 225 5.5.2 Data dependences....................................... 227 5.5.3 Finding loop-carried dependences..................... 228 5.5.4 Estimating π ............................................ 229 5.5.5 More on scope .......................................... 231 5.6 More About Loops in OpenMP: Sorting ....................... 232 5.6.1 Bubble sort .............................................. 232 5.6.2 Odd-even transposition sort ............................ 233 5.7 Scheduling Loops ................................................. 236 5.7.1 The schedule clause................................... 237 5.7.2 The static schedule type.............................. 238

Contents xiii 5.7.3 The dynamic and guided schedule types............. 239 5.7.4 The runtime schedule type............................ 239 5.7.5 Which schedule?........................................ 241 5.8 Producers and Consumers..................... 241 5.8.1 Queues.… 241 5.8.2 Message-passing............. 242 5.8.3 Sending messages ..................................... 243 5.8.4 Receiving messages.................................... 243 5.8.5 Termination detection.................................. 244 58.6 Startup.… 244 5.8.7 The atomic directive................................... 245 5.8.8 Critical sections and locks............................. 246 5.8.9 Using locks in the message-passing program......... 248 5.8.10 critical directives,atomic directives, 0rl0Cks2.… 249 5.8.11 Some caveats.… 249 5.9 Caches,Cache Coherence,and False Sharing................. 251 5.10 Thread-Safety.… 256 5.10.1 Incorrect programs can produce correct output....... 258 5.11 Summary.… 259 5.12 Exercises.… 263 5.13 Programming Assignments................................ 267 CHAPTER 6 Parallel Program Development............ 271 6.1 Two n-Body Solvers............................. 271 6.l.1 The problem.… 271 6.1.2 Two serial programs.................................. 273 6.1.3 Parallelizing the n-body solvers....................... 277 6.1.4 A word about I/O............................. 280 6.1.5 Parallelizing the basic solver using OpenMP......... 281 6.1.6 Parallelizing the reduced solver using OpenMP...... 284 6.1.7 Evaluating the OpenMP codes....................... 288 6.1.8 Parallelizing the solvers using pthreads............... 289 6.1.9 Parallelizing the basic solver using MPI.............. 290 6.1.10 Parallelizing the reduced solver using MPI........... 292 6.1.11 Performance of the MPI solvers....................... 297 6.2 Tree Search.................. 299 6.2.1 Recursive depth-first search............................ 302 6.2.2 Nonrecursive depth-first search........................ 303 6.2.3 Data structures for the serial implementations........ 305 6.2.4 Performance of the serial implementations........... 306 6.2.5 Parallelizing tree search.............................. 306

Contents xiii 5.7.3 The dynamic and guided schedule types............. 239 5.7.4 The runtime schedule type ............................ 239 5.7.5 Which schedule?........................................ 241 5.8 Producers and Consumers........................................ 241 5.8.1 Queues................................................... 241 5.8.2 Message-passing ........................................ 242 5.8.3 Sending messages ...................................... 243 5.8.4 Receiving messages .................................... 243 5.8.5 Termination detection .................................. 244 5.8.6 Startup ................................................... 244 5.8.7 The atomic directive ................................... 245 5.8.8 Critical sections and locks ............................. 246 5.8.9 Using locks in the message-passing program ......... 248 5.8.10 critical directives, atomic directives, or locks? ................................................. 249 5.8.11 Some caveats............................................ 249 5.9 Caches, Cache Coherence, and False Sharing ................. 251 5.10 Thread-Safety...................................................... 256 5.10.1 Incorrect programs can produce correct output ....... 258 5.11 Summary ........................................................... 259 5.12 Exercises ........................................................... 263 5.13 Programming Assignments ...................................... 267 CHAPTER 6 Parallel Program Development ............................... 271 6.1 Two n-Body Solvers .............................................. 271 6.1.1 The problem............................................. 271 6.1.2 Two serial programs .................................... 273 6.1.3 Parallelizing the n-body solvers ....................... 277 6.1.4 A word about I/O ....................................... 280 6.1.5 Parallelizing the basic solver using OpenMP ......... 281 6.1.6 Parallelizing the reduced solver using OpenMP ...... 284 6.1.7 Evaluating the OpenMP codes......................... 288 6.1.8 Parallelizing the solvers using pthreads ............... 289 6.1.9 Parallelizing the basic solver using MPI .............. 290 6.1.10 Parallelizing the reduced solver using MPI ........... 292 6.1.11 Performance of the MPI solvers ....................... 297 6.2 Tree Search ........................................................ 299 6.2.1 Recursive depth-first search............................ 302 6.2.2 Nonrecursive depth-first search........................ 303 6.2.3 Data structures for the serial implementations........ 305 6.2.4 Performance of the serial implementations ........... 306 6.2.5 Parallelizing tree search ................................ 306

Preface Parallel hardware has been ubiquitous for some time now.It's difficult to find a lap- top,desktop,or server that doesn't use a multicore processor.Beowulf clusters are nearly as common today as high-powered workstations were during the 1990s,and cloud computing could make distributed-memory systems as accessible as desktops. In spite of this,most computer science majors graduate with little or no experience in parallel programming.Many colleges and universities offer upper-division elective courses in parallel computing,but since most computer science majors have to take numerous required courses,many graduate without ever writing a multithreaded or multiprocess program. It seems clear that this state of affairs needs to change.Although many programs can obtain satisfactory performance on a single core,computer scientists should be made aware of the potentially vast performance improvements that can be obtained with parallelism,and they should be able to exploit this potential when the need arises. An Introduction to Parallel Programming was written to partially address this problem.It provides an introduction to writing parallel programs using MPI, Pthreads,and OpenMP-three of the most widely used application programming interfaces (APIs)for parallel programming.The intended audience is students and professionals who need to write parallel programs.The prerequisites are mini- mal:a college-level course in mathematics and the ability to write serial programs in C.They are minimal because we believe that students should be able to start programming parallel systems as early as possible. At the University of San Francisco,computer science students can fulfill a requirement for the major by taking the course,on which this text is based,immedi- ately after taking the"Introduction to Computer Science I"course that most majors take in the first semester of their freshman year.We've been offering this course in parallel computing for six years now,and it has been our experience that there really is no reason for students to defer writing parallel programs until their junior or senior year.To the contrary,the course is popular,and students have found that using concurrency in other courses is much easier after having taken the Introduction course. If second-semester freshmen can learn to write parallel programs by taking a class,then motivated computing professionals should be able to learn to write paral- lel programs through self-study.We hope this book will prove to be a useful resource for them. About This Book As we noted earlier,the main purpose of the book is to teach parallel programming in MPI,Pthreads,and OpenMP to an audience with a limited background in computer science and no previous experience with parallelism.We also wanted to make it as XV

Preface Parallel hardware has been ubiquitous for some time now. It’s difficult to find a laptop, desktop, or server that doesn’t use a multicore processor. Beowulf clusters are nearly as common today as high-powered workstations were during the 1990s, and cloud computing could make distributed-memory systems as accessible as desktops. In spite of this, most computer science majors graduate with little or no experience in parallel programming. Many colleges and universities offer upper-division elective courses in parallel computing, but since most computer science majors have to take numerous required courses, many graduate without ever writing a multithreaded or multiprocess program. It seems clear that this state of affairs needs to change. Although many programs can obtain satisfactory performance on a single core, computer scientists should be made aware of the potentially vast performance improvements that can be obtained with parallelism, and they should be able to exploit this potential when the need arises. An Introduction to Parallel Programming was written to partially address this problem. It provides an introduction to writing parallel programs using MPI, Pthreads, and OpenMP—three of the most widely used application programming interfaces (APIs) for parallel programming. The intended audience is students and professionals who need to write parallel programs. The prerequisites are minimal: a college-level course in mathematics and the ability to write serial programs in C. They are minimal because we believe that students should be able to start programming parallel systems as early as possible. At the University of San Francisco, computer science students can fulfill a requirement for the major by taking the course, on which this text is based, immediately after taking the “Introduction to Computer Science I” course that most majors take in the first semester of their freshman year. We’ve been offering this course in parallel computing for six years now, and it has been our experience that there really is no reason for students to defer writing parallel programs until their junior or senior year. To the contrary, the course is popular, and students have found that using concurrency in other courses is much easier after having taken the Introduction course. If second-semester freshmen can learn to write parallel programs by taking a class, then motivated computing professionals should be able to learn to write parallel programs through self-study. We hope this book will prove to be a useful resource for them. About This Book As we noted earlier, the main purpose of the book is to teach parallel programming in MPI, Pthreads, and OpenMP to an audience with a limited background in computer science and no previous experience with parallelism. We also wanted to make it as xv

xvi Preface flexible as possible so that readers who have no interest in learning one or two of the APIs can still read the remaining material with little effort.Thus,the chapters on the three APIs are largely independent of each other:they can be read in any order, and one or two of these chapters can be bypass.This independence has a cost:It was necessary to repeat some of the material in these chapters.Of course,repeated material can be simply scanned or skipped. Readers with no prior experience with parallel computing should read Chapter 1 first.It attempts to provide a relatively nontechnical explanation of why parallel sys- tems have come to dominate the computer landscape.The chapter also provides a short introduction to parallel systems and parallel programming. Chapter 2 provides some technical background in computer hardware and soft- ware.Much of the material on hardware can be scanned before proceeding to the API chapters.Chapters 3,4,and 5 are the introductions to programming with MPI, Pthreads,and OpenMP,respectively. In Chapter 6 we develop two longer programs:a parallel n-body solver and a parallel tree search.Both programs are developed using all three APIs.Chapter 7 provides a brief list of pointers to additional information on various aspects of parallel computing. We use the C programming language for developing our programs because all three APIs have C-language interfaces,and,since C is such a small language,it is a relatively easy language to learn-especially for C++and Java programmers,since they are already familiar with C's control structures. Classroom Use This text grew out of a lower-division undergraduate course at the University of San Francisco.The course fulfills a requirement for the computer science major,and it also fulfills a prerequisite for the undergraduate operating systems course.The only prerequisites for the course are either a grade of"B"or better in a one-semester introduction to computer science or a "C"or better in a two-semester introduction to computer science.The course begins with a four-week introduction to C program- ming.Since most students have already written Java programs,the bulk of what is covered is devoted to the use pointers in C.I The remainder of the course provides introductions to programming in MPI,Pthreads,and OpenMP. We cover most of the material in Chapters 1,3,4,and 5,and parts of the material in Chapters 2 and 6.The background in Chapter 2 is introduced as the need arises. For example,before discussing cache coherence issues in OpenMP(Chapter 5),we cover the material on caches in Chapter 2. The coursework consists of weekly homework assignments,five programming assignments,a couple of midterms,and a final exam.The homework usually involves IInterestingly,a number of students have said that they found the use of C pointers more difficult than MPI programming

xvi Preface flexible as possible so that readers who have no interest in learning one or two of the APIs can still read the remaining material with little effort. Thus, the chapters on the three APIs are largely independent of each other: they can be read in any order, and one or two of these chapters can be bypass. This independence has a cost: It was necessary to repeat some of the material in these chapters. Of course, repeated material can be simply scanned or skipped. Readers with no prior experience with parallel computing should read Chapter 1 first. It attempts to provide a relatively nontechnical explanation of why parallel systems have come to dominate the computer landscape. The chapter also provides a short introduction to parallel systems and parallel programming. Chapter 2 provides some technical background in computer hardware and software. Much of the material on hardware can be scanned before proceeding to the API chapters. Chapters 3, 4, and 5 are the introductions to programming with MPI, Pthreads, and OpenMP, respectively. In Chapter 6 we develop two longer programs: a parallel n-body solver and a parallel tree search. Both programs are developed using all three APIs. Chapter 7 provides a brief list of pointers to additional information on various aspects of parallel computing. We use the C programming language for developing our programs because all three APIs have C-language interfaces, and, since C is such a small language, it is a relatively easy language to learn—especially for C++ and Java programmers, since they are already familiar with C’s control structures. Classroom Use This text grew out of a lower-division undergraduate course at the University of San Francisco. The course fulfills a requirement for the computer science major, and it also fulfills a prerequisite for the undergraduate operating systems course. The only prerequisites for the course are either a grade of “B” or better in a one-semester introduction to computer science or a “C” or better in a two-semester introduction to computer science. The course begins with a four-week introduction to C programming. Since most students have already written Java programs, the bulk of what is covered is devoted to the use pointers in C.1 The remainder of the course provides introductions to programming in MPI, Pthreads, and OpenMP. We cover most of the material in Chapters 1, 3, 4, and 5, and parts of the material in Chapters 2 and 6. The background in Chapter 2 is introduced as the need arises. For example, before discussing cache coherence issues in OpenMP (Chapter 5), we cover the material on caches in Chapter 2. The coursework consists of weekly homework assignments, five programming assignments, a couple of midterms, and a final exam. The homework usually involves 1 Interestingly, a number of students have said that they found the use of C pointers more difficult than MPI programming