《并行与分布式程序设计》课程教学参考书：并行与并发编程《Introduction to Parallel Computing》Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar（Second Edition）.pdf_大学文库

Section 4.2.All-to-All Broadcast and Reduction Section 4.3.All-Reduce and Prefix-Sum Operations Section 4.4.Scatter and Gather Section 4.5.All-to-All Personalized Communication Section 4.6.Circular Shift Section 4.7.Improving the speed of Some Communication Operations Section 4.8.Summary Section 4.9.Bibliographic Remarks Problems Chapter 5.Analytical Modeling of Parallel Programs Section 5.1.Sources of Overhead in Parallel Programs Section 5.2.Performance Metrics for Parallel Systems Section 5.3.The Effect of Granularity on Performance Section 5.4.Scalability of Parallel Systems Section 5.5.Minimum Execution Time and Minimum Cost-Optimal Execution Time Section 5.6.Asymptotic Analysis of Parallel Programs Section 5.7.Other Scalability Metrics Section 5.8.Bibliographic Remarks Problems Chapter 6.Programming Using the Message-Passing Paradigm Section 6.1.Principles of Message-Passing Programming Section 6.2.The Building Blocks:Send and Receive operations Section 6.3.MPI:the Message Passing Interface Section 6,4.Topologies and Embedding Section 6.5.Overlapping Communication with Computation Section 6.6.Collective Communication and Computation Operations Section 6.7.Groups and Communicators Section 6.8.Bibliographic Remarks Problems Chapter 7.Programming Shared Address Space Platforms Section 7.1.Thread Basics Section 72.Why Threads? Section 7.3.The POSIX Thread API Section 7.4.Thread Basics:Creation and Termination Section 7.5.Synchronization Primitives in Pthreads Section 7.6.Controlling Thread and Synchronization Attributes Section 7.7.Thread Cancellation Section 7.8.Composite Synchronization Constructs Section 7.9.Tips for Designing Asynchronous Programs Section 7.10.OpenMP:a Standard for Directive Based Parallel Programming Section 7.11.Bibliographic Remarks Problems Chapter 8.Dense Matrix Algorithms Section 8.1.Matrix-Vector Multiplication Section 8.2.Matrix-Matrix Multiplication Section 8.3.Solving a System of Linear Equations Section 8,4.Bibliographic Remarks Problems Chapter 9.Sorting Section 9.1.Issues in Sorting on Parallel Computers Section 9.2.Sorting Networks Section 9.3.Bubble Sort and its Variants Section 9.4.Quicksort Section 9.5.Bucket and Sample Sort

Section 4.2. All-to-All Broadcast and Reduction Section 4.3. All-Reduce and Prefix-Sum Operations Section 4.4. Scatter and Gather Section 4.5. All-to-All Personalized Communication Section 4.6. Circular Shift Section 4.7. Improving the Speed of Some Communication Operations Section 4.8. Summary Section 4.9. Bibliographic Remarks Problems Chapter 5. Analytical Modeling of Parallel Programs Section 5.1. Sources of Overhead in Parallel Programs Section 5.2. Performance Metrics for Parallel Systems Section 5.3. The Effect of Granularity on Performance Section 5.4. Scalability of Parallel Systems Section 5.5. Minimum Execution Time and Minimum Cost-Optimal Execution Time Section 5.6. Asymptotic Analysis of Parallel Programs Section 5.7. Other Scalability Metrics Section 5.8. Bibliographic Remarks Problems Chapter 6. Programming Using the Message-Passing Paradigm Section 6.1. Principles of Message-Passing Programming Section 6.2. The Building Blocks: Send and Receive Operations Section 6.3. MPI: the Message Passing Interface Section 6.4. Topologies and Embedding Section 6.5. Overlapping Communication with Computation Section 6.6. Collective Communication and Computation Operations Section 6.7. Groups and Communicators Section 6.8. Bibliographic Remarks Problems Chapter 7. Programming Shared Address Space Platforms Section 7.1. Thread Basics Section 7.2. Why Threads? Section 7.3. The POSIX Thread API Section 7.4. Thread Basics: Creation and Termination Section 7.5. Synchronization Primitives in Pthreads Section 7.6. Controlling Thread and Synchronization Attributes Section 7.7. Thread Cancellation Section 7.8. Composite Synchronization Constructs Section 7.9. Tips for Designing Asynchronous Programs Section 7.10. OpenMP: a Standard for Directive Based Parallel Programming Section 7.11. Bibliographic Remarks Problems Chapter 8. Dense Matrix Algorithms Section 8.1. Matrix-Vector Multiplication Section 8.2. Matrix-Matrix Multiplication Section 8.3. Solving a System of Linear Equations Section 8.4. Bibliographic Remarks Problems Chapter 9. Sorting Section 9.1. Issues in Sorting on Parallel Computers Section 9.2. Sorting Networks Section 9.3. Bubble Sort and its Variants Section 9.4. Quicksort Section 9.5. Bucket and Sample Sort Page 2 of 3 mk:@MSITStore:F:\Reading\Parallel%20Programming\[Introduction%20to%20Parallel%20Computing,%2... 2/8/2013

Team LiB 1 PREVIOUS NEXT Preface Since the 1994 release of the text "Introduction to Parallel Computing:Design and Analysis of Algorithms"by the same authors,the field of parallel computing has undergone significant changes.Whereas tightly coupled scalable message-passing platforms were the norm a decade ago,a significant portion of the current generation of platforms consists of inexpensive clusters of workstations,and multiprocessor workstations and servers. Programming models for these platforms have also evolved over this time.Whereas most machines a decade back relied on custom APIs for messaging and loop-based parallelism,current models standardize these APIs across platforms.Message passing libraries such as PVM and MPI,thread libraries such as POSIX threads,and directive based models such as OpenMP are widely accepted as standards,and have been ported to a variety of platforms. With respect to applications,fluid dynamics,structural mechanics,and signal processing formed dominant applications a decade back.These applications continue to challenge the current generation of parallel platforms. However,a variety of new applications have also become important.These include data-intensive applications such as transaction processing and information retrieval,data mining and analysis,and multimedia services. Applications in emerging areas of computational biology and nanotechnology pose tremendous challenges for algorithms and systems development.Changes in architectures,programming models,and applications are also being accompanied by changes in how parallel platforms are made available to the users in the form of grid-based services. This evolution has a profound impact on the process of design,analysis,and implementation of parallel algorithms.Whereas the emphasis of parallel algorithm design a decade back was on precise mapping of tasks to specific topologies such as meshes and hypercubes,current emphasis is on programmability and portability,both from points of view of algorithm design and implementation.To this effect,where possible,this book employs an architecture independent view of the underlying platforms and designs algorithms for an abstract model.With respect to programming models,Message Passing Interface(MPI),POSIX threads,and OpenMP have been selected.The evolving application mix for parallel computing is also reflected in various examples in the book. This book forms the basis for a single concentrated course on parallel computing or a two-part sequence.Some suggestions for such a two-part sequence are: 1.Introduction to Parallel Computing:Chapters 1-6.This course would provide the basics of algorithm design and parallel programming. 2.Design and Analysis of Parallel Algorithms:Chapters 2 and 3 followed by Chapters 8-12.This course would provide an in-depth coverage of design and analysis of various parallel algorithms. The material in this book has been tested in Parallel Algorithms and Parallel Computing courses at the University of Minnesota and Purdue University.These courses are taken primarily by graduate students and senior-level undergraduate students in Computer Science.In addition,related courses in Scientific Computation,for which this material has also been tested,are taken by graduate students in science and engineering,who are interested in solving computationally intensive problems Most chapters of the book include (i)examples and illustrations;(ii)problems that supplement the text and test students'understanding of the material;and (iii)bibliographic remarks to aid researchers and students interested in learning more about related and advanced topics.The comprehensive subject index helps the reader locate terms they might be interested in.The page number on which a term is defined is highlighted in boldface in the index.Furthermore,the term itself appears in bold italics where it is defined.The sections that deal with relatively complex material are preceded by a '*'An instructors'manual containing slides of the figures and solutions to selected problems is also available from the publisher (http://www.booksites.net/kumar). As with our previous book,we view this book as a continually evolving resource.We thank all the readers who have kindly shared critiques,opinions,problems,code,and other information relating to our first book.It is our sincere hope that we can continue this interaction centered around this new book.We encourage readers to address communication relating to this book to book-yk@cs.umn,edu.All relevant reader input will be added to the information archived at the site http://www.cs.umn.edu/~parbook with due credit to (and permission of)the sender(s).An on-line errata of the book will also be maintained at the site.We believe that in a highly dynamic field such as ours,a lot is to be gained from a healthy exchange of ideas and material in this manner Team LiB 4FREVIOUS NEXT

[ Team LiB ] Preface Since the 1994 release of the text "Introduction to Parallel Computing: Design and Analysis of Algorithms" by the same authors, the field of parallel computing has undergone significant changes. Whereas tightly coupled scalable message-passing platforms were the norm a decade ago, a significant portion of the current generation of platforms consists of inexpensive clusters of workstations, and multiprocessor workstations and servers. Programming models for these platforms have also evolved over this time. Whereas most machines a decade back relied on custom APIs for messaging and loop-based parallelism, current models standardize these APIs across platforms. Message passing libraries such as PVM and MPI, thread libraries such as POSIX threads, and directive based models such as OpenMP are widely accepted as standards, and have been ported to a variety of platforms. With respect to applications, fluid dynamics, structural mechanics, and signal processing formed dominant applications a decade back. These applications continue to challenge the current generation of parallel platforms. However, a variety of new applications have also become important. These include data-intensive applications such as transaction processing and information retrieval, data mining and analysis, and multimedia services. Applications in emerging areas of computational biology and nanotechnology pose tremendous challenges for algorithms and systems development. Changes in architectures, programming models, and applications are also being accompanied by changes in how parallel platforms are made available to the users in the form of grid-based services. This evolution has a profound impact on the process of design, analysis, and implementation of parallel algorithms. Whereas the emphasis of parallel algorithm design a decade back was on precise mapping of tasks to specific topologies such as meshes and hypercubes, current emphasis is on programmability and portability, both from points of view of algorithm design and implementation. To this effect, where possible, this book employs an architecture independent view of the underlying platforms and designs algorithms for an abstract model. With respect to programming models, Message Passing Interface (MPI), POSIX threads, and OpenMP have been selected. The evolving application mix for parallel computing is also reflected in various examples in the book. This book forms the basis for a single concentrated course on parallel computing or a two-part sequence. Some suggestions for such a two-part sequence are: 1. Introduction to Parallel Computing: Chapters 1–6. This course would provide the basics of algorithm design and parallel programming. 2. Design and Analysis of Parallel Algorithms: Chapters 2 and 3 followed by Chapters 8–12. This course would provide an in-depth coverage of design and analysis of various parallel algorithms. The material in this book has been tested in Parallel Algorithms and Parallel Computing courses at the University of Minnesota and Purdue University. These courses are taken primarily by graduate students and senior-level undergraduate students in Computer Science. In addition, related courses in Scientific Computation, for which this material has also been tested, are taken by graduate students in science and engineering, who are interested in solving computationally intensive problems. Most chapters of the book include (i) examples and illustrations; (ii) problems that supplement the text and test students' understanding of the material; and (iii) bibliographic remarks to aid researchers and students interested in learning more about related and advanced topics. The comprehensive subject index helps the reader locate terms they might be interested in. The page number on which a term is defined is highlighted in boldface in the index. Furthermore, the term itself appears in bold italics where it is defined. The sections that deal with relatively complex material are preceded by a '*'. An instructors' manual containing slides of the figures and solutions to selected problems is also available from the publisher (http://www.booksites.net/kumar). As with our previous book, we view this book as a continually evolving resource. We thank all the readers who have kindly shared critiques, opinions, problems, code, and other information relating to our first book. It is our sincere hope that we can continue this interaction centered around this new book. We encourage readers to address communication relating to this book to book-vk@cs.umn.edu. All relevant reader input will be added to the information archived at the site http://www.cs.umn.edu/~parbook with due credit to (and permission of) the sender(s). An on-line errata of the book will also be maintained at the site. We believe that in a highly dynamic field such as ours, a lot is to be gained from a healthy exchange of ideas and material in this manner. [ Team LiB ] Page 1 of 1 file://C:\Documents and Settings\CIC000\Local Settings\Temp\~hh41E9.htm 2/8/2013

Team LiB 4 PREVIOUS NEXT Chapter 1.Introduction to Parallel Computing The past decade has seen tremendous advances in microprocessor technology.Clock rates of processors have increased from about 40 MHz(e.g.,a MIPS R3000,circa 1988)to over 2.0 GHz (e.g.,a Pentium 4, circa 2002).At the same time,processors are now capable of executing multiple instructions in the same cycle.The average number of cycles per instruction (CPI)of high end processors has improved by roughly an order of magnitude over the past 10 years.All this translates to an increase in the peak floating point operation execution rate(floating point operations per second,or FLOPS)of several orders of magnitude. A variety of other issues have also become important over the same period.Perhaps the most prominent of these is the ability (or lack thereof)of the memory system to feed data to the processor at the required rate.Significant innovations in architecture and software have addressed the alleviation of bottlenecks posed by the datapath and the memory. The role of concurrency in accelerating computing elements has been recognized for several decades. However,their role in providing multiplicity of datapaths,increased access to storage elements (both memory and disk),scalable performance,and lower costs is reflected in the wide variety of applications of parallel computing.Desktop machines,engineering workstations,and compute servers with two,four,or even eight processors connected together are becoming common platforms for design applications.Large scale applications in science and engineering rely on larger configurations of parallel computers,often comprising hundreds of processors.Data intensive platforms such as database or web servers and applications such as transaction processing and data mining often use clusters of workstations that provide high aggregate disk bandwidth.Applications in graphics and visualization use multiple rendering pipes and processing elements to compute and render realistic environments with millions of polygons in real time. Applications requiring high availability rely on parallel and distributed platforms for redundancy.It is therefore extremely important,from the point of view of cost,performance,and application requirements, to understand the principles,tools,and techniques for programming the wide variety of parallel platforms currently available. Team LiB 4PREVIOUS NEXT Team LiB PREVIOUS NEXT 1.1 Motivating Parallelism Development of parallel software has traditionally been thought of as time and effort intensive.This can be largely attributed to the inherent complexity of specifying and coordinating concurrent tasks,a lack of portable algorithms,standardized environments,and software development toolkits.When viewed in the context of the brisk rate of development of microprocessors,one is tempted to question the need for devoting significant effort towards exploiting parallelism as a means of accelerating applications.After all, if it takes two years to develop a parallel application,during which time the underlying hardware and/or software platform has become obsolete,the development effort is clearly wasted.However,there are some unmistakable trends in hardware design,which indicate that uniprocessor (or implicitly parallel) architectures may not be able to sustain the rate of realizab/e performance increments in the future.This is a result of lack of implicit parallelism as well as other bottlenecks such as the datapath and the memory. At the same time,standardized hardware interfaces have reduced the turnaround time from the development of a microprocessor to a parallel machine based on the microprocessor.Furthermore, considerable progress has been made in standardization of programming environments to ensure a longer life-cycle for parallel applications.All of these present compelling arguments in favor of parallel computing platforms. 1.1.1 The Computational Power Argument-from Transistors to FLOPS In 1965,Gordon Moore made the following simple observation: "The complexity for minimum component costs has increased at a rate of roughly a factor of two per year.Certainly over the short term this rate can be expected to continue,if not to increase.Over the longer term,the rate of increase is a bit more uncertain,although there is no reason to believe it will not remain nearly constant for at least 10 years.That means by 1975,the number of components per integrated circuit for minimum cost will be 65,000." His reasoning was based on an empirical log-linear relationship between device complexity and time, observed over three data points.He used this to justify that by 1975,devices with as many as 65,000

[ Team LiB ] Chapter 1. Introduction to Parallel Computing The past decade has seen tremendous advances in microprocessor technology. Clock rates of processors have increased from about 40 MHz (e.g., a MIPS R3000, circa 1988) to over 2.0 GHz (e.g., a Pentium 4, circa 2002). At the same time, processors are now capable of executing multiple instructions in the same cycle. The average number of cycles per instruction (CPI) of high end processors has improved by roughly an order of magnitude over the past 10 years. All this translates to an increase in the peak floating point operation execution rate (floating point operations per second, or FLOPS) of several orders of magnitude. A variety of other issues have also become important over the same period. Perhaps the most prominent of these is the ability (or lack thereof) of the memory system to feed data to the processor at the required rate. Significant innovations in architecture and software have addressed the alleviation of bottlenecks posed by the datapath and the memory. The role of concurrency in accelerating computing elements has been recognized for several decades. However, their role in providing multiplicity of datapaths, increased access to storage elements (both memory and disk), scalable performance, and lower costs is reflected in the wide variety of applications of parallel computing. Desktop machines, engineering workstations, and compute servers with two, four, or even eight processors connected together are becoming common platforms for design applications. Large scale applications in science and engineering rely on larger configurations of parallel computers, often comprising hundreds of processors. Data intensive platforms such as database or web servers and applications such as transaction processing and data mining often use clusters of workstations that provide high aggregate disk bandwidth. Applications in graphics and visualization use multiple rendering pipes and processing elements to compute and render realistic environments with millions of polygons in real time. Applications requiring high availability rely on parallel and distributed platforms for redundancy. It is therefore extremely important, from the point of view of cost, performance, and application requirements, to understand the principles, tools, and techniques for programming the wide variety of parallel platforms currently available. [ Team LiB ] [ Team LiB ] 1.1 Motivating Parallelism Development of parallel software has traditionally been thought of as time and effort intensive. This can be largely attributed to the inherent complexity of specifying and coordinating concurrent tasks, a lack of portable algorithms, standardized environments, and software development toolkits. When viewed in the context of the brisk rate of development of microprocessors, one is tempted to question the need for devoting significant effort towards exploiting parallelism as a means of accelerating applications. After all, if it takes two years to develop a parallel application, during which time the underlying hardware and/or software platform has become obsolete, the development effort is clearly wasted. However, there are some unmistakable trends in hardware design, which indicate that uniprocessor (or implicitly parallel) architectures may not be able to sustain the rate of realizable performance increments in the future. This is a result of lack of implicit parallelism as well as other bottlenecks such as the datapath and the memory. At the same time, standardized hardware interfaces have reduced the turnaround time from the development of a microprocessor to a parallel machine based on the microprocessor. Furthermore, considerable progress has been made in standardization of programming environments to ensure a longer life-cycle for parallel applications. All of these present compelling arguments in favor of parallel computing platforms. 1.1.1 The Computational Power Argument – from Transistors to FLOPS In 1965, Gordon Moore made the following simple observation: "The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years. That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000." His reasoning was based on an empirical log-linear relationship between device complexity and time, observed over three data points. He used this to justify that by 1975, devices with as many as 65,000 Page 1 of 7 file://C:\Documents and Settings\CIC000\Local Settings\Temp\~hhA83F.htm 2/8/2013

components would become feasible on a single silicon chip occupying an area of only about one-fourth of a square inch.This projection turned out to be accurate with the fabrication of a 16K CCD memory with about 65,000 components in 1975.In a subsequent paper in 1975,Moore attributed the log-linear relationship to exponential behavior of die sizes,finer minimum dimensions,and "circuit and device cleverness".He went on to state that: "There is no room left to squeeze anything out by being clever.Going forward from here we have to depend on the two size factors -bigger dies and finer dimensions. He revised his rate of circuit complexity doubling to 18 months and projected from 1975 onwards at this reduced rate.This curve came to be known as "Moore's Law".Formally,Moore's Law states that circuit complexity doubles every eighteen months.This empirical relationship has been amazingly resilient over the years both for microprocessors as well as for DRAMs.By relating component density and increases in die-size to the computing power of a device,Moore's law has been extrapolated to state that the amount of computing power available at a given cost doubles approximately every 18 months. The limits of Moore's law have been the subject of extensive debate in the past few years.Staying clear of this debate,the issue of translating transistors into useful OPS (operations per second)is the critical one. It is possible to fabricate devices with very large transistor counts.How we use these transistors to achieve increasing rates of computation is the key architectural challenge.A logical recourse to this is to rely on parallelism-both implicit and explicit.We will briefly discuss implicit parallelism in Section _2.1 and devote the rest of this book to exploiting explicit parallelism. 1.1.2 The Memory/Disk Speed Argument The overall speed of computation is determined not just by the speed of the processor,but also by the ability of the memory system to feed data to it.While clock rates of high-end processors have increased at roughly 40%per year over the past decade,DRAM access times have only improved at the rate of roughly 10%per year over this interval.Coupled with increases in instructions executed per clock cycle,this gap between processor speed and memory presents a tremendous performance bottleneck.This growing mismatch between processor speed and DRAM latency is typically bridged by a hierarchy of successively faster memory devices called caches that rely on locality of data reference to deliver higher memory system performance.In addition to the latency,the net effective bandwidth between DRAM and the processor poses other problems for sustained computation rates The overall performance of the memory system is determined by the fraction of the total memory requests that can be satisfied from the cache.Memory system performance is addressed in greater detail in Section 2.2.Parallel platforms typically yield better memory system performance because they provide (i)larger aggregate caches,and(ii)higher aggregate bandwidth to the memory system(both typically linear in the number of processors).Furthermore,the principles that are at the heart of parallel algorithms,namely locality of data reference,also lend themselves to cache-friendly serial algorithms.This argument can be extended to disks where parallel platforms can be used to achieve high aggregate bandwidth to secondary storage.Here,parallel algorithms yield insights into the development of out-of-core computations.Indeed, some of the fastest growing application areas of parallel computing in data servers(database servers,web servers)rely not so much on their high aggregate computation rates but rather on the ability to pump data out at a faster rate. 1.1.3 The Data Communication Argument As the networking infrastructure evolves,the vision of using the Internet as one large heterogeneous parallel/distributed computing environment has begun to take shape.Many applications lend themselves naturally to such computing paradigms.Some of the most impressive applications of massively parallel computing have been in the context of wide-area distributed platforms.The SETI(Search for Extra Terrestrial Intelligence)project utilizes the power of a large number of home computers to analyze electromagnetic signals from outer space.Other such efforts have attempted to factor extremely large integers and to solve large discrete optimization problems In many applications there are constraints on the location of data and/or resources across the Internet.An example of such an application is mining of large commercial datasets distributed over a relatively low bandwidth network.In such applications,even if the computing power is available to accomplish the required task without resorting to parallel computing,it is infeasible to collect the data at a central location.In these cases,the motivation for parallelism comes not just from the need for computing resources but also from the infeasibility or undesirability of alternate (centralized)approaches

components would become feasible on a single silicon chip occupying an area of only about one-fourth of a square inch. This projection turned out to be accurate with the fabrication of a 16K CCD memory with about 65,000 components in 1975. In a subsequent paper in 1975, Moore attributed the log-linear relationship to exponential behavior of die sizes, finer minimum dimensions, and "circuit and device cleverness". He went on to state that: "There is no room left to squeeze anything out by being clever. Going forward from here we have to depend on the two size factors - bigger dies and finer dimensions." He revised his rate of circuit complexity doubling to 18 months and projected from 1975 onwards at this reduced rate. This curve came to be known as "Moore's Law". Formally, Moore's Law states that circuit complexity doubles every eighteen months. This empirical relationship has been amazingly resilient over the years both for microprocessors as well as for DRAMs. By relating component density and increases in die-size to the computing power of a device, Moore's law has been extrapolated to state that the amount of computing power available at a given cost doubles approximately every 18 months. The limits of Moore's law have been the subject of extensive debate in the past few years. Staying clear of this debate, the issue of translating transistors into useful OPS (operations per second) is the critical one. It is possible to fabricate devices with very large transistor counts. How we use these transistors to achieve increasing rates of computation is the key architectural challenge. A logical recourse to this is to rely on parallelism – both implicit and explicit. We will briefly discuss implicit parallelism in Section 2.1 and devote the rest of this book to exploiting explicit parallelism. 1.1.2 The Memory/Disk Speed Argument The overall speed of computation is determined not just by the speed of the processor, but also by the ability of the memory system to feed data to it. While clock rates of high-end processors have increased at roughly 40% per year over the past decade, DRAM access times have only improved at the rate of roughly 10% per year over this interval. Coupled with increases in instructions executed per clock cycle, this gap between processor speed and memory presents a tremendous performance bottleneck. This growing mismatch between processor speed and DRAM latency is typically bridged by a hierarchy of successively faster memory devices called caches that rely on locality of data reference to deliver higher memory system performance. In addition to the latency, the net effective bandwidth between DRAM and the processor poses other problems for sustained computation rates. The overall performance of the memory system is determined by the fraction of the total memory requests that can be satisfied from the cache. Memory system performance is addressed in greater detail in Section 2.2. Parallel platforms typically yield better memory system performance because they provide (i) larger aggregate caches, and (ii) higher aggregate bandwidth to the memory system (both typically linear in the number of processors). Furthermore, the principles that are at the heart of parallel algorithms, namely locality of data reference, also lend themselves to cache-friendly serial algorithms. This argument can be extended to disks where parallel platforms can be used to achieve high aggregate bandwidth to secondary storage. Here, parallel algorithms yield insights into the development of out-of-core computations. Indeed, some of the fastest growing application areas of parallel computing in data servers (database servers, web servers) rely not so much on their high aggregate computation rates but rather on the ability to pump data out at a faster rate. 1.1.3 The Data Communication Argument As the networking infrastructure evolves, the vision of using the Internet as one large heterogeneous parallel/distributed computing environment has begun to take shape. Many applications lend themselves naturally to such computing paradigms. Some of the most impressive applications of massively parallel computing have been in the context of wide-area distributed platforms. The SETI (Search for Extra Terrestrial Intelligence) project utilizes the power of a large number of home computers to analyze electromagnetic signals from outer space. Other such efforts have attempted to factor extremely large integers and to solve large discrete optimization problems. In many applications there are constraints on the location of data and/or resources across the Internet. An example of such an application is mining of large commercial datasets distributed over a relatively low bandwidth network. In such applications, even if the computing power is available to accomplish the required task without resorting to parallel computing, it is infeasible to collect the data at a central location. In these cases, the motivation for parallelism comes not just from the need for computing resources but also from the infeasibility or undesirability of alternate (centralized) approaches. Page 2 of 7 file://C:\Documents and Settings\CIC000\Local Settings\Temp\~hhA83F.htm 2/8/2013

Team LiB 1 PREVIOUS NEXT Team LiB 4PREVIOUS NEXT 1.2 Scope of Parallel Computing Parallel computing has made a tremendous impact on a variety of areas ranging from computational simulations for scientific and engineering applications to commercial applications in data mining and transaction processing.The cost benefits of parallelism coupled with the performance requirements of applications present compelling arguments in favor of parallel computing.We present a small sample of the diverse applications of parallel computing. 1.2.1 Applications in Engineering and Design Parallel computing has traditionally been employed with great success in the design of airfoils (optimizing lift,drag,stability),internal combustion engines(optimizing charge distribution,burn),high-speed circuits (layouts for delays and capacitive and inductive effects),and structures (optimizing structural integrity, design parameters,cost,etc.),among others.More recently,design of microelectromechanical and nanoelectromechanical systems(MEMS and NEMS)has attracted significant attention.While most applications in engineering and design pose problems of multiple spatial and temporal scales and coupled physical phenomena,in the case of MEMS/NEMS design these problems are particularly acute.Here,we often deal with a mix of quantum phenomena,molecular dynamics,and stochastic and continuum models with physical processes such as conduction,convection,radiation,and structural mechanics,all in a single system.This presents formidable challenges for geometric modeling,mathematical modeling,and algorithm development,all in the context of parallel computers. Other applications in engineering and design focus on optimization of a variety of processes.Parallel computers have been used to solve a variety of discrete and continuous optimization problems.Algorithms such as Simplex,Interior Point Method for linear optimization and Branch-and-bound,and Genetic programming for discrete optimization have been efficiently parallelized and are frequently used. 1.2.2 Scientific Applications The past few years have seen a revolution in high performance scientific computing applications.The sequencing of the human genome by the International Human Genome Sequencing Consortium and Celera,Inc.has opened exciting new frontiers in bioinformatics.Functional and structural characterization of genes and proteins hold the promise of understanding and fundamentally influencing biological processes.Analyzing biological sequences with a view to developing new drugs and cures for diseases and medical conditions requires innovative algorithms as well as large-scale computational power.Indeed, some of the newest parallel computing technologies are targeted specifically towards applications in bioinformatics. Advances in computational physics and chemistry have focused on understanding processes ranging in scale from quantum phenomena to macromolecular structures.These have resulted in design of new materials,understanding of chemical pathways,and more efficient processes.Applications in astrophysics have explored the evolution of galaxies,thermonuclear processes,and the analysis of extremely large datasets from telescopes.Weather modeling,mineral prospecting,flood prediction,etc.,rely heavily on parallel computers and have very significant impact on day-to-day life. Bioinformatics and astrophysics also present some of the most challenging problems with respect to analyzing extremely large datasets.Protein and gene databases(such as PDB,SwissProt,and ENTREZ and NDB)along with Sky Survey datasets(such as the Sloan Digital Sky Surveys)represent some of the largest scientific datasets.Effectively analyzing these datasets requires tremendous computational power and holds the key to significant scientific discoveries. 1.2.3 Commercial Applications With the widespread use of the web and associated static and dynamic content,there is increasing emphasis on cost-effective servers capable of providing scalable performance.Parallel platforms ranging from multiprocessors to linux clusters are frequently used as web and database servers.For instance,on heavy volume days,large brokerage houses on Wall Street handle hundreds of thousands of simultaneous user sessions and millions of orders.Platforms such as IBMs SP supercomputers and Sun Ultra HPC servers power these business-critical sites.While not highly visible,some of the largest supercomputing

[ Team LiB ] [ Team LiB ] 1.2 Scope of Parallel Computing Parallel computing has made a tremendous impact on a variety of areas ranging from computational simulations for scientific and engineering applications to commercial applications in data mining and transaction processing. The cost benefits of parallelism coupled with the performance requirements of applications present compelling arguments in favor of parallel computing. We present a small sample of the diverse applications of parallel computing. 1.2.1 Applications in Engineering and Design Parallel computing has traditionally been employed with great success in the design of airfoils (optimizing lift, drag, stability), internal combustion engines (optimizing charge distribution, burn), high-speed circuits (layouts for delays and capacitive and inductive effects), and structures (optimizing structural integrity, design parameters, cost, etc.), among others. More recently, design of microelectromechanical and nanoelectromechanical systems (MEMS and NEMS) has attracted significant attention. While most applications in engineering and design pose problems of multiple spatial and temporal scales and coupled physical phenomena, in the case of MEMS/NEMS design these problems are particularly acute. Here, we often deal with a mix of quantum phenomena, molecular dynamics, and stochastic and continuum models with physical processes such as conduction, convection, radiation, and structural mechanics, all in a single system. This presents formidable challenges for geometric modeling, mathematical modeling, and algorithm development, all in the context of parallel computers. Other applications in engineering and design focus on optimization of a variety of processes. Parallel computers have been used to solve a variety of discrete and continuous optimization problems. Algorithms such as Simplex, Interior Point Method for linear optimization and Branch-and-bound, and Genetic programming for discrete optimization have been efficiently parallelized and are frequently used. 1.2.2 Scientific Applications The past few years have seen a revolution in high performance scientific computing applications. The sequencing of the human genome by the International Human Genome Sequencing Consortium and Celera, Inc. has opened exciting new frontiers in bioinformatics. Functional and structural characterization of genes and proteins hold the promise of understanding and fundamentally influencing biological processes. Analyzing biological sequences with a view to developing new drugs and cures for diseases and medical conditions requires innovative algorithms as well as large-scale computational power. Indeed, some of the newest parallel computing technologies are targeted specifically towards applications in bioinformatics. Advances in computational physics and chemistry have focused on understanding processes ranging in scale from quantum phenomena to macromolecular structures. These have resulted in design of new materials, understanding of chemical pathways, and more efficient processes. Applications in astrophysics have explored the evolution of galaxies, thermonuclear processes, and the analysis of extremely large datasets from telescopes. Weather modeling, mineral prospecting, flood prediction, etc., rely heavily on parallel computers and have very significant impact on day-to-day life. Bioinformatics and astrophysics also present some of the most challenging problems with respect to analyzing extremely large datasets. Protein and gene databases (such as PDB, SwissProt, and ENTREZ and NDB) along with Sky Survey datasets (such as the Sloan Digital Sky Surveys) represent some of the largest scientific datasets. Effectively analyzing these datasets requires tremendous computational power and holds the key to significant scientific discoveries. 1.2.3 Commercial Applications With the widespread use of the web and associated static and dynamic content, there is increasing emphasis on cost-effective servers capable of providing scalable performance. Parallel platforms ranging from multiprocessors to linux clusters are frequently used as web and database servers. For instance, on heavy volume days, large brokerage houses on Wall Street handle hundreds of thousands of simultaneous user sessions and millions of orders. Platforms such as IBMs SP supercomputers and Sun Ultra HPC servers power these business-critical sites. While not highly visible, some of the largest supercomputing Page 3 of 7 file://C:\Documents and Settings\CIC000\Local Settings\Temp\~hhA83F.htm 2/8/2013

networks are housed on Wall Street. The availability of large-scale transaction data has also sparked considerable interest in data mining and analysis for optimizing business and marketing decisions.The sheer volume and geographically distributed nature of this data require the use of effective parallel algorithms for such problems as association rule mining,clustering,classification,and time-series analysis. 1.2.4 Applications in Computer Systems As computer systems become more pervasive and computation spreads over the network,parallel processing issues become engrained into a variety of applications.In computer security,intrusion detection is an outstanding challenge.In the case of network intrusion detection,data is collected at distributed sites and must be analyzed rapidly for signaling intrusion.The infeasibility of collecting this data at a central location for analysis requires effective parallel and distributed algorithms.In the area of cryptography,some of the most spectacular applications of Internet-based parallel computing have focused on factoring extremely large integers. Embedded systems increasingly rely on distributed control algorithms for accomplishing a variety of tasks. A modern automobile consists of tens of processors communicating to perform complex tasks for optimizing handling and performance.In such systems,traditional parallel and distributed algorithms for leader selection,maximal independent set,etc.,are frequently used. While parallel computing has traditionally confined itself to platforms with well behaved compute and network elements in which faults and errors do not play a significant role,there are valuable lessons that extend to computations on ad-hoc,mobile,or faulty environments. Team LiB 4 PREVIOUS NEXT Team LiB PREVIOUS 1.3 Organization and Contents of the Text This book provides a comprehensive and self-contained exposition of problem solving using parallel computers.Algorithms and metrics focus on practical and portable models of parallel machines.Principles of algorithm design focus on desirable attributes of parallel algorithms and techniques for achieving these in the contest of a large class of applications and architectures.Programming techniques cover standard paradigms such as MPI and POSIX threads that are available across a range of parallel platforms. Chapters in this book can be grouped into four main parts as illustrated in Figure 1.1.These parts are as follows: Figure 1.1.Recommended sequence for reading the chapters

networks are housed on Wall Street. The availability of large-scale transaction data has also sparked considerable interest in data mining and analysis for optimizing business and marketing decisions. The sheer volume and geographically distributed nature of this data require the use of effective parallel algorithms for such problems as association rule mining, clustering, classification, and time-series analysis. 1.2.4 Applications in Computer Systems As computer systems become more pervasive and computation spreads over the network, parallel processing issues become engrained into a variety of applications. In computer security, intrusion detection is an outstanding challenge. In the case of network intrusion detection, data is collected at distributed sites and must be analyzed rapidly for signaling intrusion. The infeasibility of collecting this data at a central location for analysis requires effective parallel and distributed algorithms. In the area of cryptography, some of the most spectacular applications of Internet-based parallel computing have focused on factoring extremely large integers. Embedded systems increasingly rely on distributed control algorithms for accomplishing a variety of tasks. A modern automobile consists of tens of processors communicating to perform complex tasks for optimizing handling and performance. In such systems, traditional parallel and distributed algorithms for leader selection, maximal independent set, etc., are frequently used. While parallel computing has traditionally confined itself to platforms with well behaved compute and network elements in which faults and errors do not play a significant role, there are valuable lessons that extend to computations on ad-hoc, mobile, or faulty environments. [ Team LiB ] [ Team LiB ] 1.3 Organization and Contents of the Text This book provides a comprehensive and self-contained exposition of problem solving using parallel computers. Algorithms and metrics focus on practical and portable models of parallel machines. Principles of algorithm design focus on desirable attributes of parallel algorithms and techniques for achieving these in the contest of a large class of applications and architectures. Programming techniques cover standard paradigms such as MPI and POSIX threads that are available across a range of parallel platforms. Chapters in this book can be grouped into four main parts as illustrated in Figure 1.1. These parts are as follows: Figure 1.1. Recommended sequence for reading the chapters. Page 4 of 7 file://C:\Documents and Settings\CIC000\Local Settings\Temp\~hhA83F.htm 2/8/2013

6.Programming Using the Message Passing Paradigm 0 7.Programming Shared Address Space Platforms 1.Introduction to Parallel Computing Parallel Programming 9.Sorting 2.Parallel Programming Platforms 10.Graph Algorithms 3.Principles of Parallel Algorithm Design 0 0 11.Search Algorithms for Discrete Optimization 4.Basic Communication Problems Operations 0 5.Analytical Modeling of 12.Dynamic Programming Parallel Programs Non-Numerical Algorithms Fundamentals 8.Dense Matrix Algorithms 13.Fast Fourier Transform Numerical Algorithms Fundamentals This section spans Chapters 2 through 4 of the book.Chapter 2,Parallel Programming Platforms,discusses the physical organization of parallel platforms.It establishes cost metrics that can be used for algorithm design.The objective of this chapter is not to provide an exhaustive treatment of parallel architectures;rather,it aims to provide sufficient detail required to use these machines efficiently. Chapter 3,Principles of Parallel Algorithm Design,addresses key factors that contribute to efficient parallel algorithms and presents a suite of techniques that can be applied across a wide range of applications. Chapter 4,Basic Communication Operations,presents a core set of operations that are used throughout the book for facilitating efficient data transfer in parallel algorithms.Finally,Chapter 5,Analytical Modeling of Parallel Programs,deals with metrics for quantifying the performance of a parallel algorithm. Parallel Programming This section includes Chapters 6 and Z of the book.Chapter 6,Programming Using the Message-Passing Paradigm,focuses on the Message Passing Interface(MPI)for programming message passing platforms,including clusters.Chapter 7,Programming Shared Address Space Platforms, deals with programming paradigms such as threads and directive based approaches.Using paradigms such as POSIX threads and OpenMP,it describes various features necessary for programming shared- address-space parallel machines.Both of these chapters illustrate various programming concepts using a