more queue: www.acmqueue.com QUEUE Se_中国高校课件下载中心

点击下载：《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Software and the Concurrency Revolution

正在加载图片...

to execute concurrently are nonhomogeneous code;fine- lel tasks"),in which one or more operations are applied grained,complicated interactions;and pointer-based data independently to each item in a data collection. structures. Fine-grained data parallelism relies on the indepen- dence of the operations executed concurrently.They PROGRAMMING MODELS should not share input data or results and should be Today,you can express parallelism in a number of differ- executable without coordination.For example: ent ways,each applicable to only a subset of programs. In many cases,it is difficult,without careful design and double A[100][100]; analysis,to know in advance which model is appropriate for a particular problem,and it is always tricky to com- A=A*2 bine several models when a given application does not fit cleanly into a single paradigm. multiplies each element of a 100x100 array by 2 and These parallel programming models differ significantly stores the result in the same array location.Each of the in two dimensions:the granularity of the parallel opera- 10,000 multiplications proceeds independently and with tions and the degree of coupling between these tasks.Dif- out coordination with its peers.This is probably more ferent points in this space favor different programming concurrency than necessary for most computers,and models,so let's examine these axes in turn. its granularity is very fine,so a more practical approach Operations executed in parallel can range from single would partition the matrix into n x m blocks and execute instructions,such as addition or multiplication,to com- the operations on the blocks concurrently. plex programs that take hours or days to run.Obviously, At the other end of the granularity axis,some applica- for small operations,the overhead costs of the parallel tions,such as search engines,share only a large read-only infrastructure are significant;for example,parallel instruc- database,so concurrently processing queries requires no tion execution generally requires hardware support. coordination.Similarly,large simulations,which require Multicore processors reduce communication and syn- many runs to explore a large space of input parameters, chronization costs,as compared with conventional mul- are another embarrassingly parallel application. tiprocessors,which can reduce the overhead burden on smaller pieces of code.Still,in general,the finer grained REGULAR PARALLELISM the task,the more attention must be paid to the cost of The next step beyond independent parallelism is to apply spawning it as a separate task and providing its communi- the same operation to a collection of data when the com- cation and synchronization. putations are mutually dependent.An operation on one The other dimension is the degree of coupling in the piece of data is dependent on another operation if there communication and synchronization between the opera- is communication or synchronization between the two tions.The ideal is none:operations run entirely inde- operations. pendently and produce distinct outputs.In this case,the For example,consider a stencil computation that operations can run in any order,incur no synchroniza- replaces each point in an array,the average of its four tion or communications costs,and are easily programmed nearest neighbors: without the possibility of data races.This state of affairs is rare,as most concurrent programs share data among A[i,j]=(A[i-1,j]+A[i,j-1]+A[i+1,j]+A[i,j+1])/4 their operations.The complexity of ensuring correct and efficient operation increases as the operations become This computation requires careful coordination to ensure more diverse.The easiest case is executing the same code that an array location is read by its neighbors before for each operation.This type of sharing is often regular being replaced by its average.If space is no concern, and can be understood by analyzing only a single task. then the averages can be computed into a new array.In More challenging is irregular parallelism,in which the general,other more structured computation strategies, operations are distinct and the sharing patterns are more such as traversing the array in a diagonal wavefront,will difficult to comprehend. produce the same result,with better cache locality and lower memory consumption. INDEPENDENT PARALLELISM Regular parallel programs may require synchronization Perhaps the simplest and best-behaved model is indepen- or carefully orchestrated execution strategies to produce dent parallelism(sometimes called "embarrassingly paral- the correct results,but unlike general parallelism,the more queue:www.acmqueue.com QUEUE September 2005 57more queue: www.acmqueue.com QUEUE September 2005 57 to execute concurrently are nonhomogeneous code; finegrained, complicated interactions; and pointer-based data structures. PROGRAMMING MODELS Today, you can express parallelism in a number of different ways, each applicable to only a subset of programs. In many cases, it is difficult, without careful design and analysis, to know in advance which model is appropriate for a particular problem, and it is always tricky to combine several models when a given application does not fit cleanly into a single paradigm. These parallel programming models differ significantly in two dimensions: the granularity of the parallel operations and the degree of coupling between these tasks. Different points in this space favor different programming models, so let’s examine these axes in turn. Operations executed in parallel can range from single instructions, such as addition or multiplication, to complex programs that take hours or days to run. Obviously, for small operations, the overhead costs of the parallel infrastructure are significant; for example, parallel instruction execution generally requires hardware support. Multicore processors reduce communication and synchronization costs, as compared with conventional multiprocessors, which can reduce the overhead burden on smaller pieces of code. Still, in general, the finer grained the task, the more attention must be paid to the cost of spawning it as a separate task and providing its communication and synchronization. The other dimension is the degree of coupling in the communication and synchronization between the operations. The ideal is none: operations run entirely independently and produce distinct outputs. In this case, the operations can run in any order, incur no synchronization or communications costs, and are easily programmed without the possibility of data races. This state of affairs is rare, as most concurrent programs share data among their operations. The complexity of ensuring correct and efficient operation increases as the operations become more diverse. The easiest case is executing the same code for each operation. This type of sharing is often regular and can be understood by analyzing only a single task. More challenging is irregular parallelism, in which the operations are distinct and the sharing patterns are more difficult to comprehend. INDEPENDENT PARALLELISM Perhaps the simplest and best-behaved model is independent parallelism (sometimes called “embarrassingly parallel tasks”), in which one or more operations are applied independently to each item in a data collection. Fine-grained data parallelism relies on the independence of the operations executed concurrently. They should not share input data or results and should be executable without coordination. For example: double A[100][100]; … A = A * 2; multiplies each element of a 100x100 array by 2 and stores the result in the same array location. Each of the 10,000 multiplications proceeds independently and without coordination with its peers. This is probably more concurrency than necessary for most computers, and its granularity is very fine, so a more practical approach would partition the matrix into n x m blocks and execute the operations on the blocks concurrently. At the other end of the granularity axis, some applications, such as search engines, share only a large read-only database, so concurrently processing queries requires no coordination. Similarly, large simulations, which require many runs to explore a large space of input parameters, are another embarrassingly parallel application. REGULAR PARALLELISM The next step beyond independent parallelism is to apply the same operation to a collection of data when the computations are mutually dependent. An operation on one piece of data is dependent on another operation if there is communication or synchronization between the two operations. For example, consider a stencil computation that replaces each point in an array, the average of its four nearest neighbors: A[i, j] = (A[i-1, j] + A[i, j-1] + A[i+1, j] + A[i, j+1]) / 4; This computation requires careful coordination to ensure that an array location is read by its neighbors before being replaced by its average. If space is no concern, then the averages can be computed into a new array. In general, other more structured computation strategies, such as traversing the array in a diagonal wavefront, will produce the same result, with better cache locality and lower memory consumption. Regular parallel programs may require synchronization or carefully orchestrated execution strategies to produce the correct results, but unlike general parallelism, the

<<向上翻页向下翻页>>

点击下载：《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Software and the Concurrency Revolution