Parallel Algorithms Underlying MPl Implementations
Parallel Algorithms Underlying MPI Implementations
Parallel Algorithms Underlying MPI Implementations This chapter looks at a few of the parallel algorithms underlying the implementations of some simple MPI calls The purpose of this is not to teach you how to roll your own"versions of these routines, but rather to help you start thinking about algorithms in a parallel fashion First, the method of recursive halving and doubling, which is the algorithm underlying operations such as broadcasts and reduction operations, is discussed Then, specific examples of parallel algorithms that implement message passing are given
Parallel Algorithms Underlying MPI Implementations • This chapter looks at a few of the parallel algorithms underlying the implementations of some simple MPI calls. • The purpose of this is not to teach you how to "roll your own" versions of these routines, but rather to help you start thinking about algorithms in a parallel fashion. • First, the method of recursive halving and doubling, which is the algorithm underlying operations such as broadcasts and reduction operations, is discussed. • Then, specific examples of parallel algorithms that implement message passing are given
Recursive Halving and Doubling To illustrate recursive halving and doubling suppose you have a vector distributed among p processors, and you need the sum of all components of the vector in each processor, i.e a sum reduction One method is to use a tree-based algorithm to compute the sum to a single processor and then broadcast the sum to every processor
Recursive Halving and Doubling • To illustrate recursive halving and doubling, suppose you have a vector distributed among p processors, and you need the sum of all components of the vector in each processor, i.e., a sum reduction. • One method is to use a tree-based algorithm to compute the sum to a single processor and then broadcast the sum to every processor
I Recursive Halving and Doubling Assume that each processor has formed the partial sum of the components of the vector that it has Step 1: Processor 2 sends its partial sum to processor 1 and processor 1 adds this partial sum to its own. Meanwhile, processor 4 sends its partial sum to processor 3 and processor 3 performs a similar summation Step 2: Processor 3 sends its partial sum, which is now the sum of the components on processors 3 and 4, to processor 1 and processor 1 adds it to its partial sum to get the final sum across all the components of the vector. At each stage of the process, the number of processes doing work is cut in half. The algorithm is depicted in the Figure 13.1 below, where the solid arrow denotes a send operation and the dotted line arrow denotes a receive operation followed by a summation
Recursive Halving and Doubling • Assume that each processor has formed the partial sum of the components of the vector that it has. • Step 1: Processor 2 sends its partial sum to processor 1 and processor 1 adds this partial sum to its own. Meanwhile, processor 4 sends its partial sum to processor 3 and processor 3 performs a similar summation. • Step 2: Processor 3 sends its partial sum, which is now the sum of the components on processors 3 and 4, to processor 1 and processor 1 adds it to its partial sum to get the final sum across all the components of the vector. • At each stage of the process, the number of processes doing work is cut in half. The algorithm is depicted in the Figure 13.1 below, where the solid arrow denotes a send operation and the dotted line arrow denotes a receive operation followed by a summation
I Recursive Halving and Doubling p Figure 13.1. Summation in log(N) steps
Recursive Halving and Doubling Figure 13.1. Summation in log(N) steps
I Recursive Halving and Doubling Step 3: Processor 1 then must broadcast this sum to all other processors. This broadcast operation can be done using the same communication structure as the summation but in reverse. You will see pseudocode for this at the end of this section. Note that if the total number of processors is N, then only 2 log(N)(log base 2) steps are needed to complete the operation There is an even more efficient way to finish the job in only log(N) steps. By way of example, look at the next figure containing 8 processors. At each step, processor i and processor i+k send and receive data in a pairwise fashion and then perform the summation k is iterated from 1 through N/2 in powers of 2. If the total number of processors is N, then log(N) steps are needed. As an exercise, you should write out the necessary pseudocode for this example
Recursive Halving and Doubling • Step 3: Processor 1 then must broadcast this sum to all other processors. This broadcast operation can be done using the same communication structure as the summation, but in reverse. You will see pseudocode for this at the end of this section. Note that if the total number of processors is N, then only 2 log(N) (log base 2) steps are needed to complete the operation. • There is an even more efficient way to finish the job in only log(N) steps. By way of example, look at the next figure containing 8 processors. At each step, processor i and processor i+k send and receive data in a pairwise fashion and then perform the summation. k is iterated from 1 through N/2 in powers of 2. If the total number of processors is N, then log(N) steps are needed. As an exercise, you should write out the necessary pseudocode for this example
I Recursive Halving and Doubling pI p2 p3 p4 p5 p6 p7 Qoq9QsQo Figure 13. 2. Summation to all processors in log(N) steps
Recursive Halving and Doubling Figure 13.2. Summation to all processors in log(N) steps
I Recursive Halving and Doubling What about adding vectors? That is, how do you add several vectors component-wise to get a new vector? The answer is, you employ the method discussed earlier in a component-wise fashion This fascinating way to reduce the communications and to avoid abundant summations is described next. this method utilizes the recursive halving and doubling technique and is illustrated in Figure 13.3
Recursive Halving and Doubling • What about adding vectors? That is, how do you add several vectors component-wise to get a new vector? The answer is, you employ the method discussed earlier in a component-wise fashion. This fascinating way to reduce the communications and to avoid abundant summations is described next. This method utilizes the recursive halving and doubling technique and is illustrated in Figure 13.3
I Recursive Halving and Doubling Suppose there are 4 processors and the length of each vector is also 4 Step 1: Processor pO sends the first two components of the vector to rocessor p1, and p1 sends the last two components of the vector to pO. Then po gets the partial sums for the last two components, and p1 gets the partial sums for the first two components. So do p2 and p3 Step 2: Processor po sends the partial sum of the third component to processor p3. Processor p3 then adds to get the total sum of the third component. Similarly, processor 0, 1 and 2 find the total sums of the 4th, 2nd, and 1st components, respectively. Now the sum of the vectors are found and the components are stored in different processors Step 3: Broadcast the result using the reverse of the above communication process
Recursive Halving and Doubling • Suppose there are 4 processors and the length of each vector is also 4. • Step 1: Processor p0 sends the first two components of the vector to processor p1, and p1 sends the last two components of the vector to p0. Then p0 gets the partial sums for the last two components, and p1 gets the partial sums for the first two components. So do p2 and p3. • Step 2: Processor p0 sends the partial sum of the third component to processor p3. Processor p3 then adds to get the total sum of the third component. Similarly, processor 0,1 and 2 find the total sums of the 4th, 2nd, and 1st components, respectively. Now the sum of the vectors are found and the components are stored in different processors. • Step 3: Broadcast the result using the reverse of the above communication process