《电子工程师手册》学习资料（英文版）Chapter 95 Parallel Processors.pdf_大学文库

95 Parallel processors 95.1 Introduction 95.2 Classifications 95.3 Types of Parallel Processors Ts F Ensemble Processors. Array Processors Associative Processor The Pennsylvania State University 95.4 System Utilization 95.1 Introduction A computer usually consists of four major components: the arithmetic-logic unit(ALU), the main memory unit(MU), the input/output unit(I/O), and the control unit( CU). Such a computer is known as a uniprocessor since the processing is achieved by operating on one word or word pair at a time. In order to increase the computer performance, we may improve the device technology to reduce the switching(gate delay )time. Indeed, for the past half century we have seen switching speeds improve from 200 to 300 ms for relays to present-day subnanosecond very large scale integration(VLSI)circuits. As the switching speeds of computer devices approach a limit, however, any further significant improvement in performance is more likely to be in increasing the number of words or word pairs that can be processed simultaneously. For example, we may use one ALU to compute N sets of additions N times in a uniprocessor, or we may design a computer system with N ALUs to add all N sets once. Conceptually, such a computer system may still consist of the four major omponents mentioned previously except that there are N ALUs. An organization with multiple ALUs under the control of a single CU is called a parallel processor. To make a parallel processor more efficient and cost ffective, a fifth major component, called interconnection networks, is usually required to facilitate the inter processor and processor-memory communications. In addition, each ALU requires not only its own registers but also network interfaces; the expanded ALU is then called a processing element(PE). Figure 95.1 shows a block diagram of a parallel processor. 95.2 Classifications Flynn has classified computer systems according to the multiplicity of instruction and data streams, where computers are partitioned into four groups [Flynn, 1966] single instruction stream, single data stream(SISD): The conventional, word-sequential architecture including pipelined computers(usually with parallel ALU 2. Single instruction stream, multiple data stream(SIMD): The multiple ALU-type architectures(e.g parallel/array processor ). The ALU may be either bit-serial or bit-parallel 3. Multiple instruction stream, single data stream(MISD): Not as practical as the other classes 4. Multiple instruction stream, multiple data stream(MIMD): The multiprocessor system As a general rule, one could conclude that SISD and SIMD machines are single CU systems, whereas MIMD machines are multiple CU systems. Flynns classification does not address the interactions among the processing c 2000 by CRC Press LLC

© 2000 by CRC Press LLC 95 Parallel Processors 95.1 Introduction 95.2 Classifications 95.3 Types of Parallel Processors Ensemble Processors • Array Processors • Associative Processor 95.4 System Utilization 95.1 Introduction A computer usually consists of four major components: the arithmetic-logic unit (ALU), the main memory unit (MU), the input/output unit (I/O), and the control unit (CU). Such a computer is known as a uniprocessor since the processing is achieved by operating on one word or word pair at a time. In order to increase the computer performance, we may improve the device technology to reduce the switching (gate delay) time. Indeed, for the past half century we have seen switching speeds improve from 200 to 300 ms for relays to present-day subnanosecond very large scale integration (VLSI) circuits. As the switching speeds of computer devices approach a limit, however, any further significant improvement in performance is more likely to be in increasing the number of words or word pairs that can be processed simultaneously. For example, we may use one ALU to compute N sets of additions N times in a uniprocessor, or we may design a computer system with N ALUs to add all N sets once. Conceptually, such a computer system may still consist of the four major components mentioned previously except that there are N ALUs. An organization with multiple ALUs under the control of a single CU is called a parallel processor. To make a parallel processor more efficient and costeffective, a fifth major component, called interconnection networks, is usually required to facilitate the interprocessor and processor-memory communications. In addition, each ALU requires not only its own registers but also network interfaces; the expanded ALU is then called a processing element (PE). Figure 95.1 shows a block diagram of a parallel processor. 95.2 Classifications Flynn has classified computer systems according to the multiplicity of instruction and data streams, where computers are partitioned into four groups [Flynn, 1966]: 1. Single instruction stream, single data stream (SISD): The conventional, word-sequential architecture including pipelined computers (usually with parallel ALU) 2. Single instruction stream, multiple data stream (SIMD): The multiple ALU-type architectures (e.g., parallel/array processor). The ALU may be either bit-serial or bit-parallel. 3. Multiple instruction stream, single data stream (MISD): Not as practical as the other classes. 4. Multiple instruction stream, multiple data stream (MIMD): The multiprocessor system. As a general rule, one could conclude that SISD and SIMD machines are single CU systems, whereas MIMD machines are multiple CU systems. Flynn’s classification does not address the interactions among the processing Tse-yun Feng The Pennsylvania State University

Control FIGURE 95.1 A basic parallel processor organization modules and the methods by which processing modules in concurrent system are controlled. As a result, one can classify both uniprocessors and pipelined computers as SISD machines, because both instructions and data are provided sequentially We may also classify computer systems according to the number of bits or bit pairs a computer executes at or example, a computer may perform operations on one bit or bit pair at a through the use of a simple serial ALU. For an M-bit word or operand, the operation repeats M times( point A in Fig. 95.2). To speed up the processing, a parallel ALU is usually used so that all bits of a word can be operated on simultaneously. This is how a conventional word-sequential computer executes on its operands(point B Fig 95.2). In a parallel processor, it may execute either(a)all the ith bits of N operands or operand pairs (ic bit slice or bis)or(b)all N M-bit operands or operand pairs simultaneously(Points C and D in Fig. 95.2 respectively ). Figure 95.2 also shows some of the systems in this classification. It is seen from this classification that the performance of a computer is proportional to the total number of bits or bit pairs it can execute Fengs classification Hwang and Briggs, 1984] was originally intended for parallel processors, and as a result, the number of CUs in a computer system was not specified. Handler extended Feng's classification by adding a third dimension, namely, the number of CUs Pipelined systems are also included in this classification through additional parameters [Handler, 1977) 95.3 Types of Parallel Processors Ensemble processors An ensemble system is an extension of the conventional uniprocessor systems. It is a collection of N PEs(a Pe ere consists of an ALU, a set of local registers, and limited local control capability) and N MUs, under the control of a single CU. Thus, the organization of an ensemble processor is similar to that shown in Fig. 95.1 except that there are no direct interprocessor and processor-memory communications, i.e., no interconnection networks. When the need for communication arises, it is done through the CU. This slows down the system for applications requiring extensive interprocessor and processor-memory communications. For example, the sum of two matrices A and B can be executed in one step, if R2 PEs are available in an ensemble processor, where R is the rank of the matrices On the other hand, the product of the same two matrices requires extensive data alignment between the elements of A and B As a result, it is ineffective for performing matrix multipli ations with an ensemble processor. Therefore, while the ensemble processors are capable of executing up to N identical jobs simultaneously, they have very limited applications. Parallel element processing ensemble (PEPE)[Evensen and Troy, 1973] is an example of such parallel processors e 2000 by CRC Press LLC

© 2000 by CRC Press LLC modules and the methods by which processing modules in concurrent system are controlled. As a result, one can classify both uniprocessors and pipelined computers as SISD machines, because both instructions and data are provided sequentially. We may also classify computer systems according to the number of bits or bit pairs a computer executes at any instant [Feng, 1972]. For example, a computer may perform operations on one bit or bit pair at a time through the use of a simple serial ALU. For an M-bit word or operand, the operation repeats M times (Point A in Fig. 95.2). To speed up the processing, a parallel ALU is usually used so that all bits of a word can be operated on simultaneously. This is how a conventional word-sequential computer executes on its operands (Point B in Fig. 95.2). In a parallel processor, it may execute either (a) all the ith bits of N operands or operand pairs (i.e., bit slice or bis) or (b) all N M-bit operands or operand pairs simultaneously (Points C and D in Fig. 95.2, respectively). Figure 95.2 also shows some of the systems in this classification. It is seen from this classification that the performance of a computer is proportional to the total number of bits or bit pairs it can execute simultaneously. Feng’s classification [Hwang and Briggs, 1984] was originally intended for parallel processors, and as a result, the number of CUs in a computer system was not specified. Händler extended Feng’s classification by adding a third dimension, namely, the number of CUs. Pipelined systems are also included in this classification through additional parameters [Händler, 1977]. 95.3 Types of Parallel Processors Ensemble Processors An ensemble system is an extension of the conventional uniprocessor systems. It is a collection of N PEs (a PE here consists of an ALU, a set of local registers, and limited local control capability) and N MUs, under the control of a single CU. Thus, the organization of an ensemble processor is similar to that shown in Fig. 95.1 except that there are no direct interprocessor and processor-memory communications, i.e., no interconnection networks. When the need for communication arises, it is done through the CU. This slows down the system for applications requiring extensive interprocessor and processor-memory communications. For example, the sum of two matrices A and B can be executed in one step, if R2 PEs are available in an ensemble processor, where R is the rank of the matrices. On the other hand, the product of the same two matrices requires extensive data alignment between the elements of A and B. As a result, it is ineffective for performing matrix multiplications with an ensemble processor. Therefore, while the ensemble processors are capable of executing up to N identical jobs simultaneously, they have very limited applications. Parallel element processing ensemble (PEPE) [Evensen and Troy, 1973] is an example of such parallel processors. FIGURE 95.1 A basic parallel processor organization

」 FIGURE 95.3 An array processor organization. IO, input/output devices; LM, local memory; PE, processing element; SM, instruction is either a local type instruction, where it is executed entirely within the CU, or it is a parallel nstruction and is executed in the processing array. The primary function of the CU is to examine eacl instruction as it is to be executed and to determine where the execution should take place Associative processor Associative memories, also known as content-addressable memories, retrieve information on the basis of data ntent rather than addresses. An associative memory performs comparison(i. e, exclusive-OR or equivalence) operations at its bit level. The results of the comparison on a group of bits in a word for all words in the memory are transmitted to a register called a response register or flag. In addition, there are circuits such as multiple match resolver, enable/disable register, a number of temporary registers, as well as appropriate logic gates for resolving multiple responses and information retrieval. For associative processors, arithmetic capabilities are dded to this unit. The unit can be viewed as consisting of a number of bit-serial PEs. Furthermore, the bit level logic is moved out of the memory so that the memory part of the processor consists of a number of random-access memories called word modules. A typical associative processor is shown in Fig. 95.5. STARAN and MPP(Fig. 95. 2)are representative of this bit-serial, word-parallel SIMD organization. In Fig. 95.5 the common register is where the common operand is stored and the mask register defines the bit positions requiring operation. The enable/disable register provides local control of individual PEs. Because of its simplicity in design the per-PE cost of an associative processor is much lower, but the bit-serial operations slow down the system drastically. To compensate for this, these systems are useful only for applications requiring a large number of Pes. 95.4 System Utilization As discussed previously, for any computer there is a maximum number of bits or bit pairs that can be processed currently, whether it is under single-instruction or multiple-instruction control [Feng, 1972, 1973]. This maximum degree of concurrency, or maximum concurrency( Cm), is an indication of the computer-processing capability. The actual utilization of this capability is indicated by the average concurrency defined to be e 2000 by CRC Press LLC

© 2000 by CRC Press LLC instruction is either a local type instruction, where it is executed entirely within the CU, or it is a parallel instruction and is executed in the processing array. The primary function of the CU is to examine each instruction as it is to be executed and to determine where the execution should take place. Associative Processor Associative memories, also known as content-addressable memories, retrieve information on the basis of data content rather than addresses. An associative memory performs comparison (i.e., exclusive-OR or equivalence) operations at its bit level. The results of the comparison on a group of bits in a word for all words in the memory are transmitted to a register called a response register or flag. In addition, there are circuits such as multiple match resolver, enable/disable register, a number of temporary registers, as well as appropriate logic gates for resolving multiple responses and information retrieval. For associative processors, arithmetic capabilities are added to this unit. The unit can be viewed as consisting of a number of bit-serial PEs. Furthermore, the bitlevel logic is moved out of the memory so that the memory part of the processor consists of a number of random-access memories called word modules. A typical associative processor is shown in Fig. 95.5. STARAN and MPP (Fig. 95.2) are representative of this bit-serial, word-parallel SIMD organization. In Fig. 95.5 the common register is where the common operand is stored and the mask register defines the bit positions requiring operation. The enable/disable register provides local control of individual PEs. Because of its simplicity in design the per-PE cost of an associative processor is much lower, but the bit-serial operations slow down the system drastically. To compensate for this, these systems are useful only for applications requiring a large number of PEs. 95.4 System Utilization As discussed previously, for any computer there is a maximum number of bits or bit pairs that can be processed concurrently, whether it is under single-instruction or multiple-instruction control [Feng, 1972, 1973]. This maximum degree of concurrency, or maximum concurrency (Cm), is an indication of the computer-processing capability. The actual utilization of this capability is indicated by the average concurrency defined to be FIGURE 95.3 An array processor organization. I/O, input/output devices; LM, local memory; PE, processing element; SM, shared memory

P ∑P; where P: is the hardware utilization for parallel processors at time i. with appropriate instrumentation, the average hardware utilization of a system can be determined. In practice, however, it is not always true that every bit or bit pair that is being processed would be productive. Some of the bits produce only repetitious(superfluous)or even meaningless results. This happens more often and more severely in a parallel processor than in a word-sequential processor Consider, for example, performing a maximum search operation in a mesh-connected parallel processor(such as ILLIAC IV). For Operands, it takes(N/2)log2 N comparisons(N/2 com as for each of log2 N iterations) instead of the usual comparisons in word-sequential machines. Thus, in effect there are log, n -(n-1) (log2N-2)+1 comparisons that are nonproductive. If we let Pa be the effective parallelism over a period of T time units and u, Pi, and p; be the corresponding effective values, the effective hardware utilization is then ∑P A successful parallel processor design should yield a high u, as well as the required throughput for, at least, the intended applications. This not only involves a proper hardware and software design but also the develol ment of efficient parallel algorithms for these applications. Suppose T is the execution time of an application program using a conventional word-sequential machine, and T is the execution time of the same program using a concurrent system; the speed-up ratio is then defined T Naturally, for a specific parallel organization, the speed-up ratio determines how well an application program can utilize the hardware resources. Supporting software has a direct effect on the speed-up ratio. Defining Ter Array processor: A parallel processor consisting of a number of processing elements, memory modules, and iput/output devices as well as interconnection networks under a single control unit Associative processor: A parallel processor consisting of a number of processing elements, memory modules, nd input/output devices under a single control unit. The capability of the processing elements is usually limited to the bit-serial operations. Ensemble processor: A parallel processor consisting of a number of processing elements, memory modules, and input/output devices under a single control unit. It has no interconnection network to provie Interprocessor or processor-memory communications. Interconnection network: A network of interconnections providing interprocessor and processor-memory communications. It may be static or dynamic, distributed, or centralized. e 2000 by CRC Press LLC

© 2000 by CRC Press LLC where ri is the hardware utilization for parallel processors at time i. With appropriate instrumentation, the average hardware utilization of a system can be determined. In practice, however, it is not always true that every bit or bit pair that is being processed would be productive. Some of the bits produce only repetitious (superfluous) or even meaningless results. This happens more often and more severely in a parallel processor than in a word-sequential processor. Consider, for example, performing a maximum search operation in a mesh-connected parallel processor (such as ILLIAC IV). For N operands, it takes (N/2)log2 N comparisons (N/2 comparisons for each of log2 N iterations) instead of the usual N – 1 comparisons in word-sequential machines. Thus, in effect there are comparisons that are nonproductive. If we let ^ Pa be the effective parallelism over a period of T time units and ^ u, ^ pi , and ^ ri be the corresponding effective values, the effective hardware utilization is then A successful parallel processor design should yield a high ^ u, as well as the required throughput for, at least, the intended applications. This not only involves a proper hardware and software design but also the development of efficient parallel algorithms for these applications. Suppose Tu is the execution time of an application program using a conventional word-sequential machine, and Tc is the execution time of the same program using a concurrent system; the speed-up ratio is then defined as Naturally, for a specific parallel organization, the speed-up ratio determines how well an application program can utilize the hardware resources. Supporting software has a direct effect on the speed-up ratio. Defining Terms Array processor: A parallel processor consisting of a number of processing elements, memory modules, and input/output devices as well as interconnection networks under a single control unit. Associative processor: A parallel processor consisting of a number of processing elements, memory modules, and input/output devices under a single control unit. The capability of the processing elements is usually limited to the bit-serial operations. Ensemble processor: A parallel processor consisting of a number of processing elements, memory modules, and input/output devices under a single control unit. It has no interconnection network to provide interprocessor or processor-memory communications. Interconnection network: A network of interconnections providing interprocessor and processor-memory communications. It may be static or dynamic, distributed, or centralized. u r = = Â = = = Â P P p TP T a m i i T m i i T 1 1 1 N N N N N 2 1 2 2 1 2 2 log ( ) (log ) Ê Ë Á ˆ ¯ ˜ - -= -+ ˆ ˆ ˆ u r = = ˆ Â = = = Â P P p TP T a m i i T m i i T 1 1 1 S T T u c =

Parallel processor: A computing system consisting of a number of processors, memory modules, input/output devices, and other components under the control of a single control unit. It is known to be a single nstruction-stream, multiple-data-stream(SIMD)machine Processing element: A basic processor consisting of an arithmetic-logic unit, a number of registers, network interfaces, and some local control facilities Related Topics 18.1 Special Architectures.96.5 Parallel Processing References G H. Barnes,RM. Brown, M. Kato, D J. Kuck, D L. Slotnick, and R.A. Stokes, The ILLIAC IV computer,"IEEE K.E. Batcher, STARAN/RADCAP hardware architecture, Proc. Sagamore Computer Conf. on Parallel Processing, Pp.147-152,1973 KE. Batcher, MPP-A massively parallel processor, Proc. Int Conf. on Parallel Processing, p. 249, 1979 A. Evensen and J L. Troy, "Introduction to the architecture of a 288-element PEPE, Proc. Sagamore Computer Conference, pp. 162-169, 1973 T. Feng, "An overview of parallel processing systems, 1972 WESCON Tech. Papers, Session 1-"Parallel Processing Systems, pp 1-2, 1972. T Feng, Parallel Processing Characteristics Implementation of Data Manipulating Functions, Technical Repo RADC-TR-73-189, July 1973 M J. Flynn,Very high speed computing systems, Proc. IEEE, voL 54(12), Pp. 1901-1909, 1966. W. Handler, The impact of classification schemes on computer architecture, " Proc. Int. Conf. on parallel Processing, pp 7-15, 197 Hwang and FA. Briggs, Computer Architecture and Parallel Processing, New York: McGraw-Hill, 1984 363-376,1982. Further Information Proceedings of International Conference on Parallel Processing: An annual conference held since 1972. Recent proceedings published by CRCPress IEEE Transactions on Parallel and Distributed Systems: Started in 1990 as a quarterly, now a monthly, published by the IEEe Computer Societ Journal of parallel and Distributed Computing: A monthly published by Academic Press. Computer Architecture and Parallel Processing: A book by K. Hwang and F A. Briggs published by McGraw-Hill. e 2000 by CRC Press LLC

© 2000 by CRC Press LLC Parallel processor: A computing system consisting of a number of processors, memory modules, input/output devices, and other components under the control of a single control unit. It is known to be a singleinstruction-stream, multiple-data-stream (SIMD) machine. Processing element: A basic processor consisting of an arithmetic-logic unit, a number of registers, network interfaces, and some local control facilities. Related Topics 18.1 Special Architectures • 96.5 Parallel Processing References G.H. Barnes, R.M. Brown, M. Kato, D.J. Kuck, D.L. Slotnick, and R.A. Stokes, “The ILLIAC IV computer,” IEEE Trans. Comput., vol. C-7, pp. 746–757, 1968. K.E. Batcher, “STARAN/RADCAP hardware architecture,” Proc. Sagamore Computer Conf. on Parallel Processing, pp. 147–152, 1973. K.E. Batcher, “MPP—A massively parallel processor,” Proc. Int. Conf. on Parallel Processing, p. 249, 1979. A.J. Evensen and J.L. Troy, “Introduction to the architecture of a 288-element PEPE,” Proc. Sagamore Computer Conference, pp. 162–169, 1973. T. Feng, “An overview of parallel processing systems,” 1972 WESCON Tech. Papers, Session 1 — “Parallel Processing Systems,” pp. 1–2, 1972. T. Feng, Parallel Processing Characteristics & Implementation of Data Manipulating Functions, Technical Report RADC-TR-73-189, July 1973. M.J. Flynn, “Very high speed computing systems,” Proc. IEEE, vol. 54(12), pp. 1901–1909, 1966. W. Händler, “The impact of classification schemes on computer architecture,” Proc. Int. Conf. on Parallel Processing, pp. 7–15, 1977. K. Hwang and F.A. Briggs, Computer Architecture and Parallel Processing, New York: McGraw-Hill, 1984. D. J. Kuck and R.A. Stokes, “The Borroughs Scientific Processor (BSP),” IEEE Trans. Comput., vol. C-31(5), pp. 363–376, 1982. Further Information Proceedings of International Conference on Parallel Processing: An annual conference held since 1972. Recent proceedings published by CRCPress. IEEE Transactions on Parallel and Distributed Systems: Started in 1990 as a quarterly, now a monthly, published by the IEEE Computer Society. Journal of Parallel and Distributed Computing: A monthly published by Academic Press. Computer Architecture and Parallel Processing: A book by K. Hwang and F. A. Briggs published by McGraw-Hill