《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Some Computer Organizations and Their Effectiveness

团购合买资源类别：文库，文档格式：PDF，文档页数：13，文件大小：2.38MB

948 IEEE TRANSACTIONS ON COMPUTERS,VOL.C-21,NO.9,SEPTEMBER 1972 Some Computer Organizations and Their Efectiveness MICHAEL J.FLYNN,MEMBER,IEEE Abstract-A hierarchical model of computer organizations is more "macroscopic"view,yet without reference to a developed,based on a tree model using request/service type re- particular user environment.Clearly,any such effort sources as nodes.Two aspects of the model are distinguished:logical and physical. must be sharply limited in many aspects;some of the General parallel-or multiple-stream organizations are examined more significant are as follows. as to type and effectiveness-especially regarding intrinsic logical 1)There is no treatment of I/O problems or 1/O as a difficulties. limiting resource.We assume that all programs of inter- The overlapped simplex processor (SISD)is limited by data est will either not be limited by I/O,or the I/O limita- dependencies.Branching has a particularly degenerative effect. The parallel processors [single-instruction stream-multiple- tions will apply equally to all computer memory con- data stream (SIMD)]are analyzed.In particular,a nesting type figurations.That is,the I/O device sees a "black box" explanation is offered for Minsky's conjecture-the performance of computer with a certain performance.We shall be con- a parallel processor increases as log Mf instead of M (the number of cerned with how the computer attained a performance data stream processors). potential,while it may never be realized due to I/O con- Multiprocessors(MIMD)are subjected to a saturation syndrome siderations. based on general communications lockout.Simplified queuing models indicate that saturation develops when the fraction of task time spent 2)We make no assessment of particular instruction locked out(L/E)approaches 1/n,where n is the number of proces- sets.It is assumed that there exists a(more or less)ideal sors.Resources sharing in multiprocessors can be used to avoid set of instructions with a basically uniform execution several other classic organizational problems time-except for data conditional branch instructions Index Terms-Computer organization,instruction stream,over- whose effects will be discussed. lapped,parallel processors,resource hierarchy. 3)We will emphasize the notion of effectiveness (or efficiency)in the use of internal resources as a criterion INTRODUCTION for comparing organizations,despite the fact that either TTEMPTS to codify the structure of a computer condition 1)or 2)may dominate a total performance have generally been from one of three points of assessment. Within these limitations,we will first attempt to view:1)automata theoretic or microscopic; 2)individual problem oriented;or 3)global or statisti- classify the forms or gross structures of computer sys- tems by observing the possible interaction patterns cal. between instructions and data.Then we will examine In the microscopic view of computer structure,rela- tionships are described exhaustively.All possible inter- physical and logical attributes that seem fundamental to achieving efficient use of internal resources (execution actions and parameters are considered without respect to their relative importance in a problem environment facilities,memory,etc.)of the system. Measurements made by using individual problem CLASSIFICATION:FORMS OF COMPUTING SYSTEMS yardsticks compare organizations on the basis of their Gross Structures relative performances in a peculiar environment.Such comparisons are usually limited because of their ad hoc In order to describe a machine structure from a nature. macroscopic point of view,on the one hand,and yet Global comparisons are usually made on the basis of avoid the pitfalls of relating such descriptions to a par- elaborate statistical tabulations of relative performances ticular problem,the stream concept will be used [1]. on various jobs or mixtures of jobs.The difficulty here Stream in this context simply means a sequence of items lies in the fact that the analysis is ex post facto and usu- (instructions or data)as executed or operated on by a ally of little consequence in the architecture of the sys- processor.The notion of“instruction”or“datum”is tem since the premises on which they were based (the defined with respect to a reference machine.To avoid particular computer analyzed)have been changed. trivial cases of parallelism,the reader should consider a The object of this paper is to reexamine the principal reference instruction or datum as similar to those used interactions within a processor system so as to generate a by familiar machines (e.g.,IBM 7090).In this descrip- tion,organizations are categorized by the magnitude Manuscript received February 26,1970;revised May 25,1971, (either in space or time multiplex)of interactions of and January 21,1972.This work was supported by the U.S.Atomic their instruction and data streams.This immediately Energy Commission under Contract AT(11-1)3288. gives rise to four broad classifications of machine or- The author is with the Department of Computer Science,The Johns Hopkins Cniversity,Baltimore,Md.21218. ganizations

948 IEEE TRANSACTIONS ON COMPUTERS, VOL. c-21, NO. 9, SEPTEMBER 1972 Some Computer Organizations and Their Effectiveness MICHAEL J. FLYNN, MEMBER, IEEE Abstract-A hierarchical model of computer organizations is more "macroscopic" view, yet without reference to a developed, based on a tree model using request/service type re- particular user environment. Clearly, any such effort sources as nodes. Two aspects of the model are distinguished: logical m ustb shr limitedtim anyasc somefoft and andphysical. ~~~~~~~mustbe sharply limited In many aspects; some of the physical. mr infcn r sflos General parallel- or multiple-stream organizations are examined more significant are as follows. as to type and effectiveness-especially regarding intrinsic logical 1) There is no treatment of I/O problems or I/O as a difficulties. limiting resource. We assume that all programs of interThe overlapped simplex processor (SISD) is limited by data est will either not be limited by I/O, or the I/O limitadependencies. Branching has a particularly degenerative effect. * w The parallel processors [single-instruction stream-multiple- tionsawill ap eul to all c e me mory con- data stream (SIMD)J are analyzed. In particular, a nesting type figurations That is, the I/O device sees a "black box" explanation is offered for Minsky's conjecture-the performance of computer with a certain performance. We shall be cona parallel processor increases as log M instead of M (the number of cerned with how the computer attained a performance data stream processors). potential, wThile it may never be realized due to I/O conMultiprocessors (MIMD) are subjected to a saturation syndrome siderations. based on general communications lockout. Simplified queuing models We t t indicate that saturation develops when the fraction of task time spent 2) \\e make no assessment of particular instruction locked out (L/E) approaches 1!, where n is the number of proces- sets. It is assumed that there exists a (more or less) ideal sors. Resources sharing in multiprocessors can be used to avoid set of instructions with a basically uniform execution several other classic organizational problems. time-except for data conditional branch instructions Index Terms-Computer organization, instruction stream, over- whose effects will be discussed. lapped, parallel processors, resource hierarchy. 3) We will emphasize the notion of effectiveness (or efficiency) in the use of internal resources as a criterion INTRODUCTION for comparing organizations, despite the fact that either ATTEIMPTS to codify the structure of a computer condition 1) or 2) may dominate a total performance lhave generally been from one of three points of assessment. view: 1) automata theoretic or microscopic; Witlhin these limitations, we will first attempt to .. . ..' ~~~~~~~~classify the forms or gross structures of computer sys- 2) individual problem oriented; or 3) global or statisti- y g y cal. tems by observing the possible interaction patterns In the microscopic view of computer structure,rela-between instructions and data. Then we will examine In the microscopic view of computer structure, rela- phscladogaltriuetatemfnaetl tionshis are escribd exliustivel. All ossibl inter physical and logical attributes that seem fundamental tionships are described exhlaustively. All possible inter-.. . to achieving efficient use of internal resources (execution actions and parameters are considered without respect fclte,mmr,ec)o h ytm to their relative importance in a problem environment. Measurements made by using individual problem CLASSIFICATION: FORMS OF COMPUTING SYSTEMS yardsticks compare organizations on the basis of their Gross Structures relative performances in a peculiar environment. Suclh comparisons are usually limited because of tlheir ad hoc In order to describe a machine structure from a nature. macroscopic point of view, on the one lhand, and yet Global comparisons are usually made on the basis of avoid the pitfalls of relating such descriptions to a parelaborate statistical tabulations of relative performances ticular problem, the stream concept will be used [1]. on various jobs or mixtures of jobs. The difficulty hiere Stream in this context simply means a sequence of items lies in the fact that the analysis is ex post facto and usu- (instructions or data) as executed or operated on by a ally of little consequence in the arclhitecture of the sys- processor. The notion of "instruction" or "datum" is tem since the premises on wvliclh they were based (the defined with respect to a reference machine. To avoid particular computer analyzed) have been changed. trivial cases of parallelism, the reader should consider a Tnhe object of thliS paper is to reexamine thle principal reference instruction or datum as similar to those used interactions withlin a processor system so as to generate a by familiar machines (e.g., IBMi 7090). In this description, organizations are categorized by thle magnitude Manscrpt ecive Feruay 6, 970 reisd My 2, 171 (either in space or time multiplex) of interactions of and January 21, 1972. This work was supported by the U.. S. Atomic their instruction and data streams. Thils immlediately Energy Commission under Conltract AT (11-1) 3288. gvsrs ofu ra lsiiain fmcln r The author is with the Department of Computer Science, The giersetforbadcsiiainsfmcheoJohns Hopkins UJniversity, Baltimore, M/d. 21218. ganizations

FLYNN:COMPUTER ORGANIZATIONS 949 1)The single-instruction stream-single-data stream anticipate a hierarchy of requests,where the transition organization (SISD),which represents most conven- between the initiating level p and the next level tasks tional computing equipment available today. R:is viewed as the logical role of each of the Ri,while 2)The single-instruction stream-multiple-data actual service is through a combination of a tree of stream (SIMD),which includes most array processes, lower level logical requests and eventual physical ser- including Solomon [2](Illiac IV). vice at the leaves of the tree 3)Multiple-instruction stream-single-data stream Thus consider type organizations (MISD),which include specialized f(x,T）→(y,*) streaming organizations using multiple-instruction streams on a single sequence of data and the derivatives where f is a functional mapping (in a mathematical thereof.The plug-board machines of a bygone era were sense)of argument x into result y.Here,x,yEB,where a degenerate form of MISD wherein the instruction B is the set of values defined by the modulo class of the streams were single instructions,and a derived datum arithmetic (logical and physical notions of arithmetic (SD)passed from program step i to program step i+1 should be identical).Ther and r*indicate logical time (MI). or precedence;T is a Boolean control variable whose 4)Multiple-instruction stream-multiple-data stream validity is established by one or more predecessor logical (MIMD),which include organizations referred to as functionsf. "multiprocessor."Univac [3],among other corpora- The requirement for a r control variable stems from tions,has proposed various MIMD structures. the need for specification of the validity of f and its These are qualitative notations.They could be quan-argument x.Notice that two tasksf,f are directly tified somewhat by specifying the number of streams of dependent if either r:=1 implies r;*=1 or rj=1 implies each type in the organization or the number of instruc- r:*=1.This precedence specification may be performed tion streams per data stream,or vice versa.But in order implicitly by use of a restrictive convention (e.g.,by to attain a better insight into the notions of organiza- strict sequence)-that the physical time t at which the tion,let us formalize the stream view of computing.ith task control variable r:becomes valid,t(:=1),has Consider a generalized system model consisting of a (r:=1)>t(ri1*=1),for all ior by explicit control of reguestor and a server.For our purposes we will con- T and r*. sider the requestor as synonymous with a program(or That there can be no general way of rearranging the user)and the server as synonymous with the resources, request sequence f(or finding an alternate sequence both physical and logical,which process the elements of g)such that the precedence requirement vanishes,is a the program.Note that,for now,the distinction be- consequence of the composition requirement f(g(x)), tween a program and a resource is not precise;indeed, intrinsic to the definition of a computable function. under certain conditions a resource may be a program That is,f cannot be applied to an argument until g(x) and vice versa.Then a problem can be defined as a has been completed. stream (or sequence)of requests for service (or re- The notion of an explicit precedence control has been sources).Since each request is,in general,a program and formalized by Leiner 16 and others by use of a prece- can also specify a sequence of requests,we have a re- dence matrix. quest hierarchy. Given an unordered set of requests (tasks) Let p be a program.A program is defined simply as a f1<jn,an nXn matrix is defined so that:ai=1 request for service by a structured set of resources.P if we require t(;=1)2t(*=1),i.e.,task f must be specifies a sequence of other(sub)requests:Ri,R2,R3,completed before f;can be begun.Otherwise,=0. R,···,Ru,called tasks.While the tasks appear here The matrix M so defined identifies the initial priority. as a strictly ordered set,in general the tasks will have a By determining M2(in the conventional matrix product more complex control structure associated with them. sense),secondary implicit precedence is determined. Each request,again,consists of a sequence of sub- This process is continued until requests (the process terminates only at the combi- MP+1=0: national circuit level).Regardless of level,any request R:is a bifurcated function having two roles: The fully determined precedence matrix H is defined as R=f,f H=M+M2+M3+···+MP,P≤ the logical role of the requestor f and the combined where is the Boolean union operation:(a+b) logical and physical role of a server f. =a十b: 1)Logical Role of Reguestor:The logical role of the Thus HI defines a scheduling of precedence among the requestor is to define a result given an argument and to n tasks.At any moment of logical time(T),perhaps a define a control structure or precedence among the other set of tasksf;k either a,=0,for all j,or if a=1, tasks directly defined by the initiating program P.We then 7,*=1 are independently executable

FLYNN: COMPUTER ORGANIZATIONS 949 1) The single-instruction stream-single-data stream anticipate a hierarchy of requests, where the transition organization (SISD), which represents most conven- between the initiating level P and the next level tasks tional computing equipment available today. { Ri } is viewed as the logical role of each of the Ri, while 2) The single-instruction stream-multiple-data actual service is through a combination of a tree of stream (SIMD), which includes most array processes, lower level logical requests and eventual physical serincluding Solomon [2] (Illiac IV). vice at the leaves of the tree. 3) Multiple-instruction stream-single-data stream Thus consider type organizations (M ISD), which include specialized fAx, r) (y r* streaming organizations using multiple-instruction streams on a single sequence of data and the derivatives where fil is a functional mapping (in a mathematical thereof. The plug-board machines of a bygone era were sense) of argument x into result y. Here, x, y&B, where a degenerate form of MISD wherein the instruction B is the set of values defined by the modulo class of the streams were single instructions, and a derived datum arithmetic (logical and physical notions of arithmetic (SD) passed from program step i to program step i+1 should be identical). The r and T* indicate logical time (MI). or precedence; r is a Boolean control variable whose 4) Multiple-Instruction stream-multiple-data stream validity is established by one or more predecessor logical (MIMD), which include organizations referred to as functions {fii }. "multiprocessor." Univac [3], among other corpora- The requirement for a r control variable stems from tions, has proposed various 'M\IIMD structures. the need for specification of the validity of fil and its These are qualitative notations. They could be quan- argument x. Notice that two tasks fil, fj1 are directly tified somewhat by specifying the number of streams of dependent if either 7i= 1 implies Tj* = 1 or rj = 1 implies each type in the organization or the number of instruc- ri* = 1. This precedence specification may be performed tion streams per data stream, or vice versa. But in order implicitly by use of a restrictive convention (e.g., by to attain a better insight into the notions of organiza- strict sequence)-that the physical time t at which the tion, let us formalize the stream view of computing. ith task control variable Tr becomes valid, t(r = 1), has Consider a generalized system model consisting of a t(Tr = 1) . t(Ti1* = 1), for all i-or by explicit control of requestor and a server. For our purposes we will con- r and r*. sider the requestor as synonymous with a program (or That there can be no general way of rearranging the user) and the server as synonymous with the resources, request sequence fil (or finding an alternate sequence both physical and logical, which process the elements of gi') such that the precedence requirement vanishes, is a the program. Note that, for now, the distinction be- consequence of the composition requirement f(g(x)), tween a program and a resource is not precise; indeed, intrinsic to the definition of a computable function. under certain conditions a resource may be a program That is, f cannot be applied to an argument until g(x) and vice versa. Then a problem can be defined as a has been completed. stream (or sequence) of requests for service (or re- The notion of an explicit precedence control has been sources). Since each request is, in general, a program and formalized by Leiner [16] and others by use of a prececan also specify a sequence of requests, we have a re- dence matrix. quest hierarchy. Given an unordered set of requests (tasks) Let P be a program. A program is defined simply as a {fjl' 1 j t(ri* = 1), i.e., task fi must be specifies a sequence of other (sub) requests: R1, R2, R3, completed beforefj can be begun. Otherwise, aij,=0. R4, - , RnsR called tasks. While the tasks appear here The matrix M so defined identifies the initial priority. as a strictly ordered set, in general the tasks will have a By determining M2 (in the conventional matrix product more complex control structure associated with them. sense), secondary implicit precedence is determined. Each request, again, consists of a sequence of sub- This process is continued until requests (the process terminates only at the combinational circuit level). Regardless of level, any request MP±= 0. Ri is a bifurcated function having two roles: The fully determined precedence matrix H is defined as R- fuly Hd=eM+rM2i+eM3 +p.e.e.n+eMa, P < n the logical role of the requestor f si and the combined where + is the Boolean union operation: (a +b)~ logical and physical role of a serverfiv. a+b. 1) Logical Role of Requestor: The logical role of the Thus H defines a scheduling of precedence among the requestor is to define a result given an argument and to n tasks. At any moment of logical time (ri), perhaps a define a control structure or precedence among thle other set of tasks {fk'; k |eithler ack = 0, for all j, or if aj = 1, tasks directly defined by the initiating program P. We then Tj*-=1 } are independently executable

950 IEEE TRANSACTIONS ON COMPUTERS,SEPTEMBER 1972 define a memory hierarchy (space)or a set of primitive operations.Note that since partitions may not be unique,resulting tree structures may also differ.Also note that while leaves of the tree must be requests for physical resources,higher level nodes may also be if all inferior nodes are physical. Since physical time is associated with a physical activity,at the terminal nodes we have fix"..(SXS,)→S, where t is the initiation time and i is the completion terminal node time.Initiation is conditioned on the availability of Fig.1.Service hierarchy. operational resource v and the validity of both source operands (=1).When the operation is complete,the result is placed in a specified sink location,and the con- Since "task"and "instruction"are logically equivalent trol variabler*is set to "1." requests for service,the preceding also applies to the The advantage of this model is that it allows a per- problem of detecting independent instructions in a se- spective of a system at a consistent hierarchical level of quence.In practice,tasks are conditionally issued.This control structure.One may replace the subtree with its limits the use of the precedence computation to the equivalent physical execution time (actual or mean “while running”or dynamic environment. value)and mean operational and spatial resource re- 2)Service Role of Reguest:Thus far we have been quirements if task contention is a factor.The net effect concerned with precedence or task control at a single is to establish a vector space with a basis defined by the level.The service for node R:is defined by fi',which is a reference of the observer.The vector of operations O subtree structure(see Fig.1)terminating in the physical may appear as physical resources at a particular level k resources of the system,i.e.,the physically executable in the tree,but actually may represent a subtree of primitives of the system.Thusf:defines the transition requests-similarly with the storage vector S.The con- from P to R:and among Ri,(2=l,···,z),at a given trol structure defined by the set of request functions level,while f,"defines the substructure under node Ri. f1<i will determine the program structure as Thus f"is actually a hierarchy of subrequests terminat- an interaction of resources on OXS.The reader may ing in a primitive physical service resource. note the similarity,in hierarchy treatment at least These terminal nodes are defined as a request for service lying within the physical resource vocabulary of between the preceding model and the general model of Bell and Newell 28. the system,i.e., Implicit in the foregoing statement is the notion that f..≡v8∈V a physical resource exists in space-time,not space or time alone.For example,N requests may be made for where V is the set of available physical resources.Note an "add"operational resource-these may be served by that the request is generally for any element in a par-N servers each completing its operation in time T or ticular resource class rather than a specific resource v. equivalently by one resource that operates in time The elements are usually of two types:operational T/N.Two parameters are useful in characterizing or storage.A storage resource is a device capable of physical resources [1]-latency and bandwidth.La- retaining a representation of a datum after the initial fency is the total time associated with the processing excitation is removed.The specification of a device is (from excitation to response)of a particular data unit usually performed by coordinate location,contents,or at a phase in the computing process.Bandwidth is an implication.An operational resource is a combinational expression of time-rate of occurrence.In particular, circuit primitive (e.g.,add,shift,transfer,...)that operational bandwidth would be the number of operand performs a(usually)binary mapping SXS-S;S the set pairs processed per unit time. storage (space)resource. If,as a hierarchical reference point,we choose opera- Strictly speaking,since a request for a physical re- tions and operands as used by,for example,an IB\I source v is a request for accessing that resource,there is 7090,we can explore arrangements of,and interactions no "request for storage resource"but rather a request to between,familiar physical resources.The IB\7090 an allocation algorithm to access or modify the memory itself has a trivial control tree.In particular,we have map.Thus a storage resource has operational charac-the SISD-single operational resource operating on a teristics if we include the accessing mechanism in the single pair of storage resources.The multiple-stream storage resource partition. organizations are more interesting,however,as well as Partitions are defined on physical resources which two considerations:1)the latency for interstream com-

950 IEEE TRANSACTIONS ON COMPUTERS, SEPTEMBER 1972 p define a memory hierarchy (space) or a set of primitive operations. Note that since partitions may not be 4^t/ \ \'n unique, resulting tree structures may also differ. Also R2 R R note that while leaves of the tree must be requests for R * physical resources, higher level nodes may also be if all Aft inferior nodes are physical. R11 RIm Since physical time is associated with a physical activity, at the terminal nodes we have Sk. The reader may ing in a primitive physical service resource. note the similarity, in hierarchy treatment at least These terminal nodes are defined as a request for between the preceding model and the general model of service lying within the plhysical resource vocabulary of Bell and Ntewell [28 ] . the system, i.e., Implicit in the foregoing statement is the notion that ^ A -tav C V a physical resource exists in space-time, not space or IJik. time alone. For example, N requests may be miiade for where V is the set of available plhysical resources. Note an "add" operational resource-thlese mayv be served 'by that the request is generally for any element in a par- N servers each comlpleting its operation in timiie T or ticular resource class ratlher than a specific resource v. equivalentlvy by one resource that operates in timle The elements {v } are usually of two types: operational T N. Two paramiieters are useful in clharacterizing or storage. A storage resource is a device capable of plhysical resources [1 ]-latency and bandwidth. Laretaining a representation of a datum after tlhe initial tency is tl-he total tinme associated witlh the processing excitation is removed. The specification of a device is (from excitation to response) of a particular data unit usually perfornmed by coordinate location, contents, or at a phase in thle computing process. Bandwidth is an implication. An operational resource is a combinational expression of time-rate of occurrence. In particular, circuit prinmitive (e.g., add, shift, transfer, . ) that operational bandwidth would be the numnber of operand performs a (usually) binary mapping SXS- S;S the set pairs processed per unit time. storage (space) resource. If, as a hierarchical reference point, we choose operaStrictly speaking, since a request for a physical re- tions and operands as used by, for example, an IBM\I source v is a request for accessing that resource, there is 7090, we can explore arrangements of, and interactions no "request for storage resource" but rather a request to between, familiar phlysical resources. The IBMI 7090 an allocation algorithm to access or modify the memory itself haas a trivial control tree. In particular, we have map. Thus a storage resource-~has operational chlarac- the SISD-single operational resource operating on a teristics if we include the accessing mechanism in the single pair of storage resources. The multiple-stream storage resource partition. organizations are more interesting, however, as well as Partitions are defined on phlysical resources which two considerations: 1) the latency for interstream com-

FLYNN:COMPUTER ORGANIZATIONS 951 munication;and 2)the possibilities for high computa- tional bandwidth within a stream. Interstream Communications There are two aspects of communications:operational resource accessing of a storage item(OXS)and storage itJ- to storage transfer (SXS).Both aspects can be repre- i+J+1i sented by communications matrices each of whose entry ti is the time to transfer a datum in the jth storage re- Fig.2.Stream inertia. source to the ith resource (operational or storage).The operational communication matrix is quite useful for MIMD organizations,while the storage communica- perf.max L·△t tions matrix is usually more interesting for SIMD organizations.An(OXO)matrix can also be defined for This is illustrated in Fig.2.Successive instructions are describing MISD organizations. offset in this example by At time units. An alternate form of the communications matrix, called the connection matrix,can also be developed for System Classification the square matrix cases.This avoids possibly large or Then to summarize,a technology independent infinite entries possible in the communications matrix macroscopic specification of a large computing system (when interstream communications fail to exist).The would include:1)the number of instruction streams and reciprocal of the normalized access time ta/t(assuming the number of data streams-the "instruction"and ti is the minimum entry for a row)is entered for the "data"unit should be taken with respect to a convenient access time of an element of the ith data storage re- reference;2)the appropriate communications (or con- source by the jth operational or storage resource di.The nection)matrices;and 3)the stream inertia factor J and minimum access time (resolution)is 1.If a particular the number of time units of instruction execution la- item were inaccessible,there would be a zero entry. tency L. Notice that in comparing parallel organizations to the serial organization,the latter has immediate access to EFFECTIVENESS IN PERFORMING THE corresponding data.While it appears that under certain COMPUTING PROCESS conditions an element expression can be zero due to Resolution of Entropy lack of communication between resources,in practice this does not occur since data can be transferred from Measures of the effectiveness are necessarily problem one stream to another in finite time,however slow.Usu- based.Therefore,comparisons between parallel and ally such transfers occur in a common storage hierarchy. simplex organizations frequently are misleading since such comparisons can be based on different problem Stream Inertia environments.The historic view of parallelism in prob- lems is probably represented best by Amdahl [6]and is It is well known that the action of a single-instruction shown in Fig.3.This viewpoint is developed by the ob- stream may be telescoped for maximum performance by servation that certain operations in a problem environ- overlapping the various constituents of the execution of ment must be done in an absolutely sequential basis. an individual instruction [4].Such overlapping usually These operations include,for example,the ordinary does not exceed the issuing of one instruction per in-housekeeping operations in a program.In order to struction decode resolution time At.This avoids the achieve any effectiveness at all,from this point of view, possibly exponentially increasing number of decision parallel organizations processing N streams must have elements required in such a decoder [1],[5].A recent substantially less than 1/NX100 percent of absolutely study [13]provides an analysis of the multiple-instruc- sequential instruction segments.One can then proceed tion issuing problem in a single-overlapped instruction to show that typically for large N this does not exist in stream.In any event,a certain number of instructions conventional programs.A major difficulty with this in a single-instruction stream are being processed during analysis lies in the concept of "conventional programs" the latency time for one instruction execution.This since this implies that what exists today in the way of number may be referred to as the confluence factor or programming procedures and algorithms must also exist inertia factor J of the processor per individual instruc- in the future.Another difficulty is that it ignores the tion stream.Thus the maximum performance per in-possibility of overlapping some of this sequential pro- struction stream can be enhanced by a factor J.If the cessing with the execution of "parallel"tasks. average instruction execution time is L.At time units, To review this problem from a general perspective, the maximum performance per stream would be consider a problem in which Ni words each of p bits

FLYNN: COMPUTER ORGANIZATIONS 951 munication; and 2) the possibilities for high computa- i Lht- - L. tional bandwidth within a stream. IV.t, Interstream Communications i + J - IThere are two aspects of communications: operational resource accessing of a storage item (OXS) and storage i+J to storage transfer (S XS). Both aspects can be repre- - sented by communications matrices each of whose entry tij is the time to transfer a datum in the jth storage re- Fig. 2. Stream inertia. source to the ith resource (operational or storage). The operational communication matrix is quite useful for J MIMD organizations, while the storage communica- perfLma t tions matrix is usually more interesting for SIMD organizations. An (OXO) matrix can also be defined for This is illustrated in Fig 2 Successive instructions are describing MISD organizations. offset in this example by At time units. An alternate form of the communications matrix, System Classification called the connection matrix, can also be developed for the square matrix cases. This avoids possibly large or Then to summarize, a technology independent infinite entries possible in the communications matrix macroscopic specification of a large computing system (when interstream communications fail to exist). The would include: 1) the number of instruction streams and reciprocal of the normalized access time ti/ti, (assuming the number of data streams-the "instruction" and tii is the minimum entry for a row) is entered for the "data" unit should be taken with respect to a convenient access time of an element of the ith data storage re- reference; 2) the appropriate communications (or consource by the jth operational or storage resource dii. The nection) matrices; and 3) the stream inertia factor J and minimum access time (resolution) is 1. If a particular the number of time units of instruction execution laitem were inaccessible, there would be a zero entry. tency L. Notice that in comparing parallel organizations to the serial organization, the latter has immediate access to corresponding data. While it appears that under certain COMPUTING PROCESS conditions an element expression can be zero due to Resolution of Entropy lack of communication between resources, in practice Measures of the effectiveness are necessarily problem this does not occur since data can be transferred from based. Therefore, comparisons between parallel and one stream to another in finite time, however slow. Usu- . . . a ally such transfers occur in a common storage hierarchy. sipe orizons frequen ar ding since such comparisons can be based on different problem Stream Inertia environments. The historic view of parallelism in problems is probably represented best by Amdalhl [6] and is It is well known that the action of a single-instruction shown in Fig. 3. This viewpoint is developed by the obstream may be telescoped for maximum performance by servation that certain operations in a problem environoverlapping the various constituents of the execution of ment must be done in an absolutely sequential basis. an individual instruction [4]. Such overlapping usually These operations include, for example, the ordinary does not exceed the issuing of one instruction per in- housekeeping operations in a progranm. In order to struction decode resolution time At. This avoids the achieve any effectiveness at all, from this point of view, possibly exponentially increasing number of decision parallel organizations processing N streams must hlave elements required in such a decoder [1], [5]. A recent substantially less than 1/NX100 percent of absolutely study [13] provides an analysis of the multiple-instruc- sequential instruction segments. One can then proceed tion issuing problem in a single-overlapped instruction to show that typically for large N this does not exist in stream. In any event, a certain number of instructions conventional programs. A major difficulty with this in a single-instruction stream are being processed during analysis lies in the concept of "conventional programs" thle latency time for one instruction execution. This since this implies that wvhat exists today in the way of number may be referred to as the confluence factor or programming procedures and algorithms must also exist inertia factor J of the processor per individual instruc- in the future. Anothaer difficulty is that it ignores the tion stream. Thlus the maximum performance per in- possibility of overlapping some of this sequential prostruction stream can be enhanced by a factor J. If the cessing with the execution of "parallel" tasks. average instruction execution time is L-At time units, To reviewr this problem from a general perspective, the maximum performance per stream would be consider a problem in which N1 words each of p bits

952 IEEE TRANSACTIONS ON COMPUTERS,SEPTEMBER 1972 Thus from this most general point of view there is little. difference in the resolution of entropy in space or time. The reader will note,however,that space resolution is not necessarily as effcient as time sequences.In fact,the N/2 number of time sequence operations Ne is a linear func- tion of input size n: N:=k+2 % 2xt洛骆2 while Muller 32 has shown that for combinational Fig.3."Parallelism"in problems. circuits of arbitrary functional basis a measure for the number of circuits required N.is serve as input data.The program or algorithm operates 2n 2n on this input data and produces output results corres- k1一≤N。≤k2 ponding to N2 words each of p bits.If we presume that as a maximum each of the input bits could affect each where n is the number of input lines.See Cook [33]and of the bits in the output,then there are NiXp bits of Cook and Flynn 34 for a more general and complete uncertainty or entropy which specify an output bit. treatment of space-time functional measures.Later in Thus a table-oriented solution algorithm (i.e.,generate this paper we will discuss similar inefficiencies in parallel a table of solutions-one solution for each input combi- SIMD processors. nation)requires 2DNentries and uses N2 bits per entry. The total number of bits required is pN22pN. Recursive Functions We could generalize this for variable length input and Substantial difficulties arise when the preceding "gen- output by treating Na and Ni as the maximum number eral point of view"is reduced to the specific.In particu- of words required to specify output or input informa-lar,the class of functions for which arbitrary "space- tion.Note that this imposes a limit on the size of output time"transformations can be developed is not equiva- strings,and hence the table algorithm is too restrictive lent to the class of all recursive(computable)functions. for generalized computable functions.However,within Recursion (in Kleene's sense [7])is based on the appli- this restricted class the entropy O to be resolved by an cation of the composition operation on a finite set of algorithm is initial functions.This composition operation 8]is the pN2≤Q≤pN22pw1. association of functions h(Xi,Xg,···,Xn)with F(X,··,X,g1(X,··,X)g2(X,··,X, The lower bound also needs special interpretation for ···,gn(Xi,···,Xn,so that(Xi,···,Xm) trivial algorithms (e.g.,strings of identical ones);the =F(g1(Xi,···,Xn),···,gn(X,···,Xn).Clearly, pN2 bits were assumed to be independently derived. the application of F on the results of g:is a sequential Hence the N2 bits must be reduced by their depen- operation.Any general attempts to resolve the func- dency.(See Kolmogarov [10]for an alternate notion on tional entropy without such sequential (time)depen- the specification of Q.) dence leads to "recursion"based on infinite sets of initial In any event,O bits of uncertainty must be resolved. functions. This resolution may be performed in space or time. Bernstein [9]has developed three (rather strong) Typically it is resolved in space by combinatorial logic. sufficient conditions for two programs to operate in Each decision element may resolve between zero and parallel based on the referencing of partitions of storage. one bit of information depending upon the associated As in the preceding discussion,these conditions specify probabilities of the binary outcomes.Usually a useful limitations on the interactions between programs. repertoire or vocabulary of logical decisions is available Notice that none of the foregoing serves to limit the to an instruction stream.Let an element of the vocabu-capability of a processor organized in a parallel fashion lary consist of M decisions.Depending upon the type of to perform computation,but rather serves as a limit operation to be performed,these execution unit decision on the efficiency of such an operation.Also note that the elements resolve less than M bits of information;thus composing function F may induce similar inefficiencies Q≤m'MN, in a confluent simplex processor (depending on the nature of F and the last g:to be evaluated).Such per- where m'is the number of data stream execution units formance degradation will be discussed later.The com- and N.is the number of time operations that were used position mechanism that causes a problem here is an (i.e.,number of sequential instructions).In order to interprogram action,while the stream inertia difficulty retire the complete algorithm,of course,a sequence of occurs more prominently in purely intraprogram condi- operations is performed to execute the instruction tional actions.Indeed,there are techniques for the stream;each operation in the sequence resolves or ex- elimination of branches in simple programs by use of ceeds the required amount of entropy for a solution.Boolean test variables (0 or 1)operating multiplica-

952 IEEE TRANSACTIONS ON COMPUTERS, SEPTEMBER 1972 Thus from this most general point of view there is little. . difference in the resolution of entropy in space or time. .4 [\ The reader will note, however, that space resolution is not necessarily as efficient as time sequences. In fact, the NlR' \^|t/2 number of time sequence operations Nt is a linear function of input size n: _ _ _ _ Nt = k-n too °%of ptublem requzefxq iv sequent4a7 executw# while Muller [32] has shown that for combinational Fig. 3. "Parallelism" in problems. circuits of arbitrary functional basis a measure for the number of circuits required N, is serve as input data. The program or algorithm operates 2" 2" on this input data and produces output results corres- k1-. Ns . k2 - ponding to N2 words each of p bits. If we presume that as a maximum each of the input bits could affect each where n is the number of input lines. See Cook [33] and of the bits in the output, then there are NjXp bits of Cook and Flynn [34] for a more general and complete uncertainty or entropy which specify an output bit. treatment of space-time functional measures. Later in Thus a table-oriented solution algorithm (i.e., generate this paper we will discuss similar inefficiencies in parallel a table of solutions-one solution for each input combi- SIMD processors. nation) requires 2PN1 entries and uses pN2 bits per entry. The total number of bits required is pN22PN1. We could generalize this for variable length input and Substantial difficulties arise when the preceding "genoutput by treating N2 and N1 as the maximum number eral point of view" is reduced to the specific. In particuof words required to specify output or input informa- lar, the class of functions for which arbitrary "spacetion. Note that this imposes a limit on the size of output time" transformations can be developed is not equivastrings, and hence the table algorithm is too restrictive lent to the class of all recursive (computable) functions. for generalized computable functions. However, within Recursion (in Kleene's sense [7]) is based on the applithis restricted class the entropy Q to be resolved by an cation of the composition operation on a finite set of algorithm is initial functions. This composition operation [8] is the pN2 Q < pN2.2PN1. association of functions h(X1, X2, , Xn) with - - ~~~~~~~~F(Xlt* X,), gl(XI,* Xn), g2(X17 . , Xn) i The lower bound also needs special interpretation for , gn(Xl, , Xn), so that h(X1, , Xj) trivial algorithms (e.g., strings of identical ones); the =F(gi(Xj, Xn), , gn(Xl, Xn)). Clearly, pN2 bits were assumed to be independently derived. the application of F on the results of gi is a sequential Hence the pN2 bits must be reduced by their depen- operation. Any general attempts to resolve the funcdency. (See Kolmogarov [10] for an alternate notion on tional entropy without such sequential (time) depenthe specification of Q.) dence leads to "recursion" based on infinite sets of initial In any event, Q bits of uncertainty must be resolved. functions. This resolution may be performed in space or time. Bernstein [9] has developed three (rather strong) Typically it is resolved in space by combinatorial logic. sufficient conditions for two programs to operate in Each decision element may resolve between zero and parallel based on the referencing of partitions of storage. one bit of information depending upon the associated As in the preceding discussion, these conditions specify probabilities of the binary outcomes. Usually a useful limitations on the interactions between programs. repertoire or vocabulary of logical decisions is available Notice that none of the foregoing serves to limit the to an instruction stream. Let an element of the vocabu- capability of a processor organized in a parallel fashion lary consist of M decisions. Depending upon the type of to perform computation, but rather serves as a limit operation to be performed, these execution unit decision on the efficiency of such an operation. Also note that the elements resolve less than M bits of information; thus composing function F may induce similar inefficiencies Q<m'MAT, - in a confluent simplex processor (depending on the ~~~~~~nature of F and the last gi to be evaluated). Such perwhere m' is the number of data stream execution units formance- degradation will be discussed later. The comand N1 is the number of time operations that were used position mechanism that causes a problem here is an (i.e., number of sequential instructions). In order to interprogram action, while the stream inertia diffculty retire the complete algorithm, of course, a sequence of occurs more prominently in purely intraprogram condioperations is performed to execute thae instruction tional actions. Indeed, there are techniques for the stream; eachl operation in the sequence resolves or ex- elimination of branches in simple programs by use of ceeds thae required amount of entropy for a solution. Boolean test variables (0 or 1) operating multiplica-

FLYNN:COMPUTER ORGANIZATIONS 953 tively on each of two alternate task paths 14.This is a p,子--N direct "branch”to“composition”transformation. 2经品 p-000 SYSTEM ORGANIZATIONS AND THEIR EFFECTIVENESS IN RESOURCE USE SISD and Stream Inertia p=a.05 Serious inefficiencies may arise in confluent SISD organizations due to turbulence when data interacts p=a20 with the instruction stream.Thus an instruction may Jinertta factor require an argument that is not yet available from a Fig.4.Stream inertia and effects of branch preceding instruction (direct composition),or (more seriously)an address calculation is incomplete (indirect composition).Alternately,when a data conditional startup of the instruction overlapping.If we define per- branch is issued,testable data must be available before formance as the number of instructions executed per the branch can be fully executed(although both paths unit time,then can be prepared).In conventional programs delays due 名 M to branching usually dominate the other considera- perf.= tions;then in the following we make the simplified assumption that inertia delay is due exclusively to [w-p+(-9）+1]a branching.In addition to providing a simple model of 1 performance degradation,we can relate it to certain test data derived from a recent study.The reader should beware,however,that elimination of branching by the [-p+w-+] introduction of composition delay will not alter the turbulence situation.For a machine issuing an instruc- The last term in the denominator drops out as M be- tion per Al,if the data interaction with the stream is comes large.Then determined by the ith instruction and the usage of such 1 data is required by the (i+1)th instruction,a condition perf. of maximum turbulence is encountered,and this gen- L△t[1+J-N-1)] erates the maximum serial latency time(L-1)At which Fig.4 shows the effect of turbulence probability p for must be inserted into the stream until the overlapped various J and L.In particular,if J=20 and L were 20 issuing conditions can be reestablished. time units,N=2 instructions and a turbulence prob- If we treat the expected serial latency time of an in- ability of 10 percent,the performance of a system would struction L.At as being equivalent to the total execution degrade from a maximum possible of 1 (instructions/ time for the average instruction(from initiation to com- time unit)to 1/2.7 or about 30 percent of its potential. pletion),we must also consider an anticipation factor N A major cause of turbulence in conventional programs as the average number of instructions between (inclu- is the conditional branch;typically the output of a sively)the instruction stating a condition and the in- computer would include 10-20-percent conditional struction which tests or uses this result.Clearly,for branches.A study by O'Regan [12]on the branch N>J/L instructions no turbulence (or delay)will problem was made using the foregoing type of analysis. result. Thus under ideal conditions one instruction each For a typically scientific problem mix(five problems: L.At/J(time units)would be executed.Turbulence adds root finding;ordinary differential equation;partial differential equations;matrix inversion;and Polish a delay: string manipulation),O'Regan attempted to eliminate as many data conditional branches as possible using a for N< variety of processor architectures (single and multiple accumulators,etc.).O'Regan did not resort to extreme J =0, for NZL tactics,however,such as multiplying loop sizes or transformations to a Boolean test (mentioned earlier). For four of the problems selected the best (i.e.,mini- Given a block of M instructions with a probability of mum)conditional branch probability attainable varied encountering a turbulence causing instruction p the from =0.02 to =0.10.The partial differential equa- total time to execute these instructions would be tion results depend on grid size,but p=0.001 seems at- r=[wa-p1+1+p(-】 tainable.The best(largest)attainable N(the set-test offset)average was less than 3.No attempt was made in the study to evaluate degradation due to other than The additional "1"in the expression is due to the branch dependent turbulence

FLYNN: COMPUTER ORGANIZATIONS 953 tively on each of two alternate task paths [14]. This is a direct "branch" to "composition" transformation. s 2 SYSTEM ORGANIZATIONS AND THEIR EFFECTIVENESS IN RESOURCE USE V SISD and Stream Inertia Serious inefficiencies may arise in confluent SISD ppZ0.20 organizations due to turbulence when data interacts with the instruction stream. Thus an instruction may . inerafactor require an argument that is not yet available from a Fig. 4. Stream inertia and effects of branch. preceding instruction (direct composition), or (more seriously) an address calculation is incomplete (indirect composition). Alternately, when a data conditional formace instructiohn overlapping. If we define perbranch is issued, testable data must be available before the branch can be fully executed (although both paths unit time, tl7en can be prepared). In conventional programs delays due M M to branching usually dominate the other considera- perf.=-= - tions; then in the following we make the simplified T1N assumption that inertia delay is due exclusively to _J J branching. In addition to providing a simple model of 1 performance degradation, we can relate it to certain test - L data derived from a recent study. The reader should [(1 p)+p(J-NI)+-- beware, however, that elimination of branching by the J L . L Mj introduction of composition delay will not alter the turbulence situation. For a machine issuing an instruc- The last term in the denominator drops out as M betion per At, if the data interaction with the stream is comes large. Then determined by th-e ith instruction and the usage of such data is required by the (i+l )th instruction, a condition perf. = - - of maximum turbulence is encountered, and this gen- LAi [I + p(J-N erates the maximum serial latency time (L - t)At which . 4 s must be inserted into the stream until the overlapped Fi.4sostefecofurlnepobiiypfr various J and L. In particular, if J= 20 and L were 20 issuing conditions issuig coditioscan c beb reestablished. eestalishd. time units, N=2 instructions and a turbulence probIf we treat the expected serial latency time of an in- ailityof1 p nst,uth ormanc a system woul ability of t0 percent, the performance of a system would struction L At as being equivalent to the total execution time for the average instruction (from initiation to com- timeutom 1 oabout percentof its potential. pletion), we must also consider an anticipation factor N Anmajo ause oftbul in conventioa pograms as the average number of instructions between (inclu- . A sively) the ins n sis the conditional branch; typically the output of a svly) the instruction stating a condition and the in- computer would include 10-20-percent conditioa struction w;hich tests or uses this result. Clearly, for branches Aust ud e [12] iot an N>JIL instrctions no trbulence (o delay) wil branches. A study by O'Regan [12 ] on the branch N.J/l problem was made using the foregoing type of analysis. For a typically scientific problem mix (five problems: Thus under ideal conditions one instruction each root finding; ordinary differential equation; partial L .At/J (time units) would be executed. Turbulence adds differential equations; matrix inversion; and Polish a delay: string manipulation), O'Regan attempted to eliminate deaL-NL J as many data conditional branches as possible using a delay ---) Al, for N -. tactics, however, such as multiplying loop sizes or L transformations to a Boolean test (mentioned earlier). For four of the problems selected the best (i.e., miniGiven a block of M instructions w&ith a probability of mum) conditional branch probability attainable varied encountering a turbulence causing instruction p the from p =0.02 to p =0.10. The partial differential equatotal time to execute these instructions would be tion results depend on grid size, but p = 0.001 seems atFL / NL\] tainlable. The best (largest) attainable N (thle set-test T = L [M((- p)] + 1 + pMgJL- )J offset) average was less than 3. No attempt wvas made in the study to evaluate degradation due to othaer than The additional "1" in the expression is due to the branch dependent turbulence

954 IEEE TRANSACTIONS ON COMPUTERS,SEPTEMBER 1972 SIMD and Its Effectiveness Control There are three basic types of SIMD processors, Unit that is,processors characterized by a master instruction ocessin applied over a vector of related operands.These in- clude (Fig.5)the following types. Memory 1)The Array Processor:One control unit and m di- rectly connected processing elements.Each processing element is independent,i.e.,has its own registers and N Universol Execution Resources storage,but only operates on command from the con- (a) trol unit. 2)The Pipelined Processor:A time-multiplexed ver- ADD AU儿TIPLY TC sion of the array processor,that is,a number of func- Unit tional execution units,each tailored to a particular func- Memory staged tion.The units are arranged in a production line fashion, dedicated resources staged to accept a pair of operands every At time units. The control unit issues a vector operation to memory. Memory is arranged so that it is suitable to a high-speed stage delo data transfer and produces the source operands which are entered a pair every At time units into the desig- (b) nated function.The result stream returns to memory. 3)The Associative Processor:This is a variation of the Control array processor.Processing elements are not directly Unit addressed.The processing elements of the associative processor are activated when a generalized match rela- tion is satisfied between an input register and charac- teristic data contained in each of the processing ele- ments.For those designated elements the control unit instruction is carried out.The other units remain idle. A number of difficulties can be anticipated for the SIMD organization.These would include the following problems. 1)Communications between processing elements. 2)Vector Fitting:This is the matching of size between Execute if match Is satisfled the logical vector to be performed and the size of the (c) physical array which will process the vector. Fig.5.SIMD processors.(a)Array processor.(b)Pipelined pro- 3)Sequential Code (Nonvector):This includes house- cessor.(c)Associative processor. keeping and bookkeeping operations associated with the preparation of a vector instruction.This corresponds to widely studied [17]-[20].Results to date,however, the Amdahl effect.Degradation due to this effect can be indicate that it is not as significant a problem as was masked out by overlapping the sequential instructions earlier anticipated.Neuhauser [17],in an analysis of with the execution of vector type instructions. several classical SIMD programs,noted that communi- 4)Degradation Due to Branching:When a branch cations time for an array-type organization rarely ex- point occurs,several of the executing elements will be in ceeded 40 percent of total job time and for the matrix one state,and the remainder will be in another.The inversion case was about 15 percent. master controller can essentially control only one of the The fitting problem is illustrated in Fig.6.Given a two states;thus the other goes idle. source vector of size m,performance is effected in an 5)Empirically,Minsky and Papert [29]have ob-array processor when the M physical processing ele- served that the SIMD organization has performance ments do not divide m [21].However,so long as m is portional to the logz m(m,the number of data streams substantially larger than M,this effect will not con- per instruction stream)rather than linear.If this is tribute significant performance degradation.The pipe- generally true,it is undoubtedly due to all of the pre-line processor exhibits similar behavior,as will be dis- ceding effects (and perhaps others).We will demon- cussed later. strate an interpretation of it based upon branching The Amdahl effect is caused by a lack of "parallelism" degradation. in the source program;this can be troublesome in any Communication in SIMD organizations has been multistream organization.Several SIMD organizations

954 IEEE TRANSACTIONS ON COMPUTERS, SEPTEMBER 1972 SIMD and Its Effectiveness Control * * E There are three basic types of SIMX\D processors, Communicatlon that is, processors characterized by a master instruction npcissn applied over a vector of related operands. These in- Element P E PE clude (Fig. 5) the following types. 1) The Array Processor: One control unit and m directly connected processing elements. Each processing -N Universal Enecution Resources element is independent, i.e., has its own registers and storage, but only operates on command from the con- (a) trol unit. 2) The Pipelined Processor: A time-multiplexed ver- Control fADD MULTIPLY ETC, sion of the array processor, that is, a number of func- Unit Memory tional execution units, each tailored to a particular func- sted tion. The units are arranged in a production line fashion, resources staged to accept a pair of operands every LAt time units. _ The control unit issues a vector operation to memory. Memory is arranged so that it is suitable to a hiigh-speed execution tiYe, data transfer and produces the source operands which are entered a pair every A\t time units into the desig- (b) nated function. The result stream returns to memory. *3) The Associative Processor: This is a variation of the Control array processor. Processing elements are not directly Unit addressed. The processing elements of the associative Inquiry Register processor are activated when a generalized match relation is satisfied between an input register and clharacteristic data contained in each of the processing elements. For those designated elements the control unit instruction is carried out. The other units remain idle. (sreglstive A number of difficulties can be anticipated for the SI1\ID organization. These would include the following problems. 1) Communications betwreen processing elements. 2) Vector Fitting: This is tlhe matching of size between Execute If match s satisfied the logical vector to be performed and the size of the (c) physical array which will process the vector. Fig. 5. SIMD processors. (a) Array processor. (b) Pipeliined pro- 3) Sequential Code (Nonvector): This includes house- cessor. (c) Associative processor. keeping and bookkeeping operations associated with the preparation of a vector instruction. This corresponds to widely studied [17]-[20]. Results to date, however, the Amdalhl effect. Degradation due to this effect can be indicate tllat it is not as significant a problem as wTas masked out by overlapping the sequential instructions earlier anticipated. Neuhauser [17], in an analysis of with the execution of vector type instructions. several classical SII\ID programs, noted that conmmuni- 4) Degradation Due to Branching: When a branch cations timle for an array-type organization rarely expoint occurs, several of the executing elements will be in ceeded 40 percent of total job time and for the matrix one state, and the remiiainder will be in another. The inversion case was about 15 percent. master controller can essentially control only onie of the The fitting problem is illustrated in Fig. 6. Given a two states; thus the otlher goes idle. source vector of size m, performance is effected in an 5) Empirically, M\linsky and Papert [29] lhave ob- array processor w-vhen the M.1 phlysical processing eleserved that the SIMID organization has performiiance ments do not divide m [21]. However, so long as m. is portional to thlelog2m (in,thlenumlber of data streamns sub)stantially larger thlan M1, this effect wvill not conper instruction streaml) rather than linear. If thlis is tribute significant performance degradation. Thlepipe- generally true, it is undoubtedly due to all of thle pre- line processor exhlibits simlilar behlavior, as will be disceding effects (and perhlaps othlers). We will demlon- cussed later. strate an interpretation of it based upon branching Thle Amndahl effect is caused by a lack of "parallelism " degradation. in the source programl; this can be troublesomle in any Communication in SI i\ D organizations hlas been multistreaml organization. Several SI i\ D organizationls

FLYNN:COMPUTER ORGANIZATIONS 955 N/2 FLUSHED PIPELINE Fig.7.SIMD branching. ARRAY PROCESSOR two separate tasks,each of length N,will take twice the amount of time as their unconditional expectation. \UNFLUSHED Thus the time to execute a block of N*source instruc- PIPELINE tions (each operating on M data streams)with equal probability on primary and alternate branch paths is T=L∑o+L∑N.12+1∑24+ +L·∑Vw2 2M SIZE OF LOGICAL VECTOR T=L∑∑N2 (in units of M,the physical array size) Fig.6.Fitting problem. where jis the level of nesting of a branch and Ni.;is the number of source instructions in the primary (or alter- nate)branch path for the ith branch at level j.As the use overlapping of "sequential type"control unit in- structions with "vector operations"to avoid this effect, level of nesting increases,fewer resources are available to execute the M data streams,and the time required with some apparent success. increases-the situation would be similar if we had used Multiple-execution organizations such as SI\D have other branch probabilities (0.5).Thus the per- potential difficulty in the use of the execution resources formance is The reason for this is that all units must process the same instruction at a particular unit of time.When V* perf.= nested decisions are considered(Fig.7),difficulty arises L·∑∑N.2 because the execution units are not available to work on any other task. The factor >N./N*is the probability of encoun- Consider an SIMD system with M data streams and tering a source instruction: an average instruction execution time(per data stream) of L.Al time units.Now a single instruction will act P=∑/W*≤1. uniformly on M pairs of operands.With respect to our reference instruction I(which operates on only a pair of This analysis assumed that the overhead for reassigning operands)the SIMD instruction,designated I*,has M execution elements to alternate paths tasks is prohibi- times the effect.Ignoring possible overlap,a single I* tive.This is usually true when the task sire N;is small instruction will be executed in a least time LAf,while or when the swapping overhead is large(an array proces- the conventional unoverlapped SISD system would sor,each of whose data streams has a private data execute in/.L,△t. storage).Based on empirical evaluation of program per- To achieve close to the 1/M bound,the problem must formance in a general scientific environment (i.e.,not be partitionable in 1f identical code segments.When a the well-known "parallel type"programs such as matrix conditional branch is encountered if at least one of the inversion,etc.)it has been suggested [29]that the actual M data differs in its condition,the alternate path in-performance of the SIMD processor is proportional to structions must be fully executed.We now make a sim-the log?of the number of slave processing elements plifying assumption:the number of source instructions rather than the hoped for linear relation.This has been are the same for the primary branch path and the alter- called Minsky's conjecture: nate.Since the number of data items required to be processed by a branch stream is M,only the fraction perf.sIiD≈log2M. available will be executed initially and the task will be While this degradation is undoubtedly due to many reexecuted for the remainder.Thus a branch identifying causes,it is interesting to interpret it as a branching

FLYNN: COMPUTER ORGANIZATIONS 955 - ~~~~~~~~~~~~~~~~~~N1 - - IStiram Exa*ic t/i ('i Npr-ocgo # i onte way 22 Re12 z~~~~~~~~~~/ JNat vlwdcD | ~~~~FLU'5HED, Fig. 7. SIMD branching. 0 E |-02 o, CL~~~~ARRAY - 0lo | -PROCESSOR z two separate tasks, eaclh of length N, will take twice the anmount of time as their unconditional expectation. oooAUNFLUSHED Tlhus the time to execute a block of N* source instrucE-z 4 |/t PIPELINE tions (eaclh operating on Al data streamiis) with equal e0. L o Tprobability on primary and alternate branclh paths is T =L. E T\o+ L* ENi,1 2 + L EXi,,9 4 + 'tc iL ii _+ oL LA7N,j.2J M 2M SIZE OF LOGICAL VECTOR T = L ZN j22 (in units of M, the physical array size) Fig. 6. Fitting problem. where j is the level of nesting of a branch and Ni,j is the number of source instructions in tl-he primiary (or alternate) branchl pathlE for thle ithl branchl at level j. As thle use overlapping of "sequential type" control unit in- lev ofanesi ices,fewer reocesare a.vAilabe level of nesting increase , feNR-er resources are availab)le structions witlh "vector operations" to avoid this effect, t with sonieapparent success.s to exectute tlle Al data streaiiis, and tlle tiinle required increases the situation would be slimilar if wre lhad used Mlultiple-execution organizations suclh as SIJM D have other branch probabilities (p+O.5). Thus the perpotential difficulty in the use of the execution resources. formance is The reason for this is that all units miiust process the same instruction at a particular unit of time. WVihen perf.= -= nested decisions are considered (Fig. 7), difficulty arises T L E L A-l 2J because the execution units are not available to work on j i any other task. The factor Es N,1jlN* is the probability of encounConsider an SIM\;tD system witlh Al data streams and tering a source instruction: an average instruction execution time (per data streamn) of L .,t time units. Now a single instruction will act 1¾ ZE N* < 1. uniformly on Al pairs of operands. Witlh respect to our reference instruction I (wrhich operates on only a pair of Tllis analysis assumed that the overlhead for reassigning operands) the SIAMD instruction, designated I*, ,has Al execution elemenits to alternate patlhs tasks is prolhibitimes the effect. Ignoring possible overlap, a single I* tive. Tlis is usuallv true whrhen the task sire N-j is small instruction will be executed in a least time L .,t, wlile or wi-len the sw-appingy overlhead is large (an array procesthe conventional unoverlapped SISD system would sor, eaclh of whose data streams lhas a private data execute in MULLAt. storage). Based oIn emnpirical evaluation of program perTo acl-hieve close to the 1 Al bound, the problem mllust formance in a general scientific environment (i.e., not be partitionable in lI identical code segnments. Wh-ien a the well-know-n "parallel type" programiis suclh as matrix conditional branclh is encountered if at least one of the inversion, etc.) it lhas been suggested [29j that the actual Hf data differs in its condition, the alternate path in- performance of the SLVID processor is proportional to structions must be fully executed. \Vre now make a sim- the log2 of the number of slave processing elemlents plifying assumption: thle number of source instructions rather than thle hloped for linear relation. Thlis hlas been are thle same for thle primary branchl pathl and the alter- called M\insky's conjecture: nate. Since thle number of data items required to be processed by a branchl stream is Mf, only the fraction perf.S:IMD log2 M. available, will 1)e executed initially and thle task will be Wvhile this degradation is undoubtedly due to many reexecuted for thle remainder. Thus a branchl identifying causes, it is interesting to interpret it as a branching

956 IEEE TRANSACTIONS ON COMPUTERS,SEPTEMBER 1972 degradation.Assume that the probability of being at cient single-stream program organization for this larger one of the lower levels of nesting is uniform.That is,it is class of problems is presently substantially more effi- equally likely to be at level 1,2,...,[loga M].Since cient than an equivalent program organization suited beyond this level of nesting no further degradation to the SIMD processors.Undoubtedly this degradation occurs,assume P1=P2=...=P;=P(tos-and Pi is a combination of effects;however,branching seems =∑R-[log M]Pk.Now to be an important contributor-or rather the ability to 1j=1og:M]-1 efficiently branch in a simple SISD organization sub- Pi= stantially enhances its performance. [log2 M]j=0 Certain SIMD configurations,e.g.,pipelined proces- and the earlier performance relation can be restated: sors,which use a common data storage may appear to suffer less from the nested branch degradation,but 1 1 perf. actually the pipelined processor should exhibit an L ∑p2 equivalent behavior.In a system with source operand vector A={a,a1,:··,a,···,am}andB={b,b, This was derived for an absolute model,i.e.,number of ···,ba,···,bn},a sink vector C={co,G,··, SIMD instructions per unit time.If we wish to discuss c,···,ca}is the resultant.Several members of C will performance relative to an SISD unoverlapped processor satisfy a certain criterion for a type of future processing, with equivalent latency characteristics,then and others will not.Elements failing this criterion are tagged and not processed further,but the vector C is M perf.relative usually left unaltered.If one rearranges C,filters the ∑P2 dissenting elements,and compresses the vector,then an overhead akin to task swapping the array processor Thus the SIMD organization is M times faster if we is introduced.Notice that the automatic hardware gen- have no degradation.Now eration of the compressed vector is not practical at the high data rates required by the pipeline. M perf.relative If the pipelined processor is logically equivalent to 2i Σ other forms of SI\D,how does one interpret the num- ber of data streams?This question is related to the vector fitting problem.Fig.6 illustrates the equivalence or,ignoring the effect of nonintegral values of log2M of an array processor to the two main categories of pipe- M line processors. perf.relative = 2M-1 1)Flushed:The control unit does not issue the next vector instruction until the last elements of the present log2 M vector operation have completed their functional pro- for M large: cessing (gone through the last stage of the functional pipeline). log2 M 2)Unflushed:The next vector instruction is issued as perf.relative≈ 2 soon as the last elements of the present vector operation have been initiated(entered the first state of the pipe- Note that if we had made the less restrictive assump- line). tion that Assuming that the minimum time for the control unit to prepare a vector instruction Te is less than the average P,=2-i functional unit latency,for the flushed case the then equivalent number of data streams per instruction stream m is 过 perf.relative TL log:M m=-flushed pipeline Thus we have two plausible performance relations based on alternate nesting assumptions.Of course this degra- where At is the average stage time in the pipeline. dation is not due to idle resources alone;in fact,pro- With the unflushed case,again assuming the>e, grams can be restructured to keep processing elements the equivalent m is busy.The important open question is whether these Te restructured programs truly enhance the performance m= unfushed pipeline. of the program as distinct from just keeping the resource busy.Empirical evidence suggests that the most effi-Notice that when re=At,m=1,and we no longer have

956 IEEE TRANSACTIONS ON COMPUTERS, SEPTEMBER 1972 degradation. Assume that the probability of being at cient single-stream program organization for this larger one of the lower levels of nesting is uniform. That is, it is class of problems is presently substantially more effiequally likely to be at level 1, 2, , [1og2 MV. Since cient than an equivalent program organization suited beyond this level of niesting no further degradation to the SIM\ID processors. Undoubtedly this degradation occurs, assume P1=P2= = Pi = P[log M] -1 and P1 is a combination of effects; however, branchling seems = t= [log9 M)Pk. Now to be an important contributor-or rather the ability to I lj= [log2 M]- 1 efficiently branch in a simple SISD organization sub- P = 1 stantially enhances its performance. [log2 M] i = 0 Certain SIMD configurations, e.g., pipelined procesand the earlier performance relation can be restated: sors, which use a common data storage may appear to suffer less from the nested branch degradation, but 1 1 actually the pipelined processor should exhibit an perf. L pe2. equivalent behavior. In a system with source operand vector A={as, a, as, , an} and B-=I{bo, bi, This was derived for an absolute model, i.e., number of b,b . bn,, a sink vector C== co, c1, SIMD instructions per unit time. If we wish to discuss c, , cn} is the resultant. Several members of C will performance relative to an SISD unoverlapped processor satisfyacrancieinfratp fy a certain criterion for a type offftr future processing rcsig with equivalent latency characteristics, then and others will not. Elements failing this criterion are tagged and not processed furtlher, but tlhe vector C is M usually left unaltered. If one rearranges C, filters the perf. relative = --lZPj2j dissenting elements, and compresses the vector, then an overhead akin to task swapping the array processor Thus the SIi\JD organization is Ml times faster if we is introduced. Notice that the automatic lhardware genhave no degradation. Now eration of the comiipressed vector is not practical at the high data rates required by the pipeline. perf. relative _____= M If the pipelined processor is logically equivalent to perf. relative 2P otlher forms of SIM-AD, lbow does one interpret the number of data streams? This question is related to the j [logel] vector fitting problem. Fig. 6 illustrates tlhe equivalence or, ignoring the effect of nonintegral values of log2 Al of an array processor to the two main categories of pipeM line processors. perf. relative 1) Flushed: The control unit does not issue the next 2M - 1 vector instruction until the last elements of tlhe present log2 M vector operation lhave completed their functional processing (gone through the last stage of the functional for M large: pipeline). log2 AM 2) Unflushed: IThe next vector instruction is issued as perf. relative ---- soon as the last elemeints of the present vector operation hlave been initiated (entered the first state of the pipeNote that if we had made the less restrictive assump- Aine)c tion that tion tl1at ~~~~~~~~Assuming tl1at tlle niinimum time for the control unit to prepare a vector instruction r, is less than the average Pj= 2-i functional unit latency fL, for the flushed case the then equivalent number of data streams per instruction stream m is perf. relative _ -T log2 M m =-flushed pipeline Thus we have two plausible performance relations based w A I On alternate nesting assumptions. Of course this degra- XVt h nlsedcs,te#>C dationl iS not due to idle resources alone; in fact, pro- agi assmin grams can be restructured to keep processing elementstheqiantms busy. The im1portant open question is whaethler these T restructured programs truly enhance the performance m = -unflushed pipeline. Of thle program as distinct from just keeping the resource busy. Empirical evidence suggests thlat the most effi- NTotice thlat when r =At, m= 1, and wJe no longer have

FLYNN:COMPUTER ORGANIZATIONS 957 SIMD.In fact,we have returned to the overlapped SISD. 4E=,025 MIMD and Its Effectiveness The multiple-instruction stream organizations (the LE-.050 "multiprocessors")include at least two types. 1)True Mulliprocessors:Configurations in which sev- N:dId eral physically complete and independent SI processors /E气.200 share storage at some level for the coopertive execution 0 40 of a multitask program. Number of processors 2)Shared Resource Multiprocessor:As the name im- plies,skeleton processors are arranged to share the sys- Fig.8.MIMD lockout. tem resources.These arrangements will be discussed later. where T:;is the communications time discussed earlier Traditional AIIMD organizational problems include: and pi is the probability of task j accessing data from 1)communications/composition overhead;2)cost in- data stream i.Note that the lockout here may be due to creases linearly with additional processors,while per- the broader communications problem of the jth proces- formance increases at a lesser rate (due to interference); sor requesting a logical data stream i.This includes the and 3)providing a method for dynamic reconfiguration physical data stream accessing problem as well as addi- of resources to match changing program environment, tional sources of lockout due to control,allocation,etc. (critical tasks)-this is related to 1). In any event,Madnick [11]used a Markov model to Shared resource organization may provide limited derive the following relationship: answers to these problems,as we will discuss later. (-1) 1 Communications and composition is a primary source (idle)= of degradation in MI systems.When several instruction E E streams are processing their respective data streams on a common problem set,passing of data points is inevit- able.Even if there is naturally a favorable precedence where (idle)is the expected number of locked-out pro relationship among parallel instruction streams insofar cessors and n is the total number of processors.If a as use of the data is concerned,composition delays may single processor has unit performance,then for n pro- ensue,especially if the task execution time is variable cessors The time one instruction stream spends waiting for data perf.=n-8 (idle) to be passed to it from another is a macroscopic form of the strictly sequential problem of one instruction and normalized performance (max=1)is given by waiting for a condition to be established by its immedi- n-8 (idle) ate predecessor. perf. Even if the precedence problem (which is quite pro- gram dependent)is ignored,the "lockout"problem asso- Fig.8 is an evaluation of the normalized performance as ciated with multiple-instruction streams sharing com- the number of processors (instruction stream-data mon data may cause serious degradation.Note that stream pairs)are increased for various interaction ac- multiple-instruction stream programs without data tivity ratios L/E. sharing are certainly as sterile as a single-instruction Regis [15]has recently completed a study substan- stream program without branches. tially extending the simple Markovian model previously Madnick 11 provides an interesting model of soft- described (homogeneous resources,identical processors, ware lockout in the MIND environment.Assume that etc.)by developing a queuing model that allows for an individual processor (instruction stream control unit) vectors of requests to a vector of service resources.Leh- has expected task execution time (without conflicts)of man [30]presents some interesting simulation results E time units.Suppose a processor is "locked out"from related to the communications interference problem. accessing needed data for L time units.This locking out Since shared resource MIMD structures provide some may be due to interstream communications(or accessing) promising (though perhaps limited)answers to the AII problems (especially if the shared storage is an I/O problems,we will outline these arrangements.The exe- device).Then the lockout time for the ith processor (or cution resources of an SISD overlapped computer (ad- instruction stream)is ders,multiplier,etc.-most of the system exclusive of registers and minimal control)are rarely efficiently used, L,=∑pT as discussed in the next section. In order to effectively use this execution potential

FLYNN: COMPUTER ORGANIZATIONS 957 SIMD. In fact, we have returned to the overlapped SISD. \/IE=.025 AfIMD and Its Effectiveness The multiple-instruction stream organizations (the \L/Ex.05o "multiprocessors") include at least two types. \ 1) True Multiprocessors: Configurations in whiclh several physically complete and independent SI processors L/-z20 share storage at some level for the coopertive execution ; 40 of a multitask program. Number0fprocessors 2) Shared Resource Multiprocessor: As the name implies, skeleton processors are arranged to share the sys- Fig. 8. MIMD lockout. tem resources. These arrangements will be discussed later. where Tij is the communications time discussed earlier Traditional :AIIi\I[D organizational problems include: and pij is the probability of task j accessing data fronm 1) comnmunications/composition overhead; 2) cost in- data stream i. Note that the lockout here may be due to creases linearly with additional processors, Nwlaile per- the broader communications problem of the jth procesformance increases at a lesser rate (due to interference); sor requesting a logical data streamn i. This includes the and 3) providing a method for dynamic reconfiguration plhysical data stream accessing problem as wvell as addiof resources to match changing program environment, tional sources of lockout due to control, allocation, etc. (critical tasks)-this is related to 1). In any event, i\Iadnick [11 ] used a \'i\arkov model to Shared resource organization may provide limited derive the following relationship: answers to these problems, as we will discuss later. n 1) n1 Communications and composition is a primary source 8 (idle) = E --- - / - - - of degradation in 14I systems. When several instruction i=2 (n-i) !E/ (E - streams are processing their respective data streams on L\L/ a common problem set, passing of data points is inevitable. Even if there is naturally a favorable precedence w relationship among parallel instruction streams insofar cessors and n is the total number of processors. If a as use of the data is concerned, composition delays may single processor has unit performance, then for n proensue, especially if the task execution time is variable. cessors The timie one instruction stream spends wvaiting for data perf. = n - 8 (idle) to be passed to it from another is a macroscopic form of the strictly sequential problem of one instruction waiting for a condition to be establislhed by its immedi- n - 8 (idle) ate predecessor. perf.N -n Even if the precedence problem (whiclh is quite programl- dependent) is ignored, the "lockout" problem asso- Fig. 8 is an evaluation of the normalized performiiance as ciated witlh multiple-instruction streams slharing coni- the number of processors (instruction stream-data mon data nmay cause serious degradation. Note that stream pairs) are increased for various interaction acmultiple-instruction stream programs without data tivity ratios LIE. sharing are certainly as sterile as a single-instruction Regis [15] has recently completed a study substanstream progranm witlhout branclhes. tially extending the simple M.Iarkovian model previously i\ladnick [1Fl ] provides an interesting nmodel of soft- described (homogeneous resources, identical processors, ware lockout in the llIIS\ID environmzent. Assume tlhat etc.) by developing a queuing miodel that allows for an individual processor (instruction stream control unit) vectors of requests to a vector of service resources. Lelhhlas expected task execution time (witlhout conflicts) of man [30] presents some interesting simulation results E time units. Suppose a processor is "locked out" fronm related to the communications interference problenm. accessing needed data for L tie units. This locking out Since shared resource II\ID structures provide som1-e mlay b)e due to interstream1 commnunications (or accessing) promising (thloughl perhlaps limlited) anlswers to thle ANllI problemas (especially if thae shaared storage is an I/O problem1s, wte w-ill outline thlese arrangemwents. Thle exedevice). Th1en thle lockout timae for thle jthl processor (or cution resources of an SISD overlapped comlputer (adinstruction stream1) is ders, mlultiplier, etc. mo1St of thae system-1 exclusive of registers and mlinimlal control) are rarely efficiently uIsed, Lj =pfjTfj , = ,T,as discussed in thae next section. ~~~Inorder to effectively use thlis execution potenatiall

点击进入文档下载页（PDF格式）

共13页，试读已结束，阅读完整版请下载

点击下载（PDF格式）

浏览记录