956 IEEE TRANSACTIONS ON COMPUTERS, S_中国高校课件下载中心

点击下载：《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Some Computer Organizations and Their Effectiveness

正在加载图片...

956 IEEE TRANSACTIONS ON COMPUTERS,SEPTEMBER 1972 degradation.Assume that the probability of being at cient single-stream program organization for this larger one of the lower levels of nesting is uniform.That is,it is class of problems is presently substantially more effi- equally likely to be at level 1,2,...,[loga M].Since cient than an equivalent program organization suited beyond this level of nesting no further degradation to the SIMD processors.Undoubtedly this degradation occurs,assume P1=P2=...=P;=P(tos-and Pi is a combination of effects;however,branching seems =∑R-[log M]Pk.Now to be an important contributor-or rather the ability to 1j=1og:M]-1 efficiently branch in a simple SISD organization sub- Pi= stantially enhances its performance. [log2 M]j=0 Certain SIMD configurations,e.g.,pipelined proces- and the earlier performance relation can be restated: sors,which use a common data storage may appear to suffer less from the nested branch degradation,but 1 1 perf. actually the pipelined processor should exhibit an L ∑p2 equivalent behavior.In a system with source operand vector A={a,a1,:··,a,···,am}andB={b,b, This was derived for an absolute model,i.e.,number of ···,ba,···,bn},a sink vector C={co,G,··, SIMD instructions per unit time.If we wish to discuss c,···,ca}is the resultant.Several members of C will performance relative to an SISD unoverlapped processor satisfy a certain criterion for a type of future processing, with equivalent latency characteristics,then and others will not.Elements failing this criterion are tagged and not processed further,but the vector C is M perf.relative usually left unaltered.If one rearranges C,filters the ∑P2 dissenting elements,and compresses the vector,then an overhead akin to task swapping the array processor Thus the SIMD organization is M times faster if we is introduced.Notice that the automatic hardware gen- have no degradation.Now eration of the compressed vector is not practical at the high data rates required by the pipeline. M perf.relative If the pipelined processor is logically equivalent to 2i Σ other forms of SI\D,how does one interpret the num- ber of data streams?This question is related to the vector fitting problem.Fig.6 illustrates the equivalence or,ignoring the effect of nonintegral values of log2M of an array processor to the two main categories of pipe- M line processors. perf.relative = 2M-1 1)Flushed:The control unit does not issue the next vector instruction until the last elements of the present log2 M vector operation have completed their functional pro- for M large: cessing (gone through the last stage of the functional pipeline). log2 M 2)Unflushed:The next vector instruction is issued as perf.relative≈ 2 soon as the last elements of the present vector operation have been initiated(entered the first state of the pipe- Note that if we had made the less restrictive assump- line). tion that Assuming that the minimum time for the control unit to prepare a vector instruction Te is less than the average P,=2-i functional unit latency,for the flushed case the then equivalent number of data streams per instruction stream m is 过 perf.relative TL log:M m=-flushed pipeline Thus we have two plausible performance relations based on alternate nesting assumptions.Of course this degra- where At is the average stage time in the pipeline. dation is not due to idle resources alone;in fact,pro- With the unflushed case,again assuming the>e, grams can be restructured to keep processing elements the equivalent m is busy.The important open question is whether these Te restructured programs truly enhance the performance m= unfushed pipeline. of the program as distinct from just keeping the resource busy.Empirical evidence suggests that the most effi-Notice that when re=At,m=1,and we no longer have956 IEEE TRANSACTIONS ON COMPUTERS, SEPTEMBER 1972 degradation. Assume that the probability of being at cient single-stream program organization for this larger one of the lower levels of nesting is uniform. That is, it is class of problems is presently substantially more effiequally likely to be at level 1, 2, , [1og2 MV. Since cient than an equivalent program organization suited beyond this level of niesting no further degradation to the SIM\ID processors. Undoubtedly this degradation occurs, assume P1=P2= = Pi = P[log M] -1 and P1 is a combination of effects; however, branchling seems = t= [log9 M)Pk. Now to be an important contributor-or rather the ability to I lj= [log2 M]- 1 efficiently branch in a simple SISD organization sub- P = 1 stantially enhances its performance. [log2 M] i = 0 Certain SIMD configurations, e.g., pipelined procesand the earlier performance relation can be restated: sors, which use a common data storage may appear to suffer less from the nested branch degradation, but 1 1 actually the pipelined processor should exhibit an perf. L pe2. equivalent behavior. In a system with source operand vector A={as, a, as, , an} and B-=I{bo, bi, This was derived for an absolute model, i.e., number of b,b . bn,, a sink vector C== co, c1, SIMD instructions per unit time. If we wish to discuss c, , cn} is the resultant. Several members of C will performance relative to an SISD unoverlapped processor satisfyacrancieinfratp fy a certain criterion for a type offftr future processing rcsig with equivalent latency characteristics, then and others will not. Elements failing this criterion are tagged and not processed furtlher, but tlhe vector C is M usually left unaltered. If one rearranges C, filters the perf. relative = --lZPj2j dissenting elements, and compresses the vector, then an overhead akin to task swapping the array processor Thus the SIi\JD organization is Ml times faster if we is introduced. Notice that the automatic lhardware genhave no degradation. Now eration of the comiipressed vector is not practical at the high data rates required by the pipeline. perf. relative _____= M If the pipelined processor is logically equivalent to perf. relative 2P otlher forms of SIM-AD, lbow does one interpret the number of data streams? This question is related to the j [logel] vector fitting problem. Fig. 6 illustrates tlhe equivalence or, ignoring the effect of nonintegral values of log2 Al of an array processor to the two main categories of pipeM line processors. perf. relative 1) Flushed: The control unit does not issue the next 2M - 1 vector instruction until the last elements of tlhe present log2 M vector operation lhave completed their functional processing (gone through the last stage of the functional for M large: pipeline). log2 AM 2) Unflushed: IThe next vector instruction is issued as perf. relative ---- soon as the last elemeints of the present vector operation hlave been initiated (entered the first state of the pipeNote that if we had made the less restrictive assump- Aine)c tion that tion tl1at ~~~~~~~~Assuming tl1at tlle niinimum time for the control unit to prepare a vector instruction r, is less than the average Pj= 2-i functional unit latency fL, for the flushed case the then equivalent number of data streams per instruction stream m is perf. relative _ -T log2 M m =-flushed pipeline Thus we have two plausible performance relations based w A I On alternate nesting assumptions. Of course this degra- XVt h nlsedcs,te#>C dationl iS not due to idle resources alone; in fact, pro- agi assmin grams can be restructured to keep processing elementstheqiantms busy. The im1portant open question is whaethler these T restructured programs truly enhance the performance m = -unflushed pipeline. Of thle program as distinct from just keeping the resource busy. Empirical evidence suggests thlat the most effi- NTotice thlat when r =At, m= 1, and wJe no longer have

<<向上翻页向下翻页>>

点击下载：《GPU并行编程 GPU Parallel Programming》课程教学资源（参考文献）Some Computer Organizations and Their Effectiveness