Parhi, K.K., Chassaing, R, Bitler, B. "VLSI for Signal Processing The Electrical Engineering Handbook Ed. Richard C. Dorf Boca raton crc Press llc. 2000
Parhi, K.K., Chassaing, R., Bitler, B. “VLSI for Signal Processing” The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton: CRC Press LLC, 2000
18 VLSI for Signal Processing 18.1 Special Architectures lining. Parallel Processing Folding Transformation. Look Technique. Associativity Transformation. Distributi Arithmetic Processo Architectures.computer-aidedDesignFutureVlsiDspSystems Keshab K. Parhi 18.2 Signal Processing Chips and Applications DSP Processors. Fixed-Point TMS320C25-Based Development System.Implementation of a Finite Impulse Response Filter with Rulph Chassaing the TMS320C25. Floating-Point TMS320C30-Based Development Roger Williams University System.EVM Tools. Implementation of a Finite Impulse Response Filter with the TMS320C30. FIR and IIR Implementation Bill Bitler Using C and Assembly Code. Real-Time Applications 18.1 Special Architectures Keshab k. parhi Digital signal processing(DSP)is used in numerous applications. These applications include telephony, mobile radio, satellite communications, speech processing, video and image processing, biomedical applications, radar, and sonar. Real-time implementations of DSP systems require design of hardware that can match the application sample rate to the hardware processing rate(which is related to the clock rate and the implementation style Thus, real-time does not always mean high speed. Real-time architectures are capable of processing samples they are received from the signal source, as opposed to storing them in buffers for later processing as done in batch processing. Furthermore, real-time architectures operate on an infinite time series(since the number of the samples of the signal source is so large that it can be considered infinite). While speech and sonar application require lower sample rates, radar and video image processing applications require much higher sample rates. The sample rate information alone cannot be used to choose the architecture. The algorithm complexity is also an important consideration. For example, a very complex and computationally intensive algorithm for a low- sample-rate application and a computationally simple algorithm for a high-sample-rate application may require similar hardware speed and complexity. These ranges of algorithms and applications motivate us to study a wide variety of architecture styles Using very large scale integration(VLSI) technology, DSP algorithms car These options include(1)single or multiprocessor programmable digital signal processors,(2)the use of core programmable digital signal processor with customized interface logic, (3)semicustom gate-array implemen tations, and(4)full-custom dedicated hardware implementation. The DSP algorithms are implemented in the programmable processors by translating the algorithm to the processor assembly code. This can require extensive amount of time. On the other hand, high-level compilers for DSP can be used to generate the assembly de. Although this is currently feasible, the code generated by the compiler is not as efficient as hand-optimized code. Design of DSP compilers for generation of efficient code is still an active research topic. In the case of c 2000 by CRC Press LLC
© 2000 by CRC Press LLC 18 VLSI for Signal Processing 18.1 Special Architectures Pipelining • Parallel Processing • Retiming • Unfolding • Folding Transformation • Look-Ahead Technique • Associativity Transformation • Distributivity • Arithmetic Processor Architectures • Computer-Aided Design • Future VLSI DSP Systems 18.2 Signal Processing Chips and Applications DSP Processors • Fixed-Point TMS320C25-Based Development System • Implementation of a Finite Impulse Response Filter with the TMS320C25 • Floating-Point TMS320C30-Based Development System • EVM Tools • Implementation of a Finite Impulse Response Filter with the TMS320C30 • FIR and IIR Implementation Using C and Assembly Code • Real-Time Applications • Conclusions and Future Directions 18.1 Special Architectures Keshab K. Parhi Digital signal processing (DSP) is used in numerous applications. These applications include telephony, mobile radio, satellite communications, speech processing, video and image processing, biomedical applications, radar, and sonar.Real-time implementations of DSP systems require design of hardware that can match the application sample rate to the hardware processing rate (which is related to the clock rate and the implementation style). Thus, real-time does not always mean high speed. Real-time architectures are capable of processing samples as they are received from the signal source, as opposed to storing them in buffers for later processing as done in batch processing. Furthermore, real-time architectures operate on an infinite time series (since the number of the samples of the signal source is so large that it can be considered infinite).While speech and sonar applications require lower sample rates, radar and video image processing applications require much higher sample rates. The sample rate information alone cannot be used to choose the architecture. The algorithm complexity is also an important consideration. For example, a very complex and computationally intensive algorithm for a lowsample-rate application and a computationally simple algorithm for a high-sample-rate application may require similar hardware speed and complexity. These ranges of algorithms and applications motivate us to study a wide variety of architecture styles. Using very large scale integration (VLSI) technology, DSP algorithms can be prototyped in many ways. These options include (1) single or multiprocessor programmable digital signal processors, (2) the use of core programmable digital signal processor with customized interface logic, (3) semicustom gate-array implementations, and (4) full-custom dedicated hardware implementation. The DSP algorithms are implemented in the programmable processors by translating the algorithm to the processor assembly code. This can require an extensive amount of time. On the other hand, high-level compilers for DSP can be used to generate the assembly code.Although this is currently feasible, the code generated by the compiler is not as efficient as hand-optimized code. Design of DSP compilers for generation of efficient code is still an active research topic. In the case of Keshab K. Parhi University of Minnesota Rulph Chassaing Roger Williams University Bill Bitler InfiMed
dedicated designs, the challenge lies in a thorough understanding of the DSP algorithms and theor tectures. For example, just minimizing the number of multipliers in an algorithm may not lead to dedicated design. The area saved by the number of multipliers may be offset by the increase in control, s Off-the-shelf programmable digital signal processors can lead to faster prototyping. These prototyped systems can prove very effective in fast simulation of computation-intensive algorithms(such as those encountered in peech recognition, video compression, and seismic signal processing) or in benchmarking and standardization After standards are determined, it is more useful to implement the algorithms using dedicated circuits Design of dedicated circuits is not a simple task. Dedicated circuits provide limited or no programming flexibility. They require less silicon area and consume less power. However, the low production volume, high lesign cost, and long turnaround time are some of the difficulties associated with the design of dedicated systems. Another difficulty is the availability of appropriate computer-aided design( CAD)tools for DSP systems. As time progresses, however, the architectural design techniques will be better understood and can be incor porated into CAD tools, thus making the design of dedicated circuits easier. Hierarchical CAD tools can integrate the design at various levels in an automatic and efficient manner. Implementation of standards for signal and mage processing using dedicated circuits will lead to higher volume production. As time progresses, dedicated designs will be more acceptable to customers of DSI Successful design of dedicated circuits requires careful algorithm and architecture considerations For exam ple, for a filtering application, different equivalent realizations may possess different levels of concurrency. Thus, some of these realizations may be suitable for a particular application while other realizations may not be able to meet the sample rate requirements of the application. The lower-level architecture may be implemented in a word-serial or word-parallel manner. The arithmetic functional units may be implemented in bit-serial or digit-serial or bit-parallel manner. The synthesized architecture may be implemented with a dedicated data path or shared data path. The architecture may be systolic or nonsystolic Algorithm transformations play an important role in the design of dedicated architectures [ Parhi, 1989 This is because the transformed algorithms can be made to operate with better performance(where the performance may be measured in terms of speed, area, or power). Examples of these transformations include pipelining, parallel processing, retiming, unfolding, folding, look-ahead, associativity, and distributivity. These transformations and other architectural concepts are described in detail in subsequent sections lDeuinin g Pipelining can increase the amount of concurrency(or the number of activities performed simultaneously) in an algorithm. Pipelining is accomplished by placing latches at appropriate intermediate points in a data flow graph that describes the algorithm. Each latch also refers to a storage unit or buffer or register. The latches can be placed at feed-forward cutsets of the data flow graph In synchronous hardware implementations, pipelining n increase the clock rate of the system(and therefore the sample rate). The drawbacks associated with pipelining are the increase in system latency and the increase in the number of registers. To illustrate the speed increase using pipelining, consider the second-order three-tap finite impulse response(FIR) filter shown in Fig. 18.1(a). The signal x(n) in this system can be sampled at a rate limited by the throughput of one multiplication and two additions. For simplicity, if we assume the multiplication time to be two times the addition time( Tadd), the effective sample or clock rate of this system is 1/4Tadd By placing latches as shown in Fig. 18.1(b)at the cutset shown in the dashed line, the sample rate can be improved to the rate of one multiplication or two additions. While pipelining can be easily applied to all algorithms with no feedback loops by the appropriate placement of latches, it cannot easily be applied to algorithms with feedback loops. This is because the cutsets in feedback algorithms contain feed-forward and feedback data flow and cannot be con sidered as feed-forward cutsets Pipelining can also be used to improve the performance in sof Most software programmable DSP processors are programmed using assembly code. The assembly code is generated by high-level compilers that perform scheduling. Schedulers typically use the acyclic precedence graph to construct schedules. The removal of all edges in the signal (or data) flow graph containing delay c 2000 by CRC Press LLC
© 2000 by CRC Press LLC dedicated designs, the challenge lies in a thorough understanding of the DSP algorithms and theory of architectures. For example, just minimizing the number of multipliers in an algorithm may not lead to a better dedicated design. The area saved by the number of multipliers may be offset by the increase in control, routing, and placement costs. Off-the-shelf programmable digital signal processors can lead to faster prototyping. These prototyped systems can prove very effective in fast simulation of computation-intensive algorithms (such as those encountered in speech recognition, video compression, and seismic signal processing) or in benchmarking and standardization. After standards are determined, it is more useful to implement the algorithms using dedicated circuits. Design of dedicated circuits is not a simple task. Dedicated circuits provide limited or no programming flexibility. They require less silicon area and consume less power. However, the low production volume, high design cost, and long turnaround time are some of the difficulties associated with the design of dedicated systems.Another difficulty is the availability of appropriate computer-aided design (CAD) tools for DSP systems. As time progresses, however, the architectural design techniques will be better understood and can be incorporated into CAD tools, thus making the design of dedicated circuits easier.Hierarchical CAD tools can integrate the design at various levels in an automatic and efficient manner. Implementation of standards for signal and image processing using dedicated circuits will lead to higher volume production. As time progresses, dedicated designs will be more acceptable to customers of DSP. Successful design of dedicated circuits requires careful algorithm and architecture considerations. For example, for a filtering application, different equivalent realizations may possess different levels of concurrency. Thus, some of these realizations may be suitable for a particular application while other realizations may not be able to meet the sample rate requirements of the application. The lower-level architecture may be implemented in a word-serial or word-parallel manner. The arithmetic functional units may be implemented in bit-serial or digit-serial or bit-parallel manner. The synthesized architecture may be implemented with a dedicated data path or shared data path. The architecture may be systolic or nonsystolic. Algorithm transformations play an important role in the design of dedicated architectures [Parhi, 1989]. This is because the transformed algorithms can be made to operate with better performance (where the performance may be measured in terms of speed, area, or power). Examples of these transformations include pipelining, parallel processing, retiming, unfolding, folding, look-ahead, associativity, and distributivity. These transformations and other architectural concepts are described in detail in subsequent sections. Pipelining Pipelining can increase the amount of concurrency (or the number of activities performed simultaneously) in an algorithm. Pipelining is accomplished by placing latches at appropriate intermediate points in a data flow graph that describes the algorithm. Each latch also refers to a storage unit or buffer or register. The latches can be placed at feed-forward cutsets of the data flow graph. In synchronous hardware implementations, pipelining can increase the clock rate of the system (and therefore the sample rate). The drawbacks associated with pipelining are the increase in system latency and the increase in the number of registers. To illustrate the speed increase using pipelining, consider the second-order three-tap finite impulse response (FIR) filter shown in Fig. 18.1(a). The signal x(n) in this system can be sampled at a rate limited by the throughput of one multiplication and two additions. For simplicity, if we assume the multiplication time to be two times the addition time (Tadd), the effective sample or clock rate of this system is 1/4Tadd. By placing latches as shown in Fig. 18.1(b) at the cutset shown in the dashed line, the sample rate can be improved to the rate of one multiplication or two additions. While pipelining can be easily applied to all algorithms with no feedback loops by the appropriate placement of latches, it cannot easily be applied to algorithms with feedback loops. This is because the cutsets in feedback algorithms contain feed-forward and feedback data flow and cannot be considered as feed-forward cutsets. Pipelining can also be used to improve the performance in software programmable multiprocessor systems. Most software programmable DSP processors are programmed using assembly code. The assembly code is generated by high-level compilers that perform scheduling. Schedulers typically use the acyclic precedence graph to construct schedules. The removal of all edges in the signal (or data) flow graph containing delay
FIGURE 18.1 (a)A three-tap second-order nonrecursive digital filter;(b) the equivalent pipelined digital filter by placing storage units at the intersection of the signal wires and the feed-forward cutset. If the multiplication and operations require 2 and 1 unit of time, respectively, then the maximum achievable sampling rates for the original pipelined architectures are 1/4 and 1/2 units, respectively elements converts the signal flow graph to an acyclic precedence graph. By placing latches to pipeline a data low graph, we can alter the acyclic precedence graph. In particular, the critical path of the acyclic precedence graph can be reduced. The new precedence graph can be used to construct schedules with lower iteration periods(although this may often require an increase in the number of processors) Pipelining of algorithms can increase the sample rate of the system. Sometimes, for a constant sample rat ipelining can also reduce the power consumed by the system. This is because the data paths in the pipelined system can be charged or discharged with lower supply voltage. Since the capacitance remains almost constant, the power can be reduced. Achieving low power can be important in many battery-powered applications [Chandrakasan et al., 1992] Parallel Processing Parallel processing is related to pipelining but requires replication of hardware units. Pipelining exploits concurrency by breaking a large task into multiple smaller tasks and by separating these smaller tasks by storage units. On the other hand, parallelism exploits concurrency by performing multiple larger tasks simultaneously in separate hardware units To illustrate the speed increase due to parallelism, consider the parallel implementation of the second-order three-tap FIR filter of Fig. 18.1(a)shown in Fig. 18. 2. In the architecture of Fig. 18.2, two input samples are processed and two output samples are generated in each clock cycle period of four addition times. Because each clock cycle processes two samples, however, the effective sample rate is 1/2Tdd which is the same as that of Fig. 18.1(b). The parallel architecture leads to the speed increase with significant hardware overhead. The entire data flow graph needs to be replicated with an increase in the amount of parallelism. Thus, it is more desirable to use pipelining as opposed to parallelism. However, parallelism may be useful if pipelining alone annot meet the speed demand of the application or if the technology constraints(such as limitations on the clock rate by the I/O technology) limit the use of pipelining. In obvious ways, pipelining and parallelism can be combined also Parallelism, like pipelining, can also lead to power reduction but with significant overhead in hardware requirements. Achieving pipelining and parallelism can be difficult for systems with feedback loops Concurrency may be created in these systems by using the look-ahead transformation c 2000 by CRC Press LLC
© 2000 by CRC Press LLC elements converts the signal flow graph to an acyclic precedence graph. By placing latches to pipeline a data flow graph, we can alter the acyclic precedence graph. In particular, the critical path of the acyclic precedence graph can be reduced. The new precedence graph can be used to construct schedules with lower iteration periods (although this may often require an increase in the number of processors). Pipelining of algorithms can increase the sample rate of the system. Sometimes, for a constant sample rate, pipelining can also reduce the power consumed by the system. This is because the data paths in the pipelined system can be charged or discharged with lower supply voltage. Since the capacitance remains almost constant, the power can be reduced. Achieving low power can be important in many battery-powered applications [Chandrakasan et al., 1992]. Parallel Processing Parallel processing is related to pipelining but requires replication of hardware units. Pipelining exploits concurrency by breaking a large task into multiple smaller tasks and by separating these smaller tasks by storage units. On the other hand, parallelism exploits concurrency by performing multiple larger tasks simultaneously in separate hardware units. To illustrate the speed increase due to parallelism, consider the parallel implementation of the second-order three-tap FIR filter of Fig. 18.1(a) shown in Fig. 18.2. In the architecture of Fig. 18.2, two input samples are processed and two output samples are generated in each clock cycle period of four addition times. Because each clock cycle processes two samples, however, the effective sample rate is 1/2Tadd which is the same as that of Fig. 18.1(b). The parallel architecture leads to the speed increase with significant hardware overhead. The entire data flow graph needs to be replicated with an increase in the amount of parallelism. Thus, it is more desirable to use pipelining as opposed to parallelism. However, parallelism may be useful if pipelining alone cannot meet the speed demand of the application or if the technology constraints (such as limitations on the clock rate by the I/O technology) limit the use of pipelining. In obvious ways, pipelining and parallelism can be combined also. Parallelism, like pipelining, can also lead to power reduction but with significant overhead in hardware requirements.Achieving pipelining and parallelism can be difficult for systems with feedback loops. Concurrency may be created in these systems by using the look-ahead transformation. FIGURE 18.1 (a) A three-tap second-order nonrecursive digital filter; (b) the equivalent pipelined digital filter obtained by placing storage units at the intersection of the signal wires and the feed-forward cutset. If the multiplication and addition operations require 2 and 1 unit of time, respectively, then the maximum achievable sampling rates for the original and the pipelined architectures are 1/4 and 1/2 units, respectively
(2k+1)x(2k X(2k-1) ·a (·a FIGURE 18.2 Twofold parallel realization of the three-tap filter of Fig. 18.1(a) Retiming is similar to pipelining but yet different in some ways [Leiserson et al, 1983]. Retiming is the process of moving the delays around in the data flow graph. Removal of one delay from all input edges of a node and insertion of one delay to each outgoing edge of the same node is the simplest example of retiming. Unlike pipelining, retiming does not increase the latency of the system. However, retiming alters the number of delay elements in the system. Retiming can reduce the critical path of the data flow graph As a result, it can lead to clock period reduction in hardware implementations or critical path of the acyclic precedence graph or the iteration period in programmable software system implementations The single host formulation of the retiming transformation preserves the latency of the algorithm. The retiming formulation with no constraints on latency(ie, with separate input and output hosts) can also achieve pipelining with no retiming or pipelining with retiming. Pipelining with retiming is the most desirable transfor- mation in DSP architecture be interpreted to be identical to retiming of the original algorithm with a large number of delays at the input edges. Thus, we can increase the system latency arbitrarily and remove the appropriate number of delays from the inputs after the transformation The retiming formulation assigns retiming variables r() to each node in the data flow graph. If i(U-v) is the number of delays associated with the edge U-V in the original data flow graph and r( v)and r(U), respectively, represent the retiming variable value of the nodes V and U, then the number of delays associated with the edge U-V in the retimed data flow graph is given by (U→V)=i(U→V)+r(V)-r(U) For the data flow graph to be realizable, i, ( U-V)20 must be satisfied. The retiming transformation formulates ne problem by calculating path lengths and by imposing constraints on certain path lengths. These constraints To illustrate the usefulness of retiming, consider the data flow graph of a two-stage pipelined lattice digital filter graph shown in Fig. 18.3(a)and its equivalent pipelined-retimed data flow graph shown in Fig. 18.3(b) If the multiply time is two units and the add time is one unit, the architecture in Fig. 18. 3(a)can be clocked with period 10 units whereas the architecture in Fig. 18. 3(b) can be clocked with period 2 units
© 2000 by CRC Press LLC Retiming Retiming is similar to pipelining but yet different in some ways [Leiserson et al., 1983]. Retiming is the process of moving the delays around in the data flow graph. Removal of one delay from all input edges of a node and insertion of one delay to each outgoing edge of the same node is the simplest example of retiming. Unlike pipelining, retiming does not increase the latency of the system. However, retiming alters the number of delay elements in the system. Retiming can reduce the critical path of the data flow graph. As a result, it can lead to clock period reduction in hardware implementations or critical path of the acyclic precedence graph or the iteration period in programmable software system implementations. The single host formulation of the retiming transformation preserves the latency of the algorithm. The retiming formulation with no constraints on latency (i.e.,with separate input and output hosts) can also achieve pipelining with no retiming or pipelining with retiming. Pipelining with retiming is the most desirable transformation in DSP architecture design. Pipelining with retiming can be interpreted to be identical to retiming of the original algorithm with a large number of delays at the input edges. Thus, we can increase the system latency arbitrarily and remove the appropriate number of delays from the inputs after the transformation. The retiming formulation assigns retiming variables r(.) to each node in the data flow graph. If i(U ÆV) is the number of delays associated with the edge U ÆV in the original data flow graph and r(V) and r(U ), respectively, represent the retiming variable value of the nodes V and U, then the number of delays associated with the edge U ÆV in the retimed data flow graph is given by ir(U ÆV ) = i ( U ÆV ) + r ( V ) – r (U ) For the data flow graph to be realizable, ir(U ÆV) ³ 0 must be satisfied. The retiming transformation formulates the problem by calculating path lengths and by imposing constraints on certain path lengths. These constraints are solved as a shortest-path problem. To illustrate the usefulness of retiming, consider the data flow graph of a two-stage pipelined lattice digital filter graph shown in Fig. 18.3(a) and its equivalent pipelined-retimed data flow graph shown in Fig. 18.3(b). If the multiply time is two units and the add time is one unit, the architecture in Fig. 18.3(a) can be clocked with period 10 units whereas the architecture in Fig. 18.3(b) can be clocked with period 2 units. FIGURE 18.2 Twofold parallel realization of the three-tap filter of Fig. 18.1(a)
A2 M ×)M A2 M A(+[ FIGURE 18.3 (a)a two-stage pipelinable time-invariant lattice digital filter. If multiplication and addition operations equire 2 and 1 time units, respectively, then this data flow graph can achieve a sar period of 10 time units(which corresponds to the critical path M1→A2→M2→A1→M3→A→A4).(b) The pipelined/retimed lattice digital filter can achieve a sampling period of 2 time units. Unfolding The unfolding transformation is similar to loop unrolling. In J-unfolding, each node is replaced by J and each edge is replaced by J edges. The J-unfolded data flow graph executes J iterations of the or algorithm [ Parhi, 1991] The unfolding transformation can unravel the hidden concurrency in a data flow program. The achievable iteration period for a J-unfolded data flow graph is 1/J times the critical path length of the unfolded data flow graph. By exploiting interiteration concurrency, unfolding can lead to a lower iteration period in the context of a software programmable multiprocessor implementation. The unfolding transformation can also be applied in the context of hardware design. If we apply an unfolding transformation on a(word-serial)nonrecursive algorithm, the resulting data flow graph represents a word parallel (or simply parallel) algorithm that processes multiple samples or words in parallel every clock cycle If we apply 2-unfolding to the 3-tap FIR filter in Fig. 18. 1(a), we can obtain the data flow graph of Fig 18.2 c 2000 by CRC Press LLC
© 2000 by CRC Press LLC Unfolding The unfolding transformation is similar to loop unrolling. In J-unfolding, each node is replaced by J nodes and each edge is replaced by J edges. The J-unfolded data flow graph executes J iterations of the original algorithm [Parhi, 1991]. The unfolding transformation can unravel the hidden concurrency in a data flow program. The achievable iteration period for a J-unfolded data flow graph is 1/J times the critical path length of the unfolded data flow graph. By exploiting interiteration concurrency, unfolding can lead to a lower iteration period in the context of a software programmable multiprocessor implementation. The unfolding transformation can also be applied in the context of hardware design. If we apply an unfolding transformation on a (word-serial) nonrecursive algorithm, the resulting data flow graph represents a wordparallel (or simply parallel) algorithm that processes multiple samples or words in parallel every clock cycle. If we apply 2-unfolding to the 3-tap FIR filter in Fig. 18.1(a), we can obtain the data flow graph of Fig. 18.2. FIGURE 18.3 (a) A two-stage pipelinable time-invariant lattice digital filter. If multiplication and addition operations require 2 and 1 time units, respectively, then this data flow graph can achieve a sampling period of 10 time units (which corresponds to the critical path M1 Æ A2 Æ M2 Æ A1 Æ M3 Æ A3 Æ A4). (b) The pipelined/retimed lattice digital filter can achieve a sampling period of 2 time units
y3 y2 y, yo X(2k+1) S(2K+1) y(2k+1) s3828s IGURE 18.4(a)A least-significant-bit first bit-serial adder for word length of 4;(b)a digit-serial adder with digit size 2 obtained by two-unfolding of the bit-serial adder. The bit position 0 stands for least significant bit. Because the unfolding algorithm is based on graph theoretic approach, it can also be applied at the bit level. Thus, unfolding of a bit-serial data flow program by a factor of J leads to a digit-serial program with digit size J. The digit size represents the number of bits processed per clock cycle git-serial architecture is locked at the same rate as the bit-serial (assuming that the clock rate is limited by the communication I/C bound much before reaching the computation bound of the bit-serial program). Because the digit-serial program processes Jbits per clock cycle the effective bit rate of the digit-serial architecture is Times higher. A simple example of this unfolding is illustrated in Fig. 18.4, where the bit-serial adder in Fig. 18.4(a) is unfolded oy a factor of 2 to obtain the digit-serial adder in Fig. 18.4(b) for digit size 2 for a word length of 4. In obvious ways, the unfolding transformation can be applied to both word level and bit level simultaneously to generate word-parallel digit-serial architectures. Such architectures process multiple words per clock cycle and process a digit of each word (not the entire word) Folding Transformation The folding transformation is the reverse of the unfolding transformation. While the unfolding transformation is simpler, the folding transformation is more difficult [Parhi et al., 199 The folding transformation can be applied to fold a bit-parallel architecture to a digit-serial or bit-serial one or to fold a digit-serial architecture to a bit-serial one. It can also be applied to fold an algorithm data flow graph to a hardware data flow for a specified folding set. The folding set indicates the processor in which and the time partition at which a task is executed. A specified folding set may be infeasible, and this needs to be detected first. The folding transformation performs a preprocessing step to detect feasibility and in the feasible ase transforms the algorithm data flow graph to an equivalent pipelined/retimed data flow graph that can be folded. For the special case of regular data flow graphs and for linear space-time mappings, the folding tranformation reduces to systolic array design. In the folded architecture, each edge in the algorithm data flow graph unication the hardware architecture data flow graph. Consider an edge U- Vin the algorithm data flow graph with sociated number of delays i( U-v). Let the tasks U and V be mapped to the hardware units Hy and H, respectively. Assume that N time partitions are available, i. e, the iteration period is N. A modulo operation determines the time partition. For example, the time unit 18 for N= 4 corresponds to time partition 18 modulo c 2000 by CRC Press LLC
© 2000 by CRC Press LLC Because the unfolding algorithm is based on graph theoretic approach, it can also be applied at the bit level. Thus, unfolding of a bit-serial data flow program by a factor of J leads to a digit-serial program with digit size J. The digit size represents the number of bits processed per clock cycle. The digit-serial architecture is clocked at the same rate as the bit-serial (assuming that the clock rate is limited by the communication I/O bound much before reaching the computation bound of the bit-serial program). Because the digit-serial program processes J bits per clock cycle the effective bit rate of the digit-serial architecture is J times higher. A simple example of this unfolding is illustrated in Fig. 18.4, where the bit-serial adder in Fig. 18.4(a) is unfolded by a factor of 2 to obtain the digit-serial adder in Fig. 18.4(b) for digit size 2 for a word length of 4. In obvious ways, the unfolding transformation can be applied to both word level and bit level simultaneously to generate word-parallel digit-serial architectures. Such architectures process multiple words per clock cycle and process a digit of each word (not the entire word). Folding Transformation The folding transformation is the reverse of the unfolding transformation. While the unfolding transformation is simpler, the folding transformation is more difficult [Parhi et al., 1992]. The folding transformation can be applied to fold a bit-parallel architecture to a digit-serial or bit-serial one or to fold a digit-serial architecture to a bit-serial one. It can also be applied to fold an algorithm data flow graph to a hardware data flow for a specified folding set. The folding set indicates the processor in which and the time partition at which a task is executed. A specified folding set may be infeasible, and this needs to be detected first. The folding transformation performs a preprocessing step to detect feasibility and in the feasible case transforms the algorithm data flow graph to an equivalent pipelined/retimed data flow graph that can be folded. For the special case of regular data flow graphs and for linear space–time mappings, the folding tranformation reduces to systolic array design. In the folded architecture, each edge in the algorithm data flow graph is mapped to a communicating edge in the hardware architecture data flow graph. Consider an edge U ÆV in the algorithm data flow graph with associated number of delays i(U ÆV). Let the tasks U and V be mapped to the hardware units HU and HV , respectively. Assume that N time partitions are available, i.e., the iteration period is N. A modulo operation determines the time partition. For example, the time unit 18 for N = 4 corresponds to time partition 18 modulo FIGURE 18.4 (a) A least-significant-bit first bit-serial adder for word length of 4; (b) a digit-serial adder with digit size 2 obtained by two-unfolding of the bit-serial adder. The bit position 0 stands for least significant bit
4 or 2. Let the tasks U and V be executed in time partitions u and v, i. e, the Ith iterations of tasks U and V are executed in time units N+ u and NI+ y, respectively. The i( U- V) delays in the edge U-Vimplie that the result of the Ith iteration of U is used for the(I+ i)th iteration of V. The(I+ i)th iteration of v executed in time unit N(I+ i)+ v. Thus the number of storage units needed in the folded edge corresponding to the edge l→Vis DE(U-V)=N(I+1)+v-N-u-P=Ni+ v-u-P where P is the level of pipelining of the hardware operator Hy The d(U-v delays should be connected to the edge between Hu and H, and this signal should be switched to the input of Hy at time partition v If the d(U-V)'s as calculated here were always nonnegative for all edges U-V then the problem would be solved. However, some DFO's would be negative. The algorithm data flow graph needs to be pipelined and retimed such that all the D Os are nonnegative. This can be formulated by simple inequalities using the retiming variables. The retiming formulation can be solved as a path problem, and the retiming variables can be determined if a solution exists. The algorithm data flow graph can be retimed for folding and the calculation of the D,s can be repeated. The folded hardware architecture data flow graph can now be completed. The folding technique is illustrated in Fig. 18.5. The algorithm data flow graph of a two-stage pipelined lattice recursive digital filter of Fig. 18.3(a)is folded for the folding set shown in Fig. 18.5. Fig. 18.5(a) shows the pipelined/retimed data flow graph(preprocessed for folding) and Fig 18.5(b)shows the hardware architecture data flow graph obtained after folding As indicated before, a special case of folding can address systolic array design for regular data flow graphs and for linear mappings. The systolic architectures make use of extensive pipelining and local communication and operate in a synchronous manner [Kung, 1988]. The systolic processors can also be made to operate in an asynchronous manner, and such systems are often referred to as wavefront processors. Systolic architectures have been designed for a variety of applications including convolution, matrix solvers, matrix decomposition, Look-Ahead Technique The look-ahead technique is a very powerful technique for pipelining of recursive signal processing algorithms [Parhi and Messerschmitt, 1989]. This technique can transform a sequential recursive algorithm to an equivalent concurrent one, which can then be realized using pipelining or parallel processing or both. This technique has een successfully applied to pipeline many signal processing algorithms, including recursive digital filters(in direct form and lattice form), adaptive lattice digital filters, two-dimensional recursive digital filters, Viterbi decoders, Huffman decoders, and finite state machines. This research demonstrated that the recursive signal processing algorithms can be operated at high speed. This is an important result since modern signal processing applications in radar and image processing and particularly in high-definition and super- high-definition tele vision video signal processing require very high throughput. Traditional algorithms and topologies cannot be used for such high-speed applications because of the inherent speed bound of the algorithm created by the feedback loops. The look-ahead transformation creates additional concurrency in the signal processing rithms and the speed bound of the transformed algorithms is increased substantially. The look-ahead formation is not free from its drawbacks. It is accompanied by an increase in the hardware overhead. This difficulty has encouraged us to develop inherently pipelinable topologies for recursive signal processing algo- rithms. Fortunately, this is possible to achieve in adaptive digital filters using relaxations on the look-ahead or by the use of relaxed look-ahead Shanbhag and Parhi, 1992 To begin, consider a time-invariant one-pole recursive digital filter transfer function U(z) 1-a c 2000 by CRC Press LLC
© 2000 by CRC Press LLC 4 or 2. Let the tasks U and V be executed in time partitions u and v, i.e., the lth iterations of tasks U and V are executed in time units Nl + u and Nl + v, respectively. The i(U Æ V) delays in the edge U Æ V implies that the result of the lth iteration of U is used for the (l + i )th iteration of V. The (l + i)th iteration of V is executed in time unit N(l + i) + v. Thus the number of storage units needed in the folded edge corresponding to the edge U Æ V is DF ( U ÆV ) = N(l + i ) + v – Nl – u – Pu = Ni + v – u – Pu where Pu is the level of pipelining of the hardware operator HU. The DF(U Æ V) delays should be connected to the edge between HU and HV , and this signal should be switched to the input of HV at time partition v. If the DF (U Æ V)’s as calculated here were always nonnegative for all edges U Æ V, then the problem would be solved. However, some DF ()’s would be negative. The algorithm data flow graph needs to be pipelined and retimed such that all the DF ()’s are nonnegative. This can be formulated by simple inequalities using the retiming variables. The retiming formulation can be solved as a path problem, and the retiming variables can be determined if a solution exists. The algorithm data flow graph can be retimed for folding and the calculation of the DF ()’s can be repeated. The folded hardware architecture data flow graph can now be completed. The folding technique is illustrated in Fig. 18.5. The algorithm data flow graph of a two-stage pipelined lattice recursive digital filter of Fig. 18.3(a) is folded for the folding set shown in Fig. 18.5. Fig. 18.5(a) shows the pipelined/retimed data flow graph (preprocessed for folding) and Fig. 18.5(b) shows the hardware architecture data flow graph obtained after folding. As indicated before, a special case of folding can address systolic array design for regular data flow graphs and for linear mappings. The systolic architectures make use of extensive pipelining and local communication and operate in a synchronous manner [Kung, 1988]. The systolic processors can also be made to operate in an asynchronous manner, and such systems are often referred to as wavefront processors. Systolic architectures have been designed for a variety of applications including convolution, matrix solvers, matrix decomposition, and filtering. Look-Ahead Technique The look-ahead technique is a very powerful technique for pipelining of recursive signal processing algorithms [Parhi and Messerschmitt, 1989]. This technique can transform a sequential recursive algorithm to an equivalent concurrent one, which can then be realized using pipelining or parallel processing or both. This technique has been successfully applied to pipeline many signal processing algorithms, including recursive digital filters (in direct form and lattice form), adaptive lattice digital filters, two-dimensional recursive digital filters, Viterbi decoders, Huffman decoders, and finite state machines. This research demonstrated that the recursive signal processing algorithms can be operated at high speed. This is an important result since modern signal processing applications in radar and image processing and particularly in high-definition and super-high-definition television video signal processing require very high throughput. Traditional algorithms and topologies cannot be used for such high-speed applications because of the inherent speed bound of the algorithm created by the feedback loops. The look-ahead transformation creates additional concurrency in the signal processing algorithms and the speed bound of the transformed algorithms is increased substantially. The look-ahead transformation is not free from its drawbacks. It is accompanied by an increase in the hardware overhead. This difficulty has encouraged us to develop inherently pipelinable topologies for recursive signal processing algorithms. Fortunately, this is possible to achieve in adaptive digital filters using relaxations on the look-ahead or by the use of relaxed look-ahead [Shanbhag and Parhi, 1992]. To begin, consider a time-invariant one-pole recursive digital filter transfer function H z X z U z az ( ) ( ) ( ) = = - - 1 1 1
sA1=(A2,A1 2|+1 FIGURE 18.5 (a)A pipelined/retimed data flow graph obtained from Fig. 18.3(a)by preprocessing for folding;(b)the folded hardware architecture data flow graph. In our folding notation, the tasks are ordered within a set and the ordering represents the time partition in which the task is executed. For example, SA, =(A2, A, implies that A2 and A, are, respectively, executed in even and odd time partitions in the same processor. The notation p represents a null operation (n)=ax(n-1)+t(n) and shown in Fig 18.6(a). The maximum achievable speed in this system is limited by the operating speed of one multiply-add operation. To increase the speed of this system by a factor of 2, we can express x(n) in terms of x(n-2) by substitution of one recursion within the other x(n)=a[ax(n-2)+u(n-1)]+t(n)=a2x(n-2)+a(n-1)+u(n)
© 2000 by CRC Press LLC described by the difference equation x (n) = ax (n – 1) + u (n) and shown in Fig. 18.6(a). The maximum achievable speed in this system is limited by the operating speed of one multiply–add operation. To increase the speed of this system by a factor of 2, we can express x(n) in terms of x (n – 2) by substitution of one recursion within the other: x (n) = a[ax (n – 2) + u (n – 1)] + u (n) = a2x (n – 2) + au (n – 1) + u (n) FIGURE 18.5 (a) A pipelined/retimed data flow graph obtained from Fig. 18.3(a) by preprocessing for folding; (b) the folded hardware architecture data flow graph. In our folding notation, the tasks are ordered within a set and the ordering represents the time partition in which the task is executed. For example, SA1 = (A2, A1) implies that A2 and A1 are,respectively, executed in even and odd time partitions in the same processor. The notation F represents a null operation
u(n)+ X(n-1) u D a2 ●a IGURE 18.6(a)A first-order recursive digital filter;(b)a four-stage pipelinable equivalent filter obtained by look-ahead The transfer function of the emulated second-order system is given by 1+az H(z)= 1-a-z and is obtained by using a pole-zero cancellation at -a In the modified system, x(n)is computed using x(n-2) as opposed to x(n-1); thus we look ahead. The modified system has two delays in the multiply-add feedback loop, and these two delays can be distributed to pipeline the multiply-add operation by two stages. Of course, the additional multiply-add operation that represents one zero would also need to be pipelined by two stages to keep up with the sample rate of the system. To increase the speed by four times, we can rewrite the transfer function as (1 This system is shown in Fig. 18.6(b). Arbitrary speed increase However, for power-of-two speed increase the hardware complexity grows logarithmically with speed-up factor. The same techniq applied to any higher-order system. For example, a second-order filter with transfer function H(z) 1-2r cos ez+ can be modified to for a twofold increase in speed. In this example, the output y(n) is computed using yn-2)and yn-4); thus, it is referred to as scattered look-ahead c 2000 by CRC Press LLC
© 2000 by CRC Press LLC The transfer function of the emulated second-order system is given by and is obtained by using a pole-zero cancellation at –a. In the modified system, x(n) is computed using x (n – 2) as opposed to x (n – 1); thus we look ahead. The modified system has two delays in the multiply–add feedback loop, and these two delays can be distributed to pipeline the multiply–add operation by two stages. Of course, the additional multiply–add operation that represents one zero would also need to be pipelined by two stages to keep up with the sample rate of the system. To increase the speed by four times, we can rewrite the transfer function as: This system is shown in Fig. 18.6(b). Arbitrary speed increase is possible. However, for power-of-two speed increase the hardware complexity grows logarithmically with speed-up factor. The same technique can be applied to any higher-order system. For example, a second-order recursive filter with transfer function can be modified to for a twofold increase in speed. In this example, the output y(n) is computed using y(n – 2) and y(n – 4); thus, it is referred to as scattered look-ahead. FIGURE 18.6 (a) A first-order recursive digital filter; (b) a four-stage pipelinable equivalent filter obtained by look-ahead computation. H z az a z ( ) = + - - - 1 1 1 2 2 H z az a z a z ( ) ( )( ) ( ) = + + - - - - 1 1 1 1 2 2 4 4 H z r z r z ( ) cos = - + - - 1 1 2 1 2 2 q H z r z r z r z r z ( ) cos cos = + + - + - - - - 1 2 1 2 2 1 2 2 2 2 4 4 q q