A High-Speed,Programmable,CSD Coefficient FIR Filter Tang Zhangwen.Student Member.IEEE.Zhang Zhanpeng.Zhang Jie.Min Hao,Member:IEEE Email:zwtang @fudan.edu.cn,tel:(021)65642765 ASIC System State Key Laboratory, Fudan University,Shanghai,200433,P.R.China 4bstracA new high-speed,programmable FIR Xin D filter is present,which is a multiplierless filter with CSD encoding coefficients.With this encoding scheme,the speed of filter is improved and the area is Xh(10 optimized.In order to make this filter more applicable,we employ a new programmable CSD encoding structure to make CSD coefficients programmable.In the end of this paper,we design a 10-bits,18 taps video luminance filter with the filter structure we present.The completed filter core occupies 6.8x6.8 mm2 of silicon area in Wu-Xi Shanghua 0.6-um 2P2M CMOS technology,and its maximum work frequency is 100MHz. Yout Index Terms-Finite Impulse Response filter, Application Specific Integrated Circuit,Canonic Fig.1.Structure of traditional FIR filter Signed-digit Encode,Booth Multiplier,Wallace use of custom application specific integrated circuits Adder Tree. (ASICs),because programmable signal processors(such as DSPs)cannot accommodate such high sample rates without an excessive amount of parallel processing,and I, Introduction for dedicated applications,the flexibility of a filter with high-speed multipliers [2]is not necessary. i n mmune uls more n more In this paper,we present a new high-speed,CSD broadly,and higher performance in speed and area is coefficient FIR filter structure,which is explored demanded.The traditional FIR filter structure [1],as through studying CSD coefficient filters,BOOTH shown in Fig.1,has already not met the speed demand multipliers and high-speed adders.With such structure, of high performance systems,because it is restricted by we can implement any order high-speed FIR filters,and the multiplier and adder circuits. the critical path delay is almost not relative to the tap number.In order to make this filter more applicable,we The transform function of FIR filter is described as employ a new programmable CSD encoding structure to follows. make CSD coefficients programmable.In the following, we will respectively address CSD encoding in FIR filters M-1 in the second part,programmable CSD encoding y(n)=h(k)x(n-k) (1) structure in the third part and the structure of partial k=0 product adder tree in the fourth part.In the end,we adopt this type of filter to implement a 10-bits,18 taps video The critical path delay of the FIR filter in Fig.I is luminance filter. TM+MTA,TM is the delay of one multiplication,TA is the delay of one adder,and M is the tap number of filter.It is clear that the critical path is rapidly increasing with the II.CSD encoding in FIR filter tap number of FIR filter. The traditional FIR filter is shown as Fig.1,but for fixed High-speed digital filtering application (such as, coefficient filters we can simply shift the data bus to the sample rates in excess of 20MHz)generally require the left or right by an appropriate number of bits and
A High-Speed, Programmable, CSD Coefficient FIR Filter Tang Zhangwen, Student Member,IEEE, Zhang Zhanpeng, Zhang Jie, Min Hao, Member,IEEE Email: zwtang@fudan.edu.cn, tel: (021)65642765 ASIC & System State Key Laboratory, Fudan University, Shanghai, 200433, P.R.China AbstractA new high-speed, programmable FIR filter is present, which is a multiplierless filter with CSD encoding coefficients. With this encoding scheme, the speed of filter is improved and the area is optimized. In order to make this filter more applicable, we employ a new programmable CSD encoding structure to make CSD coefficients programmable. In the end of this paper, we design a 10-bits, 18 taps video luminance filter with the filter structure we present. The completed filter core occupies 6.8×6.8 mm2 of silicon area in Wu-Xi Shanghua 0.6-µm 2P2M CMOS technology, and its maximum work frequency is 100MHz. Index TermsFinite Impulse Response filter, Application Specific Integrated Circuit, Canonic Signed-digit Encode, Booth Multiplier, Wallace Adder Tree. I. Introduction inite Impulse response (FIR) filters have been used in video and communication circuits more and more broadly, and higher performance in speed and area is demanded. The traditional FIR filter structure [1], as shown in Fig. 1, has already not met the speed demand of high performance systems, because it is restricted by the multiplier and adder circuits. The transform function of FIR filter is described as follows. − = = − 1 0 ( ) ( ) ( ) M k y n h k x n k (1) The critical path delay of the FIR filter in Fig. 1 is TM+MTA, TM is the delay of one multiplication, TA is the delay of one adder, and M is the tap number of filter. It is clear that the critical path is rapidly increasing with the tap number of FIR filter. High-speed digital filtering application (such as, sample rates in excess of 20MHz) generally require the D D D D ... ... Yout Xin h(0) h(10) h(1) h(2) Fig. 1. Structure of traditional FIR filter use of custom application specific integrated circuits (ASICs), because programmable signal processors (such as DSPs) cannot accommodate such high sample rates without an excessive amount of parallel processing, and for dedicated applications, the flexibility of a filter with high-speed multipliers [2] is not necessary. In this paper, we present a new high-speed, CSD coefficient FIR filter structure, which is explored through studying CSD coefficient filters, BOOTH multipliers and high-speed adders. With such structure, we can implement any order high-speed FIR filters, and the critical path delay is almost not relative to the tap number. In order to make this filter more applicable, we employ a new programmable CSD encoding structure to make CSD coefficients programmable. In the following, we will respectively address CSD encoding in FIR filters in the second part, programmable CSD encoding structure in the third part and the structure of partial product adder tree in the fourth part. In the end, we adopt this type of filter to implement a 10-bits, 18 taps video luminance filter. II. CSD encoding in FIR filter The traditional FIR filter is shown as Fig. 1, but for fixed coefficient filters we can simply shift the data bus to the left or right by an appropriate number of bits and F
…品 Fig.3.Structure of one tap CSD encoding Fig.3 is one tap CSD encoding structure,which CSD coefficients are generated by MATLAB program.There are no more than three nonzero digits in a CSD coefficient.The shift operation according to the position of nonzero digit is shown as Table 1. 物票)·-3" 力45与。2·】请 侧)=9+++-34 00年2+1…2+ Table 1.CSD Encoding Operation CSD Definition Operation Fig.2.Frequency response and CSD coefficients for 10- Encoding bits,18-taps luminance filter 00000 2 Shift 0 bit 00001-01110 21214 Shift 1 to 14 bits employ a small number of adders/subtracters instead of 01111 0 Zero multipliers.The resulting hardware complexity is a small 10000 -2 Shift 0 bit and fraction of the complexity of a general filter with Negate multipliers and thus a significantly larger number of taps 10001-11110 2124 Shift 1 to 14 bits can be integrated into a single chip. and Negate As we all know,any fraction can be described as 11111 0 Zero follows [3]. The input of CSD encoding structure is 5 bits'signed binary number.MSB is singed bit,which represents X=S2-P (2) negating operation;four low bits are the number of K= shifting left.If they are full one"1111",the output is full zero.In the end,the three outputs (partial products)of where sk∈{-l,0,l}and pk∈{0,l,,M.The one tap are added together. representation given by (2)has M+1 total (ternary)digits and L nonzero digits.A canonic signed-digit (CSD) Employing the above programmable CSD encoding representation is defined as the minimal representation structure,the partial products and the internal data length for which no two nonzero digits sy are adjacent.So the decrease,but the resolution of filter does not degrade.As number of adders/subtracters required to realize a CSD we all know,Nbits X Nbits BOOTH multiplier has [N/2] coefficient is one less than the number of nonzero digits partial products [5][6]and the internal data length is 2N in the fraction.For any coefficient of FIR filters can been bits,but CSD encoding structure has three partial translated into CSD coefficient [4],we develop a products and the internal data length is smaller than N to MATLAB program to generate the CSD code of general guarantee the truncation error less than quantization FIR filter's coefficient.The CSD coefficients and noise.Thus the programmable CSD encoding structure is frequency response diagram for a 10-bits,18 taps more advantageous in the complexity and compatibility. luminance filter are shown as Fig.2. IV.Partial product adder tree III.Programmable CSD encoding structure A. Wallace adder tree in Booth multiplier The complexity of FIR filter can decrease rapidly with CSD coefficient multipliers instead of fixed coefficient Multiplier is a fundamental unit in digital signal multipliers,but the compatibility decreases too.In this processing circuits,the searching about the multiplier paper we explore a new programmable CSD encoding architecture had grown up.The multipliers(Fig.4)in [5] structure to decrease the complexity and increase [6]employ modified Booth algorithm and parallel compatibility. Wallace adder tree.It consists of the following three
Fig. 2. Frequency response and CSD coefficients for 10- bits, 18-taps luminance filter employ a small number of adders/subtracters instead of multipliers. The resulting hardware complexity is a small fraction of the complexity of a general filter with multipliers and thus a significantly larger number of taps can be integrated into a single chip. As we all know, any fraction can be described as follows [3]. = − = L K p k k x s 1 2 (2) where sk ∈{-1,0,1} and pk ∈{0,1,...,M}. The representation given by (2) has M+1 total (ternary) digits and L nonzero digits. A canonic signed-digit (CSD) representation is defined as the minimal representation for which no two nonzero digits sk are adjacent. So the number of adders/subtracters required to realize a CSD coefficient is one less than the number of nonzero digits in the fraction. For any coefficient of FIR filters can been translated into CSD coefficient [4], we develop a MATLAB program to generate the CSD code of general FIR filter’s coefficient. The CSD coefficients and frequency response diagram for a 10-bits, 18 taps luminance filter are shown as Fig. 2. III. Programmable CSD encoding structure The complexity of FIR filter can decrease rapidly with CSD coefficient multipliers instead of fixed coefficient multipliers, but the compatibility decreases too. In this paper we explore a new programmable CSD encoding structure to decrease the complexity and increase compatibility. D Fixed CSD TAP 8 2 − − 10 2 − 0 h(0) D h(0) CSD coefficient D Programmable CSD TAP h(0) 5 11000 5 01010 5 11111 xin(i-1) xin(i) xin(i-1) xin(i) xin(i-1) xin(i) xin(i-1) xin(i-1) xin(i) xin(i-1) xin(i) Fig. 3. Structure of one tap CSD encoding Fig. 3 is one tap CSD encoding structure, which CSD coefficients are generated by MATLAB program. There are no more than three nonzero digits in a CSD coefficient. The shift operation according to the position of nonzero digit is shown as Table 1. Table 1. CSD Encoding Operation CSD Encoding Definition Operation 00000 20 Shift 0 bit 00001-01110 2- 1∼2-14 Shift 1 to 14 bits 01111 0 Zero 10000 -20 Shift 0 bit and Negate 10001-11110 2- 1∼2-14 Shift 1 to 14 bits and Negate 11111 0 Zero The input of CSD encoding structure is 5 bits’ signed binary number. MSB is singed bit, which represents negating operation; four low bits are the number of shifting left. If they are full one “1111”, the output is full zero. In the end, the three outputs (partial products) of one tap are added together. Employing the above programmable CSD encoding structure, the partial products and the internal data length decrease, but the resolution of filter does not degrade. As we all know, Nbits×Nbits BOOTH multiplier has [N/2] partial products [5] [6] and the internal data length is 2N bits, but CSD encoding structure has three partial products and the internal data length is smaller than N to guarantee the truncation error less than quantization noise. Thus the programmable CSD encoding structure is more advantageous in the complexity and compatibility. IV. Partial product adder tree A. Wallace adder tree in Booth multiplier Multiplier is a fundamental unit in digital signal processing circuits, the searching about the multiplier architecture had grown up. The multipliers (Fig. 4) in [5] [6] employ modified Booth algorithm and parallel Wallace adder tree. It consists of the following three
Multiplier(N bits) Multiplicand(N bits) H234 BOOTH Encoder A UR Full Booth Adder Block Partial products generator Cin Cout Full C.A.A Partial Product © Adder Array 囱 Block C.A.C Compresso Adder 4:2 Compression Adder Final Carry Select Adder &CLA Adder Wallace Adder Tree Product(2N bits) Fig.4.Wallace adder tree in Booth multiplier 。44。。 Couto Carry Lookahead 4:2 Adder Compression TT Adder s13s12s11S10 Carry Selector Partial Product Adder Array Critical Pa Fig.5.Adder array in FIR filter modules:Booth encoder,partial products adder array and Fig.6.Carry selected adder with carry look-ahead adder final adder.Employing modified Booth algorithm,the number of partial products decreases to half.If the bit products used in FIR filters is shown as Fig.5.There are number of multiplier is N.then there are [N/2]partial three partial products output in one tap CSD encoding products.The partial product adder array adds the whole structure,so the N-taps FIR filter has 3N partial [N/2]partial products and generates 2N-bits Carry and products.In order to add all these partial products,we Sum.In the interest of improving the parallelity,we need [log23N]level partial-product adders (4:2 adopt a 4:2 compression adder instead of 3:2 full adder compression adder)and generate two partial products C in partial product adder array.In the end,the final adder (Carry)and S(Sum). adds the 2N-bits C(Carry)and S(Sum)and generates the multiplier product. B.Final adder A.Partial product adder tree in FIR filters The critical path delay of partial product adder array in FIR filter is only the delay of 2x[log-3N]serial full In the duration of design FIR filters,we design a fit adders.But final adder is a (N+M)bits full adder (M is adder tree for FIR filters through studying the above the guarantee bits to decrease the truncation error),its Wallace adder tree used in Booth multiplier.The partial delay is greater than the one in adder array.So we need
Partial products generator BOOTH Encoder C.A. A C.A. B C.A.C Carry Select Adder &CLA Multiplier(N bits) Multiplicand(N bits) Booth Block Partial Product Adder Array Block Final Adder Product (2N bits) Compressor Adder I1I2I3I4 C S Cout Cin I1I2I3I4 C S Cout Cin I1I2I3I4 C S Cout Cin I1I2I3I4 C S Cout Cin I1I2I3I4 C S Cout Cin I1I2I3I4 C S Cout Cin C I1I2I3I4 C S Cout Cin I1I2I3I4 C S Cout Cin I1I2I3I4 C S Cout Cin Partial Products Partial Products Full Adder Full Adder I1 I2 I3 I4 Cout Cin S C A A B B Cin Cin Cout Cout Sum Sum Wallace Adder Tree 4:2 Compression Adder Fig. 4. Wallace adder tree in Booth multiplier 4:2 Compression Adder Partial Product Adder Array ...... ...... ...... ...... Fig. 5. Adder array in FIR filter modules: Booth encoder, partial products adder array and final adder. Employing modified Booth algorithm, the number of partial products decreases to half. If the bit number of multiplier is N, then there are [N/2] partial products. The partial product adder array adds the whole [N/2] partial products and generates 2N-bits Carry and Sum. In the interest of improving the parallelity, we adopt a 4:2 compression adder instead of 3:2 full adder in partial product adder array. In the end, the final adder adds the 2N-bits C (Carry) and S (Sum) and generates the multiplier product. A. Partial product adder tree in FIR filters In the duration of design FIR filters, we design a fit adder tree for FIR filters through studying the above Wallace adder tree used in Booth multiplier. The partial Carry Lookahead Adder Cin0 Cin1 Cout0 Cout1 A3 A2 A1 A0 B2 B1 B0 S03 S13 S02 S12 S01 S00 S11 S10 CLA CLA Sum Selector Sum Selector 0 1 4 4 4 4 4 4 CLA CLA Sum Selector Sum Selector 0 1 4 4 4 4 4 4 Carry Selector Critical Path B3 Fig. 6. Carry selected adder with carry look-ahead adder products used in FIR filters is shown as Fig. 5. There are three partial products output in one tap CSD encoding structure, so the N-taps FIR filter has 3N partial products. In order to add all these partial products, we need [log23N] level partial-product adders (4:2 compression adder) and generate two partial products C (Carry) and S (Sum). B. Final adder The critical path delay of partial product adder array in FIR filter is only the delay of 2×[ log23N] serial full adders. But final adder is a (N+M) bits full adder (M is the guarantee bits to decrease the truncation error), its delay is greater than the one in adder array. So we need
4:2 I2C Compressio SCL Slave Adder SDA Controller Partial Product Adder Array Carry Lookahead Finnl Adder aK-]Yout' DC gain Module Yout Fig.7.Programmable,CSD coefficient FIR filter to adopt carry selected adder (CSA)with carry look- with CSD encoding coefficients.With this encoding ahead adder(CLA).Carry look-ahead adder is shown as scheme,the speed of filter is improved and the area is Fig.6,which employs two 4-bits CLA to compose one optimized.In order to make this filter more applicable, 8-bits CSA.It can be concluded that the critical path we employ a new programmable CSD encoding delay is one 8-bits CLA and some multiplexers,so the structure to make CSD coefficients programmable.In the adder of this structure can work at high speed. end of this paper,we also design a 10-bits,18-taps video luminance filter with the filter structure we present.The completed filter core occupies 6.8x6.8 mm2 of silicon V.Implementation of programmable CSD area in Wu-Xi Shanghua 0.6-um 2P2M CMOS technology,and its maximum work frequency is We construct programmable CSD coefficient filter 100MHz. (Fig.7)with many above modules.In Fig.7,the top is programmable CSD encoding structure,the middle is 4:2 Reference compression adder tree,the bottom is carry-selected [1]Zhou Yaohua et al.,Digital Signal Processing adder with carry look-ahead adder.To eliminate the DC (Fudan University Press,Shanghai,1996),pp.65. gain,the filtered signal is output to a DC gain cancelled [2]Jun Rim Choi,Lak Hyun Jang,et al.,"Structured module. Design of a 288-TAP FIR Filter by Optimized Partial Product Tree Compression",IEEE J.Solid-State Circuit In digital video encoder(DVE)system,we design a vol 32,N0.3,March,1997. 10-bits,18-taps high-speed luminance filter using the above FIR filter structure.Its frequency response [3]A.Avizienis,"Signed Digit Number Representation diagram is shown as Fig.2.In order to program the for Fast Parallel Arithmetic,"IRE Trans.Electron. coefficients in FIR filters,we design a slave mode I'C Computers,vol.EC-10,pp.389-400,1961. bus controller.The whole chip is implemented in 0.6-um [4]H.Samueli,"An improved search algorithm for the 2P2M CMOS technology in Wuxi-Shanghua.The filter design of multiplierless FIR filters with powers-of-two core area is 6.8mmX6.8mm,the maximum work coefficients",IEEE Trans.Circuits Syst.,vol.36,pp. frequency is 100MHz. 1044-1047,July1989. [5]M.Nagamatsu et al.,"A 15ns 32x32-bit CMOS Multiplier with an Improved Parallel structure,"IEEE VI.Conclusion J.Solid-State Circuit vol 25,N0.2,April,1990. Junji Mori,et al,"A 10ns 54x54-b parallel structured In this paper we present a new high-speed, Full Array Multiplier with 0.5-um COMS Technology," programmable FIR filter,which is a multiplierless filter IEEE J.Solid-State Ciruit vol 26.No.4.April,1991
D CSD TAP D CSD TAP D CSD TAP CSD TAP Carry Lookahead Final Adder Yout' Xin 4:2 Compression Adder ............... ............... ............... ............... ............... ............... Partial Product Adder Array h(0) h(1) h(2) h(N) DC_gain Module D Yout CLK CLK CLK CLK I2C Slave Controller SCL SDA Fig. 7. Programmable, CSD coefficient FIR filter to adopt carry selected adder (CSA) with carry lookahead adder (CLA).Carry look-ahead adder is shown as Fig. 6, which employs two 4-bits CLA to compose one 8-bits CSA. It can be concluded that the critical path delay is one 8-bits CLA and some multiplexers, so the adder of this structure can work at high speed. V. Implementation of programmable CSD We construct programmable CSD coefficient filter (Fig. 7) with many above modules. In Fig. 7, the top is programmable CSD encoding structure, the middle is 4:2 compression adder tree, the bottom is carry-selected adder with carry look-ahead adder. To eliminate the DC gain, the filtered signal is output to a DC gain cancelled module. In digital video encoder (DVE) system, we design a 10-bits, 18-taps high-speed luminance filter using the above FIR filter structure. Its frequency response diagram is shown as Fig. 2. In order to program the coefficients in FIR filters, we design a slave mode I2 C bus controller. The whole chip is implemented in 0.6-µm 2P2M CMOS technology in Wuxi-Shanghua. The filter core area is 6.8mm×6.8mm, the maximum work frequency is 100MHz. VI. Conclusion In this paper we present a new high-speed, programmable FIR filter, which is a multiplierless filter with CSD encoding coefficients. With this encoding scheme, the speed of filter is improved and the area is optimized. In order to make this filter more applicable, we employ a new programmable CSD encoding structure to make CSD coefficients programmable. In the end of this paper, we also design a 10-bits, 18-taps video luminance filter with the filter structure we present. The completed filter core occupies 6.8×6.8 mm2 of silicon area in Wu-Xi Shanghua 0.6-µm 2P2M CMOS technology, and its maximum work frequency is 100MHz. Reference [1] Zhou Yaohua et al., Digital Signal Processing (Fudan University Press, Shanghai, 1996), pp.65. [2] Jun Rim Choi, Lak Hyun Jang, et al., “Structured Design of a 288-TAP FIR Filter by Optimized Partial Product Tree Compression”, IEEE J. Solid-State Circuit vol 32, N0.3, March, 1997. [3] A.Avizienis, “Signed Digit Number Representation for Fast Parallel Arithmetic,” IRE Trans. Electron. Computers, vol. EC-10, pp.389-400,1961. [4] H. Samueli, “An improved search algorithm for the design of multiplierless FIR filters with powers-of-two coefficients”, IEEE Trans. Circuits Syst., vol.36, pp. 1044-1047, July 1989. [5] M. Nagamatsu et al., “A 15ns 32×32-bit CMOS Multiplier with an Improved Parallel structure,” IEEE J. Solid-State Circuit vol 25, N0.2, April, 1990. Junji Mori, et al, “ A 10ns 54×54-b parallel structured Full Array Multiplier with 0.5-um COMS Technology,” IEEE J. Solid-State Ciruit vol 26. No. 4. April, 1991