If you are unable to display correctly math fonts in Netscape under X11, click here.

Power-Delay Tradeoffs in Residue Number System

Alberto Nannarelli, Gian Carlo Cardarilli and Marco Re
Department of Electrical Engineering
University of Rome "Tor Vergata" - Italy

http://dspvlsi.uniroma2.it/

Abstract

In this paper we present some tradeoffs between delay and power consumption in the design of digital processors based on the Residue Number System (RNS). We focus on reducing the switching capacitance, and therefore the power, in modular adders and isomorph multipliers. Results on architectures such as FIR filters, show that the techniques used to reduce the switching capacitance not only lead to more power efficient circuits, but also to a better performance.

1 Introduction

In CMOS circuits the main contribution to power dissipation is the dynamic dissipation due to the charging and discharging of load capacitances. The well known expression P µ V_DD² C_L a f shows the design parameters to change in order to modify the power figures of a system. We indicate with a the average activity of a circuit related to the clock frequency f (0 £ a £ 1). Usually the main constraint to be met in system design is the timing constraint (e.g. clock cycle). Therefore, reduction of power dissipation by lowering f is not implemented for these systems. On the other hand, lowering the supply voltage V_DD, although it leads to a quadratic reduction of the power consumed, is also not implemented because the propagation delay of a circuit increase at a rate proportional to V_DD/(V_DD-V_T)² [1]. As a consequence, in most cases power optimization is carried out by minimizing the so called switching capacitance aC_L: by reducing the switching activity in some parts of the circuit.

In a Residue Number System, defined by a set of relatively prime integers {m₁, m₂, ¼, m_P} with dynamic range M = Õm_i, any integer X Î { 0, 1, ¼M-1} has a unique representation given by:

X RNS
®

( áX ñ_m₁, áX ñ_m₂, ¼, áX ñ_{m_P} )

where áX ñ_{m_i} denotes the operation X mod m_i. Operations on single moduli are done in parallel

Z = X op Y   RNS
®

   ì
ï
ï
ï
í
ï
ï
ï
î

Z_m₁ = áX_m₁ op Y_m₁ ñ_m₁

Z_m₂ = áX_m₂ op Y_m₂ ñ_m₂

¼           ¼           ¼

Z_{m_P} = áX_{m_P} op Y_{m_P} ñ_{m_P}

Therefore, the RNS allows the decomposition of a given dynamic range in slices of narrower bit-width on which the computation can be implemented in parallel [2].

This parallelization usually implies a shortening of the critical path of a system, because narrower datapath have shorter carry chains and consequently smaller delays. Taking advantage of this parallelism, in [3] the power dissipation in RNS structures is reduced by lowering the supply voltage until the desired value of delay is obtained. In [4], we presented an approach to reduce power dissipation in a RNS filter without penalizing its speed. We used the time slack available in the non critical parallel paths and reduced V_DD for these moduli until delays in all paths are equalized to the critical one. In this way, we reduce the power dissipation without affecting the overall performance. However, this multiple-V_DD approach is very technology dependent and cannot be used if only one supply voltage is available for the whole chip, or it is not very effective if V_DD cannot be easily scaled to the desired value.

In this work, we focus on power reduction techniques applied to the switching capacitance of RNS paths which are not the critical path. Our objective is to reduce the power dissipation without penalizing the overall performance.

Results show that reducing the switching capacitance, in some applications such as FIR filters, not only leads to more power efficient circuits, but also to better performance.

2 Modular Addition

The modular addition áa + b ñ_m, consists of two binary additions. If the result of a + b exceeds the modulus (it is larger than m-1), we have to subtract the modulus m. In order to speed-up the operation two additions are executed in parallel: (a + b) and (a + b - m).
If the sign of the three-term addition is negative it means that (a + b) < m and the modular sum is (a + b), otherwise the modular addition is the result of (a + b - m). The above algorithm can be implemented with a carry-save adder (CSA), two n-bit (n = élog₂ mù) adders¹ (nCPA), and a multiplexer as shown in Figure 1.

Figure 1: Structure of modular adder.

The modular adder can be simplified when m = 2ⁿ (only a n-bit adder is required), and when m = 2ⁿ-1 (implemented with two n-bit adders in parallel, one with carry-in c_in = 0 the other with c_in = 1 and then selecting the correct result according to the c_out computed in the adder with c_in = 0).

In the general case of Figure 1, the delay of the modular adder is

t_modadd = t_CSA + t_nCPA + t_MUX .

The power dissipation is that of two additions executed in parallel. We can reduce the power by switching off the path producing the unused result in the modular adder of Figure 1 by predicting if the binary sum of the two operands will exceed the modulus. In this way, according to the prediction, we select the path (left of right adder) in the modular adder to be active and switch off the other. This activation/deactivation can be done, for combinational logic, by using transparent latches which block the propagation of the new values in the adder to be turned off (Figure 2). With this new scheme, the prediction function must be well chosen because the delay of the resulting circuit is in the critical path as it is derived from Figure 2:

t¢_modadd = t_prediction + t_latch + t_modadd

The optimal prediction would be to implement the function:

F(a,b,m) = sign of (a+b-m)

because with this function we could always disable one of the two paths. However, the required sign-detector has roughly the same complexity of a n-bit adder and, therefore, the scheme will not be convenient in terms of both critical path (almost doubled) and power dissipation. For this reason, we have to design a simpler, but less precise, prediction function which will determine three cases (refer to Figure 2):

ENABLE RIGHT: correct result produced by adder at right (e.g. a+b < m);
ENABLE LEFT: correct result produced by adder at left
(e.g. a+b ³ m);
ENABLE BOTH: the prediction function is unable to predict and both adders are enabled to produce the correct result.

It is clear, that the better the prediction function is, the smaller number of input combinations fall in case ENABLE BOTH.

Figure 2: modular adder with latches and F.

As an example, for m = 11 we can choose the following prediction function F (indicating with a₃ and a₀ the MSB and the LSB of a, respectively):

F_R =
a₃

b₃

(a₂ + b₂)
a < 7 AND b < 3, or v.v.

F_L = a₃b₂+a₂b₃+a₃b₃
a ³ 8 AND b ³ 4, or v.v.

F(a,b,11) = ì
ï
ï
í
ï
ï
î

ENABLE RIGHT
if F_R = 1

ENABLE LEFT
if F_L = 1

ENABLE BOTH
otherwise

For m = 11 and F defined above, in 40% of all possible input combinations the adder at right is enabled, in 27% the adder at left, and in 33% both adders are enabled. We implemented the original mod 11 adder and the new scheme, with a library of standard cells (0.35 mma,) by using Synopsys Design Analyzer, and obtained the following values for delay and power dissipation:

delay [ns] P [mW] @ 100 MHz

original 1.45 440

new 1.68 420

ratio 1.15 0.95

The power reduction obtained (5%) in a favorable case (33% of ENABLE BOTH is quite good for a simple F) makes this techniques not very power efficient, although the delay overhead is not large and can be accommodated in a small time slack for RNS paths which are not critical. The introduced latches almost offset the reduced switching capacitance in the adders.

Figure 3: Structure of isomorphic multiplier.

3 Modular Multiplication

To reduce the complexity of modular multiplication, we use the isomorphism technique [5] to implement the product of residues. This method works quite well for small prime moduli (m < 64). By using isomorphisms, the product of the two residues is transformed into the sum of their indices which are obtained by an isomorphic transformation. According to [5], if m is prime there exists a primitive radix q such that its powers modulus m cover the set [1, m-1]:

p = áq^w ñ_m with p Î [ 1, m-1] , w Î [ 0, m-2] .

Both transformations p ® w and w ® p can be implemented with m-1-entry tables. Therefore, the product of a and b modulus m can be obtained as:

áa ·b ñ_m = áq^w ñ_m , w = áx + y ñ_m-1 ,
a = áq^x ñ_m

b = áq^y ñ_m

In order to implement the modular multiplication the following operations are performed:

Two direct isomorphic transformations (DIT) to obtain x and y;
One modulus m_I = m-1 addition áx + y ñ_{m_I};
One inverse isomorphic transformations (IIT) to obtain the product.

The tables are implemented as synthesized multi-level logic and special attention has to be paid when one of the two operands is zero. In this case there exists no isomorphic correspondence and the modular adder has to be bypassed. The delay of the resulting scheme, shown in Figure 3, is

t_isomult = t_DIT + t_CSA + t_nCPA + t_MUX + t_IIT .

Figure 4: Isomorphic multiplier after first modification.

Because isomorph multipliers use modular adders in combination with tables, as a first step, we eliminate one of the adders in parallel and we incorporate the modulus reduction in the IIT table, as follows:

we compute: x+y-m_I (m_I = m-1, x and y are the indices of the isomorphism);
if the result is positive (e.g. x+y ³ m_I) we access the normal IIT table;
otherwise we access a modified table (IIT^*) in which the inverse isomorphism is addressed by áx+y-m_I ñ_k, where k = élog₂ m_I ù.

In this new scheme, depicted in Figure 4, the IIT tables are more complex (entries are doubled) than in the original scheme, but the delay is not increased because the multiplexer in the last stage of the modular adder is eliminated.

As a second step, we can incorporate the addition of the constant -m_I in one of the DIT tables. The scheme of Figure 4 is modified as follows:

In one of the DIT table instead of addressing the index of y we address e = y-m_I;
using a n-bit adder, we compute
w = x+e = x+y-m_I;
if w is positive (e.g. x+y ³ m_I) we access the normal IIT table;
otherwise we access the IIT^* table, as previously shown.

In this new scheme (Figure 5) we have eliminated the carry-save adder which was in the critical path, while the modifications in DIT^* do not affect the table delay.

Figure 5: Isomorphic multiplier after second modification.

In conclusion, with this two modifications we reduced the critical path of the isomorph multiplier to:

t¢_isomult = t_DIT + t_nCPA + t_IIT^* .

As for the power dissipation, we implemented a m = 11 isomorph multiplier² and obtained the following results:

delay [ns] P [mW] @ 100 MHz

original 2.00 535

new 1.60 455

ratio 0.80 0.85

For this isomorph multiplier the power is reduced by 15% and the delay of the new scheme is 20% shorter.

4 Application to FIR Filters

To evaluate the impact of the techniques presented above we implemented a 16-tap programmable FIR filter using different schemes:

STD: standard RNS implementation in transposed form (one isomorph multiplier and modular adder per tap);
CS: carry-save RNS implementation as described in [4] (one isomorph multiplier and carry-save adder per tap);
NEW: carry-save RNS implementation with new isomorph multipliers (one new isomorph multiplier and carry-save adder per tap).

.	min. T_cycle	area (ND2 equiv.)	power @ 100 MHz
STD	5.8 ns	13,500	90 mW
CS	4.3 ns	18,200	100 mW
NEW	3.7 ns	17,100	94 mW

Table 1: Results of FIR filter implementations.

In all schemes DIT tables can be eliminated because the input x is the multiplicand of all the multiplications and the DIT table is incorporated in the binary to RNS conversion, while the coefficients of the filter (multiplicators) can be directly loaded at start-up as isomorphism indices. We chose the set of moduli { 11, 13, 29 } which covers a 12-bit dynamic range.

Results in Table 1 show that the CS-filter is about 25% faster than the standard RNS, but consumes about 11% more power due the extra registers introduced to keep the carry-save representation of data. By implementing the new isomorph multiplier with reduced switching capacitance the power overhead is reduced to 5% and the speed-up with respect to the standard is 1.56 (about 37% faster).

5 Conclusions

We applied techniques aiming to reduce the switching capacitance to the design of modular adders and multipliers. For the adders, the power savings are quite small compared to the time overhead introduced. However, the technique of disabling some unused parts of a circuit by using transparent latches can lead to better results if those disabled parts are deeper (in terms of levels of gates) than adders, or if custom low-power latches are used.

The modifications introduced for isomorph multipliers allow a reduction both in delay and power dissipation and make very attractive the use of this modified multipliers in RNS FIR filters.

References

[1]: W. Nebel and J. Mermet editors, Low Power Design in Deep Submicron Electronics, Kluwer Academic Publishers, 1997.
[2]: M.A. Sodestrand, W.K. Jenkins, G. A. Jullien, and F. J. Taylor, Residue Number System Arithmetic: Modern Applications in Digital Signal Processing, New York: IEEE Press, 1986.
[3]: W.L. Freking and K.K. Parhi, ``Low-power digital filters using residue arithmetic,'' Thirty-First Asilomar Conference on Signals, Systems and Computers, vol. 1, pp. 739-743, 1998.
[4]: A. Del Re, A. Nannarelli, and M. Re, ``Implementation of Digital Filters in Carry-Save Residue Number System,'' Proc. of 35th Asilomar Conference on Signals, Systems, and Computers, pp. 1309-1313, Nov. 2001.
[5]: I.M. Vinogradov, An Introduction to the Theory of Numbers, New York: Pergamon Press, 1955.

Footnotes:

¹Adders are implemented in a carry-look-ahead scheme. An extra inverter is required in the (a + b - m) adder to obtain the sign.

² We implemented a multiplier which is used in FIR filters in which the DIT tables are eliminated as explained in the next section.

File translated from T_EX by T_TH, version 2.70.
On 22 Jan 2003, 14:11.

	delay [ns]	P [mW]	@ 100 MHz
original	1.45	440
new	1.68	420
ratio	1.15	0.95