Alberto Nannarelli Ph.D. Dissertation

4.5 Radix-512 Division

4.5.1 Algorithm and Basic Implementation

We now refresh some of the expressions of the radix-512 division algorithm presented in Chapter 2, The recurrence is implemented by

w[j+1] = 512 w[j] - q_j+1z j = 0,1, ¼5

(21)

with w[0] = Mx, z = Md, and quotient-digit selection

q_j+1 = ë

^
y

+ 1/2 û .

The scaling factor M is determined by

M = -g₁d₁₅ + g₂ .

(22)

All the details of the implementation of the radix-512 divider are given in [33]. Here we briefly summarize the main features of the unit and determine the energy dissipated.

As for the other radices, d is normalized in [0.5, 1) and x < d. According to [32] the scaling factor M is in the range:

0 < M < 2

with 13 fractional bits. A total of 15 bits is required to store M. Because the scaled operands can be greater than 1, for the z and w[j] representation we need one sign bit, one integer bit and 54+13 = 67 fractional bits for a total of 69 bits. To have the correct recoding, as explained in [33], an extra integer bit is added. In conclusion, the number of bits needed to store the partial remainder w[j] (bits in the recurrence) is 70.

A first implementation of the divider is shown in Figure 2.4 on page pageref. Since the operations indicated in expression (4.5) and expression (4.6) are similar, they can be executed in the same unit. In [33] in order to reduce the area, the multiplier-accumulator required for the computation of M is eliminated, and the operation of expression (4.6) is executed in the main multiplier-accumulator. The modified block diagram is shown in Figure 4.23.

Figure 4.23: Block diagram of modified divider.

Figure 4.24: Cycles and operations.

Figure 4.24 shows the operation performed in the divider and the values stored in the registers during the different cycles.

Block gamma-table is a logic function which produces the two quantities -g₁ and -g₂ according to

g₁ =

d₆² + d₆2^-6 + 2^-15

g₂ =

2d₆ + 2^-6

d₆² + d₆2^-6 + 2^-15

where d₁₅ and d₆ are d truncated to its 15th and 6th fractional bit respectively. The block was synthesized with standard cells using COMPASS ASICSynthesizer.

The recoder is used to recode the multiplier into radix-4 representation with digits in the set {-2, -1, 0, 1, 2}.

Block MultAdd in Figure 4.23 executes both the multiplication and the addition in the recurrence. The multiple generator produces the partial products t₀, t₁, ¼, t₇ and the adder reduces the number of the partial products to the final carry-save representation (9:2 adder). Summarizing, the MultAdd operations are:

In M calculation, the recoded multiplier is d₁₅ (15 bits) which produces 8 partial products (t₀, t₁, ¼, t₇). In addition, -g₂ must be added.

-M = -( -g₁d₁₅ + g₂ ) = t₀ + t₁ + t₂ + t₃ + t₄ + t₅ + t₆ + t₇ + (-g₂)
In the scaling, the recoded multiplier is -M (15 bits).

Md, Mx = t₀ + t₁ + t₂ + t₃ + t₄ + t₅ + t₆ + t₇
In the recurrence, the recoded multiplier is q_j+1 that is 11 bits (10 + 1 for sign) bits. In this case only 6 partial products are generated (t₀-t₅) and t₆ and t₇ are used to add the carry-save represented shifted residual rw[j].

w[j+1] = 512 w[j] - q_j+1z = t₀ + t₁ + t₂ + t₃ + t₄ + t₅ + rw_s + rw_c

The conversion block performs the conversion from the signed-digit quotient and the rounding. The carry-propagate adder is used to assimilate the carry save representation of Md = z and the final remainder. In addition, hardware to detect if the sum is zero (needed for the rounding) is provided in the carry-propagate adder.

The first implementation of the divider, which corresponds to the scheme of Figure 4.23, has a critical path of 10.5 ns (Figure 4.25) and a total execution time

t_div = 10.5 ×10 = 105 [ns] .

Figure 4.25: Critical path (ns) for basic implementation.

The energy dissipated by this standard implementation is reported in Table 4.11 and indicated in the column ßtandard". Figure 4.26 shows the percentage of energy dissipated in the blocks for the basic implementation of Figure 4.23.

Figure 4.26: Percentage of Energy dissipation in radix-512 divider.

4.5.2 Low-Power Implementation

For energy reduction in radix-512, we used a different approach than for the lower radices. From Figure 4.26, it is clear that most of the energy (about 60% of the total) is dissipated in the MultAdd block. This is not unexpected because MultAdd is the largest block and consists of several levels of CSAs. As a consequence, the distribution of the energy consumption is quite different from the dissipation in lower radices where, for example, the energy dissipation in the corresponding blocks (MULT and CSA) for radix-4 is less than 25% of the total. For this reason, many of the techniques presented in Chapter 3, which were developed for lower radices, are not very effective for the radix-512 divider.

Retiming by itself only reduces glitches at the input of MULT for lower radices, and in case of radix-512 retiming by itself is not much beneficial for MultAdd because the several levels of the tree of adders produce many glitches anyway.

Changing the redundant representation is a technique designed to reduce the energy dissipation in the registers by eliminating some flip-flops. Because it requires the propagation of the carry within a digit, this technique increase the energy dissipation in the adder, that however, for lower radices, does not offset the reductions obtained in the register. For radix-512, there is not sufficient time to propagate the carry in a log₂ 512 = 9 bit adder without increasing the critical path. The use of a radix-8 CSA (a radix-512 CSA could be decomposed in 3 radix-8 CSAs) might reduce the energy dissipated in the registers by about 2% of the whole energy consumption. But this reduction in the registers will be offset by the increase of glitches for the propagation of the carry in MultAdd.

Techniques such as equalizing the paths and using low-drive cells are not very effective for lower radices and are impractical for radix-512.

The techniques that can reduce significantly the energy dissipation in the radix-512 are the modification in the convert-and-round unit, disabling the CPA when not used, and using dual voltage in the recurrence to reduce the energy dissipated in MultAdd.

In addition to the techniques presented in Chapter 3, some work has been done in [45] to reduce the power dissipation in trees of adders. In [45], by using the redundancy in a 4:2 CSA (compressor), different configurations of the compressors are used to reduce the probability of transitions in the tree. However, experimental results in [45], showed that for a large tree of adders such as the one used in a 54 × 54 multiplier [7], the power savings are about 5%.

Disabling the clock in registers

The first modification applied to the radix-512 divider is to disable the clock in flip-flops that do not change. This is particularly advantageous in register M and Z which change once and three times respectively, per division (Figure 4.24). The reduction in the energy dissipated is about 2.0 nJ corresponding to 3% of the total divider.

Reductions in the CPA unit

The carry-propagate adder (CPA) is used twice during the division. A first time to assimilate the value of z in the third cycle, and a second time in the last cycle to determine the sign of the remainder (and if it is zero). The CPA is switched off by forcing a constant logic value at its inputs when it is not used. The reduction in energy with respect to the basic implementation is about 5%.

Reductions in the convert-and-round unit

In the basic implementation of the radix-512 divider, three registers (Q, QM, and QP) are necessary to store the partial quotient (r = 1). As explained in Section 3.10, the algorithm is modified as for lower radices, by eliminating the shifting of the digits previously loaded, the two registers QM and QP, and by recoding. Two additional registers are introduced: the 6-bit ring counter C to keep track of the digits to update and the temporary register T (10 bits) used in the recoding. The digit-decrementer, implemented both for the digits of Q and for register T, is a 9-bit ripple decrement-by-one circuit and its delay is about 4.0 ns. Note that the delay of the decrementer does not affects the critical path.

By implementing the modified algorithm in the convert-and-round unit, we reduce the number of flip-flops in the unit from 162 (registers Q, QM, QP) to 70 (registers Q, C, T). The energy consumption in the divider is reduced by about 10%.

4.5.3 Dual Voltage Implementation

Retiming the recurrence

As mentioned above, for radix-512 the purpose of retiming is to limit the critical path to a few bits and use dual voltage. In order to do so, we have to move the operations done on the MSBs (selection in mux2 and recoding) from the beginning of the cycle to the end of the cycle. This can be done as sketched in Figure 4.27.b by introducing an extra register to store the carry-save representation of the recoded operand.

Figure 4.27: Retiming of the recurrence.

However, the scheme of Figure 4.27 requires an additional initialization cycle to store in register R the recoded value of d₁₅ (d_REC). An alternative to the addition of the extra cycle is to take advantage of the fact that in cycles 2 and 3 the value to be recoded is the same (-M). In this case we can use a multiplexer to divert to MultAdd either the output or register R or directly the output of the recoder, as shown in Figure 4.28. This multiplexer (Mux-R) is controlled by a new signal DIVERT, set by the controller, that routes the MultAdd input signals. Table 4.10 shows the values of the signals in the unit in the first 4 cycles. Note that blocks Mux2 and Rec are replicated in the table for clarity.


	cycles
	1	2	3	4
REG M	-	-M	-M	-M
REG W	-	-	Md	w[0]
Mux2	d	-M
Rec	d_REC	-M_REC
REG R	-	-	-M_REC	[^y][1]_REC
Mux-R	d_REC	-M_REC	-M_REC	[^y][1]_REC
MultAdd	-M	Md	w[0] = Mx	w[1]
Mux2			[^y][1]	[^y][2]
Rec			[^y][1]_REC	[^y][2]_REC

DIVERT	0	0	1	1

Table 4.10: Operations and signal values in retimed unit.

Figure 4.28: Retimed recurrence with Mux-R.

Figure 4.29: Critical path (ns) after retiming.

The new multiplexer Mux-R is now on the critical path and the clock cycle must be lengthed to accommodate the additional 0.5 ns of its delay. The new critical path is shown in Figure 4.29. However, the solution with Mux-R is still advantageous over the solution with an extra cycle. In fact, the number of cycles for the radix-512 division (10) is quite small and the longer clock cycle increases the execution time by about 5 ns that is still a shorter time than one extra clock cycle required for the first solution.

Dual Voltage

After the retiming the 52 LSBs of MultAdd and register W can be redesigned for dual voltage. The time slack for those 52 LSBs is 4.3 ns that allows a minimum dual voltage V₂ = 2.3 V.

Furthermore, voltage can be also scaled in the convert-and-round unit resulting in reduction of energy dissipated of 40% with respect to the basic implementation of the divider.

4.5.4 Summary of Results for Radix-512

For the radix-512 divider, the retiming increases the critical path by about 5%. Because we want to reduce the energy without penalizing the performance, in this case, there is a tradeoff between smaller energy and longer delay. Since, as previously discussed, the only technique which reduces significantly the energy consumption in the recurrence is the use of dual voltage, we decided not to apply the other techniques (change redundant representation, using low-drive cells, etc.) and give up performance for small energy reductions. Table 4.11 reports the energy values for the three implementations:

standard is the basic implementation of Figure 4.23.
low-power is the implementation with reduced energy dissipation, but same delay as the basic. It is obtained by applying the low-power techniques mentioned above with the exception of retiming.
dual-voltage is the estimation of the implementation with retiming and dual voltage, but longer execution time.

Optimization with Synopsys Power Compiler was not performed because in the radix-512 divider the selection of the quotient digit is done by rounding.

Figure 4.30 shows the breakdown, as a percentage of the total, of the energy dissipated in the main blocks composing the unit.

	standard	low-power	dual voltage
blocks	nJ	nJ	(est.) nJ
control	1.0	1.0	1.0
clk tree	0.5	0.5	0.5
g table	0.4	0.4	0.4
mux 1	1.1	1.1	1.1
mux 2	0.6	0.6	0.8
mux 3	1.1	1.1	1.1
recoder	2.0	2.0	3.8
Mult	14.5	14.5	8.0
Add	22.5	22.5	12.5
registers W	6.6	5.5	^*3.6
register M	0.5	0.3	0.3
register Z	2.3	1.5	1.5
reg. R + mux-R	-	-	1.1
CPA	4.5	1.5	1.5
C&R unit	8.7	2.7	^*1.3
Total [nJ]	66.5	55.0	38.5
Ratio	1.00	0.85	0.60

Area [mm²]	6.0	6.4	-
T_cycle [ns]	11.0	11.0	11.5
t_div [ns]	110	110	115

Values marked ^* include level shifters.

Table 4.11: Energy-per-division for radix-512.

Figure 4.30: Percentage of energy dissipation in radix-512 divider.

File translated from T_EX by T_TH, version 1.1 and by M_E. Last Modified : Fri Jul 9 11:14:35 PDT 1999