4.4  Radix-16 Division

4.4.1  Algorithm and Implementation

The radix-16 division algorithm is implemented by the residual recurrence

w[j+1] = 16w[j] - qj+1d j = 0,1, ¼13
with initial value w[0] = x , and quotient given by

q = 14
å
j = 1 
qj 16-j
(20)
Two additional cycles, for initialization and rounding, are required to produce the quotient in conventional representation (53 bits for IEEE double-precision) for a total of 16 cycles. As usual, both d and x are normalized in [0.5, 1) and x < d.

The radix-16 division unit is obtained by overlapping the computation of two radix-4 digits [30]. Consequently, the quotient digit is split into two parts qH and qL such that qj = 4qH + qL with digit set {-2,-1,0,1,2} in each part, resulting in the digit-set [-10,10] for qj (a = 10). The quotient digit is determined, at each iteration, by the selection function depicted in Figure 4.17. Once the digit qH is chosen, its value is used to select among all the possible combinations of qH d. The redundancy factor is r = [a/(r-1)] = [2/3]. The residual w[j] is stored in carry-save representation (wS and wC). The signed-digit representation of the quotient is converted to conventional two's complement representation and rounded by the on-the-fly convert-and-round unit.

Figure 4.17: Selection function for radix-16.

The implementation of the standard divider, optimized for shortest latency, is shown in Figure 4.18. Table 4.6 shows the delay through the two parts of SEL. Note that the larger delay of SELqL is compensated by the additional carry-save adder in the recurrence (Figure 4.18) in the path from SELqH.

The critical path post-layout is 9.2 ns and 16 cycles are required to complete the operation, corresponding to a latency of 150 ns. The energy dissipated by this basic implementation is Ediv = 46.0 nJ and the contributions of the blocks is shown in Table 4.9 in column ßtandard".

path delay [ns]
qL SELqL - MULT - HA - REG
5.7 + 1.4 + 0.6 + 1.5 = 9.2
qH SELqH - MULT - HA - FA - REG
4.0 + 1.4 + 0.6 + 1.1 + 1.5 = 8.6

Table 4.6: Critical path through qL and qH.

Figure 4.18: Basic implementation radix-16.

4.4.2  Low-Power Implementation

The retiming of the recurrence is done by moving the selection function from the first part of the cycle to the last part of the previous cycle. Two 4-bit registers are needed to store the quotient digit. After the retiming the critical path is limited to the 10 most-significant bits of the recurrence. As shown in Figure 4.19, in order not to increase the cycle delay in the retimed unit, the clock of register qL is skewed.

Since in the retimed implementation the selection function is placed after the second CSA, instead of directly after the registers, there is a large increase in the number of glitches, which are responsible for the increased dissipation of the selection function. One way to filter those glitches is to buffer the selection function with multiplexers acting as latches, as described in Figure 3.13 at page pageref. The select signal is driven by a different clock (same period, different phase) that enables the muxes to transmit the value from the CSA when it is stable, and hold the current value otherwise. However, in this case the delay of the mux affects the critical path. For radix-16 the energy dissipated in the selection function is halved, but the critical path is increased by about 5%. For this reason, the value is not included in Table 4.9.

For the 44 least-significant bits the radix-2 CSA is replaced by a radix-16 CSA (R16 CSA) that for each digit of the radix stores only one carry bit. Figure 4.20 shows the radix-16 CSA and Table 4.7 explains how the two level of adders are connected to produce the correct residual. The number of flip-flops in register Wc is reduced from 57 to 25.

Figure 4.19: Retiming and critical path.

Figure 4.20: Radix-16 CSA.

iter. j iteration j+1
first level second level

ws0

0 maSma0S0 0 mbSmb0 ws0
ws1 0 ma1S1 0 mb1 ws1

ws2

wc2 4ws0 ma2S2 4S0 mb2 ws2 wc2
ws3 4ws1 ma3S3 4S1 mb3 ws3

ws4

4ws24wc2ma4S4C44S2 mb4 ws4
ws5 4ws3 ma5S5 4S3 mb5 ws5

ws6

wc6 4ws4 ma6S6 4S4 4C4 mb6 ws6 wc6
ws7 4ws5 ma7S7 4S5 mb7 ws7

ws8

4ws64wc6ma8S8C84S6 mb8 ws8
ws9 4ws7 ma9S9 4S7 mb9 ws9

ws10

wc104ws8 ma10S10 4S8 4C8 mb10 ws10 wc10
ws11 4ws9 ma11S11 4S9 mb11 ws11
...... ...... ......

Table 4.7: Bit arrangement in two-level adders.

In order not to increase the cycle time when using radix-16 CSA the two paths shown in Table 4.8 should have the same delay. Therefore, the condition to be satisfied is:

tR16 CSA £ tHA + tSELqL
2
= 3.15 ns
The easiest way to implement this R16 CSA is by using a 4-bit carry-ripple adder. The corresponding delay, in our library, is

tSn = tC1 + 3tripple + tHA = 0.8 + 3(0.35) + 0.5 ns @ 2.0 ns .

MSBs: MULT HA SEL QL REG
LSBs: MULT R16 CSA R16 CSA REG

Table 4.8: Paths in MSBs and LSBs in the recurrence.

Furthermore, the 7 most-significant bits assimilated in the selection function could be stored in the register Ws, saving 7 additional flip-flops. However, in the radix-16 case, the assimilated value must be selected among the 5 possible alternatives (see Figure 4.17) and this requires an additional multiplexer driven by qH that increases the load both on qH and qL. For this reason the 7 MSBs bits are stored in carry-save representation.

In the original on-the-fly conversion and rounding algorithm the partial quotient is stored in two registers (Q and QM). By implementing the modified algorithm, with r = [2/3], only register Q (54 bits) and register C (14 bits) are needed. With the implementation of the modified algorithm the number of flip-flops in the convert-and-round unit is reduced from 108 to 69 and the power dissipated from 10.7nJ to 2.4nJ, resulting in a reduction of about 20% in the whole divider.

4.4.3  Dual Voltage Implementation

When applying dual voltage to the recurrence, the two cases with a radix-2 and a radix-16 CSA must be considered. As explained in Section 3.6, the time slack is longer for the radix-2 CSA implementation and the possible voltage V2 is lower. In particular, for the radix-16 divider (critical path is 9.2 ns), we get the following values:

radix-2 CSA
tslack = 4.6 ns Þ V2 = 2.0 V Þ Ediv = 22 nJ.
radix-16 CSA
tslack = 1.8 ns Þ V2 = 2.7 V. Þ Ediv = 27 nJ.

The values of Ediv indicated above take into account both the different number of flip-flops in register Wc and the voltage scaling in the convert-and-round unit.

It is clear that the implementation with radix-2 CSA is the most convenient for dual voltage.

4.4.4  Optimization with Synopsys Power Compiler

The synthesized selection function showed a delay of 4.0 ns through qL, and 3.0 ns through qH with delay constraints met. In addition the reduction in energy dissipation, obtained by incremental compilation with power dissipation constraints resulted to be about 25%.

4.4.5  Summary of Results for Radix-16

Table 4.9 reports the average energy dissipation and area for the standard and the low-power implementation, which is shown in Figure 4.21.

Figure 4.21: Low-power radix-16 divider.

standard low-power synopsys dual voltage synopsys
blocks nJnJ (est.) nJ (est.) nJ (est.) nJ

control

0.5 0.5 0.5
clk tree 0.5 0.5 0.5

mux

2.6 0.4 0.4
mul. gen. H 2.5 1.6 0.8
CSA H 3.3 3.3 1.9
mul. gen. L 2.7 1.8 1.0
CSA L 5.0 4.3 2.5
sel. func. 5.9 8.2 6.1 8.2 6.1
register Ws 4.4 4.3 *2.1
register Wc 4.2 1.6 *1.9
register qH - 0.2 0.2
register qL - 0.2 0.2

SZD

3.7 0.6 0.6
C&R unit 10.7 2.4 *0.9

Ediv  [nJ]

46.0 30.0 28.0 22.0 20.0
Ratio 1.00 0.65 0.60 0.50 0.45

Area [mm2]

2.2 1.8 - - -
Values marked * include level shifters.

Table 4.9: Energy-per-division for radix-16.

Figure 4.22 shows the breakdown, as a percentage of the total, of the energy dissipated in the main blocks composing the unit.

Figure 4.22: Percentage of energy dissipation in radix-16 divider.

In the std the largest part of energy is dissipated in the convert-and-round unit (about 25%). With the application of the energy reduction techniques, the energy dissipated in C&R unit is reduced to less than 10% of the total for l-p and less than 5% for d-v. On the other hand the contribution of the selection function to the total energy dissipation increases up to 35% of the total in d-v. By optimizing the selection function with Synopsys Power Compiler, a reduction of 25% in the block would reflect in a contribution of about 30% of the total in d-v.


File translated from TEX by TTH, version 1.1 and by ME. Last Modified : Fri Jul 9 11:14:34 PDT 1999