4.3  Radix-8 Division

4.3.1  Algorithm and Basic Implementation

The radix-8 division algorithm is implemented by the residual recurrence

w[j+1] = 8w[j] - qj+1d j = 0,1, ¼m
with initial value w[0] = x , where x is the dividend, d the divisor, and qj+1 the quotient digit at the j-th iteration. Both d and x are normalized in [0.5, 1). The quotient digit is in signed-digit representation {-a, ¼, -1,0,1, ¼, a} with redundancy factor r = a/7. The residual w[j] is stored in carry-save representation (wS and wC). The quotient digit is determined, at each iteration, by the selection function
qj = SEL(dd, ^
y
 
)
where dd is d truncated after the d-th fractional bit and [^y] = 8wS + 8wC truncated after t fractional bits.

In order to avoid the implementation of a complicated multiple generator, the quotient digit is split into two parts qH with weight 4 and qL with weight 1 (qj = 4qH + qL) and the digit set of each part is reduced to {-2,-1,0,1,2}.

Since the selection function (SEL) is in the critical path, to have the minimum latency we have to minimize its delay. We explored the implementation of three possible values of a: 6, 7, and 10 (the maximum value possible with the above mentioned representation). Table 4.2 shows a summary of the results. The gate-level implementation was obtained by synthesizing the VHDL description of the selection function with COMPASS ASICSynthesizer. This includes both the assimilation of the carry-save representation of [^y] and the actual digit-selection function.

bits of delay [ns]
a d [^y] qL qH gates
10 3 6 4.0 3.0 325
7 3 8 3.8 3.0 370
6 5 7 3.8 2.7 420

Table 4.2: Radix-8: summary selection function.

From Table 4.2, we can see that SEL for a = 7 is as fast as for a = 6, but its area is smaller4. Surprisingly, the delay for the over-redundant case a = 10 is larger. Therefore, the SEL for a = 7 is chosen, which results in a redundancy factor r = 1. The selection logic function is described in Table 4.3.

mh dd
8/16 9/16 10/16 11/16 12/16 13/16 14/16 15/16

m7

27 30 34 36 40 44 48 48
m6 23 26 28 32 34 36 40 40
m5 18 20 24 24 28 28 32 32
m4 14 16 18 20 20 24 24 24
m3 10 12 12 12 16 16 16 16
m2 6 8 8 8 8 8 8 8
m1 0 0 0 0 0 0 0 0
m0 -4 -4 -4 -4 -4 -4 -4 -4
m-1 -8 -8 -8 -8 -8 -8 -12 -12
m-2 -12 -12 -12 -16 -16 -16 -16 -20
m-3 -16 -16 -20 -20 -24 -24 -24 -28
m-4 -20 -22 -24 -26 -28 -32 -32 -36
m-5 -24 -26 -30 -32 -36 -36 -40 -44
m-6 -28 -31 -34 -38 -40 -44 -48 -52
Values in table are multiplied by 16

Table 4.3: Selection function for radix-8 and a = 7.

A first implementation of the divider is shown in Figure 4.10. The scheme is completed by a controller (not depicted in the figure).

Figure 4.10: Implementation of the radix-8 divider.

To have the divider compliant with IEEE standard for double-precision while operating with fractional values, 1-bit shifts are performed on the operands. Moreover, to have a bound residual in the first iteration (w[0] = x < d ), when x ³ d we shift x one bit to the right obtaining a fractional quotient. To compute the 53 bits of the quotient and an additional bit to perform rounding, 54/3 = 18 iterations are required. An additional cycle is required to load the value x as first residual w[0]. However, for the proposed architecture and selection function, the simplest way to accomplish this is to do as follows:

In conclusion, the load cycle is substituted by an extra iteration for a total of 20 iterations: 19 to compute the digits and one for the rounding. Finally, the quotient is normalized in [1,2) by shifting it four positions to the left. Note that all shifts are done by wiring and do not affect the latency of the operation. In the recurrence (w[j]) we need 54 fractional bits and 2 integer bits: one to hold the sign and the other to avoid the overflow in the CS-representation (being r = 1).

There are two possible critical paths, one going through qH and the other through qL. Since the delay of qH is smaller than that of qL, but the number of adders to traverse is larger, a good design tries to equalize the delays of both paths. The resulting critical paths are

SEL(qL) + mult + HA + reg
4.5 + 1.4 + 0.6 + 1.5 = 8.0 ns.
SEL(qH) + mult + HA+ FA + reg
3.6 + 1.4 + 0.6 + 0.9 + 1.5 = 8.0 ns.

In conclusion, the post-layout critical path is 8.0 ns. The energy dissipated by this basic implementation is Ediv = 47.5 nJ and the contributions of the blocks is shown in Table 4.4 in column ßtandard".

4.3.2  Low-Power Implementation

For the recurrence, the retiming is done by moving the selection function of Figure 4.10 from the first part of the cycle to the last part of the previous cycle (see Figure 4.11.a and Figure 4.11.b). Two new registers (qH and qL) are needed to store the quotient digit.

However the retiming alters the critical path because the two paths through qH and qL have different delays, and now the delay of qL is added to the delay of the two CSAs (see Figure 4.11.b). To reduce the critical path to the previous value we skew the clock of register qL by the delay difference of the two paths, which, in this case, corresponds to the delay of the CSA, as shown in Figure 4.11.c. The clock can be skewed by adding the appropriate delay (e.g. some buffers) in the clock distribution tree.

Figure 4.11: Retiming and critical path.

c) after retiming and skewing the clock.

After the retiming the multiplexer is moved out of the recurrence, as shown in Figure 4.15. Consequently, the operations in the first cycle are modified by resetting registers qH and qL to 0 and -1 respectively and by storing x in w[0] = 0 - (-x).

By using a radix-8 carry-save representation as shown in Figure 4.12, we only need to store one carry bit for each digit, instead of three. This can be done for the 50 LSBs that, after the retiming, are not on the critical path. The eight MSBs, assimilated in the adder inside the selection function block (Figure 4.15), can be stored in wS eliminating another eight flip-flops in wC.

Figure 4.12: Radix-8 carry-save adder (lower).

By retiming and changing the representation, the reduction in l-p with respect to std is about 10%.

Figure 4.13: Partitioned selection function.

The quotient-digit selection is a function of three bits of the divisor and eight of the residual. In the radix-8 case, Figure 4.13 shows the partitioning in eight parts (all the possible values of d3) for both the higher and lower parts. By partitioning the selection function, we could obtain a reduction of 40% in the energy dissipated in SEL, but at the expense of a larger clock cycle (about 10%), due to additional delay of the demultiplexer and the OR gates, and area. For this reason, the corresponding value is not included in Table 4.4.

In the original formulation of the on-the-fly conversion algorithm, three registers (Q, QM, and QP) are necessary to store the partial quotient for r = 1. As explained in Section 3.10, the algorithm is modified by eliminating the shifting of the digits previously loaded, the two registers QM and QP, and by recoding. Two additional registers are introduced: the ring counter C to keep track of the digits to update and the temporary register T used in the recoding (Figure 4.14). In this way, we reduce the number of flip-flops in the convert-and-round unit from 171 (registers Q, QM, QP) to 81 (registers Q, C, T).

Figure 4.14: Convert-and-round unit for radix-8 divider

4.3.3  Dual Voltage Implementation

For the radix-8 divider, the 50 least-significant bits in the recurrence can be redesigned for low voltage. We can apply the dual-voltage technique also to the convert-and-round unit which is not in the critical path.

When applying dual voltage to the recurrence, the two cases with radix-2 and radix-8 CSA must be considered. As explained in Section 3.6, the time slack is longer for the radix-2 CSA implementation and the possible voltage V2 is lower. In particular for the radix-8 divider (critical path is 8.0 ns), we obtain the following values:

radix-2 CSA
tslack = 3.5 ns Þ V2 = 2.0 V Þ Ediv = 20 nJ.
radix-8 CSA
tslack = 1.2 ns Þ V2 = 3.0 V. Þ Ediv = 26 nJ.

The values of Ediv indicated above take into account both the different number of flip-flops in register Wc and the voltage scaling in the convert-and-round unit.

It is clear that the implementation with radix-2 CSA is the most convenient for dual voltage. The reduction in d-v is about 30% with respect to l-p.

4.3.4  Optimization with Synopsys Power Compiler

Because the synthesis of large circuits does not give results as good as manual design, we synthesized only the selection function using Synopsys Power Compiler. The synthesis of the selection function showed a critical path (through qL) of 3.0 ns (critical path for SEL in Passport/COMPASS implementation is 4.5 ns), while the path through qH was of 2.6 ns (3.6 ns for Passport/COMPASS). The energy reduction in the selection function, obtained by incremental compilation with power dissipation constraints was very little, about 5%.

4.3.5  Summary of Results for Radix-8

Figure 4.15 shows the implementation of the low-power radix-8 divider and Table 4.4 summarizes the results obtained by applying the low-power techniques described above. We did not include in the table the estimation of a possible implementation with Synopsys Power Compiler because the reduction of energy in the selection function is less than 1% of the total.

Figure 4.15: Low-power implementation of the radix-8 divider.

standard low-power dual voltage
blocks nJnJ (est.) nJ
control 0.6 0.6 0.6
clk tree 0.4 0.4 0.4

mux

1.4 0.2 0.2
mul. gen. H 3.1 2.2 1.1
CSA H 4.4 4.4 2.2
mul. gen. L 2.6 2.2 1.1
CSA L 6.0 4.8 2.4
sel. func. 3.6 4.6 4.6
register Ws 4.2 4.0 *2.2
register Wc 4.2 1.2 *2.0
register qH - 0.2 0.2
register qL - 0.2 0.2

SZD

3.8 0.6 0.6
C&R unit 13.4 2.8 *1.0

Total [nJ]

47.5 28.5 19.0
Ratio 1.00 0.60 0.40

Area [mm2]

2.2 1.8 -
Values marked * include level shifters.

Table 4.4: Energy-per-division for radix-8.

Figure 4.16 shows the breakdown, as a percentage of the total, of the energy dissipated in the main blocks composing the unit.

Figure 4.16: Percentage of energy dissipation in radix-8 divider.

In the l-p implementation the largest part of the energy is dissipated in the CSAs (more than 30%), while in the d-v estimate the largest portion is equally distributed (about 25% of the total for each block) among the two CSAs, the two registers W and the selection function.

4.3.6  Comparison with scheme with overlapped radix-2
stages

In [34] a radix-8 divider is implemented by overlapping three radix-2 stages and computing the quotient digits in parallel. Moreover, the next partial remainder (ws and wc) is calculated speculatively for each possible quotient digit. This scheme, indicated here as r8overlap, is implemented in the Sun UltraSPARC FP-unit. As described in [34], the critical path is: 1 ×SEL + 2 ×CSA + 3 ×MUX

In order to compare the r8overlap division unit with our radix-8 divider, we made the following assumptions:

With these assumptions, we can reasonably estimate the pre-layout critical path of r8overlap as:

SEL + 2 CSA + 3 MUX + buff. + reg.
1.9 + 1.6 + 1.5 + 0.7 + 1.3 = 7.0 ns

This is similar to the critical path (pre-layout) of the radix-8 unit described here:

SEL(qL) + mult + HA + reg
3.8 + 1.2 + 0.5 + 1.3 = 6.8 ns.
SEL(qH) + mult + HA+ FA + reg
3.0 + 1.2 + 0.5 + 0.8 + 1.3 = 6.8 ns.

As for the area, Table 4.5 shows a comparison of the number of the wider (bitwise) blocks. Register QN in r8overlap can be eliminated by introducing register C. However it is not possible to change the representation of wC (e.g. reducing the size of register Wc) without penalizing the performance. Moreover, the resulting selection function of the overlapped implementation is about twice as large as the radix-8 SEL. In conclusion, it is reasonable to assume that the area of our divider is significantly smaller.

r8overlap radix-8
no. CSAs 6 2
no. mux/mult 6 2
no. registers 4 (Ws, Wc, Q, QN) 2.7 (Ws, Wc/3, Q, C)

Table 4.5: Area comparison.

We don't have data available on the energy consumption of the r8overlap division unit, but considering the larger area (larger current drawn) and roughly the same operation latency, we conclude that the energy dissipation is smaller in our implementation. Furthermore, the energy reduction techniques applied in the radix-8 divider might not be effective in the r8overlap scheme.


Footnotes:

4 Smaller area implies smaller capacitance and usually reduced energy dissipation.


File translated from TEX by TTH, version 1.1 and by ME. Last Modified : Fri Jul 9 11:14:33 PDT 1999