Alberto Nannarelli Ph.D. Dissertation

4.3 Radix-8 Division

4.3.1 Algorithm and Basic Implementation

The radix-8 division algorithm is implemented by the residual recurrence

w[j+1] = 8w[j] - q_j+1d j = 0,1, Ľm

with initial value w[0] = x , where x is the dividend, d the divisor, and q_j+1 the quotient digit at the j-th iteration. Both d and x are normalized in [0.5, 1). The quotient digit is in signed-digit representation {-a, Ľ, -1,0,1, Ľ, a} with redundancy factor r = a/7. The residual w[j] is stored in carry-save representation (w_S and w_C). The quotient digit is determined, at each iteration, by the selection function

q_j = SEL(d_d,

^
y

)

where d_d is d truncated after the d-th fractional bit and [^y] = 8w_S + 8w_C truncated after t fractional bits.

In order to avoid the implementation of a complicated multiple generator, the quotient digit is split into two parts q_H with weight 4 and q_L with weight 1 (q_j = 4q_H + q_L) and the digit set of each part is reduced to {-2,-1,0,1,2}.

Since the selection function (SEL) is in the critical path, to have the minimum latency we have to minimize its delay. We explored the implementation of three possible values of a: 6, 7, and 10 (the maximum value possible with the above mentioned representation). Table 4.2 shows a summary of the results. The gate-level implementation was obtained by synthesizing the VHDL description of the selection function with COMPASS ASICSynthesizer. This includes both the assimilation of the carry-save representation of [^y] and the actual digit-selection function.

	bits of		delay [ns]
a	d	[^y]	q_L	q_H	gates
10	3	6	4.0	3.0	325
7	3	8	3.8	3.0	370
6	5	7	3.8	2.7	420

Table 4.2: Radix-8: summary selection function.

From Table 4.2, we can see that SEL for a = 7 is as fast as for a = 6, but its area is smaller⁴. Surprisingly, the delay for the over-redundant case a = 10 is larger. Therefore, the SEL for a = 7 is chosen, which results in a redundancy factor r = 1. The selection logic function is described in Table 4.3.

m_h	d_d
	8/16	9/16	10/16	11/16	12/16	13/16	14/16	15/16
m₇	27	30	34	36	40	44	48	48
m₆	23	26	28	32	34	36	40	40
m₅	18	20	24	24	28	28	32	32
m₄	14	16	18	20	20	24	24	24
m₃	10	12	12	12	16	16	16	16
m₂	6	8	8	8	8	8	8	8
m₁	0	0	0	0	0	0	0	0
m₀	-4	-4	-4	-4	-4	-4	-4	-4
m_-1	-8	-8	-8	-8	-8	-8	-12	-12
m_-2	-12	-12	-12	-16	-16	-16	-16	-20
m_-3	-16	-16	-20	-20	-24	-24	-24	-28
m_-4	-20	-22	-24	-26	-28	-32	-32	-36
m_-5	-24	-26	-30	-32	-36	-36	-40	-44
m_-6	-28	-31	-34	-38	-40	-44	-48	-52

Values in table are multiplied by 16

Table 4.3: Selection function for radix-8 and a = 7.

A first implementation of the divider is shown in Figure 4.10. The scheme is completed by a controller (not depicted in the figure).

Figure 4.10: Implementation of the radix-8 divider.

To have the divider compliant with IEEE standard for double-precision while operating with fractional values, 1-bit shifts are performed on the operands. Moreover, to have a bound residual in the first iteration (w[0] = x < d ), when x ł d we shift x one bit to the right obtaining a fractional quotient. To compute the 53 bits of the quotient and an additional bit to perform rounding, 54/3 = 18 iterations are required. An additional cycle is required to load the value x as first residual w[0]. However, for the proposed architecture and selection function, the simplest way to accomplish this is to do as follows:

Clear the registers for w (this is done at the end of the previous division). With the selection function we have implemented, this produces a q₁ = 1.
Compute w[1] = x - d using the hardware for the recurrence. This requires a multiplexer, which is not on the critical path.
For q₁ to be 1, we shift the dividend three bits to the right. As a consequence, it is necessary to shift the final quotient accordingly,

In conclusion, the load cycle is substituted by an extra iteration for a total of 20 iterations: 19 to compute the digits and one for the rounding. Finally, the quotient is normalized in [1,2) by shifting it four positions to the left. Note that all shifts are done by wiring and do not affect the latency of the operation. In the recurrence (w[j]) we need 54 fractional bits and 2 integer bits: one to hold the sign and the other to avoid the overflow in the CS-representation (being r = 1).

There are two possible critical paths, one going through q_H and the other through q_L. Since the delay of q_H is smaller than that of q_L, but the number of adders to traverse is larger, a good design tries to equalize the delays of both paths. The resulting critical paths are

SEL(q_L) + mult + HA + reg
4.5 + 1.4 + 0.6 + 1.5 = 8.0 ns.
SEL(q_H) + mult + HA + FA + reg
3.6 + 1.4 + 0.6 + 0.9 + 1.5 = 8.0 ns.

In conclusion, the post-layout critical path is 8.0 ns. The energy dissipated by this basic implementation is E_div = 47.5 nJ and the contributions of the blocks is shown in Table 4.4 in column ßtandard".

4.3.2 Low-Power Implementation

For the recurrence, the retiming is done by moving the selection function of Figure 4.10 from the first part of the cycle to the last part of the previous cycle (see Figure 4.11.a and Figure 4.11.b). Two new registers (q_H and q_L) are needed to store the quotient digit.

However the retiming alters the critical path because the two paths through q_H and q_L have different delays, and now the delay of q_L is added to the delay of the two CSAs (see Figure 4.11.b). To reduce the critical path to the previous value we skew the clock of register q_L by the delay difference of the two paths, which, in this case, corresponds to the delay of the CSA, as shown in Figure 4.11.c. The clock can be skewed by adding the appropriate delay (e.g. some buffers) in the clock distribution tree.

Figure 4.11: Retiming and critical path.

c) after retiming and skewing the clock.

After the retiming the multiplexer is moved out of the recurrence, as shown in Figure 4.15. Consequently, the operations in the first cycle are modified by resetting registers q_H and q_L to 0 and -1 respectively and by storing x in w[0] = 0 - (-x).

By using a radix-8 carry-save representation as shown in Figure 4.12, we only need to store one carry bit for each digit, instead of three. This can be done for the 50 LSBs that, after the retiming, are not on the critical path. The eight MSBs, assimilated in the adder inside the selection function block (Figure 4.15), can be stored in w_S eliminating another eight flip-flops in w_C.

Figure 4.12: Radix-8 carry-save adder (lower).

By retiming and changing the representation, the reduction in l-p with respect to std is about 10%.

Figure 4.13: Partitioned selection function.

The quotient-digit selection is a function of three bits of the divisor and eight of the residual. In the radix-8 case, Figure 4.13 shows the partitioning in eight parts (all the possible values of d₃) for both the higher and lower parts. By partitioning the selection function, we could obtain a reduction of 40% in the energy dissipated in SEL, but at the expense of a larger clock cycle (about 10%), due to additional delay of the demultiplexer and the OR gates, and area. For this reason, the corresponding value is not included in Table 4.4.

In the original formulation of the on-the-fly conversion algorithm, three registers (Q, QM, and QP) are necessary to store the partial quotient for r = 1. As explained in Section 3.10, the algorithm is modified by eliminating the shifting of the digits previously loaded, the two registers QM and QP, and by recoding. Two additional registers are introduced: the ring counter C to keep track of the digits to update and the temporary register T used in the recoding (Figure 4.14). In this way, we reduce the number of flip-flops in the convert-and-round unit from 171 (registers Q, QM, QP) to 81 (registers Q, C, T).

Figure 4.14: Convert-and-round unit for radix-8 divider

4.3.3 Dual Voltage Implementation

For the radix-8 divider, the 50 least-significant bits in the recurrence can be redesigned for low voltage. We can apply the dual-voltage technique also to the convert-and-round unit which is not in the critical path.

When applying dual voltage to the recurrence, the two cases with radix-2 and radix-8 CSA must be considered. As explained in Section 3.6, the time slack is longer for the radix-2 CSA implementation and the possible voltage V₂ is lower. In particular for the radix-8 divider (critical path is 8.0 ns), we obtain the following values:

radix-2 CSA: t_slack = 3.5 ns Ţ V₂ = 2.0 V Ţ E_div = 20 nJ.
radix-8 CSA: t_slack = 1.2 ns Ţ V₂ = 3.0 V. Ţ E_div = 26 nJ.

The values of E_div indicated above take into account both the different number of flip-flops in register Wc and the voltage scaling in the convert-and-round unit.

It is clear that the implementation with radix-2 CSA is the most convenient for dual voltage. The reduction in d-v is about 30% with respect to l-p.

4.3.4 Optimization with Synopsys Power Compiler

Because the synthesis of large circuits does not give results as good as manual design, we synthesized only the selection function using Synopsys Power Compiler. The synthesis of the selection function showed a critical path (through q_L) of 3.0 ns (critical path for SEL in Passport/COMPASS implementation is 4.5 ns), while the path through q_H was of 2.6 ns (3.6 ns for Passport/COMPASS). The energy reduction in the selection function, obtained by incremental compilation with power dissipation constraints was very little, about 5%.

4.3.5 Summary of Results for Radix-8

Figure 4.15 shows the implementation of the low-power radix-8 divider and Table 4.4 summarizes the results obtained by applying the low-power techniques described above. We did not include in the table the estimation of a possible implementation with Synopsys Power Compiler because the reduction of energy in the selection function is less than 1% of the total.

Figure 4.15: Low-power implementation of the radix-8 divider.

	standard	low-power	dual voltage
blocks	nJ	nJ	(est.) nJ
control	0.6	0.6	0.6
clk tree	0.4	0.4	0.4
mux	1.4	0.2	0.2
mul. gen. H	3.1	2.2	1.1
CSA H	4.4	4.4	2.2
mul. gen. L	2.6	2.2	1.1
CSA L	6.0	4.8	2.4
sel. func.	3.6	4.6	4.6
register Ws	4.2	4.0	^*2.2
register Wc	4.2	1.2	^*2.0
register q_H	-	0.2	0.2
register q_L	-	0.2	0.2
SZD	3.8	0.6	0.6
C&R unit	13.4	2.8	^*1.0
Total [nJ]	47.5	28.5	19.0
Ratio	1.00	0.60	0.40

Area [mm²]	2.2	1.8	-

Values marked ^* include level shifters.

Table 4.4: Energy-per-division for radix-8.

Figure 4.16 shows the breakdown, as a percentage of the total, of the energy dissipated in the main blocks composing the unit.

Figure 4.16: Percentage of energy dissipation in radix-8 divider.

In the l-p implementation the largest part of the energy is dissipated in the CSAs (more than 30%), while in the d-v estimate the largest portion is equally distributed (about 25% of the total for each block) among the two CSAs, the two registers W and the selection function.

4.3.6 Comparison with scheme with overlapped radix-2
stages

In [34] a radix-8 divider is implemented by overlapping three radix-2 stages and computing the quotient digits in parallel. Moreover, the next partial remainder (w_s and w_c) is calculated speculatively for each possible quotient digit. This scheme, indicated here as r8overlap, is implemented in the Sun UltraSPARC FP-unit. As described in [34], the critical path is: 1 ×SEL + 2 ×CSA + 3 ×MUX

In order to compare the r8overlap division unit with our radix-8 divider, we made the following assumptions:

The CSA (or FA) has the same delay (0.8 ns) in both implementations.
The multiple generator is equivalent to the 3:1 MUX of r8overlap.
We implemented the radix-2 selection function of [34] with our library and obtained a delay of 1.9 ns.
Buffering of one of the MUXes is required.

With these assumptions, we can reasonably estimate the pre-layout critical path of r8overlap as:

SEL + 2 CSA + 3 MUX + buff. + reg.
1.9 + 1.6 + 1.5 + 0.7 + 1.3 = 7.0 ns

This is similar to the critical path (pre-layout) of the radix-8 unit described here:

SEL(q_L) + mult + HA + reg
3.8 + 1.2 + 0.5 + 1.3 = 6.8 ns.
SEL(q_H) + mult + HA + FA + reg
3.0 + 1.2 + 0.5 + 0.8 + 1.3 = 6.8 ns.

As for the area, Table 4.5 shows a comparison of the number of the wider (bitwise) blocks. Register QN in r8overlap can be eliminated by introducing register C. However it is not possible to change the representation of w_C (e.g. reducing the size of register Wc) without penalizing the performance. Moreover, the resulting selection function of the overlapped implementation is about twice as large as the radix-8 SEL. In conclusion, it is reasonable to assume that the area of our divider is significantly smaller.

	r8overlap		radix-8
no. CSAs	6		2
no. mux/mult	6		2
no. registers	4	(Ws, Wc, Q, QN)	2.7	(Ws, Wc/3, Q, C)

Table 4.5: Area comparison.

We don't have data available on the energy consumption of the r8overlap division unit, but considering the larger area (larger current drawn) and roughly the same operation latency, we conclude that the energy dissipation is smaller in our implementation. Furthermore, the energy reduction techniques applied in the radix-8 divider might not be effective in the r8overlap scheme.

Footnotes:

⁴ Smaller area implies smaller capacitance and usually reduced energy dissipation.

File translated from T_EX by T_TH, version 1.1 and by M_E. Last Modified : Fri Jul 9 11:14:33 PDT 1999


SEL(q_L)		+ mult	+ HA	+ reg
4.5		+ 1.4	+ 0.6	+ 1.5	= 8.0 ns.
SEL(q_H)	+ mult	+ HA	+ FA	+ reg
3.6	+ 1.4	+ 0.6	+ 0.9	+ 1.5	= 8.0 ns.

SEL	+ 2 CSA	+ 3 MUX	+ buff.	+ reg.
1.9	+ 1.6	+ 1.5	+ 0.7	+ 1.3 = 7.0 ns