Alberto Nannarelli Ph.D. Dissertation

4.6 Radix-4 Combined Division and Square Root

4.6.1 Algorithm and Implementation

For radix-4, expression (2.6) and expression (2.7) are rewritten as

w[j+1] = 4w[j] + F[j] j = 0,2, ¼26

(23)

and

F[j] =

ì
ï
ï
í
ï
ï
î

-q_j+1d

(division)

-(S[j]s_j+1 +

4^-(j+1)s²_j+1)

(square root)

(24)

The selection function is

q_j+1 = SEL(d_d,

^
y

)

(division)

s_j+1 = SEL(

^
S

[j],

^
y

)

(square root)

where d and S[j] are truncated after 4 fractional bits, and [^y] = 4w_S + 4w_C is truncated after 4 fractional bits. The result digit is in signed-digit representation {-2,-1,0,1,2} with redundancy factor r = [2/3] and the residual w[j] is stored in carry-save representation (w_S and w_C) to reduce the iteration time. The value S[j] is obtained by the on-the-fly conversion algorithm. In the on-the-fly conversion, two variables A and B are required. They are updated, in every iteration, as follows:

A[j] = S[j] and B[j] = S[j] - r^-j

The number of bits required in the recurrence is 55 fractional plus 1 integer for a total of 56 bits. To execute the operation 28 cycles are required for the iterations plus one additional cycle for the rounding, for a total of 29 clock cycles. The block diagram of the basic implementation is shown in Figure 4.31.

Figure 4.31: Radix-4 combined division/square root unit.

Implementation of block DSMUX

Block DSMUX selects the inputs to block FGEN according to the operation (division or square root) under execution. If the operation is a division, the value to provide to FGEN is d, while for square root the value to provide to FGEN is the partial result S[j], which coincides with A[j]. If in the conversion block we implement the algorithm described in Chapter 3 Section 3.10, the value of B can be derived from A and the state in register C. In conclusion, in block DSMUX we utilize the inputs (d, A, and C) to obtain the desired outputs (A_OUT, B) according to Table 4.12.

operation

A_OUT

division

square root

[`(C_i)]A_2i+1 + C_i( A_2i+1 ÅA_2i)

(odd bits)

[`(C_i)]A_2i + C_i[`(A_2i)]

(even bits)

i = m, ¼, 1, 0

Z_i refers to bit in position i in vector Z, being i = 0 the LSB.

Table 4.12: DSMUX operations.

Implementation of Selection Function SEL

The selection function can be divided into two parts: an adder and a logic function. The adder is an 8-bit carry-propagate adder whose addends are the 8 MSBs of the carry-save representation of W. The 7 MSBs of the sum are used to generate the result digit at each iteration, along with 3 bits of either d or A chosen as follows:

if the operation is division the 3 bits of d are those with weight 2^-2, 2^-3, and 2^-4.
If the operation is square root, the 3 bits are chosen from A, as explained in [10], according to Table 4.13.

The selection logic function is described in Table 4.14. The digit h selected is the one satisfying the expression m_h £ [^y] < m_h+1. The result digit is in the set {-2,-1,0,1,2}. This representation makes the F-generator block (FGEN) simpler.

1	0	-	first iteration (j = 0)
1	1	1	if (A_{< 0 >} = 1) and (j > 0)
A_{< -2 >}	A_{< -3 >}	A_{< -4 >}	if (A_{< 0 >} = 0) and (j > 0)

A_{< -k >} refers to bit in A with weight 2^-k. A_{< -k >} = A_55-k

Table 4.13: Bits of A used in SEL.

m_h	d_d, [^S][j]
	8/16	9/16	10/16	11/16	12/16	13/16	14/16	15/16
m₂	12	14	15	16	18	20	20	22
m₁	4	4	4	4	6	6	8	8
m₀	-4	-5	-6	-6	-6	-8	-8	-8
m_-1	-13	-15	-16	-17	-18	-20	-22	-23

Values in table are multiplied by 16

Table 4.14: Selection function for radix-4 combined division/square root.

Implementation of block FGEN

Block FGEN generates the signal F[j] as described in expression (2.7):

F[j] =

ì
ï
ï
í
ï
ï
î

-q_j+1d

(division)

-(S[j]s_j+1 +

r^-(j+1)s²_j+1)

(square root)

For square root the generation of F is quite complicated [10]. Table 4.15, where a...aa and b...bb represent bits of A[j] and B[j] respectively, shows the values of the bit-string for the different result digits. To keep track of the position, a 28-bit register (K) is used. The register K is initially loaded with 1 in the MSB and 0 in the remaining positions. At each iteration the 1 is replicated one position to the right according to the following expression.

K_i[j+1] = K_i+1[j], K₂₇[j] = 1 i = 0, 1, ¼, 27

At the end of the square root all bits of K are 1. To simplify the implementation of F[j], the bit string of Table 4.15 is rearranged as shown in Table 4.16. As can be seen, three zones are defined by the variables

K1_i = K_i K_i-1 , K2_i = K_i

K_i-1

, K3_i = K_i+1

K_i

and the relation between i and the bit h is i = ëh/2 û. In terms of these variables we obtain

F_h (odd h) =

K1_i (P2

A_h-1

+ P1

A_h

+ M1 B_h + M2 B_h-1)

+ K2_i (P2 + P1

A_h

+ M1 B_h + M2) + K3_i (P1 + M1)

F_h (even h) =

K1_i (P2

A_h-1

+ P1

A_h

+ M1 B_h + M2 B_h-1)

+ K2_i (P2 + P1 + M1 + M2) + K3_i (P1 + M1)

For division by setting all the bits of register K to 1 (all K1_i = 1 and K2_i = K3_i = 0), F[j] can be generated by the same expression as for square root. This corresponds to

F_h = P2

A_h-1

+ P1

A_h

+ M1 B_h + M2 B_h-1 h = 0, 1, ¼, 55

with d = A = B. Note that for division, when q_j+1 is positive (subtraction), the carry-in in the adder must be set to 1.

s_j+1	F[j]	bit-string
		0		j-1	j	j+1	j+2		27
0	0	00	...	00	00	00	00	...	00
1	-A[j]-2 ×4^-(j+2)	[`a][`a]	...	[`a][`a]	[`a][`a]	11	10	...	00
2	-2A[j]-8 ×4^-(j+2)	[`a][`a]	...	[`a][`a]	[`a]1	10	00	...	00
-1	B[j]+14 ×4^-(j+2)	bb	...	bb	bb	11	10	...	00
-2	2B[j]+24 ×4^-(j+2)	bb	...	bb	b1	10	00	...	00

Table 4.15: Generation of F[j].

s_j+1	F[j]		bit-string
			27		i+1	i	i-1	i-2		0
0	0		00	...	00	00	00	00	...	00
1	-A[j]-2 ×4^-(j+2)		[`a][`a]	...	[`a][`a]	[`a]1	11	00	...	00
2	-2A[j]-8 ×4^-(j+2)		[`a][`a]	...	[`a][`a]	11	00	00	...	00
-1	B[j]+14 ×4^-(j+2)		bb	...	bb	b1	11	00	...	00
-2	2B[j]+24 ×4^-(j+2)		bb	...	bb	11	00	00	...	00
		K	1	...	1	1	0	0	...	0
		K1	1	...	1	0	0	0	...	0
		K2	0	...	0	1	0	0	...	0
		K3	0	...	0	0	1	0	...	0

Table 4.16: Generation of F[j] with rearranged bit-string.

Minimum delay implementation

The minimum delay implementation, also referred as standard, is the one obtained with the constraint of minimum delay. The post-layout critical path is 7.3 ns. The critical path for the combined unit is 5% longer than the critical path for the division only unit of Section 4.2. This is mainly due to the more complicated FGEN block. The energy dissipated by this basic implementation is shown in Table 4.19 in column ßtandard".

4.6.2 Low Power Implementation

In this section we apply the techniques described in Chapter 3 to the standard implementation. Because of the different conversion algorithm required for square root, in the combined unit a low-power conversion and rounding unit, as the one described in Section 4.2, is already in place in the standard implementation. In addition to the techniques of Chapter 3, we reduce the energy dissipation in register K by gating the clock in the flip-flops.

Retiming the recurrence

The retiming is done by moving the selection function of Figure 4.31 from the first part of the cycle to the last part of the previous cycle (see Figure 4.32). The new 4-bit register is initialized to q₀ = 1 for division and s₀ = 1 for square root.

Figure 4.32: Retiming of the recurrence.

This retiming causes a problem for the square root operation. The result digit s_j+1 is converted in the next clock cycle and, as a consequence, in the first few iterations the value [^S][j] is not available for the selection function (Figure 4.33). However, because the digit-selection is done in the last part of the cycle and the conversion of the previous digit is a short delay operation, we can forward the value of the converted digit from the digit-converter to the the selection function and determine the correct value for [^S][j].

Figure 4.33: Digit forwarding.

By indicating with s_{< 1 >} s_{< 0 >} the converted digit and with s_SIGN its sign, Table 4.17 shows the modifications in the selection function.

j	[^S][j]
1,2	A_{< 0 >}	A_{< -3 >}	A_{< 0 >}
3	s_{< 0 >}	0	0
4	A_{< -2 >} Ås_SIGN	s_{< 1 >}	s_{< 0 >}
5	A_{< -2 >}[`(([`(A_{< -3 >})] [`(A_{< -4 >})]s_SIGN))]	f₁	A_{< -4 >} Ås_SIGN
others	A_{< -2 >}	A_{< -3 >}	A_{< -4 >}

f₁ = A_{< -3 >}[`(s_SIGN)] +[`((A_{< -3 >} ÅA_{< -4 >}))] s_SIGN

Table 4.17: Bits of A used in SEL (retimed).

After the retiming, we change the representation of the residual to reduce the number or flip-flops and use low-drive gates in the non critical portion of the recurrence as explained for the radix-4 divider in Section 4.2.

Reduction in register K

Register K is used to generate F[j]. In division the register is initialized to 1 in each bit, and the configuration is not changed for the whole operation. In square root the register is initialized to 1 in the MSB and to 0 in the remaining bits. Every iteration the 1 is propagated to the next bit, for a total of two transitions per flip-flop (one to set the bit to 1, one to reset at the end of the operation). It is convenient to disable the clock for those flip-flops that do not need to be changed in a specific cycle. This is the same modification done for registers A and C in the convert-and-round unit implemented by gating the clock of the flip-flops not used (Section 3.10.2). The enabling function (for the i-th flip-flop) is

f_i = OP K_i+1

K_i

i = 27, ¼, 0

where OP = 1 for square root and OP = 0 for division. By implementing this technique the energy dissipation in K is reduced virtually to zero for division and to about one third for square root.

MSBs:	DSMUX	FGEN	CSA	SEL	REG
	0.6	0.9	0.6	4.1	1.1	=	7.3 ns
LSBs:	MUX	FGEN	CSA		REG
	1.2	0.9	0.6		1.1	=	3.8 ns

Table 4.18: Paths for MSBs and LSBs in retimed recurrence.

4.6.3 Dual Voltage Implementation

The critical path in the retimed implementation is 7.3 ns. By implementing the LSBs of the recurrence with radix-2 CSAs, the delay in the LSBs is 3.8 ns, resulting in a time slack of 3.5 ns. In this case V₂ = 2.0 V can be chosen without affecting the latency of the unit. On the other hand, by opting for the use of radix-4 CSAs, the time slack is reduced to 2.6 ns and, consequently, V₂ can be lowered to 2.5 V.

Our estimation showed that the lowest energy is obtained by implementing radix-2 CSAs and V₂ = 2.0 V for the dual voltage implementation.

Figure 4.34: Low-power combined division/square root unit.

4.6.4 Optimization with Synopsys Power Compiler

The synthesized selection function met the set delay constraints (3.0 ns). the reduction in energy dissipation, obtained by incremental compilation with energy dissipation constraints resulted to be about 25%, without increasing the delay.

4.6.5 Summary of Results for Combined Unit

The implementation that consumes a reduced amount of energy is shown in Figure 4.34. Table 4.19 reports the average energy dissipation and area for the standard and low-power implementation. In the table, entry std refers to the standard implementation, optimized for speed, entry l-p is the low-power implementation with the same delay, entry d-v is an estimate of a possible implementation with dual voltage, and entry sym is an estimate of l-p with selection function optimized by Synopsys Power Compiler.

The energy dissipation for the square root operation is about 10% lower than for the division, on average. This is due to the fact that for square root in every iteration we add F[j], which initially contains many zeros, to the residual w[j].

Some of the low-power techniques used, such as the changed redundant representation, reduce the number of flip-flops in the registers and, consequently, the area.

	division				sqr. root
blocks	std	l-p	syn	d-v	std	l-p	syn	d-v
	nJ	nJ	nJ	nJ	nJ	nJ	nJ	nJ
control	0.9	0.9		0.9	0.9	0.9		0.9
clk tree	0.8	0.8		0.8	0.8	0.8		0.8
mux	1.9	1.9		0.9	1.7	1.7		0.8
DSMUX	0.1	0.1		0.1	0.3	0.3		0.3
FGEN	7.3	4.9		2.4	5.8	4.3		2.0
CSA	8.8	5.2		3.8	4.7	3.9		2.3
sel. func.	1.5	2.0	1.5	2.0	1.2	1.8	1.4	1.8
register Ws	6.7	6.4		^*3.7	6.3	6.1		^*3.6
register Wc	6.7	2.7		^*3.1	5.1	2.2		^*2.5
register q	-	0.3		0.3	-	0.3		0.3
register K	1.3	0.0		0.0	1.6	0.5		0.2
SZD	5.8	0.6		0.6	4.6	0.6		0.6
C&R unit	3.7	3.7		^*1.4	3.7	3.7		^*1.4
E_op [nJ]	46.0	29.5	29.0	20.0	37.0	27.0	26.5	17.5

Ratio to std	1.00	0.65	0.63	0.45	1.00	0.75	0.70	0.50
E_sqrt/E_div	-	-	-	-	0.80	0.90	0.90	0.90

Area [mm²]	1.9	1.8	-	-	1.9	1.8	-	-

Values marked ^* include level shifters.

Table 4.19: Summary of reductions for division and square root operations.

Figure 4.35 shows the breakdown, as a percentage of the total, of the energy dissipated in the main blocks composing the unit.

Figure 4.35: Percentage of energy dissipation in radix-4 combined unit.

4.6.6 Energy Comparison with Radix-4 Divider

In this section we compare the results obtained for the radix-4 combined division and square root with those obtained for a radix-4 divider.

Table 4.20 summarizes the results for the implementation l-p when performing division. Similar blocks are put in the same row. The combined unit dissipates 15% more than the divider, on average. The largest differences are for the blocks FGEN and "mux". The implementation of FGEN is considerably more complicated than the corresponding unit in the divider and gates with large number of inputs (8-input NAND) have been used to keep the number of levels of logic (and the delay) low. As for the multiplexer, in the retimed implementation of the divider, it is moved out of the recurrence. However, in the divider an extra cycle is required because the first iteration is only used for initialization and no quotient-digit is produced. In the combined implementation in the first iteration we perform the subtraction x-d using two inputs of the CSA and produce the first digit of the result.

blocks	divider	combined
control	1.1	0.9
clk tree	0.9	0.8
mux	0.3	1.9
DSMUX	-	0.1
FGEN	2.8	4.9
CSA	4.8	5.2
sel. func.	1.6	2.0
register Ws	6.4	6.3
register Wc	3.5	2.7
register q	0.3	0.3
register K	-	0.0
SZD	0.6	0.6
C&R unit	3.9	3.7
E_div [nJ]	26.0	29.5
Ratio	1.00	1.15
Area [mm²]	1.2	1.8
Ratio	1.0	1.5

Table 4.20: Comparison radix-4 divider/combined unit.

File translated from T_EX by T_TH, version 1.1 and by M_E. Last Modified : Fri Jul 9 11:14:36 PDT 1999