Appendix B

CAD Tools

Introduction

In this appendix we describe the features of some CAD tools used in the realization of this work. A brief description of COMPASS tools is given in Chapter 4. First, the two tools developed in our laboratory (PET and ACC) are presented. Then, the main features of the commercial tool Synopsys Power Compiler are summarized.

B.1  PET: Power Evaluation Tool

PET belongs to the category of power estimators loosely-coupled with the simulator. It is coupled with COMPASS Qsim and it was developed internally for two main reasons:

PET computes the energy and power dissipation by reading the energy views for the cells in the library, the layout-extracted netlist and the trace file generated by Qsim. The energy views are computed once for a given library, by characterization using ACC (Section B.2), and then stored in a database.

B.1.1  PET Energy and Power Models

As discussed in Section 1.2, the energy consumption in a cell is proportional to the output load, the supply voltage, the number of output transitions in a given time window and the energy dissipated internally. This is summarized by expression (1.5), which is rewritten below

Ei = (  1
2
VDD2 CL + Eint ) ni
where:

   
VDD is power supply voltage.
   
CL is the total load applied to the output.
   
Eint is the internal energy dissipated in the cell during one transition.
   
ni is the number of transitions at the output of the i-cell in the time window.

The term between parenthesis

Etran = 1
2
VDD2 CL + Eint [J]
represents the energy per transition. The average power dissipated in a cell can be computed from the energy, by introducing the following quantities:

   
f0 is the circuit main frequency (clock frequency),
   
ai is the activity factor:
ai = nr. of output transitions (in time window)
nr. of clock cycles (in time window)
= ni
nT

as

Pi = (  1
2
VDD2 CL + Eint ) ai f0 = Ei f0
nT
[W]

In a sequential cell also the internal switching, not affecting the cell's output, dissipates energy. To take into account this contribution we can write the energy and power expressions in the following way:

Ei = (  1
2
VDD2 CL + Eint ) ni + Eclnicl [J]

Pi = (  1
2
VDD2 CL + Eint ) ai f0 + Eclficl [W]
where :

   
Ecl is the energy dissipated internally per transition due to clock switching.
   
ficl = [(nicl)/(nT)] f0 is the frequency of the transitions of the cell's clock5.

Now we consider a large circuit containing N cells, NS of which are sequential. The total energy consumption in the time window is given by:

Etotal = N
å
i = 1 
( 1
2
VDD2 CLi + Eiint ) ni + NS
å
i = 1 
Eiclnicl [J]
(27)

Summarizing, in order to calculate the energy dissipated, given by expression (B.1) we need to determine the value of the following parameters:

To compute the power dissipation

Ptotal = f0
nT
N
å
i = 1 
( 1
2
VDD2 CLi + Eiint ) ni + f0
nT
NS
å
i = 1 
Eiclnicl = f0
nT
 Etotal [W] .
(28)
we need the two additional values

The quantities VDD, Eint and Ecl depend on the library that we are using. CL depends on the design and layout (type of cell connected and wire capacitance) and the number of transitions depends on the design and on the set of input vectors used.

The procedure to determine the energy and power dissipation is the following:

  1. For the chosen library determine the quantities Eint and Ecl for each cell. These values can be provided directly by the silicon vendors or obtained by cell characterization.
  2. From the layout, extract the capacitance (output load plus interconnection capacitance) at each node and associate them as output load (CL) to each cell.
  3. Run a simulation on a set of random chosen test vectors using a tool that is able to detect transitions (i.e. a logical level simulator).
  4. Calculate energy and power using expression (B.1) and expression (B.2).

B.1.2  PET Implementation

The procedure described above was implemented in PET. It consists of three C routines (analyze, ttgen and calpot) and the use of two COMPASS tools: Qsim (logic-level simulator) and extract (COMPASS Interconnect layout to netlist extractor) [38]. The latter is used to determine the capacitance (including wires) at each node of the circuit while Qsim is used to determine the logic values of the nodes used later to determine the number of transitions. PET is structured as depicted in Figure B.1.

Figure 2.1: Structure of PET.

analyze reads the extracted netlist and determines the output load for each cell of the circuit. It also provides to Qsim the labels of the nodes to monitor. The files read are:

The files produced are:

All these references are resolved later by calpot. The [mon] file is incorporated with the input stimuli in the simulation file [sim] to be used along with the netlist [nle] in the simulator.

ttgen (transitions table generator) reads the simulation output file [trc] and creates a transitions table [trn]. In this table each label/node is associated with the number of transitions occurred at that node during the simulation.

Finally, calpot calculates energy and power dissipation according to expression (B.1) and expression (B.2). The files read are:

B.1.3  PET Testing

PET was tested on a limited set of benchmarks comparing the results with those obtained using SPICE and calculating the power as the product of the voltage and the average current over a time window of the same size of that used for PET [46]. The error was never greater than 10% (the largest benchmark circuit contained about 3,000 transistors).

The main drawback of PET is that it accounts for a fixed amount of short-circuit current for each cell, determined independently of the transition time. This can lead to a lack of accuracy in some situations, for example the power dissipation of blocks not in the critical path where signals could have slow ramps. An approach to include a more accurate evaluation of the short-circuit current is described in [47]. However the improvement in the results obtained is not good enough to justify a significantly greater modeling effort.

B.2  ACC: Automatic Cell Characterization

As an increasing number of transistors is packed in a single chip, the design tools (CAD tools) have to handle larger circuits. Because it is unrealistic to simulate the behavior of a complete system with an electrical-level simulator, such as SPICE, design tools are shifting toward higher levels of abstraction. These levels of abstraction are organized in a hierarchical structure with circuit/electrical level at the bottom of the hierarchy. Circuit characterization is necessary to provide information of the electrical properties of small functional parts of the system to higher hierarchical levels. In general, cell characterization provides capacitance, timing and power values for all the cells in the library to CAD tools operating at gate-level. In our specific case, we characterize the standard cell library to extract the energy views necessary for PET.

B.2.1  ACC Energy Views

ACC (Automatic Cell Characterization) is a tool that performs library characterization by automatically running several SPICE simulations on all the cells of the library. It is derived from the tool presented in [48], and can characterize cells for timing, capacitance and energy. However, in this appendix, we only focus on characterization for energy.

As described in Section B.1, the PET energy model for a single cell is

E = (  1
2
VDD2 CL + Eint ) ni + Eclnicl
where:

   
VDD is power supply voltage.
   
CL is the total load applied to the output.
   
Eint is the internal energy dissipated in the cell during one transition.
   
ni is the number of output transitions in the time window.
   
Ecl is the energy dissipated internally due to clock switching.
   
nicl is the number of clock transitions, if the cell is sequential.

Of all the quantities indicated in the above expression, the ones obtained by characterization are Eint and Ecl (energy views).

It is convenient to characterize a cell over a period of time in which two output transitions occur (one low-to-high and one high-to-low). The value of energy is computed as the product of VDD and the value obtained by numerical integration of the current i(t) over a time window [t1, t2] in which two transitions occur:

Ecy = ó
õ
t2

t1 
v(t) i(t) dt @ VDD N
å
k = 0 
i(t1+kDt) with Dt = t2 - t1
N
The graph of the current i(t1+kDt) is obtained by SPICE simulation with resolution step Dt. By simulating the cell with different loads we determine different values of Ecy. The value Eint can be obtained, as follows:

  1. By linear curve fitting of the values of CL and Ecy, we obtain the two coefficients x1 and x0
    Ecy = x1 CL + x0 .
  2. From expression (1.5), we get:
    Ecy = (  1
    2
    VDD2 CL + Eint ) ni = 2 (  1
    2
    VDD2 CL + Eint )
  3. By combining the two expressions above:
    VDD2 CL + 2 Eint = x1 CL + x0
    we obtain:
    VDD2 = x1 and Eint = x0
    2
    Note that the value of x1 could be used to evaluate the accuracy of the linear curve fitting, being the actual value of VDD known.

For sequential cells, the contribute due to the clock switching Ecl is measured, independently of the output load, by applying an input pattern that causes no output transitions (i.e. ni = 0).

Note that the internal energy includes the energy due to short-circuit current which depends on the slope of the transitions. In our characterization for PET, we assumed the input slope to be constant for the library and chosen as the response time af a gate with drive strength of one [43], [49]. This assumption leads to accurate energy values when the circuit is optimized for timing. In fact, longer transition times reflect on longer delays. More detailed information on the characterization of energy due to the short-circuit current is provided in [47].

B.2.2  ACC Implementation

The structure of ACC is shown in Figure B.2. ACC reads three databases containing the SPICE netlists of the cells in the library, a set of loads (CapLib), and different waveforms to be applied as input stimuli (WaveLib). In addition, ACC reads three files containing the simulation specifications, the global paramenters for SPICE, and the SPICE models for the transistors.

Figure 2.2: Structure of ACC.

ACC was implemented by routines written in C and scripts in UNIX C-shell, for further details see [50]. The flow of ACC is described in Table B.1

Source configuration file containing library paths and global parameters.
For each cell in library
{
 Create a working directory $CELLNAME.
 Copy in $CELLNAME the simulation specifications (sim.specs).
 Copy in $CELLNAME the SPICE subcircuit ($CELLNAME.sub).
 For each line in sim.specs (e.g. each specification)
 {
  Create SPICE netlist ($CELLNAME.spi).
  Write file containing simulation variables (var).
  For each capacitance value CL in var
  {
   For each input stimuli set specified in var
   {
    Run SPICE.
    Extract value (e.g. Etran) specified in var.
   }
  }
  Elaborate results (polynomial fitting).
 }
 Write energy view.
}

Table B.1: ACC working flow.

B.3  Synopsys Power Compiler

We summarize below the main features of Synopsys Power Compiler. In particular we discuss the power model, the cost function and some techniques used to reduce the power dissipation. Most of the information and data are derived from those presented in [15].

Power Compiler is built on the synthesis environment of Design Compiler and allows power optimization to be performed with delay and area optimization. Power Compiler obtains its power estimates from Design Power. The power dissipated is divided into 3 contributes:

Switching power:
[1/2] C V2 f depends on pin and wire capacitance, which values are available in the synthesis technology libraries, and transition count information described by toggle rates obtained either from Design Power's probabilistic estimation algorithm or from gate-level simulation.
Internal power:
power consumed internally to the gate. The internal power model is not linear and provided by ASIC vendors as a look-up table derived from SPICE characterization. This energy table is indexed by the cell's input edge rates (slopes) and output loads to produce an energy value that is then multiplied by the toggle rate of the output.
Static or leakage power:
This is a single constant value for the cell specified by the ASIC vendor.

In Power Compiler the cost function is prioritized as follows:

  1. maximum delay
  2. minimum delay
  3. maximum dynamic power
  4. maximum leakage power
  5. maximum area.

This means that timing constraints will not be violated to save power, but available time slack will be used to reduce it. A transformation is accepted if decreases one of the cost functions, without increasing higher priority costs.

The circuit transformations that try to reduce one of the main factors contributing to the power dissipation: gate transistor dimensions, net switching activity, net transition times and net capacitive loading are described next.

B.3.1  Gate transistor dimensions

The dimensions of the transistors that compose a CMOS gate can influence a number of factors that determine the power consumption of a design. Sizing of a cell is done by choosing different implementations of the same logic function. These implementations might differ in their parasitic capacitance and internal power.

B.3.2  Composition

In order to reduce the switching power, Power Compiler merges or composes sets of cells into a more complex one. The switching power of the enclosed net is completely eliminated, however the internal power of the new cell is higher because of the increased gate size.

B.3.3  Pin swapping

Some cells can have input pins that are symmetric with respect to the logic function (for example, in a 2-input NAND gate the two input pins are symmetric), but have different capacitance values. Power can be reduced by assigning a higher switching rate net to a lower capacitance pin.

B.3.4  Sizing and buffering

The power due to the net transition time can be reduced by decreasing the transition times at the inputs. Power Compiler substitutes the driver of a net with a higher driver to sharpen the edge of the transition. In alternative the use of buffers can also reduce the transition time. The drawback is that the added capacitance (larger transistors in the driver, or extra gates to implement buffers) might offset the reductions obtained.


Footnotes:

5 There are 2 transitions per clock period. Therefore, ficl is twice the frequency of the cell's clock.


File translated from TEX by TTH, version 1.1 and by ME. Last Modified : Fri Jul 9 11:14:41 PDT 1999