Sequential Logic Synthesis with Retiming in Encounter ... - CiteSeerX

Sequential Logic Synthesis with Retiming in Encounter RTL Compiler (RC) 

Christoph Albrecht 1 , Shrirang Dhamdhere 1 , Suresh Nair 1 , Krishnan Palaniswami 2 , Sascha Richter 1 

1 Cadence Design Systems, 2 Focus Semiconductor 

Session Track: Digital IC Design 

Session Number: 2.3 

Relevant Cadence Products: Encounter RTL Compiler (RC), Encounter Conformal Logic Equivalence 

Checker (LEC) 

Abstract 

Typical ASIC designs are highly unbalanced with respect to the timing criticality of their combinational logic 

paths. This is mainly due to the ad-hoc manual design specification of the register transfer level (RTL), 

which does not use any information regarding the sequential timing criticality. Traditional logic synthesis 

does not support “borrowing” of timing slack across registers, and the optimization is restricted by fixed 

positions of the registers. This may result in a suboptimal solution, in a loss of performance, and 

unnecessary area and power consumption. 

This paper explains the concept of clock scheduling and retiming used by Encounter RTL Compiler (RC) to 

optimize across register boundaries. Retiming is a structural transformation which changes the positions of 

the registers without modifying the input-output behavior of the circuit. The reader will understand how the 

area, the number of registers, or the delay of the design is minimized. Computational results show the 

tradeoff between these two objectives. 

Practical applications are discussed: Registers may have different control signals, enable signals, or reset 

signals. This leads to the multiclass retiming problem and the reset line justification problem. 

Retiming used to be a difficult challenge for equivalence checking. However, together with Encounter 

Conformal Logic Equivalence Checker (LEC) the verification is now simple: RC writes out checkpoint netlist 

files and one script, which LEC can then process to automatically verify the golden RTL against the final 

netlist. 

We present a case study showing how retiming was used by Focus Semiconductor, a division of Focus 

Enhancements, on a 1.5 M instance UWB baseband chip. Retiming substantially improved the Quality of 

Results (QoR) and helped to meet the design objectives. 

CDNLive! Silicon Valley 2006 1

1 Introduction 

Traditional combinatorial logic synthesis focuses all the optimization efforts on the combinational paths 

between the registers. It does not support any tradeoff between tight paths and loose paths when these are 

separated by registers. 

To motivate the use of sequential logic synthesis with retiming, we will discuss the slack distribution of a 

typical ASIC design. 

Figure 1: Slack distribution of a 

typical ASIC design. 

Figure 1 shows the slack distribution, more specifically the distribution of the setup slacks of a late-mode 

analysis after synthesis. For each slack interval on the x-axis, the number of combinational paths which 

have a slack value within that interval is shown. The design has a worst negative slack of -529 ps. 

Figure 2: Slack distribution of the 

same ASIC design for which the 

slack distribution is shown in 

Figure 1, however this time with 

optimized clock latencies. 

Figure 2 shows for the same design an optimized slack distribution. The netlist was not changed, only the 

clock latencies at the registers. The latencies were computed with a slack balancing algorithm which we will 

discuss later. The number of critical paths has decreased drastically. Only a small fraction of the paths 

have a negative latency. In this case it was not possible to improve the worst negative slack, because the 

worst path in this design is a path from a primary input to a primary output. 

The two figures, Figure 1 and Figure 2, impressively demonstrate the optimization potential which becomes 

available when the registers are unlocked and not kept fixed as hard boundaries, which constrains the 

synthesis optimization algorithms. With the optimized clock latencies, many paths become uncritical. The 

additional slack can be used to downsize the combinational gates or even to use a different logic structure 

that has smaller area and power consumption. 

While clock scheduling was not able to reduce the worst negative slack for this specific design, clock 

scheduling was able to improve the slack of the side paths. These are either combinational paths that start 


at the primary input of the critical path and end at a register or paths that start at a register and end at a 

primary output. This is helpful for the synthesis optimization algorithms in RC. RC is able to improve the 

slack of a path by using slack of the side paths. 

In this paper we discuss the two sequential optimization techniques, clock scheduling and retiming, and 

show how the combination of both these techniques is used in RC. The paper is organized as follows: 

In Section 2 we discuss clock scheduling. Clock scheduling is also known as useful skew. It changes the 

latencies of the clock signal but does not change the logic. The different latencies need to be realized by a 

sophisticated clock network. 

In Section 3 we describe retiming. Retiming is a structural transformation. While retiming does not change 

the combinational gates, it modifies the netlist by moving the registers forward and backward in the logic. 

RC can use clock scheduling as an intermediate step to drive the logic synthesis and optimization process. 

Ultimately, it realizes the different latencies by retiming so that a conventional zero or limited skew backend 

flow can place the design, construct the clock network, and route the nets. This is described in Section 4. 

In practice, retiming can be constrained by registers that have different control signals (for example, enable 

signals, asynchronous set or reset signals). Section 5 discusses these constraints. 

In Section 6 we discuss the automatic verification flow with LEC. 

In the last section we present a case study how retiming was used on an UWB baseband chip from Focus 

Semiconductors. 

2 Clock Scheduling 

The following figure shows how the worst slack of a design can be improved by changing the clock 

latencies: Buffers are added to the clock distribution network and the switching time of the register is 

delayed. In this case the worst slack is improved from -2 ns to 0 ns and the design meets the timing 

requirements. If the clock latency of the capturing register of a combinational path is increased, the slack of 

the combinational path increases by the same amount. If, on the other hand, the clock latency for the 

capturing register is decreased, the slack of the combinational path decreases. Increasing the clock latency 

of the launching register decreases the slack and decreasing the latency has the opposite effect on the slack 

of the path. 

4 ns 

3 ns 

3 ns 

2 ns 

3 ns 

1 ns 2 ns 

1 ns 1 ns 

clock 

+ 2 ns + 1 ns 

+ 1 ns 

Target clock period: 5 ns Worst slack without clock latencies: 

Worst slack with clock latencies: 

- 2 ns 

0 ns 

Figure 3: The worst slack 

is improved by adjusting 

the clock latencies. 


A linear programming formulation 

The clock scheduling problem can be formulated as a linear program. This was first done by Fishburn 

in 1990 [1]. Let T be the clock period. The clock period should be minimized. Furthermore, let l i be the 

latency of the clock signal arriving at register i, and let d ij be the maximum delay of all combinational path 

from register i to register j. 

min T 

subject to l i + d ij ≤ l j + T for all combinational paths (i, j). 

The difference in the inequality is the slack. Should the design have constrained primary inputs or outputs, 

we can represent all these inputs and outputs by one dummy register that can have, without loss of 

generality, a clock latency of zero. Hence, we can assume that even in this case the linear program has the 

form above. 

The linear program is a very special linear program and it can be solved efficiently with combinatorial 

algorithms. It can be proved that the minimum clock period achievable by clock scheduling is equal to the 

maximum average path delay of all cycles in the register-to-register timing graph. The register-to-register 

timing graph contains a node for every register and an edge whenever there is a combinational path 

between the registers with a weight equal to the maximum delay of these paths. 

In general, the linear program does not have one single solution. However, any solution that minimizes the 

clock period is usually not desirable. For example, we examined the ASIC design for which the two different 

slack distributions are shown in Figure 1 and Figure 2. The worst negative slacks of the two slack 

distributions are equal and so are the clock periods at which the chips can operate without failure. 

Clock scheduling optimally balancing the slack 

In the following we discuss how it is possible to compute a clock schedule with a specific property which we 

call optimally balanced slack. As a result of this property many paths are uncritical and have a lot of slack. 

This part is more theoretical and if the time of the reader is limited, we recommend skipping this part 

because the sections following are more important for the practical use. 

We consider a small example circuit with four registers, a, b, c, and d, shown in Figure 5. 

2 

7 

a 

5 

6 

b 

4 

5 

d 

9 

c 

Figure 4: Example circuit with 

combinational gates and four 

registers. The numbers specify 

the delay of the gates. 

From the circuit we can construct the register-to-register timing graph which is shown in the following figure. 

The graph has one node for each of the four registers and an edge between two nodes whenever there is a 

combinational path between the corresponding registers. Associated with the edges is the maximum delay 

of the combinational paths. 


a 

6 

b 

9 9 

7 

11 9 

c 

5 

9 

d 

Figure 5: Register-to-register 

timing graph for the circuit in 

Figure 4. 

Without clock latencies, the minimum feasible clock period for this circuit is equal to the maximum delay of 

the combinational paths, in this case T = 11. By increasing the clock latency for the register b to +1, the 

clock period can be decreased to T = 10. This is the minimum clock period which can be achieved by clock 

scheduling, because with these latencies the two paths (b,d) and (d,b) have a slack of zero. Figure 6 shows 

the register-to-register timing graph with the latency +1 at register b. In addition to the combinational delays 

we show also the slacks for the clock period T = 10 in brackets. 

9 

(1) 

a 

c 

9 

(1) 

6 (5) 

7 (3) 

5 (5) 

9 (1) 

11 

(0) 

b 

d 

+1 

9 

(0) 

clock period 

T = 10 

delay 

(slack) 

Figure 6: A clock schedule 

applied to the registers such that 

the worst incoming slack equals 

the worst outgoing slack for every 

register. The edges 

corresponding to the critical paths 

with a slack smaller than or equal 

to 1 are shown in red. 

The clock schedule shown in Figure 6 has the property that for every register the worst incoming slack is 

equal to the worst outgoing slack. Changing the clock latency of one single register alone does not give an 

improvement, since the worst slack of all the paths starting or ending at the register can only get worse. 

The Figure 6 shows that there is one critical edge in red, the edge (d,c), which is not part of a critical cycle. It 

is possible to increase the slack of this edge by increasing the clock latency of the registers a and c 

simultaneously. This does not affect the two critical edges (c,a) and (a,c). The result is shown in Figure 7. 

In this figure the worst incoming slack equals the worst outgoing slack for every subset of the registers. 

Note that before, in Figure 6, the worst outgoing slack for the registers a and b together is equal to 5 

whereas the worst incoming slack is only 1. 

+2 

9 

(1) 

+2 

a 

c 

9 

(1) 

6 (3) 

7 (5) 

5 (3) 

9 (3) 

b 

+1 

11 9 

(0) (0) 

d 

clock period 

T = 10 

Figure 7: An optimally balanced 

clock schedule: The worst 

incoming slack equals the worst 

outgoing slack for every subset of 

the registers. 


The clock schedule shown in Figure 2 on page 2, in which the number of critical paths has decreased so 

drastically, has exactly this property. It is computationally too expensive to consider all subsets of the 

registers, because there are exponentially many cycles. Nevertheless, the efficient minimum mean balance 

algorithm by Young, Taran and Orlin [3] can find such a solution by iteratively finding critical cycles and 

contracting them. 

For synthesis operations it is helpful if the side paths of a critical path have additional slack. The slack can 

be used to reduce the delay of the critical path. An example for such a synthesis operation is Shannon 

decomposition shown in the following figure. 

combinational 

logic 

x 

0 

x 

a 

critical path 

a 

1 

Figure 8: A critical path becomes 

short and fast using Shannon 

decomposition. 

If only one path starting at a point a and ending at a point x is critical and all other paths ending at x are 

uncritical, then the fanin logic of x can be duplicated twice, once the value of a is permanently set to zero 

and once it is set to one. The two outputs of the replicated logic feed a multiplexer that chooses the right 

value for x depending on the value for a. The constant values for a are propagated to simplify the logic. 

After this transformation the path from a to x is very short and hence very fast. 

Limitations of clock scheduling 

Clock scheduling has limitations. Changing the clock latencies may increase the number of hold violations. 

The hold constraint ensures that data signals do not arrive too early at the data input pin of the register at 

the end of the path. The signal has to arrive after the register has closed. A high number can potentially 

lead to an enormous number of hold buffers, which need to be added at the end of the flow. Due to process 

variations the final delay of the paths on the fabricated chip can deviate from the computed delay. This 

limits the use of clock scheduling further. For example, it is not possible to have a long combinational path 

that has a combinational delay equal to ten times the clock period and realize the timing constraints by 

adjusting the latencies of the clock signals at the launching and receiving register. On such a combinational 

path there would be 10 different data signals at the same time. These signals need to arrive at the receiving 

register at the right time. If the combinational delay of the path were only 10% smaller on the final fabricated 

chip due to process variations, the signal would arrive too early and this would result in a hold time violation. 

As the delay could also increase, it is not possible to fix this hold violation by adding additional delay with 

hold buffers. 

Nevertheless, RC can use internally large positive and negative clock latencies and optimize the 

combinational logic with these latencies. In the end, the latencies are realized by retiming and moving the 

registers through the combinational logic. The latencies are only bounded by the number and the movement 

of the registers. 


3 Retiming 

Retiming is a powerful sequential optimization technique which overcomes the limitations of clock 

scheduling. Retiming moves the registers across the combinational logic to improve the performance 

without changing the input/output behavior of the circuit. 

The following figure shows the slack of a circuit can be improved by retiming. It is the same circuit for which 

we applied clock scheduling in Figure 4. The registers are retimed backward against the direction of the 

signal propagation. 

4 ns 3 ns 3 ns 

2 ns 3 ns 1 ns 2 ns 

1 ns 

1 ns 

Target clock period: 5 ns Worst slack before retiming: - 2 ns 

4 ns 3 ns 3 ns 

2 ns 

3 ns 

1 ns 

2 ns 

Worst slack after retiming: 

1 ns 1 ns Figure 9: The worst slack is 

improved by retiming the registers 

0 ns 

backward against the direction of 

the signal propagation. 

This example shows that retiming changes the number of registers. In this case, the number of registers 

increases. However, the number of registers can also decrease. RC minimizes the clock period as a first 

objective. Among all possible retiming solutions that achieve the minimum clock period, RC finds the 

solution with the minimum number of registers. In addition, RC has the option to minimize the number of 

registers without increasing the current clock period. 

Any retiming can be achieved by a sequence of two elementary retiming steps: Forward retiming removes 

the registers at the input of a gate and creates new registers at the outputs. Backward retiming does the 

opposite: It removes the registers at the output and creates a new register at each input. The two retiming 

steps are shown in the following figure. 

forward retiming 

backward retiming 

Figure 10: Registers retimed 

forward and backward over an 

AND gate. 

For forward retiming it is necessary that each input of the gate is driven by a register. Similarly, for backward 

retiming the gate must not drive any combinational gate but only registers. 

In order to ensure equivalent input / output behavior of the circuit, retiming cannot change the number of 

registers on any loop and on any path from a primary input to a primary output path. This is guaranteed by 

the two operations. Of course, it may still be possible to retime registers forward or backward over a gate if 


this condition does not hold for the original circuit, but the condition has to be achieved by elementary 

retiming steps applied for the other gates before. 

Constants and dangling logic (logic that does not drive anything) are an exception. Constant propagation as 

part of the RC synthesis operations simplifies any logic driven by a constant, unless the gates are preserved 

by an attribute. Similarly, dangling logic is removed. However, should this logic be preserved, retiming is 

able to create or remove registers at constants and dangling logic. 

The following figure shows an example in which retiming cannot improve the critical path because no 

elementary retiming step is possible: 

A 

B 

2 3 3 4 

C 

Figure 11: An example in which 

retiming cannot improve the clock 

period because the register 

cannot be moved forward. 

Depending on the clk-to-q delay of the register, the critical path goes from the register to the primary output 

C. If the primary inputs are even unconstrained, then the critical path starts at the register in any case. Just 

checking the slack at the data input pin and the output pin of the register, the user may wonder why the 

register was not moved forward. This is not possible, because there is no register following directly the 

primary input B. 

Efficient algorithms for retiming have been developed and published. We refer the interested reader to the 

fundamental paper by Leiserson and Saxe published in 1991 [2] in which the problem of finding a retiming 

realizing a given clock period and minimizing the number of registers is formulated and solved as a minimum 

cost flow problem. Polynomial time algorithms have been developed for this problem. A comprehensive 

book about timing in general and clock scheduling and retiming is the recent book by S. Sapatnekar [5]. 

Relationship between clock scheduling and retiming 

The two sequential optimization techniques, clock scheduling and retiming are related: It can be proved that 

the clock period achievable by clock scheduling (ignoring any hold constraints) is a lower bound on the clock 

period that can be achieved by retiming [3]. It can also be proved that retiming can almost achieve this clock 

period: The minimum clock period achievable by retiming is at most the minimum clock period achievable 

by clock scheduling plus the maximum delay of all gates. 

If a clock schedule is given a retiming can be computed as follows: Find a register with the maximum 

positive clock latency. Decrease the clock latency until the incoming slack is zero. If the slack is already 

zero, perform a backward retiming over the gate driving the register. The new registers added in front of the 

gate get a clock latency equal to the latency of the original registers minus the delay of the gate. This 

procedure is repeated until the clock latency of each register is smaller than half the delay of the gate driving 

the register. Then a similar procedure is applied for registers with the minimum negative clock latency. The 

registers are moved forward and the clock latency is increased by the delay of the gate until the clock 

latency of each register is larger than the negative value of half the delay of the gate driven by the register. 

If the clock latency of every register is then set to zero, then the retimed circuit has a clock period of which is 

at most the clock period of the original circuit with clock scheduling plus the maximum delay of all gates. 


4 The global sequentially driven synthesis flow in RC 

RC combines the two sequential optimization techniques, clock scheduling and retiming, in a global 

sequential synthesis flow shown in the following figure. 

sequentially driven synthesis 

and optimization 

combinational 

synthesis 

clock 

scheduling 

retiming 

combinational synthesis 

Figure 12: The global 

sequentially driven synthesis flow 

in RC 

The logic synthesis and optimization algorithms are tightly interlinked with clock scheduling. Clock 

scheduling computes clock latencies which improve the clock period and the slack of the combinatorial 

paths. The synthesis algorithms can use slack of side paths to further improve critical paths. In the next 

step, retiming moves the registers through the combinational logic. It minimizes the clock period and as 

second objective minimizes the number of registers. Ultimately, retiming is followed once more by 

combinational synthesis. This is necessary because the loads of the gates have changed as the registers 

were moved. 

RC performs these steps automatically. The user only has to set the attribute “retime” to true for either the 

top design or the subdesigns for which retiming should be performed and then call the “synthesize” 

command. 

5 Special cases for retiming 

In this section we describe special cases for retiming due to control signals at the registers. The control 

signals at the registers may constrain the movement of the registers. First we discuss the retiming of 

registers with enable signals. Then we describe the case when registers with an enable signal are 

implemented by a simple register with a multiplexer feedback loop. Finally, we discuss asynchronous set 

and reset signals. 

Retiming of registers with different enable signals 

In practice, the retiming of the registers can be constrained: The registers in the circuit may have different 

control signals, for example enable signals. Retiming cannot combine registers which have different control 

signals. Figure 13 shows an example. To improve the timing, the two registers should be combined and 

retimed backward. However, this is not possible because the two registers receive different enable signals. 

RC can combine and retime registers forward or backward only if they receive the same enable signals. 


en 

1 

clock 

enable 1 

enable 2 

4 7 5 

en 

2 

Figure 13: The two registers 

cannot be moved backward 

because they receive different 

enable signals. 

Multiplexer feedback loop 

Registers with an enable signal can also be implemented by a simple register and a multiplexer. This may 

be an advantage for retiming because the registers can then be merged even though the enable signals are 

different. It may, however, also constrain the register movement and increase the number of registers. 

Figure 14 shows that the number of registers can be larger. It is a pipeline design with three stages of 

registers at the primary outputs. The enable is realized by a multiplexer. When the registers are retimed 

into the combinational logic (applying only the elementary retiming steps in Figure 10), one register has to 

remain in each loop with the multiplexer. Furthermore, registers pile up at the select lines of the multiplexer. 

enable 1 

enable 2 

enable 3 

enable 1 

enable 2 

enable 3 

Figure 14: Registers with enable 

can be implemented by a simple 

register and a multiplexer. This 

may increase the register count 

when the registers are moved 

backward. 

If the registers have an enable signal instead of a loop with a multiplexer that can be moved with the 

registers, then the number of registers after retiming is smaller. 

If the registers with the multiplexers are at the primary inputs and have to be moved forward, the problem is 

different: only the last register can be retimed forward. To retime more registers forward it would be 

necessary to have additional registers at the select line of the multiplexers. 

By default RC uses registers which have enable logic built into the register. Only if the variable 

“hdl_ff_keep_feedback” is true, RC uses simple registers which are in a loop with a multiplexer. The results 

depend on the structure of the design and can differ drastically. 

Retiming of registers with asynchronous set and reset signals 

Retiming of registers with asynchronous set or reset signals is more involved. When these registers are 

retimed forward or backward through the combinational logic it is necessary to compute the new reset 

values. Moving these registers forward through the combinational logic is simple: The reset values are 

propagated through the logic. Figure 15 shows an example. 


1 

1 

0 

1 

0 

1 1 

Figure 15: The registers are 

retimed forward. The reset values 

are propagated to the registers in 

the new locations. 

Moving registers backward is more complicated. First, all the registers driven by the gate need to have the 

same reset values. Second, the reset values of the new registers that drive the inputs of the gate are not 

unique. A naive approach that moves the registers over the gates one gate by the next and randomly 

chooses any reset values is not possible. The wrong reset values could be chosen such that later the 

registers cannot be retimed backward over a gate because the reset values are different. Hence, it is 

necessary to solve a global problem: what are the required 0/1 reset values for the registers in the new 

locations such that propagating these values through the logic results in the given reset values at the 

registers in the new location This problem can be transformed into a satisfiablity problem. It is very similar 

to verifying that two netlists are equivalent, in which we ask the question: do 0/1 values exist for the 

registers and primary inputs such that propagating these values through the logic results in different values 

at a input of a register or a primary output 

Sometimes no 0/1 reset values exist for the registers in the new locations, such that propagating these 

values forward would result in the right given values at the original locations. The following figure shows an 

example. In this case no valid reset values exist if the registers were moved further backward. RC can 

move registers with asynchronous set or reset backward only as far as valid reset values for the registers 

exist. 

1 

 

0 

1 

0 

1 

0 

0 

0 

Figure 16: It is not possible to find 

reset values for the registers in the 

new locations such that propagating 

these values results in the given 

values for the registers in the 

original locations. 

If all the registers that retiming needs to merge and move either forward or backward receive equivalent 

control signals and if also the reset line justification problem is solvable, then retiming is more powerful than 

clock scheduling. It is possible to have extremely long combinational paths that have a delay as large as 

several times the clock period. If there are sufficient registers at the beginning or end of the paths, retiming 

can move these registers into the combinational logic and still achieve the target clock period. Earlier we 

had seen that clock scheduling is limited because hold constraints need to be considered. If the delays of 

the paths as well as the variations of the path delays are too large, it is at some point impossible to realize 

the hold constraints together with the setup constraints. 

Retiming may increase the number of registers. This is the only drawback. For some designs the increase 

can be significant. However, RC can also decrease the number of registers. Usually for larger designs that 

have only one critical part, RC can improve the clock period as well as decrease the number of registers: In 

the uncritical parts the locations of the registers are very flexible and hence the registers can be moved and 

possibly merged. 


6 An automated verification flow 

Retiming used to pose fundamental hurdles for equivalence checking. Proving that two netlists are 

equivalent if one netlist was generated from another netlist through combinational synthesis as well as 

through retiming is a problem of enormous complexity. To address these verification challenges RC writes 

out checkpoint files (Verilog netlist) that describe the design at a particular stage. When retiming is used, 

RC can write out the checkpoint files before and after retiming as shown in the following diagram. 

RC 

LEC 

read RTL 

initial RTL 


equivalence check 1 

(combinational) 

write checkpoint file 

retiming 

write checkpoint file 


write final netlist 

pre-retiming 

checkpoint netlist 

post-retiming 

checkpoint netlist 

final netlist 


(retiming) 


(combinational) 

Figure 17: The automated 

synthesis and verification 

flow with checkpoint files 

generated by RC and read 

by LEC. 

Along with each checkpoint file, RC also generates a corresponding “dofile”, a command script used by 

Conformal Logic Equivalence Checker (LEC). Equivalence between RTL and the final netlist is established 

through a series of verification steps which compare the initial RTL with first checkpoint_file, checkpoint –tocheckpoint 

file and last checkpoint file to the final netlist. The appropriate dofile sets up the verification of 

corresponding stages as shown in the diagram. Conformal verifies the equivalence under the assumption 

that either only combinational synthesis operations were performed or only the registers were moved by 

retiming operations. 

7 Case study: Retiming for an UWB baseband chip from Focus Enhancements 

As a case study we describe how retiming in RC was used by Focus Semiconductor, a division of Focus 

Enhancements, for the dual-phy UWB baseband chip MADRAS. This chip supports a proprietary Focus 

(Turbo) mode and a WiMedia mode which is compliant with the Multiband OFDM Alliance (MBOA). The 

Focus mode is more powerful than the MBOA mode: The ratio of the bandwidth versus the distance is 

about 2x greater. The chip is designed in a 0.13um CMOS TSMC process technology with an analog front 

end. It has about 4 million transistors which correspond to approximately 1.5 million instances. 

The Synchronization Module has a three stage hierarchical datapath implementation. Each stage is 

composed of a finite input response (FIR) filter which required datapath optimization support from RC. 

The Synchronization Peak Finder Module contains a divider which is used to normalize the synchronization 

threshold. Enough pipeline registers were added at the inputs and outputs of the block. RC then 

rebalances the combinational paths by retiming the registers into the combinational logic. 


The Coarse Equalization Module consists of a Media Access Controller (MAC) and scratchpad memory. 

Retiming was also used for this module. Pipeline registers were added at the primary inputs and outputs 

and retiming automatically moved these registers into the logic and rebalanced the delay of the 

combinational paths. 

The Fine Equalization and the Tracking Module use a similar MAC and memory that made the use of 

retiming for these modules necessary. 

A top-down sequential synthesis flow with retiming 

The design consists of a 600K instance top level block FPT which was synthesized top-down. The “retime” 

attribute was set on 16 submodules corresponding to about 45% of the total logic and 49% of the registers. 

The following table shows all the modules for which the retime attribute was set to true in the automatic 

“synthesize –retime” flow. 

number of registers 

clock period (ps) 

subdesign gates PIs POs before after change before after change 

block_1 51,667 738 571 2,589 2,558 -1.20% 12,908 3,248 -74.80% 

block_2 13,893 266 234 1,766 2,042 15.60% 13,119 3,384 -74.20% 

block_3 28,017 880 895 8,283 6,990 -15.60% 6,583 3,176 -51.80% 

block_4 2,577 65 66 141 327 131.90% 6,724 3,142 -53.30% 

block_5 17,646 407 54 380 639 68.20% 5,489 3,748 -31.70% 

block_6 8,345 503 175 388 520 34.00% 9,044 4,407 -51.30% 

block_7-a 7,680 597 77 1,269 1,473 16.10% 5,484 3,249 -40.70% 

block_7-b 7,748 597 77 1,269 1,416 11.60% 5,484 3,369 -38.60% 

block_7-c 7,716 597 77 1,269 1,420 11.90% 5,451 3,422 -37.20% 

block_7-d 7,772 597 77 1,269 1,392 9.70% 5,446 3,457 -36.50% 

block_7-e 7,778 597 77 1,269 1,446 13.90% 5,465 3,366 -38.40% 

block_7-f 7,789 597 77 1,269 1,445 13.90% 5,459 3,380 -38.10% 

block_8 7,163 141 71 1,088 1,128 3.70% 8,421 5,300 -37.10% 

block_9 28,841 411 170 1,500 1,392 -7.20% 12,291 5,693 -53.70% 

block_10 18,009 440 135 2,862 3,035 6.00% 9,195 4,427 -51.90% 

block_11 88,925 1,683 1,700 6,694 5,897 -11.90% 5,212 4,573 -12.30% 

Average 19,472 569 283 2,081 2,070 -0.60% (1) 7,611 3,834 -49.60% (2) 

(1) percentage change of the average number of registers before and after retiming 

(2) average of the percentage change of the clock period before and after retiming 

The table shows the number of combinational gates, the number of primary inputs (PIs), and the number of 

primary outputs (POs). The next three columns show the number of registers before and after retiming and 

the percentage change. The last three columns show the clock period in picoseconds before and after 

retiming and the percentage change. 

The table shows that retiming can increase and decrease the number of registers. Overall the number of 

registers decreases by 0.6%. The clock period improves always. For many of the subdesigns it is expected 

that the clock period decreases by a large amount because pipeline registers were added at either the 

primary inputs or primary outputs. 


Conclusion 

With increasing demands for faster designs and shorter time-to-market, it is important for designers to look 

for efficient optimization techniques. Retiming in Encounter RTL Compiler is one very powerful technique 

that can achieve substantial improvements in performance. 

In this paper we have described how RTL Compiler uses clock scheduling in a sequentially driven synthesis 

flow and then performs retiming minimizing the clock period and the number of registers. We have 

discussed special cases of retiming, registers with enable signals, registers with a multiplexer feedback loop 

and registers with asynchronous set and reset signals. 

With RTL Compiler it is easy to perform retiming and the direct link to Conformal Logic Equivalence 

Checking provides a complete verification solution. 

References 

[1] J. P. Fishburn, Clock Skew Optimization, IEEE Transactions on Computers, vol. 39, pp. 945-951, July 

1990. 

[2] C. Leiserson and J. Saxe, Retiming Synchronous Circuitry, Algorithmica, vol. 6, pp. 5-35, 1991. 

[3] N. E. Young, R. E. Tarjan, J. B. Orlin: Faster Parametric Shortest path and Minimum Balance 

Algorithms, Networks, 21 (1991), 205-221. 

[4] S. S. Sapatnekar, R. B. Deokar: Utilizing the retiming-skew equivalence in a practical algorithm for 

retiming large circuits, IEEE Transactions on Computer-Aided Design of Integrated Circuits and 

Systems, vol. 15, no. 10, October 1996. 

[5] S. S. Sapatnekar, Timing, Kluwer Academic Publishers, Boston, MA, 2004. 


Appendix: Encounter RTL Compiler commands for retiming 

Automatic synthesis with retiming 

It is easy to use retiming in RC: only the attribute “retime” needs to be set to true for the design or 

subdesign which should be retimed. Then during synthesis the design or subdesign is processed 

automatically by the sequentially driven synthesis flow with retiming as described in Section 4. 

set_attr retime true [subdesign] 

synthesize –to_mapped 

Manual retiming flow 

This flow can be used when a specific module or modules need to be retimed. It can be used as an 

exploratory tool to see the impact of what retiming can do for a subdesign in a mapped design. The first step 

“retime –prepare” prepares the design for retiming and “retime –min_delay” performs the actual retiming. 

Even though “retime –min_delay” performs a local mapping of immediate logic near the flops, it is 

recommended to follow it with an incremental synthesis or preferably a global synthesis depending on the 

granularity of the changes. 

retime –prepare [subdesign | design ] 

retime –min_delay [subdesign | design ] 

synthesize –to_mapped [-incr ] 

Manual retiming flow minimizing the number of registers 

This flow explicitly tries to minimize the number of registers and thus the area. This should be used only for 

a design which has positive slack. 


retime –min_area [subdesign | design ] 

synthesize –to_mapped [-incr ] 

Attributes 

set_attr dont_retime true [flop] 

set_attr retime_hard_region true \ 

[subdesign] 

set_attr boundary_opto false \ 

[subdesign] 

set_attr retime_async_reset true 

set_attr retime_optimize_reset true 

Do not retime the register specified. 

Retiming cannot move registers into or out of the 

“subdesign”. 

Disable boundary optimization (constant propagation 

and rewiring of equivalent signals across hierarchy) and 

preserve the input and output pins of a subdesign. This 

enables easier ECO for the blocks and might be 

necessary for formal verification. 

Enable retiming on flops with asynchronous set or reset 

signals. The runtime may increase if registers need to 

be moved backward. By default, registers with 

asynchronous set or reset signals are excluded from 

retiming. 

If this attribute is used in combination with the previous 

attribute, the reset logic is optimized by replacing 

asynchronous flops with simple flops wherever possible. 

For more information refer to the Encounter RTL Compiler User Guide, chapter 9, “Retiming the Design”. 


Interface to Conformal Logic Equivalence Checker (LEC) 

The checkpoint files of the automatic verification flow described in Section 6 and the corresponding dofiles 

for LEC are generated by RC if the checkpoint attributes are set as shown below. 

set_attribute checkpoint_flow true 

set_attribute library my_library.lib 

read my_design.v 

elaborate 

set_attribute checkpoint_netlist_naming_style \ 

“my_chk_dir/chk_%d.v” /designs/my_top 

set_attribute checkpoint_dofile_naming_style \ 

“my_chk_dir/chk_%d_to_chk_%d.do” /designs/my_top 

read_sdc my_constraints.sdc 

set_attr retime true my_top 


write –m > final.v 

write_do_lec –revised final.v > final.do 

To run LEC 

lec -ultra –Dofile hdl_to_chk_01.do 

lec -ultra –Dofile chk_01_to_chk_02.do 

lec -ultra –Dofile final.do 

For more information refer to the document “Interfacing between RTL Compiler and Conformal”.

Sequential Logic Synthesis with Retiming in Encounter ... - CiteSeerX

Create successful ePaper yourself

Delete template?

Save as template?