EE577a Final Project Spring 2009

16-Bit Motion Estimator for DSP 

EE577a Final Project 

Due Date: 5/4/2009 

Names: Sun, Qifeng 6383195770 

Bao, Shengkun 9538161299 

E-mail: qifengsu@usc.edu 

sbao@usc.edu

Description of the Project 

Students will be designing a low power 16-Bit Motion Estimation (ME) kernel for 

a DSP as shown in the figure below. The 16-bit ME kernel is implemented by 

the 4Kb (2×128×16) data memory, data pointer, and the absolute difference 

accumulator, which consists of an 16-bit absolute difference circuit and an 

16-bit adder circuit. It reads from the data memory two 4×4-bit pixel blocks (a 

macro block and a candidate block) and calculates the absolute difference 

between corresponding pair of pixels in two blocks. At the end it shows the 

accumulated difference for all the pair of pixels in the two blocks. 

Overall Block Diagram

Part I Introduction 

1. Top Level Architecture 

There is some slight revision of our design compared with the standard top 

level architecture. In order to reduce the delay of Row/Column Decoder and 

Absolute Difference Calculator, we got the complemented address output 

signals and data readout in Data Pointer and registers following SRAM from 

the optimized DFF, so the SRAM Cell has 14 address I/Ps and ADC has total 

64 data I/Ps, original signals and their complemented value, half to half. 

2. Performance Matrix: 

Performance Matrix 

Clock Frequency 667MHz 

Area 103400um 2 (347um * 298um) 

Average Power Consumption 4.5E+01mW 

Clock Frequency / Power 1.48E+10 (Hz/W) 

Clock Frequency / Area 6.48E+15 (Hz/m 2 ) 

Clock Frequency / (Area * Power) 1.44E+17 (Hz/W* m 2 ) 

Part II Floor Plan 

Table 1.1 Gross Performance Matrix 

The Motion Estimator can be divided into 4 pipeline stages. 

Stage 1: Data Pointer 

7-Bit Accumulator 1 

1-Bit-2-Output DFF 7 

I/P: 3 Clock, nClock, Reset 

O/P: 14 Q, nQ 

Stage 2: SRAM Macro 

2M-Bit SRAM Cell (32*4*16) 2 

8-Bit-2-Output Register Array 4 

I/P: 50 Pre_Charge, Write_en, Read_en, A_EN,

Add, nAdd, Data1, Data2 

O/P: 64 Dout1, nDout1, Dout2, nDout2 

Stage 3: Absolute Difference Calculator 

16-Bit Full Adder 2 

16-Bit MUX Array 1 

16-Bit Register Array 1 

I/P: 64 Dout1, nDout1, Dout2, nDout2 

O/P: 16 ABS 

Stage 4: 24-Bit Accumulator 

16-Bit Full Adder 1 

8-Bit Accumulator 1 

9-Bit MUX Array 1 

24-Bit Register Array 1 

1-Bit DFF 1 

I/P: 17 ABS, G0 = gnd 

O/P: 25 SUMABS, Gout 

The area taken by wires only (with no components underneath) is about 

1400um 2 (a 39um*35um square and some additional small area), the ratio to 

the overall area is about 1.35%. 

Part III Component Design and Exploration 

1. 16-Bit Absolute Difference Calculator 

(1) Macro Block

Fig 3.1 Top Level Architecture Diagram 

The ADC operates a function to calculate the absolute difference between 

Vector A[16:1] and B[16:1], the data vectors read out from the two SRAM cell. 

Calculate both A-B and B-A via add the original value of one vector and the 

complemented value of the other, and use the complemented overflow sign 

(Cout or Gout of the two 16-Bit full adders) as the selecting signal to drive the 

16-Bit MUX and get the absolute difference vector expected. 

(2) Schematic 

The main part of the schematic of ADC is the 16-Bit full adder.

Fig 3.2 Schematic View of optimized 16-Bit Full Adder 

According to the simulation result of our SRAM designed in Lab 4, the 

worst case read delay, including the decoder delay is about 800~900ps, which 

is less than the delay of ADC, including the delay of 16-Bit full adder and MUX, 

so ADC should be the critical stage, and we shall make the ADC faster to get 

high clock frequency. While the SRAM cell takes the most power consumption 

among all the components used for its scale, then we also need to reduce the 

power consumption of the ADC cell while maintain the high speed, because 

the structure of 16-Bit full adder is used for three times in two pipeline stages.

To speed up the 16-Bit full adder as well as to scale the power 

consumption level, we redesign the adder, still the Sklansky Model, while the 

logical efforts of the Black and Gray cells at the vital point of the critical path 

have been scaled to match the loading capacitance for minimizing data 

transmission delay along the critical. 

The other cells are scaled down to minimum size which keeps equal rising 

and falling delay. All of the redundant buffers are removed to decrease the 

power consumption. 

Through the optimization mentioned above, the worst case delay of the 

16-Bit adder is reduced to 910ps, less than its former value 1.03ns, and the 

carrier delay is about 600ps, which is of great help for its being used as the 

control signal of its following logical cell. 

(3) Layout 

Fig 3.3 Layout View of optimized 16-Bit Full Adder (2560um 2 , 64um*40um) 

In order to add all the components into the pipeline and to integrate them 

into a whole system easily, we scale some of the components, such as 16-Bit 

full adder and register array with the same horizontal length, 64um, which 

means 16-Bit data in parallel. 

Further more, some of the unnecessary buffers have been removed from 

the adder, there are some blanks in the layout, inevitably.

Fig 3.4 Layout View of ADC (5376um 2 , 64um*84um) 

The layout above shows the ADC, consisting of two 16-Bit full adders and 

16-Bit MUX array in the golden ellipse. The ADC has 2 groups, 32 couples of 

complementary inputs, A[16:1], nA[16:1], B[16:1] and nB[16:1]. 

The whole ADC is also scaled to match the 64um horizontal length 

standard, topologically. 

(4) Testing and Simulation 

Fig 3.5 Simulation Waveform of 16-Bit full Adder (G0 is set to vdd)

According to the simulation waveform of the 16-Bit full Adder, it operates 

desirably: 

When G0=vdd, FFFF+0000=0000, Gout=1; 

6555+AAAA=1000, Gout=1; 

5555+AAAA=0000, Gout=1 

And the worst case delay is very small: 

Worst Sum Delay: 910ps 

Worst Gout Delay: 610ps 

Fig 3.6 Simulation Waveform of ADC 

According to the simulation waveform of the ADC, it operates desirably: 

ABS (0000, 0000) = 0000; 

ABS (1986, 0117) = 186F; 

ABS (1111, 1111) = 0000; 

ABS (1986, 1989) = 0003; 

And the worst case delay is: 1.12ns 

The output will turn stable after the period of delay, before that, it may 

produce undetermined value or some other kinds of wrong answer, so the 

clock period of the pipeline should ensure the register can sample the right, 

stable value. 

For that reason, and consider that the minimum setup time of the DFF 

designed in Lab 3 is about 110~140ps. 

1.12ns + 0.14ns = 1.26ns 

So generally, the clock period of the pipeline should be no less than 1.3ns.

2. 7-Bit Data Pointer 

Since it is mentioned above that the ADC stage has the largest delay 

through analysis, and the delays of SRAM and accumulator stages are 

comparable to that, while the delay of data pointer is much less than those. 

So the emphasis of data pointer design is not to speed it up, instead, we 

shall try implementing it with components as few and small as possible to 

reduce the power and area consumption. 

The smaller the size of transistors is, the less the current, the slower of the 

speed and the less the area and power consumption. 

So we use the simplest logic implementation and scale all of the transistors 

to minimum size 300nm/200nm. 

Fig 3.7 Schematic View of Data Pointer 

Fig 3.8 Simulation Waveform of Data Pointer 

Worst case delay: 440ps (consider combinational logic only)

3. 24-Bit Accumulator 

Fig 3.9 Schematic View of Accumulator 

The goal of 24-Bit accumulator design is to reduce the power and area 

consumption while maintain its speed comparable with that of the ADC. 

Via analysis, we found that the 8 higher bits of the accumulator operates 

with function similar to data pointer instead to adder. So we combined the 

16-Bit full adder and 8-Bit data pointer together. Since the delay of data pointer 

is much less than that of the adder, we do not need to worry about the speed of 

it, instead, we minimize the area and power consumption of it. So we scaled all 

the transistors to minimum size in data pointer. 

Besides, the data pointer has a function with Gout, that is, when Gout is 1, 

the output of data pointer equals the input adds 1, otherwise it maintains. 

So we took the Gout as a selecting signal of a 9-Bit MUX. Since the worst 

case delay of Gout is about 300ps less than that of SUM function, and MUX 

has a maximum delay about 200ps, 8-Bit data pointer takes 440ps, the sum of 

them is less than the maximum delay of 16-Bit full adder, it seems that the 

calculation of the 8 higher bits almost take no time.

Fig 3.10 Layout View of 24-Bit Accumulator (6528um 2 , 64um*102um)

4. SRAM 

Towards SRAM, we use the original design we produced in Lab 4 with 

some slight revisions. 

We decreased the size of the column decoder. The operating speed of the 

column decoder is a little faster than the row decoder, so there is a period of 

interval between row-selected and column-selected. That period of time is 

operating blank and useless. To speed up the operation of SRAM, my partner 

and I slow down the column decoder and make column and row decoder 

synchronous. So the total delay is reduced. 

Fig 3.11 Layout View of SRAM_2048 (41720um 2 , 140um*298um)

Part IV Top Level Integration 

1. Schematic 

Fig 4.1 Schematic View of Motion Estimator

2. Layout 

Fig 4.2 Layout View of Motion Estimator (103400um 2 , 347um * 298um)

3. Simulation 

Fig 4.3 Simulation Waveform of Motion Estimator 

Fig 4.4 Simulation Waveform of Motion Estimator (Zoom In) 

The ME system operates as expected to get the final value 13DF07, as is 

highlighted by the golden ellipse. 

Minimum Clock Period: 1.5ns

Part V Conclusion 

If I could start over, I would redesign the SRAM. I would separate the row 

decoder and column decoder from the SRAM array. Since we have two SRAM 

in the final project and every time the input address added to each the SRAM 

macro is identical, we do not need two of them, because the use of two 

identical decoders, both in function and structure doubles the decoder area 

and power consumption. One row decoder and one column decoder would be 

adequate. Though the loading capacitance of the decoders would be twice the 

value of that present, while it would be negligible due to the logarithmic 

relationship. If so, the area and power consumption would be reduced. 

Towards our design present, my partner and I have spent quite a lot of 

time on analyzing top-down through the architecture level, transistor level and 

layout. We have taken almost everything into consideration as we can. We 

have tried speeding up the operation on critical stage and along critical path as 

well as searching for method to simplify the structure of the components with 

less timing priority and reduce the area as well as its power consumption. 

For further optimization, I consider only if my partner and I can find a better 

method to implement better SRAM structure, otherwise, we cannot get another 

version with smaller area or less power consumption. 

The other three stages, including Data Pointer, ADC and Accumulator, 

from my own opinion, are excellent. 

So in a word, we are proud of our design. 

About the achievement, I think the most dramatic is my gradually altering 

perspective of IC design. At the very beginning of digital IC design, I only 

considered realizing the function. It should be ok once the function was right, 

no matter what the values of the area, the timing and power consumption. 

Then we came to know that the speed was also important, and then it came 

the area, power was the last one. 

We seldom took power consumption into consideration in our labs. While, 

during the progress of the final project, through the procedure of trade-off 

optimization, my partner and I came to know that though trade-off may slow 

down the operation as the cost of small size and low power consumption, an 

optimal structure in architecture level can also lead to high speed of the whole 

system, which means slowing down speed of parts of the system to reduce the 

area and power consumption while reserving the high timing performance on 

the critical stage or along the critical path. 

We will refer to these principles in advanced design projects.

EE577a Final Project Spring 2009

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?