01.08.2013 Views

EE577a Final Project Spring 2009

EE577a Final Project Spring 2009

EE577a Final Project Spring 2009

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

16-Bit Motion Estimator for DSP<br />

<strong>EE577a</strong> <strong>Final</strong> <strong>Project</strong><br />

Due Date: 5/4/<strong>2009</strong><br />

Names: Sun, Qifeng 6383195770<br />

Bao, Shengkun 9538161299<br />

E-mail: qifengsu@usc.edu<br />

sbao@usc.edu


Description of the <strong>Project</strong><br />

Students will be designing a low power 16-Bit Motion Estimation (ME) kernel for<br />

a DSP as shown in the figure below. The 16-bit ME kernel is implemented by<br />

the 4Kb (2×128×16) data memory, data pointer, and the absolute difference<br />

accumulator, which consists of an 16-bit absolute difference circuit and an<br />

16-bit adder circuit. It reads from the data memory two 4×4-bit pixel blocks (a<br />

macro block and a candidate block) and calculates the absolute difference<br />

between corresponding pair of pixels in two blocks. At the end it shows the<br />

accumulated difference for all the pair of pixels in the two blocks.<br />

Overall Block Diagram


Part I Introduction<br />

1. Top Level Architecture<br />

There is some slight revision of our design compared with the standard top<br />

level architecture. In order to reduce the delay of Row/Column Decoder and<br />

Absolute Difference Calculator, we got the complemented address output<br />

signals and data readout in Data Pointer and registers following SRAM from<br />

the optimized DFF, so the SRAM Cell has 14 address I/Ps and ADC has total<br />

64 data I/Ps, original signals and their complemented value, half to half.<br />

2. Performance Matrix:<br />

Performance Matrix<br />

Clock Frequency 667MHz<br />

Area 103400um 2 (347um * 298um)<br />

Average Power Consumption 4.5E+01mW<br />

Clock Frequency / Power 1.48E+10 (Hz/W)<br />

Clock Frequency / Area 6.48E+15 (Hz/m 2 )<br />

Clock Frequency / (Area * Power) 1.44E+17 (Hz/W* m 2 )<br />

Part II Floor Plan<br />

Table 1.1 Gross Performance Matrix<br />

The Motion Estimator can be divided into 4 pipeline stages.<br />

Stage 1: Data Pointer<br />

7-Bit Accumulator 1<br />

1-Bit-2-Output DFF 7<br />

I/P: 3 Clock, nClock, Reset<br />

O/P: 14 Q, nQ<br />

Stage 2: SRAM Macro<br />

2M-Bit SRAM Cell (32*4*16) 2<br />

8-Bit-2-Output Register Array 4<br />

I/P: 50 Pre_Charge, Write_en, Read_en, A_EN,


Add, nAdd, Data1, Data2<br />

O/P: 64 Dout1, nDout1, Dout2, nDout2<br />

Stage 3: Absolute Difference Calculator<br />

16-Bit Full Adder 2<br />

16-Bit MUX Array 1<br />

16-Bit Register Array 1<br />

I/P: 64 Dout1, nDout1, Dout2, nDout2<br />

O/P: 16 ABS<br />

Stage 4: 24-Bit Accumulator<br />

16-Bit Full Adder 1<br />

8-Bit Accumulator 1<br />

9-Bit MUX Array 1<br />

24-Bit Register Array 1<br />

1-Bit DFF 1<br />

I/P: 17 ABS, G0 = gnd<br />

O/P: 25 SUMABS, Gout<br />

The area taken by wires only (with no components underneath) is about<br />

1400um 2 (a 39um*35um square and some additional small area), the ratio to<br />

the overall area is about 1.35%.<br />

Part III Component Design and Exploration<br />

1. 16-Bit Absolute Difference Calculator<br />

(1) Macro Block


Fig 3.1 Top Level Architecture Diagram<br />

The ADC operates a function to calculate the absolute difference between<br />

Vector A[16:1] and B[16:1], the data vectors read out from the two SRAM cell.<br />

Calculate both A-B and B-A via add the original value of one vector and the<br />

complemented value of the other, and use the complemented overflow sign<br />

(Cout or Gout of the two 16-Bit full adders) as the selecting signal to drive the<br />

16-Bit MUX and get the absolute difference vector expected.<br />

(2) Schematic<br />

The main part of the schematic of ADC is the 16-Bit full adder.


Fig 3.2 Schematic View of optimized 16-Bit Full Adder<br />

According to the simulation result of our SRAM designed in Lab 4, the<br />

worst case read delay, including the decoder delay is about 800~900ps, which<br />

is less than the delay of ADC, including the delay of 16-Bit full adder and MUX,<br />

so ADC should be the critical stage, and we shall make the ADC faster to get<br />

high clock frequency. While the SRAM cell takes the most power consumption<br />

among all the components used for its scale, then we also need to reduce the<br />

power consumption of the ADC cell while maintain the high speed, because<br />

the structure of 16-Bit full adder is used for three times in two pipeline stages.


To speed up the 16-Bit full adder as well as to scale the power<br />

consumption level, we redesign the adder, still the Sklansky Model, while the<br />

logical efforts of the Black and Gray cells at the vital point of the critical path<br />

have been scaled to match the loading capacitance for minimizing data<br />

transmission delay along the critical.<br />

The other cells are scaled down to minimum size which keeps equal rising<br />

and falling delay. All of the redundant buffers are removed to decrease the<br />

power consumption.<br />

Through the optimization mentioned above, the worst case delay of the<br />

16-Bit adder is reduced to 910ps, less than its former value 1.03ns, and the<br />

carrier delay is about 600ps, which is of great help for its being used as the<br />

control signal of its following logical cell.<br />

(3) Layout<br />

Fig 3.3 Layout View of optimized 16-Bit Full Adder (2560um 2 , 64um*40um)<br />

In order to add all the components into the pipeline and to integrate them<br />

into a whole system easily, we scale some of the components, such as 16-Bit<br />

full adder and register array with the same horizontal length, 64um, which<br />

means 16-Bit data in parallel.<br />

Further more, some of the unnecessary buffers have been removed from<br />

the adder, there are some blanks in the layout, inevitably.


Fig 3.4 Layout View of ADC (5376um 2 , 64um*84um)<br />

The layout above shows the ADC, consisting of two 16-Bit full adders and<br />

16-Bit MUX array in the golden ellipse. The ADC has 2 groups, 32 couples of<br />

complementary inputs, A[16:1], nA[16:1], B[16:1] and nB[16:1].<br />

The whole ADC is also scaled to match the 64um horizontal length<br />

standard, topologically.<br />

(4) Testing and Simulation<br />

Fig 3.5 Simulation Waveform of 16-Bit full Adder (G0 is set to vdd)


According to the simulation waveform of the 16-Bit full Adder, it operates<br />

desirably:<br />

When G0=vdd, FFFF+0000=0000, Gout=1;<br />

6555+AAAA=1000, Gout=1;<br />

5555+AAAA=0000, Gout=1<br />

And the worst case delay is very small:<br />

Worst Sum Delay: 910ps<br />

Worst Gout Delay: 610ps<br />

Fig 3.6 Simulation Waveform of ADC<br />

According to the simulation waveform of the ADC, it operates desirably:<br />

ABS (0000, 0000) = 0000;<br />

ABS (1986, 0117) = 186F;<br />

ABS (1111, 1111) = 0000;<br />

ABS (1986, 1989) = 0003;<br />

And the worst case delay is: 1.12ns<br />

The output will turn stable after the period of delay, before that, it may<br />

produce undetermined value or some other kinds of wrong answer, so the<br />

clock period of the pipeline should ensure the register can sample the right,<br />

stable value.<br />

For that reason, and consider that the minimum setup time of the DFF<br />

designed in Lab 3 is about 110~140ps.<br />

1.12ns + 0.14ns = 1.26ns<br />

So generally, the clock period of the pipeline should be no less than 1.3ns.


2. 7-Bit Data Pointer<br />

Since it is mentioned above that the ADC stage has the largest delay<br />

through analysis, and the delays of SRAM and accumulator stages are<br />

comparable to that, while the delay of data pointer is much less than those.<br />

So the emphasis of data pointer design is not to speed it up, instead, we<br />

shall try implementing it with components as few and small as possible to<br />

reduce the power and area consumption.<br />

The smaller the size of transistors is, the less the current, the slower of the<br />

speed and the less the area and power consumption.<br />

So we use the simplest logic implementation and scale all of the transistors<br />

to minimum size 300nm/200nm.<br />

Fig 3.7 Schematic View of Data Pointer<br />

Fig 3.8 Simulation Waveform of Data Pointer<br />

Worst case delay: 440ps (consider combinational logic only)


3. 24-Bit Accumulator<br />

Fig 3.9 Schematic View of Accumulator<br />

The goal of 24-Bit accumulator design is to reduce the power and area<br />

consumption while maintain its speed comparable with that of the ADC.<br />

Via analysis, we found that the 8 higher bits of the accumulator operates<br />

with function similar to data pointer instead to adder. So we combined the<br />

16-Bit full adder and 8-Bit data pointer together. Since the delay of data pointer<br />

is much less than that of the adder, we do not need to worry about the speed of<br />

it, instead, we minimize the area and power consumption of it. So we scaled all<br />

the transistors to minimum size in data pointer.<br />

Besides, the data pointer has a function with Gout, that is, when Gout is 1,<br />

the output of data pointer equals the input adds 1, otherwise it maintains.<br />

So we took the Gout as a selecting signal of a 9-Bit MUX. Since the worst<br />

case delay of Gout is about 300ps less than that of SUM function, and MUX<br />

has a maximum delay about 200ps, 8-Bit data pointer takes 440ps, the sum of<br />

them is less than the maximum delay of 16-Bit full adder, it seems that the<br />

calculation of the 8 higher bits almost take no time.


Fig 3.10 Layout View of 24-Bit Accumulator (6528um 2 , 64um*102um)


4. SRAM<br />

Towards SRAM, we use the original design we produced in Lab 4 with<br />

some slight revisions.<br />

We decreased the size of the column decoder. The operating speed of the<br />

column decoder is a little faster than the row decoder, so there is a period of<br />

interval between row-selected and column-selected. That period of time is<br />

operating blank and useless. To speed up the operation of SRAM, my partner<br />

and I slow down the column decoder and make column and row decoder<br />

synchronous. So the total delay is reduced.<br />

Fig 3.11 Layout View of SRAM_2048 (41720um 2 , 140um*298um)


Part IV Top Level Integration<br />

1. Schematic<br />

Fig 4.1 Schematic View of Motion Estimator


2. Layout<br />

Fig 4.2 Layout View of Motion Estimator (103400um 2 , 347um * 298um)


3. Simulation<br />

Fig 4.3 Simulation Waveform of Motion Estimator<br />

Fig 4.4 Simulation Waveform of Motion Estimator (Zoom In)<br />

The ME system operates as expected to get the final value 13DF07, as is<br />

highlighted by the golden ellipse.<br />

Minimum Clock Period: 1.5ns


Part V Conclusion<br />

If I could start over, I would redesign the SRAM. I would separate the row<br />

decoder and column decoder from the SRAM array. Since we have two SRAM<br />

in the final project and every time the input address added to each the SRAM<br />

macro is identical, we do not need two of them, because the use of two<br />

identical decoders, both in function and structure doubles the decoder area<br />

and power consumption. One row decoder and one column decoder would be<br />

adequate. Though the loading capacitance of the decoders would be twice the<br />

value of that present, while it would be negligible due to the logarithmic<br />

relationship. If so, the area and power consumption would be reduced.<br />

Towards our design present, my partner and I have spent quite a lot of<br />

time on analyzing top-down through the architecture level, transistor level and<br />

layout. We have taken almost everything into consideration as we can. We<br />

have tried speeding up the operation on critical stage and along critical path as<br />

well as searching for method to simplify the structure of the components with<br />

less timing priority and reduce the area as well as its power consumption.<br />

For further optimization, I consider only if my partner and I can find a better<br />

method to implement better SRAM structure, otherwise, we cannot get another<br />

version with smaller area or less power consumption.<br />

The other three stages, including Data Pointer, ADC and Accumulator,<br />

from my own opinion, are excellent.<br />

So in a word, we are proud of our design.<br />

About the achievement, I think the most dramatic is my gradually altering<br />

perspective of IC design. At the very beginning of digital IC design, I only<br />

considered realizing the function. It should be ok once the function was right,<br />

no matter what the values of the area, the timing and power consumption.<br />

Then we came to know that the speed was also important, and then it came<br />

the area, power was the last one.<br />

We seldom took power consumption into consideration in our labs. While,<br />

during the progress of the final project, through the procedure of trade-off<br />

optimization, my partner and I came to know that though trade-off may slow<br />

down the operation as the cost of small size and low power consumption, an<br />

optimal structure in architecture level can also lead to high speed of the whole<br />

system, which means slowing down speed of parts of the system to reduce the<br />

area and power consumption while reserving the high timing performance on<br />

the critical stage or along the critical path.<br />

We will refer to these principles in advanced design projects.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!