15.01.2013 Views

U. Glaeser

U. Glaeser

U. Glaeser

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

a fundamental property that distinguished the early DSP processors. On the TMS 320C1x, released in<br />

the early ’80s, it took 2N cycles for a N tap filter (without the shift of the delay line) [5].<br />

The modified Harvard architecture improves this idea even further. It is combined with a “repeat”<br />

instruction and a specialized addressing mode, the circular addressing mode. In this case, one multiplyaccumulate<br />

instruction is fetched from program memory and kept in the one instruction deep instruction<br />

“cache.” Then the data access cycles are performed in parallel: the coefficient is fetched from the program<br />

memory in parallel with the data sample being fetched from data memory. This architecture is found in<br />

all early DSP processors and is the foundation for all following DSP architectures. The number of memory<br />

accesses for one tap are reduced to two and these occur in the same cycle. Thus, one tap can execute in<br />

one cycle and the multiply-accumulate unit is kept occupied every cycle.<br />

Newer generation of DSP processors have even more memory banks, accompanying address generation<br />

units and control hardware, such as the repeat instruction, to support multiple parallel accesses.<br />

The execution of a 32-tap FIR filter on the dual Mac architecture of the Lucent DSP 1621, shown in<br />

Fig. 42.56, will take only 19 cycles. The corresponding pseudo code is the following:<br />

do 14 { //one instruction !<br />

a0=a0+p0+p1<br />

p0=xh*yh p1=xl*yl<br />

y=*r0++ x=*pt0++<br />

}<br />

This code can be executed in 19 clock cycles with only 38 bytes of instruction code. The inner loop<br />

takes one cycle to execute and as can be seen from the assembly code, seven operations are executed in<br />

parallel: one addition, two multiplications, two memory reads, and two address pointer updates. Note<br />

that the second pointer update, *pt0++, updates a circular address pointer.<br />

Two architectures which speed up the FIR calculation to 0.5 cycle per tap are shown in Fig. 42.56. The<br />

first one is the above mentioned Lucent DSP16210. The second one is an architecture presented in [9].<br />

It has a multiply accumulate unit that operates at double the frequency from the memory accesses.<br />

The difficult part in the implementation of this tight loop is the arrangement of the data samples in<br />

memory. To supply the parallel Mac data paths, two 32-bit data items are read from memory and stored<br />

in the X and Y register, as shown in Fig. 42.56. A similar split in lower and higher halfs occurs in the<br />

FIGURE 42.56 DSP architectures for 0.5 cycle per FIR tap.<br />

© 2002 by CRC Press LLC<br />

16 x 16 mpy 16 x 16 mpy<br />

AL ALUU<br />

XDB(32)<br />

IDB(32)<br />

Y(32)<br />

X(32)<br />

p0 (32) p1 (32)<br />

Shift/Sat. Shift/Sat.<br />

ADD BM BMUU<br />

ACC File<br />

8 x 40<br />

(a) Lucent DSP16210 architecture<br />

1 MACHINE MEMORY X MEMORY Y<br />

CYCLE<br />

(MX)<br />

(MY)<br />

16-bit 16-bit 16-bit 16-bit<br />

EVEN ODD EVEN ODD<br />

SIDE SIDE SIDE SIDE<br />

TEMP REG TEMP REG<br />

16-bit<br />

16-bit<br />

A-BUS<br />

B-BUS<br />

16-bit 16-bit<br />

1/2 MACHINE<br />

CYCLE<br />

1/2 MACHINE<br />

CYCLE<br />

POINTER X<br />

(PX)<br />

MULTIPLIER<br />

32-bit<br />

PIPELINE REG<br />

ADDER<br />

40-bit<br />

ACC<br />

BARREL SHIFTER<br />

(b) MAC at double frequency [14]<br />

POINTER Y<br />

(PY)<br />

MAC UNIT

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!