15.01.2013 Views

U. Glaeser

U. Glaeser

U. Glaeser

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

additional power benefits. Efficient hardwired instructions can reduce control overhead and minimize<br />

communication among functional units thus reducing switched capacitance in Eq. (29.2).<br />

General-purpose DSPs typically include instructions that place them in multiple levels of standby<br />

modes [6]. As an example, a DSP processor can be in full operational mode, in level 2 standby mode<br />

(computational units powered down, peripheral circuits and PLL on), in level 1 standby mode (only PLL<br />

on) or in sleep mode (everything including PLL is powered down except for a small sleep circuit capable<br />

of ramping up the PLL and powering on the rest of the units). Digital PLL designs help in the implementation<br />

of multiple idle states because they facilitate fast PLL frequency ramping sequences and fast<br />

switching between various standby modes. Depending on the application, such software-induced standby<br />

modes can provide substantial power savings.<br />

DSP algorithms usually involve the repetitive execution of a small set of instructions (kernel). Most<br />

programmable DSPs include hardware support for tight loops. A standard software loop implementation<br />

requires the maintenance and update of a loop index, a compare instruction, and a conditional branch to<br />

the beginning of the loop. The loop overhead can easily slow down a DSP kernel by a substantial factor.<br />

DSPs include hardware support for both single and multiple instruction loops (i.e., REPEAT instruction)<br />

[11]. A single instruction loop repeats a single instruction multiple times without maintaining a loop<br />

index and by fetching it only once from memory. A multiple instruction loop on the other hand must<br />

repeatedly fetch the instructions from memory each time the processor executes the loop. Hiraki et al.<br />

[12] have proposed an interesting low power optimization for multiple instruction loops that has wide<br />

applicability in programmable DSPs. A small decoded instruction buffer (DIB) is provided that stores<br />

decoded instructions during the first iteration into the loop. Subsequent iterations do not engage the<br />

instruction memory and decode unit, but fetch the decoded instructions from the DIB. Case studies have<br />

indicated 40% power savings when a DIB is implemented in a DSP for certain multimedia applications.<br />

Recently, the Berkeley Pleiades project [13] has introduced a 1-V heterogeneous reconfigurable DSP<br />

targeted to wireless baseband processing. The architecture consists of multiple “satellite” arithmetic<br />

processors, on-chip FPGA sections, on-chip memory banks, address generators, and an embedded ARM<br />

core. All these heterogeneous units are interconnected with a hierarchical reconfigurable network. The<br />

ARM core is responsible for the online reconfiguration through a dedicated bus. According to the Pleiades<br />

computation model, the embedded microprocessor core executes the high-level control and spawns<br />

arithmetic-intensive DSP kernels to the satellites. The flow of control is returned to the ARM core when<br />

all the satellite operations have completed. Run-time reconfiguration makes such an architecture very<br />

power-efficient compared to conventional programmable DSPs. A Pleiades silicon implementation is<br />

reported to implement baseband wireless functions at 10–100 MOPS/mW.<br />

Circuit Power Optimizations<br />

Most of the DSPs available in the market today include some form of fine-grain, clock-gating mechanism<br />

for power reduction. DSPs are very well suited for clock-gating because of the regular datapath structure<br />

and the small control structures (which typically cannot employ fine-grain gated clocks). A typical<br />

datapath pipeline stage employing clock-gating is shown in Fig. 29.3. Signals EN0 and EN1 are the stage<br />

clock enables that are latched 180°<br />

ahead of time and computed by the control section. A master clock<br />

is distributed to the gating clock drivers, which are typically amortized across the entire datapath width<br />

of the pipeline. Clock-gating not only saves clock and flip-flop power, but also prevents the combinational<br />

logic between pipeline stages from switching. The main down side of clock-gating is that it can present<br />

some difficulties in static timing closure because of increased uncertainty during the calculation of setup<br />

and hold time constraints. Clock-gating reduces the switching activity factor a in Eq. (29.2).<br />

On-chip memory blocks (SRAMs and ROMs) are typically optimized for low power: Memory blocks<br />

are partitioned in multiple banks so that a small fraction of the total memory array is activated during<br />

a memory access [6,8]. Moreover, address bits are typically allocated in such a fashion among row decoders<br />

and column decoders such that sequential memory accesses do not activate the row decoders during<br />

each cycle [8].<br />

© 2002 by CRC Press LLC

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!