15.01.2013 Views

U. Glaeser

U. Glaeser

U. Glaeser

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

A priori knowledge of data stream distribution can be exploited at the algorithmic and architectural level<br />

for computation minimization and power reduction.<br />

Nielsen and Sparso [22] observed that 16-bit sampled speech data samples exhibit significant correlation,<br />

in addition to a predominance of small signal values. As a result, they have proposed a sliced<br />

datapath for a digital hearing aid filter bank to exploit the small magnitude of the input samples. The<br />

arithmetic datapath has been partitioned in an MSB and an LSB slice. The MSB slice is only engaged<br />

when the input bitwidth requires it. Activation of the slices is performed by using special data tags that<br />

indicate the presence of sign extension bits in the MSB input slice. Additional circuit overhead is required<br />

for the computation and update of the tags. Dynamic bitwidth adaptation is coarse and can only be<br />

performed on a slice basis. This scheme results in data-dependent power reduction and processing time.<br />

Xanthopoulos [23] has demonstrated a DA-based dicrete cosine transform (DCT) [24] architecture<br />

that exploits correlation in the incoming image or video samples for computation minimization and<br />

power reduction. DCT is a frequency domain data transform widely used in video and still image<br />

compression standards such as MPEG [25] and JPEG [26]. The 8-point one-dimensional DCT transform<br />

is defined as follows:<br />

© 2002 by CRC Press LLC<br />

X[ u]<br />

7<br />

c[ u]<br />

( 2i + 1)uπ<br />

= --------- x[]cos i ⎛------------------------- ⎞<br />

2 ∑ ⎝ 16 ⎠<br />

i=0<br />

(29.8)<br />

where c[u] = 1/ 2 if u = 0 and 1 otherwise.<br />

Image pixels are locally well correlated and exhibit a certain number of common most significant bits.<br />

These bits constitute a common-mode DC offset that only affects the computation of the DC DCT<br />

coefficient (X[0] in Eq. (29.8)) and is irrelevant for the computation of the higher spectral (AC) coefficients<br />

(X[1] …X[7]<br />

in Eq. (29.8)). The DCT chip in [23] includes adaptive-bitwidth DA computation<br />

units that reject common most significant bits for all AC coefficient computations resulting in arithmetic<br />

operations with reduced bitwidth operands, thus reducing switching activity. The bit-serial nature of the<br />

DA operation allows very fine grain (1-bit) adaptation to the input dynamic range as opposed to the<br />

coarse slice-level adaptation in [22].<br />

An interesting algorithmic adaptation to data distribution properties has been demonstrated in [27].<br />

The chip computes the inverse discrete cosine transform (IDCT) and is targeted to MPEG-compressed<br />

video data. The 8-point, one-dimensional IDCT is defined as follows:<br />

x[] i<br />

=<br />

7<br />

∑<br />

u=0<br />

c[ u]<br />

--------- X[ u]cos<br />

⎛( 2i + 1)uπ<br />

------------------------- ⎞<br />

2 ⎝ 16 ⎠<br />

(29.9)<br />

where c[u] = 1/ 2<br />

if u = 0 and 1 otherwise.<br />

Numerous fast IDCT algorithms can minimize the number of multiplications and additions implied<br />

by Eq. (29.9) [28,29]. Yet, the statistical distribution of the input DCT coefficients possesses unique<br />

properties that can affect IDCT algorithmic design. Typically, 64-coefficient DCT blocks of MPEGcompressed<br />

video sequences have only 5–6 nonzero coefficients, mainly located in the low spatial frequency<br />

positions due to the low pass characteristics of frame sequences [27]. The histogram of Fig. 29.6<br />

shows the frequency of 64-coefficient block occurrence plotted versus the number of nonzero coefficient<br />

content for a typical MPEG sequence. The mode of such distributions is invariably blocks with a single<br />

nonzero spectral coefficient (typically the DC).<br />

Given such input data statistics, we observe that direct application of Eq. (29.9) will result in a small<br />

average number of operations since multiplication and accumulation with a zero-valued coefficient X[k]<br />

constitutes a NOP [30]. The chip in [27] uses such a direct coefficient-by-coefficient algorithm coupled<br />

with extensive clock-gating techniques to implement the implied NOPs. IDCT computation power of<br />

4.5 mW for MPEG-2 sample rates has been reported.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!