21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

COFFEE: COmpiler Framework for Energy-Aware Exploration 203<br />

We have used the proposed compiler and simulator flow to optimize the WCDMA<br />

receiver code from [23] for a basel<strong>in</strong>e VLIW processor. Start<strong>in</strong>g from the detailed performance<br />

estimate on the complete code, the receiver filter (Rx Filter <strong>in</strong> Figure 5) was<br />

identified to be the s<strong>in</strong>gle most important part of the application <strong>in</strong> terms of computational<br />

requirements and energy consumption (85% percent of the WCDMA’s Receiver<br />

cycles). Optimiz<strong>in</strong>g code transformations us<strong>in</strong>g the URUK flow will be illustrated on<br />

this part.<br />

It should be emphasized here that the presented COFFEE framework can be used for<br />

any ANSI-C compliant application. WCDMA is shown as a relevant example from our<br />

target application doma<strong>in</strong>, be<strong>in</strong>g multimedia and wireless communication algorithms<br />

for portable devices.<br />

5.2 Processor Architectures<br />

The wide range of architectural parameters supported by the presented framework allows<br />

designers to explore many different architectural styles and variants quickly. It<br />

enables experiments with comb<strong>in</strong>ations of parameters or components not commonly<br />

found <strong>in</strong> current processors. In the experimental results shown here, we have optimized<br />

the WCDMA receiver filter for two architectures described <strong>in</strong> this subsection, as an<br />

example of the range and complexity of architectures that can be supported.<br />

TI C64-like VLIW processor. The TI C64 [10] is a clustered VLIW processor with<br />

two clusters of 4 Functional Units (FUs) each. This processor is chosen as a typical<br />

example of an Instruction Level Parallel (ILP) processor. In this heterogeneous VLIW<br />

each FU can perform a subset of all supported operations. The Instruction Set Architecture<br />

(ISA) and the sizes of memory hierarchy are modeled as described <strong>in</strong> [10]. The<br />

TIC64 also supports a number of hardware accelerators (e.g. Viterbi decoder). These<br />

blocks can be correctly modeled by our flow by us<strong>in</strong>g <strong>in</strong>tr<strong>in</strong>sics, but s<strong>in</strong>ce are not needed<br />

for the WCDMA benchmark, they are not modeled <strong>in</strong> this case study.<br />

ARM Cortex A8-like processor. The Cortex A8 processor [14] is an enhanced<br />

ARMv7 architecture with support for SIMD, and is used here as a typical example<br />

of a Data Level Parallel (DLP) architecture. The processor consists of separate scalar<br />

and vector datapaths. Each datapath has a register file and specialized FUs. The vector<br />

units <strong>in</strong> the vector datapath support up to 128-bit wide data <strong>in</strong> various subword modes.<br />

The FUs have a different number of execute stages, which result <strong>in</strong> different latencies.<br />

The details of the modeled architecture <strong>in</strong>clud<strong>in</strong>g its memory hierarchy can be found <strong>in</strong><br />

[14].<br />

Novel Design Space architectures. To illustrate the wide design space that is supported<br />

by our tools, we start with a standard processor and modify the most power<br />

consum<strong>in</strong>g parts of it. Both architectures are shown <strong>in</strong> shown <strong>in</strong> Figure 6. Architecture<br />

A is a 4 issue homogeneous VLIW with a 4kB L1 data cache, 4kB L1 <strong>in</strong>struction memory<br />

and a centralized register file of 32 deep, 12 ports. Architecture B optimizes some<br />

important parts of the architecture: the cache is replaced with a data scratchpad memory<br />

and a DMA, the L1 <strong>in</strong>struction memory is enhanced with a distributed loop buffer

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!