21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

COFFEE: COmpiler Framework for Energy-Aware Exploration 199<br />

executed by each slot (this depends on the functional units <strong>in</strong> that particular slot) can<br />

be specified <strong>in</strong> the XML mach<strong>in</strong>e description. New functional units can be added to the<br />

architecture, compiler and simulator easily by modify<strong>in</strong>g the mach<strong>in</strong>e description and<br />

add<strong>in</strong>g the behavior of the new <strong>in</strong>struction (<strong>in</strong> C) to the <strong>in</strong>tr<strong>in</strong>sic library. The framework<br />

provides a user-friendly XML schema to add new functional units and specify<br />

its properties. The operation’s latency, operand specification, pipel<strong>in</strong><strong>in</strong>g, association of<br />

the functional unit to a certa<strong>in</strong> slot, are specified <strong>in</strong> the XML mach<strong>in</strong>e description and<br />

correctly taken <strong>in</strong>to account dur<strong>in</strong>g simulation. Different datapath widths can be supported:<br />

16-bit, 32-bit, 64-bit, 128-bit. By vary<strong>in</strong>g the width and number of slots, the<br />

trade off between ILP and DLP can be explored. The width can be specified for each<br />

FU separately, allow<strong>in</strong>g the usage of SIMD and scalar units <strong>in</strong> one architecture. An example<br />

of this approach is shown for ARM’s Cortex A8 <strong>in</strong> Section 5. SIMD units can<br />

be exploited by us<strong>in</strong>g <strong>in</strong>tr<strong>in</strong>sics, similar to those available <strong>in</strong> Intel’s SSE2, Freescale’s<br />

Altivec. We have also used the proposed tools so simulate other novel architectures like<br />

SyncPro [13].<br />

The pipel<strong>in</strong>e depth of the processor can be specified <strong>in</strong> the mach<strong>in</strong>e description and<br />

the compiler correctly schedules operations onto pipel<strong>in</strong>ed functional units. Based on<br />

the activity, the energy consumption of the pipel<strong>in</strong>e registers is automatically estimated.<br />

This is crucial for architectures with deep pipel<strong>in</strong>es (high clock frequency), and for wide<br />

SIMD architectures. In both cases the number of pipel<strong>in</strong>e registers is large and accounts<br />

for a large amount of the energy cost and performance.<br />

Register File. Register Files are known to be one of the most power consum<strong>in</strong>g parts<br />

of the processor. Hence it is important to ensure that the register file design space is<br />

explored properly. The COFFEE flow can handle centralized register files, clustered register<br />

files, with or without a bypass network between the functional units and the register<br />

files. The size, number of ports and connectivity of the register files are specified <strong>in</strong> the<br />

mach<strong>in</strong>e description file.<br />

Figure 3 shows different register file configurations that are supported by our framework.<br />

Comb<strong>in</strong>ations of the configurations shown <strong>in</strong> Figure 3 are supported, both <strong>in</strong> the<br />

simulator and the register allocation phase of the compiler. Separate register files for<br />

scalar and vector slots can be used together with heterogeneous datapath widths. An<br />

example of such a scalar and vector register file is shown <strong>in</strong> section 5 us<strong>in</strong>g ARM’s<br />

Cortex-A8 core [14]. Communication of data between clusters is supported for various<br />

ways, as shown <strong>in</strong> Figure 3, rang<strong>in</strong>g from extra copy units to a number of po<strong>in</strong>t to po<strong>in</strong>t<br />

connections between the clusters. The performance vs. power trade-off (as <strong>in</strong>ter cluster<br />

copy operations take an extra cycle, while <strong>in</strong>creas<strong>in</strong>g the fan-out of register file ports<br />

costs energy <strong>in</strong> <strong>in</strong>terconnections) can be explored us<strong>in</strong>g our framework.<br />

A detailed study of register file configurations and the impact on power and performance<br />

has been done <strong>in</strong> [15,16], but these studies are limited to the register file and do<br />

not provide a framework for explor<strong>in</strong>g other parts of the processor.<br />

3.3 Loop Transformations<br />

Hav<strong>in</strong>g an automated transformation framework <strong>in</strong>tegrated to the backend compiler is<br />

crucial for efficient optimization. The URUK framework [6] performs designer directed

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!