21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

204 P. Raghavan et al.<br />

PC<br />

(A) Unoptimized Architecture<br />

IL1 Memory<br />

FU FU FU FU<br />

Centralized RF<br />

PC<br />

LC<br />

Distributed<br />

Reg. File<br />

DL1 Cache DMA<br />

(B) Low Power Architecture<br />

IL1 Memory<br />

L0 Loop Buffer1 L0 Loop Buffer2<br />

FU FU FU<br />

DRF1 DRF0<br />

DL1 Scratchpad<br />

Fig. 6. Low Power Architecture Design Space<br />

(16 <strong>in</strong>structions deep). The register file is clustered (16 deep, 6 ports each). Both these<br />

architectures have a standard 5 stage pipel<strong>in</strong>e.<br />

5.3 Loop Transformations<br />

Figure 7 shows the transformations performed on the benchmark code. The first three,<br />

loop split-loop merge-loop merge, improve data locality <strong>in</strong> memory and IPC. For the<br />

Cortex A8 processor loop til<strong>in</strong>g is performed as an enabl<strong>in</strong>g transformation, vectorization<br />

is performed manually us<strong>in</strong>g <strong>in</strong>tr<strong>in</strong>sics. The decision on which transformations to<br />

do is up to the designer, and def<strong>in</strong><strong>in</strong>g an optimal set of transformations for a certa<strong>in</strong><br />

platform and application doma<strong>in</strong> is outside the scope of this paper.<br />

a<br />

Loops: b<br />

c<br />

*.c<br />

WCDMA Reciever<br />

Loop<br />

Split<br />

5.4 Results and Analysis<br />

Transformations used for Arm Cortex−A8 like<br />

a<br />

b<br />

a−c1<br />

c1<br />

b<br />

a−c1<br />

c2<br />

c2<br />

b−c2<br />

Loop<br />

Merge<br />

Transformations used for TIC64x−like<br />

Loop<br />

Merge<br />

Fig. 7. Transformations Used <strong>in</strong> Uruk for Optimizations<br />

FU<br />

Loop<br />

Tile<br />

To Compiler<br />

Figure 8 shows the normalized performance <strong>in</strong> cycles for the TI C64-like processor and<br />

the ARM Cortex A8-like processor, both before and after the transformations. The numbers<br />

have been normalized to each processor’s <strong>in</strong>itial cycle count. For the ARM, looptil<strong>in</strong>g<br />

was performed on the <strong>in</strong>itial and transformed code to enable SIMD. Therefore the<br />

ga<strong>in</strong>s are the result of an improvement <strong>in</strong> locality <strong>in</strong> the data memory. For the TI, the<br />

performance ga<strong>in</strong>s are due to improved data locality and improved ILP. The loop-merge

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!