13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

OPTIMIZING CACHE USAGEtime is larger than the memory latency. Inserting a prefetch of the first data elementneeded prior to entering the nested loop computation would eliminate or reduce thestart-up penalty for the very first iteration of the outer loop. This uncomplicated highlevelcode optimization can improve memory performance significantly.9.6.8 Minimize Number of Software PrefetchesPrefetch instructions are not completely free in terms of bus cycles, machine cycles<strong>and</strong> resources, even though they require minimal clock <strong>and</strong> memory b<strong>and</strong>width.Excessive prefetching may lead to performance penalties because of issue penaltiesin the front end of the machine <strong>and</strong>/or resource contention in the memory subsystem.This effect may be severe in cases where the target loops are small <strong>and</strong>/orcases where the target loop is issue-bound.One approach to solve the excessive prefetching issue is to unroll <strong>and</strong>/or softwarepipelineloops to reduce the number of prefetches required. Figure 9-4 presents acode example which implements prefetch <strong>and</strong> unrolls the loop to remove the redundantprefetch instructions whose prefetch addresses hit the previously issuedprefetch instructions. In this particular example, unrolling the original loop oncesaves six prefetch instructions <strong>and</strong> nine instructions for conditional jumps in everyother iteration.top_loop:prefetchnta [edx+esi+<strong>32</strong>]prefetchnta [edx*4+esi+<strong>32</strong>]. . . . .movaps xmm1, [edx+esi]movaps xmm2, [edx*4+esi]. . . . .add esi, 16cmp esi, ecxjl top_loopunrollediterationtop_loop:prefetchnta [edx+esi+128]prefetchnta [edx*4+esi+128]. . . . .movaps xmm1, [edx+esi]movaps xmm2, [edx*4+esi]. . . . .movaps xmm1, [edx+esi+16]movaps xmm2, [edx*4+esi+16]. . . . .movaps xmm1, [edx+esi+96]movaps xmm2, [edx*4+esi+96]. . . . .. . . . .add esi, 128cmp esi, ecxjl top_loopOM15172Figure 9-4. Prefetch <strong>and</strong> Loop Unrolling9-20

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!