13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

OPTIMIZING CACHE USAGEFigure 9-5 demonstrates the effectiveness of software prefetches in latency hiding.ExecutionVertex n-2pipelineissue prefetchfor vertex nFront-SideBusMem latency for V nVertex n-1 Vertex n Vertex n+1prefetchV n+1prefetchV n+2TimeMem latency for V n+1Mem latency for V n+2OM15171Figure 9-5. Memory Access Latency <strong>and</strong> Execution With PrefetchThe X axis in Figure 9-5 indicates the number of computation clocks per loop (eachiteration is independent). The Y axis indicates the execution time measured in clocksper loop. The secondary Y axis indicates the percentage of bus b<strong>and</strong>width utilization.The tests vary by the following parameters:• Number of load/store streams — Each load <strong>and</strong> store stream accesses one128-byte cache line each per iteration.• Amount of computation per loop — This is varied by increasing the number ofdependent arithmetic operations executed.• Number of the software prefetches per loop — For example, one every16 bytes, <strong>32</strong> bytes, <strong>64</strong> bytes, 128 bytes.As expected, the leftmost portion of each of the graphs in Figure 9-5 shows thatwhen there is not enough computation to overlap the latency of memory access,prefetch does not help <strong>and</strong> that the execution is essentially memory-bound. Thegraphs also illustrate that redundant prefetches do not increase performance.9.6.9 Mix Software Prefetch with Computation InstructionsIt may seem convenient to cluster all of PREFETCH instructions at the beginning of aloop body or before a loop, but this can lead to severe performance degradation. Inorder to achieve the best possible performance, PREFETCH instructions must beinterspersed with other computational instructions in the instruction sequence ratherthan clustered together. If possible, they should also be placed apart from loads. Thisimproves the instruction level parallelism <strong>and</strong> reduces the potential instruction9-21

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!