13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

OPTIMIZING CACHE USAGEschedule prefetch instructions one iteration ahead). For small loop bodies (that is,loop iterations with little computation), the prefetch scheduling distance must bemore than one iteration.A simplified equation to compute PSD is deduced from the mathematical model. Fora simplified equation, complete mathematical model, <strong>and</strong> methodology of prefetchdistance determination, see Appendix E, “Summary of Rules <strong>and</strong> Suggestions.”Example 9-3 illustrates the use of a prefetch within the loop body. The prefetchscheduling distance is set to 3, ESI is effectively the pointer to a line, EDX is theaddress of the data being referenced <strong>and</strong> XMM1-XMM4 are the data used in computation.Example 9-4 uses two independent cache lines of data per iteration. The PSDwould need to be increased/decreased if more/less than two cache lines are used periteration.Example 9-3. Prefetch Scheduling Distancetop_loop:prefetchnta [edx + esi + 128*3]prefetchnta [edx*4 + esi + 128*3]. . . . .movaps xmm1, [edx + esi]movaps xmm2, [edx*4 + esi]movaps xmm3, [edx + esi + 16]movaps xmm4, [edx*4 + esi + 16]. . . . .. . . . .add esi, 128cmp esi, ecxjl top_loop9.6.7 Software Prefetch ConcatenationMaximum performance can be achieved when the execution pipeline is at maximumthroughput, without incurring any memory latency penalties. This can be achievedby prefetching data to be used in successive iterations in a loop. De-pipeliningmemory generates bubbles in the execution pipeline.To explain this performance issue, a 3D geometry pipeline that processes 3Dvertices in strip format is used as an example. A strip contains a list of verticeswhose predefined vertex order forms contiguous triangles. It can be easily observedthat the memory pipe is de-pipelined on the strip boundary due to ineffectiveprefetch arrangement. The execution pipeline is stalled for the first two iterations foreach strip. As a result, the average latency for completing an iteration will be 165(FIX) clocks. See Appendix E, “Summary of Rules <strong>and</strong> Suggestions”, for a detaileddescription.9-18

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!